[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G/Globus Problem



On Oct 13, 2005, at 11:26 AM, James E. Dobson wrote:

I have a problem with jobs going into "globus error 7" state after a while
of succesful running. There is a very long proxy in place:


[edsan@bellows-falls edsan]$ grid-proxy-info -timeleft
7128652

Yet the jobs going into UNKNOWN state after a while:

[edsan@bellows-falls edsan]$ condor_q -globus |grep pbs
2373.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar
2374.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar
2395.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar

[edsan@bellows-falls edsan]$ condor_q -l 2395.0
...
LastHoldReason = "Globus error 7: authentication with the remote server
failed"
...


The job directory is delete so it looks like the job is done:

[edsan@bellows-falls edsan]$ globus-job-status
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
DONE
[edsan@bellows-falls edsan]$ globus-job-get-output
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
Invalid job id.

On the Gatekeeper itself (also running Condor) the jobs appear to be still
running:


[jed@pbs-01 jed]$ condor_q


-- Submitter: pbs-01.grid.dartmouth.edu : <129.170.30.146:32787> : pbs-01.grid.dartmouth.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 29.0 grid 10/12 12:40 0+23:40:00 R 0 413.6 data 30.0 grid 10/12 12:42 0+23:38:22 R 0 414.0 data 38.0 grid 10/12 16:29 0+19:50:56 R 0 399.6 data

3 jobs; 0 idle, 3 running, 0 held

But the job directory doesn't exist:
[jed@pbs-01 jed]$ condor_q -l 38.0 |grep ^Err
Err =
"/home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987/ stderr"


[jed@pbs-01 jed]$ sudo ls -ld
/home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987
Password:
ls: /home/grid/.globus/job/pbs-01.grid.dartmouth.edu/ 18955.1129148987: No
such file or directory


Has anyone seen this before? Any clue what is causing it? We have some
long running jobs that are getting hit this communication or
authentication and directory deleting problem....

[jed@pbs-01 jed]$ /opt/vdt/vdt/bin/vdt-version
You have installed the complete VDT version 1.3.5:
    Condor/Condor-G 6.7.6
    Globus Toolkit, pre web-services, client 3.2.1
    Globus Toolkit, pre web-services, server 3.2.1

[edsan@bellows-falls edsan]$ condor_version
$CondorVersion: 6.6.7 Oct 11 2004 $
$CondorPlatform: I386-LINUX_RH9 $

Hmm. I'd have to look at the gridmanager (client side) and jobmanager (server side) log files to diagnose this. One possibility: does your CA use CRLs with short lifetimes (shorter than the runtime of your jobs)? We've seen problems where the CRL gets cached in memory and never refreshed as long as the gridmanager is running.


+----------------------------------+---------------------------------+
|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
+----------------------------------+---------------------------------+