[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-G/Globus Problem



Hi,

I have a problem with jobs going into "globus error 7" state after a while
of succesful running. There is a very long proxy in place:

[edsan@bellows-falls edsan]$ grid-proxy-info -timeleft
7128652

Yet the jobs going into UNKNOWN state after a while:

[edsan@bellows-falls edsan]$ condor_q -globus |grep pbs
2373.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar
2374.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar
2395.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar

[edsan@bellows-falls edsan]$ condor_q -l 2395.0
...
LastHoldReason = "Globus error 7: authentication with the remote server
failed"
...

The job directory is delete so it looks like the job is done:

[edsan@bellows-falls edsan]$ globus-job-status
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
DONE
[edsan@bellows-falls edsan]$ globus-job-get-output
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
Invalid job id.

On the Gatekeeper itself (also running Condor) the jobs appear to be still
running:

[jed@pbs-01 jed]$ condor_q


-- Submitter: pbs-01.grid.dartmouth.edu : <129.170.30.146:32787> :
pbs-01.grid.dartmouth.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  29.0   grid           10/12 12:40   0+23:40:00 R  0   413.6 data
  30.0   grid           10/12 12:42   0+23:38:22 R  0   414.0 data
  38.0   grid           10/12 16:29   0+19:50:56 R  0   399.6 data

3 jobs; 0 idle, 3 running, 0 held

But the job directory doesn't exist:
[jed@pbs-01 jed]$ condor_q -l 38.0 |grep ^Err
Err =
"/home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987/stderr"

[jed@pbs-01 jed]$ sudo ls -ld
/home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987
Password:
ls: /home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987: No
such file or directory

Has anyone seen this before? Any clue what is causing it? We have some
long running jobs that are getting hit this communication or
authentication and directory deleting problem....

[jed@pbs-01 jed]$ /opt/vdt/vdt/bin/vdt-version
You have installed the complete VDT version 1.3.5:
    Condor/Condor-G 6.7.6
    Globus Toolkit, pre web-services, client 3.2.1
    Globus Toolkit, pre web-services, server 3.2.1

[edsan@bellows-falls edsan]$ condor_version
$CondorVersion: 6.6.7 Oct 11 2004 $
$CondorPlatform: I386-LINUX_RH9 $



Thanks,

-jed