Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G/Globus Problem

Date: Sat, 15 Oct 2005 14:25:49 -0500
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [Condor-users] Condor-G/Globus Problem

On Oct 13, 2005, at 11:26 AM, James E. Dobson wrote:

I have a problem with jobs going into "globus error 7" state after a while of succesful running. There is a very long proxy in place:
[edsan@bellows-falls edsan]$ grid-proxy-info -timeleft
7128652
Yet the jobs going into UNKNOWN state after a while:
[edsan@bellows-falls edsan]$ condor_q -globus |grep pbs
2373.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar
2374.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar
2395.0   edsan         UNKNOWN condor   pbs-01.grid.dartmo
/afs/northstar.dar
[edsan@bellows-falls edsan]$ condor_q -l 2395.0 ... LastHoldReason = "Globus error 7: authentication with the remote server failed" ...
The job directory is delete so it looks like the job is done:
[edsan@bellows-falls edsan]$ globus-job-status
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
DONE
[edsan@bellows-falls edsan]$ globus-job-get-output
https://pbs-01.grid.dartmouth.edu:33674/18955/1129148987/
Invalid job id.
On the Gatekeeper itself (also running Condor) the jobs appear to be still running:
[jed@pbs-01 jed]$ condor_q
-- Submitter: pbs-01.grid.dartmouth.edu : <129.170.30.146:32787> :
pbs-01.grid.dartmouth.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  29.0   grid           10/12 12:40   0+23:40:00 R  0   413.6 data
  30.0   grid           10/12 12:42   0+23:38:22 R  0   414.0 data
  38.0   grid           10/12 16:29   0+19:50:56 R  0   399.6 data
3 jobs; 0 idle, 3 running, 0 held
But the job directory doesn't exist: [jed@pbs-01 jed]$ condor_q -l 38.0 |grep ^Err Err = "/home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987/ stderr"

[jed@pbs-01 jed]$ sudo ls -ld /home/grid/.globus/job/pbs-01.grid.dartmouth.edu/18955.1129148987 Password: ls: /home/grid/.globus/job/pbs-01.grid.dartmouth.edu/ 18955.1129148987: No such file or directory
Has anyone seen this before? Any clue what is causing it? We have some
long running jobs that are getting hit this communication or
authentication and directory deleting problem....
[jed@pbs-01 jed]$ /opt/vdt/vdt/bin/vdt-version
You have installed the complete VDT version 1.3.5:
    Condor/Condor-G 6.7.6
    Globus Toolkit, pre web-services, client 3.2.1
    Globus Toolkit, pre web-services, server 3.2.1
[edsan@bellows-falls edsan]$ condor_version
$CondorVersion: 6.6.7 Oct 11 2004 $
$CondorPlatform: I386-LINUX_RH9 $

Hmm. I'd have to look at the gridmanager (client side) and jobmanager (server side) log files to diagnose this. One possibility: does your CA use CRLs with short lifetimes (shorter than the runtime of your jobs)? We've seen problems where the CRL gets cached in memory and never refreshed as long as the gridmanager is running.

+----------------------------------+---------------------------------+
|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
+----------------------------------+---------------------------------+

Follow-Ups:
- Re: [Condor-users] Condor-G/Globus Problem
  - From: James E. Dobson

References:
- [Condor-users] Condor-G/Globus Problem
  - From: James E. Dobson

Prev by Date: Re: [Condor-users] starter failed to connect to collector
Next by Date: Re: [Condor-users] Still Problem Submitting Job
Previous by thread: [Condor-users] Condor-G/Globus Problem
Next by thread: Re: [Condor-users] Condor-G/Globus Problem
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Condor-G/Globus Problem