[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G/Globus Problem



> Hmm. I'd have to look at the gridmanager (client side) and jobmanager
> (server side) log files to diagnose this. One possibility: does your CA
> use CRLs with short lifetimes (shorter than the runtime of your jobs)?
> We've seen problems where the CRL gets cached in memory and never
> refreshed as long as the gridmanager is running.

I'm having continued problems. We are not updating CRLs from my CA. I have
a process which creates a new grid-mapfile. My guess is we are looking for
the file when it isn't to be found. Shouldn't Condor retry this before
killing the job?

[jed@bellows-falls jed]$ condor_q |grep H | awk '{print $1}' | while read
x; do condor_q -l $x | egrep "^HoldReason =|^GlobusResource" ; done
GlobusResource = "pbs-01.grid.dartmouth.edu/jobmanager-condor"
HoldReason = "Globus error 7: authentication with the remote server
failed"
GlobusResource = "pbs-01.grid.dartmouth.edu/jobmanager-condor"
HoldReason = "Globus error 7: authentication with the remote server
failed"
GlobusResource = "pbs-01.grid.dartmouth.edu/jobmanager-condor"
HoldReason = "Globus error 7: authentication with the remote server
failed"
GlobusResource = "pbs-01.grid.dartmouth.edu/jobmanager-condor"
HoldReason = "Globus error 7: authentication with the remote server
failed"
GlobusResource = "pbs-01.grid.dartmouth.edu/jobmanager-condor"
HoldReason = "Globus error 7: authentication with the remote server
failed"
GlobusResource = "pbs-01.grid.dartmouth.edu/jobmanager-condor"
HoldReason = "Globus error 7: authentication with the remote server
failed"


>From the UserLog:
000 (2443.000.000) 10/18 14:03:55 Job submitted from host:
<129.170.30.5:32793>
...
017 (2443.000.000) 10/18 14:04:15 Job submitted to Globus
    RM-Contact: pbs-01.grid.dartmouth.edu/jobmanager-condor
    JM-Contact: https://pbs-01.grid.dartmouth.edu:37397/14340/1129658645/
    Can-Restart-JM: 1
...
001 (2443.000.000) 10/18 14:04:21 Job executing on host:
pbs-01.grid.dartmouth.e
du
...
012 (2443.000.000) 10/18 23:10:21 Job was held.
        Globus error 7: authentication with the remote server failed
        Code 2 Subcode 7
...


On the server side:


TIME: Tue Oct 18 23:10:20 2005
 PID: 18511 -- Failure: globus_gss_assist_gridmap() failed authorization.
gridma
p.c:globus_l_gss_assist_gridmap_lookup:1621:
Gridmap lookup failure: Could not map
/O=Dartmouth/CN=host/bellows-falls.grid.dartmouth.edu


Verbose error follows:
gridmap.c:globus_l_gss_assist_gridmap_lookup:1621:
Gridmap lookup failure: Could not map
/O=Dartmouth/CN=host/bellows-falls.grid.dartmouth.edu