[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G fun



 
> John, this error sometimes happens when your condor-g client 
> does not have an up-to-date list of Certificate Authorities 
> and Certificate Revocation Lists.  If these are missing it 
> can't make the GSS handshake to the remote grid resource and 
> it will appear to condor-g that the resource is down.
> If you give a couple more details about how your machine is 
> set up, maybe we can give you some ideas on how to get a 
> correct set of CA and CRL's.  In particular, are you using 
> some VDT-based version of Condor-G and Globus, or did you 
> roll your own?

Globus is installed from VDT on all associated servers, condor direct
from
wisconsin download page.

The CAs and signing policies should be the same in all cases, CRLs
probably
aren't updated.

I have also checked "date" on each machine - clocks out of sync can also
cause
authenticatin problems.

Authentication is usuall a yes/no thing - not a sometimes so I doubt
that would
explain why for some machines it takes a long time to run and sees the
grid resources
as "down" in the meantime. As an example of it taking a long time, take
a look at 
attached log from a job I sent this lunchtime. It was a fork job which
ran instantaneously
when I used gobus-job-run, but took 2 1/2 hours via condor-g ... but it
DID run in the end.
I have jobs to other resources which never seem to run, but maybe I am
not waiting long enough.

To my knowledge the grid resource didn't go "down" in this period.

The ones that never seem to run get stuck after the "Detected Down Grid
Resource" line

Cheers

JK

-----

000 (027.000.000) 11/09 12:01:47 Job submitted from host:
<193.62.125.99:65171> ...
020 (027.000.000) 11/09 12:04:55 Detected Down Globus Resource
    RM-Contact: ngs.leeds.ac.uk/jobmanager-fork ...
026 (027.000.000) 11/09 12:04:55 Detected Down Grid Resource
    GridResource: gt2 ngs.leeds.ac.uk/jobmanager-fork ...
019 (027.000.000) 11/09 14:32:27 Globus Resource Back Up
    RM-Contact: ngs.leeds.ac.uk/jobmanager-fork ...
025 (027.000.000) 11/09 14:32:27 Grid Resource Back Up
    GridResource: gt2 ngs.leeds.ac.uk/jobmanager-fork ...
001 (027.000.000) 11/09 14:32:33 Job executing on host: gt2
ngs.leeds.ac.uk/jobm anager-fork ...
017 (027.000.000) 11/09 14:32:33 Job submitted to Globus
    RM-Contact: ngs.leeds.ac.uk/jobmanager-fork
    JM-Contact: https://ngs.leeds.ac.uk:64190/2825/1194618750/
    Can-Restart-JM: 1
...
027 (027.000.000) 11/09 14:32:33 Job submitted to grid resource
    GridResource: gt2 ngs.leeds.ac.uk/jobmanager-fork
    GridJobId: gt2 ngs.leeds.ac.uk/jobmanager-fork
https://ngs.leeds.ac.uk:64190 /2825/1194618750/ ...
005 (027.000.000) 11/09 14:33:10 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job
        0  -  Total Bytes Sent By Job
        0  -  Total Bytes Received By Job ...