[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G caches expired proxy certificate?

condor-users-bounces@xxxxxxxxxxx schrieb am 10/22/2008 04:13:14 PM:

> On Wed, 22 Oct 2008, Jan Ploski wrote:
> > Hi,
> >
> > I have a question regarding proxy certificate management in Condor-G
> > 7.0.1. Here is what I observed in chronological order:
> > 1. I submitted 2 Condor-G jobs yesterday.
> > 2. The proxy certificate used by the jobs expired before the jobs'
> > completion.
> > 3. The jobs were not halted (should they?).
> No--the idea is that you should be able to renew your certificate
> and get your results back.

Hi Steven,

I got around to reproducing it today using short jobs (10 minutes 
execution time) and a short proxy expiration time (5 minutes).

The initial job whose proxy expired before finish remains in state 
'Active'. It does not terminate after I have generated a new proxy. The 
new job remains 'Idle'. In GridManager log I see for the new job:

10/29 16:15:52 [7210] Updating classad values for 22522.0:
10/29 16:15:52 [7210]    GridftpUrlBase = 
10/29 16:15:52 [7210]    GlobusDelegationUri = 
10/29 16:15:52 [7210]    JobLeaseExpiration = 1225336547
10/29 16:15:52 [7210]    GridJobId = "gt4 
10/29 16:15:52 [7210] leaving doContactSchedd()    10/29 16:15:52 [7210] 
(22522.0) doEvaluateState called: gmState GM_SUBMIT_ID_SAVE, globusState
10/29 16:15:52 [7210] (22522.0) gm state change: GM_SUBMIT_ID_SAVE -> 
10/29 16:15:52 [7210] (22522.0) proxy is about to expire
10/29 16:15:52 [7210] (22522.0) gm state change: GM_SUBMIT -> 

Apparently it is NOT using the newly generated proxy, but most probably a 
cached, expired version.

After some more time the new job is held, with "Job creation failed. ... 
NoSuchResourceException" in GridManager log.
The old job is also held, with "Staging error for RSL element 
fileStageOut" (Caused by: Certificate ... expired).

I noticed that I don't have to wait for the first job to run with expired 
proxy until its intended finish time for the described scenario to happen. 
It also happens if the proxy expires, but is regenerated even before the 
first job finishes. In other words, it seems that the new proxy is simply 
not picked up by Condor.

> > 4. I recreated my proxy manually with grid-proxy-init.
> > 5. I submitted another Condor-G job to the same WS GRAM host, but THIS 
> > was held on fileStageIn with a message in GridmanagerLog that my proxy
> > certificate expired.
> > 6. I removed this new job, submitted it again, same error.
> >
> > It worked only after I removed all jobs and then resubmitted the new 
> > So I guess the question is: how exactly does Condor-G cache the proxy
> > certificate and why does it prefer using an expired certificate 
instead of
> > the fresh one? Is this a bug?
> >
> Are you using condor_submit -spool?


> If so it would cache the file as part of the submission
> but otherwise it should forward the new proxy to the old jobs
> and the new jobs, as long as it is in the same file name as below.
> If it didn't, then something is wrong.
> What's the output of condor_q -l | grep -i x509userproxy
> for the job in question?

It gives /tmp/x509up_u1002 for both jobs (as expected).

Also, I noticed that your explanation is compatible with the one Jaime 
Frey sent me in April (direct email):

> If you overwrite the job's proxy file with a fresh proxy, Condor will 
pick it up
> and start using it. To make sure you're updating the right file, you can 
> the X509UserProxy attribute in the job ad.

However, Condor misbehaves for me. Can you please try reproducing it in 
your environment?

Jan Ploski

Dipl.-Inform. (FH) Jan Ploski
FuE Bereich Energie | R&D Division Energy
Escherweg 2  - 26121 Oldenburg - Germany
Phone/Fax: +49 441 9722 - 184 / 202
E-Mail: Jan.Ploski@xxxxxxxx
URL: http://www.offis.de