[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor-G caches expired proxy certificate?



I didn't notice in the previous E-mail that Jan was doing gt4,
WS-Gram.  There may be some complications there.  I haven't done
much with what happens in that case.

What I do know is that, at the beginning of a gt4 job, condor-G will
automatically delegate your proxy to the GT4 Delegation server on
the remote gatekeeper and any further authentication during
the job is using that delegated proxy which is already there.
the question is--is condor-g smart enough to re-delegate the new
proxy for gt4 jobs once it detects there's a new one?

Your example seems to suggest no. It will be a while before
I could try this on my own system.

The other thing to keep in mind is that there is a setting within
condor-g about how long before the proxy expires, will it stop sending
new  jobs out with an almost-expired proxy.  That holds for grid
universe types at least gt2 and gt4.  Possibly that could be tweaked.
Also you should check the release notes to see if any gt4 handling
changes have been made since condor 7.0.1.  I know there are some
fixes for the gt2 side coming out in the next stable release.


Steve Timm


On Wed, 29 Oct 2008, Jan Ploski wrote:

condor-users-bounces@xxxxxxxxxxx schrieb am 10/22/2008 04:13:14 PM:

On Wed, 22 Oct 2008, Jan Ploski wrote:

Hi,

I have a question regarding proxy certificate management in Condor-G
7.0.1. Here is what I observed in chronological order:
1. I submitted 2 Condor-G jobs yesterday.
2. The proxy certificate used by the jobs expired before the jobs'
completion.
3. The jobs were not halted (should they?).
No--the idea is that you should be able to renew your certificate
and get your results back.

Hi Steven,

I got around to reproducing it today using short jobs (10 minutes
execution time) and a short proxy expiration time (5 minutes).
 >
The initial job whose proxy expired before finish remains in state
'Active'. It does not terminate after I have generated a new proxy. The
new job remains 'Idle'. In GridManager log I see for the new job:

10/29 16:15:52 [7210] Updating classad values for 22522.0:
10/29 16:15:52 [7210]    GridftpUrlBase =
"gsiftp://srvgrid01.offis.uni-oldenburg.de:20000";
10/29 16:15:52 [7210]    GlobusDelegationUri =
"https://juggle-glob.fz-juelich.de:8443/wsrf/services/DelegationService?af8c8930-a5cb-11dd-afec-95dd6b917206";
10/29 16:15:52 [7210]    JobLeaseExpiration = 1225336547
10/29 16:15:52 [7210]    GridJobId = "gt4
uuid:76a37c40-a5cc-11dd-8880-b2c6a63cde4a"
10/29 16:15:52 [7210] leaving doContactSchedd()    10/29 16:15:52 [7210]
(22522.0) doEvaluateState called: gmState GM_SUBMIT_ID_SAVE, globusState
10/29 16:15:52 [7210] (22522.0) gm state change: GM_SUBMIT_ID_SAVE ->
GM_SUBMIT
10/29 16:15:52 [7210] (22522.0) proxy is about to expire
10/29 16:15:52 [7210] (22522.0) gm state change: GM_SUBMIT ->
GM_PROXY_EXPIRED

Apparently it is NOT using the newly generated proxy, but most probably a
cached, expired version.

After some more time the new job is held, with "Job creation failed. ...
NoSuchResourceException" in GridManager log.
The old job is also held, with "Staging error for RSL element
fileStageOut" (Caused by: Certificate ... expired).

I noticed that I don't have to wait for the first job to run with expired
proxy until its intended finish time for the described scenario to happen.
It also happens if the proxy expires, but is regenerated even before the
first job finishes. In other words, it seems that the new proxy is simply
not picked up by Condor.

4. I recreated my proxy manually with grid-proxy-init.
5. I submitted another Condor-G job to the same WS GRAM host, but THIS
job
was held on fileStageIn with a message in GridmanagerLog that my proxy
certificate expired.
6. I removed this new job, submitted it again, same error.

It worked only after I removed all jobs and then resubmitted the new
one.
So I guess the question is: how exactly does Condor-G cache the proxy
certificate and why does it prefer using an expired certificate
instead of
the fresh one? Is this a bug?

Are you using condor_submit -spool?

No.

If so it would cache the file as part of the submission
but otherwise it should forward the new proxy to the old jobs
and the new jobs, as long as it is in the same file name as below.
If it didn't, then something is wrong.
What's the output of condor_q -l | grep -i x509userproxy
for the job in question?

It gives /tmp/x509up_u1002 for both jobs (as expected).

Also, I noticed that your explanation is compatible with the one Jaime
Frey sent me in April (direct email):

If you overwrite the job's proxy file with a fresh proxy, Condor will
pick it up
and start using it. To make sure you're updating the right file, you can
check
the X509UserProxy attribute in the job ad.

However, Condor misbehaves for me. Can you please try reproducing it in
your environment?

Regards,
Jan Ploski

--
Dipl.-Inform. (FH) Jan Ploski
OFFIS
FuE Bereich Energie | R&D Division Energy
Escherweg 2  - 26121 Oldenburg - Germany
Phone/Fax: +49 441 9722 - 184 / 202
E-Mail: Jan.Ploski@xxxxxxxx
URL: http://www.offis.de
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.