[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Myproxy credential refreshment and Condor-G

Did some snooping around log files and user manuals and learned a bit
more about how things are working underneath the hood, but still have
a problem with condor-g and long-running jobs.

I hope anyone with experience with Condor-G and GT4 fires up some
suggestions my way.

This is what I have done so far.

I have seen many unsuccessful attempts of Condor-G refreshing gt4 job
credentials, while observing no successful ones.  That's OK though,
because the manual says that credential refreshment is supported for
gt2 types of jobs only.  However, my environment is exclusively gt4,
so have to come with a workaround.  One suggestion was to get the
delegation key from the job description, generate an epr using that
key and the DelegationURI, and run the globus supplied tool
'globus-credentials-refres'  to refresh the job's credentials.  This
seems to refresh the job's credentials on the execution host, but when
the job's output (stdout and stderr) is staged back to the condor
submit host, I am getting the following error in the job's log file.

012 (8161.000.000) 06/18 17:06:37 Job was held.
       Globus error: Staging error for RSL element fileStageOut.
       Code 0 Subcode 0

The container log on the execution hosts reveals that an RFT transfer
failed because of expired credentials, even though the credentials of
the job are still be valid (they were reported as successfully
refreshed by globus-credentials-refresh and I have manually verified
that using grid-proxy-info on the execution host).

From what I have seen my guess is the following. When condor submits
to Globus it generates a proxy with lifetime of maximum 12 hours,
delegates that proxy to the remote resource and submits the job.  If
the job happens to complete within a 12 hour period,
everything works out fine.  Otherwise, the job completes within 12+
hours at which point Condor-G requests the transfer of the stdout and
stderr files (part of job's description).  This transfer is performed
by the RFT service on the execution host, while credentials for the
operation are provided by Condor-G.  The RFT transfer fails because
Condor-G has delegated expired credentials, hence the entries in the
container.log file, etc.

Is my understanding of the mechanics of gird job execution process
correct, or am I interpreting log files wrongly?

Why is condor failing to refresh credentials for a job even when a
x509userproxy option, pointing to a long-lived proxy credential, is
provided in the job submit file?

And most importantly, why does it seem like condor is delegating
expired credentials for the stderr and stdout transfer back to the
submit host?  Did something get cached, while it shouldn't have been.

ps. The version of condor I am using is 6.8.5 and the execution hosts
have globus 4.0.3 installed

On 4/21/07, Nayden Markatchev <markatchev@xxxxxxxxx> wrote:

I am trying to understand how to refresh proxy credential for a long
running job that is submitted to a gt4 resource using condor-g.  So
far I gathered from the condor documentation that credential
refreshments are supported for gt2 type jobs only, but  a post from
Jan 2006 by Jamie Fray informs that this will be available in the,
back then, next release
 Is this functionality in place in 6.8.4 version of condor?

I did a bit of testing by submitting a long running grid job with
short lived credentials and I noticed the following line in the
/tmp/Gridmanager.$USER file
/tmp/GridmanagerLog.nayden.old:4/21 14:11:46 [20789]

This leads me to *think* that the gridmanager is attempting to renew
the my job credentials, but the operation is failing for some reason.
Any idea of what might be the problem, provided that feature is
supported in the current release?