[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs fetched with a hook being killed after 20 minutes



> That may have been a rabbit hole. Turn on D_FULLDEBUG, do you see
this...
>
> The initial timer is being setup before the lease duration is read, so
> it defaults to 1200 seconds. However, it looks like that timer is
being
> reset with the proper value later.

Cool. But it doesn't look like the changed value is getting propagated.

> It's not entirely clear that the claim times are working properly
here.
> Have you played with shorter times, say 10 second and see the job get
> kicked and another come in?

Just tried it with:

3/26 11:04:34 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
29943) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40899
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40899"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 60
ClusterId = 40899
ProcId = 0
ARCJob = 40899
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/1100/40899/stdout.txt"
Err = "/data/ichesal/job/20090326/1100/40899/stderr.txt"

3/26 11:04:34 State change: Finished fetching work successfully
3/26 11:04:34 Changing state: Unclaimed -> Claimed
3/26 11:04:34 Warning: starting ClaimLease timer before lease duration
set.
3/26 11:04:34 Remote job ID is 40899.0
3/26 11:04:34 Got universe "VANILLA" (5) from request classad
3/26 11:04:34 Changing activity: Idle -> Busy

And 6.5 minutes later it's still running. No claim expired messages in
the StartLog.

-  Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.