[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs fetched with a hook being killed after 20 minutes



Nudge with an update.

I tried setting:

MAX_CLAIM_ALIVES_MISSED = 12000

But the claim lease is still expiring at the 20 minute mark with the
"schedd gone?" message.

Looking at:
http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#6
618

The last paragraph is ambiguous. It makes it sound like not having a
JobLeaseDuration in your job classad (and I don't, at least not one my
fetch work script put there) "changes the duration of the claim lease",
but it doesn't say how. It just says:

> This has the further effect of changing the duration of a claim lease,
> the amount of time that the execution machine waits before dropping a
> claim due to missing keep alive messages.

Changes it to what exactly? If I look at the documentation for
ALIVE_INTERVAL (at
http://www.cs.wisc.edu/condor/manual/v7.2/3_5Policy_Configuration.html#2
2574) it says a little more about the behaviour change, specifically:

> Initially, as when the condor_schedd starts up, the alive interval
starts
> at the value set by the configuration variable ALIVE_INTERVAL  . It
may be
> modified when a job is started. The job's ClassAd attribute
> JobLeaseDuration is checked. If the value of JobLeaseDuration/3 is
less
> than the current alive interval, then the alive interval is set to
either
> this lower value or the imposed lowest limit on the alive interval of
10
> seconds. Thus, the alive interval starts at ALIVE_INTERVAL and goes
down,
> never up.

Okay, I'm not setting JobLeaseDuration so the next paragraph...uh...well
it _might_ apply. It's not made 100% clear that's for sure:

> If a claim lease expires, the condor_startd will drop the claim.

Yup. That's definitely what's happening to me.

> The length of the claim lease is the job's ClassAd attribute
> JobLeaseDuration. JobLeaseDuration defaults to 20 minutes time, except
> when explicitly set within the job's submit description file.

I don't have a submit description file so does that mean the default for
my fetch work jobs is 20 minutes?

> If JobLeaseDuration is explicitly set to 0, or it is not set as may be
the
> case for a Web Services job that does not define the attribute, then
> JobLeaseDuration is given the Undefined value.

Ah! Okay, has got to be my case, right? I'm not setting a
JobLeaseDuration in my fetch work script. So I should expect that the
rest of this paragraph defines my lease behaviour:

> Further, when undefined, the claim lease duration is calculated with
> MAX_CLAIM_ALIVES_MISSED * alive interval. The alive interval is the
> current value, as sent by the condor_schedd. If the condor_schedd
reduces
> the current alive interval, it does not update the condor_startd.

Hmm. Well, I tried adjust MAX_CLAIM_LIVES_MISSED to something
ridiculously large and my lease still expired at the 20 minute mark.
Even with a schedd running. It looks like the starter doesn't get the
alive interval from the ALIVE_INTERVAL config file setting if the schedd
didn't give it one. That's my guess.

I'm going to try setting JobLeaseDuration in my fetch work classad
output to see if that helps. But the documentation makes it all sound
pretty nebulous. Maybe it will, maybe it won't...

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Wednesday, March 25, 2009 5:37 PM
To: 'Condor-Users Mail List'
Subject: Jobs fetched with a hook being killed after 20 minutes

In a nutshell: they're being axed because the startd thinks the claim
has timed out. From the StartLog:

3/25 13:01:15 Return from HandleReq <HandleChildAliveCommand> (handler:
0.000s, sec: 0.001s)
3/25 13:01:45 State change: claim lease expired (condor_schedd gone?)
3/25 13:01:45 Changing state and activity: Claimed/Busy ->
Preempting/Killing
3/25 13:01:45 Calling Handler <receiveJobClassAdUpdate>
3/25 13:01:45 Return from Handler <receiveJobClassAdUpdate>
3/25 13:01:45 DaemonCore: pid 21416 exited with status 0, invoking
reaper 3 <reaper>
3/25 13:01:45 Starter pid 21416 exited with status 0
3/25 13:01:45 State change: starter exited
3/25 13:01:45 State change: No preempting claim, returning to owner
3/25 13:01:45 Changing state and activity: Preempting/Killing ->
Owner/Idle
3/25 13:01:45 State change: IS_OWNER is false
3/25 13:01:45 Changing state: Owner -> Unclaimed

Second line in that output says it all really. I did not have a schedd
running in this pool. Didn't think I needed one because hooks were
fetching the work for me. I did start one but that hasn't stopped the
problem from occurring. The lease is still expiring.

Right now the jobs are not passing a JobLeaseDuration attribute when the
fetch work hook assigns them to the machine.

I have no other hooks currently defined. Only a fetch work hook.

>From my configs:

ALIVE_INTERVAL = 239
MAX_CLAIM_ALIVES_MISSED = 6
MaxJobRetirementTime = 2147483640
PREEMPT = False

I set no JobLeaseDuration default in any config files so that *should*
mean it's undefined. So my lease duration should be 6 * 239 = 1434 =~ 24
minutes. But I'm seeing the claim end at exactly 20 minutes. Making me
think JobLeaseDuration is defaulting to 20 for my jobs. Either I'd like
to stop the claim from expiring.

When I'm fetching jobs with a hook should I make MAX_CLAIM_ALIVES_MISSED
be some ridiculously large integer? Is there a more elegant way to
prevent the claim from expiring? This approach seems a mite hack-ish.

Thanks!

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.