[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs fetched with a hook being killed after 20 minutes



if( c_lease_duration < 0 ) {
	if( c_type == CLAIM_COD ) {
			// COD claims have no lease by default.
		return;
	}
	dprintf( D_ALWAYS, "Warning: starting ClaimLease timer before "
			 "lease duration set.\n" );
	c_lease_duration = 1200;
}

Try 34 years, that's probably enough time, and might not overflow anything...

BTW, 1200 seconds = 20 minutes.

Best,


matt

Ian Chesal wrote:
Here's an oddity. This is from the StartLog on a machine that just
started running a long fetch work assigned job:

3/26 09:06:38 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
18978) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40895
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40895"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 2147483640
ClusterId = 40895
ProcId = 0
ARCJob = 40895
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/0900/40895/stdout.txt"
Err = "/data/ichesal/job/20090326/0900/40895/stderr.txt"

3/26 09:06:38 State change: Finished fetching work successfully
3/26 09:06:38 Changing state: Unclaimed -> Claimed
3/26 09:06:38 Warning: starting ClaimLease timer before lease duration
set.
3/26 09:06:38 Remote job ID is 40895.0
3/26 09:06:38 Got universe "VANILLA" (5) from request classad
3/26 09:06:38 Changing activity: Idle -> Busy

Of particular interest is the line:

3/26 09:06:38 Warning: starting ClaimLease timer before lease duration
set.

Which makes me thing that no matter what I do in my job ad it'll have no
effect. :)

True?

I'll know in 20 minutes. This job set a slightly-greater-than 68 year
JobLeaseDuration and sleeps for an hour. Lets see if it makes it past
the 20 minute mark.

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 11:50 AM
To: 'Condor-Users Mail List'
Subject: RE: Jobs fetched with a hook being killed after 20 minutes

Nudge with an update.

I tried setting:

MAX_CLAIM_ALIVES_MISSED = 12000

But the claim lease is still expiring at the 20 minute mark with the
"schedd gone?" message.

Looking at:
http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#6
618

The last paragraph is ambiguous. It makes it sound like not having a
JobLeaseDuration in your job classad (and I don't, at least not one my
fetch work script put there) "changes the duration of the claim lease",
but it doesn't say how. It just says:

This has the further effect of changing the duration of a claim lease,
the amount of time that the execution machine waits before dropping a
claim due to missing keep alive messages.

Changes it to what exactly? If I look at the documentation for
ALIVE_INTERVAL (at
http://www.cs.wisc.edu/condor/manual/v7.2/3_5Policy_Configuration.html#2
2574) it says a little more about the behaviour change, specifically:

Initially, as when the condor_schedd starts up, the alive interval
starts
at the value set by the configuration variable ALIVE_INTERVAL  . It
may be
modified when a job is started. The job's ClassAd attribute
JobLeaseDuration is checked. If the value of JobLeaseDuration/3 is
less
than the current alive interval, then the alive interval is set to
either
this lower value or the imposed lowest limit on the alive interval of
10
seconds. Thus, the alive interval starts at ALIVE_INTERVAL and goes
down,
never up.

Okay, I'm not setting JobLeaseDuration so the next paragraph...uh...well
it _might_ apply. It's not made 100% clear that's for sure:

If a claim lease expires, the condor_startd will drop the claim.

Yup. That's definitely what's happening to me.

The length of the claim lease is the job's ClassAd attribute
JobLeaseDuration. JobLeaseDuration defaults to 20 minutes time, except
when explicitly set within the job's submit description file.

I don't have a submit description file so does that mean the default for
my fetch work jobs is 20 minutes?

If JobLeaseDuration is explicitly set to 0, or it is not set as may be
the
case for a Web Services job that does not define the attribute, then
JobLeaseDuration is given the Undefined value.

Ah! Okay, has got to be my case, right? I'm not setting a
JobLeaseDuration in my fetch work script. So I should expect that the
rest of this paragraph defines my lease behaviour:

Further, when undefined, the claim lease duration is calculated with
MAX_CLAIM_ALIVES_MISSED * alive interval. The alive interval is the
current value, as sent by the condor_schedd. If the condor_schedd
reduces
the current alive interval, it does not update the condor_startd.

Hmm. Well, I tried adjust MAX_CLAIM_LIVES_MISSED to something
ridiculously large and my lease still expired at the 20 minute mark.
Even with a schedd running. It looks like the starter doesn't get the
alive interval from the ALIVE_INTERVAL config file setting if the schedd
didn't give it one. That's my guess.

I'm going to try setting JobLeaseDuration in my fetch work classad
output to see if that helps. But the documentation makes it all sound
pretty nebulous. Maybe it will, maybe it won't...

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Wednesday, March 25, 2009 5:37 PM
To: 'Condor-Users Mail List'
Subject: Jobs fetched with a hook being killed after 20 minutes

In a nutshell: they're being axed because the startd thinks the claim
has timed out. From the StartLog:

3/25 13:01:15 Return from HandleReq <HandleChildAliveCommand> (handler:
0.000s, sec: 0.001s)
3/25 13:01:45 State change: claim lease expired (condor_schedd gone?)
3/25 13:01:45 Changing state and activity: Claimed/Busy ->
Preempting/Killing
3/25 13:01:45 Calling Handler <receiveJobClassAdUpdate>
3/25 13:01:45 Return from Handler <receiveJobClassAdUpdate>
3/25 13:01:45 DaemonCore: pid 21416 exited with status 0, invoking
reaper 3 <reaper>
3/25 13:01:45 Starter pid 21416 exited with status 0
3/25 13:01:45 State change: starter exited
3/25 13:01:45 State change: No preempting claim, returning to owner
3/25 13:01:45 Changing state and activity: Preempting/Killing ->
Owner/Idle
3/25 13:01:45 State change: IS_OWNER is false
3/25 13:01:45 Changing state: Owner -> Unclaimed

Second line in that output says it all really. I did not have a schedd
running in this pool. Didn't think I needed one because hooks were
fetching the work for me. I did start one but that hasn't stopped the
problem from occurring. The lease is still expiring.

Right now the jobs are not passing a JobLeaseDuration attribute when the
fetch work hook assigns them to the machine.

I have no other hooks currently defined. Only a fetch work hook.

From my configs:

ALIVE_INTERVAL = 239
MAX_CLAIM_ALIVES_MISSED = 6
MaxJobRetirementTime = 2147483640
PREEMPT = False

I set no JobLeaseDuration default in any config files so that *should*
mean it's undefined. So my lease duration should be 6 * 239 = 1434 =~ 24
minutes. But I'm seeing the claim end at exactly 20 minutes. Making me
think JobLeaseDuration is defaulting to 20 for my jobs. Either I'd like
to stop the claim from expiring.

When I'm fetching jobs with a hook should I make MAX_CLAIM_ALIVES_MISSED
be some ridiculously large integer? Is there a more elegant way to
prevent the claim from expiring? This approach seems a mite hack-ish.

Thanks!

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/