[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs fetched with a hook being killed after 20 minutes



That may have been a rabbit hole. Turn on D_FULLDEBUG, do you see this...

3/26 12:52:14 Changing state: Unclaimed -> Claimed
3/26 12:52:14 Warning: starting ClaimLease timer before lease duration set.
3/26 12:52:14 Started ClaimLease timer (12) w/ 1200 second lease duration
3/26 12:52:16 JobLeaseDuration defined in job ClassAd: 315360000
3/26 12:52:16 Resetting ClaimLease timer (12) with new duration

The initial timer is being setup before the lease duration is read, so it defaults to 1200 seconds. However, it looks like that timer is being reset with the proper value later.

It's not entirely clear that the claim times are working properly here. Have you played with shorter times, say 10 second and see the job get kicked and another come in?

Best,


matt

Ian Chesal wrote:
I scaled JobLeaseDuration back to ~10 years but I'm still getting the
warning that indicates Condor is setting it to 1200 seconds for me:

3/26 09:46:46 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
22364) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40897
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40897"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 315360000
ClusterId = 40897
ProcId = 0
ARCJob = 40897
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/0900/40897/stdout.txt"
Err = "/data/ichesal/job/20090326/0900/40897/stderr.txt"

3/26 09:46:46 State change: Finished fetching work successfully
3/26 09:46:46 Changing state: Unclaimed -> Claimed
3/26 09:46:46 Warning: starting ClaimLease timer before lease duration
set.
3/26 09:46:46 Remote job ID is 40897.0
3/26 09:46:46 Got universe "VANILLA" (5) from request classad
3/26 09:46:46 Changing activity: Idle -> Busy

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 12:45 PM
To: 'Condor-Users Mail List'
Subject: RE: [Condor-users] Jobs fetched with a hook being killed after
20 minutes

Wait. Matt, I get what you're saying now. I probably set
JobLeaseDuration too large.

I'll scale it back, try again. Thanks!

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 12:41 PM
To: 'Condor-Users Mail List'
Subject: RE: [Condor-users] Jobs fetched with a hook being killed after
20 minutes

No go. Setting JobLeaseDuration to 2147483640 changed nothing. The claim
was still lost at the 20 minute mark. I still see that warning.

Is there a way to set the lease duration outside of ALIVE_INTERVAL and
MAX_CLAIM_ALIVES_MISSED -- I tried those and it still turns out to be 20
minutes for me.

Thanks.

- Ian

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matthew Farrellee
Sent: Thursday, March 26, 2009 12:36 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Jobs fetched with a hook being killed after
20 minutes

if( c_lease_duration < 0 ) {
        if( c_type == CLAIM_COD ) {
                        // COD claims have no lease by default.
                return;
        }
        dprintf( D_ALWAYS, "Warning: starting ClaimLease timer before "
                         "lease duration set.\n" );
        c_lease_duration = 1200;
}

Try 34 years, that's probably enough time, and might not overflow
anything...

BTW, 1200 seconds = 20 minutes.

Best,


matt

Ian Chesal wrote:
Here's an oddity. This is from the StartLog on a machine that just
started running a long fetch work assigned job:

3/26 09:06:38 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch
(pid
18978) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40895
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40895"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 2147483640
ClusterId = 40895
ProcId = 0
ARCJob = 40895
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/0900/40895/stdout.txt"
Err = "/data/ichesal/job/20090326/0900/40895/stderr.txt"

3/26 09:06:38 State change: Finished fetching work successfully
3/26 09:06:38 Changing state: Unclaimed -> Claimed
3/26 09:06:38 Warning: starting ClaimLease timer before lease duration
set.
3/26 09:06:38 Remote job ID is 40895.0
3/26 09:06:38 Got universe "VANILLA" (5) from request classad
3/26 09:06:38 Changing activity: Idle -> Busy

Of particular interest is the line:

3/26 09:06:38 Warning: starting ClaimLease timer before lease duration
set.

Which makes me thing that no matter what I do in my job ad it'll have
no
effect. :)

True?

I'll know in 20 minutes. This job set a slightly-greater-than 68 year
JobLeaseDuration and sleeps for an hour. Lets see if it makes it past
the 20 minute mark.

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 11:50 AM
To: 'Condor-Users Mail List'
Subject: RE: Jobs fetched with a hook being killed after 20 minutes

Nudge with an update.

I tried setting:

MAX_CLAIM_ALIVES_MISSED = 12000

But the claim lease is still expiring at the 20 minute mark with the
"schedd gone?" message.

Looking at:

http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#6
618

The last paragraph is ambiguous. It makes it sound like not having a
JobLeaseDuration in your job classad (and I don't, at least not one my
fetch work script put there) "changes the duration of the claim
lease",
but it doesn't say how. It just says:

This has the further effect of changing the duration of a claim
lease,
the amount of time that the execution machine waits before dropping a
claim due to missing keep alive messages.
Changes it to what exactly? If I look at the documentation for
ALIVE_INTERVAL (at

http://www.cs.wisc.edu/condor/manual/v7.2/3_5Policy_Configuration.html#2
2574) it says a little more about the behaviour change, specifically:

Initially, as when the condor_schedd starts up, the alive interval
starts
at the value set by the configuration variable ALIVE_INTERVAL  . It
may be
modified when a job is started. The job's ClassAd attribute
JobLeaseDuration is checked. If the value of JobLeaseDuration/3 is
less
than the current alive interval, then the alive interval is set to
either
this lower value or the imposed lowest limit on the alive interval of
10
seconds. Thus, the alive interval starts at ALIVE_INTERVAL and goes
down,
never up.
Okay, I'm not setting JobLeaseDuration so the next
paragraph...uh...well
it _might_ apply. It's not made 100% clear that's for sure:

If a claim lease expires, the condor_startd will drop the claim.
Yup. That's definitely what's happening to me.

The length of the claim lease is the job's ClassAd attribute
JobLeaseDuration. JobLeaseDuration defaults to 20 minutes time,
except
when explicitly set within the job's submit description file.
I don't have a submit description file so does that mean the default
for
my fetch work jobs is 20 minutes?

If JobLeaseDuration is explicitly set to 0, or it is not set as may
be
the
case for a Web Services job that does not define the attribute, then
JobLeaseDuration is given the Undefined value.
Ah! Okay, has got to be my case, right? I'm not setting a
JobLeaseDuration in my fetch work script. So I should expect that the
rest of this paragraph defines my lease behaviour:

Further, when undefined, the claim lease duration is calculated with
MAX_CLAIM_ALIVES_MISSED * alive interval. The alive interval is the
current value, as sent by the condor_schedd. If the condor_schedd
reduces
the current alive interval, it does not update the condor_startd.
Hmm. Well, I tried adjust MAX_CLAIM_LIVES_MISSED to something
ridiculously large and my lease still expired at the 20 minute mark.
Even with a schedd running. It looks like the starter doesn't get the
alive interval from the ALIVE_INTERVAL config file setting if the
schedd
didn't give it one. That's my guess.

I'm going to try setting JobLeaseDuration in my fetch work classad
output to see if that helps. But the documentation makes it all sound
pretty nebulous. Maybe it will, maybe it won't...

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Wednesday, March 25, 2009 5:37 PM
To: 'Condor-Users Mail List'
Subject: Jobs fetched with a hook being killed after 20 minutes

In a nutshell: they're being axed because the startd thinks the claim
has timed out. From the StartLog:

3/25 13:01:15 Return from HandleReq <HandleChildAliveCommand>
(handler:
0.000s, sec: 0.001s)
3/25 13:01:45 State change: claim lease expired (condor_schedd gone?)
3/25 13:01:45 Changing state and activity: Claimed/Busy ->
Preempting/Killing
3/25 13:01:45 Calling Handler <receiveJobClassAdUpdate>
3/25 13:01:45 Return from Handler <receiveJobClassAdUpdate>
3/25 13:01:45 DaemonCore: pid 21416 exited with status 0, invoking
reaper 3 <reaper>
3/25 13:01:45 Starter pid 21416 exited with status 0
3/25 13:01:45 State change: starter exited
3/25 13:01:45 State change: No preempting claim, returning to owner
3/25 13:01:45 Changing state and activity: Preempting/Killing ->
Owner/Idle
3/25 13:01:45 State change: IS_OWNER is false
3/25 13:01:45 Changing state: Owner -> Unclaimed

Second line in that output says it all really. I did not have a schedd
running in this pool. Didn't think I needed one because hooks were
fetching the work for me. I did start one but that hasn't stopped the
problem from occurring. The lease is still expiring.

Right now the jobs are not passing a JobLeaseDuration attribute when
the
fetch work hook assigns them to the machine.

I have no other hooks currently defined. Only a fetch work hook.

>From my configs:

ALIVE_INTERVAL = 239
MAX_CLAIM_ALIVES_MISSED = 6
MaxJobRetirementTime = 2147483640
PREEMPT = False

I set no JobLeaseDuration default in any config files so that *should*
mean it's undefined. So my lease duration should be 6 * 239 = 1434 =~
24
minutes. But I'm seeing the claim end at exactly 20 minutes. Making me
think JobLeaseDuration is defaulting to 20 for my jobs. Either I'd
like
to stop the claim from expiring.

When I'm fetching jobs with a hook should I make
MAX_CLAIM_ALIVES_MISSED
be some ridiculously large integer? Is there a more elegant way to
prevent the claim from expiring? This approach seems a mite hack-ish.

Thanks!

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise
protected from disclosure. If you are not the intended recipient, you
are hereby notified that any use, disclosure, dissemination,
distribution,  or copying  of this message, or any attachments, is
strictly prohibited.  If you have received this message in error, please
advise the sender by reply e-mail, and delete the message and any
attachments.  Thank you.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/