[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs fetched with a hook being killed after 20 minutes



Wait. Matt, I get what you're saying now. I probably set
JobLeaseDuration too large.

I'll scale it back, try again. Thanks!

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 12:41 PM
To: 'Condor-Users Mail List'
Subject: RE: [Condor-users] Jobs fetched with a hook being killed after
20 minutes

No go. Setting JobLeaseDuration to 2147483640 changed nothing. The claim
was still lost at the 20 minute mark. I still see that warning.

Is there a way to set the lease duration outside of ALIVE_INTERVAL and
MAX_CLAIM_ALIVES_MISSED -- I tried those and it still turns out to be 20
minutes for me.

Thanks.

- Ian

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Matthew Farrellee
Sent: Thursday, March 26, 2009 12:36 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Jobs fetched with a hook being killed after
20 minutes

if( c_lease_duration < 0 ) {
        if( c_type == CLAIM_COD ) {
                        // COD claims have no lease by default.
                return;
        }
        dprintf( D_ALWAYS, "Warning: starting ClaimLease timer before "
                         "lease duration set.\n" );
        c_lease_duration = 1200;
}

Try 34 years, that's probably enough time, and might not overflow
anything...

BTW, 1200 seconds = 20 minutes.

Best,


matt

Ian Chesal wrote:
> Here's an oddity. This is from the StartLog on a machine that just
> started running a long fetch work assigned job:
>
> 3/26 09:06:38 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch
(pid
> 18978) printed to stderr: DEBUG: Slot State="Unclaimed"
> Found job 40895
> Cmd = "/tools/arc/scripts/arc_execute.sh"
> Owner = "ichesal"
> Args = "40895"
> JobUniverse = 5
> Requirements = True
> JobLeaseDuration = 2147483640
> ClusterId = 40895
> ProcId = 0
> ARCJob = 40895
> IWD = "/data/ichesal/arc/sleeper"
> Out = "/data/ichesal/job/20090326/0900/40895/stdout.txt"
> Err = "/data/ichesal/job/20090326/0900/40895/stderr.txt"
>
> 3/26 09:06:38 State change: Finished fetching work successfully
> 3/26 09:06:38 Changing state: Unclaimed -> Claimed
> 3/26 09:06:38 Warning: starting ClaimLease timer before lease duration
> set.
> 3/26 09:06:38 Remote job ID is 40895.0
> 3/26 09:06:38 Got universe "VANILLA" (5) from request classad
> 3/26 09:06:38 Changing activity: Idle -> Busy
>
> Of particular interest is the line:
>
> 3/26 09:06:38 Warning: starting ClaimLease timer before lease duration
> set.
>
> Which makes me thing that no matter what I do in my job ad it'll have
no
> effect. :)
>
> True?
>
> I'll know in 20 minutes. This job set a slightly-greater-than 68 year
> JobLeaseDuration and sleeps for an hour. Lets see if it makes it past
> the 20 minute mark.
>
> - Ian
>
> -----Original Message-----
> From: Ian Chesal
> Sent: Thursday, March 26, 2009 11:50 AM
> To: 'Condor-Users Mail List'
> Subject: RE: Jobs fetched with a hook being killed after 20 minutes
>
> Nudge with an update.
>
> I tried setting:
>
> MAX_CLAIM_ALIVES_MISSED = 12000
>
> But the claim lease is still expiring at the 20 minute mark with the
> "schedd gone?" message.
>
> Looking at:
>
http://www.cs.wisc.edu/condor/manual/v7.2/2_15Special_Environment.html#6
> 618
>
> The last paragraph is ambiguous. It makes it sound like not having a
> JobLeaseDuration in your job classad (and I don't, at least not one my
> fetch work script put there) "changes the duration of the claim
lease",
> but it doesn't say how. It just says:
>
>> This has the further effect of changing the duration of a claim
lease,
>> the amount of time that the execution machine waits before dropping a
>> claim due to missing keep alive messages.
>
> Changes it to what exactly? If I look at the documentation for
> ALIVE_INTERVAL (at
>
http://www.cs.wisc.edu/condor/manual/v7.2/3_5Policy_Configuration.html#2
> 2574) it says a little more about the behaviour change, specifically:
>
>> Initially, as when the condor_schedd starts up, the alive interval
> starts
>> at the value set by the configuration variable ALIVE_INTERVAL  . It
> may be
>> modified when a job is started. The job's ClassAd attribute
>> JobLeaseDuration is checked. If the value of JobLeaseDuration/3 is
> less
>> than the current alive interval, then the alive interval is set to
> either
>> this lower value or the imposed lowest limit on the alive interval of
> 10
>> seconds. Thus, the alive interval starts at ALIVE_INTERVAL and goes
> down,
>> never up.
>
> Okay, I'm not setting JobLeaseDuration so the next
paragraph...uh...well
> it _might_ apply. It's not made 100% clear that's for sure:
>
>> If a claim lease expires, the condor_startd will drop the claim.
>
> Yup. That's definitely what's happening to me.
>
>> The length of the claim lease is the job's ClassAd attribute
>> JobLeaseDuration. JobLeaseDuration defaults to 20 minutes time,
except
>> when explicitly set within the job's submit description file.
>
> I don't have a submit description file so does that mean the default
for
> my fetch work jobs is 20 minutes?
>
>> If JobLeaseDuration is explicitly set to 0, or it is not set as may
be
> the
>> case for a Web Services job that does not define the attribute, then
>> JobLeaseDuration is given the Undefined value.
>
> Ah! Okay, has got to be my case, right? I'm not setting a
> JobLeaseDuration in my fetch work script. So I should expect that the
> rest of this paragraph defines my lease behaviour:
>
>> Further, when undefined, the claim lease duration is calculated with
>> MAX_CLAIM_ALIVES_MISSED * alive interval. The alive interval is the
>> current value, as sent by the condor_schedd. If the condor_schedd
> reduces
>> the current alive interval, it does not update the condor_startd.
>
> Hmm. Well, I tried adjust MAX_CLAIM_LIVES_MISSED to something
> ridiculously large and my lease still expired at the 20 minute mark.
> Even with a schedd running. It looks like the starter doesn't get the
> alive interval from the ALIVE_INTERVAL config file setting if the
schedd
> didn't give it one. That's my guess.
>
> I'm going to try setting JobLeaseDuration in my fetch work classad
> output to see if that helps. But the documentation makes it all sound
> pretty nebulous. Maybe it will, maybe it won't...
>
> - Ian
>
> -----Original Message-----
> From: Ian Chesal
> Sent: Wednesday, March 25, 2009 5:37 PM
> To: 'Condor-Users Mail List'
> Subject: Jobs fetched with a hook being killed after 20 minutes
>
> In a nutshell: they're being axed because the startd thinks the claim
> has timed out. From the StartLog:
>
> 3/25 13:01:15 Return from HandleReq <HandleChildAliveCommand>
(handler:
> 0.000s, sec: 0.001s)
> 3/25 13:01:45 State change: claim lease expired (condor_schedd gone?)
> 3/25 13:01:45 Changing state and activity: Claimed/Busy ->
> Preempting/Killing
> 3/25 13:01:45 Calling Handler <receiveJobClassAdUpdate>
> 3/25 13:01:45 Return from Handler <receiveJobClassAdUpdate>
> 3/25 13:01:45 DaemonCore: pid 21416 exited with status 0, invoking
> reaper 3 <reaper>
> 3/25 13:01:45 Starter pid 21416 exited with status 0
> 3/25 13:01:45 State change: starter exited
> 3/25 13:01:45 State change: No preempting claim, returning to owner
> 3/25 13:01:45 Changing state and activity: Preempting/Killing ->
> Owner/Idle
> 3/25 13:01:45 State change: IS_OWNER is false
> 3/25 13:01:45 Changing state: Owner -> Unclaimed
>
> Second line in that output says it all really. I did not have a schedd
> running in this pool. Didn't think I needed one because hooks were
> fetching the work for me. I did start one but that hasn't stopped the
> problem from occurring. The lease is still expiring.
>
> Right now the jobs are not passing a JobLeaseDuration attribute when
the
> fetch work hook assigns them to the machine.
>
> I have no other hooks currently defined. Only a fetch work hook.
>
>>From my configs:
>
> ALIVE_INTERVAL = 239
> MAX_CLAIM_ALIVES_MISSED = 6
> MaxJobRetirementTime = 2147483640
> PREEMPT = False
>
> I set no JobLeaseDuration default in any config files so that *should*
> mean it's undefined. So my lease duration should be 6 * 239 = 1434 =~
24
> minutes. But I'm seeing the claim end at exactly 20 minutes. Making me
> think JobLeaseDuration is defaulting to 20 for my jobs. Either I'd
like
> to stop the claim from expiring.
>
> When I'm fetching jobs with a hook should I make
MAX_CLAIM_ALIVES_MISSED
> be some ridiculously large integer? Is there a more elegant way to
> prevent the claim from expiring? This approach seems a mite hack-ish.
>
> Thanks!
>
> - Ian
>
> Confidentiality Notice.
> This message may contain information that is confidential or otherwise
protected from disclosure. If you are not the intended recipient, you
are hereby notified that any use, disclosure, dissemination,
distribution,  or copying  of this message, or any attachments, is
strictly prohibited.  If you have received this message in error, please
advise the sender by reply e-mail, and delete the message and any
attachments.  Thank you.
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.