[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs fetched with a hook being killed after 20 minutes



So the problem I'm seeing is once the job is evicted, new ones aren't getting run.

Basically, job "sleep 30" with a JobLeaseDuration=5 will get the job evicted after 5 seconds. I'd expect the next call to the fetch hook to retrieve another "sleep 30" job to be run. Seems the slot is stuck "Preemptin/Killing".

Best,


matt

Ian Chesal wrote:
And it looks like it's working now for a 2 year JobLeaseDuration
setting:

3/26 13:52:06 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
3029) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40905
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40905"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 62899200
ClusterId = 40905
ProcId = 0
ARCJob = 40905
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/1300/40905/stdout.txt"
Err = "/data/ichesal/job/20090326/1300/40905/stderr.txt"

3/26 13:52:06 Rank of this fetched claim is: 1.000000
3/26 13:52:06 State change: Finished fetching work successfully
3/26 13:52:06 Set destination state to Claimed
3/26 13:52:06 Changing state: Unclaimed -> Claimed
3/26 13:52:06 Warning: starting ClaimLease timer before lease duration
set.
3/26 13:52:06 in DaemonCore NewTimer()
3/26 13:52:06
3/26 13:52:06 DaemonCore--> Timers
3/26 13:52:06 DaemonCore--> ~~~~~~
3/26 13:52:06 DaemonCore--> id = 0, when = 1238100727, period = 120,
handler_descrip=<check_parent>
3/26 13:52:06 DaemonCore--> id = 8, when = 1238100776, period = 60,
handler_descrip=<eval_and_update_all>
3/26 13:52:06 DaemonCore--> id = 10, when = 1238100776, period = 0,
handler_descrip=<dc_touch_log_file>
3/26 13:52:06 DaemonCore--> id = 5, when = 1238100956, period = 240,
handler_descrip=<self_monitor>
3/26 13:52:06 DaemonCore--> id = 3, when = 1238101016, period = 300,
handler_descrip=<check_session_cache>
3/26 13:52:06 DaemonCore--> id = 7, when = 1238101886, period = 1170,
handler_descrip=<DaemonCore::SendAliveToParent>
3/26 13:52:06 DaemonCore--> id = 17, when = 1238101926, period = 0,
handler_descrip=<Claim::leaseExpired>
3/26 13:52:06 DaemonCore--> id = 4, when = 1238102517, period = 1801,
handler_descrip=<handle_cookie_refresh>
3/26 13:52:06 DaemonCore--> id = 11, when = 1238129516, period = 0,
handler_descrip=<dc_touch_lock_files>
3/26 13:52:06 DaemonCore--> id = 6, when = 1238130095, period = 29383,
handler_descrip=<DaemonCore::refreshDNS()>
3/26 13:52:06 DaemonCore--> id = 14, when = 1238200725, period = 100000,
handler_descrip=<Resource::tryFetchWork>
3/26 13:52:06
3/26 13:52:06 leaving DaemonCore NewTimer, id=17
3/26 13:52:06 Started ClaimLease timer (17) w/ 1200 second lease
duration
3/26 13:52:06 Remote job ID is 40905.0
3/26 13:52:06 JobLeaseDuration defined in job ClassAd: 62899200
3/26 13:52:06 Resetting ClaimLease timer (17) with new duration
3/26 13:52:06 In reset_timer(), id=17, time=62899200, period=0
3/26 13:52:06 in DaemonCore NewTimer()
3/26 13:52:06 In cancel_timer(), id=17
3/26 13:52:06 Renewing timer id 17 for 62899200 secs

This job sleeps for an hour. Lets see if it finishes. I'm hopeful this
time around!

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 4:47 PM
To: 'Condor-Users Mail List'
Subject: RE: [Condor-users] Jobs fetched with a hook being killed after
20 minutes

I think I figured it out:

I had an extra whitespace at the end of the integer value for
JobLeaseDuration in STDOUT from my fetch work script. I dropped the
extra whitespace and now the 60 second JobLeaseDuration is being found:

3/26 13:43:37 leaving DaemonCore NewTimer, id=17
3/26 13:43:37 Started ClaimLease timer (17) w/ 1200 second lease
duration
3/26 13:43:37 Remote job ID is 40903.0
3/26 13:43:37 JobLeaseDuration defined in job ClassAd: 60
3/26 13:43:37 Resetting ClaimLease timer (17) with new duration
3/26 13:43:37 In reset_timer(), id=17, time=60, period=0
3/26 13:43:37 in DaemonCore NewTimer()
3/26 13:43:37 In cancel_timer(), id=17
3/26 13:43:37 Renewing timer id 17 for 60 secs

And it was obeyed:

3/26 13:43:37 Changing state: Unclaimed -> Claimed
...
3/26 13:44:37 State change: claim lease expired (condor_schedd gone?)
3/26 13:44:37 Changing state and activity: Claimed/Busy ->
Preempting/Killing
3/26 13:44:37 In Starter::kill() with pid 2439, sig 3 (SIGQUIT)
3/26 13:44:37 Send_Signal(): Doing kill(2439,3) [SIGQUIT]

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 2:56 PM
To: 'Condor-Users Mail List'
Subject: RE: [Condor-users] Jobs fetched with a hook being killed after
20 minutes


When you turn on D_FULLDEBUG you don't see the startd at least
attempting to fix up the lease time?

Whoops. You bet.

You can also throw D_DAEMONCORE into the mix to see what is being done
with the timer.

Done. Send me a private email and I'll ship you a tarball with the logs.
I don't want to past them in here. I've included a snippet below.

- Ian


3/26 11:47:00 HookClient /tools/arc/scripts/hooks/arc_job_fetch (pid
32691) exited with status 0
3/26 11:47:00 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
32691) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40900
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40900"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 60
ClusterId = 40900
ProcId = 0
ARCJob = 40900
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/1100/40900/stdout.txt"
Err = "/data/ichesal/job/20090326/1100/40900/stderr.txt"

3/26 11:47:00 Rank of this fetched claim is: 1.000000
3/26 11:47:00 Hook ARC_HOOK_REPLY_FETCH: UNDEFINED
3/26 11:47:00 State change: Finished fetching work successfully
3/26 11:47:00 Set destination state to Claimed
3/26 11:47:00 Changing state: Unclaimed -> Claimed
3/26 11:47:00 Warning: starting ClaimLease timer before lease duration
set.
3/26 11:47:00 in DaemonCore NewTimer()
3/26 11:47:00
3/26 11:47:00 DaemonCore--> Timers
3/26 11:47:00 DaemonCore--> ~~~~~~
3/26 11:47:00 DaemonCore--> id = 9, when = 1238093223, period = 0,
handler_descrip=<do_update>
3/26 11:47:00 DaemonCore--> id = 0, when = 1238093227, period = 120,
handler_descrip=<check_parent>
3/26 11:47:00 DaemonCore--> id = 8, when = 1238093279, period = 60,
handler_descrip=<eval_and_update_all>
3/26 11:47:00 DaemonCore--> id = 10, when = 1238093279, period = 0,
handler_descrip=<dc_touch_log_file>
3/26 11:47:00 DaemonCore--> id = 5, when = 1238093459, period = 240,
handler_descrip=<self_monitor>
3/26 11:47:00 DaemonCore--> id = 3, when = 1238093519, period = 300,
handler_descrip=<check_session_cache>
3/26 11:47:00 DaemonCore--> id = 7, when = 1238094389, period = 1170,
handler_descrip=<DaemonCore::SendAliveToParent>
3/26 11:47:00 DaemonCore--> id = 12, when = 1238094420, period = 0,
handler_descrip=<Claim::leaseExpired>
3/26 11:47:00 DaemonCore--> id = 4, when = 1238095020, period = 1801,
handler_descrip=<handle_cookie_refresh>
3/26 11:47:00 DaemonCore--> id = 11, when = 1238122019, period = 0,
handler_descrip=<dc_touch_lock_files>
3/26 11:47:00 DaemonCore--> id = 6, when = 1238122595, period = 29383,
handler_descrip=<DaemonCore::refreshDNS()>
3/26 11:47:00
3/26 11:47:00 leaving DaemonCore NewTimer, id=12
3/26 11:47:00 Started ClaimLease timer (12) w/ 1200 second lease
duration
3/26 11:47:00 Remote job ID is 40900.0
3/26 11:47:00 In reset_timer(), id=12, time=1200, period=0
3/26 11:47:00 in DaemonCore NewTimer()
3/26 11:47:00 In cancel_timer(), id=12
3/26 11:47:00 Renewing timer id 12 for 1200 secs
3/26 11:47:00

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/