[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs fetched with a hook being killed after 20 minutes



And it looks like it's working now for a 2 year JobLeaseDuration
setting:

3/26 13:52:06 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
3029) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40905
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40905"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 62899200
ClusterId = 40905
ProcId = 0
ARCJob = 40905
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/1300/40905/stdout.txt"
Err = "/data/ichesal/job/20090326/1300/40905/stderr.txt"

3/26 13:52:06 Rank of this fetched claim is: 1.000000
3/26 13:52:06 State change: Finished fetching work successfully
3/26 13:52:06 Set destination state to Claimed
3/26 13:52:06 Changing state: Unclaimed -> Claimed
3/26 13:52:06 Warning: starting ClaimLease timer before lease duration
set.
3/26 13:52:06 in DaemonCore NewTimer()
3/26 13:52:06
3/26 13:52:06 DaemonCore--> Timers
3/26 13:52:06 DaemonCore--> ~~~~~~
3/26 13:52:06 DaemonCore--> id = 0, when = 1238100727, period = 120,
handler_descrip=<check_parent>
3/26 13:52:06 DaemonCore--> id = 8, when = 1238100776, period = 60,
handler_descrip=<eval_and_update_all>
3/26 13:52:06 DaemonCore--> id = 10, when = 1238100776, period = 0,
handler_descrip=<dc_touch_log_file>
3/26 13:52:06 DaemonCore--> id = 5, when = 1238100956, period = 240,
handler_descrip=<self_monitor>
3/26 13:52:06 DaemonCore--> id = 3, when = 1238101016, period = 300,
handler_descrip=<check_session_cache>
3/26 13:52:06 DaemonCore--> id = 7, when = 1238101886, period = 1170,
handler_descrip=<DaemonCore::SendAliveToParent>
3/26 13:52:06 DaemonCore--> id = 17, when = 1238101926, period = 0,
handler_descrip=<Claim::leaseExpired>
3/26 13:52:06 DaemonCore--> id = 4, when = 1238102517, period = 1801,
handler_descrip=<handle_cookie_refresh>
3/26 13:52:06 DaemonCore--> id = 11, when = 1238129516, period = 0,
handler_descrip=<dc_touch_lock_files>
3/26 13:52:06 DaemonCore--> id = 6, when = 1238130095, period = 29383,
handler_descrip=<DaemonCore::refreshDNS()>
3/26 13:52:06 DaemonCore--> id = 14, when = 1238200725, period = 100000,
handler_descrip=<Resource::tryFetchWork>
3/26 13:52:06
3/26 13:52:06 leaving DaemonCore NewTimer, id=17
3/26 13:52:06 Started ClaimLease timer (17) w/ 1200 second lease
duration
3/26 13:52:06 Remote job ID is 40905.0
3/26 13:52:06 JobLeaseDuration defined in job ClassAd: 62899200
3/26 13:52:06 Resetting ClaimLease timer (17) with new duration
3/26 13:52:06 In reset_timer(), id=17, time=62899200, period=0
3/26 13:52:06 in DaemonCore NewTimer()
3/26 13:52:06 In cancel_timer(), id=17
3/26 13:52:06 Renewing timer id 17 for 62899200 secs

This job sleeps for an hour. Lets see if it finishes. I'm hopeful this
time around!

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 4:47 PM
To: 'Condor-Users Mail List'
Subject: RE: [Condor-users] Jobs fetched with a hook being killed after
20 minutes

I think I figured it out:

I had an extra whitespace at the end of the integer value for
JobLeaseDuration in STDOUT from my fetch work script. I dropped the
extra whitespace and now the 60 second JobLeaseDuration is being found:

3/26 13:43:37 leaving DaemonCore NewTimer, id=17
3/26 13:43:37 Started ClaimLease timer (17) w/ 1200 second lease
duration
3/26 13:43:37 Remote job ID is 40903.0
3/26 13:43:37 JobLeaseDuration defined in job ClassAd: 60
3/26 13:43:37 Resetting ClaimLease timer (17) with new duration
3/26 13:43:37 In reset_timer(), id=17, time=60, period=0
3/26 13:43:37 in DaemonCore NewTimer()
3/26 13:43:37 In cancel_timer(), id=17
3/26 13:43:37 Renewing timer id 17 for 60 secs

And it was obeyed:

3/26 13:43:37 Changing state: Unclaimed -> Claimed
...
3/26 13:44:37 State change: claim lease expired (condor_schedd gone?)
3/26 13:44:37 Changing state and activity: Claimed/Busy ->
Preempting/Killing
3/26 13:44:37 In Starter::kill() with pid 2439, sig 3 (SIGQUIT)
3/26 13:44:37 Send_Signal(): Doing kill(2439,3) [SIGQUIT]

- Ian

-----Original Message-----
From: Ian Chesal
Sent: Thursday, March 26, 2009 2:56 PM
To: 'Condor-Users Mail List'
Subject: RE: [Condor-users] Jobs fetched with a hook being killed after
20 minutes


> When you turn on D_FULLDEBUG you don't see the startd at least
> attempting to fix up the lease time?

Whoops. You bet.

> You can also throw D_DAEMONCORE into the mix to see what is being done
> with the timer.

Done. Send me a private email and I'll ship you a tarball with the logs.
I don't want to past them in here. I've included a snippet below.

- Ian


3/26 11:47:00 HookClient /tools/arc/scripts/hooks/arc_job_fetch (pid
32691) exited with status 0
3/26 11:47:00 Warning, hook /tools/arc/scripts/hooks/arc_job_fetch (pid
32691) printed to stderr: DEBUG: Slot State="Unclaimed"
Found job 40900
Cmd = "/tools/arc/scripts/arc_execute.sh"
Owner = "ichesal"
Args = "40900"
JobUniverse = 5
Requirements = True
JobLeaseDuration = 60
ClusterId = 40900
ProcId = 0
ARCJob = 40900
IWD = "/data/ichesal/arc/sleeper"
Out = "/data/ichesal/job/20090326/1100/40900/stdout.txt"
Err = "/data/ichesal/job/20090326/1100/40900/stderr.txt"

3/26 11:47:00 Rank of this fetched claim is: 1.000000
3/26 11:47:00 Hook ARC_HOOK_REPLY_FETCH: UNDEFINED
3/26 11:47:00 State change: Finished fetching work successfully
3/26 11:47:00 Set destination state to Claimed
3/26 11:47:00 Changing state: Unclaimed -> Claimed
3/26 11:47:00 Warning: starting ClaimLease timer before lease duration
set.
3/26 11:47:00 in DaemonCore NewTimer()
3/26 11:47:00
3/26 11:47:00 DaemonCore--> Timers
3/26 11:47:00 DaemonCore--> ~~~~~~
3/26 11:47:00 DaemonCore--> id = 9, when = 1238093223, period = 0,
handler_descrip=<do_update>
3/26 11:47:00 DaemonCore--> id = 0, when = 1238093227, period = 120,
handler_descrip=<check_parent>
3/26 11:47:00 DaemonCore--> id = 8, when = 1238093279, period = 60,
handler_descrip=<eval_and_update_all>
3/26 11:47:00 DaemonCore--> id = 10, when = 1238093279, period = 0,
handler_descrip=<dc_touch_log_file>
3/26 11:47:00 DaemonCore--> id = 5, when = 1238093459, period = 240,
handler_descrip=<self_monitor>
3/26 11:47:00 DaemonCore--> id = 3, when = 1238093519, period = 300,
handler_descrip=<check_session_cache>
3/26 11:47:00 DaemonCore--> id = 7, when = 1238094389, period = 1170,
handler_descrip=<DaemonCore::SendAliveToParent>
3/26 11:47:00 DaemonCore--> id = 12, when = 1238094420, period = 0,
handler_descrip=<Claim::leaseExpired>
3/26 11:47:00 DaemonCore--> id = 4, when = 1238095020, period = 1801,
handler_descrip=<handle_cookie_refresh>
3/26 11:47:00 DaemonCore--> id = 11, when = 1238122019, period = 0,
handler_descrip=<dc_touch_lock_files>
3/26 11:47:00 DaemonCore--> id = 6, when = 1238122595, period = 29383,
handler_descrip=<DaemonCore::refreshDNS()>
3/26 11:47:00
3/26 11:47:00 leaving DaemonCore NewTimer, id=12
3/26 11:47:00 Started ClaimLease timer (12) w/ 1200 second lease
duration
3/26 11:47:00 Remote job ID is 40900.0
3/26 11:47:00 In reset_timer(), id=12, time=1200, period=0
3/26 11:47:00 in DaemonCore NewTimer()
3/26 11:47:00 In cancel_timer(), id=12
3/26 11:47:00 Renewing timer id 12 for 1200 secs
3/26 11:47:00

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.