[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] job_lease_duration setting in submit file cause long time jobs killed



I use job_lease_duration = 30 in my job submit file to get shorter delay for job reschedule when compute nodes failed. But this will cause long time job killed. I am using condor 7.8.5(x64) and CentOS 6.3. Below are log files:

In StartLog(Only show the content about slot3):
02/24/14 10:29:46 slot3: State change: claiming protocol successful
02/24/14 10:29:46 slot3: Changing state: Unclaimed -> Claimed
02/24/14 10:29:46 slot3: Got activate_claim request from shadow (10.1.1.1)
02/24/14 10:29:46 slot3: Error evaluating machine rank _expression_: None
02/24/14 10:29:46 slot3: Setting RANK to 0.0
02/24/14 10:29:46 slot3: Remote job ID is 496.0
02/24/14 10:29:46 slot3: Got universe "VANILLA" (5) from request classad
02/24/14 10:29:46 slot3: State change: claim-activation protocol successful
02/24/14 10:29:46 slot3: Changing activity: Idle -> Busy
02/24/14 10:44:51 slot3: State change: claim no longer recognized by the schedd - removing claim
02/24/14 10:44:51 slot3: Changing state and activity: Claimed/Busy -> Preempting/Killing
02/24/14 10:45:21 slot3: starter (pid 28021) is not responding to the request to hardkill its job.  The startd will now directly hard kill the starter and all its decendents.
02/24/14 10:45:21 Starter pid 28021 died on signal 9 (signal 9 (Killed))
02/24/14 10:45:21 slot3: State change: starter exited
02/24/14 10:45:21 slot3: State change: No preempting claim, returning to owner
02/24/14 10:45:21 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle
02/24/14 10:45:21 slot3: State change: IS_OWNER is false
02/24/14 10:45:21 slot3: Changing state: Owner -> Unclaimed

In StarterLog.Slot3
02/24/14 10:29:47 Job 496.0 set to execute immediately
02/24/14 10:29:47 Starting a VANILLA universe job with ID: 496.0
......
02/24/14 10:29:47 About to exec /usr/bin/csfexec
02/24/14 10:29:47 Running job as user root
02/24/14 10:29:47 Create_Process succeeded, pid=28029
02/24/14 10:44:51 Got SIGQUIT.  Performing fast shutdown.
02/24/14 10:44:51 ShutdownFast all jobs.
02/24/14 10:44:52 Process exited, pid=28029, signal=9
02/24/14 10:44:52 condor_write(): Socket closed when trying to write 381 bytes to <10.1.1.1:42920>, fd is 9
02/24/14 10:44:52 Buf::write(): condor_write() failed
02/24/14 10:44:52 condor_write(): Socket closed when trying to write 91 bytes to <10.1.1.1:42920>, fd is 9
02/24/14 10:44:52 Buf::write(): condor_write() failed
02/24/14 10:44:52 Failed to send job exit status to shadow
02/24/14 10:44:52 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
02/24/14 10:44:52 Returning from CStarter::JobReaper()

I think the reason is in StartLog: 'State change: claim no longer recognized by the schedd - removing claim'. If I remove the job_lease_duration = 30 setting in job submit file, job will not be killed.

Why this setting cause this? How can I avoid long time jobs killed?
Thanks in advance!