[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] high rate of killed jobs
- Date: Wed, 28 Mar 2018 11:51:46 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] high rate of killed jobs
On 3/28/2018 11:22 AM, almudena montiel wrote:
thanks for your explanatory answer.
It shows it was deleted by the user itself.
[root@grid003 ~]# condor_history 698148.0 -limit 1Â -af JobStatus
3 via condor_rm (by user atlprod033)
I will figure out why that happens.
Glad to help.
El 28/03/2018 a las 16:54, Todd Tannenbaum escribiÃ:
On 3/28/2018 2:50 AM, Almudena Montiel wrote:
>From the example logs, it looks to me like HTCondor killed running
job 698148.0 because back on the submit machine it was explicitly
removed from the queue.Â Ie, someone ran "condor_rm" on the job, or
the job's PeriodicRemove expression became True.
I am trying to understand this behaviour: I find very often that jobs
are exited with status 102. In the configuration we have defined not to
preempt neither kill jobs, these variables:
ÂÂ SUSPEND = FALSE
ÂÂ PREEMPT = FALSE
ÂÂ PREEMPTION_REQUIREMENTS = FALSE
ÂÂ KILL = FALSE
Here is the telling line:
I assume job 698148.0 disappeared from the queue after this happened?
If yes, what does:
03/27/18 19:43:34 (698148.0) (2185496): Requesting graceful removal
ÂÂ condor_history 698148.0 -limit 1
show?Â condor_history is like condor_q, but for completed/removed
jobs.Â Does it show the job was removed (removed job will have Status
= "X")?Â The job classad likely also will contain a RemoveReason
attribute stating why the job was removed. Some examples from my
windows laptop (same idea on Linux):
C:\condor\log>condor_history 424.0 -limit 1
Â IDÂÂÂÂ OWNERÂÂÂÂÂÂÂÂÂ SUBMITTEDÂÂ RUN_TIMEÂÂÂÂ ST COMPLETEDÂÂ CMD
Â 424.0ÂÂ tannenbaÂÂÂÂÂÂÂ 3/28 09:41ÂÂ 0+00:00:12 XÂÂÂÂÂÂÂÂ ???
C:\condor\log>condor_history 424.0 -limit 1 -l | grep -i remove
OnExitRemove = true
PeriodicRemove = false
RemoveReason = "via condor_rm (by user tannenba)"
C:\condor\log>condor_history 424.0 -limit 1 -af JobStatus RemoveReason
3 via condor_rm (by user tannenba)
Hope the above helps,
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685