[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] high rate of killed jobs



Dear Tod,

thanks for your explanatory answer.

It shows it was deleted by the user itself.

[root@grid003 ~]# condor_history 698148.0 -limit 1 -af JobStatus RemoveReason
3 via condor_rm (by user atlprod033)

I will figure out why that happens.

Again, thanks!
cheers,
Almudena

El 28/03/2018 a las 16:54, Todd Tannenbaum escribiÃ:
On 3/28/2018 2:50 AM, Almudena Montiel wrote:
Hello,

I am trying to understand this behaviour: I find very often that jobs
are exited with status 102. In the configuration we have defined not to
preempt neither kill jobs, these variables:

   SUSPEND = FALSE
   PREEMPT = FALSE
   PREEMPTION_REQUIREMENTS = FALSE
   KILL = FALSE

One example:

>From the example logs, it looks to me like HTCondor killed running job 698148.0 because back on the submit machine it was explicitly removed from the queue.  Ie, someone ran "condor_rm" on the job, or the job's PeriodicRemove expression became True.

Here is the telling line:

ShadowLog
03/27/18 19:43:34 (698148.0) (2185496): Requesting graceful removal of job.
I assume job 698148.0 disappeared from the queue after this happened?  If yes, what does:

   condor_history 698148.0 -limit 1

show?  condor_history is like condor_q, but for completed/removed jobs.  Does it show the job was removed (removed job will have Status = "X")?  The job classad likely also will contain a RemoveReason attribute stating why the job was removed. Some examples from my windows laptop (same idea on Linux):

C:\condor\log>condor_history 424.0 -limit 1
  ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD
  424.0   tannenba        3/28 09:41   0+00:00:12 X         ???  c:\utils\sleep.exe 300

C:\condor\log>condor_history 424.0 -limit 1 -l | grep -i remove
OnExitRemove = true
PeriodicRemove = false
RemoveReason = "via condor_rm (by user tannenba)"

C:\condor\log>condor_history 424.0 -limit 1 -af JobStatus RemoveReason
3 via condor_rm (by user tannenba)

Hope the above helps,
Todd

--
========================================================================
Almudena Montiel Gonzalez              e-mail: almudena.montiel@xxxxxx
Dept. Theoretical Physics. Block 15.
Laboratory of High Energy Physics
Universidad Autonoma de Madrid.
Phone: 34 91 497 4541      Fax: 34 91 497 3936
James Watt 2, Cantoblanco, 28049 Madrid, Spain.
========================================================================


---
El software de antivirus Avast ha analizado este correo electrÃnico en busca de virus.
https://www.avast.com/antivirus