[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] high rate of killed jobs



Hello,

I am trying to understand this behaviour: I find very often that jobs are exited with status 102. In the configuration we have defined not to preempt neither kill jobs, these variables:

 SUSPEND = FALSE
 PREEMPT = FALSE
 PREEMPTION_REQUIREMENTS = FALSE
 KILL = FALSE

One example:

SchedLog

03/27/18 19:42:57 (pid:287435) Starting add_shadow_birthdate(698148.0)
03/27/18 19:42:57 (pid:287435) Started shadow for job 698148.0 on slot1@xxxxxxxxxxxxxxxxxx <150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3> for group_atlas.prod.atlprod033, (shadow pid = 2185496)

..

03/27/18 19:44:03 (pid:287435) Shadow pid 2185496 for job 698148.0 exited with status 102 03/27/18 19:44:03 (pid:287435) Checking consistency running and runnable jobs
03/27/18 19:44:03 (pid:287435) Tables are consistent
03/27/18 19:44:03 (pid:287435) Rebuilt prioritized runnable job list in 0.001s. 03/27/18 19:44:03 (pid:287435) match (slot1@xxxxxxxxxxxxxxxxxx <150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3> for group_atlas.prod.atlprod033) out of jobs; relinquishing 03/27/18 19:44:03 (pid:287435) Match record (slot1@xxxxxxxxxxxxxxxxxx <150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3> for group_atlas.prod.atlprod033, 698148.-1) deleted 03/27/18 19:44:03 (pid:287435) Completed RELEASE_CLAIM to startd slot1@xxxxxxxxxxxxxxxxxx <150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3> for group_atlas.prod.atlprod033


ShadowLog

03/27/18 19:42:57 (698148.0) (2185496): Request to run on slot1_1@xxxxxxxxxxxxxxxxxx <150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3> was ACCEPTED

[..]

03/27/18 19:42:57 (698148.0) (2185496): File transfer completed successfully.
[...]

03/27/18 19:43:34 (698148.0) (2185496): Requesting graceful removal of job.
[..]

03/27/18 19:44:03 (698148.0) (2185496): Job 698148.0 is being evicted from slot1_1@xxxxxxxxxxxxxxxxxx 03/27/18 19:44:03 (698148.0) (2185496): **** condor_shadow (condor_SHADOW) pid 2185496 EXITING WITH STATUS 102

And in the node running the job:

/var/log/condor/StarterLog.slot1_1

03/27/18 19:42:57 (pid:32723) Job 698148.0 set to execute immediately
03/27/18 19:42:57 (pid:32723) Starting a VANILLA universe job with ID: 698148.0
03/27/18 19:42:57 (pid:32723) IWD: /var/lib/condor/execute/dir_32723
03/27/18 19:42:57 (pid:32723) Output file: /var/lib/condor/execute/dir_32723/_condor_stdout 03/27/18 19:42:57 (pid:32723) Error file: /var/lib/condor/execute/dir_32723/_condor_stderr
03/27/18 19:42:57 (pid:32723) Renice expr "0" evaluated to 0
03/27/18 19:42:57 (pid:32723) Using wrapper /usr/sbin/mjf-job-wrapper to exec /var/lib/condor/execute/dir_32723/condor_exec.exe
03/27/18 19:42:57 (pid:32723) Running job as user atlprod033
03/27/18 19:42:57 (pid:32723) Create_Process succeeded, pid=32727
03/27/18 19:43:34 (pid:32723) Got SIGTERM. Performing graceful shutdown.
03/27/18 19:43:34 (pid:32723) ShutdownGraceful all jobs.
03/27/18 19:44:03 (pid:32723) Got SIGQUIT. Performing fast shutdown.
03/27/18 19:44:03 (pid:32723) ShutdownFast all jobs.
03/27/18 19:44:03 (pid:32723) Process exited, pid=32727, signal=9
03/27/18 19:44:03 (pid:32723) Last process exited, now Starter is exiting
03/27/18 19:44:03 (pid:32723) **** condor_starter (condor_STARTER) pid 32723 EXITING WITH STATUS 0

I am newbie with htcondor, therefore I'd appreciate any hint to help me understand why jobs are exiting this way.

cheers,

Almudena

--
========================================================================
Almudena Montiel Gonzalez              e-mail: almudena.montiel@xxxxxx
Dept. Theoretical Physics. Block 15.
Laboratory of High Energy Physics
Universidad Autonoma de Madrid.
Phone: 34 91 497 4541      Fax: 34 91 497 3936
James Watt 2, Cantoblanco, 28049 Madrid, Spain.
========================================================================