[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Jobs stay in running state after PID exits



Hi all,

A user noticed that a few of his jobs stayed in the running state long after they should have finished. The logs showed that the PID of the job (as reported by condor_ssh_to_job) exited at 4:40 AM, but the job log shows the job running until 12:22 PM, when it was evicted by the user.

This happened on a handful of jobs out of tens of thousands.

Any ideas?

Thanks,
Jon

At 12:21:
$ condor_ssh_to_job 3879.442
Welcome to slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx!
Your condor job is running with pid(s) 3013916.

$ ps faux|grep 3013916
quxÂÂÂÂÂ 3140710Â 0.0Â 0.0 103372Â 2056 pts/0ÂÂÂ SN+Â 12:21ÂÂ 0:00ÂÂÂÂÂÂÂÂÂÂÂÂÂ \_ grep --color=auto --perl-regexp --line-buffered 3013916

From the job log:
000 (3879.442.000) 10/15 22:13:59 Job submitted from host: <10.40.31.16:9618?addrs=10.40.31.16-9618&noUDP&sock=3873084_5e04_4>
001 (3879.442.000) 10/15 22:14:31 Job executing on host: <10.40.48.189:9618?addrs=10.40.48.189-9618&noUDP&sock=7546_81bf_3>
006 (3879.442.000) 10/15 22:14:39 Image size of job updated: 1573112
...
006 (3879.442.000) 10/16 03:55:20 Image size of job updated: 2438100
006 (3879.442.000) 10/16 12:22:41 Image size of job updated: 3143308
004 (3879.442.000) 10/16 12:22:41 Job was evicted.
009 (3879.442.000) 10/16 12:22:41 Job was aborted by the user.

From the StarterLog:
10/15/16 22:14:31 (pid:3013915) Create_Process succeeded, pid=3013916
10/16/16 03:28:42 (pid:3013915) IWD: /apps/homefs2/q/daily
10/16/16 03:28:42 (pid:3013915) Error file: /spare/condor/encrypted42868/dir_3013915/.condor_ssh_to_job_1/sshd.log
10/16/16 03:28:42 (pid:3013915) Renice expr "1" evaluated to 1
10/16/16 03:28:42 (pid:3013915) Using wrapper /usr/local/sbin/os/condor_wrapper.sh to exec /usr/sbin/sshd -i -e -f /spare/condor/encrypted42868/dir_3013915/.condor_ssh_to_job_1/sshd_config
10/16/16 03:28:42 (pid:3013915) Running job as user q
10/16/16 03:28:42 (pid:3013915) Create_Process succeeded, pid=3064380
10/16/16 03:28:42 (pid:3013915) Process exited, pid=3064368, status=0
10/16/16 03:28:42 (pid:3013915) unhandled job exit: pid=3064368, status=0
10/16/16 04:40:04 (pid:3013915) Process exited, pid=3013916, status=0
10/16/16 12:21:49 (pid:3013915) IWD: /apps/homefs2/q/daily
10/16/16 12:21:49 (pid:3013915) Error file: /spare/condor/encrypted42868/dir_3013915/.condor_ssh_to_job_2/sshd.log
10/16/16 12:21:49 (pid:3013915) Renice expr "1" evaluated to 1
10/16/16 12:21:49 (pid:3013915) Using wrapper /usr/local/sbin/os/condor_wrapper.sh to exec /usr/sbin/sshd -i -e -f /spare/condor/encrypted42868/dir_3013915/.condor_ssh_to_job_2/sshd_config
10/16/16 12:21:49 (pid:3013915) Running job as user q
10/16/16 12:21:49 (pid:3013915) Create_Process succeeded, pid=3140578
10/16/16 12:21:49 (pid:3013915) Process exited, pid=3140565, status=0
10/16/16 12:21:49 (pid:3013915) unhandled job exit: pid=3140565, status=0
10/16/16 12:22:29 (pid:3013915) Process exited, pid=3140578, status=255
10/16/16 12:22:41 (pid:3013915) Got SIGTERM. Performing graceful shutdown.
10/16/16 12:22:41 (pid:3013915) ShutdownGraceful all jobs.
10/16/16 12:22:41 (pid:3013915) Process exited, pid=3064380, signal=15
10/16/16 12:22:41 (pid:3013915) Last process exited, now Starter is exiting
10/16/16 12:22:41 (pid:3013915) **** condor_starter (condor_STARTER) pid 3013915 EXITING WITH STATUS 0