Junjun Mao wrote:
Condor has been running fine for a few months but all the jobs got killed (some restarted) suddenly yesterday. Here is the log on master node SchedLog:

7/31 15:00:03 Shadow pid 10451 for job 573.2 exited with status 100
7/31 15:00:03 match (<>#1175522523#195) out of jobs (cluster id 181); relinquishing
7/31 15:00:03 Sent RELEASE_CLAIM to startd on <>
7/31 15:00:03 Match record (<>, 181, -1) deleted
7/31 15:00:04 Got VACATE_SERVICE from <>
7/31 15:00:04 Shadow pid 9427 for job 636.0 exited with status 100
7/31 15:00:04 match (<>#1175522447#121) out of jobs (cluster id 636); relinquishing
7/31 15:00:04 Sent RELEASE_CLAIM to startd on <>
7/31 15:00:04 Match record (<>, 636, -1) deleted
7/31 15:00:04 Got VACATE_SERVICE from <>
7/31 15:00:04 In DedicatedScheduler::reaper pid 22101 has status 1024
7/31 15:00:04 Shadow pid 22101 exited with status 4
7/31 15:00:04 ERROR: Shadow exited with job exception code!

It seems the Shadows exited with status 100 or 4. What is status 100 and 4? Does it have anything to do with the network or file system?

Grep for "ERROR" in your shadowlog to see what the problem is (or if you specified a "Log=" line in your job submit file to get a user log, the error will appear in that file as well).

For the interested reader, here are all the condor_shadow exit codes and what they mean:

4  	JOB_EXCEPTION  	The job exited with an exception
44 	DPRINTF_ERROR 	There is a fatal error with dprintf()
100 	JOB_EXITED 	The job exited (not killed)
101 	JOB_CKPTED 	The job was checkpointed
102 	JOB_KILLED 	The job was killed
103 	JOB_COREDUMPED 	The job was killed and a core file produced
105 	JOB_NO_MEM 	Not enough memory to start the shadow
106 	JOB_SHADOW_USAGE 	incorrect arguments to condor_shadow
107 	JOB_NOT_CKPTED 	The job was kicked off without a checkpoint
107 JOB_SHOULD_REQUEUE (!) We define this to the same number, since we want the same behavior. However, "JOB_NOT_CKPTED" doesn't mean much if we're not a standard universe job. The effect of this exit code is that we want the job to be put back in the job queue and run again.
108 	JOB_NOT_STARTED 	Can't connect to startd or request refused
109 	JOB_BAD_STATUS 	Job status != RUNNING on startup
110 	JOB_EXEC_FAILED 	Exec failed for some reason other than ENOMEM
111 	JOB_NO_CKPT_FILE 	There is no checkpoint file (lost)
112 	JOB_SHOULD_HOLD 	The job should be put on hold
113 	JOB_SHOULD_REMOVE 	The job should be removed