Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] exit status 100 in SchedLog

Date: Thu, 02 Aug 2007 11:43:50 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] exit status 100 in SchedLog

Junjun Mao wrote:

Condor has been running fine for a few months but all the jobs gotkilled (some restarted) suddenly yesterday. Here is the log on masternode SchedLog:
7/31 15:00:03 Shadow pid 10451 for job 573.2 exited with status 100
7/31 15:00:03 match (<10.10.20.64:49539>#1175522523#195) out of jobs(cluster id 181); relinquishing
7/31 15:00:03 Sent RELEASE_CLAIM to startd on <10.10.20.64:49539>
7/31 15:00:03 Match record (<10.10.20.64:49539>, 181, -1) deleted
7/31 15:00:04 Got VACATE_SERVICE from <10.10.20.64:34423>
7/31 15:00:04 Shadow pid 9427 for job 636.0 exited with status 100
7/31 15:00:04 match (<10.10.20.76:46461>#1175522447#121) out of jobs(cluster id 636); relinquishing
7/31 15:00:04 Sent RELEASE_CLAIM to startd on <10.10.20.76:46461>
7/31 15:00:04 Match record (<10.10.20.76:46461>, 636, -1) deleted
7/31 15:00:04 Got VACATE_SERVICE from <10.10.20.76:59431>
7/31 15:00:04 In DedicatedScheduler::reaper pid 22101 has status 1024
7/31 15:00:04 Shadow pid 22101 exited with status 4
7/31 15:00:04 ERROR: Shadow exited with job exception code!
It seems the Shadows exited with status 100 or 4. What is status 100 and4? Does it have anything to do with the network or file system?

Grep for "ERROR" in your shadowlog to see what the problem is (or if youspecified a "Log=" line in your job submit file to get a user log, theerror will appear in that file as well).

For the interested reader, here are all the condor_shadow exit codes andwhat they mean:


4  	JOB_EXCEPTION  	The job exited with an exception
44 	DPRINTF_ERROR 	There is a fatal error with dprintf()
100 	JOB_EXITED 	The job exited (not killed)
101 	JOB_CKPTED 	The job was checkpointed
102 	JOB_KILLED 	The job was killed
103 	JOB_COREDUMPED 	The job was killed and a core file produced
105 	JOB_NO_MEM 	Not enough memory to start the shadow
106 	JOB_SHADOW_USAGE 	incorrect arguments to condor_shadow
107 	JOB_NOT_CKPTED 	The job was kicked off without a checkpoint

107 JOB_SHOULD_REQUEUE (!) We define this to the same number, since wewant the same behavior. However, "JOB_NOT_CKPTED" doesn't mean much ifwe're not a standard universe job. The effect of this exit code is thatwe want the job to be put back in the job queue and run again.

108 	JOB_NOT_STARTED 	Can't connect to startd or request refused
109 	JOB_BAD_STATUS 	Job status != RUNNING on startup
110 	JOB_EXEC_FAILED 	Exec failed for some reason other than ENOMEM
111 	JOB_NO_CKPT_FILE 	There is no checkpoint file (lost)
112 	JOB_SHOULD_HOLD 	The job should be put on hold
113 	JOB_SHOULD_REMOVE 	The job should be removed

References:
- [Condor-users] exit status 100 in SchedLog
  - From: Junjun Mao

Prev by Date: Re: [Condor-users] Quill configuration issues
Next by Date: [Condor-users] JobRunCount vs. NumJobMatches
Previous by thread: [Condor-users] exit status 100 in SchedLog
Next by thread: [Condor-users] Mark T Glowka/HLIFE is out of the office.
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] exit status 100 in SchedLog