[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Completed jobs stuck on node.
- Date: Wed, 07 Aug 2013 15:56:57 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Completed jobs stuck on node.
On 8/7/2013 11:07 AM, Michael McInerny Murphy wrote:
Completed jobs are getting stuck on nodes. The _condor_stdout shows a normal
program finish and the expected output files are present in the
/var/lib/condor/execute/dir***/ folder. Condor still shows this job as
running (both on condor_status and condor_q), however, nothing is happening.
The machine continues to stay in the Busy state.
If the machine stays in Claimed/Busy state, that implies that HTCondor
thinks the process it spawned to start the job has not yet exited.
So first thing I'd suggest is look to see if this is indeed the case. I
know you said above that stdout shows a normal program finish, but the
bottom line is if the process still is active on the system or not. So
on a stuck node, is there a process running as a child of the starter or
not? My guess is yes. Perhaps the job you are submitting to HTCondor
is a shell script that spawns off a process (that writes your output
files) and then fails to exit for some reason?
You could also look in the starterlog.slotX for a line like:
06/14/13 17:52:24 Running job as user foo
06/14/13 17:52:24 Create_Process succeeded, pid=22439
which says pid 22439 is the actual job pid (prolly shows up in ps as
"condor_exec.exe" if vanilla universe). Then if the job actually exits
you would see a line later in the starterlog saying:
06/14/13 18:00:47 Process exited, pid=22439, status=0
If the job pid (pid 22439 in the above example) actually no longer
exists on the system and yet nothing in the starterlog shows that pid
exiting, that would be troubling. But I would be very surprised by that.
Hope the above helps to definitively narrow it down to either HTCondor
or something unexpected with your job,
I'm unsure of the path to fix
this problem. The StarterLog.slot2 file has the following msg:
ERROR: the submitting host claims to be in our UidDomain (ierus.local), yet
its hostname (192.168.1.90) does not match. If the above hostname is actually
an IP address, Condor could not perform a reverse DNS lookup to convert the IP
back into a name. To solve this problem, you can either correctly configure
DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your
I'm new to administering condor so I'm at a loss on where to start to correct
this issue. Thanks for your help.
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at:
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685