[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Completed jobs stuck on node.



On 8/7/2013 11:07 AM, Michael McInerny Murphy wrote:
Completed jobs are getting stuck on nodes.  The _condor_stdout shows a normal
program finish and the expected output files are present in the
/var/lib/condor/execute/dir***/ folder.  Condor still shows this job as
running (both on condor_status and condor_q), however, nothing is happening.
The machine continues to stay in the Busy state.

If the machine stays in Claimed/Busy state, that implies that HTCondor thinks the process it spawned to start the job has not yet exited.

So first thing I'd suggest is look to see if this is indeed the case. I know you said above that stdout shows a normal program finish, but the bottom line is if the process still is active on the system or not. So on a stuck node, is there a process running as a child of the starter or not? My guess is yes. Perhaps the job you are submitting to HTCondor is a shell script that spawns off a process (that writes your output files) and then fails to exit for some reason?

You could also look in the starterlog.slotX for a line like:

06/14/13 17:52:24 Running job as user foo
06/14/13 17:52:24 Create_Process succeeded, pid=22439

which says pid 22439 is the actual job pid (prolly shows up in ps as "condor_exec.exe" if vanilla universe). Then if the job actually exits you would see a line later in the starterlog saying:

06/14/13 18:00:47 Process exited, pid=22439, status=0

If the job pid (pid 22439 in the above example) actually no longer exists on the system and yet nothing in the starterlog shows that pid exiting, that would be troubling. But I would be very surprised by that.

Hope the above helps to definitively narrow it down to either HTCondor or something unexpected with your job,
regards,
Todd


I'm unsure of the path to fix
this problem.  The StarterLog.slot2 file has the following msg:

ERROR: the submitting host claims to be in our UidDomain (ierus.local), yet
its hostname (192.168.1.90) does not match.  If the above hostname is actually
an IP address, Condor could not perform a reverse DNS lookup to convert the IP
back into a name.  To solve this problem, you can either correctly configure
DNS to allow the reverse lookup, or you can enable TRUST_UID_DOMAIN in your
condor configuration.

I'm new to administering condor so I'm at a loss on where to start to correct
this issue.  Thanks for your help.

Michael
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685