[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Completed jobs stuck on node.



On 8/21/2013 8:27 AM, Michael Murphy wrote:
At the end of the report it looks like pid 30774 was spawned but never
exited.  This is confirmed by ps

$ ps -aux | grep condor_
condor   30467  0.0  0.0  91808  5468 ?        Ss   Aug08   0:06
/usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
condor   30468  0.0  0.0  92000  7668 ?        Ss   Aug08   0:31
condor_startd -f
root     30502  0.0  0.0  23280  2928 ?        S    Aug08   0:22
condor_procd -A /var/run/condor/procd_pipe.STARTD -L
/var/log/condor/ProcLog.STARTD -R 10000000 -S 60 -C 104
condor   30770  0.0  0.0  91172  6980 ?        Ss   Aug08   0:00
condor_starter -f -a slot2 192.168.1.93
nobody   30774  353  5.9 3879252 2941908 ?     SNsl Aug08 3226:37
condor_exec.exe <name removed>

So your suspicions were correct. How would I fix this?  The program vlox
(stuck job executable) completes normally outside of condor for the same
batch of 20 run manually.



Some thoughts:

Maybe on a stuck job, try running ssh_to_job - this would allow you to nose around in the environment of the job, look at all files, attach with a debugger, do whatever to figure out what is happening in real-time.

Whenever I hear "it runs outside HTCondor fine, but fails when HTCondor runs it", I immediately think permissions, ownerships, and environment variables. This is what is usually different between a program running inside vs outside HTCondor. For instance, when you are testing outside of HTCondor you are running as user "michael" (or whatever), but from the above it looks like HTCondor is configured to run your jobs as user "nobody". Try su-ing to nobody and see if your program works. Re environment variables, try doing "getenv=True" in your submit file to pick up all your environment variables.

Another thought - try condor_submit -i <submit file>. This will start an interactive login/shell on an execute node with the exact same setup that HTCondor uses to run your jobs (you will be user nobody, same environment, permissions, etc). Try running your job interactive that way and see if you discovery why your program is hanging.

Maybe it encountered some error condition and is sitting around waiting for console input? If you do not already use stdin with your job, maybe a file full of "yes" or whatever and specify this file with "input=filename" to have HTCondor use this file as stdin?

Sorry if the above is not very specific, but not sure else HTCondor could do here.... if your program is sitting around not exiting, one would hope it is writing something to either stdout or stderr. One last idea - maybe submit some jobs and specify streaming stdout/err by placing
  stream_output = True
  stream_error = True
  output = myjob.out
  error = myjob.err
in your submit file ... the advantage to streaming the stdout/err in realtime back to the submit machine is it will be flushed often. This way if your program is giving some clue as to why it is not exiting, it won't be cached in some stdio buffer someplace where you cannot see it.

Hope the above random thoughts help, please let us know what you figure out,
Todd