[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_q shows jobs still running which have completed



When this happens, is there still a condor_starter process running on the execute node?
does it have any child processes?
are you using a standard bioinformatics job (i.e. one that we might also be running here at UW?)

What version of HTCondor are you using?
Is the execute node the same version as the submit node for Linux and/or HTCondor?

thanks,
-tj

On 1/19/2014 10:05 PM, Joe Knapka wrote:
Hello everyone,

I am running a large number of long-running jobs on a 56-node
Linux-based HTCondor cluster, using the "vanilla" universe (because
the programs depend on both fork() and mmap()).  I have found that
occasionally condor_q shows a job as running, when that job has
actually completed hours earlier.  The job has produced its expected
output file, and no job is running on the node it was scheduled on.
When this happens, Condor no longer schedules jobs on the compute node
it thinks the completed job is running on.  I must manually condor_rm
the job in order to get Condor to schedule further jobs on the
affected node.  I have not found references to any similar symptom in
the FAQ or via Google. Any ideas why this might be happening?

Thank you,

Joe Knapka
Bioinformatics / University of Texas / El Paso