[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_q shows jobs still running which have completed



Hello everyone,

I am running a large number of long-running jobs on a 56-node
Linux-based HTCondor cluster, using the "vanilla" universe (because
the programs depend on both fork() and mmap()).  I have found that
occasionally condor_q shows a job as running, when that job has
actually completed hours earlier.  The job has produced its expected
output file, and no job is running on the node it was scheduled on.
When this happens, Condor no longer schedules jobs on the compute node
it thinks the completed job is running on.  I must manually condor_rm
the job in order to get Condor to schedule further jobs on the
affected node.  I have not found references to any similar symptom in
the FAQ or via Google. Any ideas why this might be happening?

Thank you,

Joe Knapka
Bioinformatics / University of Texas / El Paso

-- 
"I want them to understand that there is a playground in their minds
and that that is where mathematics happens." - Paul Lockhart