[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] killing jobs when condor won't respond



Recently our cluster (running condor 6.7.18) experienced an impossibly
high load (~800) due to many many globus-job-manager scripts running.
The cluster was fully utilized with ~200 running jobs, but there were
~500 or more globus-job-manager scripts running.  At one point when I
was able to run condor commands, it reported that there were ~3000 jobs
in the queue, most of them idle.

Unfortunately, I was often unable to run condor_q, condor_rm, or any
other condor command during this time due to the high load.  This
prevented me from being able to remove the idle jobs from the queue and
kill the running jobs.

Is there a backdoor way that I can manipulate the condor queue to remove
jobs, without having to go through condor_rm?  Or are there any
suggestions on how to recover from an overloaded queue?

--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature