[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing jobs when condor won't respond





Michael Thomas wrote:

Recently our cluster (running condor 6.7.18) experienced an impossibly
high load (~800) due to many many globus-job-manager scripts running.
The cluster was fully utilized with ~200 running jobs, but there were
~500 or more globus-job-manager scripts running.  At one point when I
was able to run condor commands, it reported that there were ~3000 jobs
in the queue, most of them idle.

Unfortunately, I was often unable to run condor_q, condor_rm, or any
other condor command during this time due to the high load.  This
prevented me from being able to remove the idle jobs from the queue and
kill the running jobs.

Is there a backdoor way that I can manipulate the condor queue to remove
jobs, without having to go through condor_rm?  Or are there any
suggestions on how to recover from an overloaded queue?


There is a backdoor of sorts: stop the schedd (i.e. kill -9 all condor processes) and remove job_queue.log from your condor spool directory. However, I do not believe this would have helped you recover in the circumstances that you describe. The globus jobmanagers would continue to run, and would, in fact, not be aware that the jobs have been removed, because the "backdoor" approach to emptying the job queue does not produce any job removal event in the job log, which is used by the globus jobmanager to monitor the job.

An alternate approach, which might work better in situations such as this would be to remove the globus job state files and kill the jobmanagers. When the machine becomes responsive again, you should then be able to use condor_rm to remove the jobs from Condor. What do you think?

--Dan