Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing jobs when condor won't respond

Date: Wed, 05 Jul 2006 13:36:29 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] killing jobs when condor won't respond



Michael Thomas wrote:

Recently our cluster (running condor 6.7.18) experienced an impossibly
high load (~800) due to many many globus-job-manager scripts running.
The cluster was fully utilized with ~200 running jobs, but there were
~500 or more globus-job-manager scripts running.  At one point when I
was able to run condor commands, it reported that there were ~3000 jobs
in the queue, most of them idle.

Unfortunately, I was often unable to run condor_q, condor_rm, or any
other condor command during this time due to the high load.  This
prevented me from being able to remove the idle jobs from the queue and
kill the running jobs.

Is there a backdoor way that I can manipulate the condor queue to remove
jobs, without having to go through condor_rm?  Or are there any
suggestions on how to recover from an overloaded queue?

There is a backdoor of sorts: stop the schedd (i.e. kill -9 all condorprocesses) and remove job_queue.log from your condor spool directory.However, I do not believe this would have helped you recover in thecircumstances that you describe. The globus jobmanagers would continueto run, and would, in fact, not be aware that the jobs have beenremoved, because the "backdoor" approach to emptying the job queue doesnot produce any job removal event in the job log, which is used by theglobus jobmanager to monitor the job.

An alternate approach, which might work better in situations such asthis would be to remove the globus job state files and kill thejobmanagers. When the machine becomes responsive again, you should thenbe able to use condor_rm to remove the jobs from Condor. What do you think?


--Dan

References:
- [Condor-users] killing jobs when condor won't respond
  - From: Michael Thomas

Prev by Date: [Condor-users] Startd's crashing with fatal error getting process info for starter and descendants
Next by Date: [Condor-users] How To TroubleShoot Flocking
Previous by thread: [Condor-users] killing jobs when condor won't respond
Next by thread: [Condor-users] Startd's crashing with fatal error getting process info for starter and descendants
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] killing jobs when condor won't respond