[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rm on lots of jobs



On Thu, 29 Apr 2010, Ian Stokes-Rees wrote:

We get into a situation about once every 4-6 weeks where we have >10k
jobs queued, a few thousand running, and discover that some aspect of
the system (network, files, job descriptions, job scripts) has an error
that is going to cause most, if not all, of the jobs to fail.  We need
to flush *all* the jobs.  condor_rm on all the jobs usually brings our
system to its knees, and Condor stops responding.  Typically it takes
4-24 hours to recover, often with numerous manually condor_on commands
to coax Condor back to life.  Our system usually experiences a big load
spike.  This morning when we did condor_rm with 15k jobs the load
climbed to ... you guessed it, ~15000.

Can anyone advise if there are better ways of doing this?

FWIW, the jobs are generally managed through DAGMan, and have POST
scripts associated with them.  I would have thought that a queued job
that gets "rm'ed" isn't going to go through the POST script, but I could
be wrong.

This is a case where you *don't* want to manually remove the node jobs. If you remove the node jobs, and DAGMan starts noticing those events before it gets removed itself, it *will* run POST scripts for the removed jobs, which might be part of your load problem.

As I said before, if you just condor_rm DAGMan itself, no POST scripts should get run.

Kent Wenger
Condor Team