[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rm on lots of jobs



Ian,

The DAGMan experts may correct me, but I believe the best practice for removing DAGs is to condor_rm the DAGMan scheduler universe job and let DAGMan condor_rm the jobs in the DAG. Example:

condor_rm -constraint 'JobUniverse == 7'

However, that could still result in a lot of POST scripts running at the same time. When you submit a dag, there is an option -maxpost which can be used to specify the maximum number of POST scripts that will be run at the same time in that DAG.

--Dan

Ian Stokes-Rees wrote:
We get into a situation about once every 4-6 weeks where we have >10k
jobs queued, a few thousand running, and discover that some aspect of
the system (network, files, job descriptions, job scripts) has an error
that is going to cause most, if not all, of the jobs to fail.  We need
to flush *all* the jobs.  condor_rm on all the jobs usually brings our
system to its knees, and Condor stops responding.  Typically it takes
4-24 hours to recover, often with numerous manually condor_on commands
to coax Condor back to life.  Our system usually experiences a big load
spike.  This morning when we did condor_rm with 15k jobs the load
climbed to ... you guessed it, ~15000.

Can anyone advise if there are better ways of doing this?

FWIW, the jobs are generally managed through DAGMan, and have POST
scripts associated with them.  I would have thought that a queued job
that gets "rm'ed" isn't going to go through the POST script, but I could
be wrong.

Thanks,

Ian