Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_rm on lots of jobs

Date: Thu, 29 Apr 2010 10:12:35 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] condor_rm on lots of jobs

Ian,

The DAGMan experts may correct me, but I believe the best practice forremoving DAGs is to condor_rm the DAGMan scheduler universe job and letDAGMan condor_rm the jobs in the DAG. Example:


condor_rm -constraint 'JobUniverse == 7'

However, that could still result in a lot of POST scripts running at thesame time. When you submit a dag, there is an option -maxpost which canbe used to specify the maximum number of POST scripts that will be runat the same time in that DAG.


--Dan

Ian Stokes-Rees wrote:

We get into a situation about once every 4-6 weeks where we have >10k
jobs queued, a few thousand running, and discover that some aspect of
the system (network, files, job descriptions, job scripts) has an error
that is going to cause most, if not all, of the jobs to fail.  We need
to flush *all* the jobs.  condor_rm on all the jobs usually brings our
system to its knees, and Condor stops responding.  Typically it takes
4-24 hours to recover, often with numerous manually condor_on commands
to coax Condor back to life.  Our system usually experiences a big load
spike.  This morning when we did condor_rm with 15k jobs the load
climbed to ... you guessed it, ~15000.

Can anyone advise if there are better ways of doing this?

FWIW, the jobs are generally managed through DAGMan, and have POST
scripts associated with them.  I would have thought that a queued job
that gets "rm'ed" isn't going to go through the POST script, but I could
be wrong.

Thanks,

Ian

Follow-Ups:
- Re: [Condor-users] condor_rm on lots of jobs
  - From: R. Kent Wenger
- Re: [Condor-users] condor_rm on lots of jobs
  - From: Ian Chesal

References:
- [Condor-users] condor_rm on lots of jobs
  - From: Ian Stokes-Rees

Prev by Date: [Condor-users] condor_rm on lots of jobs
Next by Date: Re: [Condor-users] condor_rm on lots of jobs
Previous by thread: [Condor-users] condor_rm on lots of jobs
Next by thread: Re: [Condor-users] condor_rm on lots of jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] condor_rm on lots of jobs