Re: [Condor-users] condor_rm on lots of jobs

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Thu, Apr 29, 2010 at 11:12 AM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:

Ian,

The DAGMan experts may correct me, but I believe the best practice for removing DAGs is to condor_rm the DAGMan scheduler universe job and let DAGMan condor_rm the jobs in the DAG. Example:

condor_rm -constraint 'JobUniverse == 7'

However, that could still result in a lot of POST scripts running at the same time. When you submit a dag, there is an option -maxpost which can be used to specify the maximum number of POST scripts that will be run at the same time in that DAG.

--Dan

Ian Stokes-Rees wrote:

We get into a situation about once every 4-6 weeks where we have >10k
jobs queued, a few thousand running, and discover that some aspect of
the system (network, files, job descriptions, job scripts) has an error
that is going to cause most, if not all, of the jobs to fail. We need
to flush *all* the jobs. condor_rm on all the jobs usually brings our
system to its knees, and Condor stops responding. Typically it takes
4-24 hours to recover, often with numerous manually condor_on commands
to coax Condor back to life. Our system usually experiences a big load
spike. This morning when we did condor_rm with 15k jobs the load
climbed to ... you guessed it, ~15000.

Can anyone advise if there are better ways of doing this?

FWIW, the jobs are generally managed through DAGMan, and have POST
scripts associated with them. I would have thought that a queued job
that gets "rm'ed" isn't going to go through the POST script, but I could
be wrong.

Thanks,

Ian

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Mailing List Archives

Public Access

Re: [Condor-users] condor_rm on lots of jobs