[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Is it possible to remove a job started by a DAG not wanted anymore?



On 04/20/2011 12:17 PM, R. Kent Wenger wrote:
On Wed, 20 Apr 2011, Carsten Aulbert wrote:

we are facing a small little problem, but don't know how to tackle it.

A large number of jobs has been started with dagman, most of the jobs run
fine, but some have very severe memory needs. We would like to "kill"
those
without dagman restarting those.

Is it possible to tell the master dagman process that it should ignore a
specific job while the dag is running? Or any other way to navigate
around
this problem?

Since you talk about DAGMan restarting the jobs, I assume you have
retries turned on for the relevant DAG nodes. (If you don't have retries
turned on, you should be able to condor_rm the offending jobs; DAGMan
would just
consider those nodes failed, and continue to make as much progress as
possible given the failures.)

Assuming that you have retries set for the nodes, you could condor_hold
the jobs you want to get rid of. Once the DAG stops making progress,
condor_rm the DAGMan job, and that should remove the held node jobs.

Unfortunately, there's no way at this point to remove any dependencies
from a running DAG. I think you'll have to edit the rescue DAG file, and
then re-run the DAG.

Kent Wenger
Condor Team

Kent and I were discussing this use case just last week.

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2057

The idea is that you'd have logic in your PRE/POST scripts that said something like,

PRE -

if [ -e $1.skip ]; then
  exit 1
fi

POST -

if [ -e $1.skip ]; then
  exit 0
fi

You'd touch the $1.skip. For a job that hasn't run, the PRE skips it and the POST marks it as succeeded. For a job that is running, the POST marks it as succeeded.

Best,


matt