[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] dagman capabilities
- Date: Mon, 05 Oct 2009 12:11:01 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] dagman capabilities
michael bane wrote:
I was looking at 'job recovery: the rescue DAG' in the online Condor
manual (2.10.6) but couldn't decide is DAGman was capable of handling
the situation of submitting N jobs (embarrassingly parallel, say) to
the Vanilla universe (since we cannot link to the checkpointing for
Standard) and then resubmitting those which are killed (eg due to pre-
emption by work on the given nodes)?
No need for DAGMan to do this... Condor itself will automatically
restart a Vanilla universe job that fails to complete. By "complete", I
mean the job executable exits of its own accord.
So if you just need to submit a group of N jobs and have them resilient
to preemption, crashes, etc, just use condor_submit. No need for using
On the other hand, if you have inter-job dependencies (i.e. after your N
vanilla jobs complete successfully, you want another job to be
submitted), then you want condor_submit_dag.
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685