[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] dagman capabilities



michael bane wrote:
I was looking at 'job recovery: the rescue DAG' in the online Condor manual (2.10.6) but couldn't decide is DAGman was capable of handling the situation of submitting N jobs (embarrassingly parallel, say) to the Vanilla universe (since we cannot link to the checkpointing for Standard) and then resubmitting those which are killed (eg due to pre- emption by work on the given nodes)?

No need for DAGMan to do this... Condor itself will automatically restart a Vanilla universe job that fails to complete. By "complete", I mean the job executable exits of its own accord.

So if you just need to submit a group of N jobs and have them resilient to preemption, crashes, etc, just use condor_submit. No need for using condor_submit_dag.

On the other hand, if you have inter-job dependencies (i.e. after your N vanilla jobs complete successfully, you want another job to be submitted), then you want condor_submit_dag.

regards,
Todd

--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685