[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] automatically submit a dagman rescume dag after the original DAg is done?



On Jul 12, 2006, at 11:21 PM, John Wheez wrote:
Can anyone provide some pointers on how to get Dagman to auto submit the resulting rescue file??

Are you sure you don't mean "get DAGMan to re-start automatically after a machine crash or shutdown"?

When DAGMan creates a rescue file, it's because it can make no further progress due to a failed node, and human intervention is necessary. However, until recently (6.7.19?) there was a DAGMan submission bug which prevented DAGMan from being correctly re-started by the Condor schedd after some types of machine crashes or shutdowns. In short, if DAGMan itself was killed by a signal, Condor happily recorded it as an abnormal termination and let DAGMan exit the queue, like it would for any other job, instead of restarting it.

Now DAGMan will only leave the queue if it exits of its own accord. This includes successful completion and "I can make no further forward progress due to failed nodes", which is when a rescue file is produced.

If a rescue file is being produced when a simple re-submission would allow the DAG to finish, then it would be better to use the automatic node RETRY feature inside the first DAG, and avoid the rescue file generation in the first place.

I hope this helps...

-Peter

--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685