[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor DAG feature request



On Mon, 11 Dec 2006, Armen Babikyan wrote:

> I have a feature request of Condor's DAG system, with respect to
> handling nested DAGs:
>
> Suppose I have DAG A that calls many DAG B's, and each DAG B runs three
> programs in it, in the order "alpha, beta, gamma".  When gamma fails,
> this causes DAG B to end and generate its own rescue file.  DAG B will
> then tell DAG A about its failure, and DAG A will then generate its own
> rescue file, and the job will stop.
>
> I've noticed that in the case of nested DAGs, DAG A's rescue DAG does
> not point to DAG B's *rescue* file, it instead points to DAG B's
> *submit* file, causing all instances of alpha, beta, and gamma to be
> performed again, instead of just gamma.
>
> I have a system where the "beta" stage of a job is very time-consuming,
> and it is possible that a few "gamma" instances may fail.  It would be
> nice if DAGMan had the ability to detect whether it was running another
> DAG as a sub-job, or just a regular job.  In the case of the former, it
> could intelligently point its own rescue file to the rescue file created
> by the DAG sub-job.

One note on this:  as a workaround, you can do the following:

In the top-level DAG, have a POST script that renames B.dag.rescue to
B.dag if B fails.

Kent Wenger
Condor Team