[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor DAG feature request



Hello,

I have a feature request of Condor's DAG system, with respect to handling nested DAGs:

Suppose I have DAG A that calls many DAG B's, and each DAG B runs three programs in it, in the order "alpha, beta, gamma". When gamma fails, this causes DAG B to end and generate its own rescue file. DAG B will then tell DAG A about its failure, and DAG A will then generate its own rescue file, and the job will stop.

I've noticed that in the case of nested DAGs, DAG A's rescue DAG does not point to DAG B's *rescue* file, it instead points to DAG B's *submit* file, causing all instances of alpha, beta, and gamma to be performed again, instead of just gamma.

I have a system where the "beta" stage of a job is very time-consuming, and it is possible that a few "gamma" instances may fail. It would be nice if DAGMan had the ability to detect whether it was running another DAG as a sub-job, or just a regular job. In the case of the former, it could intelligently point its own rescue file to the rescue file created by the DAG sub-job.

Thanks,

 - Armen

--
Armen Babikyan
MIT Lincoln Laboratory
armenb@xxxxxxxxxx . 781-981-1796