[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Recovering from failures of DAGs within DAGs



On Wed, 21 Dec 2005, Craig Robinson wrote:

> We are developing a DAGMan application which will ideally use DAGs
> within DAGs. We have seen in the Condor documentation that such
> applications are supported. How are failures of internal DAGs dealt
> with, and is there any easy way to recover from
> this?

Expanding on my earlier answer, there's an easy way to get the rescue
DAGs to work right with retries.  In the top-level DAG in my example,
just have the following as a POST script for the node that is the
lower-level DAG:

    #! /bin/csh -f
    if (-e lower.dag.rescue) then
      mv lower.dag lower.dag.orig
      mv lower.dag.rescue lower.dag
    endif

That way, if the lower-level DAG fails, you'll end up actually retrying
with the rescue DAG, which will start up from where the first try left
off (the rescue DAG records which nodes were completed).

Kent Wenger
Condor Team