[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Recovering from failures of DAGs within DAGs



FYI:  it all works perfectly well. I have a production system built over the idea of embedded DAGs, and it works great.


On 12/21/05, R. Kent Wenger < wenger@xxxxxxxxxxx> wrote:
On Wed, 21 Dec 2005, Craig Robinson wrote:

> We are developing a DAGMan application which will ideally use DAGs
> within DAGs. We have seen in the Condor documentation that such
> applications are supported. How are failures of internal DAGs dealt
> with, and is there any easy way to recover from
> this?

Say you have a top-level DAG, "top.dag", that has a node that's a
lower-level DAG, "lower.dag".  If lower.dag fails somewhere, the Condor
job running that DAG exits with a non-zero exit code, so the corresponding
node in top.dag is considered failed.

If lower.dag may have failed because of some transient condition, you can
specify retries for the corresponding node in top.dag.  This will cause
the lower.dag to be re-submitted.  However, in most cases (more details
below), lower.dag will restart from scratch in this case.  (Maybe this is
something we need to change.)

There are two ways a DAG can fail, resulting in different ways of
recovering.  The most common failure is that a node fails, for whatever
reason.  In this case, DAGMan will write out a rescue DAG.  So if
your DAG file is lower.dag, the rescue DAG file will be lower.dag.rescue.
If you are running the DAG manually, you do 'condor_submit_dag
lower.dag.rescue' to run the rescue DAG, which picks up from where the
DAG failed.  However, if lower.dag is being run from top.dag, DAGMan
isn't smart enough to submit the rescue DAG the second time around --
that's what might be worthwhile for us to change.

The second type of failure is if the DAGMan process itself blows an
assertion or something similar.  In this case, the recovery process is
different -- the DAG lock file will still be there, so a condor_submit
of lower.dag.condor.sub, for example, will run in recovery mode.  In
recovery mode, DAGMan reads through the user logs and figures out what
nodes have completed before it starts running any nodes.  So in that case,
having retries in the top-level DAG will automatically do what you want.

Hopefully that's all somewhat clear!

Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users