Re: [Condor-users] dagman rescue

Thankyou for the reply!

On Sun, Nov 21, 2010 at 10:32 PM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> On Sun, 21 Nov 2010, Mag Gam wrote:
>> To my understanding, a rescue dag is generated at the end of the DAG.
>> What if we have a DAG the size of 100k and I would like to see what
>> nodes failed? Is that possible?
> Yes, that's right.  A rescue DAG is generated either when you condor_rm the
> DAGMan job, or nodes have failed and it has reached the point that it can't
> make any more progress.
> To see what nodes have failed while the DAG is running, there are a few
> options:
> 1) Look at the dagman.out file and look for the string "failed".  This will
> probably find a few things other than actual node failures, but it will find
> all of the node failures.
> 2) As of version 7.5.4, you can have DAGMan create a node status file that
> has the current status of every node in the DAG, updated as the DAG runs.
> (See
> http://www.cs.wisc.edu/condor/manual/v7.5/2_10DAGMan_Applications.html#SECTION0031010000000000000000
> for more info.)
> 3) Have DAGMan create a dot file that is updated as the DAG runs.  (See
> http://www.cs.wisc.edu/condor/manual/v7.5/2_10DAGMan_Applications.html#SECTION003109000000000000000
> for more info.)
> If you're running 7.5.4, and you want to just see what nodes have failed, #2
> is probably the best option.  However, #2 and #3 have to be set up when the
> DAG is submitted; you can't tell DAGMan to create the node status file or
> dot file unless you specified them in the original DAG.  So for something
> that's already running, you'll have to go with #1.
> Kent Wenger
> Condor Team
