[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] dagman rescue
- Date: Sun, 21 Nov 2010 23:47:58 -0500
- From: Mag Gam <magawake@xxxxxxxxx>
- Subject: Re: [Condor-users] dagman rescue
Thankyou for the reply!
On Sun, Nov 21, 2010 at 10:32 PM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:
> On Sun, 21 Nov 2010, Mag Gam wrote:
>> To my understanding, a rescue dag is generated at the end of the DAG.
>> What if we have a DAG the size of 100k and I would like to see what
>> nodes failed? Is that possible?
> Yes, that's right. A rescue DAG is generated either when you condor_rm the
> DAGMan job, or nodes have failed and it has reached the point that it can't
> make any more progress.
> To see what nodes have failed while the DAG is running, there are a few
> 1) Look at the dagman.out file and look for the string "failed". This will
> probably find a few things other than actual node failures, but it will find
> all of the node failures.
> 2) As of version 7.5.4, you can have DAGMan create a node status file that
> has the current status of every node in the DAG, updated as the DAG runs.
> for more info.)
> 3) Have DAGMan create a dot file that is updated as the DAG runs. (See
> for more info.)
> If you're running 7.5.4, and you want to just see what nodes have failed, #2
> is probably the best option. However, #2 and #3 have to be set up when the
> DAG is submitted; you can't tell DAGMan to create the node status file or
> dot file unless you specified them in the original DAG. So for something
> that's already running, you'll have to go with #1.
> Kent Wenger
> Condor Team
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at: