[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] dagman rescue



On Sun, 21 Nov 2010, Mag Gam wrote:

To my understanding, a rescue dag is generated at the end of the DAG.
What if we have a DAG the size of 100k and I would like to see what
nodes failed? Is that possible?

Yes, that's right. A rescue DAG is generated either when you condor_rm the DAGMan job, or nodes have failed and it has reached the point that it can't make any more progress.

To see what nodes have failed while the DAG is running, there are a few options:

1) Look at the dagman.out file and look for the string "failed". This will probably find a few things other than actual node failures, but it will find all of the node failures.

2) As of version 7.5.4, you can have DAGMan create a node status file that has the current status of every node in the DAG, updated as the DAG runs. (See http://www.cs.wisc.edu/condor/manual/v7.5/2_10DAGMan_Applications.html#SECTION0031010000000000000000
for more info.)

3) Have DAGMan create a dot file that is updated as the DAG runs.  (See
http://www.cs.wisc.edu/condor/manual/v7.5/2_10DAGMan_Applications.html#SECTION003109000000000000000
for more info.)

If you're running 7.5.4, and you want to just see what nodes have failed, #2 is probably the best option. However, #2 and #3 have to be set up when the DAG is submitted; you can't tell DAGMan to create the node status file or dot file unless you specified them in the original DAG. So for something that's already running, you'll have to go with #1.

Kent Wenger
Condor Team