Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] DAG destructor job

Date: Thu, 16 Oct 2008 13:47:37 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [Condor-users] DAG destructor job

On Thu, 16 Oct 2008, Jan Ploski wrote:

Does DAGMan maintain the DAG's state in memory, or does it re-read the
.dag file after each subjob's execution?


It maintains state in memory -- the DAG file is only parsed at startup.

The scenario I would like to implement is as follows:
in normal case subjobs A, B, C, D are executed
however, if either of A or B fails, the DAG skips ahead to D (which is a
cleanup job - should be run regardless of the DAG's success or failure).

One possibility would be to update the .dag file and mark the jobs as done
in the POST script of A. Do you think that would work? If not, do you have
any hints on how to best implement a "DAG exception handler (or
destructor) job"?


That won't work.  There is way to do it that's a little more elaborate,
though -- it makes use of the "nested DAGs" feature.

Here's what you do:

* Put the "work" nodes A, B, C into a DAG, and use the ABORT-DAG-ON
feature to bail out of that DAG if one of the nodes fails.

* Make a top-level DAG that has two nodes -- one is the .condor.sub file
for the lower-level DAG; the second is the cleanup node (D).  For this
to work, the node calling the lower-level DAG has to succeed in order
for the cleanup node to be run.  You can accomplish that by either having
ABORT-DAG-ON declare the lower-level DAG successful (add RETURN 0 to
the ABORT-DAG-ON declarations), or else have a POST script on that node
in the upper-level DAG that always returns 0.


Another way to do it would be to have a POST script on A that turns
B and C into noop jobs if A fails (you'd have to do something like
copy a new submit file over the existing one).  If you do this, it's

critical that the new submit file specify the same log file as theoriginal one -- otherwise DAGMan won't "see" the events for that job.


Hopefully this is at least somewhat clear -- let me know if it's not.

Kent Wenger
Condor Team

References:
- [Condor-users] DAG destructor job
  - From: Jan Ploski

Prev by Date: Re: [Condor-users] Trying to run condor_glidein on the National Grid Service
Next by Date: Re: [Condor-users] DAGMan
Previous by thread: [Condor-users] DAG destructor job
Next by thread: [Condor-users] Condor manager at a frontend machine with two NICs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] DAG destructor job