[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Getting DAG node to fail on file transfer error



On Mon, 3 Nov 2014, Zachary Miller wrote:

On Mon, Nov 03, 2014 at 03:51:43PM +0000, Brian Candler wrote:
(related to my previous post)

If I submit a DAG which uses a http:// URL for an input file, and the
file transfer fails, the job goes into a "hold" state. Is it possible to
configure this so that it fails the node entirely?

If the DAG node failed then the whole DAG would fail, and this gets
noticed by the user. However if a job ends up in 'held' state then it's
just as if the job is taking forever to run, and needs additional
monitoring to check.

I would look into setting "periodic_remove" in your job submit file.  You can
condition it to look for the proper HoldReasonCode (that shows file transfer
has failed, and not some other reason).  I'll defer to Kent Wenger on this, but
I believe if a job gets removed it causes the DAG to fail.

Yes, getting the job to fail instead of going on hold is totally independent of whether the job is part of a DAG or not. So, you need to add an appropriate periodic_remove expression to your submit file(s).

Once a node fails in a DAG, no children of that node will run, but the DAG will make as much progress as it can before exiting (e.g., siblings of the failed node and their children will run). If you want the DAG to exit as soon as a node fails, you can use the ABORT-DAG-ON feature.

Kent Wenger
CHTC Team