[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Getting DAG node to fail on file transfer error



(related to my previous post)

If I submit a DAG which uses a http:// URL for an input file, and the file transfer fails, the job goes into a "hold" state. Is it possible to configure this so that it fails the node entirely?

If the DAG node failed then the whole DAG would fail, and this gets noticed by the user. However if a job ends up in 'held' state then it's just as if the job is taking forever to run, and needs additional monitoring to check.

Example:

$ condor_q -analyze 181.0

-- Submitter: test.example.net : <10.0.2.15:60831> : test.example.net
---
181.000:  Request is held.

Hold reason: Error from test.example.net: STARTER at 192.168.56.15 failed to receive file /var/lib/condor/execute/dir_17283/xxxx.xxxx: FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin


I've checked
http://research.cs.wisc.edu/htcondor/manual/current/condor_submit_dag.html
http://research.cs.wisc.edu/htcondor/manual/current/2_10DAGMan_Applications.html
and I can't see how to fail a held node (or fail a node where the file transfer fails), although I can see that the node_status_file and jobstate_log do indicate events for held nodes.

Thanks,

Brian Candler.