[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Inconsistencies: hold versus abort



On Tue, 2 Dec 2014, Brian Candler wrote:

Case (1): missing local file
...
If you set a NODE_STATUS_FILE it won't help you: it shows

 DagStatus = 3; /* "STATUS_SUBMITTED ()" */
...
  NodeStatus = 1; /* "STATUS_READY" */

It seems odd that the NODE_STATUS_FILE is not updated when dagman terminates - I'd expected the DagStatus to show STATUS_ERROR, and probably also the node which couldn't be submitted.

What version of DAGMan are you running? In 8.2.3 we fixed a bug that could cause the node status file to not get updated when DAGMan exits.
When I try this, I get the following for the node status:

[
  Type = "NodeStatus";
  Node = "NodeA";
  NodeStatus = 6; /* "STATUS_ERROR" */
  StatusDetails = "Job submit failed";
  RetryCount = 0;
  JobProcsQueued = 0;
  JobProcsHeld = 0;
]

Hopefully this is what you would want.

Case (2): missing or temporarily unavailable remote file
...
- on a personal condor this silently succeeds, in the sense that it makes no attempt to transfer a file :-(

This will have to be dealt with at a level below DAGMan, because if HTCondor claims the job succeeded, DAGMan doesn't have a way to know otherwise (unless you add a POST script that checks the output).

However, if you set "should_transfer_files = true", then you get the same behaviour as the following.

- on a proper cluster with separate submit and execution nodes and different filesystem domains, the job goes into "held" status.

You can find this from the NODE_STATUS_FILE by looking for

 JobProcsHeld = 1;

and JOBSTATE_LOG shows

1417548893 t2 JOB_HELD 4486.0 - - 1

But in both cases you don't get any indication of *why* it was held, and not in the <dag>.dagman.out file either. You have to use condor_q -analyze <pid> and parse its output:

You could find the hold reason in the DAGMan nodes.log file, or the log file specified in the submit file (if you specify one).

Also, as far as I can tell there are no automatic retries (those would have to be done by condor_startd, presumably?)

As far as DAGMan is concerned, a job that's on hold is still possibly going to succeed. So if you want the job to fail, you need to put a
periodic_remove expression into your submit file that removes the job
after it's been on hold for a certain amount of time. Then you could add retries to your DAG node.

Case (3): invalid scheme
...
In this case the job hangs in "Idle" state, with an unmatched requirements expression.

Again, DAGMan doesn't know why the job is idle, so it will just wait around for the job to finish.

Kent Wenger
CHTC Team