Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Inconsistencies: hold versus abort

Date: Tue, 2 Dec 2014 16:05:33 -0600 (CST)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Inconsistencies: hold versus abort

On Tue, 2 Dec 2014, Brian Candler wrote:

Case (1): missing local file
...
If you set a NODE_STATUS_FILE it won't help you: it shows

 DagStatus = 3; /* "STATUS_SUBMITTED ()" */
...
  NodeStatus = 1; /* "STATUS_READY" */
It seems odd that the NODE_STATUS_FILE is not updated when dagman terminates- I'd expected the DagStatus to show STATUS_ERROR, and probably also the nodewhich couldn't be submitted.

What version of DAGMan are you running? In 8.2.3 we fixed a bug thatcould cause the node status file to not get updated when DAGMan exits.

When I try this, I get the following for the node status:

[
  Type = "NodeStatus";
  Node = "NodeA";
  NodeStatus = 6; /* "STATUS_ERROR" */
  StatusDetails = "Job submit failed";
  RetryCount = 0;
  JobProcsQueued = 0;
  JobProcsHeld = 0;
]

Hopefully this is what you would want.

Case (2): missing or temporarily unavailable remote file
...
- on a personal condor this silently succeeds, in the sense that it makes noattempt to transfer a file :-(

This will have to be dealt with at a level below DAGMan, because ifHTCondor claims the job succeeded, DAGMan doesn't have a way to knowotherwise (unless you add a POST script that checks the output).

However, if you set "should_transfer_files = true", then you get the samebehaviour as the following.
- on a proper cluster with separate submit and execution nodes and differentfilesystem domains, the job goes into "held" status.
You can find this from the NODE_STATUS_FILE by looking for

 JobProcsHeld = 1;

and JOBSTATE_LOG shows

1417548893 t2 JOB_HELD 4486.0 - - 1
But in both cases you don't get any indication of *why* it was held, and notin the <dag>.dagman.out file either. You have to use condor_q -analyze <pid>and parse its output:

You could find the hold reason in the DAGMan nodes.log file, or the logfile specified in the submit file (if you specify one).

Also, as far as I can tell there are no automatic retries (those would haveto be done by condor_startd, presumably?)

As far as DAGMan is concerned, a job that's on hold is still possiblygoing to succeed. So if you want the job to fail, you need to put a

periodic_remove expression into your submit file that removes the job

after it's been on hold for a certain amount of time. Then you could addretries to your DAG node.

Case (3): invalid scheme
...
In this case the job hangs in "Idle" state, with an unmatched requirementsexpression.

Again, DAGMan doesn't know why the job is idle, so it will just waitaround for the job to finish.


Kent Wenger
CHTC Team

Follow-Ups:
- Re: [HTCondor-users] Inconsistencies: hold versus abort
  - From: Brian Candler

References:
- [HTCondor-users] Inconsistencies: hold versus abort
  - From: Brian Candler

Prev by Date: [HTCondor-users] Inconsistencies: hold versus abort
Next by Date: Re: [HTCondor-users] default host ranking
Previous by thread: [HTCondor-users] Inconsistencies: hold versus abort
Next by thread: Re: [HTCondor-users] Inconsistencies: hold versus abort
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Inconsistencies: hold versus abort