[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Inconsistencies: hold versus abort



I'm trying to build a system which submits and supervises condor DAGs as part a of a bigger workflow.

I just wanted to point out how apparently similar file transfer problems manifest in different ways, and therefore need to be detected and handled differently. It's really just in case it's useful for anyone else also getting to grips with the system. The examples below all tested with htcondor 8.2.4 under Ubuntu 12.04


Case (1): missing local file

==> t1.dag <==
JOB t1 t1.sub
JOBSTATE_LOG t1.log
NODE_STATUS_FILE t1.status

==> t1.sub <==
universe = vanilla
executable = /bin/true
transfer_executable = false
transfer_input_files = /nonexistent
queue

Result: dagman tries to submit the job at increasing intervals, failing each time.

12/02/14 19:18:12 From submit: Submitting job(s)
12/02/14 19:18:12 From submit: ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/02/14 19:18:12 failed while reading from pipe.
12/02/14 19:18:12 Read so far: Submitting job(s)ERROR: Can't open "/nonexistent" with flags 00 (No such file or directory)
12/02/14 19:18:12 ERROR: submit attempt failed

After 6 tries it gives up and fails the node. This is sensible, and you can handle it like any other sort of node failure - except you won't see any error output from the job itself. This means you have to parse the dagman output to find out what happened.

If you set a NODE_STATUS_FILE it won't help you: it shows

  DagStatus = 3; /* "STATUS_SUBMITTED ()" */
...
  NodeStatus = 1; /* "STATUS_READY" */

It seems odd that the NODE_STATUS_FILE is not updated when dagman terminates - I'd expected the DagStatus to show STATUS_ERROR, and probably also the node which couldn't be submitted.

However if you parse JOBSTATE_LOG, you will see instances of

1417547909 t1 SUBMIT_FAILURE - - - 1



Case (2): missing or temporarily unavailable remote file

==> t2.dag <==
JOB t2 t2.sub
JOBSTATE_LOG t2.log
NODE_STATUS_FILE t2.status

==> t2.sub <==
universe = vanilla
executable = /bin/true
transfer_executable = false
transfer_input_files = http://127.0.0.1/nonexistent
queue

Result:

- on a personal condor this silently succeeds, in the sense that it makes no attempt to transfer a file :-( This is despite the fact that curl on the command line gives:

$ curl http://127.0.0.1/nonexistent; echo $?
curl: (7) couldn't connect to host
7

However, if you set "should_transfer_files = true", then you get the same behaviour as the following.

- on a proper cluster with separate submit and execution nodes and different filesystem domains, the job goes into "held" status.

You can find this from the NODE_STATUS_FILE by looking for

  JobProcsHeld = 1;

and JOBSTATE_LOG shows

1417548893 t2 JOB_HELD 4486.0 - - 1

But in both cases you don't get any indication of *why* it was held, and not in the <dag>.dagman.out file either. You have to use condor_q -analyze <pid> and parse its output:

~~~
4299200.000:  Request is held.

Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.6.213 failed to receive file /var/lib/condor/execute/dir_9329/nonexistent: FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin
~~~

Also, as far as I can tell there are no automatic retries (those would have to be done by condor_startd, presumably?)


Case (3): invalid scheme

==> t3.dag <==
JOB t3 t3.dag
JOBSTATE_LOG t3.log
NODE_STATUS_FILE t3.status

==> t3.sub <==
universe = vanilla
executable = /bin/true
transfer_executable = false
transfer_input_files = noscheme://127.0.0.1/nonexistent
should_transfer_files = true
queue

In this case the job hangs in "Idle" state, with an unmatched requirements expression.

condor_q -analyze shows:

    Condition                         Machines Matched Suggestion
    ---------                         ---------------- ----------
1 ( TARGET.HasFileTransfer && stringListMember("noscheme",HasFileTransferPluginMethods) )
                                      0 REMOVE
2   ( TARGET.Arch == "X86_64" )       1
3   ( TARGET.OpSys == "LINUX" )       1
4   ( TARGET.Disk >= 25 )             1
5 ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                      1

This one is very hard to identify: it's indistinguishable from a job which has been submitted and is just waiting for resources to become available. And indeed, you can imagine a scenario where maybe some nodes support that file transfer scheme and some don't.

condor_q -analyze can tell you that no machines match; but the output is difficult to parse (and the -xml flag doesn't seem to work with -analyze)


Incidentally I am using the command line tools as the "API", as recommended at
http://research.cs.wisc.edu/htcondor/manual/v8.2/6_5Command_Line.html

I did try using the python API for submitting DAGs a while back, but it was doing strange things with requirements expressions (this was documented on the mailing list too). While it might be fine for submitting individual jobs, it doesn't seem to be a well-maintained way of submitting DAGs.

Regards,

Brian.