Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] proposed change in DAGMan

Date: Thu, 16 Jun 2016 10:08:07 +0100
From: Brian Candler <b.candler@xxxxxxxxx>
Subject: Re: [HTCondor-users] proposed change in DAGMan

On 15/06/2016 20:56, John N Calley wrote:

I think this is an excellent option. I think it would be best for it to be on by default because I think it is most useful for naïve users.

I think the fundamental problem is there doesn't seem to be muchconsistency over which error conditions cause a job to go on hold, andwhich to cause it to fail.

For example, a DAG node which tries to fetch a remote file via HTTP, andthe file does not exist (404 error), puts the job on hold. If the usernotices that there is no progress, and they query it, they find:


~~~
4299200.000:  Request is held.

Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.6.213
failed to receive file /var/lib/condor/execute/dir_9329/nonexistent:
FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin
~~~

However the user may not notice that the job has gone on hold, and Ibelieve it won't be retried automatically.

So if the aim is to help naïve users, it may be better to treat all sucherrors as job failures. In fact, I can't think of any case where I wouldprefer the job to go on hold rather than immediately fail the DAG node.

In the case of file transfers, a more sophisticated option might be todistinguish between temporary errors (e.g. timeout talking to remoteHTTP server, or 5xx errors) and permanent errors (4xx errors). Theformer could go on hold and retry automatically a few times beforegiving up, while the latter would fail immediately. However that's amuch more substantial change, and it doesn't solve the more generalproblem of jobs going on hold for other unexpected reasons.

Of course, failing a single DAG node doesn't prevent progress from beingmade in the rest of the DAG, but you can't retry a node until the wholeDAG run has completed. I seem to remember discussion around a proposedfeature to signal DAGman to retry failed nodes before the current run iscomplete. That would be a very useful feature, and I think wouldprobably cover the use cases for jobs 'on hold' better than today: itwould work for any node error which you could repair.


Regards,

Brian.

References:
- [HTCondor-users] proposed change in DAGMan
  - From: R. Kent Wenger
- Re: [HTCondor-users] proposed change in DAGMan
  - From: John N Calley

Prev by Date: [HTCondor-users] Grid Computing, resource is still down
Next by Date: Re: [HTCondor-users] How do I build the deb packages from source?
Previous by thread: Re: [HTCondor-users] proposed change in DAGMan
Next by thread: Re: [HTCondor-users] proposed change in DAGMan
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] proposed change in DAGMan