[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] proposed change in DAGMan

On 15/06/2016 20:56, John N Calley wrote:
I think this is an excellent option. I think it would be best for it to be on by default because I think it is most useful for naïve users.

I think the fundamental problem is there doesn't seem to be much consistency over which error conditions cause a job to go on hold, and which to cause it to fail.

For example, a DAG node which tries to fetch a remote file via HTTP, and the file does not exist (404 error), puts the job on hold. If the user notices that there is no progress, and they query it, they find:

4299200.000:  Request is held.

Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at
failed to receive file /var/lib/condor/execute/dir_9329/nonexistent:
FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin

However the user may not notice that the job has gone on hold, and I believe it won't be retried automatically.

So if the aim is to help naïve users, it may be better to treat all such errors as job failures. In fact, I can't think of any case where I would prefer the job to go on hold rather than immediately fail the DAG node.

In the case of file transfers, a more sophisticated option might be to distinguish between temporary errors (e.g. timeout talking to remote HTTP server, or 5xx errors) and permanent errors (4xx errors). The former could go on hold and retry automatically a few times before giving up, while the latter would fail immediately. However that's a much more substantial change, and it doesn't solve the more general problem of jobs going on hold for other unexpected reasons.

Of course, failing a single DAG node doesn't prevent progress from being made in the rest of the DAG, but you can't retry a node until the whole DAG run has completed. I seem to remember discussion around a proposed feature to signal DAGman to retry failed nodes before the current run is complete. That would be a very useful feature, and I think would probably cover the use cases for jobs 'on hold' better than today: it would work for any node error which you could repair.