[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] proposed change in DAGMan
- Date: Thu, 16 Jun 2016 10:08:07 +0100
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: Re: [HTCondor-users] proposed change in DAGMan
On 15/06/2016 20:56, John N Calley wrote:
I think this is an excellent option. I think it would be best for it to be on by default because I think it is most useful for naïve users.
I think the fundamental problem is there doesn't seem to be much
consistency over which error conditions cause a job to go on hold, and
which to cause it to fail.
For example, a DAG node which tries to fetch a remote file via HTTP, and
the file does not exist (404 error), puts the job on hold. If the user
notices that there is no progress, and they query it, they find:
4299200.000: Request is held.
Hold reason: Error from slot1@xxxxxxxxxxxxxxxxxxxx: STARTER at 192.168.6.213
failed to receive file /var/lib/condor/execute/dir_9329/nonexistent:
FILETRANSFER:1:non-zero exit(1792) from /usr/lib/condor/libexec/curl_plugin
However the user may not notice that the job has gone on hold, and I
believe it won't be retried automatically.
So if the aim is to help naïve users, it may be better to treat all such
errors as job failures. In fact, I can't think of any case where I would
prefer the job to go on hold rather than immediately fail the DAG node.
In the case of file transfers, a more sophisticated option might be to
distinguish between temporary errors (e.g. timeout talking to remote
HTTP server, or 5xx errors) and permanent errors (4xx errors). The
former could go on hold and retry automatically a few times before
giving up, while the latter would fail immediately. However that's a
much more substantial change, and it doesn't solve the more general
problem of jobs going on hold for other unexpected reasons.
Of course, failing a single DAG node doesn't prevent progress from being
made in the rest of the DAG, but you can't retry a node until the whole
DAG run has completed. I seem to remember discussion around a proposed
feature to signal DAGman to retry failed nodes before the current run is
complete. That would be a very useful feature, and I think would
probably cover the use cases for jobs 'on hold' better than today: it
would work for any node error which you could repair.