[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor and Docker



On 13/04/2015 16:08, R. Kent Wenger wrote:
I already get DAGman to retry each node once. I am thinking about retries which require operator intervention, e.g. because of running out of disk space or a bad NFS mount.

Hmm, sounds like you might want this feature once it's implemented:

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2831,4
Well, I've certainly wanted to do that in the past - when I've noticed a node failure while the DAG is running (and the retries failed too), but other jobs are continuing to run happily. Sometimes I've just killed the whole DAG so that I don't have to wait for it to finish.

So being able to kick off a manual retry would be a good feature. Another approach I thought of would be for DAGman to delay its retries until the last possible moment - i.e. when there are no other jobs which can proceed - instead of retrying as soon as possible. Or perhaps just the *last* retry should be handled this way.

Anyway... this is just a tweak. The main issue for me is creating a DAG dynamically (in response to a request received in an AMQP message), which in turn means a lifecycle of:

* create a working directory
* run the script to create the DAG/submit/input files in this directory
* submit the DAG
* wait for DAG to complete
* send back success/fail message to submitter, and results
* tidy up (i.e. remove the working directory) on DAG success
* on failure, keep all the temp files for post-mortem analysis; after fixes, resubmit the rescue DAG * management tools: e.g. list the working directories, clusterID for running jobs, exit status for finished jobs (eventually a web interface)

I was initially surprised that HTCondor doesn't come with any tooling for that sort of lifecycle - it seems the assumption is that all workflows are set up by hand at the CLI.

I did look for HTCondor front-ends; e.g. I found Pegasus, but as far as I can see, you are still required to create your own working directory to stick the DAX files into, and to keep track of your submission.

Regards,

Brian.