[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor and Docker



On Mon, 13 Apr 2015, Brian Candler wrote:

So being able to kick off a manual retry would be a good feature. Another approach I thought of would be for DAGman to delay its retries until the last possible moment - i.e. when there are no other jobs which can proceed - instead of retrying as soon as possible. Or perhaps just the *last* retry should be handled this way.

Hmm -- DAGMAN_RETRY_SUBMIT_FIRST defaults to false, which means when a node fails it goes to the end of the ready queue (if it has retries). But other nodes that become ready after the first node fails get added after that node. So I guess what you're looking for is a setting that keeps the retry attempt at the end of the ready queue even as other stuff is added.

Anyway... this is just a tweak. The main issue for me is creating a DAG dynamically (in response to a request received in an AMQP message), which in turn means a lifecycle of:

* create a working directory
* run the script to create the DAG/submit/input files in this directory
* submit the DAG
* wait for DAG to complete
* send back success/fail message to submitter, and results
* tidy up (i.e. remove the working directory) on DAG success
* on failure, keep all the temp files for post-mortem analysis; after fixes, resubmit the rescue DAG * management tools: e.g. list the working directories, clusterID for running jobs, exit status for finished jobs (eventually a web interface)

I was initially surprised that HTCondor doesn't come with any tooling for that sort of lifecycle - it seems the assumption is that all workflows are set up by hand at the CLI.

Well, one option is to make that top-level lifecycle into a DAG, and have the "main" DAG be a sub-DAG of the top-level DAG.

Kent