[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor and Docker



> On Apr 13, 2015, at 11:34 AM, Brian Candler <b.candler@xxxxxxxxx> wrote:
> 
> On 13/04/2015 16:08, R. Kent Wenger wrote:
>>> I already get DAGman to retry each node once. I am thinking about retries which require operator intervention, e.g. because of running out of disk space or a bad NFS mount.
>> 
>> Hmm, sounds like you might want this feature once it's implemented:
>> 
>> https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2831,4
> Well, I've certainly wanted to do that in the past - when I've noticed a node failure while the DAG is running (and the retries failed too), but other jobs are continuing to run happily. Sometimes I've just killed the whole DAG so that I don't have to wait for it to finish.
> 
> So being able to kick off a manual retry would be a good feature. Another approach I thought of would be for DAGman to delay its retries until the last possible moment - i.e. when there are no other jobs which can proceed - instead of retrying as soon as possible. Or perhaps just the *last* retry should be handled this way.
> 
> Anyway... this is just a tweak. The main issue for me is creating a DAG dynamically (in response to a request received in an AMQP message), which in turn means a lifecycle of:
> 
> * create a working directory
> * run the script to create the DAG/submit/input files in this directory
> * submit the DAG
> * wait for DAG to complete
> * send back success/fail message to submitter, and results
> * tidy up (i.e. remove the working directory) on DAG success
> * on failure, keep all the temp files for post-mortem analysis; after fixes, resubmit the rescue DAG
> * management tools: e.g. list the working directories, clusterID for running jobs, exit status for finished jobs (eventually a web interface)
> 
> I was initially surprised that HTCondor doesn't come with any tooling for that sort of lifecycle - it seems the assumption is that all workflows are set up by hand at the CLI.
> 

This sounds a bit like something we do.  Here's how we solved it:

1) External application submits DAG from a separate host, enabling spooling (using python bindings, of course).  This means the schedd itself creates directories for you.  The external application copies over the input files and submit file templates as part of the spooling.
2) For DAG monitoring, we found no acceptable API to hook things up to an external tool.  We hacked something together, however; we symlink the node status file to a directory exposed by httpd.
3) When the DAG is complete:
  - On success, the DAG goes into 'C' state.  External clients can pick up the output; once this is done, the schedd automatically deletes the directory in spool.
  - On failure, we set the OnExitHold attribute so the DAG goes into 'H' state.  The directory in spool sticks around.  Users can poke around the directory to figure out what has gone wrong, and do a "condor_release" equivalent to resubmit the DAG.

For management tools -
1) Various web interfaces revolving around the python web framework du jour and the python bindings (directly querying the schedd).
2) Various cron-like python scripts that summarize the states into static JSON and RRD files.  The JSON are served directly from disk; RRD files are rendered into graphs on the fly.
  - Example: http://hcc-briantest.unl.edu/prodview (note this monitors an application that is not DAG based, but the concept is the same).

Brian

PS - this is the final approach and skips over the various abandoned approaches. ;)