[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor Submit file: big whole file or several submit files



On Fri, 26 Jul 2013, Antonio Chay wrote:

First of all, glad to hear that you're using DAGMan!

- For DAGMan: How can I have one big submit file and get all the
"rescue" benefits? (i.e.: re-running failed jobs only).

Unfortunately, there is no way to do this. DAGMan can only re-run jobs at the granularity of a submit file. (Basically, DAGMan is just running a bunch of condor_submit commands, so it can't really do anything that you can't do by running condor_submit on the command line.)

I don't see functionality like this coming any time soon, either -- just thinking about how to do it, even manually, things get pretty difficult.

So I guess I'd say that this is a reason to *not* have DAGMan nodes consist of a large number of procs in a single cluster.

If there's some kind of hierarchical relationship between jobs that you want to preserve in your DAGs, you might consider using splices or sub-DAGs -- that would allow your top-level DAG to still be quite simple, but you'd get the full rescue capability of DAGMan.

Kent Wenge
CHTC Team