[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Avoiding combinatorial explosion in dependencies between spliced DAGS



On Thu, 30 Jul 2015, John N Calley wrote:

I make a lot of use of SPLICE-ing to compose dags into complex workflows and these often have dependencies on each other. DAGMAN deals with this by adding dependencies between every final node for the PARENT dag and every initial node of the CHILD dag. When there are thousands of initial and final nodes (as is common with my workflows) this can result in extremely large numbers of dependencies and I've had cases where parsing a rescue dag took quite a few hours. I've been living with this for a while, but I recently came up with a work-around and I wondered if others might have any thoughts on it or perhaps better ways of dealing with the issue.

We're glad that you're finding splices useful. Hopefully we can make some improvements to make them more useful...

What I have now started to do is to add a final NOOP job to each of my sub-dags, so at least I just have all the dependencies from initial jobs in the CHILD dag with this single final place-holder node. I assume that I could do the same thing to make every one of my dags start with a NOOP initial node that all the real initial nodes depend on, though I haven't actually tried this. This is clearly not the intended use of the NOOP keyword and it's a bit of a hack, so I wondered if others had better ideas?

Hmm, I wouldn't consider this a hack. There's not really a specific "intended" use for NOOP nodes -- they're for whatever someone finds useful, as in this case.

Also, it would seem that it would be easy for DAGMAN to do this for me as part of the SPLICE-ing process and the result would be a good deal cleaner. I don't see any reason for DAGMAN not to do this. Am I missing something? If not, please consider it a feature request.

That's actually something we thought of pretty much when splices were first implemented. Anyhow, there is already a corresponding feature request:

  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3587,4

I guess it's kind of languished until now because nobody has really run into a use case where it was really necessary (or, if they did, we didn't find out about it).

Maybe it's time to move that up in priority... At any rate, though, there's no reason to not do it, other than its relative priority among the several hundred outstanding DAGMan bugs/feature requests.

What I'd really like to do is to reach 'into' each sub-dag and insert dependencies between specific final nodes and specific initial nodes. I've considered hacking this solution together, but the ways of doing it that I can think of seem inelegant. I wonder if anyone has thoughts on how to do this kind of thing cleanly? To expand a bit, this comes up when I want to do Analysis A on samples 1-2000 and then I want to do Analysis B on the same samples. Analysis B for sample 1 depends on Analysis A for the same sample, but not on Analysis A for any other samples. It's a shame to require that Analysis A finish for all samples before I start Analysis B for any samples, but that is what I feel stuck with at the moment.

So you're saying that right now you have all of the A nodes in one splice, and all of the B nodes in another splice, right? I guess one thing I would want to understand in this case is what is driving your decomposition of the workflow. Because if you have a single splice that has all of your As and all of your Bs, you could do this easily. Or, if your decomposition is governed by size, you could have a splice that has A1-A100 and B1-B100, another splice that has A101-A200, B101-B200, etc.

If you really do need to have all of the As in one splice and all of the Bs in another I guess it might be possible to implement some kind of "weaker" dependency between splices, wherein a given node in the second splice only depends on some of the nodes in the first splice. That would definitely take some thinking, though, about how the dependencies should be specified, and this is something that hasn't come up previously, as far as I know, so I don't have any pre-existing ideas on it.

So, to summarize:
1) There's no problem with using NOOP nodes as you describe.
2) There's no reason to not have DAGMan automatically introduce such nodes. (This would also allow splices to have pre and post scripts, which would make them more consistent with sub-DAGs.) 3) Before any kind of implementation of the more flexible inter-splice dependencies, there would have to be some serious thinking involved, probably starting with a better understanding of your use case.

Kent Wenger
CHTC Team