[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Avoiding combinatorial explosion in dependencies between spliced DAGS
- Date: Thu, 30 Jul 2015 17:47:01 +0000
- From: John N Calley <calley_john_n@xxxxxxxxx>
- Subject: [HTCondor-users] Avoiding combinatorial explosion in dependencies between spliced DAGS
I make a lot of use of SPLICE-ing to compose dags into complex workflows and these often have dependencies on each other. DAGMAN deals with this by adding dependencies between every final node for the PARENT dag and every initial node of the CHILD dag. When there are thousands of initial and final nodes (as is common with my workflows) this can result in extremely large numbers of dependencies and I've had cases where parsing a rescue dag took quite a few hours. I've been living with this for a while, but I recently came up with a work-around and I wondered if others might have any thoughts on it or perhaps better ways of dealing with the issue.
What I have now started to do is to add a final NOOP job to each of my sub-dags, so at least I just have all the dependencies from initial jobs in the CHILD dag with this single final place-holder node. I assume that I could do the same thing to make every one of my dags start with a NOOP initial node that all the real initial nodes depend on, though I haven't actually tried this. This is clearly not the intended use of the NOOP keyword and it's a bit of a hack, so I wondered if others had better ideas? Also, it would seem that it would be easy for DAGMAN to do this for me as part of the SPLICE-ing process and the result would be a good deal cleaner. I don't see any reason for DAGMAN not to do this. Am I missing something? If not, please consider it a feature request.
What I'd really like to do is to reach 'into' each sub-dag and insert dependencies between specific final nodes and specific initial nodes. I've considered hacking this solution together, but the ways of doing it that I can think of seem inelegant. I wonder if anyone has thoughts on how to do this kind of thing cleanly? To expand a bit, this comes up when I want to do Analysis A on samples 1-2000 and then I want to do Analysis B on the same samples. Analysis B for sample 1 depends on Analysis A for the same sample, but not on Analysis A for any other samples. It's a shame to require that Analysis A finish for all samples before I start Analysis B for any samples, but that is what I feel stuck with at the moment.
Thank You Very Much,
John Calley, Ph.D.
Genetics and Bioinformatics, Tailored Therapeutics
Eli Lilly and Company
DC0731, Lilly Corporate Center, Indianapolis, IN 46285 USA
317.433.3399 (office) | 317.655.1534 (fax)
calley_john_n@xxxxxxxxx | www.lilly.com
CONFIDENTIALITY NOTICE: This e-mail message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.