[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Avoiding combinatorial explosion in dependencies between spliced DAGS



Kent,
   Thanks a lot for your thoughtful response. Let me try to go into a little more detail on my rationale for the 'Analysis A' and 'Analysis B' decomposition I mention below and try to convince you that this is a real need on my part. I do realize that coming up with a mechanism for this would not be trivial. I've been thinking along the lines of something like:
	PARENT dagA+merge_final_(.*) CHILD dagB+count_$1     ---where the capture reg_ex in the parent is translated to $1 in the child.

    So why do I think I need this? I have sets of DAGs that perform various kinds of genomic analyses across sets of biological samples. Sometimes it makes sense to run these individually and sometimes it makes sense to run them as part of larger work-flows. In general, a particular DAG may be run on its own, or as part of several other super-dags. So the decomposition into sub-dags is functional, not size based. I don't want to generate combined A/B analyses in a single DAG because this would require me to know up front every way I am going to want to compose things in the future and would get complicated quickly. This way, my script that knows about Analysis A need know nothing about Analysis B. Analysis B only needs to know where to look for the results of Analysis A (or be told it by a composing script) and scripts that compose the two need know very little about either of the of the component analyses. I hope this helps?


Thanks,

John

John Calley, Ph.D.
Genetics and Bioinformatics, Tailored Therapeutics
Eli Lilly and Company
DC0731, Lilly Corporate Center, Indianapolis, IN 46285 USA 
317.433.3399 (office) | 317.655.1534 (fax)
calley_john_n@xxxxxxxxx | www.lilly.com 

CONFIDENTIALITY NOTICE:  This e-mail message (including all attachments) is for the sole use of the intended recipient(s) and may contain confidential and privileged information.  Any unauthorized review, use, disclosure, copying or distribution is strictly prohibited.  If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.


-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of R. Kent Wenger
Sent: Thursday, July 30, 2015 2:59 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Avoiding combinatorial explosion in dependencies between spliced DAGS

On Thu, 30 Jul 2015, John N Calley wrote:

>   I make a lot of use of SPLICE-ing to compose dags into complex 
> workflows and these often have dependencies on each other. DAGMAN 
> deals with this by adding dependencies between every final node for 
> the PARENT dag and every initial node of the CHILD dag. When there are 
> thousands of initial and final nodes (as is common with my workflows) 
> this can result in extremely large numbers of dependencies and I've 
> had cases where parsing a rescue dag took quite a few hours. I've been 
> living with this for a while, but I recently came up with a 
> work-around and I wondered if others might have any thoughts on it or 
> perhaps better ways of dealing with the issue.

We're glad that you're finding splices useful.  Hopefully we can make some improvements to make them more useful...

>  What I have now started to do is to add a final NOOP job to each of 
> my sub-dags, so at least I just have all the dependencies from initial 
> jobs in the CHILD dag with this single final place-holder node. I 
> assume that I could do the same thing to make every one of my dags 
> start with a NOOP initial node that all the real initial nodes depend 
> on, though I haven't actually tried this. This is clearly not the 
> intended use of the NOOP keyword and it's a bit of a hack, so I 
> wondered if others had better ideas?

Hmm, I wouldn't consider this a hack.  There's not really a specific "intended" use for NOOP nodes -- they're for whatever someone finds useful, as in this case.

> Also, it would seem that it would be easy for DAGMAN to do this for me 
> as part of the SPLICE-ing process and the result would be a good deal 
> cleaner. I don't see any reason for DAGMAN not to do this. Am I 
> missing something? If not, please consider it a feature request.

That's actually something we thought of pretty much when splices were first implemented.  Anyhow, there is already a corresponding feature
request:

   https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3587,4

I guess it's kind of languished until now because nobody has really run into a use case where it was really necessary (or, if they did, we didn't find out about it).

Maybe it's time to move that up in priority...  At any rate, though, there's no reason to not do it, other than its relative priority among the several hundred outstanding DAGMan bugs/feature requests.

> What I'd really like to do is to reach 'into' each sub-dag and insert 
> dependencies between specific final nodes and specific initial nodes.
> I've considered hacking this solution together, but the ways of doing 
> it that I can think of seem inelegant. I wonder if anyone has thoughts 
> on how to do this kind of thing cleanly? To expand a bit, this comes 
> up when I want to do Analysis A on samples 1-2000 and then I want to 
> do Analysis B on the same samples. Analysis B for sample 1 depends on 
> Analysis A for the same sample, but not on Analysis A for any other 
> samples. It's a shame to require that Analysis A finish for all 
> samples before I start Analysis B for any samples, but that is what I 
> feel stuck with at the moment.

So you're saying that right now you have all of the A nodes in one splice, and all of the B nodes in another splice, right?  I guess one thing I would want to understand in this case is what is driving your decomposition of the workflow.  Because if you have a single splice that has all of your As and all of your Bs, you could do this easily.  Or, if your decomposition is governed by size, you could have a splice that has
A1-A100 and B1-B100, another splice that has A101-A200, B101-B200, etc.

If you really do need to have all of the As in one splice and all of the Bs in another I guess it might be possible to implement some kind of "weaker" dependency between splices, wherein a given node in the second splice only depends on some of the nodes in the first splice.  That would definitely take some thinking, though, about how the dependencies should be specified, and this is something that hasn't come up previously, as far as I know, so I don't have any pre-existing ideas on it.

So, to summarize:
1) There's no problem with using NOOP nodes as you describe.
2) There's no reason to not have DAGMan automatically introduce such nodes.  (This would also allow splices to have pre and post scripts, which would make them more consistent with sub-DAGs.)
3) Before any kind of implementation of the more flexible inter-splice dependencies, there would have to be some serious thinking involved, probably starting with a better understanding of your use case.

Kent Wenger
CHTC Team
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/