[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Non trivial way of using DAG



Hi Lorenzo,

Greg gave a solid solution that was in line with my thoughts by having an inner DAG for the actual workflow and an outer DAG for the repeating of the workflow until convergence has been achieved for a dataset. I don't have to much to add on since Greg did a good job of explaining the process of setting up a looping DAG and having a pre-script create the next iterations DAG to run but there is a bit I can elaborate on more.

I was going to suggest a bit more of a broken-out solution. For this the outer DAG with a post-script and retry on the inner DAG would control the repetition wouldn't change. But rather than having a pre-script that writes the current iterations DAG for the inner DAG, I was going to suggest having a predefined inner workflow as the following.

repeater.dag:
------------------------------------------------
SUBDAG EXTERNAL WORK work.dag
SCRIPT POST WORK workIsDone.sh
RETRY WORK <upper limit>
------------------------------------------------ 

work.dag:
------------------------------------------------
JOB SETUP split_dataset_x.sub
JOB  ANALYSIS  run_n.sub
JOB COMBINE create_dataset_x+1.sub

PARENT SETUP CHILD ANALYSIS
PARENT ANALYSIS CHILD COMBINE
-------------------------------------------------

With this the inner DAG workflow is a static chain rather than dynamically created by a pre-script. The nodes would do the following:
  1. [SETUP] A local universe job (so it runs on the AP) that finds the current dataset iteration and uses the dataset to create n sub-sets of data to be utilized by a set of jobs described ran by the ANALYSIS node. The ANALYSIS nodes job submit file can either be created dynamically each iteration or be premade with the information it passes changing each iteration (using a queue for each file in a directory or queue for all items in this file).
  2. [ANALYSIS] Set of n jobs to run utilizing the queue for each functionality that does the actual computation on the data.
  3. [COMBINE] A local universe job (so it runs on the AP) that takes the recently produced output from the n jobs to create the dataset for the next iteration or produces some file for the workIsDone.sh post-script to signify convergence.
One thing to note with this solution assumes that the n jobs to be run for each dataset doesn't allow dependencies between jobs in the set (i.e. Job A runs before job B). If these dependencies are required, you could easily replace the ANALYSIS node to be another dynamically created SUBDAG EXTERNAL. Both my solution and Gregs don't account for failure in the inner DAG.

Sorry for the verbose response, but I hope this benefits you in whatever solution you end up crafting since specific tasks can be moved around to many different parts of the DAGs. If you have any questions, comments, or concerns about either of these solutions, don't hesitate to reach out.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Lorenzo Mobilia <l.mobilia@xxxxxxxxxxxxxxxx>
Sent: Thursday, October 26, 2023 2:21 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Non trivial way of using DAG
 
Hi Greg, 

Thank you for your time and effort for the answer provided!

The workflow you specified looks quite interesting and seems like it will fulfill my request. I will try to implement that!

Thank you for you effort,

Lorenzo


Il giorno mer 25 ott 2023 alle ore 23:56 Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> ha scritto:
On 10/25/23 15:53, Matthew T West via HTCondor-users wrote:
>
> Here is the workflow tool MetOffice built to run just this sort of
> thing. I imagine there is a way to do it in DAGMan (maybe) but I
> couldn't figure out how.
>

We can have certain kinds of looping structures with DAGman, and we have
many users with this pattern of "run a dag until some computed function
converges.  The trick is to have two levels, an outer dag and an inner
dag.  The inner dag is the one that does all the work, and the outer
manages the control flow.  Let's treat the inner dag as an opaque box
for now, and assume that it is defined in a file "work.dag", and we
don't care about the shape of the dag.  Given this "work.dag", we can
wrap it in an outer dag that loops with a "repeater.dag" that looks like
this:


repeater.dag:
------------------

SUBDAG EXTERNAL WORK work.dag
SCRIPT  POST WORK workIsDone.sh
RETRY WORK 1000000

--------------------

If we condor_submit_dag repeater.dag, HTCondor will run the subdag
work.dag to completion.  When the last job in the work.dag is finished,
dagman will run the post script "workIsDone.sh", from the repeater.dag,
which runs on the access point, and thus has access to all the outputs
of all the jobs in work.dag.  If the script "workIsDone.sh" returns 0,
dagman assumes convergence has happened, all is good, and exits.  If
"workIsDone.sh" returns non-zero, dagman assumes that something is wrong
with node "WORK", and resubmits the whole dag.  (We quietly assume that
this will happen before 100000 retries).


Now, that assumes that "work.dag" is static, and is known when we run
"condor_submit_dag repeater.dag".  Maybe this isn't the case. And if it
isn't, we can just change repeater.dag to look like


repeater.dag:
------------------
SCRIPT PRE WORK makeWorkDag.sh
SUBDAG EXTERNAL WORK work.dag
SCRIPT  POST WORK workIsDone.sh
RETRY WORK 1000000
--------------------


And in this case, before repeater.dag tries to run the work.dag, it runs
the pre script "makeWorkDag.sh", which can generate the "work.dag" file,
and any dependencies it needs.


I think this should do what you need,


-greg



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/