[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Non trivial way of using DAG



Hi Greg,Â

Thank you for your time and effort for the answer provided!

The workflow you specified looks quite interesting and seems like it will fulfill my request. I will try to implement that!

Thank you for you effort,

Lorenzo


Il giorno mer 25 ott 2023 alle ore 23:56 Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> ha scritto:
On 10/25/23 15:53, Matthew T West via HTCondor-users wrote:
>
> Here is the workflow tool MetOffice built to run just this sort of
> thing. I imagine there is a way to do it in DAGMan (maybe) but I
> couldn't figure out how.
>

We can have certain kinds of looping structures with DAGman, and we have
many users with this pattern of "run a dag until some computed function
converges. The trick is to have two levels, an outer dag and an inner
dag. The inner dag is the one that does all the work, and the outer
manages the control flow. Let's treat the inner dag as an opaque box
for now, and assume that it is defined in a file "work.dag", and we
don't care about the shape of the dag. Given this "work.dag", we can
wrap it in an outer dag that loops with a "repeater.dag" that looks like
this:


repeater.dag:
------------------

SUBDAG EXTERNAL WORK work.dag
SCRIPTÂ POST WORK workIsDone.sh
RETRY WORK 1000000

--------------------

If we condor_submit_dag repeater.dag, HTCondor will run the subdag
work.dag to completion. When the last job in the work.dag is finished,
dagman will run the post script "workIsDone.sh", from the repeater.dag,
which runs on the access point, and thus has access to all the outputs
of all the jobs in work.dag. If the script "workIsDone.sh" returns 0,
dagman assumes convergence has happened, all is good, and exits. If
"workIsDone.sh" returns non-zero, dagman assumes that something is wrong
with node "WORK", and resubmits the whole dag. (We quietly assume that
this will happen before 100000 retries).


Now, that assumes that "work.dag" is static, and is known when we run
"condor_submit_dag repeater.dag". Maybe this isn't the case. And if it
isn't, we can just change repeater.dag to look like


repeater.dag:
------------------
SCRIPT PRE WORK makeWorkDag.sh
SUBDAG EXTERNAL WORK work.dag
SCRIPTÂ POST WORK workIsDone.sh
RETRY WORK 1000000
--------------------


And in this case, before repeater.dag tries to run the work.dag, it runs
the pre script "makeWorkDag.sh", which can generate the "work.dag" file,
and any dependencies it needs.


I think this should do what you need,


-greg



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/