[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DAGMan and decision making



On 4/18/23 10:08, Beaumont, Martin wrote:

Hi all,

 

 

Now, as the one in charge of the system, Iâm not really liking the fact that an Octave script, running on the submit node, is trying to handle the condor jobs on its own and making plenty of info requests to the scheduler to try handling all possibilities. Iâd prefer something more reliable, like DAGMan.

 

We never used DAGMan before, so Iâm very new to it. Both me and the users are interested in trying it out to replace the Octave script, but we are unsure if it can do what we require.


Hi Martin:

I think your intuition that DAGman is better than a polling script is spot on.  Let's start with the simple case.  DAGMan can rerun a node given the exit code of the node, which can be the exit code of a post-script.  In the simplest case a node is just one job:

JOB A some_condor_submit_file
SCRIPT POST A check_for_convergence.sh
RETRY 1000000

This will have dagman run the job specified in condor_submit_file in the pool.  When it completes, your script "check_for_convergence.sh" is run on the submit machine.  If it returns non-zero, that indicates that job A has not converged, and it needs to be rerun.  If it returns zero, that means convergence has been reached, and dagman continues on.

Now, maybe what you need is not to re-run a single job, but a dag of jobs.  In this case, we can repeat the pattern, but replace

"JOB A some_condor_submit_file" with a subdag, which looks like a node to the parent dag:

SUBDAG EXTERNAL A some_dag_file.dag

And the same thing can happen with the POST script and the RETRY line

Now, if you really want to get fancy, because the parent dagman just runs a child process dagman to run the subdag external, the "check_for_convergence.sh" script can re-write the subdag .dag file, or do whatever it wants, which will take effect on the next iteration.


Good luck,

-greg