[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?

On Wed, Jan 30, 2013 at 08:25:37AM -0500, Brian Pipa wrote:
> So, my dag file would look like
> ###
> Job Workers workers.job
> Job PostProcess postprocess.job
> PARENT Workers CHILD PostProcess
> ###
> and the initial (non-DAGMan job) QueryDB would create the workers.job,
> postprocess.job, and the dag file and submit the DAG job.


Note 1: You'll find that there's 12 seconds delay before dagman starts the
first job, and a few seconds delay between the first completing and the
second starting.  Hopefully that's acceptable in overall scheme of things.

Note 2: DAGman also has a great feature called "rescue DAG", which means
that if some DAG nodes fail, you can restart the DAG and the
successfully-completed nodes will not be re-run.

However in the case of a job cluster, if any one job fails all the other
jobs in the cluster are killed.

If you want to be able to retry individual failed jobs, then you would make
them separate DAG nodes:

Job Workers1 workers.job
Vars Workers1 instance="1"
Job Workers2 workers.job
Vars Workers1 instance="2"
Job Workers3 workers.job
Vars Workers1 instance="3"
Job Workers4 workers.job
Vars Workers1 instance="4"
Job PostProcess postprocess.job
parent Workers1 Workers2 Workers3 Workers4 child PostProcess

Then use $(instance) in workers.job instead of $(procid), to select between
the different jobs.