Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?

Date: Wed, 30 Jan 2013 13:43:28 +0000
From: Brian Candler <B.Candler@xxxxxxxxx>
Subject: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?

On Wed, Jan 30, 2013 at 08:25:37AM -0500, Brian Pipa wrote:
> So, my dag file would look like
> ###
> Job Workers workers.job
> Job PostProcess postprocess.job
> PARENT Workers CHILD PostProcess
> ###
> and the initial (non-DAGMan job) QueryDB would create the workers.job,
> postprocess.job, and the dag file and submit the DAG job.

Yes.

Note 1: You'll find that there's 12 seconds delay before dagman starts the
first job, and a few seconds delay between the first completing and the
second starting.  Hopefully that's acceptable in overall scheme of things.

Note 2: DAGman also has a great feature called "rescue DAG", which means
that if some DAG nodes fail, you can restart the DAG and the
successfully-completed nodes will not be re-run.

However in the case of a job cluster, if any one job fails all the other
jobs in the cluster are killed.

If you want to be able to retry individual failed jobs, then you would make
them separate DAG nodes:

Job Workers1 workers.job
Vars Workers1 instance="1"
Job Workers2 workers.job
Vars Workers1 instance="2"
Job Workers3 workers.job
Vars Workers1 instance="3"
Job Workers4 workers.job
Vars Workers1 instance="4"
Job PostProcess postprocess.job
parent Workers1 Workers2 Workers3 Workers4 child PostProcess

Then use $(instance) in workers.job instead of $(procid), to select between
the different jobs.

Cheers,

Brian.

References:
- [HTCondor-users] Determine when all jobs in a cluster have finished?
  - From: Brian Pipa
- Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
  - From: Brian Pipa

Prev by Date: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Next by Date: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Previous by thread: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Next by thread: Re: [HTCondor-users] Determine when all jobs in a cluster have finished?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Determine when all jobs in a cluster have finished?