[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Rescue DAG and clusters



On Wed, Jul 11, 2012 at 10:15:25PM +0100, Brian Candler wrote:
> The documentation for DAGman says:
> 
> "The failure of a single job within a cluster of multiple jobs (within a
> single node) causes the entire cluster of jobs to fail. Any other jobs
> within the failed cluster of jobs are immediately removed."
> 
> A simple test confirms this to be the case:
> 
> ==> A.submit <==
> cmd = /bin/sleep
> args = 20
> queue 10
> 
> ==> B.submit <==
> cmd = /bin/sleep
> args = 30
> queue 10
> 
> ==> test.dag <==
> JOB A A.submit
> JOB B B.submit
> PARENT A CHILD B
> 
> Killing any one of the 'sleep' condor_exec processes causes the others to be
> killed, and a restart of the dag causes all the processes in that cluster to
> be restarted from scratch.
> 
> So suppose job A and job B are doing useful work (e.g. a cluster processing
> N files in parallel), and I need all the job A's to complete before the job
> B's to start, but I want to retry individual failed jobs from A or B. 
> What's the best way to do this?
> 
> As far as I can see, I need to write out an explicit set of nodes and the
> dependencies between them.
> 
> # A.submit
> ...
> queue 1
> 
> # B.submit
> ...
> queue 1
> 
> # A.dag
> JOB A0 A.submit
> VARS A0 runnumber="0"
> JOB A1 A.submit
> VARS A1 runnumber="1"
> ...
> JOB A9 A.submit
> VARS A9 runnumber="9"
> 
> # B.dag
> JOB B0 B.submit
> VARS B0 runnumber="0"
> JOB B1 B.submit
> VARS B1 runnumber="1"
> ...
> JOB B9 B.submit
> VARS B9 runnumber="9"
> 
> # test2.dag
> SUBDAG EXTERNAL A A.dag
> SUBDAG EXTERNAL B B.dag
> PARENT A CHILD B

This is the best way to do it.
> 
> I've tested this and it works - but I have had to enumerate all 20 jobs
> explicitly, instead of just having 2 clusters of 10 jobs.  Is there any neat
> way to avoid this, similar to the "queue N" parameter in a cluster?
> 

The format above is not too bad, as it is easy to write a script for
such a DAG.

> Also, it's a bit slow to start. The first condor_dagman sits around for
> about 10-15 seconds, and then starts the inner condor_dagman.  That also
> sits around for 10-15 seconds, before it starts submitting the 'A' jobs. 
> When those have completed, it takes a while to spawn the second inner
> condor_dagman, and then some more time before the 'B' jobs.
> 

Yes, that is because if you use SUBDAG EXTERNAL, it has to spawn another
condor_dagman, which waits again. Splicing does not incur the cost of
another condor_dagman

> Replacing "SUBDAG EXTERNAL" with "SPLICE" seems to help by getting rid of
> the second layer of condor_dagman.
> 
> Is there any other parameter I can tweak to speed up the launching of jobs?
> 
> Thanks,
> 
> Brian.