[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Rescue DAG and clusters



The documentation for DAGman says:

"The failure of a single job within a cluster of multiple jobs (within a
single node) causes the entire cluster of jobs to fail. Any other jobs
within the failed cluster of jobs are immediately removed."

A simple test confirms this to be the case:

==> A.submit <==
cmd = /bin/sleep
args = 20
queue 10

==> B.submit <==
cmd = /bin/sleep
args = 30
queue 10

==> test.dag <==
JOB A A.submit
JOB B B.submit
PARENT A CHILD B

Killing any one of the 'sleep' condor_exec processes causes the others to be
killed, and a restart of the dag causes all the processes in that cluster to
be restarted from scratch.

So suppose job A and job B are doing useful work (e.g. a cluster processing
N files in parallel), and I need all the job A's to complete before the job
B's to start, but I want to retry individual failed jobs from A or B. 
What's the best way to do this?

As far as I can see, I need to write out an explicit set of nodes and the
dependencies between them.

# A.submit
...
queue 1

# B.submit
...
queue 1

# A.dag
JOB A0 A.submit
VARS A0 runnumber="0"
JOB A1 A.submit
VARS A1 runnumber="1"
...
JOB A9 A.submit
VARS A9 runnumber="9"

# B.dag
JOB B0 B.submit
VARS B0 runnumber="0"
JOB B1 B.submit
VARS B1 runnumber="1"
...
JOB B9 B.submit
VARS B9 runnumber="9"

# test2.dag
SUBDAG EXTERNAL A A.dag
SUBDAG EXTERNAL B B.dag
PARENT A CHILD B

I've tested this and it works - but I have had to enumerate all 20 jobs
explicitly, instead of just having 2 clusters of 10 jobs.  Is there any neat
way to avoid this, similar to the "queue N" parameter in a cluster?

Also, it's a bit slow to start. The first condor_dagman sits around for
about 10-15 seconds, and then starts the inner condor_dagman.  That also
sits around for 10-15 seconds, before it starts submitting the 'A' jobs. 
When those have completed, it takes a while to spawn the second inner
condor_dagman, and then some more time before the 'B' jobs.

Replacing "SUBDAG EXTERNAL" with "SPLICE" seems to help by getting rid of
the second layer of condor_dagman.

Is there any other parameter I can tweak to speed up the launching of jobs?

Thanks,

Brian.