[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Don't understand RETRY in DAGMan



On Sat, 18 Oct 2014, Ralph Finch wrote:

I don't understand the RETRY keyword in DAGMan.

We have a task that runs 500-2000 jobs in our Windows 7 pool. All must run
for the task to be successful (a calibration of a numerical model), thus, I
want to retry any jobs that fail.  So, reading the manual, I put for
instance

JOB 0 dsm2.sub
VARS 0 JOBNO="$(JOB)"
RETRY 0 3
.....

and so forth for all jobs in the .dagman file, with the intention that any
job that failed would be retried up to 3 times.

Well, two jobs did fail (from the rescue file):

# Total number of Nodes: 532
# Nodes premarked DONE: 530
# Nodes that failed: 2
#   164,280,<ENDLIST>

But on re-submitting the .dagman file, it re-ran all jobs. Is this because
all were marked to retry? (same rescue file):

DONE 0
RETRY 0 3
DONE 1
RETRY 1 3
DONE 2
RETRY 2 3
DONE 3
.....

Hmm, this doesn't sound like the correct behavior. Can you send the relevant dagman.out file(s)?

If a job succeeded the first time around, having a retry on it should not cause it to get re-run when you re-submit the DAG and have a rescue DAG. The retry only applies if the job fails.

Kent Wenger
CHTC Team