[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Don't understand RETRY in DAGMan



$CondorVersion: 8.2.1 Jun 27 2014 BuildID: 256063 $
$CondorPlatform: x86_64_Windows8 $

I don't understand the RETRY keyword in DAGMan.

We have a task that runs 500-2000 jobs in our Windows 7 pool. All must run for the task to be successful (a calibration of a numerical model), thus, I want to retry any jobs that fail.  So, reading the manual, I put for instance

JOB 0 dsm2.sub
VARS 0 JOBNO="$(JOB)"
RETRY 0 3
.....

and so forth for all jobs in the .dagman file, with the intention that any job that failed would be retried up to 3 times.

Well, two jobs did fail (from the rescue file):

# Total number of Nodes: 532
# Nodes premarked DONE: 530
# Nodes that failed: 2
#   164,280,<ENDLIST>

But on re-submitting the .dagman file, it re-ran all jobs. Is this because all were marked to retry? (same rescue file):

DONE 0
RETRY 0 3
DONE 1
RETRY 1 3
DONE 2
RETRY 2 3
DONE 3
.....

Thanks,
Ralph Finch
Calif. Dept. of Water Resources
Sacramento, CA USA