[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Don't understand RETRY in DAGMan



The bug here is not about retry, it is about rescue DAGs.  Why did
DAGMan run all the jobs if there was a rescue DAG present?

Retry means that DAGMan will see the job fail and resubmit it again
before submitting any jobs that depend on it.  When you resubmitted
the DAGman file, you started a new DAGMan  and all the job counts were
set to zero; nothing was done by DAGMan to check whether the jobs were
successful or failed on the previous independent DAGMan.

You should check the joblogs for jobs 164 and 280 to see why they failed.


On Sat, Oct 18, 2014 at 6:05 PM, Ralph Finch <ralphmariafinch@xxxxxxxxx> wrote:
> $CondorVersion: 8.2.1 Jun 27 2014 BuildID: 256063 $
> $CondorPlatform: x86_64_Windows8 $
>
> I don't understand the RETRY keyword in DAGMan.
>
> We have a task that runs 500-2000 jobs in our Windows 7 pool. All must run
> for the task to be successful (a calibration of a numerical model), thus, I
> want to retry any jobs that fail.  So, reading the manual, I put for
> instance
>
> JOB 0 dsm2.sub
> VARS 0 JOBNO="$(JOB)"
> RETRY 0 3
> .....
>
> and so forth for all jobs in the .dagman file, with the intention that any
> job that failed would be retried up to 3 times.
>
> Well, two jobs did fail (from the rescue file):
>
> # Total number of Nodes: 532
> # Nodes premarked DONE: 530
> # Nodes that failed: 2
> #   164,280,<ENDLIST>
>
> But on re-submitting the .dagman file, it re-ran all jobs. Is this because
> all were marked to retry? (same rescue file):
>
> DONE 0
> RETRY 0 3
> DONE 1
> RETRY 1 3
> DONE 2
> RETRY 2 3
> DONE 3
> .....
>
> Thanks,
> Ralph Finch
> Calif. Dept. of Water Resources
> Sacramento, CA USA
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Nathan Panike