[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Don't understand RETRY in DAGMan



On Mon, 20 Oct 2014, Ralph Finch wrote:

The bug is my misunderstanding of the manual (and its poorly described
rescue section, IMO). The way I read it, if I add RETRY in the .dag submit
file for every job, and one or more fails, it will create a rescue DAG
file...so far, so good (I'm not sure if adding RETRY to the original DAG
file is necessary for a rescue DAG to be generated). But I find the manual
unclear about resubmitting at this point:

No, retry and rescue DAGs are separate. Even if you don't have any retries, you still get a rescue DAG if nodes fail. Retry says, "if the job fails, re-try it n times before declaring the node failed" (this happens *before* the rescue DAG is generated).

"If the DAG is resubmitted utilizing the Rescue DAG, the successfully
completed nodes will not be re-executed."
utilizing: at this point what that means is not said, but that's what I
want: don't re-run good jobs, just the failed ones.

"To run a full Rescue DAG... directly specify the full Rescue DAG ïle
instead of the original DAG ïle."
Uhh...do I want to run a full Rescue DAG? What I want is to re-run only the
failed jobs, will re-running a full Rescue DAG do that?

"Re-submission of the original DAG input ïle causes condor_dagman to try to
parse the Rescue DAG ïle in combination with the original DAG input
ïle..."
Oh, apparently here's another way of re-running jobs, by re-submitting the
original DAG file. What's the difference?

The manual emphasizes the difference in behavior between past and current
versions of HTC...understandable when the change first happened, but
confusing to me, who doesn't know or care about past versions.

Anyway I tried both ways: submitting the rescue DAG, and re-submitting the
original DAG. The first stopped immediately, complaining about no
information for Job 1. The second ran all jobs again.

So how to run just the failed jobs?

Yes, this is a little confusing. We're in the process of disabling "full" rescue DAGs, which should allow simplifying the manual. At this point, if you're using the default settings for DAGMan, just forget about anything talking about "full" rescue DAGs.

Anyhow, here's my guess at what happened: say your original dag was foo.dag. You did
  condor_submit_dag foo.dag
and this had two nodes fail, producing a rescue DAG:
  foo.dag.rescue.001

From your description, it sounds like you did this:
  condor_submit_dag foo.dag.rescue.001
and then
  condor_submit_dag foo.dag

Is that right? Did you remove foo.dag.rescue.001 before re-submitting the original DAG? If so, that would explain why all of the jobs got re-run.

If you're using the default settings for DAGMan, what you want to do after the first run fails is just
  condor_submit_dag foo.dag
which will automatically pick up the status from foo.dag.rescue.001.

It's expected that doing
  condor_submit_dag foo.dag.rescue.001
will fail...

Kent Wenger
CHTC Team