Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Don't understand RETRY in DAGMan

Date: Mon, 20 Oct 2014 12:36:08 -0500 (CDT)
From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Don't understand RETRY in DAGMan

On Mon, 20 Oct 2014, Ralph Finch wrote:

The bug is my misunderstanding of the manual (and its poorly described
rescue section, IMO).Â The way I read it, if I add RETRY in the .dag submit
file for every job, and one or more fails, it will create a rescue DAG
file...so far, so good (I'm not sure if adding RETRY to the original DAG
file is necessary for a rescue DAG to be generated). But I find the manual
unclear about resubmitting at this point:

No, retry and rescue DAGs are separate. Even if you don't have anyretries, you still get a rescue DAG if nodes fail. Retry says, "if thejob fails, re-try it n times before declaring the node failed" (thishappens *before* the rescue DAG is generated).

"If the DAG is resubmitted utilizing the Rescue DAG, the successfully
completed nodes will not be re-executed."
utilizing: at this point what that means is not said, but that's what I
want: don't re-run good jobs, just the failed ones.

"To run a full Rescue DAG... directly specify the full Rescue DAG ïle
instead of the original DAG ïle."
Uhh...do I want to run a full Rescue DAG? What I want is to re-run only the
failed jobs, will re-running a full Rescue DAG do that?

"Re-submission of the original DAG input ïle causes condor_dagman to try to
parse the Rescue DAG ïle in combination with the original DAG input
ïle..."
Oh, apparently here's another way of re-running jobs, by re-submitting the
original DAG file. What's the difference?

The manual emphasizes the difference in behavior between past and current
versions of HTC...understandable when the change first happened, but
confusing to me, who doesn't know or care about past versions.

Anyway I tried both ways: submitting the rescue DAG, and re-submitting the
original DAG. The first stopped immediately, complaining about no
information for Job 1. The second ran all jobs again.

So how to run just the failed jobs?

Yes, this is a little confusing. We're in the process of disabling "full"rescue DAGs, which should allow simplifying the manual. At this point, ifyou're using the default settings for DAGMan, just forget about anythingtalking about "full" rescue DAGs.

Anyhow, here's my guess at what happened: say your original dag wasfoo.dag. You did

  condor_submit_dag foo.dag
and this had two nodes fail, producing a rescue DAG:
  foo.dag.rescue.001

From your description, it sounds like you did this:

  condor_submit_dag foo.dag.rescue.001
and then
  condor_submit_dag foo.dag

Is that right? Did you remove foo.dag.rescue.001 before re-submitting theoriginal DAG? If so, that would explain why all of the jobs got re-run.

If you're using the default settings for DAGMan, what you want to doafter the first run fails is just

  condor_submit_dag foo.dag
which will automatically pick up the status from foo.dag.rescue.001.

It's expected that doing
  condor_submit_dag foo.dag.rescue.001
will fail...

Kent Wenger
CHTC Team

Follow-Ups:
- Re: [HTCondor-users] Don't understand RETRY in DAGMan
  - From: Ralph Finch

References:
- [HTCondor-users] Don't understand RETRY in DAGMan
  - From: Ralph Finch
- Re: [HTCondor-users] Don't understand RETRY in DAGMan
  - From: Nathan W. Panike
- Re: [HTCondor-users] Don't understand RETRY in DAGMan
  - From: Ralph Finch

Prev by Date: Re: [HTCondor-users] Don't understand RETRY in DAGMan
Next by Date: Re: [HTCondor-users] Don't understand RETRY in DAGMan
Previous by thread: Re: [HTCondor-users] Don't understand RETRY in DAGMan
Next by thread: Re: [HTCondor-users] Don't understand RETRY in DAGMan
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Don't understand RETRY in DAGMan