Re: [HTCondor-users] Don't understand RETRY in DAGMan

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Mon, Oct 20, 2014 at 10:36 AM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:

Anyhow, here's my guess at what happened:Â say your original dag was foo.dag.Â You did
Â condor_submit_dag foo.dag
and this had two nodes fail, producing a rescue DAG:
Â foo.dag.rescue.001

>From your description, it sounds like you did this:
Â condor_submit_dag foo.dag.rescue.001
and then
Â condor_submit_dag foo.dag

Is that right?Â Did you remove foo.dag.rescue.001 before re-submitting the original DAG?Â If so, that would explain why all of the jobs got re-run.

That's right. However, I didn't rename or remove foo.dag.rescue.001 before re-submitting the original DAG...thus my confusion as to how to re-run just the failed jobs.

If you're using the default settings for DAGMan, what you want to do after the first run fails is just
Â condor_submit_dag foo.dag
which will automatically pick up the status from foo.dag.rescue.001.

It's expected that doing
Â condor_submit_dag foo.dag.rescue.001
will fail...

But then why say in the manual To run a full Rescue DAG directly specify the full Rescue DAG ïle instead of the original DAG ïle. Hopefully re-written for clarity soon.

Mailing List Archives

Public Access

Re: [HTCondor-users] Don't understand RETRY in DAGMan