[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Don't understand RETRY in DAGMan





On Mon, Oct 20, 2014 at 10:36 AM, R. Kent Wenger <wenger@xxxxxxxxxxx> wrote:

Anyhow, here's my guess at what happened: say your original dag was foo.dag. You did
 condor_submit_dag foo.dag
and this had two nodes fail, producing a rescue DAG:
 foo.dag.rescue.001

>From your description, it sounds like you did this:
 condor_submit_dag foo.dag.rescue.001
and then
 condor_submit_dag foo.dag

Is that right? Did you remove foo.dag.rescue.001 before re-submitting the original DAG? If so, that would explain why all of the jobs got re-run.

That's right. However, I didn't rename or remove foo.dag.rescue.001 before re-submitting the original DAG...thus my confusion as to how to re-run just the failed jobs.

If you're using the default settings for DAGMan, what you want to do after the first run fails is just
 condor_submit_dag foo.dag
which will automatically pick up the status from foo.dag.rescue.001.

It's expected that doing
 condor_submit_dag foo.dag.rescue.001
will fail...

But then why say in the manual To run a full Rescue DAG directly specify the full Rescue DAG ïle instead of the original DAG ïle. Hopefully re-written for clarity soon.

RF