[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] [newbie question: using DAGman how can I restart a job that failed after that another script solve it]



On Fri, 9 May 2008, Jean-Pierre Ocalan wrote:

I'm trying to do some exercise with Condor to understand better how does
this huge system work.
Let's say I have few jobs organized as it follows:
A -> B -> C -> F
           -> D -> E

(D depends on B)

Let's say now that B fails ... I don't want to retry immediately B with
the command RETRY B <number of time> ....
I want to launch another script that will repair the problem an restart B.
I guess that I can work with the PRE and POST script.
Let's say that my POST script, launched after the execution of B, check
the returned value and if there is a problem the script fix it but how
can I tell to restart B ?
Do I have to create a new workflow of jobs like this ?
B ->C->F
  ->D->E

Hmm, I'm not 100% sure I understand what you're asking, but I'll take a shot at it.

I think the combination of a POST script and RETRY will do what you want, because retry works on the node as a whole, not just the Condor job within the node. So, if the Condor job for node B fails, the POST script will get run, and then if the POST script fails, the node will be retried (assuming you have retries set).

You can use the UNLESS-EXIT option to RETRY to bail out if the POST script cannot fix the problem (see http://www.cs.wisc.edu/condor/manual/v7.1/2_10DAGMan_Applications.html#SECTION003102500000000000000).

Kent Wenger
Condor Team