[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] [newbie question: using DAGman how can I restart a job that failed after that another script solve it]
- Date: Fri, 9 May 2008 14:36:38 -0500 (CDT)
- From: "R. Kent Wenger" <wenger@xxxxxxxxxxx>
- Subject: Re: [Condor-users] [newbie question: using DAGman how can I restart a job that failed after that another script solve it]
On Fri, 9 May 2008, Jean-Pierre Ocalan wrote:
I'm trying to do some exercise with Condor to understand better how does
this huge system work.
Let's say I have few jobs organized as it follows:
A -> B -> C -> F
-> D -> E
(D depends on B)
Let's say now that B fails ... I don't want to retry immediately B with
the command RETRY B <number of time> ....
I want to launch another script that will repair the problem an restart B.
I guess that I can work with the PRE and POST script.
Let's say that my POST script, launched after the execution of B, check
the returned value and if there is a problem the script fix it but how
can I tell to restart B ?
Do I have to create a new workflow of jobs like this ?
Hmm, I'm not 100% sure I understand what you're asking, but I'll take a
shot at it.
I think the combination of a POST script and RETRY will do what you want,
because retry works on the node as a whole, not just the Condor job
within the node. So, if the Condor job for node B fails, the POST script
will get run, and then if the POST script fails, the node will be retried
(assuming you have retries set).
You can use the UNLESS-EXIT option to RETRY to bail out if the POST script
cannot fix the problem (see