On Fri, 9 May 2008, Jean-Pierre Ocalan wrote:
I'm trying to do some exercise with Condor to understand better how does this huge system work. Let's say I have few jobs organized as it follows: A -> B -> C -> F -> D -> E (D depends on B) Let's say now that B fails ... I don't want to retry immediately B with the command RETRY B <number of time> .... I want to launch another script that will repair the problem an restart B. I guess that I can work with the PRE and POST script. Let's say that my POST script, launched after the execution of B, check the returned value and if there is a problem the script fix it but how can I tell to restart B ? Do I have to create a new workflow of jobs like this ? B ->C->F ->D->E
Hmm, I'm not 100% sure I understand what you're asking, but I'll take a shot at it.
I think the combination of a POST script and RETRY will do what you want, because retry works on the node as a whole, not just the Condor job within the node. So, if the Condor job for node B fails, the POST script will get run, and then if the POST script fails, the node will be retried (assuming you have retries set).
You can use the UNLESS-EXIT option to RETRY to bail out if the POST script cannot fix the problem (see http://www.cs.wisc.edu/condor/manual/v7.1/2_10DAGMan_Applications.html#SECTION003102500000000000000).
Kent Wenger Condor Team