[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FEATURE QUESTION: Re-submitting using 'on_exit_remove' but for a limited number of re-tries



On Aug 24, 2006, at 9:42 PM, Etan Cohen wrote:

In the documentation there is:

As another example, if your job should only leave the queue if it exited on its own with status 0, you would use this on_exit_remove _expression_:

         == False) && (ExitCode == 0)

If the job was killed by a signal or exited with a non-zero exit status, Condor would leave the job in the queue to run again.

 

I have a job with both real and intermittent failure modes. I would like to have a counter on the resubmission – e.g. re-submit up to a maximum of 5 times. This would allow working-through the intermittent failures but would not cause an infinite loop with the real failures.

 

Does such a feature exist?


The job attribute JobRunCount counts the number of times the job was started. However, it doesn't distinguish between the job re-executing because of on_exit_remove and because the execute machine evicted the job or died.

Another option is to set on_exit_hold and periodic_release, then use NumSystemHolds to count the number of failed executions.

+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+