[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] FEATURE QUESTION: Re-submitting using'on_exit_remove' but for a limited number of re-tries



James,

 

What you are describing seems to be post-processing once the job terminated – i.e. on the submit machine check the JobRunCount or NumSystemHolds attribute and decide whether to re-submit job. That is too late – i.e. if this was happening on the submit machine I can just check whether the job produced incorrect results and re-submit.

 

What I was hoping to find was a ‘ type of behavior but with a limited number of attempts. I.e. something which is managed by Condor without requiring post-processing on the submit side. The reason for this is that I have scripts to submit jobs which terminate once the submissions are done. Without automatic Condor controlled re-submission I would have to have the script constantly running checking on job status and whether it terminated correctly or incorrectly – this would be a problem (especially on a laptop).

 

Regards,

 

Etan

 

-----------------------------------------------

Etan G. Cohen

etan.cohen@xxxxxxxxxxxx

(949) 433 1811


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
Sent: Thursday, August 31, 2006 9:22 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] FEATURE QUESTION: Re-submitting using'on_exit_remove' but for a limited number of re-tries

 

On Aug 24, 2006, at 9:42 PM, Etan Cohen wrote:



In the documentation there is:

As another example, if your job should only leave the queue if it exited on its own with status 0, you would use this on_exit_remove _expression_:

         == False) && (ExitCode == 0)

If the job was killed by a signal or exited with a non-zero exit status, Condor would leave the job in the queue to run again.

 

I have a job with both real and intermittent failure modes. I would like to have a counter on the resubmission – e.g. re-submit up to a maximum of 5 times. This would allow working-through the intermittent failures but would not cause an infinite loop with the real failures.

 

Does such a feature exist?

 

The job attribute JobRunCount counts the number of times the job was started. However, it doesn't distinguish between the job re-executing because of on_exit_remove and because the execute machine evicted the job or died.

 

Another option is to set on_exit_hold and periodic_release, then use NumSystemHolds to count the number of failed executions.

 

+--------------------------------+-----------------------------------+

|           Jaime Frey           | I used to be a heavy gambler.     |

|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |

| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |

+--------------------------------+-----------------------------------+