[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Vanilla jobs not automatically restart





Stuart Anderson wrote:

On Wed, Nov 28, 2007 at 12:02:35PM -0600, Dan Bradley wrote:
The 6.8 series does not provide a good way to avoid multiple runs of a job, because JobRunCount in the job ClassAd is actually the number of times the schedd has started a shadow for the job, not the number of times the job has actually been started. You can try using JobRunCount anyway, but it may sometimes be an overestimate of the number of times the job has started.

In 6.9.5 (about to be released), there is a new attribute of the job ClassAd called NumJobStarts, which I think you should be able to use like so in the submit file:

periodic_hold = num_job_starts > 0 && JobStatus == 1
requirements = num_job_starts == 0


Dan,
	I am confused by the distinction between JobRunCount and NumJobStarts.
Would you be willing to enumerate the circumstances when these two numbers can
differ?

JobRunCount is deprecated in 6.9.5. As it was implemented in previous versions, it is basically equivalent to the new NumShadowStarts attribute, which is the number of times the schedd has started up a submit-side shadow process for the job. Frequently, the number of shadow starts and the number of times the job starts are equal, but they can differ. For example, if the power goes off on the submit node and the schedd restarts before the job lease expires (default 20 minutes), then the schedd can start up a new shadow to remotely watch over the still running job. In this case, the number of shadow starts will be greater than the number of job starts.

--Dan