[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Vanilla jobs not automatically restart





José M. Martín wrote:

I mean a crash, a hardware failure or an electric cut off.
When Condor is restablish, jobs automatically restart. Standard jobs start from their last checkpoint but Vanilla jobs restart from the beginning. This is a problem for me because the files created while jobs was running are truncated. (I can use the generated data until the failure point, programs are simulations, they usually never end).

I would like the Vanilla jobs were in held state until users can decide if restart or remove them from the queue.

The 6.8 series does not provide a good way to avoid multiple runs of a job, because JobRunCount in the job ClassAd is actually the number of times the schedd has started a shadow for the job, not the number of times the job has actually been started. You can try using JobRunCount anyway, but it may sometimes be an overestimate of the number of times the job has started.

In 6.9.5 (about to be released), there is a new attribute of the job ClassAd called NumJobStarts, which I think you should be able to use like so in the submit file:

periodic_hold = num_job_starts > 0 && JobStatus == 1
requirements = num_job_starts == 0

You may also want to set in your condor configuration SHADOW_LAZY_QUEUE_UPDATE=False (new in 6.9.5) in order to decreases chances of loss of information about jobs starting shortly before a power failure. However, if it's not a problem that a job which ran for only 10 minutes is restarted after a power failure, then stick with the default "lazy" queue update.

--Dan