[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Vanilla jobs not automatically restart
- Date: Wed, 28 Nov 2007 12:02:35 -0600
- From: Dan Bradley <dan@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Vanilla jobs not automatically restart
José M. Martín wrote:
I mean a crash, a hardware failure or an electric cut off.
When Condor is restablish, jobs automatically restart. Standard jobs start
from their last checkpoint but Vanilla jobs restart from the beginning. This
is a problem for me because the files created while jobs was running are
truncated. (I can use the generated data until the failure point, programs
are simulations, they usually never end).
I would like the Vanilla jobs were in held state until users can decide if
restart or remove them from the queue.
The 6.8 series does not provide a good way to avoid multiple runs of a
job, because JobRunCount in the job ClassAd is actually the number of
times the schedd has started a shadow for the job, not the number of
times the job has actually been started. You can try using JobRunCount
anyway, but it may sometimes be an overestimate of the number of times
the job has started.
In 6.9.5 (about to be released), there is a new attribute of the job
ClassAd called NumJobStarts, which I think you should be able to use
like so in the submit file:
periodic_hold = num_job_starts > 0 && JobStatus == 1
requirements = num_job_starts == 0
You may also want to set in your condor configuration
SHADOW_LAZY_QUEUE_UPDATE=False (new in 6.9.5) in order to decreases
chances of loss of information about jobs starting shortly before a
power failure. However, if it's not a problem that a job which ran for
only 10 minutes is restarted after a power failure, then stick with the
default "lazy" queue update.