Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Vanilla jobs not automatically restart

Date: Wed, 28 Nov 2007 12:02:35 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Vanilla jobs not automatically restart



José M. Martín wrote:

I mean a crash, a hardware failure or an electric cut off.
When Condor is restablish, jobs automatically restart. Standard jobs startfrom their last checkpoint but Vanilla jobs restart from the beginning. Thisis a problem for me because the files created while jobs was running aretruncated. (I can use the generated data until the failure point, programsare simulations, they usually never end).
I would like the Vanilla jobs were in held state until users can decide ifrestart or remove them from the queue.

The 6.8 series does not provide a good way to avoid multiple runs of ajob, because JobRunCount in the job ClassAd is actually the number oftimes the schedd has started a shadow for the job, not the number oftimes the job has actually been started. You can try using JobRunCountanyway, but it may sometimes be an overestimate of the number of timesthe job has started.

In 6.9.5 (about to be released), there is a new attribute of the jobClassAd called NumJobStarts, which I think you should be able to uselike so in the submit file:


periodic_hold = num_job_starts > 0 && JobStatus == 1
requirements = num_job_starts == 0

You may also want to set in your condor configurationSHADOW_LAZY_QUEUE_UPDATE=False (new in 6.9.5) in order to decreaseschances of loss of information about jobs starting shortly before apower failure. However, if it's not a problem that a job which ran foronly 10 minutes is restarted after a power failure, then stick with thedefault "lazy" queue update.


--Dan

Follow-Ups:
- Re: [Condor-users] Vanilla jobs not automatically restart
  - From: Stuart Anderson

References:
- [Condor-users] Vanilla jobs not automatically restart
  - From: José M. Martín
- Re: [Condor-users] Vanilla jobs not automatically restart
  - From: Matt Hope
- Re: [Condor-users] Vanilla jobs not automatically restart
  - From: José M. Martín

Prev by Date: Re: [Condor-users] Condor SOAP API GRID Authentication Question
Next by Date: Re: [Condor-users] Vanilla jobs not automatically restart
Previous by thread: Re: [Condor-users] Vanilla jobs not automatically restart
Next by thread: Re: [Condor-users] Vanilla jobs not automatically restart
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Vanilla jobs not automatically restart