[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Should the schedd/startd's tolerate schedd machine reboots?



On Wed, 2 Feb 2005 13:05:46 -0500, Ian Chesal <ICHESAL@xxxxxxxxxx> wrote:
> With appropriatly long ALIVE_INTERVAL (the default 300 seconds seems
> find) and MAX_CLAIM_ALIVES_MISSED (the default of 6 seems fine) I
> expected startds to tolerate a reasonably fast reboot of a schedd
> machine and continue to run jobs. I expected the startd to tolerate an
> outage of up to 30 minutes with the schedd before terminating running
> jobs. I'm not observing this behaviour though. I'm seeing startds vacate
> running jobs as soon as the schedd machine goes down. This is on WinXP
> to WinXP machines with 6.7.3. Is it perhaps due to a shutdown routine in
> the schedd? As the service is brought down does it reach out to startds
> to tell it to terminate running jobs? Can I prevent this so reboots are
> tolerated? Reboots are a necessary evil our windows development
> environment unfortunatly.

The job lease duration controls the schedd reboot survival

http://www.cs.wisc.edu/condor/manual/v6.7/2_13Special_Environment.html#sec:Job-Lease

you must 
1) make sure your execute machines will allow leasing
2) make sure your submitters include "job_lease_duration" in their
submit scripts

Are you sure both the above are happening...

(also note that if you are using the other 6.7 series functionality of
streaming output that this will prevent leasing from working)

Matt