[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Should the schedd/startd's tolerate schedd machine reboots?

With appropriatly long ALIVE_INTERVAL (the default 300 seconds seems
find) and MAX_CLAIM_ALIVES_MISSED (the default of 6 seems fine) I
expected startds to tolerate a reasonably fast reboot of a schedd
machine and continue to run jobs. I expected the startd to tolerate an
outage of up to 30 minutes with the schedd before terminating running
jobs. I'm not observing this behaviour though. I'm seeing startds vacate
running jobs as soon as the schedd machine goes down. This is on WinXP
to WinXP machines with 6.7.3. Is it perhaps due to a shutdown routine in
the schedd? As the service is brought down does it reach out to startds
to tell it to terminate running jobs? Can I prevent this so reboots are
tolerated? Reboots are a necessary evil our windows development
environment unfortunatly.

- Ian