Yes, the scheduling node must be operational for the jobs to keep running, b/c the starter and the shadow communicate while the job runs. The submit file attribute, +JobLeaseDuration=<#seconds>, tells Condor how long to allow the shadow and the starter to be out of contact before killing the job. If you set it to something longer (1200-2400 seconds), the occasional reboot of your scheduler shouldn't affect running jobs, provided the condor_schedd restarts during the lease duration time you specified in the submit file.

This issue is a good reason to have a centralized scheduler that isn't your workstation. You could run a scheduler node which is always on, and use condor_submit -r | -n | -s to remotely spool/submit your jobs depending on your set-up.

I hope that helps.


Rob Stevenson wrote:
Dear all,
First, many thanks for previous advice.

I have an issue where it appears that the submitter PC needs to be on
permanently during the whole duration of the run.

I'm running in Windows in the vanilla universe with jobs that, if
suspended or preempted, will restart from the beginning. Therefore I
completely disallow all preempts/suspends which is no problem.

However, I've noticed that when a submitter PC is powered off or crashes
and has to be restarted, any jobs that have been submitted from this but
not yet started (I) will not begin until the submitter is back online.
Also that sometimes, though not always, jobs that are currently active
(r) will stop and restart from the beginning.

I can imagine that starting jobs that haven't yet started may require
the submitter pc to be on, but I'm surprised that already running jobs
occasionally fail. Is this usual?

Best Regards,

