[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] The submitting PC suffers power failure and running jobs



Rob,

Yes, the scheduling node must be operational for the jobs to keep running, b/c the starter and the shadow communicate while the job runs. The submit file attribute, +JobLeaseDuration=<#seconds>, tells Condor how long to allow the shadow and the starter to be out of contact before killing the job. If you set it to something longer (1200-2400 seconds), the occasional reboot of your scheduler shouldn't affect running jobs, provided the condor_schedd restarts during the lease duration time you specified in the submit file.

This issue is a good reason to have a centralized scheduler that isn't your workstation. You could run a scheduler node which is always on, and use condor_submit -r | -n | -s to remotely spool/submit your jobs depending on your set-up.

I hope that helps.

Regards,
Rob

Rob Stevenson wrote:
Dear all,
First, many thanks for previous advice.

I have an issue where it appears that the submitter PC needs to be on
permanently during the whole duration of the run.

I'm running in Windows in the vanilla universe with jobs that, if
suspended or preempted, will restart from the beginning. Therefore I
completely disallow all preempts/suspends which is no problem.

However, I've noticed that when a submitter PC is powered off or crashes
and has to be restarted, any jobs that have been submitted from this but
not yet started (I) will not begin until the submitter is back online.
Also that sometimes, though not always, jobs that are currently active
(r) will stop and restart from the beginning.

I can imagine that starting jobs that haven't yet started may require
the submitter pc to be on, but I'm surprised that already running jobs
occasionally fail. Is this usual?

Best Regards,
Rob



**********************************************************************
HR Wallingford uses Faxes and Emails for confidential and legally privileged business communications. They do not of themselves create legal commitments. Disclosure to parties other than addressees requires our specific consent. We are not liable for unauthorised disclosures nor reliance upon them. If you have received this message in error please advise us immediately and destroy all copies of it.

HR Wallingford Limited
Howbery Park, Wallingford, Oxon, OX10 8BA, UK
Registered in England No. 02562099
**********************************************************************

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

--

===================================
Rob Futrick
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com