[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Problems with power outage etc



Hello all

We have had a few incidents with power outages etc. What then happens is that our jobs are usually restarted. This is not something we generally want. Our jobs usually run for weeks and we would rather have the job exit than restarting as all result files are usually overriden in such an event. What is the best approach to avvoid this? 

This morning we also had a problem when a domain controller went down for a while and the starter wasn't able to see the schedd even though they were both alive. At some point then the lease expired and the job restarted. We want to avoid this aswell.

>From my standpoint it would be better if the jobs would just keep running even though the schedd is out of reach. Our cluster is sufficiently small that if a couple errant jobs keep on running we can fix that manually.

Regards Peter