We have some trouble with condor restarting our jobs. This happens when there is some disturbance (backup job locking the disc) and the head loses touch with the working nodes. I have two questions
1. How can I change the time it takes before the head node orders a restart of a job.
2. Is it possible to change what is done when a restart is issued. Could I, instead of condor sending a SIGKILL to the job, tell it to run a script that shuts the job down safely? It would be preferable to have condor shut the job quietly down instead of restarting it.
We use condor to run CFD (different commercial codes), so there is no issue with jobs getting out of hand and holding up a node for longer stretches of time. Also our cluster is quite small, so I have a good overview of the jobs running.