[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs restarting



On 11/18/2015 5:28 AM, Peter Ellevseth wrote:
Hello all

We have some trouble with condor restarting our jobs. This happens when
there is some disturbance (backup job locking the disc) and the head
loses touch with the working nodes. I have two questions

1.How can I change the time it takes before the head node orders a
restart of a job.

If the submit machine fails to hear from the execute machine for more than X seconds, where X is defined by JobLeaseDuration in the job's submit file, then the job will be killed and restarted (potentially someplace else).

By default, X is either 20 minutes or 40 minutes (depending on the HTCondor version).

You can explicitly set it your job's submit file eg

  executable = foo.exe
  JobLeaseDuration = 3600
  queue

Or you can specify a default in the condor_config file that condor_submit will pick up and use, eg append in your condor_config

  JobLeaseDuration = 3600
  SUBMIT_EXPRS = $(SUBMIT_EXPRS) JobLeaseDuration

Some details in the Manual are at http://is.gd/ShifW8


2.Is it possible to change what is done when a restart is issued. Could
I, instead of condor sending a SIGKILL to the job, tell it to run a
script that shuts the job down safely?

I think Ben gave suggestions for this question in an earlier post...

It would be preferable to have
condor shut the job quietly down instead of restarting it.


Do you mean you don't want the job to restart? I.e. you want to run the job once, and if there is a problem, have the job leave the queue instead of restarting? If so, see the HOWTO at
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts


Hope the above helps
Todd