[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs restarting



Great stuff, I will try this as well as Bens suggestion.

Thank you,
Peter

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: 20. november 2015 00:18
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Jobs restarting

On 11/18/2015 5:28 AM, Peter Ellevseth wrote:
> Hello all
>
> We have some trouble with condor restarting our jobs. This happens 
> when there is some disturbance (backup job locking the disc) and the 
> head loses touch with the working nodes. I have two questions
>
> 1.How can I change the time it takes before the head node orders a 
> restart of a job.

If the submit machine fails to hear from the execute machine for more than X seconds, where X is defined by JobLeaseDuration in the job's submit file, then the job will be killed and restarted (potentially someplace else).

By default, X is either 20 minutes or 40 minutes (depending on the HTCondor version).

You can explicitly set it your job's submit file eg

   executable = foo.exe
   JobLeaseDuration = 3600
   queue

Or you can specify a default in the condor_config file that condor_submit will pick up and use, eg append in your condor_config

   JobLeaseDuration = 3600
   SUBMIT_EXPRS = $(SUBMIT_EXPRS) JobLeaseDuration

Some details in the Manual are at http://is.gd/ShifW8

>
> 2.Is it possible to change what is done when a restart is issued. 
> Could I, instead of condor sending a SIGKILL to the job, tell it to 
> run a script that shuts the job down safely?

I think Ben gave suggestions for this question in an earlier post...

> It would be preferable to have
> condor shut the job quietly down instead of restarting it.
>

Do you mean you don't want the job to restart?  I.e. you want to run the job once, and if there is a problem, have the job leave the queue instead of restarting?  If so, see the HOWTO at https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToAvoidJobRestarts


Hope the above helps
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/