[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dealing with lost submitters.



One option is to set JOB_DEFAULT_LEASE_DURATION in the configuration files on the submitting machines. The default is 2400 seconds (40 minutes). This controls how long the submitter and executor will attempt to reconnect before aborting a job execution. The downside to lowering this value is that you risk killing jobs in situations where an interruption is temporary. For example, when upgrading HTCondor or rebooting on the submit machine.

 - Jaime

On Jun 25, 2022, at 1:15 AM, Dudu Handelman <duduhandelman@xxxxxxxxxxx> wrote:

Hi all.
Sometime the submitting machine is out of resources for example disk space. the condor service will be stopped and the jobs on the executer side will wait for it. 

So, in this situation there are waisted resources just waiting. 

Usually, I do it manually by evicting this user jobs. 

How to deal with it automatically? 

Many thanks 
David