[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Dealing with lost submitters.
- Date: Thu, 30 Jun 2022 06:42:25 +0000
- From: Dudu Handelman <duduhandelman@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Dealing with lost submitters.
I just remembered that i have configured it before. I have a transform which set the JobLeaseDuration = 1800
I wonder why it's not working maybe its docker universe related.
Will try to verify that in the lab
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Wednesday, June 29, 2022, 23:48
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Dealing with lost submitters.
One option is to set JOB_DEFAULT_LEASE_DURATION in the configuration files on the submitting machines. The default is 2400 seconds (40 minutes). This controls how long the submitter and executor will attempt to reconnect before aborting a job execution. The
downside to lowering this value is that you risk killing jobs in situations where an interruption is temporary. For example, when upgrading HTCondor or rebooting on the submit machine.
Sometime the submitting machine is out of resources for example disk space. the condor service will be stopped and the jobs on the executer side will wait for it.
So, in this situation there are waisted resources just waiting.
Usually, I do it manually by evicting this user jobs.
How to deal with it automatically?