[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_defrag does not start to defrag



The defrag daemon calculates the number of machines to drain per polling interval this way.

	m_draining_per_hour = param_double("DEFRAG_DRAINING_MACHINES_PER_HOUR",0,0);
	double rate = m_draining_per_hour/3600.0*m_polling_interval;
              m_draining_per_poll = (int)floor(rate + 0.00001);

this works out to 0.416 per interval with your configuration, which floor turns into 0.

There is some logic to account for truncation of the fractional rate once per hour and once per day,
but the easy fix for you would be just to us a slower interval or a larger number of draining machines per hour.

You should also probably adjust your DEFRAG_WHOLE_MACHINE_EXPR if you want to focus draining
on 64 core machines, although I don't think that is the source of your current issue.

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Carsten Aulbert
Sent: Wednesday, December 2, 2020 7:20 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] condor_defrag does not start to defrag

Hi,

sorry for the completely obvious subject line. Right now, we have some 
large jobs waiting for free slots to become available, but condor_defrag 
does not free up nodes and thus the jobs are stalled in the idle state.

We have this configured on the central manager to enable defrag for our 
largest nodes:

$ condor_config_val -sum|grep DEFRAG
# from /etc/condor/config.d/20_DEFRAG
DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR DEFRAG
DEFRAG_INTERVAL = 300
DEFRAG_DRAINING_MACHINES_PER_HOUR = 5.0
DEFRAG_MAX_CONCURRENT_DRAINING = 10
DEFRAG_REQUIREMENTS = PartitionableSlot && Offline=!=True && TotalCpus >= 64
DEFRAG_DRAINING_START_EXPR = (KillableJob =?= true)
DEFRAG_UPDATE_INTERVAL = 60

(shortened intervals due to testing)

and the requirements are indeed fulfilled for some hosts:

$ condor_status -const "PartitionableSlot && Offline=!=True && TotalCpus 
 >= 64"|grep -c slot1@
378

As written in the manual, looking for information from MyType=="Defrag":

$ condor_status -l -any -constraint 'MyType == "Defrag"'

AddressV1 = "{[ p=\"primary\"; a=\"10.20.40.190\"; port=9618; 
n=\"Internet\"; spid=\"4846_880a_4\"; noUDP=true; ], [ p=\"IPv4\"; 
a=\"10.20.40.190\"; port=9618; n=\"Internet\"; spid=\"4846_880a_4\"; 
noUDP=true; ]}"
AuthenticatedIdentity = "condor_pool@xxxxxxxxxxx"
AuthenticationMethod = "PASSWORD"
AvgDrainingBadput = 57.83840072266071
AvgDrainingUnclaimed = 0.02835789194728204
CondorPlatform = "$CondorPlatform: X86_64-Debian_10 $"
CondorVersion = "$CondorVersion: 8.8.9 May 06 2020 BuildID: 
Debian-8.8.9-1 PackageID: 8.8.9-1 Debian-8.8.9-1 $"
DaemonLastReconfigTime = 1606899484
DaemonStartTime = 1599749232
DrainedMachines = 29377
DrainFailures = 0
DrainSuccesses = 25
LastHeardFrom = 1606914846
Machine = "condorhub.atlas.local"
MachinesDraining = 0
MachinesDrainingPeak = 0
MeanDrainedArrival = 0.0008874576141337798
MeanDrainedArrivalSD = 0.03087811226198665
MyAddress = 
"<10.20.40.190:9618?addrs=10.20.40.190-9618&noUDP&sock=4846_880a_4>"
MyCurrentTime = 1606914846
MyType = "Defrag"
Name = "condorhub.atlas.local"
RecentDrainFailures = 0
RecentDrainSuccesses = 5
RecentStatsLifetime = 3000
StatsLifetime = 7165614
TargetType = ""
UpdateSequenceNumber = 24033
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 24032
UpdatesTotal = 24033
WholeMachines = 1864
WholeMachinesPeak = 2468

But still, each interval the log file ends with this while jobs are 
waiting for resources to appear:

12/02/20 13:13:54 Newly Arrived whole machines is
12/02/20 13:13:54 (no machines)
12/02/20 13:13:54 Newly departed draining machines is
12/02/20 13:13:54 (no machines)
12/02/20 13:13:54 Lifetime whole machines arrived: 29377
12/02/20 13:13:54 Lifetime mean arrival rate: 3.19485 machines / hour
12/02/20 13:13:54 Lifetime mean arrival rate sd: 111.161
12/02/20 13:13:54 Average pool draining badput = 5783.84%
12/02/20 13:13:54 Average pool draining unclaimed = 2.84%
12/02/20 13:13:54 Doing nothing, because number to drain in next 300s is 
calculated to be 0.

Anyone with an idea what we are missing?

Cheers

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185