[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_defrag does not start to defrag



Hi,

sorry for the completely obvious subject line. Right now, we have some large jobs waiting for free slots to become available, but condor_defrag does not free up nodes and thus the jobs are stalled in the idle state.

We have this configured on the central manager to enable defrag for our largest nodes:

$ condor_config_val -sum|grep DEFRAG
# from /etc/condor/config.d/20_DEFRAG
DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR DEFRAG
DEFRAG_INTERVAL = 300
DEFRAG_DRAINING_MACHINES_PER_HOUR = 5.0
DEFRAG_MAX_CONCURRENT_DRAINING = 10
DEFRAG_REQUIREMENTS = PartitionableSlot && Offline=!=True && TotalCpus >= 64
DEFRAG_DRAINING_START_EXPR = (KillableJob =?= true)
DEFRAG_UPDATE_INTERVAL = 60

(shortened intervals due to testing)

and the requirements are indeed fulfilled for some hosts:

$ condor_status -const "PartitionableSlot && Offline=!=True && TotalCpus >= 64"|grep -c slot1@
378

As written in the manual, looking for information from MyType=="Defrag":

$ condor_status -l -any -constraint 'MyType == "Defrag"'

AddressV1 = "{[ p=\"primary\"; a=\"10.20.40.190\"; port=9618; n=\"Internet\"; spid=\"4846_880a_4\"; noUDP=true; ], [ p=\"IPv4\"; a=\"10.20.40.190\"; port=9618; n=\"Internet\"; spid=\"4846_880a_4\"; noUDP=true; ]}"
AuthenticatedIdentity = "condor_pool@xxxxxxxxxxx"
AuthenticationMethod = "PASSWORD"
AvgDrainingBadput = 57.83840072266071
AvgDrainingUnclaimed = 0.02835789194728204
CondorPlatform = "$CondorPlatform: X86_64-Debian_10 $"
CondorVersion = "$CondorVersion: 8.8.9 May 06 2020 BuildID: Debian-8.8.9-1 PackageID: 8.8.9-1 Debian-8.8.9-1 $"
DaemonLastReconfigTime = 1606899484
DaemonStartTime = 1599749232
DrainedMachines = 29377
DrainFailures = 0
DrainSuccesses = 25
LastHeardFrom = 1606914846
Machine = "condorhub.atlas.local"
MachinesDraining = 0
MachinesDrainingPeak = 0
MeanDrainedArrival = 0.0008874576141337798
MeanDrainedArrivalSD = 0.03087811226198665
MyAddress = "<10.20.40.190:9618?addrs=10.20.40.190-9618&noUDP&sock=4846_880a_4>"
MyCurrentTime = 1606914846
MyType = "Defrag"
Name = "condorhub.atlas.local"
RecentDrainFailures = 0
RecentDrainSuccesses = 5
RecentStatsLifetime = 3000
StatsLifetime = 7165614
TargetType = ""
UpdateSequenceNumber = 24033
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 24032
UpdatesTotal = 24033
WholeMachines = 1864
WholeMachinesPeak = 2468

But still, each interval the log file ends with this while jobs are waiting for resources to appear:

12/02/20 13:13:54 Newly Arrived whole machines is
12/02/20 13:13:54 (no machines)
12/02/20 13:13:54 Newly departed draining machines is
12/02/20 13:13:54 (no machines)
12/02/20 13:13:54 Lifetime whole machines arrived: 29377
12/02/20 13:13:54 Lifetime mean arrival rate: 3.19485 machines / hour
12/02/20 13:13:54 Lifetime mean arrival rate sd: 111.161
12/02/20 13:13:54 Average pool draining badput = 5783.84%
12/02/20 13:13:54 Average pool draining unclaimed = 2.84%
12/02/20 13:13:54 Doing nothing, because number to drain in next 300s is calculated to be 0.

Anyone with an idea what we are missing?

Cheers

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature