Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_defrag does not start to defrag

Date: Wed, 02 Dec 2020 14:20:22 +0100
From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
Subject: [HTCondor-users] condor_defrag does not start to defrag

Hi,

sorry for the completely obvious subject line. Right now, we have somelarge jobs waiting for free slots to become available, but condor_defragdoes not free up nodes and thus the jobs are stalled in the idle state.

We have this configured on the central manager to enable defrag for ourlargest nodes:


$ condor_config_val -sum|grep DEFRAG
# from /etc/condor/config.d/20_DEFRAG
DAEMON_LIST = MASTER COLLECTOR NEGOTIATOR DEFRAG
DEFRAG_INTERVAL = 300
DEFRAG_DRAINING_MACHINES_PER_HOUR = 5.0
DEFRAG_MAX_CONCURRENT_DRAINING = 10
DEFRAG_REQUIREMENTS = PartitionableSlot && Offline=!=True && TotalCpus >= 64
DEFRAG_DRAINING_START_EXPR = (KillableJob =?= true)
DEFRAG_UPDATE_INTERVAL = 60

(shortened intervals due to testing)

and the requirements are indeed fulfilled for some hosts:

$ condor_status -const "PartitionableSlot && Offline=!=True && TotalCpus>= 64"|grep -c slot1@

378

As written in the manual, looking for information from MyType=="Defrag":

$ condor_status -l -any -constraint 'MyType == "Defrag"'

AddressV1 = "{[ p=\"primary\"; a=\"10.20.40.190\"; port=9618;n=\"Internet\"; spid=\"4846_880a_4\"; noUDP=true; ], [ p=\"IPv4\";a=\"10.20.40.190\"; port=9618; n=\"Internet\"; spid=\"4846_880a_4\";noUDP=true; ]}"

AuthenticatedIdentity = "condor_pool@xxxxxxxxxxx"
AuthenticationMethod = "PASSWORD"
AvgDrainingBadput = 57.83840072266071
AvgDrainingUnclaimed = 0.02835789194728204
CondorPlatform = "$CondorPlatform: X86_64-Debian_10 $"

CondorVersion = "$CondorVersion: 8.8.9 May 06 2020 BuildID:Debian-8.8.9-1 PackageID: 8.8.9-1 Debian-8.8.9-1 $"

DaemonLastReconfigTime = 1606899484
DaemonStartTime = 1599749232
DrainedMachines = 29377
DrainFailures = 0
DrainSuccesses = 25
LastHeardFrom = 1606914846
Machine = "condorhub.atlas.local"
MachinesDraining = 0
MachinesDrainingPeak = 0
MeanDrainedArrival = 0.0008874576141337798
MeanDrainedArrivalSD = 0.03087811226198665

MyAddress ="<10.20.40.190:9618?addrs=10.20.40.190-9618&noUDP&sock=4846_880a_4>"

MyCurrentTime = 1606914846
MyType = "Defrag"
Name = "condorhub.atlas.local"
RecentDrainFailures = 0
RecentDrainSuccesses = 5
RecentStatsLifetime = 3000
StatsLifetime = 7165614
TargetType = ""
UpdateSequenceNumber = 24033
UpdatesHistory = "00000000000000000000000000000000"
UpdatesLost = 0
UpdatesSequenced = 24032
UpdatesTotal = 24033
WholeMachines = 1864
WholeMachinesPeak = 2468

But still, each interval the log file ends with this while jobs arewaiting for resources to appear:


12/02/20 13:13:54 Newly Arrived whole machines is
12/02/20 13:13:54 (no machines)
12/02/20 13:13:54 Newly departed draining machines is
12/02/20 13:13:54 (no machines)
12/02/20 13:13:54 Lifetime whole machines arrived: 29377
12/02/20 13:13:54 Lifetime mean arrival rate: 3.19485 machines / hour
12/02/20 13:13:54 Lifetime mean arrival rate sd: 111.161
12/02/20 13:13:54 Average pool draining badput = 5783.84%
12/02/20 13:13:54 Average pool draining unclaimed = 2.84%

12/02/20 13:13:54 Doing nothing, because number to drain in next 300s iscalculated to be 0.


Anyone with an idea what we are missing?

Cheers

Carsten

--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] condor_defrag does not start to defrag
  - From: John M Knoeller

Prev by Date: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU
Next by Date: Re: [HTCondor-users] Question about scitoken authZ support
Previous by thread: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU
Next by thread: Re: [HTCondor-users] condor_defrag does not start to defrag
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] condor_defrag does not start to defrag