[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_defrag does not start to defrag



Hi tj,

On 02.12.20 16:49, John M Knoeller wrote:
The defrag daemon calculates the number of machines to drain per polling interval this way.

	m_draining_per_hour = param_double("DEFRAG_DRAINING_MACHINES_PER_HOUR",0,0);
	double rate = m_draining_per_hour/3600.0*m_polling_interval;
               m_draining_per_poll = (int)floor(rate + 0.00001);

this works out to 0.416 per interval with your configuration, which floor turns into 0.

There is some logic to account for truncation of the fractional rate once per hour and once per day,
but the easy fix for you would be just to us a slower interval or a larger number of draining machines per hour.

that makes sense and indeed, raising DEFRAG_DRAINING_MACHINES_PER_HOUR from 5 to 25 yields almost immediately:

12/02/20 16:04:25 Lifetime whole machines arrived: 29443
12/02/20 16:04:25 Lifetime mean arrival rate: 3.19682 machines / hour
12/02/20 16:04:25 Lifetime mean arrival rate sd: 111.038
12/02/20 16:04:25 Average pool draining badput = 10198.21%
12/02/20 16:04:25 Average pool draining unclaimed = 4.44%
12/02/20 16:04:25 Looking for 3 machines to drain.
12/02/20 16:04:25 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxx
12/02/20 16:04:25 Expected draining completion time is 339s; expected draining badput is 130048 cpu-seconds
12/02/20 16:04:25 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxx
12/02/20 16:04:25 Expected draining completion time is 558s; expected draining badput is 151680 cpu-seconds
12/02/20 16:04:25 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxx
12/02/20 16:04:25 Expected draining completion time is 439s; expected draining badput is 152656 cpu-seconds 12/02/20 16:04:25 Drained maximum number of machines allowed in this cycle (3).
12/02/20 16:04:25 Drained 3 machines (wanted to drain 3 machines).



You should also probably adjust your DEFRAG_WHOLE_MACHINE_EXPR if you want to focus draining
on 64 core machines, although I don't think that is the source of your current issue.


but maybe still worth to look into, should this then be a "negative" expression like

DEFRAG_WHOLE_MACHINE_EXPR = Cpus == TotalCpus && Offline =!= True && TotalCpus < 64


Cheers

Carsten

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature