[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Drained 0 machines





Il 17/07/20 21:27, Beyer, Christoph ha scritto:
Hi,

look for DEFRAG_REQUIREMENTS & DEFRAG_WHOLE_MACHINE_EXPR
I did.

The DEFRAG_REQUIREMENTS expression match ~ 400 nodes

DEFRAG_WHOLE_MACHINE_EXPR matches 4 nodes (2 of which not eligible for draining)

I increased DefragLog verbosity and now i see a reason:

[...]
07/17/20 21:22:14 Skipping slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: because it is already running as a whole machine.
07/17/20 21:22:14 Skipping slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: because it is already running as a whole machine.
[...]

And i think this is because i had
DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || (Cpus >= 8)) && (StartJobs =?= True)

and all the skipped machines already have a 8-core job running.
I changed DEFRAG_WHOLE_MACHINE_EXPR to
((Cpus == TotalCpus) || (Cpus >= 16)) && (StartJobs =?= True)

and now i see more machines are put on draining.

Thanks,
Stefano

These knobs define the requirements which machines can be drained and what is considered a drained machine

for ex:

# machine should be partiionable and online
DEFRAG_REQUIREMENTS = PartitionableSlot && Offline=!=True
# drain down to a blob of 12 online cores
DEFRAG_WHOLE_MACHINE_EXPR = Cpus == 12 && Offline=!=True

Best
christoph