[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_defrag only some machines?


The ExpectedMachineGracefulDrainingBadput is an estimate of how much work will be lost when the jobs running on that machine are evicted. When a machine is drained, the jobs are instructed to gracefully evict, which means they are sent a TERM signal (by default) and allowed up to the MaxJobRetirementTime (default of zero) to shut down before being kill -9'd.

A machine with 10 jobs which have accumulated 30 minutes each, if evicted, will have a minimum of 300 minutes of badput, while a machine with 1 job with 60 minutes of runtime will have 60 minutes of badput if evicted, so it will be chosen for draining ahead of the first machine.

Have you taken a look at pslot preemption? I wonder if that might be more useful for your situation than defragmenting. It seems like that might give you more control over when a whole-machine job can evict the single-core jobs, and avoid any draining at all if there are no whole-machine jobs waiting to run.

Also, make sure that you're doing a depth-first fill of the machines for the single-core jobs, which may give the whole-machine jobs a better fighting chance; and make sure your job_lease_duration is set to something reasonable - the default is 40 minutes, but I usually use 20 (it depends on the characteristics of your jobs).

	-Michael Pelletier.

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Di Domenico
Sent: Thursday, January 12, 2017 1:22 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_defrag only some machines?

having let the pool run for a while longer, it does appear to have pulled in some of the nodes that originally weren't.

so i guess what this really boils down to is that I don't understand what

DEFRAG_RANK = -ExpectedMachineGracefulDrainingBadput

really means as it relates to the current state of my pool

I can see ExpectedMachineGracefulDrainingBadput is a classadd attached to each of the machines in my pool, which represents a calculated number, but i don't fully understand it

i see the explination in the manual, but it's still not clear.  does anyone have a pointer to something that might make it more clear how this is actually choosing machines to set to draining state?