[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Partitionable Slot Starvation




On 8/15/12 5:44 PM, Todd Tannenbaum wrote:
On 8/15/2012 4:49 PM, William Strecker-Kellogg wrote:

Each execute machine is 24-core and configured thus:

SLOT_TYPE_1 = cpus=16, ram=2/3, swap=2/3, disk=2/3
SLOT_TYPE_2 = cpus=auto, ram=auto, swap=auto, disk=auto

SLOT_TYPE_1_PARTITIONABLE = True

NUM_SLOTS_TYPE_1 = 1
NUM_SLOTS_TYPE_2 = 8

so we have a combination of partitionable and regular slots.

[snip]
Enter the defrag daemon. As I understand it, it was designed to prevent
this kind of starvation.  I configured it as follows:

DAEMON_LIST = $(DAEMON_LIST) DEFRAG
DEFRAG_INTERVAL = 90
DEFRAG_DRAINING_MACHINES_PER_HOUR = 12.0
DEFRAG_MAX_WHOLE_MACHINES = 4
DEFRAG_MAX_CONCURRENT_DRAINING = 4

Does anyone who has had experience setting up & running this have any
pointers or feedback?


Hi Will -

Warning: this feedback is with a grand total of 10 seconds of thought, sent out right before I walk out the door...but having said that, my initial thought is because you are mixing both static and partitionable slots on each machine (perfectly reasonable thing to do btw), perhaps the defrag daemon's default settings for DEFRAG_REQUIREMENTS and/or DEFRAG_WHOLE_MACHINE_EXPR are not appropriate and should be tweaked in your config. I.e. the default values for these two knobs may assume all the slots on a startd are partitionable.

regards,
Todd

If the problem was caused by DEFRAG_REQUIREMENTS and/or DEFRAG_WHOLE_MACHINE_EXPR, the defrag log would indicate so with a message like the following:

"Drained 0 machines (wanted to drain X machines)."

"Doing nothing, because DEFRAG_MAX_WHOLE_MACHINES=X and there are Y whole machines."


As a sanity check, what numbers do you see in the following line in the log when defrag starts up or is reconfigured?

"polling interval %ds, DEFRAG_DRAINING_MACHINES_PER_HOUR = %f/hour = %d/interval + %d/hour + %d/day"

And what numbers do you see in the most recent log line of the following form:

"There are currently %d draining and %d whole machines."

One word of warning: defrag drains the whole startd, partitionable slots and static slots alike. If you only want it to drain some slots and not others, you need to run multiple startds and set DEFRAG_REQUIREMENTS to only match the slots of the startd to be drained and not the slots of the other startd.

--Dan