[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Partitionable Slot Starvation



Hi all,

Is anyone out there using partitionable slots and have any experience
with the new condor_defrag daemon?  I have set up a small test cluster
with condor-7.8.1 as follows:

1 CM running collector, negotiator, defrag
5 execute machines each running startd, schedd

Each execute machine is 24-core and configured thus:

SLOT_TYPE_1 = cpus=16, ram=2/3, swap=2/3, disk=2/3
SLOT_TYPE_2 = cpus=auto, ram=auto, swap=auto, disk=auto

SLOT_TYPE_1_PARTITIONABLE = True

NUM_SLOTS_TYPE_1 = 1
NUM_SLOTS_TYPE_2 = 8

so we have a combination of partitionable and regular slots.

If I submit several thousand regular test (sleep 600) jobs, they fill up
the nodes and partition the 16-core slots into 16 dynamic slots as
expected.  However if I then submit, as the same user, 20 test jobs that
have request_cpus=8, they never start running until the large back-log
of single-core jobs is completely finished (CLAIM_WORKLIFE is 1hr).

Enter the defrag daemon. As I understand it, it was designed to prevent
this kind of starvation.  I configured it as follows:

DAEMON_LIST = $(DAEMON_LIST) DEFRAG
DEFRAG_INTERVAL = 90
DEFRAG_DRAINING_MACHINES_PER_HOUR = 12.0
DEFRAG_MAX_WHOLE_MACHINES = 4
DEFRAG_MAX_CONCURRENT_DRAINING = 4

on the central manager so that it should, after a few minutes, start
draining one or two machines. Instead, after three hours I only see this
on the logs:

08/15/12 16:58:52 Doing nothing, because number to drain in next 90s is
calculated to be 0.

I'm wondering if (1) I have misunderstood the purpose of the defrag
daemon, (2) mis-configured it, or (3) there is something wrong in the
behavior.

Does anyone who has had experience setting up & running this have any
pointers or feedback?

Thanks a lot,
-Will