[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] DEFRAG Large to Small slots



Hi All,

I am testing a condor configuration with partition-able slots and a defrag daemon.  The slots on my worker nodes are configure like this:

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=2
SLOT_TYPE_1_PARTITIONABLE = TRUE

Each worker node has 2 CPUs, 4Gbytes of RAM, and 40GB or disk space. To these workers I submitted jobs with the following requirements:

request_cpus = 2
request_memory = 3000
request_disk = 30000

This cause my test queue to be partitioned like this:

slot1@server-1e433 LINUX      X86_64 Unclaimed Idle 0.000 2778  0+02:49:38
slot1_1@server-1e4 LINUX      X86_64 Claimed   Busy 0.000 3072  0+00:00:38
slot1@server-38fcb LINUX      X86_64 Unclaimed Idle 0.000 2778  0+02:49:44
slot1_1@server-38f LINUX      X86_64 Claimed   Busy 0.000 3072  0+00:01:47
slot1@server-4fcf0 LINUX      X86_64 Unclaimed Idle 0.000 2778  0+02:49:43
slot1_1@server-4fc LINUX      X86_64 Claimed   Busy 0.000 3072  0+00:00:47
slot1@server-51c6b LINUX      X86_64 Unclaimed Idle 0.010 2778  0+02:49:36
slot1_1@server-51c LINUX      X86_64 Claimed   Busy 0.000 3072  0+00:01:15
slot1@server-5f5ae LINUX      X86_64 Unclaimed Idle 0.000 2778  0+02:54:48
slot1_1@server-5f5 LINUX      X86_64 Claimed   Busy 0.000 3072  0+00:06:26
...

Now I submitted jobs with the following requirements:

request_cpus = 1
request_memory = 512
request_disk = 15000

Clearly these may be served by the existing slots, but it would me more efficient if the existing slots were removed and instead I would end up with two slots per machine. This does not seem to happen, here is the log of my defrag daemon:

04/02/14 11:24:00 ******************************************************
04/02/14 11:24:00 ** condor_defrag (CONDOR_DEFRAG) STARTING UP
04/02/14 11:24:00 ** /usr/libexec/condor/condor_defrag
04/02/14 11:24:00 ** SubsystemInfo: name=DEFRAG type=DAEMON(12) class=DAEMON(1)
04/02/14 11:24:00 ** Configuration: subsystem:DEFRAG local:<NONE> class:DAEMON
04/02/14 11:24:00 ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $
04/02/14 11:24:00 ** $CondorPlatform: x86_64_RedHat6 $
04/02/14 11:24:00 ** PID = 17334
04/02/14 11:24:00 ** Log last touched 4/2 11:21:43
04/02/14 11:24:00 ******************************************************
04/02/14 11:24:00 Using config source: /etc/condor/condor_config
04/02/14 11:24:00 Using local config sources: 
04/02/14 11:24:00    /etc/condor/config.d/defrag
04/02/14 11:24:00    /etc/condor/config.d/partition
04/02/14 11:24:00    /etc/condor/config.d/ports
04/02/14 11:24:00    /etc/condor/config.d/scaling
04/02/14 11:24:00    /etc/condor/config.d/soap
04/02/14 11:24:00    /etc/condor/condor_config.local
04/02/14 11:24:00 Daemon Log is logging: D_ALWAYS D_ERROR
04/02/14 11:24:00 DaemonCore: command socket at <myip:40438>
04/02/14 11:24:00 DaemonCore: private command socket at <myip:40438>
04/02/14 11:24:00 State file /var/lock/condor/defrag_state does not yet exist.
04/02/14 11:24:00 Will evaluate defragmentation policy every DEFRAG_INTERVAL=300 seconds.
04/02/14 11:24:00 polling interval 300s, DEFRAG_DRAINING_MACHINES_PER_HOUR = 10.000000/hour = 0/interval + 10/hour + 0/day
04/02/14 11:24:00 There are currently 0 draining and 16 whole machines.
04/02/14 11:24:00 Average pool draining badput = 0.00%
04/02/14 11:24:00 Average pool draining unclaimed = 0.00%
04/02/14 11:24:00 Looking for 7 machines to drain.
04/02/14 11:24:00 Drained 0 machines (wanted to drain 7 machines).
04/02/14 11:29:00 There are currently 0 draining and 16 whole machines.
04/02/14 11:29:00 Average pool draining badput = 0.00%
04/02/14 11:29:00 Average pool draining unclaimed = 0.00%
04/02/14 11:29:00 Doing nothing, because number to drain in next 300s is calculated to be 0.
04/02/14 11:34:01 There are currently 0 draining and 16 whole machines.
04/02/14 11:34:01 Average pool draining badput = 0.00%
04/02/14 11:34:01 Average pool draining unclaimed = 0.00%
04/02/14 11:34:01 Doing nothing, because number to drain in next 300s is calculated to be 0.
04/02/14 11:39:01 There are currently 0 draining and 16 whole machines.
04/02/14 11:39:01 Average pool draining badput = 0.00%
04/02/14 11:39:01 Average pool draining unclaimed = 0.00%
04/02/14 11:39:01 Doing nothing, because number to drain in next 300s is calculated to be 0.
04/02/14 11:44:01 There are currently 0 draining and 16 whole machines.
04/02/14 11:44:01 Average pool draining badput = 0.00%
04/02/14 11:44:01 Average pool draining unclaimed = 0.00%
04/02/14 11:44:01 Doing nothing, because number to drain in next 300s is calculated to be 0.


The defrag daemon is configured as follows:

DAEMON_LIST = $(DAEMON_LIST), DEFRAG
DEFRAG_INTERVAL = 300
DEFRAG_DRAINING_MACHINES_PER_HOUR = 10
DEFRAG_MAX_WHOLE_MACHINES = 20
DEFRAG_MAX_CONCURRENT_DRAINING = 10

Am I doing something wrong here? Do the slots need to be Idle for the defrag to happen?

Best Regards,
-Frank



----------
Frank Berghaus
University of Victoria
Research Associate
Physics & Astronomy
UVic Phone: +1 (250) 472-4085
UVic Office: Elliot 201