[HTCondor-users] DEFRAG Large to Small slots

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi All,

I am testing a condor configuration with partition-able slots and a defrag daemon. The slots on my worker nodes are configure like this:

NUM_SLOTS = 1

NUM_SLOTS_TYPE_1 = 1

SLOT_TYPE_1 = cpus=2

SLOT_TYPE_1_PARTITIONABLE = TRUE

Each worker node has 2 CPUs, 4Gbytes of RAM, and 40GB or disk space. To these workers I submitted jobs with the following requirements:

request_cpus = 2

request_memory = 3000

request_disk = 30000

This cause my test queue to be partitioned like this:

slot1@server-1e433 LINUX X86_64 Unclaimed Idle 0.000 2778 0+02:49:38

slot1_1@server-1e4 LINUX X86_64 Claimed Busy 0.000 3072 0+00:00:38

slot1@server-38fcb LINUX X86_64 Unclaimed Idle 0.000 2778 0+02:49:44

slot1_1@server-38f LINUX X86_64 Claimed Busy 0.000 3072 0+00:01:47

slot1@server-4fcf0 LINUX X86_64 Unclaimed Idle 0.000 2778 0+02:49:43

slot1_1@server-4fc LINUX X86_64 Claimed Busy 0.000 3072 0+00:00:47

slot1@server-51c6b LINUX X86_64 Unclaimed Idle 0.010 2778 0+02:49:36

slot1_1@server-51c LINUX X86_64 Claimed Busy 0.000 3072 0+00:01:15

slot1@server-5f5ae LINUX X86_64 Unclaimed Idle 0.000 2778 0+02:54:48

slot1_1@server-5f5 LINUX X86_64 Claimed Busy 0.000 3072 0+00:06:26

...

Now I submitted jobs with the following requirements:

request_cpus = 1

request_memory = 512

request_disk = 15000

Clearly these may be served by the existing slots, but it would me more efficient if the existing slots were removed and instead I would end up with two slots per machine. This does not seem to happen, here is the log of my defrag daemon:

04/02/14 11:24:00 ******************************************************

04/02/14 11:24:00 ** condor_defrag (CONDOR_DEFRAG) STARTING UP

04/02/14 11:24:00 ** /usr/libexec/condor/condor_defrag

04/02/14 11:24:00 ** SubsystemInfo: name=DEFRAG type=DAEMON(12) class=DAEMON(1)

04/02/14 11:24:00 ** Configuration: subsystem:DEFRAG local:<NONE> class:DAEMON

04/02/14 11:24:00 ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $

04/02/14 11:24:00 ** $CondorPlatform: x86_64_RedHat6 $

04/02/14 11:24:00 ** PID = 17334

04/02/14 11:24:00 ** Log last touched 4/2 11:21:43

04/02/14 11:24:00 ******************************************************

04/02/14 11:24:00 Using config source: /etc/condor/condor_config

04/02/14 11:24:00 Using local config sources:

04/02/14 11:24:00 /etc/condor/config.d/defrag

04/02/14 11:24:00 /etc/condor/config.d/partition

04/02/14 11:24:00 /etc/condor/config.d/ports

04/02/14 11:24:00 /etc/condor/config.d/scaling

04/02/14 11:24:00 /etc/condor/config.d/soap

04/02/14 11:24:00 /etc/condor/condor_config.local

04/02/14 11:24:00 Daemon Log is logging: D_ALWAYS D_ERROR

04/02/14 11:24:00 DaemonCore: command socket at <myip:40438>

04/02/14 11:24:00 DaemonCore: private command socket at <myip:40438>

04/02/14 11:24:00 State file /var/lock/condor/defrag_state does not yet exist.

04/02/14 11:24:00 Will evaluate defragmentation policy every DEFRAG_INTERVAL=300 seconds.

04/02/14 11:24:00 polling interval 300s, DEFRAG_DRAINING_MACHINES_PER_HOUR = 10.000000/hour = 0/interval + 10/hour + 0/day

04/02/14 11:24:00 There are currently 0 draining and 16 whole machines.

04/02/14 11:24:00 Average pool draining badput = 0.00%

04/02/14 11:24:00 Average pool draining unclaimed = 0.00%

04/02/14 11:24:00 Looking for 7 machines to drain.

04/02/14 11:24:00 Drained 0 machines (wanted to drain 7 machines).

04/02/14 11:29:00 There are currently 0 draining and 16 whole machines.

04/02/14 11:29:00 Average pool draining badput = 0.00%

04/02/14 11:29:00 Average pool draining unclaimed = 0.00%

04/02/14 11:29:00 Doing nothing, because number to drain in next 300s is calculated to be 0.

04/02/14 11:34:01 There are currently 0 draining and 16 whole machines.

04/02/14 11:34:01 Average pool draining badput = 0.00%

04/02/14 11:34:01 Average pool draining unclaimed = 0.00%

04/02/14 11:34:01 Doing nothing, because number to drain in next 300s is calculated to be 0.

04/02/14 11:39:01 There are currently 0 draining and 16 whole machines.

04/02/14 11:39:01 Average pool draining badput = 0.00%

04/02/14 11:39:01 Average pool draining unclaimed = 0.00%

04/02/14 11:39:01 Doing nothing, because number to drain in next 300s is calculated to be 0.

04/02/14 11:44:01 There are currently 0 draining and 16 whole machines.

04/02/14 11:44:01 Average pool draining badput = 0.00%

04/02/14 11:44:01 Average pool draining unclaimed = 0.00%

04/02/14 11:44:01 Doing nothing, because number to drain in next 300s is calculated to be 0.

The defrag daemon is configured as follows:

DAEMON_LIST = $(DAEMON_LIST), DEFRAG

DEFRAG_INTERVAL = 300

DEFRAG_DRAINING_MACHINES_PER_HOUR = 10

DEFRAG_MAX_WHOLE_MACHINES = 20

DEFRAG_MAX_CONCURRENT_DRAINING = 10

Am I doing something wrong here? Do the slots need to be Idle for the defrag to happen?

Best Regards,

-Frank

----------

Frank Berghaus

University of Victoria

Research Associate

Physics & Astronomy

UVic Phone: +1 (250) 472-4085

UVic Office: Elliot 201

Mailing List Archives

Public Access

[HTCondor-users] DEFRAG Large to Small slots