[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Cluster utilization is low for mixed memory intensive jobs



Hello,

 

I have tons of single-core jobs to run with different memory request. As a preliminary work, I set up HTCondor on a single machine using a dynamic slot setup. My goal is simple: keep cluster fully utilized. 

 

Initially, it seems that all memory of the target machine is allocated. However, after running several rounds, the unclaimed slot1 starts containing more and more memory, and only few jobs are run in parallel, with a lot of jobs in the queue as idle state as follows. Those jobs memory request is between 500MB â 1600MB. So I am pretty sure the cluster should have run more jobs.

 

$condor_status

Name                            OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

 

slot1@DummyServer   LINUX      X86_64 Unclaimed Idle      0.000 84510  0+00:28:17

slot1_1@DummyServer LINUX      X86_64 Claimed   Busy      0.000  1000  0+00:00:01

slot1_2@DummyServer LINUX      X86_64 Claimed   Busy      0.000  1000  0+00:00:01

slot1_3@DummyServer LINUX      X86_64 Claimed   Busy      0.020  1000  0+00:00:04

slot1_4@DummyServer LINUX      X86_64 Claimed   Busy      0.010  1000  0+00:00:03

slot1_5@DummyServer LINUX      X86_64 Claimed   Busy      0.020  1000  0+00:00:05

slot1_6@DummyServer LINUX      X86_64 Claimed   Busy      0.010  1000  0+00:00:03

slot1_7@DummyServer LINUX      X86_64 Claimed   Busy      0.020  1000  0+00:00:02

slot1_8@DummyServer LINUX      X86_64 Claimed   Busy      0.000  1000  0+00:00:18

 

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

 

  X86_64/LINUX     9     0       8         1       0          0        0      0

 

         Total     9     0       8         1       0          0        0      0

 

$condor_q

OWNER BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS

ââ

ââ

Total for query: 2138 jobs; 0 completed, 0 removed, 2130 idle, 8 running, 0 held, 0 suspended

Total for all users: 2138 jobs; 0 completed, 0 removed, 2130 idle, 8 running, 0 held, 0 suspended

 

 

I havenât done any fancy setup in condor_config.local. Iâve tried both:

            DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

NUM_SLOTS = 1

NUM_SLOTS_TYPE_1 = 1

SLOT_TYPE_1 = cpus=100%

SLOT_TYPE_1_PARTITIONABLE = true

 

and

           

NUM_SLOTS = 1

NUM_SLOTS_TYPE_1 = 1

SLOT_TYPE_1 = cpus=100%

SLOT_TYPE_1_PARTITIONABLE = true

CLAIM_WORKLIFE =0

 

The cluster utilization is similar low in both setup scenario. For CLAIM_WORKLIEF = 0,  I thought after each job completes, the corresponding claimed slot would be returned back to the original unclaimed slot1 so that once a new job showing up, a new slot is created with relative memory allocation. Again, my job workload is mixed, and I donât think to keep a specific amount of static slots can meet my specification.

 

Here is my sample submission file,

 

executable = xxx.sh

should_transfer_files   = NO

request_cpus     = 1

request_memory   = 1600

log = xxx.log

output = xxx.txt

Queue

 

And here is my condor version:

 

$CondorVersion: 8.8.3 May 26 2019 BuildID: 470254 $

$CondorPlatform: x86_64_Ubuntu18 $

 

Any comments and suggestions are appreciated.

 

Best,

Shunxing