[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Cluster utilization is low for mixed memory intensive jobs



My guess in the example below would be that your machine has 8 CPU cores â âcondor_status DummyServer âautoformat TotalCpusâ would tell you. By default it uses cpu cores, memory, and scratch disk space to determine if thereâs room for another job.

If your machine does advertise more than eight cores, then that's definitely a puzzle - you'd want to look at the "Disk" machine attribute to see if that might be imposing a constraint, however the default value for that is fairly low and disk drives these days are fairly big. Clearly there's enough memory for more jobs assuming there's disk and cpus available.

The claim worklife can be left at the default value rather than being set to zero - an existing slot for which the claim has not yet expired and which matches another job from the same submitter will be matched and dispatched without negotiator overhead, which improves efficiency a bit.


Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Bao, Shunxing
Sent: Monday, July 15, 2019 12:59 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [External] [HTCondor-users] Cluster utilization is low for mixed memory intensive jobs

Hello,
Â
I have tons of single-core jobs to run with different memory request. As a preliminary work, I set up HTCondor on a single machine using a dynamic slot setup. My goal is simple: keep cluster fully utilized.Â
Â
Initially, it seems that all memory of the target machine is allocated. However, after running several rounds, the unclaimed slot1 starts containing more and more memory, and only few jobs are run in parallel, with a lot of jobs in the queue as idle state as follows. Those jobs memory request is between 500MB â 1600MB. So I am pretty sure the cluster should have run more jobs.
Â
$condor_status
NameÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ OpSysÂÂÂÂÂ ArchÂÂ StateÂÂÂÂ Activity LoadAv MemÂÂÂ ActvtyTime
Â
slot1@DummyServerÂÂ LINUXÂÂÂÂÂ X86_64 Unclaimed IdleÂÂÂÂÂ 0.000 84510Â 0+00:28:17
slot1_1@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ BusyÂÂÂÂÂ 0.000Â 1000Â 0+00:00:01
slot1_2@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ BusyÂÂÂÂÂ 0.000Â 1000Â 0+00:00:01
slot1_3@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ Busy ÂÂÂÂÂ0.020Â 1000Â 0+00:00:04
slot1_4@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ BusyÂÂÂÂÂ 0.010Â 1000Â 0+00:00:03
slot1_5@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ BusyÂÂÂÂÂ 0.020Â 1000Â 0+00:00:05
slot1_6@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ BusyÂÂÂÂÂ 0.010Â 1000Â 0+00:00:03
slot1_7@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ BusyÂÂÂÂÂ 0.020Â 1000Â 0+00:00:02
slot1_8@DummyServer LINUXÂÂÂÂÂ X86_64 ClaimedÂÂ BusyÂÂÂÂÂ 0.000Â 1000Â 0+00:00:18
Â
 ÂÂÂÂÂÂTotal Owner Claimed Unclaimed Matched Preempting Backfill Drain
Â
 X86_64/LINUX 9 0 8 1 0 0 0 0
Â
ÂÂÂÂÂÂÂÂ TotalÂÂÂÂ 9ÂÂÂÂ 0ÂÂÂÂÂÂ 8ÂÂÂÂÂÂÂÂ 1ÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂ 0ÂÂÂÂÂ 0
Â
$condor_q
OWNER BATCH_NAMEÂÂÂ SUBMITTEDÂÂ DONEÂÂ RUNÂÂÂ IDLEÂ TOTAL JOB_IDS
ââ
ââ
Total for query: 2138 jobs; 0 completed, 0 removed, 2130 idle, 8 running, 0 held, 0 suspended
Total for all users: 2138 jobs; 0 completed, 0 removed, 2130 idle, 8 running, 0 held, 0 suspended
Â
Â
I havenât done any fancy setup in condor_config.local. Iâve tried both:
ÂÂÂÂÂÂÂÂÂÂÂÂDAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true
Â
and
ÂÂÂÂÂÂÂÂÂÂÂ
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true
CLAIM_WORKLIFE =0
Â
The cluster utilization is similar low in both setup scenario. For CLAIM_WORKLIEF = 0, I thought after each job completes, the corresponding claimed slot would be returned back to the original unclaimed slot1 so that once a new job showing up, a new slot is created with relative memory allocation. Again, my job workload is mixed, and I donât think to keep a specific amount of static slots can meet my specification.
Â
Here is my sample submission file,
Â
executable = xxx.sh
should_transfer_filesÂÂ = NO
request_cpusÂÂÂÂ = 1
request_memoryÂÂ = 1600
log = xxx.log
output = xxx.txt
Queue
Â
And here is my condor version:
Â
$CondorVersion: 8.8.3 May 26 2019 BuildID: 470254 $
$CondorPlatform: x86_64_Ubuntu18 $
Â
Any comments and suggestions are appreciated.
Â
Best,
Shunxing