[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] RESERVED_MEMORY not considered by HTCondor



Jewel,

128 * 1024 = 131,072 - 3072 = 128,000

So youâre 723MB over your reserved limit.

This may be a result of consumption policies - the default memory consumption policy rounds up the memory for the slot to the next 128-megabyte increment:

CONSUMPTION_MEMORY = quantize(target.RequestMemory,{128})

The extra 723 megabytes is 5.65 times 128, so if each of the 13 jobs requested 64 megabytes short of the next 128 then the quantize would wind up with that number. I'm not certain, however, that the consumption policy is applied before or after the match - if it's after the match, then that may be the root cause.

Try submitting your jobs with even increments of 128 megabytes of memory, and see if that helps.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Zhuo Zhang via HTCondor-users
Sent: Tuesday, October 23, 2018 11:17 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Zhuo Zhang <zhuo.zhang@xxxxxxxx>
Subject: [External] [HTCondor-users] RESERVED_MEMORY not considered by HTCondor

Hi HTCondor Experts,
Recently we are experiencing machine crashing because of OOM. Each worker in our cluster has 128GB memory, and each has 3072MB reserved memory that cannot be used by HTCondor:
RESERVED_MEMORYÂ Â Â Â Â= 3072

In addition, each worker has 1 partitionable slot defined as below:
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = TRUE

However, if you add the dynamic slot size shown below in the second last column (MB), you will get 128,723MB. Condor obviously does not subtract 3072MB (RESERVED_MEMORY) from all the physical memory of the machine.