[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] RESERVED_MEMORY not considered by HTCondor



Hi HTCondor Experts,

Recently we are experiencing machine crashing because of OOM. Each worker in our cluster has 128GB memory, and each has 3072MB reserved memory that cannot be used by HTCondor:
RESERVED_MEMORYÂ Â Â Â Â= 3072

In addition, each worker has 1 partitionable slot defined as below:
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = TRUE

However, if you add the dynamic slot size shown below in the second last column (MB), you will get 128,723MB. Condor obviously does not subtract 3072MB (RESERVED_MEMORY) from all the physical memory of the machine.

master1:4} condor_status | grep worker1Â
Slot1@worker1  LINUX   X86_64 Unclaimed Idle   1.000 Â339 0+03:18:58
slot1_1@worker1 LINUX   X86_64 Claimed ÂBusy   1.020 15360 0+00:00:03
slot1_2@worker1 LINUX   X86_64 Claimed ÂBusy   1.000 15104 0+00:07:57
slot1_3@worker1 LINUX   X86_64 Claimed ÂBusy   1.000 4096 0+00:02:20
slot1_4@worker1 LINUX   X86_64 Claimed ÂBusy   0.650 5888 0+00:08:08
slot1_5@worker1 LINUX   X86_64 Claimed ÂBusy   1.020 8064 0+00:00:32
slot1_6@worker1 LINUX   X86_64 Claimed ÂBusy   1.000 20096 0+00:00:03
slot1_7@worker1 LINUX   X86_64 Claimed ÂBusy   1.010 5888 0+00:00:32
slot1_8@worker1 LINUX   X86_64 Claimed ÂBusy   0.000 20096 0+00:36:00
slot1_9@worker1 LINUX   X86_64 Claimed ÂBusy   1.000 4096 0+00:00:06
slot1_10@worker1 LINUX   X86_64 Claimed ÂBusy   1.040 4096 0+00:00:03
slot1_11@worker1 LINUX   X86_64 Claimed ÂBusy   1.010 5120 0+00:00:03
slot1_12@worker1 LINUX   X86_64 Claimed ÂBusy   1.000 15360 0+00:02:08
slot1_13@worker1 LINUX   X86_64 Claimed ÂBusy   1.000 5120 0+00:04:14

My question is why RESERVED_MEMORY is not considered by HTCondor in this case.

Thank you in advance,
Jewel