[HTCondor-users] memory overprovisionning : restart needed ?

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

Because jobs that should use 2GB require 2.5 time more memory, I changed our MEMORY setting to be 2.6 times the real memory of the systems…

When I query the daemons after issuing a condor_reconfig, I see the change :

# condor_config_val -name wn272 -startd memory

2.6 * quantize( 64364, 1000 )

But when I look at the condor_status, it tells me this :

Name Cpu Mem LoadAv KbdIdle State StateTime Activ ActvtyTime

slot1@wn272. 11 700 0.680 56+01:00:41 Unclaim 0+02:01:14 Idle 0+02:01:14

slot1_10@wn2 1 5200 0.990 56+01:00:41 Claimed 0+02:01:14 Busy 0+02:01:13

slot1_11@wn2 1 5200 0.930 56+01:00:41 Claimed 0+02:35:48 Busy 0+02:35:47

slot1_12@wn2 1 5200 1.000 56+01:00:41 Claimed 0+02:46:01 Busy 0+02:46:00

slot1_13@wn2 1 5200 1.000 56+01:00:41 Claimed 1+01:53:25 Busy 1+01:53:25

slot1_14@wn2 1 5200 1.000 56+01:00:41 Claimed 1+01:52:24 Busy 1+01:52:24

slot1_15@wn2 1 5200 1.000 56+01:00:41 Claimed 1+01:51:24 Busy 1+01:51:24

slot1_16@wn2 1 5200 1.000 56+01:00:41 Claimed 1+01:50:25 Busy 1+01:50:23

slot1_17@wn2 1 5200 1.000 56+01:00:41 Claimed 0+12:56:19 Busy 0+12:56:18

slot1_18@wn2 1 5200 1.000 56+01:00:41 Claimed 0+12:55:19 Busy 0+12:55:18

slot1_19@wn2 1 5200 1.000 56+01:00:41 Claimed 0+12:54:19 Busy 0+12:54:18

slot1_1@wn27 1 5200 1.000 56+01:00:41 Claimed 1+01:57:35 Busy 1+01:57:34

slot1_20@wn2 1 5200 0.690 56+01:00:41 Claimed 0+12:53:19 Busy 0+12:53:17

slot1_22@wn2 1 5200 1.000 56+01:00:41 Claimed 0+12:50:19 Busy 0+12:50:18

slot1_2@wn27 1 5200 0.810 56+01:00:41 Claimed 0+12:57:19 Busy 0+12:57:18

slot1_3@wn27 1 5200 0.940 56+01:00:41 Claimed 0+03:02:14 Busy 0+03:02:13

slot1_4@wn27 1 2100 1.000 56+01:00:41 Claimed 0+07:30:43 Busy 0+07:30:41

slot1_5@wn27 1 2100 1.000 56+01:00:41 Claimed 0+05:30:24 Busy 0+05:30:24

slot1_6@wn27 1 5200 1.000 56+01:00:41 Claimed 0+02:45:40 Busy 0+02:45:39

slot1_7@wn27 1 2100 1.000 56+01:00:41 Claimed 0+05:12:24 Busy 0+05:12:24

slot1_8@wn27 1 2100 1.000 56+01:00:41 Claimed 0+05:28:24 Busy 0+05:28:22

slot1_9@wn27 1 5200 1.000 56+01:00:41 Claimed 0+02:40:28 Busy 0+02:40:27

As you can see, this is the old 1.5 overcommit factor :

# condor_status -state wn272|grep slot|gawk '{s+=$3} END{print s}'

97500

And not the 2.5 one, which should raise the available virtual memory to 160GB…

I tried restarting condor on one node (service condor restart) and the correct memory resources appeared, but I also lost all running jobs.

My question therefore is : is there a way that condor takes this new setting ?

The slots are configured like this :

NUM_SLOTS = 1

SLOT_TYPE_1 = cpus=100%,mem=100%,auto

NUM_SLOTS_TYPE_1 = 1

SLOT_TYPE_1_PARTITIONABLE = TRUE

Condor version : condor-8.2.6-287355.x86_64

Thanks

Mailing List Archives