[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] memory overprovisionning : restart needed ?



Hi,

 

Because jobs that should use 2GB require 2.5 time more memory, I changed our MEMORY setting to be 2.6 times the real memory of the systems…

When I query the daemons after issuing a condor_reconfig, I see the change :

 

# condor_config_val -name wn272 -startd memory

2.6 * quantize( 64364, 1000 )

 

But when I look at the condor_status, it tells me this :

 

Name         Cpu  Mem  LoadAv  KbdIdle    State    StateTime  Activ  ActvtyTime

 

slot1@wn272.  11   700  0.680 56+01:00:41 Unclaim  0+02:01:14 Idle   0+02:01:14

slot1_10@wn2   1  5200  0.990 56+01:00:41 Claimed  0+02:01:14 Busy   0+02:01:13

slot1_11@wn2   1  5200  0.930 56+01:00:41 Claimed  0+02:35:48 Busy   0+02:35:47

slot1_12@wn2   1  5200  1.000 56+01:00:41 Claimed  0+02:46:01 Busy   0+02:46:00

slot1_13@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:53:25 Busy   1+01:53:25

slot1_14@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:52:24 Busy   1+01:52:24

slot1_15@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:51:24 Busy   1+01:51:24

slot1_16@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:50:25 Busy   1+01:50:23

slot1_17@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:56:19 Busy   0+12:56:18

slot1_18@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:55:19 Busy   0+12:55:18

slot1_19@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:54:19 Busy   0+12:54:18

slot1_1@wn27   1  5200  1.000 56+01:00:41 Claimed  1+01:57:35 Busy   1+01:57:34

slot1_20@wn2   1  5200  0.690 56+01:00:41 Claimed  0+12:53:19 Busy   0+12:53:17

slot1_22@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:50:19 Busy   0+12:50:18

slot1_2@wn27   1  5200  0.810 56+01:00:41 Claimed  0+12:57:19 Busy   0+12:57:18

slot1_3@wn27   1  5200  0.940 56+01:00:41 Claimed  0+03:02:14 Busy   0+03:02:13

slot1_4@wn27   1  2100  1.000 56+01:00:41 Claimed  0+07:30:43 Busy   0+07:30:41

slot1_5@wn27   1  2100  1.000 56+01:00:41 Claimed  0+05:30:24 Busy   0+05:30:24

slot1_6@wn27   1  5200  1.000 56+01:00:41 Claimed  0+02:45:40 Busy   0+02:45:39

slot1_7@wn27   1  2100  1.000 56+01:00:41 Claimed  0+05:12:24 Busy   0+05:12:24

slot1_8@wn27   1  2100  1.000 56+01:00:41 Claimed  0+05:28:24 Busy   0+05:28:22

slot1_9@wn27   1  5200  1.000 56+01:00:41 Claimed  0+02:40:28 Busy   0+02:40:27

 

As you can see, this is the old 1.5 overcommit factor :

# condor_status -state wn272|grep slot|gawk '{s+=$3} END{print s}'

97500

 

And not the 2.5 one, which should raise the available virtual memory to 160GB…

I tried restarting condor on one node (service condor restart) and the correct memory resources appeared, but I also lost all running jobs.

 

My  question therefore is : is there a way that condor takes this new setting ?

The slots are configured like this :

 

NUM_SLOTS = 1

SLOT_TYPE_1               = cpus=100%,mem=100%,auto

NUM_SLOTS_TYPE_1          = 1

SLOT_TYPE_1_PARTITIONABLE = TRUE

 

Condor version : condor-8.2.6-287355.x86_64

 

Thanks