[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] memory overprovisionning : restart needed ?



Hi Frederic,

Typically, this lists the macros requiring restart:

http://research.cs.wisc.edu/htcondor/manual/v8.3/3_3Configuration.html#sec:Macros-Requiring-Restart

However, I think MEMORY really should be there too (simply missing from the documentation).

In general, I find that any time you change the properties of the nodes (# of CPUs, amount of memory, number of slots), the startd needs to be restarted.

Does anyone else have a different experience?

Brian

On Feb 12, 2015, at 8:58 AM, SCHAER Frederic <frederic.schaer@xxxxxx> wrote:

Hi,
 
Because jobs that should use 2GB require 2.5 time more memory, I changed our MEMORY setting to be 2.6 times the real memory of the systemsâ
When I query the daemons after issuing a condor_reconfig, I see the change :
 
# condor_config_val -name wn272 -startd memory
2.6 * quantize( 64364, 1000 )
 
But when I look at the condor_status, it tells me this :
 
Name         Cpu  Mem  LoadAv  KbdIdle    State    StateTime  Activ  ActvtyTime
 
slot1@wn272.  11   700  0.680 56+01:00:41 Unclaim  0+02:01:14 Idle   0+02:01:14
slot1_10@wn2   1  5200  0.990 56+01:00:41 Claimed  0+02:01:14 Busy   0+02:01:13
slot1_11@wn2   1  5200  0.930 56+01:00:41 Claimed  0+02:35:48 Busy   0+02:35:47
slot1_12@wn2   1  5200  1.000 56+01:00:41 Claimed  0+02:46:01 Busy   0+02:46:00
slot1_13@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:53:25 Busy   1+01:53:25
slot1_14@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:52:24 Busy   1+01:52:24
slot1_15@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:51:24 Busy   1+01:51:24
slot1_16@wn2   1  5200  1.000 56+01:00:41 Claimed  1+01:50:25 Busy   1+01:50:23
slot1_17@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:56:19 Busy   0+12:56:18
slot1_18@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:55:19 Busy   0+12:55:18
slot1_19@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:54:19 Busy   0+12:54:18
slot1_1@wn27   1  5200  1.000 56+01:00:41 Claimed  1+01:57:35 Busy   1+01:57:34
slot1_20@wn2   1  5200  0.690 56+01:00:41 Claimed  0+12:53:19 Busy   0+12:53:17
slot1_22@wn2   1  5200  1.000 56+01:00:41 Claimed  0+12:50:19 Busy   0+12:50:18
slot1_2@wn27   1  5200  0.810 56+01:00:41 Claimed  0+12:57:19 Busy   0+12:57:18
slot1_3@wn27   1  5200  0.940 56+01:00:41 Claimed  0+03:02:14 Busy   0+03:02:13
slot1_4@wn27   1  2100  1.000 56+01:00:41 Claimed  0+07:30:43 Busy   0+07:30:41
slot1_5@wn27   1  2100  1.000 56+01:00:41 Claimed  0+05:30:24 Busy   0+05:30:24
slot1_6@wn27   1  5200  1.000 56+01:00:41 Claimed  0+02:45:40 Busy   0+02:45:39
slot1_7@wn27   1  2100  1.000 56+01:00:41 Claimed  0+05:12:24 Busy   0+05:12:24
slot1_8@wn27   1  2100  1.000 56+01:00:41 Claimed  0+05:28:24 Busy   0+05:28:22
slot1_9@wn27   1  5200  1.000 56+01:00:41 Claimed  0+02:40:28 Busy   0+02:40:27
 
As you can see, this is the old 1.5 overcommit factor :
# condor_status -state wn272|grep slot|gawk '{s+=$3} END{print s}'
97500
 
And not the 2.5 one, which should raise the available virtual memory to 160GBâ
I tried restarting condor on one node (service condor restart) and the correct memory resources appeared, but I also lost all running jobs.
 
My  question therefore is : is there a way that condor takes this new setting ?
The slots are configured like this :
 
NUM_SLOTS = 1
SLOT_TYPE_1               = cpus=100%,mem=100%,auto
NUM_SLOTS_TYPE_1          = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE
 
Condor version : condor-8.2.6-287355.x86_64
 
Thanks 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/