[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] nodes without cgrouped jobs?



sorry to trouble you all again :-[

might it be, that we have screwed up our config how to handle freezing
jobs??

I just noticed, that while on the suspicious nodes the cpu,cpuacct and
memory job-subslices are not necessarily complete, all running jobs have
a slice in the freezer controller??

E.g., on one node we have a job running on a dynamic slot (slot1_20) -
and as I would expect the condor_starter PID and all job's children are
mapped in the corresponding cpu and memory slices' tasks lists [1].

So, of the currently 17 jobs in total on this node, only 3 of these have
slices by the cpu or memory controller - but all have freezer slices? [2]

According to the StarterLogs, all jobs got actually memory limits [3].
So I would rule out for the moment, that we missed to set limit for
these jobs during submission.

I am not aware, that we have any freezer controller specific settings
but have to check for more details.

Maybe somebody has an idea, why some jobs only end up in the freezer but
no other resource control group?

Cheers,
  Thomas

[*]
under
  https://desycloud.desy.de/index.php/s/wribqLz37AEPgbH/download
is a tarball including the cpu,cpuacct/condor.system and
freezer/condor.system slicesa the condor_execs'
/proc/PID/cgroup,... as well as greps for the starter PIDs on the slots'
StarterLogs


[1]
* on slot1_4
condor_starter PID: 5218 --> condor_exec PID: 5311

[2]
* on slot1_20
condor_starter PID: 24163 --> condor_exec PID: 24507
 or
* on slot1_1
condor_starter PID: 39966 --> condor_exec PID: 40069)

I don't see a time pattern as for example job on slot1_1 started ~3.5h
before slot1_4 and got no cpu slice but slot1_4 did.

[3]
from the naming I would assume that soft/hard/memsw are mapped to
memory.memsw.limit_in_bytes and memory.soft_limit_in_bytes

slot1_4: with memory slice
06/21/18 09:09:07 (pid:5218) Limiting (soft) memory usage to 0 bytes
06/21/18 09:09:07 (pid:5218) Limiting memsw usage to 9223372036854775807
bytes
06/21/18 09:09:07 (pid:5218) Limiting (soft) memory usage to 7918845952
bytes
06/21/18 09:09:07 (pid:5218) Limiting (hard) memory usage to
143370891264 bytes
06/21/18 09:09:07 (pid:5218) Limiting memsw usage to 143370895360 bytes

slot1_1: without memory slice
> grep Limiting /tmp/StarterLog.slot1_1.39966
06/21/18 05:40:17 (pid:39966) Limiting (soft) memory usage to 0 bytes
06/21/18 05:40:17 (pid:39966) Limiting memsw usage to
9223372036854775807 bytes
06/21/18 05:40:17 (pid:39966) Limiting (soft) memory usage to 1073741824
bytes
06/21/18 05:40:17 (pid:39966) Limiting (hard) memory usage to
143340998656 bytes
06/21/18 05:40:17 (pid:39966) Limiting memsw usage to 143341002752 bytes

> slot1_4 aka 5218 memory slice limits
/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.kmem.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.kmem.tcp.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.limit_in_bytes
143370891264
/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.memsw.limit_in_bytes
143370895360
/sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.soft_limit_in_bytes
7918845952

Attachment: bin8pEeZ6ZcEN.bin
Description: Binary data

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature