sorry to trouble you all again :-[ might it be, that we have screwed up our config how to handle freezing jobs?? I just noticed, that while on the suspicious nodes the cpu,cpuacct and memory job-subslices are not necessarily complete, all running jobs have a slice in the freezer controller?? E.g., on one node we have a job running on a dynamic slot (slot1_20) - and as I would expect the condor_starter PID and all job's children are mapped in the corresponding cpu and memory slices' tasks lists [1]. So, of the currently 17 jobs in total on this node, only 3 of these have slices by the cpu or memory controller - but all have freezer slices? [2] According to the StarterLogs, all jobs got actually memory limits [3]. So I would rule out for the moment, that we missed to set limit for these jobs during submission. I am not aware, that we have any freezer controller specific settings but have to check for more details. Maybe somebody has an idea, why some jobs only end up in the freezer but no other resource control group? Cheers, Thomas [*] under https://desycloud.desy.de/index.php/s/wribqLz37AEPgbH/download is a tarball including the cpu,cpuacct/condor.system and freezer/condor.system slicesa the condor_execs' /proc/PID/cgroup,... as well as greps for the starter PIDs on the slots' StarterLogs [1] * on slot1_4 condor_starter PID: 5218 --> condor_exec PID: 5311 [2] * on slot1_20 condor_starter PID: 24163 --> condor_exec PID: 24507 or * on slot1_1 condor_starter PID: 39966 --> condor_exec PID: 40069) I don't see a time pattern as for example job on slot1_1 started ~3.5h before slot1_4 and got no cpu slice but slot1_4 did. [3] from the naming I would assume that soft/hard/memsw are mapped to memory.memsw.limit_in_bytes and memory.soft_limit_in_bytes slot1_4: with memory slice 06/21/18 09:09:07 (pid:5218) Limiting (soft) memory usage to 0 bytes 06/21/18 09:09:07 (pid:5218) Limiting memsw usage to 9223372036854775807 bytes 06/21/18 09:09:07 (pid:5218) Limiting (soft) memory usage to 7918845952 bytes 06/21/18 09:09:07 (pid:5218) Limiting (hard) memory usage to 143370891264 bytes 06/21/18 09:09:07 (pid:5218) Limiting memsw usage to 143370895360 bytes slot1_1: without memory slice > grep Limiting /tmp/StarterLog.slot1_1.39966 06/21/18 05:40:17 (pid:39966) Limiting (soft) memory usage to 0 bytes 06/21/18 05:40:17 (pid:39966) Limiting memsw usage to 9223372036854775807 bytes 06/21/18 05:40:17 (pid:39966) Limiting (soft) memory usage to 1073741824 bytes 06/21/18 05:40:17 (pid:39966) Limiting (hard) memory usage to 143340998656 bytes 06/21/18 05:40:17 (pid:39966) Limiting memsw usage to 143341002752 bytes > slot1_4 aka 5218 memory slice limits /sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.kmem.limit_in_bytes 9223372036854771712 /sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.kmem.tcp.limit_in_bytes 9223372036854771712 /sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.limit_in_bytes 143370891264 /sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.memsw.limit_in_bytes 143370895360 /sys/fs/cgroup/memory/system.slice/condor.service/condor_var_lib_condor_execute_slot1_4@xxxxxxxxxxxxxxxxx/memory.soft_limit_in_bytes 7918845952
Attachment:
bin8pEeZ6ZcEN.bin
Description: Binary data
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature