[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Use of cgroups



Hi all

I am struggling to understand how the cgroup mechanism affects my jobs. I have a added a new fresh node to our cluster. I have starting a lot of jobs on it, but all of sudden it starts killing my jobs. I have traced it back to the OOM killer. However, the execute machine has 250GB of memory and my jobs are not using close to that.

I wanted to try to tune the oom-killer, but I can't seem to find the relevant services (systemd-oomd, OS is ubuntu 22.04). Also haven't found out how to disable it.

Right now I am able to run about 40 (out of 48 cores) jobs. Each use about 0.5% of total memory. When I submit more jobs, the oom-killer steps in and kills them.

I am noticing that the OS seems to be using a lot of swap even when there is a lot physical memory available.

Are there any knobs in condor I can tune to aid with this?

P

 

Peter Ellevseth 

Principal Advisor / Principal Advisor

+47 93 43 56 01 / +47 73 90 05 00

 peter.ellevseth@xxxxxxxxxx

 safetec.no