Re: [HTCondor-users] Use of cgroups

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

8/28/23 13:24:54 (pid:3101831) Error while locating memcg controller for starter: 50014 Cgroup not initialized

08/28/23 13:25:03 (pid:3101831) ProcFamilyDirectCgroupV2::get_usage cannot open /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot3@stnode1/memory.peak: 2 No such file or directory

It doesn't seem to affect the job though, still running smoothly.

The server was set up externally, but shoulnd't be more than standard ubuntu.

Peter

Peter Ellevseth
Principal Advisor / Principal Advisor
	+47 93 43 56 01 / +47 73 90 05 00
	peter.ellevseth@xxxxxxxxxx
	safetec.no

Hi Peter,

out-of-the-box the memory is handled by Condor but not the memory cgroup
(what version are you using?). IIRC one reason is that else the kernel
might kill process trees and Condor/the users would have no good clue.
Have you set up your cluster shifting OOM from the starter to the kernel
itself?

Cheers,
Thomas

On 28/08/2023 05.56, Peter Ellevseth wrote:
> Hi all
>
> I am struggling to understand how the cgroup mechanism affects my jobs.
> I have a added a new fresh node to our cluster. I have starting a lot of
> jobs on it, but all of sudden it starts killing my jobs. I have traced
> it back to the OOM killer. However, the execute machine has 250GB of
> memory and my jobs are not using close to that.
>
> I wanted to try to tune the oom-killer, but I can't seem to find the
> relevant services (systemd-oomd, OS is ubuntu 22.04). Also haven't found
> out how to disable it.
>
> Right now I am able to run about 40 (out of 48 cores) jobs. Each use
> about 0.5% of total memory. When I submit more jobs, the oom-killer
> steps in and kills them.
>
> I am noticing that the OS seems to be using a lot of swap even when
> there is a lot physical memory available.
>
> Are there any knobs in condor I can tune to aid with this?
>
> P
>
>
>
> *Peter Ellevseth *
>
> Principal Advisor / Principal Advisor
>
>
>
> +47 93 43 56 01 / +47 73 90 05 00
>
>
>
> <mailto:peter.ellevseth@xxxxxxxxxx>
>
>
>
> peter.ellevseth@xxxxxxxxxx<mailto:peter.ellevseth@xxxxxxxxxx>
>
> <https://eur03.safelinks.protection.outlook.com/?url="">>
>
>
>
> safetec.no<https://eur03.safelinks.protection.outlook.com/?url="">>
>
>
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://eur03.safelinks.protection.outlook.com/?url="">
>
> The archives can be found at:
> https://eur03.safelinks.protection.outlook.com/?url="">

Mailing List Archives

Public Access

Re: [HTCondor-users] Use of cgroups