[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Use of cgroups



Hi

I am running using ubuntu 22 and condor 10.4. Installation is more or less out of the box. I have implemented the stages in the get_condor script in an ansible playbook. Configuration is minimal, and I have not touched on the memory handling in any intentional way.

I see that cgroup is mentioned in the Starterlog.slotX, like:

8/28/23 13:24:54 (pid:3101831) Error while locating memcg controller for starter: 50014 Cgroup not initialized
08/28/23 13:25:03 (pid:3101831) ProcFamilyDirectCgroupV2::get_usage cannot open /sys/fs/cgroup/htcondor/condor_var_lib_condor_execute_slot3@stnode1/memory.peak: 2 No such file or directory

It doesn't seem to affect the job though, still running smoothly.

The server was set up externally, but shoulnd't be more than standard ubuntu.

Peter

 

Peter Ellevseth 

Principal Advisor / Principal Advisor

+47 93 43 56 01 / +47 73 90 05 00

 peter.ellevseth@xxxxxxxxxx

 safetec.no

 

 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Hartmann <thomas.hartmann@xxxxxxx>
Sent: Monday, August 28, 2023 14:28
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Use of cgroups
 
Hi Peter,

out-of-the-box the memory is handled by Condor but not the memory cgroup
(what version are you using?). IIRC one reason is that else the kernel
might kill process trees and Condor/the users would have no good clue.
Have you set up your cluster shifting OOM from the starter to the kernel
itself?

Cheers,
  Thomas

On 28/08/2023 05.56, Peter Ellevseth wrote:
> Hi all
>
> I am struggling to understand how the cgroup mechanism affects my jobs.
> I have a added a new fresh node to our cluster. I have starting a lot of
> jobs on it, but all of sudden it starts killing my jobs. I have traced
> it back to the OOM killer. However, the execute machine has 250GB of
> memory and my jobs are not using close to that.
>
> I wanted to try to tune the oom-killer, but I can't seem to find the
> relevant services (systemd-oomd, OS is ubuntu 22.04). Also haven't found
> out how to disable it.
>
> Right now I am able to run about 40 (out of 48 cores) jobs. Each use
> about 0.5% of total memory. When I submit more jobs, the oom-killer
> steps in and kills them.
>
> I am noticing that the OS seems to be using a lot of swap even when
> there is a lot physical memory available.
>
> Are there any knobs in condor I can tune to aid with this?
>
> P
>
>       
>
> *Peter Ellevseth *
>
> Principal Advisor / Principal Advisor
>
>       
>
> +47 93 43 56 01 / +47 73 90 05 00
>
>               
>
> <mailto:peter.ellevseth@xxxxxxxxxx>
>
>       
>
>   peter.ellevseth@xxxxxxxxxx<mailto:peter.ellevseth@xxxxxxxxxx>
>
> <https://eur03.safelinks.protection.outlook.com/?url="">>
>
>       
>
>   safetec.no<
https://eur03.safelinks.protection.outlook.com/?url="">>
>
>       
>                       
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
https://eur03.safelinks.protection.outlook.com/?url="">
>
> The archives can be found at:
>
https://eur03.safelinks.protection.outlook.com/?url="">