[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Use of cgroups



Hi Peter,

out-of-the-box the memory is handled by Condor but not the memory cgroup (what version are you using?). IIRC one reason is that else the kernel might kill process trees and Condor/the users would have no good clue. Have you set up your cluster shifting OOM from the starter to the kernel itself?

Cheers,
 Thomas

On 28/08/2023 05.56, Peter Ellevseth wrote:
Hi all

I am strugglingÂto understand how the cgroup mechanism affects my jobs. I have a added a new fresh node to our cluster. I have starting a lot of jobs on it, but all of sudden it starts killing my jobs. I have traced it back to the OOM killer. However, the execute machine has 250GB of memory and my jobs are not using close to that.

I wanted to try to tune the oom-killer, but I can't seem toÂfind the relevant services (systemd-oomd, OS is ubuntu 22.04). Also haven't found out how to disable it.

Right now I am able to run about 40 (out of 48 cores) jobs. Each use about 0.5% of total memory. When I submit more jobs, the oom-killer steps in and kills them.

I am noticing that the OS seems to be using a lot of swap even when there is a lot physical memory available.

Are there any knobs in condor I can tune to aid with this?

P

	

*Peter Ellevseth *

Principal Advisor / Principal Advisor

	

+47 93 43 56 01 / +47 73 90 05 00

		

<mailto:peter.ellevseth@xxxxxxxxxx>

	

 Âpeter.ellevseth@xxxxxxxxxx<mailto:peter.ellevseth@xxxxxxxxxx>

<https://safetec.no/>

	

 Âsafetec.no<https://safetec.no/>

	
			


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature