[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] enforcing cgroups v2 memory limits



Hi Greg,

our general idea was to reserve on execution points 1-2 core weight equivalents plus a bit memory for system processes, which should not have a real performance impact with SMT enabled.

But after another discussion, we might more gravitate to move reweighting/reserving resource shares into the condor group. Specific thing is, that we occasionally observer execution point startds being overloaded with calculating slot weights. When spending all time calculating slot weights, such a node can become absent from its collector. Although not solving the underlying issue (special user jobs), it might be an option to reserve core weights and mem for the whole condor group, but reweight job requirements in a transformation so that the integrated job children cgroup core weights and memory are not at 100% but leave 1-2% for the startd (and other processes).

Cheers,
  Thomas

On 16/02/2024 20.34, Greg Thain via HTCondor-users wrote:
On 2/16/24 05:12, Thomas Hartmann wrote:
Hi all,

I would like to enforce also under cgroups v2 memory limits around 95% of the total memory. However, I am not sure, how Condors OOM watchdog would react to it?


Hi Thomas:

I'm not quite sure what your requirements are here -- are you OK if any one job goes over the per-slot memory limit, but only care if all the jobs, in total, go over some limit?


-greg

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature