[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] nodes without cgrouped jobs?



> On 06/29/2018 03:16 AM, Thomas Hartmann wrote:
> 
> Hi Gred,
> 
> many thanks - turning on cgroup delegation seem to do the trick!! :)
> 
> With the delegate option on for all controllers in the Condor unit's
> service section [1], the job slices survived all new/restarts of other
> units!


I can confirm that Greg's suggestion to set Delegate=true solved a similar cgroup problem for LIGO. In particular, as discovered by Greg, systemd under certain circumstances (e.g., systemctl daemon-reload and restarting non-condor services) was moving condor user processes in dynamic slots out of their condor created cgroups, which bypassed the resource rules put in place for each dynamic slot.

A cgroup expert shouLD confirm, but according to https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ it  looks like Condor should be setting Delegate=yes by default,

"Services must set Delegate=yes for the units they intend to manage subcgroups of. If they create and manipulate cgroups outside of units that have Delegate=yes set, they violate the access contract for control groups."

Kudos to Greg for tracking this down!

P.S. This has only been confirmed with single machine testing so far, but I fully expect it will solve a cluster-wide problem once we restart all of the startd with Delegate=true.

--
Stuart Anderson  anderson@xxxxxxxxxxxxxxxx
http://www.ligo.caltech.edu/~anderson