[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroups question/problem



From: Bob Ball <ball@xxxxxxxxx>
Date: 03/16/2016 11:00 PM
 
> It is interesting, we are getting a fair number of OOM Holds these days,
> but only a few seem to end this way with the WN locked up. One I just
> observed started near the end of what I would call a small storm in the
> number of kernel process creations on the WN. Typical is around <25/s,
> and this one was running around 600-700/s.  I am leaving this WN "as-is"
> until at least tomorrow should there be anything I could pull out of
> this for you.

I've seen this sort of thing before too - I think what might have been
happening in the instances where the machine locked up is that the memory
ballooning was happening too quickly for the OOM killer to cope with,
and it ran out of memory itself. That's just a theory, since I never
bothered to peel that onion - I had plenty of other unrelated things to
make me weep.

At the time I was running RHEL6.5. I found that after I got up to
RHEL6.7 - having skipped 6.6 - it stopped happening and my exec nodes
stayed up for months at a time despite the best efforts of my users.
I'm not sure whether that's because they straightened out their code or
there was some fix that improved the cgroup and OOM killer's reliability
and effectiveness, but there you have it.

        -Michael Pelletier.
_