[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_shadow and file descriptors



Tracebacks are at a level of detail I don't have from my admin team.  I can tell you it force automatic crash/reboots.  It has happened multiple times in the past few months with differing condor users.

We will just plan to increase the file descriptors.  Juggling too many balls at the moment to study the core dumps ;)


On Fri, Sep 27, 2013 at 3:08 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

On Sep 27, 2013, at 2:04 PM, Paul Brenner <paul.r.brenner@xxxxxx> wrote:

> RedHat 6.X with 128GB of RAM and 64cores on the frontend that was most recently crashed.  I would certainly guess that scientific linux defaults are much higher for the "non enterprise workloads".  I guess if you figure a "default" RedHat/CentOS in the enterprise world may be a modest LAMPS webserver the file descriptor defaults may not be that surprising.  As mentioned we have used the current configuration for many years with Grid Engine running 10K+ concurrent jobs and never experienced a file descriptor limitation.  In the world of Condor/HTC the configuration tuning is definitely weighted differently.
>
> Good to know that SL runs at a million plus as a "default".  Our smallest head node has 32GB of RAM so growing 10x from 65K to 650K should be reasonably low risk.
>

Hi Paul,

Actually, the behavior I described (initial limit based on installed memory) is a kernel default (not a SL-specific tuning).  I'm surprised that RHEL would turn it lower!  I'm kinda scratching my head on that one.

I didn't know that running out of file descriptors could crash the kernel.  What do the tracebacks look like?

Brian




--
Paul R Brenner, PhD, P.E.
Center for Research Computing
The University of Notre Dame