[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_shadow and file descriptors



RedHat 6.X with 128GB of RAM and 64cores on the frontend that was most recently crashed.  I would certainly guess that scientific linux defaults are much higher for the "non enterprise workloads".  I guess if you figure a "default" RedHat/CentOS in the enterprise world may be a modest LAMPS webserver the file descriptor defaults may not be that surprising.  As mentioned we have used the current configuration for many years with Grid Engine running 10K+ concurrent jobs and never experienced a file descriptor limitation.  In the world of Condor/HTC the configuration tuning is definitely weighted differently.

Good to know that SL runs at a million plus as a "default".  Our smallest head node has 32GB of RAM so growing 10x from 65K to 650K should be reasonably low risk.  


On Fri, Sep 27, 2013 at 2:50 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

On Sep 27, 2013, at 11:34 AM, Paul Brenner <paul.r.brenner@xxxxxx> wrote:

> Thanks Mats, Dan, and Greg,
>
> We were trying to count how many files each Condor job transferred/opened and could not justify the massive file descriptor requirement.  Now that we understand each shadow process can open 40-50 file descriptors it is clear that 65K file descriptors is not enough for 2K concurrently running jobs.  The RHEL defaults are an order of magnitude lower than 65K.  Sounds like we will need to raise this another order of magnitude.
>
> We regularly run 10K+ concurrent jobs from the same submit hosts with Grid Engine but the master/slave submission model is totally different.  We will do some quick research regarding any pitfalls for raising the file descriptor count even higher and then proceed accordingly (all of our cluster frontends [20+]) have the same image so we need to be careful with any base OS config changes.
>
>

Hi Paul,

What version of RHEL do you run?

I'm scratching my head a bit because modern kernels set /proc/sys/fs/file-max according to the amount of memory on the machine.  For example, on a machine with 8GB of RAM, this works out to be .8M file descriptors.  A machine with 32GB of RAM should out-of-the-box have a maximum of 3.2M.

Of course, "out of the box" to me refers to SL, not RHEL.  Is it possible that is a difference?

Brian



--
Paul R Brenner, PhD, P.E.
Center for Research Computing
The University of Notre Dame