[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_shadow and file descriptors



Thanks Mats, Dan, and Greg,

We were trying to count how many files each Condor job transferred/opened and could not justify the massive file descriptor requirement.  Now that we understand each shadow process can open 40-50 file descriptors it is clear that 65K file descriptors is not enough for 2K concurrently running jobs.  The RHEL defaults are an order of magnitude lower than 65K.  Sounds like we will need to raise this another order of magnitude.  

We regularly run 10K+ concurrent jobs from the same submit hosts with Grid Engine but the master/slave submission model is totally different.  We will do some quick research regarding any pitfalls for raising the file descriptor count even higher and then proceed accordingly (all of our cluster frontends [20+]) have the same image so we need to be careful with any base OS config changes.


On Fri, Sep 27, 2013 at 11:23 AM, Mats Rynge <rynge@xxxxxxx> wrote:
On 09/27/2013 05:44 AM, Paul Brenner wrote:
> The submit host crashes due to exhaustion of the maximum number of file
> descriptors.  We already have raised this default setting in RH Linux to
> 65,536

I'm not sure how we arrived at these numbers, but for a RHEL5 system
which regularly runs 12k jobs, we have in /etc/sysctl.conf:

# we need more file descriptors due to Condor's port usage
fs.file-max = 1639200


And in /etc/security/limits.conf

# we need a lot of file descriptors for Condor work
*                soft    nofile          150000
*                hard    nofile          160000

--
Mats Rynge
USC/ISI - Pegasus Team <http://pegasus.isi.edu>



--
Paul R Brenner, PhD, P.E.
Center for Research Computing
The University of Notre Dame