[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_shadow and file descriptors



On 09/27/2013 07:44 AM, Paul Brenner wrote:
Hello,

We have a large Linux based Condor pool (5,000 - 15,000 slots depending on opportunistic availability). We also have numerous front end server nodes (all with >32GB of RAM) from which our campus users can submit jobs.

In terms of scale we have reached a limitation due to the condor_shadow process's need to hold open file descriptors. When we set the max number of concurrent jobs per submit host to 2000 we occasionally have users who get nearly 2000 jobs running prior to crashing the submit host.

The submit host crashes due to exhaustion of the maximum number of file descriptors. We already have raised this default setting in RH Linux to 65,536


Paul:

The problem here is that the shadow uses a surprisingly large number of file descriptors. It turns out that on Linux, one fd is consumed per shared library that an application links with, for the duration of the process. Depending on the platform, the shadow is dynamically linked to something like 40 libraries, so 64k open fds is probably still not enough for 2000 jobs. Can you up the per-user fd limit by a factor of 10 more?

We'll update the wiki with this information.

-Greg