[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_shadow and file descriptors



Hello,

We have a large Linux based Condor pool (5,000 - 15,000 slots depending on opportunistic availability).  We also have numerous front end server nodes (all with >32GB of RAM) from which our campus users can submit jobs.

In terms of scale we have reached a limitation due to the condor_shadow process's need to hold open file descriptors.  When we set the max number of concurrent jobs per submit host to 2000 we occasionally have users who get nearly 2000 jobs running prior to crashing the submit host.

The submit host crashes due to exhaustion of the maximum number of file descriptors.  We already have raised this default setting in RH Linux to 65,536

I looked through:
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageLargeCondorPools
and
http://research.cs.wisc.edu/htcondor/manual/v7.8/3_3Configuration.html#SECTION004312000000000000000

to see if there was a way to tune condor_shadow process to no hold open as many file descriptors; unfortunately I did not find a discussion in this regard.

Has anyone run into this issue for large submissions?  If the answer is in a prior thread can it be added to the wiki page for managing large condor pools?

If the condor_shadow process must open file descriptors for file I/O transfer/spooling at start up will it close all of them after the transfer is done?  If this is the case would one potential solution simply be to artificially "slow down" the start up of jobs such that the transfer of initial input files is staggered such that maximum number of descriptors at any one time is reduced?

--
Paul R Brenner, PhD, P.E.
Center for Research Computing
The University of Notre Dame