[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Out of file descriptors problem
- Date: Thu, 20 Apr 2006 14:21:10 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Out of file descriptors problem
At 07:07 AM 4/20/2006, John Horne wrote:
We are running a Fedora Core 3 server with Condor 6.7.6. We have had a
lot of Windows jobs put in to the queue by one user, just over 187,000.
However, we have let these run with no problem for the past month or so.
We have around 1,300 Windows nodes currently available.
The server is configured with 16384 file descriptors for all users, both
the soft and hard limit in /etc/security/limit.conf. As such I don't see
how it can be out of file descriptors, and how come the problem is only
showing up now?
Just some thoughts off the top of my head:
There are lots of limits. Sounds like you already upped the total
number of descriptors allowed on the system, but there still may be
limits per user id, limits per process, or limits on ephemeral ports.
Take a look at
It may be helpful to you.
As for 16k descriptors, with over 1000 jobs simultaneously running
(and thus over 1000 shadow processes), that only leaves 16
descriptors per shadow --- and actually, far less than that, set you
set your system-wide limit to 16k (i.e. these descriptors are also
used by non-condor processes on your system). Recall that "file
descriptors" on Unix used not only by open files but also open
network sockets. A process is usually born w/ stdout,err,in, and
then the shadow needs to make network connections to the schedd, to
the remote starter, etc. If the shadow needs to transfer files,
stream stdio from the remote job, or write to logs, that is more
descriptors. If your jobs are standard universe, then the shadow
needs to open a descriptor for each file your remote job opens. I
don't think it is out of the question to think that the system is
honest to goodness running out of descriptors --- I would suggest
Maybe the reason it did not happen in the past is because this submit
machine never was able to start so many shadows in the past, or the
nature of the submitted job changed (i.e. the user is now desires to
stream stdout or something). Or maybe some other process on the
system is now using more descriptors than in the past (database? web
server? java container?).
hope this is helpful,
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba Madison, WI 53706-1685
Phone: (608) 263-7132 FAX: (608) 262-9777