[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Out of file descriptors problem



At 07:07 AM 4/20/2006, John Horne wrote:
Hello,

We are running a Fedora Core 3 server with Condor 6.7.6. We have had a
lot of Windows jobs put in to the queue by one user, just over 187,000.
However, we have let these run with no problem for the past month or so.
We have around 1,300 Windows nodes currently available.

[snip]
The server is configured with 16384 file descriptors for all users, both
the soft and hard limit in /etc/security/limit.conf. As such I don't see
how it can be out of file descriptors, and how come the problem is only
showing up now?

Just some thoughts off the top of my head:

There are lots of limits. Sounds like you already upped the total number of descriptors allowed on the system, but there still may be limits per user id, limits per process, or limits on ephemeral ports.

Take a look at
   http://www.cs.wisc.edu/condor/condorg/linux_scalability.html
It may be helpful to you.

As for 16k descriptors, with over 1000 jobs simultaneously running (and thus over 1000 shadow processes), that only leaves 16 descriptors per shadow --- and actually, far less than that, set you set your system-wide limit to 16k (i.e. these descriptors are also used by non-condor processes on your system). Recall that "file descriptors" on Unix used not only by open files but also open network sockets. A process is usually born w/ stdout,err,in, and then the shadow needs to make network connections to the schedd, to the remote starter, etc. If the shadow needs to transfer files, stream stdio from the remote job, or write to logs, that is more descriptors. If your jobs are standard universe, then the shadow needs to open a descriptor for each file your remote job opens. I don't think it is out of the question to think that the system is honest to goodness running out of descriptors --- I would suggest raising it.

Maybe the reason it did not happen in the past is because this submit machine never was able to start so many shadows in the past, or the nature of the submitted job changed (i.e. the user is now desires to stream stdout or something). Or maybe some other process on the system is now using more descriptors than in the past (database? web server? java container?).

hope this is helpful,
regards,
Todd



-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba      Madison, WI 53706-1685
Phone: (608) 263-7132  FAX: (608) 262-9777