[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Out of file descriptors problem



Hello,

We are running a Fedora Core 3 server with Condor 6.7.6. We have had a
lot of Windows jobs put in to the queue by one user, just over 187,000.
However, we have let these run with no problem for the past month or so.
We have around 1,300 Windows nodes currently available.

We noticed a few days ago that the number of nodes used dropped to about
20. Nothing on the server had changed. I looked through various log
files but could see no obvious reason. I restarted condor (using
'condor_restart') and again it claimed around 1000 nodes. (For some
reason there always seems to be a few hundred nodes available - why
aren't all the nodes used?) However, this morning we are again back to
about 20 nodes being used, when there are over 1000 that are available.
The queue currently has 106,000 jobs in it.

I have found in the log directory for the server the file
'dprintf_failure.SCHEDD'. It contains:

  4/20 12:54:01 dprintf() had a fatal error in pid 30053
  **** PANIC -- OUT OF FILE DESCRIPTORS at line 297 in sock.Ceuid: 1985,
ruid: 0

The server is configured with 16384 file descriptors for all users, both
the soft and hard limit in /etc/security/limit.conf. As such I don't see
how it can be out of file descriptors, and how come the problem is only
showing up now?

I also received an email message from the server inidicating the same
problem:

============================================================
This is an automated email from the Condor system
on machine "ltsp.csd.plymouth.ac.uk".  Do not reply.

"/opt/condor/sbin/condor_schedd" on "ltsp.csd.plymouth.ac.uk" exited
with status 44.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
**** PANIC -- OUT OF FILE DESCRIPTORS at line 297 in sock.C
4/20 12:08:12 ******************************************************
4/20 12:08:12 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
4/20 12:08:12 ** /opt/condor-6.7.6/sbin/condor_schedd
4/20 12:08:12 ** $CondorVersion: 6.7.6 Mar 15 2005 $
4/20 12:08:12 ** $CondorPlatform: I386-LINUX_RH9 $
4/20 12:08:12 ** PID = 27967
4/20 12:08:12 ******************************************************
4/20 12:08:12 Using config file: /opt/condor/etc/condor_config
4/20 12:08:12 Using local config
files: /opt/condor/hosts/ltsp/condor_config.local
4/20 12:08:12 DaemonCore: Command Socket at <141.163.66.135:39335>
4/20 12:22:39 Sent ad to central manager for phu@xxxxxxxxxxxxxxxxxxxxxxx
4/20 12:22:39 Sent ad to 1 collectors for phu@xxxxxxxxxxxxxxxxxxxxxxx
4/20 12:24:44 DaemonCore: Command received via TCP from host
<141.163.66.135:39372>
4/20 12:24:44 DaemonCore: received command 416 (NEGOTIATE), calling
handler (negotiate)
4/20 12:24:44 Negotiating for owner: phu@xxxxxxxxxxxxxxxxxxxxxxx
4/20 12:24:47 Checking consistency running and runnable jobs
4/20 12:24:47 Tables are consistent
4/20 12:30:55 Out of servers - 1052 jobs matched, 105897 jobs idle, 30
jobs rejected
**** PANIC -- OUT OF FILE DESCRIPTORS at line 297 in sock.C
*** End of file SchedLog
============================================================

I can try increasing the number of file descriptors further, but would
rather ask on the list if anyone has any ideas about this?



Thanks,

John.

-- 
---------------------------------------------------------------
John Horne, University of Plymouth, UK  Tel: +44 (0)1752 233914
E-mail: John.Horne@xxxxxxxxxxxxxx       Fax: +44 (0)1752 233839