[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] idle workers die with panic -- out of file descriptors



Hello all,

I've got a problem with workers dying with "**** PANIC -- OUT OF FILE DESCRIPTORS at line 454 in /slots/01/dir_57518/userdir/src/condor_io/sock.cpp".  There are 4 machines (centos 6.3 x64, htcondor 8.0.6 from redhat rpm in wisc repo) that all exhibit the same problem.  They're all setup with just MASTER,STARTD.  It seems to be around 10 hours after starting.

Being a new deployment, the queue is idle.  The workers haven't run any jobs since starting  (none to were submitted).  They've all run jobs while I was testing.

>From googling, it appears that that message is most common on the submit machine but it keeps trucking along (ubuntu 10.04 lts x64, htcondor 8.0.6 compiled from source).  Submit machine is also a worker.  condor_status  says its still available for processing.

I've just deployed condor on these machines so I've probably made an error in a config file.  I wouldn't expect to run out of descriptors with nothing to run so I'm a bit lost as to why it happens and how to fix it.

Tail of MasterLog
04/07/14 17:31:40 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 26383
04/07/14 18:41:52 DefaultReaper unexpectedly called on pid 26383, status 0.
04/07/14 18:41:52 The STARTD (pid 26383) exited with status 0
04/07/14 18:41:52 restarting /usr/sbin/condor_startd in 10 seconds
04/07/14 18:42:02 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 26430
04/07/14 19:52:13 DefaultReaper unexpectedly called on pid 26430, status 0.
04/07/14 19:52:13 The STARTD (pid 26430) exited with status 0
04/07/14 19:52:13 restarting /usr/sbin/condor_startd in 10 seconds
04/07/14 19:52:23 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 26476
**** PANIC -- OUT OF FILE DESCRIPTORS at line 454 in /slots/01/dir_57518/userdir/src/condor_io/sock.cpp

Non-default settings
DAEMON_LIST = MASTER, STARTD
i7 = True
STARTD_EXPRS = i7, $(STARTD_EXPRS)
COUNT_HYPERTHREAD_CPUS = False
CONDOR_HOST     = 10.1.1.54
COLLECTOR_NAME          = AGBU
ALLOW_WRITE = 10.1.*
DEFAULT_DOMAIN_NAME = agbu.localdomain
NO_DNS = True
TRUST_UID_DOMAIN = True

-- 
Klint Gore
Database Manager
Sheep CRC
A.G.B.U.
University of New England
Armidale NSW 2350

Ph: 02 6773 3789  
Fax: 02 6773 3266
EMail: kgore4@xxxxxxxxxx