[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lots of TIME_WAIT sockets killing server



What is your value of SEC_DEFAULT_SESSION_DURATION
in your condor_config file. .if you have a lot of machines going in
and out of the pool and you are using TCP which it sounds like you
are, then the security session is cached for some amount of time
and a socket is kept open for every machine that has been
in there.  If you shorten SEC_DEFAULT_SESSION_DURATION it should
help.   In newer versions of condor it is also possible to
set it via subsystem, i.e. TOOL.SEC_DEFAULT_SESSION_DURATION=60.

Steve Timm


On Tue, 1 Jun 2010, J.A. Gutierrez wrote:


	Hello

	I've found a problem in condor and I can't find the cause:

	Since we upgrade our Linux condor slave ("execute") nodes
	from Fedora Core 2 to CentOS 5.2 (and then, to CentOS 5.4),
	if condor is active for a couple of days, the condor master host
	gets its connection table filled with thousands of "TIME_WAIT"
	sockets, so no new connections can be opened and the server
	(which also acts as central NFS/NIS+ server) gets killed.


	Our current setup is:

	* NFS/NIS+/Condor master server:

	- Sun SPARC server running Solaris 8.
	- Condor master version 7.4.2

	* NFS/NIS+/Condor clients:

	- x86 PC's running Linux CentOS 5.4
	- Condor 7.4.2
	(when the server starts getting irresponsive, usually there are
	no more than 6 PC's running condor)


	Condor configuration:

	- Common FILESYSTEM_DOMAIN/UID_DOMAIN on master and slaves
	- USE_NFS = False
	- USE_AFS = False
	- ~condor is local on every PC
	- mostly default settings for everything


	IIRC, the problem started with the upgrade from Fededora Core 2
	to Centos 5.2, while keeping the same condor installation.
	Then, I upgraded condor to current release, but I got the same
	problem.


	Any idea?


	Thanks...




--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.