[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Lots of TIME_WAIT sockets killing server



	Hello

	I've found a problem in condor and I can't find the cause:

	Since we upgrade our Linux condor slave ("execute") nodes
	from Fedora Core 2 to CentOS 5.2 (and then, to CentOS 5.4),
	if condor is active for a couple of days, the condor master host
	gets its connection table filled with thousands of "TIME_WAIT"
	sockets, so no new connections can be opened and the server
	(which also acts as central NFS/NIS+ server) gets killed.


	Our current setup is:

	* NFS/NIS+/Condor master server:

	- Sun SPARC server running Solaris 8.
	- Condor master version 7.4.2

	* NFS/NIS+/Condor clients:

	- x86 PC's running Linux CentOS 5.4
	- Condor 7.4.2
	(when the server starts getting irresponsive, usually there are
	no more than 6 PC's running condor)


	Condor configuration:

	- Common FILESYSTEM_DOMAIN/UID_DOMAIN on master and slaves
	- USE_NFS = False 
	- USE_AFS = False
	- ~condor is local on every PC
	- mostly default settings for everything


	IIRC, the problem started with the upgrade from Fededora Core 2
	to Centos 5.2, while keeping the same condor installation.
	Then, I upgraded condor to current release, but I got the same
	problem.


	Any idea?


	Thanks...


-- 
PGP and other useless info at      \
http://webdiis.unizar.es/~spd/      \
finger://daphne.cps.unizar.es/spd    \       Timeo Danaos et dona ferentes
ftp://ivo.cps.unizar.es/pub/          \                         (Virgilio)