[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lots of TIME_WAIT sockets killing server



> >	if condor is active for a couple of days, the condor master host
> >	gets its connection table filled with thousands of "TIME_WAIT"
> >	sockets, so no new connections can be opened and the server
> >	(which also acts as central NFS/NIS+ server) gets killed.
> >

	More information:

	I started condor master 24 hours ago.

	Now, `condor_status` shows 3 Linux clients and 1 Sparc client.
	
	`condor_q` shows 20 jobs queued, and two of them are running
	(with 6 an 18 CPU hours each).

---------------------------------------------------------------------------
  18.0   *****           6/2  11:15   0+06:03:07 I  0   219.7 convert.sh        
  18.1   *****           6/2  11:15   0+19:01:04 R  0   219.7 convert.sh        
  18.2   *****           6/2  11:15   0+00:00:00 I  0   0.0  convert.sh        
	...
---------------------------------------------------------------------------
	
	The log file for job 18.1 has 278 lines saying:

---------------------------------------------------------------------------
	010 (018.001.000) 06/03 06:36:40 Job was suspended.
---------------------------------------------------------------------------

	(the host which started the job was shutdown)

	On the master, 
	netstat -an | egrep "17.14.*TIME_WAIT" | tail -1

---------------------------------------------------------------------------
***.***.***.***.32772 10.3.17.14.607        5888      0 24616      0 TIME_WAIT
---------------------------------------------------------------------------

	and `netstat -an | egrep "17.14.*TIME_WAIT" | wc -l` gives "211"
	and gets incremented every few minutes...


	The host 10.3.17.14 is up; and joined the condor pool, but is not
	executing any job.


	At this point, if I try to login in 10.3.17.14 as user (automounting
	$HOME via NFS from master host), I can't, since automount can't
	mount $HOME (manually mounting directories from master still works)

	automount process on the client host it's unresponding. I can't
	even kill it with -TERM, I have to use -KILL

	After restarting automount on the client, I can login as user,
	but in the meanwhile, "TIME_WAIT" sockets on server has grown to
	412.

	But, shortly after that, "TIME_WAIT" sockets for 10.3.17.14 have
	gone! (being replaced by 217 similar sockets to 10.3.17.12).

	The port on the server is always "32772", which is assigned to
	rpc.nisd (NIS+ service daemon)....




	So, I must conclude it's a problem with linux NIS+ client/autmount
	which is triggered by condor only; but I can't imagine how.

	It seems a very dirty workaround could be monitoring TIME_WAIT
	sockets on the master, and restarting automountd on the hosts
	with lots of TIME_WAIT sockets, but I'd like to find a better
	solution.


-- 
PGP and other useless info at      \
http://webdiis.unizar.es/~spd/      \
finger://daphne.cps.unizar.es/spd    \       Timeo Danaos et dona ferentes
ftp://ivo.cps.unizar.es/pub/          \                         (Virgilio)