[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computers missing from Condor pool

Hi Erik,

Thank you for your reply. I've been wary of changing to TCP because of the warnings in condor_config and the manual, as well as the effect it might have on network / system load, but I'm willing to explore this option further.

From the manual, I understand I need to set COLLECTOR_SOCKET_CACHE_SIZE to the number of machines in the pool, multiplied by the number of daemons per machine, and the collector process will need to be able to manage at least that many file descriptors. In our case, this means the collector would need at least 10.000 file descriptors.

The default OS-wide limit on file descriptors seems high enough at 206.151, but the default per-process limit on file descriptors in Linux seems to be 1024, so to enable TCP updates I'd have to increase that by a factor 10.. is that a safe thing to do?


Rob de Graaf

Erik Paulson wrote:
On Tue, Feb 26, 2008 at 03:34:39PM +0100, Rob de Graaf wrote:
The suggested fix, adding a delay by setting the D_NETWORK debug flag, has been applied on all computers and has had some effect; the average pool size has gone up, but not by as much as we had hoped, and ping sweeps still reveal many more live machines not appearing in the pool, leading us to believe there is still some other problem.

We've looked at master and startd log files but we haven't been able to find anything seriously wrong, and we're running out of ideas.

What could be causing computers to sometimes be missing from our pool, and what else can we do to find them?

Turn on TCP updates to the collector, instead of UDP.


Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/