[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Computers missing from Condor pool
- Date: Tue, 26 Feb 2008 16:40:25 +0100
- From: Rob de Graaf <r.degraaf@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Computers missing from Condor pool
Thank you for your reply. I've been wary of changing to TCP because of
the warnings in condor_config and the manual, as well as the effect it
might have on network / system load, but I'm willing to explore this
From the manual, I understand I need to set COLLECTOR_SOCKET_CACHE_SIZE
to the number of machines in the pool, multiplied by the number of
daemons per machine, and the collector process will need to be able to
manage at least that many file descriptors. In our case, this means the
collector would need at least 10.000 file descriptors.
The default OS-wide limit on file descriptors seems high enough at
206.151, but the default per-process limit on file descriptors in Linux
seems to be 1024, so to enable TCP updates I'd have to increase that by
a factor 10.. is that a safe thing to do?
Rob de Graaf
Erik Paulson wrote:
On Tue, Feb 26, 2008 at 03:34:39PM +0100, Rob de Graaf wrote:
The suggested fix, adding a delay by setting the D_NETWORK debug flag,
has been applied on all computers and has had some effect; the average
pool size has gone up, but not by as much as we had hoped, and ping
sweeps still reveal many more live machines not appearing in the pool,
leading us to believe there is still some other problem.
We've looked at master and startd log files but we haven't been able to
find anything seriously wrong, and we're running out of ideas.
What could be causing computers to sometimes be missing from our pool,
and what else can we do to find them?
Turn on TCP updates to the collector, instead of UDP.
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
You can also unsubscribe by visiting
The archives can be found at: