[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computers missing from Condor pool




Yes, you would need to increase the per-process limit on file descriptors. You can do that in the condor init script, for example.

I am not aware of anybody who has tested the scalability of the collector with ~10,000 open TCP connections. Up until Condor 6.9.3, Condor would fail with more than 1024, so don't try it in 6.8. In the worst case, if there are scalability problems in the collector due to number of open connections, you would be able to work around the problem by having a bank of N collectors with each execute machine configured to report to just one. Then (in 7.0.1) you can configure these collectors to forward ClassAds to a single collector that is used for matchmaking purposes. This forwarding happens via UDP, but since it would be Linux to Linux, you shouldn't suffer from the Windows UDP problem.

Of course, the real solution is for Condor to work around the Windows UDP problem if at all possible. I hope this will addressed soon.

--Dan

Rob de Graaf wrote:

Hi Erik,

Thank you for your reply. I've been wary of changing to TCP because of the warnings in condor_config and the manual, as well as the effect it might have on network / system load, but I'm willing to explore this option further.

From the manual, I understand I need to set COLLECTOR_SOCKET_CACHE_SIZE to the number of machines in the pool, multiplied by the number of daemons per machine, and the collector process will need to be able to manage at least that many file descriptors. In our case, this means the collector would need at least 10.000 file descriptors.

The default OS-wide limit on file descriptors seems high enough at 206.151, but the default per-process limit on file descriptors in Linux seems to be 1024, so to enable TCP updates I'd have to increase that by a factor 10.. is that a safe thing to do?

Regards,

Rob de Graaf

Erik Paulson wrote:
On Tue, Feb 26, 2008 at 03:34:39PM +0100, Rob de Graaf wrote:
The suggested fix, adding a delay by setting the D_NETWORK debug flag, has been applied on all computers and has had some effect; the average pool size has gone up, but not by as much as we had hoped, and ping sweeps still reveal many more live machines not appearing in the pool, leading us to believe there is still some other problem.

We've looked at master and startd log files but we haven't been able to find anything seriously wrong, and we're running out of ideas.

What could be causing computers to sometimes be missing from our pool, and what else can we do to find them?

Turn on TCP updates to the collector, instead of UDP.

-Erik

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/