Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computers missing from Condor pool

Date: Tue, 26 Feb 2008 12:38:54 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Computers missing from Condor pool

Yes, you would need to increase the per-process limit on filedescriptors. You can do that in the condor init script, for example.

I am not aware of anybody who has tested the scalability of thecollector with ~10,000 open TCP connections. Up until Condor 6.9.3,Condor would fail with more than 1024, so don't try it in 6.8. In theworst case, if there are scalability problems in the collector due tonumber of open connections, you would be able to work around the problemby having a bank of N collectors with each execute machine configured toreport to just one. Then (in 7.0.1) you can configure these collectorsto forward ClassAds to a single collector that is used for matchmakingpurposes. This forwarding happens via UDP, but since it would be Linuxto Linux, you shouldn't suffer from the Windows UDP problem.

Of course, the real solution is for Condor to work around the WindowsUDP problem if at all possible. I hope this will addressed soon.


--Dan

Rob de Graaf wrote:

Hi Erik,
Thank you for your reply. I've been wary of changing to TCP because ofthe warnings in condor_config and the manual, as well as the effect itmight have on network / system load, but I'm willing to explore thisoption further.
From the manual, I understand I need to set COLLECTOR_SOCKET_CACHE_SIZEto the number of machines in the pool, multiplied by the number ofdaemons per machine, and the collector process will need to be able tomanage at least that many file descriptors. In our case, this means thecollector would need at least 10.000 file descriptors.
The default OS-wide limit on file descriptors seems high enough at206.151, but the default per-process limit on file descriptors in Linuxseems to be 1024, so to enable TCP updates I'd have to increase that bya factor 10.. is that a safe thing to do?
Regards,

Rob de Graaf

Erik Paulson wrote:
On Tue, Feb 26, 2008 at 03:34:39PM +0100, Rob de Graaf wrote:
The suggested fix, adding a delay by setting the D_NETWORK debug flag,has been applied on all computers and has had some effect; the averagepool size has gone up, but not by as much as we had hoped, and pingsweeps still reveal many more live machines not appearing in the pool,leading us to believe there is still some other problem.
We've looked at master and startd log files but we haven't been able tofind anything seriously wrong, and we're running out of ideas.
What could be causing computers to sometimes be missing from our pool,and what else can we do to find them?
Turn on TCP updates to the collector, instead of UDP.

-Erik

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

Follow-Ups:
- Re: [Condor-users] Computers missing from Condor pool
  - From: Wojtek Goscinski

References:
- [Condor-users] Computers missing from Condor pool
  - From: Rob de Graaf
- Re: [Condor-users] Computers missing from Condor pool
  - From: Erik Paulson
- Re: [Condor-users] Computers missing from Condor pool
  - From: Rob de Graaf

Prev by Date: Re: [Condor-users] Visually design Condor DAGs
Next by Date: [Condor-users] Condor-related signal 11
Previous by thread: Re: [Condor-users] Computers missing from Condor pool
Next by thread: Re: [Condor-users] Computers missing from Condor pool
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Computers missing from Condor pool