Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Computers missing from Condor pool
Perhaps you've investigated this path already, but I thought I would
mention it... Might be useful information to someone else...
Do a "condor_status -l | condor_updates_stats | grep "Stats:" And
check for lost updates.
We get anythign between 0% to 60% lost updates - the result is that at
any one time around 10% to 20% of our machines are not visible. The
reason is we have a problem with our Cisco routers which implement a
security 'feature' that drops the first packet of each new
communication from a computer that hasn't communicated in a few
minutes.
For reference, the problem was originally discussed here -
https://lists.cs.wisc.edu/archive/condor-users/2007-October/msg00074.shtml
The only reason we get machines participatiing at all is because our
machines are generally talktative - therefore the router trusts them
and doesn't kick in the security 'feature' all the time. So far we've
put up with the problem - knowing that machines that some machines
which were not present one minute, will be present the next and visa
versa. It doesn't seem to affect job throughput (much) - when a
machine appears, it still successfully accepts jobs etc.
THere are two solutions, lower the time interval between machines
pinging the collector - from 300 seconds down to 100 reduces the
problem. Otherwise, we may investigate switching to TCP in the future.
regards,
james
On Wed, Feb 27, 2008 at 5:38 AM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:
>
> Yes, you would need to increase the per-process limit on file
> descriptors. You can do that in the condor init script, for example.
>
> I am not aware of anybody who has tested the scalability of the
> collector with ~10,000 open TCP connections. Up until Condor 6.9.3,
> Condor would fail with more than 1024, so don't try it in 6.8. In the
> worst case, if there are scalability problems in the collector due to
> number of open connections, you would be able to work around the problem
> by having a bank of N collectors with each execute machine configured to
> report to just one. Then (in 7.0.1) you can configure these collectors
> to forward ClassAds to a single collector that is used for matchmaking
> purposes. This forwarding happens via UDP, but since it would be Linux
> to Linux, you shouldn't suffer from the Windows UDP problem.
>
> Of course, the real solution is for Condor to work around the Windows
> UDP problem if at all possible. I hope this will addressed soon.
>
> --Dan
>
>
>
> Rob de Graaf wrote:
>
> >Hi Erik,
> >
> >Thank you for your reply. I've been wary of changing to TCP because of
> >the warnings in condor_config and the manual, as well as the effect it
> >might have on network / system load, but I'm willing to explore this
> >option further.
> >
> > From the manual, I understand I need to set COLLECTOR_SOCKET_CACHE_SIZE
> >to the number of machines in the pool, multiplied by the number of
> >daemons per machine, and the collector process will need to be able to
> >manage at least that many file descriptors. In our case, this means the
> >collector would need at least 10.000 file descriptors.
> >
> >The default OS-wide limit on file descriptors seems high enough at
> >206.151, but the default per-process limit on file descriptors in Linux
> >seems to be 1024, so to enable TCP updates I'd have to increase that by
> >a factor 10.. is that a safe thing to do?
> >
> >Regards,
> >
> >Rob de Graaf
> >
> >Erik Paulson wrote:
> >
> >
> >>On Tue, Feb 26, 2008 at 03:34:39PM +0100, Rob de Graaf wrote:
> >>
> >>
> >>>The suggested fix, adding a delay by setting the D_NETWORK debug flag,
> >>>has been applied on all computers and has had some effect; the average
> >>>pool size has gone up, but not by as much as we had hoped, and ping
> >>>sweeps still reveal many more live machines not appearing in the pool,
> >>>leading us to believe there is still some other problem.
> >>>
> >>>We've looked at master and startd log files but we haven't been able to
> >>>find anything seriously wrong, and we're running out of ideas.
> >>>
> >>>What could be causing computers to sometimes be missing from our pool,
> >>>and what else can we do to find them?
> >>>
> >>>
> >>>
> >>Turn on TCP updates to the collector, instead of UDP.
> >>
> >>-Erik
> >>
> >>_______________________________________________
> >>Condor-users mailing list
> >>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >>subject: Unsubscribe
> >>You can also unsubscribe by visiting
> >>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >>The archives can be found at:
> >>https://lists.cs.wisc.edu/archive/condor-users/
> >>
> >>
> >_______________________________________________
> >Condor-users mailing list
> >To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> >subject: Unsubscribe
> >You can also unsubscribe by visiting
> >https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> >The archives can be found at:
> >https://lists.cs.wisc.edu/archive/condor-users/
> >
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>