[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Computers missing from Condor pool



Perhaps you've investigated this path already, but I thought I would
mention it... Might be useful information to someone else...

Do a "condor_status -l | condor_updates_stats | grep "Stats:"  And
check for lost updates.

We get anythign between 0% to 60% lost updates - the result is that at
any one time around 10% to 20% of our machines are not visible. The
reason is we have a problem with our Cisco routers which implement a
security 'feature' that drops the first packet of each new
communication from a computer that hasn't communicated in a few
minutes.

For reference, the problem was originally discussed here -
https://lists.cs.wisc.edu/archive/condor-users/2007-October/msg00074.shtml

The only reason we get machines participatiing at all is because our
machines are generally talktative - therefore the router trusts them
and doesn't kick in the security 'feature' all the time. So far we've
put up with the problem - knowing that machines that some machines
which were not present one minute, will be present the next and visa
versa. It doesn't seem to affect job throughput (much) - when a
machine appears, it still successfully accepts jobs etc.

THere are two solutions, lower the time interval between machines
pinging the collector -  from 300 seconds down to 100 reduces the
problem. Otherwise, we may investigate switching to TCP in the future.

regards,

james

On Wed, Feb 27, 2008 at 5:38 AM, Dan Bradley <dan@xxxxxxxxxxxx> wrote:
>
>  Yes, you would need to increase the per-process limit on file
>  descriptors.  You can do that in the condor init script, for example.
>
>  I am not aware of anybody who has tested the scalability of the
>  collector with ~10,000 open TCP connections.  Up until Condor 6.9.3,
>  Condor would fail with more than 1024, so don't try it in 6.8.  In the
>  worst case, if there are scalability problems in the collector due to
>  number of open connections, you would be able to work around the problem
>  by having a bank of N collectors with each execute machine configured to
>  report to just one.  Then (in 7.0.1) you can configure these collectors
>  to forward ClassAds to a single collector that is used for matchmaking
>  purposes.  This forwarding happens via UDP, but since it would be Linux
>  to Linux, you shouldn't suffer from the Windows UDP problem.
>
>  Of course, the real solution is for Condor to work around the Windows
>  UDP problem if at all possible.  I hope this will addressed soon.
>
>  --Dan
>
>
>
>  Rob de Graaf wrote:
>
>  >Hi Erik,
>  >
>  >Thank you for your reply. I've been wary of changing to TCP because of
>  >the warnings in condor_config and the manual, as well as the effect it
>  >might have on network / system load, but I'm willing to explore this
>  >option further.
>  >
>  > From the manual, I understand I need to set COLLECTOR_SOCKET_CACHE_SIZE
>  >to the number of machines in the pool, multiplied by the number of
>  >daemons per machine, and the collector process will need to be able to
>  >manage at least that many file descriptors. In our case, this means the
>  >collector would need at least 10.000 file descriptors.
>  >
>  >The default OS-wide limit on file descriptors seems high enough at
>  >206.151, but the default per-process limit on file descriptors in Linux
>  >seems to be 1024, so to enable TCP updates I'd have to increase that by
>  >a factor 10.. is that a safe thing to do?
>  >
>  >Regards,
>  >
>  >Rob de Graaf
>  >
>  >Erik Paulson wrote:
>  >
>  >
>  >>On Tue, Feb 26, 2008 at 03:34:39PM +0100, Rob de Graaf wrote:
>  >>
>  >>
>  >>>The suggested fix, adding a delay by setting the D_NETWORK debug flag,
>  >>>has been applied on all computers and has had some effect; the average
>  >>>pool size has gone up, but not by as much as we had hoped, and ping
>  >>>sweeps still reveal many more live machines not appearing in the pool,
>  >>>leading us to believe there is still some other problem.
>  >>>
>  >>>We've looked at master and startd log files but we haven't been able to
>  >>>find anything seriously wrong, and we're running out of ideas.
>  >>>
>  >>>What could be causing computers to sometimes be missing from our pool,
>  >>>and what else can we do to find them?
>  >>>
>  >>>
>  >>>
>  >>Turn on TCP updates to the collector, instead of UDP.
>  >>
>  >>-Erik
>  >>
>  >>_______________________________________________
>  >>Condor-users mailing list
>  >>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>  >>subject: Unsubscribe
>  >>You can also unsubscribe by visiting
>  >>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>  >>
>  >>The archives can be found at:
>  >>https://lists.cs.wisc.edu/archive/condor-users/
>  >>
>  >>
>  >_______________________________________________
>  >Condor-users mailing list
>  >To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>  >subject: Unsubscribe
>  >You can also unsubscribe by visiting
>  >https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>  >
>  >The archives can be found at:
>  >https://lists.cs.wisc.edu/archive/condor-users/
>  >
>  >
>  _______________________________________________
>  Condor-users mailing list
>  To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>  subject: Unsubscribe
>  You can also unsubscribe by visiting
>  https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>  The archives can be found at:
>  https://lists.cs.wisc.edu/archive/condor-users/
>