[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] lost updates / network issues?
- Date: Sun, 07 Oct 2007 20:18:55 -0500 (CDT)
- From: Steven Timm <timm@xxxxxxxx>
- Subject: Re: [Condor-users] lost updates / network issues?
The two problems are related--if you miss that many updates
then the collector will give up on a resource from time to time
and time out the classad. This may be a time to turn on
UPDATE_COLLECTOR_WITH_TCP--that will make the updates much
more reliable. At one point I had this level of updateslost
and changed to tcp and it solved the problem altogether.
Steven C. Timm, Ph.D (630) 840-8525
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
On Mon, 8 Oct 2007, Wojtek Goscinski wrote:
I'm hoping maybe someone can give me some advice about how to diagnose a
problem with our pool.
We're running a test pool with a handful of resources. Condorview is showing
that resources are sometimes appearing and disappearing (see attached
screenshot) - though I've only noticed this rarely with condor_status. There
is no specific reason for resources to join and leave - apart form network
In addition, condor_status shows me that a lot of updates are being lost -
sometimes around 1/4 (see below).
Hence, i've got 2 questions:
- is this amount of updates lost cause for concern? Machines are on a busy
student network. Should I be upping the rate at which updates occur?
- why might condorview be showing me that resources are entering and leaving
the pool? is this cause for concern?
UpdatesTotal = 4725
UpdatesSequenced = 4793
UpdatesLost = 1028
UpdatesTotal = 5151
UpdatesSequenced = 5148
UpdatesLost = 366
UpdatesTotal = 4636
UpdatesSequenced = 4612
UpdatesLost = 916
UpdatesTotal = 3688
UpdatesSequenced = 3630
UpdatesLost = 1175
UpdatesTotal = 5214
UpdatesSequenced = 5213
UpdatesLost = 361
UpdatesTotal = 5202
UpdatesSequenced = 5201
UpdatesLost = 1471
UpdatesTotal = 5221
UpdatesSequenced = 5220
UpdatesLost = 284