[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] lost updates / network issues?

The two problems are related--if you miss that many updates
then the collector will give up on a resource from time to time
and time out the classad.  This may be a time to turn on
UPDATE_COLLECTOR_WITH_TCP--that will make the updates much
more reliable.  At one point I had this level of updateslost
and changed to tcp and it solved the problem altogether.


Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

On Mon, 8 Oct 2007, Wojtek Goscinski wrote:

Howdy All,

I'm hoping maybe someone can give me some advice about how to diagnose a
problem with our pool.

We're running a test pool with a handful of resources. Condorview is showing
that resources are sometimes appearing and disappearing (see attached
screenshot) - though I've only noticed this rarely with condor_status. There
is no specific reason for resources to join and leave - apart form network
issues perhaps...

In addition, condor_status shows me that a lot of updates are being lost -
sometimes around 1/4 (see below).

Hence, i've got 2 questions:

- is this amount of updates lost cause for concern? Machines are on a busy
student network. Should I be upping the rate at which updates occur?
- why might condorview be showing me that resources are entering and leaving
the pool? is this cause for concern?



UpdatesTotal = 4725
UpdatesSequenced = 4793
UpdatesLost = 1028

UpdatesTotal = 5151
UpdatesSequenced = 5148
UpdatesLost = 366

UpdatesTotal = 4636
UpdatesSequenced = 4612
UpdatesLost = 916

UpdatesTotal = 3688
UpdatesSequenced = 3630
UpdatesLost = 1175

UpdatesTotal = 5214
UpdatesSequenced = 5213
UpdatesLost = 361

UpdatesTotal = 5202
UpdatesSequenced = 5201
UpdatesLost = 1471

UpdatesTotal = 5221
UpdatesSequenced = 5220
UpdatesLost = 284