[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] udp missing



JK,

> I have some 2 processor (not hyperthreaded) machines, both Linux and
> Windows. The vm1 get loads of missed udp to the collector and can
> fall from the pool whereas the vm2 doesn't.

As I stated in a private e-mail, we have a pool here with similar
symptoms (up to 20% update loss, only on vm1, on machines with either
2 or 4 total VMs), and haven't found the cause.  There are, however,
things you can look at (assuming the collector is on a Linux box).

Check the output of "netstat -su".  Run it once, and then again in 5
minutes, and compare the count of "packet receive errors".  This will
tell you if packets are making it to the collector, but just not being
processed quickly enough.

If this count is increasing, consider setting COLLECTOR_SOCKET_BUFSIZE
to a larger value.  Also run "cat /proc/sys/net/core/rmem_max" and
"cat /proc/sys/net/core/wmem_max" to see if they are too small.  Look
at the 6.8.3 release notes and search for "rmem_max" for additional
information about this.

> Does each process (condor_master, condor_startd) send 1 udp update
> to the collector for all the processors in its care? If so, is this
> just the reporting that is confusing?

There is one update for the master and one update for each VM being
run by the startd.  (Basically, there is one update for each unique
resource that is reported by "condor_status -any").

> Further information on this. I have tried setting
> STARTD_DEBUG = D_NETWORK

FYI - you probably wanted "STARTD_DEBUG = D_COMMAND D_NETWORK" so you
don't lose the debug information that was set by default.

> and get the following being sent:
> 
> 10/16 13:49:23 SEND [1000] <A.B.C.D:41444> <A.B.C.E:9618>
> 10/16 13:49:23 SEND [1000] <A.B.C.D:41444> <A.B.C.E:9618>
> 10/16 13:49:23 SEND [1000] <A.B.C.D:41444> <A.B.C.E:9618>
> 10/16 13:49:23 SEND [882] <A.B.C.D:41444> <A.B.C.E:9618>
> 10/16 13:49:24 SEND [1000] <A.B.C.D:41445> <A.B.C.E:9618>
> 10/16 13:49:24 SEND [1000] <A.B.C.D:41445> <A.B.C.E:9618>
> 10/16 13:49:24 SEND [1000] <A.B.C.D:41445> <A.B.C.E:9618>
> 10/16 13:49:24 SEND [882] <A.B.C.D:41445> <A.B.C.E:9618>
> 
> etc, every 5 mins
> 
> I presume it send values for each "vm", hence 2 lots of values, only
> 1s apart.
> 
> What is 1000 vs 882?
> 
> Why 4 messages? I have condor_master and condor_startd only running
> on this machine

The update for each VM is taking 3882 bytes on this machine.  This is
being broken up into 1000 byte chunks to fit into UDP packets that
should make it across your network without being further fragmented.
Note that this requires all 4 UDP packets to arrive at the collector
in order for the update to succeed.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison