[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_restart and missing machines (enhancement request)



Cheers, thanks Daniel.

Any idea how turning on D_NETWORK compares to turning on TCP updating
with respect to network performance:
UPDATE_COLLECTOR_WITH_TCP = TRUE
(see https://lists.cs.wisc.edu/archive/condor-users/2005-April/msg00204.shtml )

cheers

JK


> -----Original Message-----
> From: Daniel Forrest [mailto:forrest@xxxxxxxxxxxxx]
> Sent: Friday, July 13, 2007 3:28 PM
> To: Kewley, J (John)
> Cc: Condor-Users Mail List
> Subject: Re: [Condor-users] condor_restart and missing machines
> (enhancement request)
> 
> 
> JK,
> 
> > I am currently still epxeriencing problems with the reports from
> > Windows PCs failing to arrive at the central manager. I suspect this
> > problem will go away when they are all upgraded from 6.6 to 6.8, but
> > in the meantime have this request:
> 
> We have had the same problem here (version 6.8.1).  One thing that
> will ease the problem is to set D_NETWORK for the daemons on the
> Windows PCs.  A quote from a message (unanswered) to condor-admin
> earlier this year:
> 
> <QUOTE>
> 
> Putting a network sniffer on the problem machines shows that what is
> happening is that the first fragment of a multi-fragment message is
> never making it to the wire.  For example, a typical Master Update is
> 1244 bytes split into two fragments of 1000 and 244 bytes.  When the
> update fails, the 1000 byte fragment never leaves the machine, but the
> 244 byte fragment does.  Furthermore, if D_NETWORK is enabled for the
> Master, then the majority of the updates start to work.  That is, the
> writing of the debugging information to the log file and the resulting
> small delay between the sending of the two fragments is usually enough
> to resolve the problem.
> 
> Clearly there is a problem with Windows losing (i.e. failing to send)
> UDP packets if the time between multiple calls to sendto() is too
> short.  I've Googled on this and there was some indication this might
> be related to using nonblocking sockets with sendto(), but setting
> "NONBLOCKING_COLLECTOR_UPDATE = False" doesn't solve the problem.
> 
> </QUOTE>
> 
> With D_NETWORK set, our Windows machines show a <1.5% loss of updates.
> Before this, enough updates were lost from several machines that they
> were regularly dropped from condor_status.
> 
> -- 
> Daniel K. Forrest	Laboratory for Molecular and
> forrest@xxxxxxxxxxxxx	Computational Genomics
> (608) 262 - 9479	University of Wisconsin, Madison
>