[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_restart and missing machines (enhancement request)



JK,

> I am currently still epxeriencing problems with the reports from
> Windows PCs failing to arrive at the central manager. I suspect this
> problem will go away when they are all upgraded from 6.6 to 6.8, but
> in the meantime have this request:

We have had the same problem here (version 6.8.1).  One thing that
will ease the problem is to set D_NETWORK for the daemons on the
Windows PCs.  A quote from a message (unanswered) to condor-admin
earlier this year:

<QUOTE>

Putting a network sniffer on the problem machines shows that what is
happening is that the first fragment of a multi-fragment message is
never making it to the wire.  For example, a typical Master Update is
1244 bytes split into two fragments of 1000 and 244 bytes.  When the
update fails, the 1000 byte fragment never leaves the machine, but the
244 byte fragment does.  Furthermore, if D_NETWORK is enabled for the
Master, then the majority of the updates start to work.  That is, the
writing of the debugging information to the log file and the resulting
small delay between the sending of the two fragments is usually enough
to resolve the problem.

Clearly there is a problem with Windows losing (i.e. failing to send)
UDP packets if the time between multiple calls to sendto() is too
short.  I've Googled on this and there was some indication this might
be related to using nonblocking sockets with sendto(), but setting
"NONBLOCKING_COLLECTOR_UPDATE = False" doesn't solve the problem.

</QUOTE>

With D_NETWORK set, our Windows machines show a <1.5% loss of updates.
Before this, enough updates were lost from several machines that they
were regularly dropped from condor_status.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison