Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_restart and missing machines (enhancement request)

Date: Fri, 13 Jul 2007 09:28:10 -0500
From: Daniel Forrest <forrest@xxxxxxxxxxxxx>
Subject: Re: [Condor-users] condor_restart and missing machines (enhancement request)

JK,

> I am currently still epxeriencing problems with the reports from
> Windows PCs failing to arrive at the central manager. I suspect this
> problem will go away when they are all upgraded from 6.6 to 6.8, but
> in the meantime have this request:

We have had the same problem here (version 6.8.1).  One thing that
will ease the problem is to set D_NETWORK for the daemons on the
Windows PCs.  A quote from a message (unanswered) to condor-admin
earlier this year:

<QUOTE>

Putting a network sniffer on the problem machines shows that what is
happening is that the first fragment of a multi-fragment message is
never making it to the wire.  For example, a typical Master Update is
1244 bytes split into two fragments of 1000 and 244 bytes.  When the
update fails, the 1000 byte fragment never leaves the machine, but the
244 byte fragment does.  Furthermore, if D_NETWORK is enabled for the
Master, then the majority of the updates start to work.  That is, the
writing of the debugging information to the log file and the resulting
small delay between the sending of the two fragments is usually enough
to resolve the problem.

Clearly there is a problem with Windows losing (i.e. failing to send)
UDP packets if the time between multiple calls to sendto() is too
short.  I've Googled on this and there was some indication this might
be related to using nonblocking sockets with sendto(), but setting
"NONBLOCKING_COLLECTOR_UPDATE = False" doesn't solve the problem.

</QUOTE>

With D_NETWORK set, our Windows machines show a <1.5% loss of updates.
Before this, enough updates were lost from several machines that they
were regularly dropped from condor_status.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison

Follow-Ups:
- Re: [Condor-users] condor_restart and missing machines (enhancement request)
  - From: Kewley, J \(John\)

References:
- Re: [Condor-users] Condor Java Macs
  - From: Ian Cottam
- [Condor-users] condor_restart and missing machines (enhancement request)
  - From: Kewley, J \(John\)

Prev by Date: Re: [Condor-users] environment variables
Next by Date: Re: [Condor-users] condor_restart and missing machines (enhancement request)
Previous by thread: [Condor-users] condor_restart and missing machines (enhancement request)
Next by thread: Re: [Condor-users] condor_restart and missing machines (enhancement request)
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] condor_restart and missing machines (enhancement request)