[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor nodes vanish temporarily



This is a note to provide some further information on the topic. As I
already wrote before we have two Windows clusters in use. One is running
Windows Server 2003 R2 and the other one is running Windows Server 2008
R2. While the settings  

STARTD_DEBUG = D_COMMAND D_NETWORK
MASTER_DEBUG = D_COMMAND D_NETWORK

in the Condor config file solved the issue on the Windows Server 2003 R2
cluster it was still there on the Server 2008 R2 cluster. However, with
the information at hand that it has something to do with with UDP we
tried to switch to TCP for the Condor update messages. This worked
perfectly and the nodes remain in the list now. 



Am Montag, den 07.03.2011, 15:38 -0600 schrieb Daniel Forrest:
> Felix Wolfheimer wrote:
> 
> > > This sounds familiar.  On the pool machines, try setting this:
> > > 
> > > STARTD_DEBUG = D_COMMAND D_NETWORK
> > > MASTER_DEBUG = D_COMMAND D_NETWORK
> > > 
> > > in "condor_config".  If this clears up the problem, I can go into more
> > > detail as to what the problem might be and why this "fixes" it.
> > 
> > I tried your suggestion and after setting the keys in the condor_config
> > file on the pool my machines do not vanish anymore from the list.
> > 
> > Thank you very much for your help!
> 
> That's good to hear.  And since it worked I can tell you what is
> probably happening.  It was several years ago when I first diagnosed
> this problem, so this is all from memory.
> 
> Updates to the condor_collector are sent as UDP messages.  A normal
> update is broken up into several UDP packets.  Under Windows, a UDP
> packet is considered "sent" once it is buffered in memory.  If another
> packet is queued too quickly it can overwrite the previous packet
> before it is actually put on the wire.  If you were to run an ethernet
> "snooper" on the packets received by the condor_collector you would
> find many updates that are missing packets.  (As I recall, in my case
> it was always the first packet of an update that was missing.)  By
> adding the additional debugging output (D_NETWORK) you are providing
> enough delay to keep this from happening.
> 
> The other alternative here is to use TCP for updates, but at the time
> I discovered this "fix" that option was not available.
> 
> I'm not sure why more Windows sites don't see this problem.  Using
> Google back then I found several other Windows applications that had
> similar problems that were solved by introducing a delay in the send
> loop.  I assumed that Windows had improved since then because no one
> else has been noticing this particular problem.
>