[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor nodes vanish temporarily



Felix Wolfheimer wrote:

> > This sounds familiar.  On the pool machines, try setting this:
> > 
> > STARTD_DEBUG = D_COMMAND D_NETWORK
> > MASTER_DEBUG = D_COMMAND D_NETWORK
> > 
> > in "condor_config".  If this clears up the problem, I can go into more
> > detail as to what the problem might be and why this "fixes" it.
> 
> I tried your suggestion and after setting the keys in the condor_config
> file on the pool my machines do not vanish anymore from the list.
> 
> Thank you very much for your help!

That's good to hear.  And since it worked I can tell you what is
probably happening.  It was several years ago when I first diagnosed
this problem, so this is all from memory.

Updates to the condor_collector are sent as UDP messages.  A normal
update is broken up into several UDP packets.  Under Windows, a UDP
packet is considered "sent" once it is buffered in memory.  If another
packet is queued too quickly it can overwrite the previous packet
before it is actually put on the wire.  If you were to run an ethernet
"snooper" on the packets received by the condor_collector you would
find many updates that are missing packets.  (As I recall, in my case
it was always the first packet of an update that was missing.)  By
adding the additional debugging output (D_NETWORK) you are providing
enough delay to keep this from happening.

The other alternative here is to use TCP for updates, but at the time
I discovered this "fix" that option was not available.

I'm not sure why more Windows sites don't see this problem.  Using
Google back then I found several other Windows applications that had
similar problems that were solved by introducing a delay in the send
loop.  I assumed that Windows had improved since then because no one
else has been noticing this particular problem.

-- 
Dan