[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor nodes vanish temporarily



Hi Dan,

thanks for the precise explanation of the reason for the problem.
Regarding your assumption that Windows has improved since you
experienced the problem: I suppose that this is not the case. A
colleague of mine (he works in another department in my company)
experienced the exact same problems with his pool which consists of
machines running Windows Server 2008 R2, i.e. a very recent version of
Windows...

Anyway, thanks again for your great help!



Am Montag, den 07.03.2011, 15:38 -0600 schrieb Daniel Forrest:
> Felix Wolfheimer wrote:
> 
> > > This sounds familiar.  On the pool machines, try setting this:
> > > 
> > > STARTD_DEBUG = D_COMMAND D_NETWORK
> > > MASTER_DEBUG = D_COMMAND D_NETWORK
> > > 
> > > in "condor_config".  If this clears up the problem, I can go into more
> > > detail as to what the problem might be and why this "fixes" it.
> > 
> > I tried your suggestion and after setting the keys in the condor_config
> > file on the pool my machines do not vanish anymore from the list.
> > 
> > Thank you very much for your help!
> 
> That's good to hear.  And since it worked I can tell you what is
> probably happening.  It was several years ago when I first diagnosed
> this problem, so this is all from memory.
> 
> Updates to the condor_collector are sent as UDP messages.  A normal
> update is broken up into several UDP packets.  Under Windows, a UDP
> packet is considered "sent" once it is buffered in memory.  If another
> packet is queued too quickly it can overwrite the previous packet
> before it is actually put on the wire.  If you were to run an ethernet
> "snooper" on the packets received by the condor_collector you would
> find many updates that are missing packets.  (As I recall, in my case
> it was always the first packet of an update that was missing.)  By
> adding the additional debugging output (D_NETWORK) you are providing
> enough delay to keep this from happening.
> 
> The other alternative here is to use TCP for updates, but at the time
> I discovered this "fix" that option was not available.
> 
> I'm not sure why more Windows sites don't see this problem.  Using
> Google back then I found several other Windows applications that had
> similar problems that were solved by introducing a delay in the send
> loop.  I assumed that Windows had improved since then because no one
> else has been noticing this particular problem.
>