[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting comms with XP nodes (SP2 UDP bugette)

> I would also like to say that I tried to switch over to using TCP with
> no luck.  I configured the master/collector to have 99 sockets in its
> cache.  Then I picked two machines from my pool as guinea pigs and set
> their UPDATE_COLLECTOR_WITH_TCP value to TRUE.  It didn't work.  After
> many attempts to reconfig all the daemons and even fully restarting the
> services on the master and test nodes, the CollectorLog continued to
> indicate that TCP updates were being rejected:
> 	...
> 	6/22 10:01:56 Received UPDATE command via TCP; ignored
> 	6/22 10:02:05 Received UPDATE command via TCP; ignored
> 	6/22 10:02:14 Received UPDATE command via TCP; ignored

for this, you need to also define COLLECTOR_SOCKET_CACHE_SIZE.

> What's equally frustrating is that since the deployment of SP2, even my
> submit nodes seem to become disconnected.  Users submit jobs from their
> desktops, the jobs appear in their local queue, but no one else can see
> the jobs via condor_q -global.  The jobs never run until the local
> service is shutdown and restarted.
> Lastly, I'm still plagued by errors like this in my CollectorLog which
> must have some correlation to all of these other symptoms:
> 	6/22 10:01:47 DC_AUTHENTICATE: attempt to open invalid
> session...
> 	... IQ00:3320:1118923534:1043, failing.

this is probably the cause of your problem(s).

if you are running old (say pre- 6.6.2) binaries there are known bugs which
caused sessions to be purged.  but i'm guessing that's not the case.

> I saw several postings regarding DC_AUTHENTICATE errors but none of them
> suggested any remedy.  One posting suggested that most likely the
> "session" has been purged from the session cache so it is indeed
> invalid.  That's great but what exactly IS a session in this context?
> And how do I prevent this error from occurring?  Is it bad?  I don't
> know.

by the way, a "session" is essential a security context.  it holds bits of
information like if/how you authenticated and crypto keys.  if you aren't
actually using any authentication, encryption, or MD5 checksums, you can
turn sessions off with:

perhaps that will help you.  but in general, there should be no harm in
using sessions even if you are not using any security features.

say for example you restart the collector.  when each startd sends its next
update to the collector, it will now try to use a session the the collector
does not know about.  this will result in the "attempt to open invalid session"
message.  however, when this happens a message is sent back to startd to tell
it to invalidate that session and start a new one next time.  so the problem
*should* correct itself.  however, it's likely that when combined with flakey
UDP packets, the problem is worsened.

i'm interested in hearing if any of the hotfixes that bruce mentioned help the
situation at all.

in the meantime, you can try working around it by using TCP updates to the
collector (remember to also define COLLECTOR_SOCKET_CACHE_SIZE) but this will
probably not cure all the problems as condor uses UDP quite a bit.