[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Restarting comms with XP nodes (SP2 UDP bugette)



Zach,

> for this, you need to also define COLLECTOR_SOCKET_CACHE_SIZE.

Yeah, did that too.  Sorry I wasn't more clear.  I followed the
procedure in the manual. I put this in my config files: 

	COLLECTOR_SOCKET_CACHE_SIZE = 99 and 
	UPDATE_COLLECTOR_WITH_TCP   = TRUE  

I pushed the updated config files out to my master/collector and to two
test execution/submit nodes.  I issued condor_reconfig to all three.  I
immediately began receiving the "...via TCP; ignored" errors.  I tried
restarting condor on the master first, then on the two execution/submit
nodes.  The error persisted.

Thanks for the "session" info.  I keep my entire pool at the same
revision level so there is no mixed versioning going on here.  Everyone
is running 6.6.9 currently.   I'm curious, you said that if the
collector is restarted, it would invalidate any open sessions.  Does the
same hold true in reverse?  That is, if an execution node is restarted,
is there a potential for the same problem?

I'll try setting SEC_DEFAULT_NEGOTIATION = OPTIONAL and see if that
helps.  I also see that v6.6.10 was released yesterday so I'll take a
look at that too.

I've done some preliminary investigation regarding Bruce's post.  I'll
respond to his email separately to keep the threads intact.

-Bryan

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Zachary Miller
Sent: Wednesday, June 22, 2005 6:21 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Restarting comms with XP nodes (SP2 UDP
bugette)

> I would also like to say that I tried to switch over to using TCP with
> no luck.  I configured the master/collector to have 99 sockets in its
> cache.  Then I picked two machines from my pool as guinea pigs and set
> their UPDATE_COLLECTOR_WITH_TCP value to TRUE.  It didn't work.  After
> many attempts to reconfig all the daemons and even fully restarting
the
> services on the master and test nodes, the CollectorLog continued to
> indicate that TCP updates were being rejected:
> 
> 	...
> 	6/22 10:01:56 Received UPDATE command via TCP; ignored
> 	6/22 10:02:05 Received UPDATE command via TCP; ignored
> 	6/22 10:02:14 Received UPDATE command via TCP; ignored

for this, you need to also define COLLECTOR_SOCKET_CACHE_SIZE.

> What's equally frustrating is that since the deployment of SP2, even
my
> submit nodes seem to become disconnected.  Users submit jobs from
their
> desktops, the jobs appear in their local queue, but no one else can
see
> the jobs via condor_q -global.  The jobs never run until the local
> service is shutdown and restarted.
> 
> Lastly, I'm still plagued by errors like this in my CollectorLog which
> must have some correlation to all of these other symptoms:
> 
> 	6/22 10:01:47 DC_AUTHENTICATE: attempt to open invalid
> session...
> 	... IQ00:3320:1118923534:1043, failing.

this is probably the cause of your problem(s).

if you are running old (say pre- 6.6.2) binaries there are known bugs
which
caused sessions to be purged.  but i'm guessing that's not the case.


> I saw several postings regarding DC_AUTHENTICATE errors but none of
them
> suggested any remedy.  One posting suggested that most likely the
> "session" has been purged from the session cache so it is indeed
> invalid.  That's great but what exactly IS a session in this context?
> And how do I prevent this error from occurring?  Is it bad?  I don't
> know.

by the way, a "session" is essential a security context.  it holds bits
of
information like if/how you authenticated and crypto keys.  if you
aren't
actually using any authentication, encryption, or MD5 checksums, you can
turn sessions off with:
  SEC_DEFAULT_NEGOTIATION = OPTIONAL

perhaps that will help you.  but in general, there should be no harm in
using sessions even if you are not using any security features.

say for example you restart the collector.  when each startd sends its
next
update to the collector, it will now try to use a session the the
collector
does not know about.  this will result in the "attempt to open invalid
session"
message.  however, when this happens a message is sent back to startd to
tell
it to invalidate that session and start a new one next time.  so the
problem
*should* correct itself.  however, it's likely that when combined with
flakey
UDP packets, the problem is worsened.

i'm interested in hearing if any of the hotfixes that bruce mentioned
help the
situation at all.

in the meantime, you can try working around it by using TCP updates to
the
collector (remember to also define COLLECTOR_SOCKET_CACHE_SIZE) but this
will
probably not cure all the problems as condor uses UDP quite a bit.


cheers,
-zach

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users