[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Restarting comms with XP nodes (SP2 UDP bugette)



[I apologize upfront for the long post.]

I'd like to add some comments to Rob's suggestion.  

My pool has been suffering from disappearing nodes (i.e. "... master
dropping connections...") for quite some time.  This problem predates
the installation of SP2 on my XP boxes and has plagued my pool through
Condor versions 6.6.5 to 6.6.9. Additionally, almost all of my execution
nodes are running Windows Server 2003 and they disappear from the pool
just as often as the XP boxes.  The server boxes have not yet been
upgraded to SP1 so they do not have a firewall running and therefore
should not be plagued by firewall related issues.

I too, noticed that issuing a condor_reconfig would often, but not
always, reestablish the connection.  Just as often, condor_reconfig
returns with "Can't find address for master..." because the connection
has been dead for so long the master has purged it from the cache.  The
only way to condor_reconfig will work under that condition is from the
affected node itself.  To this end I created a scheduled task on each
execution node that runs every 30 minutes of idle time; when it wakes
up, it issues a condor_reconfig to the local host.  I decided to run it
at idle because it appeared to me that the hosts were only dropping out
of the pool when they were idle for too long.  But this is just an
observation; I have no evidence to back it up.

This scheduled task has kept my pool listing full at all times but has
resulted in a new problem.  Many of the machines now go into the Claimed
Idle state.  The executable never gets run but the job shows as running
in the queue.  I think there is a reason the machines drop out of the
queue and when you force them back into the queue via condor_reconfig,
they continue to exhibit bad behavior.

I would also like to say that I tried to switch over to using TCP with
no luck.  I configured the master/collector to have 99 sockets in its
cache.  Then I picked two machines from my pool as guinea pigs and set
their UPDATE_COLLECTOR_WITH_TCP value to TRUE.  It didn't work.  After
many attempts to reconfig all the daemons and even fully restarting the
services on the master and test nodes, the CollectorLog continued to
indicate that TCP updates were being rejected:

	...
	6/22 10:01:56 Received UPDATE command via TCP; ignored
	6/22 10:02:05 Received UPDATE command via TCP; ignored
	6/22 10:02:14 Received UPDATE command via TCP; ignored
	...

What's equally frustrating is that since the deployment of SP2, even my
submit nodes seem to become disconnected.  Users submit jobs from their
desktops, the jobs appear in their local queue, but no one else can see
the jobs via condor_q -global.  The jobs never run until the local
service is shutdown and restarted.

Lastly, I'm still plagued by errors like this in my CollectorLog which
must have some correlation to all of these other symptoms:

	6/22 10:01:47 DC_AUTHENTICATE: attempt to open invalid
session...
	... IQ00:3320:1118923534:1043, failing.

I saw several postings regarding DC_AUTHENTICATE errors but none of them
suggested any remedy.  One posting suggested that most likely the
"session" has been purged from the session cache so it is indeed
invalid.  That's great but what exactly IS a session in this context?
And how do I prevent this error from occurring?  Is it bad?  I don't
know.

I would greatly appreciate hearing any insight anyone has on these
issues.  Condor has been a great resource for us and we have come to
rely on it quite heavily.  After 15 months of use, I thought I would
have it running like clock work.  Instead, I find myself fiddling with
it almost daily.  

Sorry again for the long post.

-Bryan

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Rob Fletcher
Sent: Wednesday, June 22, 2005 9:09 AM
To: Condor Users
Subject: [Condor-users] Restarting comms with XP nodes (SP2 UDP bugette)

Hi,

If you are having problems with your master dropping connections with
Windows XP nodes (which are running SP2), due to the UDP bug and have
not
as yet moved over to using TCP connections then...

Issuing a condor_reconfig <node>

Will re-establish the connection. e.g. condor_reconfig xpnode0

Maybe this note will be useful to some people.

Cheers,

Rob

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users