[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] Condor shuts down our network



We've seen this problem again.  When we came in this morning, we found that one particular grid node (not the same one as last time) was sending the Central Manager MasterAds continuously, effectively shutting down our network.  I don't know how long it had been going on, but that node alone sent the Central Manager about 1,035,016 messages between 7:38:46 and 8:27:02 this morning, almost entirely MasterAds.

We've retrieved the relevant logfiles from the node that was sending the messages, and I've attached them.  Although there's a lot of hints of poor communication in the logfiles, I see nothing conclusive.  The strangest thing I see is this, from the MasterLog--notice the datestamps:

[lots of messages ranging smoothly up to 4/15 07:01]
4/15 07:01:48 Child 3300 died, but not a daemon -- Ignored
4/16 03:35:23 Can't send EOM to the collector (NSI-DELL3975)
4/15 12:28:29 ERROR: Child pid 1664 appears hung! Killing it hard.
4/16 04:27:58 DaemonCore: Command received via UDP from host <10.83.1.168:4120>
[a few more messages ranging smoothly up from 4/16 04:27]

Aside from that, I don't see anything overly strange.  Any ideas?

-David

-----Original Message-----
From: David Vestal 
Sent: Wednesday, March 31, 2004 3:23 PM
To: 'Erik Paulson'
Subject: RE: [condor-users] Condor shuts down our network


Erik,

Our intern just came back from the computer that sent all the messages.  I've attached the logfiles from that machine, but the relevant entries are below.  The surge of Master Ads happened around 3/30/04 at 8:29 AM.

The MasterLog has:
3/26 12:25:08 Started DaemonCore process "C:\Condor/bin/condor_startd.exe", pid and pgroup = 1676
...and later...
3/30 08:29:03 Can't send EOM to the collector (NSI-DELL3975)
3/30 08:36:47 ERROR: Child pid 1676 appears hung! Killing it hard.
3/30 09:36:50 ERROR: Child pid 1676 appears hung! Killing it hard.

And if it's relevant, the StartLog contains several hundred occurrences of:
3/29 13:50:28 no loadavg samples this minute, maybe thread died???
...mostly on 3/29, but including one on 3/30 at 8:29:02.

I've also attached the config file for that node.

Thanks,
-David

-----Original Message-----
From: Erik Paulson [mailto:epaulson@xxxxxxxxxxx]
Sent: Tuesday, March 30, 2004 5:07 PM
To: David Vestal
Subject: Re: [condor-users] Condor shuts down our network


On Tue, Mar 30, 2004 at 04:27:16PM -0500, David Vestal wrote:
> > > > It appears to be an update of machine status.  Our logging software
> > > > logged 57874 of them in just over 56 seconds.
> > >
> > > That's certainly not right.
> >
> >You're right.  A tiny bit of other messages are mixed in.  When I separated
> >out all except the messages sent from this one node to the central manager,
> >I found that this node had sent 56159 messages to the CM, over the course
> >of three minutes, 34 seconds.  Still, that's clearly way too many.
> 
> I know how Erik talks. What he meant is, "David, you're absolutely correct: 
> something is seriously wrong." He didn't mean, "Your statement is incorrect."
> 
> -alain
> 
> 
> Erik,
> 
> Please accept my apologies; if I'd read your message once more, I'd have picked up on what you meant.  I'm a bit red-faced right now.  Sorry about that.
> 

No problem! :)

You can make it up to me with a master log showing me what's going on with
the machine that's sending all of the traffic :)

-Erik

Attachment: Message Overflow.zip
Description: Message Overflow.zip