[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_status stuck



It's quite possible I've been bitten by the CONDOR_DEVELOPERS_COLLECTOR
macro.

In my configuration, the setting of this macro was commented out as it
comes out-of-the-box.  

I understand the default behaviour is:

"By default, they will be sent to condor.cs.wisc.edu. If you do notwant
these
updates to be sent from your pool, explicitly set this macro to NONE. If
undefined (commented
out) in the configuration file, Condor follows its default behavior."

The thing is my cluster is on a private network.

I've been restarting my pool today to diagnose why the collector
apparently is non-responsive.  After every restart it would hang, with
the other daemons reporting "timeout reading 5 bytes from <... : 9618 >

Finally, after setting this macro to NONE, I have a responsive pool.

I wonder if the collector, when attempting to "phone home", has some
ridiculous timeout associated with it causing it to be non-responsive if
it cannot get a connection.

Again, this is my best explanation for why the collector was locking up.
But it would also explain why the lock-up is intermittent - that is,
perhaps the intermittent part is coming from the "periodic" updates ?

thanks for the help - especially focusing my attention on the collector

if this diagnosis is correct, then hopefully this will help someone down
the road...



 

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jason Stowe
Sent: Thursday, March 27, 2008 1:44 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor_status stuck

Andrew,
>  - a few other of the same PERMISSION DENIED for QUERY_STARTD_PVT_ADS
Based upon the info you've given, all signs point to the collector as
needing to be restarted, or that your security settings have changed or
are preventing the querying classads.

Probably need to look at the security settings, and if those haven't
changed since when condor_status was working, try restarting your
collector process. Hope that helps!

Good Luck,
Jason


--
===================================
Jason A. Stowe

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com

On Thu, Mar 27, 2008 at 12:04 PM, Pleat, Andrew C.
<andrew.pleat@xxxxxxx> wrote:
> One other unusual message which most likely is unrelated is:
>
>  on execution machine CollectorLog periodically (~ every 10 minutes):
>  - Trying to query collector < (central manager) : 9618 >
>  - condor_read(): Socket closed when trying to read 5 bytes from ... 
> 9618
>  - IO: EOF reading packet header
>  - Couldn't fetch ads: communication error
>  - Aborting negotiation cycle
>
>  and on central manager at same time:
>  - DaemonCore: PERMISION DENIED to unknown user from host (the 
> execution  machine above:9625> for command 49 (UPDATE_NEGOTIATOR_AD)
>  - a few other of the same PERMISSION DENIED for QUERY_STARTD_PVT_ADS
>
>  again, no idea if it's related but something to fix...
>
>  thanks again
>
>
>
>  -----Original Message-----
>  From: condor-users-bounces@xxxxxxxxxxx
>
>
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Pleat, Andrew
C.
>  Sent: Thursday, March 27, 2008 11:35 AM
>  To: Condor-Users Mail List
>  Subject: Re: [Condor-users] condor_status stuck
>
>  - condor_status on central manager is hanging
>  - condor_status is hanging on other machines as well
>  - CollectorLog
>         - lots of apparently normal messages up until 10:30 and then  
> silence
>         - only unusual message is at 10:17:
>                 - can't send UPDATE_COLLECTOR_AD to collector ((nul):
>  Failed to send UDP update command to collector
>                 - Housekeeper: Ready to clean old ads
>                 -   <bunch of 'Cleaning' messages>
>                 - then resume normal messages up until 10:30 silence
>  - condor_status eventually failed (tens of minutes later):
>         - SECMAN:2003:TCP connection to <... : 9618> failed
>  - subsequently CollectorLog shows:
>         - condor_collector (CONDOR_COLLECTOR) STARTING UP
>         - this must be the master restarting it (as Steve Timm
>  indicated)
>  - reissued 'condor_status' - again stuck
>  - MasterLog
>         - at 11:25 shows:
>                 - NEGOTIATOR recovered
>                 - COLLECTOR recovered
>                 - SCHEDD recovered
>  - the 'condor_restart -subsystem schedd' that I issued initially 
> final  went through (although now I now understand it wasn't the 
> likely
>  culprit)
>  - reissued 'condor_q' and same result : Failed to fetch ads ... :
9679
>         - note the port changed
>
>  thanks for the responses
>
>  -----Original Message-----
>  From: condor-users-bounces@xxxxxxxxxxx  
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
>  Sent: Thursday, March 27, 2008 11:15 AM
>  To: Condor-Users Mail List
>  Subject: Re: [Condor-users] condor_status stuck
>
>
>
>  Hi Andrew -
>
>  Based upon your clues below, everything points to the
condor_collector
>  process not responding.    What does the CollectorLog on your central
>  manager machine have to say for itself?    Can you run
"condor_status"
>  on your central manager?
>
>  thanks,
>  Todd
>
>
>  Pleat, Andrew C. wrote:
>  >
>  >
>  > Condor 6.8.5
>  >
>  > Occasionally, there's some sort of lock-up occuring in my cluster.
>  > The symptoms are:
>  >
>  > - condor_status hangs indefinitely
>  > - condor_q hangs for about a minute and returns 'Failed to fetch 
> ads  > from: <... : 9683> : ..'
>  > - condor_restart -subsystem schedd hangs
>  >         - I tried this based on looking at condor_users mail
>  > - condor processes still running (although no apparent activity)  >

> > Logs:
>  > - MasterLog shows normal activity
>  > - NegotiatorLog seems to have stopped reporting
>  >         - normally it writes messages every 5 minutes
>  >         - the last report was "Getting all public ads ..."
>  > - SchedLog reports 'Called reschedule_negotiator()' as last message
>  >         - a condor_submit_dag had been performed in the same time
>  frame
>  >         - normally, the next message is "Activity on stashed
>  > negotiator socket"
>  > - StartLog has nothing special (although file is still being
touched)
>  >         - the only other file still being touched is MasterLog
>  >
>  > My conclusion would be the negotiator is somehow stuck.
>  >
>  > any ideas
>  >
>  > thank you
>  > andy pleat
>  >
>  >
>  >
>  >
>  > 
> ----------------------------------------------------------------------
>  > --
>  >
>  > _______________________________________________
>  > Condor-users mailing list
>  > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx

> > with a  > subject: Unsubscribe  > You can also unsubscribe by 
> visiting  > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>  >
>  > The archives can be found at:
>  > https://lists.cs.wisc.edu/archive/condor-users/
>
>
>  --
>  Todd Tannenbaum                       University of Wisconsin-Madison
>  Condor Project Research               Department of Computer Sciences
>  tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
>  Phone: (608) 263-7132                 Madison, WI 53706-1685
>  _______________________________________________
>  Condor-users mailing list
>  To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx 
> with  a
>  subject: Unsubscribe
>  You can also unsubscribe by visiting
>  https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>  The archives can be found at:
>  https://lists.cs.wisc.edu/archive/condor-users/
>  _______________________________________________
>  Condor-users mailing list
>  To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx 
> with  a
>  subject: Unsubscribe
>  You can also unsubscribe by visiting
>  https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>  The archives can be found at:
>  https://lists.cs.wisc.edu/archive/condor-users/
>  _______________________________________________
>  Condor-users mailing list
>  To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx 
> with a
>  subject: Unsubscribe
>  You can also unsubscribe by visiting
>  https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>  The archives can be found at:
>  https://lists.cs.wisc.edu/archive/condor-users/
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/