[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_status stuck



- condor_status on central manager is hanging 
- condor_status is hanging on other machines as well
- CollectorLog
	- lots of apparently normal messages up until 10:30 and then
silence
	- only unusual message is at 10:17:
		- can't send UPDATE_COLLECTOR_AD to collector ((nul):
Failed to send UDP update command to collector
		- Housekeeper: Ready to clean old ads
		-   <bunch of 'Cleaning' messages>
		- then resume normal messages up until 10:30 silence
- condor_status eventually failed (tens of minutes later):
	- SECMAN:2003:TCP connection to <... : 9618> failed
- subsequently CollectorLog shows:
	- condor_collector (CONDOR_COLLECTOR) STARTING UP
	- this must be the master restarting it (as Steve Timm
indicated)
- reissued 'condor_status' - again stuck
- MasterLog
	- at 11:25 shows:
		- NEGOTIATOR recovered
		- COLLECTOR recovered
		- SCHEDD recovered
- the 'condor_restart -subsystem schedd' that I issued initially final
went through (although now I now understand it wasn't the likely
culprit)
- reissued 'condor_q' and same result : Failed to fetch ads ... : 9679
	- note the port changed 

thanks for the responses

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Thursday, March 27, 2008 11:15 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor_status stuck



Hi Andrew -

Based upon your clues below, everything points to the condor_collector 
process not responding.    What does the CollectorLog on your central 
manager machine have to say for itself?    Can you run "condor_status" 
on your central manager?

thanks,
Todd


Pleat, Andrew C. wrote:
> 
> 
> Condor 6.8.5
> 
> Occasionally, there's some sort of lock-up occuring in my cluster.  
> The symptoms are:
> 
> - condor_status hangs indefinitely
> - condor_q hangs for about a minute and returns 'Failed to fetch ads
> from: <... : 9683> : ..'
> - condor_restart -subsystem schedd hangs
>         - I tried this based on looking at condor_users mail
> - condor processes still running (although no apparent activity)
> 
> Logs:
> - MasterLog shows normal activity
> - NegotiatorLog seems to have stopped reporting
>         - normally it writes messages every 5 minutes
>         - the last report was "Getting all public ads ..."
> - SchedLog reports 'Called reschedule_negotiator()' as last message
>         - a condor_submit_dag had been performed in the same time
frame
>         - normally, the next message is "Activity on stashed 
> negotiator socket"
> - StartLog has nothing special (although file is still being touched)
>         - the only other file still being touched is MasterLog
> 
> My conclusion would be the negotiator is somehow stuck.
> 
> any ideas
> 
> thank you
> andy pleat
> 
> 
> 
> 
> ----------------------------------------------------------------------
> --
> 
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx 
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/


-- 
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/