[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_status stuck



It's very likely a stuck Collector.
Have a look at CollectorLog

Do you see anything like this?

3/27 09:59:30 Calling Handler <sockCacheHandler>
3/27 09:59:30 Got INVALIDATE_STARTD_ADS
3/27 10:00:30 condor_write(): timed out writing 29 bytes to unknown source
3/27 10:00:30 Buf::write(): condor_write() failed
3/27 10:00:30 Unable to acknowledge invalidation
3/27 10:01:30 condor_write(): timed out writing 29 bytes to unknown source
3/27 10:01:30 Buf::write(): condor_write() failed
3/27 10:01:30 Unable to acknowledge invalidation
3/27 10:01:30 Return from Handler <sockCacheHandler>
3/27 10:01:30 Calling Handler <sockCacheHandler>
3/27 10:01:30 Got INVALIDATE_STARTD_ADS
3/27 10:02:40 KEYCACHE: created: 0xa3dbb30

If so, then you've encountered a long-standing bug in the collector
which they tell me is going to get fixed Real Soon Now.
If it stays hung for long enough the master will detect
that and restart it.

Steve Timm

On Thu, 27 Mar 2008, Pleat, Andrew C. wrote:

Condor 6.8.5

Occasionally, there's some sort of lock-up occuring in my cluster.  The
symptoms are:

- condor_status hangs indefinitely
- condor_q hangs for about a minute and returns 'Failed to fetch ads
from: <... : 9683> : ..'
- condor_restart -subsystem schedd hangs
	- I tried this based on looking at condor_users mail
- condor processes still running (although no apparent activity)

Logs:
- MasterLog shows normal activity
- NegotiatorLog seems to have stopped reporting
	- normally it writes messages every 5 minutes
	- the last report was "Getting all public ads ..."
- SchedLog reports 'Called reschedule_negotiator()' as last message
	- a condor_submit_dag had been performed in the same time frame
	- normally, the next message is "Activity on stashed negotiator
socket"
- StartLog has nothing special (although file is still being touched)
	- the only other file still being touched is MasterLog

My conclusion would be the negotiator is somehow stuck.

any ideas

thank you
andy pleat





--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.