Occasionally, there's some sort of lock-up occuring in my cluster. The symptoms are:
- condor_status hangs indefinitely
- condor_q hangs for about a minute and returns 'Failed to fetch ads from: <... : 9683> : ..'
- condor_restart -subsystem schedd hangs
- I tried this based on looking at condor_users mail
- condor processes still running (although no apparent activity)
- MasterLog shows normal activity
- NegotiatorLog seems to have stopped reporting
- normally it writes messages every 5 minutes
- the last report was "Getting all public ads ..."
- SchedLog reports 'Called reschedule_negotiator()' as last message
- a condor_submit_dag had been performed in the same time frame
- normally, the next message is "Activity on stashed negotiator socket"
- StartLog has nothing special (although file is still being touched)
- the only other file still being touched is MasterLog
My conclusion would be the negotiator is somehow stuck.