[Condor-users] sched goess offline, kills jobs


We saw an issue on our condor installation on Friday afternoon that
killed all jobs in the cluster.  Details are below. I'm looking to find
out what happened, why it killed jobs, and how to keep it from happening

The first symptom was that some of our monitoring software started to
hang, because condor_q  was hanging on
queries against condor_sched.

The NegiatorLog has several messages like:
03/02/12 16:23:24 condor_read(): timeout reading 5 bytes from schedd
03/02/12 16:23:24 IO: Failed to read packet header
03/02/12 16:23:24     Failed to get reply from schedd
03/02/12 16:23:24   Error: Ignoring submitter for this cycle
03/02/12 16:23:24  negotiateWithGroup resources used scheddAds length

Finally, condor_q started failing all-together with the message:
Error: Collector has no record of schedd/submitter

At that point, I restarted condor on the gatekeeper, which runs
condor_master and schedd.   I've previously restarted condor on the
gatekeeper, and even rebooted it, without dropping jobs. However, this
time it didn't work that way. In worker-node logs I see messages like:

03/02/12 16:14:08 slot16: Failed to connect to schedd
03/02/12 16:14:11 slot8: State change: claim lease expired
(condor_schedd gone?)
03/02/12 16:14:11 slot8: Changing state and activity: Claimed/Busy ->

The SchedLog on osg-gk, oddly, shows nothing unusual during this time.

I put a snapshot of the log files up here, in case anyone wants to
browse them:

On Saturday, I got an email from condor on the manager saying that the
condor_negotiator was killed because it was unresponsive.  The email
says the last lines of the NegotiatorLog were:
03/03/12 16:32:19 ---------- Started Negotiation Cycle ----------
03/03/12 16:32:19 Phase 1:  Obtaining ads from collector ...
03/03/12 16:32:19   Getting all public ads ...
03/03/12 16:32:33   Sorting 6473 ads ...
03/03/12 16:32:33   Getting startd private ads ...
03/03/12 17:12:05 Got ads: 6473 public and 5941 private
03/03/12 17:12:05 Public ads include 8 submitter, 5941 startd
03/03/12 17:12:25 Phase 2:  Performing accounting ...

The email was sent at 17:22, which matches the time the MasterLog says
it killed the negotiator process.  I don't see any messages on the
worker nodes saying that anything went into Preempting/Killing, so I
assume this event did not kill any jobs.