[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] sched goess offline, kills jobs



I am and have been facing the similar issue for a while. My solution was to rebuild the program so its more tolerant to failure. 



On Mon, Mar 5, 2012 at 3:13 PM, Sarah Williams <saewill@xxxxxxxxx> wrote:
I should have mentioned, the condor head nodes run 7.6.0-1, and the
worker nodes run 7.6.4-1.

On 3/5/12 3:12 PM, Sarah Williams wrote:
> Hello,
>
> We saw an issue on our condor installation on Friday afternoon that
> killed all jobs in the cluster.  Details are below. I'm looking to find
> out what happened, why it killed jobs, and how to keep it from happening
> again.
>
> The first symptom was that some of our monitoring software started to
> hang, because condor_q  was hanging on
> queries against condor_sched.
>
> The NegiatorLog has several messages like:
> 03/02/12 16:23:24 condor_read(): timeout reading 5 bytes from schedd
> group_atlasprod.usatlas1@xxxxxxxxxxxxxxx.
> 03/02/12 16:23:24 IO: Failed to read packet header
> 03/02/12 16:23:24     Failed to get reply from schedd
> 03/02/12 16:23:24   Error: Ignoring submitter for this cycle
> 03/02/12 16:23:24  negotiateWithGroup resources used scheddAds length
>
> Finally, condor_q started failing all-together with the message:
> Error: Collector has no record of schedd/submitter
>
> At that point, I restarted condor on the gatekeeper, which runs
> condor_master and schedd.   I've previously restarted condor on the
> gatekeeper, and even rebooted it, without dropping jobs. However, this
> time it didn't work that way. In worker-node logs I see messages like:
>
> 03/02/12 16:14:08 slot16: Failed to connect to schedd
> <128.135.158.146:39156>
> 03/02/12 16:14:11 slot8: State change: claim lease expired
> (condor_schedd gone?)
> 03/02/12 16:14:11 slot8: Changing state and activity: Claimed/Busy ->
> Preempting/Killing
>
> The SchedLog on osg-gk, oddly, shows nothing unusual during this time.
>
> I put a snapshot of the log files up here, in case anyone wants to
> browse them:
> http://www.mwt2.org/~sarah/condor/
>
> On Saturday, I got an email from condor on the manager saying that the
> condor_negotiator was killed because it was unresponsive.  The email
> says the last lines of the NegotiatorLog were:
> 03/03/12 16:32:19 ---------- Started Negotiation Cycle ----------
> 03/03/12 16:32:19 Phase 1:  Obtaining ads from collector ...
> 03/03/12 16:32:19   Getting all public ads ...
> 03/03/12 16:32:33   Sorting 6473 ads ...
> 03/03/12 16:32:33   Getting startd private ads ...
> 03/03/12 17:12:05 Got ads: 6473 public and 5941 private
> 03/03/12 17:12:05 Public ads include 8 submitter, 5941 startd
> 03/03/12 17:12:25 Phase 2:  Performing accounting ...
>
> The email was sent at 17:22, which matches the time the MasterLog says
> it killed the negotiator process.  I don't see any messages on the
> worker nodes saying that anything went into Preempting/Killing, so I
> assume this event did not kill any jobs.
>
> --Sarah

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/



--
--- Get your facts first, then you can distort them as you please.--