[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] Condor 6.6 for Window: Negotiator fails if any grid memberfails .



Title: Message
So, I went and added a machine to my windows grid, but whenever I turn that machine on, the whole grid stops.
 
Turning on D_FULLDEBUG, I found the following information in the negotiator log of the central manager:
 

11/20 12:01:31 ---------- Started Negotiation Cycle ----------
11/20 12:01:31 Phase 1:  Obtaining ads from collector ...
11/20 12:01:31   Getting all public ads ...
11/20 12:01:31   Sorting 26 ads ...
11/20 12:01:31   Getting startd private ads ...
11/20 12:01:31 Got ads: 26 public and 10 private
11/20 12:01:31 Public ads include 1 submitter, 10 startd
11/20 12:01:31 Phase 2:  Performing accounting ...
11/20 12:01:31 Phase 3:  Sorting submitter ads by priority ...
11/20 12:01:31 Phase 4.1:  Negotiating with schedds ...
11/20 12:01:31   Negotiating with
Heinzm@USWPUS00431 at <54.14.48.190:4844>
11/20 12:01:31     Request 00021.00000:
11/20 12:01:52 Can't connect to <54.14.48.95:1168>:0, errno = 10060
11/20 12:01:52 Will keep trying for 10 seconds...
11/20 12:01:53 Connect failed for 10 seconds; returning FALSE
11/20 12:01:53 ERROR:
SECMAN:2003:TCP connection to <54.14.48.95:1168> failed
 
11/20 12:01:53 condor_write(): Socket closed when trying to write buffer
11/20 12:01:53 Buf::write(): condor_write() failed
11/20 12:01:53       Could not send PERMISSION
11/20 12:01:53   Error: Ignoring schedd for this cycle
11/20 12:01:53 ---------- Finished Negotiation Cycle ----------
 
For some reason, this one machine is causing schedd to completely abort! Turning off condor on this machine causes schedd to start working again:
 
11/20 12:06:53 ---------- Started Negotiation Cycle ----------
11/20 12:06:53 Phase 1:  Obtaining ads from collector ...
11/20 12:06:53   Getting all public ads ...
11/20 12:06:54   Sorting 24 ads ...
11/20 12:06:54   Getting startd private ads ...
11/20 12:06:54 Got ads: 24 public and 9 private
11/20 12:06:54 Public ads include 1 submitter, 9 startd
11/20 12:06:54 Phase 2:  Performing accounting ...
11/20 12:06:54 Phase 3:  Sorting submitter ads by priority ...
11/20 12:06:54 Phase 4.1:  Negotiating with schedds ...
11/20 12:06:54   Negotiating with Heinzm@USWPUS00431 at <54.14.48.190:4844>
11/20 12:06:54     Request 00021.00002:
11/20 12:06:54       Matched 21.2 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.51:1065>
11/20 12:06:54       Successfully matched with CDPGRID04.merck.com
11/20 12:06:54     Request 00021.00003:
11/20 12:06:54       Matched 21.3 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.52:1063>
11/20 12:06:54       Successfully matched with CDPGRID05.merck.com
11/20 12:06:54     Request 00021.00004:
11/20 12:06:55       Matched 21.4 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.177:1775>
11/20 12:06:55       Successfully matched with CDPGRID02.merck.com
11/20 12:06:55     Request 00021.00005:
11/20 12:06:55       Matched 21.5 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.76:2342>
11/20 12:06:55       Successfully matched with PRIMENTIA3
11/20 12:06:55     Request 00021.00006:
11/20 12:06:55       Matched 21.6 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.87:2229>
11/20 12:06:55       Successfully matched with PRIMENTIA2
11/20 12:06:55     Request 00021.00007:
11/20 12:06:55       Matched 21.7 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.111:1028>
11/20 12:06:55       Successfully matched with cdpdtof01.merck.com
11/20 12:06:55     Request 00021.00008:
11/20 12:06:55       Matched 21.8 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.1:1081>
11/20 12:06:55       Successfully matched with CDPGRID03.merck.com
11/20 12:06:55     Request 00021.00009:
11/20 12:06:55       Matched 21.9 Heinzm@USWPUS00431 <54.14.48.190:4844> preempting none <54.14.48.9:4060>
11/20 12:06:56       Successfully matched with CDPGRID01.merck.com
11/20 12:06:56     Over submitter resource limit (8) ... only consider startd ranks
11/20 12:06:56     Got NO_MORE_JOBS;  done negotiating
11/20 12:06:56 Phase 4.2:  Negotiating with schedds ...
11/20 12:06:56   Negotiating with Heinzm@USWPUS00431 at <54.14.48.190:4844>
11/20 12:06:56 ---------- Finished Negotiation Cycle ----------
 
I've had this problem with two different machines with Condor 6.6 installed. Can you please help?