[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_write(): Socket closed when trying to write



Hi,

I have two HTCondor clusters (8.0.6), each cluster has two nodes (a master and worker node), and I want to enable the flocking service.

The first cluster has the following members
master01.demo01.org -> 192.168.251.2
wn01.demo01.org -> 192.168.251.3

And, those are the members of the second cluster
master02.demo02.org -> 192.168.252.2
wn02.demo02.org -> 192.168.252.3

I have added some configuration variables on the condor configuration files and I am able to successfully run the command condor_status -pool master0*.

However, when I try to run a "large" task (which run 350 times the hostname command) the flocking mechanism does not work. The task is launched from master01.demo01.org.

Here are some snippets from Negotiator log files (same date and same hour).

master01 -> NegotiatorLog
08/06/14 19:55:20 ---------- Started Negotiation Cycle ----------
08/06/14 19:55:20 Phase 1:Â Obtaining ads from collector ...
08/06/14 19:55:20ÂÂ Getting startd private ads ...
08/06/14 19:55:20ÂÂ Getting Scheduler, Submitter and Machine ads ...
08/06/14 19:55:20ÂÂ Sorting 3 ads ...
08/06/14 19:55:20 Got ads: 3 public and 1 private
08/06/14 19:55:20 Public ads include 1 submitter, 1 startd
08/06/14 19:55:20 Phase 2:Â Performing accounting ...
08/06/14 19:55:20 Phase 3:Â Sorting submitter ads by priority ...
08/06/14 19:55:20 Phase 4.1:Â Negotiating with schedds ...
08/06/14 19:55:20ÂÂ Negotiating with condor@xxxxxxxxxx at <192.168.251.2:57952>
08/06/14 19:55:20 0 seconds so far
08/06/14 19:55:20ÂÂÂÂ Request 00021.00029:
08/06/14 19:55:20ÂÂÂÂÂÂ Rejected 21.29 condor@xxxxxxxxxx <192.168.251.2:57952>: no match found
08/06/14 19:55:20ÂÂÂÂ Got NO_MORE_JOBS;Â done negotiating
08/06/14 19:55:20Â negotiateWithGroup resources used scheddAds length 1
08/06/14 19:55:20 ---------- Finished Negotiation Cycle ----------

master02 -> NegotiatorLog
08/06/14 19:55:20 ---------- Started Negotiation Cycle ----------
08/06/14 19:55:20 Phase 1:Â Obtaining ads from collector ...
08/06/14 19:55:20ÂÂ Getting startd private ads ...
08/06/14 19:55:20ÂÂ Getting Scheduler, Submitter and Machine ads ...
08/06/14 19:55:20ÂÂ Sorting 4 ads ...
08/06/14 19:55:20 Got ads: 4 public and 1 private
08/06/14 19:55:20 Public ads include 1 submitter, 1 startd
08/06/14 19:55:20 Phase 2:Â Performing accounting ...
08/06/14 19:55:20 Phase 3:Â Sorting submitter ads by priority ...
08/06/14 19:55:20 Phase 4.1:Â Negotiating with schedds ...
08/06/14 19:55:20ÂÂ Negotiating with condor@xxxxxxxxxx at <192.168.251.2:57952>
08/06/14 19:55:20 0 seconds so far
08/06/14 19:55:20 condor_write(): Socket closed when trying to write 200 bytes to schedd condor@xxxxxxxxxx, fd is 8
08/06/14 19:55:20 Buf::write(): condor_write() failed
08/06/14 19:55:20ÂÂÂÂ Failed to send scheddName/eom to condor@xxxxxxxxxx (<192.168.251.2:57952>)
08/06/14 19:55:20ÂÂ Error: Ignoring submitter for this cycle
08/06/14 19:55:20Â negotiateWithGroup resources used scheddAds length 0
08/06/14 19:55:20 ---------- Finished Negotiation Cycle ----------

Any clues about why this problem is happening?

Thanks a lot for your hints.

John,