[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor flocking



All
I'm testing condor 7.4.2 on my Clusters, that's cluster1(10.122.100.0/24) and cluster2(10.122.200.0/24). I set flocking from cluster2 to cluster1 and the condor config files is like this

# Cluster1 Central Manager
CONDOR_HOST = $(FULL_HOSTNAME) #same as central-manager.cluster1.net
FLOCK_FROM = *.cluster2.net
FLOCK_TO = Not Defined
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ = *
ALLOW_WRITE = *
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD    = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_READ_COLLECTOR  = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD     = $(ALLOW_READ), $(FLOCK_FROM)

# Custer2 Central Manager
CONDOR_HOST = $(FULL_HOSTNAME) #same as central-manager.cluster2.net
FLOCK_FROM = Not Defined
FLOCK_TO = central-manager.cluster1.net
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
ALLOW_ADMINISTRATOR = $(CONDOR_HOST)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_READ = *
ALLOW_WRITE = *
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD    = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_READ_COLLECTOR  = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD     = $(ALLOW_READ), $(FLOCK_FROM)

Without flocking, each cluster can run the jobs correctly. But, when I flocking that clusters and submit the job on cluster1, nothing of the jobs is done.
Any of you ever got this problem?
NB : DNS working fine

NegotiatorLog
06/04 21:45:19 ---------- Started Negotiation Cycle ----------
06/04 21:45:19 Phase 1:  Obtaining ads from collector ...
06/04 21:45:19   Getting all public ads ...
06/04 21:45:25   Sorting 16 ads ...
06/04 21:45:25   Getting startd private ads ...
06/04 21:45:25 Got ads: 16 public and 8 private
06/04 21:45:25 Public ads include 1 submitter, 8 startd
06/04 21:45:25 Phase 2:  Performing accounting ...
06/04 21:45:25 Phase 3:  Sorting submitter ads by priority ...
06/04 21:45:25 Phase 4.1:  Negotiating with schedds ...
06/04 21:45:25   Negotiating with condor@xxxxxxxxxxxxxxxxxxxxx at <10.122.200.1:46303>
06/04 21:45:25 0 seconds so far
06/04 21:45:25 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from schedd condor@xxxxxxxxxxxxxxxxx
06/04 21:45:25 IO: Failed to read packet header
06/04 21:45:25     Failed to get reply from schedd
06/04 21:45:25   Error: Ignoring submitter for this cycle
06/04 21:45:25 ---------- Finished Negotiation Cycle ----------

ScheddLog
06/04 20:43:57 (pid:31170) NOTE: QUEUE_ALL_USERS_TRUSTED=TRUE - all queue access checks disabled!
06/04 20:43:57 (pid:31170) About to rotate ClassAd log /var/lib/condor/spool/job_queue.log
06/04 20:51:54 (pid:31170) Sent ad to central manager for condor@xxxxxxxxxxxxxxxxxxxxxxx
06/04 20:51:54 (pid:31170) Sent ad to 1 collectors for condor@xxxxxxxxxxxxxxxxxxxxxxx
06/04 20:52:08 (pid:31170) condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from collector at <10.122.100.1:9618>.
06/04 20:52:08 (pid:31170) IO: Failed to read packet header
06/04 20:52:08 (pid:31170) Unknown negotiator (10.122.100.1).  Aborting negotiation.
06/04 20:52:28 (pid:31170) Unknown negotiator (10.122.100.1).  Aborting negotiation.

Thanks for advance......

Regards,

Iwan