Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Flocking - remote nodes matching, but not executing

Date: Thu, 25 Oct 2007 12:29:21 -0500
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] Flocking - remote nodes matching, but not executing

Flocking does not remove the requirement of bi-direction connectivitybetween the submit node and the execute node. For example, your submitnode in pool A must be able to connect to execute machines in pool B inorder to run jobs on them.

The Linux version of Condor has a component called GCB which can be usedto reverse the direction of connections in cases like this. Currently,this is not supported by the Windows version of Condor.

Depending on your usage requirements, another possible option is to useCondor-C. This allows you to submit jobs on submit node A, which arethen internally resubmitted onto submit node B. Once the job finishesrunning in cluster B, the output files are copied back to submit node A,and the job in the queue on submit node A is marked as completed. Thisis generally less convenient than flocking, because you have to submitjobs on submit node A that are specifically targeting to be resubmittedto node B, whereas in flocking, the jobs you submit can run in eitherpool, depending on availability of resources. You can get fancier withCondor-C and use site-level matchmaking to try to load balance acrossthe two clusters, but it simply isn't as seamless as flocking.


I hope that explanation helps.

--Dan

Peter Myerscough-Jackopson wrote:

Dear all,
I am having a problem with joining two pools via flocking, and Isuspect it is mainly my assumptions that are wrong.Background
--------------------------
Pool A has 10 machines
Pool B has 10 machines
All machines are running WinXP 64-bit on private networks withoutdomain controllers.The cluster heads on pool A and B are connected via a VPN, but none ofthe other nodes of each cluster are connected, nor is IP trafficforwarded.I am running these pools in collaboration with someone else and Idon't have direct access to pool B.To join the two pools together both master(cluster heads) haveBIND_ALL_INTERFACES = true so that they can operate on their internalnetwork interface, and the VPN interface.We have also added the name of the opposing pool's cluster head intoour "hosts" file eg. 192.168.1.10 clusterhead_AWe have then added that name (not the ip address) to the condor_configfile in the condor FLOCK_TO and FLOCK_FROM macros.Our HOSTALLOW_READ and HOSTALLOW_WRITE are both *, which I know isbad, but the clusters are behind firewalls and VPNs and so onlyaccessible by trusted parties. I was hoping to reduce the number ofhoops flocking had to jump through and hope to bring this back up tosome more secure settings.I can run "condor_q -name clusterhead_A" and see the opposing poolsqueue, but if I use the IP address, ie "condor_q -name 192.168.1.15" Iget the error message:
"Error: Collector has no record of schedd/submitter"
"condor_q -global" also successfully returns the queue from the otherpool.I have not changed the NO_DNS macro nor the DEFAULT_DOMAIN_NAME macroin the condor_config file, both are commented out. If I do this andrun condor_reconfig, then I get the following error message :ERROR "gethostname failed, errno = 0" at line 266 in file..\src\condor_c++_util\my_hostname.C------------------------The problem I get is as the subject line reads, and as you can seeI've tried a few things.What should I do to get condor flocking working such that jobs migrateand run on the other pool, without requiring a direct connection frommy head to their execute nodes?I was under the impression that jobs would migrate to the opposingpool's queue and then be submitted and managed by the opposing poolwith the results being passed back. Am I wrong about this?From my log files I can see my cluster head is trying to directlyconnect to the remote cluster's nodes, which it can't do. It is alsoseeming to have trouble connecting to itself on its VPN IP addresseven though I have BIND_ALL_INTERFACE=true.If anyone has any ideas/solutions please do reply,PeterPs. I can ping the remote cluster head across the VPN and also the VPNIP address of my own machine.*Dr Peter Myerscough-Jackopson *
Engineer, MAC Ltd

phone: +44 (0) 23 8076 7808  fax: +44 (0) 23 8076 0602
email: peter.myerscough-jackopson@xxxxxxxxxx  web: www.macltd.com

Multiple Access Communications Limited is a company registered in
England at Delta House, Southampton Science Park, Southampton,
SO16 7NS, United Kingdom with Company Number 1979185
------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/

References:
- [Condor-users] Flocking - remote nodes matching, but not executing
  - From: Peter Myerscough-Jackopson

Prev by Date: Re: [Condor-users] Error evaluating rank
Next by Date: Re: [Condor-users] Error evaluating rank
Previous by thread: [Condor-users] Flocking - remote nodes matching, but not executing
Next by thread: [Condor-users] connection refused condor windows 2000 node
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Flocking - remote nodes matching, but not executing