[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor multiple pools





Cor Cornelisse wrote:

Condor glide-in looks a bit like overkill to me, since we'll then be
running condor within condor.

The overhead of running an extra startd for each job is typically not
significant.  However, glidein still requires bi-directional
connectivity between submit  and execute machines, so you would need to
use GCB within the glidein pool itself.  Within the underlying pools,
you would not necessarily have to use GCB, as long as you have one
public schedd per pool.  The glideins could be submitted on-demand from
some central location to each of these publicly accessible schedds.  Of
course, it would take some effort to set that all up and maintain it.


So let's say I've two condor pools and one additional submit machine, this
submit machine has bi-directional communication with both pool schedulers.
Then glidein creates sort of a "virtual" pool, and will need GCB to enable
execute machines from one pool to contact execute machines from the other
pool?! Sounds nice, would be interesting to see how much load this puts on
the network.

Yes. If there was one publicly accessible schedd per pool, you could submit the glideins to these via Condor-C. These schedds would then run the glideins on their respective pools, and the glideins would "phone home" and become part of a pool that spans across the different parts of your network. The glidein pool would need GCB to provide bidirectional connectivity in the following cases:

central manager <--> execute machines
submit machines <--> execute machines

As Greg Thain pointed out, communication between the execute machines (e.g. for MPI) is an additional problem that would need to be worked out.

I've spend quite some time reading documentation and the only thing I
could come up with is using GCB to create one big pool. However, this
would severly affect the scalability.
From what I have seen, pools on the order of 2000 CPUs are practical,
with some attention to configuration details.  Beyond that, I lack
experience to comment.


We are talking at under a hundred boxes right now, maybe in the future it
will scale up but certainly never more than a few hundred.

Then I would say a single pool is no problem from a scalability standpoint, assuming all the network problems could be worked out.



We might like to add an existing
cluster in the future and if we would be using GCB, the existing
cluster's
configuration would have to be adapted to use GCB and join our pool.

There is an active effort to make GCB less invasive, so, for example,
communication within a pool could take place without any dependence on
GCB, but communication with external submitters would use GCB.  As it
exists today, you are correct that GCB is all or nothing.

I should clarify my statement that "GCB is all or nothing". What I mean is that in order to allow condor daemons on a node to accept incoming connections from outside of a NAT or firewall, you need to have this node use GCB, and this currently implies that all connections to this node, even from within the NAT or firewall will also involve GCB. It does _not_ mean that all network traffic will pass through the GCB server. With a suitably configured GCB routing table, it is possible for a direct connection to be formed in many cases. However, the current implementation cannot form this direct connection without some communication with the GCB server, which adds latency and creates an additional point of failure.



Let's say I have one MPI job requiring 30 cpu's, and submit it to the
cluster, which is say made up out of 2 pools with 20 worker nodes. One
condor pool needs to be able to communicate with the other one right? Even
worse, every worker node needs to be able to contact any other worker
node? Which would in my case imply adding GCB to the basic needs, which in
turn might be easiest to realize through glidein.

Whether glidein is the "easiest" approach in this case depends on how difficult it would be for you to apply GCB to the underlying pools versus how difficult it would be for you to submit condor glideins to the different pools.

--Dan