[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor multiple pools





Steffen Grunewald wrote:
On Mon, Oct 30, 2006 at 12:21:18PM +0100, Cor Cornelisse wrote:
Hi,

We are setting up a cluster at our campus and since we've some experience
with condor we plan on using it. The machines which will join the cluster
are scattered throughout the building. Since there's not enough power /
network connections available to fit them into one room, it comes down to
small clusters of say 10 ~ 20 boxes, which have a connection to the
internal network via NAT.
At first we thought of creating seperate condor pools on all these
subclusters and then use job flocking. However, we'd like to have the
ability to use ALL machines for ONE big job. Job flocking can only migrate
it's job from one pool to another if I'm correct.

When you say, "one big job", are you talking about MPI or something like that?

Condor glide-in looks a bit like overkill to me, since we'll then be
running condor within condor.

The overhead of running an extra startd for each job is typically not significant. However, glidein still requires bi-directional connectivity between submit and execute machines, so you would need to use GCB within the glidein pool itself. Within the underlying pools, you would not necessarily have to use GCB, as long as you have one public schedd per pool. The glideins could be submitted on-demand from some central location to each of these publicly accessible schedds. Of course, it would take some effort to set that all up and maintain it.

I've spend quite some time reading documentation and the only thing I
could come up with is using GCB to create one big pool. However, this
would severly affect the scalability.

From what I have seen, pools on the order of 2000 CPUs are practical, with some attention to configuration details. Beyond that, I lack experience to comment.

 We might like to add an existing
cluster in the future and if we would be using GCB, the existing cluster's
configuration would have to be adapted to use GCB and join our pool.

There is an active effort to make GCB less invasive, so, for example, communication within a pool could take place without any dependence on GCB, but communication with external submitters would use GCB. As it exists today, you are correct that GCB is all or nothing.

I find it hard to believe I'm the only one who would like to join multiple
pools and still have the ability to have one job running over multiple
pools. I must be overlooking something, can someone give me a hint in de
right direction? I do understand condor is about HTC, and what I'm
requesting is actually a HPC kind of thing, but does this mean I will have
to go looking for something else instead of Condor?

Sound like VLAN might be a solution for you - allows to use the general
network infrastructure, and still keep the clusters separated from other
stuff... I'd ask the IT guys whether they can make this possible.

Cheers
 Steffen