[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor multiple pools



>
>
> Steffen Grunewald wrote:
>> On Mon, Oct 30, 2006 at 12:21:18PM +0100, Cor Cornelisse wrote:
>>
>>> Hi,
>>>
>>> We are setting up a cluster at our campus and since we've some
>>> experience
>>> with condor we plan on using it. The machines which will join the
>>> cluster
>>> are scattered throughout the building. Since there's not enough power /
>>> network connections available to fit them into one room, it comes down
>>> to
>>> small clusters of say 10 ~ 20 boxes, which have a connection to the
>>> internal network via NAT.
>>> At first we thought of creating seperate condor pools on all these
>>> subclusters and then use job flocking. However, we'd like to have the
>>> ability to use ALL machines for ONE big job. Job flocking can only
>>> migrate
>>> it's job from one pool to another if I'm correct.
>>>
>
> When you say, "one big job", are you talking about MPI or something like
> that?
>

Yes that would be MPI

>>> Condor glide-in looks a bit like overkill to me, since we'll then be
>>> running condor within condor.
>>>
>
> The overhead of running an extra startd for each job is typically not
> significant.  However, glidein still requires bi-directional
> connectivity between submit  and execute machines, so you would need to
> use GCB within the glidein pool itself.  Within the underlying pools,
> you would not necessarily have to use GCB, as long as you have one
> public schedd per pool.  The glideins could be submitted on-demand from
> some central location to each of these publicly accessible schedds.  Of
> course, it would take some effort to set that all up and maintain it.
>

So let's say I've two condor pools and one additional submit machine, this
submit machine has bi-directional communication with both pool schedulers.
Then glidein creates sort of a "virtual" pool, and will need GCB to enable
execute machines from one pool to contact execute machines from the other
pool?! Sounds nice, would be interesting to see how much load this puts on
the network.

>>> I've spend quite some time reading documentation and the only thing I
>>> could come up with is using GCB to create one big pool. However, this
>>> would severly affect the scalability.
>
>  From what I have seen, pools on the order of 2000 CPUs are practical,
> with some attention to configuration details.  Beyond that, I lack
> experience to comment.
>

We are talking at under a hundred boxes right now, maybe in the future it
will scale up but certainly never more than a few hundred.

>>>  We might like to add an existing
>>> cluster in the future and if we would be using GCB, the existing
>>> cluster's
>>> configuration would have to be adapted to use GCB and join our pool.
>>>
>
> There is an active effort to make GCB less invasive, so, for example,
> communication within a pool could take place without any dependence on
> GCB, but communication with external submitters would use GCB.  As it
> exists today, you are correct that GCB is all or nothing.
>

Let's say I have one MPI job requiring 30 cpu's, and submit it to the
cluster, which is say made up out of 2 pools with 20 worker nodes. One
condor pool needs to be able to communicate with the other one right? Even
worse, every worker node needs to be able to contact any other worker
node? Which would in my case imply adding GCB to the basic needs, which in
turn might be easiest to realize through glidein.

>>> I find it hard to believe I'm the only one who would like to join
>>> multiple
>>> pools and still have the ability to have one job running over multiple
>>> pools. I must be overlooking something, can someone give me a hint in
>>> de
>>> right direction? I do understand condor is about HTC, and what I'm
>>> requesting is actually a HPC kind of thing, but does this mean I will
>>> have
>>> to go looking for something else instead of Condor?
>>>
>>
>> Sound like VLAN might be a solution for you - allows to use the general
>> network infrastructure, and still keep the clusters separated from other
>> stuff... I'd ask the IT guys whether they can make this possible.
>>
>> Cheers
>>  Steffen
>>

VLAN is not a solution unfortunatly. System Administration is seriously
under staffed at our site and we are already happy with the fact that the
needed ip adresses were configured within a week time.

>>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>

Tnx for the replies so far!

-- 
A lie told often enough becomes the truth.

Lenin (1870 - 1924)