[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] One vm or two?



On Thu, 04 Dec 2003 16:30:30 -0800  Rob Malouf wrote:

> Unfortunately, though, I don't think it'll do what I need.  When I
> say "big" or "small" jobs, I mean that in terms of memory use, not
> running time.  Creating three equal vms would mean that each has
> one-third of the machine's physical memory, and then no job that
> requires more than that would ever run.  Do I have that right?

yes.

> Or is it possible to assign each vm all of the memory, and then use
> some other strategy to make them share it?

no.  

i think what alain was suggesting was to try to configure the startd
so that 1 VM advertised all the RAM, and the other 2 each advertised
1/2 the RAM.  unfortunately, you can't really do that. :( at least,
you can't have all 3 advertising themselves simultaneously.  the
startd prevents you from over-committing any shared system resources
(ram, swap space or disk space) when you're carving up an SMP
machine.  the only exception is the number of cpus (and therefore,
VMs).  condor lets users "lie" about that, but it prevents you from
lying about the ram.
  

your question gets to the heart of a long-standing problem in Condor's
support for SMP machines.  as the author of that code, i feel
obligated to speak up here. :) i personally did not like the proposed
design that is now implemented.  i wanted to take longer to initially
add the support to make it more complicated and powerful.  i wanted an
SMP startd to advertise all the available resources at any given time,
and whenever a job was matched, the startd would advertise whatever
remaining resources were not used by the job and were still available.
unfortunately, that's really complicated to do correctly for a lot of
reasons not worth getting into right now.  the end result is that it
was more important to get the basic functionality out quickly, than to
delay the release while waiting for the more fancy stuff.


the "party line" on how to solve your problem given the current
implementation is as follows...

the one thing you *can* do with the existing code is define seperate
"virtual machine types".  it's described in gory detail in section
3.10.6 of the 6.6.0 manual on the web "Configuring the Startd for SMP
Machines".  you can find it here:

http://www.cs.wisc.edu/condor/manual/v6.6/3_10Setting_Up.html#SECTION004106000000000000000

if you define multiple VM types for your host, you can dynamically
recofigure how many of each VM type the startd is advertising at any
given time.  you do not have to restart the startd for these changes
to take effect, you just have to send a condor_reconfig.  the only
requirement is that you can't remove a virtual machine of a given type
unless the VM is idle.  also, you can't change the *definition* of any
virtual machine types without restarting the startd.  what you *can*
change on the fly is the number of each type being advertised at any
given time.

in your case, you'd probably want to define 3 types in your config
file.  something like:

VIRTUAL_MACHINE_TYPE_1 = cpus=1, ram=256
VIRTUAL_MACHINE_TYPE_2 = cpus=1, ram=480
VIRTUAL_MACHINE_TYPE_3 = cpus=1, ram=32

the relative sizes of ram allocated to #2 and #3 would depend on your
needs.  if you really wanted all 512, you could just use 2 types, and
not even try to run tiny jobs on the 2nd CPU when a big-memory job is
running on the machine.  something like this:

VIRTUAL_MACHINE_TYPE_1 = cpus=1, ram=256
VIRTUAL_MACHINE_TYPE_2 = cpus=1, ram=512

in the common case to have your 2 "evenly divided" VMs, each with 256
megs, you'd use this in your config file:

NUM_VIRTUAL_MAHCINES_TYPE_1 = 2

however, if you wanted to switch this machine to "uneven" mode, you'd
want this, instead:

#NUM_VIRTUAL_MAHCINES_TYPE_1 = 2
NUM_VIRTUAL_MAHCINES_TYPE_2 = 1
NUM_VIRTUAL_MAHCINES_TYPE_3 = 1

once you made that change and sent a condor-reconfig, you'd still have
2 VMs advertised, but one would have 480 megs, and the other 32.
given that you'd have to remove both type 1 VMs, you could only do
this reconfig if both VMs were idle.  when you wanted to switch back
to even partitioning, you'd just revert to the initial settings and
reconfig again.

technically, you don't even have to change the config file to make
these changes, since you could take advantage of the
"condor_config_val -set" functionality (RTFM for more details).


so, back to the party line...  what you're supposed to do if you care
about this problem is write some "outside agent" to monitor your job
queues, notice if jobs aren't running because they need more RAM than
is available, and dynamically reconfigure your pool on demand.  sorry,
that's not a joke. :)


since it's *possible* (granted, quite complicated and difficult for
users to get working) with the existing implmentation, i was told
"that's good enough for now, and we can make it better later if it
turns out lots of people have this problem".  unfortunately "later"
never came, and i've been working on dozens of other features and
tasks, never coming back to the SMP support.  we'll add your name to
the growing list of folks who have inquired about this, and maybe
someday i or another condor developer will have time to work on this
again.

if anyone wants to find us a grant that's willing to pay for this
functionality, we'd love to hear from you! :)


another approach, given enough machines, is to just permanently
configure some machines as evenly divided, and some as unevenly
divided.  by "permanent" i just mean you'd do it as a human
administrator, on the time scale of days/weeks at a time, instead of
writing a program to monitor it on the time scale of minute-to-minute
demand.  this is in fact what we do with our cluster of dedicated
compute nodes at our pool at UW-Madison Computer Science.  some of
them are "big memory" machines, and some are not.  if we find there
are lots of big memory jobs in the queue that aren't getting enough
service, we convert more regular nodes into big memory nodes.  it's
inelegant, but we get by.

given that we, the developers of Condor and world-experts on the
system, haven't taken the time or effort to write the "outside agent"
to solve this problem for ourselves yet, i don't see how we can
reasonably expect anyone else to write it, either.  but, that's the
party line. ;)


sorry to be the bearer of bad news, but that's the current state of
things.  hopefully this will change, but at this point, i can't
honestly claim anyone is actively working on this particular problem.
it's on an enormous wish-list of features and improvements we
developers want to add to condor.  unfortunately, that list is
prioritized and ordered by the commitments to grants and
collaborations, not necessarily what the developers ourselves think
are the most interesting or useful things to add.

i hope this clarifies your options, and the background on why things
are as they are.  good luck!

-derek

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>