[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [condor-users] condor ClassAds and pvm?



Mark,

I had replied offline to Jeff Linderoth, to spare the list the long
description of what I am doing, but since you ask, I might as well post it
here. Jeff's suggestions follow, and my questions follow that (it will be a
long post).

> -----Original Message-----
> From: Mark Silberstein [mailto:marks@xxxxxxxxxxxxxxxxxxxxxxx]
> Sent: Monday, January 26, 2004 4:59 PM
> To: gardner@xxxxxxxxxxxxxx
> Subject: Re: [condor-users] condor ClassAds and pvm?
>
>
> Hi
> I'm not sure you can get exactly what you want, but the closest thing is
> Condor MW library
> (http://www.cs.wisc.edu/condor/mw) and Condor PVM
> (http://www.cs.wisc.edu/condor/pvm)
> They allow you to run PVM applications ( Master-Worker ) over Condor, so
> that Condor manages the resource allocation, and PVM is used as a
> transport.
> By the way, I would most appreciate if you could describe in a little
> more details about your application, its general purpose. I'm trying to
> analyze different kinds of applications with high memory requirements,
> and your application can contribute greatly. In return I can promise to
> help with making Condor work for you ...
> Thanks a lot
> Mark
>

----------------------------------------
My description of the problem:
----------------------------------------

> We are trying to solve an impossibly large MIP. In order to do it, we have
> algorithms that break the one problem into about 5000 smaller problems,
> using geographical boundaries and other information to minimize the
> interaction between these problems. Of the 5000 smaller problems, perhaps
> 100 or so are still large enough to require between 1 and 2GB of RAM to
> solve. The rest scale pretty evenly down to trivial solutions.
>
> For largely historical reasons, the program start off with 14 parallel
> processes (one for each geographical region). Each of these processes
> calculates the splits into smaller problems. We don't have enough memory
to
> then solve the problems inside these processes. Instead, they each take
the
> largest 'n' problems and split them off into more parallel processes. This
> would result in perhaps 500 simultaneous parallel processes, which would
> overwhelm the computer system.
>
> To handle that, I have a dedicated PVM process (mbg_spawn) which does the
> actual spawns. When my solver processes want to spawn a new problem, they
> submit a request to mbg_spawn, which puts it in a queue and starts the
> largest problems first, so that they will all finish somewhat together.
> Mbg_spawn takes a parameter which says the most number of problems which
can
> run simultaneously and queues up the others until processors are free.

> We originally ran this system on a MP machine with 32GB of shared RAM, so
I
> really didn't care which problem ran on which processor. Now that I have
it
> working, we are moving towards two sets of changes that require me to be
> smarter about scheduling.
>
> First, we are putting this on a Beowulf cluster. We can configure the
> cluster so that some of the machines have more memory than others, so I
> would like to put the biggest problems on those machines.
>
> Second, we will be running a simulation system, where multiple people can
> each run multiple simulations. I don't want to run only one at a time,
> because some may be small and not require the entire cluster. Instead, I
> would like to dole out problems from the first few simulation jobs, to
keep
> the cluster fully, but not over, loaded.
>
> Since I am communicating with mbg_spawn through pvm, each user has their
own
> copy running and doesn't know about the others. As such, if I tell
mbg_spawn
> to let 10 jobs run at once, when I have 5 users, I will be running 50
jobs.
> that's no good.
>
> I basically want a way to communicate with a process similar to mbg_spawn
> where I can say "I need a processor with at least 1GB of RAM" and the
> process will wait until there is one free, or perhaps one with 2GB of RAM
> that only has a couple of small processes running. It needs to handle
> requests from multiple users, so that no more processes run than I have
> processors (or perhaps 2 or 3 processes per processor, but no

----------------------------------------
Jeff's reply:
----------------------------------------

Are you trying to avoid installing Condor on all of the machines
in the cluster?  If you use condor, it should be pretty easy to do what you
want.


Way 1
-----
First, I'll talk about how (I think) you might be able to do what you
want with a straight/kludgy Condor implementation.


The mbg_spawn() process can call
system("condor_status -v");
which would return lots of information about each machine in he
cluster...  For example,

MyType = "Machine"
TargetType = "Job"
Name = "vm2@xxxxxxxxxxxxx"
Machine = "fire9.cluster"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
CondorVersion = "$CondorVersion: 6.4.7 Jan 26 2003 $"
CondorPlatform = "$CondorPlatform: INTEL-LINUX-GLIBC22 $"
VirtualMachineID = 2
ExecutableSize = 3947
JobUniverse = 1
NiceUser = FALSE
ImageSize = 43904
VirtualMemory = 524412
Disk = 6092210
CondorLoadAvg = 0.996394
LoadAvg = 0.996394
KeyboardIdle = 89660803
ConsoleIdle = 89660803
Memory = 250
Cpus = 1
StartdIpAddr = "<192.168.0.9:32772>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "cluster"
FileSystemDomain = "cluster"
Subnet = "192.168.0"
HasIOProxy = TRUE
TotalVirtualMemory = 1048824
TotalDisk = 12184420
KFlops = 416598
Mips = 1291
LastBenchmark = 1075061676
TotalLoadAvg = 1.990000
TotalCondorLoadAvg = 1.990000
ClockMin = 1044
ClockDay = 0
TotalVirtualMachines = 2
HasFileTransfer = TRUE
HasMPI = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList =
"HasFileTransfer,HasMPI,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Claimed"
EnteredCurrentState = 1075065268
Activity = "Busy"
EnteredCurrentActivity = 1075065324
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = START
CurrentRank = 0.000000
RemoteUser = "bad0@cluster"
RemoteOwner = "bad0@cluster"
ClientMachine = "fire1.cluster"
JobId = "619.5"
JobStart = 1075065324
LastPeriodicCheckpoint = 1075065324
UpdateSequenceNumber = 8090
DaemonStartTime = 1072792169
LastHeardFrom = 1075069482


And it could use this information to decide whether or not it can
pvm_spawn() a new job on this machine.  Note you only need Condor, and
not Condor-PVM to do it this way.  But then it's up to you to do the
scheduling.



Way 2
-----
Condor-PVM

This all assumes you are running Condor on the nodes of your cluster,
and each also has the condor-enabled pvmd.

I think you should run the mbg_spawn process through condor.
(condor_submit).  Imagine that you have in your condor submit file
specified that there will be three classes of processes that you might
want to run.

Class 1: < 256MB
Class 2: 256MB -- 512MB
Class 3: > 512MB

In Condor_PVM pvm_addhosts() doesn't block, so you can say
pvm_addhost(CLASS)

You will be pvm_notified() when a machine of class CLASS is available.
(Please abuse my abuse of the PVM functions -- I can't remember the
argument lists of the top of my head).  Now you don't have to keep
checking -- you just have to respond to the PVM messages, since
Condor-PVM is doing the checking for you.


Way 3
-----

MW.

Yes, I wrote some of MW, and I am going to be doing a whole lot of MW
coding again soon.  First a few facts.  MW is different than
Condor-PVM.  One flavor of MW uses Condor-PVM to do its resource
management and communication.

I'd also like to talk with you about how we might be able to do what you
want with MW.  We're revamping MW right now, and we're looking for
users, but depending on when you want to roll stuff out, it might not be
ready fast enough for you.  But let me know if you would be willing to
be an MW Guinea Pig...

----------------------------------------
Back to more of my questions:
----------------------------------------

First of all, thanks for all replies from everyone.

Jeff, I don't really like the first way for obvious reasons.. i have to do
the scheduling and you would get into some pretty clear race conditions as
multiple spawn requests came in simultaneously.

The second approach sounds pretty good, but I want to make sure I understand
it. When my mbg_spawn program wants to spawn a new process, it should do an
addHost(machineClass) and then wait for a pvm_notify before it calls
pvm_spawn(machineClass, executable)? [forgive the pseudo-code, i am too lazy
to look up the actual syntax].

This actually sounds pretty good. Some questions... If 3 people are running
simulation jobs (each has a mbg_spawn program) and they all request
pvm_addhost(class1)... when a class1 host becomes available, will each
mbg_spawn get a pvm_notify message? When they each pvm_spawn my model
program, what will happen? they wait in line? What if another host becomes
ready while they are waiting? Will all 3 mbg_spawns get notified again?

Also, will condor_pvm do a pvm_delhosts() after each program is complete, so
that I will get a pvm_notify every time a host is available? i.e. will the
each host et added and deleted once for every pvm_spawn?

Finally, there are a couple of things that concern me about adding
condor_pvm. First, I thought that I read that condor_pvm will only run a
single process on the node at a time. In what little tuning I have done, it
seems like we are better running 2 processes (given adequate RAM) because
the processes do some I/O and the processor sits idle when that is
happening.

Second, I am testing my cluster environment on a cluster running Scyld
Beowulf. I don't know if you are familiar with that software, but they have
a modifed pvm also, which is very efficient and makes my non-cluster aware
programs work out of the box (memory mapped files across the network are
ok). I am concerned with the implications of replacing their pvm.

Now, the third approach (use MW) might fix all these. My mbg_spawn is
sort-of a master, although all it actually does is spawn the new tasks and
get notified when they are done. It is very general purpose in that it has
no knowledge of what these tasks are.. Here is my mbg_spawn.parm file
(parameters):

[MBG_SPAWN]
maxload = 5            # max number of programs to run simultaneously
interval = 1           # num seconds between checks of system load
num_startups = 3       # number of startup jobs given

[MBG_SPAWN_00]
cmd = pvm_gs

[MBG_SPAWN_01]
cmd = nbh_control -dir $(NN_DATA_DIR)/$(USER) -parm nbh.parm

[MBG_SPAWN_02]
cmd = email.ksh $(NBH_MAILFROM) $(NBH_MAILTO) $(NN_DATA_DIR)/$(USER)_started
$(NN_DATA_DIR)/$(USER)/nbh.parm

When mbg_spawn starts up, it just kicks off the jobs listed in order. Then
it sits and waits for pvm processes to send a message saying "execute this
command, which is this big".

The actual master program in the whole system is nbh_control. It feeds data
to nbh_segment workers and to nbh_phase2 workers and gets results back from
them. mbg_spawn is just a primitive resource manager.

Anyway, this is the longest post I have every done. If you get down here,
please reply with the secret word (betelguese) so that I know you didn't
fall asleep along the way. :)

Thanks again for all the help,

- Gardner

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>