[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] memory "sharing" question




> Nicolas,
> 
> Thanks for asking this question. As I write this, I'm in the process
of
> designing the next 10 or so nodes to add to my pool.  I'm strongly
> favoring the AMD X2 line of processors and I'm wondering how I can
best
> make use of the dual execution cores.  I had wondered how Condor
handled
> things like memory resources. Do the class ads adjust on the fly?  So
if
> a job landed on a dual cpu node with 2GB of RAM and the job used up
> 1.5GB, would the class ad for the other cpu change to 512MB?

The quick answer is no.

Condor uses a static division of resources on multi-VM'ed machines
because it's the easiest and most straight-forward approach. The
resource division is not enforced by Condor. So a job that said it
needed 512 MB of memory can actually use 513 MB and Condor won't do
anything about it. However if your job advertised that it was going to
use 513 MB of memory it never would have run on the machine with only
512 MB offered up. You can refine your resources guess by looking at the
job history. The JobSize gets adjusted in realtime once it starts
running on the machine. It's accurate for Linux and Solaris, not sure
how accurate it is for Windows (Condor Guys: Does it track memory use on
Windows for the main process spawned plus any children or just the main
process?).

I have a gripe with the JobSize attribute: it resets itself once the job
starts to run. So if you've advertised the need for 1 GB of memory in a
vanilla job, and it starts to run and gets preempted before it reaches
its peak memory use point, you'll end up with a JobSize that's lower.
You almost want to set JobSize and have a secondary attribute called
JobSizeActual that tracks the real use. At least for vanilla jobs. I've
taken the opposite approach here and we set AlteraJobSize to match
memory and let JobSize only track the actually memory use for historical
analysis.

One approach I've toyed with is using cross-VM class ad attributes to
implement a dynamic memory tracking system in Condor. This is by no
means tested. Right now it's just a theory. It goes something like this:

If you have a two VM machine you have each VM export to the other VM the
current AlteraJobSize of it's running job. No running job would
hopefully mean 0 gets exported. Then you set the AvailableMemory
attributed on the VM to be the total physical memory in the machine
minus the AlteraJobSize of the job running on the other VM. Now you have
a dynamic memory value.

Of course this only affects job matching. But it's better than nothing.
There was some talk on the list about modifying Condor to enforce
resource usage limits on jobs. You could probably cook up some enforcer
scripts using Hawkeye (http://www.cs.wisc.edu/condor/hawkeye/).

- Ian