[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Multi-Slot settings for single nodes



Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> wrote:
> I have a few boxes with newly arrived quad-core CPUs. The usual set-up
> would be to define 4 slots for that machine and each will get 25% of the
> installed memory (say 2 GByte each).
> 
> However, from time to time a user comes up and tells me, I need to run a
> program which requires 5-6 GByte of memory. As far as I know, this could
> be achieved by

This is a case we think about, but we haven't got a direct answer
or timeline at the moment.

In the meanwhile you can use STARTD_SLOT_ATTRS (STARTD_VM_EXPRS
in 6.9.2 and earlier) to share information between VMs.
(http://www.cs.wisc.edu/condor/manual/v6.9/3_3Configuration.html#13390)
The general strategy is to create more VMs than you have
processors, then give the different slots different policies
(using SlotID in your START and other policy expressions).  So
you might have a policy roughly like:

- Slot 1-4: "normal", advertises 1/4 of the RAM.  Refuses to
  START if slot5_State=="Claimed"

- Slot 5: "bigjob", advertises all, or lots of the RAM.  Refuses
  to START if slot[1-4]_State=="Claimed".  Or perhaps refuses to
  start if (slot1_ImageSize + slot2_ImageSize + slot3_ImageSize +
  slot4_ImageSize) > 4gigs.  (Of course, ImageSize can be
  undefined, so you actual expression will be more complex. 

Some general discussion on crafting per-VM policies is here:
http://www.cs.wisc.edu/condor/manual/v6.9/3_12Setting_Up.html#SECTION004127500000000000000

If you're willing to spend the time crafting the expressions, it
should be possible to build a complex system with multiple types
of logical VMs.

This does rely on jobs either providing reasonably accurate
ImageSize information, or perhaps marking themselves with special
attributes (+WantBigMachine=TRUE).  You can rely on Condor
providing reasonably accurate ImageSize, but until the job runs
and Condor gets a better measure, it will likely make poor
matches.

Depending on your workload, you might need to worry about
starvation where a large number of small jobs keep the nodes busy
enough that the larger jobs never get a chance to run.  It should
be possible to tune various priorities to make it work (say,
using RANK), but it will likely need to be tuned to your typical
workload.

The paper on the Bologna Batch System gives some concrete
examples of using more VMs than processors with the goal of
creating different logical pools.
http://pages.cs.wisc.edu/~pfc/bologna_batch_system.html

Remember that Condor's allocations of memory are really
guidelines, and unless your policies are written to evict jobs
that grow too large, Condor will happily get a job use more RAM
than the slot is allocated.

-- 
Alan De Smet                              Condor Project Research
adesmet@xxxxxxxxxxx                 http://www.condorproject.org/