[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Multi-Slot settings for single nodes



On Wed, Jun 20, 2007 at 03:06:51PM -0500, Alan De Smet wrote:
> (using SlotID in your START and other policy expressions).  So
> you might have a policy roughly like:
> 
> - Slot 1-4: "normal", advertises 1/4 of the RAM.  Refuses to
>   START if slot5_State=="Claimed"
> 
> - Slot 5: "bigjob", advertises all, or lots of the RAM.  Refuses
>   to START if slot[1-4]_State=="Claimed".  Or perhaps refuses to
>   start if (slot1_ImageSize + slot2_ImageSize + slot3_ImageSize +
>   slot4_ImageSize) > 4gigs.  (Of course, ImageSize can be
>   undefined, so you actual expression will be more complex. 
> 
> Some general discussion on crafting per-VM policies is here:
> http://www.cs.wisc.edu/condor/manual/v6.9/3_12Setting_Up.html#SECTION004127500000000000000

Also, to be able to find this in the archive next time, here's
the pointer to the (AFAICT) first post on the topic:

https://lists.cs.wisc.edu/archive/condor-users/2006-November/msg00068.shtml

> Depending on your workload, you might need to worry about
> starvation where a large number of small jobs keep the nodes busy
> enough that the larger jobs never get a chance to run.  It should
> be possible to tune various priorities to make it work (say,
> using RANK), but it will likely need to be tuned to your typical
> workload.

An easy way to do this seems to be modifying the NEGOTIATOR_PRE_JOB_RANK
(a very powerful attribute! I'm glad I learned about it) to *prefer*
nodes that are already partially occupied. 
If you add a positive number to that rank for each CPU/slot already in
use, this will make these machines more attractive for small jobs while
big jobs are not affected (since their requirements are no longer 
matched). This will leave more machines unclaimed, giving the big job
guy more opportunities for a match.

I'm not sure whether it makes sense (or is desirable) to model features
of other batch systems (like PBS) in Condor - but this approach seems
to be a reasonable workaround. Perhaps the Condor developers will come
up with something really beautiful and elegant? :-)

> Remember that Condor's allocations of memory are really
> guidelines, and unless your policies are written to evict jobs
> that grow too large, Condor will happily get a job use more RAM
> than the slot is allocated.

... and users risk to never finish their job if it gets evicted for some
reason, and the memory footprint reported has grown beyond the resource
offered in the machine(slot) class ad. That'd be the final punishment,
after slowing down due to swapping. B-]

Steffen

-- 
Steffen Grunewald * MPI Grav.Phys.(AEI) * Am Mühlenberg 1, D-14476 Potsdam
Cluster Admin * http://pandora.aei.mpg.de/merlin/ * http://www.aei.mpg.de/
* e-mail: steffen.grunewald(*)aei.mpg.de * +49-331-567-{fon:7233,fax:7298}
No Word/PPT mails - http://www.gnu.org/philosophy/no-word-attachments.html