[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dynamic slots



On 1/31/2014 4:41 PM, Shrum, Donald C wrote:
I have a cluster of machines that are dedicated to HTCondor.

I've read some on dynamic slots; specifically this powerpoint:
http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/thain-dynamic-slots.pdf

as well as this http://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html#sec:SMP-dynamicprovisioning

I've enabled whole machine jobs on our cluster.  I presume if I use dynamic slots I'll do away with the configuration for whole machine jobs.

Is that the case and is using dynamic slots a better practice?   Any input would be appreciated.


As with most things in life, the answer is "it depends". :)

A "whole machine" static slot configuration, as described at
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots
has some real shortcomings. A big one is the all or nothing approach, where the setup typically can give a job either one core or all the cores. This can be a bummer depending on your typical job mix. If you have, for instance, 32-core servers and a job mix that wants a combination of 1, 8, and 16core jobs, then using dynamic slots will likely allow much better utilization because you will be able to pack in a lot more jobs on your server. Dynamic slots are also nice because cores are not the only "axis" they are concerned about - maybe, for instance, your mix of 1/8/16 core jobs are further split into large memory vs small memory. Because dynamic slots are created "best fit" with respect to both cores and memory (and any other server resources as well), that could be a nice win as well. Finally, dynamic slots result in a simpler configuration - the "whole machine" config at the above URL is pretty complicated, both for a human to debug/tweak and also for "condor_q -analyze" to give helpful results.

But dynamic slots have their downfalls as well. First off, the user has to tell HTCondor what their job needs in terms of cores, memory, etc, and what they tell HTCondor has real implications. Unfortunately, many users simply have no idea what their jobs require or can effectively utilize, so sometimes static slots created by a system admin that is familiar with the workloads of the organization (esp if the cluster is used for a repetitive/predictable workload) could be better. Also, dynamic slots currently do not work with startd RANK policies (i.e. if you have machines that need to prefer certain types of jobs), but we are currently working to fix that shortcoming.

Another complication with dynamic slots is starvation. For instance, a simple dynamic slot setup could result in multicore jobs starving (waiting forever) if there is an infinite supply of incoming single core jobs. The whole-machine-slots static recipe above gets around this problem by always prioritizing whole-machine jobs; if a whole-machine job matches, the server will then immediately "drain" out all the single core jobs (i.e. not start any new single core jobs while waiting for existing single core jobs to complete). This "always prioritize large machine jobs" strategy is not ideal, and thus most whole-machine-slot sites deal with this by only setting up some percentage of their servers with a whole-machine-slot policy. Draining costs utilization; you will have slots sitting around idle waiting for all the single-core jobs to exit. But either job preemption (i.e. killing a job before it is done and starting it over) or server draining is the price one must pay in order to avoid starvation of larger core jobs. Dynamic slot configs are usually setup to do draining with the help of the condor_defrag service as described here http://goo.gl/Qh8UXu (or via some other external application-aware service issuing the condor_drain command-line tool), but note that the condor_defrag daemon is going to drain some percentage of servers regardless of if there are multi-core jobs submitted or not. The condor_defrag service also produces better results if your cluster of servers is more homogenous, since in a very heterogeneous server pool it is possible the condor_defrag service may drain machines that no multicore jobs want.

All in all, the I think most sites that have switched to dynamic slots feel it is a definite improvement (esp in increased utilization), but there is not a clear and obvious winner in every case, and balancing our your draining policy can be tricky. Down the road we hope to keep enhancing HTCondor to make things easier/smarter, and to create better tools to communicate these tradeoffs happening on a cluster more directly to administrators.

I know the above is not a clear answer and probably more than you wanted, but hopefully will help give an idea of the tradeoffs involved.

regards,
Todd


Thanks and have a good weekend.

Donny Shrum
FSU RCC



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685