Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Dynamic slots

Date: Mon, 03 Feb 2014 12:34:53 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Dynamic slots

On 1/31/2014 4:41 PM, Shrum, Donald C wrote:

I have a cluster of machines that are dedicated to HTCondor.

I've read some on dynamic slots; specifically this powerpoint:
http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/thain-dynamic-slots.pdf

as well as this http://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html#sec:SMP-dynamicprovisioning

I've enabled whole machine jobs on our cluster.  I presume if I use dynamic slots I'll do away with the configuration for whole machine jobs.

Is that the case and is using dynamic slots a better practice?   Any input would be appreciated.


As with most things in life, the answer is "it depends". :)

A "whole machine" static slot configuration, as described at
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=WholeMachineSlots

has some real shortcomings. A big one is the all or nothing approach,where the setup typically can give a job either one core or all thecores. This can be a bummer depending on your typical job mix. If youhave, for instance, 32-core servers and a job mix that wants acombination of 1, 8, and 16core jobs, then using dynamic slots willlikely allow much better utilization because you will be able to pack ina lot more jobs on your server. Dynamic slots are also nice becausecores are not the only "axis" they are concerned about - maybe, forinstance, your mix of 1/8/16 core jobs are further split into largememory vs small memory. Because dynamic slots are created "best fit"with respect to both cores and memory (and any other server resources aswell), that could be a nice win as well. Finally, dynamic slots resultin a simpler configuration - the "whole machine" config at the above URLis pretty complicated, both for a human to debug/tweak and also for"condor_q -analyze" to give helpful results.

But dynamic slots have their downfalls as well. First off, the user hasto tell HTCondor what their job needs in terms of cores, memory, etc,and what they tell HTCondor has real implications. Unfortunately, manyusers simply have no idea what their jobs require or can effectivelyutilize, so sometimes static slots created by a system admin that isfamiliar with the workloads of the organization (esp if the cluster isused for a repetitive/predictable workload) could be better. Also,dynamic slots currently do not work with startd RANK policies (i.e. ifyou have machines that need to prefer certain types of jobs), but we arecurrently working to fix that shortcoming.

Another complication with dynamic slots is starvation. For instance, asimple dynamic slot setup could result in multicore jobs starving(waiting forever) if there is an infinite supply of incoming single corejobs. The whole-machine-slots static recipe above gets around thisproblem by always prioritizing whole-machine jobs; if a whole-machinejob matches, the server will then immediately "drain" out all the singlecore jobs (i.e. not start any new single core jobs while waiting forexisting single core jobs to complete). This "always prioritize largemachine jobs" strategy is not ideal, and thus most whole-machine-slotsites deal with this by only setting up some percentage of their serverswith a whole-machine-slot policy. Draining costs utilization; you willhave slots sitting around idle waiting for all the single-core jobs toexit. But either job preemption (i.e. killing a job before it is doneand starting it over) or server draining is the price one must pay inorder to avoid starvation of larger core jobs. Dynamic slot configs areusually setup to do draining with the help of the condor_defrag serviceas described here http://goo.gl/Qh8UXu (or via some other externalapplication-aware service issuing the condor_drain command-line tool),but note that the condor_defrag daemon is going to drain some percentageof servers regardless of if there are multi-core jobs submitted or not.The condor_defrag service also produces better results if your clusterof servers is more homogenous, since in a very heterogeneous server poolit is possible the condor_defrag service may drain machines that nomulticore jobs want.

All in all, the I think most sites that have switched to dynamic slotsfeel it is a definite improvement (esp in increased utilization), butthere is not a clear and obvious winner in every case, and balancing ouryour draining policy can be tricky. Down the road we hope to keepenhancing HTCondor to make things easier/smarter, and to create bettertools to communicate these tradeoffs happening on a cluster moredirectly to administrators.

I know the above is not a clear answer and probably more than youwanted, but hopefully will help give an idea of the tradeoffs involved.


regards,
Todd

Thanks and have a good weekend.

Donny Shrum
FSU RCC



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

Prev by Date: Re: [HTCondor-users] excel on condor
Next by Date: [HTCondor-users] HTCondor 8.1.1 Sched daemon crashes when submitting a job....known issue?
Previous by thread: Re: [HTCondor-users] excel on condor
Next by thread: [HTCondor-users] HTCondor 8.1.1 Sched daemon crashes when submitting a job....known issue?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Dynamic slots