[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] negotiator "poor" performance issue
- Date: Fri, 14 Mar 2014 13:09:04 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] negotiator "poor" performance issue
On 3/14/2014 5:42 AM, Pek Daniel wrote:
Hi Daniel, some thoughts inline...
I assigned to the jobs I submitted randomized priorities, because
otherwise the negotiator would go through the schedds sequentially
(first, it runs all the jobs from schedd1, then from schedd2, etc).
I've also set:
USE_GLOBAL_JOB_PRIOS = true
Just FYI - the negotiator communicates with schedds in user priority
order regardless of schedd. So if your jobs were submitted from
different users (or with different accounting_groups), the negotiator
would not go through all the schedds sequentially.
I don't use job arrays or clusters and I can't consider using them,
this is a constraint.
^^^ This is a bummer...
In this way, I could achieve ~10 jobs / sec negotiation (dispatching)
rate (not using priorities doesn't change this).
- did anybody measure before a higher dispatch rate?
- is this 10 jobs / sec considered a "normal" or "good enough" value
in case of HTCondor?
Of course we are always working to improve the at which the negotiator
makes matches, and we have several ideas/plans on the horizon.
However, negotiator match rate for most real-world scenarios is not as
important as it may seem. The reason is because negotiator match rate
has little to do with job start rate in HTCondor. When the negotiator
makes a match, it hands it out to a schedd. This schedd then claims the
slot, and starts a job. A key point is that when the job completes, the
schedd will find another job from that same user that matches the slot
and start it **without any involvement from the condor_negotiator**.
The schedd will keep using and reusing a slot it has claimed for job
after job until the match is broken. With a default CLAIM_WORKLIFE (see
http://goo.gl/VOg9nm ) of an hour there are not typically that many
Unclaimed machines on any given negotiation cycle (i.e. machines that
are not already assigned to a schedd) that the negotiator has to worry
about. In other words, the negotiator is not typically involved at job
boundaries, but only when claims need to move from one user/schedd to
another due to priorities...
Hope the above makes sense...
- can I do anything without touching the source to increase the
Tuning knobs like NEGOTIATOR_INFORM_STARTD could help, but not sure how
much. I guess you also need to think about how important/relevant of a
metric negotiator dispatch rate is for your scenario. Maybe sustained
job completion rate makes more sense. See
for a bunch of performance graphs starting around slide 18. For
example, tests back with v7.6.0 showed a negotiator matchmaking rate of
8 per second (close to what you found), but because the schedd reuses
matches, the sustained job completion rate for just one schedd was 80
jobs/second. And of course, you can scale job completion rate
horizontally by adding more schedds.
You may find the following paper of interest, even though it is getting
a bit old:
Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron
Livny, Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability
limits of the Condor batch system", Journal of Physics: Conference
Series, Vol. 331, No. 6, 2011
p.s. Also be aware the negotiator classad ("condor_status -negotiator
-l") publishes a number of statistics related to matchmaking
performance, see http://goo.gl/BbIp9R . Useful for graphing with
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685