[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] negotiator "poor" performance issue



On 3/14/2014 5:42 AM, Pek Daniel wrote:
Hi,


Hi Daniel, some thoughts inline...


I assigned to the jobs I submitted randomized priorities, because
otherwise the negotiator would go through the schedds sequentially
(first, it runs all the jobs from schedd1, then from schedd2, etc).
I've also set:
USE_GLOBAL_JOB_PRIOS = true


Just FYI - the negotiator communicates with schedds in user priority order regardless of schedd. So if your jobs were submitted from different users (or with different accounting_groups), the negotiator would not go through all the schedds sequentially.

I don't use job arrays or clusters and I can't consider using them,
this is a constraint.


^^^ This is a bummer...


In this way, I could achieve ~10 jobs / sec negotiation (dispatching)
rate (not using priorities doesn't change this).

My questions:
- did anybody measure before a higher dispatch rate?
- is this 10 jobs / sec considered a "normal" or "good enough" value
in case of HTCondor?

Of course we are always working to improve the at which the negotiator makes matches, and we have several ideas/plans on the horizon.

However, negotiator match rate for most real-world scenarios is not as important as it may seem. The reason is because negotiator match rate has little to do with job start rate in HTCondor. When the negotiator makes a match, it hands it out to a schedd. This schedd then claims the slot, and starts a job. A key point is that when the job completes, the schedd will find another job from that same user that matches the slot and start it **without any involvement from the condor_negotiator**. The schedd will keep using and reusing a slot it has claimed for job after job until the match is broken. With a default CLAIM_WORKLIFE (see http://goo.gl/VOg9nm ) of an hour there are not typically that many Unclaimed machines on any given negotiation cycle (i.e. machines that are not already assigned to a schedd) that the negotiator has to worry about. In other words, the negotiator is not typically involved at job boundaries, but only when claims need to move from one user/schedd to another due to priorities...

Hope the above makes sense...

- can I do anything without touching the source to increase the
negotiation performance?


Tuning knobs like NEGOTIATOR_INFORM_STARTD could help, but not sure how much. I guess you also need to think about how important/relevant of a metric negotiator dispatch rate is for your scenario. Maybe sustained job completion rate makes more sense. See

http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/tannenba-roadmap.pdf
for a bunch of performance graphs starting around slide 18. For example, tests back with v7.6.0 showed a negotiator matchmaking rate of 8 per second (close to what you found), but because the schedd reuses matches, the sustained job completion rate for just one schedd was 80 jobs/second. And of course, you can scale job completion rate horizontally by adding more schedds.

You may find the following paper of interest, even though it is getting a bit old:

Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron Livny, Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability limits of the Condor batch system", Journal of Physics: Conference Series, Vol. 331, No. 6, 2011

http://research.cs.wisc.edu/htcondor/doc/chep10_condor_scalability.pdf

regards,
Todd

p.s. Also be aware the negotiator classad ("condor_status -negotiator -l") publishes a number of statistics related to matchmaking performance, see http://goo.gl/BbIp9R . Useful for graphing with condor_gandliad

--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685