[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] negotiator "poor" performance issue



Thanks Todd! This is really useful information!

2014-03-14 19:09 GMT+01:00 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
> On 3/14/2014 5:42 AM, Pek Daniel wrote:
>>
>> Hi,
>>
>
> Hi Daniel, some thoughts inline...
>
>
>>
>> I assigned to the jobs I submitted randomized priorities, because
>> otherwise the negotiator would go through the schedds sequentially
>> (first, it runs all the jobs from schedd1, then from schedd2, etc).
>> I've also set:
>> USE_GLOBAL_JOB_PRIOS = true
>>
>
> Just FYI -  the negotiator communicates with schedds in user priority order
> regardless of schedd.  So if your jobs were submitted from different users
> (or with different accounting_groups), the negotiator would not go through
> all the schedds sequentially.
>
>
>> I don't use job arrays or clusters and I can't consider using them,
>> this is a constraint.
>>
>
> ^^^ This is a bummer...
>
>>
>> In this way, I could achieve ~10 jobs / sec negotiation (dispatching)
>> rate (not using priorities doesn't change this).
>>
>> My questions:
>> - did anybody measure before a higher dispatch rate?
>> - is this 10 jobs / sec considered a "normal" or "good enough" value
>> in case of HTCondor?
>
>
> Of course we are always working to improve the at which the negotiator makes
> matches, and we have several ideas/plans on the horizon.
>
> However, negotiator match rate for most real-world scenarios is not as
> important as it may seem.  The reason is because negotiator match rate has
> little to do with job start rate in HTCondor.  When the negotiator makes a
> match, it hands it out to a schedd.  This schedd then claims the slot, and
> starts a job.  A key point is that when the job completes, the schedd will
> find another job from that same user that matches the slot and start it
> **without any involvement from the condor_negotiator**. The schedd will keep
> using and reusing a slot it has claimed for job after job until the match is
> broken.  With a default CLAIM_WORKLIFE (see http://goo.gl/VOg9nm ) of an
> hour there are not typically that many Unclaimed machines on any given
> negotiation cycle (i.e. machines that are not already assigned to a schedd)
> that the negotiator has to worry about.  In other words, the negotiator is
> not typically involved at job boundaries, but only when claims need to move
> from one user/schedd to another due to priorities...
>
> Hope the above makes sense...
>
>
>> - can I do anything without touching the source to increase the
>> negotiation performance?
>>
>
> Tuning knobs like NEGOTIATOR_INFORM_STARTD could help, but not sure how
> much.  I guess you also need to think about how important/relevant of a
> metric negotiator dispatch rate is for your scenario.  Maybe sustained job
> completion rate makes more sense.  See
>
> http://research.cs.wisc.edu/htcondor/CondorWeek2011/presentations/tannenba-roadmap.pdf
> for a bunch of performance graphs starting around slide 18.  For example,
> tests back with v7.6.0 showed a negotiator matchmaking rate of 8 per second
> (close to what you found), but because the schedd reuses matches, the
> sustained job completion rate for just one schedd was 80 jobs/second.  And
> of course, you can scale job completion rate horizontally by adding more
> schedds.
>
> You may find the following paper of interest, even though it is getting a
> bit old:
>
> Dan Bradley, Timothy St Clair, Matthew Farrellee, Ziliang Guo, Miron Livny,
> Igor Sfiligoi, and Todd Tannenbaum, "An update on the scalability limits of
> the Condor batch system", Journal of Physics: Conference Series, Vol. 331,
> No. 6, 2011
>
> http://research.cs.wisc.edu/htcondor/doc/chep10_condor_scalability.pdf
>
> regards,
> Todd
>
> p.s. Also be aware the negotiator classad ("condor_status -negotiator -l")
> publishes a number of statistics related to matchmaking performance, see
> http://goo.gl/BbIp9R .  Useful for graphing with condor_gandliad
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing   Department of Computer Sciences
> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132                  Madison, WI 53706-1685
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/