[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Minimize time to start job.



On 8/13/2019 6:21 AM, don_vanchos wrote:
> Hello,
> 
> I noticed that for the simplest vanilla jobs on my cluster, the time 
> difference `job_ad["JobStartDate"] - job_ad["QDate"]` is from 5 to 20 
> seconds. So quite a lot of time elapses between sending job to the queue 
> and starting the process.
> 
> Then I setÂNEGOTIATOR_CYCLE_DELAY setting to 0. And this time difference 
> became equal from 0 to 1 second.
> 
> My goal is to make job launch as fast as possible! What are the 
> consequences if I make this setting equal to zero? Maybe performance 
> degradation? Or maybe the wrong behavior in some cases?ÂIf the '0' value 
> is harmful, then how can I minimize this time difference?
> 
> P.S.ÂMany thanks to all for the answers to my questions in the 
> neighboring branches and in this (in advance).
> 
> -- 
> Sincerely yours,
> Ivan Ergunov                                                 

Hi Ivan,

How it works is the condor_schedd (running on your submit node) 
maintains a set of execute node slots it has claimed.  The time it takes 
the schedd to start an idle job onto a slot it already has claimed is 
typically very fast (sub-second).  However, if the schedd does not have 
a claimed slot available, it needs to ask the negotiator for a match --- 
this is the 5 to 20 second delay you initially observed.  So if you 
submit 1000 jobs to a pool with 10 cpu cores it will take a few seconds 
for the schedd to get the matches intially and start the first 10 jobs, 
but jobs 11 through 1000 will start practically immediately when an 
earlier job completes (because the schedd does not need to talk again to 
the negotiator - it already has the slots claimed).

To answer your question above re NEGOTIATOR_CYCLE_DELAY : If your pool 
is of modest size (e.g. ~ one thousand cores or less), is all located on 
the same local-area network (i.e. your pool is not spread across a high 
latency wide-area internet), and you have just one submit node (i.e. one 
schedd) where you are submitting jobs, I think a NEGOTIATOR_CYCLE_DELAY 
of 1 or 2 would be fine. Because the negotiator is mostly stateless, the 
idea of NEGOTIATOR_CYCLE_DELAY to give time for the schedd to claim the 
slots matched by negotiator and for this to be reflected in the 
collector before the start of the next negotiator cycle so that the 
negotiator does not waste time giving out the same resources over and 
over again.

Note that negotiator cycle itself is started periodically (controlled by 
config knob NEGOTIATOR_INTERVAL) or triggered whenever a condor_submit 
command or a condor_reschedule command is issued.

Hope the above helps, feel free to ask any followup questions if the 
above was unclear,

regards
Todd

-- 
Todd Tannenbaum                        University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257