[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor Rank question



> It seems condor has no concept of load average on system. We have a
> farm about 400 servers each with 4 cores. There are many times when
> 1,5,15 minute load average is around 5.00. Is it possible to have
> condor avoid boxes like this to execute new jobs? I don't want the box
> to be oversubscribed.

There's already a LoadAvg and CondorLoadAvg number reported by the startd to the collector (see: http://www.cs.wisc.edu/condor/manual/v7.2/Appendix_A_ClassAd.html#77090)

If you don't like those values you can use a startd cron job to publish a set of attributes to the machine ad with numbers you prefer. If you can publish the machine's load in it's ClassAd Condor can make scheduling decisions using this information.

Now, here's the rub: the Class Ad used by the scheduler when evaluating your rank expressions is potentially stale. So the load may be off by as much as a few minutes.

The only way I know of to get around this is to also enforce some policy in your START expression, which will be evaluated when a job actually tries to start on a machine.

Here's an example...lets say you're using LoadAvg in your negotiator ranking. So you have:

##  The NEGOTIATOR_POST_JOB_RANK expression chooses between
##  resources that are equally preferred by the job.
##  Break ties by looking for machines that have Idle longer than others
##  and use them first:
##
##     (((Activity =?= 'Owner') * (State =?= 'Idle')) * 1000000000) + ((Activity =?= 'Unclaimed') * 100000000)
##
##  Also try and use faster machines before slower machines
##
##      + (KFlops * 0.001)
##
##  And assign jobs to separate machines before we start putting
##  two jobs on a machine:
##      - (SlotID * 10)
##
##  And try to put jobs on a machines that are more lightly loaded:
##      - LoadAvg
NEGOTIATOR_POST_JOB_RANK = (((Activity =?= 'Owner') * (State =?= 'Idle')) * 1000000000) + ((Activity =?= 'Unclaimed') * 100000000) + (KFlops * 0.001) - (SlotID * 10) - LoadAvg


But maybe you want to be really strict about this? You really don't want to run a job on a machine if it's LoadAvg is over 5. This sort expression doesn't help you enforce that. It'll certainly put those machines lower on the list of perferred machines, but it's not strict. So you need to add some logic to your START expression on the machine:

##  Take what ever I'm currently doing for a START expression and make sure jobs
##  don't run here if the load is too high.
START = $(START) && (LoadAvg < 5)

Now you've got some closer-to-real-time evaluation of LoadAvg but it's still not perfect. LoadAvg still isn't realtime, but it's a lot closer to reality at the startd than it is at the collector. That *shouldn't* preempt existing, runing jobs on the machine LoadAvg >= 5, but it'll stop new jobs from running there and it will ensure the negotiator isn't trying too hard to put jobs on these machines.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.