[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] default host ranking



Hi,

Think I finally found the problem. The variables for negotiator pre and post rank has default values, as I noticed in the NegotiatorLog:
12/05/14 16:53:11 NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory
12/05/14 16:53:11 NEGOTIATOR_POST_JOB_RANK = (RemoteOwner =?= UNDEFINED) * (ifthenElse(isUndefined(KFlops), 1000, Kflops) - SlotID - 1.0e10*(Offline=?=True))

Now what I am guessing is that My.Rank is expected to also pick up the rank value if set in a job script, but it only picks up rank values set in the configuration file on the various hosts. In our case, ‘Memory’ is 2012 on host1 and 500 on host2. This cause the host2 to fill up first.

I now removed the Memory and CPU part from the pre rank and things work as (we) expected.

Cheers,
Yngve

> On 03 Dec 2014, at 09:46, Yngve Levinsen <Yngve.Levinsen@xxxxxxx> wrote:
> 
> Hi,
> 
> Thanks for your suggestions Todd!
> 
>> On 01 Dec 2014, at 19:15, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>> 
>> On 12/1/2014 10:22 AM, Yngve Levinsen wrote:
>>> Hi all,
>>> 
>>> I would like to set up a default rank of the hosts in our pool (unless
>>> the user specifies another ranking). Where is this set?
>> 
>> You can use knob NEGOTIATOR_POST_JOB_RANK for this purpose.  It may be helpful to see the recipe
>> https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToSteerJobs
> 
> Thanks for the link, I was not aware of this functionality. NEGOTIATOR_POST_JOB_RANK should then be defined in condor_config.local (or similar) of the machine(-s) that runs the negotiator only right? This did not seem to work for me. I noticed in “condor_status -long” that there is a variable “CurrentRank” which seem to hold the current rank of running jobs (otherwise 0). This variable stayed 0 after I had defined NEGOTIATOR_POST_JOB_RANK (on host1 which is my central host).
> 
> After a bit more testing, I did notice that it actually had some effect. I set it to equal SlotID, and noticed that instead of starting to use slot1, slot2 etc, it would start using slot24, slot23 etc. However, this was just internal ranking on one host, condor would still fill up the slots in host2 before filling up slots in host1. There seem to be some remnant preference to use host2 instead of host1, and I am not sure where/how this is defined.
> 
> What I did eventually figure out which worked was to define “RANK” in my condor_config.local on the various machines. I set it to equal SlotID, which was a very simple way to evenly distribute the jobs on the various machines we have, more or less in the way we want. It would be nice if the user could also override the ranking, but it is less crucial for us.
> 
>> 
>>> Currently we
>>> only have two machines, the central host and “host2”. It looks to me
>>> like HTCondor is always filling up the slots on “host2” before starting
>>> to use the slots on the central machine. I suppose this makes sense for
>>> larger pools where you want to keep the resources on the central host
>>> free for as long as possible. However, we would currently like it to be
>>> opposite (or even, distribute the jobs evenly between hosts in the pool)
>>> 
>> 
>> On your execute machines, are you using static slots (the default), or partitionable slots?
>> 
>> Assuming static slots, the above recipe gives an example NEGOTIATOR_POST_JOB_RANK to distribute the jobs evenly between hosts in the pool.
> 
> I use static slots, the default. I am 99% sure of that (last percent because I am generally not always completely convinced I understand the configurations correctly).
> 
>> 
>>> Further, I tried to use the “rank” parameter in the job file without
>>> success. I added this line to the job configuration file:
>>> 
>>> /rank = ( 2 * (machine == “host1”) ) + (machine == “host2”)/
>>> 
>>> With this condor was still populating the slots on host2 before using
>>> the slots on host1. I then figured maybe there is some other ranking
>>> done, such that I need to increase the number. However, even
>>> /rank = ( 1000 * (machine == “host1”) ) + (machine == “host2”)/
>>> or
>>> /rank = ( 1000 * (machine == “host1”) ) /
>>> changed anything (that I noticed).
>>> 
>>> That made me think that maybe I was simply using wrong hostnames, so I
>>> added them to the “requirements” instead. That worked, unless I wrote
>>> “host1” and/or “host2” (and spelled correctly), the respective hosts
>>> would not be used.
>>> 
>>> Is ranking not turned on by default, or is there something else I might
>>> be missing?
>>> 
>> 
>> I would have expected the above to work (assuming static slots).
>> 
>> What version of HTCondor are you using?  There was a bug related to job rank what was fixed starting with v8.2.2 which may be causing you problems.  See
>>  https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=4403
> 
> I am using 8.2.4 on both machines.
> 
>> 
>> Assuming static slots and no jobs running on host1 or host2, what happens if you try the following submit file?
>> 
>> requirements =  machine == "host1" || machine == "host2" && ( Rank =!= UNDEFINED )
>> 
>> rank  = 1000 - SlotID
>> 
>> Also, does the following command
>>  condor_status -af machine
>> display "host1" and "host2" spelled the same etc as what you used in your job submit file?
>> 
> 
> Yes, and if I set one or both to the requirements variable instead, it works as expected (I tried to misspell one of them to make 100% sure, and then it would not submit anything to that host).
> 
> With your suggested rank/requirements the jobs still run fine, but CurrentRank is as before equal to 0. The same is true for Rank (“condor_status -af Rank”). 
> 
>> Hope the above helps,
>> Todd
> 
> Thanks for suggestions,
> Yngve
> 
>> 
>> 
>> 
>>> In case I explain myself incorrectly I attach my job configuration file.
>>> 
>>> Cheers,
>>> Yngve
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>> 
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>> 
>> 
>> 
>> -- 
>> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>> Center for High Throughput Computing   Department of Computer Sciences
>> HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
>> Phone: (608) 263-7132                  Madison, WI 53706-1685
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/