[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Configuring NEGOTIATOR_PRE_JOB_RANK and NEGOTIATOR_POST_JOB_RANK to optimize matching



Hi Lars,

On Wed, 2019-03-13 at 16:31:28 +0000, Lars Henrik Sowa wrote:
> Hi all,
> 
> I am currently trying to optimize the matching for our machines using NEGOTIATOR_PRE_JOB_RANK and NEGOTIATOR_POST_JOB_RANK (as at the moment we face some issues with matching offline machines before online machines and which, for various reasons, cannot be woken up via Wake on Lan).
> 
> I hope you can clarify some questions I have.
> 
> As default these values are set to (I have checked them via "condor_config_val -v ..."):
> NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory
> NEGOTIATOR_POST_JOB_RANK = (RemoteOwner =?= UNDEFINED) * (ifthenElse(isUndefined(KFlops), 1000, Kflops) - SlotID - 1.0e10*(Offline=?=True))
> 
> Currently I do not understand why Memory is in there with a negative sign. According to the documentation memory is "The amount of RAM in MiB in this slot. For static slots, this value will be the same as in TotalSlotMemory. For a partitionable slot, this value will be the quantity remaining in the partitionable slot.". In our case this is equal to TotalSlotMemory as well as TotalMemory as we only have one slot per machine. However if I understood this correctly this would mean that with the default value, NEGOTIATOR_PRE_JOB_RANK would prefer machines with less memory over ones with more memory. Is this correct? If I now actually want to match machines with more memory first I just would have to turn the - memory into a + memory?

The idea of ranking is often to spend smaller resources before bigger ones, to keep big
resources for jobs that wouldn't match smaller (already partitioned) slots.
Therefore, one would rank an already fragmented machine over a more complete one
(this doesn't change the rule that the machine must match the job's requirements).

I used the RemoteOwner clause for quite some time, but recently dropped it. YMMV.

> With these default values it might also happen that offline machines are matched before online machines because it's only taken into account in the post job rank with the default values. To avoid this I just move the - 1.0e10*(Offline=?=True)) from the post job rank to the pre job rank so that these offline machines are definitly sorted to the end of the list?

That's how I understand it...

Cheers,
 S