[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] job/load balancing question



 

Hi Everyone, 

 

My on premise HTCondor pool is a set size with a known number of cores across a set number of machines.  

 

I recently found a suggestion that it appeared would help balance the load a little bit. 

 

NEGOTIATOR_PRE_JOB_RANK = isUndefined(RemoteOwner) * (- SlotId)

 

And as advertised this definitely changed the system so that instead of loading all of the jobs on one machine it is spreading the jobs across all of the machines  BUT I just realized it broke another part of my system.

 

Under the default formula

 

NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory

 

Jobs match the core with the least amount of memory needed first,  and my load balancing depends on that.  I have some cores set aside for the big memory jobs and jobs with little or no memory requirements should go to the other cores first and only use the big memory allocated cores if nothing else is available and nothing else needs them.

 

The spreading jobs across many nodes appears to be filling slot #1 first but that is not one of my lower memory allocation slots.   I want it to match the lower memory cores first, leaving the cores with larger memory allocations available for those jobs that come later in the sequence.

 

THE REAL QUESTION IS HERE---------------

I am guessing there is a way to combine these two interests?     I want to spread the jobs widely across machines instead of filling up all the cores on one machine first  -- but I also want the memory requirements to be given weight. 

 

More info that might help or might just confuse..

   Ideally â 1st job goes to low memory core on machine x. 

                   2nd job goes to low memory core on a different machine etc.   

                 Then when there is one job on each machine then the next job goes to the second low memory core on a node etc.       

    In the meantime when a job requiring more memory shows up it can go to a larger memory core on machine one on the machines?

    But if all of the low memory cores fill up and a larger core is available go ahead and let a small job use it.  The job run times are short enough to absorb that wait.  The current problem is that the small jobs are taking up the bigger cores in a way that is noticably delaying the start of large jobs

 

         Mary   

 

Mary Romelfanger

Deputy Branch Manager/

Principal Computer Scientist

Data Systems Branch

.___.      

{o,o}      Phone 410-338-6708
/)__)     Cell      443-244-0191
-"-"-          mary@xxxxxxxxx

 

Space Telescope Science Institute

3700 San Martin Drive

Baltimore, MD 21218