[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] imbalance question



Hi Nomad,

this is a more or less religious question, IMHO there is no such thing as an intellignet or a dumb way to do this because 'it depends' if you imagine there would be some 'whole-node-type' jobs flying in at any time it would be nice to fill the pool vertically. You are working with virual machines, maybe these get spawned as needed I bet you would not want to spawn a VM for every job and so on .... 

The good news is as soon as you teach condor what is an intelligent way to fill the pool it will roll up the sleeves on his little electronic arms and do what it's told :)

Assuming your pool is rather homogenic each job after having been negotiated will come up with a long list of possible slots to run on. This list is then sorted by different mechanisms. I am not 100% sure but the ranking _expression_ that comes with the job is one of them might be pre-negotiation though.

Anyway, the more general approach is the NEGOTIATOR_POST_JOB_RANK (on the negotiator) you can use this to sort the said list as you like,  the higher the number the better. If you simply want memory to be the all-deciding factor just put memory in there: NEGOTIATOR_POST_JOB_RANK=TotalMemory.

You can also do algebra inside the NEGOTIATOR_POST_JOB_RANK like (TotalMemory - (10 * TotalLoadAvg)) which would take the load on the machine in account.

If you have an even more sophisitcated view on your hosts you can use the startd_cron to fetch some metrics like disk-filled-level or network bandwidth and then sort your machines according to these parameters.

It is quite simple to test your _expression_, condor_status -af <your options> and a little bit of awk will spit out a list quite similar to what you have to expect from the negotiator, a tiny bit of blurriness here as the ranking then includes machines that might not be suitable for the job for other reasons but good enough to judge your  NEGOTIATOR_POST_JOB_RANK _expression_ for sure !

If you simply want to fill horizontally the slotid is the factor you want to sort after (prefer small slotids), if you use partitionable slots the slotid is always '1' so you need a different approach, see this recipe for ex. :

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToFillPoolBreadthFirst

sorry for the lengthy e-mail, seems I am in a chatty mood ;)

Hope this helps !

Best
Christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Lee Damon" <nomad@xxxxxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 11. MÃrz 2020 00:29:28
Betreff: [HTCondor-users] imbalance question

I suspect I'm missing something fundamental but it's the end of the work day and my brain is done.
I have a 6-host cluster. The hosts are mostly the same, they're all VMs running the same OS, configured the same (configuration management via puppet*) and they all have the same NFS mount access to the data. The only real difference is how much RAM the hosts have.

Users are submitting jobs and those jobs keep going to the two busiest nodes in the cluster instead of being spread around. I've just tested and see the same behavior.

When I put a requirements = (name of idle host) the job goes to the idle host with no problems. However, if no hostname requirements are set the jobs keep going to the same busy hosts. Oddly, the busiest hosts are the ones with the least available RAM overall.

I was pretty sure condor should be doing a better job of balancing the loads. What am I missing here?

; condor_status
Name                                         OpSys      Arch   State     Activi

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle  
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle  
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle  
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1_7@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy  
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX      X86_64 Unclaimed Idle  
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX      X86_64 Unclaimed Idle  
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx    LINUX      X86_64 Unclaimed Idle  

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

  X86_64/LINUX       15     0       9         6       0          0      0

         Total       15     0       9         6       0          0      0

; ssh chrusm0 uptime ; ssh chrusm1 uptime ; ssh chrulg0 uptime ; ssh omics0 uptime ; ssh omics1 uptime ; ssh omics2 uptime
 16:24:15 up 21 days, 40 min, 10 users,  load average: 12.07, 13.56, 15.00
 16:24:16 up 20 days, 23:34,  0 users,  load average: 7.20, 7.20, 7.15
 16:24:16 up 21 days, 40 min,  5 users,  load average: 0.00, 0.02, 0.11
 16:24:17 up 76 days,  4:58,  0 users,  load average: 0.02, 1.53, 2.91
 16:24:18 up 76 days,  4:57,  0 users,  load average: 0.00, 0.40, 1.14
 16:24:18 up 76 days,  4:55,  0 users,  load average: 0.00, 0.01, 0.10

thanks,
nomad

* - this is a different lab than the one I emailed about last week. Different hosts and configuration management system.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/