[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] imbalance question



I suspect I'm missing something fundamental but it's the end of the work day and my brain is done.

I have a 6-host cluster. The hosts are mostly the same, they're all VMs running the same OS, configured the same (configuration management via puppet*) and they all have the same NFS mount access to the data. The only real difference is how much RAM the hosts have.

Users are submitting jobs and those jobs keep going to the two busiest nodes in the cluster instead of being spread around. I've just tested and see the same behavior.

When I put a requirements = (name of idle host) the job goes to the idle host with no problems. However, if no hostname requirements are set the jobs keep going to the same busy hosts. Oddly, the busiest hosts are the ones with the least available RAM overall.

I was pretty sure condor should be doing a better job of balancing the loads. What am I missing here?

; condor_status
Name                     OpSys   ÂArch  State   Activi

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   ÂX86_64 Unclaimed Idle Â
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   ÂX86_64 Unclaimed Idle Â
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX   ÂX86_64 Unclaimed Idle Â
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1_5@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1_6@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1_7@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX   ÂX86_64 Claimed  Busy Â
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx  ÂLINUX   ÂX86_64 Unclaimed Idle Â
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx  ÂLINUX   ÂX86_64 Unclaimed Idle Â
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx  ÂLINUX   ÂX86_64 Unclaimed Idle Â

       ÂMachines Owner Claimed Unclaimed Matched Preempting ÂDrain

 X86_64/LINUX    15   0    9     6    0     Â0   Â0

    ÂTotal    15   0    9     6    0     Â0   Â0

; ssh chrusm0 uptime ; ssh chrusm1 uptime ; ssh chrulg0 uptime ; ssh omics0 uptime ; ssh omics1 uptime ; ssh omics2 uptime
Â16:24:15 up 21 days, 40 min, 10 users, Âload average: 12.07, 13.56, 15.00
Â16:24:16 up 20 days, 23:34, Â0 users, Âload average: 7.20, 7.20, 7.15
Â16:24:16 up 21 days, 40 min, Â5 users, Âload average: 0.00, 0.02, 0.11
Â16:24:17 up 76 days, Â4:58, Â0 users, Âload average: 0.02, 1.53, 2.91
Â16:24:18 up 76 days, Â4:57, Â0 users, Âload average: 0.00, 0.40, 1.14
Â16:24:18 up 76 days, Â4:55, Â0 users, Âload average: 0.00, 0.01, 0.10

thanks,
nomad

* - this is a different lab than the one I emailed about last week. Different hosts and configuration management system.