[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Cluster utilization is low for mixed memory intensive jobs



Thanks so much for your advice, Michael! I should have attached my hardware configuration. I wonder if there is any configuration can schedule jobs faster to utilize the cluster fully.

The CPU information of my dummy server is as follows. I should run 256 jobs in parallel. The total memory of the machine is about 96GB. In the real world, my responsibilities are a mix from 16GB to 4GB. While to "run" more jobs on my test dummy server, I scale down the memory usage by 10 and replace real processing part to sleep function (16GB to 4GB -> 1600MB to 4000 MB). Which means ideally I can run 60 large memory jobs, and run about 240 jobs for small memory jobs (neither of those cases could use all 256 CPUs, and that's' my definition of memory-intensive jobs.)

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  4
Core(s) per socket:  64
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               87
Model name:          Intel(R) Xeon Phi(TM) CPU 7230 @ 1.30GHz
......

I indeed set each jobs's request_disk = 1MB for testing. My workload is in order of first submitting hundreds jobs with 1600MB, then jobs with 1000 MB, 800 MB,400 MB so on so forth. 

Here is the condor_status before my submission

Name                          OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@DummyServer      LINUX      X86_64 Unclaimed Idle      0.000 96510  1+00:00:17

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

  X86_64/LINUX     1     0       0         1       0          0        0      0

         Total     1     0       0         1       0          0        0      0

Initially, large memory jobs can saturate all resource as the aspect of memory utilization as expected. 

slot1@DummyServer   LINUX      X86_64 Unclaimed Idle      0.000  510  0+00:01:41
slot1_1@DummyServer LINUX      X86_64 Claimed   Busy      0.000 1600  0+00:00:02
..... (in total 60 claimed slots)
slot1_60@DummyServerLINUX      X86_64 Claimed   Busy      0.000 1600  0+00:00:02

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

  X86_64/LINUX    61     0      60         1       0          0        0      0

         Total    61     0      60         1       0          0        0      0

However, after a few rounds, there is an enormous amount of jobs in the queue, and cluster utilization is low. And yes, it is still a puzzle to me :-(

Name                             OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

slot1@DummyServer   LINUX      X86_64 Unclaimed Idle      0.000 71510  0+00:10:34
slot1_2@DummyServer LINUX      X86_64 Claimed   Busy      0.000  1000  0+00:00:06
slot1_3@DummyServer LINUX      X86_64 Claimed   Busy      0.010  1000  0+00:00:15
slot1_5@DummyServer LINUX      X86_64 Claimed   Busy      0.010  1000  0+00:00:31
slot1_6@DummyServer LINUX      X86_64 Claimed   Busy      0.000  1000  0+00:00:12
slot1_8@DummyServer LINUX      X86_64 Claimed   Busy      0.000  1000  0+00:02:32
slot1_9@DummyServer LINUX      X86_64 Claimed   Busy      0.010  1000  0+00:00:20
slot1_11@DummyServerLINUX      X86_64 Claimed   Busy      0.000  1000  0+00:00:09
slot1_12@DummyServerLINUX      X86_64 Claimed   Busy      0.010  1000  0+00:00:09
slot1_19@DummyServerLINUX      X86_64 Claimed   Busy      0.000  1000  0+00:03:01
slot1_31@DummyServerLINUX      X86_64 Claimed   Busy      0.000  1000  0+00:03:14
slot1_54@DummyServerLINUX      X86_64 Claimed   Busy      0.000  1000  0+00:02:28
slot1_57@DummyServerLINUX      X86_64 Claimed   Busy      0.000  1000  0+00:02:24
slot1_60@DummyServerLINUX      X86_64 Claimed   Busy      0.000  1000  0+00:02:16
slot1_61@DummyServerLINUX      X86_64 Claimed   Busy      0.000  1000  0+00:02:15

               Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

  X86_64/LINUX    15     0      14         1       0          0        0      0

         Total    15     0      14         1       0          0        0      0

Here is the condor_q

Total for query: 2739 jobs; 0 completed, 0 removed, 2718 idle, 21 running, 0 held, 0 suspended
Total for all users: 2739 jobs; 0 completed, 0 removed, 2718 idle, 21 running, 0 held, 0 suspended

Best,
Shunxing

ï-----Original Message-----
    
    My guess in the example below would be that your machine has 8 CPU cores ? ?condor_status DummyServer ?autoformat TotalCpus? would tell you. By default it uses cpu cores, memory, and scratch disk space to determine if there?s room for another job.
    
    If your machine does advertise more than eight cores, then that's definitely a puzzle - you'd want to look at the "Disk" machine attribute to see if that might be imposing a constraint, however the default value for that is fairly low and disk drives these days are fairly big. Clearly there's enough memory for more jobs assuming there's disk and cpus available.
    
    The claim worklife can be left at the default value rather than being set to zero - an existing slot for which the claim has not yet expired and which matches another job from the same submitter will be matched and dispatched without negotiator overhead, which improves efficiency a bit.
    
    
    Michael V. Pelletier
    Information Technology
    Digital Transformation & Innovation
    Integrated Defense Systems
    Raytheon Company