[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CPU accounting: NonCondorLoadAvg



And here's another observation. On some slots - but not all - LoadAvg is exactly 1 larger than CondorLoadAvg.

Showing columns in the following order:
* TotalLoadAvg
* TotalCondorLoadAvg
* LoadAvg
* CondorLoadAvg

I see the following:

$ condor_status -format %17.17s Name -format " %-9.9s" State -format " %-8.8s" Activity -format " %4d" Cpus -format " %6.3f" TotalLoadAvg -format " %6.3f" TotalCondorLoadAvg -format " %6.3f" LoadAvg -format " %6.3f\n" CondorLoadAvg | grep dar3
slot1@xxxxxxxxxxx Owner     Idle       18 13.700  9.770  1.000 0.000
slot1_11@xxxxxxxx Claimed   Busy        1 13.700  9.770  0.650 0.650
slot1_12@xxxxxxxx Claimed   Busy        1 13.700  9.770  0.650 0.650
slot1_13@xxxxxxxx Claimed   Busy        1 13.700  9.770  0.670 0.670
slot1_14@xxxxxxxx Claimed   Busy        1 13.700  9.770  1.630 0.700
slot1_15@xxxxxxxx Claimed   Busy        1 13.700  9.770  1.720 0.720
slot1_16@xxxxxxxx Claimed   Busy        1 13.700  9.770  1.770 0.770
slot1_1@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.710 0.710
slot1_2@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.710 0.710
slot1_3@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.760 0.760
slot1_4@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.710 0.710
slot1_5@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.690 0.690
slot1_6@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.710 0.710
slot1_7@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.660 0.660
slot1_8@xxxxxxxxx Claimed   Busy        1 13.700  9.770  0.670 0.670
$ ssh dar3 uptime
 17:04:52 up 19 days, 22:48,  0 users,  load average: 13.79, 13.82, 14.05

$ condor_status -format %17.17s Name -format " %-9.9s" State -format " %-8.8s" Activity -format " %4d" Cpus -format " %6.3f" TotalLoadAvg -format " %6.3f" TotalCondorLoadAvg -format " %6.3f" LoadAvg -format " %6.3f\n" CondorLoadAvg | grep dar4
slot1@xxxxxxxxxxx Owner     Idle       27  5.300  1.000  1.000 0.000
slot1_1@xxxxxxxxx Claimed   Busy        1  5.300  1.000  0.200 0.200
slot1_2@xxxxxxxxx Claimed   Busy        1  5.300  1.000  0.500 0.200
slot1_4@xxxxxxxxx Claimed   Busy        1  5.300  1.000  1.200 0.200
slot1_7@xxxxxxxxx Claimed   Busy        1  5.300  1.000  1.200 0.200
slot1_8@xxxxxxxxx Claimed   Busy        1  5.300  1.000  1.200 0.200
$ ssh dar4 uptime
 17:04:38 up 19 days, 22:40,  0 users,  load average: 5.33, 5.22, 5.22

It looks like TotalLoadAvg is the sum of LoadAvg (5.300 = 1.000+0.200+0.500+1.200+1.200+1.200). Note that this includes the first 1.000 which is the LoadAvg of 1.000 in an idle slot!

Also, TotalCondorLoadAvg is the sum of CondorLoadAvg (1.000 = 0.200+0.200+0.200+0.200+0.200)

I found some code in src/condor_startd.V6/ResMgr.cpp which appears to spread the "owner load" over the slots, 1.0 per slot, which I think explains that. And "owner load" is m_attr->load() - m_attr->condor_load()

Unfortunately, with I/O-waiting applications, the sum of CPU utilisation of the processes is not directly comparable to the /proc/loadavg values, i.e. the difference isn't going to give the "owner" load as far as I can see.

Regards,

Brian.