[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CPUsUsage versus computed values



Hi Todd,

Thanks for the reality check, now I feel a bit embarrassed - no idea where I read the one minute average thing. -_-;

As for the discrepancy between CpusUsage and _expression_ [2]:
Iâm only looking at our grid queue, so no suspensions, no scheduler universe, or other obvious âexoticâ situations. All execution points are practically the same and running on HTCondor 10, but the entry points still are HTCondor 9 (due to them being CEs with GSI auth).

Some things Iâve checked:
- It occurs for about 25%-30% of jobs.
- NumJobMatches and NumShadowStarts are 1 so no mismatch between cumulative and recent execution.
- CommittedTime is usually way more than an hour, so this shouldnât be an initialisation issue.
- No relation to submitter, CPU or memory request. No relation to few, specific machines. No relation to failed/completed jobs.
- All computations (C and ClassAd) are 64 bit ints and doubles holding real time ranges, so overflow or precision should not be an issue.

Let me know if you have any other idea what I could double-check.

Cheers,
Max

On 6. Mar 2024, at 20:19, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:


Hi Max,

I am not positive, but my recollection is the CpusUsage attribute in the job ad is averaged over the lifetime of the job.  This is also what the documentation for this job attribute states at
  https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html

Some quick observations / questions:

Re _expression_ [1] below, if you job policy allows for job suspension, this _expression_ does not take suspension into account, if that is something your site policy utilizes.

Re _expression_ the discrepancies between _expression_ [2] and CpusUsage from condor_history, this is very interesting.... question: did you limit the jobs you inspected just to vanilla jobs that completed normally, or did your analysis include jobs that were removed,  scheduler universe jobs, etc ?  Any notable correlation between jobs from history that had a sensible correlation vs bogus, such as all the bogus jobs were container universe or ?

Thank you for sharing, and hope this helps,
best regards,
Todd

On 3/6/2024 3:04 AM, Fischer, Max (SCC) wrote:
Hi all,

We are currently putting an extra close eye on CPU usage and Iâm a bit confused by the options available (letâs not delve into what is âtheâ CPU usage). Iâm using both the inbuilt CPUsUsage (via ProcFamily CgroupV1) and computed expressions for running [1] and completed [2] jobs. Of course they donât quite agree so Iâm interested if Iâm doing it right and if anyone has better suggestions.

For running jobs the CPUsUsage is consistently higher than the computed value but rather close (e.g. 9.02 vs 8.7).
- The docs for CPUsUsage say itâs the one-minute CPU usage. However, if Iâm reading the code [3] right only the total cpu metrics for the entire job cgroup are collected and used. So is CPUsUsage over a specific time range or the entire lifetime of the job?
- Is my _expression_ basically replicating what CPUsUsage is doing and just limited by timing resolution?

For completed jobs the CPUsUsage is sometimes sensible (e.g. 7.86 vs 7.78) but oftentimes completely bogus (e.g. 0.11 vs 7.22).
- Is the CPUsUsage actually meaningful in the history?
- Can we somehow record the peak or average CPUsUsage in history?

Cheers,
Max

[1] _expression_ for condor_q -run
'(RemoteSysCPU + RemoteUserCpu) / (ServerTime - JobCurrentStartDate)'

[2] _expression_ for condor_history
'(CumulativeRemoteSysCpu + CumulativeRemoteUserCpu) / (RemoteWallClockTime - CumulativeSuspensionTime)â

[3] ProcFamilyDirectCgroupV1::get_usage
https://github.com/htcondor/htcondor/blob/66aadf0278a07ee219eaa184068403c7dee1db4d/src/condor_utils/proc_family_direct_cgroup_v1.cpp#L292-L338

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685 

Attachment: smime.p7s
Description: S/MIME cryptographic signature