[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CPUsUsage versus computed values




Hi Max,

I am not positive, but my recollection is the CpusUsage attribute in the job ad is averaged over the lifetime of the job.  This is also what the documentation for this job attribute states at
  https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html

Some quick observations / questions:

Re _expression_ [1] below, if you job policy allows for job suspension, this _expression_ does not take suspension into account, if that is something your site policy utilizes.

Re _expression_ the discrepancies between _expression_ [2] and CpusUsage from condor_history, this is very interesting.... question: did you limit the jobs you inspected just to vanilla jobs that completed normally, or did your analysis include jobs that were removed,  scheduler universe jobs, etc ?  Any notable correlation between jobs from history that had a sensible correlation vs bogus, such as all the bogus jobs were container universe or ?

Thank you for sharing, and hope this helps,
best regards,
Todd

On 3/6/2024 3:04 AM, Fischer, Max (SCC) wrote:
Hi all,

We are currently putting an extra close eye on CPU usage and Iâm a bit confused by the options available (letâs not delve into what is âtheâ CPU usage). Iâm using both the inbuilt CPUsUsage (via ProcFamily CgroupV1) and computed expressions for running [1] and completed [2] jobs. Of course they donât quite agree so Iâm interested if Iâm doing it right and if anyone has better suggestions.

For running jobs the CPUsUsage is consistently higher than the computed value but rather close (e.g. 9.02 vs 8.7).
- The docs for CPUsUsage say itâs the one-minute CPU usage. However, if Iâm reading the code [3] right only the total cpu metrics for the entire job cgroup are collected and used. So is CPUsUsage over a specific time range or the entire lifetime of the job?
- Is my _expression_ basically replicating what CPUsUsage is doing and just limited by timing resolution?

For completed jobs the CPUsUsage is sometimes sensible (e.g. 7.86 vs 7.78) but oftentimes completely bogus (e.g. 0.11 vs 7.22).
- Is the CPUsUsage actually meaningful in the history?
- Can we somehow record the peak or average CPUsUsage in history?

Cheers,
Max

[1] _expression_ for condor_q -run
'(RemoteSysCPU + RemoteUserCpu) / (ServerTime - JobCurrentStartDate)'

[2] _expression_ for condor_history
'(CumulativeRemoteSysCpu + CumulativeRemoteUserCpu) / (RemoteWallClockTime - CumulativeSuspensionTime)â

[3] ProcFamilyDirectCgroupV1::get_usage
https://github.com/htcondor/htcondor/blob/66aadf0278a07ee219eaa184068403c7dee1db4d/src/condor_utils/proc_family_direct_cgroup_v1.cpp#L292-L338

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx>  University of Wisconsin-Madison
Center for High Throughput Computing    Department of Computer Sciences
Calendar: https://tinyurl.com/yd55mtgd  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                   Madison, WI 53706-1685