[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage
- Date: Wed, 18 May 2022 15:23:02 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage
On 5/18/2022 1:27 PM, Todd L Miller
So, at least, I understand that we can
play with DeviceGpusAverageUsage to
check if the utilization is 0, but I do not understand the
between DeviceGpusAverageUsage and GpusAverageUsage or why the
GpusAverageUsage is undefined while the DeviceGpusAverageUsage
If I recall correctly --
The GPU monitor can only monitor the utilization of a given
GPU; it knows nothing about which jobs are using which device. It
reports the "Device*" values for each GPU to the specific slot
assigned that GPU. "GPUsAverageUsage" is a per-_job_ attribute,
derived from the "Device*" values, and is set in the _job_ by the
startd. Those job-ad attributes are mirrored into the slot ad by
Additionally, none of this works for sufficiently-short jobs,
although since you're talking about checking four hours in, that
shouldn't be a problem.
I haven't tested this recently, but last time I did, average
GPU utilization and peak GPU memory usage were certainly being
recorded in the job log (where the other usage is reported), and I
believe in the job ad as well. AFAIK, there's no reason why the
whole job ad wouldn't be written to the history file.
Realize there are _two_ history files involved here - 1) the job
history (on the submit/access point machine) which contains all the
job classads that left the job queue, and 2) the startd history
(that lives on the execute machine) which contains all the job
classads that ran on the execute machine.
I think ToddM above was talking about the job history (item 1
above), and Christoph in his email was looking at startd history
(item 2 above).
@ToddM: do you expect the GPU usage attributes to appear in the
startd history as the job history?