[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage
- Date: Wed, 18 May 2022 13:27:51 -0500 (CDT)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage
So, at least, I understand that we can play with DeviceGpusAverageUsage to
check if the utilization is 0, but I do not understand the connection
between DeviceGpusAverageUsage and GpusAverageUsage or why the
GpusAverageUsage is undefined while the DeviceGpusAverageUsage is not.
If I recall correctly --
The GPU monitor can only monitor the utilization of a given GPU;
it knows nothing about which jobs are using which device. It reports the
"Device*" values for each GPU to the specific slot assigned that GPU.
"GPUsAverageUsage" is a per-_job_ attribute, derived from the "Device*"
values, and is set in the _job_ by the startd. Those job-ad attributes
are mirrored into the slot ad by STARTD_JOB_ATTRS.
Additionally, none of this works for sufficiently-short jobs,
although since you're talking about checking four hours in, that shouldn't
be a problem.
I haven't tested this recently, but last time I did,
average GPU utilization and peak GPU memory usage were certainly being
recorded in the job log (where the other usage is reported), and I believe
in the job ad as well. AFAIK, there's no reason why the whole job ad
wouldn't be written to the history file.