[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage

On 5/18/2022 1:27 PM, Todd L Miller wrote:
So, at least, I understand that we can play with DeviceGpusAverageUsage to
check if the utilization is 0, but I do not understand the connection
between DeviceGpusAverageUsage and GpusAverageUsage or why the
GpusAverageUsage is undefined while the DeviceGpusAverageUsage is not.

If I recall correctly --

    The GPU monitor can only monitor the utilization of a given GPU; it knows nothing about which jobs are using which device.  It reports the "Device*" values for each GPU to the specific slot assigned that GPU. "GPUsAverageUsage" is a per-_job_ attribute, derived from the "Device*" values, and is set in the _job_ by the startd.  Those job-ad attributes are mirrored into the slot ad by STARTD_JOB_ATTRS.

    Additionally, none of this works for sufficiently-short jobs, although since you're talking about checking four hours in, that shouldn't be a problem.

    I haven't tested this recently, but last time I did, average GPU utilization and peak GPU memory usage were certainly being recorded in the job log (where the other usage is reported), and I believe in the job ad as well.  AFAIK, there's no reason why the whole job ad wouldn't be written to the history file.

Hi folks,

Realize there are _two_ history files involved here - 1) the job history (on the submit/access point machine) which contains all the job classads that left the job queue, and 2) the startd history (that lives on the execute machine) which contains all the job classads that ran on the execute machine.

I think ToddM above was talking about the job history (item 1 above), and Christoph in his email was looking at startd history (item 2 above).

@ToddM: do you expect the GPU usage attributes to appear in the startd history as the job history?

Todd T