[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage



So, at least, I understand that we can play with DeviceGpusAverageUsage to
check if the utilization is 0, but I do not understand the connection
between DeviceGpusAverageUsage and GpusAverageUsage or why the
GpusAverageUsage is undefined while the DeviceGpusAverageUsage is not.

If I recall correctly --

The GPU monitor can only monitor the utilization of a given GPU; it knows nothing about which jobs are using which device. It reports the "Device*" values for each GPU to the specific slot assigned that GPU. "GPUsAverageUsage" is a per-_job_ attribute, derived from the "Device*" values, and is set in the _job_ by the startd. Those job-ad attributes are mirrored into the slot ad by STARTD_JOB_ATTRS.

Additionally, none of this works for sufficiently-short jobs, although since you're talking about checking four hours in, that shouldn't be a problem.

I haven't tested this recently, but last time I did, average GPU utilization and peak GPU memory usage were certainly being recorded in the job log (where the other usage is reported), and I believe in the job ad as well. AFAIK, there's no reason why the whole job ad wouldn't be written to the history file.

- ToddM