[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage



Hi Todd M, Todd T,

Thank you for your responses. You help me to understand how the monitor of the GPUs works in HTCondor.Â

I am checking the condor_history both for the schedd side and the startd_history and the GpusAverageUSage is reported for some jobs but not for all, as Todd M commented maybe this is because the jobs are not long enough. I'll continue to investigate.

Cheers,

Carles

On Wed, 18 May 2022 at 22:24, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 5/18/2022 1:27 PM, Todd L Miller wrote:
So, at least, I understand that we can play with DeviceGpusAverageUsage to
check if the utilization is 0, but I do not understand the connection
between DeviceGpusAverageUsage and GpusAverageUsage or why the
GpusAverageUsage is undefined while the DeviceGpusAverageUsage is not.

If I recall correctly --

ÂÂÂÂThe GPU monitor can only monitor the utilization of a given GPU; it knows nothing about which jobs are using which device. It reports the "Device*" values for each GPU to the specific slot assigned that GPU. "GPUsAverageUsage" is a per-_job_ attribute, derived from the "Device*" values, and is set in the _job_ by the startd. Those job-ad attributes are mirrored into the slot ad by STARTD_JOB_ATTRS.

ÂÂÂÂAdditionally, none of this works for sufficiently-short jobs, although since you're talking about checking four hours in, that shouldn't be a problem.

ÂÂÂÂI haven't tested this recently, but last time I did, average GPU utilization and peak GPU memory usage were certainly being recorded in the job log (where the other usage is reported), and I believe in the job ad as well. AFAIK, there's no reason why the whole job ad wouldn't be written to the history file.


Hi folks,

Realize there are _two_ history files involved here - 1) the job history (on the submit/access point machine) which contains all the job classads that left the job queue, and 2) the startd history (that lives on the execute machine) which contains all the job classads that ran on the execute machine.

I think ToddM above was talking about the job history (item 1 above), and Christoph in his email was looking at startd history (item 2 above).

@ToddM: do you expect the GPU usage attributes to appear in the startd history as the job history?

regards,
Todd T


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es