[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage



Hello,

Going back to this topic, as Cristoph commented the GpusAverageUsage value was undefined for many GPU jobs for unknown reasons and it was not just for short jobs. I do not know if my next explanation has any sense but this is what we found...

We started to suspectÂthatÂthe 8 GPUs of this machine were not always responding when the condor_gpu_utilization script continually asks the usage in aÂWaitForExit mode. We moved the condor_gpu_utilization script to run with a timeout of 20 seconds and in periodic mode and it works better for a while. Although some jobs stillÂhad undefined GpusAverageUsage, the major part of them reported correctly theÂusage. After several days, we moved back to check WaitForExit mode again and we observed that the GpusAverageUsage was correct again and not undefined...

What it seems is that after a "systemctl reload condor" in the WN, the GpusAverageUsage value changes to undefined for the major part of the jobs, and is needed a restart that creates again the CronJob to obtain again GpusAverageUsage values. This is also achieved by changing the monitor mode from WaitForExit to Periodic or whatever and using reload, the old CronJob GPUs_MONITOR is removed, a new one is created and, then, the GpusAverageUsage is not undefined. Does this have any sense?Â

Right now, the machine is reporting the GpusAverageUsageÂfor the major part of the jobs, and our _expression_ that put on hold the jobs that are not using the GPU for the last 4 hours is working fine.

Thank you very much and excuse me if I have not explained myself well.

Cheers,

Carles

On Thu, 19 May 2022 at 07:17, Carles Acosta <cacosta@xxxxxx> wrote:
Hi Todd M, Todd T,

Thank you for your responses. You help me to understand how the monitor of the GPUs works in HTCondor.Â

I am checking the condor_history both for the schedd side and the startd_history and the GpusAverageUSage is reported for some jobs but not for all, as Todd M commented maybe this is because the jobs are not long enough. I'll continue to investigate.

Cheers,

Carles

On Wed, 18 May 2022 at 22:24, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 5/18/2022 1:27 PM, Todd L Miller wrote:
So, at least, I understand that we can play with DeviceGpusAverageUsage to
check if the utilization is 0, but I do not understand the connection
between DeviceGpusAverageUsage and GpusAverageUsage or why the
GpusAverageUsage is undefined while the DeviceGpusAverageUsage is not.

If I recall correctly --

ÂÂÂÂThe GPU monitor can only monitor the utilization of a given GPU; it knows nothing about which jobs are using which device. It reports the "Device*" values for each GPU to the specific slot assigned that GPU. "GPUsAverageUsage" is a per-_job_ attribute, derived from the "Device*" values, and is set in the _job_ by the startd. Those job-ad attributes are mirrored into the slot ad by STARTD_JOB_ATTRS.

ÂÂÂÂAdditionally, none of this works for sufficiently-short jobs, although since you're talking about checking four hours in, that shouldn't be a problem.

ÂÂÂÂI haven't tested this recently, but last time I did, average GPU utilization and peak GPU memory usage were certainly being recorded in the job log (where the other usage is reported), and I believe in the job ad as well. AFAIK, there's no reason why the whole job ad wouldn't be written to the history file.


Hi folks,

Realize there are _two_ history files involved here - 1) the job history (on the submit/access point machine) which contains all the job classads that left the job queue, and 2) the startd history (that lives on the execute machine) which contains all the job classads that ran on the execute machine.

I think ToddM above was talking about the job history (item 1 above), and Christoph in his email was looking at startd history (item 2 above).

@ToddM: do you expect the GPU usage attributes to appear in the startd history as the job history?

regards,
Todd T


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es