[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage



Hi Cristoph,

Thank you very much for response :)

We have the GPUs monitoring enabled.

[root@gpu01 ~]# condor_config_val GPU_MONITOR
/usr/libexec/condor/condor_gpu_utilization

And, looking at the condor_history of the startd, a lot of the jobs executed report undefined GpusMemoryUsageÂalthough the GPusProvisoned are not 0. Anyway, the DeviceGpusAverageUsage is always undefined in the startd_history.

But, with condor_status queries, we can see the DeviceGpusAverageUsage:

[root@gpu01 ~]# condor_status gpu01 -const 'Gpus >0' -af Name Gpus GpusAverageUsage GpusMemoryUsage DeviceGpusAverageUsage DeviceGpusMemoryPeakUsage
slot2_1@xxxxxxxxxxxx 1 undefined undefined 5.224270059614254E-06 2957
slot2_2@xxxxxxxxxxxx 1 undefined undefined 0.0 3
slot2_3@xxxxxxxxxxxx 1 undefined undefined 0.0 3
slot2_4@xxxxxxxxxxxx 1 undefined undefined 0.0 3
slot2_5@xxxxxxxxxxxx 1 undefined undefined 0.0 3
slot2_6@xxxxxxxxxxxx 1 undefined undefined 0.0 3
slot2_7@xxxxxxxxxxxx 1 undefined undefined 0.0 3
slot2_9@xxxxxxxxxxxx 1 undefined undefined 1.831859628341712E-06 2160

So, at least, I understand that we can play with DeviceGpusAverageUsage to check if the utilization is 0, but I do not understand the connection between DeviceGpusAverageUsage and GpusAverageUsage or why the GpusAverageUsage is undefined while the DeviceGpusAverageUsage is not.

Cheers,

Carles


On Wed, 18 May 2022 at 10:18, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi Carles,

I am struggling with the same issue more or less since quite a while, the GPU monitoring seems unreliable or at least the results do not end up in the job-history. Here is what I know about it ;)

On the GPU node you need the gbu_monitor, that should look somewhat like this:

[root@batchg004 ~]# condor_config_val GPU_MONITOR
/usr/libexec/condor/condor_gpu_utilization

You can start it on the command line and check the output:

[root@batchg004 ~]# /usr/libexec/condor/condor_gpu_utilization
SlotMergeConstraint = StringListMember("CUDA0", AssignedGPUs) || StringListMember("GPU-6e5f40be", AssignedGPUs) || StringListMember("GPU-6e5f40be-cd37-ca8a-bdbb-3e03e8f44f34", AssignedGPUs)
UptimeGPUsSeconds = 9.968521
UptimeGPUsMemoryPeakUsage = 10861
- GPUsSlot0
SlotMergeConstraint = StringListMember("CUDA0", AssignedGPUs) || StringListMember("GPU-6e5f40be", AssignedGPUs) || StringListMember("GPU-6e5f40be-cd37-ca8a-bdbb-3e03e8f44f34", AssignedGPUs)
UptimeGPUsSeconds = 9.795428
UptimeGPUsMemoryPeakUsage = 10861
- GPUsSlot0
<snip>

Checking the jobs on the GPU node reveals:

[root@batchg004 ~]# condor_history -file /var/log/condor/startd_history -af:l GPUsMemoryUsage GPUsAverageUsage
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = 2690.0 GPUsAverageUsage = 0.1477277118427709
GPUsMemoryUsage = 10776.0 GPUsAverageUsage = 0.9403373569880673
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = 10776.0 GPUsAverageUsage = 0.8581894960995667
GPUsMemoryUsage = 7668.0 GPUsAverageUsage = 0.2796400375929942
GPUsMemoryUsage = 3.0 GPUsAverageUsage = 0.0
GPUsMemoryUsage = 3.0 GPUsAverageUsage = 0.0
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined
GPUsMemoryUsage = undefined GPUsAverageUsage = undefined

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Carles Acosta" <cacosta@xxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 18. Mai 2022 07:39:57
Betreff: [HTCondor-users] DeviceGpusAverageUsage and GpusAverageUsage

Dear all,
In our HTCondor cluster running 9.0.12 we have a few machines with GPUs.

We would like to be sure that the users requestingÂGPUs are really using them and for that reason, we are interestedÂin creating some _expression_ that says something like if after 4 hours the GPU average usage is 0.0, the job will be held.Â

OurÂfirst doubt is where we can extract the GPU average usage. There is the DeviceGpusAverageUsage and the documentation says that it counts the GPU used by the slot against the time theÂstartd started up. However, there is a GpusAverageUsage that most of the time is undefined but we have seen it not undefined in some cases with values slightly different from DeviceGpusAverageUsage. What is the difference between DeviceGpusAverageUsage and GpusAverageUsage?Â

Thank you in advance.

Best regards,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es