[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPU benchmarking

Hi Masaj,


The CUDAComputeUnits figure is reported based on the card or cards installed in the system. Thereâs actually no attribute CUDA0ComputeUnits, since thatâs expected to be the same across all cards.


Hereâs what is generated in per-card attributes with the â-extra -dynamicâ options:


CUDA0DevicePciBusId = "0000:06:00.0"

CUDA0DeviceUuid = "520c5858-f08d-0e24-83b6-47e072996f2b"

CUDA0DieTempC = 32

CUDA0EccErrorsDoubleBit = 0

CUDA0EccErrorsSingleBit = 0

CUDA0FreeGlobalMemory = 8518

CUDA0PowerUsage_mw = 41538

CUDA0UtilizationPct = 77


You can write expressions to incorporate these values, but it wonât have any impact on which card is chosen for the job. The startd simply takes the next unclaimed device in sequence from the AssignedGPUs list.


One way you can tweak that mechanism is to alter the order of the DetectedGPUs list as the inventory is being taken, perhaps with a wrapper around condor_gpu_discovery. If your machine causes condor_gpu_discovery to list all the cards in one cooling region followed by all the cards in another cooling region within the system, you could balance the heating across both cooling regions by changing the order to âCUDA0,CUDA2,CUDA1,CUDA3â so that GPU assignments would alternate between cooling regions, for example.


Michael V Pelletier

Principal Engineer

Raytheon Technologies

Digital Technology

HPC Support Team


From: Martin Sajdl <masaj.xxx@xxxxxxxxx>
Sent: Tuesday, May 25, 2021 8:15 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Michael Pelletier <michael.v.pelletier@xxxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] GPU benchmarking


Thank you Michael!
The formula below looks like a good idea. I have one additional question. Is it Okay to use classads in the form TARGET.CUDAComputeUnits when the real slot classad looks like CUDA0ComputeUnits or CUDA1ComputeUnits? Does Condor automatically able to translate to a correct value using AssignedGPU?


On 5/20/2021 10:14 PM, Michael Pelletier via HTCondor-users wrote:

For my GPU jobs, I set up a ranking based on the number of compute units, times the number of cores per CU. You might also add the global memory. I do like the idea of factoring in the CUDA capability level as well, if your cluster has more than one type of card in it.


So for example, in a submit description:


rank = TARGET. CUDAComputeUnits * TARGET. CUDACoresPerCU + CUDAFreeGlobalMemory


Michael V Pelletier

Principal Engineer

Raytheon Technologies

Digital Technology

HPC Support Team


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd Tannenbaum
Sent: Thursday, May 20, 2021 12:49 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Martin Sajdl <masaj.xxx@xxxxxxxxx>
Subject: [External] Re: [HTCondor-users] GPU benchmarking


On 5/20/2021 8:56 AM, Martin Sajdl wrote:


we have a cluster of nodes with GPUs and we would need to set a benchmark number for each slot with GPU to be able to correctly control jobs ranking - start a job on the most powerful GPU available.
Do someone use or know a GPU benchmark tool? Ideally multi-platform (Linux, Windows)...

Hi Martin,

Just a quick thought:

While it is not strictly a benchmark, perhaps a decent proxy would be to use the CUDACapability attribute that is likely already present in each slot with a GPU (assuming they are NVIDIA gpus, that is). 

You could enter the following condor_status command to see if you feel that CUDACapability makes intuitive sense as a performance metric on your pool:

    condor_status -cons 'gpus>0' -sort CUDACapability -af name CudaCapability CudaDevicename

Hope the above helps

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
The archives can be found at: