[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] GPU benchmarking



Hi Michael,

thank you again! To be honest, our nodes are configured in the way that there is as many slots as many GPUs are plugged in - each slot has just one GPU. So I think the tweak you mentioned is not needed there.
But I wanted to just ensure, that I can use values like TARGET.CUDAComputeUnits in a job rank and it will be correctly translated to e.g. CUDA1ComputeUnits on the slot where AssignedGPUs="CUDA1".
My classads for an example slot are below, each slot has just one GPU assigned, but CUDA* classads for both GPUs plugged in the node.

AssignedGPUs = "CUDA1"
CUDA0Capability = 7.5
CUDA0ClockMhz = 1695.0
CUDA0ComputeUnits = 34
CUDA0CoresPerCU = 64
CUDA0DeviceName = "GeForce RTX 2060 SUPER"
CUDA0DevicePciBusId = "0000:01:00.0"
CUDA0DeviceUuid = "5ffaf895-e943-8da2-23f4-d751418ba217"
CUDA0DriverVersion = 11.2
CUDA0ECCEnabled = false
CUDA0GlobalMemoryMb = 8192
CUDA0OpenCLVersion = 1.2
CUDA0RuntimeVersion = 10.2
CUDA1Capability = 7.5
CUDA1ClockMhz = 1695.0
CUDA1ComputeUnits = 34
CUDA1CoresPerCU = 64
CUDA1DeviceName = "GeForce RTX 2060 SUPER"
CUDA1DevicePciBusId = "0000:02:00.0"
CUDA1DeviceUuid = "d777aeb6-a721-c756-7075-9f19a3a54c2a"
CUDA1DriverVersion = 11.2
CUDA1ECCEnabled = false
CUDA1GlobalMemoryMb = 8192
CUDA1OpenCLVersion = 1.2
CUDA1RuntimeVersion = 10.2

Masaj


On 5/25/2021 3:47 PM, Michael Pelletier via HTCondor-users wrote:

Hi Masaj,

Â

The CUDAComputeUnits figure is reported based on the card or cards installed in the system. Thereâs actually no attribute CUDA0ComputeUnits, since thatâs expected to be the same across all cards.

Â

Hereâs what is generated in per-card attributes with the â-extra -dynamicâ options:

Â

CUDA0DevicePciBusId = "0000:06:00.0"

CUDA0DeviceUuid = "520c5858-f08d-0e24-83b6-47e072996f2b"

CUDA0DieTempC = 32

CUDA0EccErrorsDoubleBit = 0

CUDA0EccErrorsSingleBit = 0

CUDA0FreeGlobalMemory = 8518

CUDA0PowerUsage_mw = 41538

CUDA0UtilizationPct = 77

Â

You can write expressions to incorporate these values, but it wonât have any impact on which card is chosen for the job. The startd simply takes the next unclaimed device in sequence from the AssignedGPUs list.

Â

One way you can tweak that mechanism is to alter the order of the DetectedGPUs list as the inventory is being taken, perhaps with a wrapper around condor_gpu_discovery. If your machine causes condor_gpu_discovery to list all the cards in one cooling region followed by all the cards in another cooling region within the system, you could balance the heating across both cooling regions by changing the order to âCUDA0,CUDA2,CUDA1,CUDA3â so that GPU assignments would alternate between cooling regions, for example.

Â

Michael V Pelletier

Principal Engineer

Raytheon Technologies

Digital Technology

HPC Support Team

Â

From: Martin Sajdl <masaj.xxx@xxxxxxxxx>
Sent: Tuesday, May 25, 2021 8:15 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Michael Pelletier <michael.v.pelletier@xxxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] GPU benchmarking

Â

Thank you Michael!
The formula below looks like a good idea. I have one additional question. Is it Okay to use classads in the form TARGET.CUDAComputeUnits when the real slot classad looks like CUDA0ComputeUnits or CUDA1ComputeUnits? Does Condor automatically able to translate to a correct value using AssignedGPU?

Regards,
Masaj

On 5/20/2021 10:14 PM, Michael Pelletier via HTCondor-users wrote:

For my GPU jobs, I set up a ranking based on the number of compute units, times the number of cores per CU. You might also add the global memory. I do like the idea of factoring in the CUDA capability level as well, if your cluster has more than one type of card in it.

Â

So for example, in a submit description:

Â

rank = TARGET. CUDAComputeUnits * TARGET. CUDACoresPerCU + CUDAFreeGlobalMemory

Â

Michael V Pelletier

Principal Engineer

Raytheon Technologies

Digital Technology

HPC Support Team

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Todd Tannenbaum
Sent: Thursday, May 20, 2021 12:49 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Martin Sajdl <masaj.xxx@xxxxxxxxx>
Subject: [External] Re: [HTCondor-users] GPU benchmarking

Â

On 5/20/2021 8:56 AM, Martin Sajdl wrote:

Hi!

we have a cluster of nodes with GPUs and we would need to set a benchmark number for each slot with GPU to be able to correctly control jobs ranking - start a job on the most powerful GPU available.
Do someone use or know a GPU benchmark tool? Ideally multi-platform (Linux, Windows)...


Hi Martin,

Just a quick thought:

While it is not strictly a benchmark, perhaps a decent proxy would be to use the CUDACapability attribute that is likely already present in each slot with a GPU (assuming they are NVIDIA gpus, that is).Â

You could enter the following condor_status command to see if you feel that CUDACapability makes intuitive sense as a performance metric on your pool:

  condor_status -cons 'gpus>0' -sort CUDACapability -af name CudaCapability CudaDevicename

Hope the above helps
Todd



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
Â
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Â


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/