[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor version 9.0.17 GPU repeat/divide with GPU MIGs



Hi Vikrant,

I can't say for certainty if the discrepancy in the memory size is because due to condor or the CUDA libraries. I am more inclined to think that the CUDA libraries are at fault since condor_gpu_discovery is just grabbing the memory in bytes and converting it into MB.

As for running the command, one thing to note is that using both -repeat and -divide is pointless since the last one specified dictates behavior. Plus, -divide is just -repeat with the GlobalMemoryMb divided by the number of repeats. This repetition of GPU devices is shown in the DetectedGpus list. Your grep may be hiding information from you. Finally, you should be able to divide or repeat by any integer value greater than one.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Thursday, January 11, 2024 2:40 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Condor version 9.0.17 GPU repeat/divide with GPU MIGs
 
Hello Experts,

We have a NVIDIA H100 GPU on a machine, created 7 MIG on this GPU.

- Following value is same for 6 MIGs but one MIG has different value 9969MB

GlobalMemoryMb=9971 

nvidia-smi command shows the same value 9984 MiB for all MIGs. 

Is this a condor or CUDA library issue? 

- Using the following command to divide the MIG further. It shows global memory less than devicememory. Should not I expect to see two devices each with 4985MB of memory? Also I can't increase the value of repeat and divide (something like 4 I don't need it but is thr a reason behind it?)

# `condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 2 -divide 2 | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=4985

It only shows the deviememory - globalmemory disappeared. 

#  `condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 4 -divide 4 | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"

Without divide and repeat:

#  `condor_config_val LIBEXEC`/condor_gpu_discovery -extra | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=9971



Thanks & Regards,
Vikrant Aggarwal