[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor version 9.0.17 GPU repeat/divide with GPU MIGs



Hello Experts,

We have a NVIDIA H100 GPU on a machine, created 7 MIG on this GPU.

- Following value is same for 6 MIGs but one MIG has different value 9969MB

GlobalMemoryMb=9971Â

nvidia-smi command shows the same value 9984 MiB for all MIGs.Â

Is this a condor or CUDA library issue?Â

- Using the following command to divide the MIG further. It shows global memory less than devicememory. Should not I expect to see two devices each with 4985MB of memory? Also I can't increase the value of repeat and divide (something like 4 I don't need it but is thr a reason behind it?)

# `condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 2 -divide 2 | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=4985

It only shows the deviememory - globalmemory disappeared.Â

# Â`condor_config_val LIBEXEC`/condor_gpu_discovery -extra -repeat 4 -divide 4 | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceMemoryMb=9971
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"

Without divide and repeat:

# Â`condor_config_val LIBEXEC`/condor_gpu_discovery -extra | grep MIG_ea8bzzz
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxDeviceUuid="MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxx"
MIG_ea8bzzz_6a1a_562d_9f13_xxxxxxxxxxGlobalMemoryMb=9971



Thanks & Regards,
Vikrant Aggarwal