Hello HTCondor team and users,
we have an HTCondor cluster with several A100 GPUs capable of using Multi-Instance GPU (MIG).
I am currently working on enabling our HTCondor batch system (version 9.0.5) to utilize these resources for our local group as well as for the WLCG.
While HTCondor is able to detect these instances, a number of issues seem to remain.
More details about these issues are below:
1. Currently, "condor_gpu_discovery" returns the device names as "short-uuid".
As described in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
this is a valid way to set the "CUDA_VISIBLE_DEVICES" variable for GPU resources.
It should be noted that the naming conventions for MIG-devices were recently changed:
Unfortunately, MIGs now have to be addressed by their full uuid to obtain a valid "CUDA_VISIBLE_DEVICES".
This should be fixable by providing all uuid in their long-uuid version.
The option for this already exists with the "-uuid" argument of "condor_gpu_discovery",
although, it is currently not the default for setting "CUDA_VISIBLE_DEVICES".
2. In our tests, the names returned by "condor_gpu_discovery" always start with the prefix "GPU-",
even if the uuid of an MIG device is printed. This is the case for short and long uuids.
(Example "condor_gpu_discovery": "GPU-MIG-183c")
(Example "condor_gpu_discovery -uuid": "GPU-MIG-183c36e4-43ce-531b-8aaf-b0eed9800604")
As this is not the correct uuid for MIG devices, the resulting "CUDA_VISIBLE_DEVICES" is invalid.
As described in (1.) this might be caused by the recent change in MIG naming conventions.
A valid value for "CUDA_VISIBLE_DEVICES" would be something like "MIG-183c36e4-43ce-531b-8aaf-b0eed9800604".
While I'm not sure how "condor_gpu_discovery" obtains its results,
it should be possible to only prepend the "MIG-" prefix instead of both.
After "fixing" the above issue by hard-coding the available MIG uuids two additional issues occurred during the scheduling.
3. "condor_gpu_discovery" can return the parent GPU in addition to its child MIGs.
Result of "condor_gpu_discovery -uuid":
DetectedGPUs="GPU-edf948a0-ba3e-36f9-e8a5-399b7cb63ba0, GPU-MIG-183c36e4-43ce-531b-8aaf-b0eed9800604, GPU-MIG-491e8b14-a9c6-58a1-9ef2-03b0095224c3,
GPU-MIG-a08e6fad-ada6-5417-81c3-ca3fdb5cdc30, GPU-MIG-b9a54a8c-bbf3-52df-9e24-ad7db4402a5b, GPU-9f253d6d-b549-2c5f-4340-2d02fd902061"
Result of "nvidia-smi -L":
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-edf948a0-ba3e-36f9-e8a5-399b7cb63ba0)
MIG 7g.40gb Device 0: (UUID: MIG-a08e6fad-ada6-5417-81c3-ca3fdb5cdc30)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-78a0883a-f92b-997f-233d-b3c796bfb1b2)
MIG 4g.20gb Device 0: (UUID: MIG-183c36e4-43ce-531b-8aaf-b0eed9800604)
MIG 2g.10gb Device 1: (UUID: MIG-491e8b14-a9c6-58a1-9ef2-03b0095224c3)
MIG 1g.5gb Device 2: (UUID: MIG-b9a54a8c-bbf3-52df-9e24-ad7db4402a5b)
GPU 2: NVIDIA A100-PCIE-40GB (UUID: GPU-9f253d6d-b549-2c5f-4340-2d02fd902061)
Please note that the uuid of GPU 0 as well as the uuid of its MIG are listed in discovery output.
This is not the case for GPU 1 where only the MIG uuids are listed.
We are not sure what the reason for this inconsistency is.
In any case, the discovery of both parent GPU and child MIG can lead to the assignment of jobs to both.
The default behavior (in our experience) for a GPU with an MIG is,
that a task assigned to the parent GPU is run by the first listed child MIG.
During job scheduling, this can lead to multiple issues.
a) If both the parent GPU and the first child MIG are assigned a job at the same time
the two tasks will be run on the same MIG, possibly leading to issues.
b) MIGs only possesses a portion of the entire GPUs capabilities.
Jobs that require an entire GPU might still be assigned to the parent GPU,
leading to the child MIG running a task that it might not be sufficient for.
This issue might be fixable by excluding all GPUs that provide one or more MIG instances from the discovery.
4. The current approach for multi-GPU jobs seems to be to append as many GPUs as necessary to the
"CUDA_VISIBLE_DEVICES" variable from among the uuids discovered by "condor_gpu_discovery".
While this is valid if only GPU resources exist, it is unfortunately not usable if MIG resources are present.
As stated in https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#cuda-visible-devices
MIG devices cannot be combined with other MIG/GPU devices.
In its current form it is possible, that multiple MIGs or a mix of MIGs and GPUs are assigned to multi-GPU jobs.
This leads to an invalid "CUDA_VISIBLE_DEVICES" variable.
At this point, many common packages like TensorFlow or PyTorch have some default behavior that activates.
Typically, in the case of no valid "CUDA_VISIBLE_DEVICES" variable, the first GPU found with nvidia-smi is used.
(This also applies to the issues (1.) and (2.))
Combined with the issue described in (3.) this can lead to multiple multi-GPU jobs running on the same MIG.
Although NVIDIA hints that MIGs could be combined with GPUs at some point in the future,
currently, this issue is probably only solvable by excluding all MIGs from multi-GPU jobs.
Do other users also see these issues? I'd be glad if you could fix some problems observed with the current version of "condor_gpu_discovery" and MIGs.
We could also discuss this besides the mailing list via separate emails or tickets if you have further questions or comments.