It doesn’t appear to be working, with either setting:
OFFLINE_MACHINE_RESOURCE_GPUS = CUDA0
OFFLINE_MACHINE_RESOURCE_GPUS = “CUDA0”
I set this up in the local config of the exec node which has the GPUs, did a condor_reconfig on that node (no args), and when I submit a GPU-requesting job to it, I get the following in the dynamic slot:
When I restart, it works:
But a restart doesn’t fly for a dynamic-availability feature. When it takes CUDA0 offline at startup, the AssignedGPUs in the partitionable slot changes to omit that string.
Did I maybe not wait long enough for the collector ad to update after the reconfig, or some such?
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx]
On Behalf Of John M Knoeller
I think a restart is only required if you are using static slots and want the gpus to be un-assigned to a static slot.
I don’t believe a restart is actually required when using partitionable slots, the offline GPUs will just not be assigned to any NEW dyamic slot.
The intent of this knob is that you would set it via condor_config_val -set and then reconfig.