[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Preventing HTCondor assignment of a given GPU based on GPU state policy?



It doesn’t appear to be working, with either setting:

 

OFFLINE_MACHINE_RESOURCE_GPUS = CUDA0

OFFLINE_MACHINE_RESOURCE_GPUS = “CUDA0”

 

I set this up in the local config of the exec node which has the GPUs, did a condor_reconfig on that node (no args), and when I submit a GPU-requesting job to it, I get the following in the dynamic slot:

 

CUDA_VISIBLE_DEVICES=0

_CONDOR_AssignedGPUs=CUDA0

 

When I restart, it works:

 

CUDA_VISIBLE_DEVICES=1

_CONDOR_AssignedGPUs=CUDA1

 

But a restart doesn’t fly for a dynamic-availability feature. When it takes CUDA0 offline at startup, the AssignedGPUs in the partitionable slot changes to omit that string.

 

Did I maybe not wait long enough for the collector ad to update after the reconfig, or some such?

 

                -Michael Pelletier.

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Tuesday, May 16, 2017 5:11 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Preventing HTCondor assignment of a given GPU based on GPU state policy?

 

I think a restart is only required if you are using static slots and want the gpus to be un-assigned to a static slot.

 

I don’t believe a restart is actually required when using partitionable slots, the offline GPUs will just not be assigned to any NEW dyamic slot.

 

The intent of this knob is that you would set it via condor_config_val -set and then reconfig.

 

-tj