[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Preventing HTCondor assignment of a given GPU based on GPU state policy?



I think a restart is only required if you are using static slots and want the gpus to be un-assigned to a static slot.

 

I don’t believe a restart is actually required when using partitionable slots, the offline GPUs will just not be assigned to any NEW dyamic slot.

 

The intent of this knob is that you would set it via condor_config_val -set and then reconfig.

 

-tj

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Pelletier
Sent: Tuesday, May 16, 2017 3:23 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Preventing HTCondor assignment of a given GPU based on GPU state policy?

 

Unfortunately according to 3.3.11, changes to OFFLINE_MACHINE_RESOURCE_<name> require a restart, rather than a reconfig. I’d like to do this dynamically, so that the negotiator can try to avoid GPUs in use by jobs which haven’t yet been brought under HTCondor’s purview.

 

Would the negotiator’s assignment work as expected if the AssignedGPUs string were made into some sort of ugly strcat() _expression_ based on the GPUs’ state attributes, and an ugly sum _expression_ for Gpus? So that if, say, the CUDA0UtilizationPercent is 100, the “CUDA0” would be omitted from the partitionable slot’s AssignedGPUs attribute if it wasn’t assigned by HTCondor, and GPUs would be one fewer?

 

                -Michael Pelletier.

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Tuesday, May 16, 2017 3:14 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Preventing HTCondor assignment of a given GPU based on GPU state policy?

 

You can configure

 

OFFLINE_MACHINE_RESOURCE_GPUS = CUDA0

 

to prevent HTCondor from assigning that GPU to a slot.

 

-tj

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael Pelletier
Sent: Tuesday, May 16, 2017 11:45 AM
To: HTCondor-Users Mail List (htcondor-users@xxxxxxxxxxx) <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Preventing HTCondor assignment of a given GPU based on GPU state policy?

 

Hi folks,

 

I’m working on getting a new exec node stood up with multiple GPUs, for use by job which need dedicated GPU assignment – a first in our pools. Other jobs I’ve dealt with had an internal lock and queue mechanism to be able to share all the GPUs on the system, so I didn’t need to worry about HTCondor assignments.

 

I’d like to be able to prevent HTCondor from assigning a GPU that’s already in use by a non-HTCondor process to one of its jobs. I wrote a wrapper for nvidia-smi which pulls in an ad like so:

 

hostname$ /user/condor/libexec/condor_nvidia_probe

CUDA0FreeGlobalMemory = 2441

CUDA0UtilizationPct = 100

CUDA1FreeGlobalMemory = 4031

CUDA1UtilizationPct = 0

CUDA2FreeGlobalMemory = 4031

CUDA2UtilizationPct = 0

CUDA3FreeGlobalMemory = 4031

CUDA3UtilizationPct = 0

CUDAFreeGlobalMemory = 14534

CUDAUtilization = 25.0

--

hostname$


(This might be a good addition to condor_gpu_discovery, a “-utilization” argument.)

So in the above case, I’d like to prevent any HTCondor job from being assigned the CUDA0 device since it’s 100% used, and preferably advertise one fewer GPU available on the system. Is there any means to do this? I’ve been mulling the kinds of expressions I think I might need and my brain is starting to hurt a bit.

 

Michael V. Pelletier
Principal Engineer
Information Technology
Future Technologies & Cloud
Integrated Defense Systems
Raytheon Company

+1 978-858-9681   (office)
+1 339-293-9149   (cell)
7-225-9681   (tie line)
Michael.V.Pelletier@xxxxxxxxxxxx

50 Apple Hill Drive
Tewksbury, MA 01876 USA
www.raytheon.com

Follow Raytheon On
Twitter YouTube Facebook LinkedIn 

Raytheon Sustainability

This message contains information that may be confidential and privileged. Unless you are the addressee (or authorized to receive mail for the addressee), you should not use, copy or disclose to anyone this message or any information contained in this message. If you have received this message in error, please so advise the sender by reply e-mail and delete this message. Thank you for your cooperation.