Re: [HTCondor-users] Question regarding overlapping GPU slots and nested partitioning of slots

You can switch between these two configurations with a restart of the EP. So what you might want to look into is having a pool wide process that looks at the jobs in the queue. And then reconfigures your EPs to be either GPU sharing EPs or EPs that support giving multiple GPUs to a single job based on what is currently needed.

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Voigtländer, Tim Aike (ETP) <tim.voigtlaender@xxxxxxx>
Sent: Wednesday, February 14, 2024 4:54 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Question regarding overlapping GPU slots and nested partitioning of slots

Hi all,

I'm currently trying to set up a GPU node that accepts both, jobs that request multiple GPUs at once

and jobs that request only a portion of a GPU (by specifying the amount of device memory).

I was able to set up both by using partitionable slots.

For the multi GPU variant, all GPUs are assigned to the same slot and dynamic slots get assigned a number of the available GPUs. (This is the default setup.)

For the split GPU variant, one slot is defined per GPU and by using the `-repeat` option of the `condor_gpu_discovery` multiple jobs can be assigned to the same GPU.

In the second setup, the requested device memory is the limiting factor.

I'd like to use both on the same node, for maximum, flexibility, but this seems to be difficult.

It is possible to have both kinds of setups exist at the same time,

having the GPUs and its copies in partitionable slots, as well as the GPUs in one partitionable slot (see attached config file).

However, there seems to be no way to disallow the multi GPU jobs from being assigned to GPUs that are also already assigned to split GPU jobs and vice versa.

I was wondering if there is a straightforward solution to such an issue of conflicting slots.

Alternatively, is it possible to achieve a sort of nested partitionable slot?

That way, a whole GPU could be dynamically assigned to a partitionable slot that can then be further subdivided.

I've attached an excerpt of the condor_config for clarity.

The version used is 23.0.4 on CentOS7.

Cheers and thanks,
Tim Voigtländer

Mailing List Archives

Public Access

Re: [HTCondor-users] Question regarding overlapping GPU slots and nested partitioning of slots