[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fractional GPU



Thanks, but that is only supported on NVIDIA H100, A100, and A30
Tensor Core GPUs - we don't have any of those.

On Fri, Feb 23, 2024 at 9:57âAM Matthew T West via HTCondor-users
<htcondor-users@xxxxxxxxxxx> wrote:
>
> Hi Larry,
>
> Have you investigated NVIDIA's MIG
> https://www.nvidia.com/en-gb/technologies/multi-instance-gpu/?
>
> AFAIK, if you partition the cards at boot into sub-units, HTCondor's GPU
> discovery will pick up each of those as distinct entities on the compute
> node. Would you always want them divided into 1/4s or does this need to
> be dynamic partitioning?
>
> Cheers,
> Matt
>
> Matthew T. West
> DevOps & HPC SysAdmin
> University of Exeter, Research IT
> http://www.exeter.ac.uk/research/researchcomputing/support/researchit
> 57 Laver Building, North Park Road, Exeter, EX4 4QE, United Kingdom
>
> On 22/02/2024 22:45, Larry Martell wrote:
> > CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.
> >
> >
> > Proceeding under the assumption that condor does not directly support
> > fractional GPUs, I am trying what I read here:
> > https://www-auth.cs.wisc.edu/lists/htcondor-users/2020-December/msg00018.shtml:
> >
> >> You can get HTCondor to do this just by having the same device show up more than once in the device enumeration.
> >> For instance, if you have two GPUs and your configuration is
> >> MACHINE_RESOURCE_GPUS = CUDA0, CUDA1
> >> You can run two jobs on each GPU by configuring
> >> MACHINE_RESOURCE_GPUS = CUDA0, CUDA1, CUDA0, CUDA1
> > I have 1 GPU and this is what I have in my config file:
> >
> > #use feature:GPUs
> > #GPU_DISCOVERY_EXTRA = -extra
> > MACHINE_RESOURCE_GPUs = CUDA0, CUDA0, CUDA0, CUDA0
> >
> > and this env setting: CUDA_VISIBLE_DEVICES="0"
> >
> > But when I run multiple jobs requesting a GPU they run serially, not
> > in parallel.
> >
> > Has anyone been able to get something like this working?
> >
> > On Thu, Feb 22, 2024 at 3:53âPM Larry Martell <larry.martell@xxxxxxxxx> wrote:
> >> Does condor support fractional GPUs? I am setting request_GPUs = 0.25
> >> and it is matching (I can see that with -better-analyze and in the
> >> StartLog) but the job never runs, it stays in idle state.