[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor to manage GPUs only



HTCondor does support AMD GPUs. They will be marked as OpenCL devices rather than ROCm devices though.

Benedikt

On Fri, Jul 28, 2023 at 2:27âPM Valerio Bellizzomi <valerio@xxxxxxxxxx> wrote:
I am interested to run HTCondor with AMD ROCm. Does HTCondor support
ROCm ?

Thanks
Valerio


On Fri, 2023-07-28 at 15:13 +0000, John M Knoeller via HTCondor-users
wrote:
> Thanks for the suggestion. ÂI'm not sure about the GPU ordinal, I
> don't think we have that information for gpus that have a UUID, which
> should be all NVIDIA gpus at this point.
> -tj
>
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> Of Valerio Bellizzomi
> Sent: Friday, July 28, 2023 2:01 AM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Condor to manage GPUs only
>
> On Thu, 2023-07-27 at 22:11 +0000, John M Knoeller via HTCondor-users
> wrote:
> > We are currently working on
> >
> >Â Â Âcondor_status -gpus
> >
> > and hope to have something in the next version of
> > HTCondor. ÂSomething like this is likely
> >
> > Name              ÂUser             ÂGPU
> > s
> > GPU-Memory GPU-Name           Â
> >
> > slot1@machine1    Âuser1@xxxxxxxxxxxxx  1 Â10.6
> > GB     ÂNVIDIA GeForce RTX 2080 Ti slot1@mahine2     Â
> > user2@xxxxxxxxxxxxx  1 Â15.9 GB     ÂTesla P100-PCIE-
> > 16GBÂ Â Â Â Â Â
> > ...
> >
> > I would be interested in your thoughts about what sort of
> > information
> > you would like to see.
> >
> > -tj
>
> Maybe add the GPU ordinal, GPU global memory, and the GPU UUID ?
>
> Cheers
> Valerio
>
>
>
> > -----Original Message-----
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> > Of Russell Smithies
> > Sent: Wednesday, July 26, 2023 9:57 PM
> > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] Condor to manage GPUs only
> >
> > I figured it out eventually - I had the "use feature" bit in the
> > config, but the tags start with "GPUs_" not CUDA" eg. "
> > GPUs_DeviceName" not " CUDADeviceName"
> >
> > muthur# condor_status -constraint '!isUndefined(DetectedGPUs)'
> > -compact -af:h machine
> > GPUs_DeviceName   ÂGPUs_Capability   ÂGPUs_DriverVersion  G
> > PU
> > s_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid
> > DeviceGPUsAverageUsage
> > machine         GPUs_DeviceName   ÂGPUs_Capability  Â
> >Â Â
> > GPUs_DriverVersion  GPUs_GlobalMemoryMb GPUs_DeviceUuid
> > kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe
> > 8.0Â Â Â Â Â Â Â Â Â Â12.2Â Â Â Â Â Â Â Â Â 81051Â Â Â Â Â Â Â Â5f8
> > 46
> > c33-4dd5-ad62-eb12-c3813915d819
> >
> > My next issue is sorting out munge authentication if anyone can
> > point
> > me to some useful docs? I can't get it to use anything but the
> > default tokens ;-(
> > We've used munge on slurm so I don't see any great need to change.
> >
> > --Russell
> >
> > -----Original Message-----
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> > Of John M Knoeller via HTCondor-users
> > Sent: Thursday, July 27, 2023 1:08 PM
> > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> > Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] Condor to manage GPUs only
> >
> > Add
> >
> >Â Â Âuse FEATURE : GPUs
> >
> > to the configuration of your STARTD to have it run
> > condor_gpu_detection on startup and treat the GPUs as slot
> > resources.
> >
> > -tj
> >
> > -----Original Message-----
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> > Of Russell Smithies
> > Sent: Wednesday, July 26, 2023 3:35 PM
> > To: htcondor-users@xxxxxxxxxxx
> > Subject: [HTCondor-users] Condor to manage GPUs only
> >
> >
> > Hi all,
> > I used Condor 20 years ago and am trying to transition back from
> > slurm.
> >
> > I want to initially only use Condor for managing the GPUs on 3
> > servers, two servers have 2 x A100s and one server has 2 X V100.
> > I'm not sure of the best way to do this - or if it's even possible?
> > Surely given the number of products that are "powered by GPUs" it
> > must be.
> >
> > When I do a "condor_gpu_discovery" I can see the GPUs:
> >Â Â muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested
> >Â Â DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
> >Â Â Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108;
> > CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe";
> > DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051;
> > MaxSupportedVersion=12020; ]
> >Â Â GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0";
> > DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
> >Â Â GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0";
> > DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]
> >
> > But when I do "condor_status" I don't see the GPUs but only see the
> > CPU resources. And on this server with a pair of AMD EPYC 75F3
> > processors that's 128 slots to scroll through.
> > What I really want to see is no CPU slots, only the GPUs.
> > Is this possible or am I asking too much.
> > Is there a better way of job scheduling for GPUs?
> >
> > Thanx,
> >
> > Russell Smithies
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Benedikt Riedel
Global Computing Coordinator IceCube Neutrino Observatory
Technical Coordinator IceCube Neutrino Observatory
Computing Manager Wisconsin IceCube Particle Astrophysics Center
University of Wisconsin-Madison