[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor to manage GPUs only



I am interested to run HTCondor with AMD ROCm. Does HTCondor support
ROCm ?

Thanks
Valerio


On Fri, 2023-07-28 at 15:13 +0000, John M Knoeller via HTCondor-users
wrote:
> Thanks for the suggestion.   I'm not sure about the GPU ordinal, I
> don't think we have that information for gpus that have a UUID, which
> should be all NVIDIA gpus at this point. 
> -tj
> 
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> Of Valerio Bellizzomi
> Sent: Friday, July 28, 2023 2:01 AM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Condor to manage GPUs only
> 
> On Thu, 2023-07-27 at 22:11 +0000, John M Knoeller via HTCondor-users
> wrote:
> > We are currently working on 
> > 
> >     condor_status -gpus
> > 
> > and hope to have something in the next version of
> > HTCondor.   Something like this is likely
> > 
> > Name                             User                           GPU
> > s
> > GPU-Memory GPU-Name                       
> > 
> > slot1@machine1         user1@xxxxxxxxxxxxx    1   10.6
> > GB           NVIDIA GeForce RTX 2080 Ti  slot1@mahine2           
> > user2@xxxxxxxxxxxxx    1   15.9 GB           Tesla P100-PCIE-
> > 16GB           
> > ...
> > 
> > I would be interested in your thoughts about what sort of
> > information
> > you would like to see.
> > 
> > -tj
> 
> Maybe add the GPU ordinal, GPU global memory, and the GPU UUID ?
> 
> Cheers
> Valerio
> 
> 
> 
> > -----Original Message-----
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> > Of Russell Smithies
> > Sent: Wednesday, July 26, 2023 9:57 PM
> > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] Condor to manage GPUs only
> > 
> > I figured it out eventually - I had the "use feature" bit in the
> > config, but the tags start with "GPUs_" not CUDA" eg. "
> > GPUs_DeviceName" not " CUDADeviceName"
> > 
> > muthur# condor_status -constraint  '!isUndefined(DetectedGPUs)'
> > -compact  -af:h machine
> > GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    G
> > PU
> > s_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid
> > DeviceGPUsAverageUsage
> > machine                  GPUs_DeviceName       GPUs_Capability     
> >   
> > GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUs_DeviceUuid
> > kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe
> > 8.0                   12.2                  81051               5f8
> > 46
> > c33-4dd5-ad62-eb12-c3813915d819
> > 
> > My next issue is sorting out munge authentication if anyone can
> > point
> > me to some useful docs?  I can't get it to use anything but the
> > default tokens ;-(
> > We've used munge on slurm so I don't see any great need to change.
> > 
> > --Russell
> > 
> > -----Original Message-----
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> > Of John M Knoeller via HTCondor-users
> > Sent: Thursday, July 27, 2023 1:08 PM
> > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> > Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] Condor to manage GPUs only
> > 
> > Add
> > 
> >     use FEATURE : GPUs
> > 
> > to the configuration of your STARTD to have it run
> > condor_gpu_detection on startup and treat the GPUs as slot
> > resources.
> > 
> > -tj
> > 
> > -----Original Message-----
> > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf
> > Of Russell Smithies
> > Sent: Wednesday, July 26, 2023 3:35 PM
> > To: htcondor-users@xxxxxxxxxxx
> > Subject: [HTCondor-users] Condor to manage GPUs only
> > 
> > 
> > Hi all,
> > I used Condor 20 years ago and am trying to transition back from
> > slurm.
> > 
> > I want to initially only use Condor for managing the GPUs on 3
> > servers, two servers have 2 x A100s and one server has 2 X V100.
> > I'm not sure of the best way to do this - or if it's even possible?
> > Surely given the number of products that are "powered by GPUs" it
> > must be.
> > 
> > When I do a "condor_gpu_discovery" I can see the GPUs:
> >    muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested
> >    DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
> >    Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108;
> > CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe";
> > DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051;
> > MaxSupportedVersion=12020; ]
> >    GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0";
> > DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
> >    GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0";
> > DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]
> > 
> > But when I do "condor_status" I don't see the GPUs but only see the
> > CPU resources. And on this server with a pair of AMD EPYC 75F3
> > processors that's 128 slots to scroll through.
> > What I really want to see is no CPU slots, only the GPUs.
> > Is this possible or am I asking too much.
> > Is there a better way of job scheduling for GPUs?
> > 
> > Thanx,
> > 
> > Russell Smithies
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to 
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to 
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to 
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> > 
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to 
> > htcondor-users-request@xxxxxxxxxxx
> > with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > 
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/