[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor to manage GPUs only



That's exactly what we'd like  ð
I did a few installs and uninstalls and miraculously the servers connected - I still have no idea why but it's working now!
I'm only seeing one GPU per node (the first device?) which is odd as all the servers have two GPUs, it could be the way I have my constraints?
 
muthur# /usr/libexec/condor/condor_gpu_discovery -extra
DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ]
GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]
 
 
muthur# condor_status -constraint  '!isUndefined(DetectedGPUs)' -compact  -af:h machine GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage
machine                  GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUsMemoryUsage       GPUs_DeviceUuid                      DeviceGPUsAverageUsage
kscprod-data1 Tesla V100-PCIE-32GB  7.0                   12.1                  32501               271.0                 5e382249-1938-0c64-2b04-04631b812baa 0.0
kscprod-data2 NVIDIA A100-PCIE-40GB 8.0                   12.1                  40377               29496.0               387fd653-c749-2ec6-8eab-f967090d6579 0.6562980190294957
kscprod-data3 NVIDIA A100 80GB PCIe 8.0                   12.2                  81051               878.0                 5f846c33-4dd5-ad62-eb12-c3813915d819 0.0001105264908529342
 
thanx,
 
--Russell
 
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users
Sent: Friday, July 28, 2023 10:12 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only
 
We are currently working on
 
    condor_status -gpus
 
and hope to have something in the next version of HTCondor.   Something like this is likely
 
Name                             User                           GPUs GPU-Memory GPU-Name                      
 
slot1@machine1         user1@xxxxxxxxxxxxx    1   10.6 GB           NVIDIA GeForce RTX 2080 Ti  slot1@mahine2           user2@xxxxxxxxxxxxx    1   15.9 GB           Tesla P100-PCIE-16GB          
...
 
I would be interested in your thoughts about what sort of information you would like to see.
 
-tj
 
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies
Sent: Wednesday, July 26, 2023 9:57 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only
 
I figured it out eventually - I had the "use feature" bit in the config, but the tags start with "GPUs_" not CUDA" eg. " GPUs_DeviceName" not " CUDADeviceName"
 
muthur# condor_status -constraint  '!isUndefined(DetectedGPUs)' -compact  -af:h machine GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage
machine                  GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUs_DeviceUuid
kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe 8.0                   12.2                  81051               5f846c33-4dd5-ad62-eb12-c3813915d819
 
My next issue is sorting out munge authentication if anyone can point me to some useful docs?  I can't get it to use anything but the default tokens ;-( We've used munge on slurm so I don't see any great need to change.
 
--Russell
 
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users
Sent: Thursday, July 27, 2023 1:08 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only
 
Add
 
    use FEATURE : GPUs
 
to the configuration of your STARTD to have it run condor_gpu_detection on startup and treat the GPUs as slot resources.
 
-tj
 
-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies
Sent: Wednesday, July 26, 2023 3:35 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Condor to manage GPUs only
 
 
Hi all,
I used Condor 20 years ago and am trying to transition back from slurm.
 
I want to initially only use Condor for managing the GPUs on 3 servers, two servers have 2 x A100s and one server has 2 X V100.
I'm not sure of the best way to do this - or if it's even possible? Surely given the number of products that are "powered by GPUs" it must be.
 
When I do a "condor_gpu_discovery" I can see the GPUs:
   muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested
   DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
   Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ]
   GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
   GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]
 
But when I do "condor_status" I don't see the GPUs but only see the CPU resources. And on this server with a pair of AMD EPYC 75F3 processors that's 128 slots to scroll through.
What I really want to see is no CPU slots, only the GPUs.
Is this possible or am I asking too much.
Is there a better way of job scheduling for GPUs?
 
Thanx,
 
Russell Smithies
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://aus01.safelinks.protection.outlook.com/?url="">
 
The archives can be found at:
https://aus01.safelinks.protection.outlook.com/?url="">
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://aus01.safelinks.protection.outlook.com/?url="">
 
The archives can be found at:
https://aus01.safelinks.protection.outlook.com/?url="">
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://aus01.safelinks.protection.outlook.com/?url="">
 
The archives can be found at:
https://aus01.safelinks.protection.outlook.com/?url="">
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://aus01.safelinks.protection.outlook.com/?url="">
 
The archives can be found at:
https://aus01.safelinks.protection.outlook.com/?url="">
 
This email has been filtered by SMX. For more information visit smxemail.com