[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor to manage GPUs only



Itâs not going to be possible to hide the CPUs entirely.  HTCondor will require least 1 CPU, some Memory and some Disk to be provisioned in each slot or no job will ever match.

 

The best you can do is to configure

 

NUM_CPUS = <n>

 

Where <n> is the number of GPUs on the machine or the maximum number of slots you want to be able to create on that machine.

 

Similarly, you can configure

 

MEMORY = <mm>

DISK = <dd>

 

To set the size of the total pool of memory and disk to make slot from,

 

-tj

 

From: Russell Smithies <Russell.Smithies@xxxxxxxxxx>
Sent: Sunday, July 30, 2023 3:46 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: Condor to manage GPUs only

 

Youâre right John, I turned off âcompactâ and it worked fine.
I guess more coffee and time to RTFM would help a lot here 
😊

 

The next really odd request is I donât want any CPUs to be available, just GPUs. Weâre just trialling condor and still using slurm for CPUs, and slurm isnât aware of external jobs or server loads.

I thought setting âCPUS = 0â or âCpuBusy = Trueâ or âTotalCpus = 0â  in the config might do it but I still see them all as available.

Any ideas?

 

--Russell

 

From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Saturday, July 29, 2023 3:01 AM
To: Russell Smithies <Russell.Smithies@xxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: Condor to manage GPUs only

 

It might be -compact, which is like adding

 

-constraint âPartitionableSlot =?= true || DynamicSlot =!= trueâ

 

But -compact only shows one line per machine, even if it gets back multiple ads for that machine.

This can lead to weird results when you mix -compact with -af but have static slots or multiple p-slots.

 

-tj

 

From: Russell Smithies <Russell.Smithies@xxxxxxxxxx>
Sent: Thursday, July 27, 2023 6:11 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: RE: Condor to manage GPUs only

 

That's exactly what we'd like  😊

I did a few installs and uninstalls and miraculously the servers connected - I still have no idea why but it's working now!

I'm only seeing one GPU per node (the first device?) which is odd as all the servers have two GPUs, it could be the way I have my constraints?

 

muthur# /usr/libexec/condor/condor_gpu_discovery -extra

DetectedGPUs="GPU-5f846c33, GPU-c60861f1"

Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ]

GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]

GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]

 

 

muthur# condor_status -constraint  '!isUndefined(DetectedGPUs)' -compact  -af:h machine GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage

machine                  GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUsMemoryUsage       GPUs_DeviceUuid                      DeviceGPUsAverageUsage

kscprod-data1 Tesla V100-PCIE-32GB  7.0                   12.1                  32501               271.0                 5e382249-1938-0c64-2b04-04631b812baa 0.0

kscprod-data2 NVIDIA A100-PCIE-40GB 8.0                   12.1                  40377               29496.0               387fd653-c749-2ec6-8eab-f967090d6579 0.6562980190294957

kscprod-data3 NVIDIA A100 80GB PCIe 8.0                   12.2                  81051               878.0                 5f846c33-4dd5-ad62-eb12-c3813915d819 0.0001105264908529342

 

thanx,

 

--Russell

 

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users
Sent: Friday, July 28, 2023 10:12 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only

 

We are currently working on

 

    condor_status -gpus

 

and hope to have something in the next version of HTCondor.   Something like this is likely

 

Name                             User                           GPUs GPU-Memory GPU-Name                      

 

slot1@machine1         user1@xxxxxxxxxxxxx    1   10.6 GB           NVIDIA GeForce RTX 2080 Ti  slot1@mahine2           user2@xxxxxxxxxxxxx    1   15.9 GB           Tesla P100-PCIE-16GB          

...

 

I would be interested in your thoughts about what sort of information you would like to see.

 

-tj

 

-----Original Message-----

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies

Sent: Wednesday, July 26, 2023 9:57 PM

To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] Condor to manage GPUs only

 

I figured it out eventually - I had the "use feature" bit in the config, but the tags start with "GPUs_" not CUDA" eg. " GPUs_DeviceName" not " CUDADeviceName"

 

muthur# condor_status -constraint  '!isUndefined(DetectedGPUs)' -compact  -af:h machine GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUsMemoryUsage GPUs_DeviceUuid DeviceGPUsAverageUsage

machine                  GPUs_DeviceName       GPUs_Capability       GPUs_DriverVersion    GPUs_GlobalMemoryMb GPUs_DeviceUuid

kscprod-data3.esr.cri.nz NVIDIA A100 80GB PCIe 8.0                   12.2                  81051               5f846c33-4dd5-ad62-eb12-c3813915d819

 

My next issue is sorting out munge authentication if anyone can point me to some useful docs?  I can't get it to use anything but the default tokens ;-( We've used munge on slurm so I don't see any great need to change.

 

--Russell

 

-----Original Message-----

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users

Sent: Thursday, July 27, 2023 1:08 PM

To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

Cc: John M Knoeller <johnkn@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] Condor to manage GPUs only

 

Add

 

    use FEATURE : GPUs

 

to the configuration of your STARTD to have it run condor_gpu_detection on startup and treat the GPUs as slot resources.

 

-tj

 

-----Original Message-----

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies

Sent: Wednesday, July 26, 2023 3:35 PM

Subject: [HTCondor-users] Condor to manage GPUs only

 

 

Hi all,

I used Condor 20 years ago and am trying to transition back from slurm.

 

I want to initially only use Condor for managing the GPUs on 3 servers, two servers have 2 x A100s and one server has 2 X V100.

I'm not sure of the best way to do this - or if it's even possible? Surely given the number of products that are "powered by GPUs" it must be.

 

When I do a "condor_gpu_discovery" I can see the GPUs:

   muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested

   DetectedGPUs="GPU-5f846c33, GPU-c60861f1"

   Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ]

   GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]

   GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]

 

But when I do "condor_status" I don't see the GPUs but only see the CPU resources. And on this server with a pair of AMD EPYC 75F3 processors that's 128 slots to scroll through.

What I really want to see is no CPU slots, only the GPUs.

Is this possible or am I asking too much.

Is there a better way of job scheduling for GPUs?

 

Thanx,

 

Russell Smithies

 

_______________________________________________

HTCondor-users mailing list

To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a

subject: Unsubscribe

You can also unsubscribe by visiting

 

The archives can be found at:

 

_______________________________________________

HTCondor-users mailing list

subject: Unsubscribe

You can also unsubscribe by visiting

 

The archives can be found at:

 

_______________________________________________

HTCondor-users mailing list

subject: Unsubscribe

You can also unsubscribe by visiting

 

The archives can be found at:

 

_______________________________________________

HTCondor-users mailing list

subject: Unsubscribe

You can also unsubscribe by visiting

 

The archives can be found at:

 

This email has been filtered by SMX. For more information visit smxemail.com