[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor to manage GPUs only



HTCondorâs support for munge authentication is pretty basic. We donât have any options for altering how the munge credential is created. Looking through the munge docs, I donât see any mention of alternate tokens. Can you be more specific about the munge option you need?

 - Jaime

On Jul 26, 2023, at 9:56 PM, Russell Smithies <Russell.Smithies@xxxxxxxxxx> wrote:

My next issue is sorting out munge authentication if anyone can point me to some useful docs?  I can't get it to use anything but the default tokens ;-(
We've used munge on slurm so I don't see any great need to change.

--Russell

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John M Knoeller via HTCondor-users
Sent: Thursday, July 27, 2023 1:08 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: John M Knoeller <johnkn@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Condor to manage GPUs only

Add

   use FEATURE : GPUs

to the configuration of your STARTD to have it run condor_gpu_detection on startup and treat the GPUs as slot resources.

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Russell Smithies
Sent: Wednesday, July 26, 2023 3:35 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Condor to manage GPUs only


Hi all,
I used Condor 20 years ago and am trying to transition back from slurm.

I want to initially only use Condor for managing the GPUs on 3 servers, two servers have 2 x A100s and one server has 2 X V100.
I'm not sure of the best way to do this - or if it's even possible? Surely given the number of products that are "powered by GPUs" it must be.

When I do a "condor_gpu_discovery" I can see the GPUs:
  muthur# /usr/libexec/condor/condor_gpu_discovery -extra -nested
  DetectedGPUs="GPU-5f846c33, GPU-c60861f1"
  Common=[ Capability=8.0; ClockMhz=1410.00; ComputeUnits=108; CoresPerCU=64; DeviceName="NVIDIA A100 80GB PCIe"; DriverVersion=12.20; ECCEnabled=true; GlobalMemoryMb=81051; MaxSupportedVersion=12020; ]
  GPU_5f846c33=[ id="GPU-5f846c33"; DevicePciBusId="0000:41:00.0"; DeviceUuid="5f846c33-4dd5-ad62-eb12-c3813915d819"; ]
  GPU_c60861f1=[ id="GPU-c60861f1"; DevicePciBusId="0000:A1:00.0"; DeviceUuid="c60861f1-85ee-082a-6211-8564787ede57"; ]

But when I do "condor_status" I don't see the GPUs but only see the CPU resources. And on this server with a pair of AMD EPYC 75F3 processors that's 128 slots to scroll through.
What I really want to see is no CPU slots, only the GPUs.
Is this possible or am I asking too much.
Is there a better way of job scheduling for GPUs?

Thanx,

Russell Smithies

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/