[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Detecting GPU




Hi,

jep, that sounds like a possible issue, easiest thing would be to do a 'su condor' and execute it from there to check ?

All the rest is looking as expected I am afraid ...

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Martin Sajdl" <masaj.xxx@xxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>, "Josef MitlÃhner" <josef.mitlohner@xxxxxx>
Gesendet: Freitag, 3. April 2020 12:36:36
Betreff: Re: [HTCondor-users] Detecting GPU

Hi guys,
it seems the issue is that condor_gpu_discovery utility works a bit different when it is launched from normal user session in Windows or from a context of
running service (condor deamon)...
As Josef wrote, it seems it has a limited access to GPU from the context of the service, but it is still somehow linked to GPU type (this one is very old), because the limitation seems not to be there on system with newer GPUs.

Masaj

On 03.04.2020 11:59, Josef MitlÃhner wrote:
C:\>condor_config_val -dump | grep -i gpu
ENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000
MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties $(GPU_DISCOVERY_EXTRA)
STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(LIBEXEC)/condor_gpu_utilization
STARTD_CRON_GPUs_MONITOR_METRICS = SUM:GPUs, PEAK:GPUsMemory
STARTD_CRON_GPUs_MONITOR_MODE = WaitForExit
STARTD_CRON_GPUs_MONITOR_PERIOD = 1
STARTD_CRON_JOBLIST =  GPUs_MONITOR GPUs_MONITOR STARTCFG

Best regards
Josef

On 3.4.2020 11:41, Beyer, Christoph wrote:
there sems to be a little something missing somewhere ;)

I had similar problems when we started to use GPUs, the cause was an individual configuration overwriting the feature config.

What does condor_config_val say, it should look somehow similar to this:

[root@batchg003 ~]# condor_config_val -dump | grep -i gpu
ENVIRONMENT_FOR_AssignedGPUs = GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES
ENVIRONMENT_VALUE_FOR_UnAssignedGPUs = 10000
MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties $(GPU_DISCOVERY_EXTRA)
SLOT_TYPE_1 = GPUs=1, CPUs=2
SLOT_WEIGHT = GPUs
START = (NODE_IS_HEALTHY =?= True) && (StartJobs =?= True) && TARGET.RequestGpus && (RequestRuntime <= 12000)
STARTD_CRON_GPUs_MONITOR_EXECUTABLE = $(LIBEXEC)/condor_gpu_utilization
STARTD_CRON_GPUs_MONITOR_METRICS = SUM:GPUs, PEAK:GPUsMemory
STARTD_CRON_GPUs_MONITOR_MODE = WaitForExit
STARTD_CRON_GPUs_MONITOR_PERIOD = 1
STARTD_CRON_JOBLIST = NODEHEALTH GPUs_MONITOR GPUs_MONITOR

Best
Christoph



--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Josef MitlÃhner" <josef.mitlohner@xxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 3. April 2020 11:32:48
Betreff: Re: [HTCondor-users] Detecting GPU

Hello everyone,
I made a task where there was only "condor_gpu_discovery -extra" and the output was only "DetectedGPUs = 0". However, when I execute the command manually, it returns:

 C: \> condor_gpu_discovery -extra
DetectedGPUs = "CUDA1"
CUDACapability = 1.2
CUDAClockMhz = 1402.00
CUDAComputeUnits = 2
CUDACoresPerCU = 8
CUDADeviceName = "GeForce 210"
CUDADevicePciBusId = "0000: 05: 00.0"
CUDADeviceUuid = "00000000-0000-0000-0000-000000000000"
CUDADriverVersion = 6.50
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 1024
CUDARuntimeVersion = 10.20

So in the configuration context, condor_gpu_discovery does not have access to any GPU information.

Best regards
Josef


On 2.4.2020 13:34, Josef MitlÃhner wrote:
Hi,

lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)

C:\>condor_status -l mitlohner-w764 | grep -i gpu
DetectedGPUs = 0
GPUs = 0
MachineResources = "Cpus Memory Disk Swap GPUs"
TotalGPUs = 0
TotalSlotGPUs = 0

Best regards
Josef

On 2.4.2020 12:45, Beyer, Christoph wrote:
hmm,

what does

lspci | grep -i nvidia

say ?

condor_Status should look somehow like this:

[root@batchg003 ~]# condor_status -l batchg003 | grep -i gpu
AssignedGPUs = "CUDA0"
DetectedGPUs = 1
GPUs = 1
MachineResources = "Cpus Memory Disk Swap GPUs"
SlotWeight = GPUs
Start = (NODE_IS_HEALTHY =?= true) && (StartJobs =?= true) && TARGET.RequestGpus && (RequestRuntime <= 12000)
TotalGPUs = 1
TotalSlotGPUs = 1
[root@batchg003 ~]# condor_status -l batchg003 | grep -i cuda
AssignedGPUs = "CUDA0"
CUDACapability = 6.1
CUDADeviceName = "GeForce GTX 1080 Ti"
CUDADevicePciBusId = "0000:65:00.0"
CUDADeviceUuid = "3f2d719f-7d89-c75c-1a71-94316a2fcd12"
CUDADriverVersion = 10.2
CUDAECCEnabled = false
CUDAGlobalMemoryMb = 11178

Best
Christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Josef MitlÃhner" <josef.mitlohner@xxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 2. April 2020 12:08:40
Betreff: Re: [HTCondor-users] Detecting GPU

Hi,
thank you for your reply.

The result is the same. The only change is (after installing CUDA pagkage) in the "condor_gpu_disovery -properties" listing:

DetectedGPUs="CUDA0"
CUDACapability=1.2
CUDADeviceName="GeForce 210"
CUDADevicePciBusId="0000:05:00.0"
CUDADeviceUuid="00000000-0000-0000-0000-000000000000"
CUDADriverVersion=6.50
CUDAECCEnabled=false
CUDAGlobalMemoryMb=1024
CUDARuntimeVersion=10.20

Thanks for help,
Best regards
Josef

On 2.4.2020 10:24, Beyer, Christoph wrote:
Hi,

try
@use feature : GPUs
@use feature : GPUsMonitor

The second one is not mandatory of course but you will want it ;)

install the cuda and nvidia-driver pkgs (I think those cone with the cuda pkg though)

cuda.x86_64

Restart the host and check ...

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Josef MitlÃhner" <josef.mitlohner@xxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 2. April 2020 10:13:53
Betreff: [HTCondor-users] Detecting GPU

Hello,
when I run the command "condor_gpu_discovery -properties" on my computer it detects one GPU

DetectedGPUs="CUDA0"
can't open SOFTWARE\NVIDIA Corporation\GPU Computing Toolkit\CUDA
CUDACapability=1.2
CUDADeviceName="GeForce 210"
CUDADevicePciBusId="0000:05:00.0"
CUDADeviceUuid="00000000-0000-0000-0000-000000000000"
CUDADriverVersion=6.50
CUDAECCEnabled=false
CUDAGlobalMemoryMb=1024

In condor.config i have a line with "use feature : GPUs"


Why does my HTCondor server say (condor_status -l):
...
DetectedGPUs = 0
...

?
Thank you for reply
Josef


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/