[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CUDA_VISIBLE_DEVICES not in the environment



Ups,

I guess I stand corrected ;) 

Indeed as this is the only 4-GPU-machine it has some history in terms of configuration and while it was running 4 static slots everybody was happy. I uncluttered the configuration and now it looks much better -thanks a lot ! 

One thing I recognized it that the partitionable gpu-slot is also accepting   regular non-gpu jobs which is not in my intention, hence I added: 

STARTD_ATTRS = IsJupyterSlot, Request_GPUs, $(STARTD_ATTRS)
SLOT_TYPE_1_START = $(START:True) && (TARGET.Request_GPUs >= 1)

Is that correct and do I need to declare 'Request_GPUs' as a Startd-attribute at all ? 

Best
Christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "johnkn" <johnkn@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 12. Dezember 2019 21:35:03
Betreff: Re: [HTCondor-users] CUDA_VISIBLE_DEVICES not in the environment

The config on that machine has 

  use feature:gpus

But it *also* has some other knobs, possibly leftover from earlier experimentation, that are overriding use feature:gpus

try running

condor_config_val -v -dump MACHINE_RESOURCE

You will see that the config on that machine has

MACHINE_RESOURCE_gpu = 4
MACHINE_RESOURCE_NAMES =  gpu gpu

Note that this is gpu not gpus.  This config knob is disabling the GPUs resource. 

Also the slot configuration is assigning GPU, but not GPUs to slot_type 1

SLOT_TYPE_1 = gpu=4, cpu=8

Once again, this looks like some cruft leftover from earlier experimentation. what you want to do is

remove these from  your cofig
MACHINE_RESOURCE_gpu
MACHINE_RESOURCE_NAMES

And change your SLOT_TYPE_1 config to this

SLOT_TYPE_1 = GPUs=4, CPUs=8

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Beyer, Christoph
Sent: Thursday, December 12, 2019 1:06 PM
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] CUDA_VISIBLE_DEVICES not in the environment

Hi Tj,

it is all undefined: 

[root@bird-htc-sched13 ~]# condor_status batchg010.desy.de -af AssignedGPUs
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined

There is one partitionable slot with 4 gpus and a couple of static jupyter slots to make some usage of the cpu power of the machine: 

[root@bird-htc-sched13 ~]# condor_status batchg010.desy.de 
Name                      OpSys      Arch   State     Activity LoadAv Mem     ActvtyTime

slot1@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000 342351  1+05:25:05
slot1_2@xxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      1.150   1536  0+02:12:03
slot1_3@xxxxxxxxxxxxxxxxx LINUX      X86_64 Claimed   Busy      1.160   1536  0+01:56:02
slot2@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot3@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot4@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot5@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot6@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot7@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot8@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot9@xxxxxxxxxxxxxxxxx   LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot10@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15
slot11@xxxxxxxxxxxxxxxxx  LINUX      X86_64 Unclaimed Idle      0.000   4000  1+05:25:15

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

  X86_64/LINUX       13     0       2        11       0          0      0

         Total       13     0       2        11       0          0      0

Best
christoph

-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "johnkn" <johnkn@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 12. Dezember 2019 17:20:03
Betreff: Re: [HTCondor-users] CUDA_VISIBLE_DEVICES not in the environment

What GPUs are getting assigned to the slot?

   condor_status -af Name AssignedGPUs

Does CUDA_VISIABLE_DEVICES get set in the environment when you don't use the job wrapper?

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Beyer, Christoph
Sent: Thursday, December 12, 2019 6:58 AM
To: htcondor-users <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] CUDA_VISIBLE_DEVICES not in the environment

Hi,

I am struggling a bit with the parallel usage of GPUs as I mentioned earlier. As a matter of fact part of my problems result from  
CUDA_VISIBLE_DEVICES not being set in the job environment 

I use the gpu-feature which expands as expected to: 

[root@batchg010 condor]# condor_config_val use feature:gpus
use FEATURE:GPUs is
	MACHINE_RESOURCE_INVENTORY_GPUs=$(LIBEXEC)/condor_gpu_discovery -properties $(GPU_DISCOVERY_EXTRA)
	ENVIRONMENT_FOR_AssignedGPUs=GPU_DEVICE_ORDINAL=/(CUDA|OCL)//  CUDA_VISIBLE_DEVICES
	ENVIRONMENT_VALUE_FOR_UnAssignedGPUs=10000

I am running a jobwrapper but also in the jobwrapper environment I do not see a sign of CUDA_VISIBLE_DEVICES being set, same thing in the environment once the job is running. 

Subsequently I get all 4 GPUs in a single gpu-slot: 

/usr/libexec/condor/condor_gpu_discovery
DetectedGPUs="CUDA0, CUDA1, CUDA2, CUDA3"


Is there an additional trick that I missed ? 

This on 

$CondorVersion: 8.9.1 Apr 17 2019 BuildID: 466671 PackageID: 8.9.1-1 $




-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/