[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Adding GPUs to machine resources



John:

The problem was that I had a script in /etc/profile.d that was setting both PATH and LD_LIBRARY_PATH for users who login and, e.g., run condor_gpu_discovery. I replaced the part of the script that sets LD_LIBRARY_PATH with a proper entry in /etc/ld.so.conf.d, ran ldconfig and restarted HTCondor and found the correct machine ClassAd variables (^CUDA.*) set.

It is worth emphasizing a point I don't think you picked up on; I set ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES and nevertheless these values were set:

AssignedGPUs = "OCL0,OCL1"
OCLDeviceName = "GeForce GTX 690"
OCLOpenCLVersion = 1.1
OCLGlobalMemoryMb = 2048

You may or may not consider this behavior correct. It is definitely not what I would expect. These variables are not set once the LD_LIBRARY_PATH is fixed as described above.

I'll begin testing the job submission / GPU assignment issues soon and report back. Thanks for getting back to me.

--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678


On Fri, Mar 28, 2014 at 10:03 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
>
> Changes to resources in a STARTD always requires a full daemon restart, not just a reconfig, this is true for GPUs as well.
>
> Other than that, I'm not sure what you are saying is wrong with your configuration.  Are you expecting to see CUDA GPUs
> and not seeing them?  the condor_gpu_discovery tool dynamically loads the cuda libraries, so if you are expecting to
> see CUDA gpus but aren't, the problem is likely to be that the libraries aren't in the path for the STARTD.
>
> -tj
>
>
> On 3/27/2014 5:53 PM, Tom Downes wrote:
>
> Hi John:
>
> Thanks for getting back to me. I'm still not seeing CUDA variables coming in but it looks like it's closer to the mark. I've changed the configuration to:
>
> MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
> ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
>
> slot_type_1_partitionable = true
> slot_type_1 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=auto
> num_slots_type_1 = 1
>
> And I get this:
>
> root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep OCL
> AssignedGPUs = "OCL0,OCL1"
> OCLDeviceName = "GeForce GTX 690"
> OCLOpenCLVersion = 1.1
> OCLGlobalMemoryMb = 2048
>
> You'll note that my previous messages show only CUDA information when running condor_gpu_discovery manually and that the condor_config only asks for CUDA_VISIBLE_DEVICES.
>
> Also: getting the OCL variables to show up requires a full daemon restart, not just a condor_reconfig with the right MACHINE_RESOURCE_INVENTORY_GPUs.
>
> --
> Tom Downes
> Associate Scientist and Data Center Manager
> Center for Gravitation, Cosmology and Astrophysics
> University of Wisconsin-Milwaukee
> 414.229.2678
>
>
> On Thu, Mar 27, 2014 at 7:24 PM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
> >
> > Appologies to all on this list. there was a mistake in the htcondor-wiki page https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus that was fixed just this morning.
> >
> > where it said
> >
> >    MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
> >
> > it should have said
> >
> >    MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
> >
> > You should only use MACHINE_RESOURCE_GPUs when you intend to specify the number or id's of the GPUs directly, rather than by running the GPU discovery tool,
> > so
> >
> >     MACHINE_RESOURCE_GPUs = CUDA0 CUDA1
> >
> > would be a valid declaration of 2 GPUs with id's of CUDA0 and CUDA1
> >
> >
> >
> > On 3/26/2014 10:14 AM, Tom Downes wrote:
> >
> > Hi:
> >
> > I've installed the Condor development series (8.1.4) on execute nodes that have GPUs installed. The rest of the Condor cluster is all on 8.0.5. I am following the instructions at https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus to advertise the GPUs as part of the Machine ClassAd. The machine is configured as a single partitionable slot with all CPUs/RAM/GPUs):
> >
> > MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
> > ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
> >
> > slot_type_1_partitionable = true
> > slot_type_1 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=auto
> > num_slots_type_1 = 1
> >
> > This is what I get:
> >
> > root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | grep -i gpu
> > TotalGPUs = 2
> > TotalSlotGPUs = 2
> > MachineResources = "Cpus Memory Disk Swap GPUs"
> > GPUs = 2
> > WithinResourceLimits = # long reasonable _expression_
> > AssignedGPUs = "/usr/lib/condor/libexec/condor_gpu_discovery,-properties"
> > DetectedGPUs = 2
> > childGPUs = { 0,0 }
> >
> > Note, in particular, the value of AssignedGPUs. Also note this:
> >
> > root@nemo-slave3000:~# /usr/lib/condor/libexec/condor_gpu_discovery -properties
> > DetectedGPUs="CUDA0, CUDA1"
> > CUDACapability=3.0
> > CUDADeviceName="GeForce GTX 690"
> > CUDADriverVersion=6.0
> > CUDAECCEnabled=false
> > CUDAGlobalMemoryMb=2048
> > CUDARuntimeVersion=5.50
> >
> > Following a hunch from ticket #3386, I added the -dynamic argument:
> >
> > root@nemo-slave3000:~# /usr/lib/condor/libexec/condor_gpu_discovery -dynamic -properties
> > DetectedGPUs="CUDA0, CUDA1"
> > CUDACapability=3.0
> > CUDADeviceName="GeForce GTX 690"
> > CUDADriverVersion=6.0
> > CUDAECCEnabled=false
> > CUDAGlobalMemoryMb=2048
> > CUDARuntimeVersion=5.50
> > CUDA0FanSpeedPct=30
> > CUDA0DieTempF=34
> > CUDA1FanSpeedPct=30
> > CUDA1DieTempF=32
> >
> > This results in:
> >
> > root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i GPU
> > TotalGPUs = 3
> > TotalSlotGPUs = 3
> > MachineResources = "Cpus Memory Disk Swap GPUs"
> > GPUs = 3
> > WithinResourceLimits = # long..
> > AssignedGPUs = "/usr/lib/condor/libexec/condor_gpu_discovery,-properties,-dynamic"
> > DetectedGPUs = 3
> > childGPUs = { 0,0 }
> >
> > Note the detection of 3 CPUs according to Condor...
> >
> > So one issue is that I'm not sure if AssignedGPUs is correct. No matter what I do, the following command returns empty:
> >
> > root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i cuda
> >
> > --
> > Tom Downes
> > Associate Scientist and Data Center Manager
> > Center for Gravitation, Cosmology and Astrophysics
> > University of Wisconsin-Milwaukee
> > 414.229.2678
> >
> >
> > On Wed, Mar 12, 2014 at 4:06 PM, Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx> wrote:
> > >
> > > I've been running Condor for more than a decade now, but being
> > > rather new to the Condor/GPU business, I'm having a hard time now.
> > >
> > > Following http://spinningmatt.wordpress.com/2012/11/19, I have tried
> > > to add two GPUs to the resources available to a standalone machine
> > > with a number of CPU cores, by defining in condor_config.d/gpu:
> > >
> > > MACHINE_RESOURCE_NAMES    = GPUS
> > > MACHINE_RESOURCE_GPUS     = 2
> > >
> > > SLOT_TYPE_1               = cpus=100%,auto
> > > SLOT_TYPE_1_PARTITIONABLE = TRUE
> > > NUM_SLOTS_TYPE_1          = 1
> > >
> > > I added a "request_gpus" line to my - otherwise rather simplistic -
> > > submit file, specifying either 1 or 0.
> > > This works - depending on the amount of free resources (obviously,
> > > the GPUS are the least abundant one), jobs get matched and started.
> > > Checking the output of condor_status -l for the individual dynamic
> > > slots, the numbers look OK.
> > > (I'm wondering whether I'd have to set request_gpus=0 somewhere.
> > > Seems to default to 0 though.)
> > >
> > > Now the idea is to tell the job - via arguments, environment,
> > > or a job wrapper - which GPU to use. This is where I ran out of
> > > ideas.
> > >
> > > https://htcondor-wiki.cs.wiki.edu/index.cgi/wiki?p=HowToManageGpus
> > > suggests to use
> > >   arguments = @...$((AssignedGPUs))
> > > but this macro cannot be expanded on job submission...
> > >
> > > There's no _CONDOR_AssignedGPUs in the "printenv" output.
> > >
> > > Even
> > > # grep -i gpu /var/lib/condor/execute/dir_*/.{machine,job}.ad
> > > doesn't show anything that looks helpful.
> > >
> > > Addition of a line
> > > ENVIRONMENT_FOR_AssignedGpus = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL
> > > as suggested in the wiki page shows no effect at all.
> > >
> > > Also, $(LIBEXEC)/condor_gpu_discovery doesn't work as expected:
> > > # /usr/lib/condor/libexec/condor_gpu_discovery [-properties]
> > > modprobe: FATAL: Module nvidia-uvm not found.
> > > 2
> > > (and -properties makes no difference)
> > >
> > > In the end, I'd like to have up to TotalGpus slots with a (or
> > > both) GPU/s assigned to it/them, and $CUDA_VISIBLE_DEVICES or
> > > another environment variable telling me (and a possible wrapper
> > > script) the device numbers. (I also suppose that a non-GPU slot
> > > would have to set $CUDA_VISIBLE_DEVICES to the empty string or
> > > -1?)
> > >
> > > In an era of partitionable resources, will I still have to revert
> > > to static assignments of the individual GPUs to static slots? I
> > > don't hope so (as this doesn't provide an easy means to allocate
> > > both GPUs to a single job)...
> > >
> > > Any suggestions?
> > >
> > > Thanks,
> > >  S
> > >
> > > --
> > > Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
> > > MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
> > > http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > >
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
> >
> >
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/