[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Adding GPUs to machine resources



Changes to resources in a STARTD always requires a full daemon restart, not just a reconfig, this is true for GPUs as well.

Other than that, I'm not sure what you are saying is wrong with your configuration.  Are you expecting to see CUDA GPUs
and not seeing them?  the condor_gpu_discovery tool dynamically loads the cuda libraries, so if you are expecting to
see CUDA gpus but aren't, the problem is likely to be that the libraries aren't in the path for the STARTD. 

-tj

On 3/27/2014 5:53 PM, Tom Downes wrote:
Hi John:

Thanks for getting back to me. I'm still not seeing CUDA variables coming in but it looks like it's closer to the mark. I've changed the configuration to:

MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES

slot_type_1_partitionable = true
slot_type_1 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=auto
num_slots_type_1 = 1


And I get this:

root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep OCL
AssignedGPUs = "OCL0,OCL1"
OCLDeviceName = "GeForce GTX 690"
OCLOpenCLVersion = 1.1
OCLGlobalMemoryMb = 2048


You'll note that my previous messages show only CUDA information when running condor_gpu_discovery manually and that the condor_config only asks for CUDA_VISIBLE_DEVICES.

Also: getting the OCL variables to show up requires a full daemon restart, not just a condor_reconfig with the right MACHINE_RESOURCE_INVENTORY_GPUs.

--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678


On Thu, Mar 27, 2014 at 7:24 PM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
>
> Appologies to all on this list. there was a mistake in the htcondor-wiki page https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus that was fixed just this morning.
>
> where it said
>
>    MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
>
> it should have said
>
>    MACHINE_RESOURCE_INVENTORY_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
>
> You should only use MACHINE_RESOURCE_GPUs when you intend to specify the number or id's of the GPUs directly, rather than by running the GPU discovery tool,
> so
>
>     MACHINE_RESOURCE_GPUs = CUDA0 CUDA1
>
> would be a valid declaration of 2 GPUs with id's of CUDA0 and CUDA1
>
>
>
> On 3/26/2014 10:14 AM, Tom Downes wrote:
>
> Hi:
>
> I've installed the Condor development series (8.1.4) on execute nodes that have GPUs installed. The rest of the Condor cluster is all on 8.0.5. I am following the instructions at https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus to advertise the GPUs as part of the Machine ClassAd. The machine is configured as a single partitionable slot with all CPUs/RAM/GPUs):
>
> MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
> ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
>
> slot_type_1_partitionable = true
> slot_type_1 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=auto
> num_slots_type_1 = 1
>
> This is what I get:
>
> root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | grep -i gpu
> TotalGPUs = 2
> TotalSlotGPUs = 2
> MachineResources = "Cpus Memory Disk Swap GPUs"
> GPUs = 2
> WithinResourceLimits = # long reasonable _expression_
> AssignedGPUs = "/usr/lib/condor/libexec/condor_gpu_discovery,-properties"
> DetectedGPUs = 2
> childGPUs = { 0,0 }
>
> Note, in particular, the value of AssignedGPUs. Also note this:
>
> root@nemo-slave3000:~# /usr/lib/condor/libexec/condor_gpu_discovery -properties
> DetectedGPUs="CUDA0, CUDA1"
> CUDACapability=3.0
> CUDADeviceName="GeForce GTX 690"
> CUDADriverVersion=6.0
> CUDAECCEnabled=false
> CUDAGlobalMemoryMb=2048
> CUDARuntimeVersion=5.50
>
> Following a hunch from ticket #3386, I added the -dynamic argument:
>
> root@nemo-slave3000:~# /usr/lib/condor/libexec/condor_gpu_discovery -dynamic -properties
> DetectedGPUs="CUDA0, CUDA1"
> CUDACapability=3.0
> CUDADeviceName="GeForce GTX 690"
> CUDADriverVersion=6.0
> CUDAECCEnabled=false
> CUDAGlobalMemoryMb=2048
> CUDARuntimeVersion=5.50
> CUDA0FanSpeedPct=30
> CUDA0DieTempF=34
> CUDA1FanSpeedPct=30
> CUDA1DieTempF=32
>
> This results in:
>
> root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i GPU
> TotalGPUs = 3
> TotalSlotGPUs = 3
> MachineResources = "Cpus Memory Disk Swap GPUs"
> GPUs = 3
> WithinResourceLimits = # long..
> AssignedGPUs = "/usr/lib/condor/libexec/condor_gpu_discovery,-properties,-dynamic"
> DetectedGPUs = 3
> childGPUs = { 0,0 }
>
> Note the detection of 3 CPUs according to Condor...
>
> So one issue is that I'm not sure if AssignedGPUs is correct. No matter what I do, the following command returns empty:
>
> root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i cuda
>
> --
> Tom Downes
> Associate Scientist and Data Center Manager
> Center for Gravitation, Cosmology and Astrophysics
> University of Wisconsin-Milwaukee
> 414.229.2678
>
>
> On Wed, Mar 12, 2014 at 4:06 PM, Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx> wrote:
> >
> > I've been running Condor for more than a decade now, but being
> > rather new to the Condor/GPU business, I'm having a hard time now.
> >
> > Following http://spinningmatt.wordpress.com/2012/11/19, I have tried
> > to add two GPUs to the resources available to a standalone machine
> > with a number of CPU cores, by defining in condor_config.d/gpu:
> >
> > MACHINE_RESOURCE_NAMES    = GPUS
> > MACHINE_RESOURCE_GPUS     = 2
> >
> > SLOT_TYPE_1               = cpus=100%,auto
> > SLOT_TYPE_1_PARTITIONABLE = TRUE
> > NUM_SLOTS_TYPE_1          = 1
> >
> > I added a "request_gpus" line to my - otherwise rather simplistic -
> > submit file, specifying either 1 or 0.
> > This works - depending on the amount of free resources (obviously,
> > the GPUS are the least abundant one), jobs get matched and started.
> > Checking the output of condor_status -l for the individual dynamic
> > slots, the numbers look OK.
> > (I'm wondering whether I'd have to set request_gpus=0 somewhere.
> > Seems to default to 0 though.)
> >
> > Now the idea is to tell the job - via arguments, environment,
> > or a job wrapper - which GPU to use. This is where I ran out of
> > ideas.
> >
> > https://htcondor-wiki.cs.wiki.edu/index.cgi/wiki?p=HowToManageGpus
> > suggests to use
> >   arguments = @...$((AssignedGPUs))
> > but this macro cannot be expanded on job submission...
> >
> > There's no _CONDOR_AssignedGPUs in the "printenv" output.
> >
> > Even
> > # grep -i gpu /var/lib/condor/execute/dir_*/.{machine,job}.ad
> > doesn't show anything that looks helpful.
> >
> > Addition of a line
> > ENVIRONMENT_FOR_AssignedGpus = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL
> > as suggested in the wiki page shows no effect at all.
> >
> > Also, $(LIBEXEC)/condor_gpu_discovery doesn't work as expected:
> > # /usr/lib/condor/libexec/condor_gpu_discovery [-properties]
> > modprobe: FATAL: Module nvidia-uvm not found.
> > 2
> > (and -properties makes no difference)
> >
> > In the end, I'd like to have up to TotalGpus slots with a (or
> > both) GPU/s assigned to it/them, and $CUDA_VISIBLE_DEVICES or
> > another environment variable telling me (and a possible wrapper
> > script) the device numbers. (I also suppose that a non-GPU slot
> > would have to set $CUDA_VISIBLE_DEVICES to the empty string or
> > -1?)
> >
> > In an era of partitionable resources, will I still have to revert
> > to static assignments of the individual GPUs to static slots? I
> > don't hope so (as this doesn't provide an easy means to allocate
> > both GPUs to a single job)...
> >
> > Any suggestions?
> >
> > Thanks,
> >  S
> >
> > --
> > Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
> > MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
> > http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/