[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Adding GPUs to machine resources



I tried an alternative method with two extra static slots:

# configure GPUs is available
MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES

slot_type_1_partitionable = true
slot_type_1 = cpus=30, mem=60000, gpus=0
num_slots_type_1 = 1

slot_type_2_partitionable = false
slot_type_2 = cpus=1, mem=512, gpus=1
num_slots_type_2 = 2

The GPU count and assignment is correct, but the CUDA variables are still not coming along for the ride:

root@nemo-slave3000:~# condor_status -long slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i GPU
TotalGPUs = 2
TotalSlotGPUs = 1
MachineResources = "Cpus Memory Disk Swap GPUs"
GPUs = 1
AssignedGPUs = "/usr/lib/condor/libexec/condor_gpu_discovery"
DetectedGPUs = 2
root@nemo-slave3000:~# condor_status -long slot3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i GPU
TotalGPUs = 2
TotalSlotGPUs = 1
MachineResources = "Cpus Memory Disk Swap GPUs"
GPUs = 1
AssignedGPUs = "-properties"
DetectedGPUs = 2
root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i GPU
TotalGPUs = 2
TotalSlotGPUs = 0
MachineResources = "Cpus Memory Disk Swap GPUs"
GPUs = 0
WithinResourceLimits = # long
DetectedGPUs = 2
childGPUs = { 0 }


This is how I get the library environment setup for a user shell:

root@nemo-slave3000:~# cat /etc/profile.d/gpu.sh
GPU_SCRIPT=/usr/local/gpu/setup.sh
if [ -f $GPU_SCRIPT ] ; then
  . $GPU_SCRIPT
fi
root@nemo-slave3000:~# cat /usr/local/gpu/setup.sh
export CUDA_INSTALL_PATH=/usr/local/gpu/cuda
export PATH=${PATH}:/usr/local/gpu/cuda/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/gpu/cuda/lib64:/usr/local/gpu/cuda/lib


Should this be sufficient also for the Condor environment? That might be an obvious location for a problem.

--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678


On Wed, Mar 26, 2014 at 4:14 PM, Tom Downes <tpdownes@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi:
>
> I've installed the Condor development series (8.1.4) on execute nodes that have GPUs installed. The rest of the Condor cluster is all on 8.0.5. I am following the instructions at https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus to advertise the GPUs as part of the Machine ClassAd. The machine is configured as a single partitionable slot with all CPUs/RAM/GPUs):
>
> MACHINE_RESOURCE_GPUs = $(LIBEXEC)/condor_gpu_discovery -properties
> ENVIRONMENT_FOR_AssignedGPUs = CUDA_VISIBLE_DEVICES
>
> slot_type_1_partitionable = true
> slot_type_1 = cpus=$(DETECTED_CORES), mem=$(DETECTED_MEMORY), gpus=auto
> num_slots_type_1 = 1
>
> This is what I get:
>
> root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx | grep -i gpu
> TotalGPUs = 2
> TotalSlotGPUs = 2
> MachineResources = "Cpus Memory Disk Swap GPUs"
> GPUs = 2
> WithinResourceLimits = # long reasonable _expression_
> AssignedGPUs = "/usr/lib/condor/libexec/condor_gpu_discovery,-properties"
> DetectedGPUs = 2
> childGPUs = { 0,0 }
>
> Note, in particular, the value of AssignedGPUs. Also note this:
>
> root@nemo-slave3000:~# /usr/lib/condor/libexec/condor_gpu_discovery -properties
> DetectedGPUs="CUDA0, CUDA1"
> CUDACapability=3.0
> CUDADeviceName="GeForce GTX 690"
> CUDADriverVersion=6.0
> CUDAECCEnabled=false
> CUDAGlobalMemoryMb=2048
> CUDARuntimeVersion=5.50
>
> Following a hunch from ticket #3386, I added the -dynamic argument:
>
> root@nemo-slave3000:~# /usr/lib/condor/libexec/condor_gpu_discovery -dynamic -properties
> DetectedGPUs="CUDA0, CUDA1"
> CUDACapability=3.0
> CUDADeviceName="GeForce GTX 690"
> CUDADriverVersion=6.0
> CUDAECCEnabled=false
> CUDAGlobalMemoryMb=2048
> CUDARuntimeVersion=5.50
> CUDA0FanSpeedPct=30
> CUDA0DieTempF=34
> CUDA1FanSpeedPct=30
> CUDA1DieTempF=32
>
> This results in:
>
> root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i GPU
> TotalGPUs = 3
> TotalSlotGPUs = 3
> MachineResources = "Cpus Memory Disk Swap GPUs"
> GPUs = 3
> WithinResourceLimits = # long..
> AssignedGPUs = "/usr/lib/condor/libexec/condor_gpu_discovery,-properties,-dynamic"
> DetectedGPUs = 3
> childGPUs = { 0,0 }
>
> Note the detection of 3 CPUs according to Condor...
>
> So one issue is that I'm not sure if AssignedGPUs is correct. No matter what I do, the following command returns empty:
>
> root@nemo-slave3000:~# condor_status -long slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  | grep -i cuda
>
> --
> Tom Downes
> Associate Scientist and Data Center Manager
> Center for Gravitation, Cosmology and Astrophysics
> University of Wisconsin-Milwaukee
> 414.229.2678
>
>
>
> On Wed, Mar 12, 2014 at 4:06 PM, Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx> wrote:
> >
> > I've been running Condor for more than a decade now, but being
> > rather new to the Condor/GPU business, I'm having a hard time now.
> >
> > Following http://spinningmatt.wordpress.com/2012/11/19, I have tried
> > to add two GPUs to the resources available to a standalone machine
> > with a number of CPU cores, by defining in condor_config.d/gpu:
> >
> > MACHINE_RESOURCE_NAMES    = GPUS
> > MACHINE_RESOURCE_GPUS     = 2
> >
> > SLOT_TYPE_1               = cpus=100%,auto
> > SLOT_TYPE_1_PARTITIONABLE = TRUE
> > NUM_SLOTS_TYPE_1          = 1
> >
> > I added a "request_gpus" line to my - otherwise rather simplistic -
> > submit file, specifying either 1 or 0.
> > This works - depending on the amount of free resources (obviously,
> > the GPUS are the least abundant one), jobs get matched and started.
> > Checking the output of condor_status -l for the individual dynamic
> > slots, the numbers look OK.
> > (I'm wondering whether I'd have to set request_gpus=0 somewhere.
> > Seems to default to 0 though.)
> >
> > Now the idea is to tell the job - via arguments, environment,
> > or a job wrapper - which GPU to use. This is where I ran out of
> > ideas.
> >
> > https://htcondor-wiki.cs.wiki.edu/index.cgi/wiki?p=HowToManageGpus
> > suggests to use
> >   arguments = @...$((AssignedGPUs))
> > but this macro cannot be expanded on job submission...
> >
> > There's no _CONDOR_AssignedGPUs in the "printenv" output.
> >
> > Even
> > # grep -i gpu /var/lib/condor/execute/dir_*/.{machine,job}.ad
> > doesn't show anything that looks helpful.
> >
> > Addition of a line
> > ENVIRONMENT_FOR_AssignedGpus = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL
> > as suggested in the wiki page shows no effect at all.
> >
> > Also, $(LIBEXEC)/condor_gpu_discovery doesn't work as expected:
> > # /usr/lib/condor/libexec/condor_gpu_discovery [-properties]
> > modprobe: FATAL: Module nvidia-uvm not found.
> > 2
> > (and -properties makes no difference)
> >
> > In the end, I'd like to have up to TotalGpus slots with a (or
> > both) GPU/s assigned to it/them, and $CUDA_VISIBLE_DEVICES or
> > another environment variable telling me (and a possible wrapper
> > script) the device numbers. (I also suppose that a non-GPU slot
> > would have to set $CUDA_VISIBLE_DEVICES to the empty string or
> > -1?)
> >
> > In an era of partitionable resources, will I still have to revert
> > to static assignments of the individual GPUs to static slots? I
> > don't hope so (as this doesn't provide an easy means to allocate
> > both GPUs to a single job)...
> >
> > Any suggestions?
> >
> > Thanks,
> >  S
> >
> > --
> > Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
> > MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
> > http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/