[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Adding GPUs to machine resources



On Wed, Mar 12, 2014 at 04:06:46PM +0100, Steffen Grunewald wrote:
> 
> Following http://spinningmatt.wordpress.com/2012/11/19, I have tried
> to add two GPUs to the resources available to a standalone machine
> with a number of CPU cores, by defining in condor_config.d/gpu:
> 
> MACHINE_RESOURCE_NAMES    = GPUS
> MACHINE_RESOURCE_GPUS     = 2
> 
> SLOT_TYPE_1               = cpus=100%,auto
> SLOT_TYPE_1_PARTITIONABLE = TRUE
> NUM_SLOTS_TYPE_1          = 1
> 
> I added a "request_gpus" line to my - otherwise rather simplistic -
> submit file, specifying either 1 or 0.
> This works - depending on the amount of free resources (obviously,
> the GPUS are the least abundant one), jobs get matched and started.
> Checking the output of condor_status -l for the individual dynamic
> slots, the numbers look OK.
> (I'm wondering whether I'd have to set request_gpus=0 somewhere.
> Seems to default to 0 though.)
> 
> Now the idea is to tell the job - via arguments, environment,
> or a job wrapper - which GPU to use. This is where I ran out of
> ideas.
> 
> https://htcondor-wiki.cs.wiki.edu/index.cgi/wiki?p=HowToManageGpus
> 
> Addition of a line
> ENVIRONMENT_FOR_AssignedGpus = CUDA_VISIBLE_DEVICES, GPU_DEVICE_ORDINAL
> as suggested in the wiki page shows no effect at all.

I evetually found that upgrading from 8.0.5 to 8.1.4 would add 
the functionality I was looking for, and even the condor_gpu_discovery
command would yield better results:

root@krakatoa# /usr/lib/condor/libexec/condor_gpu_discovery -properties
modprobe: FATAL: Module nvidia-uvm not found.
DetectedGPUs="CUDA0, CUDA1"
CUDACapability=3.5
CUDADeviceName="Tesla K20c"
CUDADriverVersion=6.0
CUDAECCEnabled=false
CUDAGlobalMemoryMb=4800
CUDARuntimeVersion=5.50
root@krakatoa# /usr/lib/condor/libexec/condor_gpu_discovery -properties -dynamic
modprobe: FATAL: Module nvidia-uvm not found.
DetectedGPUs="CUDA0, CUDA1"
CUDACapability=3.5
CUDADeviceName="Tesla K20c"
CUDADriverVersion=6.0
CUDAECCEnabled=false
CUDAGlobalMemoryMb=4800
CUDARuntimeVersion=5.50
CUDA0FanSpeedPct=36
CUDA0PowerUsage_mw=49804
CUDA0DieTempF=45
CUDA0EccErrorsSingleBit=0
CUDA0EccErrorsDoubleBit=0
CUDA1FanSpeedPct=33
CUDA1PowerUsage_mw=43265
CUDA1DieTempF=44
CUDA1EccErrorsSingleBit=0
CUDA1EccErrorsDoubleBit=0

As the "nvidia" module has been already loaded, the "FATAL" error
seems to have no ill side-effects (and I suppose the stderr output
would be dropped)

I'll procees with MACHINE_RESOURCE_INVENTORY_GPUS, and work my
way through the rest of the configuration...

Thanks to all who responded.

- S