[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Standardizing Condor GPU interface



Hello,

Here at IceCube we are about to start using Condor to run jobs on both nVidia and AMD GPUs. We'd like our GPU jobs be to compatible with other sites, so we followed https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToManageGpus, which seems to be the closest thing to a standard for defining a Condor interface for GPU jobs. I thought I'd share a few ideas to improve that document to better accommodate mixed GPU environments based on our experiences at IceCube.


GPU_API should be a list, since both CUDA and OpenCL can run on nVidia GPUs, but only OpenCL can be used with AMD cards.


Currently, the wiki does not mention a classad attribute to identify GPU's manufacturer, e.g. for users who want to run only on AMDs. We decided to include it in GPU_NAME. However, one problem with GPU_NAME is that it's not obvious how its content should be formatted (in order to be compatible across sites). We thought about keeping it consistent with lspci, but its output can be cryptic (e.g. GTX690 is listed as GK104), and nvidia-smi and clinfo don't quite work either. Right now we just manually set it in puppet to things like "nVidia GeForce GTX 690" and "AMD Radeon HD 7970".


It may be useful to mention in the "Identify the GPU" section that both CUDA and OpenCL use environmental variables to control which GPUs an application may run on. We use something like the following in our USER_JOB_WRAPPER script to set those automatically (this way things don't break if the user forgets to appropriately set environment in the submit file):

#!/bin/bash
gpu_dev=$(awk -F ' = ' '/^GPU_DEV = /{print $2}' $_CONDOR_MACHINE_AD)
export CUDA_VISIBLE_DEVICES=$gpu_dev
export COMPUTE=:0.$gpu_dev
export GPU_DEVICE_ORDINAL=$gpu_dev
exec "$@

One problem we encountered is that users who run primarily GPU jobs tend to have much better priority than users who primarily run CPU jobs (because there are many fewer GPUs than CPUs). This results in heavy CPU users being almost completely locked out of using GPUs. We added the following to reduce the severity of this problem, which may also be useful for the wiki:

SlotWeight = ifthenelse(isUndefined(HAS_GPU), "Cpus", 100)


So these are my two cents. I'd be really interested what other people are doing.

Vlad