[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Detecting GPUs in a job slot



Greetings,
are condor and condor_gpu_discovery detecting all the GPUs in the node or only the ones assigned to the job?

I'd like to know if I'm doing the right thing in the glidein to configure its startd and advertise the GPUs to the user jobs

Premise:
- the glidein is a job submitted via condor (different universes possible) that at the end starts a startd
- the glidein can get a whole node or be scheduled as a job in its slot (together w/ other jobs running on the same node)
- the batch system where the glidein is scheduled can be HTCondor or something else (PBS, SLURM, LSF, ...)
- the glidein prepares the configuration and starts a startd that will allow user jobs to use GPUs (if available)

When a glidein lands on a node/slot uses condor_gpu_discovery to find out about the GPUs.
This is used internally to log what has been found (we use the function at the end to parse the output). 
Then condor is configured to discover the GPUs (config fragments are below)

Questions:
1. Is condor discovering the actual GPUs for the slot (i.e. the GPUs assigned to this glidein) or all GPUs available on the "hardware" (real or VM)?
2. If it is discovering the GPUs in the slot,  does the discovery work on different batch systems?
3. Is condor_gpu_discovery (DetectedGPUs) consistent w/ the setting condor adds in the classad or is it finding all and condor filters that before making the machine ad?
4. Does it make sense to let the users specify the number of GPUs. I.e. is over-provisioning possible/handled by condor (like for CPUs)? 

Thank you,
Marco


PS Here the function parsing condor_gpu_discovery, the results are used internally by the glidein and logged

Fragments from the condor_config used by the glidein for its startd

The configuration for the startd is letting condor auto-discover:

# Declare GPUs resource, auto-discovered: ${i}
use feature : GPUs
# GPUsMonitor is automatically included in newer HTCondor
use feature : GPUsMonitor
GPU_DISCOVERY_EXTRA = -extra
# Protect against no GPUs found
if defined MACHINE_RESOURCE_GPUS
else
  MACHINE_RESOURCE_GPUS = 0
endif

We have also an option for users to force the number of GPUs (res_num is user-provided):

# Declare GPU resource, forcing ${res_num}: ${i}
use feature : GPUs
# GPUsMonitor is automatically included in newer HTCondor
use feature : GPUsMonitor
GPU_DISCOVERY_EXTRA = -extra
MACHINE_RESOURCE_GPUS = ${res_num}


This is the function used to parse condor_gpu_discovery for Glidein use:

function find_gpus_num {
    # use condor tools to find the available GPUs
    if [ ! -f "$CONDOR_DIR/libexec/condor_gpu_discovery" ]; then
        echo "WARNING: condor_gpu_discovery not found" 1>&2
        return 1
    fi
    local tmp1
    tmp1="`"$CONDOR_DIR"/libexec/condor_gpu_discovery`"
    local ec=$?
    if [ $ec -ne 0 ]; then
        echo "WARNING: condor_gpu_discovery failed (exit code: $ec)" 1>&2
        return $ec
    fi 
    local tmp="`echo "$tmp1" | grep "^DetectedGPUs="`"
    if [ "${tmp:13}" = 0 ]; then
        echo "No GPUs found with condor_gpu_discovery, setting them to 0" 1>&2
        echo 0
        return
    fi
    set -- $tmp
    echo "condor_gpu_discovery found $# GPUs: $tmp" 1>&2
    echo $#
}