[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Detecting GPUs in a job slot



Hi Marco,

1 & 2: GPU discovery is independent of batch system and works by looking at the system hardware. We at IceCube asked for some sort of slot restriction feature to be added to discovery, and here is what the developers came up with:

If CUDA_VISIBLE_DEVICES or GPU_DEVICE_ORDINAL is set in the environment when condor_gpu_discovery is run, it will report only devices present in those lists.

So if you get a whole node, you don't need to set those variables and it will find all the GPUs. If you have a smaller slot, those variables should be set to limit the GPUs discovered. Note that this is a newer feature added within the last few point releases.

3. I'm unsure what setting MACHINE_RESOURCE_GPUS will do to the detection, and will let someone else answer that.

4. Yes, request_gpus=2 or similar should work fine from a user perspective.

David

On Thu, Sep 24, 2020 at 12:39 PM Marco Mambelli <marcom@xxxxxxxx> wrote:
Greetings,
are condor and condor_gpu_discovery detecting all the GPUs in the node or only the ones assigned to the job?

I'd like to know if I'm doing the right thing in the glidein to configure its startd and advertise the GPUs to the user jobs

Premise:
- the glidein is a job submitted via condor (different universes possible) that at the end starts a startd
- the glidein can get a whole node or be scheduled as a job in its slot (together w/ other jobs running on the same node)
- the batch system where the glidein is scheduled can be HTCondor or something else (PBS, SLURM, LSF, ...)
- the glidein prepares the configuration and starts a startd that will allow user jobs to use GPUs (if available)

When a glidein lands on a node/slot uses condor_gpu_discovery to find out about the GPUs.
This is used internally to log what has been found (we use the function at the end to parse the output).
Then condor is configured to discover the GPUs (config fragments are below)

Questions:
1. Is condor discovering the actual GPUs for the slot (i.e. the GPUs assigned to this glidein) or all GPUs available on the "hardware" (real or VM)?
2. If it is discovering the GPUs in the slot, does the discovery work on different batch systems?
3. Is condor_gpu_discovery (DetectedGPUs) consistent w/ the setting condor adds in the classad or is it finding all and condor filters that before making the machine ad?
4. Does it make sense to let the users specify the number of GPUs. I.e. is over-provisioning possible/handled by condor (like for CPUs)?

Thank you,
Marco


PS Here the function parsing condor_gpu_discovery, the results are used internally by the glidein and logged

Fragments from the condor_config used by the glidein for its startd

The configuration for the startd is letting condor auto-discover:

# Declare GPUs resource, auto-discovered: ${i}
use feature : GPUs
# GPUsMonitor is automatically included in newer HTCondor
use feature : GPUsMonitor
GPU_DISCOVERY_EXTRA = -extra
# Protect against no GPUs found
if defined MACHINE_RESOURCE_GPUS
else
 MACHINE_RESOURCE_GPUS = 0
endif

We have also an option for users to force the number of GPUs (res_num is user-provided):

# Declare GPU resource, forcing ${res_num}: ${i}
use feature : GPUs
# GPUsMonitor is automatically included in newer HTCondor
use feature : GPUsMonitor
GPU_DISCOVERY_EXTRA = -extra
MACHINE_RESOURCE_GPUS = ${res_num}


This is the function used to parse condor_gpu_discovery for Glidein use:

function find_gpus_num {
  # use condor tools to find the available GPUs
  if [ ! -f "$CONDOR_DIR/libexec/condor_gpu_discovery" ]; then
    echo "WARNING: condor_gpu_discovery not found" 1>&2
    return 1
  fi
  local tmp1
  tmp1="`"$CONDOR_DIR"/libexec/condor_gpu_discovery`"
  local ec=$?
  if [ $ec -ne 0 ]; then
    echo "WARNING: condor_gpu_discovery failed (exit code: $ec)" 1>&2
    return $ec
  fi
  local tmp="`echo "$tmp1" | grep "^DetectedGPUs="`"
  if [ "${tmp:13}" = 0 ]; then
    echo "No GPUs found with condor_gpu_discovery, setting them to 0" 1>&2
    echo 0
    return
  fi
  set -- $tmp
  echo "condor_gpu_discovery found $# GPUs: $tmp" 1>&2
  echo $#
}
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/