[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Detecting GPUs in a job slot



Thanks David,
I'll have a look at those variables.

> On Sep 24, 2020, at 13:23, David Schultz <david.schultz@xxxxxxxxxxxxxxxx> wrote:
> 
> Hi Marco,
> 
> 1 & 2: GPU discovery is independent of batch system and works by looking at the system hardware.  We at IceCube asked for some sort of slot restriction feature to be added to discovery, and here is what the developers came up with:
> 
> If CUDA_VISIBLE_DEVICES or GPU_DEVICE_ORDINAL is set in the environment when condor_gpu_discovery is run, it will report only devices present in those lists.
> 
> So if you get a whole node, you don't need to set those variables and it will find all the GPUs.  If you have a smaller slot, those variables should be set to limit the GPUs discovered.  Note that this is a newer feature added within the last few point releases.

Do you know if condor is setting these (CUDA_VISIBLE_DEVICES and GPU_DEVICE_ORDINAL) when it is assigning a slot to a job?
E.g. my glidein is a condor job running in a partitionable slot of a node w/ a GPU.
- Will condor set those variables when I request (or do not request) some GPU ?
- Will then condor police if I use more GPUs than I requested/was assigned?

> 
> 3. I'm unsure what setting MACHINE_RESOURCE_GPUS will do to the detection, and will let someone else answer that.
> 
> 4. Yes, request_gpus=2 or similar should work fine from a user perspective.

I was actually referring from a system (worked node) admin point of view. Does it make sense to advertise more GPUs that there actually are on the node?
This is done for CPUs (cores) at times to allow over-provisioning (or sometimes less are advertised to under-provision and leave more memory per core)

Thanks,
Marco


> 
> David
> 
> On Thu, Sep 24, 2020 at 12:39 PM Marco Mambelli <marcom@xxxxxxxx> wrote:
> Greetings,
> are condor and condor_gpu_discovery detecting all the GPUs in the node or only the ones assigned to the job?
> 
> I'd like to know if I'm doing the right thing in the glidein to configure its startd and advertise the GPUs to the user jobs
> 
> Premise:
> - the glidein is a job submitted via condor (different universes possible) that at the end starts a startd
> - the glidein can get a whole node or be scheduled as a job in its slot (together w/ other jobs running on the same node)
> - the batch system where the glidein is scheduled can be HTCondor or something else (PBS, SLURM, LSF, ...)
> - the glidein prepares the configuration and starts a startd that will allow user jobs to use GPUs (if available)
> 
> When a glidein lands on a node/slot uses condor_gpu_discovery to find out about the GPUs.
> This is used internally to log what has been found (we use the function at the end to parse the output). 
> Then condor is configured to discover the GPUs (config fragments are below)
> 
> Questions:
> 1. Is condor discovering the actual GPUs for the slot (i.e. the GPUs assigned to this glidein) or all GPUs available on the "hardware" (real or VM)?
> 2. If it is discovering the GPUs in the slot,  does the discovery work on different batch systems?
> 3. Is condor_gpu_discovery (DetectedGPUs) consistent w/ the setting condor adds in the classad or is it finding all and condor filters that before making the machine ad?
> 4. Does it make sense to let the users specify the number of GPUs. I.e. is over-provisioning possible/handled by condor (like for CPUs)? 
> 
> Thank you,
> Marco
> 
> 
> PS Here the function parsing condor_gpu_discovery, the results are used internally by the glidein and logged
> 
> Fragments from the condor_config used by the glidein for its startd
> 
> The configuration for the startd is letting condor auto-discover:
> 
> # Declare GPUs resource, auto-discovered: ${i}
> use feature : GPUs
> # GPUsMonitor is automatically included in newer HTCondor
> use feature : GPUsMonitor
> GPU_DISCOVERY_EXTRA = -extra
> # Protect against no GPUs found
> if defined MACHINE_RESOURCE_GPUS
> else
>   MACHINE_RESOURCE_GPUS = 0
> endif
> 
> We have also an option for users to force the number of GPUs (res_num is user-provided):
> 
> # Declare GPU resource, forcing ${res_num}: ${i}
> use feature : GPUs
> # GPUsMonitor is automatically included in newer HTCondor
> use feature : GPUsMonitor
> GPU_DISCOVERY_EXTRA = -extra
> MACHINE_RESOURCE_GPUS = ${res_num}
> 
> 
> This is the function used to parse condor_gpu_discovery for Glidein use:
> 
> function find_gpus_num {
>     # use condor tools to find the available GPUs
>     if [ ! -f "$CONDOR_DIR/libexec/condor_gpu_discovery" ]; then
>         echo "WARNING: condor_gpu_discovery not found" 1>&2
>         return 1
>     fi
>     local tmp1
>     tmp1="`"$CONDOR_DIR"/libexec/condor_gpu_discovery`"
>     local ec=$?
>     if [ $ec -ne 0 ]; then
>         echo "WARNING: condor_gpu_discovery failed (exit code: $ec)" 1>&2
>         return $ec
>     fi 
>     local tmp="`echo "$tmp1" | grep "^DetectedGPUs="`"
>     if [ "${tmp:13}" = 0 ]; then
>         echo "No GPUs found with condor_gpu_discovery, setting them to 0" 1>&2
>         echo 0
>         return
>     fi
>     set -- $tmp
>     echo "condor_gpu_discovery found $# GPUs: $tmp" 1>&2
>     echo $#
> }
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_mailman_listinfo_htcondor-2Dusers&d=DwICAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=jjLxT0zITnDsGN5TXUNgyoao_eBKajL9jxZjA2_7BR0&s=b2e6JjsxEFJ-ezh2Rjevfp7qFA1h1RDTDx298in1Sao&e= 
> 
> The archives can be found at:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.cs.wisc.edu_archive_htcondor-2Dusers_&d=DwICAg&c=gRgGjJ3BkIsb5y6s49QqsA&r=EF06-Wh4L9CNLgD8bnIjNQ&m=jjLxT0zITnDsGN5TXUNgyoao_eBKajL9jxZjA2_7BR0&s=Hyur-2siOFdLvWg-t6s6BNpBPKVTVjQgQrCbYZz5sJU&e=