[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU



Hi Eric,

Sure, I'm happy to. Like I said, our machines have one partitionable job slot for each GPU, so the worker job slot config looks something like this:

>> NUM_SLOTS = 2
>> NUM_SLOTS_TYPE_1 = 1
>> SLOT_TYPE_1 = cpus=3,mem=30122
>> SLOT_TYPE_1_PARTITIONABLE = true
>> SLOT_TYPE_1_GPU_NUM = 0
>> NUM_SLOTS_TYPE_2 = 1
>> SLOT_TYPE_2 = cpus=3,mem=30122
>> SLOT_TYPE_2_PARTITIONABLE = true
>> SLOT_TYPE_2_GPU_NUM = 1
>> GPU_MEMORY = 8000
>> MACHINE_RESOURCE_GPUMEMORY = 16000
>>
>> STARTD_ATTRS = GPU_NUM, $(STARTD_ATTRS)

The default usage of full GPUs is handled by condor with
>> use feature : gpus

The key parts here are that we set a STARTD_ATTR called GPU_NUM for each slot, which is later used to set CUDA_VISIBLE_DEVICES, and that we add a new resource GPUMEMORY (in this instance, we have two identical GPUs with 8GB VRAM each).
A user can then request a certain amount of GPU memory in their submit file the same way they would request other machine resources:
>> Request_GpuMemory    = 2000

Since we allow both the request of full GPUs and the request of only a part of the memory, we make sure that they don't collide. If some GPU memory is already used, no full GPU can be requested and vice-versa. This is done in the START _expression_:
>> START = (IfThenElse(target.RequestGpuMemory =?= UNDEFINED, 0, target.RequestGpuMemory) == 0 || my.GPUs == my.TotalSlotGPUs) && \
>>      (IfThenElse(target.RequestGPUs =?= UNDEFINED, 0, target.RequestGPUs) == 0 || my.GpuMemory == my.TotalSlotGpuMemory)

Setting the CUDA_VISIBLE_DEVICES environment variable is done in a user job wrapper [1], which is defined in the worker config
>> USER_JOB_WRAPPER=/etc/condor/set_cuda_env

To monitor the GPU memory usage separately for each condor job, we replace the default GPU monitoring script with our own [2].
>> # GPU Memory monitor
>> STARTD_CRON_GPUsMEMORY_MONITOR_EXECUTABLE=/etc/condor/monitor_gpus.py
>> STARTD_CRON_GPUsMEMORY_MONITOR_METRICS = PEAK:GPUsMemory
>> STARTD_CRON_GPUsMEMORY_MONITOR_MODE = Periodic
>> STARTD_CRON_GPUsMEMORY_MONITOR_PERIOD = 30
>>
>> STARTD_CRON_GPUs_MONITOR_EXECUTABLE=/bin/false
>> STARTD_CRON_JOBLIST=$(STARTD_CRON_JOBLIST),GPUsMEMORY_MONITOR
>>
>> STARTD_JOB_ATTRS=$(STARTD_JOB_ATTRS),GPUsMemory
>> UPDATE_INTERVAL=30

Finally, if the used GPU memory is reported to be larger than the requested one, the job is killed by the SYSTEM_PERIODIC_REMOVE macro:
>> SYSTEM_PERIODIC_REMOVE = $(SYSTEM_PERIODIC_REMOVE) || ((GPUsMemoryUsage > RequestGpuMemory) && (RequestGPUs == 0))

Best regards,
Yannik

------------------------------------

[1]
#!/bin/bash

if [ "$_CONDOR_MACHINE_AD" != "" ]; then
    GPU_NUM="$(egrep '^GPU_NUM' "$_CONDOR_MACHINE_AD" | cut -d ' ' -f 3)"
    SLOT_GPUS="$(egrep '^TotalSlotGPUs' "$_CONDOR_MACHINE_AD" | cut -d ' ' -f 3)"
    SLOT_GPUMEM="$(egrep '^TotalSlotGPUMEMORY' "$_CONDOR_MACHINE_AD" | cut -d ' ' -f 3)"

    # If GPU number is defined (on the partitionable slot) and the job is a GPU job, set visible device
    if [[ "$GPU_NUM" != "" ]] && ( [[ "$SLOT_GPUS" != "0" ]] || [[ "$SLOT_GPUMEM" != "0" ]] ); then
        export CUDA_VISIBLE_DEVICES="$GPU_NUM"
    else
        export CUDA_VISIBLE_DEVICES="-1"
    fi
fi

exec "$@"


[2]
#!/usr/bin/env python
# -*- coding: utf-8 -*-

from subprocess import check_output
from collections import defaultdict
from psutil import Process

starter_signature = ["condor_starter", "-f", "-a"]


def query(kind, values):
    if not isinstance(values, dict):
        values = {v: str for v in values}

    gpu_query = check_output(["nvidia-smi", "--query-{}={}".format(kind, ",".join(values.keys())),
        "--format=csv,nounits,noheader"], universal_newlines=True)

    query_results = []
    for line in gpu_query.splitlines():
        line_results = {}
        for (key, type_converter), value in zip(values.items(), line.strip().split(", ")):
            line_results[key] = type_converter(value)
        query_results.append(line_results)

    return query_results


def get_slot(pid):
    process = Process(pid)
    while process:
        cmdline = process.cmdline()
        if cmdline[:3] == starter_signature:
            return cmdline[3]
        process = process.parent()
    return None


def get_slot_updates(slot_name, values):
    slot_info = []
    for attr, value in values.items():
        slot_info.append("Uptime{}PeakUsage = {}".format(attr, value))
    slot_info.append('SlotName = "{}@"'.format(slot_name))
    slot_info.append("- {}".format(slot_name))
    return slot_info


gpu_info = query("gpu", {"index": int, "utilization.gpu": int})
application_info = query("compute-apps", {"pid": int, "used_gpu_memory": int})

# get total memory usage for each (partitioned) job
slot_gpu_memories = defaultdict(int)
for line in application_info:
    slot_id = get_slot(line["pid"])
    if slot_id:
        slot_gpu_memories[slot_id] += line["used_gpu_memory"]

updates = []
for slot_id, memory in slot_gpu_memories.items():
    updates.extend(get_slot_updates(slot_id, {"GPUsMemory": memory}))

if updates:
    updates.append("- update:true")
    print("\n".join(updates))



Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> im Auftrag von Eric Sedore via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 30. November 2020 04:26
An: HTCondor-Users Mail List
Cc: Eric Sedore
Betreff: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU
 

Thanks Yannik – yes, if you have time and are willing that would be very helpful.

 

-Eric

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Rath, Yannik
Sent: Wednesday, November 25, 2020 6:51 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

 

Hi Eric,

we also have a number of jobs on our cluster that do not use a full GPU. We ended up with a solution that is rather specialized to our use case, but maybe it happens to align with yours.

For each GPU on a machine, we have one partitionable job slot.
This is one limitation to our approach, as it means we have to associate a certain fraction of RAM and CPU cores to each GPU, and that the same machine cannot run jobs that require multiple GPUs.

We add an additional resource to the job slot, which we name GPUMemory. A user can require either a full GPU as normal or a certain amount of GPU memory (or of course neither for non-GPU jobs).
In our job start _expression_ we make sure these things don't collide, i.e. a full GPU can't be requested if part of its memory is already used and vice-versa.

The job slot also has a configuration variable that identifies the associated GPU, which is used to set the CUDA_VISIBLE_DEVICES environment variable in a user job wrapper.

Finally, we have a monitoring script for the used GPU memory, so that condor kills jobs using more memory than they requested.

In case this sounds like something that would make sense for you, I can collect the configuration parts and share them here.

Best regards,
Yannik

 

 


Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> im Auftrag von John M Knoeller <johnkn@xxxxxxxxxxx>
Gesendet: Dienstag, 24. November 2020 18:08
An: HTCondor-Users Mail List
Betreff: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

 

Hi Eric.   

 

Nvidia is adding the ability to share a GPU between processes in newer hardware with hardware enforcement of memory isolation

between processes. HTCondor does plan to support that but it does not yet, and I don’t think the NVida devices that support this

are very common yet.  This is work in progress…

 

However,  You can share a GPU between processes *without* any kind protection between processes just by having more than a single process set the environment variable CUDA_VISIBLE_DEVICES to the same value

 

You can get HTCondor to do this just by having the same device show up more than once in the device enumeration.  

 

For instance, if you have two GPUs and your configuration is

 

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1

 

You can run two jobs on each GPU by configuring

 

MACHINE_RESOURCE_GPUS = CUDA0, CUDA1, CUDA0, CUDA1

 

If you don’t use the MACHINE_RESOURCE_GPUS  knob, and instead use HTCondor’s GPU detection, you can use the same trick, it’s just a little more work.

 

# enable GPU discovery

use FEATURE : GPUs

# then override the GPU device enumeration with a wrapper script that duplicates the detected GPUs

MACHINE_RESOURCE_INVENTORY_GPUs = $(ETC)/bin/condor_gpu_discovery.sh $(1) -properties $(GPU_DISCOVERY_EXTRA)

 

The wrapper script $(ETC)/bin/condor_gpu_discovery.sh is something that you need to write.

 

condor_gpu_discovery produces output like this

 

DetectedGPUs="CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

 

Your wrapper script should produce the same output, but with a modified value for DetectedGPUs like this

 

DetectedGPUs="CUDA0, CUDA1, CUDA0, CUDA1"

CUDACapability=6.0

CUDADeviceName="Tesla P100-PCIE-16GB"

CUDADriverVersion=11.0

CUDAECCEnabled=true

CUDAGlobalMemoryMb=16281

CUDAMaxSupportedVersion=11000

CUDA0DevicePciBusId="0000:3B:00.0"

CUDA0DeviceUuid="dddddddd-dddd-dddd-dddd-dddddddddddd"

CUDA1DevicePciBusId="0000:D8:00.0"

CUDA1DeviceUuid="cccccccc-cccc-cccc-cccc-cccccccccccc"

 

-tj

 

 

 

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Eric Sedore via HTCondor-users
Sent: Thursday, November 19, 2020 11:44 PM
To: htcondor-users@xxxxxxxxxxx
Cc: Eric Sedore <essedore@xxxxxxx>
Subject: [HTCondor-users] Running multiple jobs simultaneously on a single GPU

 

Good evening everyone,

 

I’ve listened to a few presentations that mentioned there is a way (either ready now or planned) to allow multiple jobs to utilize a single GPU.  This would be helpful as we have a number of workloads/jobs that do not consume the entire GPU (memory or processing).  Is there documentation (apologies if I missed it) that would assist with how to set up this configuration?

 

Happy to provide more of a description if my question is not clear.

 

Thanks,

-Eric