[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Prioritizing GPU jobs on partitionable/dynamic slots


I am reconfiguring our cluster to use dynamic GPU slots instead of static ones, and I have trouble figuring out how to ensure that GPU jobs aren't starved because of non-GPU jobs without wasting or over-committing resources.

For example, with slot definition below, 2 CPU jobs, or 1 job that requests 2GB will block GPU jobs from landing on this node:
SLOT_TYPE_1 = cpus=2, mem=2GB, gpus=2

Ideally, I'd like non-GPU jobs to be killed one-by-one, starting from youngest, until there is space for a GPU job, but only if there are idle GPU jobs in queue that could use this machine (if it weren't for CPU jobs). I have no idea how to implement this though (without external scripts).

The only way I can think of to prevent GPU job starvation is either creating a separate partitionable slot only for GPU jobs, or having a single partitionable slot, but preventing CPU jobs from using all its CPU and memory using START and APPEND_REQUIREMENTS expressions.

Is there a better way?