[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] RequestGPUs on HTCondor via HTCondor-CE



Hi all,

weâre mainly running an HTCondor CPU cluster, but occasionally can acquire some GPU resources for our community. We are running a standard Grid HTCondor-CE v5 in front of the cluster, and all jobs go through this.
Now, we realised that jobs arenât actually scheduled with respect to their requested GPUs: *if a node has a GPU*, then it respects RequestGPUs, and otherwise it is just ignored. So we end up with jobs that want GPUs to run on nodes that do not have GPUs.

So, what I am wondering is: In the scenario of having mixed GPU and non-GPU resources in a HTCondor cluster behind a HTCondor-CE, who is actually responsible for enforcing GPU requirements?

* The CE job router sets ReuestGPUs, but no Requirement to enforce it. Do we have to manually add GPUs to the Requirement clause in the router if we expect GPU jobs?
* The Schedd would usually create a GPU clause for Requirement if we passed it `request_gpus`, but obviously we do not do that in this case.
* The nodes' `WithinResourceLimits` only care about GPUs if the node actually has some. Do we have to announce at least â0 GPUsâ on every node to get the resource limit?

My hunch is that we have to manually add the requirement in the CE, but would rather get some other/expert voices on this.

Cheers,
Max

Attachment: smime.p7s
Description: S/MIME cryptographic signature