[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] RequestGPUs on HTCondor via HTCondor-CE



Max, the safest option is to have the CE add a GPUs clause to the job Requirements, just like condor_submit does.   

If you are using the new syntax JobRouter configuration that is available in HTCondor 9.0, it is possible to add clauses to the job's Requirements statement cleanly like this

if defined My.RequestGpus
   SET Requirements = ($(My.Requirements)) && (RequestGPUs >= TARGET.GPUs)
endif

If you control all of the execute nodes, then it might be easier to just have all the nodes advertise GPUs as a custom resource.  And have machines that have no GPUs just advertise 0 GPUs, so that WithinResourceLimits will refuse to start jobs that require GPUs there.

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Fischer, Max (SCC) <max.fischer@xxxxxxx>
Sent: Monday, July 5, 2021 11:26 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] RequestGPUs on HTCondor via HTCondor-CE
 
Hi all,

we’re mainly running an HTCondor CPU cluster, but occasionally can acquire some GPU resources for our community. We are running a standard Grid HTCondor-CE v5 in front of the cluster, and all jobs go through this.
Now, we realised that jobs aren’t actually scheduled with respect to their requested GPUs: *if a node has a GPU*, then it respects RequestGPUs, and otherwise it is just ignored. So we end up with jobs that want GPUs to run on nodes that do not have GPUs.

So, what I am wondering is: In the scenario of having mixed GPU and non-GPU resources in a HTCondor cluster behind a HTCondor-CE, who is actually responsible for enforcing GPU requirements?

* The CE job router sets ReuestGPUs, but no Requirement to enforce it. Do we have to manually add GPUs to the Requirement clause in the router if we expect GPU jobs?
* The Schedd would usually create a GPU clause for Requirement if we passed it `request_gpus`, but obviously we do not do that in this case.
* The nodes' `WithinResourceLimits` only care about GPUs if the node actually has some. Do we have to announce at least “0 GPUs” on every node to get the resource limit?

My hunch is that we have to manually add the requirement in the CE, but would rather get some other/expert voices on this.

Cheers,
Max