[HTCondor-users] assigning multiple GPUs to a single job

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,

I have been scheduling GPU jobs in our cluster byÂ

1) setting in the config file for each node

"use feature: GPUs"

"GPU_DISCOVERY_EXTRA = -extra"

(as suggested in the documentation, running condor_gpu_discovery -properties manually produces the right results for each machine)

2) setting up a number of slots with 1 CPU each, e.g. in a 2-GPU machine.

"SLOT_TYPE_1 = cpus=1,mem=auto

SLOT_TYPE_1_PARTITIONABLE = FALSE

NUM_SLOTS_TYPE_1 = 2"

When submitting jobs that have "request_GPUs=1" in the submit file the jobs get scheduled to machines that have a GPU, and there are no more jobs being scheduled than there are GPUs, across multiple machines. However, when I specify "request_GPUs=2", the job stays in the queue with status "I", even though the requested number is available.

Hence, I am wondering what I am doing wrong and whether I have incorrectly set up the basic mechanism in #2. The GPU discovery works beautifully, so I suspect I am overcomplicating ...Â

thank you for your help!

FranciscoÂ

Mailing List Archives

Public Access

[HTCondor-users] assigning multiple GPUs to a single job