Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] gpu's and preemption

Date: Thu, 28 Feb 2019 09:00:09 -0500
From: Michael Di Domenico <mdidomenico4@xxxxxxxxx>
Subject: [HTCondor-users] gpu's and preemption

My pool slots are currently setup as partitionable with
cpu=auto,memory=auto,gpus=auto.  Preempt/Suspend/etc are all set to
false.

for 99% of our pool usage this works fine using just standard userprio
to divey out the jobs fairly.

We've trundled down the preemption path in the past, but backed away
from it cause it too complicated and finicky to get things right
(without making people cranky).  However, I'm faced with a situation
now where preemption might be useful.

given this scenario, the node is 8 cores and 4 gpus.

usera comes along and submits a job that does not require gpu's,
single slot/single cpu
requestcpu=1
requestmem=1000

slot1@compute1 unclaimed
slot1_1@compute1 claimed usera
slot1_2@compute1 claimed usera
slot1_3@compute1 claimed usera
slot1_4@compute1 claimed usera
slot1_5@compute1 claimed usera
slot1_6@compute1 claimed usera
slot1_7@compute1 claimed usera
slot1_8@compute1 claimed usera

userb comes along and submits a job that does require a gpu, single
slot/single cpu/single gpu
requestcpu=1
requestgpu=1
requestmem=1000

userb has to wait until one of the eight jobs of usera are finished in
order to use the gpu.

ideally what i'd like to happen is that 4 of usera's jobs are
preempted for userb's.  just the fact that a user is asking for a gpu
should be enough to preempt another person from a slot that isn't

as extra credit, what happens when the box has 16 cores and 4 gpus,
and userb comes along and asks for two cpus/one gpu per job, does it
kick eight of usera's jobs off?

Follow-Ups:
- Re: [HTCondor-users] gpu's and preemption
  - From: Todd L Miller

Prev by Date: Re: [HTCondor-users] transfer_in/output_files only if they exist
Next by Date: Re: [HTCondor-users] Fwd: Re: DAG error: "BAD EVENT: job (...) executing, total end count != 0 (1)"
Previous by thread: Re: [HTCondor-users] condor_interactive & condor_ssh_to_job & /usr/libexec/condor/condor_ssh_to_job_shell_setup & PID namespaces
Next by thread: Re: [HTCondor-users] gpu's and preemption
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[HTCondor-users] gpu's and preemption