[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] gpu's and preemption



	Disclaimer: I am not a preemption expert.

ideally what i'd like to happen is that 4 of usera's jobs are
preempted for userb's.  just the fact that a user is asking for a gpu
should be enough to preempt another person from a slot that isn't

This sounds like a job for a (machine) RANK expression. (Something like RANK = RequestGPUs, probably.) That will work just fine with pslots, except ...

as extra credit, what happens when the box has 16 cores and 4 gpus,
and userb comes along and asks for two cpus/one gpu per job, does it
kick eight of usera's jobs off?

... it won't kick eight of usera's jobs off. If you want to do that, you'll have to set ALLOW_PSLOT_PREEMPTION = TRUE in the configuration /of the negotiator/. By itself, this doesn't allow priority-based preemption, but it will change the behavior of preemption with respect to pslots for your entire pool, so be you'll want to be sure that no other preemption is taking place (or that you understand the consequences).

Be aware, however, that the way HTCondor combines slots when doing pslot preemption does /not/ wait for the corresponding jobs to finish exiting before reassigning their resources, so HTCondor may overcommit in some cases (e.g., the non-GPU jobs take so long to vacate that the GPU job finishes transferring and starts before they finish).

If you don't set ALLOW_PSLOT_PREEMPTION, undersized dynamic slots will be ignored. If you'd rather not preempt at all, you can attempt address the issue via draining, instead.

If the issue happens rarely enough for manual intervention to be reasonable, starting with 8.9.0, you'll be able to address the issue with the condor_now command (whose version of slot coalescing doesn't suffer from the overcommit issue described above). I suppose you could try to script the tool, but that seems fraught with peril.

- Toddm