[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Possible for user to limit number of jobs per physical machine?



Hi Tom,

just replying to email #2 but referencing both.

In short, I like the dynamic approach of #1 also better but I fear it
may fall short if, say, a few hundred jobs of the I/O-hard class wait
for resources and a 100 core machine comes back online after maintenance
(or becomes available after another user removes her jobs). In that
scenario the negotiator ranking would not matter and still fill up the
node which would then hammer its local disk for hours to days.

On 9/10/20 7:07 PM, tpdownes@xxxxxxxxx wrote:
> A quicker-to-implement way might be to use a custom machine resource:
> 
> https://htcondor.readthedocs.io/en/latest/admin-manual/policy-configuration.html?highlight=MACHINE_RESOURCE_NAMES#dividing-system-resources-in-multi-core-machines
> 
> and then direct the user to explicitly consume that resource.
>

The IMHO clear disadvantage of this is that ir required a full startd
restart to update the slot configuration making it a pretty worrisome
configuration update throughout a pool. On the other hand, one could
predefine a number of virtual resources per machine and tell users to
consume these. Besides cluttering the slot definitions with virtA,
virtB, virtC, ... users may just by accident try to use the same virtual
resource because they chose the same letter based on their first names ;-).

Right now, I think I like the simplicity of the latter approach more
even though it may (and according to Murphy will) break down sooner or
later. But I need to think more about it.

Cheers

Carsten

-- 
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature