just replying to email #2 but referencing both.
In short, I like the dynamic approach of #1 also better but I fear it
may fall short if, say, a few hundred jobs of the I/O-hard class wait
for resources and a 100 core machine comes back online after maintenance
(or becomes available after another user removes her jobs). In that
scenario the negotiator ranking would not matter and still fill up the
node which would then hammer its local disk for hours to days.
I see what you're saying in terms of HTCondor scheduling. I would
(a) really investigate the cgroups limits options for IOPS/BW limiting. The hierarchical limits should be easy to impose at parent cgroup, although I read them as simple hard limits shared among all child cgroups.
(b) pair this approach with a start / drain / preemption policy that identifies nodes that have gone "bad" and prevents new jobs from matching
All-in-all, in running HTCondor, I found the best approach to protect my sanity are policies that return the system to health "generally" rather than identifying today's problem of the week.
HTCondor Gods: "what would be really cool" is if you turned IOAccounting on by default for the htcondor cgroup and:
1. determined the physical device for at the EXECUTE directory
2. looked at the ioaccounting stats for it (cumulative IOPS, bytes xferred)
3. published load-like stats (ExecuteIOPS1, ExecuteIOPS5, ExecuteIOPS15) in the Machine ClassAd
The IMHO clear disadvantage of this is that ir required a full startd
restart to update the slot configuration making it a pretty worrisome
configuration update throughout a pool. On the other hand, one could
predefine a number of virtual resources per machine and tell users to
consume these. Besides cluttering the slot definitions with virtA,
virtB, virtC, ... users may just by accident try to use the same virtual
resource because they chose the same letter based on their first names ;-).
Right now, I think I like the simplicity of the latter approach more
even though it may (and according to Murphy will) break down sooner or
later. But I need to think more about it.
If you're really looking to avoid a startd restart, a quick thought is something like:
STARTD_JOB_ATTRS = UsingLotsOfDiskIOPS
STARTD_PARTITIONABLE_SLOT_ATTRS = UsingLotsOfDiskIOPS
The some kind of START / preemption / draining policies that do the thing you want. I think that ought to work and you could create job transforms thatÂautomatically apply the tag to jobs that match a specific user and/or command.