[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Possible for user to limit number of jobs per physical machine?

A quicker-to-implement way might be to use a custom machine resource:


and then direct the user to explicitly consume that resource.

I prefer e-mailÂ#1 because it monitors a scarce resourceÂ(block I/O), implements a sensible scheduling policy (breadth-first), and assigns a cost to the user for consuming it. accordingly, and "taxes" users for using too much of it. But, the custom resource approach will work with a cooperative user with only a day or so of work.


On Thu, Sep 10, 2020 at 11:23 AM Tom Downes <tpdownes@xxxxxxxxx> wrote:

I too don't like hacky approaches to a specific problem. I would see how far something like this gets you:

1. Turn on IO accounting on the HTCondor cgroup (the parent one)
2. Create a ClassAd hook that monitors the stats coming out of that IO accounting
3. Figure out what "bad values" of yourÂstats are and impose some limits (in cgroups) that are below the bad value.
4. Create NEGOTIATOR_POST_JOB_RANK that puts machines approaching "bad" at the end of the list. You might have to combine width "breadth-first" filling policies based on CPUs to ensure that initial job matching goes that way.
5. A question for the crowd: since everything is an _expression_, could one modify the SLOT_WEIGHT so that I/O usage is included in the user's priority factor?

For 5,Âyou might scale the I/O weight so that "1000 IOPS = 0.5 CPUs" or anything below 1000 IOPS = 0, etc.

For 3, of you manage the cgroup under the condor.service group, you can use SystemD overrides at /etc/systemd/system/condor.service.d/iolimits.conf


Look at IOAccounting, etc.. Read your actual OS man page to ensure you use settings appropriate to your release of SystemD.


On Thu, Sep 10, 2020 at 10:31 AM Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> wrote:
Hi all,

a current user has the problem to start a very I/O intensive jobs and
would like to limit himself to one or two jobs per defined slot - as we
currently only define a single slot per physical machine, that should
not be a problem.

However, as we admins do not want to change the nodes' configuration on
a per user basis or that often, especially not if each user only has a
subset of jobs which are that demanding.

Therefore the question, has anyone a recipe how a user could limit
himself to only run a limited number of jobs per node regardless of how
many subslots a partitionable main slots a machine may have?

While browsing around the docs and mailing list archive, the only place
I found where this information may be readily available is the machine
ad "ChildRemoteUser" from the PartitionableSlot. However, given that
this seems to be a stringified list, I do not know if and how this could
be used in the Requirements section of a submit file.[1]

Anyone with an idea?



[1] While writing this email - thus without testing it so far - I
wondered if it were possible to use any of the predefined functions[2]
in the user's submit file to target only machines where this particular
user has nothing running so far? Or would that in the end lead to a
situation where the Negotiator would propose a match but the node may
refuse the job to run?


Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: