[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Possible for user to limit number of jobs per physical machine?



Carsten:

I too don't like hacky approaches to a specific problem. I would see how far something like this gets you:

1. Turn on IO accounting on the HTCondor cgroup (the parent one)
2. Create a ClassAd hook that monitors the stats coming out of that IO accounting
3. Figure out what "bad values" of yourÂstats are and impose some limits (in cgroups) that are below the bad value.
4. Create NEGOTIATOR_POST_JOB_RANK that puts machines approaching "bad" at the end of the list. You might have to combine width "breadth-first" filling policies based on CPUs to ensure that initial job matching goes that way.
5. A question for the crowd: since everything is an _expression_, could one modify the SLOT_WEIGHT so that I/O usage is included in the user's priority factor?

For 5,Âyou might scale the I/O weight so that "1000 IOPS = 0.5 CPUs" or anything below 1000 IOPS = 0, etc.

For 3, of you manage the cgroup under the condor.service group, you can use SystemD overrides at /etc/systemd/system/condor.service.d/iolimits.conf

https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html

Look at IOAccounting, etc.. Read your actual OS man page to ensure you use settings appropriate to your release of SystemD.

Tom

On Thu, Sep 10, 2020 at 10:31 AM Carsten Aulbert <carsten.aulbert@xxxxxxxxxx> wrote:
Hi all,

a current user has the problem to start a very I/O intensive jobs and
would like to limit himself to one or two jobs per defined slot - as we
currently only define a single slot per physical machine, that should
not be a problem.

However, as we admins do not want to change the nodes' configuration on
a per user basis or that often, especially not if each user only has a
subset of jobs which are that demanding.

Therefore the question, has anyone a recipe how a user could limit
himself to only run a limited number of jobs per node regardless of how
many subslots a partitionable main slots a machine may have?

While browsing around the docs and mailing list archive, the only place
I found where this information may be readily available is the machine
ad "ChildRemoteUser" from the PartitionableSlot. However, given that
this seems to be a stringified list, I do not know if and how this could
be used in the Requirements section of a submit file.[1]

Anyone with an idea?

Cheers

Carsten

[1] While writing this email - thus without testing it so far - I
wondered if it were possible to use any of the predefined functions[2]
in the user's submit file to target only machines where this particular
user has nothing running so far? Or would that in the end lead to a
situation where the Negotiator would propose a match but the node may
refuse the job to run?

[2]
https://htcondor.readthedocs.io/en/latest/misc-concepts/classad-mechanism.html#predefined-functions


--
Dr. Carsten Aulbert, Max Planck Institute for Gravitational Physics,
CallinstraÃe 38, 30167 Hannover, Germany
Phone: +49 511 762 17185

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/