[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Setting Python 3.13 env var PYTHON_CPU_COUNT in HTCondor?



On 3/7/24 07:38, Joachim Meyer wrote:

Hi,

Python 3.13 will introduce PYTHON_CPU_COUNT as a configuration variable to override the `os.cpu_count()` (and friends) return value (https://github.com/python/cpython/commit/0362cbf908aff2b87298f8a9422e7b368f890071).

We've seen it a number of times that the users use `os.cpu_count()` to determine how many "subprocesses" they may launch.

This obviously blows the CPU load completely when having multiple jobs that only request 16 cores but start 256 processes...

So, I'm wondering, should HTCondor set PYTHON_CPU_COUNT to alleviate this for future Python versions?


Also, if you know workarounds that already work with released Python versions, I'm happy to learn about them!

,

Hi Joachim:

Thanks for the heads-up -- this is just the sort of thing we rely on htcondor users to keep us informed of.  We will definitely add PYTHON_CPU_COUNT to the set of env vars that condor sets by default.  FWIW, "There is a knob for this" (tm), you could put

PYTHON_CPU_COUNT into STARTER_NUM_THREADS_ENV_VARS

https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#STARTER_NUM_THREADS_ENV_VARS


and HTCondor will set this for you today.


And, no, we don't know of workarounds that work with older python versions (or other codes).  If anyone knows, we'd love to hear about it -- as this is an ongoing problem we face.  Note that with cgroups, we can limit the amount of cpu that any job uses, which protects the *machine*. But if a job is cgroup cpu limited to one core, and spawns 256 threads, even if the machine is protected, the job might run significantly slower than it could.

We've investigate such hackery as bind-mounting over /proc/cpuinfo, which had mixed success, but if there are better ways to solve this globally, we are all ears.

Thanks,

-greg