[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] forcing concurrency limits



I want to allocate 90% of the cores for jobs that run
indefinitely, possibly weeks. I would also like to allocate the remaining
cores for short lived jobs, 10 minutes maximum. I can hold the job if it
runs longer than 10 minutes.

But, I am not sure how to enforce the ratio. A short job can also run on
the long running (90% core). I believe with concurrency limits I can do it
but is there a way to force a concurrency limit or atleast default to a
certain one? Or is there a better way to do this?

I would expect that you could force all jobs to specify a concurrency limit by using submit requirements, or to add a default to all jobs by using submit transforms. (See: https://htcondor.readthedocs.io/en/latest/admin-manual/policy-configuration.html#submit-requirements and https://htcondor.readthedocs.io/en/latest/admin-manual/policy-configuration.html#job-transforms respectively). If you have enough jobs, and the jobs are all the same size, you could maintain the ratio you desire just by making the concurrency limits have the appropriate ratio (576 "long" to 64 "short")

If your pool generally has a fixed membership, and enough machines, you could enforce the ratio for jobs requesting multiple cores by splitting up the machines: nine run only "long" jobs and one runs only "short" jobs. Of course, this proportion won't be maintained if one of the machines stops working properly.

If your jobs request multiple CPUs, you could probably use submit requirements and transforms to require that the job request as many "long" tokens (or "short" tokens, as appropriate) as CPUs it requests. (This could lead to single-CPU jobs dominating the mix of jobs; I don't know what can be done about that.)

If you want to put short jobs on hold after ten minutes, it's probably easiest to do that with system periodic hold:

https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#SYSTEM_PERIODIC_HOLD%20and%20SYSTEM_PERIODIC_HOLD_%3CName%3E

You can add an attribute (ShortJob = True) to the job in the submit transform to make writing the hold expression easier, something like:

SYSTEM_PERIODIC_HOLD_NAMES = $(SYSTEM_PERIODIC_HOLD_NAMES) SHORT_JOB_NOT_SHORT
SYSTEM_PERIODIC_HOLD_SHORT_JOB_NOT_SHORT = (ShortJob === True) && (JobStatus == 2) && ((EnteredCurrentStatus - time()) > 600)
SYSTEM_PERIODIC_HOLD_SHORT_JOB_NOT_SHORT_REASON = "Your short job ran for more than ten minutes.  Try resubmitting it with LongJob = True"

You can then use EXTENDED_SUBMIT_COMMANDS

https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#EXTENDED_SUBMIT_COMMANDS

to enable the use of "LongJob" without the +.

-- ToddM