[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests



Thank you Thomas and Christoph for your answers.
Being from the "user should know what they are doing" school, I'm trying to implement hard limits.
So in terms of memory I guess that:
SYSTEM_PERIODIC_HOLD = ( ResidentSetSize > 1000 * RequestMemory )
(Shouldn't it be 1024?) is what I'm looking for.

As for the CPU limit, if I have to choose between affinity and letting it be, I'll prefer to let it be.
That said, I am still looking for a solution to prevent from, usually the same users, to use more resources than they ask for.
As this usually happens in the beginning of the execution, it might be doable, given the right parameters names.


David


On Wed, Feb 24, 2021 at 3:19 PM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,

from what I know

CGROUP_MEMORY_LIMIT_POLICY = hard

Should do what you expect (job goes on hold with hold reason) - it needs to be set on the workernode

The syntax for periodic hold would be:

SYSTEM_PERIODIC_HOLD = ( ResidentSetSize > 3000 * RequestMemory )
SYSTEM_PERIODIC_HOLD_REASON = "Memory usage too high (> 3 x requested-memory)"

There is a mismatch in entities -> ResidentSetSize is KB & RequestMemory is MB, hence the '3000' not '3'

If you are looking for a remove use

SYSTEM_PERIODIC_REMOVE
and
SYSTEM_PERIODIC_REMOVE_REASON

These 'periodic events' need to be implemented on the scheduler ....

For the cores - the condor job is not bound to a specific core of the machine, it will use core-duty-cycles every now and then and the usage will be recorded in the cgroup of the job. If you put a system_periodic_hold on that value I expect the job will go into hold with a corresponding reason.

As of default the condor job can use as many cores as it likes as longs as the cores are available at the given time which is a desirable thing I guess.

You can set ASSIGN_CPU_AFFINITY to true to alter this behaviour and strictly limit the job on the number of cores reserved for the slot.

This is all afaik and I am prepared to stand corrected in a minut or two though ;)

Best
christoph


Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "David Cohen" <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 24. Februar 2021 13:31:32
Betreff: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPUÂÂÂÂÂÂÂÂrequests

Hi,
I was under the, apparently wrong, impression that setting
CGROUP_MEMORY_LIMIT_POLICY = HARD
will suffice to kill jobs running over the requested memory.
I now understand that I have to back it up by a SYSTEM_PERIODIC_HOLD
As the system is in production I don't want to risk getting it wrong and killing innocent jobs.

While I'm at it can I also use that method to remove jobs that are using more cores than requested (cpu usage > cpu requested)?

Thanks,
David




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/