[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU requests



Hi,

from what I know

CGROUP_MEMORY_LIMIT_POLICY = hard

Should do what you expect (job goes on hold with hold reason) - it needs to be set on the workernode

The syntax for periodic hold would be:

SYSTEM_PERIODIC_HOLD = ( ResidentSetSize > 3000 * RequestMemory )
SYSTEM_PERIODIC_HOLD_REASON = "Memory usage too high (> 3 x requested-memory)"

There is a mismatch in entities -> ResidentSetSize is KB & RequestMemory is MB, hence the '3000' not '3'

If you are looking for a remove use

SYSTEM_PERIODIC_REMOVE
and
SYSTEM_PERIODIC_REMOVE_REASON

These 'periodic events' need to be implemented on the scheduler ....

For the cores - the condor job is not bound to a specific core of the machine, it will use core-duty-cycles every now and then and the usage will be recorded in the cgroup of the job. If you put a system_periodic_hold on that value I expect the job will go into hold with a corresponding reason.

As of default the condor job can use as many cores as it likes as longs as the cores are available at the given time which is a desirable thing I guess.

You can set ASSIGN_CPU_AFFINITY to true to alter this behaviour and strictly limit the job on the number of cores reserved for the slot.

This is all afaik and I am prepared to stand corrected in a minut or two though ;)

Best
christoph


Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "David Cohen" <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 24. Februar 2021 13:31:32
Betreff: [HTCondor-users] Periodic Hold for jobs exceeding memory and CPU        requests

Hi,
I was under the, apparently wrong, impression that setting
CGROUP_MEMORY_LIMIT_POLICY = HARD
will suffice to kill jobs running over the requested memory.
I now understand that I have to back it up by a SYSTEM_PERIODIC_HOLD
As the system is in production I don't want to risk getting it wrong and killing innocent jobs.

While I'm at it can I also use that method to remove jobs that are using more cores than requested (cpu usage > cpu requested)?

Thanks,
David




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/