from what I know
CGROUP_MEMORY_LIMIT_POLICY = hard
Should do what you expect (job goes on hold with hold reason) - it needs to be set on the workernode
The syntax for periodic hold would be:
SYSTEM_PERIODIC_HOLD = ( ResidentSetSize > 3000 * RequestMemory )
SYSTEM_PERIODIC_HOLD_REASON = "Memory usage too high (> 3 x requested-memory)"
There is a mismatch in entities -> ResidentSetSize is KB & RequestMemory is MB, hence the '3000' not '3'
If you are looking for a remove use
These 'periodic events' need to be implemented on the scheduler ....
For the cores - the condor job is not bound to a specific core of the machine, it will use core-duty-cycles every now and then and the usage will be recorded in the cgroup of the job. If you put a system_periodic_hold on that value I expect the job will go into hold with a corresponding reason.
As of default the condor job can use as many cores as it likes as longs as the cores are available at the given time which is a desirable thing I guess.
You can set ASSIGN_CPU_AFFINITY to true to alter this behaviour and strictly limit the job on the number of cores reserved for the slot.
This is all afaik and I am prepared to stand corrected in a minut or two though ;)
Building 02b, Room 009
I was under the, apparently wrong, impression that setting
CGROUP_MEMORY_LIMIT_POLICY = HARD
will suffice to kill jobs running over the requested memory.
I now understand that I have to back it up by a SYSTEM_PERIODIC_HOLD
As the system is in production I don't want to risk getting it wrong and killing innocent jobs.
While I'm at it can I also use that method to remove jobs that are using more cores than requested (cpu usage > cpu requested)?
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
You can also unsubscribe by visitinghttps://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/htcondor-users/