[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Killing job instead of putting on hold when Memory is exhausted



From: Tomas Kouba <tomas.kouba@xxxxxxx>
Date: 05/23/2016 04:40 AM
 
> Hello,
>
> I have configured HTCondor to run jobs with limited amount of memory via cgroups.
> Now I am testing what happens with jobs that allocate too much:
> - put on Hold
> - HoldReason = "Error from slot1@<node>: Job has gone over memory limit"
>
> Is it possible to tell HTCondor to kill the job instead of putting jobs on hold?
> (actually I would prefer killing jobs instead of holding under all
> circumstances, not only memory
> exhaustion).

The hold action is tied to the cgroup OOM killer so it's not under user governance,
but you can implement your desired policy by setting a "periodic_remove"
_expression_. For example, if your JobStatus is 5 (held), check for a HoldReasonCode
(page 955 of the 8.4.6 manual) of 34 which indicates a memory limit was hit. If
both conditions are true, you'd set periodic_remove to true, and then the job will
exit the queue at the next interval after being held due to memory exhaustion.

periodic_remove = (JobStatus == 5 && HoldReasonCode == 34)

My own system_periodic_remove _expression_ allows held jobs to stay in the queue for
up to five days keyed from the CompletionDate attribute, so that the users can see
them and adjust their submissions accordingly but without requiring the users
to manually clean up after themselves (since we all know how that usually goes).

One thing you might consider, however, is using the 34 code to resize the memory of
the job by some factor and allowing it to restart. You can do this with a
periodic_release _expression_ coupled with a request_memory _expression_ that sets
the memory request to either the baseline value for the job or some percentage
increase of the memory used in the last run where the memory was exhausted,
allowing the job to claim more memory at each run until it's able to finish
successfully. To limit the number of attempts, you'd incorporate NumJobStarts in
the _expression_.

        -Michael Pelletier.