[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SYSTEM_PERIODIC_HOLD ignored



Yes, you are right.
I did some tests, trying different values to see how many jobs are caught.
I think I'll start enforcing on memory usage that is less than tenth of the requested, to allow users to adjust without holding too many jobs.
In the future I'll enforce a bit more restrictive value.
The goal is to raise the upper limit of allowed reservation, while making sure users won't always request the max allowed "just to be safe".

David

On Fri, Aug 27, 2021 at 10:42 AM Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:
Hi,

On 27/08/21 06:20, David Cohen wrote:
Hooray!!
It's working now and job's running over time are evicted.
Now to my next project, holding jobs that after 30 minutes of run still don't use more than 10% of the requested memory:
WastingMemory = (JobStatus == 2 && (time() - JobCurrentStartExecutingDate) > 1800) && (RequestMemory > 8192) && (ResidentSetSize/1024 < RequestMemory/10)

I believe that thread gives me all the tools needed to manage that one.

Experts here might want to confirm: i think that some job classads (such as ResidentSetSize) are actually updated every 15 minutes.
If that is true, that means that this policy could put on hold a job now, based on a value measured up to 15 minutes before.
A simple remedy would be that of waiting 2700 seconds instead of 1800.

When considering a hold policy, i use condor_q to check for candidate jobs, and verify that no "innocent" jobs are involved.
Running something like this or a variant:

condor_q -glob -all -cons '(JobStatus == 2 && (time() - JobCurrentStartExecutingDate) > 1800)' -af:j owner '(RequestMemory > 8192)' '(ResidentSetSize < RequestMemory * 102.4)'

Â

Could help to confirm that the right jobs are affected before enforcing the rule.

Stefano


Many thanks,
David


On Thu, Aug 26, 2021 at 4:48 PM Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:
On 26/08/21 15:12, Stefano Dal Pra wrote:
> [SNIP]
>>
>> That works perfectly for MEMORY_EXCEEDED but totally ignored for
>> TIME_EXCEEDED.
[SNIP]

I stumbled on a somehow survived job running for 21 days, so i forged a
clause to get it held and verify that it works:

TooMuchTime = (jobstatus == 2 && (time() - JobStartDate > 86400 * 7))

This clause works, but it only takes effect after condor restart:
condor_reconfig not enough.

Stefano