[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SYSTEM_PERIODIC_HOLD ignored



Hi,

On 27/08/21 06:20, David Cohen wrote:
Hooray!!
It's working now and job's running over time are evicted.
Now to my next project, holding jobs that after 30 minutes of run still don't use more than 10% of the requested memory:
WastingMemory = (JobStatus == 2 && (time() - JobCurrentStartExecutingDate) > 1800) && (RequestMemory > 8192) && (ResidentSetSize/1024 < RequestMemory/10)

I believe that thread gives me all the tools needed to manage that one.

Experts here might want to confirm: i think that some job classads (such as ResidentSetSize) are actually updated every 15 minutes.
If that is true, that means that this policy could put on hold a job now, based on a value measured up to 15 minutes before.
A simple remedy would be that of waiting 2700 seconds instead of 1800.

When considering a hold policy, i use condor_q to check for candidate jobs, and verify that no "innocent" jobs are involved.
Running something like this or a variant:

condor_q -glob -all -cons '(JobStatus == 2 && (time() - JobCurrentStartExecutingDate) > 1800)' -af:j owner '(RequestMemory > 8192)' '(ResidentSetSize < RequestMemory * 102.4)'

Â

Could help to confirm that the right jobs are affected before enforcing the rule.

Stefano


Many thanks,
David


On Thu, Aug 26, 2021 at 4:48 PM Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:
On 26/08/21 15:12, Stefano Dal Pra wrote:
> [SNIP]
>>
>> That works perfectly for MEMORY_EXCEEDED but totally ignored for
>> TIME_EXCEEDED.
[SNIP]

I stumbled on a somehow survived job running for 21 days, so i forged a
clause to get it held and verify that it works:

TooMuchTime = (jobstatus == 2 && (time() - JobStartDate > 86400 * 7))

This clause works, but it only takes effect after condor restart:
condor_reconfig not enough.

Stefano