[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] SYSTEM_PERIODIC_HOLD ignored



Hi David,

a couple of ideas:

1st
Âconsider using ServerTime instead of time(). To prevent this from being undefined (not sure this could happen).
Also, make sure the job is running:
(JobStatus == 2) && (((ServerTime ?: time())Â - JobCurrentStartDate) > IfthenElse(...) )

2nd
ÂI agree; CpusUsage can be undefined for a few minutes before being set. Consider using

( (CpusUsage ?: 0)Â > (RequestCpus * 1.1) + 0.8 )

Instead of

(CpusUsage > RequestCpus + 0.8 + (RequestCpus * 0.1))

Stefano

On 26/08/21 09:24, David Cohen wrote:
Hi Stefano,

I temporarily omitted the CPU checks from the SYSTEM_PERIODIC_HOLD as I want to focus on Memory and Time.
So now I have:
MEMORY_EXCEEDED = ( ResidentSetSize > 1024 * RequestMemory )
TIME_EXCEEDED = ((Time() - JobCurrentStartDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600))
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(MEMORY_EXCEEDED) || $(TIME_EXCEEDED)
SYSTEM_PERIODIC_HOLD_REASON = {"","MEMORY EXCEEDED", "TIME EXCEEDED"}[max({int($(MEMORY_EXCEEDED))*1, int($(TIME_EXCEEDED))*2})]

That works perfectly for MEMORY_EXCEEDED but totally ignored for TIME_EXCEEDED.
I checked the _expression_ following your example:
condor_q -allusers -run -af '((Time() - JobCurrentStartDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600))'
And verified that the output is either true or false, no undefined or otherwise.
Do you have any idea what has gone wrong?


Nevertheless, I checked the CpusUsage _expression_, for future use:
I changed Cpus toÂRequestCpus:
condor_q -allusers -run -lim 2 -af '(CpusUsage > RequestCpus + 0.8 + (RequestCpus * 0.1))' CpusUsage RequestCpus   Â
false 0.999672281096517 1
undefined undefined 1

I suspect that the undefined comes from jobs that just started and don't return CPU statistics yet.

Thanks,
David

On Tue, Aug 24, 2021 at 11:35 AM Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:
Hi David, condor_reconfig was enough in my case;
the syntax is very "delicate", i think; i had similar problems until things started working as expected.
My "take home" experience is that when writing the condition it is fundamental to prevent
that it evaluates to undefined.

For example consider the _expression_ for CPU_EXCEEDED: when applied to a running jobs it
should provide a True / False value only; however:

[root@ce06-htc ~]# condor_q -run -lim 2 -af '(CpusUsage > Cpus + 0.8 + (Cpus * 0.1))' CpusUsage Cpus
undefined 0.9824864979138553 undefined
undefined 0.9867808965373372 undefined

Problem here is that Cpus is not a job classad and always evaluated to undefined. No running job has a
value for Cpus:
[root@ce06-htc ~]# condor_q -glob -all -cons '(jobstatus == 2) && (Cpus =!= undefined)' -af:j Owner
[root@ce06-htc ~]#

Stefano




On 24/08/21 09:44, David Cohen wrote:
Hi,
I changed SYSTEM_PERIODIC_HOLD_REASON to "all at once" as you suggested.
It seems that condor_reconfig is not enough to apply those changes.
Not to running jobs, or even new ones. (test jobs still get the old hold reason).
Is there a way other than draining restarting the startd to apply these changes?


CPU_EXCEEDED = (CpusUsage > Cpus + 0.8 + (Cpus * 0.1))
MEMORY_EXCEEDED = ( ResidentSetSize > 1024 * RequestMemory )
TIME_EXCEEDED = (Time() - JobCurrentStartExecutingDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600)
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(MEMORY_EXCEEDED) || $(TIME_EXCEEDED) || $(CPU_EXCEEDED)
SYSTEM_PERIODIC_HOLD_REASON = {"","MEMORY_EXCEEDED", "TIME_EXCEEDED", "CPU usage exceeded request_cpus"}[max({int($(MEMORY_EXCEEDED))*1,int($(TIME_EXCEEDED))*2,int($(CPU_EXCEEDED))*3})]

David

On Mon, Aug 23, 2021 at 12:27 PM Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:
Hello,

i finally tested the method, and it turns out that it only works by defining SYSTEM_PERIODIC_HOLD_REASON
all at once, this way:

SYSTEM_PERIODIC_HOLD_REASON = {"","message1", "message2", "..."}[max(int($(condition1)), int($(condition2)), int($(condition3)))]

The initial plan of defining MyHoldReason = {"", ... } and then

SYSTEM_PERIODIC_HOLD_REASON = $MyHoldReason[max(...)]
does not work. Not sure if some syntactic adjustment could help.

Stefano


On 22/08/21 13:01, David Cohen wrote:
Hi,
I followed Stefano's example (with Jaime's correction), and created the following:

CPU_EXCEEDED = (CpusUsage > Cpus + 0.8 + (Cpus * 0.1))
MEMORY_EXCEEDED = ( ResidentSetSize > 1024 * RequestMemory )
TIME_EXCEEDED = (Time() - JobCurrentStartExecutingDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600)
MyHoldReason = {"","MEMORY_EXCEEDED eval to True", "TIME_EXCEEDED eval to True", "CPU_EXCEEDED eval to True"}
SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || $(MEMORY_EXCEEDED) || $(TIME_EXCEEDED) || $(CPU_EXCEEDED)
SYSTEM_PERIODIC_HOLD_REASON = $(MyHoldReason)[max({int($(MEMORY_EXCEEDED))*1,int($(TIME_EXCEEDED))*2,int($(CPU_EXCEEDED))*3})]

I first tried to add to the startd config and run condor_reconfig, the overtime job wasn't removed, then on the schedd with the same result.
When I had only Time Rule it was on the schedd. The CPU and Mem rules, that end up conflicting with the schedd SYSTEM_PERIODIC_HOLD are from the startd.
So maybe my failure is the attempt to combine them in one location.

Any ideas?

David



On Thu, Aug 19, 2021 at 11:40 PM Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
I commend you on your advanced use of ClassAds operators.
You will need to use $() when referencing the TooMuch* parameters in yourÂSYSTEM_PERIODIC_HOLD andÂSYSTEM_PERIODIC_HOLD_REASON values. Since the TooMuch* parameters are config file macros and not ClassAd attributes in the job ads, they need to be expanded at config file parsing/lookup time.

Â- Jaime

On Aug 19, 2021, at 2:57 PM, Stefano Dal Pra <stefano.dalpra@xxxxxxxxxxxx> wrote:

Hello, i was about to test something similar, by defining the following checks:

TooMuchDiskÂÂ = (DiskUsage_raw > 20 * CpusProvisioned * 1024000)
TooMuchTimeÂÂ = (ServerTime - JobStartDate > 86400 * 7)
TooMuchMemory = (MemoryProvisioned > 6000)
TooMuchImg = ImageSize_RAW/1e6 > 35 * CpusProvisioned

SYSTEM_PERIODIC_HOLD = $(SYSTEM_PERIODIC_HOLD:False) || TooMuchDisk || TooMuchTime || TooMuchMemory || TooMuchImg

MyHoldReason = {"","TooMuchDisk eval to True", "TooMuchTime eval to True", "TooMuchMemory eval to True", "TooMuchImg eval to True"}
SYSTEM_PERIODIC_HOLD_REASON = $(MyHoldReason)[max({int(TooMuchDisk),int(TooMuchTime)*2,int(TooMuchMemory)*3}]

The idea is to define MyHoldReason as an array of strings, and set
SYSTEM_PERIODIC_HOLD_REASON as one string from the array, whose index comes from the boolean values of the checks.

I think this should work, provided that int(True) == 1 Â and int(False) == 0, but have not yet tested it.
Stefano


Il 19/08/21 18:47, Jaime Frey ha scritto:
Itâs a little cumbersome to have multiple hold triggers with distinct reason messages. You need to chain them together manually. Hereâs a pattern to follow to keep it from becoming too confusing:

HOLD_CLAUSE_1 =Â( ResidentSetSize > 1024 * RequestMemory )
HOLD_REASON_1 =Â"Memory usage too high (Trying to use more then requested-memory)â

HOLD_CLAUSE_2 = (Time() - JobCurrentStartDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600)
HOLD_REASON_2 =Â"Job Is Running over timeâ

SYSTEM_PERIODIC_HOLD = $(HOLD_CLAUSE_1)) || $(HOLD_CLAUSE_2)
SYSTEM_PERIODIC_HOLD_REASON = $(HOLD_CLAUSE_1) ? $(HOLD_REASON_1) : $(HOLD_REASON_2)

If you have more than two hold expressions, you may to add some parentheses to the SYSTEM_PERIODIC_HOLD_REASON _expression_ to ensure the nested ?: operators evaluate properly.

Â- Jaime

On Aug 19, 2021, at 7:10 AM, David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Thanks Christoph,
<quote><doublequote><tick>
didn't return results, but:
<doublequote><tick>
did the trick.

It returned 6 jobs that were held due to high memory usage, not for running over time.
That indicated that the following, from the startd configuration is causing the conflict:
SYSTEM_PERIODIC_HOLD = ( ResidentSetSize > 1024 * RequestMemory )
SYSTEM_PERIODIC_HOLD_REASON = "Memory usage too high (Trying to use more then requested-memory)"

What is the proper way to create multiple SYSTEM_PERIODIC_HOLD without them conflicting with each other?


On Thu, Aug 19, 2021 at 2:16 PM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,

try

condor_q -all -nobatch -constraint '"`condor_config_val SYSTEM_PERIODIC_HOLD`"'

(<quote><doublequote><tick>)


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "David Cohen" <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Donnerstag, 19. August 2021 12:59:15
Betreff: Re: [HTCondor-users] SYSTEM_PERIODIC_HOLD ignored

Thanks Jaime for your reply,

condor_q -all -nobatch -constraint `condor_config_val SYSTEM_PERIODIC_HOLD`

-- Parse error in constraint _expression_ "("

Looking at a job that should have been put on hold:
HiMemUser = 0
RequestMemory = 5120
JobCurrentStartDate = 1628598643 Â Â## ÂTime() - 1628598643 > 72*3600 - Assuming Time() is working properly and returning the time as Epoch value.

The error seems to indicate a typo error, but I cannot figure it out.
All the arguments that need to be evaluated are present and have the expected values.




On Wed, Aug 18, 2021 at 12:03 AM Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
I canât think of anything that would normally cause a periodic hold _expression_ to stop working.
Here are a couple of ideas for debugging the problemâ

When thereâs a job in the queue that you think should be affected by the periodic hold _expression_, try running this command:
condor_q -all -nobatch -constraint `condor_config_val SYSTEM_PERIODIC_HOLD`

If that doesnât display the problematic job(s), try altering the _expression_ (removing or adjusting terms) to see whatâs needed to make the jobs appear. That can reveal differences between what youâre checking for and whatâs in the job ads.

To ensure the schedd is evaluating the periodic job expressions on a timely basis, you can try amending the _expression_ to always hold special test jobs. For example, you can add this to the end of your config files:
SYSTEM_PERIODIC_HOLD = ($SYSTEM_PERIODIC_HOLD) || AdminHoldJob=?=true

Then, submit a test job with the following line in the submit file:
+AdminHoldJob=True

Then, wait and see if the job gets held.

Â- Jaime

> On Aug 17, 2021, at 5:09 AM, David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
> A SYSTEM_PERIODIC_HOLD, configure on the schedd, that used to work is ignored lately:
>
> SYSTEM_PERIODIC_HOLD = (Time() - JobCurrentStartDate) > IfthenElse(HiMemUser && (RequestMemory > 40*1024), 120*3600 , 72*3600)
> SYSTEM_PERIODIC_HOLD_Reason = "Job Is Running over time"
> SYSTEM_PERIODIC_REMOVE = JobStatus == 5 && (Time() - EnteredCurrentStatus) > 600
>
> I could find no reference to that in the system's log.
> How can I debug that?
>
> Best,
> David


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/