[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Avoiding CPU wastage



Hi Vikrant,

I think your periodic_hold is evaluating to ERROR when runtime < 300 because of division by zero.

Using the values from condor_q:
((((0 + 8) / 1) / ifthenelse(true && (11) > 300, (11), 0 ) * 100) > 40)
((8 / 0 * 100) > 40)

I'd guess it only worked for the sleep job, and in 8.6, because the _expression_ wasn't evaluated until after the job had been running for 300 seconds.

I would try:
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) /ÂRemoteWallClockTime * 100)
periodic_hold = ifthenelse(JobStatus == 2 && RemoteWallClockTime > 300, $(RemoteCpuUtilizationPercent) > 40, False)

Best,
Collin

On Wed, May 15, 2019 at 12:22 AM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Team,

Any inputs to make this work on 8.5.8?

Thanks & Regards,
Vikrant Aggarwal


On Mon, May 13, 2019 at 7:32 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
This seems to be only not behaving properly with condor version 8.5.8 but in version 8.6.13Â working as expected.

Thanks & Regards,
Vikrant Aggarwal


On Mon, May 13, 2019 at 3:30 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Team,

Referring [1], [2] old email threads I am testing in lab to take action on the jobs which are running for more than 180s with both higher and lower CPU utilization.Â

- In following submit file I am generating load using stress and if the CPU utilization goes about .4 then putting the job on hold and releasing it using periodic_release so that it can get schedule on another node. Strangely job is going into hold status within 11s of running as per remotewallclocktime, parameters used for evaluating the _expression_ should return value greater than .4 not sure why hold reason is showing condition is UNDEFINED.

~~~
$ cat stress.shÂ
#!/bin/bash
stress --cpu 1 -t 360

$ cat stress.sub
executable       = sleep.sh
log          Â= stress.log
output         = outfile$(Process).txt
error         Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) > 40)
periodic_hold_reason = "Using cpu more than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_files Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue


$ condor_q 2564.0


-- Schedd: testmachine : <IPaddress:9618?... @ 05/13/19 05:32:39
OWNERÂ Â ÂBATCH_NAMEÂ Â Â Â SUBMITTEDÂ ÂDONEÂ ÂRUNÂ Â IDLEÂ ÂHOLDÂ TOTAL JOB_IDS
vaggarwal CMD: stress.sh Â5/13 05:30   _   _   _   1   1 2564.0

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

$ condor_q 2566.0 -af holdreason holdreasoncode
The job attribute PeriodicHold _expression_ '( ( ( ( RemoteSysCpu + RemoteUserCpu ) / RequestCpus ) / ifthenelse(JobStatus == 2 && ( time() - JobCurrentStartDate ) > 300,( time() - JobCurrentStartDate ),0) * 100 ) > 40 )' evaluated to UNDEFINED 5

$ condor_q 2564.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 8.0 1 11.0
~~~

- If I am running the same scenario for sleep job, it's working as expected. Job went into hold status 3 times and during the last time because of false PeriodicRelease condition, it remains in hold status.Â

~~~
cat sleep.sh
#!/bin/bash
# file name: sleep.sh

TIMETOWAIT="1020"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAITÂ

$ cat sleep.subÂ
executable       = sleep.sh
log          Â= sleep.log
output         = outfile$(Process).txt
error         Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) < 20)
periodic_hold_reason = "Using cpu less than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_files Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue


$ condor_q 2555.0 -af holdreason holdreasoncode
Using cpu less than threshold 3

$ condor_q 2555.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 0.0 1 301.0
~~~


My objective is to hold the job if it's not doing any activity which seems to be working fine but I want to confirm the other way around as well to ensure that behavior is as expected.Â

Some queries related to RemoteSysCpu RemoteUserCpu

- During testing I observed that RemoteSysCpu RemoteUserCpu are getting change only when job status is changed otherwise they remain 0.
- Also are these parameters are accumulative like remotewallclocktime?



Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Collin Mehring | PE-JoSE - Software Engineer