Re: [HTCondor-users] Avoiding CPU wastage

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

On Wed, May 15, 2019 at 12:22 AM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Team,

Any inputs to make this work on 8.5.8?

Thanks & Regards,

Vikrant Aggarwal

On Mon, May 13, 2019 at 7:32 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
This seems to be only not behaving properly with condor version 8.5.8 but in version 8.6.13Â working as expected.

Thanks & Regards,

Vikrant Aggarwal

On Mon, May 13, 2019 at 3:30 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Team,

Referring [1], [2] old email threads I am testing in lab to take action on the jobs which are running for more than 180s with both higher and lower CPU utilization.Â

- In following submit file I am generating load using stress and if the CPU utilization goes about .4 then putting the job on hold and releasing it using periodic_release so that it can get schedule on another node. Strangely job is going into hold status within 11s of running as per remotewallclocktime, parameters used for evaluating the _expression_ should return value greater than .4 not sure why hold reason is showing condition is UNDEFINED.

~~~
$ cat stress.shÂ
#!/bin/bash
stress --cpu 1 -t 360

$ cat stress.sub
executableÂ Â Â Â Â Â Â = sleep.sh
logÂ Â Â Â Â Â Â Â Â Â Â= stress.log
outputÂ Â Â Â Â Â Â Â Â = outfile$(Process).txt
errorÂ Â Â Â Â Â Â Â Â Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) > 40)
periodic_hold_reason = "Using cpu more than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_filesÂ Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue

$ condor_q 2564.0

-- Schedd: testmachine : <IPaddress:9618?... @ 05/13/19 05:32:39
OWNERÂ Â ÂBATCH_NAMEÂ Â Â Â SUBMITTEDÂ ÂDONEÂ ÂRUNÂ Â IDLEÂ ÂHOLDÂ TOTAL JOB_IDS
vaggarwal CMD: stress.shÂ Â5/13 05:30Â Â Â _Â Â Â _Â Â Â _Â Â Â 1Â Â Â 1 2564.0

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

$ condor_q 2566.0 -af holdreason holdreasoncode
The job attribute PeriodicHold _expression_ '( ( ( ( RemoteSysCpu + RemoteUserCpu ) / RequestCpus ) / ifthenelse(JobStatus == 2 && ( time() - JobCurrentStartDate ) > 300,( time() - JobCurrentStartDate ),0) * 100 ) > 40 )' evaluated to UNDEFINED 5

$ condor_q 2564.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 8.0 1 11.0
~~~

- If I am running the same scenario for sleep job, it's working as expected. Job went into hold status 3 times and during the last time because of false PeriodicRelease condition, it remains in hold status.Â

~~~
cat sleep.sh
#!/bin/bash
# file name: sleep.sh

TIMETOWAIT="1020"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAITÂ

$ cat sleep.subÂ
executableÂ Â Â Â Â Â Â = sleep.sh
logÂ Â Â Â Â Â Â Â Â Â Â= sleep.log
outputÂ Â Â Â Â Â Â Â Â = outfile$(Process).txt
errorÂ Â Â Â Â Â Â Â Â Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) < 20)
periodic_hold_reason = "Using cpu less than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_filesÂ Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue

$ condor_q 2555.0 -af holdreason holdreasoncode
Using cpu less than threshold 3

$ condor_q 2555.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 0.0 1 301.0
~~~

My objective is to hold the job if it's not doing any activity which seems to be working fine but I want to confirm the other way around as well to ensure that behavior is as expected.Â

Some queries related to RemoteSysCpu RemoteUserCpu

- During testing I observed that RemoteSysCpu RemoteUserCpu are getting change only when job status is changed otherwise they remain 0.
- Also are these parameters are accumulative like remotewallclocktime?

[1] https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-April/msg00103.shtml
[2] https://www-auth.cs.wisc.edu/lists/htcondor-users/2018-September/msg00092.sht

Thanks & Regards,

Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Public Access

Re: [HTCondor-users] Avoiding CPU wastage