[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Avoiding CPU wastage



This seems to be only not behaving properly with condor version 8.5.8 but in version 8.6.13Â working as expected.

Thanks & Regards,
Vikrant Aggarwal


On Mon, May 13, 2019 at 3:30 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Team,

Referring [1], [2] old email threads I am testing in lab to take action on the jobs which are running for more than 180s with both higher and lower CPU utilization.Â

- In following submit file I am generating load using stress and if the CPU utilization goes about .4 then putting the job on hold and releasing it using periodic_release so that it can get schedule on another node. Strangely job is going into hold status within 11s of running as per remotewallclocktime, parameters used for evaluating the _expression_ should return value greater than .4 not sure why hold reason is showing condition is UNDEFINED.

~~~
$ cat stress.shÂ
#!/bin/bash
stress --cpu 1 -t 360

$ cat stress.sub
executable       = sleep.sh
log          Â= stress.log
output         = outfile$(Process).txt
error         Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) > 40)
periodic_hold_reason = "Using cpu more than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_files Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue


$ condor_q 2564.0


-- Schedd: testmachine : <IPaddress:9618?... @ 05/13/19 05:32:39
OWNERÂ Â ÂBATCH_NAMEÂ Â Â Â SUBMITTEDÂ ÂDONEÂ ÂRUNÂ Â IDLEÂ ÂHOLDÂ TOTAL JOB_IDS
vaggarwal CMD: stress.sh Â5/13 05:30   _   _   _   1   1 2564.0

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

$ condor_q 2566.0 -af holdreason holdreasoncode
The job attribute PeriodicHold _expression_ '( ( ( ( RemoteSysCpu + RemoteUserCpu ) / RequestCpus ) / ifthenelse(JobStatus == 2 && ( time() - JobCurrentStartDate ) > 300,( time() - JobCurrentStartDate ),0) * 100 ) > 40 )' evaluated to UNDEFINED 5

$ condor_q 2564.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 8.0 1 11.0
~~~

- If I am running the same scenario for sleep job, it's working as expected. Job went into hold status 3 times and during the last time because of false PeriodicRelease condition, it remains in hold status.Â

~~~
cat sleep.sh
#!/bin/bash
# file name: sleep.sh

TIMETOWAIT="1020"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAITÂ

$ cat sleep.subÂ
executable       = sleep.sh
log          Â= sleep.log
output         = outfile$(Process).txt
error         Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) < 20)
periodic_hold_reason = "Using cpu less than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_files Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue


$ condor_q 2555.0 -af holdreason holdreasoncode
Using cpu less than threshold 3

$ condor_q 2555.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 0.0 1 301.0
~~~


My objective is to hold the job if it's not doing any activity which seems to be working fine but I want to confirm the other way around as well to ensure that behavior is as expected.Â

Some queries related to RemoteSysCpu RemoteUserCpu

- During testing I observed that RemoteSysCpu RemoteUserCpu are getting change only when job status is changed otherwise they remain 0.
- Also are these parameters are accumulative like remotewallclocktime?


[1] https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-April/msg00103.shtml
[2] https://www-auth.cs.wisc.edu/lists/htcondor-users/2018-September/msg00092.sht

Thanks & Regards,
Vikrant Aggarwal