Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Best fit of CPUTime and WallTime

Date: Thu, 27 Sep 2018 16:03:27 +0000
From: Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Best fit of CPUTime and WallTime

Hi Petr,

The RemoteWallClockTime attribute is only updated when the job changes status, not while it's running. To get the current remote wall clock time, you need to do a bit of math:

(CurrentTime - JobCurrentStartExecutingDate)

That will only be valid if the job is in state 2, running, and will give you the total time for the current run - if the job has been run more than once, you can add the RemoteWallClockTime to that value to get the total amount of time it's been assigned to a machine and running across all starts.

Is TotalJobRunTime a custom attribute at your site?

Here's how I approached calculating CPU utilization for a job:

TotalExecutingTime = \
  ( ifThenElse(! isUndefined(RemoteWallClockTime), \
        RemoteWallClockTime, 0) - \
    ifThenElse(! isUndefined(CumulativeSuspensionTime), \
        CumulativeSuspensionTime, 0) \
  ) + \
  ( ifThenElse(JobStatus == 2, \
        CurrentTime - JobCurrentStartDate, 0) \
  ) + \
  ( ifThenElse(JobStatus == 7, \
        LastSuspensionTime - JobCurrentStartDate, 0) \
  )

The first clause takes any previous runs which have updated RemoteWallClockTime, and subtracts out any time that the job has spent suspended. The second clause does the current runtime for a running job, and the third clause calculates how much time the job spent running before the most recent suspension time. This doesn't take multiple suspensions into account - if you suspend a running job for a while, then continue it, then suspend it again, the first suspension time will be included in the executing time. I didn't see an easy way around that when I was writing it, though there may be one, and it didn't seem worth the effort to address.

Looking at this nearly three years later, it should probably use JobCurrentStartExecutingDate, since the StartDate includes the time spent running the PreCmd, and also input transfer time if I'm remembering correctly. Normally that's negligible for the kinds of jobs my users run.

Next, I use that value to calculate a RemoteCpuUtilizationPercent value:

RemoteCpuUtilizationPercent = \
  ifThenElse(! isUndefined(TotalExecutingTime) && TotalExecutingTime > 0, \
    ((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / TotalExecutingTime * 100, \
     UNDEFINED)

As you can see it's normalized for the number of CPUs the job requested, so a job which requests 1 cpu but uses 4 will have a 400% utilization, while a job which requests 4 but uses 2 will show 50%. This makes it simple to spot who's under- and over-requesting, and you can write a watchdog to send out nagging e-mails if you're into that sort of thing.

I calculate the RemoteUserCpuUtilizationPercent in the same way, and then RemoteSysCpuUtilizationPercent is just the difference between the total and user figures, 

Michael V. Pelletier

References:
- [HTCondor-users] Dedicated Scheduler Config to enable Parallel Jobs.
  - From: Sofya Urbaniec
- [HTCondor-users] Best fit of CPUTime and WallTime
  - From: Petr Horak

Prev by Date: [HTCondor-users] Fwd: Aw: Re: Cannot sent jobs as Owner in WindowsOS
Next by Date: Re: [HTCondor-users] Fwd: Aw: Re: Cannot sent jobs as Owner in WindowsOS
Previous by thread: [HTCondor-users] Best fit of CPUTime and WallTime
Next by thread: [HTCondor-users] how to re-allocate the servers for idle jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Best fit of CPUTime and WallTime