[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] RemoteWallClockTime vs CommittedTime

Many thanks Dan.... more comments embedded..

John Weigand

On 10/16/2013 9:34 AM, Daniel Forrest wrote:
Hi John,

No, this is always after the job has completed or been killed.  The
RemoteWallClockTime is always > 0.

Interesting that you mention that if the job gets evicted, then
RemoteWallClockTime is updated.  I am now wondering if this is an additive
process or just a plain update.  By that I mean, if a job gets evicted
multiple times, does the RemoteWallClockTime get added to each time or just
updated with the latest time.

RemoteWallClockTime is cumulative.

If there is some means of killing the job (maybe kill -9) that would leave
the values in the state I am seeing.

Maybe I am not understanding the doc correctly but it seems that
RemoteWallClockTime  should always be >= CommittedTime.

I have access to almost a decade worth of Condor history data so I ran
a script overnight to look for RemoteWallClockTime < CommittedTime.

I have found multiple cases (and luckily some still had UserLog files
lying around) with two different scenarios.

1.) Errors connecting to the schedd after writing a checkpoint.

     The CommittedTime is logged when the checkpoint is written, but
     the RemoteWallClockTime is logged at shadow exit.  If this second
     update fails then some RemoteWallClockTime will be lost.
This then may explain where I am seeing these...
   21,690  CPU
   20,289 RemoteWallClockTime
   22,869 CommittedTime

which using RemoteWallClockTime, CPU efficiency = 106.9%
      using CommittedTime, CPU efficiency = 94.8%
Even though 94.8 is probably a not so realistic utilization factor it is more believable than 106.9.

Could also explain a lot of these (many more)...
   22,054 CPU
   20,097 RemoteWallClockTime
   0      CommittedTime

This 2nd update failure may be more common than #2 below.
I also see a lot where RemoteWallClockTime = CommittedTime yet CPU is > than both and cores don't appear to be a factor (but I can't rule cores out yet).

Is there a way to tell when that update fails?

2.) Job running a second time with TerminationPending = TRUE.

     A comment from condor_shadow.V6/shadow.C:

     /* If the completed job had been committed to the job queue,
        but for some reason the shadow exited wierdly and the
        schedd is trying to run it again, then simply write
        the job termination events and send the email as if the job had
        just ended. */

     I believe this is the case you are seeing when RemoteWallClockTime
     is only 1 second.  That is the amount of time it takes for the job
     to exit immediately from TerminationPending set.  It appears to be
     a bug that RemoteWallClockTime is being reset.  To verify this you
     should check for TerminationPending in the classAd.

Unfortunately, I only see the accounting data sent up to our accounting system (Gratia) from many grids within OSG. I will have to get one of the admins to check their condor history logs for some of these to verify. This case is much more rare than #1 above but it was the one that triggered my curiosity as to how we are handling the data. I have a lot more #1s.

This particular one (RemoteWallClockTime = 1) came up from 1 site on 1 day then stopped.