[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] RemoteWallClockTime vs CommittedTime

Hi John,

> No, this is always after the job has completed or been killed.  The 
> RemoteWallClockTime is always > 0.
> Interesting that you mention that if the job gets evicted, then 
> RemoteWallClockTime is updated.  I am now wondering if this is an additive 
> process or just a plain update.  By that I mean, if a job gets evicted 
> multiple times, does the RemoteWallClockTime get added to each time or just 
> updated with the latest time.

RemoteWallClockTime is cumulative.

> or
> If there is some means of killing the job (maybe kill -9) that would leave 
> the values in the state I am seeing.
> Maybe I am not understanding the doc correctly but it seems that 
> RemoteWallClockTime  should always be >= CommittedTime.

I have access to almost a decade worth of Condor history data so I ran
a script overnight to look for RemoteWallClockTime < CommittedTime.

I have found multiple cases (and luckily some still had UserLog files
lying around) with two different scenarios.

1.) Errors connecting to the schedd after writing a checkpoint.

    The CommittedTime is logged when the checkpoint is written, but
    the RemoteWallClockTime is logged at shadow exit.  If this second
    update fails then some RemoteWallClockTime will be lost.

2.) Job running a second time with TerminationPending = TRUE.

    A comment from condor_shadow.V6/shadow.C:

    /* If the completed job had been committed to the job queue,
       but for some reason the shadow exited wierdly and the
       schedd is trying to run it again, then simply write
       the job termination events and send the email as if the job had
       just ended. */

    I believe this is the case you are seeing when RemoteWallClockTime
    is only 1 second.  That is the amount of time it takes for the job
    to exit immediately from TerminationPending set.  It appears to be
    a bug that RemoteWallClockTime is being reset.  To verify this you
    should check for TerminationPending in the classAd.