[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] RemoteWallClockTime vs CommittedTime
- Date: Wed, 16 Oct 2013 09:34:55 -0500
- From: Daniel Forrest <dan.forrest@xxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] RemoteWallClockTime vs CommittedTime
> No, this is always after the job has completed or been killed. The
> RemoteWallClockTime is always > 0.
> Interesting that you mention that if the job gets evicted, then
> RemoteWallClockTime is updated. I am now wondering if this is an additive
> process or just a plain update. By that I mean, if a job gets evicted
> multiple times, does the RemoteWallClockTime get added to each time or just
> updated with the latest time.
RemoteWallClockTime is cumulative.
> If there is some means of killing the job (maybe kill -9) that would leave
> the values in the state I am seeing.
> Maybe I am not understanding the doc correctly but it seems that
> RemoteWallClockTime should always be >= CommittedTime.
I have access to almost a decade worth of Condor history data so I ran
a script overnight to look for RemoteWallClockTime < CommittedTime.
I have found multiple cases (and luckily some still had UserLog files
lying around) with two different scenarios.
1.) Errors connecting to the schedd after writing a checkpoint.
The CommittedTime is logged when the checkpoint is written, but
the RemoteWallClockTime is logged at shadow exit. If this second
update fails then some RemoteWallClockTime will be lost.
2.) Job running a second time with TerminationPending = TRUE.
A comment from condor_shadow.V6/shadow.C:
/* If the completed job had been committed to the job queue,
but for some reason the shadow exited wierdly and the
schedd is trying to run it again, then simply write
the job termination events and send the email as if the job had
just ended. */
I believe this is the case you are seeing when RemoteWallClockTime
is only 1 second. That is the amount of time it takes for the job
to exit immediately from TerminationPending set. It appears to be
a bug that RemoteWallClockTime is being reset. To verify this you
should check for TerminationPending in the classAd.