[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] åè: RemoteWallClockTime vs CommittedTime



Ntrum.ggããã
ï
åæç HTC ËããË

----- Reply message -----
åäèï "Daniel Forrest" <dan.forrest@xxxxxxxxxxxxx>
æäèï "John Weigand" <weigand@xxxxxxxx>
åæï "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>, "Tanya Levshina" <tlevshin@xxxxxxxx>
äæï [HTCondor-users] RemoteWallClockTime vs CommittedTime
ææï éä, 10 æ 16 æ, 2013 å 10:34 äå


Hi John,

> No, this is always after the job has completed or been killed.  The
> RemoteWallClockTime is always > 0.
>
> Interesting that you mention that if the job gets evicted, then
> RemoteWallClockTime is updated.  I am now wondering if this is an additive
> process or just a plain update.  By that I mean, if a job gets evicted
> multiple times, does the RemoteWallClockTime get added to each time or just
> updated with the latest time.

RemoteWallClockTime is cumulative.

> or
> If there is some means of killing the job (maybe kill -9) that would leave
> the values in the state I am seeing.
>
> Maybe I am not understanding the doc correctly but it seems that
> RemoteWallClockTime  should always be >= CommittedTime.

I have access to almost a decade worth of Condor history data so I ran
a script overnight to look for RemoteWallClockTime < CommittedTime.

I have found multiple cases (and luckily some still had UserLog files
lying around) with two different scenarios.

1.) Errors connecting to the schedd after writing a checkpoint.

   The CommittedTime is logged when the checkpoint is written, but
   the RemoteWallClockTime is logged at shadow exit.  If this second
   update fails then some RemoteWallClockTime will be lost.

2.) Job running a second time with TerminationPending = TRUE.

   A comment from condor_shadow.V6/shadow.C:

   /* If the completed job had been committed to the job queue,
      but for some reason the shadow exited wierdly and the
      schedd is trying to run it again, then simply write
      the job termination events and send the email as if the job had
      just ended. */

   I believe this is the case you are seeing when RemoteWallClockTime
   is only 1 second.  That is the amount of time it takes for the job
   to exit immediately from TerminationPending set.  It appears to be
   a bug that RemoteWallClockTime is being reset.  To verify this you
   should check for TerminationPending in the classAd.

--
Dan
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/