[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] History file with incorrect JobCurrentStartDate? - version windows 8.4.4



Hi John

 

Thanks for the replies.

 

I'm not sure that the clock jumping scenario is the cause in this case.

 

Below is an extract of part of the output of condor_history for a particular history file.

 

As you can see there is a mix of -ve values amongst the correct +ve ones and these

always correspond to a jump in the JobCurrentStartDate. For this history file this JobCurrentStartDate

is ALWAYS the same. For other history files that also have -ve CommittedSlotTime they

also have a constant (but different for each history file) JobCurrentStartDate value.

 

If you want I can send you this history file offline?

 

condor_history -f history.20180320T072752 -format "%6s " ClusterId -format "%6s " ProcId -format "%10s " CommittedSlotTime -format "%d " CompletionDate -format "%d\n" JobCurrentStartDate

 

865057  18518     1707.0 1521453470 1521451762

865057   1842     1703.0 1521453099 1521451396

865057  19534     1700.0 1521452998 1521451297

865057  17784     1680.0 1521451762 1521450082

865057  15067     1718.0 1521451396 1521449647

865057  13836     1694.0 1521451297 1521449603

865057  14087  -186856.0 1521450290 1521637146

865057  12124     1800.0 1521450129 1521448329

865057  13952  -187151.0 1521449995 1521637146

865057  10558     1673.0 1521449895 1521448222

865057  14468  -187269.0 1521449877 1521637146

865057  13944  -187322.0 1521449824 1521637146

865057  10696  -187575.0 1521449571 1521637146

865057  16028  -187576.0 1521449570 1521637146

865057  12565  -187625.0 1521449521 1521637146

865057  27255  -187676.0 1521449470 1521637146

865057  15906  -187707.0 1521449439 1521637146

865057  27356  -187709.0 1521449437 1521637146

865057  15751  -187719.0 1521449429 1521637146

865057  15977  -187774.0 1521449372 1521637146

865057  26993  -187778.0 1521449368 1521637146

865057  26663  -187815.0 1521449331 1521637146

865057  26529  -187831.0 1521449315 1521637146

865057   6636     1765.0 1521449315 1521447550

865057   9149  -187842.0 1521449304 1521637146

865057  25891  -187858.0 1521449288 1521637146

865057  13271  -187865.0 1521449281 1521637146

865057  11931  -187935.0 1521449211 1521637146

865057  13019  -187938.0 1521449208 1521637146

865057   5811     1720.0 1521449198 1521447478

865057  13516  -187952.0 1521449194 1521637146

865057  15757  -187953.0 1521449193 1521637146

865057   6362  -187955.0 1521449192 1521637146

865057  15131  -187957.0 1521449189 1521637146

865057  15861  -187977.0 1521449169 1521637146

865057  15777  -187977.0 1521449169 1521637146

865057   5322     1737.0 1521449157 1521447404

865057   9882  -188014.0 1521449157 1521637146

865057  15754  -188003.0 1521449157 1521637146

 

 

I’ve done some more checking on the submit node itself and the 2 history files with –ve CommitedSlotTime

values are either side of the .schedd_address.super file (time wise) in the spool directory.

i.e. history.20180320T221705 and history.20180320T072752

 

 

This time also corresponds to what looks like a condor restart at that same time, based on the

times for these files in the log directory.

 

 

Maybe this has something to do with the issue?

 

Thanks.

 

Cheers

 

Greg

 

 

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller
Sent: Thursday, 5 April 2018 11:46 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] History file with incorrect JobCurrentStartDate? - version windows 8.4.4

 

We've done some digging in the code, and the only thing we can think that would explain this is if the clock jumped ahead by 2 days

just at the time the job started, then it jumped back again for the remainder of the job execution.

 

JobCurrentStartDate is set by the schedd at the time it creates the shadow,

the rest of the numbers are set or calculated by the shadow based on the current system clock time.

 

So in order for this to happen,  the schedd would need to get 1521637146 from the system clock when it creates the shadow

and the shadow would need to get 1521445871 from the system clock when the job actually begins running just a few seconds

later.  Otherwise, we can't explain how the CommittedSlotTime is negative - it is calculated by the shadow which means that the shadow

would have to SEE the JobCurrentStartDate that is in the future relative to it's own clock.

 

If this were a case of all of the values but JobCurrentStartDate being stale somehow, then the CommittedSlotTime would not be negative.

 

-tj

 

-----Original Message-----

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of John M Knoeller

Sent: Thursday, April 5, 2018 10:12 AM

To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>

Subject: Re: [HTCondor-users] History file with incorrect JobCurrentStartDate? - version windows 8.4.4

 

This is not a known bug. but it seems similar to this bug

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6626

 

which will be fixed in the upcoming 8.6.11 release.

 

could you send me the entire job ad so I can have a look?

 

thanks

-tj

 

 

-----Original Message-----

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Greg.Hitchen@xxxxxxxx

Sent: Thursday, April 5, 2018 2:00 AM

To: htcondor-users@xxxxxxxxxxx

Subject: [HTCondor-users] History file with incorrect JobCurrentStartDate? - version windows 8.4.4

 

Hi All

 

We have a number of windows submit nodes in our HTCondor pool.

 

We have reporting scripts on a linux machine that download all the history files and can

provide monthly (or daily) usage on a per user basis.

 

In our latest report for March 2018 we noticed something strange, -ve run times!

After tracking back through the scripts we eventually found the answer in the history files themselves.

 

e.g. for one job these are the relevant values in the history file

 

JobCurrentStartDate                                      1521637146

CompletionDate                                               1521450290

EnteredCurrentStatus                                   1521450290

JobCurrentStartTransferOutputDate      1521450290

JobFinishedHookDone                                  1521450290

LastJobLeaseRenewal                                    1521450290

JobCurrentStartExecutingDate                  1521445871

LastMatchTime                                                 1521445870

LastVacateTime                                                1521444418

JobLastStartDate                                              1521442616

LastRejMatchTime                                          1521436370

JobStartDate                                                      1521192534

QDate                                                                   1521181266

CumulativeSlotTime                                       45124

RemoteWallClockTime                                  45124

RemoteUserCpu                                              2970

RemoteSysCpu                                                 1286

CommittedSuspensionTime                        0

CumulativeSuspensionTime                        0

CommittedSlotTime                                       -186856

CommittedTime                                               -186856

 

Note the 2 -ve committed time values. These are equal to (CompletionDate - JobCurrentStartDate).

 

In fact the JobCurrentStartDate is nearly 2 days AFTER the CompletionDate!?

 

For the moment we will need to change our scripts to NOT use the CommittedSlotTime BUT

calculate it from (CompletionDate - JobLastStartDate).

 

Is this a known bug? Or something that someone has come across before?

I've had a look through the 8.2.* and 8.4.* release notes and bug fixes but couldn't see anything.

 

Thanks for any help.

 

Cheers

 

Greg

 

P.S. we will soon(ish) be upgrading our submit nodes from win2008 to win2016 and will upgrade

to the latest 8.6.* version then. Meanwhile the history files already exist so we will kludge

a workaround.

 

 

 

_______________________________________________

HTCondor-users mailing list

To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a

subject: Unsubscribe

You can also unsubscribe by visiting

https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

 

The archives can be found at:

https://lists.cs.wisc.edu/archive/htcondor-users/

 

_______________________________________________

HTCondor-users mailing list

To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a

subject: Unsubscribe

You can also unsubscribe by visiting

https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

 

The archives can be found at:

https://lists.cs.wisc.edu/archive/htcondor-users/

 

_______________________________________________

HTCondor-users mailing list

To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a

subject: Unsubscribe

You can also unsubscribe by visiting

https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

 

The archives can be found at:

https://lists.cs.wisc.edu/archive/htcondor-users/