[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] removed job with no walltime but start dates



Hi Jason,

many thanks for the detailed explanation! :)

Thing is, that I am trying to compile from the histories some historic
accounting records of aggregated CPU times used by our groups. However,
for one group (CMS), I only get total numbers of core weighted
walltimes, that are significantly off from the integrated statistics by
the VO itself as well as what we see in accumulated monitoring stats via
active jobs (i.e., ~condor_q).
So far the only suspicion is, that many jobs by this group/VO ended up
in state 3/removed, so that I assume that maybe a part of the missing
walltimes might actually got run in this removed jobs. But I can't
really make heads or tails of these jobs :-/

Cheers,
  Thomas

On 11/02/2022 20.53, Jason Patton wrote:
> Hi Thomas,
> 
> The timeline of which attributes get updated when and by which daemon is
> bit of a hairy mess, which is probably deserving of better documentation
> as the current description in the manual isn't enough to keep us from
> having to do code archeology every time we want to figure out when
> something is touched. Here's my best understanding of the attrs you
> mentioned:
> 
> JobCurrentStartDate, NumShadowStarts, and JobRunCount - These are set by
> the schedd each time a shadow starts.
> JobCurrentStartExecutingDate and NumJobStarts - These are set by the
> shadow when it first notices that the starter is running the job's
> executable (after file transfer). There could be any number of issues on
> the starter side (file transfer problem, exception in the starter
> itself, communication error, etc.) that results in the shadow never
> receiving the "job has started executing" signal from the starter, which
> means these attributes may not get updated even if code has started
> executing on the remote machine.
> RemoteWallClockTime - I thought that this should be initially set to 0
> by condor_submit, so I'm confused why it would be undefined. Maybe this
> behavior has changed. Anyway, the startd updates this only when the job
> has exited (including if it is exiting because of job removal). If
> something goes wrong at the remote machine, again, this value may not
> end up being computed or sent back to the schedd.
> EnteredCurrentStatus - Updated by the schedd when a job is submitted,
> held, or released, or when the schedd notices that a job is running or
> stopped. Again, for noticing when jobs are running or stopped, it relies
> on communication coming back from the starter.
> 
> To answer your question, my medium-confidence guess would be that your
> job tried to start running twice (NumShadowStarts == JobRunCount ==
> 2)... the first time there was a problem before or during file transfer
> (so NumJobStarts was not incremented), and then on the second attempt,
> file transfer was successful and execution started (NumJobStarts == 1)
> but something bad happened either at execute point or with the
> communication channel just at or before the time that you removed the
> job, so the update to RemoteWallClockTime never got sent back.
> 
> Does that make sense, or have I missed explaining another mysterious
> value in the ad?
> 
> Jason Patton
> 
> On 2/11/22 10:56 AM, Thomas Hartmann wrote:
>> Hi all,
>>
>> I am struggling to interpret jobs at us, that ended up in state
>> 3/removed and have event dates, which seem odd to me.
>>
>> For example, job [1] got submitted through a CondorCE onto the cluster
>> and got removed around 1641290372 (last current status).
>> The job has no RemoteWallClockTime (undefined) - however the job
>> (shadow??) has actually a number of start dates. Since the job as two
>> shadows and job counts but only one actual job starts, I am unsure, how
>> to interpret the job start dates here.
>> I suppose, the initial start date points to the first shadow/job count,
>> with a second start(?) around the CurrentStarts.
>>
>> But has there actually been a job instance, that run on a node? Or are
>> the various start dates referring to shadow events (and if so, what did
>> the shadow do)?
>> While there was no RemoteWallClockTime logged, what happened between
>> CurrentStart*Date and EnteredCurrentStatus?
>>
>> btw: what is actually the difference between JobCurrentStartDate and
>> JobCurrentStartExecutingDate?
>> I would read [2] in a way, that JobCurrentStartDate is the moment the
>> sandbox transfer is initiated and that JobCurrentExecutionDate is the
>> moment the transfer finished and the job actually starts, or?
>> (however, this interpretation would break down here, where the
>> *ExecutingDate is earlier than *StartDate)
>>
>> Maybe somebody has an idea, what the event flow of this shadow/job might
>> have been?
>>
>> (package versions during history generation were [3])
>>
>> Cheers and thanks for ideas,
>> ÂÂ Thomas
>>
>>
>>
>> [1]
>> ClusterID: 2131886
>> JobStatus: 3
>> QDate: 1641290372
>> JobStartDate: 1641270888
>> JobCurrentStartDate: 1641278464
>> JobCurrentStartExecutingDate: 1641270889
>> EnteredCurrentStatus: 1641290372
>> RemoteWallClockTime: undefined
>> CumulativeSlotTime: 0
>> CommittedTime: 0
>> CompletionDate: 0
>> NumShadowStarts: 2
>> NumJobStarts: 1
>> JobRunCount: 2
>>
>>
>> [2]
>> https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html?highlight=JobCurrentStartExecutingDate#job-classad-attributes
>>
>>
>>
>> [3]
>> condor-9.0.8-1.el7.x86_64
>> condor-classads-9.0.8-1.el7.x86_64
>> condor-externals-9.0.8-1.el7.x86_64
>> condor-procd-9.0.8-1.el7.x86_64
>> htcondor-ce-5.1.2-1.el7.noarch
>> htcondor-ce-apel-5.1.2-1.el7.noarch
>> htcondor-ce-client-5.1.2-1.el7.noarch
>> htcondor-ce-condor-5.1.2-1.el7.noarch
>> htcondor-ce-view-5.1.2-1.el7.noarch
>> python2-condor-9.0.8-1.el7.x86_64
>> python3-condor-9.0.8-1.el7.x86_64
>>
>> on EL7 3.10.0-1160.36.2.el7.x86_64
>>
>>
>> [queries.a]
>>> condor_history -file history.a 2131886 -af ClusterID RoutedFromJobId
>> JobStatus QDate JobStartDate JobCurrentStartDate
>> JobCurrentStartExecutingDate EnteredCurrentStatus RemoteWallClockTime
>> CumulativeSlotTime CommittedTime CompletionDate NumShadowStarts
>> NumJobStarts JobRunCount
>> 2131886 1143391.0 3 1641290372 1641270888 1641278464 1641270889
>> 1641290372 undefined 0 0 0 2 1 2
>>
>> [queries.b]
>>> condor_history -file history.a 2131886 -format "ClusterID: %d\n"
>> ClusterID -format "RoutedFromJobId: %s\n" RoutedFromJobId -format
>> "JobStatus: %d\n" JobStatus -format "QDate: %d\n" QDate -format
>> "JobStartDate: %d\n" JobStartDate -format "JobCurrentStartDate: %d\n"
>> JobCurrentStartDate -format "JobCurrentStartExecutingDate: %d\n"
>> JobCurrentStartExecutingDate -format "EnteredCurrentStatus: %d\n"
>> EnteredCurrentStatus -format "RemoteWallClockTime: %V\n"
>> RemoteWallClockTime -format "CumulativeSlotTime: %d\n"
>> CumulativeSlotTime -format "CommittedTime: %d\n" CommittedTime -format
>> "CompletionDate: %d\n" CompletionDate -format "NumShadowStarts: %d\n"
>> NumShadowStarts -format "NumJobStarts: %d\n" NumJobStarts -format
>> "JobRunCount: %d\n" JobRunCount
>> ClusterID: 2131886
>> RoutedFromJobId: 1143391.0
>> JobStatus: 3
>> QDate: 1641290372
>> JobStartDate: 1641270888
>> JobCurrentStartDate: 1641278464
>> JobCurrentStartExecutingDate: 1641270889
>> EnteredCurrentStatus: 1641290372
>> RemoteWallClockTime: undefined
>> CumulativeSlotTime: 0
>> CommittedTime: 0
>> CompletionDate: 0
>> NumShadowStarts: 2
>> NumJobStarts: 1
>> JobRunCount: 2
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature