[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] removed job with no walltime but start dates



Hi all,

I am struggling to interpret jobs at us, that ended up in state
3/removed and have event dates, which seem odd to me.

For example, job [1] got submitted through a CondorCE onto the cluster
and got removed around 1641290372 (last current status).
The job has no RemoteWallClockTime (undefined) - however the job
(shadow??) has actually a number of start dates. Since the job as two
shadows and job counts but only one actual job starts, I am unsure, how
to interpret the job start dates here.
I suppose, the initial start date points to the first shadow/job count,
with a second start(?) around the CurrentStarts.

But has there actually been a job instance, that run on a node? Or are
the various start dates referring to shadow events (and if so, what did
the shadow do)?
While there was no RemoteWallClockTime logged, what happened between
CurrentStart*Date and EnteredCurrentStatus?

btw: what is actually the difference between JobCurrentStartDate and
JobCurrentStartExecutingDate?
I would read [2] in a way, that JobCurrentStartDate is the moment the
sandbox transfer is initiated and that JobCurrentExecutionDate is the
moment the transfer finished and the job actually starts, or?
(however, this interpretation would break down here, where the
*ExecutingDate is earlier than *StartDate)

Maybe somebody has an idea, what the event flow of this shadow/job might
have been?

(package versions during history generation were [3])

Cheers and thanks for ideas,
  Thomas



[1]
ClusterID: 2131886
JobStatus: 3
QDate: 1641290372
JobStartDate: 1641270888
JobCurrentStartDate: 1641278464
JobCurrentStartExecutingDate: 1641270889
EnteredCurrentStatus: 1641290372
RemoteWallClockTime: undefined
CumulativeSlotTime: 0
CommittedTime: 0
CompletionDate: 0
NumShadowStarts: 2
NumJobStarts: 1
JobRunCount: 2


[2]
https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html?highlight=JobCurrentStartExecutingDate#job-classad-attributes


[3]
condor-9.0.8-1.el7.x86_64
condor-classads-9.0.8-1.el7.x86_64
condor-externals-9.0.8-1.el7.x86_64
condor-procd-9.0.8-1.el7.x86_64
htcondor-ce-5.1.2-1.el7.noarch
htcondor-ce-apel-5.1.2-1.el7.noarch
htcondor-ce-client-5.1.2-1.el7.noarch
htcondor-ce-condor-5.1.2-1.el7.noarch
htcondor-ce-view-5.1.2-1.el7.noarch
python2-condor-9.0.8-1.el7.x86_64
python3-condor-9.0.8-1.el7.x86_64

on EL7 3.10.0-1160.36.2.el7.x86_64


[queries.a]
> condor_history -file history.a 2131886 -af ClusterID RoutedFromJobId
JobStatus QDate JobStartDate JobCurrentStartDate
JobCurrentStartExecutingDate EnteredCurrentStatus RemoteWallClockTime
CumulativeSlotTime CommittedTime CompletionDate NumShadowStarts
NumJobStarts JobRunCount
2131886 1143391.0 3 1641290372 1641270888 1641278464 1641270889
1641290372 undefined 0 0 0 2 1 2

[queries.b]
> condor_history -file history.a 2131886 -format "ClusterID: %d\n"
ClusterID -format "RoutedFromJobId: %s\n" RoutedFromJobId -format
"JobStatus: %d\n" JobStatus -format "QDate: %d\n" QDate -format
"JobStartDate: %d\n" JobStartDate -format "JobCurrentStartDate: %d\n"
JobCurrentStartDate -format "JobCurrentStartExecutingDate: %d\n"
JobCurrentStartExecutingDate -format "EnteredCurrentStatus: %d\n"
EnteredCurrentStatus -format "RemoteWallClockTime: %V\n"
RemoteWallClockTime -format "CumulativeSlotTime: %d\n"
CumulativeSlotTime -format "CommittedTime: %d\n" CommittedTime -format
"CompletionDate: %d\n" CompletionDate -format "NumShadowStarts: %d\n"
NumShadowStarts -format "NumJobStarts: %d\n" NumJobStarts -format
"JobRunCount: %d\n" JobRunCount
ClusterID: 2131886
RoutedFromJobId: 1143391.0
JobStatus: 3
QDate: 1641290372
JobStartDate: 1641270888
JobCurrentStartDate: 1641278464
JobCurrentStartExecutingDate: 1641270889
EnteredCurrentStatus: 1641290372
RemoteWallClockTime: undefined
CumulativeSlotTime: 0
CommittedTime: 0
CompletionDate: 0
NumShadowStarts: 2
NumJobStarts: 1
JobRunCount: 2


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature