[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] removed job with no walltime but start dates



Hi Thomas,

The timeline of which attributes get updated when and by which daemon is bit of a hairy mess, which is probably deserving of better documentation as the current description in the manual isn't enough to keep us from having to do code archeology every time we want to figure out when something is touched. Here's my best understanding of the attrs you mentioned:

JobCurrentStartDate, NumShadowStarts, and JobRunCount - These are set by the schedd each time a shadow starts.
JobCurrentStartExecutingDate and NumJobStarts - These are set by the shadow when it first notices that the starter is running the job's executable (after file transfer). There could be any number of issues on the starter side (file transfer problem, exception in the starter itself, communication error, etc.) that results in the shadow never receiving the "job has started executing" signal from the starter, which means these attributes may not get updated even if code has started executing on the remote machine.
RemoteWallClockTime - I thought that this should be initially set to 0 by condor_submit, so I'm confused why it would be undefined. Maybe this behavior has changed. Anyway, the startd updates this only when the job has exited (including if it is exiting because of job removal). If something goes wrong at the remote machine, again, this value may not end up being computed or sent back to the schedd.
EnteredCurrentStatus - Updated by the schedd when a job is submitted, held, or released, or when the schedd notices that a job is running or stopped. Again, for noticing when jobs are running or stopped, it relies on communication coming back from the starter.

To answer your question, my medium-confidence guess would be that your job tried to start running twice (NumShadowStarts == JobRunCount == 2)... the first time there was a problem before or during file transfer (so NumJobStarts was not incremented), and then on the second attempt, file transfer was successful and execution started (NumJobStarts == 1) but something bad happened either at execute point or with the communication channel just at or before the time that you removed the job, so the update to RemoteWallClockTime never got sent back.

Does that make sense, or have I missed explaining another mysterious value in the ad?

Jason Patton

On 2/11/22 10:56 AM, Thomas Hartmann wrote:
Hi all,

I am struggling to interpret jobs at us, that ended up in state
3/removed and have event dates, which seem odd to me.

For example, job [1] got submitted through a CondorCE onto the cluster
and got removed around 1641290372 (last current status).
The job has no RemoteWallClockTime (undefined) - however the job
(shadow??) has actually a number of start dates. Since the job as two
shadows and job counts but only one actual job starts, I am unsure, how
to interpret the job start dates here.
I suppose, the initial start date points to the first shadow/job count,
with a second start(?) around the CurrentStarts.

But has there actually been a job instance, that run on a node? Or are
the various start dates referring to shadow events (and if so, what did
the shadow do)?
While there was no RemoteWallClockTime logged, what happened between
CurrentStart*Date and EnteredCurrentStatus?

btw: what is actually the difference between JobCurrentStartDate and
JobCurrentStartExecutingDate?
I would read [2] in a way, that JobCurrentStartDate is the moment the
sandbox transfer is initiated and that JobCurrentExecutionDate is the
moment the transfer finished and the job actually starts, or?
(however, this interpretation would break down here, where the
*ExecutingDate is earlier than *StartDate)

Maybe somebody has an idea, what the event flow of this shadow/job might
have been?

(package versions during history generation were [3])

Cheers and thanks for ideas,
   Thomas



[1]
ClusterID: 2131886
JobStatus: 3
QDate: 1641290372
JobStartDate: 1641270888
JobCurrentStartDate: 1641278464
JobCurrentStartExecutingDate: 1641270889
EnteredCurrentStatus: 1641290372
RemoteWallClockTime: undefined
CumulativeSlotTime: 0
CommittedTime: 0
CompletionDate: 0
NumShadowStarts: 2
NumJobStarts: 1
JobRunCount: 2


[2]
https://htcondor.readthedocs.io/en/latest/classad-attributes/job-classad-attributes.html?highlight=JobCurrentStartExecutingDate#job-classad-attributes


[3]
condor-9.0.8-1.el7.x86_64
condor-classads-9.0.8-1.el7.x86_64
condor-externals-9.0.8-1.el7.x86_64
condor-procd-9.0.8-1.el7.x86_64
htcondor-ce-5.1.2-1.el7.noarch
htcondor-ce-apel-5.1.2-1.el7.noarch
htcondor-ce-client-5.1.2-1.el7.noarch
htcondor-ce-condor-5.1.2-1.el7.noarch
htcondor-ce-view-5.1.2-1.el7.noarch
python2-condor-9.0.8-1.el7.x86_64
python3-condor-9.0.8-1.el7.x86_64

on EL7 3.10.0-1160.36.2.el7.x86_64


[queries.a]
condor_history -file history.a 2131886 -af ClusterID RoutedFromJobId
JobStatus QDate JobStartDate JobCurrentStartDate
JobCurrentStartExecutingDate EnteredCurrentStatus RemoteWallClockTime
CumulativeSlotTime CommittedTime CompletionDate NumShadowStarts
NumJobStarts JobRunCount
2131886 1143391.0 3 1641290372 1641270888 1641278464 1641270889
1641290372 undefined 0 0 0 2 1 2

[queries.b]
condor_history -file history.a 2131886 -format "ClusterID: %d\n"
ClusterID -format "RoutedFromJobId: %s\n" RoutedFromJobId -format
"JobStatus: %d\n" JobStatus -format "QDate: %d\n" QDate -format
"JobStartDate: %d\n" JobStartDate -format "JobCurrentStartDate: %d\n"
JobCurrentStartDate -format "JobCurrentStartExecutingDate: %d\n"
JobCurrentStartExecutingDate -format "EnteredCurrentStatus: %d\n"
EnteredCurrentStatus -format "RemoteWallClockTime: %V\n"
RemoteWallClockTime -format "CumulativeSlotTime: %d\n"
CumulativeSlotTime -format "CommittedTime: %d\n" CommittedTime -format
"CompletionDate: %d\n" CompletionDate -format "NumShadowStarts: %d\n"
NumShadowStarts -format "NumJobStarts: %d\n" NumJobStarts -format
"JobRunCount: %d\n" JobRunCount
ClusterID: 2131886
RoutedFromJobId: 1143391.0
JobStatus: 3
QDate: 1641290372
JobStartDate: 1641270888
JobCurrentStartDate: 1641278464
JobCurrentStartExecutingDate: 1641270889
EnteredCurrentStatus: 1641290372
RemoteWallClockTime: undefined
CumulativeSlotTime: 0
CommittedTime: 0
CompletionDate: 0
NumShadowStarts: 2
NumJobStarts: 1
JobRunCount: 2



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/