[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] strangeness with condor_history





On Sep 15, 2021, at 8:36 AM, Jeff Templon <templon@xxxxxxxxx> wrote:

Yo,

On 15 Sep 2021, at 14:58, Bockelman, Brian wrote:

Hi Jeff,
Correct - if you want a record of all the jobs that left the queue, you don't want to use "CompletionDate" as that attribute only exists for jobs that ran to completion.
Rather, I suspect you want "EnteredCurrentStatus". This is the time the job transitioned to its completed *or* removed status -- e.g., when the job left the queue.

Yep, I found that one too.

I wouldn't use the '-completedsince' flag (in fact, I'm not really sure about the utility of that flag) but, if you're trying to reliably extract all the jobs from the file, you probably want "-since".
Examples:
# All jobs that left the queue in the last hour.
condor_history -since 'EnteredCurrentStatus < time()-3600'
# All jobs that left the queue after job 16115497.0.
condor_history -since '16115497.0'

This is faster than what I had, which was the EnteredCurrentStatus in the constraint clause, which scanned all available history


Yes - the "since" option is quite nice as it informs the schedd / client when it can stop scanning and ignore the rest of the history.

If you're scraping through logs, it's also useful to note that condor_history can work against a remote schedd (and has reasonable python bindings to boot).  This way, your cronjob only needs to be run at one location instead of trying to keep N cronjobs alive for N hosts.

.

One deficiency of the condor job ad is it, surprisingly, doesn't provide a reliable way to get the walltime and CPU time of the *last* execution of the job -- you only get the aggregate information across all runs.

Isnât this the difference between the âCumulativeâ ads and the non-cumulative, like CumulativeRemoteUserCpu vs RemoteUserCpu? There is also the difference between RemoteWallClockTime and CommittedTime.



Ah -- but again this is tricky.

"CommittedTime" is the aggregate wall clock time for jobs that run to completion or successfully self-checkpoint.  So, a job that is removed after 24 hours of running will not see any CommittedTime.

You might be right on the CPU usage pieces; that part has left my memory.

Brian