[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] strangeness with condor_history



Hi Jeff,

Correct - if you want a record of all the jobs that left the queue, you don't want to use "CompletionDate" as that attribute only exists for jobs that ran to completion.

Rather, I suspect you want "EnteredCurrentStatus".  This is the time the job transitioned to its completed *or* removed status -- e.g., when the job left the queue.

I wouldn't use the '-completedsince' flag (in fact, I'm not really sure about the utility of that flag) but, if you're trying to reliably extract all the jobs from the file, you probably want "-since".

Examples:

# All jobs that left the queue in the last hour.
condor_history -since 'EnteredCurrentStatus < time()-3600'

# All jobs that left the queue after job 16115497.0.
condor_history -since '16115497.0'

The latter example is useful if you're trying to harvest the history files -- you only need to remember where you left off the last time you ran.

One deficiency of the condor job ad is it, surprisingly, doesn't provide a reliable way to get the walltime and CPU time of the *last* execution of the job -- you only get the aggregate information across all runs.

Note that there's a relatively new script, condor_adstash, that will do all this for you if you want to take the job history and put it into ElasticSearch.

Brian

> On Sep 15, 2021, at 6:35 AM, Jeff Templon <templon@xxxxxxxxx> wrote:
> 
> Hi David,
> 
> Thanks for your answer. I get it. I guess this means that I cannot rely on CompletionDate at all, since I want an accounting record at job end, regardless of how it was ended. CPU used is cpu used and Shall Be Accounted. So back to the drawing board. Any suggestions for alternatives are welcome. The criterion is, I want a single record of resources used by the job (see the examples), generated at the moment that the job is no longer running (so job ends, crashes, removed by condor_rm, whatever, doesnât matter).
> 
> Thanks
> 
> JT
> 
> On 13 Sep 2021, at 17:17, David Schultz wrote:
> 
> 
> Hi JT,
> 
> The problem here is that removed jobs are not considered completed, and have `CompletionDate = 0`. completedsince is a "scan until the first job that completed on or before the given unix timestamp," which means it does a reverse search until CompletionDate<=time_expr and lists all jobs more recent than that (but not including that matching job).  So in the case of a removed job being the most recent job on the history, it will match the expression and halt the search, returning no jobs.  (this may be something the condor team should "fix")
> 
> Personally, I don't use the completedsince flag and use constraints directly.  You might consider using a constraint with EnteredCurrentStatus to filter the history, since that should always be set.
> 
> Best,
> David
> 
> On Mon, Sep 13, 2021 at 9:21 AM Jeff Templon <templon@xxxxxxxxx> wrote:
> Yo
> 
> I wrote a while back about using condor_history to generate an âaccounting log fileâ that reproduces what we have with our Torque LRMS. Note here I use accounting in the Torque sense : accounting means transaction records like a bank account. What HTCondor calls accounting is something different, I already know that.
> 
> For right now, our HTCondor system only has one schedd, making the simplest approach to use condor_history on the schedd node. My prototype command is like this:
> 
> condor_history -json -completedsince $(date -d "2021-09-13 15:00:00" +â%sâ) \
>    -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId \
>           RemoteHost RequestCpus RequestMemory ExitCode CpusUsage \
>           ResidentSetSize ImageSize RemoteWallClockTime
> 
> Everything was fine, until I realised that I had made a mistake while submitting a bunch of jobs aimed at testing the command. I executed condor_rm to remove those jobs, and doing so not only removed the jobs from the queue, it removed them from the history!!. This violates some fundamental principle of accounting for me - deleting something from the queue should not erase the record of it ever having happened. While looking into this, I discovered something very weird: using a different form of the constraint command (above completedsince) gives different results. I am using âgrep CpusUsâ to grab out one line of output per found job, and then wc -l to count how many jobs were found.
> 
> â> date -d "2021-09-13 15:00:00" +"%s"
> 1631538000
> â> condor_history -json -completedsince $(date -d "2021-09-13 15:00:00" +"%s")  \
>     -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId \
>     RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize \
>     ImageSize RemoteWallClockTime | grep CpusUs | wc -l
> 23
> â> condor_history -json -constraint "CompletionDate > 1631538000 "  -af:hj Owner \
>     Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost \
>     RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize \
>     RemoteWallClockTime | grep CpusUs | wc -l
> 331
> 
> This difference seems to depend on whether the condor_rm has been issued or not. As I write this, itâs 16:00 :
> 
> â> date -d "2021-09-13 16:00:00" +"%s"
> 1631541600
> â> condor_history -json -constraint "Owner == \"templon\" && CompletionDate > 1631541600" \
>      -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId \
>      RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize \
>      ImageSize RemoteWallClockTime | grep CpusUs | wc -l
> 0
> â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") \
>      -constraint "Owner == \"templon\""  -af:hj Owner Cmd Args QDate  \
>      JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus \
>      RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize \
>      RemoteWallClockTime | grep CpusUs | wc -l
> 0
> 
> All looks fine. Now I submit a bunch of jobs, some of which will complete quickly, wait a bit, and try again.
> 
> â> condor_history -json -constraint "Owner == \"templon\" && CompletionDate > 1631541600"  -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l
> 40
> â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") -constraint "Owner == \"templon\""  -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l
> 40
> 
> Still looks fine. Now use condor_rm to delete the entire set of jobs:
> 
> â> condor_history -json -constraint "Owner == \"templon\" && CompletionDate > 1631541600"  -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l
> 41
> â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") -constraint "Owner == \"templon\""  -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime | grep CpusUs | wc -l
> 0
> 
> The 41 result instead of 40 is because one more job completed between the last condor_history command and the condor_rm command.
> 
> Using the CompletionDate form of the constraint, all the jobs are still there, but completedsince is no longer accurate.
> 
> Whatâs going on?? The accounting records (note again: accounting in the Torque/bank account sense!) should be holy, and using completedsince they are not holy. This makes me wonder in what other circumstances are the records not holy. Is there some documentation on how to ensure that only holy output will result from my commands?
> 
> Thanks,
> 
> JT
> 
> ps: I did indeed check the jobs disappear (and not just the CpusUsage field).
> 
> â> condor_history -json -completedsince $(date -d "2021-09-13 16:00:00" +"%s") -constraint "Owner == \"templon\""  -af:hj Owner Cmd Args QDate JobCurrentStartDate CompletionDate GlobalJobId RemoteHost RequestCpus RequestMemory ExitCode CpusUsage ResidentSetSize ImageSize RemoteWallClockTime
> [
> ]
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________ 
> HTCondor-users mailing list 
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a 
> subject: Unsubscribe 
> You can also unsubscribe by visiting 
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users 
> The archives can be found at:
> 
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/