[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Extracting dag and job run-time information
- Date: Wed, 7 Mar 2018 16:34:10 -0600
- From: Mark Coatsworth <coatsworth@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Extracting dag and job run-time information
We do gather some runtime statistics that can probably help you here.
First off, starting in version 8.7.4, we display a bunch of DAGMan runtime statistics in the .dagman.out file, related to how fast DAGMan is actually submitting jobs. We've found that when dealing with lots of short running jobs (on the order of seconds or minutes), DAGMan spends most of its time sitting idle when it could be sending new jobs. If you notice that SubmitCycleTimeSum is dramatically lower than SleepCycleTimeSum, then that's the case here (if so, let me know and I'll send some suggestions).
The script you've used to scrape job running times works very well, but you're right, extending it to work correctly for held/evicted jobs would be a lot of work. I think you're better off using condor_history.
With condor_history, you can get information for all of a DAG's jobs using a constraint. The following looks up all children of $DagJobID$ and displays their individual IDs + running time (RemoteWallClockTime). You can customize the list of attributes as you see fit:
condor_history -constraint "DAGManJobId == $DagJobID$" -af:lh ClusterId DAGManJobId RemoteWallClockTime
Another advantage to this approach is there's probably another attribute describing the job types. I don't know how you tell the different types apart, but if you specify it in the submit files, you should see it somewhere in condor_history.
Hope this helps! Let me know if you have any other questions,