[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Extracting dag and job run-time information



Duncan:

 

I wouldnât bother parsing ASCII for the reasons youâre encountering. It is easy to interact with condor_history and the job queue in python. This is something I wrote a few years back so it may not quite work today.

 

https://drive.google.com/file/d/1aRNpFeP6z-8dke_y9epHIhN-q4-JCV7E/view?usp=sharing

 

What is unstated but hopefully clear: only jobs that have âcompletedâ end up in condor_history. This does not describe idle, running, or held jobs but does describe jobs that exited successfully or unsuccesfully. Those could be queried and statistically described using similar code to the above.

 

Tom

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Mark Coatsworth <coatsworth@xxxxxxxxxxx>
Reply-To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Date: Wednesday, March 7, 2018 at 4:36 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Extracting dag and job run-time information

 

Hi Duncan,

 

We do gather some runtime statistics that can probably help you here.

 

First off, starting in version 8.7.4, we display a bunch of DAGMan runtime statistics in the .dagman.out file, related to how fast DAGMan is actually submitting jobs. We've found that when dealing with lots of short running jobs (on the order of seconds or minutes), DAGMan spends most of its time sitting idle when it could be sending new jobs. If you notice that SubmitCycleTimeSum is dramatically lower than SleepCycleTimeSum, then that's the case here (if so, let me know and I'll send some suggestions).

 

The script you've used to scrape job running times works very well, but you're right, extending it to work correctly for held/evicted jobs would be a lot of work. I think you're better off using condor_history.

 

With condor_history, you can get information for all of a DAG's jobs using a constraint. The following looks up all children of $DagJobID$ and displays their individual IDs + running time (RemoteWallClockTime). You can customize the list of attributes as you see fit:

condor_history -constraint "DAGManJobId == $DagJobID$" -af:lh ClusterId DAGManJobId RemoteWallClockTime

 

Another advantage to this approach is there's probably another attribute describing the job types. I don't know how you tell the different types apart, but if you specify it in the submit files, you should see it somewhere in condor_history.

 

Hope this helps! Let me know if you have any other questions,

Mark

 

On Wed, Mar 7, 2018 at 12:24 PM, Duncan Meacher <duncan.meacher@xxxxxxx> wrote:

Hi all,

 

Sorry if this has been asked before, I've tried searching around but haven't seen anything. I'm trying to do some profiling on our dags to get a better idea of where bottle necks are occurring. For this I'm just looking at the run time of all jobs within a dag and histogram them by job type. The current dags I'm looking at contains ~42,000 individual jobs and ~30 unique job types. My current method for this is to search through the dag.dagman.out file for the submit time and completion times to calculate the duration:

 

# Collect two dictionaries of all job start and stop times
jobstartdict = {}
jobstopdict = {}
for line in open(options.input_file):
        # 09/20/16 22:11:04 Event: ULOG_SUBMIT for HTCondor Node gstlal_inspiral_0001 (472899.0.0) {09/20/16 22:09:49}
        if "ULOG_SUBMIT" in line:
                date, time, _, _, _, _, _, jobname, _, _, _ = line.split()
                jobstartdict[jobname] = datetime.strptime("%s %s" % (date, time), "%x %X")
        # 09/20/16 22:12:45 Node gstlal_inspiral_0001 job proc (472899.0.0) completed successfully. 
        if "completed successfully" in line:
                date, time, _, jobname, _, _, _, _, _ = line.split()
                jobstopdict[jobname] = datetime.strptime("%s %s" % (date, time), "%x %X")

# Collect jobs that have both start and stop times
correctKeys = set(jobstartdict.keys()) & set(jobstopdict.keys())

# Calculate each job duration
duration = {}
for k in correctKeys:
        duration[k] = (jobstopdict[k] - jobstartdict[k]).total_seconds()

 

This method works 99% of the time, when the dag/jobs behave normally, however this doesn't work in the case that jobs are either placed on hold or go back into idle after being submitted and running. I'm looking for a better way to get the Run Time of each job.

 

I'm looking at condor_history, but I can't output exactly what I need. I can get information for all jobs submitted by a user, but need all jobs just from a single dag. I know I can also use -userlog on the nodes.log file, and use the condor ID numbers, but this also has issues if the dag is brought and and then re-submitted, as a new nodes.log file is then created. 

 

Any help on this would be greatly appreciated.

 

Thanks, Duncan

 

--

==========================

Duncan Meacher, PhD
Postdoctoral Researcher
Institute for Gravitation and the Cosmos
Department of Physics
Pennsylvania State University
104 Davey Lab #040
University Park, PA 16802
Tel:
+1 814 865 3243
==========================


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



 

--

Mark Coatsworth

Systems Programmer

Center for High Throughput Computing

Department of Computer Sciences

University of Wisconsin-Madison

+1 608 206 4703