[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Extracting dag and job run-time information

Hi all,

Sorry if this has been asked before, I've tried searching around but haven't seen anything. I'm trying to do some profiling on our dags to get a better idea of where bottle necks are occurring. For this I'm just looking at the run time of all jobs within a dag and histogram them by job type. The current dags I'm looking at contains ~42,000 individual jobs and ~30 unique job types. My current method for this is to search through the dag.dagman.out file for the submit time and completion times to calculate the duration:

# Collect two dictionaries of all job start and stop times
jobstartdict = {}
jobstopdict = {}
for line in open(options.input_file):
    # 09/20/16 22:11:04 Event: ULOG_SUBMIT for HTCondor Node gstlal_inspiral_0001 (472899.0.0) {09/20/16 22:09:49}
    if "ULOG_SUBMIT" in line:
        date, time, _, _, _, _, _, jobname, _, _, _ = line.split()
        jobstartdict[jobname] = datetime.strptime("%s %s" % (date, time), "%x %X")
    # 09/20/16 22:12:45 Node gstlal_inspiral_0001 job proc (472899.0.0) completed successfully.Â
    if "completed successfully" in line:
        date, time, _, jobname, _, _, _, _, _ = line.split()
        jobstopdict[jobname] = datetime.strptime("%s %s" % (date, time), "%x %X")

# Collect jobs that have both start and stop times
correctKeys = set(jobstartdict.keys()) & set(jobstopdict.keys())

# Calculate each job duration
duration = {}
for k in correctKeys:
    duration[k] = (jobstopdict[k] - jobstartdict[k]).total_seconds()

This method works 99% of the time, when the dag/jobs behave normally, however this doesn't work in the case that jobs are either placed on hold or go back into idle after being submitted and running. I'm looking for a better way to get the Run Time of each job.

I'm looking at condor_history, but I can't output exactly what I need. I can get information for all jobs submitted by a user, but need all jobs just from a single dag. I know I can also use -userlog on the nodes.log file, and use the condor ID numbers, but this also has issues if the dag is brought and and then re-submitted, as a new nodes.log file is then created.Â

Any help on this would be greatly appreciated.

Thanks, Duncan


Duncan Meacher, PhD
Postdoctoral Researcher
Institute for Gravitation and the Cosmos
Department of Physics
Pennsylvania State University
104 Davey Lab #040
University Park, PA 16802
Tel: +1 814 865 3243