[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Extracting dag and job run-time information



Hi all,

Sorry if this has been asked before, I've tried searching around but haven't seen anything. I'm trying to do some profiling on our dags to get a better idea of where bottle necks are occurring. For this I'm just looking at the run time of all jobs within a dag and histogram them by job type. The current dags I'm looking at contains ~42,000 individual jobs and ~30 unique job types. My current method for this is to search through the dag.dagman.out file for the submit time and completion times to calculate the duration:

# Collect two dictionaries of all job start and stop times
jobstartdict = {}
jobstopdict = {}
for line in open(options.input_file):
    # 09/20/16 22:11:04 Event: ULOG_SUBMIT for HTCondor Node gstlal_inspiral_0001 (472899.0.0) {09/20/16 22:09:49}
    if "ULOG_SUBMIT" in line:
        date, time, _, _, _, _, _, jobname, _, _, _ = line.split()
        jobstartdict[jobname] = datetime.strptime("%s %s" % (date, time), "%x %X")
    # 09/20/16 22:12:45 Node gstlal_inspiral_0001 job proc (472899.0.0) completed successfully.Â
    if "completed successfully" in line:
        date, time, _, jobname, _, _, _, _, _ = line.split()
        jobstopdict[jobname] = datetime.strptime("%s %s" % (date, time), "%x %X")

# Collect jobs that have both start and stop times
correctKeys = set(jobstartdict.keys()) & set(jobstopdict.keys())

# Calculate each job duration
duration = {}
for k in correctKeys:
    duration[k] = (jobstopdict[k] - jobstartdict[k]).total_seconds()

This method works 99% of the time, when the dag/jobs behave normally, however this doesn't work in the case that jobs are either placed on hold or go back into idle after being submitted and running. I'm looking for a better way to get the Run Time of each job.

I'm looking at condor_history, but I can't output exactly what I need. I can get information for all jobs submitted by a user, but need all jobs just from a single dag. I know I can also use -userlog on the nodes.log file, and use the condor ID numbers, but this also has issues if the dag is brought and and then re-submitted, as a new nodes.log file is then created.Â

Any help on this would be greatly appreciated.

Thanks, Duncan

--
==========================

Duncan Meacher, PhD
Postdoctoral Researcher
Institute for Gravitation and the Cosmos
Department of Physics
Pennsylvania State University
104 Davey Lab #040
University Park, PA 16802
Tel: +1 814 865 3243
==========================