[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] watch_dag utility


I recently learned how to generate and submit Condor DAGs, and have been
executing DAGs with ~1000 nodes.  The actual structure of my DAGs is
not too complicated, basically ~200 parallel paths of 5 steps with minimal
cross-linking.  To allow me to view the progress of DAG execution in a
concise way, I wrote a Tcl script called 'watch_dag' which scans the
*.dag and *.dag.dagman.out files, identifies "stages" in the DAG, and
prints out a summary of job status.  Here is an example of running this
in the middle of executing a DAG with 1190 nodes:

pshawhan> watch_dag H1H2part1.pss2.dag
Stage  Executable               Total Waiting Queued Running Succeeded Failed
  1    lalapps_tmpltbank          238       0      0       0       228     10
  2    lalapps_inspiral           238      10      0       7       221      0
  3    lalapps_inca               238      17      0       0       221      0
  4    lalapps_inspiral           238      17      2      77       142      0
  5    lalapps_inca               238     142     36       0        60      0

I have put a copy of this utility (~300 lines of Tcl code) at
http://www.ligo.caltech.edu/~pshawhan/watch_dag ; feel free to download
and use it.  (It requires that tclsh be somewhere in your PATH, and then
of course you have to remember to put the watch_dag script into some
directory in your PATH and do 'chmod +x watch_dag'.)  Type 'watch_dag'
without any arguments for a usage summary.

No warranty is implied; this is just something I threw together based on
reverse-engineering the contents of some *.dag and *.dagman.out files,
but it seems to work pretty well (for my jobs, at least).  I'd appreciate
hearing any bug reports or suggestions for improvement.

Peter Shawhan

Condor Support Information:
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>