[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Detailled monitoring of a DAG




Dear all,

I have a DAG containing ~30 parallel "blocks", including each 3-4 jobs connected by parent-child links. That DAG could be triggered automatically a dozen times per day or so and would run each time on different "live" data.

What (Python) framework/approach would you recommend to monitor in a detailled way the running of each DAG instance? Which DAG/blocks/jobs completed successfully or failed, how long each DAG/block/job took, why a particular job took that long (evictions, etc.), etc. I would then use the individual DAG summary data to build long-term statistics, identify problems in my code or the software environment...

All that information is available combining the .dag and .dag.dagman.out files: are there existing tools that parse these and could be directly used for or adapted to this goal?

Thanks in advance for your advices,

Nicolas

--

============================================
= Nicolas ARNAUD                           =
=                                          =
= Laboratoire de physique des deux infinis =
= IrÃne Joliot-Curie (IJCLab)              =
= CNRS/IN2P3 & Università Paris-Saclay     =
=                                          =
= Virgo Experiment                         =
=                                          =
= European Gravitational Observatory (EGO) =
= Via E. Amaldi, 5                         =
= 56021 Santo Stefano a Macerata           =
= Cascina (PI) -- Italia                   =
= Tel: + 39 050 752 314                    =
============================================