[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Detailled monitoring of a DAG



On 8/31/21 5:33 AM, Nicolas Arnaud wrote:

Dear all,

What (Python) framework/approach would you recommend to monitor in a detailled way the running of each DAG instance? Which DAG/blocks/jobs completed successfully or failed, how long each DAG/block/job took, why a particular job took that long (evictions, etc.), etc. I would then use the individual DAG summary data to build long-term statistics, identify problems in my code or the software environment...


Hi Nicolas:

I don't think there is an existing, comprehensive solution for this today. The htcondor python bindings have tools to read the job logs (not the DAG logs, but the job logs), and the job logs are annotated with the DAG node name, so that might be helpful. Some groups add DAG node prescript or postscript to explicitly log additional information about job starts and restarts.


-greg