[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] best practice for periodic metric script along jobs


We use for that update job info hooks (<Keyword>_HOOK_UPDATE_JOB_INFO) [1]. The default update period happens to be 5m, although that can be configured, and IIRC it's even run in the same environment of the job (although we don't use PID namespaces, so I don't know in that case. We also use job start and end hooks, which depending on your use case might also be helpful or not.

So far it has worked pretty well for us.



[1]: https://htcondor.readthedocs.io/en/v8_8_3/misc-concepts/hooks.html?highlight=%3CKeyword%3E_HOOK_UPDATE_JOB_INFO#work-fetching-hooks-invoked-by-htcondor

On 18/6/19 0:29, Todd Tannenbaum wrote:
On 6/17/2019 10:01 AM, Thomas Hartmann wrote:
Hi all,

I would like to ask, if there is some 'established best practice' to run
periodically a script along each job.

I.e., I would like to run a small metrics script periodically (~5m) for
each job, collect the output and add a summary of the metrics to the
job's summary.

I guess, it should work to start such a script as pre job process into
the background, loop/write the metrics in a separate file/pipe and
colelct the metrics by a post job script.
But I wonder, if there is a more Condor way(?), e.g., a cron for each
starter (startd?) and storing the metrics in an extra job class ad (or
adding it to the job log with a grep'able identifier)?


Hi Thomas!

A quick thought :  If you have control of the execute nodes involved,
you could set the config knobs

    USER_JOB_WRAPPER = /some/path/monitor_my_jobs.sh

and monitor_my_jobs.sh could be:

    # Run my monitor script
    collect_metrics.sh &
    # Exec my actual job, keeping the same pid
    exec ""$@"

and collect_metrics.sh then monitor whatever you want.  The only
processes it would "see" would be the pids associated with the job
(which is what USE_PID_NAMESPACES=True does).  Every five minutes it
could publish metrics via
    condor_chirp set_job_attr_delayed <JobAttributeName> <AttributeValue>
which will cause the metrics to get published into the job classad so
they are visible in the history classad.  See "man condor_chirp".
Warning... the above was just the first idea I had, I didn't test it...

But a question I have for you... what metrics would your script collect?
   HTCondor is already collecting info about memory, cpu, local disk
usage, and a few others... what other metrics are you interested in?


HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at:

Dr. Joan Josep Piles-Contreras
ZWE Scientific Computing
Max Planck Institute for Intelligent Systems
(p) +49 7071 601 1750