[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] best practice for periodic metric script along jobs



On 6/17/2019 10:01 AM, Thomas Hartmann wrote:
> Hi all,
> 
> I would like to ask, if there is some 'established best practice' to run
> periodically a script along each job.
> 
> I.e., I would like to run a small metrics script periodically (~5m) for
> each job, collect the output and add a summary of the metrics to the
> job's summary.
> 
> I guess, it should work to start such a script as pre job process into
> the background, loop/write the metrics in a separate file/pipe and
> colelct the metrics by a post job script.
> But I wonder, if there is a more Condor way(?), e.g., a cron for each
> starter (startd?) and storing the metrics in an extra job class ad (or
> adding it to the job log with a grep'able identifier)?
> 
> Cheers,
>    Thomas
> 

Hi Thomas!

A quick thought :  If you have control of the execute nodes involved, 
you could set the config knobs

   USE_PID_NAMESPACES = True
   USER_JOB_WRAPPER = /some/path/monitor_my_jobs.sh

and monitor_my_jobs.sh could be:

   #!/bin/bash
   # Run my monitor script
   collect_metrics.sh &
   # Exec my actual job, keeping the same pid
   exec ""$@"

and collect_metrics.sh then monitor whatever you want.  The only 
processes it would "see" would be the pids associated with the job 
(which is what USE_PID_NAMESPACES=True does).  Every five minutes it 
could publish metrics via
   condor_chirp set_job_attr_delayed <JobAttributeName> <AttributeValue>
which will cause the metrics to get published into the job classad so 
they are visible in the history classad.  See "man condor_chirp". 
Warning... the above was just the first idea I had, I didn't test it...

But a question I have for you... what metrics would your script collect? 
  HTCondor is already collecting info about memory, cpu, local disk 
usage, and a few others... what other metrics are you interested in?

Thanks
Todd