[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] best practice for periodic metric script along jobs



You're looking for the UPDATE_JOB_STATUS hook. I used it to create a mechanism to look for and terminate stalled / hung jobs, but you can do anything you want in your script:

https://research.cs.wisc.edu/htcondor/HTCondorWeek2017/presentations/ThuPelletier_Monitoring.pdf

The UJS hook runs eight seconds (STARTER_INITIAL_UPDATE_INTERVAL) after the job starts, and then once every five minutes (UPDATE_INTERVAL), by default. Hooks are established in the server-side configuration, and referenced by the job submissions via the hook_keyword directive.

You can't use a Pre script to launch an adjunct process for the job, because the environment they run in is killed off and cleaned up before the job starts, including any background processes it starts, as I discovered through hard-won experience. I did mention the idea of a submit-specified periodic executable to the CHTC team at HTCondor Week back in '17 and '18, rather than the hook which needs the server-side configuration, but needless to say it's a back burner item as compared to the more sexy stuff going on in the development. I haven't gotten around to writing a patch and PR myself, either.

I would recommend, as I did in the Checkfile hook described in my presentation, to collect the metrics in the job ClassAd using Chirp. These updates are recorded in the HTCondor log file as they're made. Since you're just interested in metrics, rather than wanting to take action immediately when a problem is identified as I did, you could use the deferred update for Chirp so it wouldn't need the +WantIOProxy attribute set in the submit.

Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Thomas Hartmann
Sent: Monday, June 17, 2019 11:02 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] [HTCondor-users] best practice for periodic metric script along jobs

Hi all,

I would like to ask, if there is some 'established best practice' to run periodically a script along each job.

I.e., I would like to run a small metrics script periodically (~5m) for each job, collect the output and add a summary of the metrics to the job's summary.

I guess, it should work to start such a script as pre job process into the background, loop/write the metrics in a separate file/pipe and colelct the metrics by a post job script.
But I wonder, if there is a more Condor way(?), e.g., a cron for each starter (startd?) and storing the metrics in an extra job class ad (or adding it to the job log with a grep'able identifier)?

Cheers,
  Thomas