[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] best practice for periodic metric script along jobs



Hi Michael, Todd and Joan,

many thanks for the detailed input!

Michael's Checkfile hook looks like to deliver all I have in mind - but
then I have as Todd says full control over the nodes anyway. I will give
both approaches a try and see, what fit's most.

@Todd
Thing is, that I would like to compile a rough power consumption summary
for each job, i.e., to read a node's power metrics and derive a very
rough estimate (scaled by #cores) of a jobs power consumption.
Motivation would be to give the users a 'real-life' clue on their
resource usage, i.e., "your job/task used ##Wh of energy - approximately
causing #g of CO2"

Cheers and many thanks
  Thomas


On 18/06/2019 00.29, Todd Tannenbaum wrote:
> On 6/17/2019 10:01 AM, Thomas Hartmann wrote:
>> Hi all,
>>
>> I would like to ask, if there is some 'established best practice' to run
>> periodically a script along each job.
>>
>> I.e., I would like to run a small metrics script periodically (~5m) for
>> each job, collect the output and add a summary of the metrics to the
>> job's summary.
>>
>> I guess, it should work to start such a script as pre job process into
>> the background, loop/write the metrics in a separate file/pipe and
>> colelct the metrics by a post job script.
>> But I wonder, if there is a more Condor way(?), e.g., a cron for each
>> starter (startd?) and storing the metrics in an extra job class ad (or
>> adding it to the job log with a grep'able identifier)?
>>
>> Cheers,
>>    Thomas
>>
> 
> Hi Thomas!
> 
> A quick thought :  If you have control of the execute nodes involved, 
> you could set the config knobs
> 
>    USE_PID_NAMESPACES = True
>    USER_JOB_WRAPPER = /some/path/monitor_my_jobs.sh
> 
> and monitor_my_jobs.sh could be:
> 
>    #!/bin/bash
>    # Run my monitor script
>    collect_metrics.sh &
>    # Exec my actual job, keeping the same pid
>    exec ""$@"
> 
> and collect_metrics.sh then monitor whatever you want.  The only 
> processes it would "see" would be the pids associated with the job 
> (which is what USE_PID_NAMESPACES=True does).  Every five minutes it 
> could publish metrics via
>    condor_chirp set_job_attr_delayed <JobAttributeName> <AttributeValue>
> which will cause the metrics to get published into the job classad so 
> they are visible in the history classad.  See "man condor_chirp". 
> Warning... the above was just the first idea I had, I didn't test it...
> 
> But a question I have for you... what metrics would your script collect? 
>   HTCondor is already collecting info about memory, cpu, local disk 
> usage, and a few others... what other metrics are you interested in?
> 
> Thanks
> Todd
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature