[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] best practice for periodic metric script along jobs



on a second thought - Todd's suggestion might be better suited as it
should also work with other universes as the Docker one

On 19/06/2019 10.20, Thomas Hartmann wrote:
> Hi Michael, Todd and Joan,
> 
> many thanks for the detailed input!
> 
> Michael's Checkfile hook looks like to deliver all I have in mind - but
> then I have as Todd says full control over the nodes anyway. I will give
> both approaches a try and see, what fit's most.
> 
> @Todd
> Thing is, that I would like to compile a rough power consumption summary
> for each job, i.e., to read a node's power metrics and derive a very
> rough estimate (scaled by #cores) of a jobs power consumption.
> Motivation would be to give the users a 'real-life' clue on their
> resource usage, i.e., "your job/task used ##Wh of energy - approximately
> causing #g of CO2"
> 
> Cheers and many thanks
>   Thomas
> 
> 
> On 18/06/2019 00.29, Todd Tannenbaum wrote:
>> On 6/17/2019 10:01 AM, Thomas Hartmann wrote:
>>> Hi all,
>>>
>>> I would like to ask, if there is some 'established best practice' to run
>>> periodically a script along each job.
>>>
>>> I.e., I would like to run a small metrics script periodically (~5m) for
>>> each job, collect the output and add a summary of the metrics to the
>>> job's summary.
>>>
>>> I guess, it should work to start such a script as pre job process into
>>> the background, loop/write the metrics in a separate file/pipe and
>>> colelct the metrics by a post job script.
>>> But I wonder, if there is a more Condor way(?), e.g., a cron for each
>>> starter (startd?) and storing the metrics in an extra job class ad (or
>>> adding it to the job log with a grep'able identifier)?
>>>
>>> Cheers,
>>>    Thomas
>>>
>>
>> Hi Thomas!
>>
>> A quick thought :  If you have control of the execute nodes involved, 
>> you could set the config knobs
>>
>>    USE_PID_NAMESPACES = True
>>    USER_JOB_WRAPPER = /some/path/monitor_my_jobs.sh
>>
>> and monitor_my_jobs.sh could be:
>>
>>    #!/bin/bash
>>    # Run my monitor script
>>    collect_metrics.sh &
>>    # Exec my actual job, keeping the same pid
>>    exec ""$@"
>>
>> and collect_metrics.sh then monitor whatever you want.  The only 
>> processes it would "see" would be the pids associated with the job 
>> (which is what USE_PID_NAMESPACES=True does).  Every five minutes it 
>> could publish metrics via
>>    condor_chirp set_job_attr_delayed <JobAttributeName> <AttributeValue>
>> which will cause the metrics to get published into the job classad so 
>> they are visible in the history classad.  See "man condor_chirp". 
>> Warning... the above was just the first idea I had, I didn't test it...
>>
>> But a question I have for you... what metrics would your script collect? 
>>   HTCondor is already collecting info about memory, cpu, local disk 
>> usage, and a few others... what other metrics are you interested in?
>>
>> Thanks
>> Todd
>>
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature