[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] RemoteWallClockTime broken down per run?

Hi Mark,

many thanks for the infor and the suggestions! ð

While in principle attractive, I am not sure if the JobEventLog would
work well. My general idea would be a smallish tool to roughly
approximate the energy consumption. Since our users do remote submits, I
would need to inject a UserLog/JobEventLog file AFAIS. I am not sure,
how well that would scales with some of our users' more odd jobs.

Alternatively, I would aim for the general event log - which for the
schedds should contain all their jobs state transitions.
With forwarding the daemon events into a DB or into something parsable
with Spark, it should be possible to prepare a query for the job runs
and fold them with the node stats (not sure about the core count).

While we are in principle writing the event logs as XML, I had to
disable parsing them into JSONs & forwarding these into ES due to load
issues - thus, I would be very interested in JSON as native event log
output in the stable series (noticed it in 8.9/9.1) ð

Cheers and thanks,

On 03/11/2021 21.55, Mark Coatsworth wrote:
> Hi Thomas,
> Unfortunately we don't have anything native in HTCondor that breaks
> down RemoteWallClockTime into individual runs like you're asking for.
> However, we just introduced a new feature that may help. In our
> upcoming HTCondor v9.4.0 release (shipping next month) we're adding a
> new attribute called LastRemoteWallClockTime. This just records the
> runtime for the last job execution.
> Do you think that by polling the job, or maybe using some clever
> scripts that run whenever a job iteration completes, you could grab
> the information from there?
> Another idea: I think all the information you're looking for is in the
> job event log (or global event log) which shows all the individual
> execution start/stop times. You could probably write a very simple
> Python script using our JobEventLog API to scrape this information.
> Would that get you what you need?
> Mark
> On Tue, Nov 2, 2021 at 11:23 AM Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
>> Hi all,
>> is there a job class ad like RemoteWallClockTime (or CommittedTime),
>> that is broken down per individual runs?
>> Background is, that I would like to calculate power usage statistics for
>> our users' jobs.
>> Thus, we add a few benchmark values as additional machine ads. After
>> injecting these machine ads via a transform into the jobs, I can in
>> principle calculate my stats with these [2].
>> However, unfortunately not all our users' jobs are single job runs. So,
>> I would need to sum over all run iterations of a job - which might have
>> run on different nodes with different benchmark values.
>> But AFAIS `RemoteWallClockTime` is the total wall time over all job runs
>> - where I would need the wall times broken down per run [3]
>> Is there a job ad, that describes the wall time per run - or am I
>> probably overthinking? ð
>> Cheers,
>>   Thomas
>> [1]
>> JobMachineAttrs = "HS06PerSlot HS06perWatt..."
>> [2]
>>> condor_history 151. -af "RemoteWallClockTime/60.0/60.0 * RequestCpus *
>> MachineAttrHS06PerSlot0 / MachineAttrHS06perWatt0"
>> 0.1323784722222222
>> [3]
>>   RemoteWallClockTime0 * ... / MachineAttrHS06perWatt0
>>   +
>>   RemoteWallClockTime1 * ... / MachineAttrHS06perWatt1
>>   +
>>   ...
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature