[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Machine activity for partitionable slots



> On Jun 30, 2021, at 10:29 AM, niels.reuter@xxxxxxxxx wrote:
> 
> Hey all,
> 
> I'm currently working on a tool that automatically turns on machines when their resources are requested via idle jobs in the condor queue, and turns these machines off again when they have been idle for longer than an hour. This is done to reduce power consumption, as our GPU machines consume a lot of power when idle.
> 
> I'm currently having difficulty determining the idle time of a machine with a whole-machine partitionable slot. The "Activity" and "EnteredCurrentActivity" ClassAd attributes update for the dynamic slots created, but not for the parent. Once the dynamic slots finish and disappear, the parent slot reports a long idle time, even if a child slot recently existed. Is there a way to determine how long a whole machine or partitionable slot has been (truly) idle?

As you observed, the EnteredCurrentActivity and EnteredCurrentState attributes of a partitionable slot donât change when a dynamic slot is created or destroyed. I donât believe that information is currently available in the machine ads.
This is not the first time it would have been nice to have an attribute that records when a dynamic slot was last created/destroyed. Iâll look into adding it for a future release.

At present, I see a couple possibilities for your tool:
* Have your tool remember between scans which machines have no dynamic slots (NumDynamicSlots==0) and when that became true.

* Define a STARTD_CRON job that determines if the local machine is currently idle and remembers between executions when that became true. The job would output the timestamp for inclusion in the slot ad.

 - Jaime