[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Machine activity for partitionable slots



note that there is some power management built into htcondor itself.. are you using those features? In particular there is a way to tell with condor_status which machines are powered down at the moment and to shut down and revive them as necessary.. a few years ago someone gave a talk about how he was heating a greenhouse with a rack of servers and used htcondor to control the temperature appropriately based on how many servers it took to keep the greenhouse warm.

Steve


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of niels.reuter@xxxxxxxxx <niels.reuter@xxxxxxxxx>
Sent: Wednesday, June 30, 2021 10:29 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Machine activity for partitionable slots
 
Hey all,

I'm currently working on a tool that automatically turns on machines when their resources are requested via idle jobs in the condor queue, and turns these machines off again when they have been idle for longer than an hour. This is done to reduce power consumption, as our GPU machines consume a lot of power when idle.

I'm currently having difficulty determining the idle time of a machine with a whole-machine partitionable slot. The "Activity" and "EnteredCurrentActivity" ClassAd attributes update for the dynamic slots created, but not for the parent. Once the dynamic slots finish and disappear, the parent slot reports a long idle time, even if a child slot recently existed. Is there a way to determine how long a whole machine or partitionable slot has been (truly) idle?

Thanks for the help,

Niels