[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] power management: ROOSTER_UNHIBERNATE not working



11/15/23 09:42:12 Got 0 startd ads matching ROOSTER_UNHIBERNATE=Offline

How do I troubleshoot and fix this?

When a machine hibernates but rooster can't find its ad in the collector, the usual problem is that the startd correctly sent an offline ad to the collector but then sent an invalidate ad to the collector before actually shutting down; the invalidate ad invalidates the offline ad.

To confirm / debug this, turn up the debug level on either your startd or your collector; the former should log when it sends ads and the former when it receives them.

Arguably, sending an invalidate ad shouldn't remove offline ads, but if your hibernate script allows/requires the system to shut down normally, that's probably the problem: the startd will invalidate its ad before exiting when sent a SIGTERM (as is the usual case). This problem has been reported to us before (https://opensciencegrid.atlassian.net/browse/HTCONDOR-1806), but we haven't been able to address it yet; my apologies.

The work-around is to kill the startd will a SIGKILL before shutting HTCondor down; depending on how long shutdown takes, you may also need to kill the condor master to prevent it from respawning the startd.

- ToddM