[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] power management: ROOSTER_UNHIBERNATE not working



Thanks for the replies.

I think the ads are still invalidated:
$ condor_status bench7 -af Offline
returns nothing.


Hereâs my full hibernate script (thanks Christoph) including the pkill:
#!/bin/bash
if [[ $1 == ad ]]
then
   echo "HibernationMethod = \"systemctl\""
   HibernationMethod="systemctl suspend"
   echo "HibernationRawMask = 8"
   HibernationRawMask="8"
   echo "HibernationSupportedStates = \"S5\""
   HibernationSupportedStates="S5"
fi

if [[ $@ == "set S5" ]]
then
   # prevent invalidating class ad by killing stard first
   sudo pkill -SIGKILL condor_startd
   sudo pkill -SIGKILL condor_master
   sudo /sbin/poweroff
fi

I turned on Collector debug in the CM config:
COLLECTOR_DEBUG = D_FULLDEBUG
COLLECTOR_PERSISTENT_AD_LOG = /var/log/condor/PersistentAdLog
ABSENT_REQUIREMENTS = ( (HibernationLevel?:0) == 0 )
EXPIRE_INVALIDATED_ADS = True
CLASSAD_LIFETIME = 900
# 604800s is 7 days
ABSENT_EXPIRE_ADS_AFTER = 604800
OFFLINE_EXPIRE_ADS_AFTER = 604800
ROOSTER_INTERVAL = 180
ROOSTER_DEBUG = D_FULLDEBUG
ROOSTER_UNHIBERNATE = Offline


$ grep -A5 -B5 bench7 /var/log/condor/CollectorLog

11/17/23 08:53:33 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
11/17/23 08:53:33 Got INVALIDATE_MASTER_ADS
11/17/23 08:53:33 In OfflineCollectorPlugin::expire()
11/17/23 08:53:33               **** Removed(1) stale ad(s): "< bench7.timehole.org >"
11/17/23 08:53:33 (Invalidated 1 ads)
11/17/23 08:53:33 In OfflineCollectorPlugin::update ( 15 )
11/17/23 08:53:44 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 08:53:44 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 08:53:44 In OfflineCollectorPlugin::update ( 0 )
11/17/23 08:53:44 Removed ad from persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
11/17/23 08:53:44 Got INVALIDATE_STARTD_ADS
11/17/23 08:53:44 In OfflineCollectorPlugin::expire()
11/17/23 08:53:44 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
11/17/23 08:53:44 (Invalidated 0 ads)
11/17/23 08:53:44 In OfflineCollectorPlugin::expire()
11/17/23 08:53:44 OfflineCollectorPlugin::persistentStoreRemove: Replacing existing offline ad.
11/17/23 08:53:44 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
11/17/23 08:53:44 (Invalidated 0 ads)
11/17/23 08:53:44 In OfflineCollectorPlugin::update ( 13 )
11/17/23 08:53:44 Removed ad from persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
11/17/23 08:53:44 condor_read(): Socket closed when trying to read 5 bytes from <192.168.1.7:37067> in non-blocking mode
11/17/23 08:53:44 IO: EOF reading packet header
11/17/23 08:53:44 DaemonCore: Can't receive command request from 192.168.1.7 (perhaps a timeout?)
11/17/23 08:53:44 condor_read(): Socket closed when trying to read 5 bytes from <192.168.1.7:39905> in non-blocking mode
11/17/23 08:53:44 IO: EOF reading packet header
-

Is the condor_read() log message a problem or is it caused by the machine hibernating?

Who/what else invalidates the persistent class ads?

Thanks very much.

JK



> On Nov 15, 2023, at 4:22 PM, Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
>
>     External Email - Use Caution
>
>
>
>> 11/15/23 09:42:12 Got 0 startd ads matching ROOSTER_UNHIBERNATE=Offline
>>
>> How do I troubleshoot and fix this?
>
>       When a machine hibernates but rooster can't find its ad in the
> collector, the usual problem is that the startd correctly sent an offline
> ad to the collector but then sent an invalidate ad to the collector before
> actually shutting down; the invalidate ad invalidates the offline ad.
>
>       To confirm / debug this, turn up the debug level on either your
> startd or your collector; the former should log when it sends ads and the
> former when it receives them.
>
>       Arguably, sending an invalidate ad shouldn't remove offline ads,
> but if your hibernate script allows/requires the system to shut down
> normally, that's probably the problem: the startd will invalidate its ad
> before exiting when sent a SIGTERM (as is the usual case).  This problem
> has been reported to us before
> (https://opensciencegrid.atlassian.net/browse/HTCONDOR-1806),
> but we haven't been able to address it yet; my apologies.
>
>       The work-around is to kill the startd will a SIGKILL before
> shutting HTCondor down; depending on how long shutdown takes, you may also
> need to kill the condor master to prevent it from respawning the startd.
>
> - ToddM
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/