[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] power management: ROOSTER_UNHIBERNATE not working



Thanks Todd but still not waking up.  

The CollectorLog just before the machine hibernates:

11/17/23 13:46:28 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 13:46:28 Want private ads, but no socket given!
11/17/23 13:46:28 In OfflineCollectorPlugin::update ( 60 )
11/17/23 13:46:28 Machine ad lifetime: 604800
11/17/23 13:46:28 Added ad to persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
11/17/23 13:46:28 Got INVALIDATE_MASTER_ADS
11/17/23 13:46:28 		**** Removed(1) ad(s): "< bench7.timehole.org >"
11/17/23 13:46:28 (Invalidated 1 ads)
11/17/23 13:46:28 In OfflineCollectorPlugin::update ( 15 )
11/17/23 13:46:30 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 13:46:30 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 13:46:30 In OfflineCollectorPlugin::update ( 0 )
11/17/23 13:46:37 MasterAd     : Updating ... "< bench6.timehole.org >"
11/17/23 13:46:37 In OfflineCollectorPlugin::update ( 2 )
11/17/23 13:46:38 StartdAd     : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 13:46:38 StartdPvtAd  : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 13:46:38 In OfflineCollectorPlugin::update ( 0 )
11/17/23 13:46:39 Removed ad from persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
11/17/23 13:46:39 Got INVALIDATE_STARTD_ADS
11/17/23 13:46:39 		**** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 13:46:39 (Invalidated 1 ads)
11/17/23 13:46:39 		**** Removed(1) ad(s): "< slot1@xxxxxxxxxxxxxxxxxxx >"
11/17/23 13:46:39 (Invalidated 1 ads)
11/17/23 13:46:39 In OfflineCollectorPlugin::update ( 13 )
11/17/23 13:46:39 condor_read(): Socket closed when trying to read 5 bytes from <192.168.1.7:40515> in non-blocking mode
11/17/23 13:46:39 IO: EOF reading packet header
11/17/23 13:46:39 DaemonCore: Can't receive command request from 192.168.1.7 (perhaps a timeout?)
11/17/23 13:46:39 condor_read(): Socket closed when trying to read 5 bytes from <192.168.1.7:43069> in non-blocking mode
11/17/23 13:46:39 IO: EOF reading packet header
11/17/23 13:46:39 DaemonCore: Can't receive command request from 192.168.1.7 (perhaps a timeout?)
11/17/23 13:46:58 Got QUERY_STARTD_PVT_ADS
11/17/23 13:46:58 QueryWorker: forked new high priority worker with id 7344 ( max 4 active 1 pending 0 )
11/17/23 13:46:58 Query after modification: *(true) && (Absent =!= True)*
11/17/23 13:46:58 (Sending 3 ads in response to query)
11/17/23 13:46:58 Query info: matched=3; skipped=0; query_time=0.001522; send_time=0.000641; type=MachinePrivate; requirements={(true) && (Absent =!= true)}; locate=0; limit=0; from=COLLECTOR; peer=<127.0.0.1:21931>; projection={}; filter_private_attrs=0
11/17/23 13:46:58 Got QUERY_ANY_ADS
11/17/23 13:46:58 QueryWorker: forked new high priority worker with id 7345 ( max 4 active 2 pending 0 )
11/17/23 13:46:58 QueryWorker: Child 7344 done
11/17/23 13:46:58 Query after modification: *((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= True)*
11/17/23 13:46:58 (Sending 4 ads in response to query)
11/17/23 13:46:58 Query info: matched=4; skipped=9; query_time=0.001622; send_time=0.001977; type=Any; requirements={((((MyType == "Submitter")) || ((MyType == "Machine")))) && (Absent =!= true)}; locate=0; limit=0; from=COLLECTOR; peer=<127.0.0.1:4433>; projection={}; filter_private_attrs=0
11/17/23 13:46:58 QueryWorker: Child 7345 done
11/17/23 13:46:58 AccountingAd  : Updating ... "< <none>bench12.timehole.org >"
11/17/23 13:46:58 In OfflineCollectorPlugin::update ( 77 )

And yet the PersistentAdLog only contains the bench7 ad with Offline true.


JK




> On Nov 17, 2023, at 12:44 PM, Todd L Miller <tlmiller@xxxxxxxxxxx> wrote:
> 
> 
>     External Email - Use Caution
> 
> 
> 
>> 11/17/23 08:53:44 In OfflineCollectorPlugin::update ( 13 )
> 
>       This should almost certainly be 'OfflineCollectorPlugin::invalidate',
> instead; command 13 is INVALIDATE_STARTD_ADS.
> 
>> 11/17/23 08:53:44 Removed ad from persistent store key=<slot1@xxxxxxxxxxxxxxxxxxx>
> 
>> Is the condor_read() log message a problem or is it caused by the
>> machine hibernating?
> 
>       I would expect it's caused by the SIGKILLs, but it's not
> necessarily a problem.
> 
>> Who/what else invalidates the persistent class ads?
> 
>       That is a very good question.  The log fragment starts with an ad
> being added to the persistent store (but doesn't include the line saying
> why), and includes a master ad invalidation, a startd ad invalidation, and
> in the middle what looks like a startd update ad.
> 
>       However, I noticed that you set
> 
> EXPIRE_INVALIDATED_ADS = TRUE
> 
> in order to turn on absent ads.  Could you try setting that back to FALSE
> (the default) and running the offline-ad experiment again?
> 
> - ToddM