[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Absent node still active



Hi Mark,

maybe I should have provided some log excerpts from the start...

Okay, here is a more detailed timeline in terms of logs:

----
(Startlog) - crash at 14:52, unknown reason
----
07/13/20 14:52:06 Setting up slot pairings
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
....
^@^@^@^@^@^@^@^@^@^@^@^@07/14/20 09:38:02 ******************************************************
07/14/20 09:38:02 ** condor_startd (CONDOR_STARTD) STARTING UP
07/14/20 09:38:02 ** /usr/sbin/condor_startd
....

----
(CollectorLog)
----
07/13/20 15:12:44       **** Removing stale ad: "< slot2_3@xxxxxxxxxxxxxxxxx , 127.0.0.1 >"
07/13/20 15:12:44 Added ad to persistent store key=<slot2_3@xxxxxxxxxxxxxxxxx,127.0.0.1>

>From the source code these two lines should be produced precisely when an ad gets the Absent attribute
(offline_plugin on the collector), hence from this point on:

----
(condor_status -absent)
slot1@xxxxxxxxxxxxxxxxx            LINUX      X86_64    7/13 14:52  8/12 14:52
...

So far, before reboot at 09:38, all is as it should be, after such a crash.

Now the node with the startd reboots and gets a job:

(CollectorLog)
07/14/20 09:38:02 MasterAd     : Inserting ** "< batch1066.desy.de >"
07/14/20 09:38:14 StartdAd     : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
07/14/20 09:38:14 StartdPvtAd  : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
07/14/20 09:38:14 StartdAd     : Inserting ** "< slot2@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
07/14/20 09:38:14 StartdPvtAd  : Inserting ** "< slot2@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
07/14/20 09:44:01 StartdAd     : Inserting ** "< slot2_1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
...(rest of the slots, gradually until final slot)
07/14/20 09:49:32 StartdAd     : Inserting ** "< slot2_47@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
07/14/20 09:49:32 StartdPvtAd  : Inserting ** "< slot2_47@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"

(NegotiatorLog)
07/14/20 09:43:58     Request 50111935.00000: autocluster 915 (request count 87 of 100)
07/14/20 09:43:58       Matched 50111935.0 BIRD_cms.lite.user@xxxxxxx <131.169.223.41:9618?addrs=131.169.223.41-9618+[2001-638-700-10df--1-29]-9618&noUDP&sock=schedd_2006168_f5e8_3> preempting none <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> slot2@xxxxxxxxxxxxxxxxx
07/14/20 09:43:58       Successfully matched with slot2@xxxxxxxxxxxxxxxxx

(SchedLog)
7/14/20 09:43:59 (pid:3957219) Started shadow for job 50111938.9 on slot2@xxxxxxxxxxxxxxxxx <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> for BIRD_cms.lite.user, (shadow pid = 3165751)
...
07/14/20 10:09:27 (pid:3957219) Shadow pid 3165751 for job 50111938.9 exited with status 115
...
07/14/20 10:09:27 (pid:3957219) Match record (slot2@xxxxxxxxxxxxxxxxx <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> for BIRD_cms.lite.user, 50111938.9) deleted

The output of condor_status -absent remains unchanged (a few hours after the reboot) and presumably all the time since reboot.

Then there is another crash of the node and the node is marked absent again:

(CollectorLog)
07/14/20 23:42:44       **** Removing stale ad: "< slot2_22@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
07/14/20 23:42:44 Added ad to persistent store key=<slot2_22@xxxxxxxxxxxxxxxxx,131.169.160.166>

At this point, I would have expected the absent output to have changed to 23:42 - a few hours later it is still 15:12.

I did search the CollectorLog for lines matching

07/14/20 hh:mm:ss Removed ad from persistent store key=<slotX_Y@xxxxxxxxxxxxxxxxx,131.169.160.166>

which are produced for other nodes on at least two occasions (explicit invalidate from node and presumably regular absent removal).

I am fairly sure these would indicate removal of the ad with the Absent attribute.

But there were none, so I figure that these somehow remained next to the newly inserted ones above and possibly are the reason for this whole behaviour.

Hopefully this is more helpful.

Best
  Kruno

----- Original Message -----
> From: "Mark Coatsworth" <coatsworth@xxxxxxxxxxx>
> To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Sent: Thursday, 16 July, 2020 00:39:31
> Subject: Re: [HTCondor-users] Absent node still active

> Hi Kruno,
> 
> I'm trying to reproduce this problem here on a 3-node testbed cluster.
> 
> Can you explain how exactly the startd is crashing such that
> `condor_status -absent` shows output? Can you attach what it says?
> 
> In my tests with both controlled and forced shutdowns, the controller
> seems to react well and I don't know how to get it to a state where
> -absent returns anything. I'm running condor v8.8.9.
> 
> Mark
> 
> 
> 
> 
> On Wed, Jul 15, 2020 at 3:30 AM Sever, Krunoslav
> <krunoslav.sever@xxxxxxx> wrote:
>>
>> Hi,
>>
>> got an absent node that should not be absent... here is the story:
>>
>> First it crashed and I see in the log where the collector dutifully set it to
>> absent about 30 minutes later, when the startd ad expired.
>>
>> condor_status -absent (currently) shows that time.
>>
>> After the node started up again, I see that the collector received new startd
>> ads, so I assume these would replace the absent ad.
>>
>> But condor_status -absent still shows the node, unchanged, a few hours after
>> reboot.
>>
>> Moreover, a few minutes after reboot, the negotiator (surprisingly?) matched a
>> job for the node, which was scheduled and ran.
>>
>> Even more interesting, the node apparently crashed again a few hours later and
>> again I see the log entry where the collector sets the Absent attribute.
>>
>> But condor_status -absent *still* shows the original absent date, i.e. from the
>> first crash.
>>
>> Looking through the sources I see that the offline plugin in the collector is
>> the only place where the Absent attribute is set.
>>
>> A few other source files reference the attribute but only for reading purposes
>> (e.g. condor_status).
>>
>> I also note that the persistent storage where the absent ads are put was never
>> removed after reboot of the node.
>>
>> This removal is done when a node actively invalidates an ad, so maybe that's
>> missing or didn't run somehow?
>>
>> Any ideas?
>>
>> Best
>>   Kruno
>>
>> --
>> ------------------------------------------------------------------------
>> Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
>>                         Ein Forschungszentrum der Helmholtz-Gemeinschaft
>>                                                             Notkestr. 85
>> phone:  +49-40-8998-1648                                   22607 Hamburg
>> e-mail: krunoslav.sever@xxxxxxx                                  Germany
>> ------------------------------------------------------------------------
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> 
> 
> --
> Mark Coatsworth
> Systems Programmer
> Center for High Throughput Computing
> Department of Computer Sciences
> University of Wisconsin-Madison
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
------------------------------------------------------------------------
Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
                        Ein Forschungszentrum der Helmholtz-Gemeinschaft
                                                            Notkestr. 85
phone:  +49-40-8998-1648                                   22607 Hamburg
e-mail: krunoslav.sever@xxxxxxx                                  Germany
------------------------------------------------------------------------