[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Unsubscribe




> On Jul 16, 2020, at 1:12 AM, Sever, Krunoslav <krunoslav.sever@xxxxxxx> wrote:
> 
> Hi Mark,
> 
> maybe I should have provided some log excerpts from the start...
> 
> Okay, here is a more detailed timeline in terms of logs:
> 
> ----
> (Startlog) - crash at 14:52, unknown reason
> ----
> 07/13/20 14:52:06 Setting up slot pairings
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
> ....
> ^@^@^@^@^@^@^@^@^@^@^@^@07/14/20 09:38:02 ******************************************************
> 07/14/20 09:38:02 ** condor_startd (CONDOR_STARTD) STARTING UP
> 07/14/20 09:38:02 ** /usr/sbin/condor_startd
> ....
> 
> ----
> (CollectorLog)
> ----
> 07/13/20 15:12:44       **** Removing stale ad: "< slot2_3@xxxxxxxxxxxxxxxxx , 127.0.0.1 >"
> 07/13/20 15:12:44 Added ad to persistent store key=<slot2_3@xxxxxxxxxxxxxxxxx,127.0.0.1>
> 
> From the source code these two lines should be produced precisely when an ad gets the Absent attribute
> (offline_plugin on the collector), hence from this point on:
> 
> ----
> (condor_status -absent)
> slot1@xxxxxxxxxxxxxxxxx            LINUX      X86_64    7/13 14:52  8/12 14:52
> ...
> 
> So far, before reboot at 09:38, all is as it should be, after such a crash.
> 
> Now the node with the startd reboots and gets a job:
> 
> (CollectorLog)
> 07/14/20 09:38:02 MasterAd     : Inserting ** "< batch1066.desy.de >"
> 07/14/20 09:38:14 StartdAd     : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:38:14 StartdPvtAd  : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:38:14 StartdAd     : Inserting ** "< slot2@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:38:14 StartdPvtAd  : Inserting ** "< slot2@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:44:01 StartdAd     : Inserting ** "< slot2_1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> ...(rest of the slots, gradually until final slot)
> 07/14/20 09:49:32 StartdAd     : Inserting ** "< slot2_47@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:49:32 StartdPvtAd  : Inserting ** "< slot2_47@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 
> (NegotiatorLog)
> 07/14/20 09:43:58     Request 50111935.00000: autocluster 915 (request count 87 of 100)
> 07/14/20 09:43:58       Matched 50111935.0 BIRD_cms.lite.user@xxxxxxx <131.169.223.41:9618?addrs=131.169.223.41-9618+[2001-638-700-10df--1-29]-9618&noUDP&sock=schedd_2006168_f5e8_3> preempting none <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> slot2@xxxxxxxxxxxxxxxxx
> 07/14/20 09:43:58       Successfully matched with slot2@xxxxxxxxxxxxxxxxx
> 
> (SchedLog)
> 7/14/20 09:43:59 (pid:3957219) Started shadow for job 50111938.9 on slot2@xxxxxxxxxxxxxxxxx <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> for BIRD_cms.lite.user, (shadow pid = 3165751)
> ...
> 07/14/20 10:09:27 (pid:3957219) Shadow pid 3165751 for job 50111938.9 exited with status 115
> ...
> 07/14/20 10:09:27 (pid:3957219) Match record (slot2@xxxxxxxxxxxxxxxxx <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> for BIRD_cms.lite.user, 50111938.9) deleted
> 
> The output of condor_status -absent remains unchanged (a few hours after the reboot) and presumably all the time since reboot.
> 
> Then there is another crash of the node and the node is marked absent again:
> 
> (CollectorLog)
> 07/14/20 23:42:44       **** Removing stale ad: "< slot2_22@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 23:42:44 Added ad to persistent store key=<slot2_22@xxxxxxxxxxxxxxxxx,131.169.160.166>
> 
> At this point, I would have expected the absent output to have changed to 23:42 - a few hours later it is still 15:12.
> 
> I did search the CollectorLog for lines matching
> 
> 07/14/20 hh:mm:ss Removed ad from persistent store key=<slotX_Y@xxxxxxxxxxxxxxxxx,131.169.160.166>
> 
> which are produced for other nodes on at least two occasions (explicit invalidate from node and presumably regular absent removal).
> 
> I am fairly sure these would indicate removal of the ad with the Absent attribute.
> 
> But there were none, so I figure that these somehow remained next to the newly inserted ones above and possibly are the reason for this whole behaviour.
> 
> Hopefully this is more helpful.
> 
> Best
>  Kruno
> 
> ----- Original Message -----
>> From: "Mark Coatsworth" <coatsworth@xxxxxxxxxxx>
>> To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
>> Sent: Thursday, 16 July, 2020 00:39:31
>> Subject: Re: [HTCondor-users] Absent node still active
> 
>> Hi Kruno,
>> 
>> I'm trying to reproduce this problem here on a 3-node testbed cluster.
>> 
>> Can you explain how exactly the startd is crashing such that
>> `condor_status -absent` shows output? Can you attach what it says?
>> 
>> In my tests with both controlled and forced shutdowns, the controller
>> seems to react well and I don't know how to get it to a state where
>> -absent returns anything. I'm running condor v8.8.9.
>> 
>> Mark
>> 
>> 
>> 
>> 
>> On Wed, Jul 15, 2020 at 3:30 AM Sever, Krunoslav
>> <krunoslav.sever@xxxxxxx> wrote:
>>> 
>>> Hi,
>>> 
>>> got an absent node that should not be absent... here is the story:
>>> 
>>> First it crashed and I see in the log where the collector dutifully set it to
>>> absent about 30 minutes later, when the startd ad expired.
>>> 
>>> condor_status -absent (currently) shows that time.
>>> 
>>> After the node started up again, I see that the collector received new startd
>>> ads, so I assume these would replace the absent ad.
>>> 
>>> But condor_status -absent still shows the node, unchanged, a few hours after
>>> reboot.
>>> 
>>> Moreover, a few minutes after reboot, the negotiator (surprisingly?) matched a
>>> job for the node, which was scheduled and ran.
>>> 
>>> Even more interesting, the node apparently crashed again a few hours later and
>>> again I see the log entry where the collector sets the Absent attribute.
>>> 
>>> But condor_status -absent *still* shows the original absent date, i.e. from the
>>> first crash.
>>> 
>>> Looking through the sources I see that the offline plugin in the collector is
>>> the only place where the Absent attribute is set.
>>> 
>>> A few other source files reference the attribute but only for reading purposes
>>> (e.g. condor_status).
>>> 
>>> I also note that the persistent storage where the absent ads are put was never
>>> removed after reboot of the node.
>>> 
>>> This removal is done when a node actively invalidates an ad, so maybe that's
>>> missing or didn't run somehow?
>>> 
>>> Any ideas?
>>> 
>>> Best
>>>  Kruno
>>> 
>>> --
>>> ------------------------------------------------------------------------
>>> Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
>>>                        Ein Forschungszentrum der Helmholtz-Gemeinschaft
>>>                                                            Notkestr. 85
>>> phone:  +49-40-8998-1648                                   22607 Hamburg
>>> e-mail: krunoslav.sever@xxxxxxx                                  Germany
>>> ------------------------------------------------------------------------
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>> 
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> 
>> 
>> 
>> --
>> Mark Coatsworth
>> Systems Programmer
>> Center for High Throughput Computing
>> Department of Computer Sciences
>> University of Wisconsin-Madison
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>> 
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
> 
> -- 
> ------------------------------------------------------------------------
> Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
>                        Ein Forschungszentrum der Helmholtz-Gemeinschaft
>                                                            Notkestr. 85
> phone:  +49-40-8998-1648                                   22607 Hamburg
> e-mail: krunoslav.sever@xxxxxxx                                  Germany
> ------------------------------------------------------------------------
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/