[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Absent node still active

Hi Kruno,

I'm trying to reproduce this problem here on a 3-node testbed cluster.

Can you explain how exactly the startd is crashing such that
`condor_status -absent` shows output? Can you attach what it says?

In my tests with both controlled and forced shutdowns, the controller
seems to react well and I don't know how to get it to a state where
-absent returns anything. I'm running condor v8.8.9.


On Wed, Jul 15, 2020 at 3:30 AM Sever, Krunoslav
<krunoslav.sever@xxxxxxx> wrote:
> Hi,
> got an absent node that should not be absent... here is the story:
> First it crashed and I see in the log where the collector dutifully set it to absent about 30 minutes later, when the startd ad expired.
> condor_status -absent (currently) shows that time.
> After the node started up again, I see that the collector received new startd ads, so I assume these would replace the absent ad.
> But condor_status -absent still shows the node, unchanged, a few hours after reboot.
> Moreover, a few minutes after reboot, the negotiator (surprisingly?) matched a job for the node, which was scheduled and ran.
> Even more interesting, the node apparently crashed again a few hours later and again I see the log entry where the collector sets the Absent attribute.
> But condor_status -absent *still* shows the original absent date, i.e. from the first crash.
> Looking through the sources I see that the offline plugin in the collector is the only place where the Absent attribute is set.
> A few other source files reference the attribute but only for reading purposes (e.g. condor_status).
> I also note that the persistent storage where the absent ads are put was never removed after reboot of the node.
> This removal is done when a node actively invalidates an ad, so maybe that's missing or didn't run somehow?
> Any ideas?
> Best
>   Kruno
> --
> ------------------------------------------------------------------------
> Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
>                         Ein Forschungszentrum der Helmholtz-Gemeinschaft
>                                                             Notkestr. 85
> phone:  +49-40-8998-1648                                   22607 Hamburg
> e-mail: krunoslav.sever@xxxxxxx                                  Germany
> ------------------------------------------------------------------------
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Mark Coatsworth
Systems Programmer
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin-Madison