[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Absent node still active


got an absent node that should not be absent... here is the story:

First it crashed and I see in the log where the collector dutifully set it to absent about 30 minutes later, when the startd ad expired.

condor_status -absent (currently) shows that time.

After the node started up again, I see that the collector received new startd ads, so I assume these would replace the absent ad.

But condor_status -absent still shows the node, unchanged, a few hours after reboot.

Moreover, a few minutes after reboot, the negotiator (surprisingly?) matched a job for the node, which was scheduled and ran.

Even more interesting, the node apparently crashed again a few hours later and again I see the log entry where the collector sets the Absent attribute.

But condor_status -absent *still* shows the original absent date, i.e. from the first crash.

Looking through the sources I see that the offline plugin in the collector is the only place where the Absent attribute is set.

A few other source files reference the attribute but only for reading purposes (e.g. condor_status).

I also note that the persistent storage where the absent ads are put was never removed after reboot of the node.

This removal is done when a node actively invalidates an ad, so maybe that's missing or didn't run somehow?

Any ideas?


Krunoslav Sever            Deutsches Elektronen-Synchrotron (IT-Systems)
                        Ein Forschungszentrum der Helmholtz-Gemeinschaft
                                                            Notkestr. 85
phone:  +49-40-8998-1648                                   22607 Hamburg
e-mail: krunoslav.sever@xxxxxxx                                  Germany