[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.2.4 / 7.4.1 — "Can't find resource with ClaimId" errors from startd



Dan Bradley <dan@xxxxxxxxxxxx> writes:

> I recommend the following course of action to debug this problem further:
> If you haven't already, turn on verbose debugging information in the execute
> node configuration:

[...]

> It may also be useful to see the collector and negotiator logs for the same
> time period and with the same extra debugging options.

I hadn't, but one of the changes I made to try and narrow it down seems to
have made the problem vanish: turning off the use of the 'Claimed' state
through 'NEGOTIATOR_INFORM_STARTD = False'.

I wonder if this could be due to our network topology: we have two sites,
connected over a WAN.  The Condor master is remote, but the submission and
execution nodes are local.

I wonder if this was a race between the negotiator successfully informing the
execute node of the claim, and the submission node connecting and trying to
grab it?[1]


Failing that I will take another look at this debugging, and see if I can
manage to reliably generate the problem enough to capture something useful.

        Daniel


Footnotes: 
[1]  Coincidental to these issues was a problem that significantly raised
     packet loss across the WAN link, so it would have made the communication
     between the negotiator and the execution node much less reliable as,
     IIRC, they use UDP for the move to the Claimed state...

-- 
✣ Daniel Pittman            ✉ daniel@xxxxxxxxxxxx            ☎ +61 401 155 707
               ♽ made with 100 percent post-consumer electrons