[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Quill out of sync



Hi all,

we might have a problem here caused by a networking issue yesterday when our 
mgmt. network was flooded with traffic.

We have four head nodes which share a negotiator in HA mode and at some point 
yesterday one node decided it would be the negotiator for a couple of minutes 
as it could not connect to any other head node. Now we have this weird 
situation that quill and the "direct" query are out of sync:

Querying against quill
atlas2# condor_q -g |grep running
2 jobs; 0 idle, 2 running, 0 held
9648 jobs; 3150 idle, 6498 running, 0 held

Direct query
atlas2# condor_q -g -direct schedd|grep running
21 jobs; 8 idle, 13 running, 0 held
2081 jobs; 1 idle, 2080 running, 0 held
1 jobs; 0 idle, 1 running, 0 held

condor_status believes this:
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX  6602     0    2070        31       0          0     4501

               Total  6602     0    2070        31       0          0     4501

The negotiator agrees by telling me (for any user):
Got NO_MORE_JOBS;  done negotiating

How do we get quill and the daemons back to sync, it's been in this state now 
for more than 12 hours, thus I would assume it would have had a chance to 
replay the "forgotten" transactions, right?

Cheers

Carsten