[HTCondor-users] negotiator dies


I've run different performance test for six days on HTCondor, and
during that period, from time to time my negotiator got killed and
restarted because:

"/usr/sbin/condor_negotiator" on "condormaster1" was killed because
it was no longer responding.
Condor will automatically restart this process in 10 seconds.

And the last log lines are just completely ordinary, nothing suspicious in them.

I can't see any obvious bottleneck, except there's a peak on the
networking plots, ~450 sent packets/sec, ~0.4 sent MBytes/sec, ~900
received packets/sec, ~1.4 received MBytes/sec.

I have 100 subcollectors running on 2 machines (50-50), and one of the
machines runs the main collector, the other runs the negotiator. I
have ~700 worker nodes with 33 600 jobslots (I turn them on and off
during the test), and during the tests, I submitted multiple times
something like 80 000 - 400 000 jobs spread among 10 schedds. So could
say that I've done everything to keep the negotiator really busy.

I attach a weekly graph during the testing period.

[root@condormaster1 ~]# condor_version
$CondorVersion: 8.1.2 Oct 19 2013 BuildID: 189797 $
$CondorPlatform: x86_64_RedHat6 $


Attachment: graph.png
Description: PNG image