[condor-users] condor_status and condor_q failing

On our cluster, occasionally when there is a lot of work going on and a
lot of jobs in the queue, condor_q and condor_status have a hard time
connecting to the collector.  Is there any specific reason/fix for this?
~ Below is one of the messages we get, more comments to follow
(hostnames/IPs removed):

CEDAR:6001:Failed to connect to <###.##.#.##:9618>
Error: Couldn't contact the condor_collector on hostname.domainname.

Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to fix
this problem.

If you are the system administrator, check that the condor_collector is
running on hostname.domainname, check the HOSTALLOW configuration in
your condor_config, and check the MasterLog and CollectorLog files in your
log directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.

I'm running this command from the condor master itself, so HOSTALLOW
isn't an issue (and I know it's not because a lot of the time those
commands work, it's just maybe 10% of the time under load).  Also, when
this happens, there is no corresponding entry in the MasterLog or
CollectorLog to indicate a problem.

This is running Condor 6.5.5, RedHat 9 dynamic package, under Gentoo Linux.


Corey Shields - IU Unix Systems Support Group

