We have a Condor-6.6.9 cluster, and for no apparent reason no more than two jobs are able to run on the cluster. I checked the Sched.log
file of the master server and I noticed the following entries within it:
Sent ad to central manager for ak791@xxxxxxxxxxxxxx
Activity on stashed negotiator socket
Negotiating for owner ak791@xxxxxxxxxxxxxx
Checking consistency running and runnable jobs
Tables are consistent
Out of servers – 0 jobs matched, 8 jobs idle, 8 jobs rejected
I then checked the Matchlog file and I had numerous instances of the following:
Rejected 13149.x ak791@xxxxxxxxxxxxxx < 192.168.1.103:59494>: no match found
The NegotiatorLog file had the following entries:
Rejected 13149.x ak791@xxxxxxxxxxxxxx <192.168.1.103:59494>: no match found
I noticed that the system in question, oneofxeon, has problems connecting to several of the nodes in the cluster either through SSH or telnet. Connection
attempts fail with the error output being: No route to host. I verified the /etc/hosts file entries are all correct.
Has anyone seen this before, and knows what steps need to be done to correct it? Thanks.