[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems Submitting Long Jobs to the Cluster



We have a Condor-6.6.9 cluster, and for no apparent reason no more than two jobs are able to run on the cluster. I checked the Sched.log

file of the master server and I noticed the following entries within it:

 

Sent ad to central manager for ak791@xxxxxxxxxxxxxx

Activity on stashed negotiator socket

Negotiating for owner ak791@xxxxxxxxxxxxxx

Checking consistency running and runnable jobs

Tables are consistent

Out of servers – 0 jobs matched, 8 jobs idle, 8 jobs rejected

 

I then checked the Matchlog file and I had numerous instances of the following:

 

Rejected 13149.x ak791@xxxxxxxxxxxxxx < 192.168.1.103:59494>: no match found

 

The NegotiatorLog file had the following entries:

 

Request 13149.0000x

            Rejected 13149.x ak791@xxxxxxxxxxxxxx <192.168.1.103:59494>: no match found

 

I noticed that the system in question, oneofxeon, has problems connecting to several of the nodes in the cluster either through SSH or telnet. Connection

attempts fail with the error output being: No route to host. I verified the /etc/hosts file entries are all correct.

 

Has anyone seen this before, and knows what steps need to be done to correct it? Thanks.

 

The information transmitted in this electronic communication is intended only for the person or entity to whom it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this information in error, please contact the Compliance HelpLine at 800-856-1983 and properly dispose of this information.