[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] flocking: trying to connect directly to the privatenetwork



Hi all

I've just finished installing condor in a cluster (viz), which head node is vizhead and the nodes are viznode1, viznode2... in a 10.1.1.* private network

From my submit node (127.39.27.70) I fired a job matching the OS on the viz
cluster so there's no chance to execute anywhere but there.

Problem is, the submit node seems to talk to vizhead and then a match is made. Up to this point, everything is okay.

However, it seems that the submit node then tries to connect directly to the cluster nodes, which are in the private network (10.1.1.*) and, of course, fails.

How to workaround this?

The SchedLog in the vizhead follows:

9/24 16:40:27 DaemonCore: Command received via TCP from host <127.39.27.70:44768>
9/24 16:40:27 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
9/24 16:40:27 Negotiating for owner: hbucher@xxxxxxxxxxxxxxxxxxxxxxxxxx
9/24 16:40:27 Checking consistency running and runnable jobs
9/24 16:40:27 Tables are consistent
9/24 16:40:27 Out of servers - 0 jobs matched, 10 jobs idle, 1 jobs rejected
9/24 16:40:27 Increasing flock level for hbucher@xxxxxxxxxxxxxxxxxxxxxxxxxx to 1.
9/24 16:40:31 Sent ad to central manager for hbucher@xxxxxxxxxxxxxxxxxxxxxxxxxx
9/24 16:41:02 DaemonCore: Command received via TCP from host <127.39.27.155:51294>
9/24 16:41:02 DaemonCore: received command 416 (NEGOTIATE), calling handler (negotiate)
9/24 16:41:02 Negotiating for owner: hbucher@xxxxxxxxxxxxxxxxxxxxxxxxxx
9/24 16:41:02 Checking consistency running and runnable jobs
9/24 16:41:02 Tables are consistent
9/24 16:41:02 Out of servers - 7 jobs matched, 3 jobs idle, 1 jobs rejected
9/24 16:41:47 select returns 0, connect failed
9/24 16:41:47 Will keep trying for 45 seconds...
9/24 16:41:48 Connect failed for 45 seconds; returning FALSE
9/24 16:41:48 Couldn't send REQUEST_CLAIM to startd at <10.1.1.111:38543>
9/24 16:43:03 Can't connect to <10.1.1.111:38543>:0, errno = 145
9/24 16:43:03 Will keep trying for 10 seconds...
9/24 16:43:04 Connect failed for 10 seconds; returning FALSE
9/24 16:43:04 ERROR:
SECMAN:2003:TCP connection to <10.1.1.111:38543> failed


9/24 16:43:04 Sent RELEASE_CLAIM to startd on <10.1.1.111:38543>
9/24 16:43:04 Match record (<10.1.1.111:38543>, 22, 0) deleted
9/24 16:43:49 select returns 0, connect failed
9/24 16:43:49 Will keep trying for 45 seconds...
9/24 16:43:50 Connect failed for 45 seconds; returning FALSE
9/24 16:43:50 Couldn't send REQUEST_CLAIM to startd at <10.1.1.112:33865>
9/24 16:45:05 Can't connect to <10.1.1.112:33865>:0, errno = 145
9/24 16:45:05 Will keep trying for 10 seconds...
9/24 16:45:06 Connect failed for 10 seconds; returning FALSE
9/24 16:45:06 ERROR:
SECMAN:2003:TCP connection to <10.1.1.112:33865> failed

Thanks!

Henrique