[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] strange ipaddress problem



hi all,
we are running a cluster with 600+ cpus. the head node has two interfaces one
facing the internet (128.180.2.45) and the other a private net (192.168.*.*).
users log into this node to submit their jobs.  all the other nodes in the
cluster are in the private net.

everything seems fine except for 10 nodes in the cluster. these nodes have
ipaddresses 192.168.1.10 through 192.168.1.19 (and hostnames blaze10 through
blaze19). if i do the following on the head node:

[asm4@blaze1 ~]$ condor_status blaze10 -l | grep IpAdd
PublicNetworkIpAddr = "<128.180.2.450:56927>"
StartdIpAddr = "<128.180.2.450:56927>"
PublicNetworkIpAddr = "<128.180.2.450:56927>"
StartdIpAddr = "<128.180.2.450:56927>"

similarly blaze11 shows ipaddress 128.180.2.451 in condor_status on blaze1 and
so on. however, the same command, when used on some other
node, say blaze2 gives:
[asm4@blaze2 ~]$ condor_status blaze10 -l | grep IpAdd
PublicNetworkIpAddr = "<192.168.1.10:56927>"
StartdIpAddr = "<192.168.1.10:56927>"
PublicNetworkIpAddr = "<192.168.1.10:56927>"
StartdIpAddr = "<192.168.1.10:56927>"

which is the correct address.


in NegotiatorLog of the head node i see,
6/4 20:24:32     Request 147588.00000:
6/4 20:24:32     Failed to initiate socket to send MATCH_INFO to
slot2@xxxxxxxxxxxxxxxxxxxxx
6/4 20:24:32       Matched 147588.0 bad0@xxxxxxxxxxxxx <128.180.2.45:45179>
preempting none <128.180.2.450:56927> slot2@xxxxxxxxxxxxxxxxxxxxx
6/4 20:24:32       Successfully matched with slot2@xxxxxxxxxxxxxxxxxxxxx

repeatedly.
i can log into each of these 10 nodes and their ipaddress seems to be set
correctly.
we have 7.0.1 running on all (X86_64-LINUX_RHEL5) nodes

we also have BIND_ALL_INTERFACES set to true because we were trying a few
things with flocking.

any ideas what could be wrong? thanks in advance.
--
regards
Ashutosh Mahajan
http://www.lehigh.edu/~asm4