[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] strange ipaddress problem



Ashutosh,

Here is a link to patched Condor executables that should solve this problem. You only need to replace the condor_collector, condor_negotiator, and condor_schedd on your central manager. The rest of your pool should not need to be patched.

http://www.cs.wisc.edu/~danb/condor_7.0.2_lehigh/

This Condor build was made from the 7.0.2 pre-release, but the patch I am giving you may not be in time to make it into 7.0.2. I'll let you know. I built Condor on CentOS release 4.5 32-bit. Hopefully that is compatible with your system. If not, I can build it on some other platform.

--Dan

Dan Bradley wrote:

Hi Ashutosh,

This is a bug in Condor. It is affecting your nodes with an IP address matching the private address of your central manager plus trailing digits.

I have a patch ready, but I may be too late to sneak it into 7.0.2. I'll send you some patched Condor executables to solve the problem.

Sorry you hit this!

--Dan

Ashutosh Mahajan wrote:

hi all,
we are running a cluster with 600+ cpus. the head node has two interfaces one facing the internet (128.180.2.45) and the other a private net (192.168.*.*). users log into this node to submit their jobs. all the other nodes in the
cluster are in the private net.

everything seems fine except for 10 nodes in the cluster. these nodes have ipaddresses 192.168.1.10 through 192.168.1.19 (and hostnames blaze10 through
blaze19). if i do the following on the head node:

[asm4@blaze1 ~]$ condor_status blaze10 -l | grep IpAdd
PublicNetworkIpAddr = "<128.180.2.450:56927>"
StartdIpAddr = "<128.180.2.450:56927>"
PublicNetworkIpAddr = "<128.180.2.450:56927>"
StartdIpAddr = "<128.180.2.450:56927>"

similarly blaze11 shows ipaddress 128.180.2.451 in condor_status on blaze1 and
so on. however, the same command, when used on some other
node, say blaze2 gives:
[asm4@blaze2 ~]$ condor_status blaze10 -l | grep IpAdd
PublicNetworkIpAddr = "<192.168.1.10:56927>"
StartdIpAddr = "<192.168.1.10:56927>"
PublicNetworkIpAddr = "<192.168.1.10:56927>"
StartdIpAddr = "<192.168.1.10:56927>"

which is the correct address.


in NegotiatorLog of the head node i see,
6/4 20:24:32     Request 147588.00000:
6/4 20:24:32     Failed to initiate socket to send MATCH_INFO to
slot2@xxxxxxxxxxxxxxxxxxxxx
6/4 20:24:32 Matched 147588.0 bad0@xxxxxxxxxxxxx <128.180.2.45:45179>
preempting none <128.180.2.450:56927> slot2@xxxxxxxxxxxxxxxxxxxxx
6/4 20:24:32       Successfully matched with slot2@xxxxxxxxxxxxxxxxxxxxx

repeatedly.
i can log into each of these 10 nodes and their ipaddress seems to be set
correctly.
we have 7.0.1 running on all (X86_64-LINUX_RHEL5) nodes

we also have BIND_ALL_INTERFACES set to true because we were trying a few
things with flocking.

any ideas what could be wrong? thanks in advance.
--
regards
Ashutosh Mahajan
http://www.lehigh.edu/~asm4

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/