[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] GCB "Unable to determine local IP address"



>  Assuming your glidein target is Linux, and assuming your config file is
>  not setting NETWORK_INTERFACE.

I tried setting NETWORK_INTERFACE and got the same result. There's
something in the manual that says NET_REMAP_ENABLE turns on
BIND_ALL_INTERFACES. Would that cause NETWORK_INTERFACE to be ignored?

>  Unfortunately, I think some clues may have appeared in your MasterLog
>  file on lines earlier than where you started above.  Could you send the
>  entire MasterLog?  Specifically, I am hoping that a line(s) containing
>  "_all_myIP" appears in the log, since it looks like this is the
>  underlying function that failed --- and it looks like it always will log
>  why it is failing.

Here's the entire log from a later test:

2/12 18:03:08 ******************************************************
2/12 18:03:08 ** condor_master (CONDOR_MASTER) STARTING UP
2/12 18:03:08 **
/home/geovault-00/juve/Condor_glidein/7.0.0-x86_64-pc-Linux-2.4-glibc2.3/condor_master
2/12 18:03:08 ** $CondorVersion: 7.0.0 Jan 22 2008 BuildID: 72173 $
2/12 18:03:08 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
2/12 18:03:08 ** PID = 5784
2/12 18:03:08 ** Log last touched time unavailable (No such file or directory)
2/12 18:03:08 ******************************************************
2/12 18:03:08 Using config source:
/home/geovault-00/juve/Condor_glidein/glidein_condor_config
2/12 18:03:08 GCB: GCB_bind(7[GCB_SOCKET], 0x7fbfffe7a0-><0.0.0.0:0>):
Unable to determine local IP address (_myIP failed).
2/12 18:03:08 Failed to bind to command ReliSock
2/12 18:03:08 (Make sure your IP address is correct in /etc/hosts.)
2/12 18:03:08 ERROR "BindAnyCommandPort() failed" at line 8385 in file
daemon_core.C

>  Possible reasons why it could fail:
>    1. out of memory (malloc returns NULL)
>    2. failure to open a datagram socket
>    3. failure to call ioctl SIOCGIFCONF to get list of all interfaces
>    4. failure to ioctl SIOCGIFFLAGS to find out if an interface is up
>    5. more than 10 network interfaces on the machine
>
>  Thoughts on the above:
>  (1) - does not seem likely.
>  (2) - perhaps a limit on number of descriptors or sockets for this user
>  is being hit?  could try running "limit" as the glidein user.

The limit on descriptors is set to 4096.

>  (3) - don't know why this would fail unless there is some uncommon
>  permission settings going on.  can you successfully run "/sbin/ifconfig"
>  as the glidein user?
>  (4) - same thoughts as #3.  also, are any of the interfaces listed from
>  ifconfig really strange?  maybe we are failing to see if some virtual
>  interface, like from some VPN software, is up or down and that is
>  confusing GCB.
>  (5) - when your run /sbin/ifconfig, how many entries do you see?  If the
>  answer is 10 or more, I think we have discovered the problem.  The GCB
>  code has a static limit of 10 network interfaces.  If this is indeed the
>  problem you are hitting, we could improve this.

I can run ifconfig on the nodes. It shows two interfaces: eth0 and lo.

eth0      Link encap:Ethernet  HWaddr 00:11:43:32:D7:F9
          inet addr:10.255.255.247  Bcast:10.255.255.255  Mask:255.0.0.0
          inet6 addr: fe80::211:43ff:fe32:d7f9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:829942 errors:0 dropped:0 overruns:0 frame:0
          TX packets:965888 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:161701145 (154.2 MiB)  TX bytes:159499192 (152.1 MiB)
          Base address:0xdcc0 Memory:dfae0000-dfb00000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:28031 errors:0 dropped:0 overruns:0 frame:0
          TX packets:28031 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3865070 (3.6 MiB)  TX bytes:3865070 (3.6 MiB)

>  Another idea is to side-step the problem all-together by specifying
>  NETWORK_INTERFACE in your glidein's config file, although I realize this
>  could be a pain in the rear to do, and ideally i'd like to understand
>  why the above setup is failing in your environment....

I am wondering if there might be an issue with _myIP and _all_myIP on
some platforms. If I compile GCB on the nodes and call those two
functions from a test program linked with libGCBcomm.a, _all_myIP
returns only 127.0.0.1, and that causes _myIP to fail. Specifically, I
think that the part of _all_myIP where it loops through the interfaces
might be setting the pointer incorrectly. If I change:

ptr += sizeof(ifr->ifr_name) + len;

To:

ptr += sizeof(struct ifreq);

Then _all_myIP returns the other interface(s) on the machines I am
having problems with.

Gideon
#include <stdio.h>
#include <stdint.h>

int _all_myIP(uint32_t *ip_arr, int *size);
int _myIP(uint32_t *ipaddr);

int test_all_myIP()
{
	uint32_t ip_arr[10], ipaddr;
	int size = 10;
	int i, res;

	res = _all_myIP(ip_arr, &size);
	if (res < 0) return res;

	printf("_all_myIP(): %d\n",size);
	for (i=0; i<size; i++)
	{
		ipaddr = ip_arr[i];
		printf("%d.%d.%d.%d\n",
 			(ipaddr >>  0) & 0xff,
 			(ipaddr >>  8) & 0xff,
 			(ipaddr >> 16) & 0xff,
 			(ipaddr >> 24) & 0xff);
	}

	return res;
}

int test_myIP()
{
	uint32_t ipaddr;
	int res;
	
	res = _myIP(&ipaddr);
	if (res < 0) return res;

	printf("_myIP():\n%d.%d.%d.%d\n",
 		(ipaddr >>  0) & 0xff,
 		(ipaddr >>  8) & 0xff,
 		(ipaddr >> 16) & 0xff,
 		(ipaddr >> 24) & 0xff);

	return res;
}

int main(int argc, char **argv)
{
	int res;
	
	res = test_all_myIP();
	printf("_all_myIP() returned %d\n",res);
	
	res = test_myIP();
	printf("_myIP() returned %d\n",res);
	
	return 0;
}