[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] network problem



Hi, all... First at all, I apologize for the long message. I'm getting stuck
with Condor installation in an intranet with Ubuntu Xeon machines, and need
some help, please.

I've installed Condor apparently without problems, but, when running
condor_status from the central manager
(hostname xeon2, IP 192.168.1.22), get the following error:

-----------------------------------------------------------------------
- attempt to connect to <255.255.255.255:9618> failed: Network is unreachable
- (connect errno = 101).
-----------------------------------------------------------------------

I don't know why it tries to connect to 255.255.255.255

IP for the central manager is 192.168.1.22, as shown in the /etc/hosts file at
that machine:

-------------------------------------------------
- 127.0.0.1       localhost
- 127.0.1.1       xeon2
-
- # Internal IP numbers for cluster machines
- 192.168.1.1     bioinfo
-
- 192.168.1.11    thales
- 192.168.1.12    pentium2
- 192.168.1.13    pentium3
- 192.168.1.14    pentium4
- 192.168.1.15    pentium5
-
- 192.168.1.21    bioxeon
- 192.168.1.22    xeon2
- 192.168.1.23    xeon3
- 192.168.1.24    xeon4
- 192.168.1.25    xeon5
- 192.168.1.26    xeon6
- 192.168.1.27    xeon7
- 192.168.1.28    xeon8
-
- # The following lines are desirable for IPv6 capable hosts
- ::1     ip6-localhost ip6-loopback
- fe00::0 ip6-localnet
- ff00::0 ip6-mcastprefix
- ff02::1 ip6-allnodes
- ff02::2 ip6-allrouters
- ff02::3 ip6-allhosts
-----------------------------------------------------------

And /etc/network/interfaces at that machine seems to be also right;

-----------------------------------
- # The loopback network interface
- auto lo
- iface lo inet loopback
-
- # LAN
- auto eth0
- iface eth0 inet static
- address 192.168.1.22
- netmask 255.255.255.0
- network 192.168.1.0
------------------------------------

It is not a general problem with the intranet, since I can ping and ssh between
192.168.1.* nodes.

The installations steps were:

I've installed Condor in the central manager first (hostname xeon2). I've run
condor_configure as a root like that:

sudo ./condor_configure --install --local-dir=/home/condor
--type=manager,execute,submit --central-manager=xeon2

I tried adding the parameter --owner=root, but it gave me an error so I thought
that it was not necessary and removed it. Daemons are in this way run by condor
user (maybe started by root and then shifted to condor user, aren't they?, or
maybe that is the problem?)

Then I've edited the condor_config and condor_config.local files to set the
following parameters:

UID_DOMAIN=$(FULL_HOSTNAME)
FILESYSTEM_DOMAIN=$(FULL_HOSTNAME)
HOSTALLOW_READ=192.168.1.*,*.cs.wisc.edu
HOSTALLOW_WRITE=192.168.1.*
COLLECTOR_NAME=IBMCP-cluster
USE_NFS=False
USE_AFS=False
DEFAULT_DOMAIN_NAME=ibmcp-cluster.upv.es #(not a real domain, since it is an
intranet with no internet access)
NO_DNS=True
TRUST_UID_DOMAIN=True
CONDOR_HOST=xeon2
NETWORK_INTERFACE=192.168.1.22

I'm not really sure if these settings are right. The nodes are in an intranet,
so they don't have fully qualified internet domain (they have single hostnames,
and 192.168.1.* IP numbers). Only two of them have a second network card with
access to internet and fully qualified internet hostname, but it is not the
case for the central manager node.

Then I run condor_master (or start condor using condor.boot at /etc/init.d and
the corresponding /etc/rc*.d directories), and get the following daemons
running:

jforment@xeon2:~$ ps aux | grep condor
condor 4458 40.4 0.1 32324 3200 ? Ss 15:26 /usr/local/condor/sbin/condor_master
condor 4463  0.0 0.1 31328 3344 ? Ss 15:26 condor_collector -f
condor 4464 40.8 0.1 31144 3296 ? Ss 15:26 condor_negotiator -f
condor 4474 48.6 0.2 33332 4384 ? Ss 15:26 condor_startd -f
condor 4477 40.1 0.1 32636 4016 ? Ss 15:26 condor_schedd -f

The logs seem normal, except for the error in the conexion above indicated (I
include them below for revision). To sum up, I think the problem is that it
always try to connect to 255.255.255.255, and it is wrong: it must connect to
192.168.1.22. I suspect that the problem can arise from a wrong settings at
macros above indicated, specifically those related with the fact of machines
not having a fully qualified internet domain and hostnames, but I cannot fix
the problem. I've tried to change some of them, but I didn't success.

Thanks a lot,

Javier.


(CollectorLog)
11/28 15:36:37 ******************************************************
11/28 15:36:37 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
11/28 15:36:37 ** /usr/local/condor-6.8.6/sbin/condor_collector
11/28 15:36:37 ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/28 15:36:37 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/28 15:36:37 ** PID = 4512
11/28 15:36:37 ** Log last touched 11/28 15:33:29
11/28 15:36:37 ******************************************************
11/28 15:36:37 Using config source: /home/condor/condor_config
11/28 15:36:37 Using local config sources:
11/28 15:36:37    /home/condor/condor_config.local
11/28 15:36:37 DaemonCore: Command Socket at <192.168.1.22:9618>
11/28 15:36:37 In ViewServer::Init()
11/28 15:36:37 In CollectorDaemon::Init()
11/28 15:36:37 In ViewServer::Config()
11/28 15:36:37 In CollectorDaemon::Config()
11/28 15:36:37 enable: Creating stats hash table
11/28 15:51:37 Housekeeper:  Ready to clean old ads
11/28 15:51:37  Cleaning StartdAds ...
11/28 15:51:37  Cleaning StartdPrivateAds ...
11/28 15:51:37  Cleaning QuillAds ...
11/28 15:51:37  Cleaning ScheddAds ...
11/28 15:51:37  Cleaning SubmittorAds ...
11/28 15:51:37  Cleaning LicenseAds ...
11/28 15:51:37  Cleaning MasterAds ...
11/28 15:51:37  Cleaning CkptServerAds ...
11/28 15:51:37  Cleaning CollectorAds ...
11/28 15:51:37  Cleaning StorageAds ...
11/28 15:51:37  Cleaning NegotiatorAds ...
11/28 15:51:37  Cleaning HadAds ...
11/28 15:51:37  Cleaning Generic Ads ...
11/28 15:51:37 Housekeeper:  Done cleaning

(MasterLog)
11/28 15:36:36 ******************************************************
11/28 15:36:36 ** condor_master (CONDOR_MASTER) STARTING UP
11/28 15:36:36 ** /usr/local/condor-6.8.6/sbin/condor_master
11/28 15:36:36 ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/28 15:36:36 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/28 15:36:36 ** PID = 4511
11/28 15:36:36 ** Log last touched 11/28 15:33:29
11/28 15:36:36 ******************************************************
11/28 15:36:36 Using config source: /home/condor/condor_config
11/28 15:36:36 Using local config sources:
11/28 15:36:36    /home/condor/condor_config.local
11/28 15:36:36 DaemonCore: Command Socket at <192.168.1.22:58724>
11/28 15:36:37 Collector port not defined, will use default: 9618
11/28 15:36:37 Started DaemonCore process
"/usr/local/condor/sbin/condor_collector", pid and pgroup = 4512
11/28 15:36:37 Started DaemonCore process
"/usr/local/condor/sbin/condor_negotiator", pid and pgroup = 4513
11/28 15:36:37 Started DaemonCore process
"/usr/local/condor/sbin/condor_startd", pid and pgroup = 4514
11/28 15:36:37 Started DaemonCore process
"/usr/local/condor/sbin/condor_schedd", pid and pgroup = 4515
11/28 15:36:42 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20

11/28 15:37:02 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).
11/28 15:37:02 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
failed

11/28 15:37:02 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:41:42 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20

11/28 15:42:02 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).
11/28 15:42:02 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
failed

11/28 15:42:02 Failed to start non-blocking update to <255.255.255.255:9618>.
... (last error repetition each 5 minutes)

(NegotiatorLog)
11/28 15:36:37 ******************************************************
11/28 15:36:37 Using config source: /home/condor/condor_config
11/28 15:36:37 Using local config sources:
11/28 15:36:37    /home/condor/condor_config.local
11/28 15:36:37 DaemonCore: Command Socket at <192.168.1.22:44272>
11/28 15:36:38 ACCOUNTANT_HOST = None (local)
11/28 15:36:38 NEGOTIATOR_INTERVAL = 300 sec
11/28 15:36:38 NEGOTIATOR_TIMEOUT = 30 sec
11/28 15:36:38 MAX_TIME_PER_SUBMITTER = 31536000 sec
11/28 15:36:38 MAX_TIME_PER_PIESPIN = 31536000 sec
11/28 15:36:38 PREEMPTION_REQUIREMENTS = ( (CurrentTime - EnteredCurrentState) >
(1 * (60 * 60)) && RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser =
11/28 15:36:38 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - TARGET.ImageSize
11/28 15:36:38 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED
11/28 15:36:38 NEGOTIATOR_POST_JOB_RANK = None
11/28 15:36:38 Warning: attempting to compare null hostnames in same_host.
11/28 15:36:38 ---------- Started Negotiation Cycle ----------
11/28 15:36:38 Phase 1:  Obtaining ads from collector ...
11/28 15:36:38   Getting all public ads ...
11/28 15:36:38 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).  Will keep trying for 10 total seconds (10

11/28 15:36:48 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).
11/28 15:36:48 Couldn't fetch ads: communication error
11/28 15:36:48 Aborting negotiation cycle
11/28 15:36:48 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20

11/28 15:37:08 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).
11/28 15:37:08 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
failed

11/28 15:37:08 Failed to start non-blocking update to <255.255.255.255:9618>.
... (last error repetition each 5 minutes)

(ScheddLog)
11/28 15:36:37 (pid:4515) ******************************************************
11/28 15:36:37 (pid:4515) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
11/28 15:36:37 (pid:4515) ** /usr/local/condor-6.8.6/sbin/condor_schedd
11/28 15:36:37 (pid:4515) ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/28 15:36:37 (pid:4515) ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/28 15:36:37 (pid:4515) ** PID = 4515
11/28 15:36:37 (pid:4515) ** Log last touched 11/28 15:33:29
11/28 15:36:37 (pid:4515) ******************************************************
11/28 15:36:37 (pid:4515) Using config source: /home/condor/condor_config
11/28 15:36:37 (pid:4515) Using local config sources:
11/28 15:36:37 (pid:4515)    /home/condor/condor_config.local
11/28 15:36:37 (pid:4515) DaemonCore: Command Socket at <192.168.1.22:48552>
11/28 15:36:37 (pid:4515) History file rotation is enabled.
11/28 15:36:37 (pid:4515)   Maximum history file size is: 20971520 bytes
11/28 15:36:37 (pid:4515)   Number of rotated history files is: 2
11/28 15:36:38 (pid:4515) "/usr/local/condor/sbin/condor_shadow.pvm -classad"
did not produce any output, ignoring
11/28 15:36:38 (pid:4515) attempt to connect to <255.255.255.255:9618> failed:
Network is unreachable (connect errno = 101).  Will keep trying for 20 total

11/28 15:36:58 (pid:4515) attempt to connect to <255.255.255.255:9618> failed:
Network is unreachable (connect errno = 101).
11/28 15:36:58 (pid:4515) ERROR: SECMAN:2003:TCP connection to
<255.255.255.255:9618> failed

11/28 15:36:58 (pid:4515) Failed to start non-blocking update to
<255.255.255.255:9618>.
... (last error repetition each 5 minutes)

(StartLog)
11/28 15:36:37 ******************************************************
11/28 15:36:37 ** condor_startd (CONDOR_STARTD) STARTING UP
11/28 15:36:37 ** /usr/local/condor-6.8.6/sbin/condor_startd
11/28 15:36:37 ** $CondorVersion: 6.8.6 Sep 13 2007 $
11/28 15:36:37 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
11/28 15:36:37 ** PID = 4514
11/28 15:36:37 ** Log last touched 11/28 15:33:29
11/28 15:36:37 ******************************************************
11/28 15:36:37 Using config source: /home/condor/condor_config
11/28 15:36:37 Using local config sources:
11/28 15:36:37    /home/condor/condor_config.local
11/28 15:36:37 DaemonCore: Command Socket at <192.168.1.22:49167>
11/28 15:36:38 "/usr/local/condor/sbin/condor_starter.pvm -classad" did not
produce any output, ignoring
11/28 15:36:39 vm1: New machine resource allocated
11/28 15:36:39 vm2: New machine resource allocated
11/28 15:36:39 vm3: New machine resource allocated
11/28 15:36:39 vm4: New machine resource allocated
11/28 15:36:39 vm5: New machine resource allocated
11/28 15:36:39 vm6: New machine resource allocated
11/28 15:36:39 vm7: New machine resource allocated
11/28 15:36:39 vm8: New machine resource allocated
11/28 15:36:39 About to run initial benchmarks.
11/28 15:36:43 Completed initial benchmarks.
11/28 15:36:46 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20

11/28 15:37:06 attempt to connect to <255.255.255.255:9618> failed: Network is
unreachable (connect errno = 101).
11/28 15:37:06 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
failed

11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
<255.255.255.255:9618>, but it failed.
11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
<255.255.255.255:9618>, but it failed.
11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
<255.255.255.255:9618>, but it failed.
11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
<255.255.255.255:9618>, but it failed.
11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
<255.255.255.255:9618>, but it failed.
11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
<255.255.255.255:9618>, but it failed.
11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
<255.255.255.255:9618>, but it failed.
11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
... (last error repetition each 5 minutes)




-- 
Javier Forment Millet
Instituto de Biología Celular y Molecular de Plantas (IBMCP) CSIC-UPV
 Ciudad Politécnica de la Innovación (CPI) Edificio 8 E, Escalera 7 Puerta E
 Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
Tlf.:+34-96-3877858
FAX: +34-96-3877859
jforment@xxxxxxxxxxxx


-- 
Javier Forment Millet
Instituto de Biología Celular y Molecular de Plantas (IBMCP) CSIC-UPV
 Ciudad Politécnica de la Innovación (CPI) Edificio 8 E, Escalera 7 Puerta E
 Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
Tlf.:+34-96-3877858
FAX: +34-96-3877859
jforment@xxxxxxxxxxxx