[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] network problem



Hi all... I've fixed the strange problem by just changing

CONDOR_HOST=xeon2

to

CONDOR_HOST=192.168.1.22

I don't know why, but it works. Anyway, I would be very happy if any expert
would check my condor settings described below...

Thanks,

Javier.


Mensaje citado por Javier Forment Millet <jforment@xxxxxxxxxxxx>:

> Hi, all... First at all, I apologize for the long message. I'm getting stuck
> with Condor installation in an intranet with Ubuntu Xeon machines, and need
> some help, please.
>
> I've installed Condor apparently without problems, but, when running
> condor_status from the central manager
> (hostname xeon2, IP 192.168.1.22), get the following error:
>
> -----------------------------------------------------------------------
> - attempt to connect to <255.255.255.255:9618> failed: Network is unreachable
> - (connect errno = 101).
> -----------------------------------------------------------------------
>
> I don't know why it tries to connect to 255.255.255.255
>
> IP for the central manager is 192.168.1.22, as shown in the /etc/hosts file
> at
> that machine:
>
> -------------------------------------------------
> - 127.0.0.1       localhost
> - 127.0.1.1       xeon2
> -
> - # Internal IP numbers for cluster machines
> - 192.168.1.1     bioinfo
> -
> - 192.168.1.11    thales
> - 192.168.1.12    pentium2
> - 192.168.1.13    pentium3
> - 192.168.1.14    pentium4
> - 192.168.1.15    pentium5
> -
> - 192.168.1.21    bioxeon
> - 192.168.1.22    xeon2
> - 192.168.1.23    xeon3
> - 192.168.1.24    xeon4
> - 192.168.1.25    xeon5
> - 192.168.1.26    xeon6
> - 192.168.1.27    xeon7
> - 192.168.1.28    xeon8
> -
> - # The following lines are desirable for IPv6 capable hosts
> - ::1     ip6-localhost ip6-loopback
> - fe00::0 ip6-localnet
> - ff00::0 ip6-mcastprefix
> - ff02::1 ip6-allnodes
> - ff02::2 ip6-allrouters
> - ff02::3 ip6-allhosts
> -----------------------------------------------------------
>
> And /etc/network/interfaces at that machine seems to be also right;
>
> -----------------------------------
> - # The loopback network interface
> - auto lo
> - iface lo inet loopback
> -
> - # LAN
> - auto eth0
> - iface eth0 inet static
> - address 192.168.1.22
> - netmask 255.255.255.0
> - network 192.168.1.0
> ------------------------------------
>
> It is not a general problem with the intranet, since I can ping and ssh
> between
> 192.168.1.* nodes.
>
> The installations steps were:
>
> I've installed Condor in the central manager first (hostname xeon2). I've run
> condor_configure as a root like that:
>
> sudo ./condor_configure --install --local-dir=/home/condor
> --type=manager,execute,submit --central-manager=xeon2
>
> I tried adding the parameter --owner=root, but it gave me an error so I
> thought
> that it was not necessary and removed it. Daemons are in this way run by
> condor
> user (maybe started by root and then shifted to condor user, aren't they?, or
> maybe that is the problem?)
>
> Then I've edited the condor_config and condor_config.local files to set the
> following parameters:
>
> UID_DOMAIN=$(FULL_HOSTNAME)
> FILESYSTEM_DOMAIN=$(FULL_HOSTNAME)
> HOSTALLOW_READ=192.168.1.*,*.cs.wisc.edu
> HOSTALLOW_WRITE=192.168.1.*
> COLLECTOR_NAME=IBMCP-cluster
> USE_NFS=False
> USE_AFS=False
> DEFAULT_DOMAIN_NAME=ibmcp-cluster.upv.es #(not a real domain, since it is an
> intranet with no internet access)
> NO_DNS=True
> TRUST_UID_DOMAIN=True
> CONDOR_HOST=xeon2
> NETWORK_INTERFACE=192.168.1.22
>
> I'm not really sure if these settings are right. The nodes are in an
> intranet,
> so they don't have fully qualified internet domain (they have single
> hostnames,
> and 192.168.1.* IP numbers). Only two of them have a second network card with
> access to internet and fully qualified internet hostname, but it is not the
> case for the central manager node.
>
> Then I run condor_master (or start condor using condor.boot at /etc/init.d
> and
> the corresponding /etc/rc*.d directories), and get the following daemons
> running:
>
> jforment@xeon2:~$ ps aux | grep condor
> condor 4458 40.4 0.1 32324 3200 ? Ss 15:26
> /usr/local/condor/sbin/condor_master
> condor 4463  0.0 0.1 31328 3344 ? Ss 15:26 condor_collector -f
> condor 4464 40.8 0.1 31144 3296 ? Ss 15:26 condor_negotiator -f
> condor 4474 48.6 0.2 33332 4384 ? Ss 15:26 condor_startd -f
> condor 4477 40.1 0.1 32636 4016 ? Ss 15:26 condor_schedd -f
>
> The logs seem normal, except for the error in the conexion above indicated (I
> include them below for revision). To sum up, I think the problem is that it
> always try to connect to 255.255.255.255, and it is wrong: it must connect to
> 192.168.1.22. I suspect that the problem can arise from a wrong settings at
> macros above indicated, specifically those related with the fact of machines
> not having a fully qualified internet domain and hostnames, but I cannot fix
> the problem. I've tried to change some of them, but I didn't success.
>
> Thanks a lot,
>
> Javier.
>
>
> (CollectorLog)
> 11/28 15:36:37 ******************************************************
> 11/28 15:36:37 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
> 11/28 15:36:37 ** /usr/local/condor-6.8.6/sbin/condor_collector
> 11/28 15:36:37 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/28 15:36:37 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 11/28 15:36:37 ** PID = 4512
> 11/28 15:36:37 ** Log last touched 11/28 15:33:29
> 11/28 15:36:37 ******************************************************
> 11/28 15:36:37 Using config source: /home/condor/condor_config
> 11/28 15:36:37 Using local config sources:
> 11/28 15:36:37    /home/condor/condor_config.local
> 11/28 15:36:37 DaemonCore: Command Socket at <192.168.1.22:9618>
> 11/28 15:36:37 In ViewServer::Init()
> 11/28 15:36:37 In CollectorDaemon::Init()
> 11/28 15:36:37 In ViewServer::Config()
> 11/28 15:36:37 In CollectorDaemon::Config()
> 11/28 15:36:37 enable: Creating stats hash table
> 11/28 15:51:37 Housekeeper:  Ready to clean old ads
> 11/28 15:51:37  Cleaning StartdAds ...
> 11/28 15:51:37  Cleaning StartdPrivateAds ...
> 11/28 15:51:37  Cleaning QuillAds ...
> 11/28 15:51:37  Cleaning ScheddAds ...
> 11/28 15:51:37  Cleaning SubmittorAds ...
> 11/28 15:51:37  Cleaning LicenseAds ...
> 11/28 15:51:37  Cleaning MasterAds ...
> 11/28 15:51:37  Cleaning CkptServerAds ...
> 11/28 15:51:37  Cleaning CollectorAds ...
> 11/28 15:51:37  Cleaning StorageAds ...
> 11/28 15:51:37  Cleaning NegotiatorAds ...
> 11/28 15:51:37  Cleaning HadAds ...
> 11/28 15:51:37  Cleaning Generic Ads ...
> 11/28 15:51:37 Housekeeper:  Done cleaning
>
> (MasterLog)
> 11/28 15:36:36 ******************************************************
> 11/28 15:36:36 ** condor_master (CONDOR_MASTER) STARTING UP
> 11/28 15:36:36 ** /usr/local/condor-6.8.6/sbin/condor_master
> 11/28 15:36:36 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/28 15:36:36 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 11/28 15:36:36 ** PID = 4511
> 11/28 15:36:36 ** Log last touched 11/28 15:33:29
> 11/28 15:36:36 ******************************************************
> 11/28 15:36:36 Using config source: /home/condor/condor_config
> 11/28 15:36:36 Using local config sources:
> 11/28 15:36:36    /home/condor/condor_config.local
> 11/28 15:36:36 DaemonCore: Command Socket at <192.168.1.22:58724>
> 11/28 15:36:37 Collector port not defined, will use default: 9618
> 11/28 15:36:37 Started DaemonCore process
> "/usr/local/condor/sbin/condor_collector", pid and pgroup = 4512
> 11/28 15:36:37 Started DaemonCore process
> "/usr/local/condor/sbin/condor_negotiator", pid and pgroup = 4513
> 11/28 15:36:37 Started DaemonCore process
> "/usr/local/condor/sbin/condor_startd", pid and pgroup = 4514
> 11/28 15:36:37 Started DaemonCore process
> "/usr/local/condor/sbin/condor_schedd", pid and pgroup = 4515
> 11/28 15:36:42 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20
>
> 11/28 15:37:02 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).
> 11/28 15:37:02 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
> failed
>
> 11/28 15:37:02 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:41:42 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20
>
> 11/28 15:42:02 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).
> 11/28 15:42:02 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
> failed
>
> 11/28 15:42:02 Failed to start non-blocking update to <255.255.255.255:9618>.
> ... (last error repetition each 5 minutes)
>
> (NegotiatorLog)
> 11/28 15:36:37 ******************************************************
> 11/28 15:36:37 Using config source: /home/condor/condor_config
> 11/28 15:36:37 Using local config sources:
> 11/28 15:36:37    /home/condor/condor_config.local
> 11/28 15:36:37 DaemonCore: Command Socket at <192.168.1.22:44272>
> 11/28 15:36:38 ACCOUNTANT_HOST = None (local)
> 11/28 15:36:38 NEGOTIATOR_INTERVAL = 300 sec
> 11/28 15:36:38 NEGOTIATOR_TIMEOUT = 30 sec
> 11/28 15:36:38 MAX_TIME_PER_SUBMITTER = 31536000 sec
> 11/28 15:36:38 MAX_TIME_PER_PIESPIN = 31536000 sec
> 11/28 15:36:38 PREEMPTION_REQUIREMENTS = ( (CurrentTime -
> EnteredCurrentState) >
> (1 * (60 * 60)) && RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser =
> 11/28 15:36:38 PREEMPTION_RANK = (RemoteUserPrio * 1000000) -
> TARGET.ImageSize
> 11/28 15:36:38 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED
> 11/28 15:36:38 NEGOTIATOR_POST_JOB_RANK = None
> 11/28 15:36:38 Warning: attempting to compare null hostnames in same_host.
> 11/28 15:36:38 ---------- Started Negotiation Cycle ----------
> 11/28 15:36:38 Phase 1:  Obtaining ads from collector ...
> 11/28 15:36:38   Getting all public ads ...
> 11/28 15:36:38 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).  Will keep trying for 10 total seconds (10
>
> 11/28 15:36:48 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).
> 11/28 15:36:48 Couldn't fetch ads: communication error
> 11/28 15:36:48 Aborting negotiation cycle
> 11/28 15:36:48 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20
>
> 11/28 15:37:08 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).
> 11/28 15:37:08 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
> failed
>
> 11/28 15:37:08 Failed to start non-blocking update to <255.255.255.255:9618>.
> ... (last error repetition each 5 minutes)
>
> (ScheddLog)
> 11/28 15:36:37 (pid:4515)
> ******************************************************
> 11/28 15:36:37 (pid:4515) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 11/28 15:36:37 (pid:4515) ** /usr/local/condor-6.8.6/sbin/condor_schedd
> 11/28 15:36:37 (pid:4515) ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/28 15:36:37 (pid:4515) ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 11/28 15:36:37 (pid:4515) ** PID = 4515
> 11/28 15:36:37 (pid:4515) ** Log last touched 11/28 15:33:29
> 11/28 15:36:37 (pid:4515)
> ******************************************************
> 11/28 15:36:37 (pid:4515) Using config source: /home/condor/condor_config
> 11/28 15:36:37 (pid:4515) Using local config sources:
> 11/28 15:36:37 (pid:4515)    /home/condor/condor_config.local
> 11/28 15:36:37 (pid:4515) DaemonCore: Command Socket at <192.168.1.22:48552>
> 11/28 15:36:37 (pid:4515) History file rotation is enabled.
> 11/28 15:36:37 (pid:4515)   Maximum history file size is: 20971520 bytes
> 11/28 15:36:37 (pid:4515)   Number of rotated history files is: 2
> 11/28 15:36:38 (pid:4515) "/usr/local/condor/sbin/condor_shadow.pvm -classad"
> did not produce any output, ignoring
> 11/28 15:36:38 (pid:4515) attempt to connect to <255.255.255.255:9618>
> failed:
> Network is unreachable (connect errno = 101).  Will keep trying for 20 total
>
> 11/28 15:36:58 (pid:4515) attempt to connect to <255.255.255.255:9618>
> failed:
> Network is unreachable (connect errno = 101).
> 11/28 15:36:58 (pid:4515) ERROR: SECMAN:2003:TCP connection to
> <255.255.255.255:9618> failed
>
> 11/28 15:36:58 (pid:4515) Failed to start non-blocking update to
> <255.255.255.255:9618>.
> ... (last error repetition each 5 minutes)
>
> (StartLog)
> 11/28 15:36:37 ******************************************************
> 11/28 15:36:37 ** condor_startd (CONDOR_STARTD) STARTING UP
> 11/28 15:36:37 ** /usr/local/condor-6.8.6/sbin/condor_startd
> 11/28 15:36:37 ** $CondorVersion: 6.8.6 Sep 13 2007 $
> 11/28 15:36:37 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
> 11/28 15:36:37 ** PID = 4514
> 11/28 15:36:37 ** Log last touched 11/28 15:33:29
> 11/28 15:36:37 ******************************************************
> 11/28 15:36:37 Using config source: /home/condor/condor_config
> 11/28 15:36:37 Using local config sources:
> 11/28 15:36:37    /home/condor/condor_config.local
> 11/28 15:36:37 DaemonCore: Command Socket at <192.168.1.22:49167>
> 11/28 15:36:38 "/usr/local/condor/sbin/condor_starter.pvm -classad" did not
> produce any output, ignoring
> 11/28 15:36:39 vm1: New machine resource allocated
> 11/28 15:36:39 vm2: New machine resource allocated
> 11/28 15:36:39 vm3: New machine resource allocated
> 11/28 15:36:39 vm4: New machine resource allocated
> 11/28 15:36:39 vm5: New machine resource allocated
> 11/28 15:36:39 vm6: New machine resource allocated
> 11/28 15:36:39 vm7: New machine resource allocated
> 11/28 15:36:39 vm8: New machine resource allocated
> 11/28 15:36:39 About to run initial benchmarks.
> 11/28 15:36:43 Completed initial benchmarks.
> 11/28 15:36:46 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).  Will keep trying for 20 total seconds (20
>
> 11/28 15:37:06 attempt to connect to <255.255.255.255:9618> failed: Network
> is
> unreachable (connect errno = 101).
> 11/28 15:37:06 ERROR: SECMAN:2003:TCP connection to <255.255.255.255:9618>
> failed
>
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
> <255.255.255.255:9618>, but it failed.
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
> <255.255.255.255:9618>, but it failed.
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
> <255.255.255.255:9618>, but it failed.
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
> <255.255.255.255:9618>, but it failed.
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
> <255.255.255.255:9618>, but it failed.
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
> <255.255.255.255:9618>, but it failed.
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> 11/28 15:37:06 ERROR: SECMAN:2004:Was waiting for TCP auth session to
> <255.255.255.255:9618>, but it failed.
> 11/28 15:37:06 Failed to start non-blocking update to <255.255.255.255:9618>.
> ... (last error repetition each 5 minutes)
>
>
>
>
> --
> Javier Forment Millet
> Instituto de Biología Celular y Molecular de Plantas (IBMCP) CSIC-UPV
>  Ciudad Politécnica de la Innovación (CPI) Edificio 8 E, Escalera 7 Puerta E
>  Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
> Tlf.:+34-96-3877858
> FAX: +34-96-3877859
> jforment@xxxxxxxxxxxx
>
>
> --
> Javier Forment Millet
> Instituto de Biología Celular y Molecular de Plantas (IBMCP) CSIC-UPV
>  Ciudad Politécnica de la Innovación (CPI) Edificio 8 E, Escalera 7 Puerta E
>  Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
> Tlf.:+34-96-3877858
> FAX: +34-96-3877859
> jforment@xxxxxxxxxxxx
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>


-- 
Javier Forment Millet
Instituto de Biología Celular y Molecular de Plantas (IBMCP) CSIC-UPV
 Ciudad Politécnica de la Innovación (CPI) Edificio 8 E, Escalera 7 Puerta E
 Calle Ing. Fausto Elio s/n. 46022 Valencia, Spain
Tlf.:+34-96-3877858
FAX: +34-96-3877859
jforment@xxxxxxxxxxxx