[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting mad trying to flocking in condor



Hi,

Could it be that a firewall prevents further communication necessary for the flocking?
I had this and I needed to add exceptions to the iptables firewall rules on both machines, like this:

on machine X:

-A INPUT -m state --state NEW -m tcp -p tcp -s xxx.xxx.xxx.xxx/16 -d ip.addr.of.X -j ACCEPT
-A INPUT -p udp -s xxx.xxx.xxx.xxx/16 -d ip.addr.of.X -j ACCEPT

where "xxx.xxx.xxx.xxx/16" includes all machines in my pool, including the two flocking machines.

Rob.


From: Michell Guzman Cancimance <michellrad@xxxxxxxxx>
To: condor-users@xxxxxxxxxxx
Sent: Wednesday, July 4, 2012 4:57 PM
Subject: [Condor-users] Getting mad trying to flocking in condor

Hi,

I'm getting mad trying to flock a job from a cluster A (master.cluster.org, 172.18.0.2) to a cluster B (cl-master.mycluster.org. 178.12.100.2),
each cluster have a master and two worker nodes, the cluster A have nodes with arch X86_64, and the cluster
B have nodes with arch INTEL (32 bits). I have configured the two condor_config (the flocking section) in each master nodes of this clusters (master.cluster.org and cl-master.mycluster.org nodes) following the steps in (http://research.cs.wisc.edu/condor/manual/v6.8/5_2Connecting_Condor.html). When I run a job en each cluster separately that works fine, but when I run a job with a requirement of an arch INTEL into the cluster A (the cluster whose nodes have X86_64 Arch) trying to
do a flock to the cluster B doesn't works. I have tried a lot of stuff but I can't get any success. I would appreciate any help in order to solve this problem.


Best regards
Michell


This is my SchedLog file

07/03/12 07:05:38 (pid:813) Can't open directory "/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
07/03/12 07:05:38 (pid:813) Can't open directory "/opt/condor/tmp/condor/local.master/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
07/03/12 07:05:38 (pid:813) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
07/03/12 07:05:38 (pid:813) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
07/03/12 07:05:38 (pid:813) Setting maximum accepts per cycle 4.
07/03/12 07:05:38 (pid:813) ******************************************************
07/03/12 07:05:38 (pid:813) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
07/03/12 07:05:38 (pid:813) ** /opt/condor/tmp/condor/sbin/condor_schedd
07/03/12 07:05:38 (pid:813) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
07/03/12 07:05:38 (pid:813) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
07/03/12 07:05:38 (pid:813) ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
07/03/12 07:05:38 (pid:813) ** $CondorPlatform: x86_64_deb_5.0 $
07/03/12 07:05:38 (pid:813) ** PID = 813
07/03/12 07:05:38 (pid:813) ** Log last touched 7/3 07:04:28
07/03/12 07:05:38 (pid:813) ******************************************************
07/03/12 07:05:38 (pid:813) Using config source: /opt/condor/tmp/condor/etc/condor_config
07/03/12 07:05:38 (pid:813) Using local config sources:
07/03/12 07:05:38 (pid:813)    /opt/condor/tmp/condor/local.master/condor_config.local
07/03/12 07:05:38 (pid:813) DaemonCore: command socket at <10.0.2.15:49711>
07/03/12 07:05:38 (pid:813) DaemonCore: private command socket at <10.0.2.15:49711>
07/03/12 07:05:38 (pid:813) Setting maximum accepts per cycle 4.
07/03/12 07:05:38 (pid:813) History file rotation is enabled.
07/03/12 07:05:38 (pid:813)   Maximum history file size is: 20971520 bytes
07/03/12 07:05:38 (pid:813)   Number of rotated history files is: 2
07/03/12 07:05:40 (pid:813) About to rotate ClassAd log /opt/condor/tmp/condor/local.master/spool/job_queue.log
07/03/12 07:05:48 (pid:813) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
07/03/12 07:05:48 (pid:813) Sent ad to central manager for vagrant@xxxxxxxxxxx
07/03/12 07:05:48 (pid:813) Sent ad to 1 collectors for vagrant@xxxxxxxxxxx
07/03/12 07:06:38 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:06:38 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:06:38 (pid:813) AutoCluster:config() significant attributes changed to
07/03/12 07:06:38 (pid:813) Checking consistency running and runnable jobs
07/03/12 07:06:38 (pid:813) Tables are consistent
07/03/12 07:06:38 (pid:813) Rebuilt prioritized runnable job list in 0.001s.
07/03/12 07:06:38 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 07:06:38 (pid:813) Increasing flock level for vagrant to 1.
07/03/12 07:06:38 (pid:813) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
07/03/12 07:06:38 (pid:813) Sent ad to central manager for vagrant@xxxxxxxxxxx
07/03/12 07:06:38 (pid:813) Sent ad to 1 collectors for vagrant@xxxxxxxxxxx
07/03/12 07:06:59 (pid:813) attempt to connect to <67.215.65.132:9618> failed: Connection timed out (connect errno = 110).
07/03/12 07:06:59 (pid:813) attempt to connect to <67.215.65.132:9618> failed: timed out after 20 seconds.
07/03/12 07:06:59 (pid:813) ERROR: SECMAN:2004:Failed to create security session to <67.215.65.132:9618> with TCP.
|SECMAN:2003:TCP connection to <67.215.65.132:9618> failed.
07/03/12 07:06:59 (pid:813) Failed to start non-blocking update to <67.215.65.132:9618>.
07/03/12 07:06:59 (pid:813) ERROR: SECMAN:2004:Failed to create security session to <67.215.65.132:9618> with TCP.
|SECMAN:2003:TCP connection to <67.215.65.132:9618> failed.
07/03/12 07:06:59 (pid:813) Failed to start non-blocking update to <67.215.65.132:9618>.
07/03/12 07:07:38 (pid:813) Activity on stashed negotiator socket: <172.18.0.2:45498>
07/03/12 07:07:38 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:07:38 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:07:38 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 07:08:38 (pid:813) Activity on stashed negotiator socket: <172.18.0.2:45498>
07/03/12 07:08:38 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:08:38 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:08:38 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 07:09:38 (pid:813) Activity on stashed negotiator socket: <172.18.0.2:45498>
07/03/12 07:09:38 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:09:38 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:09:38 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 07:09:47 (pid:813) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
07/03/12 07:09:47 (pid:813) Sent ad to central manager for vagrant@xxxxxxxxxxx
07/03/12 07:09:47 (pid:813) Sent ad to 1 collectors for vagrant@xxxxxxxxxxx
07/03/12 07:10:09 (pid:813) attempt to connect to <67.215.65.132:9618> failed: Connection timed out (connect errno = 110).  Will keep trying for 60 total seconds (39 to go).

07/03/12 07:10:48 (pid:813) attempt to connect to <67.215.65.132:9618> failed: Connection timed out (connect errno = 110).
07/03/12 07:10:48 (pid:813) Failed to send RESCHEDULE to negotiator cl-master.mycluster.org:
07/03/12 07:10:48 (pid:813) attempt to connect to <67.215.65.132:9618> failed: Connection timed out (connect errno = 110).
07/03/12 07:10:48 (pid:813) attempt to connect to <67.215.65.132:9618> failed: Connection timed out (connect errno = 110).
07/03/12 07:10:48 (pid:813) Activity on stashed negotiator socket: <172.18.0.2:45498>
07/03/12 07:10:48 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:10:48 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:10:48 (pid:813) Checking consistency running and runnable jobs
07/03/12 07:10:48 (pid:813) Tables are consistent
07/03/12 07:10:48 (pid:813) Rebuilt prioritized runnable job list in 0.000s.
07/03/12 07:10:48 (pid:813) ERROR: SECMAN:2004:Failed to create security session to <67.215.65.132:9618> with TCP.
|SECMAN:2003:TCP connection to <67.215.65.132:9618> failed.
07/03/12 07:10:48 (pid:813) Failed to start non-blocking update to <67.215.65.132:9618>.
07/03/12 07:10:48 (pid:813) ERROR: SECMAN:2004:Failed to create security session to <67.215.65.132:9618> with TCP.
|SECMAN:2003:TCP connection to <67.215.65.132:9618> failed.
07/03/12 07:10:48 (pid:813) Failed to start non-blocking update to <67.215.65.132:9618>.
07/03/12 07:10:48 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 07:11:08 (pid:813) Activity on stashed negotiator socket: <172.18.0.2:45498>
07/03/12 07:11:08 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:11:08 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:11:08 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 07:12:08 (pid:813) Activity on stashed negotiator socket: <172.18.0.2:45498>
07/03/12 07:12:08 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:12:08 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:12:08 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 07:13:08 (pid:813) Activity on stashed negotiator socket: <172.18.0.2:45498>
07/03/12 07:13:08 (pid:813) Using negotiation protocol: NEGOTIATE
07/03/12 07:13:08 (pid:813) Negotiating for owner: vagrant@xxxxxxxxxxx
07/03/12 07:13:08 (pid:813) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/03/12 16:54:51 (pid:836) Can't open directory "/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
07/03/12 16:54:51 (pid:836) Can't open directory "/opt/condor/tmp/condor/local.master/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
07/03/12 16:54:51 (pid:836) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
07/03/12 16:54:51 (pid:836) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
07/03/12 16:54:51 (pid:836) Setting maximum accepts per cycle 4.


07/04/12 08:13:01 (pid:1936) Can't open directory "/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
07/04/12 08:13:01 (pid:1936) Can't open directory "/opt/condor/tmp/condor/local.master/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
07/04/12 08:13:01 (pid:1936) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
07/04/12 08:13:01 (pid:1936) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
07/04/12 08:13:01 (pid:1936) Setting maximum accepts per cycle 4.
07/04/12 08:13:01 (pid:1936) ******************************************************
07/04/12 08:13:01 (pid:1936) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
07/04/12 08:13:01 (pid:1936) ** /opt/condor/tmp/condor/sbin/condor_schedd
07/04/12 08:13:01 (pid:1936) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
07/04/12 08:13:01 (pid:1936) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
07/04/12 08:13:01 (pid:1936) ** $CondorVersion: 7.6.4 Oct 20 2011 BuildID: 379441 $
07/04/12 08:13:01 (pid:1936) ** $CondorPlatform: x86_64_deb_5.0 $
07/04/12 08:13:01 (pid:1936) ** PID = 1936
07/04/12 08:13:01 (pid:1936) ** Log last touched 7/4 08:13:01
07/04/12 08:13:01 (pid:1936) ******************************************************
07/04/12 08:13:01 (pid:1936) Using config source: /opt/condor/tmp/condor/etc/condor_config
07/04/12 08:13:01 (pid:1936) Using local config sources:
07/04/12 08:13:01 (pid:1936)    /opt/condor/tmp/condor/local.master/condor_config.local
07/04/12 08:13:01 (pid:1936) DaemonCore: command socket at <10.0.2.15:33007>
07/04/12 08:13:01 (pid:1936) DaemonCore: private command socket at <10.0.2.15:33007>
07/04/12 08:13:01 (pid:1936) Setting maximum accepts per cycle 4.
07/04/12 08:13:01 (pid:1936) History file rotation is enabled.
07/04/12 08:13:01 (pid:1936)   Maximum history file size is: 20971520 bytes
07/04/12 08:13:01 (pid:1936)   Number of rotated history files is: 2
07/04/12 08:13:41 (pid:1936) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
07/04/12 08:13:41 (pid:1936) Sent ad to central manager for vagrant@xxxxxxxxxxx
07/04/12 08:13:41 (pid:1936) Sent ad to 1 collectors for vagrant@xxxxxxxxxxx
07/04/12 08:14:01 (pid:1936) IPVERIFY: unable to resolve IP address of cl-master.mycluster.org
07/04/12 08:14:21 (pid:1936) Using negotiation protocol: NEGOTIATE
07/04/12 08:14:21 (pid:1936) Negotiating for owner: vagrant@xxxxxxxxxxx
07/04/12 08:14:21 (pid:1936) AutoCluster:config() significant attributes changed to
07/04/12 08:14:21 (pid:1936) Checking consistency running and runnable jobs
07/04/12 08:14:21 (pid:1936) Tables are consistent
07/04/12 08:14:21 (pid:1936) Rebuilt prioritized runnable job list in 0.002s.
07/04/12 08:14:21 (pid:1936) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/04/12 08:14:21 (pid:1936) Increasing flock level for vagrant to 1.
07/04/12 08:14:21 (pid:1936) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
07/04/12 08:14:41 (pid:1936) Failed to start non-blocking update to unknown.
07/04/12 08:14:41 (pid:1936) Sent ad to central manager for vagrant@xxxxxxxxxxx
07/04/12 08:14:41 (pid:1936) Sent ad to 1 collectors for vagrant@xxxxxxxxxxx
07/04/12 08:15:01 (pid:1936) Failed to start non-blocking update to unknown.
07/04/12 08:15:26 (pid:1936) Using negotiation protocol: NEGOTIATE
07/04/12 08:15:26 (pid:1936) Negotiating for owner: vagrant@xxxxxxxxxxx
07/04/12 08:15:26 (pid:1936) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts,Scheduler
07/04/12 08:15:26 (pid:1936) Checking consistency running and runnable jobs
07/04/12 08:15:26 (pid:1936) Tables are consistent
07/04/12 08:15:26 (pid:1936) Rebuilt prioritized runnable job list in 0.002s.
07/04/12 08:15:26 (pid:1936) Activity on stashed negotiator socket: <172.18.0.2:38674>
07/04/12 08:15:26 (pid:1936) Using negotiation protocol: NEGOTIATE
07/04/12 08:15:26 (pid:1936) Negotiating for owner: vagrant@xxxxxxxxxxx
07/04/12 08:15:26 (pid:1936) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/04/12 08:15:26 (pid:1936) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/04/12 08:15:45 (pid:1936) Using negotiation protocol: NEGOTIATE
07/04/12 08:15:45 (pid:1936) Negotiating for owner: vagrant@xxxxxxxxxxx
07/04/12 08:15:45 (pid:1936) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/04/12 08:16:27 (pid:1936) Activity on stashed negotiator socket: <172.18.0.2:40374>
07/04/12 08:16:27 (pid:1936) Using negotiation protocol: NEGOTIATE
07/04/12 08:16:27 (pid:1936) Negotiating for owner: vagrant@xxxxxxxxxxx
07/04/12 08:16:27 (pid:1936) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected
07/04/12 08:16:27 (pid:1936) Activity on stashed negotiator socket: <172.18.0.2:38674>
07/04/12 08:16:27 (pid:1936) Using negotiation protocol: NEGOTIATE
07/04/12 08:16:27 (pid:1936) Negotiating for owner: vagrant@xxxxxxxxxxx
07/04/12 08:16:27 (pid:1936) Finished negotiating for vagrant in local pool: 0 matched, 2 rejected


--
"Nullius addictus jurare in verba magistri"

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/