[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Setting up condor on ec2 machines



Hi,

Matt, thanks to your post
(http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/),
it worked just fine with 7.6.3

My main problem now is to try to get read of:
ALLOW_WRITE = $(ALLOW_WRITE), 10.112.45.248, ip-10-112-45-248.ec2.internal
for all nodes in my pool

I guest that
ALLOW_WRITE = $(ALLOW_WRITE), '*.ec2.internal'

should work but then I am opening that to any node on EC2.
Is there a way to accept only authenticated connections ? What is the
impact on performance ?

Then I might need to move my manager inside my organization I guess
this can be taken care by using the condor_shared_port daemon.

The other problem I do foresee is that condor is using internal names
... I had to put ALLOW_WRITE to ip-10-112-45-248.ec2.internal which
will not know anymore from the manager when sitting outside EC2. Can I
force condor to use public interface and fqdn ?

Regards
Guillaume



On Thu, Aug 25, 2011 at 7:20 PM, Timothy St. Clair <tstclair@xxxxxxxxxx> wrote:
> Just as a side note, I'm in the process of creating a full outline of
> how to configure + setup Fedora to allow spill-over to the cloud, using
> the JobRouter.
>
> I'll ping back this thread when I'm done and have something functioning.
> I'll also make the AMI's publicly available.
>
> Cheers,
> Tim
>
> On Thu, 2011-08-25 at 19:14 +0530, tog wrote:
>> Hi Matt
>>
>> Thanks for the quick answer ...
>> >From what I read I should move to latest 7.6.3 ;-)
>>
>> Anyway I do believe that until I don't move one of the role outside of
>> EC2, I should not face firewall problems. I think all ports are open
>> inside a pool of machines belonging to the same EC2 reservation.
>>
>> Here is the port info:
>>
>> ubuntu@ec2-75-101-194-52:~$ sudo netstat -tlnp
>> Active Internet connections (only servers)
>> Proto Recv-Q Send-Q Local Address           Foreign Address
>> State       PID/Program name
>> tcp        0      0 0.0.0.0:22              0.0.0.0:*
>> LISTEN      472/sshd
>> tcp        0      0 0.0.0.0:40002           0.0.0.0:*
>> LISTEN      3274/condor_negotia
>> tcp        0      0 0.0.0.0:40034           0.0.0.0:*
>> LISTEN      3258/condor_startd
>> tcp        0      0 0.0.0.0:40009           0.0.0.0:*
>> LISTEN      3259/condor_schedd
>> tcp        0      0 0.0.0.0:40014           0.0.0.0:*
>> LISTEN      3257/condor_master
>> tcp        0      0 0.0.0.0:9618            0.0.0.0:*
>> LISTEN      3292/condor_collect
>> tcp        0      0 0.0.0.0:40019           0.0.0.0:*
>> LISTEN      3259/condor_schedd
>> tcp6       0      0 :::22                   :::*
>> LISTEN      472/sshd
>> tcp6       0      0 :::1527                 :::*
>> LISTEN      3024/java
>>
>> Will keep you updated once I will have 7.6.3 install ... I might be
>> able to use condor_configure.
>>
>> Best Regards
>> Guillaume
>>
>>
>>
>> On Thu, Aug 25, 2011 at 6:26 PM, Matthew Farrellee <matt@xxxxxxxxxx> wrote:
>> > (inline)
>> >
>> > On 08/25/2011 08:26 AM, tog wrote:
>> >>
>> >> Hi
>> >>
>> >> I am trying to set up a full installation on EC2 by full I mean that
>> >> all processes will be internal to the cloud (while I would like to
>> >> keep the possibility to have the manager internally to my
>> >> organization.
>> >> Therefore I decided to set the hostnames of all machines to their
>> >> public DNS names (which is not the default on EC2)
>> >
>> > CM+Sched -(firewall)-> {dragons} -> {EC2: Startd} ?
>> >
>> > That's a pretty popular configuration it seems. The Cloud Foundations
>> > architecture documents have a walk-through of that, but using Ubuntu you may
>> > not have access.
>> >
>> >
>> >> I have 2 machines running Ubuntu Lucid with condor coming from the
>> >> distribution itself (7.2.4)
>> >>
>> >> Master  ec2-75-101-194-52.compute-1.amazonaws.com      75.101.194.52
>> >> 10.116.39.128
>> >> Worker  ec2-50-16-126-254.compute-1.amazonaws.com      50.16.126.254
>> >> 10.112.45.248
>> >
>> > First problem is possibly that 7.2.4 is painfully old with many known bugs
>> > at this point. Newer versions include the shared_port daemon, which helps
>> > significantly with firewall configuration.
>> >
>> > http://spinningmatt.wordpress.com/2011/06/21/getting-started-multiple-node-condor-pool-with-firewalls/
>> >
>> >
>> >> I have the following error messages that I cannot explain:
>> >>
>> >> MasterLog
>> >>
>> >> 8/25 11:17:36 DaemonCore: Command Socket at<75.101.194.52:40014>
>> >> 8/25 11:17:36 Failed to listen(9618) on TCP command socket.
>> >> 8/25 11:17:36 ERROR: Create_Process failed trying to start
>> >> /usr/sbin/condor_collector
>> >> 8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds
>> >
>> > Failed to listen on 9618, what's on that port (netstat -tl)?
>> >
>> >
>> >> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
>> >> pid and pgroup = 3258
>> >> 8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
>> >> pid and pgroup = 3259
>> >> 8/25 11:17:36 Started DaemonCore process
>> >> "/usr/sbin/condor_negotiator", pid and pgroup = 3274
>> >> 8/25 11:17:36 condor_write(): Socket closed when trying to write 973
>> >> bytes to<10.116.39.128:9618>, fd is 8
>> >> 8/25 11:17:36 Buf::write(): condor_write() failed
>> >> 8/25 11:17:36 Failed to send non-blocking update to<10.116.39.128:9618>.
>> >> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
>> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>> >> reason: DAEMON authorizat
>> >> ion policy contains no matching ALLOW entry for this request;
>> >> identifiers used for this host:
>> >> 10.116.39.128,ip-10-116-39-128.ec2.internal
>> >> 8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
>> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>> >> reason: cached result for
>> >>  DAEMON; see first case for the full reason
>> >> 8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
>> >> 10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
>> >> reason: cached result for
>> >>  DAEMON; see first case for the full reason
>> >> 8/25 11:17:46 Failed to listen(9618) on TCP command socket.
>> >> 8/25 11:17:46 ERROR: Create_Process failed trying to start
>> >> /usr/sbin/condor_collector
>> >> 8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
>> >> 8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
>> >> bytes to unknown source, fd is 8, errno=104
>> >> 8/25 11:17:46 Buf::write(): condor_write() failed
>> >
>> > CHILDALIVE failing because the node is not configured to talk to itself at
>> > the DAEMON access level, you'll need something like ALLOW_DAEMON = ...,
>> > $(IP_ADDRESS), $(FULL_HOSTNAME)
>> >
>> > This isn't fatal, but it means that the master cannot watch for hung
>> > daemons.
>> >
>> > http://spinningmatt.wordpress.com/2009/10/21/condor_master-for-managing-processes/
>> >
>> >
>> >> SchedLog
>> >>
>> >> 8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
>> >> write 218 bytes to unknown source, fd is 12, errno=104
>> >> 8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
>> >> 8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
>> >> 8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
>> >> EXITING WITH STATUS 0
>> >
>> > This could be a number of things. Best to leave it until you have the other
>> > issues resolved.
>> >
>> >
>> >> My configuration files are the followings:
>> >>
>> >> Master node
>> >>
>> >> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
>> >> TCP_FORWARDING_HOST=75.101.194.52
>> >> PRIVATE_NETWORK_INTERFACE=10.116.39.128
>> >> UPDATE_COLLECTOR_WITH_TCP=True
>> >> HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
>> >> HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
>> >> LOWPORT=40000
>> >> HIGHPORT=40050
>> >> COLLECTOR_SOCKET_CACHE_SIZE=1000
>> >>
>> >> '*.internal' is matching my internal hostnames
>> >>
>> >> Slave node
>> >>
>> >> ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
>> >> COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
>> >> PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
>> >> TCP_FORWARDING_HOST = 50.16.126.154
>> >> PRIVATE_NETWORK_INTERFACE = 10.112.45.248
>> >> COUNT_HYPERTHREAD_CPUS = False
>> >> DAEMON_LIST = MASTER, STARTD
>> >> UPDATE_COLLECTOR_WITH_TCP = True
>> >> #COUNT_HYPERTHREAD_CPUS = False
>> >> #UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
>> >> #MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
>> >> LOWPORT=40000
>> >> HIGHPORT=40050
>> >> DAEMON_LIST = MASTER, STARTD
>> >>
>> >> Thanks
>> >> Guillaume
>> >
>> > Along with getting away from condor 7.2, you should get away from
>> > HOSTALLOW_* and always avoid mixing HOSTALLOW_* and ALLOW_*.
>> >
>> > 7.2 has bugs in the TCP_FORWARDING_HOST, some discovered while creating the
>> > Cloud Foundations reference architecture.
>> >
>> > Best,
>> >
>> >
>> > matt
>> >
>>
>>
>>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>



-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net