[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Setting up condor on ec2 machines



Hi

I am trying to set up a full installation on EC2 by full I mean that
all processes will be internal to the cloud (while I would like to
keep the possibility to have the manager internally to my
organization.
Therefore I decided to set the hostnames of all machines to their
public DNS names (which is not the default on EC2)

I have 2 machines running Ubuntu Lucid with condor coming from the
distribution itself (7.2.4)

Master  ec2-75-101-194-52.compute-1.amazonaws.com      75.101.194.52
10.116.39.128
Worker  ec2-50-16-126-254.compute-1.amazonaws.com      50.16.126.254
10.112.45.248

I have the following error messages that I cannot explain:

MasterLog

8/25 11:17:36 DaemonCore: Command Socket at <75.101.194.52:40014>
8/25 11:17:36 Failed to listen(9618) on TCP command socket.
8/25 11:17:36 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:36 restarting /usr/sbin/condor_collector in 10 seconds
8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_startd",
pid and pgroup = 3258
8/25 11:17:36 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 3259
8/25 11:17:36 Started DaemonCore process
"/usr/sbin/condor_negotiator", pid and pgroup = 3274
8/25 11:17:36 condor_write(): Socket closed when trying to write 973
bytes to <10.116.39.128:9618>, fd is 8
8/25 11:17:36 Buf::write(): condor_write() failed
8/25 11:17:36 Failed to send non-blocking update to <10.116.39.128:9618>.
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: DAEMON authorizat
ion policy contains no matching ALLOW entry for this request;
identifiers used for this host:
10.116.39.128,ip-10-116-39-128.ec2.internal
8/25 11:17:36 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
 DAEMON; see first case for the full reason
8/25 11:17:41 PERMISSION DENIED to unauthenticated user from host
10.116.39.128 for command 60008 (DC_CHILDALIVE), access level DAEMON:
reason: cached result for
 DAEMON; see first case for the full reason
8/25 11:17:46 Failed to listen(9618) on TCP command socket.
8/25 11:17:46 ERROR: Create_Process failed trying to start
/usr/sbin/condor_collector
8/25 11:17:46 restarting /usr/sbin/condor_collector in 120 seconds
8/25 11:17:46 condor_write(): Socket closed when trying to write 1057
bytes to unknown source, fd is 8, errno=104
8/25 11:17:46 Buf::write(): condor_write() failed

SchedLog

8/25 11:18:06 (pid:1505) condor_write(): Socket closed when trying to
write 218 bytes to unknown source, fd is 12, errno=104
8/25 11:18:06 (pid:1505) Buf::write(): condor_write() failed
8/25 11:18:06 (pid:1505) All shadows are gone, exiting.
8/25 11:18:06 (pid:1505) **** condor_schedd (condor_SCHEDD) pid 1505
EXITING WITH STATUS 0


My configuration files are the followings:

Master node

PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST=75.101.194.52
PRIVATE_NETWORK_INTERFACE=10.116.39.128
UPDATE_COLLECTOR_WITH_TCP=True
HOSTALLOW_WRITE=$(ALLOW_WRITE), '*.internal'
HOSTALLOW_READ=$(ALLOW_READ),'*.internal'
LOWPORT=40000
HIGHPORT=40050
COLLECTOR_SOCKET_CACHE_SIZE=1000

'*.internal' is matching my internal hostnames

Slave node

ubuntu@ec2-50-16-126-254:~$ sudo more /etc/condor_conf*
COLLECTOR_HOST = ec2-75-101-194-52.compute-1.amazonaws.com
PRIVATE_NETWORK_NAME=amazon-ec2-us-east-1d
TCP_FORWARDING_HOST = 50.16.126.154
PRIVATE_NETWORK_INTERFACE = 10.112.45.248
COUNT_HYPERTHREAD_CPUS = False
DAEMON_LIST = MASTER, STARTD
UPDATE_COLLECTOR_WITH_TCP = True
#COUNT_HYPERTHREAD_CPUS = False
#UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
#MASTER_UPDATE_INTERVAL = $RANDOM_INTEGER(230, 370)
LOWPORT=40000
HIGHPORT=40050
DAEMON_LIST = MASTER, STARTD

Thanks
Guillaume




-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net