This may be more of a question for an Amazon AWS forum, but I’d like to see if there’s a way of reconfiguring Condor to work around the problem, or if anyone has experienced anything similar.
I have had a few Amazon EC2 instances running Condor on Scientific Linux 6.3 quite successfully, however we now want to run them in an Amazon Virtual Private Cloud (VPC). The problem I seem to have is that instances running in VPC do not have access to internal DNS in the same way that regular EC2 instances do. ie – nothing is contactable via hostname, not even the local machine:
[root@ip-10-0-14-137 ~]# hostname
[root@ip-10-0-14-137 ~]# ping ip-10-0-14-137
ping: unknown host ip-10-0-14-137
[root@ip-10-0-14-137 ~]# nslookup ip-10-0-14-137
** server can't find ip-10-0-14-137: NXDOMAIN
[root@ip-10-0-14-137 ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 0E:6D:51:79:E6:48
inet addr:10.0.14.137 Bcast:10.0.15.255 Mask:255.255.240.0
[root@ip-10-0-14-137 ~]# ping 10.0.14.137
PING 10.0.14.137 (10.0.14.137) 56(84) bytes of data.
64 bytes from 10.0.14.137: icmp_seq=1 ttl=64 time=0.024 ms
This seems to have the effect that condor doesn’t pick up the hostname when it can’t resolve an IP address back to that hostname:
[root@ip-10-0-5-109 run2]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:04
slot2@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:05
slot3@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:06
slot4@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:07
slot5@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:01:26
slot6@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:01:27
slot7@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:02:28
slot8@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:02:21
slot1@ip-10-0-5-10 LINUX X86_64 Claimed Busy 1.000 3735 0+00:01:18
slot2@ip-10-0-5-10 LINUX X86_64 Claimed Busy 0.990 3735 0+00:01:28
slot3@ip-10-0-5-10 LINUX X86_64 Claimed Busy 1.060 3735 0+00:00:55
slot4@ip-10-0-5-10 LINUX X86_64 Claimed Busy 0.960 3735 0+00:00:54
You can see from the last 4 slots, running on the submit node that I have added an entry to /etc/hosts on that node for the hostname pointing to the correct IP.
We could obviously do the same for each of the worker nodes, but this isn’t going to be practical when running many spot instances.
I’m assuming this is also what’s causing problems with running these jobs on the worker nodes currently.
EastQuayIT Ltd is a limited company, registered in England and Wales with Registration no. 07595813. VAT No: GB 116 6924 08.
Any quotation above is based on the terms and conditions of business and commencement of the services is evidence of your acceptance to the same. This message, including any attachments, has been sent by EastQuayIT Ltd and is intended solely for the use of the person(s) to whom it is addressed. Its contents are confidential and if you are not the intended recipient, please could you delete this email from your system, without copying or disclosing its contents, and inform the sender by return e-mail that you have received this message. Email communications cannot be guaranteed to be secure, or free from computer viruses, therefore EastQuayIT Ltd does not accept legal responsibility for this message or its contents. The recipient is responsible for checking this message for viruses and verifying its authenticity before acting on the contents. Any views or opinions presented are solely those of the author and do not necessarily represent those of EastQuayIT Ltd.