[HTCondor-users] Problem with Internal DNS on Amazon AWS VPC

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

This may be more of a question for an Amazon AWS forum, but I’d like to see if there’s a way of reconfiguring Condor to work around the problem, or if anyone has experienced anything similar.

I have had a few Amazon EC2 instances running Condor on Scientific Linux 6.3 quite successfully, however we now want to run them in an Amazon Virtual Private Cloud (VPC). The problem I seem to have is that instances running in VPC do not have access to internal DNS in the same way that regular EC2 instances do. ie – nothing is contactable via hostname, not even the local machine:

[root@ip-10-0-14-137 ~]# hostname

ip-10-0-14-137

[root@ip-10-0-14-137 ~]# ping ip-10-0-14-137

ping: unknown host ip-10-0-14-137

[root@ip-10-0-14-137 ~]# nslookup ip-10-0-14-137

Server: 10.0.0.2

Address: 10.0.0.2#53

** server can't find ip-10-0-14-137: NXDOMAIN

[root@ip-10-0-14-137 ~]# ifconfig

eth0 Link encap:Ethernet HWaddr 0E:6D:51:79:E6:48

inet addr:10.0.14.137 Bcast:10.0.15.255 Mask:255.255.240.0

……

[root@ip-10-0-14-137 ~]# ping 10.0.14.137

PING 10.0.14.137 (10.0.14.137) 56(84) bytes of data.

64 bytes from 10.0.14.137: icmp_seq=1 ttl=64 time=0.024 ms

This seems to have the effect that condor doesn’t pick up the hostname when it can’t resolve an IP address back to that hostname:

[root@ip-10-0-5-109 run2]# condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:04

slot2@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:05

slot3@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:06

slot4@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:00:07

slot5@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:01:26

slot6@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:01:27

slot7@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:02:28

slot8@ LINUX X86_64 Unclaimed Idle 0.000 3761 0+00:02:21

slot1@ip-10-0-5-10 LINUX X86_64 Claimed Busy 1.000 3735 0+00:01:18

slot2@ip-10-0-5-10 LINUX X86_64 Claimed Busy 0.990 3735 0+00:01:28

slot3@ip-10-0-5-10 LINUX X86_64 Claimed Busy 1.060 3735 0+00:00:55

slot4@ip-10-0-5-10 LINUX X86_64 Claimed Busy 0.960 3735 0+00:00:54

You can see from the last 4 slots, running on the submit node that I have added an entry to /etc/hosts on that node for the hostname pointing to the correct IP.

We could obviously do the same for each of the worker nodes, but this isn’t going to be practical when running many spot instances.

I’m assuming this is also what’s causing problems with running these jobs on the worker nodes currently.

Any thoughts?

Thanks,

Giles

Mailing List Archives

Public Access

[HTCondor-users] Problem with Internal DNS on Amazon AWS VPC