Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Name resolution anomaly
- Date: Fri, 21 Oct 2005 23:41:25 +0300
- From: Marko Kääramees <marko.kaaramees@xxxxxx>
- Subject: [Condor-users] Name resolution anomaly
Hello,
It seems that I have found a name resolution anomaly when testing
parallel universe jobs in condor 6.7.12.
Our test setup includes one front-end machine and two computational
nodes. Front-end server has with two network interfaces and nfs service.
It does not do computations. Computational nodes have only master and
startd running. All are Cern SL 3 hosts.
Server has two interfaces - external (turing.xxx) and internal
(turing.tud.xxx). DNS resolution is working for both, reverse is working
only for the external interface. All the name resolution info is also in
the hosts file:
127.0.0.1 localhost.localdomain localhost
193.40.111.222 turing.xxx turing
192.168.19.253 turing.tud.xxx turing
192.168.16.1 p4-1.tud.xxx p4-1
192.168.16.2 p4-2.tud.xxx p4-2
Condor should work only on the internal interface.
NETWORK_INTERFACE = 192.168.19.253
FULL_HOSTNAME = turing.tud.xxx
The problem occured in the submission of parallel jobs. They went to the
queue, but never run. I saw lines with wrong host name in the log-s
(job_queue.log):
103 21.0 Scheduler "DedicatedScheduler@xxxxxxxxxx"
but I had in may configuretion files
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxx"
which is right and I didn't want to change it.
Change of the hosts hostname didn't help, change of the environment
variable HOSTNAME of the root shell starting condor_master didn't help
either. After lot of experimenting I found that the order of records in
the hosts file matters. Writing turing.tud.xxx record before turing.xxx
solved the problem. It seems that the conversion from short hostname to
FQDN is done using the /etc/hosts file. The result of the hostname (and
hostname -f) command didn't depend on this order at the same time.
Later I saw that condor_q uses the same strange resolution, when
printing out the results (IP is right, name should be turing.tud.xxx):
[root@turing condor]# condor_q
-- Submitter: turing.xxx : <192.168.19.253:34868> : turing.xxx
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
I think that the name resolution does not work in the most reasonable
way in the case where there are two interfaces with the same short name
and local (NIS) name resolution.
Best regards,
Marko Kääramees