[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Name resolution anomaly



Hello,

It seems that I have found a name resolution anomaly when testing
parallel universe jobs in condor 6.7.12. 

Our test setup includes one front-end machine and two computational
nodes. Front-end server has with two network interfaces and nfs service.
It does not do computations. Computational nodes have only master and
startd running. All are Cern SL 3 hosts. 

Server has two interfaces - external (turing.xxx) and internal
(turing.tud.xxx). DNS resolution is working for both, reverse is working
only for the external interface. All the name resolution info is also in
the hosts file:

127.0.0.1               localhost.localdomain localhost
193.40.111.222          turing.xxx turing
192.168.19.253          turing.tud.xxx turing

192.168.16.1            p4-1.tud.xxx p4-1
192.168.16.2            p4-2.tud.xxx p4-2

Condor should work only on the internal interface.  
NETWORK_INTERFACE = 192.168.19.253
FULL_HOSTNAME = turing.tud.xxx

The problem occured in the submission of parallel jobs. They went to the
queue, but never run. I saw lines with wrong host name in the log-s
(job_queue.log):

103 21.0 Scheduler "DedicatedScheduler@xxxxxxxxxx"

but I had in may configuretion files
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxx"
which is right and I didn't want to change it.

Change of the hosts hostname didn't help, change of the environment
variable HOSTNAME of the root shell starting condor_master didn't help
either. After lot of experimenting I found that the order of records in
the hosts file matters. Writing turing.tud.xxx record before turing.xxx
solved the problem. It seems that the conversion from short hostname to
FQDN is done using the /etc/hosts file. The result of the hostname (and
hostname -f) command didn't depend on this order at the same time. 

Later I saw that condor_q uses the same strange resolution, when
printing out the results (IP is right, name should be turing.tud.xxx):

[root@turing condor]# condor_q
-- Submitter: turing.xxx : <192.168.19.253:34868> : turing.xxx
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD


I think that the name resolution does not work in the most reasonable
way in the case where there are two interfaces with the same short name
and local (NIS) name resolution.


Best regards,
Marko Kääramees