[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 9.1



Hi John,

I haveÂmanaged to get it all working now and will document what i've done shortly but you are correct, our /etc/hosts files were incorrect for condor's heuristicÂmethod to determine hostname and domain names. That was the root cause of both problems. I found this article which pointed me along your line of thinking and corrected our error https://spinningmatt.wordpress.com/2010/07/28/how-condor-determines-a-nodes-ip-and-hostname/.

To try answer your question, it appears condor recognizes both the loopback and external ip and that was why it was able to communicate (see below).

Cheers, and thanks again!

Lyle

--- output of condor heuristic, note this is using a good /etc/hosts Â
lyle@tuna:~$ env _CONDOR_TOOL_DEBUG=D_HOSTNAME condor_config_val -debug FULL_HOSTNAME
08/03/21 05:37:52 NETWORK_INTERFACE=* matches lo 127.0.0.1, enp0s31f6 192.168.7.144, docker_gwbridge 172.18.0.1, docker0 172.17.0.1, lo ::1, enp0s31f6 fe80::c8f7:3616:850e:e934, docker_gwbridge fe80::42:51ff:fea1:e70, docker0 fe80::42:9ff:fec0:5dff, veth208f15f fe80::68b3:40ff:fefe:dd88, vetha4a8c59 fe80::e:80ff:fef7:62bd, vethcf3513d fe80::7c8f:7bff:fe23:4af2, choosing IP 192.168.7.144
08/03/21 05:37:52 DNS returned:
08/03/21 05:37:52 127.0.1.1
08/03/21 05:37:52 192.168.7.144
08/03/21 05:37:52 We returned:
08/03/21 05:37:52 127.0.1.1
08/03/21 05:37:52 192.168.7.144
08/03/21 05:37:52 hostname: tuna.ocwen.com
08/03/21 05:37:52 I am: hostname: tuna, fully qualified doman name: tuna.ocwen.com, IP: 192.168.7.144, IPv4: 192.168.7.144, IPv6:
08/03/21 05:37:52 Trying to getting network interface information after reading config
08/03/21 05:37:52 NETWORK_INTERFACE=* matches lo 127.0.0.1, enp0s31f6 192.168.7.144, docker_gwbridge 172.18.0.1, docker0 172.17.0.1, lo ::1, enp0s31f6 fe80::c8f7:3616:850e:e934, docker_gwbridge fe80::42:51ff:fea1:e70, docker0 fe80::42:9ff:fec0:5dff, veth208f15f fe80::68b3:40ff:fefe:dd88, vetha4a8c59 fe80::e:80ff:fef7:62bd, vethcf3513d fe80::7c8f:7bff:fe23:4af2, choosing IP 192.168.7.144
08/03/21 05:37:52 NETWORK_INTERFACE=* matches lo 127.0.0.1, enp0s31f6 192.168.7.144, docker_gwbridge 172.18.0.1, docker0 172.17.0.1, lo ::1, enp0s31f6 fe80::c8f7:3616:850e:e934, docker_gwbridge fe80::42:51ff:fea1:e70, docker0 fe80::42:9ff:fec0:5dff, veth208f15f fe80::68b3:40ff:fefe:dd88, vetha4a8c59 fe80::e:80ff:fef7:62bd, vethcf3513d fe80::7c8f:7bff:fe23:4af2, choosing IP 192.168.7.144
08/03/21 05:37:52 DNS returned:
08/03/21 05:37:52 127.0.1.1
08/03/21 05:37:52 192.168.7.144
08/03/21 05:37:52 We returned:
08/03/21 05:37:52 127.0.1.1
08/03/21 05:37:52 192.168.7.144
08/03/21 05:37:52 hostname: tuna.ocwen.com
08/03/21 05:37:52 I am: hostname: tuna, fully qualified doman name: tuna.ocwen.com, IP: 192.168.7.144, IPv4: 192.168.7.144, IPv6:
08/03/21 05:37:52 NETWORK_INTERFACE=* matches lo 127.0.0.1, enp0s31f6 192.168.7.144, docker_gwbridge 172.18.0.1, docker0 172.17.0.1, lo ::1, enp0s31f6 fe80::c8f7:3616:850e:e934, docker_gwbridge fe80::42:51ff:fea1:e70, docker0 fe80::42:9ff:fec0:5dff, veth208f15f fe80::68b3:40ff:fefe:dd88, vetha4a8c59 fe80::e:80ff:fef7:62bd, vethcf3513d fe80::7c8f:7bff:fe23:4af2, choosing IP 192.168.7.144
08/03/21 05:37:52 DNS returned:
08/03/21 05:37:52 127.0.1.1
08/03/21 05:37:52 192.168.7.144
08/03/21 05:37:52 We returned:
08/03/21 05:37:52 127.0.1.1
08/03/21 05:37:52 192.168.7.144
08/03/21 05:37:52 hostname: tuna.ocwen.com
08/03/21 05:37:52 I am: hostname: tuna, fully qualified doman name: tuna.ocwen.com, IP: 192.168.7.144, IPv4: 192.168.7.144, IPv6:
Â

On Tue, Aug 3, 2021 at 1:56 AM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
I think the fundamental problem is a combination of your hosts file and the fact that you seem to be forcing HTCondor to use 127.0.0.1 as the preferred IP address.Â

We lookup tuna and get 127.0.0.1 and then we lookup 127.0.0.1 and the first answer in the hosts file is localhost, so that becomes the hostname.

I think you either need to remove tuna from the hosts file, give it a different IP address (like the public IP address), or make it the first entry in the hosts file for 127.0.0.1

But I'm confused how you can have a 3 node pool that is working at all if you are telling HTCondor to use 127.0.0.1 for communication. The nodes should be unable to talk to each other.

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Lyle Pakula <Lyle@xxxxxxxxxxxxxxxx>
Sent: Sunday, August 1, 2021 9:33 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor 9.1
Â
Hi John,

Thanks for the help.Â

1/ NETWORK_INTERFACE is the same on all machines

lyle@tuna$ condor_config_val -v NETWORK_INTERFACE
NETWORK_INTERFACE = *
Â# at: <Default>
Â# raw: NETWORK_INTERFACE = *


FYI my /etc/hosts on all machines follows a standard layout, ie for @tuna

lyle@tuna$ cat /etc/hosts
127.0.0.1    localhost        tuna
127.0.1.1    tuna.ocwen.com   Âtuna


all machines have a /etc/hostname file containing their "hostname" but domainnameÂis blank.Â

2/ UID_DOMAIN is also similar on all machines, that is default ofÂ

lyle@grenadier:$ condor_config_val -v UID_DOMAIN
UID_DOMAIN = localhost
Â# at: <Default>
Â# raw: UID_DOMAIN = $(FULL_HOSTNAME)


... What I tried
It looked to me that condor is not picking up the actual hostname and perhaps this is because we have no domainname configured.Â

lyle@grenadier:/etc/condor/config.d$ hostname
grenadier

lyle@grenadier:/etc/condor/config.d$ condor_config_val -v HOSTNAME
HOSTNAME = localhost
Â# at: <Detected>
Â# raw: HOSTNAME = localhost

lyle@grenadier:/etc/condor/config.d$ condor_config_val -v FULL_HOSTNAME
FULL_HOSTNAME = localhost
Â# at: <Detected>
Â# raw: FULL_HOSTNAME = localhost

* I tried pointingÂNETWORK_INTERFACE to 127.0.1.1 on all machines and also to the CENTRAL MANAGER ip (something i read) but this did not change what condor picks up as the hostname.Â
* I tried setting the UID_DOMAIN=ocwen.comÂon all machinesÂbut this did not work (everything still runs as nobody) and i suspect this is because the hostname is not picked up correctly as well

Thanks, Lyle


On Wed, Jul 28, 2021 at 1:59 AM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
I think slots are appearing as localhost because your condor_config is telling condor to use localhost as the primary network interface.ÂÂ

What does the condor_config have set forÂNETWORK_INTERFACE ?

Try running

 Âcondor_config_val -v NETWORK_INTERFACE

By the way, you can see all of your configuration that differs from the default HTCondor configuration by running

  condor_config_val -summary

When a job runs, files will be written as nobody if the job runs as nobody, which happens when HTCondor does not think that the submit node and the execute node have the same set of user ids. It decides this by comparing the value of UID_DOMAIN on both of these machines.Â

Try running

  condor_config_val -v UID_DOMAIN

on both the submit machine and the execute machine, what is the value?

Now having files writting as nobody on the execute node is not a problem when HTCondor is doing file transfer, because it will change ownership of the files as it transfers the results back. but if you are using a shared file system
you may need to do some additional configuration.Â

Instructions for setting up HTCondor to use shared files system is here



-tj



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Lyle Pakula <Lyle@xxxxxxxxxxxxxxxx>
Sent: Monday, July 26, 2021 7:14 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] HTCondor 9.1
Â
Hi Everyone and thanks for everyone's help in advance!

We have recently upgraded from a very old install of 7.6 to 9.1 on ubuntu 18.04 by basically blowing away everything old (uninstall, remove systemctl, delete "condor user" from all machines) and then followingÂhttps://htcondor.readthedocs.io/en/latest/getting-htcondor/admin-quick-start.html.

* Starting with a basic setup (3 Machines, 3 roles)Â+ NAS mounted on all machines.Â
* Vanilla universe Jobs read/write to and from the NASÂ

Question 1 - Why are slots apearing as "localhost" and not the machineÂname they are actually on?
lyle@tuna:~$ condor_status
Name      ÂOpSys   ÂArch  State   Activity LoadAv Mem  ActvtyTime

slot1@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:39
slot2@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:36
slot3@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:33
slot4@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:32
slot5@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:31
slot6@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:42
slot7@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:41
slot8@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:41

Question 2 - Files are written as nobody:nouser, how can we change this?Â
Problem here is that the written files are unreadable/unwriteable to the submitterÂ

Tried this but did not workÂ

Thanks, Lyle

--
AE CAPITAL
Ground Floor, 555 Bourke Street, Melbourne AustraliaÂ3000

p +61 3 9020 7801
m +61 (0)434 872 054
w http://www.aecapital.com.au


AE Capital Pty Limited (ACN 153 242 865) is regulated by the Australian Securities & Investments Commission and is a Corporate Authorised Representative of JFM Pty Limited (ACN 125 150 656), holder of an Australian Financial Services Licence (AFSL 314585). AE Capital Pty Limited is a member of the National Futures Association (ID 0498660).
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
AE CAPITAL
Ground Floor, 555 Bourke Street, Melbourne AustraliaÂ3000

p +61 3 9020 7801
m +61 (0)434 872 054
w http://www.aecapital.com.au


AE Capital Pty Limited (ACN 153 242 865) is regulated by the Australian Securities & Investments Commission and is a Corporate Authorised Representative of JFM Pty Limited (ACN 125 150 656), holder of an Australian Financial Services Licence (AFSL 314585). AE Capital Pty Limited is a member of the National Futures Association (ID 0498660).
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
AE CAPITAL
Ground Floor, 555 Bourke Street, Melbourne AustraliaÂ3000

p +61 3 9020 7801
m +61 (0)434 872 054
w http://www.aecapital.com.au


AE Capital Pty Limited (ACN 153 242 865) is regulated by the Australian Securities & Investments Commission and is a Corporate Authorised Representative of JFM Pty Limited (ACN 125 150 656), holder of an Australian Financial Services Licence (AFSL 314585). AE Capital Pty Limited is a member of the National Futures Association (ID 0498660).