[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 9.1



Hi,

I'm documenting here my recipe to solve my installation issues,Âif there is a better forum to contribute to HTCondor, please let me know. Also, please feel free to let me know if something here is wrong or could be improved. I hope this helps someone else google'ingÂin the future.Â

Task : Upgrade Condor 7.xx to 9.1 and do breadth first load balancing

ProblemÂ1 -ÂHtCondor installation script won't complete because previous condor install not completelyÂremoved with apt purge
What: HtCondor installation script keeps failing
Why: Previous install not completely removed via apt purge
Remedy: ManuallyÂremove everything

# Remove old condor user no longer needed
sudo deluser --remove-home condor
# Purge Condor
sudo apt purge condorÂÂ
# Ensure systemctl services and entries are stopped and removed respectively
systemctl stop condor
systemctl disable condor
rm /etc/systemd/system/condor
rm /etc/systemd/system/condor.service
rm /usr/lib/systemd/system/condor
systemctl daemon-reload
systemctl reset-failed

Problem 2 - Condor detectingÂall execute nodes as localhost because of /etc/hosts format
What: Condor sees all execute hosts as slot1@localhost, slot2@localhost, etcÂ
Why: /etc/hosts needs to conform to a specific format so that Condor's heuristic method for determining hostname and domainameÂoperate correctly
Remedy: alter /etc/hostsÂ
Specifically, the loopback address cannot contain the machinenameÂelse condor associates "localhost" as the "machinename".ÂÂ

ie this is bad
127.0.0.1 localhost machinename
127.0.1.1 machinename.domain.name machinename
192.168.7.1 machinename
and this is good...
127.0.0.1 localhost 
127.0.1.1 machinename.domain.name machinename
192.168.7.1 machinename
I feel this may well be something limiting in condor. Given we use docker and containers and can have multiple machines names on the same ip, it's not clear how condor will resolve this in a more complicated setup. 
See alsoÂ
https://ccl.cse.nd.edu/operations/condor/hostname.shtml
https://spinningmatt.wordpress.com/2010/07/28/how-condor-determines-a-nodes-ip-and-hostname/

Problem 3 - Jobs always running as "nobody" and not running as submitting "user" because UID_DOMAIN always differs on submit and execute machines in default setup due to our (no domain) networkÂsetupÂ
What : Files written to shared file system are written as nobody:nouser (user:group) which makes them unreadable to the submitter.
Why: Default UID_DOMAIN in condor install is $(FULL_HOSTNAME) which differs for every machine by default in our setup. Further, the submitter domain looks different on it;s own machine because we had no domainname in our ubuntu setup

Remedy: Override UID_DOMAIN on every machine by adding custom config in /etc/condor/config.d/99_custom.config on every machine

# Make sure every machine knows it's on the same
# user id domain and thus can run the job as the submitting user (eg lyle)

# we do not use domainnamesÂon our internal network so set a default one
DEFAULT_DOMAIN_NAME=ocwen.com
# For the UID_DOMAIN to be the same on all machines
UID_DOMAIN=ocwen.com
# Don't check submitter's UID_DOMAIN, just trust itÂ
TRUST_UID_DOMAIN=true
SOFT_UID_DOMAIN=true


Note that I went pretty hard here to force condor to match UID_DOMAIN.

Problem 4 - Condor by default allocates resources to the best machine first, then second best machine and so on (aka "depth first"). We would rather condor spread jobs over the best machines (aka "breadth first").
What: If two machines have slightlyÂdifferent KFlops and many slots, potentially all slots for one machine will first before the other
Why: DefaultÂNEGOTIATOR_PRE_JOB_RANK setting
Remedy: ModifyÂNEGOTIATOR_PRE_JOB_RANK for breadth first allocation to slots on a machine inÂ/etc/condor/config.d/99_custom.configÂ

NEGOTIATOR_PRE_JOB_RANK = ( $(NEGOTIATOR_PRE_JOB_RANK) ) * (SlotID/ÂTotalSlots)
Problem 5 - Default condor install allows only 1 role for a machine
What : Submitter machine wont be an excute machine by default. Obviously sharing is caring here. 
Why: Default DAEMON_LIST from the meta-config is geared towards one role
Remedy : Add the required dameon in /etc/condor/config.d/99_custom.configÂ
# Make Every Condor submit machine also an execute node
DAEMON_LIST=$(DAEMON_LIST) STARTD
Finally... Problem 6 - How to add nodes fast
We use jumpcloud for SSO on all machines which has a nifty tool to run commands remotely as root according to defined "groups" of devices. So to add a amchine to our pool all we do is
curl -fsSL https://get.htcondor.org | GET_HTCONDOR_PASSWORD="$htcondor_password" sudo /bin/bash -s -- --no-dry-run --submit $central_manager_name
Then tag it in jumpcloud "condor submit group" where we have the follwoing command that is run on any tagged machine
cat <<EOL > /etc/condor/config.d/02_aec_custom.config && condor_reconfig && mount -a

# Make sure every machine knows it's on the same
# user id domain and thus can run the job as the submitting user (eg lyle)
DEFAULT_DOMAIN_NAME=ocwen.com
UID_DOMAIN=ocwen.com
TRUST_UID_DOMAIN=true
SOFT_UID_DOMAIN=true

# Make Every Condor machine also an execute node
DAEMON_LIST=\$(DAEMON_LIST) STARTD

# Define Load Balancing on the AE Pool
NEGOTIATOR_PRE_JOB_RANK=( \$(NEGOTIATOR_PRE_JOB_RANK) ) * (SlotID/ TotalSlots)

EOL

On Tue, Jul 27, 2021 at 10:14 AM Lyle Pakula <Lyle@xxxxxxxxxxxxxxxx> wrote:
Hi Everyone and thanks for everyone's help in advance!

We have recently upgraded from a very old install of 7.6 to 9.1 on ubuntu 18.04 by basically blowing away everything old (uninstall, remove systemctl, delete "condor user" from all machines) and then followingÂhttps://htcondor.readthedocs.io/en/latest/getting-htcondor/admin-quick-start.html.

* Starting with a basic setup (3 Machines, 3 roles)Â+ NAS mounted on all machines.Â
* Vanilla universe Jobs read/write to and from the NASÂ

Question 1 - Why are slots apearing as "localhost" and not the machineÂname they are actually on?
lyle@tuna:~$ condor_status
Name      ÂOpSys   ÂArch  State   Activity LoadAv Mem  ActvtyTime

slot1@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:39
slot2@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:36
slot3@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:33
slot4@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:32
slot5@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:31
slot6@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:42
slot7@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:41
slot8@localhost LINUX   ÂX86_64 Unclaimed Idle   Â0.000 1990 Â0+00:30:41

Question 2 - Files are written as nobody:nouser, how can we change this?Â
Problem here is that the written files are unreadable/unwriteable to the submitterÂ

Tried this but did not workÂ
http://personalpages.to.infn.it/~gariazzo/htcondor/concepts.html#perms

Thanks, Lyle

--
AE CAPITAL
Ground Floor, 555 Bourke Street, Melbourne AustraliaÂ3000

p +61 3 9020 7801
m +61 (0)434 872 054
w http://www.aecapital.com.au


AE Capital Pty Limited (ACN 153 242 865) is regulated by the Australian Securities & Investments Commission and is a Corporate Authorised Representative of JFM Pty Limited (ACN 125 150 656), holder of an Australian Financial Services Licence (AFSL 314585). AE Capital Pty Limited is a member of the National Futures Association (ID 0498660).


--
AE CAPITAL
Ground Floor, 555 Bourke Street, Melbourne AustraliaÂ3000

p +61 3 9020 7801
m +61 (0)434 872 054
w http://www.aecapital.com.au


AE Capital Pty Limited (ACN 153 242 865) is regulated by the Australian Securities & Investments Commission and is a Corporate Authorised Representative of JFM Pty Limited (ACN 125 150 656), holder of an Australian Financial Services Licence (AFSL 314585). AE Capital Pty Limited is a member of the National Futures Association (ID 0498660).