[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Can't See Worker Machines - condor_status is blank



Hi All.

Â

Iâm still in need of help if thereâs anyone out there who could spare a bit of time? I read that a Condor master running on a VM needs Condor to run on the host machine. This was from a search I had done specifically for a VMware Workstation VM, but I noticed the same was apparently true for VirtualBox. But I had previously been using Condor with a VirtualBox VM without Condor on the host and it worked just fine.

Iâve upgraded my 2 servers to Condor 9.5 (CentOS 7) and added this to my Master (Oracle 8 on VMware Workstation 16)ÂÂ running on Windows 10. Iâve added Condor to my Windows host too.

Itâs unclear to me how to set the Windows host up. Other things I note: For previous versions of Condor there were quite some online tutorials on how to set up a simple master-worker system on 2 different machines. There doesnât seem to be much available on this anymore. I found a YouTube video on your channel that explains it from a V8 perspective, but the simple setup referenced in the video doesnât exist in the V9 manual and the link to the V8 manual is dead. (https://www.youtube.com/watch?v=cZ_DTsuRbk4)

The manual for V9 has the word âexampleâ more than 1200 times but none of those examples seem to be for a simple setup to get started.

I created a condor token by condor_token_create -identity jfisher@xxxxxxxxxxxxxxxxxx as root, then I copied the passwords.d/POOL file from the master machine to the workers, but I get an authentication error: both worker machines return

condor_status

Error: communication error

AUTHENTICATE:1003:Failed to authenticate with any method

AUTHENTICATE:1004:Failed to authenticate using FS

AUTHENTICATE:1004:Failed to authenticate using IDTOKENS

Â

The master still returns nothing.

The logs and setup files can be found here:

https://www.dropbox.com/t/9uO3VGrZTKtT7p4B

I do appreciate that everyone is busy, but I really would be extremely grateful if someone could point out where Iâm going wrong. Iâve not been able to get anything running for a few weeks now.

Â


--
Kind regards,

Justin Fisher.


On Tue, Dec 28, 2021 at 6:51 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
Did you also switch to a newer version of HTCondor?

I think these messages from the CollectorLog on the central manager show the problemÂ

12/28/21 17:20:03 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy denies all access
12/28/21 17:20:03 DC_AUTHENTICATE: Command not authorized, done!
12/28/21 17:20:13 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy denies all access
12/28/21 17:20:13 DC_AUTHENTICATE: Command not authorized, done!

The configuration on the central manager does not have any value forÂÂALLOW_ADVERTISE_MASTER or ALLOW_ADVERTISE_STARTD.

If you were running HTCondor 8.8.*Â Âthen the ALLOW_WRITE configuration value would be used when those had no value, but during the 8.9 series,
we made HTCondor more secure by default, and part of that was that ALLOW_ADVERTISE_MASTER and ALLOW_ADVERTISE_STARTD stopped inheriting
the value of ALLOW_WRITE.ÂÂ

You can add these lines to the configuration of your central manager to fix this

ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE)
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE)
ALLOW_ADVERTISE_SCHEDD = $(ALLOW_WRITE)

HTCondor is trying to move away from authentication based on IP addresses since that sort of installation is vulnerable to misuse by
anyone who has the ability run programs from within your firewall. ÂIf you trust everyone who has access to your 192.168.178.* IPÂ
address range, then making the change above is fine. But if you want a more secure HTCondor installation, you should upgrade
to HTCondor 9.0 or 9.5 and switch to IDTOKEN authentication.Â

-tj



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of justin0419@xxxxxxxxx <justin0419@xxxxxxxxx>
Sent: Tuesday, December 28, 2021 10:56 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Can't See Worker Machines - condor_status is blank
Â
Hi All.

I recently changed my maser server from CentOS 7 to Oracle Linux 8. I followed the installation instructions from:

https://research.cs.wisc.edu/htcondor/instructions/el/8/development/

Having set up the Condor master and adjusted the worker servers to suite the new master (ip address and name) I find I can't run Condor over the network.

condor_status comes up blank.

If I add STARTD to my master config file, I do get a list of slots in the master machine, but I don't want to run anything on the master machine. But at least it tells me I've got some small percentage of the installation correct.

I did have this problem before, which you very kindly supplied an answer for. I went through all the great suggestions you guys gave me last time but this time they don't work, so I'm clearly doing something else wrong.

This isn't a firewall problem. For now I've disabled firewalld and selinux on all machines.

my /etc/condor/condor_config file is untouched from the installation.

Below is some log files, my /etc/hosts and the config files from the master and one of the workers. If anyone could clue me in I'd be most greatful.

--

Kind regards,

Justin Fisher


----------------------------------------------------------------------------------------------------
$CondorVersion: 8.9.13 Mar 30 2021 BuildID: 535058 PackageID: 8.9.13-1 $

ps ax | grep condor
 19369 ?    ÂSs   0:00 /usr/sbin/condor_master -f
 19419 ?    ÂS   Â0:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 973
 19420 ?    ÂSs   0:00 condor_shared_port -p 9618
 19421 ?    ÂSs   0:00 condor_collector
 19422 ?    ÂSs   0:00 condor_negotiator
 19423 ?    ÂSs   0:00 condor_schedd
 21617 pts/0  ÂS+   0:00 grep --color=auto condor



----------------------------------------------------------------------------------------------------
tail -n10 CollectorLog
12/28/21 17:19:47 Query info: matched=0; skipped=0; query_time=0.000180; send_time=0.000103; type=MachinePrivate; requirements={true}; locate=0; limit=0; from=COLLECTOR; peer=<192.168.178.63:22405>; projection={}; filter_private_ads=0
12/28/21 17:19:47 (Sending 0 ads in response to query)
12/28/21 17:19:47 QueryWorker: forked new high priority worker with id 20004 ( max 4 active 2 pending 0 )
12/28/21 17:19:47 Query info: matched=0; skipped=14; query_time=0.000182; send_time=0.000084; type=Any; requirements={(((MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; limit=0; from=COLLECTOR; peer=<192.168.178.63:5845>; projection={}; filter_private_ads=0
12/28/21 17:20:03 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy denies all access
12/28/21 17:20:03 DC_AUTHENTICATE: Command not authorized, done!
12/28/21 17:20:13 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy denies all access
12/28/21 17:20:13 DC_AUTHENTICATE: Command not authorized, done!
12/28/21 17:20:13 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy denies all access
12/28/21 17:20:13 DC_AUTHENTICATE: Command not authorized, done!

----------------------------------------------------------------------------------------------------
tail -n10 MasterLog
12/28/21 17:03:46 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 19420
12/28/21 17:03:46 Waiting for /var/lock/condor/shared_port_ad to appear.
12/28/21 17:03:46 Found /var/lock/condor/shared_port_ad.
12/28/21 17:03:46 Cannot remove wait-for-startup file /var/log/condor/.collector_address
12/28/21 17:03:47 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 19421
12/28/21 17:03:47 Waiting for /var/log/condor/.collector_address to appear.
12/28/21 17:03:47 Found /var/log/condor/.collector_address.
12/28/21 17:03:47 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 19422
12/28/21 17:03:47 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 19423
12/28/21 17:03:47 Daemons::StartAllDaemons all daemons were started


----------------------------------------------------------------------------------------------------
tail -n10 SchedLog
12/28/21 17:03:47 (pid:19423) DaemonCore: command socket at <192.168.178.63:9618?addrs=192.168.178.63-9618+[2001-871-262-b1ea-20c-29ff-feff-a619]-9618&alias=or8.ingenazure.com&noUDP&sock=schedd_19369_19f7>
12/28/21 17:03:47 (pid:19423) DaemonCore: private command socket at <192.168.178.63:9618?addrs=192.168.178.63-9618+[2001-871-262-b1ea-20c-29ff-feff-a619]-9618&alias=or8.ingenazure.com&noUDP&sock=schedd_19369_19f7>
12/28/21 17:03:47 (pid:19423) History file rotation is enabled.
12/28/21 17:03:47 (pid:19423) Â Maximum history file size is: 20971520 bytes
12/28/21 17:03:47 (pid:19423) Â Number of rotated history files is: 2
12/28/21 17:03:47 (pid:19423) Reloading job factories
12/28/21 17:03:47 (pid:19423) Loaded 0 job factories, 0 were paused, 0 failed to load
12/28/21 17:03:47 (pid:19423) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
12/28/21 17:03:47 (pid:19423) TransferQueueManager upload 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
12/28/21 17:03:47 (pid:19423) TransferQueueManager download 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
[jfisher@or8 condor]$



----------------------------------------------------------------------------------------------------
All /etc/hosts files are identical:

more /etc/hosts
127.0.0.1 Â localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 Â Â Â Â localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.178.63 or8.ingenazure.com
192.168.178.61 eda1.ingenazure.com
192.168.178.60 eda2.ingenazure.com

Pinging from master machine to ensure no typo's on /etc/hosts:

ping or8.ingenazure.com
PING or8.ingenazure.com (192.168.178.63) 56(84) bytes of data.
64 bytes from or8.ingenazure.com (192.168.178.63): icmp_seq=1 ttl=64 time=0.018 ms

ping eda1.ingenazure.com
PING eda1.ingenazure.com (192.168.178.61) 56(84) bytes of data.
64 bytes from eda1.ingenazure.com (192.168.178.61): icmp_seq=1 ttl=64 time=0.848 ms

ping eda2.ingenazure.com
PING eda2.ingenazure.com (192.168.178.60) 56(84) bytes of data.
64 bytes from eda2.ingenazure.com (192.168.178.60): icmp_seq=1 ttl=64 time=0.848 ms

----------------------------------------------------------------------------------------------------
Master machine (or8.ingenazure.com)
/etc/condor/config.d/00master.config

DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, SHARED_PORT

START = true
ALLOW_ADMINISTRATOR = jfisher@xxxxxxxxxxxxxx
DEFAULT_DOMAIN_NAME = ingenazure.com
UID_DOMAIN = ingenazure.com
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
ALLOW_WRITE = 192.168.178.*
ALLOW_READ Â= */*.ingenazure.com, or8.ingenazure.com
ALLOW_NEGOTIATOR = or8.ingenazure.com, 192.168.178.*
CONDOR_ADMIN = jfisher@xxxxxxxxxxxxxx
CONDOR_HOST = or8.ingenazure.com
USE_NFS = FALSE
HOSTNAME = or8

USE_SHARED_PORT=TRUE
SHARED_PORT_ARGS = -p 9618
COLLECTOR_USES_SHARED_PORT=TRUE
COLLECTOR_HOST = $(CONDOR_HOST):9618
StartJobs = TRUE

MASTER_INSTANCE_LOCK = /var/lock/condor/InstanceLock
MAX_DEFAULT_LOG = 1000000
EVENT_LOG = $(LOG)/EventLog
EVENT_LOG_JOB_AD_INFORMATION_ATTRS=Owner,CurrentHosts,x509userproxysubject,x509UserProxyVOName,AccountingGroup,GlobalJo
bId,QDate,JobStartDate,JobCurrentStartDate,JobFinishedHookDone
EVENT_LOG_MAX_SIZE = 10000000
EVENT_LOG_MAX_ROTATIONS = 5
POOL_HISTORY_DIR = /var/log/condor
KEEP_POOL_HISTORY = True

GROUP_NAMES = group_ANALOG, group_DIGITAL, group_OTHER, #set the shares for your users
GROUP_QUOTA_DYNAMIC_group_ANALOG = 1
GROUP_QUOTA_DYNAMIC_group_DIGITAL = 1
GROUP_QUOTA_DYNAMIC_group_OTHER = 0.5
GROUP_ACCEPT_SURPLUS = TRUE


----------------------------------------------------------------------------------------------------
Worker machine 1 (eda1.ingenazure.com)
/etc/condor/config.d/00worker.config

CAL_CONFIG_DIR = /etc/condor/config.d
DAEMON_LIST = MASTER,STARTD
DEFAULT_DOMAIN_NAME = ingenazure.com
CONDOR_HOST = or8.ingenazure.com
UID_DOMAIN = ingenazure.com
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST), 192.168.178.*
ALLOW_READ = *.$(UID_DOMAIN), Â192.168.178.*
CONDOR_ADMIN = jfisher@xxxxxxxxxxxxxx
USE_NFS = FALSE
StartJobs = true
STARTD_ATTRS = StartJobs, $(STARTD_ATTRS)
START = true
HOSTALLOW_CONFIG = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST)
ENABLE_RUNTIME_CONFIG = True
RUNTIME_CONFIG_ADMIN = $(CONDOR_HOST)
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/persistent
USE_SHARED_PORT = TRUE
SHARED_PORT_ARGS = -p 9618
COLLECTOR_USES_SHARED_PORT=TRUE
COLLECTOR_HOST = $(CONDOR_HOST):9618

# Enable CGROUP control
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft

# slots
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 24
SLOT_TYPE_1 = cpus=1, ram=4%, swap=4%, disk=4%
SLOT_TYPE_1_PARTITIONABLE = true
COUNT_HYPERTHREAD_CPUS = true
----------------------------------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/