[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Can't See Worker Machines - condor_status is blank



Hi All.

I recently changed my maser server from CentOS 7 to Oracle Linux 8. I followed the installation instructions from:

https://research.cs.wisc.edu/htcondor/instructions/el/8/development/

Having set up the Condor master and adjusted the worker servers to suite the new master (ip address and name) I find I can't run Condor over the network.

condor_status comes up blank.

If I add STARTD to my master config file, I do get a list of slots in the master machine, but I don't want to run anything on the master machine. But at least it tells me I've got some small percentage of the installation correct.

I did have this problem before, which you very kindly supplied an answer for. I went through all the great suggestions you guys gave me last time but this time they don't work, so I'm clearly doing something else wrong.

This isn't a firewall problem. For now I've disabled firewalld and selinux on all machines.

my /etc/condor/condor_config file is untouched from the installation.

Below is some log files, my /etc/hosts and the config files from the master and one of the workers. If anyone could clue me in I'd be most greatful.

--

Kind regards,

Justin Fisher


----------------------------------------------------------------------------------------------------
$CondorVersion: 8.9.13 Mar 30 2021 BuildID: 535058 PackageID: 8.9.13-1 $

ps ax | grep condor
 19369 ?    ÂSs   0:00 /usr/sbin/condor_master -f
 19419 ?    ÂS   Â0:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 973
 19420 ?    ÂSs   0:00 condor_shared_port -p 9618
 19421 ?    ÂSs   0:00 condor_collector
 19422 ?    ÂSs   0:00 condor_negotiator
 19423 ?    ÂSs   0:00 condor_schedd
 21617 pts/0  ÂS+   0:00 grep --color=auto condor



----------------------------------------------------------------------------------------------------
tail -n10 CollectorLog
12/28/21 17:19:47 Query info: matched=0; skipped=0; query_time=0.000180; send_time=0.000103; type=MachinePrivate; requirements={true}; locate=0; limit=0; from=COLLECTOR; peer=<192.168.178.63:22405>; projection={}; filter_private_ads=0
12/28/21 17:19:47 (Sending 0 ads in response to query)
12/28/21 17:19:47 QueryWorker: forked new high priority worker with id 20004 ( max 4 active 2 pending 0 )
12/28/21 17:19:47 Query info: matched=0; skipped=14; query_time=0.000182; send_time=0.000084; type=Any; requirements={(((MyType == "Submitter")) || ((MyType == "Machine")))}; locate=0; limit=0; from=COLLECTOR; peer=<192.168.178.63:5845>; projection={}; filter_private_ads=0
12/28/21 17:20:03 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy denies all access
12/28/21 17:20:03 DC_AUTHENTICATE: Command not authorized, done!
12/28/21 17:20:13 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy denies all access
12/28/21 17:20:13 DC_AUTHENTICATE: Command not authorized, done!
12/28/21 17:20:13 PERMISSION DENIED to unauthenticated@unmapped from host 192.168.178.61 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy denies all access
12/28/21 17:20:13 DC_AUTHENTICATE: Command not authorized, done!

----------------------------------------------------------------------------------------------------
tail -n10 MasterLog
12/28/21 17:03:46 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 19420
12/28/21 17:03:46 Waiting for /var/lock/condor/shared_port_ad to appear.
12/28/21 17:03:46 Found /var/lock/condor/shared_port_ad.
12/28/21 17:03:46 Cannot remove wait-for-startup file /var/log/condor/.collector_address
12/28/21 17:03:47 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 19421
12/28/21 17:03:47 Waiting for /var/log/condor/.collector_address to appear.
12/28/21 17:03:47 Found /var/log/condor/.collector_address.
12/28/21 17:03:47 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 19422
12/28/21 17:03:47 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 19423
12/28/21 17:03:47 Daemons::StartAllDaemons all daemons were started


----------------------------------------------------------------------------------------------------
tail -n10 SchedLog
12/28/21 17:03:47 (pid:19423) DaemonCore: command socket at <192.168.178.63:9618?addrs=192.168.178.63-9618+[2001-871-262-b1ea-20c-29ff-feff-a619]-9618&alias=or8.ingenazure.com&noUDP&sock=schedd_19369_19f7>
12/28/21 17:03:47 (pid:19423) DaemonCore: private command socket at <192.168.178.63:9618?addrs=192.168.178.63-9618+[2001-871-262-b1ea-20c-29ff-feff-a619]-9618&alias=or8.ingenazure.com&noUDP&sock=schedd_19369_19f7>
12/28/21 17:03:47 (pid:19423) History file rotation is enabled.
12/28/21 17:03:47 (pid:19423) Â Maximum history file size is: 20971520 bytes
12/28/21 17:03:47 (pid:19423) Â Number of rotated history files is: 2
12/28/21 17:03:47 (pid:19423) Reloading job factories
12/28/21 17:03:47 (pid:19423) Loaded 0 job factories, 0 were paused, 0 failed to load
12/28/21 17:03:47 (pid:19423) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
12/28/21 17:03:47 (pid:19423) TransferQueueManager upload 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
12/28/21 17:03:47 (pid:19423) TransferQueueManager download 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
[jfisher@or8 condor]$



----------------------------------------------------------------------------------------------------
All /etc/hosts files are identical:

more /etc/hosts
127.0.0.1 Â localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 Â Â Â Â localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.178.63 or8.ingenazure.com
192.168.178.61 eda1.ingenazure.com
192.168.178.60 eda2.ingenazure.com

Pinging from master machine to ensure no typo's on /etc/hosts:

ping or8.ingenazure.com
PING or8.ingenazure.com (192.168.178.63) 56(84) bytes of data.
64 bytes from or8.ingenazure.com (192.168.178.63): icmp_seq=1 ttl=64 time=0.018 ms

ping eda1.ingenazure.com
PING eda1.ingenazure.com (192.168.178.61) 56(84) bytes of data.
64 bytes from eda1.ingenazure.com (192.168.178.61): icmp_seq=1 ttl=64 time=0.848 ms

ping eda2.ingenazure.com
PING eda2.ingenazure.com (192.168.178.60) 56(84) bytes of data.
64 bytes from eda2.ingenazure.com (192.168.178.60): icmp_seq=1 ttl=64 time=0.848 ms

----------------------------------------------------------------------------------------------------
Master machine (or8.ingenazure.com)
/etc/condor/config.d/00master.config

DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, SHARED_PORT

START = true
ALLOW_ADMINISTRATOR = jfisher@xxxxxxxxxxxxxx
DEFAULT_DOMAIN_NAME = ingenazure.com
UID_DOMAIN = ingenazure.com
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
ALLOW_WRITE = 192.168.178.*
ALLOW_READ Â= */*.ingenazure.com, or8.ingenazure.com
ALLOW_NEGOTIATOR = or8.ingenazure.com, 192.168.178.*
CONDOR_ADMIN = jfisher@xxxxxxxxxxxxxx
CONDOR_HOST = or8.ingenazure.com
USE_NFS = FALSE
HOSTNAME = or8

USE_SHARED_PORT=TRUE
SHARED_PORT_ARGS = -p 9618
COLLECTOR_USES_SHARED_PORT=TRUE
COLLECTOR_HOST = $(CONDOR_HOST):9618
StartJobs = TRUE

MASTER_INSTANCE_LOCK = /var/lock/condor/InstanceLock
MAX_DEFAULT_LOG = 1000000
EVENT_LOG = $(LOG)/EventLog
EVENT_LOG_JOB_AD_INFORMATION_ATTRS=Owner,CurrentHosts,x509userproxysubject,x509UserProxyVOName,AccountingGroup,GlobalJo
bId,QDate,JobStartDate,JobCurrentStartDate,JobFinishedHookDone
EVENT_LOG_MAX_SIZE = 10000000
EVENT_LOG_MAX_ROTATIONS = 5
POOL_HISTORY_DIR = /var/log/condor
KEEP_POOL_HISTORY = True

GROUP_NAMES = group_ANALOG, group_DIGITAL, group_OTHER, #set the shares for your users
GROUP_QUOTA_DYNAMIC_group_ANALOG = 1
GROUP_QUOTA_DYNAMIC_group_DIGITAL = 1
GROUP_QUOTA_DYNAMIC_group_OTHER = 0.5
GROUP_ACCEPT_SURPLUS = TRUE


----------------------------------------------------------------------------------------------------
Worker machine 1 (eda1.ingenazure.com)
/etc/condor/config.d/00worker.config

CAL_CONFIG_DIR = /etc/condor/config.d
DAEMON_LIST = MASTER,STARTD
DEFAULT_DOMAIN_NAME = ingenazure.com
CONDOR_HOST = or8.ingenazure.com
UID_DOMAIN = ingenazure.com
FILESYSTEM_DOMAIN = $(UID_DOMAIN)
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST), 192.168.178.*
ALLOW_READ = *.$(UID_DOMAIN), Â192.168.178.*
CONDOR_ADMIN = jfisher@xxxxxxxxxxxxxx
USE_NFS = FALSE
StartJobs = true
STARTD_ATTRS = StartJobs, $(STARTD_ATTRS)
START = true
HOSTALLOW_CONFIG = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST)
ENABLE_RUNTIME_CONFIG = True
RUNTIME_CONFIG_ADMIN = $(CONDOR_HOST)
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/persistent
USE_SHARED_PORT = TRUE
SHARED_PORT_ARGS = -p 9618
COLLECTOR_USES_SHARED_PORT=TRUE
COLLECTOR_HOST = $(CONDOR_HOST):9618

# Enable CGROUP control
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft

# slots
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 24
SLOT_TYPE_1 = cpus=1, ram=4%, swap=4%, disk=4%
SLOT_TYPE_1_PARTITIONABLE = true
COUNT_HYPERTHREAD_CPUS = true
----------------------------------------------------------------------------------------------------