[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_status shows nothing



Hi,

I manage two clusters, one is acting a little odd. condor_status returns nothing.

On the master node:

systemctl status condor -l
● condor.service - Condor Distributed High-Throughput-Computing
   Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-03-18 11:31:47 GMT; 23h ago
 Main PID: 16112 (condor_master)
   Status: "All daemons are responding"
    Tasks: 6 (limit: 32767)
   Memory: 24.7M
   CGroup: /system.slice/condor.service
           ├─16112 /usr/sbin/condor_master -f
           ├─16154 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 990
           ├─16155 condor_shared_port -f
           ├─16157 condor_collector -f
           ├─16158 condor_negotiator -f
           └─16159 condor_schedd -f

Mar 18 11:31:47 fastpc2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

[this looks OK]

tail -50 /var/log/condor/MasterLog

03/18/19 11:00:08 Preen (pid 15523) exited with status 0
03/18/19 11:31:47 Got SIGQUIT.  Performing fast shutdown.
03/18/19 11:31:47 Sent SIGQUIT to COLLECTOR (pid 6428)
03/18/19 11:31:47 Sent SIGQUIT to NEGOTIATOR (pid 6432)
03/18/19 11:31:47 Sent SIGQUIT to SCHEDD (pid 6433)
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6432, status 0.
03/18/19 11:31:47 The NEGOTIATOR (pid 6432) exited with status 0
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6428, status 0.
03/18/19 11:31:47 The COLLECTOR (pid 6428) exited with status 0
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6433, status 0.
03/18/19 11:31:47 The SCHEDD (pid 6433) exited with status 0
03/18/19 11:31:47 Sent SIGTERM to SHARED_PORT (pid 6387)
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6387, status 0.
03/18/19 11:31:47 The SHARED_PORT (pid 6387) exited with status 0
03/18/19 11:31:47 All daemons are gone.  Exiting.
03/18/19 11:31:47 **** condor_master (condor_MASTER) pid 5538 EXITING WITH STATUS 0
03/18/19 11:31:47 ******************************************************
03/18/19 11:31:47 ** condor_master (CONDOR_MASTER) STARTING UP
03/18/19 11:31:47 ** /usr/sbin/condor_master
03/18/19 11:31:47 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
03/18/19 11:31:47 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
03/18/19 11:31:47 ** $CondorVersion: 8.9.0 Feb 27 2019 BuildID: 462330 PackageID: 8.9.0-1 $
03/18/19 11:31:47 ** $CondorPlatform: x86_64_RedHat7 $
03/18/19 11:31:47 ** PID = 16112
03/18/19 11:31:47 ** Log last touched 3/18 11:31:47
03/18/19 11:31:47 ******************************************************
03/18/19 11:31:47 Using config source: /etc/condor/condor_config
03/18/19 11:31:47 Using local config sources: 
03/18/19 11:31:47    /etc/condor/config.d/condor_master_fastpc2.config
03/18/19 11:31:47    /etc/condor/config.d/condor_master_fastpc2.config.bak
03/18/19 11:31:47    /etc/condor/condor_config.local
03/18/19 11:31:47 config Macros = 75, Sorted = 75, StringBytes = 1939, TablesBytes = 2756
03/18/19 11:31:47 CLASSAD_CACHING is OFF
03/18/19 11:31:47 Daemon Log is logging: D_ALWAYS D_ERROR
03/18/19 11:31:48 SharedPortEndpoint: waiting for connections to named socket 16112_1201
03/18/19 11:31:48 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
03/18/19 11:31:48 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
03/18/19 11:31:48 DaemonCore: private command socket at <192.168.20.12:0?sock=16112_1201>
03/18/19 11:31:48 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
03/18/19 11:31:48 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1551328706)
03/18/19 11:31:48 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 16155
03/18/19 11:31:48 Waiting for /var/lock/condor/shared_port_ad to appear.
03/18/19 11:31:49 Found /var/lock/condor/shared_port_ad.
03/18/19 11:31:49 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 16157
03/18/19 11:31:49 Waiting for /var/log/condor/.collector_address to appear.
03/18/19 11:31:50 Found /var/log/condor/.collector_address.
03/18/19 11:31:50 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 16158
03/18/19 11:31:50 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 16159
03/18/19 12:31:48 Preen pid is 16679
03/18/19 12:31:48 Preen (pid 16679) exited with status 0

[this is from yesterday, "SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory" seems ominous]

ls -ltrh /var/lock/condor/shared_port_ad
-rw-r--r-- 1 condor condor 281 Mar 19 10:56 /var/lock/condor/shared_port_ad

cat /var/lock/condor/shared_port_ad
ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0
MyAddress = "<192.168.20.12:9618?addrs=192.168.20.12-9618&noUDP>"
RequestsBlocked = 0
RequestsFailed = 0
RequestsPendingCurrent = 0
RequestsPendingPeak = 2
RequestsSucceeded = 47969
SharedPortCommandSinfuls = "<192.168.20.12:9618>"

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:26:b9:5d:3e:77 brd ff:ff:ff:ff:ff:ff
    inet 130.88.20.80/24 brd 130.88.20.255 scope global noprefixroute dynamic em1
       valid_lft 1119485sec preferred_lft 1119485sec
    inet6 fe80::226:b9ff:fe5d:3e77/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:26:b9:5d:3e:78 brd ff:ff:ff:ff:ff:ff
    inet 192.168.20.12/24 brd 192.168.20.255 scope global noprefixroute em2
       valid_lft forever preferred_lft forever
    inet6 fe80::226:b9ff:fe5d:3e78/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:73:df:89:96 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:73ff:fedf:8996/64 scope link 
       valid_lft forever preferred_lft forever
6: veth532a0b9@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether ca:e3:89:de:ab:6e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c8e3:89ff:fede:ab6e/64 scope link 
       valid_lft forever preferred_lft forever

cat /etc/*eleas*
NAME="Scientific Linux"
VERSION="7.6 (Nitrogen)"
ID="scientific"
ID_LIKE="rhel centos fedora"

If anyone has any suggestions / wants more info, please let me know.

Best,
Ben
----------------------------------------------------------------------------
   Ben Pietras <ben.pietras@xxxxxxxxxxxxxxxx>           
   School of Physics and Astronomy,   Tel.  0161-275-4231
   The University of Manchester,          Fax. 0161-275-5509
   Manchester, M13 9PL.                                             
----------------------------------------------------------------------------