[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_status shows nothing



In your MasterLog this

03/18/19 11:31:48 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
03/18/19 11:31:48 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.

is followed a second later by this

03/18/19 11:31:49 Found /var/lock/condor/shared_port_ad.

So. no. not a problem.   When the Master starts up, it starts the SharedPort daemon, and then has to wait for the shared_port_ad
to appear before starting the other daemons. 

Also, condor_status will show nothing if there are no Startds in your pool that are configured to send ads to this collector, or if this
collector is refusing their ads.

Use 

  condor_status -all 

To see all of the ads in the collector, not just Startd ads.   

Check the CollectorLog to see if it is refusing to accept any ads.   

Use

   condor_config_val -dump ALLOW_

To see the configuration related to allowing Schedds, Startd's etc to send ads to this collector.   The relevant entries are will start
with ALLOW_ADVERTISE (ALLOW_DAEMON for some ads, but not for Startd or Schedd ads)

In 8.9.0 we tightened up the default security behavior.  In 8.8 you could set ALLOW_WRITE to give permission to send ads to the Collector, Because ALLOW_ADVERTISE would inherit from ALLOW_WRITE.  This no longer happens in 8.9.0. See the release notes

http://research.cs.wisc.edu/htcondor/manual/v8.9.0/DevelopmentReleaseSeries89.html

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Ben Pietras
Sent: Tuesday, March 19, 2019 6:05 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] condor_status shows nothing

Hi,

I manage two clusters, one is acting a little odd. condor_status returns nothing.

On the master node:

systemctl status condor -l
● condor.service - Condor Distributed High-Throughput-Computing
   Loaded: loaded (/usr/lib/systemd/system/condor.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2019-03-18 11:31:47 GMT; 23h ago
 Main PID: 16112 (condor_master)
   Status: "All daemons are responding"
    Tasks: 6 (limit: 32767)
   Memory: 24.7M
   CGroup: /system.slice/condor.service
           ├─16112 /usr/sbin/condor_master -f
           ├─16154 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 990
           ├─16155 condor_shared_port -f
           ├─16157 condor_collector -f
           ├─16158 condor_negotiator -f
           └─16159 condor_schedd -f

Mar 18 11:31:47 fastpc2 systemd[1]: Started Condor Distributed High-Throughput-Computing.

[this looks OK]

tail -50 /var/log/condor/MasterLog

03/18/19 11:00:08 Preen (pid 15523) exited with status 0
03/18/19 11:31:47 Got SIGQUIT.  Performing fast shutdown.
03/18/19 11:31:47 Sent SIGQUIT to COLLECTOR (pid 6428)
03/18/19 11:31:47 Sent SIGQUIT to NEGOTIATOR (pid 6432)
03/18/19 11:31:47 Sent SIGQUIT to SCHEDD (pid 6433)
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6432, status 0.
03/18/19 11:31:47 The NEGOTIATOR (pid 6432) exited with status 0
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6428, status 0.
03/18/19 11:31:47 The COLLECTOR (pid 6428) exited with status 0
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6433, status 0.
03/18/19 11:31:47 The SCHEDD (pid 6433) exited with status 0
03/18/19 11:31:47 Sent SIGTERM to SHARED_PORT (pid 6387)
03/18/19 11:31:47 AllReaper unexpectedly called on pid 6387, status 0.
03/18/19 11:31:47 The SHARED_PORT (pid 6387) exited with status 0
03/18/19 11:31:47 All daemons are gone.  Exiting.
03/18/19 11:31:47 **** condor_master (condor_MASTER) pid 5538 EXITING WITH STATUS 0
03/18/19 11:31:47 ******************************************************
03/18/19 11:31:47 ** condor_master (CONDOR_MASTER) STARTING UP
03/18/19 11:31:47 ** /usr/sbin/condor_master
03/18/19 11:31:47 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
03/18/19 11:31:47 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
03/18/19 11:31:47 ** $CondorVersion: 8.9.0 Feb 27 2019 BuildID: 462330 PackageID: 8.9.0-1 $
03/18/19 11:31:47 ** $CondorPlatform: x86_64_RedHat7 $
03/18/19 11:31:47 ** PID = 16112
03/18/19 11:31:47 ** Log last touched 3/18 11:31:47
03/18/19 11:31:47 ******************************************************
03/18/19 11:31:47 Using config source: /etc/condor/condor_config
03/18/19 11:31:47 Using local config sources: 
03/18/19 11:31:47    /etc/condor/config.d/condor_master_fastpc2.config
03/18/19 11:31:47    /etc/condor/config.d/condor_master_fastpc2.config.bak
03/18/19 11:31:47    /etc/condor/condor_config.local
03/18/19 11:31:47 config Macros = 75, Sorted = 75, StringBytes = 1939, TablesBytes = 2756
03/18/19 11:31:47 CLASSAD_CACHING is OFF
03/18/19 11:31:47 Daemon Log is logging: D_ALWAYS D_ERROR
03/18/19 11:31:48 SharedPortEndpoint: waiting for connections to named socket 16112_1201
03/18/19 11:31:48 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
03/18/19 11:31:48 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
03/18/19 11:31:48 DaemonCore: private command socket at <192.168.20.12:0?sock=16112_1201>
03/18/19 11:31:48 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
03/18/19 11:31:48 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1551328706)
03/18/19 11:31:48 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 16155
03/18/19 11:31:48 Waiting for /var/lock/condor/shared_port_ad to appear.
03/18/19 11:31:49 Found /var/lock/condor/shared_port_ad.
03/18/19 11:31:49 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 16157
03/18/19 11:31:49 Waiting for /var/log/condor/.collector_address to appear.
03/18/19 11:31:50 Found /var/log/condor/.collector_address.
03/18/19 11:31:50 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 16158
03/18/19 11:31:50 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 16159
03/18/19 12:31:48 Preen pid is 16679
03/18/19 12:31:48 Preen (pid 16679) exited with status 0

[this is from yesterday, "SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory" seems ominous]

ls -ltrh /var/lock/condor/shared_port_ad
-rw-r--r-- 1 condor condor 281 Mar 19 10:56 /var/lock/condor/shared_port_ad

cat /var/lock/condor/shared_port_ad
ForkedChildrenCurrent = 0
ForkedChildrenPeak = 0
MyAddress = "<192.168.20.12:9618?addrs=192.168.20.12-9618&noUDP>"
RequestsBlocked = 0
RequestsFailed = 0
RequestsPendingCurrent = 0
RequestsPendingPeak = 2
RequestsSucceeded = 47969
SharedPortCommandSinfuls = "<192.168.20.12:9618>"

ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:26:b9:5d:3e:77 brd ff:ff:ff:ff:ff:ff
    inet 130.88.20.80/24 brd 130.88.20.255 scope global noprefixroute dynamic em1
       valid_lft 1119485sec preferred_lft 1119485sec
    inet6 fe80::226:b9ff:fe5d:3e77/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
3: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:26:b9:5d:3e:78 brd ff:ff:ff:ff:ff:ff
    inet 192.168.20.12/24 brd 192.168.20.255 scope global noprefixroute em2
       valid_lft forever preferred_lft forever
    inet6 fe80::226:b9ff:fe5d:3e78/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:73:df:89:96 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:73ff:fedf:8996/64 scope link 
       valid_lft forever preferred_lft forever
6: veth532a0b9@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether ca:e3:89:de:ab:6e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::c8e3:89ff:fede:ab6e/64 scope link 
       valid_lft forever preferred_lft forever

cat /etc/*eleas*
NAME="Scientific Linux"
VERSION="7.6 (Nitrogen)"
ID="scientific"
ID_LIKE="rhel centos fedora"

If anyone has any suggestions / wants more info, please let me know.

Best,
Ben
----------------------------------------------------------------------------
   Ben Pietras <ben.pietras@xxxxxxxxxxxxxxxx>           
   School of Physics and Astronomy,   Tel.  0161-275-4231
   The University of Manchester,          Fax. 0161-275-5509
   Manchester, M13 9PL.                                             
----------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/