[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] [EXTERNAL] Collector down and not restarting properly



I apologize – I found the issue. /var is full. Badly mapped disks.

 

Michael Fienen, Ph. D.
Research Hydrologist
United States Geological Survey
Upper Midwest Water Science Center
8505 Research Way
Middleton, WI  53562-3581
phone:  608.821.3894
https://www.usgs.gov/staff-profiles/michael-n-fienen

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Fienen, Michael N via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Date: Monday, March 29, 2021 at 5:12 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Fienen, Michael N <mnfienen@xxxxxxxx>
Subject: [EXTERNAL] [HTCondor-users] Collector down and not restarting properly

 

 This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.  

 

Hello Condor World! Been a minute….

 

We have been running rock-solid for months, but just ran into a problem where a user submitted a job and all staying in Idle state. I rebooted the schedd and getting errors for it trying to come back. From MasterLog:


03/28/21 22:14:29 DefaultReaper unexpectedly called on pid 2310192, status 11264.

03/28/21 22:14:29 The COLLECTOR (pid 2310192) exited with status 44

03/28/21 22:14:29 Sending obituary for "/usr/sbin/condor_collector"

03/28/21 22:14:29 restarting /usr/sbin/condor_collector in 10 seconds

03/28/21 22:14:29 condor_write(): Socket closed when trying to write 1513 bytes to collector <schedd_name_here>, fd is 10

03/28/21 22:14:29 Buf::write(): condor_write() failed

03/28/21 22:14:29 condor_read(): Socket closed abnormally when trying to read 5 bytes from collector <schedd_name_here>, in non-blocking mode, errno=104 Connection reset by peer

03/28/21 22:14:29 SECMAN: no classad from server, failing

03/28/21 22:14:29 ERROR: SECMAN:2007:Failed to end classad message.

03/28/21 22:14:29 Failed to start non-blocking update to <schedd_IP_here>:9168,

03/28/21 22:14:33 DefaultReaper unexpectedly cal03/29/21 14:04:03 ********************************************

**********

03/29/21 14:04:03 ** condor_master (CONDOR_MASTER) STARTING UP

03/29/21 14:04:03 ** /usr/sbin/condor_master

03/29/21 14:04:03 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)

03/29/21 14:04:03 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON

03/29/21 14:04:03 ** $CondorVersion: 8.8.13 Mar 23 2021 BuildID: 534541 PackageID: 8.8.13-1 $

03/29/21 14:04:03 ** $CondorPlatform: x86_64_CentOS7 $

03/29/21 14:04:03 ** PID = 1344

03/29/21 14:04:03 ** Log last touched 3/29 14:01:34

03/29/21 14:04:03 ******************************************************

03/29/21 14:04:03 Using config source: /etc/condor/condor_config

03/29/21 14:04:03 Using local config sources: 

03/29/21 14:04:03    /etc/condor/condor_config.local

03/29/21 14:04:03 config Macros = 70, Sorted = 70, StringBytes = 1827, TablesBytes = 2568

03/29/21 14:04:03 CLASSAD_CACHING is OFF

03/29/21 14:04:03 Daemon Log is logging: D_ALWAYS D_ERROR

03/29/21 14:04:04 SharedPortEndpoint: waiting for connections to named socket 1344_7360

03/29/21 14:04:04 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or director

y

03/29/21 14:04:04 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.

03/29/21 14:04:04 DaemonCore: private command socket at < schedd_IP_here:0?sock=1344_7360>

03/29/21 14:04:04 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_I

NCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)

03/29/21 14:04:04 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port

03/29/21 14:04:04 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1616514849)

03/29/21 14:04:04 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 2077

03/29/21 14:04:04 Waiting for /var/lock/condor/shared_port_ad to appear.

03/29/21 14:04:05 Found /var/lock/condor/shared_port_ad.

03/29/21 14:04:05 DaemonCore: ERROR: Can't open address file /var/log/condor/.master_address.new

03/29/21 14:04:05 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 2815

03/29/21 14:04:05 Waiting for /var/log/condor/.collector_address to appear.

03/29/21 14:04:06 Waiting for /var/log/condor/.collector_address to appear.

 

 

That last line about waiting for .collector_address to appear is now just filing up the MasterLog – writing once per second.

 

Seems like permissions somehow, but I don’t see how this could have changed on its own. Any ideas?

Many thanks!

Mike

 

Michael Fienen, Ph. D.
Research Hydrologist
United States Geological Survey
Upper Midwest Water Science Center
8505 Research Way
Middleton, WI  53562-3581
phone:  608.821.3894
https://www.usgs.gov/staff-profiles/michael-n-fienen