[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor hanging, failing to start



Thanks so much for getting back to me Brian.  Iâve upped the log levels for the Master and pasted the results of a new start below.  I also got an email from the server last night stating that the collector process had crashed:

This is an automated email from the Condor system
on machine "vuwunicocondor03.ods.vuw.ac.nz".  Do not reply.

"/sbin/condor_collector" on "vuwunicocondor03.ods.vuw.ac.nz" was killed because
it was no longer responding.
Condor will automatically restart this process in 10 seconds.

I actually have an extraordinarily keen user who actually sent a job to Condor last night and he reports that itâs been very slow, but some jobs are running.  Anyway - hereâs a snip of the MasterLog - please let me know if you need any more info or any extra testing or logs.

Many thanks, Craig

ââ
04/25/20 16:11:00 Result of reading /etc/issue:  \S

 

04/25/20 16:11:00 Result of reading /etc/redhat-release:  Red Hat Enterprise Linux Server release 7.7 (Maipo)

 

04/25/20 16:11:00 Using processor count: 2 processors, 2 CPUs, 0 HTs
04/25/20 16:11:00 Reading condor configuration from '/etc/condor/condor_config'
04/25/20 16:11:00 Enumerating interfaces: lo 127.0.0.1 up
04/25/20 16:11:00 Enumerating interfaces: eno16780032 10.40.18.11 up
04/25/20 16:11:00 ******************************************************
04/25/20 16:11:00 ** condor_master (CONDOR_MASTER) STARTING UP
04/25/20 16:11:00 ** /usr/sbin/condor_master
04/25/20 16:11:00 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/25/20 16:11:00 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/25/20 16:11:00 ** $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
04/25/20 16:11:00 ** $CondorPlatform: x86_64_RedHat7 $
04/25/20 16:11:00 ** PID = 1289047
04/25/20 16:11:00 ** Log last touched 4/25 16:09:41
04/25/20 16:11:00 ******************************************************
04/25/20 16:11:00 Using config source: /etc/condor/condor_config
04/25/20 16:11:00 Using local config sources: 
04/25/20 16:11:00    /etc/condor/config.d/00VUWCondor_config.local
04/25/20 16:11:00    /etc/condor/config.d/00VUWCondor_config.local
04/25/20 16:11:00 config Macros = 115, Sorted = 115, StringBytes = 3967, TablesBytes = 4196
04/25/20 16:11:00 CLASSAD_CACHING is OFF
04/25/20 16:11:00 Daemon Log is logging: D_FULLDEBUG D_ALWAYS D_ERROR
04/25/20 16:11:00 Attempting to lock /var/log/condor/InstanceLock.
04/25/20 16:11:00 FileLock object is updating timestamp on: /var/log/condor/InstanceLock
04/25/20 16:11:00 FileLock::obtain(1) - @1587787860.331143 lock on /var/log/condor/InstanceLock now WRITE
04/25/20 16:11:00 Obtained lock on /var/log/condor/InstanceLock.
04/25/20 16:11:00 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:01 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:02 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:03 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:04 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:05 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:06 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:07 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:08 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:09 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:10 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:11 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:12 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:13 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:14 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:15 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:16 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:17 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:18 Sleeping one second for kernel parameter tuning (pid 1289048).
04/25/20 16:11:19 Waited too long for kernel parameters to be tuned, hard-killing script.
04/25/20 16:11:19 SharedPortEndpoint: waiting for connections to named socket 1289047_9a26
04/25/20 16:11:19 SharedPortEndpoint: failed to open /var/log/condor/shared_port_ad: No such file or directory
04/25/20 16:11:19 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
04/25/20 16:11:19 DaemonCore: private command socket at <10.40.18.11:0?sock=1289047_9a26>
04/25/20 16:11:19 Setting maximum accepts per cycle 8.
04/25/20 16:11:19 Will use TCP to update collector vuwunicocondor03.ods.vuw.ac.nz <10.40.18.11:9618>
04/25/20 16:11:19 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
04/25/20 16:11:19 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1540925514)
04/25/20 16:11:19 No sockets passed from systemd
04/25/20 16:11:19 Set systemd to be notified once every 400 seconds.
04/25/20 16:11:19 ::RealStart; SHARED_PORT >
04/25/20 16:11:19 Looking for matching Collector on 'vuwunicocondor03.ods.vuw.ac.nz' ...
04/25/20 16:11:19 Matching 'vuwunicocondor03.ods.vuw.ac.nz:9618'
04/25/20 16:11:19 Host name matches collector <10.40.18.11:9618>.
04/25/20 16:11:19 Finished looking for Collectors.
04/25/20 16:11:19 Starting Collector on port 9618
04/25/20 16:11:19 Starting daemon on TCP port 9618
04/25/20 16:11:19 Started DaemonCore process "/usr/libexec/condor/condor_shared_port -f", pid and pgroup = 1289257
04/25/20 16:11:19 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:19 FileLock object is updating timestamp on: /var/log/condor/InstanceLock
04/25/20 16:11:19 DaemonCore: No more children processes to reap.
04/25/20 16:11:19 Getting monitoring info for pid 1289047
04/25/20 16:11:20 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:21 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:22 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:23 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:24 enter Daemons::UpdateCollector
04/25/20 16:11:24 Trying to update collector <10.40.18.11:9618>
04/25/20 16:11:24 Attempting to send update via TCP to collector vuwunicocondor03.ods.vuw.ac.nz <10.40.18.11:9618>
04/25/20 16:11:24 File descriptor limits: max 32767, safe 26214
04/25/20 16:11:24 exit Daemons::UpdateCollector
04/25/20 16:11:24 enter Daemons::CheckForNewExecutable
04/25/20 16:11:24 Time stamp of running /usr/sbin/condor_master: 1540925514
04/25/20 16:11:24 GetTimeStamp returned: 1540925514
04/25/20 16:11:24 Waiting for /var/log/condor/shared_port_ad to appear.
04/25/20 16:11:24 condor_read() failed: recv() 5 bytes from collector vuwunicocondor03.ods.vuw.ac.nz returned -1, timeout=20, errno=104 Connection reset by peer.
04/25/20 16:11:24 IO: Failed to read packet header
04/25/20 16:11:24 condor_read(): Socket closed when trying to read 5 bytes from collector vuwunicocondor03.ods.vuw.ac.nz in non-blocking mode
04/25/20 16:11:24 IO: EOF reading packet header
04/25/20 16:11:24 condor_read(): Socket closed when trying to read 5 bytes from collector vuwunicocondor03.ods.vuw.ac.nz
04/25/20 16:11:24 IO: EOF reading packet header
04/25/20 16:11:24 SECMAN: no classad from server, failing
04/25/20 16:11:24 ERROR: SECMAN:2007:Failed to end classad message.
04/25/20 16:11:24 Failed to start non-blocking update to <10.40.18.11:9618>.
04/25/20 16:11:44 Found /var/log/condor/shared_port_ad.
04/25/20 16:11:44 ::RealStart; COLLECTOR >
04/25/20 16:11:44 Looking for matching Collector on 'vuwunicocondor03.ods.vuw.ac.nz' ...
04/25/20 16:11:44 Matching 'vuwunicocondor03.ods.vuw.ac.nz:9618'
04/25/20 16:11:44 Host name matches collector <10.40.18.11:9618>.
04/25/20 16:11:44 Finished looking for Collectors.
04/25/20 16:11:44 Starting collector with shared port id collector
04/25/20 16:11:44 Starting daemon with shared port id collector
04/25/20 16:11:44 Started DaemonCore process "/sbin/condor_collector -f", pid and pgroup = 1289278
04/25/20 16:11:44 Waiting for /var/log/condor/.collector_address to appear.
04/25/20 16:11:45 Found /var/log/condor/.collector_address.
04/25/20 16:11:45 ::RealStart; NEGOTIATOR >
04/25/20 16:11:45 Started DaemonCore process "/sbin/condor_negotiator -f", pid and pgroup = 1289280
04/25/20 16:11:45 ::RealStart; SCHEDD >
04/25/20 16:11:45 Started DaemonCore process "/usr/sbin/condor_schedd -f", pid and pgroup = 1289281
04/25/20 16:15:19 Getting monitoring info for pid 1289047
04/25/20 16:16:24 enter Daemons::UpdateCollector
04/25/20 16:16:24 Trying to update collector <10.40.18.11:9618>
04/25/20 16:16:24 Attempting to send update via TCP to collector vuwunicocondor03.ods.vuw.ac.nz <10.40.18.11:9618>
04/25/20 16:16:24 exit Daemons::UpdateCollector
04/25/20 16:16:24 enter Daemons::CheckForNewExecutable
04/25/20 16:16:24 Time stamp of running /usr/sbin/condor_master: 1540925514
04/25/20 16:16:24 GetTimeStamp returned: 1540925514
04/25/20 16:18:25 ERROR: SECMAN:2003:deadline for security handshake with collector vuwunicocondor03.ods.vuw.ac.nz has expired.
04/25/20 16:18:25 Failed to start non-blocking update to <10.40.18.11:9618>.
04/25/20 16:19:19 Getting monitoring info for pid 1289047



On 24/04/2020, at 4:06 PM, Bockelman, Brian <BBockelman@xxxxxxxxxxxxx> wrote:

Hi Craig,

Indeed, this is very fishy:

04/24/20 15:34:59 Daemon Log is logging: D_ALWAYS D_ERROR
04/24/20 15:35:18 SharedPortEndpoint: waiting for connections to named socket 1034787_63fd

I can't think of what might be causing a 20 second delay there.  That's highly indicative of a configuration issue with the shared port daemon.  It's not clear from that log level *what* the problem might be.

Can you restart with:

MASTER_DEBUG = D_FULLDEBUG

and post the relevant pieces from the condor_master log?  That may provide additional information about your problem.

Thanks!

Brian

On Apr 23, 2020, at 10:58 PM, Craig Parker <craig.parker@xxxxxxxxx> wrote:

Hi all, hope you are all well during these turbulent times.   

I have a weird problem with my HTCondor instnace here at VUW over the past couple of days.  Jobs were alternately being held âidleâ for long periods of time for no reasons discernable by me, and I *think* I may have discovered and resolved an issue afecting this with a number of the client machines, where their SharedPort addresses were being set to 127.0.0.1.

I thought that it was all sorted, but returning to the server for some final testing has ruined my day a little.  Iâm currently unable to restart the Condor service on the server, and looking at the MasterLog it seems that the machine isnât able to determine its own communication addresses - if that makes sense.  Below I have a snip of the MasterLog during Condor startup: the long time it takes for the 'shared_port_ad' file to appear looks to be suspicious to me?

Honestly Iâm a little lost with this though, and would really appreciate any kind of assistance at all, even if itâs just to say Iâm barking up the wrong tree.  If you need any more info or logs etc, please let me know and Iâll get them to you.

Many thanks, Craig

ITS Client Technology Manager
Ph: +64 4 463 6052
Mob: 027 564 6052
Rankine Brown level 8
Victoria University of Wellington,
PO Box 600, Wellington 6140, New Zealand

------
04/24/20 15:34:59 ** condor_master (CONDOR_MASTER) STARTING UP
04/24/20 15:34:59 ** /usr/sbin/condor_master
04/24/20 15:34:59 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/24/20 15:34:59 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/24/20 15:34:59 ** $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
04/24/20 15:34:59 ** $CondorPlatform: x86_64_RedHat7 $
04/24/20 15:34:59 ** PID = 1034787
04/24/20 15:34:59 ** Log last touched 4/24 15:32:07
04/24/20 15:34:59 ******************************************************
04/24/20 15:34:59 Using config source: /etc/condor/condor_config
04/24/20 15:34:59 Using local config sources: 
04/24/20 15:34:59    /etc/condor/config.d/00VUWCondor_config.local
04/24/20 15:34:59    /etc/condor/config.d/00VUWCondor_config.local
04/24/20 15:34:59 config Macros = 114, Sorted = 114, StringBytes = 3955, TablesBytes = 4160
04/24/20 15:34:59 CLASSAD_CACHING is OFF
04/24/20 15:34:59 Daemon Log is logging: D_ALWAYS D_ERROR
04/24/20 15:35:18 SharedPortEndpoint: waiting for connections to named socket 1034787_63fd
04/24/20 15:35:18 SharedPortEndpoint: failed to open /var/log/condor/shared_port_ad: No such file or directory
04/24/20 15:35:18 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
04/24/20 15:35:18 DaemonCore: private command socket at <10.40.18.11:0?sock=1034787_63fd>
04/24/20 15:35:18 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
04/24/20 15:35:18 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1540925514)
04/24/20 15:35:18 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1034852
04/24/20 15:35:18 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:19 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:20 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:21 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:22 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:23 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:23 condor_read() failed: recv() 5 bytes from collector vuwunicocondor03.ods.vuw.ac.nz returned -1, timeout=20, errno=104 Connection reset by peer.
04/24/20 15:35:23 IO: Failed to read packet header
04/24/20 15:35:23 SECMAN: no classad from server, failing
04/24/20 15:35:23 ERROR: SECMAN:2007:Failed to end classad message.
04/24/20 15:35:23 Failed to start non-blocking update to <10.40.18.11:9618>.
04/24/20 15:35:43 Found /var/log/condor/shared_port_ad.
04/24/20 15:35:43 Started DaemonCore process "/sbin/condor_collector", pid and pgroup = 1034877
04/24/20 15:35:43 Waiting for /var/log/condor/.collector_address to appear.
04/24/20 15:35:44 Found /var/log/condor/.collector_address.
04/24/20 15:35:44 Started DaemonCore process "/sbin/condor_negotiator", pid and pgroup = 1034879
04/24/20 15:35:44 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1034880
04/24/20 15:42:24 ERROR: SECMAN:2003:deadline for security handshake with collector vuwunicocondor03.ods.vuw.ac.nz has expired.
04/24/20 15:42:24 Failed to start non-blocking update to <10.40.18.11:9618>.
04/24/20 15:47:24 ERROR: SECMAN:2003:deadline for security handshake with collector vuwunicocondor03.ods.vuw.ac.nz has expired.
04/24/20 15:47:24 Failed to start non-blocking update to <10.40.18.11:9618>.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://apc01.safelinks.protection.outlook.com/?url="">

The archives can be found at:
https://apc01.safelinks.protection.outlook.com/?url="">