[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor hanging, failing to start



Hi Craig,

Indeed, this is very fishy:

04/24/20 15:34:59 Daemon Log is logging: D_ALWAYS D_ERROR
04/24/20 15:35:18 SharedPortEndpoint: waiting for connections to named socket 1034787_63fd

I can't think of what might be causing a 20 second delay there.  That's highly indicative of a configuration issue with the shared port daemon.  It's not clear from that log level *what* the problem might be.

Can you restart with:

MASTER_DEBUG = D_FULLDEBUG

and post the relevant pieces from the condor_master log?  That may provide additional information about your problem.

Thanks!

Brian

On Apr 23, 2020, at 10:58 PM, Craig Parker <craig.parker@xxxxxxxxx> wrote:

Hi all, hope you are all well during these turbulent times.   

I have a weird problem with my HTCondor instnace here at VUW over the past couple of days.  Jobs were alternately being held âidleâ for long periods of time for no reasons discernable by me, and I *think* I may have discovered and resolved an issue afecting this with a number of the client machines, where their SharedPort addresses were being set to 127.0.0.1.

I thought that it was all sorted, but returning to the server for some final testing has ruined my day a little.  Iâm currently unable to restart the Condor service on the server, and looking at the MasterLog it seems that the machine isnât able to determine its own communication addresses - if that makes sense.  Below I have a snip of the MasterLog during Condor startup: the long time it takes for the 'shared_port_ad' file to appear looks to be suspicious to me?

Honestly Iâm a little lost with this though, and would really appreciate any kind of assistance at all, even if itâs just to say Iâm barking up the wrong tree.  If you need any more info or logs etc, please let me know and Iâll get them to you.

Many thanks, Craig

ITS Client Technology Manager
Ph: +64 4 463 6052
Mob: 027 564 6052
Rankine Brown level 8
Victoria University of Wellington,
PO Box 600, Wellington 6140, New Zealand

------
04/24/20 15:34:59 ** condor_master (CONDOR_MASTER) STARTING UP
04/24/20 15:34:59 ** /usr/sbin/condor_master
04/24/20 15:34:59 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/24/20 15:34:59 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/24/20 15:34:59 ** $CondorVersion: 8.6.13 Oct 30 2018 BuildID: 453497 $
04/24/20 15:34:59 ** $CondorPlatform: x86_64_RedHat7 $
04/24/20 15:34:59 ** PID = 1034787
04/24/20 15:34:59 ** Log last touched 4/24 15:32:07
04/24/20 15:34:59 ******************************************************
04/24/20 15:34:59 Using config source: /etc/condor/condor_config
04/24/20 15:34:59 Using local config sources: 
04/24/20 15:34:59    /etc/condor/config.d/00VUWCondor_config.local
04/24/20 15:34:59    /etc/condor/config.d/00VUWCondor_config.local
04/24/20 15:34:59 config Macros = 114, Sorted = 114, StringBytes = 3955, TablesBytes = 4160
04/24/20 15:34:59 CLASSAD_CACHING is OFF
04/24/20 15:34:59 Daemon Log is logging: D_ALWAYS D_ERROR
04/24/20 15:35:18 SharedPortEndpoint: waiting for connections to named socket 1034787_63fd
04/24/20 15:35:18 SharedPortEndpoint: failed to open /var/log/condor/shared_port_ad: No such file or directory
04/24/20 15:35:18 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
04/24/20 15:35:18 DaemonCore: private command socket at <10.40.18.11:0?sock=1034787_63fd>
04/24/20 15:35:18 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
04/24/20 15:35:18 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1540925514)
04/24/20 15:35:18 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1034852
04/24/20 15:35:18 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:19 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:20 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:21 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:22 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:23 Waiting for /var/log/condor/shared_port_ad to appear.
04/24/20 15:35:23 condor_read() failed: recv() 5 bytes from collector vuwunicocondor03.ods.vuw.ac.nz returned -1, timeout=20, errno=104 Connection reset by peer.
04/24/20 15:35:23 IO: Failed to read packet header
04/24/20 15:35:23 SECMAN: no classad from server, failing
04/24/20 15:35:23 ERROR: SECMAN:2007:Failed to end classad message.
04/24/20 15:35:23 Failed to start non-blocking update to <10.40.18.11:9618>.
04/24/20 15:35:43 Found /var/log/condor/shared_port_ad.
04/24/20 15:35:43 Started DaemonCore process "/sbin/condor_collector", pid and pgroup = 1034877
04/24/20 15:35:43 Waiting for /var/log/condor/.collector_address to appear.
04/24/20 15:35:44 Found /var/log/condor/.collector_address.
04/24/20 15:35:44 Started DaemonCore process "/sbin/condor_negotiator", pid and pgroup = 1034879
04/24/20 15:35:44 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 1034880
04/24/20 15:42:24 ERROR: SECMAN:2003:deadline for security handshake with collector vuwunicocondor03.ods.vuw.ac.nz has expired.
04/24/20 15:42:24 Failed to start non-blocking update to <10.40.18.11:9618>.
04/24/20 15:47:24 ERROR: SECMAN:2003:deadline for security handshake with collector vuwunicocondor03.ods.vuw.ac.nz has expired.
04/24/20 15:47:24 Failed to start non-blocking update to <10.40.18.11:9618>.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/