[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] 6.7.20 collector timeouts




I am experiencing bad timeouts with the condor collector at 6.7.20
and have a ticket open with condor support on it. I am just bringing
this to the attention of condor-users to see if anyone else has
seen this problem.

I have just completed restarting my whole system, all startd, schedd,
negotiator, collector, everything, and the problems are the same as before. I had not even started any of the compute nodes yet--just
starting up the three head nodes and we already had problems.
In general there are three or four timeout messages that I don't understand
why they are happening, and each corresponds to holding up a condor_status
command.  Typically condor_status takes 2-3 minutes to complete.
This is a new problem which didn't happen right away when we upgraded
to 6.7.20, only about a week later.  Also collector has crashed twice
during that time.

So does anyone know what these error messages are in the CollectorLog
(debug right now is [root@fnpcsrv2 log]# condor_config_val COLLECTOR_DEBUG
D_FULLDEBUG, D_COMMAND, D_SECURITY, D_PROTOCOL

7/27 21:04:27 DaemonCore: Can't receive command request (perhaps a timeout?)
7/27 21:04:37 condor_read(): timeout reading buffer.
7/27 21:04:57 condor_read(): timeout reading buffer.
7/27 21:04:57 IO: Failed to read packet header
7/27 21:04:57 DaemonCore: Can't receive command request (perhaps a timeout?)
7/27 21:04:57 DaemonCore: received unregistered command request 25 !
7/27 21:05:27 SECMAN: Error sending response classad!

I didn't change any security settings going from 6.7.18 to 6.7.20,
I am just using host-based authentication.  Should I have changed any?
Why is SECMAN involved?
(I am using TCP to talk to the collector).

Also see the following errors:

/27 21:02:57 DC_AUTHENTICATE: attempt to open invalid session fnpcsrv2:29720:1. 7/27 21:03:57 DC_AUTHENTICATE: attempt to open invalid session fnpcsrv2:29720:1.
7/27 21:05:27 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:14:31 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:18:59 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:19:59 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:20:29 DC_AUTHENTICATE: attempt to open invalid session fnpcsrv2:29720:1.

As I write the E-mail there have been no timeouts since 21:32 and it is now
21:40, and condor_status comes back OK.  Much is claimed at the moment but
little is busy. Still interested in what is causing these errors above, though, because I've seen the timeouts come and go.

steve



-----------------startup of collector is below.


7/27 20:56:28 **** condor_collector (condor_COLLECTOR) EXITING WITH STATUS 0
7/27 20:58:51 ******************************************************
7/27 20:58:51 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
7/27 20:58:51 ** /opt/condor-6.7.20/sbin/condor_collector
7/27 20:58:51 ** $CondorVersion: 6.7.20 Jun 21 2006 $
7/27 20:58:51 ** $CondorPlatform: I386-LINUX_RH9 $7/27 20:58:51 ** $CondorPlatform: I386-LINUX_RH9 $
7/27 20:58:51 ** PID = 20646
7/27 20:58:51 ** Log last touched 7/27 20:56:28
7/27 20:58:51 ******************************************************
7/27 20:58:51 Using config source: /etc/condor/condor_config
7/27 20:58:51 Using local config sources:
7/27 20:58:51    /opt/condor/etc/group_params.config
7/27 20:58:51    /opt/condor/local/condor_config.local
7/27 20:58:51 DaemonCore: Command Socket at <131.225.167.47:9618>
7/27 20:58:51 In ViewServer::Init()
7/27 20:58:51 In CollectorDaemon::Init()
7/27 20:58:51 In ViewServer::Config()
7/27 20:58:51 In CollectorDaemon::Config()
7/27 20:58:51 Will forward ads on to View Server fnpcsrv1.fnal.gov
7/27 20:58:51 enable: Creating stats hash table
7/27 20:58:51 (Sending 0 ads in response to query)
7/27 20:58:51 Got QUERY_STARTD_PVT_ADS
7/27 20:58:51 (Sending 0 ads in response to query)
7/27 20:58:51 NegotiatorAd  : Inserting ** "< fnpcsrv2.fnal.gov >"
7/27 20:58:51 stats: Inserting new hashent for 'Negotiator':'fnpcsrv2.fnal.gov':
'131.225.167.47'
7/27 20:58:55 Got QUERY_STARTD_ADS
7/27 20:58:55 (Sending 0 ads in response to query)
7/27 20:58:56 MasterAd     : Inserting ** "< fnpcsrv2.fnal.gov >"
7/27 20:58:56 stats: Inserting new hashent for 'Master':'fnpcsrv2.fnal.gov':'131
.225.167.47'
7/27 20:59:02 Got QUERY_STARTD_ADS
7/27 20:59:02 (Sending 0 ads in response to query)
7/27 20:59:20 Got QUERY_MASTER_ADS
7/27 20:59:20 (Sending 1 ads in response to query)
7/27 20:59:24 MasterAd     : Inserting ** "< samgfarm2.fnal.gov >"
7/27 20:59:24 stats: Inserting new hashent for 'Master':'samgfarm2.fnal.gov':'13
1.225.167.49'
7/27 20:59:30 Got QUERY_STARTD_ADS
7/27 20:59:30 (Sending 0 ads in response to query)
7/27 20:59:33 Got QUERY_SCHEDD_ADS
7/27 20:59:33 (Sending 0 ads in response to query)
7/27 20:59:43 WARNING:  No master ad for < fnpcsrv1.fnal.gov >
7/27 20:59:43 ScheddAd : Inserting ** "< fnpcsrv1.fnal.gov , 131.225.167.44
"
7/27 20:59:43 stats: Inserting new hashent for 'Schedd':'fnpcsrv1.fnal.gov':'131
.225.167.44'
7/27 20:59:43 SubmittorAd : Inserting ** "< yoo@xxxxxxxxxxxxxxxxxxxxxxxxx , 131
.225.167.44 >"
7/27 20:59:43 stats: Inserting new hashent for 'Submittor':'yoo@xxxxxxxx':'131.225.167.44' 7/27 20:59:43 SubmittorAd : Inserting ** "< rubin@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
31.225.167.44 >"
7/27 20:59:43 stats: Inserting new hashent for 'Submittor':'rubin@xxxxxxxx':'131
.225.167.44'
7/27 20:59:43 IO: Failed to read packet header
7/27 20:59:43 IO: Failed to read packet header
7/27 20:59:45 ** Master < fnpcsrv1.fnal.gov > rejuvenated from recently down 7/27 20:59:45 stats: Inserting new hashent for 'Master':'fnpcsrv1.fnal.gov':'131
.225.167.44'
7/27 20:59:50 WARNING:  No master ad for < vm1@xxxxxxxxxxxxxxxxx >
7/27 20:59:50 StartdAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxx , 131.225.167
.44 >"
7/27 20:59:50 stats: Inserting new hashent for 'Start':'vm1@xxxxxxxxxxxxxxxxx':'
131.225.167.44'
7/27 20:59:50 StartdPvtAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxx , 131.225.167
.44 >"
7/27 20:59:50 stats: Inserting new hashent for 'StartdPvt':'vm1@xxxxxxxxxxxxxxxx
v':'131.225.167.44'
7/27 20:59:51 WARNING:  No master ad for < vm2@xxxxxxxxxxxxxxxxx >
7/27 20:59:51 StartdAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxx , 131.225.167 .44 >"7/27 20:59:51 stats: Inserting new hashent for 'Start':'vm2@xxxxxxxxxxxxxxxxx':'
131.225.167.44'
7/27 20:59:51 StartdPvtAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxx , 131.225.167
.44 >"
7/27 20:59:51 stats: Inserting new hashent for 'StartdPvt':'vm2@xxxxxxxxxxxxxxxx
v':'131.225.167.44'
7/27 20:59:56 Got QUERY_STARTD_ADS
7/27 20:59:56 (Sending 2 ads in response to query)
7/27 21:00:02 Got QUERY_SCHEDD_ADS
7/27 21:00:02 (Sending 1 ads in response to query)
7/27 21:00:02 Got QUERY_SCHEDD_ADS
7/27 21:00:02 (Sending 1 ads in response to query)
7/27 21:00:02 Got QUERY_STARTD_ADS
7/27 21:00:02 (Sending 2 ads in response to query)
7/27 21:00:07 Got QUERY_STARTD_ADS
7/27 21:00:07 (Sending 2 ads in response to query)
7/27 21:00:07 Got QUERY_SUBMITTOR_ADS
7/27 21:00:07 (Sending 2 ads in response to query)
7/27 21:00:13 Got QUERY_STARTD_ADS
7/27 21:00:13 (Sending 2 ads in response to query)
7/27 21:00:18 Got QUERY_MASTER_ADS
7/27 21:00:18 (Sending 3 ads in response to query)7/27 21:00:22 Got QUERY_SCHEDD_ADS
7/27 21:00:22 (Sending 1 ads in response to query)
7/27 21:00:38 Got QUERY_STARTD_ADS
7/27 21:00:38 (Sending 2 ads in response to query)
7/27 21:00:51 WARNING:  No master ad for < quill@xxxxxxxxxxxxxxxxx >
7/27 21:00:51 QuillAd : Inserting ** "< quill@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.gov >"
7/27 21:00:51 stats: Inserting new hashent for 'Quill':'quill@xxxxxxxxxxxxxxxxx'
:'131.225.167.42'
7/27 21:00:51 Got QUERY_SCHEDD_ADS
7/27 21:00:51 (Sending 1 ads in response to query)
7/27 21:00:55 MasterAd     : Inserting ** "< fngp-osg.fnal.gov >"
7/27 21:00:55 stats: Inserting new hashent for 'Master':'fngp-osg.fnal.gov':'131
.225.167.42'
7/27 21:00:57 ScheddAd : Inserting ** "< fngp-osg.fnal.gov , 131.225.167.42
"
7/27 21:00:57 stats: Inserting new hashent for 'Schedd':'fngp-osg.fnal.gov':'131
.225.167.42'
7/27 21:00:57 SubmittorAd : Inserting ** "< e907@xxxxxxxxxxxxxxxxxxxxxxxxx , 13
1.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'e907@xxxxxxxx':'131. 225.167.42'7/27 21:00:57 SubmittorAd : Inserting ** "< polly@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
31.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'polly@xxxxxxxx':'131
.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd : Inserting ** "< group_e898.minboone@xxxxxxxxxxxxxxx
g.fnal.gov , 131.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'group_e898.minboone@
fnal.gov':'131.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd : Inserting ** "< grase@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
31.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'grase@xxxxxxxx':'131
.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd : Inserting ** "< group_e907.mipp@xxxxxxxxxxxxxxxxxxx
al.gov , 131.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'group_e907.mipp@fnal
.gov':'131.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd : Inserting ** "< group_cdms.cdms@xxxxxxxxxxxxxxxxxxx al.gov , 131.225.167.42 >"7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'group_cdms.cdms@fnal
.gov':'131.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd : Inserting ** "< sdss@xxxxxxxxxxxxxxxxxxxxxxxxx , 13
1.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'sdss@xxxxxxxx':'131.
225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd : Inserting ** "< mis@xxxxxxxxxxxxxxxxxxxxxxxxx , 131
.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for 'Submittor':'mis@xxxxxxxx':'131.2
25.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:01:07 condor_read(): timeout reading buffer.
7/27 21:01:27 condor_read(): timeout reading buffer.
7/27 21:01:27 IO: Failed to read packet header
7/27 21:01:27 DaemonCore: Can't receive command request (perhaps a timeout?)
7/27 21:01:27 IO: Failed to read packet header
7/27 21:01:37 condor_read(): timeout reading buffer.
7/27 21:01:57 condor_read(): timeout reading buffer.
7/27 21:01:57 IO: Failed to read packet header
7/27 21:01:57 DaemonCore: Can't receive command request (perhaps a timeout?)
--More--((%)

------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team