[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 6.7.20 collector timeouts



If you are on windows..try adding this to all the local configs

SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
SEC_DEFAULT_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE




>
>
>
>>
>> I am experiencing bad timeouts with the condor collector at 6.7.20
>> and have a ticket open with condor support on it. I am just bringing
>> this to the attention of condor-users to see if anyone else has
>> seen this problem.
>>
>> I have just completed restarting my whole system, all startd, schedd,
>> negotiator, collector, everything, and the problems are the same as
>> before.
>
> About 1/2 hour after the restart of the system, most of the timeouts
> appear to have stopped for now and condor_status
> comes back in reasonable time again, and the system
> is now filling up with jobs and nodes are getting busy again.
>
> Still interested in knowing what these error messages
> below are all about though.
>
> Thanks
>
> Steve Timm
>
>
>>I had not even started any of the compute nodes yet--just
>> starting up the three head nodes and we already had problems.
>> In general there are three or four timeout messages that I don't
>> understand
>> why they are happening, and each corresponds to holding up a
>> condor_status
>> command.  Typically condor_status takes 2-3 minutes to complete.
>> This is a new problem which didn't happen right away when we upgraded
>> to 6.7.20, only about a week later.  Also collector has crashed twice
>> during that time.
>
>
>>
>> So does anyone know what these error messages are in the CollectorLog
>> (debug right now is [root@fnpcsrv2 log]# condor_config_val
>> COLLECTOR_DEBUG
>> D_FULLDEBUG, D_COMMAND, D_SECURITY, D_PROTOCOL
>>
>> 7/27 21:04:27 DaemonCore: Can't receive command request (perhaps a
>> timeout?)
>> 7/27 21:04:37 condor_read(): timeout reading buffer.
>> 7/27 21:04:57 condor_read(): timeout reading buffer.
>> 7/27 21:04:57 IO: Failed to read packet header
>> 7/27 21:04:57 DaemonCore: Can't receive command request (perhaps a
>> timeout?)
>> 7/27 21:04:57 DaemonCore: received unregistered command request 25 !
>> 7/27 21:05:27 SECMAN: Error sending response classad!
>>
>> I didn't change any security settings going from 6.7.18 to 6.7.20,
>> I am just using host-based authentication.  Should I have changed any?
>> Why is SECMAN involved?
>> (I am using TCP to talk to the collector).
>>
>> Also see the following errors:
>>
>> /27 21:02:57 DC_AUTHENTICATE: attempt to open invalid session
>> fnpcsrv2:29720:1.
>> 7/27 21:03:57 DC_AUTHENTICATE: attempt to open invalid session
>> fnpcsrv2:29720:1.
>> 7/27 21:05:27 ERROR: DC_AUTHENTICATE unable to receive auth_info!
>> 7/27 21:14:31 ERROR: DC_AUTHENTICATE unable to receive auth_info!
>> 7/27 21:18:59 ERROR: DC_AUTHENTICATE unable to receive auth_info!
>> 7/27 21:19:59 ERROR: DC_AUTHENTICATE unable to receive auth_info!
>> 7/27 21:20:29 DC_AUTHENTICATE: attempt to open invalid session
>> fnpcsrv2:29720:1.
>>
>> As I write the E-mail there have been no timeouts since 21:32 and it is
>> now
>> 21:40, and condor_status comes back OK.  Much is claimed at the moment
>> but
>> little is busy.  Still interested in what is causing these errors above,
>> though, because I've seen the timeouts come and go.
>>
>> steve
>>
>>
>>
>> -----------------startup of collector is below.
>>
>>
>> 7/27 20:56:28 **** condor_collector (condor_COLLECTOR) EXITING WITH
>> STATUS
>> 0
>> 7/27 20:58:51 ******************************************************
>> 7/27 20:58:51 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
>> 7/27 20:58:51 ** /opt/condor-6.7.20/sbin/condor_collector
>> 7/27 20:58:51 ** $CondorVersion: 6.7.20 Jun 21 2006 $
>> 7/27 20:58:51 ** $CondorPlatform: I386-LINUX_RH9 $7/27 20:58:51 **
>> $CondorPlatform: I386-LINUX_RH9 $
>> 7/27 20:58:51 ** PID = 20646
>> 7/27 20:58:51 ** Log last touched 7/27 20:56:28
>> 7/27 20:58:51 ******************************************************
>> 7/27 20:58:51 Using config source: /etc/condor/condor_config
>> 7/27 20:58:51 Using local config sources:
>> 7/27 20:58:51    /opt/condor/etc/group_params.config
>> 7/27 20:58:51    /opt/condor/local/condor_config.local
>> 7/27 20:58:51 DaemonCore: Command Socket at <131.225.167.47:9618>
>> 7/27 20:58:51 In ViewServer::Init()
>> 7/27 20:58:51 In CollectorDaemon::Init()
>> 7/27 20:58:51 In ViewServer::Config()
>> 7/27 20:58:51 In CollectorDaemon::Config()
>> 7/27 20:58:51 Will forward ads on to View Server fnpcsrv1.fnal.gov
>> 7/27 20:58:51 enable: Creating stats hash table
>> 7/27 20:58:51 (Sending 0 ads in response to query)
>> 7/27 20:58:51 Got QUERY_STARTD_PVT_ADS
>> 7/27 20:58:51 (Sending 0 ads in response to query)
>> 7/27 20:58:51 NegotiatorAd  : Inserting ** "< fnpcsrv2.fnal.gov >"
>> 7/27 20:58:51 stats: Inserting new hashent for
>> 'Negotiator':'fnpcsrv2.fnal.gov':
>> '131.225.167.47'
>> 7/27 20:58:55 Got QUERY_STARTD_ADS
>> 7/27 20:58:55 (Sending 0 ads in response to query)
>> 7/27 20:58:56 MasterAd     : Inserting ** "< fnpcsrv2.fnal.gov >"
>> 7/27 20:58:56 stats: Inserting new hashent for
>> 'Master':'fnpcsrv2.fnal.gov':'131
>> .225.167.47'
>> 7/27 20:59:02 Got QUERY_STARTD_ADS
>> 7/27 20:59:02 (Sending 0 ads in response to query)
>> 7/27 20:59:20 Got QUERY_MASTER_ADS
>> 7/27 20:59:20 (Sending 1 ads in response to query)
>> 7/27 20:59:24 MasterAd     : Inserting ** "< samgfarm2.fnal.gov >"
>> 7/27 20:59:24 stats: Inserting new hashent for
>> 'Master':'samgfarm2.fnal.gov':'13
>> 1.225.167.49'
>> 7/27 20:59:30 Got QUERY_STARTD_ADS
>> 7/27 20:59:30 (Sending 0 ads in response to query)
>> 7/27 20:59:33 Got QUERY_SCHEDD_ADS
>> 7/27 20:59:33 (Sending 0 ads in response to query)
>> 7/27 20:59:43 WARNING:  No master ad for < fnpcsrv1.fnal.gov >
>> 7/27 20:59:43 ScheddAd     : Inserting ** "< fnpcsrv1.fnal.gov ,
>> 131.225.167.44
>>> "
>> 7/27 20:59:43 stats: Inserting new hashent for
>> 'Schedd':'fnpcsrv1.fnal.gov':'131
>> .225.167.44'
>> 7/27 20:59:43 SubmittorAd  : Inserting ** "<
>> yoo@xxxxxxxxxxxxxxxxxxxxxxxxx
>> , 131
>> .225.167.44 >"
>> 7/27 20:59:43 stats: Inserting new hashent for
>> 'Submittor':'yoo@xxxxxxxx':'131.225.167.44'
>> 7/27 20:59:43 SubmittorAd  : Inserting ** "<
>> rubin@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
>> 31.225.167.44 >"
>> 7/27 20:59:43 stats: Inserting new hashent for
>> 'Submittor':'rubin@xxxxxxxx':'131
>> .225.167.44'
>> 7/27 20:59:43 IO: Failed to read packet header
>> 7/27 20:59:43 IO: Failed to read packet header
>> 7/27 20:59:45 ** Master < fnpcsrv1.fnal.gov > rejuvenated from recently
>> down
>> 7/27 20:59:45 stats: Inserting new hashent for
>> 'Master':'fnpcsrv1.fnal.gov':'131
>> .225.167.44'
>> 7/27 20:59:50 WARNING:  No master ad for < vm1@xxxxxxxxxxxxxxxxx >
>> 7/27 20:59:50 StartdAd     : Inserting ** "< vm1@xxxxxxxxxxxxxxxxx ,
>> 131.225.167
>> .44 >"
>> 7/27 20:59:50 stats: Inserting new hashent for
>> 'Start':'vm1@xxxxxxxxxxxxxxxxx':'
>> 131.225.167.44'
>> 7/27 20:59:50 StartdPvtAd  : Inserting ** "< vm1@xxxxxxxxxxxxxxxxx ,
>> 131.225.167
>> .44 >"
>> 7/27 20:59:50 stats: Inserting new hashent for
>> 'StartdPvt':'vm1@xxxxxxxxxxxxxxxx
>> v':'131.225.167.44'
>> 7/27 20:59:51 WARNING:  No master ad for < vm2@xxxxxxxxxxxxxxxxx >
>> 7/27 20:59:51 StartdAd     : Inserting ** "< vm2@xxxxxxxxxxxxxxxxx ,
>> 131.225.167
>> .44 >"7/27 20:59:51 stats: Inserting new hashent for
>> 'Start':'vm2@xxxxxxxxxxxxxxxxx':'
>> 131.225.167.44'
>> 7/27 20:59:51 StartdPvtAd  : Inserting ** "< vm2@xxxxxxxxxxxxxxxxx ,
>> 131.225.167
>> .44 >"
>> 7/27 20:59:51 stats: Inserting new hashent for
>> 'StartdPvt':'vm2@xxxxxxxxxxxxxxxx
>> v':'131.225.167.44'
>> 7/27 20:59:56 Got QUERY_STARTD_ADS
>> 7/27 20:59:56 (Sending 2 ads in response to query)
>> 7/27 21:00:02 Got QUERY_SCHEDD_ADS
>> 7/27 21:00:02 (Sending 1 ads in response to query)
>> 7/27 21:00:02 Got QUERY_SCHEDD_ADS
>> 7/27 21:00:02 (Sending 1 ads in response to query)
>> 7/27 21:00:02 Got QUERY_STARTD_ADS
>> 7/27 21:00:02 (Sending 2 ads in response to query)
>> 7/27 21:00:07 Got QUERY_STARTD_ADS
>> 7/27 21:00:07 (Sending 2 ads in response to query)
>> 7/27 21:00:07 Got QUERY_SUBMITTOR_ADS
>> 7/27 21:00:07 (Sending 2 ads in response to query)
>> 7/27 21:00:13 Got QUERY_STARTD_ADS
>> 7/27 21:00:13 (Sending 2 ads in response to query)
>> 7/27 21:00:18 Got QUERY_MASTER_ADS
>> 7/27 21:00:18 (Sending 3 ads in response to query)7/27 21:00:22 Got
>> QUERY_SCHEDD_ADS
>> 7/27 21:00:22 (Sending 1 ads in response to query)
>> 7/27 21:00:38 Got QUERY_STARTD_ADS
>> 7/27 21:00:38 (Sending 2 ads in response to query)
>> 7/27 21:00:51 WARNING:  No master ad for < quill@xxxxxxxxxxxxxxxxx >
>> 7/27 21:00:51 QuillAd     : Inserting ** "<
>> quill@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>> .gov >"
>> 7/27 21:00:51 stats: Inserting new hashent for
>> 'Quill':'quill@xxxxxxxxxxxxxxxxx'
>> :'131.225.167.42'
>> 7/27 21:00:51 Got QUERY_SCHEDD_ADS
>> 7/27 21:00:51 (Sending 1 ads in response to query)
>> 7/27 21:00:55 MasterAd     : Inserting ** "< fngp-osg.fnal.gov >"
>> 7/27 21:00:55 stats: Inserting new hashent for
>> 'Master':'fngp-osg.fnal.gov':'131
>> .225.167.42'
>> 7/27 21:00:57 ScheddAd     : Inserting ** "< fngp-osg.fnal.gov ,
>> 131.225.167.42
>>> "
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Schedd':'fngp-osg.fnal.gov':'131
>> .225.167.42'
>> 7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> e907@xxxxxxxxxxxxxxxxxxxxxxxxx , 13
>> 1.225.167.42 >"
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'e907@xxxxxxxx':'131.
>> 225.167.42'7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> polly@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
>> 31.225.167.42 >"
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'polly@xxxxxxxx':'131
>> .225.167.42'
>> 7/27 21:00:57 IO: Failed to read packet header
>> 7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> group_e898.minboone@xxxxxxxxxxxxxxx
>> g.fnal.gov , 131.225.167.42 >"
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'group_e898.minboone@
>> fnal.gov':'131.225.167.42'
>> 7/27 21:00:57 IO: Failed to read packet header
>> 7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> grase@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
>> 31.225.167.42 >"
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'grase@xxxxxxxx':'131
>> .225.167.42'
>> 7/27 21:00:57 IO: Failed to read packet header
>> 7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> group_e907.mipp@xxxxxxxxxxxxxxxxxxx
>> al.gov , 131.225.167.42 >"
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'group_e907.mipp@fnal
>> .gov':'131.225.167.42'
>> 7/27 21:00:57 IO: Failed to read packet header
>> 7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> group_cdms.cdms@xxxxxxxxxxxxxxxxxxx
>> al.gov , 131.225.167.42 >"7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'group_cdms.cdms@fnal
>> .gov':'131.225.167.42'
>> 7/27 21:00:57 IO: Failed to read packet header
>> 7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> sdss@xxxxxxxxxxxxxxxxxxxxxxxxx , 13
>> 1.225.167.42 >"
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'sdss@xxxxxxxx':'131.
>> 225.167.42'
>> 7/27 21:00:57 IO: Failed to read packet header
>> 7/27 21:00:57 SubmittorAd  : Inserting ** "<
>> mis@xxxxxxxxxxxxxxxxxxxxxxxxx
>> , 131
>> .225.167.42 >"
>> 7/27 21:00:57 stats: Inserting new hashent for
>> 'Submittor':'mis@xxxxxxxx':'131.2
>> 25.167.42'
>> 7/27 21:00:57 IO: Failed to read packet header
>> 7/27 21:01:07 condor_read(): timeout reading buffer.
>> 7/27 21:01:27 condor_read(): timeout reading buffer.
>> 7/27 21:01:27 IO: Failed to read packet header
>> 7/27 21:01:27 DaemonCore: Can't receive command request (perhaps a
>> timeout?)
>> 7/27 21:01:27 IO: Failed to read packet header
>> 7/27 21:01:37 condor_read(): timeout reading buffer.
>> 7/27 21:01:57 condor_read(): timeout reading buffer.
>> 7/27 21:01:57 IO: Failed to read packet header
>> 7/27 21:01:57 DaemonCore: Can't receive command request (perhaps a
>> timeout?)
>> --More--((%)
>>
>> ------------------------------------------------------------------
>> Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx
>> http://home.fnal.gov/~timm/
>> Fermilab Computing Div/Core Support Services Dept./Scientific Computing
>> Section
>> Assistant Group Leader, Farms and Clustered Systems Group
>> Lead of Computing Farms Team
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at either
>> https://lists.cs.wisc.edu/archive/condor-users/
>> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>


----------------------------
Adam Chrystie
Pipeline/Infrastructure Supervisor
Ext#:
adamc@xxxxxxxxxxxxxx