[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] 6.7.20 collector timeouts



On Fri, 28 Jul 2006, Dan Bradley wrote:


This sounds a lot to me like some non-condor process was connecting to
he collector's command port.  If you have the problem again, you may
want to trace the network activity.

There certainly are such non-condor processes floating around
and about Fermilab.  We have our
own denial-of-service engine--its modus operandi is to scan
for ports that look like they can support any type of rudimentary
http connection--if it finds one, then it scans all the ports
0-65536 on the machine to map for other ports.  We have seen evidence
of it connecting to condor ports in the past.  But no obvious footprints
here at the time to which I was referring.  Ran several tcpdumps looking
for evidence too.

Are you saying that such a port-mucker could either (1)  put the collector
into a wierd state such as we saw or (2) even crash it?

I went for a full restart of everything, which appears to have worked
overnight, after having experienced something like this on another
cluster where everything was going slow and there were lots of
timeouts.

In this case--a couple days before we had restarted our busiest
schedd, and kept the Collector going, then yesterday we had upped
the number of sockets in the collector cache on our collectors.
Things were slow but in the evening it got to the point
where condor_submit was just hanging for ever and that was
when I decided that total restart couldn't hurt anything.

Thanks

Steve Timm




--Dan

Steven Timm wrote:




I am experiencing bad timeouts with the condor collector at 6.7.20
and have a ticket open with condor support on it. I am just bringing
this to the attention of condor-users to see if anyone else has
seen this problem.

I have just completed restarting my whole system, all startd, schedd,
negotiator, collector, everything, and the problems are the same as
before.



About 1/2 hour after the restart of the system, most of the timeouts
appear to have stopped for now and condor_status
comes back in reasonable time again, and the system
is now filling up with jobs and nodes are getting busy again.

Still interested in knowing what these error messages
below are all about though.

Thanks

Steve Timm




I had not even started any of the compute nodes yet--just
starting up the three head nodes and we already had problems.
In general there are three or four timeout messages that I don't
understand
why they are happening, and each corresponds to holding up a condor_status
command.  Typically condor_status takes 2-3 minutes to complete.
This is a new problem which didn't happen right away when we upgraded
to 6.7.20, only about a week later.  Also collector has crashed twice
during that time.






So does anyone know what these error messages are in the CollectorLog
(debug right now is [root@fnpcsrv2 log]# condor_config_val COLLECTOR_DEBUG
D_FULLDEBUG, D_COMMAND, D_SECURITY, D_PROTOCOL

7/27 21:04:27 DaemonCore: Can't receive command request (perhaps a
timeout?)
7/27 21:04:37 condor_read(): timeout reading buffer.
7/27 21:04:57 condor_read(): timeout reading buffer.
7/27 21:04:57 IO: Failed to read packet header
7/27 21:04:57 DaemonCore: Can't receive command request (perhaps a
timeout?)
7/27 21:04:57 DaemonCore: received unregistered command request 25 !
7/27 21:05:27 SECMAN: Error sending response classad!

I didn't change any security settings going from 6.7.18 to 6.7.20,
I am just using host-based authentication.  Should I have changed any?
Why is SECMAN involved?
(I am using TCP to talk to the collector).

Also see the following errors:

/27 21:02:57 DC_AUTHENTICATE: attempt to open invalid session
fnpcsrv2:29720:1.
7/27 21:03:57 DC_AUTHENTICATE: attempt to open invalid session
fnpcsrv2:29720:1.
7/27 21:05:27 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:14:31 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:18:59 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:19:59 ERROR: DC_AUTHENTICATE unable to receive auth_info!
7/27 21:20:29 DC_AUTHENTICATE: attempt to open invalid session
fnpcsrv2:29720:1.

As I write the E-mail there have been no timeouts since 21:32 and it is
now
21:40, and condor_status comes back OK.  Much is claimed at the moment but
little is busy.  Still interested in what is causing these errors above,
though, because I've seen the timeouts come and go.

steve



-----------------startup of collector is below.


7/27 20:56:28 **** condor_collector (condor_COLLECTOR) EXITING WITH STATUS
0
7/27 20:58:51 ******************************************************
7/27 20:58:51 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
7/27 20:58:51 ** /opt/condor-6.7.20/sbin/condor_collector
7/27 20:58:51 ** $CondorVersion: 6.7.20 Jun 21 2006 $
7/27 20:58:51 ** $CondorPlatform: I386-LINUX_RH9 $7/27 20:58:51 **
$CondorPlatform: I386-LINUX_RH9 $
7/27 20:58:51 ** PID = 20646
7/27 20:58:51 ** Log last touched 7/27 20:56:28
7/27 20:58:51 ******************************************************
7/27 20:58:51 Using config source: /etc/condor/condor_config
7/27 20:58:51 Using local config sources:
7/27 20:58:51    /opt/condor/etc/group_params.config
7/27 20:58:51    /opt/condor/local/condor_config.local
7/27 20:58:51 DaemonCore: Command Socket at <131.225.167.47:9618>
7/27 20:58:51 In ViewServer::Init()
7/27 20:58:51 In CollectorDaemon::Init()
7/27 20:58:51 In ViewServer::Config()
7/27 20:58:51 In CollectorDaemon::Config()
7/27 20:58:51 Will forward ads on to View Server fnpcsrv1.fnal.gov
7/27 20:58:51 enable: Creating stats hash table
7/27 20:58:51 (Sending 0 ads in response to query)
7/27 20:58:51 Got QUERY_STARTD_PVT_ADS
7/27 20:58:51 (Sending 0 ads in response to query)
7/27 20:58:51 NegotiatorAd  : Inserting ** "< fnpcsrv2.fnal.gov >"
7/27 20:58:51 stats: Inserting new hashent for
'Negotiator':'fnpcsrv2.fnal.gov':
'131.225.167.47'
7/27 20:58:55 Got QUERY_STARTD_ADS
7/27 20:58:55 (Sending 0 ads in response to query)
7/27 20:58:56 MasterAd     : Inserting ** "< fnpcsrv2.fnal.gov >"
7/27 20:58:56 stats: Inserting new hashent for
'Master':'fnpcsrv2.fnal.gov':'131
.225.167.47'
7/27 20:59:02 Got QUERY_STARTD_ADS
7/27 20:59:02 (Sending 0 ads in response to query)
7/27 20:59:20 Got QUERY_MASTER_ADS
7/27 20:59:20 (Sending 1 ads in response to query)
7/27 20:59:24 MasterAd     : Inserting ** "< samgfarm2.fnal.gov >"
7/27 20:59:24 stats: Inserting new hashent for
'Master':'samgfarm2.fnal.gov':'13
1.225.167.49'
7/27 20:59:30 Got QUERY_STARTD_ADS
7/27 20:59:30 (Sending 0 ads in response to query)
7/27 20:59:33 Got QUERY_SCHEDD_ADS
7/27 20:59:33 (Sending 0 ads in response to query)
7/27 20:59:43 WARNING:  No master ad for < fnpcsrv1.fnal.gov >
7/27 20:59:43 ScheddAd     : Inserting ** "< fnpcsrv1.fnal.gov ,
131.225.167.44


"


7/27 20:59:43 stats: Inserting new hashent for
'Schedd':'fnpcsrv1.fnal.gov':'131
.225.167.44'
7/27 20:59:43 SubmittorAd  : Inserting ** "< yoo@xxxxxxxxxxxxxxxxxxxxxxxxx
, 131
.225.167.44 >"
7/27 20:59:43 stats: Inserting new hashent for
'Submittor':'yoo@xxxxxxxx':'131.225.167.44'
7/27 20:59:43 SubmittorAd  : Inserting ** "<
rubin@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
31.225.167.44 >"
7/27 20:59:43 stats: Inserting new hashent for
'Submittor':'rubin@xxxxxxxx':'131
.225.167.44'
7/27 20:59:43 IO: Failed to read packet header
7/27 20:59:43 IO: Failed to read packet header
7/27 20:59:45 ** Master < fnpcsrv1.fnal.gov > rejuvenated from recently
down
7/27 20:59:45 stats: Inserting new hashent for
'Master':'fnpcsrv1.fnal.gov':'131
.225.167.44'
7/27 20:59:50 WARNING:  No master ad for < vm1@xxxxxxxxxxxxxxxxx >
7/27 20:59:50 StartdAd     : Inserting ** "< vm1@xxxxxxxxxxxxxxxxx ,
131.225.167
.44 >"
7/27 20:59:50 stats: Inserting new hashent for
'Start':'vm1@xxxxxxxxxxxxxxxxx':'
131.225.167.44'
7/27 20:59:50 StartdPvtAd  : Inserting ** "< vm1@xxxxxxxxxxxxxxxxx ,
131.225.167
.44 >"
7/27 20:59:50 stats: Inserting new hashent for
'StartdPvt':'vm1@xxxxxxxxxxxxxxxx
v':'131.225.167.44'
7/27 20:59:51 WARNING:  No master ad for < vm2@xxxxxxxxxxxxxxxxx >
7/27 20:59:51 StartdAd     : Inserting ** "< vm2@xxxxxxxxxxxxxxxxx ,
131.225.167
.44 >"7/27 20:59:51 stats: Inserting new hashent for
'Start':'vm2@xxxxxxxxxxxxxxxxx':'
131.225.167.44'
7/27 20:59:51 StartdPvtAd  : Inserting ** "< vm2@xxxxxxxxxxxxxxxxx ,
131.225.167
.44 >"
7/27 20:59:51 stats: Inserting new hashent for
'StartdPvt':'vm2@xxxxxxxxxxxxxxxx
v':'131.225.167.44'
7/27 20:59:56 Got QUERY_STARTD_ADS
7/27 20:59:56 (Sending 2 ads in response to query)
7/27 21:00:02 Got QUERY_SCHEDD_ADS
7/27 21:00:02 (Sending 1 ads in response to query)
7/27 21:00:02 Got QUERY_SCHEDD_ADS
7/27 21:00:02 (Sending 1 ads in response to query)
7/27 21:00:02 Got QUERY_STARTD_ADS
7/27 21:00:02 (Sending 2 ads in response to query)
7/27 21:00:07 Got QUERY_STARTD_ADS
7/27 21:00:07 (Sending 2 ads in response to query)
7/27 21:00:07 Got QUERY_SUBMITTOR_ADS
7/27 21:00:07 (Sending 2 ads in response to query)
7/27 21:00:13 Got QUERY_STARTD_ADS
7/27 21:00:13 (Sending 2 ads in response to query)
7/27 21:00:18 Got QUERY_MASTER_ADS
7/27 21:00:18 (Sending 3 ads in response to query)7/27 21:00:22 Got
QUERY_SCHEDD_ADS
7/27 21:00:22 (Sending 1 ads in response to query)
7/27 21:00:38 Got QUERY_STARTD_ADS
7/27 21:00:38 (Sending 2 ads in response to query)
7/27 21:00:51 WARNING:  No master ad for < quill@xxxxxxxxxxxxxxxxx >
7/27 21:00:51 QuillAd     : Inserting ** "<
quill@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
.gov >"
7/27 21:00:51 stats: Inserting new hashent for
'Quill':'quill@xxxxxxxxxxxxxxxxx'
:'131.225.167.42'
7/27 21:00:51 Got QUERY_SCHEDD_ADS
7/27 21:00:51 (Sending 1 ads in response to query)
7/27 21:00:55 MasterAd     : Inserting ** "< fngp-osg.fnal.gov >"
7/27 21:00:55 stats: Inserting new hashent for
'Master':'fngp-osg.fnal.gov':'131
.225.167.42'
7/27 21:00:57 ScheddAd     : Inserting ** "< fngp-osg.fnal.gov ,
131.225.167.42


"


7/27 21:00:57 stats: Inserting new hashent for
'Schedd':'fngp-osg.fnal.gov':'131
.225.167.42'
7/27 21:00:57 SubmittorAd  : Inserting ** "<
e907@xxxxxxxxxxxxxxxxxxxxxxxxx , 13
1.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'e907@xxxxxxxx':'131.
225.167.42'7/27 21:00:57 SubmittorAd  : Inserting ** "<
polly@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
31.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'polly@xxxxxxxx':'131
.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd  : Inserting ** "<
group_e898.minboone@xxxxxxxxxxxxxxx
g.fnal.gov , 131.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'group_e898.minboone@
fnal.gov':'131.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd  : Inserting ** "<
grase@xxxxxxxxxxxxxxxxxxxxxxxxx , 1
31.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'grase@xxxxxxxx':'131
.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd  : Inserting ** "<
group_e907.mipp@xxxxxxxxxxxxxxxxxxx
al.gov , 131.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'group_e907.mipp@fnal
.gov':'131.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd  : Inserting ** "<
group_cdms.cdms@xxxxxxxxxxxxxxxxxxx
al.gov , 131.225.167.42 >"7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'group_cdms.cdms@fnal
.gov':'131.225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd  : Inserting ** "<
sdss@xxxxxxxxxxxxxxxxxxxxxxxxx , 13
1.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'sdss@xxxxxxxx':'131.
225.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:00:57 SubmittorAd  : Inserting ** "< mis@xxxxxxxxxxxxxxxxxxxxxxxxx
, 131
.225.167.42 >"
7/27 21:00:57 stats: Inserting new hashent for
'Submittor':'mis@xxxxxxxx':'131.2
25.167.42'
7/27 21:00:57 IO: Failed to read packet header
7/27 21:01:07 condor_read(): timeout reading buffer.
7/27 21:01:27 condor_read(): timeout reading buffer.
7/27 21:01:27 IO: Failed to read packet header
7/27 21:01:27 DaemonCore: Can't receive command request (perhaps a
timeout?)
7/27 21:01:27 IO: Failed to read packet header
7/27 21:01:37 condor_read(): timeout reading buffer.
7/27 21:01:57 condor_read(): timeout reading buffer.
7/27 21:01:57 IO: Failed to read packet header
7/27 21:01:57 DaemonCore: Can't receive command request (perhaps a
timeout?)
--More--((%)

------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR