[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] production ccb pool not communicating



I forgot to mention that we're using this version:

[samgrid@samgfwd06 ~]$ condor_version
$CondorVersion: 7.3.1 May 19 2009 BuildID: 154007 $
$CondorPlatform: I386-LINUX_RHEL3 $

This system was setup before 7.4 came out and we wanted to use CCB. Should I do a quick upgrade to the released 7.4.X??? Maybe there were bugs fixed relating to this?

Thanks,

joe

Joe Boyd wrote:
Hello,

I've got a production condor pool that has been running for a while but is now getting communication errors with multiple remote sites. CCB is used to talk to the remote sites. The local machine is 131.225.216.64 and the schedd log seems to be saying that the local schedd can't connect to the local CCB servers running on ports 9877, 9878, and 9879.

This pool was in use and I'm not aware of any config changes made. I have restarted the pool and removed all the remote glideins since jobs weren't runnng anyway. Nothing helped.

The remote machines do seem to be properly reporting back to the Collector via CCB and it's the local daemons that don't seem to be able to communicate.

Any help appreciated.

joe


Here is a snippet of the ShadowLog:

02/26 09:32:19 (63446.0) (21089): attempt to connect to <131.225.216.64:9878> failed: timed out after 20 seconds. 02/26 09:32:19 (63446.0) (21089): Failed to reverse connect to startd glidein_19053@xxxxxxxxxxxxxxxxxxxx via CCB. 02/26 09:32:19 (63446.0) (21089): glidein_19053@xxxxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send command ACTIVATE_CLAIM to the startd 02/26 09:32:19 (63446.0) (21089): Job 63446.0 is being evicted from glidein_19053@xxxxxxxxxxxxxxxxxxxx 02/26 09:32:25 (63401.0) (21462): attempt to connect to <131.225.216.64:9879> failed: timed out after 20 seconds. 02/26 09:32:25 (63401.0) (21462): Failed to reverse connect to startd glidein_1605@xxxxxxxxxxxxxxxxxxxx via CCB. 02/26 09:32:25 (63401.0) (21462): glidein_1605@xxxxxxxxxxxxxxxxxxxx: DCStartd::activateClaim: Failed to send command ACTIVATE_CLAIM to the startd 02/26 09:32:25 (63401.0) (21462): Job 63401.0 is being evicted from glidein_1605@xxxxxxxxxxxxxxxxxxxx 02/26 09:32:33 (64305.0) (20721): attempt to connect to <131.225.216.64:9878> failed: timed out after 20 seconds. 02/26 09:32:33 (64305.0) (20721): Failed to reverse connect to <145.100.48.37:53084> via CCB. 02/26 09:32:33 (64305.0) (20721): RemoteResource::killStarter(): Could not send command to startd 02/26 09:32:33 (64305.0) (20721): logEvictEvent with unknown reason (108), aborting 02/26 09:32:33 (64305.0) (20721): **** condor_shadow (condor_SHADOW) pid 20721 EXITING WITH STATUS 108 02/26 09:32:34 (63430.0) (19812): attempt to connect to <131.225.216.64:9878> failed: timed out after 20 seconds. 02/26 09:32:34 (63430.0) (19812): Failed to reverse connect to <134.158.73.56:52363> via CCB. 02/26 09:32:34 (63430.0) (19812): RemoteResource::killStarter(): Could not send command to startd 02/26 09:32:34 (63430.0) (19812): logEvictEvent with unknown reason (108), aborting 02/26 09:32:34 (63430.0) (19812): **** condor_shadow (condor_SHADOW) pid 19812 EXITING WITH STATUS 108 02/26 09:32:35 (63418.0) (20805): attempt to connect to <131.225.216.64:9878> failed: timed out after 20 seconds. 02/26 09:32:35 (63418.0) (20805): Failed to reverse connect to <134.158.73.11:36610> via CCB.


and snippet from the SchedLog:

02/26 09:58:03 (pid:26356) CCBClient: no more CCB servers to try for requesting reversed connection to startd at <155.198.216.230:58302>; giving up. 02/26 09:58:03 (pid:26356) Failed to send RELEASE_CLAIM to startd at <155.198.216.230:58302>: SECMAN:2003:TCP connection to startd at <155.198.216.230:58302> failed. 02/26 09:58:13 (pid:26356) attempt to connect to <131.225.216.64:9878> failed: timed out after 20 seconds. 02/26 09:58:13 (pid:26356) attempt to connect to <131.225.216.64:9877> failed: timed out after 20 seconds. 02/26 09:58:13 (pid:26356) Failed to send CCB_REQUEST to collector 131.225.216.64:9878: SECMAN:2003:TCP connection to collector 131.225.216.64:9878 failed. 02/26 09:58:13 (pid:26356) CCBClient: no more CCB servers to try for requesting reversed connection to startd at <134.158.73.87:51338>; giving up. 02/26 09:58:13 (pid:26356) Failed to send RELEASE_CLAIM to startd at <134.158.73.87:51338>: SECMAN:2003:TCP connection to startd at <134.158.73.87:51338> failed. 02/26 09:58:13 (pid:26356) Failed to send CCB_REQUEST to collector 131.225.216.64:9877: SECMAN:2003:TCP connection to collector 131.225.216.64:9877 failed. 02/26 09:58:13 (pid:26356) CCBClient: no more CCB servers to try for requesting reversed connection to startd at <155.198.216.131:41092>; giving up. 02/26 09:58:13 (pid:26356) Failed to send RELEASE_CLAIM to startd at <155.198.216.131:41092>: SECMAN:2003:TCP connection to startd at <155.198.216.131:41092> failed. 02/26 09:58:23 (pid:26356) attempt to connect to <155.198.217.34:60842> failed: No route to host (connect errno = 113). 02/26 09:58:23 (pid:26356) Failed to send REQUEST_CLAIM to startd glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for samgrid@xxxxxxxxxxxxxxxxxx: SECMAN:2003:TCP connection to startd glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for samgrid@xxxxxxxxxxxxxxxxxx failed. 02/26 09:58:23 (pid:26356) Match record (glidein_13694@xxxxxxxxxxxxxxxxxxxxx <155.198.217.34:60842> for samgrid@xxxxxxxxxxxxxxxxxx, 62202.0) deleted

One of the Collector logs:

02/26 11:22:17 MasterAd : Inserting ** "< glidein_9430@xxxxxxxxxxxxxxxxxxxxxxxxx >" 02/26 11:22:17 stats: Inserting new hashent for 'Master':'glidein_9430@xxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.98.36' 02/26 11:22:21 StartdAd : Inserting ** "< monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx , 194.171.99.24 >" 02/26 11:22:21 stats: Inserting new hashent for 'Start':'monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.24' 02/26 11:22:21 StartdPvtAd : Inserting ** "< monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx , 194.171.99.24 >" 02/26 11:22:21 stats: Inserting new hashent for 'StartdPvt':'monitor_4874@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.24' 02/26 11:22:22 MasterAd : Inserting ** "< glidein_5940@xxxxxxxxxxxxxxxxxxxx >" 02/26 11:22:22 stats: Inserting new hashent for 'Master':'glidein_5940@xxxxxxxxxxxxxxxxxxxx':'134.158.73.25' 02/26 11:22:23 MasterAd : Inserting ** "< glidein_9490@xxxxxxxxxxxxxxxxxxxxxxxxxx >" 02/26 11:22:23 stats: Inserting new hashent for 'Master':'glidein_9490@xxxxxxxxxxxxxxxxxxxxxxxxxx':'194.171.99.133' 02/26 11:22:23 condor_write(): Socket closed when trying to write 285 bytes to <194.171.99.140:41788>, fd is 1064
02/26 11:22:23 Buf::write(): condor_write() failed
02/26 11:22:23 SECMAN: Error sending response classad!
MyType = "(unknown type)"
TargetType = "(unknown type)"
AuthMethods = "GSI"
CryptoMethods = "3DES,BLOWFISH"
OutgoingNegotiation = "REQUIRED"
Authentication = "REQUIRED"
Encryption = "OPTIONAL"
Integrity = "REQUIRED"
Enact = "NO"
Subsystem = "MASTER"
ServerPid = 12469
SessionDuration = "60"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.3.1 May 19 2009 BuildID: 154007 $"
ServerCommandSock = "<194.171.99.140:57692>"
Command = 67
02/26 11:22:23 condor_write(): Socket closed when trying to write 288 bytes to <194.171.99.139:45429>, fd is 1064
02/26 11:22:23 Buf::write(): condor_write() failed
02/26 11:22:23 SECMAN: Error sending response classad!
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/