[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] flocking / CCB



Hi Dan,

 

The error message has disappeared.  I did two things – I restarted condor on the processing nodes and I changed PRIVATE_NETWORK_NAME to local our internal domain is local –

[root@condor-36 condor]# host condor-36

condor-36.local has address 10.178.6.36

 

I’m not sure which of those things fixed it but it is fixed.  I previously had a unique identifier in PRIVATE_NETWORK_NAME (fsu-hpc-condor) that was not reflective of our internal domain.

 

I’m sending this so my solution is stuffed into the archives :)

 

The full message is below -

 

StartLog:06/09/12 16:33:04 CCBListener: registered with CCB server 10.178.6.5 as ccbid 144.174.50.29:9618?PrivNet=fsu-hpc-condor-private#124

StartLog:06/09/12 16:39:05 CCBListener: failed to receive message from CCB server 10.178.6.5

StartLog:06/09/12 16:39:05 CCBListener: connection to CCB server 10.178.6.5 failed; will try to reconnect in 60 seconds.

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: Tuesday, June 12, 2012 11:48 AM
To: condor-users@xxxxxxxxxxx
Subject: Re: [Condor-users] flocking / CCB

 

Hi Don,


06/09/12 16:39:05 CCBListener: failed to receive message from CCB server 10.178.6.5

 


Could you provide more logs?  I'm specifically interested in any log message containing CCB.

It also may be helpful to add D_FULLDEBUG and D_COMMAND to COLLECTOR_DEBUG on the machine serving as your CCB server.  This will give you messages when daemons try to register themselves for CCB access.

--Dan

On 6/9/12 4:16 PM, Shrum, Donald C wrote:

I'm trying to get a test job to flock between FSU and USF here in Florida.

 

As our cluster is on a private network and we have a public IP only on the central manager I added the following to condor_config on the central manager - 

 

PRIVATE_NETWORK_NAME = fsu-hpc-condor-private

PRIVATE_NETWORK_INTERFACE = 10.178.6.5

 

 

I added CCB_ADDRESS and the same PRIVATE_NETWORK_NAME to the processing nodes' condor_config.

 

So far as I can tell the CCB daemon runs on the collector so I don't need to explicitly set it to run. 

 

 

 

I must be missing something simple in the setup.  I see errors that read - 

06/09/12 16:39:05 CCBListener: failed to receive message from CCB server 10.178.6.5

 

I ran condor_reconfig on the processing nodes.  Do I need to restart condor on all the nodes as a result of the change?  The error message makes me think not.

 

Any pointers to debug this would be appreciated.

 

Thanks for the help.

 

Don

FSU HPC

 




_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
 
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/