[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] CCB errors leading to job evictions.



Thanks Dan for getting back to me so quickly. You've given me a few things to chase down; I'll let you know how it goes. Colin.

Dan Bradley wrote:
Colin,

The situation you describe is caused by a job getting matched to a startd that is no longer connected to the CCB server. If you can track down the startd logs, it would be helpful to determine why the startd was no longer connected. Was the startd dead? If not, did the startd notice that it was disconnected from the CCB server? If not, perhaps some network device silently dropped the connection. In some cases, that can be avoided by configuring a shorter CCB_HEARTBEAT_INTERVAL, which forces more frequent activity on the connection.

Another possible explanation is that the CCB server (i.e. the collector) is running low on resources and therefore is failing to stay connected to all of the daemons. I have seen this happen when using iptables with too small a value for ip_conntrack_max on the CCB server machine.

Hope that helps.

--Dan

On 6/14/11 4:53 PM, Colin Leavett-Brown wrote:
We are running Condor 7.6.1 in a Xen virtual machine (both the real host and VM have Scientific Linux SL release 5.5 (Boron) installed), and we are seeing somewhere between 6% and 10% of our jobs being evicted and restarted multiple times apparently because of CCB failures. Also, jobs often experience CCB errors when starting which causes the job to be delayed. From the ShadowLog, the following messages are an extract for a job that is experiencing both kinds of issue:


Problems when starting:

06/04/11 06:35:34 (3331.0) (25184): CCBClient: received failure message from CCB server collector 206.12.154.58:9618 in response to request for reversed connection to startd vm192.cloud.nrc.ca: CCB server rejecting request for ccbid 9024 because no daemon is currently registered with that id (perhaps it recently disconnected). 06/04/11 06:35:34 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:35:34 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024> 06/04/11 06:35:42 (3331.0) (25184): CCBClient: received failure message from CCB server collector 206.12.154.58:9618 in response to request for reversed connection to startd vm192.cloud.nrc.ca: CCB server rejecting request for ccbid 9024 because no daemon is currently registered with that id (perhaps it recently disconnected). 06/04/11 06:35:42 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:35:42 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024> 06/04/11 06:35:59 (3331.0) (25184): CCBClient: received failure message from CCB server collector 206.12.154.58:9618 in response to request for reversed connection to startd vm192.cloud.nrc.ca: CCB server rejecting request for ccbid 9024 because no daemon is currently registered with that id (perhaps it recently disconnected). 06/04/11 06:35:59 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:35:59 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024> 06/04/11 06:36:51 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:36:51 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>


Communication error leading to eviction:

06/04/11 06:44:59 (3331.0) (25184): Job 3331.0 is being evicted from vm192.cloud.nrc.ca 06/04/11 06:45:00 (3331.0) (25184): condor_read(): Socket closed when trying to read 21 bytes from <132.246.148.92:40009> 06/04/11 06:45:00 (3331.0) (25184): DCStartd::deactivateClaim: failed to read response ad. 06/04/11 06:45:00 (3331.0) (25184): **** condor_shadow (condor_SHADOW) pid 25184 EXITING WITH STATUS 107

Has anyone else experienced these kind of problems?



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

--
Colin Leavett-Brown
Department of Physics & Astronomy
University of Victoria
250-721-7728