[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] CCB errors leading to job evictions.



We are running Condor 7.6.1 in a Xen virtual machine (both the real host and VM have Scientific Linux SL release 5.5 (Boron) installed), and we are seeing somewhere between 6% and 10% of our jobs being evicted and restarted multiple times apparently because of CCB failures. Also, jobs often experience CCB errors when starting which causes the job to be delayed. From the ShadowLog, the following messages are an extract for a job that is experiencing both kinds of issue:


Problems when starting:

06/04/11 06:35:34 (3331.0) (25184): CCBClient: received failure message from CCB server collector 206.12.154.58:9618 in response to request for reversed connection to startd vm192.cloud.nrc.ca: CCB server rejecting request for ccbid 9024 because no daemon is currently registered with that id (perhaps it recently disconnected). 06/04/11 06:35:34 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:35:34 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024> 06/04/11 06:35:42 (3331.0) (25184): CCBClient: received failure message from CCB server collector 206.12.154.58:9618 in response to request for reversed connection to startd vm192.cloud.nrc.ca: CCB server rejecting request for ccbid 9024 because no daemon is currently registered with that id (perhaps it recently disconnected). 06/04/11 06:35:42 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:35:42 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024> 06/04/11 06:35:59 (3331.0) (25184): CCBClient: received failure message from CCB server collector 206.12.154.58:9618 in response to request for reversed connection to startd vm192.cloud.nrc.ca: CCB server rejecting request for ccbid 9024 because no daemon is currently registered with that id (perhaps it recently disconnected). 06/04/11 06:35:59 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:35:59 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024> 06/04/11 06:36:51 (3331.0) (25184): Failed to reverse connect to startd vm192.cloud.nrc.ca via CCB. 06/04/11 06:36:51 (3331.0) (25184): locateStarter(): Failed to connect to startd <132.246.148.92:40009?CCBID=206.12.154.58:9618#9024>


Communication error leading to eviction:

06/04/11 06:44:59 (3331.0) (25184): Job 3331.0 is being evicted from vm192.cloud.nrc.ca 06/04/11 06:45:00 (3331.0) (25184): condor_read(): Socket closed when trying to read 21 bytes from <132.246.148.92:40009> 06/04/11 06:45:00 (3331.0) (25184): DCStartd::deactivateClaim: failed to read response ad. 06/04/11 06:45:00 (3331.0) (25184): **** condor_shadow (condor_SHADOW) pid 25184 EXITING WITH STATUS 107

Has anyone else experienced these kind of problems?



--
Colin Leavett-Brown
Department of Physics & Astronomy
University of Victoria
250-721-7728