[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Large number of shadow exceptions due to Connection time out



Hi all

we are currently seeing a large number of shadows dying due to connection time 
outs. These are almost certainly caused by our network having a couple of 
issues right now, however, is there any setting we can tell Condor or the 
Linux kernel to mitigate this issue a bit as a short time solution before we 
can weed out the networking problems at its root?

Messages are like this DAG with standard universe jobs:

000 (28889363.000.000) 11/15 23:19:55 Job submitted from host: 
<10.20.30.2:51388>
     DAG Node: A9630


001 (28889363.000.000) 11/15 23:20:27 Job executing on host: 
<10.10.4.85:36245>


007 (28889363.000.000) 11/15 23:24:22 Shadow exception!
         Unable to talk to job: Connection timed out

         281  -  Run Bytes Sent By Job
         9824800  -  Run Bytes Received By Job


001 (28889363.000.000) 11/15 23:24:33 Job executing on host: <10.10.9.8:43219>


007 (28889363.000.000) 11/15 23:33:32 Shadow exception!
         Unable to talk to job: Connection timed out

         313  -  Run Bytes Sent By Job
         9824832  -  Run Bytes Received By Job


001 (28889363.000.000) 11/15 23:33:44 Job executing on host: 
<10.10.0.90:51334>

I'm currently looking into Linux's TCP settings to mitigate this, but any 
advice would be helpful!

Cheers and TALIA

Carsten
-- 
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
Phone/Fax: +49 511 762-17185 / -17193
http://www.top500.org/system/9234 | http://www.top500.org/connfam/6
CaCert Assurer | Get free certificates from http://www.cacert.org/