[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Shadow exception with LamMpi jobs



Thanks Greg for the suggestion. I've set into the config file
STARTER_UPLOAD_TIMEOUT = 3600, and then I've restarted condor and
submitted again, but the shadow exception is still present:

000 (18790.000.000) 03/14 12:18:30 Job submitted from host: <10.7.7.250:38867>
...
014 (18790.000.014) 03/14 12:22:48 Node 14 executing on host: <10.7.7.14:39641>
...
014 (18790.000.021) 03/14 12:22:48 Node 21 executing on host: <10.7.7.14:39641>
...
014 (18790.000.010) 03/14 12:24:12 Node 10 executing on host: <10.7.7.13:51425>
...
014 (18790.000.019) 03/14 12:24:12 Node 19 executing on host: <10.7.7.14:39641>
...
014 (18790.000.017) 03/14 12:24:55 Node 17 executing on host: <10.7.7.14:39641>
...
014 (18790.000.013) 03/14 12:24:56 Node 13 executing on host: <10.7.7.13:51425>
...
014 (18790.000.015) 03/14 12:26:18 Node 15 executing on host: <10.7.7.13:51425>
...
014 (18790.000.016) 03/14 12:27:07 Node 16 executing on host: <10.7.7.13:51425>
...
014 (18790.000.026) 03/14 12:27:50 Node 26 executing on host: <10.7.7.17:36396>
...
014 (18790.000.029) 03/14 12:28:35 Node 29 executing on host: <10.7.7.17:36396>
...
014 (18790.000.031) 03/14 12:29:26 Node 31 executing on host: <10.7.7.17:36396>
...
014 (18790.000.001) 03/14 12:30:14 Node 1 executing on host: <10.7.7.11:58617>
...
007 (18790.000.000) 03/14 12:30:14 Shadow exception!
        Error from starter on slot1@xxxxxxxxxxxxxx: Failed to transfer files
        0  -  Run Bytes Sent By Job
        30215708672  -  Run Bytes Received By Job

Pasquale

>  Try setting in the config file
>
>  STARTER_UPLOAD_TIMEOUT = 1200
>
>  or set it to another large value, and see if the problem goes away.