[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] odd communications errors



Recently I've had a rash of reports like the following error log, and today I received the following error report from a user, which seems related. We've examined the switches for evidence of network problems and found none. Any ideas what might be causing this?

Condor 6.7.20 on Ubuntu Dapper with kernel 2.6.17.4


_________________________________

10/2 17:46:33 condor_read(): timeout reading buffer.
10/2 17:46:33 IO: Failed to read packet header
10/2 17:46:33 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm8: WARNING: No processes found in starter's family
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm7: WARNING: No processes found in starter's family
10/2 17:46:43 condor_read(): timeout reading buffer.
10/2 17:47:03 condor_read(): timeout reading buffer.
10/2 17:47:03 IO: Failed to read packet header
10/2 17:47:03 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:47:03 DaemonCore: Command received via UDP from host <128.83.120.40:47058> 10/2 17:47:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim) 10/2 17:47:03 vm6: State change: received RELEASE_CLAIM command from preempting claim
10/2 17:47:03 Starter pid 15838 exited with status 0

____________________________________


[A USER] wrote:
> I've been getting a lot of log messages like this lately:
>
>
> ...
> 022 (4182.144.000) 09/27 14:01:28 Job disconnected, attempting to reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to vm1@[MACHINE]
>
> ...then later...
>
> ...
> 024 (4182.144.000) 09/27 14:22:02 Job reconnection failed
>     Job disconnected too long: JobLeaseDuration (1200 seconds) expired
>     Can not reconnect to vm1@[MACHINE], rescheduling job
>
>
> .... Often the same job will cycle between these two errors several times. The system seems to try to reconnect for 20 minutes, and it doesn't seem like my jobs make any progress while this is going on. Also, I have yet to see any log messages indicating that the job has successfully reconnected. I don't know if that's because those events aren't logged, or because the jobs never successfully reconnect once disconnected.
>
> I've never seen this behavior before.  Any idea what's happening?