Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] odd communications errors

Date: Wed, 04 Oct 2006 09:10:15 -0500
From: "David A. Kotz" <dkotz@xxxxxxxxxxxxx>
Subject: [Condor-users] odd communications errors

Recently I've had a rash of reports like the following error log, andtoday I received the following error report from a user, which seemsrelated. We've examined the switches for evidence of network problemsand found none. Any ideas what might be causing this?


Condor 6.7.20 on Ubuntu Dapper with kernel 2.6.17.4


_________________________________

10/2 17:46:33 condor_read(): timeout reading buffer.
10/2 17:46:33 IO: Failed to read packet header
10/2 17:46:33 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm8: WARNING: No processes found in starter's family
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm7: WARNING: No processes found in starter's family
10/2 17:46:43 condor_read(): timeout reading buffer.
10/2 17:47:03 condor_read(): timeout reading buffer.
10/2 17:47:03 IO: Failed to read packet header
10/2 17:47:03 DaemonCore: Can't receive command request (perhaps a timeout?)

10/2 17:47:03 DaemonCore: Command received via UDP from host<128.83.120.40:47058>10/2 17:47:03 DaemonCore: received command 443 (RELEASE_CLAIM), callinghandler (command_release_claim)10/2 17:47:03 vm6: State change: received RELEASE_CLAIM command frompreempting claim

10/2 17:47:03 Starter pid 15838 exited with status 0

____________________________________


[A USER] wrote:
> I've been getting a lot of log messages like this lately:
>
>
> ...

> 022 (4182.144.000) 09/27 14:01:28 Job disconnected, attempting toreconnect

>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to vm1@[MACHINE]
>
> ...then later...
>
> ...
> 024 (4182.144.000) 09/27 14:22:02 Job reconnection failed
>     Job disconnected too long: JobLeaseDuration (1200 seconds) expired
>     Can not reconnect to vm1@[MACHINE], rescheduling job
>
>

> .... Often the same job will cycle between these two errors severaltimes. The system seems to try to reconnect for 20 minutes, and itdoesn't seem like my jobs make any progress while this is going on.Also, I have yet to see any log messages indicating that the job hassuccessfully reconnected. I don't know if that's because those eventsaren't logged, or because the jobs never successfully reconnect oncedisconnected.

>
> I've never seen this behavior before.  Any idea what's happening?

Follow-Ups:
- Re: [Condor-users] odd communications errors
  - From: Steven Timm

Prev by Date: Re: [Condor-users] Execution nodes advertise they are on 127.0.0.1
Next by Date: Re: [Condor-users] odd communications errors
Previous by thread: Re: [Condor-users] Execution nodes advertise they are on 127.0.0.1
Next by thread: Re: [Condor-users] odd communications errors
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] odd communications errors