[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] odd communications errors



There's a known bug in Condor 6.7.20 with non-blocking TCP communications.
Do you have any of the TCP communications turned on in your config file?
It's fixed in 6.8.1.

Steve


On Wed, 4 Oct 2006, David A. Kotz wrote:

Recently I've had a rash of reports like the following error log, and
today I received the following error report from a user, which seems
related.  We've examined the switches for evidence of network problems
and found none.  Any ideas what might be causing this?

Condor 6.7.20 on Ubuntu Dapper with kernel 2.6.17.4


_________________________________

10/2 17:46:33 condor_read(): timeout reading buffer.
10/2 17:46:33 IO: Failed to read packet header
10/2 17:46:33 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm8: WARNING: No processes found in starter's family
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm7: WARNING: No processes found in starter's family
10/2 17:46:43 condor_read(): timeout reading buffer.
10/2 17:47:03 condor_read(): timeout reading buffer.
10/2 17:47:03 IO: Failed to read packet header
10/2 17:47:03 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:47:03 DaemonCore: Command received via UDP from host
<128.83.120.40:47058>
10/2 17:47:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
10/2 17:47:03 vm6: State change: received RELEASE_CLAIM command from
preempting claim
10/2 17:47:03 Starter pid 15838 exited with status 0

____________________________________


[A USER] wrote:
> I've been getting a lot of log messages like this lately:
>
>
> ...
> 022 (4182.144.000) 09/27 14:01:28 Job disconnected, attempting to
reconnect
>     Socket between submit and execute hosts closed unexpectedly
>     Trying to reconnect to vm1@[MACHINE]
>
> ...then later...
>
> ...
> 024 (4182.144.000) 09/27 14:22:02 Job reconnection failed
>     Job disconnected too long: JobLeaseDuration (1200 seconds) expired
>     Can not reconnect to vm1@[MACHINE], rescheduling job
>
>
> ....  Often the same job will cycle between these two errors several
times.  The system seems to try to reconnect for 20 minutes, and it
doesn't seem like my jobs make any progress while this is going on.
Also, I have yet to see any log messages indicating that the job has
successfully reconnected.  I don't know if that's because those events
aren't logged, or because the jobs never successfully reconnect once
disconnected.
>
> I've never seen this behavior before.  Any idea what's happening?
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team