[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] odd communications errors



Steve,

Thanks loads for the instant response.

Unless there is a default to use TCP for something, I'm not using it. A grep of my main condor_config shows only the following for "TCP", and it is commented out:

#UPDATE_COLLECTOR_WITH_TCP = True

- dave


Steven Timm wrote:
There's a known bug in Condor 6.7.20 with non-blocking TCP communications.
Do you have any of the TCP communications turned on in your config file?
It's fixed in 6.8.1.

Steve


On Wed, 4 Oct 2006, David A. Kotz wrote:

Recently I've had a rash of reports like the following error log, and
today I received the following error report from a user, which seems
related.  We've examined the switches for evidence of network problems
and found none.  Any ideas what might be causing this?

Condor 6.7.20 on Ubuntu Dapper with kernel 2.6.17.4


_________________________________

10/2 17:46:33 condor_read(): timeout reading buffer.
10/2 17:46:33 IO: Failed to read packet header
10/2 17:46:33 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm8: WARNING: No processes found in starter's family
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm7: WARNING: No processes found in starter's family
10/2 17:46:43 condor_read(): timeout reading buffer.
10/2 17:47:03 condor_read(): timeout reading buffer.
10/2 17:47:03 IO: Failed to read packet header
10/2 17:47:03 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:47:03 DaemonCore: Command received via UDP from host
<128.83.120.40:47058>
10/2 17:47:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
10/2 17:47:03 vm6: State change: received RELEASE_CLAIM command from
preempting claim
10/2 17:47:03 Starter pid 15838 exited with status 0

____________________________________


[A USER] wrote:
I've been getting a lot of log messages like this lately:


...
022 (4182.144.000) 09/27 14:01:28 Job disconnected, attempting to
reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm1@[MACHINE]

...then later...

...
024 (4182.144.000) 09/27 14:22:02 Job reconnection failed
    Job disconnected too long: JobLeaseDuration (1200 seconds) expired
    Can not reconnect to vm1@[MACHINE], rescheduling job


....  Often the same job will cycle between these two errors several
times.  The system seems to try to reconnect for 20 minutes, and it
doesn't seem like my jobs make any progress while this is going on.
Also, I have yet to see any log messages indicating that the job has
successfully reconnected.  I don't know if that's because those events
aren't logged, or because the jobs never successfully reconnect once
disconnected.
I've never seen this behavior before.  Any idea what's happening?
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR