[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Odd communications errors



Look at your MasterLog.. it appears that your condor_schedd
died for some reason or other.  You can keep this from happening
by increasing the value of JobLeaseDuration.

Steve


On Mon, 20 Nov 2006, Guoli Wang wrote:


Steve, Dave, and all,

I just found a similar problem which Steve and Dave had talked about before,
but my problem may be more interesting:


The job had been running on one machine (10.40.16.24) for about a week, but
suddenly lost connection (there's no power shutdown or anything obvious),
and afterwards, the job was assigned to another machine (10.40.16.25)!!!

Any advice? I'm using condor 6.8.1 !!!


000 (016.000.000) 11/09 15:03:13 Job submitted from host: <10.40.16.24:4805>
...
001 (016.000.000) 11/09 15:03:21 Job executing on host: <10.40.16.24:4806>
...
006 (016.000.000) 11/09 15:03:29 Image size of job updated: 39644
...
006 (016.000.000) 11/09 15:23:29 Image size of job updated: 225588
....

006 (016.000.000) 11/18 01:23:59 Image size of job updated: 355776
...
022 (016.000.000) 11/18 03:07:40 Job disconnected, attempting to reconnect
   Local schedd and job shadow died, schedd now running again
   Trying to reconnect to vm1@antec5 <10.40.16.24:4806>
...
024 (016.000.000) 11/18 03:24:15 Job reconnection failed
   Job disconnected too long: JobLeaseDuration (1200 seconds) expired
   Can not reconnect to vm1@antec5, rescheduling job
...
001 (016.000.000) 11/19 03:13:37 Job executing on host: <10.40.16.25:1212>
...
006 (016.000.000) 11/19 03:33:45 Image size of job updated: 220428
...
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Thanks,

Guoli





Steve,

Thanks loads for the instant response.

Unless there is a default to use TCP for something, I'm not using it. A grep
of my main condor_config shows only the following for "TCP", and it is
commented out:


#UPDATE_COLLECTOR_WITH_TCP = True

- dave


Steven Timm wrote:

There's a known bug in Condor 6.7.20 with non-blocking TCP communications.
Do you have any of the TCP communications turned on in your config file?
It's fixed in 6.8.1.

Steve


On Wed, 4 Oct 2006, David A. Kotz wrote:

Recently I've had a rash of reports like the following error log, and
today I received the following error report from a user, which seems
related.  We've examined the switches for evidence of network problems
and found none.  Any ideas what might be causing this?

Condor 6.7.20 on Ubuntu Dapper with kernel 2.6.17.4


_________________________________

10/2 17:46:33 condor_read(): timeout reading buffer.
10/2 17:46:33 IO: Failed to read packet header
10/2 17:46:33 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm8: WARNING: No processes found in starter's family
10/2 17:46:33 ProcFamily::currentfamily: ERROR: family_size is 0
10/2 17:46:33 vm7: WARNING: No processes found in starter's family
10/2 17:46:43 condor_read(): timeout reading buffer.
10/2 17:47:03 condor_read(): timeout reading buffer.
10/2 17:47:03 IO: Failed to read packet header
10/2 17:47:03 DaemonCore: Can't receive command request (perhaps a timeout?)
10/2 17:47:03 DaemonCore: Command received via UDP from host
<128.83.120.40:47058>
10/2 17:47:03 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_release_claim)
10/2 17:47:03 vm6: State change: received RELEASE_CLAIM command from
preempting claim
10/2 17:47:03 Starter pid 15838 exited with status 0

____________________________________


[A USER] wrote:

I've been getting a lot of log messages like this lately:


...
022 (4182.144.000) 09/27 14:01:28 Job disconnected, attempting to

reconnect

   Socket between submit and execute hosts closed unexpectedly
   Trying to reconnect to vm1@[MACHINE]

...then later...

...
024 (4182.144.000) 09/27 14:22:02 Job reconnection failed
   Job disconnected too long: JobLeaseDuration (1200 seconds) expired
   Can not reconnect to vm1@[MACHINE], rescheduling job


....  Often the same job will cycle between these two errors several

times.  The system seems to try to reconnect for 20 minutes, and it
doesn't seem like my jobs make any progress while this is going on.
Also, I have yet to see any log messages indicating that the job has
successfully reconnected.  I don't know if that's because those events
aren't logged, or because the jobs never successfully reconnect once
disconnected.

I've never seen this behavior before.  Any idea what's happening?

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525  timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team