Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] "create_tcp_port(): bind() failed" for standard universe jobs
- Date: Fri, 04 May 2012 11:40:42 +0100
- From: Mark Calleja <mc321@xxxxxxxxx>
- Subject: [Condor-users] "create_tcp_port(): bind() failed" for standard universe jobs
Hi,
One of our users is seeing some of his migrating standard universe jobs
(Linux, Condor v7.6.6) fail to restart with:
001 (12814.129.000) 04/29 14:59:01 Job executing on host:
<xxx.xxx.xxx.xxx:9210>
...
007 (12814.129.000) 04/29 14:59:01 Shadow exception!
create_tcp_port(): bind() failed: 98(Address already in use)
125 - Run Bytes Sent By Job
6501894 - Run Bytes Received By Job
The execute hosts we see this failing on are a mixture of distros,
including Ubuntu 10.04, Debian 6.0, and SLES 10. I've come across one
related thread in the Condor-users mailing list (begins at
https://lists.cs.wisc.edu/archive/condor-users/2011-January/msg00037.shtml),
but since the majority of Condor installations on these execute hosts
has been via tar balls then I don't think that what's in that thread is
relevant.
Can anyone shed light as to what this bind failure is alluding to? Is it
a case that the machine has run out of ephemeral ports for the job
(unlikely, as many machines don't define a port range), or is the
standard universe functionality really trying to bind to a specific port
that's already in use? (I thought that the latter couldn't be the case
as the standard universe abstracted away specific port usage).
Any hints to the underlying cause of this issue would be gratefully
received.
Ta,
Mark