[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] "create_tcp_port(): bind() failed" for standard universe jobs



Hi,

One of our users is seeing some of his migrating standard universe jobs (Linux, Condor v7.6.6) fail to restart with:

001 (12814.129.000) 04/29 14:59:01 Job executing on host: <xxx.xxx.xxx.xxx:9210>
...
007 (12814.129.000) 04/29 14:59:01 Shadow exception!
        create_tcp_port(): bind() failed: 98(Address already in use)
        125  -  Run Bytes Sent By Job
        6501894  -  Run Bytes Received By Job

The execute hosts we see this failing on are a mixture of distros, including Ubuntu 10.04, Debian 6.0, and SLES 10. I've come across one related thread in the Condor-users mailing list (begins at https://lists.cs.wisc.edu/archive/condor-users/2011-January/msg00037.shtml), but since the majority of Condor installations on these execute hosts has been via tar balls then I don't think that what's in that thread is relevant.

Can anyone shed light as to what this bind failure is alluding to? Is it a case that the machine has run out of ephemeral ports for the job (unlikely, as many machines don't define a port range), or is the standard universe functionality really trying to bind to a specific port that's already in use? (I thought that the latter couldn't be the case as the standard universe abstracted away specific port usage).

Any hints to the underlying cause of this issue would be gratefully received.

Ta,
Mark