On Jul 27, 2012, at 7:19 AM, Bob Briscoe wrote:
The standard universe uses a lot of code not used anywhere else in Condor. It may not fully respect BIND_ALL_INTERFACES and NETWORK_INTERFACE. The network traffic is also a little different. When the starter needs to transfer the job executable from the shadow, the shadow creates a child process listening on a newly-bound port and sends the address the child is listening on to the starter. The starter then connects to the child of the shadow to perform the transfer.
You can get more information on the network interfaces being used.
On the submit machine, add this line to the config file:
SHADOW_DEBUG = D_NETWORK
Then look for this sequence of lines:
file = "/var/lib/condor/spool/55/cluster55.ickpt.subproc0"
addr = <220.127.116.11:9000>
This will tell you the address that the shadow's child is listening on.
Add this line to the Condor config file on the execution machines where standard universe jobs are failing:
STARTER_DEBUG = D_NETWORK
Then, look for a line like this in the starter log:
Opening TCP stream to <18.104.22.168:9000>
This is the address the starter is attempting to connect to, which should match the address in the shadow log.
Thanks and regards,
UW-Madison Condor Team