Re: [Condor-users] Network traffic associated with a standard universe job

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 31 Jul 2012 16:19:57 -0500

Subject: Re: [Condor-users] Network traffic associated with a standard universe job

On Jul 27, 2012, at 7:19 AM, Bob Briscoe wrote:

I'm trying to figure out what network traffic is assoiciated with a standard universe job, and how it differs from a vanilla universe one. I'm especially interested in the sequence of traffic at job start-up, as that's when we see our problem (for a full account of the symptoms see the thread at: https://lists.cs.wisc.edu/archive/condor-users/2012-July/msg00141.shtml, which unfortunately got no responses). So, could a Condor developer kindly state what connections are initiated, from/to which daemon/process, the protocol being used, and confirm that all such traffic respects the settings in BIND_ALL_INTERFACES and NETWORK_INTERFACE.

The standard universe uses a lot of code not used anywhere else in Condor. It may not fully respect BIND_ALL_INTERFACES and NETWORK_INTERFACE. The network traffic is also a little different. When the starter needs to transfer the job executable from the shadow, the shadow creates a child process listening on a newly-bound port and sends the address the child is listening on to the starter. The starter then connects to the child of the shadow to perform the transfer.

You can get more information on the network interfaces being used.

On the submit machine, add this line to the config file:

SHADOW_DEBUG = D_NETWORK

Then look for this sequence of lines:

Entering pseudo_get_file_stream

file = "/var/lib/condor/spool/55/cluster55.ickpt.subproc0"

addr = <12.34.56.78:9000>

This will tell you the address that the shadow's child is listening on.

Add this line to the Condor config file on the execution machines where standard universe jobs are failing:

STARTER_DEBUG = D_NETWORK

Then, look for a line like this in the starter log:

Opening TCP stream to <12.34.56.78:9000>

This is the address the starter is attempting to connect to, which should match the address in the shadow log.

Thanks and regards,

Jaime Frey

UW-Madison Condor Team

Mailing List Archives

Public Access

Re: [Condor-users] Network traffic associated with a standard universe job