[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Network traffic associated with a standard universe job

On Jul 27, 2012, at 7:19 AM, Bob Briscoe wrote:

I'm trying to figure out what network traffic is assoiciated with a standard universe job, and how it differs from a vanilla universe one. I'm especially interested in the sequence of traffic at job start-up, as that's when we see our problem (for a full account of the symptoms see the thread at: https://lists.cs.wisc.edu/archive/condor-users/2012-July/msg00141.shtml, which unfortunately got no responses). So, could a Condor developer kindly state what connections are initiated, from/to which daemon/process, the protocol being used, and confirm that all such traffic respects the settings in BIND_ALL_INTERFACES and NETWORK_INTERFACE.

The standard universe uses a lot of code not used anywhere else in Condor. It may not fully respect BIND_ALL_INTERFACES and NETWORK_INTERFACE. The network traffic is also a little different. When the starter needs to transfer the job executable from the shadow, the shadow creates a child process listening on a newly-bound port and sends the address the child is listening on to the starter. The starter then connects to the child of the shadow to perform the transfer.

You can get more information on the network interfaces being used.
On the submit machine, add this line to the config file:

Then look for this sequence of lines:
   Entering pseudo_get_file_stream
   file = "/var/lib/condor/spool/55/cluster55.ickpt.subproc0"
   addr = <>

This will tell you the address that the shadow's child is listening on.

Add this line to the Condor config file on the execution machines where standard universe jobs are failing:

Then, look for a line like this in the starter log:
  Opening TCP stream to <>

This is the address the starter is attempting to connect to, which should match the address in the shadow log.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team