[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] InfiniBand



On 2/24/2016 12:37 PM, Jonathan Knudson wrote:
I have been tasked with setting up a new server cluster.  There is one
head node and 12 compute nodes.  This system is connected via
InfiniBand.  I read in the documentation that IB is useful for parallel
jobs.  Can I utilize this network with the Vanilla Universe?



The main advantage of IB is low latency, which is helpful for parallel jobs that pass many small messages between compute nodes (i.e. MPI jobs). Many sites will only want MPI traffic on their IB network, and will purposefully direct all non-MPI traffic (such as HTCondor traffic, NFS traffic, ssh/scp traffic, etc... anything that is not super sensitive to latency) to ethernet in order to not decrease performance of their MPI jobs.

The HTCondor config knobs will not control the pathway for traffic coming from your jobs themselves, they just control traffic originating from HTCondor daemons such as file transfer performed by the condor_starter (if you are not using a shared file system) and system traffic such as classads to/from the collector. You will have to configure your MPI library or your shared filesystem (NFS, Gluster, whatever) to use the IB network separately --- HTCondor's config file has no impact on those services.

Unless you are asking HTCondor to transfer large amounts of data via your job submit file transfer_input|output_files knobs, I am not sure there is any advantage for you setting up HTCondor to use the IB. And even in that case, file transfer is primarily a bandwidth issue, not a latency issue.

If you still want to setup HTCondor daemons to use the IB network, I think the issue you are facing below is you do not have any routing setup between your IB network and your ethernet network. That means that if you set NETWORK_INTERFACE = 192.168.0.* on your CM, you likely want to setup CONDOR_HOST = 192.168.0.179 everywhere else -- i.e. use an explicit IP address for CONDOR_HOST, since perhaps a DNS name is giving an address for the ethernet interface.

Hope the above helps
Todd


Right now I have Condor installed on the head node and one of the
compute nodes.  I have read through the documentation about having
multiple NICs.  On the compute node I have the BIND_ALL_INTERFACES set
to true.  On the CM I have the  set the NETWORK_INTERFACE
=192.168.0.179.  This is the IB address.  But I still get the Failed to
connect error.  The CM is on our production network and on the
192.168.0.x network and has 3 IP address assigned.

When I added NETWORK_INTERFACE =192.168.0.184 to the compute node and
changed BIND_ALL_INTERFACES = False or commented out, I get the error
Can’t connect to local master.

When using either the IB or GbE network I get the "Error: communication
error

CEDAR:6001:Failed to connect to <192.168.0.179:9618
<http://192.168.0.179:9618>>".

CM - DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, SCHEDD, STARTD

Node DAEMON_LIST = MASTER, STARTD

This might be a Linux issue which is another problem in itself…

Thanks

Jon



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685