Three Simple Steps to Achieving Peak DDS Performance
Written by Dave Seltz
May 10, 2017
RTI Connext® DDS provides an order of magnitude performance improvement over most other messaging middleware. But occasionally we run into customers who are trying to improve the performance of their DDS communications. This performance improvement can be achieved in either throughput or latency. In this blog, I will go through the three simple steps required to assess the performance of your system and will also review some of the most common ways customers have improved performance of their DDS communications.
Step 1: What performance should you be getting?
Compare the numbers you are getting with the comprehensive DDS benchmarks that RTI provides here: https://www.rti.com/products/dds/benchmarks.
If you are not getting close to the numbers you see in the DDS benchmarks, there are a couple things to try:
Use RTI Perftest to make sure you’re comparing apples to apples.
The configuration of the NIC and the network switch, as well as the maximum network throughput and the CPU, all have an impact on the final DDS performance results. So, to make a fair comparison run the DDS benchmarks on your hardware. RTI makes the DDS benchmark program, “RTI Perftest,” available in source code format with complete documentation. You can find a copy of Perftest here: https://community.rti.com/downloads/rti-connext-dds-performance-test
Make sure you are running your tests using the network interface you think you are using.
DDS enables shared memory and UDPv4 transports by default. If Shared memory is available between two nodes DDS will use that by default. But if there are many network interfaces available DDS will only use the first four. I’ve seen developers want to test out a certain network interface, say Infiniband, but it was not one of the first four listed and so DDS was not adding it to the mix. In fact, on Windows systems, the order that network interfaces are listed by the OS, and thus selected by DDS, is random and so the network interface you are actually using can change from run to run. In fact, DDS will actually send the same data over two paths, if they exist, to the same endpoint. This can take up CPU time and slow throughput. You can explicitly select the interface you want (or do not want) using the transport QOS “allow-interfaces” property. Here is a good RTI Community article on the subject: https://community.rti.com/howto/control-or-restrict-network-interfaces-nics-used-discovery-and-data-distribution.
Following is the actual XML code for “allow_interfaces” and “deny_interfaces” QOS that lets you explicitly pick the network interface you want to use or do not want to use:
Step 2. Use the RTI DDS tools to diagnose your performance issues.
Use RTI Monitor to look for the number of ACKs, NACKs, dropped packets, and duplicate packets. If these numbers are high, it can be due to several things:
- Transport buffer sizes are too small
- MTU is not optimized for switch
- There may be too many heartbeats causing multiple resends for single NACKs, indicating the reader is not keeping up
- The CPU and memory process(es) are bound.
Use RTI Monitor or Admin Console to compare QOS settings of the DataReaders and DataWriters. Sometimes you are not using the QOS values you think you are using.
A great way to learn about using the Admin Console and the Monitor tools is to watch this video from our tools lead, Ken Brophy.
Step 3. Now let’s start to look at your application to see how we can speed things up by changing the “shape” of the data in motion.
RTI DDS gives you many ways to fine tune your system using QOS settings. This flexibility is great because you have a lot of control over how DDS works. But all the options can be daunting! I won’t go over every setting (this blog would quickly grow to be a textbook) but I will hit on what I feel are the most important settings to check in regards to performance.
First, don’t use strict reliability if it is not needed. Strict reliability makes sure that every sample reaches every reliable destination and will re-send samples if necessary. Resending samples and the structure that supports them take time and memory. Many applications would be fine missing a sample very occasionally or waiting longer for it to be re-transmitted.
If you do need to use strict reliability then start with the DDS built-in profile “StrictReliable.HighThroughput". It is a good idea in general to use the built-in profiles that RTI provides. These built-in profiles are set up by RTI to have all of the default settings needed for the most common DDS use cases. The built-in profiles can be used as-is or can be used as the basis for your QOS configuration and then tweaked for your specific needs. You can read about using DDS built-in profiles and get a working example here: https://community.rti.com/examples/built-qos-profiles
Using Extensible types (XTypes) and sequences of structures can hurt performance. DDS serializes and de-serializes data it sends and receives, and this process takes a lot longer with complicated data types.
Adjust heartbeat_period/ ACKNACK combo. In reliable communications, the DataWritersends DDS data samples and heartbeats to reliable DataReaders. A DataReader responds to a heartbeat by sending an ACKNACK, which tells the DataWriter what the DataReader has received so far. In addition, the DataReader can request missing DDS samples (by sending an ACKNACK) and the DataWriter will respond by resending the missing DDS samples. So, the heartbeat_period can control how quickly a data reader can acknowledge receipt of a sample or ask for a sample to be re-sent, impacting performance. Here is an article that talks about how the heartbeat_period can impact latency and throughput.
Modify the Asynchronous Publisher configuration to use flow control to lower the data rate. Sometimes if the data rate from the writer is too fast, the reader gets swamped and the resulting dropped samples and resends slow down the system. Lowering the writer’s data rate a little leaves room for repairs, etc. This gives DDS time to handle incoming data and avoids costly resends. You can use a flow controller to shape the output traffic your publisher will generate. By using an asynchronous publisher and custom flow controller you can lower the data rate. You can see a working example of how to use the asynchronous publisher here: https://community.rti.com/examples/asynchronous-publisher
For smaller sample sizes, use batching and/or Turbo Mode. Batching groups of small samples into a single large packet is more efficient to send and can result in a large throughput increase. But note that while the use of batching increases throughput, it can hurt latency when little data is being sent (because of the added time needed to batch small samples). In high-throughput cases, though, average latency results because of all the CPU saved on the subscriber side of the interface.
Turbo Mode is an experimental feature that uses an intelligent algorithm that adjusts the number of bytes in each batch at runtime according to current system conditions, such as write speed (or write frequency) and sample size. This intelligence gives Turbo Mode the ability to increase throughput at high message rates and avoid negatively impacting message latency at low message rates.
Here is an article that goes into detail on how to use batching and includes a working example: https://community.rti.com/examples/batching-and-turbo-mode
Use multicast for topics with more than a couple of subscribers. Multicast allows a publisher to send to multiple readers with a single write, greatly reduces network and publisher-side processor utilization. Note that sometimes this feature is not available at the network level. Here is a good article on how to implement multicast: https://community.rti.com/best-practices/use-multicast-one-many-data
For reliable communications modify the Send Window size. When a reliable DataWriter writes a DDS sample, it keeps that sample in its queue until it has received acknowledgments from all of its subscribing DataReaders that the sample was received. The number of outstanding DDS samples allowed is referred to as the DataWriter's "send window." Once the number of outstanding DDS samples has reached the send window size, subsequent writes will block until an outstanding DDS sample is acknowledged. Anytime the writer blocks, it hurts performance. You can read about adjusting the Send Window in section 126.96.36.199 of the DDS User’s Manual.
Modify the transport settings. Whether you are using UDPv4 or shared memory or a custom transport, having the right buffer sizes and message sizes configured is extremely important when trying to optimize performance. Following is XML code for modifying transport message size and buffers sizes for the UDPv4 transport:
Note that the sizes used here are suggestions for optimizing performance when using large samples. You can make these values smaller for smaller samples.
I hope this advice is helpful in getting the best performance out of your DDS Application. I’ve listed the tips I’ve found most helpful for improving DDS performance but there are other methods that can also be helpful depending on the circumstances. In order to get more information on improving throughput or latency (or really help with any other Connext DDS issue), I encourage you to check out the RTI Community portal. The RTI Community portal is an excellent source of information and support! And of course, always feel free to contact our great support department or your local Field Application Engineer for further help.