RTI benchmarked Connext DDS with a wide variety of latency and throughput tests using RTI Performance Test (PerfTest). These results show that Connext DDS provides sub-millisecond latency that scales linearly with data payload size and throughput that easily exceeds 90% of line rate over
The PerfTest benchmarking tool is completely free, along with documentation and a video tutorial.
- RTI Connext DDS
- CentOS Linux release 7.1.1503 (Core) (Linux 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux)
- RTI Perftest 2.2
- Java: jdk1.7.0_76
Test duration: 300 seconds per test (
Data payload sizes: 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 and 63000 bytes
Switch Configuration: D-Link DXS-3350 SR
- 176Gbps Switching Capacity
- Dual 10-Gig stacking ports and optional 10-Gig uplinks
- Stacks up to 8 units per stack
- Memory: 4MB (Packet Buffer Size)
- Interfaces: 48 x 10/100/1000BASE-T ports
- Intel I350 Gigabit NIC
- Intel Core i7 CPU
- Cache: 12MB
- Number of cores: 6 (12 threads)
- Speed: 3.33 GHz
- Memory: 12GB
Connext DDS provides low and predictable latency that scales linearly with message size.
These charts show the one-way latency for publish/subscribe messaging in applications. Latency was measured, in microseconds, by having the consumer (DDS DataReader) echo messages back to the producer (DDS DataWriter). This allowed round-trip latency to be measured on the sending machine, avoiding time synchronization issues. The round-trip latency was divided in half to get the one-way latency that is shown. The test was repeated up to a maximum payload of 63,000 bytes. RTI Connext DDS APIs were used with the standard RTPS interoperability protocol and the message exchanges were reliable.
Even at larger message sizes, the variation between the minimum and 99.99%-ile* latency remains consistently low. RTI Connext DDS exhibits very low jitter and very high determinism, making it suitable for time- and mission-critical applications.* 99.99 percentile means only one out of 10,000 samples is expected to have longer latency.
Latency - Connext DDS C++ API - UDP v4
Latency - Connext DDS C++ API - Shared Memory
Latency - Connext DDS Java API - UDP v4
Latency - Connext DDS Java API - Shared Memory
Connext DDS enables sustained, high throughput that approaches the theoretical network bandwidth, with modest CPU requirements.
The following graphs show sustainable one-to-one (point-to-point) publish/subscribe throughput in terms of network bandwidth (megabits per second). It was measured between a single producing and consuming thread, over Gigabit Ethernet and a single DDS topic.
Accounting for Ethernet, IP and UDP overhead, the maximum bandwidth available for message data (and metadata) is slightly over 950 megabits/sec on 1 Gbps Ethernet. The data shows Connext DDS is able to fully utilize all of this available bandwidth when sending messages larger than 256 bytes with C++ applications and larger than 1024 bytes with Java applications. Essentially, throughput is limited by the network and not by the CPU or Connext DDS protocol overhead.
Because Connext DDS uses peer-to-peer messaging — without a centralized or per-node Enterprise Service Bus (ESB), message broker, server or daemon processes — it does not impose any inherent limit on aggregate messaging capacity. Throughput is limited only by the network infrastructure. In practice, Connext DDS can deliver orders of magnitude higher capacity than other solutions.
Throughput - Connext DDS C++ API - UDP v4
Throughput - Connext DDS C++ API - Shared Memory
Throughput - Connext DDS Java API - UDP v4
Throughput - Connext DDS Java API - Shared Memory
Connext DDS maintains excellent latency, even at high message rates.
Enterprise messaging middleware typically queues messages (or blocks producers) when volume exceeds capacity. In contrast, Connext DDS is designed for real-time applications in which the consequences of excessive latency could be catastrophic, e.g., autonomous cars, automated trading applications, combat systems or any time-critical IIoT applications.
The following chart shows how latency increases with throughput. Even at more than 200K samples per second, latency remains under 100 microseconds.
Platforms and librariesC++ benchmark application:
- i86 Linux CentOS 5.5 using RTI Connext DDS release target libraries for i86Linux2.6gcc4.1.1.
- x64 Linux CentOS 5.5 using RTI Connext DDS release target libraries for x64Linux2.6gcc4.1.1.
The benchmark application uses an updated version of the Connext DDS libraries that instruments the calls that allocate memory from the heap in order to measure the memory usage.
The program memory reflects the memory required to load the dynamic libraries into memory.
|Size for arch: i86Linux2.6gcc4.1.1||Size for arch: x64Linux2.6gcc4.1.1|
|libnddscpp.so||1,450,110 bytes||1,500,170 bytes|
|libnddscpp2.so||1,158,808 bytes||1,174,743 bytes|
|libnddsc.so||5,049,030 bytes||5,450,549 bytes|
|libnddscore.so||5,415,947 bytes||5,813,595 bytes|
This section provides the default and minimum stack size for all the different threads created by the middleware. This includes the following threads:
- Database thread
- Event thread
- Receive threads
- Asynchronous publishing thread
- Batching thread
The actual number of threads created by the middleware will depend on the configuration of various QoS policies.
|Default stack size||Minimumstack size2|
|User thread||OS default1||30,000 bytes|
|Database thread||OS default1||7,400 bytes|
|Event thread||OS default1||18,000 bytes|
|Receiver thread||OS default1||11,300 bytes|
|Asynchronous publishing thread||OS default1||9,000 bytes|
|Batch thread||OS default1||9,000 bytes|
1 In Linux, the OS default can be obtained by invoking the
2 This value refers to the minimum stack size needed for a given thread. This value assumes no user-specific stack space is needed; therefore, if the user adds any data on the thread's stack, that size must be taken into account.
The Database thread (also referred to as the Database
The event thread handles all timed events, including checking for timeouts and deadline as well as sending periodic heartbeats and repair traffic. There is one event thread per DomainParticipant.
The receive threads are used to receive and process the data from the installed transports. There is one receive thread per (transport, receive port) pair.
When using the built-in UDPv4 and SHMEM transports (default configuration) the middleware creates five receive threads:
- 2 for unicast (one for UDPv4, one for SHMEM).
- 1 for multicast (for UDPv4).
For user data:
- 2 for unicast (one for UDPv4, one for SHMEM).
Asynchronous Publishing Thread
The asynchronous publishing thread handles the data transmission when asynchronous publishing is enabled in a DataWriter.
There is one asynchronous publishing thread per Publisher. This thread is created only if there is one DataWriter enabling asynchronous publishing in the Publisher.
The batch thread handles the asynchronous flushing of a batch when batching is enabled in a DataWriter and the flush_period is set to a value different than DDS_DURATION_INFINITE.
There is one batch thread per Publisher. This thread is created only if there is one DataWriter enabling batching and setting a finite flush_period in the Publisher.
By default, the stack size value assigned to each one of these threads is platform and OS dependent. This value can be modified by updating the thread stack size QoS value, but a minimum is required.
This section provides the memory allocated by the OS for the built-in transports: UDPv4, UDPv6, and SHMEM when using the default QoS.
|Receive socket buffer size||131,072 bytes||131,072 bytes|
|Send socket buffer size||131,072 bytes||131,072 bytes|
In this case, the value depends also on the maximum size of the SHMEM message and the maximum number of SHMEM received messages:
(SHMEM_RECEIVED_MESSAGE_COUNT_MAX_DEFAULT*SHMEM_MESSAGE_SIZE_MAX_DEFAULT/4) = (64*65536/4) = 1048576
|Receive buffer size||1,048,576 bytes|
When using UDPv4 with the default configuration, for each new DomainParticipant created, the middleware uses:
- 1 receive socket to receive Unicast-Discovery data.
- 1 receive socket to receive Multicast-Discovery data.
- 1 receive socket to receive Unicast-UserData data.
- 1 socket to send Unicast data.
- N sockets to send Multicast-Discovery where N is the number of multicast interfaces in the host.
The port assigned for the receive socket depends on the domain ID and participant ID.
The same number of sockets are opened when using UDPv6.
Regarding SHMEM, RTI Connext DDS will use by default:
- 1 shared memory buffer for Unicast-Discovery.
- 1 shared memory buffer for Unicast-UserData.
The receive and send socket buffer sizes can be configured by modifying the transport QoS settings.
Heap Usage of RTI Connext DDS Entities
RTI has designed and implemented a benchmark application that measures the memory that is directly allocated by the middleware using malloc(). This benchmark application uses updated libraries that have been instrumented to measure the heap allocations. Additionally, the RTI Connext DDS libraries also request the OS to allocate other memory, including:
- Socket buffers (see RTI Transports)
- Shared memory regions (see RTI Transports)
- Thread stacks (see RTI Threads)
All the memory allocated by the OS can be tuned using QoS parameters, or DDS transport properties.
The following table reports the average heap allocation for the different DDS entities that can be used in an RTI Connext DDS application.
The amount of memory required for an entity depends on the value of different QoS policies. For this benchmark, RTI has used a QoS profile that results in minimum memory usage.
|DomainParticipant||1,377,577 bytes||1,876,271 bytes|
|Type registered||887 bytes||1,191 bytes|
|Topic||1,322 bytes||1,814 bytes|
|Subscriber||8,962 bytes||14,063 bytes|
|Publisher||2,322 bytes||3,222 bytes|
|DataReader||52,359 bytes||82,445 bytes|
|DataWriter||30,447 bytes||44,145 bytes|
|Instance registered in DataWriter||289 bytes||442 bytes|
|Sample store in DataWriter||925 bytes||1,312 bytes|
|Remote DataReader||4,734 bytes||6,416 bytes|
|Remote DataWriter||9,627 bytes||14,362 bytes|
|Instance registered in DataReader||616 bytes||973 bytes|
|Sample store in DataReader||654 bytes||1,137 bytes|
|Remote DomainParticipant||51,081 bytes||67,253 bytes|
|DomainParticipantFactory||60,441 bytes||75,355 bytes|
Note: In order to efficiently manage the creation and deletion of DDS entities and samples, RTI Connext DDS implements its own memory manager. The memory manager allocates and manages multiple buffers to avoid continuous memory allocation. Therefore the memory growth does not necessarily follow linearly with the creation of DDS entities and samples. The pre-allocation scheme of the memory manager is configurable.
Benchmarking Best Practices
Are you or your team planning to benchmark the performance of your system? How do you ensure that the tests are realistic?
Performance doesn’t just mean “fast.” Performance requirements tie your system to the real world by imposing time-related constraints. For example, latency requirements may look like “shall do X functionality in N milliseconds,” whereas throughput requirements may look like “shall handle M thousand Ys per second.”
Before you jump into testing, here are 5 best practices to remember when designing performance tests:
Create a performance model before testing the performance of a systemWaiting until the system is operational is too late to discover any performance problems. Pre-defining performance goals, such as maximum response times and acceptable performance metrics, help interpret the results of tests.
Create realistic testsThe test setup must reflect the actual payloads sizes, representative of the application architecture and run on the same or comparable (less ideal) hardware platforms and OS.
Perform an apples-to-apples comparison with the performance modelThe configuration of the Network Interface Card (NIC) and the network switch, as well as the maximum network throughput and the CPU, all have an impact on the final performance results. So, to make a relevant comparison:
- include the impact of the network device configuration in the performance model and
- run the benchmarks on the same hardware you use to create a performance model.
Ensure that the tests are using the network interface you think you are usingTypically, test applications can be configured to use multiple network interfaces, and default settings may use transports, such as using shared memory or intra-process transports. If the intent is to test performance over certain network interfaces, ensure that the test application is explicitly configured to exercise the intended network interface(s).
Measuring Average while including OutliersThe average measurement for any metric is informative, but this measurement can be misleading by itself. Be sure to include other metrics, such as 90th percentile or standard deviation, to get a better view of system performance.
These benchmarks show that Connext DDS
- Exhibits high throughput, approaching the theoretical bandwidth of Gigabit Ethernet, using modest CPUs
- Provides very low latency, that increases linearly with data payload size
- Sustains low latency and throughput even at very high levels of message traffic
To learn more or run these benchmarks on your own hardware, please download: