RTI Connext® DDS provides an order-of-magnitude performance advantage over most other messaging and integration middleware on every supported platform.
RTI benchmarked Connext DDS 5.2.3 with a wide variety of latency and throughput tests using RTI Performance Test (PerfTest). These results show that Connext DDS provides sub-millisecond latency that scales linearly with data payload size and throughput that easily exceeds 90% of line rate over gigabit Ethernet. Moreover, latency remains low as data throughput increases.
The PerfTest benchmarking tool is completely free, along with documentation and a video tutorial.
- RTI Connext DDS 5.2.3
- CentOS Linux release 7.1.1503 (Core) (Linux 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux)
- RTI Perftest 5.2.0
- Java: jdk1.7.0_76
Test duration: 300 seconds per test (datapoint)
Data payload sizes: 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 and 63000 bytes
Switch Configuration: D-Link DXS-3350 SR
- 176Gbps Switching Capacity
- Dual 10-Gig stacking ports and optional 10-Gig uplinks
- Stacks up to 8 units per stack
- Memory: 4MB (Packet Buffer Size)
- Interfaces: 48 x 10/100/1000BASE-T ports
- Intel I350 Gigabit NIC
- Intel Core i7 CPU
- Cache: 12MB
- Number of cores: 6 (12 threads)
- Speed: 3.33 GHz
- Memory: 12GB
Connext DDS provides low and predictable latency that scales linearly with message size.
These charts show the one-way latency for publish/subscribe messaging in applications. Latency was measured, in microseconds, by having the consumer (DDS DataReader) echo messages back to the producer (DDS DataWriter). This allowed round-trip latency to be measured on the sending machine, avoiding time synchronization issues. The round-trip latency was divided in half to get the one-way latency that is shown. The test was repeated up to a maximum payload of 63,000 bytes. RTI Connext DDS APIs were used with the standard RTPS interoperability protocol and the message exchanges were reliable.
Latency - Connext DDS C++ API - UDP v4
Latency - Connext DDS C++ API - Shared Memory
Latency - Connext DDS Java API - UDP v4
Latency - Connext DDS Java API - Shared Memory
Even at larger message sizes, the variation between the minimum and 99.99%-ile* latency remains consistently low. RTI Connext DDS exhibits very low jitter and very high determinism, making it suitable for time- and mission-critical applications.* 99.99 percentile means only one out of 10,000 samples is expected to have longer latency.
Connext DDS enables sustained, high throughput that approaches the theoretical network bandwidth, with modest CPU requirements.
The following graphs show sustainable one-to-one (point-to-point) publish/subscribe throughput in terms of network bandwidth (megabits per second). It was measured between a single producing and consuming thread, over Gigabit Ethernet and a single DDS topic.
Accounting for Ethernet, IP and UDP overhead, the maximum bandwidth available for message data (and metadata) is slightly over 950 megabits/sec on 1 Gbps Ethernet. The data shows Connext DDS is able to fully utilize all of this available bandwidth when sending messages larger than 256 bytes with C++ applications and larger than 1024 bytes with Java applications. Essentially, throughput is limited by the network and not by the CPU or Connext DDS protocol overhead.
Throughput - Connext DDS C++ API - UDP v4
Throughput - Connext DDS C++ API - Shared Memory
Throughput - Connext DDS Java API - UDP v4
Throughput - Connext DDS Java API - Shared Memory
Because Connext DDS uses peer-to-peer messaging — without a centralized or per-node Enterprise Service Bus (ESB), message broker, server or daemon processes — it does not impose any inherent limit on aggregate messaging capacity. Throughput is limited only by the network infrastructure. In practice, Connext DDS can deliver orders of magnitude higher capacity than other solutions.
Connext DDS maintains excellent latency, even at high message rates.
Enterprise messaging middleware typically queues messages (or blocks producers) when volume exceeds capacity. In contrast, Connext DDS is designed for real-time applications in which the consequences of excessive latency could be catastrophic, e.g., autonomous cars, automated trading applications, combat systems or any time-critical IIoT applications.
The following chart shows how latency increases with throughput. Even at more than 200K samples per second, latency remains under 100 microseconds.
Latency vs Throughput
Connext DDS performance withstands heavy data traffic.
Platforms and libraries
All measurements have been gathered in the following Linux platforms using a C++ benchmark application:
- i86 Linux CentOS 5.5 using RTI Connext DDS release target libraries for i86Linux2.6gcc4.1.1.
- x64 Linux CentOS 5.5 using RTI Connext DDS release target libraries for x64Linux2.6gcc4.1.1.
The benchmark application uses an updated version of the Connext DDS libraries that instruments the calls that allocate memory from the heap in order to measure the memory usage.
The program memory reflects the memory required to load the dynamic libraries into memory.
|Size for arch: i86Linux2.6gcc4.1.1||Size for arch: x64Linux2.6gcc4.1.1|
|libnddscpp.so||1,280,701 bytes||1,325,598 bytes|
|libnddsc.so||4,270,194 bytes||4,585,913 bytes|
|libnddscore.so||4,842,557 bytes||5,199,797 bytes|
This section provides the default and minimum stack size for all the different threads created by the middleware. This includes the following threads:
- Database thread
- Event thread
- Receive threads
- Asynchronous publishing thread
- Batching thread
The actual number of threads created by the middleware will depend on the configuration of various QoS policies.
|Default stack size||Minimumstack size2|
|Database thread||OS default1||10,500 bytes|
|Event thread||OS default1||22,500 bytes|
|Receiver thread||OS default1||22,500 bytes|
|Asynchronous publishing thread||OS default1||8,700 bytes|
|Batch thread||OS default1||8,700 bytes|
1 In Linux, the OS default, can be obtained by invoking the ulimit command. In the CentOS 5 machines we used this size was 10240KB (For both i86 and x64 machines).
2 This value refers to the minimum stack size needed for a given thread. This value assumes no user-specific stack space is needed; therefore, if the user adds any data on the thread's stack, that size must be taken into account.
The Database thread (also referred to as the Database cleanup thread) is created to garbage-collect records related to deleted entities from the in-memory database used by the middleware. There is one database thread per DomainParticipant.
The event thread handles all timed events, including checking for timeouts and deadline as well as sending periodic heartbeats and repair traffic. There is one event thread per DomainParticipant.
The receive threads are used to receive and process the data from the installed transports. There is one receive thread per (transport, receive port) pair.
When using the built-in UDPv4 and SHMEM transports (default configuration) the middleware creates five receive threads:
- 2 for unicast (one for UDPv4, one for SHMEM).
- 1 for multicast (for UDPv4).
For user data:
- 2 for unicast (one for UDPv4, one for SHMEM).
Asynchronous Publishing Thread
The asynchronous publishing thread handle the data transmission when asynchronous publishing is enabled in a DataWriter.
There is one asynchronous publishing thread per Publisher. This thread is created only if there is one DataWriter enabling asynchronous publishing in the Publisher.
The batch thread handles the asynchronous flushing of a batch when batching is enabled in a DataWriter and the flush_period is set to a value different than DDS_DURATION_INFINITE.
There is one batch thread per Publisher. This thread is created only if there is one DataWriter enabling batching and setting a finite flush_period in the Publisher.
By default, the stack size value assigned to each one of these threads is platform and OS dependent. This value can be modified by updating the thread stack size QoSvalue, but a minimum is required.
This section provides the memory allocated by the OS for the built-in transports: UDPv4, UDPv6, and SHMEM when using the default QoS.
|Receive socket buffer size||131,072 bytes||131,072 bytes|
|Send socket buffer size||131,072 bytes||131,072 bytes|
In this case, the value depends also on the maximum size of the SHMEM message and the maximum number of SHMEM received messages:
(SHMEM_RECEIVED_MESSAGE_COUNT_MAX_DEFAULT*SHMEM_MESSAGE_SIZE_MAX_DEFAULT/4) = (64*65536/4) = 1048576
|Receive buffer size||1,048,576 bytes|
When using UDPv4 with the default configuration, for each new DomainParticipant created, the middleware uses:
- 1 receive socket to receive Unicast-Discovery data.
- 1 receive socket to receive Multicast-Discovery data.
- 1 receive socket to receive Unicast-UserData data.
- 1 socket to send Unicast data.
- N sockets to send Multicast-Discovery where N is the number of multicast interfaces in the host.
The port assigned for the receive socket depends on the domain ID and participant ID.
The same number of sockets are opened when using UDPv6.
Regarding SHMEM, RTI Connext DDS will use by default:
- 1 shared memory buffer for Unicast-Discovery.
- 1 shared memory buffer for Unicast-UserData.
The receive and send socket buffer sizes can be configured by modifying the transport QoS settings.
Heap Usage of RTI Connext DDS Entities
RTI has designed and implemented a benchmark application that measures the memory that is directly allocated by the middleware using malloc(). This benchmark application uses updated libraries that have been instrumented to measure the heap allocations. Additionally, the RTI Connext DDS libraries also request the OS to allocate other memory, including:
- Socket buffers (see RTI Transports)
- Shared memory regions (see RTI Transports)
- Thread stacks (see RTI Threads)
All the memory allocated by the OS can be tuned using QoS parameters, or DDS transport properties.
The following table reports the average heap allocation for the different DDS entities that can be used in an RTI Connext DDS application.
The amount of memory required for an entity depends on the value of different QoS policies. For this benchmark, RTI has used a QoS profile that results on minimum memory usage.
|DomainParticipant||913,056 bytes||1,261,912 bytes|
|Type registered||877 bytes||1,313 bytes|
|Topic||1,321 bytes||1,792 bytes|
|Subscriber||9,022 bytes||13,914 bytes|
|Publisher||2,204 bytes||3,017 bytes|
|DataReader||49,811 bytes||79,532 bytes|
|DataWriter||28,133 bytes||41,179 bytes|
|Instance registered in DataWriter||276 bytes||440 bytes|
|Sample store in DataWriter||845 bytes||1,185 bytes|
|Remote DataReader||3,848 bytes||4,952 bytes|
|Remote DataWriter||8,574 bytes||13,154 bytes|
|Instance registered in DataReader||620 bytes||972 bytes|
|Sample store in DataReader||626 bytes||904 bytes|
|Remote DomainParticipant||30628 bytes||41738 bytes|
|DomainParticipantFactory||27942 bytes||41912 bytes|
The memory reported for samples and instances does not include the user-data, only the meta-data.
Note: In order to efficiently manage the creation and deletion of DDS entities and samples, RTI Connext DDS implements its own memory manager. The memory manager allocates and manages multiple buffers to avoid continuous memory allocation. Therefore the memory growth does not necessarily follow linearly with the creation of DDS entities and samples. The pre-allocation scheme of the memory manager is configurable.
These benchmarks show that Connext DDS
- Exhibits high throughput, approaching the theoretical bandwidth of Gigabit Ethernet, using modest CPUs
- Provides very low latency, that increases linearly with data payload size
- Sustains low latency and throughput even at very high levels of message traffic
To learn more or run these benchmarks on your own hardware, please download: