RTI benchmarked Connext DDS with a wide variety of latency and throughput tests using RTI Performance Test (PerfTest). These results show that Connext DDS provides sub-millisecond latency that scales linearly with data payload size and throughput that easily exceeds 90% of line rate over gigabit Ethernet. Moreover, latency remains low as data throughput increases.

The PerfTest benchmarking tool is completely free, along with documentation and a video tutorial.

Benchmarking Environment


  • RTI Connext DDS
  • CentOS Linux release 7.1.1503 (Core) (Linux 3.10.0-229.4.2.el7.x86_64 #1 SMP Wed May 13 10:06:09 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux)
  • RTI Perftest 2.2
  • Java: jdk1.7.0_76

Test duration: 300 seconds per test (datapoint)

Data payload sizes: 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 and 63000 bytes


Switch Configuration: D-Link DXS-3350 SR

  • 176Gbps Switching Capacity
  • Dual 10-Gig stacking ports and optional 10-Gig uplinks
  • Stacks up to 8 units per stack
  • Memory: 4MB (Packet Buffer Size)
  • Interfaces: 48 x 10/100/1000BASE-T ports

Test Machines

  • Intel I350 Gigabit NIC
  • Intel Core i7 CPU
    • Cache: 12MB
    • Number of cores: 6 (12 threads)
    • Speed: 3.33 GHz
  • Memory: 12GB


Connext DDS provides low and predictable latency that scales linearly with message size.

These charts show the one-way latency for publish/subscribe messaging in applications. Latency was measured, in microseconds, by having the consumer (DDS DataReader) echo messages back to the producer (DDS DataWriter). This allowed round-trip latency to be measured on the sending machine, avoiding time synchronization issues. The round-trip latency was divided in half to get the one-way latency that is shown. The test was repeated up to a maximum payload of 63,000 bytes. RTI Connext DDS APIs were used with the standard RTPS interoperability protocol and the message exchanges were reliable.

Even at larger message sizes, the variation between the minimum and 99.99%-ile* latency remains consistently low. RTI Connext DDS exhibits very low jitter and very high determinism, making it suitable for time- and mission-critical applications.

* 99.99 percentile means only one out of 10,000 samples is expected to have longer latency.

Latency - Connext DDS C++ API - UDP v4


Latency - Connext DDS C++ API - Shared Memory


Latency - Connext DDS Java API - UDP v4


Latency - Connext DDS Java API - Shared Memory



Connext DDS enables sustained, high throughput that approaches the theoretical network bandwidth, with modest CPU requirements.

The following graphs show sustainable one-to-one (point-to-point) publish/subscribe throughput in terms of network bandwidth (megabits per second). It was measured between a single producing and consuming thread, over Gigabit Ethernet and a single DDS topic.

Accounting for Ethernet, IP and UDP overhead, the maximum bandwidth available for message data (and metadata) is slightly over 950 megabits/sec on 1 Gbps Ethernet. The data shows Connext DDS is able to fully utilize all of this available bandwidth when sending messages larger than 256 bytes with C++ applications and larger than 1024 bytes with Java applications. Essentially, throughput is limited by the network and not by the CPU or Connext DDS protocol overhead.

Because Connext DDS uses peer-to-peer messaging — without a centralized or per-node Enterprise Service Bus (ESB), message broker, server or daemon processes — it does not impose any inherent limit on aggregate messaging capacity. Throughput is limited only by the network infrastructure. In practice, Connext DDS can deliver orders of magnitude higher capacity than other solutions.

Throughput - Connext DDS C++ API - UDP v4


Throughput - Connext DDS C++ API - Shared Memory


Throughput - Connext DDS Java API - UDP v4


Throughput - Connext DDS Java API - Shared Memory


Latency vs Throughput

Connext DDS maintains excellent latency, even at high message rates.

Enterprise messaging middleware typically queues messages (or blocks producers) when volume exceeds capacity. In contrast, Connext DDS is designed for real-time applications in which the consequences of excessive latency could be catastrophic, e.g., autonomous cars, automated trading applications, combat systems or any time-critical IIoT applications.

The following chart shows how latency increases with throughput. Even at more than 200K samples per second, latency remains under 100 microseconds.


Memory Usage

Platforms and libraries

C++ benchmark application:
  • i86 Linux CentOS 5.5 using RTI Connext DDS release target libraries for i86Linux2.6gcc4.1.1.
  • x64 Linux CentOS 5.5 using RTI Connext DDS release target libraries for x64Linux2.6gcc4.1.1.

The benchmark application uses an updated version of the Connext DDS libraries that instruments the calls that allocate memory from the heap in order to measure the memory usage.

Program Memory

The program memory reflects the memory required to load the dynamic libraries into memory.

  Size for arch: i86Linux2.6gcc4.1.1 Size for arch: x64Linux2.6gcc4.1.1
libnddscpp.so 1,450,110 bytes 1,500,170 bytes
libnddscpp2.so 1,158,808 bytes 1,174,743 bytes
libnddsc.so 5,049,030 bytes 5,450,549 bytes
libnddscore.so 5,415,947 bytes 5,813,595 bytes


This section provides the default and minimum stack size for all the different threads created by the middleware. This includes the following threads:

  • Database thread
  • Event thread
  • Receive threads
  • Asynchronous publishing thread
  • Batching thread

The actual number of threads created by the middleware will depend on the configuration of various QoS policies.

  Default stack size Minimumstack size2
User thread OS default1 30,000 bytes
Database thread OS default1 7,400 bytes
Event thread OS default1 18,000 bytes
Receiver thread OS default1 11,300 bytes
Asynchronous publishing thread OS default1 9,000 bytes
Batch thread OS default1 9,000 bytes


1 In Linux, the OS default can be obtained by invoking the ulimit command. In the CentOS 5 machines, we used this size was 10240KB (For both i86 and x64 machines).

This value refers to the minimum stack size needed for a given thread. This value assumes no user-specific stack space is needed; therefore, if the user adds any data on the thread's stack, that size must be taken into account.

Database Thread

The Database thread (also referred to as the Database cleanup thread) is created to garbage-collect records related to deleted entities from the in-memory database used by the middleware. There is one database thread per DomainParticipant.

Event Thread

The event thread handles all timed events, including checking for timeouts and deadline as well as sending periodic heartbeats and repair traffic. There is one event thread per DomainParticipant.

Receive Threads

The receive threads are used to receive and process the data from the installed transports. There is one receive thread per (transport, receive port) pair.

When using the built-in UDPv4 and SHMEM transports (default configuration) the middleware creates five receive threads:

For discovery:

  • 2 for unicast (one for UDPv4, one for SHMEM).
  • 1 for multicast (for UDPv4).

For user data:

  • 2 for unicast (one for UDPv4, one for SHMEM).
Asynchronous Publishing Thread

The asynchronous publishing thread handles the data transmission when asynchronous publishing is enabled in a DataWriter.

There is one asynchronous publishing thread per Publisher. This thread is created only if there is one DataWriter enabling asynchronous publishing in the Publisher.

Batch Thread

The batch thread handles the asynchronous flushing of a batch when batching is enabled in a DataWriter and the flush_period is set to a value different than DDS_DURATION_INFINITE.

There is one batch thread per Publisher. This thread is created only if there is one DataWriter enabling batching and setting a finite flush_period in the Publisher.

By default, the stack size value assigned to each one of these threads is platform and OS dependent. This value can be modified by updating the thread stack size QoS value, but a minimum is required.

RTI Transports

This section provides the memory allocated by the OS for the built-in transports: UDPv4, UDPv6, and SHMEM when using the default QoS.


  UDPv4 UDPv6
Receive socket buffer size 131,072 bytes 131,072 bytes
Send socket buffer size 131,072 bytes 131,072 bytes


In this case, the value depends also on the maximum size of the SHMEM message and the maximum number of SHMEM received messages:


Receive buffer size 1,048,576 bytes

When using UDPv4 with the default configuration, for each new DomainParticipant created, the middleware uses:

  • 1 receive socket to receive Unicast-Discovery data.
  • 1 receive socket to receive Multicast-Discovery data.
  • 1 receive socket to receive Unicast-UserData data.
  • 1 socket to send Unicast data.
  • N sockets to send Multicast-Discovery where N is the number of multicast interfaces in the host.

The port assigned for the receive socket depends on the domain ID and participant ID.

The same number of sockets are opened when using UDPv6.

Regarding SHMEM, RTI Connext DDS will use by default:

  • 1 shared memory buffer for Unicast-Discovery.
  • 1 shared memory buffer for Unicast-UserData.

The receive and send socket buffer sizes can be configured by modifying the transport QoS settings.

Heap Usage of RTI Connext DDS Entities

RTI has designed and implemented a benchmark application that measures the memory that is directly allocated by the middleware using malloc(). This benchmark application uses updated libraries that have been instrumented to measure the heap allocations. Additionally, the RTI Connext DDS libraries also request the OS to allocate other memory, including:

All the memory allocated by the OS can be tuned using QoS parameters, or DDS transport properties.

The following table reports the average heap allocation for the different DDS entities that can be used in an RTI Connext DDS application.

The amount of memory required for an entity depends on the value of different QoS policies. For this benchmark, RTI has used a QoS profile that results in minimum memory usage.

  i86Linux2.6gcc4.1.1 x64Linux2.6gcc4.1.1
DomainParticipant 1,377,577 bytes 1,876,271 bytes
Type registered 887 bytes 1,191 bytes
Topic 1,322 bytes 1,814 bytes
Subscriber 8,962 bytes 14,063 bytes
Publisher 2,322 bytes 3,222 bytes
DataReader 52,359 bytes 82,445 bytes
DataWriter 30,447 bytes 44,145 bytes
Instance registered in DataWriter 289 bytes 442 bytes
Sample store in DataWriter 925 bytes 1,312 bytes
Remote DataReader 4,734 bytes 6,416 bytes
Remote DataWriter 9,627 bytes 14,362 bytes
Instance registered in DataReader 616 bytes 973 bytes
Sample store in DataReader 654 bytes 1,137 bytes
Remote DomainParticipant 51,081 bytes 67,253 bytes
DomainParticipantFactory 60,441 bytes 75,355 bytes


Note: In order to efficiently manage the creation and deletion of DDS entities and samples, RTI Connext DDS implements its own memory manager. The memory manager allocates and manages multiple buffers to avoid continuous memory allocation. Therefore the memory growth does not necessarily follow linearly with the creation of DDS entities and samples. The pre-allocation scheme of the memory manager is configurable. 

Benchmarking Best Practices

Are you or your team planning to benchmark the performance of your system? How do you ensure that the tests are realistic?

Performance doesn’t just mean “fast.” Performance requirements tie your system to the real world by imposing time-related constraints. For example, latency requirements may look like “shall do X functionality in N milliseconds,” whereas throughput requirements may look like “shall handle M thousand Ys per second.” 

Before you jump into testing, here are 5 best practices to remember when designing performance tests:

  1. Create a performance model before testing the performance of a system
    Waiting until the system is operational is too late to discover any performance problems. Pre-defining performance goals, such as maximum response times and acceptable performance metrics, help interpret the results of tests.
  2. Create realistic tests
    The test setup must reflect the actual payloads sizes, representative of the application architecture and run on the same or comparable (less ideal) hardware platforms and OS.
  3. Perform an apples-to-apples comparison with the performance model
    The configuration of the Network Interface Card (NIC) and the network switch, as well as the maximum network throughput and the CPU, all have an impact on the final performance results. So, to make a relevant comparison:
    1. include the impact of the network device configuration in the performance model and
    2. run the benchmarks on the same hardware you use to create a performance model.
  4. Ensure that the tests are using the network interface you think you are using
    Typically, test applications can be configured to use multiple network interfaces, and default settings may use transports, such as using shared memory or intra-process transports. If the intent is to test performance over certain network interfaces, ensure that the test application is explicitly configured to exercise the intended network interface(s).
  5. Measuring Average while including Outliers
    The average measurement for any metric is informative, but this measurement can be misleading by itself. Be sure to include other metrics, such as 90th percentile or standard deviation, to get a better view of system performance.


These benchmarks show that Connext DDS

  • Exhibits high throughput, approaching the theoretical bandwidth of Gigabit Ethernet, using modest CPUs
  • Provides very low latency, that increases linearly with data payload size
  • Sustains low latency and throughput even at very high levels of message traffic

To learn more or run these benchmarks on your own hardware, please download:

Or contact your local RTI representative.