The Top 10 Reasons for Dropped DDS Messages

Today’s IoT systems often require hundreds of thousands of concurrent data points communicated in many different configurations between thousands of end points. This type of complex communication is made possible with Data Distribution Service® (DDS) and the speed at which this processing takes place enables applications such as rocket launches, autonomous vehicles, sophisticated robotics and more. In the processing of these hundreds of thousands of data points, messages may be dropped. So why is this happening? What are the causes and cures for DDS messages getting dropped?

DDS is not actually “losing” a message. DDS can be configured to intentionally not deliver a message to a data reader under certain conditions. To fully understand what causes samples to be dropped, let's first distinguish between the different circumstances that can be involved.

To start, it is entirely possible that the message is lost by the network or the operating system, rather than by DDS. In this case, when using DDS Reliability, the DDS reader would request the message get resent by the writer. DDS automatically corrects for messages dropped by the network.

So why does it sometimes appear that DDS has dropped a message? Here we have to make a separate distinction between “rejected” samples versus “lost” samples.

When a DataReader receives a message but has insufficient resources to accept it, the message is dropped by the DDS middleware. The sample may still be queued at the DataWriter, in which case it could be retransmitted later depending on Reliability settings. Since the message can still be repaired, it is considered “Rejected” and not “Lost.” But DataWriters have limits on their resources too, and if the DataWriter continues to publish data, a new message may overwrite the old message that has not yet been received by the DataReader. In this case, the old message is considered “Lost,” that is: There is no way for the DataWriter to repair this missing message for the DataReader.

So, it’s important to reiterate that DDS is not actually “losing” messages. DDS is configured to intentionally not deliver messages to the data reader under certain conditions. And there are a number of different ways that a user can configure DDS that would result in dropped messages (messages not being delivered).

The Top 10 Reasons DDS (Intentionally) Drops Messages

Listed below are the Top 10 most common reasons, ranging from most common to least common (also included are references, when available, for further reading):

Number 1: Best Effort Communication

The most common reason that DDS may drop messages is because the DataReader is configured for Best Effort Communication. With Best Effort reliability, typically there are two ways you can drop messages:

  • Out of order samples: A Best Effort reader will drop any messages it receives that are older than the most recent message it has received. So, in a case where a Best Effort reader receives message 7, followed by messages 6 and 5, only 7 will be delivered to the application and messages 5 and 6 will be dropped by DDS. If the order or arrival is 5, 7, 6, messages 5 and 7 would be delivered to the application, while message 6 would be dropped because it is older than the previously delivered message 7. With a Reliable communication setting however, this would not be the case. No messages would get dropped -- they would all be delivered to the application in the right order.
  • Reader queue overflow: With Best Effort Communication, when the data reader’s queue is full and more messages are received, the older messages get dropped and they are Lost. They will not be repaired.
  • For more on Best Effort reliability, see section 10.1.1 of the DDS User’s Manual.

Number 2: Using KEEP_LAST Reliability

DDS may drop messages if you are using KEEP_LAST instead of strict reliable Quality of Service (QoS). Strict reliability means reliability QoS is enabled and history kind is set to KEEP_ALL. If reliability is turned on but history kind is KEEP_LAST, then messages can get dropped:

  • If running on the same kind of CPU/computer, a DataWriter can typically write faster than a DataReader can handle.
  • When the DataReader’s queue gets full, new messages replace the oldest ones in the queue. Overwritten messages will then be Lost.
  • With Strict reliable mode the DataWriter will block if the reader’s queue fills up. The messages will be rejected but they will not be overwritten or dropped, they can still be repaired.
  • For more information on KEEP_ALL vs. KEEP_LAST history QoS read this article: Why isn't my reliable data reader receiving all data samples.

 

Number 3: Time-Based Filters or Content Filtered Topics

DDS may drop messages if a time-based or content-based filter is being used, because messages that fall outside the parameters of the filter will be dropped. These messages may appear to be lost but are either intentionally not sent by the DataWriter or are discarded by the DataReader.

Number 4: Durability QoS Not Set

Messages are lost on startup when the sending application starts sending DDS messages before the Receiving application is ready. If the sender (DataWriter) starts before the receiver (DataReader) is ready to receive messages, then the DataReader might miss the first few messages. To avoid this, simply turn on the Durability QoS setting. When Durability is used, the DataWriter saves N messages in its queue (dictated by History size). Late-joining readers will be sent messages they missed as dictated by their history setting but limited by what is in the writer’s history. For more information on Durability read DURABILITY QosPolicy.

Number 5: No Space in the WriteQueue

With strict reliability when max_blocking_time for the DataWriter has been reached, the message will not be delivered and can appear to be dropped. The write operation will return error code DDS_RETCODE_TIMEOUT and the message will not even be delivered to the write queue. So this message that is "dropped" is never considered as a message. It does not have a sequence number, therefore it will not be repaired, even though strict reliability is used. Also note that messages dropped due to exceeding max_blocking_time can be monitored. So the user has the ability to just write that message again. Therefore, the message is fortunately not "lost forever." Read about reliability protocol in the User’s Guide for more information.

Number 6: Listener Context Switch

In DDS, messages can be lost as a result of using a Listener rather than a Wait Set to handle inbound data. Typically, the listener will be called back from one of only a few middleware threads, which means that it is important to be careful not to block or do any long processing. If a block occurs in this thread, there are many potential negative consequences:

  • Losing data for the DataReader the listener is installed on, because the receive thread is not removing it from the socket buffer and it gets overwritten
  • Receiving strictly reliable data with a delay, because the receive thread is not removing it from the socket buffer and if it gets overwritten it must be re-sent
  • Losing or delaying data for other DataReaders, because by default all DataReaders created with the same DomainParticipant share the same threads.
  • Not being notified of periodic events on time
  • For more on listener context switch and listener callback, read the article: Never Block in a Listener Callback.

 

Number 7: Controlling Heartbeats and Retries

Controlling heartbeats and retries refers to a specific situation where, when not using strict reliability, max_heartbeat_retries is reached, during which the DataWriter will consider the DataReader as inactive and messages may be lost. Also, If your heartbeat rate is too slow and you are using KEEP_LAST, you may not have time to repair a sample before newer ones arrive, in which case the older one will be lost. For more information, read the Connext DDS User’s guide 10.3.4 Controlling Heartbeats and Retries with DataWriterProtocol QosPolicy.

Number 8: Too Many Elements

Normally when sequences or strings are defined with different lengths they are not “assignable.” This means DDS will determine there is a type mismatch and the dataReader and the dataWriter of these two types will not be allowed to communicate. But after setting ignore_sequence_bounds and ignore_string_bounds to TRUE, the two types will become assignable; however, the DataReader will drop samples published with an actual sequence or string length greater than the maximum lengths.

See section 2.4 of the Getting Started Guide Addendum for extensible types.

Number 9: When IDL file annotations are used

If the @range, @min, or @max annotations are used in an IDL file to limit message values, then messages published by the DataWriter that fall outside these ranges will not be received by the DataReader. For instance, the DataReader will drop the message {x=170} when using the following annotation: @range(min=100, max=150) long x.

The reader will not provide that message to the application because x is outside the valid range [100,150].

When a DataReader drops a message, Connext DDS logs a warning but does not update the SAMPLE_ LOST or SAMPLE_REJECTED Status.

For more information see section 2.4 of the Getting Started Guide Addendum for extensible types.

Number 10: When Using destination_order QoS

When using destination_order QoS configured with "by source timestamp," data will be delivered by a DataReader in the order in which it was sent. If data for an instance arrives on the network with a source timestamp earlier than the source timestamp of the last data delivered, the new data will be dropped (if the timestamp difference is greater than source_timestamp_tolerance). This ordering therefore works best when system clocks are relatively synchronized among writing machines. For more information see section 6.5.8 DESTINATION_ORDER QosPolicy in the Connext DDS User’s guide.

Note that these are just my top ten reasons why messages can be Rejected and Lost. There are other reasons and the User’s Manual lists the reasons in sections “7.3.7.8 SAMPLE_REJECTED Status” and “7.3.7.7 SAMPLE_LOST Status.”

Which Issue is it?

Okay! That covers the Top 10 reasons messages might get lost. But how can one determine which of these 10 happens to be the reason a particular system is losing packets?

RTI provides many tools for debugging issues involving lost samples including a rich API providing system status, a powerful logging mechanism, and of course Wireshark understands and can dissect DDS messages. But maybe the most powerful resource is the RTI Monitor tool, which gives detailed information about all transactions. This includes information on the ACKs, NACKs, dropped messages, queue sizes and queue usage in the system. The Monitor also provides a full list of QoS settings. It can quickly help answer the following questions:

  • Is the system set to Best Effort vs. Reliable communication?
  • Is the History set to KEEP_LAST vs. KEEP_ALL?
  • What size are the Read and Write queues set to?
  • What are the different time-outs set to?

RTI Monitor needs to use instrumented libraries. See: Using the Monitor Library in Your Application for more information.

Conclusion

There are many reasons why messages will get dropped, or seem to get dropped, during DDS communications. In all cases, whether it is the network dropping the messages or DDS dropping the messages, DDS can be configured to address the issue. The trick is determining why exactly the system is dropping messages. Fortunately, RTI has a number of tools and methods for finding this information. So while a DDS system can appear to be losing messages, most of the time this is a configuration issue that can be fixed with the right QoS settings.


About the Author

Dave SDavid Seltz is a Field Application Engineer for Real-Time Innovations supporting customers in the New England area. David has been in the embedded industry for over 32 years working in engineering and sales roles. Previous to his work at RTI, David was the world-wide FAE manager for Wind River Systems. David holds a Bachelor of Science degree in computer engineering from Lehigh University, and a Master of Science degree in computer engineering from the University of Massachusetts.

 

Getting Started with Connext DDS

Connext® DDS is the world's leading implementation of the Data Distribution Service (DDS) standard for Real-Time Systems. Try a fully-functional version of Connext DDS for 30 days.


Free Trial