A few weeks ago I ordered something from Amazon.com. Something that I needed soon-ish. I am a loyal Amazon customer and Amazon Prime member, and I've never experienced a problem with timely delivery of my order. Unfortunately, my experience with this order was different. It was a mess.
They had somehow lost my order en route, and as a result I would not receive it in time. Apparently, one of the carriers delivered my package to the wrong USPS office, and from there everything went wrong.
This experience with Amazon and my order made me think about how frustrating it is for something to not arrive when it is expected. In my domain, DDS and distributed systems, this would translate into you not receiving the data you expect, when you expected it. Why would this happen? This blog post is going to provide some insight into the 3 main causes of unexpected data loss in a Connext DDS system, and also let you know what to do about it.
Cause # 1: Discovery is not complete
From time to time, we get user e-mails in which they mention that some of the DataReaders miss the first few data samples coming from a DataWriter, even when communication is configured to be reliable.
Usually this occurs because the DataWriter has not yet discovered the affected DataReaders when the first samples are published. In this case, the samples are simply dropped because there are no receivers. If nobody is at home, the DataWriter will not deliver the sample.
To deal with this situation, the user can do two things:
- Configure the DataWriter to keep samples around for late-joiner or not-yet-discovered DataReaders. This is done by setting DurabilityQosPolicy to TRANSIENT_LOCAL.
- Wait to discover the DataReaders before publishing the first samples. This can be done by monitoring the PublicationMatchedStatus on the DataWriter.
For additional information on this, see the following solution on Community: https://community.rti.com/kb/why-does-my-dds-datareader-miss-first-few-samples
Cause #2: The User (that would be you) thinks communication is configured to be strict reliable, but it is not
Other times, a user will mention that they've configured communication to be reliable, but they're still losing some of the samples (not necessarily the first samples). What is going on?
To guarantee that a DataWriter delivers a sample to a matching DataReader, the communication has to be configured to be STRICT reliable. This requires two things:
- Configuring the DataWriter and DataReader so that (1) the DataReader can ACK/NACK samples and (2) the DataWriter can repair NACKed samples. The user does this by setting ReliabilityQosPolicy to be RELIABLE.
- Configuring (1) the DataWriter cache to keep all samples until ACKed and (2) the DataReader cache to keep all samples until taken. The user does that by setting HistoryQosPolicy to be KEEP_ALL.
The challenge in this case is that sometimes the user may forget to change HistoryQosPolicy from KEEP_LAST (the default) to KEEP_ALL. In lossy and congested networks, for example a satellite link, this may lead to sample losses because when a DataWriter receives a NACK for a sample, that sample may have been replaced by others on the DataWriter cache.
For more information, check out this solution on Community: https://community.rti.com/kb/why-isnt-my-reliable-data-reader-receiving-all-data-samples
Cause #3: The DataReader becomes unresponsive and it is marked as inactive
This one is tricky. I have to admit that I have spent a few hours of my time debugging this scenario.
The reliability contract between a DataWriter and a DataReader is maintained as long as the DataReader is responsive. In Connext DDS, a DataReader will be considered inactive if it does not send any ACK/NACK messages in response to 'n' periodic HEARTBEAT (HB) messages sent by the DataWriter (where 'n' is configured using the DataWriter's protocol parameter max_heartbeat_retries).
By default, max_heartbeat_retries is ten, and the default HB period is three seconds, resulting in an inactivity period of thirty seconds. So far, so good.
The problem is this: to increase throughput and makes the system more responsive, some users reduce the HB period without increasing max_heartbeat_retries. This can lead to sample losses as the DataReader may not send ACK/NACK as frequently as required.
For example, if the user reduces the HB period to one hundred milliseconds and does not change max_heartbeat_retries, the inactivity period would become one second.
Make sure that when you adjust the HB period, you also adjust max_heartbeat_retries. To address this usability issue, we have filed an RFE (Request For Enhancement) to make max_heartbeat_retries a timeout/period, decoupling it from changes to the HB period.
Hopefully this post will help you find answers to your questions involving unexpected data loss. At RTI, we are aware that sometimes debugging your DDS distributed system may be challenging, and we work every day to make your experience much easier and smoother. Connext DDS has a great tool ecosystem that allows you to debug your system at run-time. To see some great demo videos of these tools in action, check out this blog post. In addition, I must highlight the great community that we are building around the product. Be sure to visit community.rti.com and make use of the active forum and large knowledge base.