Time to Sync Up on Consistency in IIoT Systems
Written by Sander Mertens
June 21, 2018
An important (and hotly debated) topic in distributed system design is which consistency model to use. Consistency models influence many parts of system design, and picking one over the other impacts things like system availability and robustness against network failures. This blog is meant for system architects that want to get a better handle on what it means to be or not to be consistent.
First, let’s clarify that this blog is not about the "C" in ACID (https://en.wikipedia.org/wiki/ACID). ACID consistency ensures that updates to a datastore are valid according to a set of constraints. This blog focuses on the kind of consistency that describes what happens when data is replicated between distributed stores. As it turns out, ACID does say something about that, but it is the "I" (Isolation), not the "C." Confusing? A little bit, but bear with me.
Isolation is also referred to as strong consistency. When a system is strongly consistent, all writes and reads between distributed stores are executed in the same order for every application in that system. That is an elaborate way of saying, when one application writes something, all applications that read after the write, are guaranteed to see the new data.
This turns out to be an incredibly useful property in many systems. It makes sure that no two people can order the same refrigerator from a shopping website, not even when they purchase at the exact same time. Strong consistency enforces the same order of operations globally, so two purchases are always processed by everyone in the same order. In practice this means that the second purchase attempt is guaranteed to see that the refrigerator is out of stock.
Strong consistency sounds like a pretty neat thing to have, so why don’t all systems use it? It is because of something called the CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem). First, a quick note: CAP has been rightfully criticized by many because it is too simple to describe the behavior of complex distributed systems – so be careful when using it – but it does provide a useful framework for discussing consistency models. I won’t go into the details of CAP because the internet has plenty of resources that do a much better job than I could hope to do here.
In summary, CAP tells us what happens in distributed systems when applications temporarily lose the ability to "talk", or in other words: when the network goes down. It turns out that it is impossible for a system to be strongly consistent and always guarantee uptime, regardless of loss of connectivity. Bummer.
That sounds complex, but is actually quite intuitive. Remember how strong consistency requires a global order to all operations in a system? That means a read must see all previous writes, from everyone. If not all of the applications are connected, then this becomes impossible to guarantee. One application may have put in an order for a refrigerator, but if not all applications have received this order yet, they cannot put in new orders. That results in downtime for the shopping website!
Downtime can be mitigated by throwing more resources at the problem, like doing database replication (more storage) or deploying redundant web servers (more compute). Today this is almost trivial to do in public cloud infrastructures, though it gets quite expensive and complex when a system has many moving parts, like with microservice architectures. When a system is not running on a cloud, adding more resources is far from trivial as storage, compute and bandwidth are all much more constrained in non-cloud environments.
So while strong consistency is convenient for applications, it is taxing on an infrastructure (and your wallet!). To get around these issues, clever people have come up with solutions that are not as pedantic when it comes to consistency, but still workable from an application perspective. What we are talking about is “eventual consistency.” Time for another definition.
A system is eventually consistent when, if there have been no updates for a given item, all applications eventually see the same value. Or in layman terms: everyone eventually sees the same data if they wait long enough. This means that applications are able to read and write at the same time, and are able to do so even while the network is down! Eventually, the system's infrastructure delivers all the updates to the applications.
Because applications do not have to wait for each other, the uptime of an eventually consistent system is theoretically infinite – provided that your applications do not crash nor a power outage occurs. Infinity is not practical however; after some time you do expect your applications to reconnect. Therefore, eventually consistent systems typically put a limit on the time it may take to become consistent. When that limit expires, and applications did not have a chance to synchronize, failure recovery takes place.
Availability is not the only advantage of eventually consistent systems. Because reads and writes require no synchronization, as is the case with strongly consistent systems, they are much faster. The lack of synchronization also enables direct peer-to-peer communication, which further improves performance while also improving robustness, as it removes the need for a centralized message broker, which introduce single points of failure.
While eventual consistency does not work for shopping websites, when considering its advantages (availability, performance, robustness, resource efficiency) it should come as no surprise that it is used a lot in mission-critical systems.
The RTI Connext Product Suite is the leading implementation of the OMG DDS standard, which is widely deployed as a protocol in mission-critical, Industrial IoT systems. A big difference between OMG DDS and other connectivity protocols, is that DDS behaves like a distributed database which is continuously synchronized between applications, whereas other protocols typically provide an interface to send around messages and leave state management up to the application.
If you think that a distributed database sounds like something that has to deal with consistency, you are right. RTI Connext DDS has to constantly balance availability and performance versus consistency to be able to work in the most demanding mission-critical environments. By default, RTI Connext DDS uses eventual consistency which guarantees that applications built with it do not stop working when the network fails, while ensuring that all applications eventually share the same view of "the world."
Now you see how something that sounds as abstract as "consistency" has far-reaching consequences and should be treated as an important topic in early system design. Unfortunately it is never as simple as to be either "strongly consistent" or "eventually consistent." A lambda architecture (https://en.wikipedia.org/wiki/Lambda_architecture) is just one example that uses both strong and eventual consistency to get the best of both worlds. With so many shades of consistency, system architects have to make complex decisions on how much consistency their system can afford.
At RTI, our professional services team helps you make those decisions, reason through the consequences and configure our products to create a consistency solution that works for you.
Learn more about RTI services here: https://www.rti.com/services