Achieving fault tolerance by replicating DataWriters
Written by Reinier Torenbeek
October 30, 2012
DDS is often used to build mission critical systems consisting of multiple components executing simultaneously and collaborating together to achieve a certain task. In such systems, there are usually several components that are critically important for the overall functioning. Those components require extra attention when designing the system to ensure that the likelihood of them failing is minimized.
DDS provides advanced features that help achieve such goals. Many ways lead to Rome though, and there is a lot to tell about this subject. This blog article focuses on achieving fault tolerance with regard to failing publishing applications or machines, as achieved by means of active process replication.
The guiding example
For the purpose of illustrating the theory in this article with an example, the use case of the simple distributed finite state machine (FSM) as described in a previous article will be used. The FSM consists of a few components: a Master Participant, a Worker Participant and a Manager. Then there is the Observer component, which is not essential for the functionality of the pattern since it is passively observing only. For the purpose of our example, we will focus on the interaction between the Manager and the Observer.
Let's assume that the Manager is the most critical process in the pattern -- and that is likely to be the case, especially if the pattern is expanded to support multiple Masters and Workers. The most obvious way to make the Manager component more robust is by running multiple processes simultaneously, all with the same responsibility. With that approach, the Manager component is defined as a conceptual item that may consist of multiple, simultaneously running application processes. For the sake of simplicity, the collection of those processes is from here on referred to as the Manager component. Note that those could be executing in a distributed fashion on multiple machines.
Active process replication
Simultaneously running multiple Manager processes with the same purpose will decrease the likelihood that the Manager Component will break down completely. DDS allows for adding and removing participants on the fly without requiring configuration changes, so the basic concept of process replication is supported. Still, there are multiple options to choose from.
Plain process replication
The most basic approach is to just run multiple, identical Manager applications. Each of them is acting the same, receiving the same transition requests and publishing the same state updates in response. As a consequence, applications observing the state machine will see multiple instances of that state machine simultaneously and will have to be able to deal with that. This approach requires a minor adjustment to the data-model to allow multiple task instances to exist side-by-side. This is done by adding a key attribute identifying the originating Manager application.
An advantage of this approach is its low complexity; just replicating everything is an approach everybody understands. There is no impact on the Manager application code and all participants remain completely decoupled from each other. Additionally, the mechanism does not rely on any built-in replication mechanism from the middleware. This could be considered an advantage as well, because the developer now has every aspect of the replication process under control.
That latter argument could be seen as a disadvantage at the same time. Requiring all Observers of the state machine to be able to deal with multiple, identical versions simultaneously does have an impact on the application code and as such introduces an extra burden on the application developer. Also, the number of instances an instance state updates scales linearly with the number of Manager processes. For this particular state machine example, that is not that big of a deal, but in the general case, it could have a significant impact on network bandwidth consumption and memory usage.
Process replication with writer-side peer awareness
Another option, avoiding some of the disadvantages mentioned in the previous section, is to make Manager processes aware of each other at the application level. If all Manager processes know which other Manager processes are currently participating in the infrastructure, then they all could decide for themselves, according to a system-wide consistent selection algorithm, which of the replicated Manager processes is the currently active one. All Manager processes would be participating in the pattern and updating the relevant states accordingly -- but only the active Manager would be actually communicating those updates to the data-space. If the active Manager leaves the system for some reason, then the remaining Manager processes need to decide for themselves whether or not they are the new active Manager, and adjust their behavior accordingly.
DDS supports the necessary features to implement this approach. By subscribing to so-called built-in Topics, applications can be made aware of other DDS Entities in the cloud. This kind of subscription can also notify the application of lost liveliness or destruction of each of these Entities. Those two aspects combined are sufficient for our purposes.
Note that this approach has no application-level impact on any of the subscribing components in the system and consequently, they do not require any code changes. This kind of replication also does not consume any extra bandwidth or resources like the plain replication does.
There is some impact at the application level for the Manager processes though. The mechanism of peer-awareness and leader selection has to be created. This is not trivial and the impact of bugs in that piece of code could be high -- imagine it could lead to none of the processes stepping up as the leader!
Process replication with DDS's exclusive ownership mechanism
The last option presented here takes advantage of a native DDS feature called exclusive ownership. Normally, DDS allows multiple publishers to update the same data-item simultaneously. This behavior is called shared ownership and it is the default setting for the ownership Quality of Service (QoS) setting. If the non-default setting, called exclusive ownership, is selected, the infrastructure will make one of the publishers the owner of the data-item -- at any time, only the owner of the data-item can update its state. For this leader selection, another policy called ownership strength can be adjusted. This integer value will be inspected by the middleware to identify the current leader at any time, that is the publisher with the highest ownership strength. This selection is done consistently throughout the whole system.
Other than selecting the right QoS, this solution is transparent to all components in the system. The different Manager processes are not aware of each other and execute their tasks independently. Similarly, the subscribing components do not (have to) know that in fact, multiple publishers are present in the system. Everything needed to achieve the required functionality is handled within the middleware. As a consequence, no extra code has to be added, nor tested, nor maintained.
The exclusive ownership functionality is described in the DDS specification and was designed with the mechanism of process replication in mind. In that sense, it is not a surprise that this solution is usually the best fit. The specification does not indicate how this functionality should be achieved by the middleware; that is left up to the product vendors. However, it is good to know that typically, for optimal robustness, implementations do send data over the wire for each of the active publishers, and the leader selection takes place at the receiver side. This means that the feature does result in extra networking traffic.
Recovering from faults
Process replication is an important aspect of increasing fault tolerance in a system. However, there is more to it. If a system, by virtue of the process replication, is able to continue after a fault, then it will have entered a stage in which it is less fault-tolerant simply because one of the replicating processes is gone. It is still essential that the system recovers as quickly as possible from the fault and returns to the fault-tolerant state. In our example, this means that the failing Manager process will have to restart, obtain the current state of all task data-items and start participating in the pattern again. DDS can help make the task of recovery easier as well, but that will be the subject of another post.