Monitoring the Health of Your IIoT Systems
Written by Kyoungho An
September 13, 2018
How do you ensure that your IIoT system is healthy? When your system is running, it may experience network loss or delay, node failures, or unexpected changes due to software upgrades and new application deployments. These problems affect the performance of your application. But if you do not continuously monitor them, identifying the source of the problem can be quite challenging. The RTI Research team is working on architectural solutions for operational monitoring of distributed energy systems. However, this approach can be applied to any vertical application, including yours.
Operational monitoring provides you with a clear understanding of your system health by collecting performance metrics and events over time. Specifically, it gives you insights through real-time visualization and analysis. To support this operational monitoring capability for DDS-based systems, the RTI Research team evaluated relevant technologies and developed prototype software for demonstration (this work was done as part of a DOE-funded research contract).
Three key components are needed for monitoring: a solution for data collection, a solution for data storage and a solution for visualization.
Time-Series Database for Operational Monitoring
For operational monitoring, we used a software stack from InfluxData called TICK (derived from the initials of each technology). It is shown in the figure below. Telegraf is a plugin-driven agent for collecting monitoring data. It supports more than 100+ plugins so you can collect data from many different sources. You can also extend your monitoring sources by developing your own plugin. Once monitoring data is collected by Telegraf, collected data is handed off to InfluxDB -- a data time-series monitoring technology. From InfluxDB the data can be passed on to Chronograf for visualization; Kapacitor provides alerting based upon user-defined rules.
In particular, InfluxDB is an open source time-series database for monitoring that provides several interesting features:
- SQL-like time-centric query languages
- Built-in time-series functions in the query language
- Automated data retention policy
- A Schema-less approach
- Downsampling via continuous queries
- High availability via distributed clustering (only supported in the commercial version)
Monitoring Architecture and Implementation
TICK formed the foundation of our Administration Layer (as depicted below). In addition, we needed to provide tools that generated the health monitoring data – what we call our Management Services Layer.
The figure above describes the monitoring architecture that we built for our project. This architecture consists mainly of a management services layer and administration layer.
- Management services layer includes software components that collect monitoring data from a node where user applications are running. For our project, user applications are OpenFMB simulation applications, but it can be any DDS application.
- Administration layer consists of software components that store, visualize and alert on collected time-series monitoring data.
The types of data we collected with this architecture include:
- Node metrics: CPU, memory, network usage of nodes
- Container metrics: CPU, memory, network usage of containers
- DDS metrics: discovery stats, protocol stats, and events (e.g., liveliness loss, sample lost, sample rejected)
To implement the architecture, we used existing Telegraf plugins to collect node and container metrics. These metrics are collected from an operating system and a container engine. For DDS metrics, we leveraged RTI Monitoring Library.
Our intelligent bridge transforms locally-collected data from our monitoring agents into remote data to be passed over the monitoring databus. The bridge can filter the collected data to reduce data over network and also enrich it (e.g., adding hostname as a tag to group time-series data) if needed.
To subscribe to data from the monitoring databus at the administration side, we used a DDS plugin-enabled Telegraf (Metrics Collection Service in the architecture). As the Telegraf plugin framework is written in Go, we also developed a DDS Go binding with RTI Connector! It is currently available at https://github.com/rticommunity/rticonnextdds-connector-go. For visualization and alerts, we used Grafana.
With all of these artifacts, we could demonstrate an end-to-end operational monitoring capability for DDS-based systems using our energy system simulations as user applications (available via our Case + Code page: https://www.rti.com/resources/usecases/microgrid-openfmb). We are happy to share our work and get feedback from you. If you are interested, please let us know!
In the next blog we will delve much deeper into our InfluxDB integration and provide you with source code and documentation so that you can give it a spin for yourself!