The emergence of service-oriented architectures has driven the shift towards a service-oriented paradigm, which has been adopted in several application domains. The advent of cloud computing facilities and recently of edge computing environments has increased the aforementioned paradigm shift towards service provisioning. In this context, various “traditional” critical infrastructure components have turned to services, being deployed and managed on top of cloud and edge computing infrastructures. However, the latter poses a specific challenge: the services of the critical infrastructures within and across application verticals/domains (e.g., transportation, health, industrial venues, etc.) need to be continuously available with near-zero downtime. In this context, this paper presents an approach for high-performance monitoring and failure detection of critical infrastructure services that are deployed in virtualized environments. The failure detection framework consists of distributed agents (i.e., monitoring services) to ensure timely collection of monitoring data, while it is enhanced with a voting algorithm to minimize the case of false positives. The goal of the proposed approach is to detect failures in datacenters that support critical infrastructures by targeting both the acquisition of monitoring data in a performant way and the minimization of false positives in terms of potential failure detection. The specific approach is the baseline towards decision making and triggering of actions in runtime to ensure service high availability, given that it provides the required data for decision making on time with high accuracy.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited