Smart Preventive Maintenance of Hybrid Networks and IoT Systems Using Software Sensing and Future State Prediction

Minea, Marius; Minea, Viviana Laetitia; Semenescu, Augustin

doi:10.3390/s23136012

Open AccessArticle

Smart Preventive Maintenance of Hybrid Networks and IoT Systems Using Software Sensing and Future State Prediction

by

Marius Minea

^1,*

,

Viviana Laetitia Minea

² and

Augustin Semenescu

^3,4

¹

Department Telematics and Electronics for Transports, University Politehnica of Bucharest, 060042 Bucharest, Romania

²

Department IT, Orange Services Romania, 020334 Bucharest, Romania

³

Faculty of Materials Science and Engineering, University Politehnica of Bucharest, 060042 Bucharest, Romania

⁴

Romanian Academy of Scientists, 050045 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(13), 6012; https://doi.org/10.3390/s23136012

Submission received: 10 May 2023 / Revised: 20 June 2023 / Accepted: 25 June 2023 / Published: 28 June 2023

(This article belongs to the Topic AI-Enabled Sustainable Computing for Digital Infrastructures: Challenges and Innovations)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

At present, IoT and intelligent applications are developed on a large scale. However, these types of new applications require stable wireless connectivity with sensors, based on several standards of communication, such as ZigBee, LoRA, nRF, Bluetooth, or cellular (LTE, 5G, etc.). The continuous expansion of these networks and services also comes with the requirement of a stable level of service, which makes the task of maintenance operators more difficult. Therefore, in this research, an integrated solution for the management of preventive maintenance is proposed, employing software-defined sensing for hardware components, applications, and client satisfaction. A specific algorithm for monitoring the levels of services was developed, and an integrated instrument to assist the management of preventive maintenance was proposed, which are based on the network of future states prediction. A case study was also investigated for smart city applications to verify the expandability and flexibility of the approach. The purpose of this research is to improve the efficiency and response time of the preventive maintenance, helping to rapidly recover the required levels of service, thus increasing the resilience of complex systems.

Keywords:

preventive maintenance; Markov Chains; future state prediction; state matrix; risk assessment

1. Introduction

Today, the current developments in large cities are oriented towards the introduction of smart applications, reduction of environmental impact, and new services to ease day-to-day living for citizens. Continuous growth in data collection, storage, processing, and transmission led to creating heterogeneous structures for communications and data mining, which are not always compatible and/or well structured. Finally, smart mobility is also seen as one of the main solutions to reduce traffic congestions, stress, and pollution in urban areas. All these services are largely based on heterogeneous telecommunications solutions and hybrid subsystems. Standards and technologies for communication evolve on a permanent basis. For example, new standardization includes Matter, an open-source connectivity standard for smart home and Internet-of-things devices, aimed at improving compatibility and security, of which Version 1.0 of the specification was published on 4 October 2022. There is also Thread, an IPv6-based, low-power mesh networking technology for Internet-of-things (IoT) products. Hence, the evolution of network standards, technologies, and hardware is in a continuous process.

On the other hand, the more intensive use of wireless sensors and non-intrusive detection of vehicles, passengers, and/or travelers, along with the introduction of other smart-city-specific services, means that communications are also extending and becoming very heterogeneous. Maintaining the necessary level of service for such complex networks is becoming a difficult task, even though several applications and techniques help maintenance operators to track and discover malfunctions and the excessive loading of network channels or slow the response of applications. Consequently, the maintenance of such complex systems and networks is also becoming complex, making it difficult for the human operators and specific services to efficiently manage all functionalities in real-time and to ensure flawless services. Up until this moment, the different solutions for hardware, traffic, and applications monitoring are not integrated in a single platform, and standardization in this field is still at a poor level. The present work is intended to represent an additional solution for improving the response time of the SMC (Services’ Monitoring Centre), considering the continuous increase in complexity of telecommunication networks, in the context of the exploitation of smart city services. Another goal is to improve the overall application response time. From the period of the pandemic until the present day, there has been an intense emphasis on the digitalization of as many services as possible that citizens can access. Therefore, there is fierce competition between companies to offer clients the opportunity to digitally achieve their forecasted goals, assessing the services from the comfort of their home. The criterion that makes the difference and leads to customer loyalty in this situation is the desired availability of applications. Situations in which applications have a high response time provide a negative experience to users and make them re-orient towards competition. Therefore, this work is also focused on improving this aspect.

-: This research was aimed at creating a platform for integrated monitoring of reliability, level of service, and client satisfaction, employing simple solutions that do not require difficult programming tasks and/or intensive computing power—a solution to collect, store, and analyze all information regarding hardware/software malfunctions and application performance. The approach assumed that intelligent agents are employed for the management of different services (e.g., specific for smart city and communication networks), which collect the relevant information regarding levels of service. The collected data were stored and used to build a state matrix, which was then employed to produce a prognosis on future states of the network and to issue early warnings for preventive maintenance. This involved the integration of intelligent agents for information collection regarding hardware and software monitoring, combined with application and client satisfaction monitoring.
-: A model was created for a state matrix based on collected data and building a data base for cyclic and/or event-triggered updating and analysis.
-: An algorithm was created for building and updating the state transition matrix based on the Markov approach. This solution was chosen to keep the necessary computing power at a low level.
-: Development and adaptation of the solution for client satisfaction analysis were proposed.
-: All these approaches were integrated into a single platform to assist the maintenance operators in early detection and warning, regarding malfunctions and network decrease in performance, also based on a risk assessment matrix.

2. Related Work

Studies and research were performed worldwide to enhance preventive maintenance solutions and to keep in line with the rapid development of technologies and services. The management of complex networks must begin with a deep understanding of the system architecture, based on the topics defined by the ISO network management model: fault, configuration, accounting, performance, and security managements [1]. This model provides a comprehensive means for managing the major functions of network management. Modern and heterogeneous communication networks challenge the maintenance services’ accuracy and the effective processing of big data in a real-time manner. Mobility of some wireless sensors, and/or monitored devices, also may create complex behavior of network traffic, difficult for analysis and interpretation for early detection of anomalies. In this direction, deep learning has been efficiently employed to facilitate analytics knowledge discovery in big data systems to detect hidden and complex patterns. Deep learning models are applied in network traffic monitoring and analysis.

Modern communication networks, including Cognitive Radio [2,3], as well as the reliability of the communication link between the users, are based on several Quality-of-Service (QoS) indicators, such as connection availability, channel availability, service retainability, and/or network unserviceable probability. These are evaluated under a variety of channel failure and PU arrival rates, allowing for “on-line” monitoring of the network viability. Still, these improvements depend on other, random factors, such as the channel or the receiver’s availability (which may be in different state—out of reach, busy, etc.). The authors of [2,3] conclude that another important KPI of these networks’ reliability, which should be included in the QoS study, should be the receiver’s availability. Of course, this represents a very promising advance, but not all networks are presently at this stage of development. Therefore, the methodology proposed in this work comes as a complementary service to a heterogeneous type of network, integrating QoS-related data in a solution for post-processing and forecasting of the network’s state of functionality. However, for the proposed solution, mostly fixed receivers have been considered (i.e., sensors of the smart city services), and studying the availability of the receivers may constitute, probably, a future work.

The work of S. Rezaei and X. Liu [4] presented a survey on a specific part of the models for different Deep Learning-based network traffic classifications. Aniello et al. [5] also introduced, in their study, some machine learning-based models (both unsupervised and supervised) in a scenario involving malware analysis, but they do not extend their research to malware detection.

A more in-depth analysis on network traffic analysis was performed by Conti et al. [6], which had some interesting points in considering the level of traffic at which the network is monitored and the aim of this analysis. Some non-supervised learning algorithms, such as k-means, or supervised, i.e., Random Forest, are analyzed, along with a very pertinent organization of the main KPIs in traffic monitoring, such as traffic characterization, app identification, usage study, malware detection, user action identification, OS identification, position estimation, ad fraud identification, tethering (internet sharing) information, or website fingerprinting.

Additionally, Fadlullah et al. [7] presented deep learning models and architectures for network traffic control systems, covering mainly the network infrastructural aspects.

For larger networks and big data analytics, D’Alconzo et al. [8] focused on anomaly detection and security mechanisms with the purpose of identifying and reacting in a fast manner to unpredictable events while monitoring many heterogeneous events. The authors also categorize previous research on network traffic monitoring and analysis (NTMA) that work with big data approaches. In the same domain of NTMA, the work [9] by M. Abbasi, A. Shahraki and A. Taherkordi is mentioned, which provides a comprehensive review on applications of deep learning in NTMA, analyzes the integration of deep learning and NTMA, and performs a review of DL techniques for NTMA.

Similarly related work is given in [10,11,12]: the passive flow monitoring of hybrid network connections, usefulness of machine learning in network monitoring, and the challenges and opportunities that big data present in this direction of research.

Another interesting direction of research is focused on analyzing data traffic statistics and detecting anomalies [13]. Most of the actual methods for detecting anomalies in data traffic, especially in public networks and institutions, have been analyzed and presented in a comparative study: statistically based methods, distance-based methods, density-based methods, clustering-based methods, graph-based methods, and learning-based methods. The research concludes with the proposal of including an Anomaly Detection Module (ADM), based on a combination of the above-described technologies.

There are different, other domains where this approach is also welcomed: power grids need very accurate monitoring of operation status to ensure uninterruptible operation. One solution for this is based on random matrix theory and qualitative trend analysis [14]. The solution considers two types of elements: the variability and the overall performance of the system, ignoring the complex physical structure of the power grid and using the data generated during the operation of the power grid more effectively. On the other hand, not only natural factors may produce failures of such networks. In smart metering methods, human intervention may also be a cause of malfunction, instability, or bad operation. Data-driven fraud detection methods are analyzed in [15], comprised of AI-based supervised methods, including wide and deep neural networks and multi-data-source deep learning models, along with unsupervised methods, e.g., clustering. Complementary to these methods, vulnerabilities are analyzed from as many aspects as possible, and the researchers recommend employing lightweight privacy-preserving detection to preserve relevant data for accurate detection, as well as the use of AI-based self-learning detectors.

One other and important aspect of smart city services is the distribution of utilities. Research in the direction of improvement of water distribution normal operation and validity include some innovative solutions, such as Digital Twins, for rapidly detecting leaking and maintaining pressure control, fractal control, partitioning (pressure management areas), or multi-objective optimization, which is an approach that is based on the Gomory–Hu tree to maintain control over each segment, etc. [16].

In the same domain of energy grids management, some researchers propose hybrid data transmission networks to compensate for the missing of GSM signals in remote locations. Similar hybrid networks, based on a combination of RS485 and RF modules (nRF), according to study [17], can be successfully used in solar power parks as an alternative to GSM networks.

Hardware gear can also be a cause of a system’s or a network’s malfunction. A solution for monitoring complex hardware computing equipment could be HDD failure monitoring, which is based on self-monitoring analysis and reporting technology [18].

However, in complex distribution grids, correct operation might be corrupted via false data injection attacks (FDIAs). In [19], a novel deep neural network approach is proposed to perform simultaneously distribution system state estimation calculation (using regression) and FDIA detection.

With the increasing role of complex networks in the era of information, another problem that has been in focus in recent years is the prediction of data links related to air transports networks to improve the efficiency of transportation in complex networks of airports [20].

Regarding electricity distribution, consumption, and related policies, an arising concept and a side-effect is so-called “energy justice”, concerning the effect of introducing advanced techniques for data collection and AI-related applications in the field, which may lead to privacy infringements. In [21], it is explained that “Energy justice” is a concept that has emerged predominantly in social science research to highlight that energy related decisions, particularly as part of the energy transition, should produce just outcomes. Therefore, the authors of the study recommend that “it is important to take energy justice in consideration from an early stage in the development or design of AI techniques”.

Technologies for monitoring and the maintenance of public transformers in an energy distribution network are considered in work [22]. The aim of the research is to remotely determine public transformers load and to construct a load prediction model, based on the LSTM (Long-Short Term Memory) algorithm, to be used for detection and the accurate location of heavy overload risks in advance, therefore being a preventive maintenance technique.

Preventive maintenance has always been a priority for critical applications and industry. Therefore, many researchers are focused on finding the most appropriate solutions to improve efficiency of this aspect. Different strategies are tested, and they prove their efficacity in increasing reliability [23], such as using a logistic regression model to assess the health condition of equipment and a neural network model to estimate its failure probability, considering the scheduled workloads. Besides the industrial process, employment of intelligent agents to verify on a continuous basis the load of different components in a communication network has also been implemented. The goal is to determine the best operational status of a server in each time slot, based on Markov chain models, as well as to optimize the system’s performance, which is measured in terms of throughput [24]. However, modern communication networks now rely on optical fiber, which is immune to e.m. interferences, but the optical fiber is also part of the reliability chain, so it also needs monitoring in terms of its operational status. Therefore, there are some solutions to improve the performance of FO via integration with optical amplifier boards, able to detect optical layer events and fiber soft/hard failures with online remote management [25]. Processes increase in complexity when they are developed in cloud applications. In order to extend the preventive maintenance at this level, some researchers propose a Recurrent Neural Network (RNN)-based method to proactively predict faults, in the event of insufficient resources in fog devices, based on a conceptual LSTM and novel Computation Memory and Power (CRP) rule-based network policy [26]. For networks and systems based on sensors, some authors employ Bayesian Network Models (BNM) that can be improved via fusion-learning methodology: merging different data from sensors and metrology logs, combined with a human-in-the-loop approach for expert knowledge elicitation of the BN structure [27]. Another solution is data prediction using a v-Support Vector Regression (vSVR) algorithm [28], the latter being very useful for high network loads, such as in emergency support during festivals and large-scale activities.

Other methods for improving reliability and resilience of different systems and networks with models of operation use Least Squares Support Vector Machine (LSSVM) [27], an exponentially weighted moving average method combined with a continuous deep belief network for constructing the reliability model [28], or even intelligent solutions to prevent security breaches with a delay-based attack detection and isolation scheme (DA-DIS) [29]. For underground medium-voltage power supplying networks, a novel method for improving reliability is proposed in [30], using various machine learning classification algorithms.

When complex systems, including more networks and subsystems, are to be monitored, different approaches include dedicated sensors, IoT platforms, and a LSTM ensemble neural, which are all developed to predict the operational status [31], and for avoiding cascading failures, a hybridization of two meta-heuristic techniques, namely, the snake optimizer and the sine-cosine algorithm (SO-SCA), are proposed to solve the problem [32]. A fault-tolerant topology algorithm for agricultural WSN, based on a double-price function, is designed in [33] to improve the connectivity and reliability of the WSN, while some approaches employ a trained multi-agent for comparing the computed future state with the actual state and early detect faults [34].

Many of the techniques applied for improving early fault detection and preventive maintenance are reviewed and analyzed together [35,36]. The authors conclude that “These monitoring tools can be used for achieving the goal of high performance and reliable networks as they are capable of analyzing the resources for configuring the network problems and alert the administrator if any network issue occurs”.

When it comes to preventive maintenance, grid networks and energy supplying distributed systems are in the center of preoccupation; methodologies include: distributed data collection network [37] or adding QoS to low cost protocols, such as ZigBee (using IEEE 802.15.4 defined physical and MAC layer) and Bluetooth (IEEE 802.15.1), by providing differentiated service for traffic of different priority at the MAC layer [38]; also, the DFS (Depth First Search) algorithm is used to divide the network in zones and to capture the influence of maintenance decisions in the model of the served load from DGs and batteries by generating topological constraints [39]. Finally, state transitions and risk models [40] have also been employed for the preventive maintenance. Regarding communication networks, different approaches are considered by several researchers: usage of infrastructure monitoring tools [41], cloud applications monitoring [42], runtime software-fault monitoring tools [43], distributed performance monitoring [44], or lightweight distributed metric services [45] to cope with very large networks and continuous monitoring of applications [46,47].

There are some research works that survey the state-of-the-art in the field of scalable networks for heterogeneous systems, software-based networking, and hybrid systems involving several categories of smart devices, such as [48], where the authors present studies of ML/DL applications in software-defined environments.

The methods for assessing the network performance may be split into two categories:

-: active methods for network efficiency and level of service monitoring, involving the injection of probe traffic into the network to learn about its state of operation, as well as
-: passive methods, observing and analyzing different KPIs collected in big data storages.

Table 1 presents in a comparative mode some of these aspects.

Taking into consideration the information presented in Table 1, it is obvious that a combination of the two techniques is the most beneficial for the NTMA. However, this is a complicated process to implement because it needs a deep understanding of the network and messages structures, and for this to become effective, a very complex team of experts with a period of accommodation, or training, is also needed.

Complex systems and distributed network maintenance have also been the preoccupations of many researchers [48,49,50], and modeling of the present and future states using different models, including Markov Chains and/or Hidden Markov, are discussed in connection with some applications for several systems [51], based on the modeling of hidden states of those systems. These solutions might involve complex algorithms and also presume higher computation power for achieving usable results in the prognosis of a system’s future states, as well as possible training, using simulated or collected data. Markov Chains and Hidden Markov Chains (HMMs) are both mathematical models used to describe stochastic processes, where the state of a system evolves over time. The Markov Chain consists of a finite set of states and a transition probability matrix. The matrix defines the probability of transitioning from one state to another. Each state has a fixed set of transition probabilities associated with it, and these probabilities remain constant throughout the process.

A Hidden Markov Chain is an extension of the Markov Chain model that incorporates hidden or unobservable states. In a HMM, the system has a set of observable states, but the underlying state of the system is hidden or unknown. The observed states are generated by the hidden states through a set of probability distributions. In general, HMMs require more processing power than simple Markov Chains due to the additional complexity involved in inferring the hidden states from the observed states. The computational complexity of HMMs arises from the need to estimate or infer the hidden states using algorithms, such as the Viterbi algorithm or the Baum-Welch algorithm. Moreover, HMMs often involve more complex probability distributions for emission and transition probabilities compared to the constant probabilities in simple Markov Chains. These probability distributions usually require additional calculations and more processing power to handle.

The present work is focused on proposing an integrated platform for preventive maintenance, which is dedicated to complex smart city services and involved data communication networks, based on a less demanding computation power. Therefore, it uses only observable indicators, using data collected by different intelligent agents. These agents harvest information both from hardware and communication channels loads, as well as from the applications’ availability and response times. As a continuation of a previous research [52], the use of intelligent agents in early discovering and noticing deviations of normal operation and lowering of the level of service is associated in this work with the updating of a current state matrix and computing different state probabilities for a future state prediction matrix. The latter is aimed at providing the operator with alerts and suggestions for alleviating malfunctions’ and maloperations’ negative effects.

The remainder of this article is organized as follows: Section 3, Materials and Methods, describes the main aspects regarding the permanent monitoring of reliability and levels of service based on Markov Chain modeling of a future state matrix. Section 4 proposes an algorithm for integrating the state matrix and clients’ satisfaction in a common monitoring platform, as well as application on a case study with six smart city services, and, finally, Section 5 and Section 6 are proposed, where an analysis on the utility of the proposed solution is discussed, along with future developments.

3. Materials and Methods

3.1. Reliability and Maintenance Relationship

Due to their required high level of service, smart city services and supporting data communications networks need permanent monitoring and maintenance. Due to the continuous development and the growing complexity, these networks have become difficult to monitor and maintain.

Therefore, there is a need for automated maintenance processes, supported by intelligent agents able to early detect failures, malfunctions, and any other defective operations. At the same time, even manual upgrading, deployment of new software versions, operational support, troubleshooting, etc., may become sources of defective operation of some of the functional components from the complex networks. In fact, as personal observations reveal, on some of the mobile communication networks in Romania, intensive upgrading and improvements in the functional (hardware or software) components caused more than 55% of the events causing low levels of service. This might be somehow justified, considering the vast complexity of the network and implications that one server, or application, have in the overall process, implications that the human personnel might not be able to envisage from the beginning. Moreover, there are some causes that cannot be forecasted, such as natural disasters (flooding, earthquakes, fire, etc.), or works in the field, performed by other parties, which are possible to intrude in the physical cabling, but these seem to be much rare than the functional failures or human intervention effects, such as third-party vendor services failures, security breaches, and so on.

For achieving the goal of obtaining a simple solution for the integrated management of hardware failures, software problems and customer satisfaction, in this research, the following aspects have been addressed:

-: Integration of (existing) intelligent agents for hardware, applications, and services monitoring
-: Proposing an algorithm for building a state matrix for the system
-: Proposing an algorithm for building and updating the state-transition matrix, based on the Markov approach
-: Development and adaptation of the solution for the clients’ satisfaction analysis
-: Integration of all these approaches in a single platform to assist the maintenance operators in early detection of malfunctions and network decrease in performance
-: Creating a risk evaluation matrix for the maintenance operations

In general, the probability of failure is best described in reliability theory by the failure rate,

λ (t) = \sum_{i = 1}^{N} λ_{i} (t)

(1)

where;

λ_{i} (t)

represents the failure rate of the independent functional component, and

N

is the total number of functional components taken into consideration. We say that

λ (t)

is a probability that the product will work without failure until the considered moment and fail during the immediately following time unit (if this unit is small).

Then, the overall reliability function

R_{T} (t)

of the system for a year is given by:

R_{T} (t) = e^{- \sum_{i = 1}^{N} λ_{i} (t) \cdot t}

(2)

where; t is the duration of time corresponding to a year, expressed in hours, and the mean time between failures (MTBF) is given by:

M T B F = \frac{1}{\sum_{i = 1}^{N} λ_{i} (t)}

(3)

(if the chain of reliability only considers the equipment) Because determining the utility function (failure-free operation) requires a large volume of experience, the reliability of a product is generally characterized by the average duration of operation:

T_{0} = M τ = \int_{0}^{\infty} t q (t) d t = - t P (t) |_{0}^{\infty} + \int_{0}^{\infty} P (t) d t = \int_{0}^{\infty} P (t) d t

(4)

where;

M [T_{d}]

or

M [T_{i}]

represent the average value of a repair or replacement of time between two consecutive successful states of operation, during which the respective installation repaired or replaced, and

P (t)

represents the probability that the product will work without breaking down until time t:

P (t) = P (τ \geq t)

.

For example, in a mobile communication network in Romania, there have been cases where applications which were making requests to a specific public domain failed because other, banned domains, stole the public IP of the legal one, a process which led to blacklisting the correct IP. Intensive maintenance of complex networks, consequently, could also produce negative effects, such as randomly lowering of some service levels, increasing operation costs, causing outage duration costs, etc. A balance is necessary to be made between maintenance costs and outage duration costs.

The most appropriate maintenance service can be determined via two different approaches:

-: Preventive maintenance—via scheduled procedures, condition-based procedures, or reliability-centered maintenance
-: Corrective maintenance—operation is performed after the failure has manifested. It might also trigger corrective measures, or changes in the structure of the network, upgrading of software components, etc.

Mathematical modeling of maintenance should consider an objective function, seeking an optimum between the following criteria: minimization of restoring time, minimization of maintenance costs, and risk minimization. It is considered that a model which employs risk management is important in AI-assisted preventive maintenance, being more efficient in suggesting the human operators the appropriate measures to be taken and their forecasted risks in terms of operating levels of service for the different hardware and software components. This is because the quantification of risks enables determining an optimal level of risk, which provides the most efficient maintenance strategy for complex systems and networks.

The methodology in this paper proposes the automation of multiple integrated processes, namely, (i) the introduction of risk assessment-based functional monitoring agents and (ii) the monitoring of the clients’ satisfaction. To determine an optimal preventive maintenance objective, it is necessary to analyze multiple possible operating states and scenarios, based on state transition matrixes. A multi-level approach is easier to introduce in practice, especially when complex networks and services are involved. In this way, a dedicated monitoring application and model should be developed for the data communication network. Then, a higher-level application for monitoring complex services (including the monitored network) is to be set on a superior level of implementation. This superior-level application shall be in charge, also, of monitoring clients’ satisfaction.

3.2. Building the Algorithm for Network and Service Risk Assessment

This subsection describes the proposed approach for obtaining an automated preventive maintenance process, helping human operators in the fast recovery activities of the data communication network, or preventing the occurrence of a failure, due to early warning messaging.

The basis of this model is founded on the analysis of a complex data communication network and a set of relevant smart city-related monitoring agents, and, from the point of view of the operating states, the main causes of the decrease in the level of some services, as well as the analysis of the causes of the most frequent hardware and/or application failures, are considered. A transition matrix is then built, considering different failure rates and the corresponding risk factors, with associated causes. Risk is defined as the product between the probability that a failure occurs and the expected value of costs that the failure produces in the system. The risk is defined at the level of the considered data network. The evaluated data network is a complex one, with different services and applications, and it is used as a backbone data communication network in a smart-city environment, where different services also rely on smaller communication networks, such as ZigBee, Bluetooth, or LoRa.

For the intelligent monitoring of the backbone data network, previous work results have been presented in [52]. The following intelligent agents have been in use for monitoring smart city services:

Traffic service levels monitoring service
Energy distribution service levels monitoring
Environment monitoring service
Crowdsourcing monitoring service
Public lighting monitoring service
Waste disposal monitoring service

Each individual agent is set to monitor a specific service from the point of view of its functionality, iteratively, and/or by event triggered. Each record is indexed with event start and event end timestamps to determine the service unavailability duration. The assessment took place for a one-year period, during which all six services have been monitored from the availability point of view, namely, the ratio between the count of successful requests of the service and overall requests (successful plus failed requests to services). The diagram presented in Figure 1 shows a sample analysis for a month period, where a specific service has experienced some failures. The number of failures is represented by the vertical (blue) bars, while the availability index is presented in the upper part of the diagram, in percents, and the red line shows the evolution in time of these indexes.

The next figures present, in detail, samples of the six independent service activities during the monitoring period: Figure 2—traffic monitoring service, Figure 3—energy distribution monitoring service, Figure 4—environmental monitoring service, Figure 5—crowdsourcing monitoring service, Figure 6—public lighting monitoring service—vertical lines represent the division of time for monitoring the service due to the fact that this specific service is only monitored during night time, and, finally, Figure 7—waste disposal monitoring service. The red vertical lines represent decreases in services’ availability due to the different causes, including malfunctions, equipment failures, maintenance operations, software upgrading, OSI physical level degradation, etc.

Some of the most common failures noticed have been caused by human interventions, including corrective maintenance, curative maintenance, software upgrading, preventive maintenance, peer migration, hardware replacement, hardware upgrade, and standardization.

Considering the impact of these malfunctions, the following represent the main effects, on a scale from the worst to the less harmful impact: complete failure, traffic loss, incoherence/loss of data, latency, loss of administration, loss of supervision, mini failure (complete failure for a max. 10 min time), and slow response.

The probability of uninterrupted functioning for this service, computed based on collected data, was

P_{t m} = 0.99725

.