Application of the Learning Automaton Model for Ensuring Cyber Resiliency

: This work addresses the functional approach to ensuring cyber resiliency as a kind of adaptive security management. For this purpose, we propose a learning automaton model capable of self-learning and adapting to changes while interacting with the external environment. Each node in the under-controlled system has a set of probable actions with respect to neighboring nodes. The same actions are represented in the graph of the learning automaton, but the probabilities of actions in the graph model are permanently updated based on the received reinforcement signals. Due to the adaptive reconﬁguration of the nodes, the system is able to counteract the cyberattacks, preserving resiliency. The experimental study results for the emulated wireless sensor network (WSN) are presented and discussed. The packets loss rate stays below 20% when the number of malicious nodes is 20% of the total number of nodes, while the common system loses more than 70% of packets. The network uptime with the proposed solution is 30% longer; the legitimate nodes detect malicious nodes and rebuild their interaction with them, thereby saving their energy. The proposed mechanism allows ensuring the security and functional sustainability of the protected system regardless of its complexity and mission.


Introduction
Currently, there is a trend toward the use of devices capable of interacting with the external environment and exchanging information with each other via an internal network or the Internet. The number of such devices is constantly increasing, which indicates the transition to the next technological concept of a reconfigurable system-ofsystems. A significant part of such a system are unattended computing nodes moving and distributed in space. Combining such devices into a self-organizing networks for collecting, processing and transmitting information, e.g., a wireless sensor network (WSN), the internet of things (IoT), vehicle ad hoc networks (VANET), flying ad hoc networks (FANET), and cyber-physical systems (CPS), expands the possibilities for presenting information about manufacturing processes, the environment, the automation, as well as improving 'humanto-human' or 'human-to-machine' interaction.
Ensuring security plays a crucial role in the use of the connected systems since they are used in critical areas that are directly related to the economy and life of society. This is evidenced by a number of security incidents targeted at the IoT-based formations [1] as well as regulatory acts [2] related to the security of critical infrastructure around the world. Many research efforts are already underway to develop security technologies concerning the security of the reconfigurable systems [3,4]. However, a crucial component of making ubiquitous security a reality is the ability to shift the balance that currently favors the attackers, in part by using cyber resiliency techniques.
Cyber resiliency is the ability of a system to anticipate, withstand, recover from, and adapt to adverse conditions and threats [5], and today it does matter for the critical infrastructure. From an engineering perspective, cyber resiliency is an aspect of emergent quality-the trustworthiness of the system [6][7][8]. A cyber-resilient system is one that has security safeguards built into it as a fundamental element of its design and that displays a high level of withstanding cyber threats, security faults, and continues operating in a degraded or debilitated system to carry out the system's mission-essential functions.
Today, cyber resiliency can be reached using trust-based and IDS-aided techniques that, in most cases, just allow us to identify the security issues in the systems. That approaches are eliminated by isolating the malicious node from the system or informing us how to counteract the cyber threat. Within our research, the objective of work is to observe the security resiliency technologies, discuss a functional adaptability method based on a learning automaton (LA) model and examine it. The LA model is capable of self-learning and adapting to changes while interacting with the external environment. Each node in the under-controlled system has a set of probable actions corresponding to neighboring nodes. The same actions are represented in the graph of the LA, but the probabilities of actions in the graph model are permanently updated based on the received reinforcement signals. Due to adaptive reconfiguration, the system is able to counteract the developing attacks. The preservation of system operability is ensured by the LA, which changes the node's behavior with respect to malicious nodes, following the analysis of input signals. In this sense, since the benefit of the proposed method of defensive resiliency is an ability to provide an asymmetric advantage to defenders in order to maximize the effect applied toward winning cyber conflict between adversaries and defenders, the proposed approach is a part of the symmetry/asymmetry aspect meaningful in the more stable and longer functioning of the protected system.
The following sections of the paper present the functional approach to cyber resiliency based on adaptive control with a learning automaton (LA) model (Section 2), the experimental study of the proposed control mechanism of the LA (Section 3), the final discussion (Section 4), and conclusion (Section 5).

Approaches to Ensuring Cyber Resiliency
The response to security impacts should be comprehensive and of a purposeful nature to ensure preservation of the system in functional properties, and thus appeal to methods and means to hold the system's dynamic development or reduce the degree of influence of security impacts on different system levels: components, subsystems, missions, and business functions. On every level, there are required approaches to ensuring resiliency. Such approaches can be based on principles following the biological analogy [8]. This similarity to biological systems can be applied to transfer the experience of millions of years of natural evolution to the safety of digital systems.
A digital system, as well as its biological prototype, can be represented as a complex topological structure forming a set of connected elementary units (nodes) through which complex functional or informational processes take place. Any threats or intrusions of the outer environment on such a system can cause the generation of additional signals and changes in node-to-node communications, system topology of the nodes, and reactions (functions) at each level of the system representation.
Summarizing the resiliency-providing principles in biological systems and transferring them to digital systems result in three main approaches to ensuring cyber resiliency ( Figure 1).

•
Homeostatic approach that preserves the system state under external influences [9]. Consideration of a digital system with a network structure as a homeostatic system implies the formation of some set of attributes, which should be satisfied by the system, forming a view of its state. In the case that, as a result of operation, one or more attributes cease to meet the required criteria, corrections are applied to the system [10][11][12][13]; • Functional approach, which is based on the theory of functional systems [7,14]. The main principle of the functional approach is the preservation of the system function under external influences. In its framework, a digital system is considered a system with one or more functions, and its performance under destructive influences is a priority. In this case, the goal of management is the preservation of this function or their set with the help of various methods and tools [15,16]; • Ahead reflection (anticipation) that prevents a destructive impact and its consequences before it is committed [17,18]. The essence of the ahead reflection is to predict possible security impacts and take measures to neutralize them by creating resource reserves, applying an anticipatory effect [19,20]. Each approach to cyber resiliency follows one of the dominant principles: • Expenses reducing: choosing the way of reaction (out of acceptable ones) to a destructive impact (e.g., cyberattack) requires the minimization of costs-amount of resources or energy-for its implementation. For example, the number of operations on the graph of the system should be minimized to preserve the functional route or homeostatic equilibrium. • Maximization of the system's freedom degrees: This is to maximize information exchange with minimum entropy in the system. For resiliency, it causes maximization of communication links and interactions between the connected nodes of the system. • Cyber resistance preservation: When responding to an external impact (e.g., cyberattack), the system has to ensure the preservation (if possible) of a sufficient stock of components for subsequent compensatory and anticipatory actions. For some systems, this principle can be formulated as the maintaining the margins of stability. The quantitative assessment of resiliency depends on the type of the system and approach to ensuring the system resiliency. For example, it can be expressed as a risk score or a number of reserved functional routes and a number of redundant nodes.
The dominance of one or another principle agrees with one of the above listed approaches and the goal of the system protection. Each approach has its own characteristics and key differences, respectively, to the point of view on cyber resiliency (Table 1). In this work, we are inspired by the functional approach as a kind of adaptive control, the most promising technique for resiliency countering either natural (for bio-systems) or digital (for computer and cyber-physical systems) threats. Adaptive control solutions mainly allow us to identify the security threat and, as a counteraction, remove the malicious node from the system topology. At the same time, the functional approach is mainly aimed at providing protection against targeted attacks by malicious nodes without taking into account the occurrence of natural threats. Consequently, existing protection methods are not sufficiently effective for system nodes operating for a long period of time without maintenance on a non-monitored perimeter. The goal of our work is to add adaptive behavior whereby a node modifies the rules of interaction with its neighbors. The preservation of infrastructure operability is proposed to be ensured by changing the behavior of a node relative to a malicious node based on intelligent analysis of input signals.
The maintenance of functional stability has been proposed to be achieved by the application of a computing model of a learning automaton. The learning automaton (LA) is capable of self-learning while interacting with the external environment and adapting to changes. Each node in the under-controlled system has a set of probable actions with respect to neighboring nodes. Similar actions are represented in the LA, but the probabilities of actions in the LA are permanently updated based on the received reinforcement signals. The reinforcement signal is generated based on the received performance of neighboring nodes during network operation. In this model, each node changes the rules of interaction with its neighbors. Through the use of adaptive behavior, nodes are able to counteract both natural security threats and targeted attacks.

The Learning Automaton Model
Originally, the LA was specified by M. Tsetlin [21], and it is an optimization model required to determine an optimal action out of a set of actions. In this regard, learning for the automaton is effective if and only if the system in which the automaton works has a high level of fuzziness. In systems with a low level of fuzziness, the automaton learning may not be a correct measure for selection [22]. Therefore, this model is well suited for the application of security adaptability. For digital systems, the necessary characteristics of the LA are defined as follows: The probability vector is updated using the following systems of Equations (1) and (2): where P j (n) is a probability of action at a point in time n; a and b are coefficients, the values of which depend on the chosen learning model. The LA uses the learning models L R− P for actions α 1 , and α 2 , and L RI for action α 3 . Equation (1) is utilized in the case of winning, and Equation (2) in the case of losing the automaton. Taking into account the given parameters, a graph representation for the LA is depicted in Figure 2. It marks the graph's vertices and arcs by the labels: PSD/OSD-receiving/sending service data, PID/OID-receiving/sending information data, VID/VSD-verification of information/service data, OD-data processing, RST-reset of the connection, and p α m nprobability of the action. This graph is general and not bound to the routing protocol used in the concrete system. If a particular routing protocol has to be referred to, states of the PSD, OSD, PID and OID vertices should be specified depending on the specifics of the corresponding rules of the protocol. Data transfer between the nodes is divided in information flow and service flow. Information messages include data required for the system to perform its functional task. For example, the wireless sensor network (WSN) is utilized for monitoring the surrounding manufacturing conditions, and, in this case, the information messages can be data about the pressure and temperature in the system's surroundings. Service data messages contain information required for the routing protocol, and convey information about the state of the system's communications and nodes (e.g., performance indicators, and behavior indicators).
The node function is based on the method of adaptive control of the node interaction rules. The system cyber resiliency is ensured by changing the probabilities of transitions between the states on the graph model. The transition on the graph occurs depending on the value of the complex indicator Q, which is calculated following the formula (3): where g i -obsolescence rate (g k = φ m−k , φ ∈ [0, 1]), m-window size, and BI i,j -complex behavioral indicator. The complex behavioral indicator describes the node's performance of the target function assigned to it. This indicator is calculated according to Formula (4) and consists of the direct behavioral indicator calculated by this node and the indirect behavioral indicator obtained from other nodes.
where DI i,j -direct behavioral indicator, I I i,j -indirect behavioral indicator, and C i,jconfidence coefficient. The direct behavioral indicator is calculated by the node itself by analyzing the behavior of its neighbors. For example, the node monitors the fact that a neighboring node has transmitted a packet to it (verification of the correct re-transmission of the packet through the network).
The following indicators are distinguished, the use of which allows us to protect the system against cyber threats: • Packet re-transmission: After transmitting a packet to a neighbor node, the sending node switches to a monitoring mode to track the re-transmission of its packet further through the network. It is used to counteract the nodes that re-transmit packets selectively or not at all. • Packet integrity: In addition to checking for re-transmissions, the sending node also checks the checksum of the packet sent by its neighbor through the network. • Node data generation intensity: The indicator is defined as the number of packets received from a node for a certain period of time and is designed to protect nodes from attacks of energy depletion and channel clogging. • Volume of sent data: similar to the previous one. Accounting the volume of data sent by neighboring nodes helps protect against resource exhaustion attacks.
The values of the indicators package re-transmission and package integrity are calculated by Formula (5) and represent the ratio of successful interaction events to all events: where S i,j m -the number of successful events between nodes i and j in the context of the corresponding indicator, and F i,j m -the number of unsuccessful (failure) events.
The indicators are evaluated based on the thresholds. The thresholds depend on the application area of the system and type of the transmitted data.
To obtain a complex indicator of behavior DI i,j , the value of each indicator is summed with the weight according to the formula (6). The weight of the indicator depends on the type and use case of the system.
where N-number of behavioral indicators, W m -indicator's weight, and I i,j m -the value of the indicator for a particular aspect of behavior relative to the corresponding node j.
The indirect behavioral indicator is calculated by Formula (7) on the basis of direct behavioral indicators, obtained from all neighbors relative to the node for which this indicator is calculated: where DI i,k l -the value of the direct indicator relative to the node k, DI k l ,j -the value of the indicator of node k relative to node j, and n is the number of nodes that provided their value. This indicator is introduced to compensate for insufficient data to calculate the correct value of the direct indicator. Over time, with an increase in the number of interactions, the weight in the value of the direct indicator becomes greater than the weight of the indirect indicator.
Let us take the sample of how this algorithm works in a single round of network node interactions. There are three nodes: i, j, k. The node i needs to transmit data through the network to the base station. For a conventional routing algorithm, the nodes j and k are acceptable for further data transfer.
Based on the LA graph, the node i is in the PSD state and needs to move to the OID state to send the packet. Let us assume that the probability of the node j moving to the OID state is higher than that of the node k. Then, node i will send a packet to node j and then go into monitoring mode to monitor whether node j has re-transmitted its packet further down the network. However, node j did not re-transmit the packet through the network because of the cyberattack. Accordingly, node i will recalculate indicators I i,j and B i,j , and, based on the result of the complex indicator Q i,j , determine the result of the action in the LA as the loss.
Using the system of Equation (2), the probabilities are recalculated; on the next iteration, node i will send its data to node k for further re-transmission when sending the packet again. If node k re-transmits successfully, the probability for node i grows.
Experiments with the use-case system were conducted, and their outputs are discussed in the following section.

Results
For the experimental study, a virtual ad hoc network was constructed in the NS-3 simulator [24]. The nodes are MICAz emulated devices [25] with Atmel ATmega128L micro-controller and 2.4 GHz IEEE 802.15.4 modules used to create a low-power wireless sensor network (WSN). The system topology on which the efficiency of the developed adaptive control method was evaluated includes 500 nodes located randomly in the area of 1000 to 500 m. The test bench consists of legitimate nodes, malicious nodes, and a base station governed by the AODV routing protocol. The number and behavior style of the compromised WSN nodes are varied.
The malicious nodes can perform packet modification attacks [26], black hole and grey hole attacks [27] with all or some packet ejection, and energy depletion [28,29]. When simulation starts, all nodes in the WSN consider each other a legitimate one. Figure 3 shows the relationship between the number of lost packets and the number of malicious nodes in the system. In this experiment, the nodes performed the black hole and grey hole attacks with packet ejection as well as packet modification attacks. Only the correct packets arriving at the base station are counted. An incorrect packet is one that was generated by a legitimate node and did not reach the base station, or one that reached the base station but in some modified form differing from the original content. The system with the LA-based control works better than conventional management in delivering the correct packets to the base station. Nodes successfully identify the malicious nodes and correct the rules of interaction with them. The loss rate stays below 20% when the number of malicious nodes is 20% of the total number of system nodes. Under the same conditions, the original system loses more than 70% of packets, which is too crucial for sensitive cyberspace.
Further experiments were carried out to evaluate the energy efficiency of the developed cyber resiliency method. A refined energy cost accounting model of the NS-3 simulator was utilized for the evaluation. Figure 4 depicts the dependence of the number of functioning nodes on the number of rounds passed. Round is defined as a certain period of time during which nodes transmit packets (in this sample: 100 s). In this experiment, the network worked in normal mode and there were no malicious nodes.
With the LA-based control, the nodes consume more power. In the following experiment, an additional 100 malicious nodes were deployed to conduct a resource exhaustion attack. The results are presented in Figure 5. The network uptime with active protection is 30% longer; the legitimate nodes successfully detect malicious nodes and rebuild the process of interaction with them, thereby not wasting their energy resources.

Discussion
To disrupt the stable functioning of the system, most cyber threats are targeted at breaking the work of the system. Cutting off the nodes of the system, breaking nodeto-node communication via functional and control links connecting the system nodes, disturbing the routing protocol work and other impacts on resiliency inspire the researchers all over the world to design and develop various methods ensuring cyber resiliency.
The first group of the related works (e.g., [30][31][32][33][34][35][36][37]) proposes a technique of trust-based control for the distributed systems. The system nodes are grouped into unions (clusters), and each cluster allocates one node, the cluster head, which is responsible for the safety of the group. Data exchange in the cluster also takes place over the cluster head. Different studies (e.g., [38,39]) are focused on cluster head selection and voting mechanisms. The functional stability of the system is arranged as a reputation and trustworthiness model. It is based on the calculation of the score of the trust of nodes to each other while monitoring the activity of individual nodes and, consequentially, changing the nodes behavior to adapt to the varying circumstances [40]. Differences in approaches can arise due to the peculiarities of the environment in which nodes interact. There are different interpretations of reputation and trustworthiness considering various attributes, objects and subjects for trust measurement. For instance, Ref. [39] proposes swarm intelligence as an energy-efficient method. Additionally, Ref. [41] suggests an automaton-like model to select a trusted connection of the nodes within the re-configurable system.
The trust-based technique for the system control is promising because its application does not require the nodes to expend a large amount of system resources, while being able to protect against multiple cyber threats of different kinds. A weakness of this approach is the requirement of the computing resources for the head node and trustworthy calculations. Additionally, the introduction of a compromised node into a protected cluster can cause much more damage than the usual destructive impact of such a compromised node in a fully distributed system. Additional data transmission during trustworthy computing also affects the energy resources and lifetime of nodes. Data aggregation is used to optimize data transmission and thus preserve the functional stability of the network [42,43]. Aggregation techniques can, in some cases, prevent the targeted power depletion attacks [44]. For this purpose, special nodes, the aggregation points, are inserted into the structure for safe aggregation. However, the aggregation points can be susceptible to various types of attacks and hence require robust protection because false data can be inserted through the compromised nodes, which is guaranteed to lead to security faults [44].
The second group of competing works uses intrusion detection systems (IDS) operating to monitor the system nodes, identify the incidents and respond to security impacts. IDS functionality can be supplemented with additional operation to complete adaptability (e.g., [45][46][47][48][49][50][51]. IDS may be extended with an intelligent a posteriori security management subsystem, e.g., Ref. [52] suggests a smart advice system to improve the security maintenance, and Refs. [53,54] propose a finite state automaton to pave a secure data transmission route. The key weakness of this approach is the hard requirement for IDS agents running on the nodes and consuming vast computing and energy resources. A possible alternative as a centralized or cloud IDS, which requires a dedicated server, is not always feasible due to the nature of modern self-organizing cyberspaces of the IoT, WSN, VANET, and FANET.
Both trust-based and IDS-aided techniques mainly just allow us to identify the security impacts. In most cases, they are eliminated by isolating the malicious node from the system or informing us how to counteract the cyber threat. To preserve cyber resiliency, the proposed LA-based method adds adaptive behavior control whereby a node changes its behavior, continuing to interact with its neighbors. As in biological systems, in most cases, the hurt in some part of an organism does not force it to cut that part of it. The preservation of system operability is ensured by the LA, which changes the node's behavior with respect to a malicious node, following the analysis of input signals.
The LA, as a control model, marshals adaptability for the under-controlled system. With respect to alternatives, the LA allows ensuring the security and functional sustainability for the protected system. The maintenance of cyber resiliency is achieved by adaptive behavior based on the system's nodes dynamically changing the rules of interaction with their neighbors. Through the use of adaptive behavior, nodes are able to counteract both natural security threats and targeted attacks. It was experimentally demonstrated that the loss rate stays below 20% when the number of malicious nodes is 20% of the total number of nodes, while the common system loses more than 70% of packets. The network uptime with the LA-based solution is 30% longer.
The proposed mechanism of the LA-based control allows the security and functional resiliency to be ensured in the protected system regardless of its complexity and mission.

Conclusions
In this work, a novel method for the functional adaptation of the reconfigurable systems is composed on the basis of the LA model, which takes into account the current factors of the under-controlled system. The configuration of the LA model is built to describe dynamically changing the rules for the under-controlled interactions of the system nodes, and it allows nodes to self-adapt to the changes in the environment and thus preserve system resiliency. The flexible behavior of the system nodes is achieved by automatically changing the probabilities between machine states. Experimental evaluation of the effectiveness of the proposed model for security control was carried out; it shows the superiority of the proposed approach over the traditional techniques. The developed solution supports the survivability of the supervised system, reducing the data loss rate and increasing the duration of faultless lifetime.
In further work, we are interested in focusingon modifying and adapting the developed method to expand applications, for example, cyber-physical technologies, such as VANET and FANET.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: