A Continuous Terminal Sliding-Mode Observer-Based Anomaly Detection Approach for Industrial Communication Networks

Dynamic traffic monitoring is a critical part of industrial communication network cybersecurity, which can be used to analyze traffic behavior and identify anomalies. In this paper, industrial networks are modeled by a dynamic fluid-flow model of TCP behavior. The model can be described as a class of systems with unmeasurable states. In the system, anomalies and normal variants are represented by the queuing dynamics of additional traffic flow (ATF) and can be considered as a disturbance. The novel contributions are described as follows: (1) a novel continuous terminal sliding-mode observer (TSMO) is proposed for such systems to estimate the disturbance for traffic monitoring; (2) in TSMO, a novel output injection strategy is proposed using the finite-time stability theory to speed up convergence of the internal dynamics; and (3) a full-order sliding-mode-based mechanism is developed to generate a smooth output injection signal for real-time estimations, which is directly used for anomaly detection. To verify the effectiveness of the proposed approach, the real traffic profiles from the Center for Applied Internet Data Analysis (CAIDA) DDoS attack datasets are used.


Introduction
An industrial network is a communication network that applied in an industrial environment, i.e., manufacturing, power generation, energy distribution, and transportation, with protocols to provide real-time control and monitoring of industrial systems. Due to the development of the Industrial Internet of Things (IIoT), a variety of technologies, such as sensors, wireless communications, and computing, have paved the way from local to remote networks for performing remote operations, monitoring, and maintenance through the Internet. Security concerns about the IIoT have been raised. On 21 October 2016, attackers utilize the Mirai IoT botnet to launch high-impact distributed denial of service (DDoS) attacks against the Dyn DNS service, which caused an extended Internet outage [1]. Therefore, the vulnerability of industrial networks have reinforced the importance of safety and security to protect industrial systems against cyber threats [2]. To detect and prevent the attacks, researchers are focused on designing traffic monitoring devices, such as firewalls and intrusion detection systems (IDSs), placed at different levels of industrial networks to detect and prevent attacks [3].
TSMO is proposed to increase the estimation dynamics of the abnormal traffic, in which the estimation error will converge to a bounded small area within a finite-time and then converge to zero asymptotically. For the network communication scenarios, it is required to meet two criterias: robustness and smooth output injection signals. The results of the estimation for ATF can be further used for the anomaly detection. The paper aims at overcoming the following three challenges from the theoretical viewpoints: 1.
How to develop an observer for a class of systems where parts of states are unmeasurable.

2.
How to increase the convergence speed of the internal dynamics in the observer. 3.
How to design a smooth output injection of the observer and apply it directly for the estimation algorithm.
The remainder of the paper is organized as follows. The fluid-flow model of industrial networks is described in Section 3. The sliding-mode observer for the system is proposed in Section 4. In Section 5, the practical traffic replay is carried out to illustrate the effectiveness of the proposed method. Finally, conclusions are given in Section 6.
The objective in the paper is to design an observer for estimating the disturbance δ(t) in (1). Now, an observer is proposed for the system (1) in the forṁx (t) = Ax(t) + A dx (t − τ) + bu(t) + v(t), (2) wherex(t) = [x 1 (t),x 2 (t)] T ∈ R 2 is the estimate of x(t), and v(t) = [v 1 (t), v 2 (t)] T ∈ R 2 is the output injection of the observer. If the errors between the estimates and the true states are written as e(t) =x(t) − x(t), then, from (1) and (2), the following error system is obtaineḋ and the estimate of the disturbance δ(t) follows that The estimation process includes the following two steps: 1.
The error system (3) converges to zero asymptotically or in finite-time by using the output injection of the observer.

2.
Once the error system (3) converges to zero, the disturbance in (1) can be estimated using (4).
The output injection of the observer v(t) in (2) can only utilize the measurable error e 2 , i.e., v 1 = v 1 (e 2 ), v 2 = v 2 (e 2 ). The output injection v 2 = v 2 (e 2 ) can be designed to force e 2 converging to zero, although there exists unmeasurable e 1 and disturbance δ(t) in the error system (3). However, in the conventional observer [22], there is no output injection v 1 for the internal dynamics of error system (3). In such a case, the error state e 1 will converge to zero asymptotically due to the assumption 1. As a result, the convergence of e 1 cannot be affected by the signal v 2 and may be very slow. To address this problem in the conventional methods, an output injection signal v 1 is proposed to the error system (3), which aims at speeding up the convergence of the internal dynamics of the error system (3). When the error system (3) converges to zero, the estimate of the disturbance can be obtained using (4). Hence, the output injection of the observer v 2 (t) is required to be smooth, which is a challenge to the design of the SMO.
Two Lemmas are stated below and will be used in the proof of the Theorems later.
. Given a nonlinear systemẋ = f(x), where x ∈ R n , f(0) = 0, and f(·) : R n → R n is a continuous function. If there exists a continuous positive definite function V(x) such thaṫ To prove the Theorems in the paper, the stability of the following form of linear systems with time-varying delay is considered: where x(t) ∈ R n is the state, A and A d are constant matrices with appropriate dimensions, the time delay, τ(t), is a time-varying continuous function that satisfies τ 1 < τ(t) < τ 2 anḋ τ(t) ≤ µ, where τ 1 , τ 2 , and µ are all known positive constants, and the initial condition, ϕ(t) ∈ R n , is a continuous function of t ∈ [−τ 2 , 0]. (5) is asymptotically stable if there exist matrices P > 0; Q i > 0, Z j > 0, for i = 1, 2, 3, and j = 1, 2; N i , M i , and S i , i = 1, 2 with appropriate dimensions such that the following LMI holds:

Fluid-Flow Model of Industrial Networks
Industrial networks interconnect various industrial control systems (ICS), e.g., localarea switched networks, such as distributed control systems, and wide-area routed networks, such as SCADA, to support the communication between devices. Most ICSs adopt some specialized protocols, such as Open Platform Communications, Modbus, Distributed Network Protocol, Inter-Control Center Protocol, Profibus, etc. However, these protocols were initially designed for serial communications and must been adapted to operate over TCP/IP networks, which is a standard Ethernet link layer and has been widely implemented at common network infrastructures. To this end, the industrial TCP/IP networks will be studied in the paper.
An industrial TCP/IP network consists of multiple hosts and clients in industrial control systems, which are physically connected in any number of topologies including star, tree, and even full-mesh. In industrial networks, a star topology is extremely common to connect to end devices [33]. So, a typical industrial TCP/IP network in a star topology is adopted in this study. In the topology, all nodes (hosts or any other industrial control systems peripherals) are connected to an industrial router. Each connected host has a dedicated, point-to-point connection between the host and the router. It is assumed that there are N homogeneous sources, i.e., all sources are the same in structure, nature, parameters, and software implementations. They connect to a destination (a host or a client devices) through a router, where two mechanisms are embedded: an Active Queue Management (AQM) and an observer. The AQM regulates the queue length in the router buffer with a randomization of choosing connections to notify the congestion, so that the network utilization can be improved. The observer is used to estimate the traffic flow and further detect its abnormal behavior of the traffics in industrial TCP/IP networks.
To describe the behavior of the traffics in industrial networks, the following fluid-flow model of TCP behavior can be used [9]: where w(t) is the average TCP congestion window size in packets. Congestion Window (C wnd ) is a TCP state variable that limits the amount of data the TCP can send into the network before receiving an ACK. q(t) is expected queue length in packets. w and q are positive and bounded, i.e., w ∈ [0,w] and q ∈ [0,q], wherew andq are known and denote maximum window size and buffer size, respectively. τ(t) is the round-trip time in seconds which induces time varying delay in the communication channel. p(t) is the probability of packet loss and takes value at [0, 1]. T p is the propagation delay in seconds. N and C are the numbers of TCP sections and the link bandwidth in packets/second, respectively.
In system (7), δ(t) represents the unmeasurable queuing dynamics of ATF in the network. It includes the modeling errors and anomalies. Both of them are uncertain and perturb the normal TCP/IP network behavior at the router level. In normal working conditions, δ(t) is around a fixed value, which forms a layer near the value; however, when an anomaly intrusion happens, it will suddenly increase.
The purpose of the paper is to estimate δ(t) only using q(t) in (7). After obtaining the estimate of δ(t), we can detect and further analyze the anomalies.
The equilibrium point of system (7) is assumed as (w 0 , q 0 ), where w 0 is the equilibrium window size, and q 0 is the required queue length set by the AQM. p 0 is the equilibrium input value, and τ 0 is the equilibrium round-trip time. They can be determined as follows byẇ(t) = 0 andq(t) = 0: The system (7) can be linearized around its equilibrium point. Defining the perturbation of the equilibrium point as ∆w(t) = w(t) − w 0 and ∆q(t) = q(t) − q 0 , the dynamics of the industrial TCP networks (7) can be linearized to where q(t) and p(t) are available in the router. Some software programs, such as Netflow, PacketScope, and Loss Measurement Management, have been installed in routers. They can monitor and measure p(t) [34]. The congestion window w(t) cannot be used in the AQM or the observer because it is unmeasurable.
To simplify the design of the observer for the linearized model of the industrial TCP/IP network (8), a state transformation is made first.
The time-delay τ(t) in (9) satisfies the following inequality: whereq, C and T p are defined in (7). It should be noted that the lower bound of τ(t) is T p as defined in (10). T p is the propagation delay at the circumstance of neither congestion nor queuing delay in a router. In addition, the upper bound of τ(t) in (10) is the combination of the propagation delay and the maximum queuing delay under the worst case of congestion in the router buffer, i.e., τ(t), cannot exceedq/C + T p .
The derivative of τ(t) can be assumed to satisfẏ where µ is a known positive constant. The condition of (10) and (11) can be obtained as below. Differentiating the last equation in (7) with the time t giveṡ The term Nw(t) + δ(t)τ(t) in (12) is actually the amount of data being transmitted in the TCP/IP network, which is physically constrained to the TCP/IP network capacities, namely Nw(t) + δ(t)τ(t) ≤ BDP +q whereq is the buffer capacity defined in (7). BDP is the Bandwidth-Delay Product, which represents the amount of data that can be in transit [35]. BDP refers to the product of a data link's capacity C and its round-trip delay time τ(t), i.e., BDP = Cτ(t), where C and τ(t) are defined in (7). Normally, the buffer capacity of a router in (7)q is dependent on the BDP, i.e.,q = µCτ(t), where µ = 1/ √ N is a constant [36]. Then, it can be obtained that Nw(t) + δ(t)τ(t) ≤ Cτ(t) + µCτ(t) and furthermore, we have the condition (11) is true.
The state variable x(t) in the linearized model of the TCP/IP network (9) satisfies the inequality as follows: wherew is the known positive constant, i.e., the maximum window size, and is defined in (7). In TCP/IP networks, the window size refers to the amount of dada that a host is currently willing to send. Normally, the maximum window sizew at a host is configured as a constant, i.e.,w is set as 65, 535 (0xFFFF) bytes [37]. As seen as in (8) and (9), x(t) is the perturbation around the equilibrium point of w(t) that is limited to the known constant maximum window sizew. As x(t) = δw(t), so |x(t)| cannot exceed the maximum value of w(t), i.e., the inequality (13) is true.
The aformentioned amount of data being transmitted in the TCP/IP network, (12) includes traffic flow of all N TCP sections Nw(t), as well as the dynamics of ATF δ(t)τ(t). It is physically constrained to the TCP/IP network capacities, namely As δ(t) is physically limited to the router communication capacity, its change rate is always constrained to δ (t) ≤ d m /T, where T is the sampling period and kept as a constant 1/C [9]. Hence, we have is a known positive constant. Summarizing the analysis above gives where both d m and d 1 are known positive constants. The block diagram of the AQM and observer in a router is shown in Figure 1. The AQM is utilized to control the queue length q(t) to a required value by regulating the probability of packets loss p(t). The inputs of the observer, i.e., q(t) and p(t), are measurable states. The outputs of the observer is the estimate of δ(t). The paper aims to design an observer for estimating the dynamics of ATF in real-time and further detecting anomalies in industrial networks.

Estimation of δ(t)
Controller qref Observer AQM TCP dynamics Traffic flow Traffic flow

Design of the TSM Observer
In the fluid-flow model of TCP/IP networks in (9), the ATF dynamics δ(t) can be considered as a disturbance. The estimate of δ(t) can be used for anomaly detection. To estimate δ(t), an observer is proposed as wherex(t) andŷ(t) represent the estimates of the system state x(t) and output y(t), respectively, and v 1 (t) and v 2 (t) are output injection for the observer.
as the errors between the system states and their estimates. The error system can be obtained from (9) and (15) as follows: It should be noted that the state ξ 2 in error system (16) is measurable and can be used in the design of the output injection. However, the state ξ 1 is unmeasurable and cannot be used in the design of the output injection, i.e., v 1 and v 2 in (16) can include only ξ 2 .

Proof.
From (17), the manifold (18) can be rewritten as Substituting (19) and (20) into the above gives Differentiating s(t) in (22) with respect to time t along the measurable error subsystem (17) yieldsṡ Further substituting (21) into the above equation giveṡ Introduce a candidate Lyapunov function given by V 1 (t) = 0.5s 2 (t). Taking the derivative of V 1 (t) along the trajectories of (16), and using the above expression, it follows that From the conditions (13), (14) and the above, we havė it can be seen that measurable error subsystem (17) will reach to s(t) = 0 within the finitetime t r ≤ |s(0)|/η 2 ; in other words, s(t) = 0, ∀t ≥ t r . Once the ideal sliding-mode s(t) = 0 is established, the measurable error subsystem (17) will maintain on s(t) = 0 thereafter and behaves in an identical fashion asξ 2 (t) = −αξ 2 (t) − βξ 2 φ / ρ (t), which will converge to zero along s(t) = 0 in the finite-time t s . Theorem 1 yields a method of designing the output injection in (17) by only using the measurable ξ 2 (t), which forces ξ 2 (t) to converge to zero in a finite-time, although there exist unmeasurable ξ 1 (t) and unknown disturbance δ(t) in (17).

Unmeasurable Error Subsystem
For the unmeasurable error subsystem in (16), namelẏ Define an area Γ for unmeasurable ξ 1 near zero as where ϕ is a positive constant and defined as ϕ = a −1 21 d m + ε, d m is defined in (14), and ε is a positive constant, which can be chosen by 0 < ε < a −1 21 d m /2. The purpose of introducing the area Γ is to design a output injection strategy in the following Theorem for increasing the convergence speed of the error ξ 1 , when it is outside Γ. Theorem 2. The unmeasurable error subsystem (23) will converge to zero asymptotically, if the output injection is given by where k 1 = a 11w + η 1 ,w is a constant defined in (7), and η 1 > 0 is a constant.
Proof. The error state space of ξ 1 can be divided into two different areas, Γ o and Γ, and defined, respectively, as (24). So, two different cases, i.e., Case 1 and 2, are considered.
Case 1: the error state ξ 1 is in area Γ o . The measurable error subsystem (17) will move toward the sliding manifold s = 0 under the output injection (19)- (21). When the measurable error subsystem reaches and stays on the sliding manifold, s(t) = 0, under the output injection in Theorem 1, it follows from (22) that From the above equation and (26), it gives that As ξ 1 is in area Γ o , the inequality ξ 1 − a −1 21 δ > ϕ holds. According to (28) and the above inequality, we can have that ξ 1 (t) > ϕ. So, the output injection (25) can be rewritten as v 1 (t) = −k 1 sgn ξ 1 (t) .
When the system states ξ 1 , ξ 2 are in Ω 1 , the boxcar function ∏ −σ,σ (s) = 0, and then v 1 (t) in (25) is equal to zero, which means that the measurable error subsystem (17) has not reached to the sliding manifold s(t) = 0. In this case, the output injection (25) has not been applied in the unmeasurable error subsystem (23).
The output injection strategies (19)- (21) in Theorem 1 drive the error subsystem (17) toward the sliding manifold s = 0 and remain on the manifold thereafter, which guarantees the system states ξ 1 , ξ 2 to converge into the area Ω 2 in a finite-time. Then, the unmeasurable error system (23) will converge to zero asymptotically.
In ideal condition, σ = 0, i.e., the ideal sliding-mode s = 0 can be detected. However, in practical environments, detecting ideal s = 0 is not possible. So, we can just only detect an area near zero, |s| < σ. In this case, substituting (19) and (20) into (18), we have a 21 ξ 1 (t) − δ(t)=a 21x (t) − v 2n (t) + s(t), where |s(t)| < σ. Hence, it can be chosen σ as σ=κ a 21ξ1 (t) , where κ = 0.02 − 0.05. It should be noted that σ can affect only the convergence speed in dynamical process but cannot affect the final observation. Theorem 3. If the two output injection signals in the error system (16) are designed using Theorems 1 and 2, respectively, the estimation errors lim t→(t r +t s ) ξ 2 (t) = 0 and lim t→∞ ξ 1 (t) = 0. Then, the ATF dynamics δ(t) in (9) can be estimated by as where v 2 (t) is designed in (19).
Proof. Based on Theorem 1, the measurable error subsystem (16) under the output injection (19) will reach to the sliding manifold s(t) = 0 in the finite-time t r and maintain on s(t) = 0 thereafter. The unmeasurable error subsystem (17) will converge to zero in the finite-time along s(t) = 0. Then, it follows from (17) thaṫ From Theorem 2, the unmeasurable error state ξ 1 (t) under the output injection (25) will converge to zero asymptotically. From (36), the ATF dynamics δ(t) can be estimated directly by the smooth v 2 (t) in (19) when the unmeasurable error state ξ 1 (t) converges to zero asymptotically. This completes the proof.

Real Traffic Replay Results
The real traffic replay results are given to varify the effectiveness of the proposed TSMO method in real-time.

Real Traffic Replay Setup
For experimental purposes, we used the real traffic dataset from CAIDA, which is governed by the Regents of the University of California and located at the University of California San Diego (UCSD) [40].
In the paper, the CAIDA "DDoS Attack 2007" dataset is used to test the proposed method. This dataset contains approximately one hour of anonymized traffic traces from a DDoS attack on 4 August 2007 (20 : 50 : 08 UTC to 21 : 56 : 16 UTC). The DDoS attack attempts to disrupt access to the targeted server and all of the bandwidth of the network connecting the server to the Internet, by consuming computing resources on the server. The 1-h trace is split up into 5-min pcap files, where pcap is an application programming interface for capturing network traffic. The total uncompressed size of the dataset is 21 GB. The traces only include attack traffic to the victim and responses to the attack from the victim. The non-attack traffic in the traces has been removed as much as possible. Traces in this dataset are anonymized using CryptoPAn prefix-preserving anonymization using a single key. The payload has been removed from all packets. These traces can be read with any software that reads the format of packet capture (pcap), including the CoralReef Software Suite, Tcpdump, Wireshark, and many others. The details of traffic features are shown in Table 1. In this experiment, the real-time DDoS attack scenarios for the CAIDA datasets are considered. This collection groups the backscatter datasets, which were created from the massive amount of data continuously collected from the UCSD Network Telescope.
To study the network traffic behavior, a network simulator is used to set up network environments. It is a discrete event-based network simulator for networking research, which contains the necessary features, e.g., a traffic trace generator, to replay the real traffic traces profiles. A typical star topology of the TCP/IP network consisting of a number of hosts and clients with one network gateway is considered in the study. There are N source agents and destination agents being created to represent the hosts and clients in the network, respectively, where N = 60. The 'newreno tcp' agents are used for the sources with 'ftp' connections to generate long-lived TCP flows to the destination clients. The maximum value of C wnd in each 'tcp' agent is set to be the same as 0.12 Mb. The link capacity C of the network gateway router is set to be 15 Mb. Moreover, the packet size is set to be 500 bytes. The connections between each host/client and the router are set by 'full-duplex', which construct bi-directional links at propagation delay T p = 200 ms. The proportional integral (PI) AQM mechanism is applied to regulate the queue length (QL) at a desired value of q 0 = 175 packets in router buffer [41]. The capacity of router bufferq is set to be 800 packets. A traffic trace generates payload bursts according to the given trace file of the DDoS attack profile from the CAIDA Dataset. In the network simulator, traffic trace is implemented by using the C++ class 'TrafficTrace', which is bound to the specified real DDoS attack traffic trace file in the OTcl domain.
A hundred distributed attackers are created and attached with the real traffic trace files from the CAIDA datasets. In the paper, an increasing rate attack profile of the CAIDA DDoS 2007 datasets is used to test the proposed method. This DoS attack lasts a period of five min.
The parameters in the linearized TCP/IP network model (9) are: a 11 = 0.2630, a 12 = 0.0044, b d = 481.7708, a 21 = 243.2432, and a 22 = 4.0541.  Figures 2 and 5 shows the traffic dynamics of QL captured at the router, which includes the normal traffic flows and the DDoS attack profiles. With simple observations at this traffic dynamics of QL, the anomalies displayed in the traffic dynamics cannot be identified and detected in real-time. By contrast, the TSMO-based real-time NTM scheme, which is implemented at the router, is capable to extract TCP traffic flows from the total traffic dynamics in the buffer and estimate the dynamics of ATF for anomaly detection.

Real Traffic Replay Results and Discussion
As the Theorem 1, the measurable error subsystem (17) will reach to the predesigned manifold (18), i.e., s(t) = 0, within the finite-time t r . Therefore the estimation error ξ 2 of QL is governed by the output injection (19) to converge to zero in the finite-time t s along s(t) = 0.
In addition to forcing the estimation error ξ 2 to zero in the finite-time, the other aim is to speed up the convergence of the internal dynamics of the error system (16) for precision estimation to meet the real-time criteria. By Theorem 2, the internal dynamics, i.e., the estimation error ξ 2 , is forced to the defined area (24) in the finite-time and then converges to zero asymptotically. As presented in Figures 3 and 6, the congestion window is accurately estimated, which reflects the serious degradations in sending rate, throuput and bandwidth utilization in the networks when the DDoS attacks started in the scenario. From the Theorem 3, the dynamics of ATF, i.e., δ(t), which is represented by the increasing rate attack profile and the subgroup attack profile from the CAIDA datasets, is quickly and exactly estimated. The results of the estimated dynamics of DDoS rate are depicted in Figures 4 and 7.
As the experimental results illustrated in Figures 2 to 7, the proposed TSMO-based NTM presents a good tracking performances of the real traffic trace profile for anomaly detection with the main features of the SMC systems. This real traffic replay experimental results demonstrated the effectiveness and efficiency of the proposed TSMO algorithms in a real-time monitoring capability under real traffic profile environments.

Comparative Studies
Four different observer algorithms are evaluated in the real traffic replay tests.

The Luenberger Observer (LO)
The output injection strategies of the LO can be designed as [12]: whereŷ lo (t) is the estimate of y(t) in (9), v lo 1 (t) and v lo 2 (t) are the output injection signals of the observer, and L lo 1 and L lo 2 are the gains of the output injection.
As highly frequent switching phenomenon existed in v csmo 2 (t) due to the signum function, a low-pass filter is needed to extract the equivalent signal.

The Terminal Sliding Mode Observer (TSMO)
The output injection strategies of the proposed TSMO are designed using Theorem 1 and 2.
In order to make a fair comparison, the parameters of the four types of observer schemes are repeatedly tested, and, thereby, the optimal parameters are obtained. In the processing, the tradeoff between the dynamic performances and the steady-state performances of the closed-loop error system is made. In this condition, the convergence speed and steady-state performances are compared each other for these observers.
To make the quantitative comparisons among the four kinds of observer algorithms in terms of the steady-state performances of closed-loop error systems, Table 2 provides the average displacement error (ADE) and the standard deviation of displacement error (SDE) in the scenario. From the comparative results in Table 2, the proposed TSMO features the fastest dynamical response and the best steady-state accuracies of estimating w(t) and δ(t) compared to other existing three observers.

Conclusions
This paper has proposed an SMO-based network traffic monitoring approach to estimate the ATF dynamics. The main contributions of this work can be summarized as follows: (i) One output injection of the observer is specially designed to be smooth using the full-order SMC technique. It can be directly used for the estimation of traffic flows in real time, does not need any low-pass filter. (ii) The novel strategy for another output injection of the observer is proposed to increase the convergence speed of the internal dynamics of the observer, which can improve the speed of the estimation algorithms. (iii) The proposed TSMO can be used for a class of linear systems with time-varying delay where some system states are unmeasurable. For the proposed observer, the parameters in the algorithms are to be carefully set. The experimental results have verified the efficiency of the proposed TSMO by comparative studies in real traffic profiles from the CAIDA DDoS attack datasets. The future work will focus on anomaly detection applications considering the multiple area communication networks.