LT-FS-ID: Log-Transformed Feature Learning and Feature-Scaling-Based Machine Learning Algorithms to Predict the k-Barriers for Intrusion Detection Using Wireless Sensor Network

The dramatic increase in the computational facilities integrated with the explainable machine learning algorithms allows us to do fast intrusion detection and prevention at border areas using Wireless Sensor Networks (WSNs). This study proposed a novel approach to accurately predict the number of barriers required for fast intrusion detection and prevention. To do so, we extracted four features through Monte Carlo simulation: area of the Region of Interest (RoI), sensing range of the sensors, transmission range of the sensor, and the number of sensors. We evaluated feature importance and feature sensitivity to measure the relevancy and riskiness of the selected features. We applied log transformation and feature scaling on the feature set and trained the tuned Support Vector Regression (SVR) model (i.e., LT-FS-SVR model). We found that the model accurately predicts the number of barriers with a correlation coefficient (R) = 0.98, Root Mean Square Error (RMSE) = 6.47, and bias = 12.35. For a fair evaluation, we compared the performance of the proposed approach with the benchmark algorithms, namely, Gaussian Process Regression (GPR), Generalised Regression Neural Network (GRNN), Artificial Neural Network (ANN), and Random Forest (RF). We found that the proposed model outperforms all the benchmark algorithms.


Introduction
These days, security is one of the primary concerns for every nation caused by highly unpredictable and noxious events taking place across the globe [1][2][3]. Every nation wants to secure and protect its borders from any kind of intrusion and attack by enemy forces. In addition, unauthorised and illegal entry is another vital matter that requires immediate attention from the concerned authorities [4]. In order to protect their international borders from enemies and unfriendly forces, several nations have their regular armies. These army soldiers patrol along the border stretches, but patrolling methods are conventional, periodic, and limited. Since a country may have international boundaries that are thousands of miles long, it is impossible to deploy soldiers at every single location. Consequently, there remains a large area along the international borders that is unguarded. Enemies may take advantage of these unguarded locations and enter the territories. They can likely steal some classified documents crucial to the security of a nation, decimate defence personnel, or demolish crucial infrastructures. The surveillance and monitoring along the international borders and checkpoints can be achieved with the help of WSNs.
WSNs is a widely accepted and renowned technology because it is cheap, readily available, and can be installed on the fly in almost no time at any place [5,6]. In addition, WSNs consist of small and homogeneous sensors that work in a de-centralised fashion requiring no pre-installed foundation and communicating over wireless channels [7]. Therefore, WSNs are employed for many civilian and military applications such as precision agriculture, health monitoring, structural health monitoring, industrial monitoring, disaster management, rescue operations, wild animal monitoring, landslide monitoring, fire detection, monitoring and surveillance in border areas, and many more [8][9][10][11]. Furthermore, intrusion detection in border areas and unauthorised access detection in restricted areas and infrastructures is a pivotal application of WSNs. For example, a WSN can be deployed to form a sensor barrier for any possible intrusion path as shown in Figure 1. The studies conducted so far on intrusion detection issues can be divided into two categories; first, it is described as a monitoring or surveillance system to detect an invader or an unauthorised entry in the RoI. Secondly, it is assumed to be a component of a WSN system specifically designed and implemented to diagnose compromised and/or vulnerable sensors for avoiding false alarms and ensuring correct network behaviour [12]. In this work, we concentrate on the first category. The work presented in [13] proposed a fusion algorithm with three levels of hierarchy to spot a passive mobile intruder. They have employed two crucial modalities, namely the sensing probability model and acoustic signal model, to ascertain the presence of an invader. In addition, the authors have also analysed the influence of the number of sensors, intruder speed on the probability of detection, detection accuracy, and false alarm rate and found that the proposed algorithm outperforms the other fusion algorithms. Another work presented in [14] proposed optimal trajectories for mobile sensors employed for intrusion detection in a given RoI. The proposed trajectories for mobile sensors will maximise the coverage area and reduce energy consumption, which would increase the lifetime of the sensor network, thus providing improved intrusion detection performance. A distributed border surveillance system is proposed in [15], where the performance of the system is estimated in terms of the number of barriers obtained for a possible intrusion path in shadowed and non-shadowed environmental conditions. The authors found that the number of barriers obtained for shadowed environmental conditions is greater compared with the nonshadowed environmental conditions. Similarly, the work in [16] proposed a smart border surveillance system that uses ultrasonic, passive infra-red, and camera sensors to detect the presence of an intruder. The proposed system is capable of distinguishing between animal and human beings. The system sends an alert message and video streams to the control system as soon as it identifies an intruder. In Ref. [17], the authors have proposed a border surveillance system architecture that renders high energy efficiency and load balancing capabilities, thus, increasing the network lifetime. Furthermore, the proposed methodology needs less maintenance, involves low-cost installation, and delivers enhanced reliability. The authors claim that the proposed system outperforms other available intrusion detection systems and has an enhanced network lifetime. Another work provided in [18] presented an analytical model to detect a mobile intruder using mobile sensor networks. They have obtained an analytical formula to calculate the k-barrier coverage probability for an invader trying to cross a rectangular-belt region following a given path. They have also investigated the effect of network parameters such as sensor-to-intruder velocity ratio, sensing range, sensor count, and intrusion path angle on the performance metric. The proposed model is very effective in detecting an intrusion and tracking the enemy movements. Most recently, the authors in [19] proposed a remote surveillance system using robots with CCTV cameras. The authors claim that the proposed work will be useful for border surveillance and internal monitoring.
It is pivotal to mention that the above-discussed works [13][14][15][16][17][18][19] contribute significantly in the research domain. However, their models are validated through Monte Carlo simulation, which requires very high computation cost and time. For instance, it requires approximately 15 hours to achieve a single outcome through simulation runs at a given value of parameters. In addition, the simulation time increases exponentially with the increase in the number of sensors, sensing range and other network parameters. This is because of the fact that WSNs produce a large volume of data that requires plenty of time for its processing and analysis. Applications like infiltration in border regions are time-sensitive because a delay in seconds may cause catastrophes. Thus, it is vital to detect any kind of intrusion along the borders and around the prohibited regions as quickly as possible.
The problem at hand can be resolved by employing machine learning approaches that are exceptionally competent for computational time [20,21]. For instance, the work presented in [22] provided a mathematical framework to evaluate the k-barrier coverage probability for a given intrusion path using mobile WSNs. The authors have proposed three machine learning models based on the GPR algorithm to predict the k-barrier coverage probability to overcome the computational and time complexity problem. In doing so, they have considered sensing range, the number of sensors, sensor to intruder velocity ratio, mobile to static sensor ratio, required value of k, and intrusion path angle as potential features. The proposed machine learning model can predict the k-barrier coverage probability with higher accuracy than the other benchmark algorithms.
In this study, we proposed an efficacious machine learning-based approach to accurately predict the number of barriers for fast intrusion detection and prevention using relevant features. We extracted relevant features (i.e., the area of the RoI, sensing and transmission range of the sensor, and the total number of sensors) synthetically through Monte Carlo simulations. Subsequently, we applied feature transformation and scaling operations and trained a SVR model. We access the performance of the trained model by using R, RMSE, bias, and computational time complexity as the performance metrics. The main contributions of this paper are as follows: • We introduced a synthetic data generation framework for a cost-effective solution. • We estimated the relative importance score of each feature by using the regression tree ensemble approach.
• We performed the sensitivity analysis of the features using Partial Dependency Plot (PDP) analysis. • We proposed a novel algorithm based on log-transformed feature learning and featurescaling to accurately predict the number of barriers for fast intrusion detection and prevention. We also performed a sensitivity analysis of the proposed algorithm.

Preparation of the Datasets
The performance of any machine learning model depends on the quality of datasets on which it is trained [23]. These datasets can either be field derived (obtained by direct measurements) or generated synthetically (obtained through simple rules, statistical modelling, and simulations) [24]. The use of synthetic data is increasing exponentially in the domain of healthcare [25,26], WSNs [22,27], and data privacy [28].
In this study, we extracted the datasets synthetically through simulations. To do so, we consider a finite number of sensors (N), distributed uniformly and randomly in a rectangular RoI. Each sensor is assumed to be homogeneous, i.e., sensing, transmission, and computational capabilities are identical for each sensor. The dimensions of the network deployment RoI are varied from 100 × 50 m 2 to 250 × 200 m 2 . The entire dataset used for training and testing purposes is obtained through simulations using network simulator NS-2.35. The complete procedure for simulation outcomes is explained below.
Any two arbitrary sensors in the deployed WSN can communicate with each other, if they satisfy the condition, R tx ≥ 2R s , where, R tx and R s indicates the transmission and sensing range of sensors respectively. Here, we have considered the most widely employed sensing range model known as the Binary Sensing Model (BSM) to estimate the performance of WSNs. According to BSM [29], a random sensor can detect a target with probability equal to one, if the target falls within the sensing range R s of the sensor denoted by S i . Otherwise, the target detection probability will be equal to zero. Mathematically, it can be represented by Equation (1).
where d(S i , P) represents the Euclidean distance between the sensor S i and target point P.
To identify the existence of intruders, a barrier is formed by connecting a sensor cluster over the entire RoI. To detect an intruder successfully, there should be at least one barrier for each possible intrusion path to ensure barrier coverage. The total number of sensors required to achieve the desired k-barrier coverage can be computed by k = L/2R s [1] and the maximum Barrier Paths (BP max ) that can be constructed for a given intrusion path is computed as: BP max = N/k , where L indicates the length of the rectangular RoI. The k-coverage ensures that each point in the target RoI is monitored by k distinct sensors, where k is a positive integer having value greater than one. Table 1 shows different network parameters and their values used to get the simulation results.

Calculation of Feature Importance and Sensitivity
To calculate each feature's relative importance score, we created a regression ensemble through boosting ensemble learning. We leverage LSBoost (Least Square gradient Boosting) algorithm to boost hundred regression trees, each having unity learning rate [22,30]. This algorithm assumes each decision tree as a weak learner and processes them individually by identifying their weak points. Afterward, the algorithm process the next weak learner by concentrating on the weak aspect of the previous learner. In this way, the algorithm iteratively formed an ensemble of weak learners. Once the ensemble is generated, we calculated the feature importance by summing the total change in the normalised node risk.
Further, we performed the Partial Dependency Plot (PDP) analysis to assess the impact of each individual feature on the predictand. It computes the partial dependency of the considered feature set on the predictand by marginalising the impact of remaining features [27,30]. We considered a set of two features and computed their partial dependency on the predictand. For a set of four features, we have a total of six pairs of features. We plotted the 2D and 3D variation profiles.

SVR Model Set-Up
In this section, we have discussed the modelling of SVR [31,32] for the prediction of the number of barriers ( Figure 2). It is an effective algorithm to address prediction problems, solve sample issues, and provide significant generalisation performance [30,33]. Using a nonlinear mapping ϕ (.) : n → n h , the training sets (x i , y i ), where i = 1 to n, are mapped into a high dimensional feature space, n h . Then, a linear function, f, is used to express the nonlinear association among features and the response variable. The SVR function [34] is a linear function which is represented as: where f(x) indicates the forecasting values, w ∈ n h indicates the weighting matrix, and B ∈ indicates the bias term. The SVR approach intends to reduce the empirical risk as: where Θ (y i , w T ϕ(x i ) + B) indicates the -insensitive loss function that determines the optimal hyper plane on a high-dimensional feature space to maximise the distance between two subsets of input dataset. It is determined by: Hence, SVR is concerned with identifying the optimal hyper plane and decreasing the residual between the training datasets and the -insensitive loss function. Moreover, SVR reduces the total errors by: with the following constraints Equation (4) normalises weight sizes, ensures regression function flatness, penalises f(x) and y training residuals by the -insensitive loss function, and C represents the penalty parameter. Training residuals above are represented as ξ * i and below − are represented as ξ i . However, in the dual space, SVR function is represented as: where K(x i ,x j ) represents the kernel function. It is the inner product of x i and x j vectors in the feature space ϕ(x i ) and ϕ(x j ), respectively. We have used polynomial kernel (Equation (7)) as it belongs to the group of the non-stationary kernel that performs effectively over standarised and transformed features [35].
where γ and ω are the kernel function's structural parameter and polynomial degree, respectively. The prediction accuracy of an SVR model is governed by the good tuning of hyperparameters (C and ). If the residual between the observed and predicted value is greater than the hyperparameter then the other hyperparameter C, penalises the model. Hence, a high value of C results in under-fitting, and a lower value leads to high computational complexity [27].
In this study, we applied the universal grid optimisation algorithm [36] to optimise the hyperparameters. We selected the most frequently used Mean Square Error (MSE) function [37] as the objective function given by: where n is the sampling size, f i is the observed and f i is the predicted values. We iteratively optimised C for all possible by considering the MSE function as the objective function. We found the optimal value of C = 0.1 and = 0.01. Afterward, we applied log transformation (LT) [38] and mean z-score scaling (Equation (9)) on the input features. Where x f is the input feature set, x f is the mean of the feature set, and σ is the standard deviation of the feature set. Once we applied feature pre-processing, we trained and evaluated the SVR model in an 80:20 ratio. The datasets are divided randomly using Mersenne Twister random generator. We illustrated the complete methodology in Figure 3 and also enumerated the complete process into the following steps;

1.
We synthetically generated the input features (i.e., area of the RoI, sensing range of the sensors, transmission range of the sensor, and the number of sensors) through Monte Carlo simulations.

2.
We trained a regression tree ensemble to estimate each feature's relative feature importance score.

3.
We leverage PDP analysis to perform the sensitivity analysis of each feature.

4.
We applied feature scaling on the selected features post log transformation.

5.
We used the Mersenne Twister generator with a random seed to randomly divide the datasets for training and testing the model in a ratio of 80:20. 6.
We used 80% of the datasets to set up the machine learning model. 7.
We used the remaining 20% of the datasets to test the performance of the trained model. 8.
We performed the sensitivity analysis of the trained model. 9.
We performed the error analysis using error histogram analysis to understand the distribution of the errors. 10. We compared the performance of the trained model with the benchmark algorithms (i.e., ANN, GRNN, GPR, and Random Forest).

Results
In this section, firstly, we discuss the results of feature importance and sensitivity analysis. Afterward, we discuss the performance of the proposed model.

Feature Importance and Sensitivity
We evaluated the prominence of each feature through the regression tree ensemble approach. The bars in Figure 4 show the relative feature importance score of each feature. The feature importance score of all four features ranges between 60 to 140. We found that the area of the RoI has the least feature importance among all, indicating that area of the rectangular region is the least relevant feature in predicting the number of barriers for fast intrusion detection and prevention. Surprisingly, we found that the sensing range of the sensor, the transmission range of the sensor, and the number of sensors have the same and highest feature importance score, indicating that they are the most relevant features in predicting the number of barriers.
Further, we performed the feature sensitivity analysis of all the four features through the Partial Dependency Plot (PDP) analysis ( Figure 5). We observed that area of the rectangular region has a negative repercussion on the response variable (i.e., number of barriers). In contrast, the sensing range of the sensors, the transmission range of the sensors, and the number of sensors have a positive repercussion on the response variable.

Model Performance
Once our model is trained, we evaluate its performance by using R, RMSE, and bias as the performance metrics. To do so, we fed the testing datasets into the trained model's input and obtained the predicted response from the model. Afterward, we plotted a linear fit line between the observed and predicted response variable in Figure 6a. In doing so, we observed that the predicted values accord well with the observed values (R = 0.98, RMSE = 6.47, and bias = 12.35). All the data points lie around the regression line, with very few (especially the lower values) lying beyond the 95% Confidence Interval (C.I.).
However, the presence of positive bias indicates that the model is slightly overestimating the response variable.
Further to understand the distribution of errors in the model, we have plotted the error histogram of the model using 10 bins (Figure 6b). We fitted a continuous Gaussian fit on the error distribution and found that the error follows left-skewed distribution (also called negatively skewed distribution). The error ranges from −7.4 (leftmost bin) to 21.4 (rightmost bin). Negative errors (left to the zero error line) represent the underestimated region, and positive errors (right to the zero error line) represent the overestimated region. The peak of the distribution lies in the overestimated region, indicating the presence of positive bias.

Comparison with Other Scaling Methods
We also evaluated and compared the performance of other scaling approaches. We considered Center Mean (CM) scaling and Min-Max scaling along with Z-score scaling. We also considered the Non-Scaled (NS) version for an appropriate comparison. After log transformation of the features, we applied these scaling techniques and trained the SVR model. We reported the performance of LT-NS-SVR, LT-CM-SVR, LT-ZM-SVR, and LT-MM-SVR in Table 2. Interestingly, we found that the predicted barriers accord well with the observed values for all the variants. However, the RMSE, MSE, bias, and computational time complexity of LT-NS-SVR is worst among all. Table 2. Comparison of the performance of Z-score scaling (i.e., LT-ZM-SVR) with other scaling methods (i.e., LT-NS-SVR, LT-CM-SVR, and LT-MM-SVR).

Comparison with Benchmark Algorithms
To ensure an unbiased conclusion, we compared the performance of the proposed approach with different benchmark algorithms. In doing so, we evaluated the performance of ANN [39], GRNN [40], GPR [41,42], and Random Forest (RF) [43] over the same datasets after performing LT and z-score scaling on the features (Table 3). These models are selected based upon their performance in different applications such as remote sensing [30], WSNs [44], IoT [45], and blockchain [46]. We selected R, RMSE, MSE, bias, and computational time as the performance metrics. In comparing, we found that the proposed approach outperforms the benchmark algorithms in terms of RMSE, MSE, and bias. Additionally, LT-ZM-SVR emerges as the computationally efficient approach. Surprisingly, we found that the RF has the best R; however, with a poor RMSE. We observed a positive bias (i.e., overestimation tendency) in GRNN, GPR, RF, and LT-ZM-SVR. In contrast, a negative bias (i.e., underestimation tendency) is observed with ANN.

Sensitivity Analysis of the LT-ZM-SVR
Finally, we performed the sensitivity analysis of the LT-ZM-SVR model to evaluate its robustness in the presence of uncertainty in input features. To do so, we introduced a fixed amount of variation in any one of the input features, keeping others constant. We performed this iteratively for all the features and reported the percentage change in the response variable in Figure 7. From the heat map, we found that overall the model is quite stable in the presence of small uncertainty. Relatively, the model is more vulnerable to the uncertainly present in the number of sensors.

Conclusions
This study proposed a novel approach to estimate the number of barriers required for intrusion detection. To do so, we extracted relevant features from the network parameters through Monte Carlo simulations. We evaluated the relevancy of each feature through feature importance analysis. We found the area of the RoI to be the least relevant feature in estimating the number of barriers. All other features (i.e., the sensing range, the transmission range, and the number of sensors) equally carry the highest relevancy. Additionally, to measure the impact of each feature on the response variable, we performed a feature sensitivity analysis. We observed that except for the area of the RoI, all other features positively impact the response variable. Afterward, we applied log transformation and scaling operations on the selected features. After feature pre-processing, we applied the tuned SVR algorithm as an interpretable data-driven model. Once our model was trained, we evaluated its performance on the testing datasets using R, RMSE, MSE, bias, and computational time complexity as performance metrics. We found that the proposed approach accurately and timely predicts the number of barriers for fast intrusion detection and prevention.
For a robust conclusion, we compared the performance of the proposed approach with different scaling and benchmark algorithms. We found that the proposed methodology outperforms all the benchmark algorithms. However, the limitation of the proposed algorithm is that it assumes the values of the input features to be a positive real number. This study is a step towards fast intrusion detection and prevention using WSNs. Our approach can be employed for near-real-time applications such as border surveillance.

Informed Consent Statement:
The computer algorithms originated during the current study can be made available from the corresponding author or first author on a reasonable request.

Data Availability Statement:
The datasets generated during and/or analysed during the current study can be downloaded from https://www.kaggle.com/abhilashdata/intrusion-data-wsn (accessed on 26 January 2022).