χ2-BidLSTM: A Feature Driven Intrusion Detection System Based on χ2 Statistical Model and Bidirectional LSTM

In a network architecture, an intrusion detection system (IDS) is one of the most commonly used approaches to secure the integrity and availability of critical assets in protected systems. Many existing network intrusion detection systems (NIDS) utilize stand-alone classifier models to classify network traffic as an attack or as normal. Due to the vast data volume, these stand-alone models struggle to reach higher intrusion detection rates with low false alarm rates( FAR). Additionally, irrelevant features in datasets can also increase the running time required to develop a model. However, data can be reduced effectively to an optimal feature set without information loss by employing a dimensionality reduction method, which a classification model then uses for accurate predictions of the various network intrusions. In this study, we propose a novel feature-driven intrusion detection system, namely χ2-BidLSTM, that integrates a χ2 statistical model and bidirectional long short-term memory (BidLSTM). The NSL-KDD dataset is used to train and evaluate the proposed approach. In the first phase, the χ2-BidLSTM system uses a χ2 model to rank all the features, then searches an optimal subset using a forward best search algorithm. In next phase, the optimal set is fed to the BidLSTM model for classification purposes. The experimental results indicate that our proposed χ2-BidLSTM approach achieves a detection accuracy of 95.62% and an F-score of 95.65%, with a low FAR of 2.11% on NSL-KDDTest+. Furthermore, our model obtains an accuracy of 89.55%, an F-score of 89.77%, and an FAR of 2.71% on NSL-KDDTest−21, indicating the superiority of the proposed approach over the standard LSTM method and other existing feature-selection-based NIDS methods.


Introduction
In present-day society, various organizations and individuals have become more and more reliant upon information and communication technology (ICT), due to the increasing number of useful technologies. The rise in reliance has resulted in a greater demand for more stable and reliable ICT components and services. As a section of ICT, the Internet provides a medium for individuals and organizations to accomplish tasks in their everyday lives. However, as the data flow and the information traffic over the Internet increase, user privacy and transactions become more prone to malicious users' threats and attacks (intrusions). An intrusion is a succession of activities aiming to jeopardize the security of a network system [1].
Intrusion detection systems (IDSs) have proven essential in the security domain and play a vital role in detecting different types of malicious behaviors and attacks. IDSs can be grouped into three basic strategic concepts (misuse detection, anomaly detection, and a hybrid of the two) [2,3]. Misuse detection is a signature-based approach used to identify a particular matching behavior or signature, compare it to recorded user behavior or activities, and raise a signal [4][5][6]. Anomaly detection is used to spot activities that are significantly different from normal user activity. In anomaly detection systems, an action is raised if there is some deviation from a predefined computer state [7][8][9]. Hybrid detection is a fusion of anomaly and misuse detection methods used to identify malicious activities [2,10,11]. It is vital to mention that network intrusion or attacks can come from outside the network (outsider attacks) or from within the network (insider attacks). Researchers have proposed several different intrusion detection systems over the past few decades using machine learning, deep learning, and other statistical methods. However, in recent times, machine learning and deep learning techniques have gained more attention in many different research areas, including intrusion detection [12]. They have become the most commonly adopted approaches for many intrusion detection systems (IDS).
In the literature, machine learning methods such as support vector machine (SVM), decision trees (DT), k-nearest neighbor (KNN), artificial neural networks, and deep neural networks (DNN) have been widely used for the detection of network intruders [13][14][15][16]. However, the performance of these techniques depends heavily on simulated datasets. These datasets often require many features for training, making them computationally expensive for most classification models. Furthermore, using large numbers of features may result in low performance, because some features may be redundant and irrelevant to the performance of a model. Therefore, it is necessary to perform feature selection before training, to eliminate redundant and irrelevant features from the datasets. Feature selection plays an important role in data preprocessing for most machine learning models. It is the process of selecting features with the highest contributions to the predictive variable. Feature selection can be performed manually or using algorithms (automatically) to reduce the dimensions of the data to a subset of features relevant to building a predictive model. There are three main categories of feature selection in the literature: wrapper, filter, and hybrid methods [17]. The wrapper method utilizes the greedy search strategy to evaluate all possible feature combinations against a criterion for evaluation based on machine learning algorithms [17,18]. The filter method, on the other hand, is not dependent on any machine learning algorithm. Features are selected based on the variable characteristics or intrinsic properties, which are measured via statistical analysis [19,20]. A hybrid or embedded method uses a combination of the properties of wrapper and filter methods [17,21]. Motivated by the positive impact of feature selection on the performance of machine learning and deep learning models for several different problems, we have developed a new IDS called χ 2 -BidLSTM for network systems.

Main Contributions
The proposed χ 2 -BidLSTM IDS integrates χ 2 with a BidLSTM-based deep learning model. The χ 2 statistical model is used for the ranking and selection of features based on their χ 2 test scores. The selected optimal features are used to train a bidirectional long short-term memory (BidLSTM)-based recurrent neural network (RNN) for network intrusion detection. The NSL-KDD dataset, which can be accessed via the University of New Brunswick (UNB) data repository, is used to train and evaluate our χ 2 -BidLSTM model's performance. The contributions of this paper are as follows: 1. Developing and implementing an intrusion detection system based on a bidirectional long short-term memory integrated with a χ 2 feature selection model.

2.
To the best of our knowledge, no prior work has addressed the hybridization of the bidirectional LSTM model with χ 2 statistical model for network intrusion detection. 3. The χ 2 -BidLSTM method uses fewer features for training and testing purposes, and thus reduces the complexity of traditional BidLSTM and also improves its classification accuracy. 4. A better classification accuracy than the traditional bidirectional LSTM model is obtained. Additionally, our approach outperforms existing state-of-the-art methods in the literature.
The remainder of this work is organized as follows. Section 2 presents a review of related work in the literature. A description of the dataset and the proposed methodology is presented in Section 3. Section 4 presents the implementation, experimental results, and discussion. Section 5 discusses the model complexity and limitations. In Section 6, we present the conclusions and future directions of the study.

Related Work
As an essential element for ensuring security in network systems, IDSs continuously draw the interest of many researchers. Many models have been developed to enhance the effectiveness of IDSs in network systems. In this section, we discuss the literature related to IDS techniques based on machine learning (ML) and deep learning (DL) that leverage feature selection for network anomaly detection.
The authors in [22] proposed a hybrid IDS approach using the NSL-KDD dataset, which focuses on combining the probability distributions of different learning algorithms using information gain (IG) and a voting algorithm to select relevant features for classification. The hybrid method comprises the J48, Random Tree, Meta Pagging, REPTree, Decision Stump, AdaBoostM1, and naive Bayes base classifiers. Although the technique demonstrated a good performance of 99.81% and 98.56% accuracy for binary and multiclass problems, respectively, there are still some concerns that need attention. The feature selection process in this approach is often biased towards variables with distinct values, not variables that have observations with large values, which can result in over-fitting and poor performance. In [23], Hota and Shrivas also developed a framework that utilizes different feature selection methods for irrelevant feature removal. The findings suggested that the C4.5 algorithm could obtain the greatest accuracy with IG for just 17 features of the NSL-KDD dataset. The study investigated the performance of four different feature selection methods (i.e., correlation, information gain, relief, and symmetrical uncertainty) integrated with the C4.5 decision tree algorithm for classification. According to the experimental findings, the most efficient amongst the four selection methods was information gain with C4.5, which obtained a detection accuracy of 99.68%. Although the result is promising, the method tends to be skewed towards attributes with many possible values, leading to poor generalization. Moreover, the entropy model employed in C4.5 has many timeconsuming logarithmic operations, sorting operations, and continuous values resulting in high computational cost. Using logistic regression combined with a search strategy, the authors in [24] presented a feature-selection-based IDS model that selects the best subset of features from the KDDCUP'99 and the UNSW-NB15 datasets. The findings indicated that their algorithm yields a good detection accuracy with just 18 selected features from the KDDCUP'99 dataset and 20 selected features from the UNSW-NB15 dataset.
Acharya and Singh [25] proposed a novel bio-driven feature selection approach that utilizes the Intelligent Water Drops algorithm combined with an SVM classifier for network intrusion detection. Their approach, also known as a swarm optimization algorithm, produced a high performance on the KDDCUP'99 dataset. The results indicated that the approach obtained a high accuracy of 93.12%, a detection rate of 91.35%, and a reduced false alarm rate of 3.35%, compared to other methods. The authors in [26] introduced a hybrid IDS mechanism that integrates feature selection and clustering using SVM and K-medoids clustering strategies. In this approach, the authors trained a naïve Bayes classifier on the KDDCUP'99 dataset. They evaluated the model using the detection rate, accuracy, and false alarm rate. The evaluation results indicated that the proposed approach obtained a higher detection rate of 90.1%, an accuracy of 91.5%, and a false alarm rate of 6.36%. In [27], Jabbar et al. presented a cluster-oriented ensemble model for network intrusion detection. The model was developed using the alternating decision tree technique (ADTree) and the K-nearest neighbor (KNN) algorithm. In experiments, their proposed approach showed a better performance with regard to accuracy and detection rate, compared to other methods in the literature.
In another approach [28], Paulauskas and Auskalnis introduced an ensemble IDS model. The model was developed using naïve Bayes (NB), C5.0, J48, and the partial decision list algorithms as base classifiers, with the notion of integrating multiple learners. Experimental findings indicated that the approach obtained better accuracy for network intrusion detection. To combat the high-dimensionality problem in network traffic, Zhou and Cheng [29] developed a heuristic feature selection algorithm known as the correlationbased feature selection bat algorithm (CFS-BA). Their strategy obtains the best feature subset by evaluating the correlations among features. The authors further built an ensemble model that incorporates random forest, forest-oriented penalizing attribute, and C4.5 decisions, using the rule of the average of probabilities (AoP). The model was trained and evaluated using CIC-IDS2017, KDDCUP'99, and the NSL-KDD datasets. The results showed that the CFS-BA ensemble approach produced a better performance, compared to other existing methods.
In [30], Pham et al. presented a hybrid approach that leverages gain ratio and bagging techniques for network intrusion detection. The former (gain ratio) is utilized to obtain the best features. The latter (bagging) is used to integrate tree-based core classifiers. The approach was evaluated using the NSL-KDD dataset. The results showed that the bagging method combined with J48 as the core classifier produced better performance for 35 features. The authors in [31] proposed a wrapper-based IDS that utilizes a hyper-graph (HG) and a genetic algorithm (GA) for producing possible subsets of features. The approach uses SVM as a classification algorithm, which is evaluated on the NSL-KDD dataset. From the evaluation, their proposed method exhibited a performance accuracy of 96.72% with 35 selected features.
In [32], Abdullah et al. developed an IDS model based on splitting the data input into several subsets relative to the attack types. In this work, IG was used to select the best features for each subset. Using random forest (RF) and partial decision list (PART) as core classifiers, the method was evaluated on the NSL-KDD dataset. Experimental findings illustrated that higher accuracy was achieved with the RF and PART classifiers combined with product probability. In [33], Mohammadi et al. introduced a feature-selection-based IDS that incorporates a clustering algorithm. The methodology was developed using a wrapper method that leverages a linear correlation coefficient (LCC) algorithm and a filter strategy that utilizes a cuttlefish algorithm (CFA). Their approach trained a decision tree (DT) classifier on the KDDCUP'99 dataset. Experimental results with 10-fold crossvalidation showed that the method obtained a 95.03% accuracy, a 95.23% detection rate, and a reduced false alarm rate of 1.6%. The authors in [34] developed a hybrid intrusion detection system that integrates principal component analysis (PCA) and information gain (IG) algorithms for feature selection. Their approach was evaluated on the NSL-KDD, Kyoto 2006+, and ISCX 2012 datasets using three ensemble classifiers (i.e., multi-layer perceptron (MLP), SVM, and instance-based learning algorithms (IBK)). The IG-PCA method exhibited more remarkable performance in detection rate, accuracy, and false alarm rate than other existing strategies.
Using the organic combination of several deep learning methods, the authors in [35] proposed a novel anomaly detection approach known as HELAD. The authors first performed feature extraction and selection using the damped incremental statistics algorithm (DISA). An autoencoder was then trained with selected features of a label dataset while noting the irregular (abnormal) score labels in the data traffic. They further trained an LSTM model using the irregular score label and obtained the final score using a weighted technique. In the experiment, the HELAD method produced a better accuracy compared to other state-of-the-art methods. In [36], the authors used a multi-objective technique to obtain the best subsets of features, which were then evaluated based on three decision tree algorithms (i.e., NB, RF, and C4.5). The three algorithms were trained and tested using the CIC-IDS2017, UNSW-NB15, and NSL-KDD datasets. In the experiment, the NSGA2-LR approach showed promising results compared to other methods.

Materials and Methods
In this section, we present a detailed description of the dataset used in our study and the proposed methodology.

Description of Dataset
One of the benchmark datasets utilized by researchers on intrusion detection in the security domain is the NSL-KDD dataset [37]. It is publicly available in the online data repository of the University of New Brunswick (UNB) [38]. The NSL-KDD dataset is a modified form of the KDDCUP'99 dataset presented in [39]. The proposed model was trained and evaluated on the NSL-KDD dataset. We selected this dataset because of the following advantages: 1. The dataset has a reasonable and sufficient number of traffic records that can be used to perform the study. 2. It does not contain redundant traffic in the training set, ensuring that classifiers are not biased toward more frequently occurring records. 3. The testing set has no duplicate records; hence, the performance of learning algorithms is not biased by models with higher detection rates on more frequently occurring records. 4. The fraction of records in the main KDD dataset is inversely proportional to the overall records chosen from each difficulty level category. Therefore, the prediction rates of various ML algorithms differ over a greater range, making accurate evaluation of various learning methods more effective.
The dataset includes a training set (i.e., KDDTrain + ) containing 125,973 data records and two different test sets (i.e., KDDTest −21 and KDDTest + ) containing 11,850 and 22,544 data records, respectively, as presented in Table 1.

Data Preprocessing
As presented in Table 2, the NSL-KDD dataset has forty-one (41) features, of which three are non-numeric. The non-numeric features are service, protocol_type, and flag. The dataset has one classification label that can be categorized into two classes (i.e., normal and attack) for a 2-class classification or five classes for a 5-class classification. The five classes include Remote-2-Local (R2L), User-to-Root (U2R), Denial of Service (DoS), Probe, and Normal. Apart from the normal class, the remaining four classes represent the different attack types found in the dataset (see Table 3). Like any neural network model, the proposed approach uses only numeric values as inputs. Hence, we converted all the non-numeric data inputs to numeric form by encoding and assigning unique integer values to each of them. As an integral part of data preprocessing, normalization plays an essential role in producing a balanced dataset. The commonly used normalization strategies in machine learning and data science include decimal scaling, z-score, and min-max. The NSL-KDD dataset exhibits uneven distribution for some features (e.g., src_bytes and dst_bytes), leading to biased results. To ensure that the proposed model does not produce biased results, we transformed all 41 features to values within the range of 0 to 1 by utilizing the min-max feature scaling technique, as shown in Equation (1): where z signifies the original value of the feature and x represents the newly scaled number.

Proposed Approach
The proposed method (see Figure 1) in this paper is chi-square bidirectional long short-term memory (χ 2 -BidLSTM), which involves two steps. The first step utilizes a chi-square statistical model to select optimal features from the dataset. The second step trains a bidirectional LSTM predictive model on the optimal set. 3.3.1. Chi-Square (χ 2 ) Feature Selection A χ 2 model computes the χ 2 statistics for every feature (F i ) and class (θ) to measure the level of independence between each class and feature. It also indicates features that are most likely to be irrelevant (not class-dependent) for classification [40]. The feature selection process first partitions the data and ranks the features, and then performs a search to obtain an optimal subset from the ranked set of features [41]. The features are ranked using the χ 2 test scores. For instance, consider a 2-class (i.e., Normal and Attack) classification with m instances. We can construct a table to obtain the χ 2 test scores, as shown in Table 4. Table 4. Computation of chi-square test scores.

Normal Class Attack Class Total
Here, ϕ represents the total number of instances with feature F i , and the total number of instances without F i is represented by m − ϕ. In addition, δ denotes the total number of normal instances. The total number of attack instances is denoted by m − δ. The χ 2 test statistic compares the observed values (O) measured from the data with the expected values (E). From Table 4, the observed values are n, a, ν, and ρ. Let E n , E a , E ν , and E ρ represent their expected values, respectively. Using the assumption that the two occurrences are independent, we can compute the expected values as: Similarly to Equation (2), we can also compute the expected values E a , E ν , and E ρ . The χ 2 test statistic for a goodness-of-fit test is generally obtained by: where i is the number of different data classes, O denotes the observed values measured from the data, and E represents the expected values. We can therefore compute the χ 2 test statistic for Table 4 as: The test statistic in Equation (4) is used to rank the features. Subsequently, we perform a forward best-first search to select the features with the highest test scores as the optimal set. Thus, we first select a feature with the highest χ 2 test result and check its performance using the BidLSTM model. Another feature is added to the subset of features in the subsequent iteration based on the test score. Again, we investigate the performance of the subset of features with the BidLSTM model. This procedure is repeated until every ranked feature is added to the subset. The subset of features with the best performance is then selected as the optimal set and supplied to the BidLSTM predictive model to produce the best performance results.

BidLSTM Model
In this subsection, we present the deep learning methods used in this study. We give a brief overview of the working principles of RNNs in general and narrow it down to the main methods: the LSTM and the BidLSTM.
RNN is a generalized form of traditional feed-forward network with internal memory, capable of propagating information from the past to the future. It generates loops in the networks which enable information to persist. The loops are utilized together with the memory state to process a sequence of input data [42]. RNN is a category of DNN that is able to utilize previous outputs while maintaining hidden layers that serve as storage for information [43,44]. The same weights and biases are supplied to all layers, thereby minimizing the challenge of memorizing and increasing parameters. The basic architecture of an RNN is shown in Figure 2. RNNs may be suitable for solving several research problems, but they suffer from the drawback of vanishing gradients, which inspired the development of LSTM in [45]. The LSTM network is a more advanced version of the RNN that learns long-term dependencies via a gating mechanism. It is a solution to the vanishing gradients problem encountered when training conventional RNNs [47]. The gates and cell state are the LSTM network's basic principles. The cell state is considered as the network's memory and serves as route to propagate relevant information. The gates (i.e., forget, input, and output gates) control the information flow and determine what knowledge should be kept or discarded (forgotten), as shown in Figure 3. Equations (5)- (9) give the expressions for the cell state and various gates at the periods t and t − 1, as follows: where i T denotes the input gate, α represents the input vector, o T is the output gate, β T denotes the output, and f T represents the forget function. The cell state is given by ζ, witĥ W and λ as the weight and bias parameters, respectively. The proposed method, which is the bidirectional LSTM (BidLSTM), augments the conventional LSTM to enhance a network model's performance. The BidLSTM model utilizes two hidden LSTM layers to process data inputs in two directions (i.e., forward and backward) [48,49]. The basic concept of a BidLSTM model is quite simple. It involves duplicating the primary recursive layer of the neural model. In training, the input to the primary layer consists of the actual data, while that of the duplicate layer is a reverse copy of the data. This technique effectively increases the amount of information available to the model. Figure 4 displays the structure of a BidLSTM model. The Keras library in Python provides a wrapper for the bidirectional layers used for developing BidLSTMs. It permits users to decide the merging mode, which determines how the outputs from both directions (i.e., forward and backward) are combined before feeding them to the subsequent layer.
The forward hidden layer ( → β ), the backward hidden layer ( ← β ), and the output (o) of a BidLSTM can be obtained from the following equations [49,50]:  It is evident from the literature that bidirectional RNN models perform considerably better than standard models in several research areas, including intrusion detection. In this approach, we evaluated the performance of BidLSTM using the NSL-KDD intrusion detection dataset. The χ 2 statistical model was hybridized with the BidLSTM to further improve the model's performance. We carried out experiments to discover the hyperparameter values that would result in the best IDS performance metrics. The trained χ 2 -BidLSTM model consisted of an input layer with 64 neurons, three hidden layers with 32 neurons each, and an output layer with five neurons corresponding to the five class labels. We set the number of epochs to 100, with a range of 0 to 0.05 as the model weights. We defined the loss function in the training process to assess the model weights. Since the study deals with a multi-class classification issue, we chose an algorithmic loss function specified in the Keras library as "categorical_crossentropy". This loss function measures how the predicted values vary from the actual values. We employed ReLU as the activation function for all layers except the output layer, which used Softmax activation. The model uses an "Adam" optimization algorithm with a learning rate of 0.008. Finally, we fitted the model to the dataset using the "fit" function. We adopted the K-fold cross-validation scheme with the value of K set to 10, to evaluate the model's performance.

Experimental Results and Discussion
This section presents the process of implementing χ 2 -BidLSTM (Algorithm 1) and discusses the experimental results. To investigate the proposed method's robustness, we evaluated the model's performance using different metrics such as accuracy, precision, F-score, and false alarm rate (FAR). In addition, this section compares the findings to those of the standard LSTM model and other techniques in the literature. The complexity and runtime analyses of the proposed algorithm are also provided.

Implementation
The proposed method is a feature-selection-based IDS called χ 2 -BidLSTM. Several tools in the literature can be used to perform this type of experiment. In this study, the Python programming language was utilized to implement the different phases of the proposed method. To be precise, we used Python's TensorFlow and Keras libraries to implement the various components of the model. All experiments and evaluations were carried out using a personal computer (PC) running on the Windows 10 Operating System (OS), with the following specifications: Intel Core i5-9300H CPU, 8GB Random Access Memory (RAM), NVIDIA GeForce GTX 1050, and a 4GB dedicated GDDR5 VRAM. The implementation was in two phases. The first phase was feature selection with a chi-square statistical model. As mentioned earlier, the NSL-KDD dataset contains 41 training features. However, the dataset contains some irrelevant features that can hinder the performance of a model. To improve the prediction accuracy, we used a χ 2 feature selection method to narrow down the features to those most relevant for classification. As shown in Section 1, the χ 2 model computes the scores between each feature D[i] and class label L and ranks the features in descending order based on their test scores. The result is saved in SELECTED. The algorithm finally returns SELECTED, containing the list of ranked feature indexes. After ranking all features, a forward best search was performed to select an optimal set, as stated in Section 3.3.1. The search first finds the feature having the highest χ 2 test score using the evaluation function v() and appends it to SELECTED. The next iteration finds the subsequent feature that achieves the highest score in addition to the features in SELECTED. The procedure is repeated until an ideal feature combination is achieved and fed to the classification models to produce the best results.

Results and Discussion
In this section, we present a discussion of the results obtained from all experiments. We performed a total of four separate experiments using the NSL-KDD dataset with 10-fold cross-validation.

Experiment No. 1: Standard LSTM Trained with All 41 Features
In experiment 1, we investigated a standard LSTM model's performance using all 41 features for 5-class classification (i.e., DoS, Probe, U2R, R2L, and Normal). Table 5 illustrates the confusion matrices used to evaluate the model's performance, and the results are reported in Tables 6 and 7.

Algorithm 1 χ 2 -BidLSTM implementation process
Obtain the χ 2 test scores for each feature using χ 2 statistical model Rank(sort) the features in descending order based on their χ 2 test scores 1: arr ← {} 2: for i ← 1 to n do 3: test_score ← chi.sqaured(D[i], L) Compute the χ 2 score between features in the dataset D and class labels L 4: append (i, test_score) to arr 5: end for 6: rank the features of arr Sort the features in a descending order based on their χ 2 test scores 7: store the feature scores of arr to SELECTED 8: return SELECTED Find the features with the highest test value (v_max) from the ranked features Obtain the best feature subset for training using forward search 9: SELECTED ← {} 10: v_max ← −1 11: SubF ← index of D 12: while SubF != NULL do 13: index ← NULL 14: for i ← 0 to SubF length do 15: temp f eature_list ← (SELECTED ∪ SubF[i]) 16: temp v ← v(temp f eature_list) 17: if temp v > v_max then 18: index ← i 19: end if 20: end for 21: if index == NULL then 22: break 23: else 24: append SubF[index] to SELECTED Model training interface with a K-fold cross-validation using the optimal set 29: for f = 1 to k do 30: [ ]Training_set = New_List[|V|] 31: [ ]Testing_set = New_List [|V|] Construct the training set 32: for m = 1 to k do 33: if m == f then 34: continue 35: end if 36: for v = 1 to |V| do 37  As reported in Table 6, the standard LSTM model produces a detection accuracy of 87.26%, a precision of 90.34%, an F-Score of 88.03%, and a false alarm rate of 4.03% for the NSL-KDDTest + dataset. From Table 7, the model produced a 74.49% detection accuracy, 81.53% precision, an F-Score of 75.76%, and a 5.96% false alarm rate for the NSL-KDDTest −21 dataset.

Experiment No. 2: BidLSTM Trained with all 41 Features
The second phase of the experiments involved a bidirectional LSTM model trained with all 41 features of the dataset. We evaluated the model's performance using the confusion matrix and experimental findings shown in Tables 8-10.  To obtain a better intuition about the numbers of correctly classified attacks and the misclassification rates, we tabulated the confusion matrix in Table 8 for the two test sets (i.e., NSL-KDDTest + and NSL-KDDTest −21 ). The vertical labels denote the true classes while the horizontal labels represent the predicted classes. As mentioned in Section 3.1, the NSL-KDDTest + dataset contains 22,544 traffic records, out of which 12,833 samples are malicious behaviors (attacks) and 9711 are normal behaviors. Out of the 12,833 attack samples, the BidLSTM model could correctly detect 11,332, producing a detection accuracy of 91.36% from the confusion matrix, a precision of 92.81%, and an F-score of 91.67%. The model misclassified 1501 attack samples, yielding a low false alarm rate of 3.06%. Similarly, the NSL-KDDTest −21 test set contains 2152 normal and 9698 attack records. BidLSTM correctly detected 7947 attacks while 1751 were misclassified. Thus, the model achieved 82.05% detection accuracy, 85.91% precision, an F-Score of 82.77%, and a false alarm rate of 4.20% compared to the standard LSTM model. The performances of our proposed approach, BidLSTM, and those of other current techniques using all the 41 features in the NSL-KDD dataset are reported in Table 11. Based on the results presented in Table 11, our approach shows substantial advantages over the other methods on the NSL-KDD dataset. BidLSTM trained using all 41 features reliably exhibits a greater detection accuracy and a better F-score than the other methods on the two test sets (i.e., NSL-KDDTest + and NSL-KDDTest −21 ), as shown in Figure 5. Additionally, it also has a lower FAR than these approaches, indicating its effectiveness in detecting intrusions.

Experiment No. 3: Standard LSTM Trained with Reduced Features
In this section, we investigate the performance of the chi-squared feature selection integrated with the standard LSTM model. Using the χ 2 statistical model, we achieved different subsets of features. These subsets were fed successively to the standard LSTM classification model for training, and the performance of each subset was recorded as shown in Section 6. The subset of features that produced the best performance results was selected as the optimal set as shown in Figure 6. Table 12 presents the list of features in the chosen subset.
As shown in Table 13, with just 21 features, the model could correctly detect 11,377 malicious records out of the 12,833 records in the NSL-KDDTest + dataset, producing a higher accuracy of 91.16% compared to training the model with all 41 features. Additionally, with the reduced set of features, the LSTM model obtained 91.86% precision, 96.23% specificity, an F-score of 91.32%, and a recall of 91.20%. It also produced a low false alarm rate of 3.77% compared to the standard LSTM trained with all features.
The experimental results from Table 14 indicate that the standard LSTM model trained with the reduced feature set improved the performance by 3.90%.
In the same vein, Table 15 shows that the model improved the performance by 6.49% with the reduced feature set. That is, with the NSL-KDDTest −21 dataset, as presented in Table 13, the model accurately detected 7760 attack records out of a total of 9698, achieving a detection accuracy of 80.98% compared to when it was trained with the complete feature set. It obtained precision, specificity, recall and F-score values of 84.95%, 95.49%, 80.97%, and 81.68%, respectively, with a low false alarm rate of 4.51%.   In this experiment, we evaluated the performance of the proposed BidLSTM method using a reduced feature set. Similarly to experiment No. 3, we obtained different subsets of features after applying χ 2 feature selection. Sequentially, these feature subsets were fed to the BidLSTM model for training and classification. We then selected the subset with the best performance as the optimal feature set, as shown in Table 12.
From Tables 16-18, we can observe that the BidLSTM model had higher detection accuracy, precision, specificity, F-score, and recall. It is evident from the results that BidLSTM trained with a reduced set of features improves the performance of BidLSTM trained with all 41 features by 4.26% and 7.50% on the NSL-KDDTest + and NSL-KDDTest −21 datasets, respectively. With 17 features, as presented in Table 12, the model could correctly detect 11,976 attack samples out of the 12,833 samples in the NSL-KDDTest + dataset, yielding a greater accuracy of 95.62%. Furthermore, it achieved a higher precision of 95.88%, a specificity of 97.89%, an F-score of 95.65%, and a recall of 95.62%. It produced a 2.11% false alarm rate, which was lower than when the model was trained with a complete feature set. In addition, it can be observed from Table 16 that BidLSTM trained with 17 features could effectively detect 8644 attacks from a total of 9698 attack records in the NSL-KDDTest −21 dataset. That is, using the NSL-KDDTest −21 test dataset, BidLSTM with the reduced feature set obtained a detection accuracy of 89.55%, with a low false alarm rate of 2.71%, as shown in Table 18. It also achieved 90.75% precision, 97.29% specificity, 89.55% recall, and an F-score of 89.77%. A comparison of the performances of our proposed method and other existing feature reduction methods is shown in Table 19.  To broaden the scope of the benchmark, we compared the performance of our χ 2 -BidLSTM approach to that of earlier studies that used the NSL-KDD Test + and NSL-KDDTest −21 datasets. Figure 7 shows the comparison of our results with some of the earlier techniques on these two test sets. The proposed approach, which outperforms other contemporary IDS algorithms, achieved the best detection accuracy based on experimental findings on the NSL-KDD datasets. In addition to having greater detection accuracy, the proposed approach outperformed prior approaches significantly in terms of the false-alarmrate measure. Our proposed χ 2 -BidLSTM method achieved greater accuracy of 95.62% and an F-score of 95.65%, with a false alarm rate of 2.11% on the NSL-KDDTest + dataset, using only 17 features. Furthermore, the proposed method obtained an accuracy of 89.55%, an F-score of 89.77%, and a false alarm rate of 2.71%, with just 17 features, according to the experimental findings on the KDDTest −21 dataset, which is superior to the χ 2 -LSTM method and the other existing approaches based on all the performance measures presented in Table 19. The table shows that feature selection improves the performance of both the standard LSTM and BidLSTM models considerably in predicting network intrusion. Chisquare feature selection, compared to other existing feature-selection-based approaches (i.e., PCA, information gain, mutual nformation, CFS, and gain ratio), exhibited superiority in terms of detection accuracy and FAR.

Model Complexity and Limitations
This subsection addresses the complexity of the proposed χ 2 -BidLSTM method and the time needed for training and testing. Furthermore, we also present the limitations of the proposed approach.

Time Complexity
We evaluated the time complexity of our proposed χ 2 -BidLSTM approach with regard to the various units of the model implementation: feature ranking using the χ 2 statistical model, optimal feature selection using the forward best search algorithm, and BidLSTM. To obtain the best feature combination set for training, we first used the χ 2 statistical model to rank all features in a descending order based on their χ 2 test scores, as shown in Section 1. The time complexity is O(n × F), where n is the number of classes and F is the number of features. After ranking all the features, we used the forward best search algorithm to obtain the optimal feature set for each model. The algorithm begins with an empty set. It searches for the feature with the highest χ 2 test score using the evaluation function and appends it to SELECTED (see Algorithm Section 1; lines 9-28). The algorithm continuously finds the next feature that can achieve the best evaluation score with the feature(s) in SELECTED until the desired dimension is reached and no additional features can improve the accuracy. The algorithm's time complexity is O(D), where D is the desired dimension.
The core unit of the approach is the BidLSTM model, which trains two LSTM layers (i.e., the first layer in the forward direction and the second in reverse order).

Execution Time Analysis
In this subsection, we analyze the testing and training times of the models used in this study. To ensure a fair and accurate analysis, we performed all the experiments using an Intel Core i5 PC with an 8 GB memory. The training and testing times of the various methods are reported in Figure 8a and Figure 8b, respectively. From Figure 8a, it can be seen that the BidLSTM approach with the complete set of features requires more time (9789.24 s) to train than the standard LSTM model (5546.31 s). The reason is that BidLSTM trains two LSTMs with an entry shape of a dimensional matrix of the data length and the number of features used. In this domain, the most important characteristic of a model is its ability to accurately, actively, and effectively detect network intrusion. As such, there is a trade-off between training time and performance. Therefore, the standard LSTM model may require a shorter training time, but the BidLSTM model exhibits better accuracies in detecting intrusions. From Figure 8, it is evident that feature selection reduces the training and testing times of both methods considerably. With the reduced set of features, the standard LSTM model requires a training time of 2397.36 s, whiles it takes 4678.62 s to train the BidLSTM model. Thus, feature selection not only improves the performances of the various models but also minimizes the computational times of the models.

Limitations
The experimental findings demonstrate that the proposed method can efficiently detect network intrusions. However, more study in this domain is still required to improve the overall performance of the proposed approach further. The proposed χ 2 -BidLSTM model has a higher complexity and needs more training time than the standard LSTM model, as demonstrated by the complexity and runtime analyses of the model. In the real-world network scenario of computer systems, new forms of intrusions emerge continually, which may not be captured by the NSL-KDD dataset. The absence of emerging novel attacks in the dataset can make it difficult for the proposed method to adapt to recently emerging attacks in a network system. These are the primary limitations of the χ 2 -BidLSTM-based intrusion detection model.

Conclusions
Even though various machine learning methods have been proposed to enhance the performance of IDSs, most existing intrusion detection methods still struggle to achieve good performance. This study offers a new IDS approach called χ 2 -BidLSTM that integrates the chi-square (χ 2 ) statistical model with a bidirectional long short-term memory (BidLSTM) model. We used the χ 2 statistical model to reduce the dataset to the optimal set, to handle the imbalance and high dimensionality of the data. The NSL-KDD dataset with 10-fold cross-validation was used to evaluate the performance of our proposed χ 2 -BidLSTM approach, and the results were compared to other existing intrusion detection approaches. The experimental findings indicated that the proposed χ 2 -BidLSTM method improved the intrusion detection accuracies of the standard LSTM and BidLSTM models by 3.90% and 4.26%, respectively, on the NSL-KDDTest + test set, and by 6.49% and 7.50%, respectively, on the NSL-KDDTest −21 test set. Compared with previously existing techniques that utilize feature selection, the proposed approach achieved higher detection accuracy and F-scores on the two test datasets, while maintaining lower false alarm rates. Furthermore, the proposed χ 2 -BidLSTM method exhibited good performance in detecting minority attack types such as User-to-Root (U2R) and Remote-2-Local (R2L) attacks compared to the other techniques, indicating the robustness and effectiveness of our proposed approach.

Future Directions
The future direction of this study is to explore more feature selection algorithms to further improve the intrusion detection rate of our model. Additionally, as mentioned in Section 5.3, although the proposed approach has a higher intrusion detection accuracy with a low false alarm rate (FAR), the approach has a significant execution-time cost due to the BidLSTM deep network architecture and computations within the LSTM memory cells. Hence, we intend to investigate how to reduce the computational complexity of the proposed method further while maintaining high detection accuracy and a low false alarm rate. We also plan to investigate the performance of our approach with the latest intrusion detection datasets with real-world traffic, such as the UNSW-NB15 and CIC-IDS2017 datasets. Data Availability Statement: The dataset used or analyzed during the current study is publicly available for use from the University of New Brunswick (UNB) data repository. It can be accessed using the following url: Available online: https://www.unb.ca/cic/datasets/nsl.html (accessed on 10 January 2022).

Conflicts of Interest:
The authors declare there is no conflict of interest regarding the publication of this paper. The manuscript has been submitted solely to this journal and is not published, in press, or submitted elsewhere.

Abbreviations
The following abbreviations are used in this study: