Network Intrusion Detection Model Based on CNN and GRU

: A network intrusion detection model that fuses a convolutional neural network and a gated recurrent unit is proposed to address the problems associated with the low accuracy of existing intrusion detection models for the multiple classiﬁcation of intrusions and low accuracy of class imbalance data detection. In this model, a hybrid sampling algorithm combining Adaptive Synthetic Sampling (ADASYN) and Repeated Edited nearest neighbors (RENN) is used for sample processing to solve the problem of positive and negative sample imbalance in the original dataset. The feature selection is carried out by combining Random Forest algorithm and Pearson correlation analysis to solve the problem of feature redundancy. Then, the spatial features are extracted by using a convolutional neural network, and further extracted by fusing Averagepooling and Maxpooling, using attention mechanism to assign different weights to the features, thus reducing the overhead and improving the model performance. At the same time, a Gated Recurrent Unit (GRU) is used to extract the long-distance dependent information features to achieve comprehensive and effective feature learning. Finally, a softmax function is used for classiﬁcation. The proposed intrusion detection model is evaluated based on the UNSW_NB15, NSL-KDD, and CIC-IDS2017 datasets, and the experimental results show that the classiﬁcation accuracy reaches 86.25%, 99.69%, 99.65%, which are 1.95%, 0.47% and 0.12% higher than that of the same type of CNN-GRU, and can solve the problems of low classiﬁcation accuracy and class imbalance well.


Introduction
Network intrusion detection is a security mechanism that has been developed in recent years to dynamically monitor, prevent and defend against system intrusions. It specifically refers to collect information from several nodes of a computer network or system and analyze this information to discover whether there is an attack or a breach of security policy in the network system. Research on intrusion detection technology has been conducted worldwide since the 1980s, and it has now developed into an integral part of the network security architecture [1].
Traditional machine learning methods have been widely used in network intrusion detection systems, such as Bayesian [2][3][4], Support Vector Machines [5][6][7][8][9][10], Decision Trees [11][12][13], Logistic Regression [14][15][16], etc. These methods have achieved good results. However, these methods are not suitable for massive and high-dimensional data, and cannot solve the problem of degraded classification performance due to their own sensitivity to outliers and noise. At the same time, due to the continuous development of digital technology and the increasingly diverse means of cyber-attacks, traditional machine learning methods have had difficulty in meeting the needs of users.
In recent years, deep learning techniques have been widely used in natural language processing [17], image recognition [18] and other fields. These techniques have also achieved good results in the field of intrusion detection by combining low-level features (1) To address the problem of feature redundancy, this paper proposes a feature selection algorithm (RFP algorithm). First, a random forest algorithm is introduced to calculate the importance of features, and then Pearson correlation analysis is used to select features; (2) To address the problem of sample imbalance, this paper proposes a hybrid sampling algorithm (ADRDB algorithm) by combining the Adaptive Synthetic Sampling (ADASYN) [26] and Repeated Edited nearest neighbors (RENN) [27] sampling methods for sampling, while using Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [28] to reject noise, and finally achieving a balanced dataset; (3) In this paper, CNN is introduced to extract spatial features from the network data traffic and use its weight sharing feature to improve the speed; GRU network is introduced to extract temporal features and learn the dependency between features, so as to avoid overfitting problems; the attention mechanism is introduced to assign different weights to the features, thus reducing the overhead and improving the model performance.

Related Work
Network security intrusion detection is a relatively broad area of research. Existing models used in the field of intrusion detection include Convolutional Neural Networks, Recurrent Neural Networks, machine learning, and hybrid models. Scholars have used a variety of different approaches to address the problems of low detection accuracy and difficulty in detecting a few classes of samples in the field of intrusion detection. Convolutional neural networks are primarily used in tasks related to image and video analysis, such as image classification, face recognition, target recognition, image processing and so on. Furthermore, in recent years it has also been widely used in the field of intrusion detection. A recurrent neural network is mainly used in various tasks of connected handwriting recognition and speech recognition. It is also commonly used in the field of intrusion detection due to its effectiveness in processing time series data.
In terms of improving detection accuracy: Tama, B.A. et al. used a combination of particle swarm optimization algorithms, ant colony algorithms and genetic algorithms for feature selection to reduce the feature size of the training data, followed by a secondary classification method to detect abnormal behavior in the network [29]. Bu, S.J. et al. combined a traditional learning classifier system with a convolutional neural network for the detection of anomalous behavior, and the proposed system has adaptive and learning capabilities [30]. Le, T.T.H. et al. first performed feature selection via the SFSDT model, followed by classification via recurrent neural networks, achieving better results on both the NSL-KDD dataset and the ISCX dataset [31]. Hassan, M.M. et al. proposed an intrusion detection system based on CNN and a weight-dropped long short-term memory network, and achieved more satisfactory results [32].
In terms of addressing the class imbalance: Louk, M.H.L. et al. compared existing sampling methods and found that EasyEnsemble performed better in resolving sample imbalance [33]. Liu, L. et al. divided the dataset into hard and easy sets by ENN, and reduced the imbalance of the original dataset by processing the samples in the hard set through the K-Means algorithm [34]. Yan, M. et al. identified anomalous traffic with good accuracy by an improved density peak clustering algorithm [35]. The recent work related to network intrusion detection using optimization algorithms, deep learning algorithms, and machine learning algorithms is given in Table 1. Table 1. A brief description of the recent related works on network intrusion detection, using several optimization, deep, and machine learning algorithms.

Ref.
Dataset Methods Evaluation Metrics Accuracy [36] N-BaIoT LGBA-NN precision, recall F1-score, support Gained 90% accuracy [ To effectively select feature subsets and hyperparameters, Abdullah Alharbi et al. [36] proposed a local-global optimal neural network Bat algorithm (LGBA-NN). They tested on the N-BaIoT dataset and compared with several recent advanced methods, such as the particle swarm optimization (PSO) algorithm and Bat algorithm. The experiments demonstrate that the LGBA-NN algorithm has a significant improvement in the detection accuracy of multi-class botnets up to 90%.
Jevgenijus Toldinas et al. [37] first transformed the original network data into fourchannel (Red, Green, Blue, and Alpha) images, which were later divided into a training and a testing set. The obtained images were classified based on the ResNet50 model and evaluated on the UNSW_NB15 and BOUN_Ddos datasets. The experimental results demonstrate that the proposed model has a high accuracy in detecting anomalous attacks.
In response to the ongoing and changing malicious threats in the current cyber environment, Muhammad Ashfaq Khan [38] proposes a hybrid intrusion detection system based on deep learning (HRCNNIDS). The authors use convolutional neural networks to capture regional features and temporal features through recurrent neural networks and evaluate the proposed system based on CSE-CIC-IDS2018 dataset. The experimental results show that the method has a high detection rate for malicious attacks.
In recent years, the number and complexity of attacks on network environments have continued to rise, and Guo Pu et al. [39] proposed an unsupervised anomaly detection method for network intrusion detection work. They combined subspace clustering (SSC) and a class of support vector machines (OCSVM) to achieve better performance on the NSL-KDD dataset.
Gia Nhu Nguyen et al. [40] introduced blockchain data transfer technology to the field of intrusion detection to both secure data and improve detection efficiency. Their proposed model uses sensors to collect data, after which intrusion detection is performed through deep belief networks. In addition, the model ensures privacy and security by sharing the created model. The authors evaluated the proposed model through NSL-KDD 2015 and CIDDS-001 datasets and achieved better results.
To address the existing problem of imbalanced intrusion detection data samples, Ranjit Panigrahi et al. [41] proposed a host-based intrusion detection algorithm. They first generate balanced samples from high-level imbalanced datasets in the preprocessing stage by an improved random sampling mechanism. Then the datasets are filtered by an improved multi-class feature selection mechanism. Finally, a merged tree construction algorithm based on the C4.5-based detector is built. The experimental results show that their proposed algorithm achieves a very high detection accuracy.
In their study, MohammadNoor Injadat et al. [42] fully considered the impact of sampling techniques on model performance and compared two feature selection techniques based on information gain and correlation, and finally proposed the multi-stage optimization of an intrusion detection system based on machine learning. In this system, they first sample the dataset by the SMOTE algorithm, followed by feature selection, and finally optimize the parameters of the model by the optimization algorithm. Experiments through evaluation on the UNSW_NB15 dataset and CIC-IDS2017 dataset demonstrate that the proposed model significantly improves the detection accuracy while reducing the training sample size and feature set size.
A study [43] introduces sample-weighted and class-weighted algorithms into support vector machines to solve the problems faced by intrusion detection. Experimental results show that the algorithm can achieve the advantages of short time consumption, high recognition accuracy, low false alarm rate and high classification accuracy in different situations.
Xiaokang Zhou et al. [44] propose a few-shot learning model based on a Siamese convolutional neural network for intrusion detection tasks. In their model, they first construct a Siamese convolutional neural network to measure the distance based on the optimized feature representation of the input samples, and then efficiently identify the cyber-physical attack types. Furthermore, to improve the efficiency of the training process, they propose a robust cost function that includes three specific losses (transformation loss in the relative feature representation, coding loss in the encoding process, and prediction loss based on the distance between features). The experimental results show that their proposed method has better detection performance.
To cope with the security problems in large-scale data streams, Xiaokang Zhou et al. [45] proposed a learning model for variational long short-term memory networks based on a reconfigured feature representation. They first designed an encoder neural network implementation associated with a variational reparameterization scheme to represent the lowdimensional features of the original data. Then, three loss functions are defined and quantified to constrain the reconstructed hidden variables to a more explicit and meaningful form. Experimental results show that the model can effectively deal with imbalance and high-dimensional problems, and also achieve better detection results.

Network Intrusion Detection Model Based on CNN and GRU
Traditional intrusion detection models are more concerned with time series features and ignore spatial features in the process of detecting attacks. The CNN structure is more effective in extracting the spatial features of the data traffic, however, its ability to extract long-distance dependent information is mediocre; the GRU structure is more effective in extracting long-distance dependent information and can avoid forgetting during the learning process, however, its number of parameters is large and the training time is long. Therefore, this paper integrates the two to improve the model's ability to learn features, which can fully extract features from both spatial and temporal dimensions, and thus achieve a higher classification detection accuracy.
The proposed network intrusion detection model combining convolutional neural network and GRU, referred to as the CNN-GRU model, consists of three main stages: firstly, the pre-processing stage, in which the original data is converted into numerical features and normalized, and then the dataset is balanced by the ADRDB algorithm, after which the features are extracted by the RFP algorithm and finally converted into a grey-scale map; Secondly, the training phase, in which the pre-processed data are assigned different weights to the features by the Convolutional Block Attention Module (CBAM) based on residuals firstly, then the spatial features are extracted by the CNN module, and the spatial information is further aggregated by combining Averagepooling and Maxpooling. After that, the temporal features are extracted by multiple GRU units. Finally, the classification is performed by the Softmax function; Thirdly, the testing phase, in which the test set is passed into the trained model for classification. The structure of the model proposed in this paper is shown in Figure 1. Appl

Data Pre-Processing
In the pre-processing stage, this paper firstly converts the non-numerical features in the original flow data into numerical features and normalizes the features; secondly, a hybrid sampling algorithm (ADRDB algorithm) combining ADASYN and RENN is used for sampling; afterwards, the feature selection algorithm (RFP algorithm) is used for feature selection; finally, the resulting data is converted into a grey-scale map. The specific process of this stage is shown in Figure 1.

Non-Numerical Feature Transformation and Normalization
The only way the raw data can be used as model input is if it has been cleaned, labelled, annotated and prepared. In this paper, the LabelEncoder function in the scikitlearn library is used to convert the non-numeric features in the raw data traffic to numeric features to ensure that all feature values are numeric, thus facilitating the model to learn the data features.
When non-numeric features are converted to numeric, there is a tendency for the clustering of sample points in the feature space to be guided by individual feature values and less influenced by other feature values. Data normalization can reduce the variance of the features to a certain range, thus reducing the impact of outliers. In this paper, we

Data Pre-Processing
In the pre-processing stage, this paper firstly converts the non-numerical features in the original flow data into numerical features and normalizes the features; secondly, a hybrid sampling algorithm (ADRDB algorithm) combining ADASYN and RENN is used for sampling; afterwards, the feature selection algorithm (RFP algorithm) is used for feature selection; finally, the resulting data is converted into a grey-scale map. The specific process of this stage is shown in Figure 1.

Non-Numerical Feature Transformation and Normalization
The only way the raw data can be used as model input is if it has been cleaned, labelled, annotated and prepared. In this paper, the LabelEncoder function in the scikitlearn library is used to convert the non-numeric features in the raw data traffic to numeric features to ensure that all feature values are numeric, thus facilitating the model to learn the data features.
When non-numeric features are converted to numeric, there is a tendency for the clustering of sample points in the feature space to be guided by individual feature values and less influenced by other feature values. Data normalization can reduce the variance of the features to a certain range, thus reducing the impact of outliers. In this paper, we use min-max normalization to normalize the feature values to between zero and one, as shown in the formula: where h i,j represents the feature value in row i and column j of the dataset.
The normalized values are balanced by the proposed ADRDB algorithm for the majority and minority class samples, respectively, to obtain the balanced dataset. After that, the features are extracted by the RFP algorithm to obtain the pre-processed dataset.

Hybrid Sampling Method Combining ADASYN and RENN
The core idea of the hybrid sampling method proposed in this paper, which combines ADASYN and RENN, is as follows: firstly, divide the original data set into a majority class sample set and a minority class sample set; secondly, obtain a new majority class sample set by undersampling with the RENN algorithm for the majority class sample set, and obtain a new minority class sample set by oversampling with the ADASYN algorithm for the minority class sample set; thirdly, the DBSCAN clustering algorithm is used to remove the noise in the new sample set, and then the two datasets are merged to obtain the balanced dataset. The specific steps of the hybrid sampling method combining ADASYN and RENN are as follows. The detailed procedure is shown in Algorithm 1.
The inputs to the algorithm are the original majority class sample set N and minority class sample set P and the number of samples contained in both, and the outputs are the balanced majority class sample set newN and minority class sample set newP.
(1) Calculate the degree of imbalance of the dataset d.
(2) If d < d th (where d th is a pre-determined value for the maximum allowed degree of imbalance ratio), the following operations are performed: firstly, calculate the number of G samples that are needed to be generated for the minority class; secondly, for each sample in N find its k 1 nearest neighbors and calculate the ratio r i , where ∆ i denotes the number of samples belonging to the majority class among the k nearest neighbors of and |X| all represent the number of samples; afterwards, normalize r i tor i ; finally, calculate the number of samples that need to be synthesized for each minority class sample. Output: The balanced minority class sample set, newP.
The balanced majority class sample set, newN.

Feature Selection Algorithm
This paper proposes a new feature selection algorithm to address the problem of data features redundancy. The algorithm first calculates the importance of sample features by the Random Forest algorithm and ranks them in order of importance; then analyses the correlation between features by Pearson's index; and finally combines the two results to select the features.
The Random Forest algorithm (RF) is an integrated learning algorithm using Decision Trees as the base learner. In feature engineering, RF algorithms can identify important features from a large number of sample features. The essence of the algorithm is to analyze and calculate the contribution of each feature of the sample in the tree, and then calculate its average value and compare the contribution between the features to identify the important features [46]. Existing methods are usually evaluated using the Gini index or the out-of-bag data error rate as an evaluation metric. The specific steps are as follows: (1) For each base learner, select the corresponding out-of-bag data (some of the remaining samples that are not selected) and calculate its error, denoted as error_a; (2) Randomly add disturbances to the full sample of out-of-bag data and calculate its error, noted as error_b; (3) Assuming that the forest contains M trees, the value of Importance of a feature is as follows: (4) A new dataset is constructed by filtering out the features with a high level of importance.
The Pearson correlation coefficient is used to measure the correlation between two variables X and Y. It has a value range of (−1, 1) [47]. The Pearson correlation coefficient is obtained by calculating the covariance and standard deviation between the two eigenvalues and quoting it by the following formula: Pearson's correlation coefficient varies from −1 to 1. If the Pearson's correlation coefficient is close to ±1, this indicates a high correlation between the two characteristics and the relationship can be well represented by a linear equation. If the Pearson correlation coefficient is close to zero, it means that there is no linear relationship between the two features.
The pseudo code of the feature selection algorithm proposed in this paper is shown in Algorithm 2. (1) Selecting out-of-bag data and calculating the error, noted as error_a.
(2) Randomly add disturbances to the out-of-bag data, noted as error_b. The pre-processed dataset is obtained by converting the data after data balancing and feature selection into a grayscale map. The converted grayscale plots for the different categories are shown in Figure 2.

Model Structure
The purpose of introducing an attention mechanism in the model is mainly to improve the representational power. It mainly means that we give a larger weight to im-

Convolutional Block Attention Module
The purpose of introducing an attention mechanism in the model is mainly to improve the representational power. It mainly means that we give a larger weight to important features and a smaller weight to unnecessary features. The Convolutional Block Attention Module (CBAM) is a lightweight attention module proposed by Woo et al. in 2018, which contains of two main parts: the channel attention module and the spatial attention module [48].
The specific structure of the channel attention module in CBAM is shown in Figure 3, which uses the relationship of features between channels to generate a channel attention map to form the input. Afterwards, the spatial information is aggregated using Averagepooling and Maxpooling, respectively, to generate two different sources of spatial information F c avg and F c max . The two spatial information sources are fed into a shared network consisting of a multi-layer perceptron and a hidden layer, which in turn generate the required channel attention graph M c , and finally output the feature vector by element summation.
where σ represents the sigmoid function, W 0 and W 1 are the weights of the multilayer perceptron, F is the input and M c is the final output.

Convolutional Block Attention Module
The purpose of introducing an attention mechanism in the model is mainly to improve the representational power. It mainly means that we give a larger weight to important features and a smaller weight to unnecessary features. The Convolutional Block Attention Module (CBAM) is a lightweight attention module proposed by Woo et al. in 2018, which contains of two main parts: the channel attention module and the spatial attention module [48].
The specific structure of the channel attention module in CBAM is shown in Figure  3, which uses the relationship of features between channels to generate a channel attention map to form the input. Afterwards, the spatial information is aggregated using Averagepooling and Maxpooling, respectively, to generate two different sources of spatial information c avg F and max c F . The two spatial information sources are fed into a shared network consisting of a multi-layer perceptron and a hidden layer, which in turn generate the required channel attention graph c M , and finally output the feature vector by element summation.
where σ represents the sigmoid function,  The specific structure of the spatial attention module in CBAM is shown in Figure 4, which uses the spatial relationships between features to generate a spatial attention map to form the input. Afterwards, two 2D feature maps s avg F and max s F are generated in turn The specific structure of the spatial attention module in CBAM is shown in Figure 4, which uses the spatial relationships between features to generate a spatial attention map to form the input. Afterwards, two 2D feature maps F s avg and F s max are generated in turn by Averagepooling and Maxpooling, which effectively highlight the information region. The two are finally fused and the desired 2D spatial attention map is generated by standard convolution to output the feature vector.
where f 7×7 represents a convolution operation with filter size 7 × 7, F is the input and M s is the final output.
ard convolution to output the feature vector.  CBAM uses a channel attention module and a spatial attention module in turn, enabling the model to learn the features of the channel and spatial axes respectively, as shown in Figure 5. Given the initial feature map F as input, CBAM will generate 1D channel attention feature maps c M and 2D spatial attention feature maps s M in turn. The whole process is shown in Figure 5 and summarized as follows.
where  denotes element-wise multiplication, F  is the output after the channel attention module and F is the final output.

Convolutional Neural Networks
The two main existing and more popular CNN structures are the Residual Network (ResNet) [49] and the Inception Network [50], with ResNet proposing a concept of residual and Inception proposing a concept of split-transform-merge.
In order to improve the expressive power of CNN and to fully learn the diversity of features in the classification process, with the idea of Inception, the data input is extracted by multiple convolutional neural networks to ensure that they can learn simple to complex feature transformations. 2D convolution has shown excellent performance in the field CBAM uses a channel attention module and a spatial attention module in turn, enabling the model to learn the features of the channel and spatial axes respectively, as shown in Figure 5. Given the initial feature map F as input, CBAM will generate 1D channel attention feature maps M c and 2D spatial attention feature maps M s in turn. The whole process is shown in Figure 5 and summarized as follows.
where ⊗ denotes element-wise multiplication, F is the output after the channel attention module and F is the final output.
ard convolution to output the feature vector.  CBAM uses a channel attention module and a spatial attention module in turn, enabling the model to learn the features of the channel and spatial axes respectively, as shown in Figure 5. Given the initial feature map F as input, CBAM will generate 1D channel attention feature maps c M and 2D spatial attention feature maps s M in turn. The whole process is shown in Figure 5 and summarized as follows.
where  denotes element-wise multiplication, F  is the output after the channel attention module and F is the final output.

Convolutional Neural Networks
The two main existing and more popular CNN structures are the Residual Network (ResNet) [49] and the Inception Network [50], with ResNet proposing a concept of residual and Inception proposing a concept of split-transform-merge.
In order to improve the expressive power of CNN and to fully learn the diversity of features in the classification process, with the idea of Inception, the data input is extracted by multiple convolutional neural networks to ensure that they can learn simple to complex feature transformations. 2D convolution has shown excellent performance in the field

Convolutional Neural Networks
The two main existing and more popular CNN structures are the Residual Network (ResNet) [49] and the Inception Network [50], with ResNet proposing a concept of residual and Inception proposing a concept of split-transform-merge.
In order to improve the expressive power of CNN and to fully learn the diversity of features in the classification process, with the idea of Inception, the data input is extracted by multiple convolutional neural networks to ensure that they can learn simple to complex feature transformations. 2D convolution has shown excellent performance in the field of computer vision, so this paper uses 2D convolution to extract the spatial features of the data.
The CNN module used in this paper is described as follows: firstly, the processed grayscale maps are input to the strides as 1 × 1, 1 × 1 and 3 × 3 convolutional modules to extract features, respectively. After that, the resulting features are fused to obtain the processed feature maps. Its structure is shown in Figure 6. of computer vision, so this paper uses 2D convolution to extract the spatial features of the data.
The CNN module used in this paper is described as follows: firstly, the processed grayscale maps are input to the strides as 1 × 1, 1 × 1 and 3 × 3 convolutional modules to extract features, respectively. After that, the resulting features are fused to obtain the processed feature maps. Its structure is shown in Figure 6.

Model Structure
The intrusion detection model proposed in this paper consists of three main parts: in order to comprehensively and finitely learn the features of the data, the features are firstly assigned different weights through the CBAM attention mechanism module based on residual; secondly, the spatial features are extracted through the CNN module, and the spatial information is further aggregated using fused Averagepooling and Maxpooling; then the temporal features are extracted through GRU, and finally the classification is carried out through the Softmax function, the specific structure of which is shown in Figure 7.
(1) The grayscale map obtained after pre-processing is input to the CBAM module based on residual, and the features are given different weights to obtain the output F. (2) The new feature map F is input into the CNN module for feature extraction, after which the spatial information is aggregated using Maxpooling and Averagepooling to obtain the new feature map C F . (3) Pass C F to the GRU unit to extract the dependencies between features and obtain the output G F .
(4) Pass G F to the fully connected layer, which uses Softmax as the activation function to achieve the classification of intrusion detection behavior.

Model Structure
The intrusion detection model proposed in this paper consists of three main parts: in order to comprehensively and finitely learn the features of the data, the features are firstly assigned different weights through the CBAM attention mechanism module based on residual; secondly, the spatial features are extracted through the CNN module, and the spatial information is further aggregated using fused Averagepooling and Maxpooling; then the temporal features are extracted through GRU, and finally the classification is carried out through the Softmax function, the specific structure of which is shown in Figure 7.

Experimental Setup
In order to test the performance of the proposed network intrusion detection method fusing CNN and GRU, several sets of experiments are designed in this paper.  (1) The grayscale map obtained after pre-processing is input to the CBAM module based on residual, and the features are given different weights to obtain the output F. (2) The new feature map F is input into the CNN module for feature extraction, after which the spatial information is aggregated using Maxpooling and Averagepooling to obtain the new feature map F C . (3) Pass F C to the GRU unit to extract the dependencies between features and obtain the output F G . (4) Pass F G to the fully connected layer, which uses Softmax as the activation function to achieve the classification of intrusion detection behavior.

Experimental Setup
In order to test the performance of the proposed network intrusion detection method fusing CNN and GRU, several sets of experiments are designed in this paper. After several experimental validations, the parameters of the model in this paper are specified in Table 2.

Dataset and Evaluation Criteria
Over the years, many datasets related to intrusion detection have been introduced for research and development, including KDDCup99 [51], UNSW-NB15 [52], NSL-KDD [53], CIC-IDS2017 [54] and LITNET-2020 [55]. In this paper, we choose to use the UNSW-NB15, NSL-KDD and CIC-IDS2017 datasets to evaluate the proposed model. These datasets are the more widely used datasets in the existing intrusion detection field. The NSL-KDD dataset is a relatively early dataset applied to the field of intrusion detection, and the related research is more mature. UNSW_NB15 and CIC-IDS2017 are recent datasets that can better reflect the real network environment. The specific descriptions of the three datasets are shown below.
The NSL-KDD dataset is an improvement of the KDD99 dataset, which removes the redundant and duplicate data from the training and testing sets on the basis of the KDD99 dataset, so that the training and testing sets are set up in a more reasonable way. It mainly contains 41-dimensional attribute features and one-dimensional category features, covering five types of Normal, Probe, Dos, R2L, U2R. The distribution of different categories of attacks in the dataset is shown in Table 3. The UNSW-NB15 dataset is a new dataset generated in 2015 by the Cyber Range Laboratory of the Australian Centre for Cyber Security (ACCS) using the IXIA PerfectStorm tool to simulate realistic cyber environments. The dataset mainly consists of 47 attribute features and two category features, and contains nine attack techniques: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, Worms. In this paper, we directly use the initial training set and testing set to test the performance of the model. The distribution of different categories of attacks in the dataset is shown in Table 4. The CIC-IDS2017 dataset is derived from the 3-7 July 2017 Canadian Institute for Cybersecurity (CIC) collection for cyber data, which contains benign as well as recent common attacks in the field of cyber intrusions, filling the gap of no cyber-based attacks in the UNSW-NB15 dataset. The dataset contains 78-dimensions of attribute features and one-dimension of category features covering 15 attack types. In this paper, the anomalous behaviors of a similar nature are merged, and the final dataset contains nine types of attacks: Benign, Dos, Portscan, Ddos, Patator, Bot, Web attack, Infiltration, and Heartbleed. The distribution of different categories of attacks in the dataset is shown in Table 5. The evaluation metrics of the network security intrusion detection model include four main metrics: precision, accuracy, recall, and F1-score. In the specific detection results, T (true) and F (false) represent correctly or incorrectly classified data, respectively. P (positive) and N (negative) indicate that the predicted results of the detection system are abnormal or normal data, respectively. All data in the dataset must be classified into four categories: TP, TN, FP and FN. Only TP indicates that the system's classification result consists of abnormal attack data and the classification result is correct; TN indicates that the system's classification result is positive and correct; FP indicates that the system predicts the data as abnormal attack data although the classification result is wrong; FN indicates that the system predicts the data as normal data although the classification result is incorrect. The classification results of the model for the data are represented by the confusion matrix, as shown in Table 6. Accuracy describes the ratio of the number of correct samples predicted to the total number of samples and is calculated as follows: Precision describes the ratio of the number of classes predicted to be positive to the number of classes actually predicted to be positive, calculated as follows.
Recall describes the ratio of the number of predicted positive classes that are actually positive to the number of all positive classes and is calculated as follows.
F 1 − score describes the magnitude of the harmonic mean between precision and recall, calculated as follows.
It can be seen that F 1 achieves larger values when both recall and precision have larger values.

Experimental Results and Analysis
In order to fully verify the effectiveness of the model proposed, several sets of experiments are set up in this paper: Section 4.

Experiment on Feature Selection Analysis
In order to verify the classification performance of the CNN-GRU algorithm proposed in this paper, the public dataset UNSW_NB15, NSL-KDD and CIC-IDS2017 were selected. This section focuses on the UNSW_NB15 dataset for detailed introduction. The detailed process of feature selection is mainly introduced for the RFP algorithm proposed in this paper.
In this paper, the importance of each feature in the dataset UNSW_NB15 is first calculated by the Random Forest algorithm and ranked according to the degree of importance, as shown in Figure 8. From the figure, it can be seen that the importance degree of different features varies greatly, for example, the importance value of feature sbytes is 0.115, while the importance value of is_ftp_login and ct_ftp_cmd is zero. The importance metrics of all features are distributed between 0 and 0. 12. in this paper.
In this paper, the importance of each feature in the dataset UNSW_NB15 is first calculated by the Random Forest algorithm and ranked according to the degree of importance, as shown in Figure 8. From the figure, it can be seen that the importance degree of different features varies greatly, for example, the importance value of feature sbytes is 0.115, while the importance value of is_ftp_login and ct_ftp_cmd is zero. The importance metrics of all features are distributed between 0 and 0.12. Feature selection based on feature importance alone is a single reference criterion and the results obtained are not very convincing, so this paper combines feature importance and Pearson correlation analysis for feature selection. In order to visualize the correlation between features, a feature correlation diagram was created as shown in Figure 9. From the figure, the degree of correlation is clear between these 42-dimensional features. In order to further observe whether feature X and Y show correlation in the plane distribution, a correlation graph with feature X as the x-axis and feature Y as the y-axis is established. Due to the large number of data feature dimensions, this paper selects two types of features with correlation indices greater than or equal to 0.9 or less than or equal to −0.9 for analysis and introduction, as shown in Table 7. Feature selection based on feature importance alone is a single reference criterion and the results obtained are not very convincing, so this paper combines feature importance and Pearson correlation analysis for feature selection. In order to visualize the correlation between features, a feature correlation diagram was created as shown in Figure 9. From the figure, the degree of correlation is clear between these 42-dimensional features. In order to further observe whether feature X and Y show correlation in the plane distribution, a correlation graph with feature X as the x-axis and feature Y as the y-axis is established. Due to the large number of data feature dimensions, this paper selects two types of features with correlation indices greater than or equal to 0.9 or less than or equal to −0.9 for analysis and introduction, as shown in Table 7. Combining Figures 8 and 9 for feature selection. For features with strong linear correlation, the more important features are retained according to the degree of importance; for features with weak linear correlation, the importance index of the features is analyzed, and if it is lower than 0.001, they are eliminated; for features whose correlation index is not within the analysis interval, their importance index is also analyzed, and those features with an importance level below 0.0001 are excluded. Finally, the NSL-KDD dataset leaves 28-dimensional features, the UNSW_NB15 dataset leaves 28-dimensional features, and the CIC-IDS2017 dataset leaves 52 features.   Figures 8 and 9 for feature selection. For features with strong linear correlation, the more important features are retained according to the degree of importance; for features with weak linear correlation, the importance index of the features is analyzed, and if it is lower than 0.001, they are eliminated; for features whose correlation index is not within the analysis interval, their importance index is also analyzed, and those features with an importance level below 0.0001 are excluded. Finally, the NSL-KDD dataset leaves 28-dimensional features, the UNSW_NB15 dataset leaves 28-dimensional features, and the CIC-IDS2017 dataset leaves 52 features.

Comparison Experiments of Different Feature Selection Methods
In order to verify the effectiveness and applicability of the feature selection method proposed in this paper, a comparison experiment of different feature selection methods is set up in this section: the feature selection method (RFP) proposed in this paper is compared with existing feature selection methods such as PCA [23] and AE [24] under the

Comparison Experiments of Different Feature Selection Methods
In order to verify the effectiveness and applicability of the feature selection method proposed in this paper, a comparison experiment of different feature selection methods is set up in this section: the feature selection method (RFP) proposed in this paper is compared with existing feature selection methods such as PCA [23] and AE [24] under the same experimental conditions. The data obtained after pre-processing is first sampled using the ADRDB algorithm to balance the data set, and then the features of the NSL-KDD dataset are reduced to 28 dimensions; the features of the UNSW_NB15 dataset are reduced to 28dimensions; and the features of the CIC-IDS2017 dataset are reduced to 52 dimensions by the above three methods, respectively. Finally, the classification experiments are conducted by the proposed model in the paper, and the model parameters are set as shown in Table 1. The results obtained are shown in Table 8. The bold text indicates the evaluation indicators of the proposed model. From Table 8, it can be seen that the results obtained after using the data processed by the RFP algorithm proposed in this paper in the model are better. Analysis of the reasons for this shows that: the PCA method relies more on the variance when downscaling the data, while the non-principal components with small variance may also contain important information on the differences of the samples, and the downscaling process will have an impact on the subsequent data processing; the AE method is more dependent on the training data when reconstructing the feature space, thus, neither one of the above two methods has achieved better results. The RFP algorithm proposed in this paper starts from the data itself and selects features according to their importance and the correlation between them, which can improve the classification accuracy of the model.

Experiment Comparing Single Model with Hybrid Model
In order to verify the effectiveness of the model proposed in this paper on intrusion recognition, this section sets up performance analysis experiments on the intrusion detection model fusing CNN and GRU: under the same experimental conditions, the dataset was processed by sampling the preprocessing methods mentioned in Section 3.1 of the text, after which CNN, GRU and CNN-GRU were tested through the UNSW-NB15, NSL-KDD and CIC-IDS2017 datasets to obtain their classification accuracy, precision, recall and F1score values as shown in Table 9. The bold text indicates the evaluation indicators of the proposed model. As can be seen from the table, compared to using a single model of CNN and GRU, the CNN-GRU model can effectively extract the features of the original data and then effectively perform intrusion detection. The detection accuracy, recall, precision, and F1 score of dataset NSL-KDD reached 99.69%, 99.65%, 99.69%, and 99.70%, respectively; the detection accuracy, recall, precision, and F1 score of dataset UNSW_NB15 reached 86.25%, 86.92%, 86.25%, and 85.59%, respectively; and the detection accuracy, recall, precision, and F1 score of dataset CIC-IDS2017 reached 99.65%, 99.63%, 99.65%, and 99.64%, respectively. The reason for this is that the CNN structure can learn spatial features effectively by deepening the width of the network, while the GRU structure can extract temporal features of the data better. The model in this paper fuses CNN and GRU to learn both spatial and temporal features of the data, and introduces the attention mechanism to learn features comprehensively and effectively, thus achieving better results.

Comparison Experiments of Different Sampling Methods
In order to solve the problem of unbalanced dataset, this paper adopts a hybrid ADASYN and RENN sampling method (ADRDB algorithm) to process the dataset, in order to verify the effectiveness of the proposed method, this section sets up a comparison experiment of different sampling methods: under the same experimental conditions, the model uses seven different methods, SMOTE, ADASYN, random oversampler, ENN, RENN, random undersampler, and ADRDB, to handle the imbalanced data set. The RFP algorithm is then used to filter the features. Finally, the classification experiments are conducted by the proposed model in the paper, and the model parameters are set as shown in Table 1. The resulting detection accuracy is shown in Table 10. The bold text indicates the evaluation indicators of the proposed model. As can be seen from the table, comparing a variety of different sampling methods, the ADRDB algorithm proposed in this paper has better processing effect for unbalanced samples. The reasons for the analysis are that the single oversampling methods such as Random Oversampler, SMOTE and ADASYN cannot effectively discriminate the noisy data and are prone to generate a large amount of noisy data in the process of synthesizing new samples, which in turn leads to the degradation of the classification effect of the model; the single undersampling methods such as Random Undersampler, ENN and RENN are prone to losing key information of most classes of samples, which in turn leads to the degradation of the classification effect of the model. In this paper, the proposed hybrid sampling method samples the majority and minority samples separately, and rejects the noisy data by the DBSCAN algorithm, which not only avoids the loss of key information, but also reduces the influence of noisy data on the classifier model, thus achieving better results.

Comparison Experiments with Different Pooling Methods
In this paper, we use a fusion of max pooling and average pooling to solve the problem of insufficient feature extraction ability of the model. To verify the effectiveness of the proposed method, this section sets up a comparison experiment of different pooling methods: under the same experimental conditions, the dataset is first processed by the preprocessing method mentioned in Section 3.1, and then the model is used to aggregate spatial information by three different methods: average pooling, maximum pooling, and fusion pooling, respectively. The model parameters are still used as those given in Table 1. The resulting detection accuracy is shown in Table 11. The bold text indicates the evaluation indicators of the proposed model. The reason for this is that Averagepooling is used to extract features by averaging the global range of features, while Maxpooling is used to extract features by taking the maximum value of the feature points in the domain. The experimental results show that fusion pooling effectively improves the model's ability to learn features, and the classification results are greatly improved.

Comparison Experiments of Run-Time Performance
To evaluate the run-time performance of the proposed model, this section sets up comparison experiments. Under the same experimental conditions, experiments are conducted for each of the following five scenarios to compare their training and prediction time and classification accuracy. These five scenarios are described as follows: (1) "Before RFP" means that the data is only sampled using the ADRDB algorithm during the data pre-processing stage, and then the data is converted into grayscale maps for input into the classification model; (2) "Before RFP" means that the data is not sampled using the ADRDB algorithm in the pre-processing stage, and the normalized data is only selected using the RFP algorithm, and then the data is converted into a grayscale map for input into the classification model; (3) "CNN" means that the data is preprocessed using the method described in Section 3.1, and then converted to grayscale maps for input into the classification model, which uses only the CNN structure; (4) "GRU" means that the data is preprocessed using the method described in Section 3.1, and then converted to grayscale maps for input into the classification model, which uses only the GRU structure; (5) "Proposed Model" indicates that the data are pre-processed using the method described in Section 3.1, and then the data are converted into grayscale maps for input into the proposed classification model. Besides, "Train" indicates the time taken to train the model for 200 epochs. The obtained experimental results are shown in Table 12. The bold text indicates the evaluation indicators of the proposed model.
From Table 12, the model proposed in this paper consumes more time than "CNN", "GRU", "Before ADRDB" and less time than "Before RFP". The main reasons are as follows: The ADRDB algorithm generates new samples after the sampling is completed, resulting in more samples being used for training and testing, and thus takes more time. The RFP algorithm results in a decrease in the number of features in the samples after the feature selection is completed, and thus takes less time. When using a single CNN or GRU structure, the number of model parameters is smaller and thus the time spent is relatively less. The proposed model in this paper consumes relatively more time, however the model accuracy is greatly improved.  Figure 10 gives a graph of the classification accuracy and loss value of the intrusion detection model CNN-GRU with the number of iteration steps. From the figure, it can be seen that the model in this paper achieves a good convergence effect.
In order to further verify the effectiveness of the intrusion detection model proposed in this paper, a performance comparison experiment is set up in this section: under the same experimental conditions, common machine learning methods such as random forest, K-mean clustering, decision tree and other recently proposed intrusion detection models are applied to the dataset, and the performance comparison is shown in Table 13. The bold text indicates the evaluation indicators of the proposed model.
Compared to machine learning algorithms, this model learns features through neural networks, which can combine low-level features to form a more abstract and non-linear high-level representation, and then explore the input-output relationship between data, effectively improving the accuracy of intrusion detection. Compared to the S-ResNet, CNN, CNN-GRU, CNN-LSTM, CNN-BiLSTM and CNN-GRU-attention models, the intrusion detection model CNN-GRU proposed in this paper reduces the effects of feature redundancy and class imbalance on the one hand, and can extract both spatial and temporal features of the data on the other hand, so that the extracted feature information is more comprehensive and thus achieves better results. In order to further verify the effectiveness of the intrusion detection model propos in this paper, a performance comparison experiment is set up in this section: under t same experimental conditions, common machine learning methods such as random fo

Statistical Test
In order to accurately evaluate the proposed intrusion detection method, significance tests were conducted on three datasets, UNSW_NB15, NSL-KDD, and CIC-IDS2017, with reference to the methods used in the literature [62,63]. We used a two-tailed t-test to analyze the significance of the indicators obtained by the proposed method and thus verify that the obtained results were not obtained by chance. The calculation formula is as follows: We use the UNSW_NB15 dataset as an example for detailed discussion. First, the average accuracy value of the proposed method is calculated as 0.8625 according to Equation (10) (whereŷ i represents the i-th test accuracy value and n = 20). After that, the standard deviation of the accuracy value is calculated as 0.01040 by Equation (11). Finally, the critical value is X calculated by Equation (12) (where y 0 is the assumed minimum accuracy value and y 0 = 0.8438). The critical value 5.268 is greater than the 2.845 obtained in the two-tailed t-test table. This result indicates that the tested accuracy value of our proposed model is greater than the assumed minimum value of 0.8438, which has a confidence degree of (1 − α = 0.99). The experimental results for the three data sets are shown in Table 14. From Table 14, the minimum values of our assumed accuracies are all higher than the maximum values of the different models mentioned in the paper. Therefore, the model proposed in the paper has a statistically significant difference compared to the single use of the CNN model, GRU model and other existing models, and has a significant advantage over other methods.

Conclusions
Traditional intrusion detection models generally suffer from incomplete feature extraction and general multi-classification effects. To address these problems, this paper proposes an intrusion detection model that fuses convolutional neural networks and gated recursive units. The model solves the problems of data set imbalance and feature redundancy by using the ADRDB algorithm and the RFP algorithm, and then achieves comprehensive and sufficient feature learning by fusing CNN and GRU, while introducing the attention module to assign different weights to the features, thus reducing the overhead and improving the model performance. The accuracy of the proposed model based on the NSL-KDD dataset, UNSW_NB15 dataset and CIC-IDS2017 dataset is 99.69%, 86.25% and 99.65%, respectively. Furthermore, the precision can reach 99.65%, 86.92% and 99.63%.
In summary, This paper demonstrates that the model has a strong feature extraction capability, high detection accuracy and low false alarm rate when dealing with large-scale high-dimensional network data through feature selection analysis experiments, hybrid model versus single model comparison experiments, feature extraction method comparison experiments, pooling method comparison experiments and performance analysis experiments on the dataset, and has greatly improved the detection effect for a few classes, which provides promising prospective real-time applications for intrusion detection systems. However, the model proposed in this paper still has some drawbacks while improving the detection accuracy: first, the number of parameters of the model is relatively high; second, the running time of the model is relatively high; third, the detection accuracy of the model for a small number of samples is improved, although the improvement effect is not much. In the subsequent research, we will further study the model lightweighting to further improve the accuracy of minority sample detection, further improve the overall classification effect of the model, and further reduce the running time cost.  Data Availability Statement: All data used in this paper can be obtained by contacting the authors of this study.