Using a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) to Classify Network Attacks

: An intrusion detection system (IDS) identiﬁes whether the network tra ﬃ c behavior is normal or abnormal or identiﬁes the attack types. Recently, deep learning has emerged as a successful approach in IDSs, having a high accuracy rate with its distinctive learning mechanism. In this research, we developed a new method for intrusion detection to classify the NSL-KDD dataset by combining a genetic algorithm (GA) for optimal feature selection and long short-term memory (LSTM) with a recurrent neural network (RNN). We found that using LSTM-RNN classiﬁers with the optimal feature set improves intrusion detection. The performance of the IDS was analyzed by calculating the accuracy, recall, precision, f-score, and confusion matrix. The NSL-KDD dataset was used to analyze the performances of the classiﬁers. An LSTM-RNN was used to classify the NSL-KDD datasets into binary (normal and abnormal) and multi-class (Normal, DoS, Probing, U2R, and R2L) sets. The results indicate that applying the GA increases the classiﬁcation accuracy of LSTM-RNN in both binary and multi-class classiﬁcation. The results of the LSTM-RNN classiﬁer were also compared with the results using a support vector machine (SVM) and random forest (RF). For multi-class classiﬁcation, the classiﬁcation accuracy of LSTM-RNN with the GA model is much higher than SVM and RF. For binary classiﬁcation, the classiﬁcation accuracy of LSTM-RNN is similar to that of RF and higher than that of SVM.


Introduction
With the rapid growth of networks in accessing valuable information, network-based services are growing faster than ever before. This situation has raised the number of cyber offense cases. A network intrusion detection system (NIDS) is a tool that detects malicious activities in a system that violate security and privacy. Any malicious activity is typically reported to an administrator or collected centrally. A security information and event management (SIEM) system collects and combines information about malicious activities from multiple sources and distinguishes malicious activity from false alarms. Denning (1987) implemented an IDS for the first time [1], and since then, research on IDS has developed to provide better solutions that may protect network systems from various types of network attacks.
Organizations are using NIDSs to protect their systems from attacks that come with network connectivity. NIDSs are classified into two categories: (1) signature-based NIDSs and (2) anomaly-based NIDSs. A signature-based NIDS filters the signatures of the attack patterns. The analysis is static and is limited to the detection of only known attack patterns. Signature-based NIDSs give high accuracy and low false-alarm rates for detecting known attacks, but they have poor performance binary and multiclass classification. The accuracy is about 93.88% for multi-class classification and about 99.91% for binary classification. LSTM-RNN also provided a high TPR and a low FPR.
The rest of the paper is organized as follows. Section 2 provides the background on deep learning, long short-term memory (LSTM) networks, recurrent neural networks (RNNs), and genetic algorithms (GAs). It also describes the NSL-KDD dataset, which is used in the research. Section 3 discusses the existing research on LSTM and RNN in NIDS, which is related to this research. Section 4 presents the methodology of the LSTM-RNN with the GA model in detail. Section 5 presents the results of various experiments as well as a performance comparison of several classifiers. In Section 6, we discuss the results of our experiments. Finally, we provide conclusions and future work in Section 7.

Deep Learning Architecture
Deep learning is one of the machine learning methods that implements artificial neural networks. A deep learning network is a multi-layer neural network. Deep learning networks include deep neural networks (DNN), convolutional neural networks (CNN), recurrent neural networks (RNN), deep belief networks (DBN), and others. In this section, we will describe the architecture of RNN and long short-term memory (LSTM).

Recurrent Neural Network (RNN)
RNN is a type of artificial neural network (ANN), wherein the connection between the nodes resembles the neurons of a human brain. Neural network connections can transmit signals to other neurons/nodes [24] like synapses in a biological brain. The artificial neuron then processes the received signal and transmits it to the other connected neurons/nodes. Neurons and connections typically have weights to adjust the learning process. The weight can vary to adjust the strength of the signal as the signal travels from the input layers to the output layers. An ANN contains hidden layers between the input and output layers. RNN should have at least three hidden layers. The basic architecture of RNNs includes input units, output units, and hidden units, with the hidden units performing all the calculations by weight adjustment to produce the outputs [18,25,26]. The RNN model has a one-way flow of information from the input units to the hidden units and a directional loop that compares the error of this hidden layer to that of the previous hidden layer, and adjusts the weights between the hidden layers. Figure 1 represents a simple RNN architecture with two hidden layers.
Information 2020, 11,243 3 of 21 The rest of the paper is organized as follows. Section 2 provides the background on deep learning, long short-term memory (LSTM) networks, recurrent neural networks (RNNs), and genetic algorithms (GAs). It also describes the NSL-KDD dataset, which is used in the research. Section 3 discusses the existing research on LSTM and RNN in NIDS, which is related to this research. Section 4 presents the methodology of the LSTM-RNN with the GA model in detail. Section 5 presents the results of various experiments as well as a performance comparison of several classifiers. In Section 6, we discuss the results of our experiments. Finally, we provide conclusions and future work in Section 7.

Deep Learning Architecture
Deep learning is one of the machine learning methods that implements artificial neural networks. A deep learning network is a multi-layer neural network. Deep learning networks include deep neural networks (DNN), convolutional neural networks (CNN), recurrent neural networks (RNN), deep belief networks (DBN), and others. In this section, we will describe the architecture of RNN and long short-term memory (LSTM).

Recurrent Neural Network (RNN)
RNN is a type of artificial neural network (ANN), wherein the connection between the nodes resembles the neurons of a human brain. Neural network connections can transmit signals to other neurons/nodes [24] like synapses in a biological brain. The artificial neuron then processes the received signal and transmits it to the other connected neurons/nodes. Neurons and connections typically have weights to adjust the learning process. The weight can vary to adjust the strength of the signal as the signal travels from the input layers to the output layers. An ANN contains hidden layers between the input and output layers. RNN should have at least three hidden layers. The basic architecture of RNNs includes input units, output units, and hidden units, with the hidden units performing all the calculations by weight adjustment to produce the outputs [18,25,26]. The RNN model has a one-way flow of information from the input units to the hidden units and a directional loop that compares the error of this hidden layer to that of the previous hidden layer, and adjusts the weights between the hidden layers. Figure 1 represents a simple RNN architecture with two hidden layers.  An RNN is an extension of traditional feed-forward neural networks (FFNNs). In FFNNs, the information moves in only the forward direction; i.e., from the input nodes, through the hidden nodes, to the output nodes; there are no cycles or loops in the network. Hidden layers are optional in traditional FFNNs. We assume an input vector sequence, a hidden vector sequence, and an output vector sequence denoted by X, H, and Y, respectively. An input vector sequence is given as X = (x 1 , x 2 , . . . , x T ). A traditional RNN calculates the hidden vector sequence H = (h 1 , h 2 , . . . , h T ) and output vector sequence Y = (y 1 , y 2 , . . . , y T ) with t = 1 to T as follows: where function σ is a nonlinearity activation function, W is a weight matrix, and b is a bias term. In Equation (1), h t is the hidden layer output at t-time steps, and h t−1 denotes the previous hidden layer's output. RNN uses gradient-based methods to learn time sequences: back-propagation through time (BPTT) or real-time recurrent learning (RTRL) [27]. In BPTT, the network is unfolded into a multilayer FFNN in which each time a sequence is processed to construct the FFNN; firstly the training data is used to train the model and then the output error gradient is saved for each time step. BPTT uses the standard backpropagation algorithm to train each FFNN, and it updates the weights using the sum of the gradients obtained for weights in all layers of the network.
RTRL is an online learning algorithm, where the error gradient is computed, and weights are updated for each time step in a forward propagation manner. It computes the gradients of the internal and output nodes with respect to all the weights of the network.
Standard RNNs are not able to establish more than 5-10 time steps. A vanishing gradient problem may arise in RNN when gradient-based learning methods are used for updating the weights. Weights receive an updated proportion of the partial derivative of the error function in each training iteration. In some cases, the gradient will be very small. These error signals may either blow-up or vanish, which prevents the weight from changing value. These vanishing error signals may cause the weights to fluctuate. With a vanishing error, learning takes an unacceptable amount of time or does not work at all. In [25], a detailed theoretical analysis and solution of that problem with long-term dependencies is presented.
RNNs can be used for supervised classification learning [14,15]. RNNs are difficult to train because of vanishing and exploding gradients. The problems of vanishing and exploding gradients arise due to improperly assigned weights (assigned to either very high or very low value). Thus, an LSTM with forget gates is often combined with an RNN to overcome these training issues [26,27]. However, RNNs are a good choice for solving time series sequence prediction problems. In [18][19][20], researchers showed that some implementations of LSTM-RNN provide notable performance in intrusion detection. In [21], Staudemeyer et al. used LSTM-RNN on the KDD Cup'99 datasets and showed that LSTM-RNN could learn all attack classes hidden in the training data. They also learned from their experiment that the receiver operating characteristics (ROC) curve and the AUC (area under the ROC curve) are well suited for selecting high performing networks. The ROC curve is the probability curve plotting the TPR against the FPR, where TPR is on the y-axis and FPR is on the x-axis. A ROC curve shows the performance of a classification model at all classification thresholds. AUC value measures the entire two-dimensional area under the ROC curve. AUC values range from 0 to 1. AUC = 0.0 means the prediction of the model is 100% wrong and AUC = 1.0 means the prediction is 100% correct.

Long Short-Term Memory
LSTM can mitigate the problem of vanishing error [19,21,23]. LSTM can learn how to bridge more than 1000 discrete time steps [23]. LSTM networks replace all units in the hidden layer with memory blocks. Each memory block has at least one memory cell. Figure 2 demonstrates one cell in a basic LSTM network.
The memory cells activate with the regulating gates. These gates control the incoming and outgoing information flow. A forget-gate is placed between an input gate and an output gate. Forget gates can reset the state of the linear unit if the stored information is no longer needed. These gates are simple sigmoid threshold units. These activation functions range from 0 to 1.
The output y c j (t) of an LSTM memory cell show in Figure 2 is computed as: where y out j is the output gate activation, s c j is the internal state of the output gate, and h is the hidden layer output.
Information 2020, 11, 243 5 of 21 The memory cells activate with the regulating gates. These gates control the incoming and outgoing information flow. A forget-gate is placed between an input gate and an output gate. Forget gates can reset the state of the linear unit if the stored information is no longer needed. These gates are simple sigmoid threshold units. These activation functions range from 0 to1.
The output ( ) of an LSTM memory cell show in Figure 2 is computed as: where is the output gate activation, is the internal state of the output gate, and h is the hidden layer output. GAs are inspired by the concept of Darwin's theory of evolution and execute evolutionary biology techniques, such as selection, mutation, recombination, and inheritance to solve a given problem. GA is a heuristic search method used for finding optimal solutions.
A GA [28] generates a pool of possible solutions for a given problem. It then creates a group of features/individuals randomly from the given problem. The individuals are then evaluated with evaluation functions provided by the programmer. These evaluation functions include some recombination and mutations, just as in natural genetics. Each of the solutions is assigned a fitness value, and the filtered individuals give a higher chance to fit into the model, thereby providing a list of good solutions rather than just one. GAs perform well in a large dataset where there are various features.
GAs provide various advantages in machine learning applications. They are used mostly in feature selection when many features need to be considered to predict a class in a classification problem. GAs can manage a dataset with many features, and they do not need any specific knowledge about the problem. A GA randomly generates solutions by recombination and mutation of the given features. Though this requires a lot of computation for building a predictive model, it can select the best subset of the features. As a result, using the GA solution, the NIDS model needs less execution time than when using the complete feature sets.
In this research, an NSL-KDD dataset is used which contains 125,973 records for training and 22,544 records for testing. Since a GA performs well on a large dataset, we applied GAs in this research to find the features that give near-optimal solutions. This not only minimized the training and testing time but also gave a high detection rate, a high true positive rate, and a low false alarm rate. A high positive rate and low false alarm rate indicates an ideal NIDS system. The outcomes of using GAs are presented in Section 5. GAs are inspired by the concept of Darwin's theory of evolution and execute evolutionary biology techniques, such as selection, mutation, recombination, and inheritance to solve a given problem. GA is a heuristic search method used for finding optimal solutions.
A GA [28] generates a pool of possible solutions for a given problem. It then creates a group of features/individuals randomly from the given problem. The individuals are then evaluated with evaluation functions provided by the programmer. These evaluation functions include some recombination and mutations, just as in natural genetics. Each of the solutions is assigned a fitness value, and the filtered individuals give a higher chance to fit into the model, thereby providing a list of good solutions rather than just one. GAs perform well in a large dataset where there are various features.
GAs provide various advantages in machine learning applications. They are used mostly in feature selection when many features need to be considered to predict a class in a classification problem. GAs can manage a dataset with many features, and they do not need any specific knowledge about the problem. A GA randomly generates solutions by recombination and mutation of the given features. Though this requires a lot of computation for building a predictive model, it can select the best subset of the features. As a result, using the GA solution, the NIDS model needs less execution time than when using the complete feature sets.
In this research, an NSL-KDD dataset is used which contains 125,973 records for training and 22,544 records for testing. Since a GA performs well on a large dataset, we applied GAs in this research to find the features that give near-optimal solutions. This not only minimized the training and testing time but also gave a high detection rate, a high true positive rate, and a low false alarm rate. A high positive rate and low false alarm rate indicates an ideal NIDS system. The outcomes of using GAs are presented in Section 5.

NSL-KDD Dataset
NSL-KDD is a refined version of the KDDCup'99 datasets. It removes all redundant records in the KDD Cup'99 so that unbiased classification results can be derived. By eliminating the duplicated data, a better detection rate can be achieved. The NSL-KDD dataset [29] contains KDDTrain+, which is a full training dataset including attack-type labels and difficulty levels in CSV format, and KDDTest+, which is a full testing dataset including attack-type labels and difficulty levels in CSV format. In the experiment in [29], the authors randomly created three subsets of the KDD training set with fifty thousand records. They employed 21 learning machines that had been formerly trained using the created training sets to label the records of the entire training and testing sets. This experiment provided 21 predicted labels for each record. The difficulty level of the NSL-KDD dataset denotes the number of successful prediction values obtained from each of the 21 learning machines. Successful prediction value conveys the number of learning machines that predicted the correct label. The highest difficulty level was 21 for any record in the dataset.

Major Categories Subcategories
Denial The distribution of normal and attack records available in the NSL-KDD dataset is shown in Table 3.

Related Work
According to Yin, Zhu, Fei, and He [15], an RNN-based intrusion detection system (RNN-IDS) is very suitable for designing a classification model with high accuracy. They performed their analysis on the NSL-KDD datasets. The performance of RNN-IDS is superior to that of traditional machine learning classification methods in both binary and multi-class classification. Researchers have also compared the performance of RNN-IDS with some traditional machine learning algorithms such as J48, SVM, RF, and ANN. In [16], the authors used the CNN-LSTM model on the KDD Cup'99 datasets and showed that CNN-LSTM performed better than CNN-GRU and CNN-RNN. In [21], the authors used the LSTM-RNN model on the KDD Cup'99 datasets. They performed several experiments to find the optimum hyper-parameter values, such as those of learning rate and hidden layer size and analyzed the impacts of the hidden layer size and learning rate on the performance of the classifier. In [19], Staudemeyer stated that the LSTM-RNN model is able to learn all attack classes hidden in the KDD Cup'99 training dataset. They tested the model on all features and on extracted minimal feature sets, respectively. The DT and backward elimination process were used to extract the minimal features from the original datasets. They showed the performance of the model by calculating the ROC curve and corresponding AUC value. In each of the projects [18][19][20][21], LSTM performed excellently.
We used the feature selection concept from [19] but using the GA on the NSL-KDD datasets. We performed several experiments to set up the hyper-parameter value. Then we examined the performance of the LSTM-RNN network on all features and on extracted optimal feature sets, respectively. We also analyzed the performances of SVM and RF classifiers on both the original features set and the optimal features set. SVM and RF are the two traditional machine learning approaches most often used to verify the performances of classifiers. All the research efforts mentioned in this section have SVM and RF classifiers as the common baselines for comparing their performances with those of proposed classifiers. Thus, we applied them to scale the accuracy of the LSTM-RNN classifier.

LSTM-RNN NIDS with a Genetic Algorithm
This section describes the LSTM-RNN NIDS with the feature-selection GA. The RNN is the basic model, and a LSTM network is used to achieve a high detection rate.
In this research, the training dataset was used to train the classifier, and the testing dataset was then used to measure the accuracy of the classifier. Two types of classifications were conducted: binary and multi-class. Normal and anomaly are the two classes in binary classification, whereas normal, denial of service (DoS), probe, user-to root (U2R), and remote-to-local (R2L) are the five categories detected using multi-class classification.
The classification metrics considered in this research are accuracy, precision, recall, f-score, true positive rate, and false-positive rate. The confusion matrix was calculated to show the records of true positive, true negative, false positive, and false negative records achieved in each model.
The LSTM-RNN models were designed with 5, 10, 20, 40, 60, 80, and 100 neurons in the hidden layers. Python 3.7 with the scikit-learn, TensorFlow, and Keras packages was used to develop the programs along with other Python open-source libraries. Figure 3 shows the following steps performed for classifying network attacks using the LSTM-RNN classifier along with the GA.

LSTM-RNN Classifier with GA Architecture
Information 2020, 11, 243 8 of 21 Figure 3 shows the following steps performed for classifying network attacks using the LSTM-RNN classifier along with the GA. Step 1: Data Preprocessing Numericalization: Our training and testing datasets include 38 numeric features and three nonnumeric features. We must make all the features numeric to apply them the model, so we converted the non-numeric features, protocol-type, service, and flag, into numerals. There were three values in the protocol type, 70 values in service and 11 values in flag features. In transforming all non-numeric features, we mapped 41 features into 122 (= 38 + 3 + 70 + 11) features [15].

LSTM-RNN Classifier with GA Architecture
To transform the categorical features into binary features, we first transformed the features into integers using LabelEncoder, a scikit-learn package. Then the integers were passed to One-Hot-Encoder to transform them into binary features. The One-Hot-Encoder is a matrix of integers, which represent different sub-categories of categorical features. The output is a sparse matrix. Each column of the output matrix corresponds to one possible value of one feature. The values of the input features have the range [0, n].
Normalization: Some features in the feature space, such as duration, src_bytes, and dst_bytes, have a very large difference between the maximum and minimum values. We applied a logarithmic scaling method to map to the range [0,1] [15].
Normalization is not required for every dataset, but a dataset like NSL-KDD, where the features have different ranges of values, will require normalizing the values. Normalization changes the values of a dataset to a common scale. If the features do not have a similar range of values, then the gradient descents may take too long to converge. By normalizing, we can assure that the gradient descents can converge more quickly. An added benefit is that normalizing data can also increase the accuracy of a classifier.

Step 2: Feature Selection
Feature selection from the dataset is the process of finding the most relevant feature-set for the future predictive model implementation. This technique was used to identify and remove the Step 1: Data Preprocessing Numericalization: Our training and testing datasets include 38 numeric features and three non-numeric features. We must make all the features numeric to apply them the model, so we converted the non-numeric features, protocol-type, service, and flag, into numerals. There were three values in the protocol type, 70 values in service and 11 values in flag features. In transforming all non-numeric features, we mapped 41 features into 122 (= 38 + 3 + 70 + 11) features [15].
To transform the categorical features into binary features, we first transformed the features into integers using LabelEncoder, a scikit-learn package. Then the integers were passed to One-Hot-Encoder to transform them into binary features. The One-Hot-Encoder is a matrix of integers, which represent different sub-categories of categorical features. The output is a sparse matrix. Each column of the output matrix corresponds to one possible value of one feature. The values of the input features have the range [0, n].
Normalization: Some features in the feature space, such as duration, src_bytes, and dst_bytes, have a very large difference between the maximum and minimum values. We applied a logarithmic scaling method to map to the range [0, 1] [15].
Normalization is not required for every dataset, but a dataset like NSL-KDD, where the features have different ranges of values, will require normalizing the values. Normalization changes the values of a dataset to a common scale. If the features do not have a similar range of values, then the gradient descents may take too long to converge. By normalizing, we can assure that the gradient descents can converge more quickly. An added benefit is that normalizing data can also increase the accuracy of a classifier.

Step 2: Feature Selection
Feature selection from the dataset is the process of finding the most relevant feature-set for the future predictive model implementation. This technique was used to identify and remove the unneeded, redundant, and irrelevant features that do not contribute or decrease the accuracy of the future predictive model implementation for the classification.
Optimal feature selection is one of the most important tasks in developing an excellent NIDS because some features can bias the classifier to identify a particular class, and that may increase the amount of misclassification. To minimize the misclassification rate, to minimize the training time, and to maximize the accuracy of the classifier, we have applied a GA in our experiment. Using a GA, we have obtained an optimal subset of 99 features from the original 122 features. After applying the GA, insignificant data are removed from the original features set. The GA selects only a subset of relevant features. Accuracy is used as the fitness function in this research. We have calculated the maximum, minimum and average accuracy of the training dataset according to their labels. The GA provided a set of solutions by performing recombination and mutation of the features and selected the features with maximum, minimum, and average accuracy. The subset of features that gives the maximum accuracy was selected as the optimal feature set. We then prepared our training and testing datasets with this optimal feature set.
The flow-chart of the GA implementation is shown in Figure 4.
Information 2020, 11, 243 9 of 21 unneeded, redundant, and irrelevant features that do not contribute or decrease the accuracy of the future predictive model implementation for the classification.
Optimal feature selection is one of the most important tasks in developing an excellent NIDS because some features can bias the classifier to identify a particular class, and that may increase the amount of misclassification. To minimize the misclassification rate, to minimize the training time, and to maximize the accuracy of the classifier, we have applied a GA in our experiment. Using a GA, we have obtained an optimal subset of 99 features from the original 122 features. After applying the GA, insignificant data are removed from the original features set. The GA selects only a subset of relevant features. Accuracy is used as the fitness function in this research. We have calculated the maximum, minimum and average accuracy of the training dataset according to their labels. The GA provided a set of solutions by performing recombination and mutation of the features and selected the features with maximum, minimum, and average accuracy. The subset of features that gives the maximum accuracy was selected as the optimal feature set. We then prepared our training and testing datasets with this optimal feature set.
The flow-chart of the GA implementation is shown in Figure 4. The GA is a heuristic optimization method and it was implemented to find the best feature-set from the NSL-KDD datasets using the DEAP (distributed Evolutionary Algorithms in Python) evolutionary computation framework [32].
Initialize the population: The NSL-KDD datasets were mapped into 122 features in this experiment. The GA initializes the population of 10 individual feature-sets, each time randomly selected from the total 122 features. Each of the 10 individual feature-sets is represented as a feature vector with 122 attributes. Each attribute from the feature-set contains either a value 1 that represents that the feature is included in the individual, or a value 0 that represents the feature is not included in the individual.
Fitness value assignment: After the initialization of the population, the GA assigns the fitness for each individual feature-set. Here the logistic regression (LR) is used for training the model. The LR model gets trained from each individual feature-set and evaluates the selection error. The lowest selection error represents the highest fitness. The GA ranks each individual feature-set according to its fitness.
Selection of individual for the next generation: After the fitness value assignment, the selection operator chooses the individuals that recombine later for the next generation. The selection operator selects the individuals according to their fitness levels. The number of selected individuals is N/2, N being the population size; i.e., almost half of the individuals will get selected in this step.
Crossover: In this phase, the GA recombines the selected individuals to generate a new population from almost half of the population already selected from the previous steps, although, the total number of features in the new population will remain the same. Here the GA chooses the best combination of feature-set from the N/2 individuals through uniform crossover method.
Mutation: The GA performs mutation to diversify the population and to prevent premature convergence. The GA flips the value of the attributes of the input individual and returns the new individual. In this experiment, the probability of each attribute to be flipped is set to 0.05. The The GA is a heuristic optimization method and it was implemented to find the best feature-set from the NSL-KDD datasets using the DEAP (distributed Evolutionary Algorithms in Python) evolutionary computation framework [32].
Initialize the population: The NSL-KDD datasets were mapped into 122 features in this experiment. The GA initializes the population of 10 individual feature-sets, each time randomly selected from the total 122 features. Each of the 10 individual feature-sets is represented as a feature vector with 122 attributes. Each attribute from the feature-set contains either a value 1 that represents that the feature is included in the individual, or a value 0 that represents the feature is not included in the individual.
Fitness value assignment: After the initialization of the population, the GA assigns the fitness for each individual feature-set. Here the logistic regression (LR) is used for training the model. The LR model gets trained from each individual feature-set and evaluates the selection error. The lowest selection error represents the highest fitness. The GA ranks each individual feature-set according to its fitness.
Selection of individual for the next generation: After the fitness value assignment, the selection operator chooses the individuals that recombine later for the next generation. The selection operator selects the individuals according to their fitness levels. The number of selected individuals is N/2, N being the population size; i.e., almost half of the individuals will get selected in this step.
Crossover: In this phase, the GA recombines the selected individuals to generate a new population from almost half of the population already selected from the previous steps, although, the total number of features in the new population will remain the same. Here the GA chooses the best combination of feature-set from the N/2 individuals through uniform crossover method.
Mutation: The GA performs mutation to diversify the population and to prevent premature convergence. The GA flips the value of the attributes of the input individual and returns the new individual. In this experiment, the probability of each attribute to be flipped is set to 0.05. The introduction of a very low mutation probability helps the GA to increase the search ability over the individuals to form a diverse population of individual features-sets. It also prevents the drastic drop in the local minimum on the GA feature selection.
Final decision and termination: If 10 generations have been processed, the GA will select the individual with the highest predictive feature-set selection, and then terminate. Alternatively, the GA will continue to the "fitness value assignment" phase to produce the next generation until the GA makes a final decision on the highest predictive feature-set selection before the termination.
In our experiment, we identified a set of 70 unique features using the GA on the training dataset. Additionally, we obtained a set of 66 unique features by using the GA on the testing dataset (with the label removed). Finally, we prepared the feature sets of 99 features as the union of the training (70 features) and testing (66 features) subsets. Figure 5 represents the set of 99 features that we obtained by applying the GA. introduction of a very low mutation probability helps the GA to increase the search ability over the individuals to form a diverse population of individual features-sets. It also prevents the drastic drop in the local minimum on the GA feature selection. Final decision and termination: If 10 generations have been processed, the GA will select the individual with the highest predictive feature-set selection, and then terminate. Alternatively, the GA will continue to the "fitness value assignment" phase to produce the next generation until the GA makes a final decision on the highest predictive feature-set selection before the termination.
In our experiment, we identified a set of 70 unique features using the GA on the training dataset. Additionally, we obtained a set of 66 unique features by using the GA on the testing dataset (with the label removed). Finally, we prepared the feature sets of 99 features as the union of the training (70 features) and testing (66 features) subsets. Figure 5 represents the set of 99 features that we obtained by applying the GA. We observed that most of the features selected by the GA are from service or flag categories. This is hardly surprising, as most of the features are of these two types. Protocol type is also a significant feature type. The GA selects most of the unique features which can predict an attack type, so most of the features in the optimal features set are from service, protocol_type, and flag categories. Some of the other features besides service, protocol_type, and flag are also unique with having a large amount of potential value, so the GA selects them too.

Step 3: LSTM-RNN Model Construction
To build the LSTM-RNN model, first we selected hyper-parameters and optimizers for both binary and multi-class classification. We determined the following hyper-parameters: batch size, the number of epochs, learning rate, dropout, and activation function.  We observed that most of the features selected by the GA are from service or flag categories. This is hardly surprising, as most of the features are of these two types. Protocol type is also a significant feature type. The GA selects most of the unique features which can predict an attack type, so most of the features in the optimal features set are from service, protocol_type, and flag categories. Some of the other features besides service, protocol_type, and flag are also unique with having a large amount of potential value, so the GA selects them too.

Step 3: LSTM-RNN Model Construction
To build the LSTM-RNN model, first we selected hyper-parameters and optimizers for both binary and multi-class classification. We determined the following hyper-parameters: batch size, the number of epochs, learning rate, dropout, and activation function.

a.
Batch size is the number of training records in one forward and one backward pass. b.
An epoch means one forward and one backward pass of all the training examples. c.
Learning rate is the proportion of the weights that are updated during the training of the LSTM-RNN model. It can be chosen from the range [0.0-1.0]. d.
Dropout is a regularization technique, where randomly selected neurons are ignored during training. The selected neurons are temporarily removed on the forward pass. e.
The activation function converts an input signal of a node in a neural network to an output signal, which is then used as the input of the next hidden layer and so on. To find the output of a layer, we calculated the sum of the products of inputs and their corresponding weights, applied the activation function to that sum, and then fed that output as an input to the next layer.
A suitable optimizer plays a very crucial part in classification. From various available optimizers, we selected stochastic gradient descent (SGD) for multiclass classification and the Adam optimizer for binary classification. SGD provides a lower computational cost than other gradient descent optimizers used in multiclass classification. Because other optimizers, such as the batch gradient descent and mini-batch gradient descent optimizers, use all the training samples for completing one iteration, they are computationally expensive to use. The advantage of SGD is that it uses only one batch for each iteration.
The Adam optimizer is easy to implement and computationally efficient because the decision is in binary and it requires less memory to implement.

Step 4: Prediction and performance metrics calculation
After fitting the model using the training dataset and testing dataset, we obtained the testing accuracy rate and loss value of the classifier. The testing dataset is used as a validation set to get an unbiased estimate of accuracy during the learning process. Then we used the predict_classes module of Keras to get the prediction matrix of the model; that is the classes predicted by the model. At this point, the expected classes have been defined in the original KDDTest+ dataset and the predicted classes. Next, we calculated the classification metrics such as precision, recall, f-score, true positive rate (TPR), false-positive rate (FPR), and confusion matrix from the expected and predicted classes matrix. These metrices are explained in Appendix A.

Step 5: Evaluation by performance analysis
We performed Step 4 for both SVM and RF classifiers and then evaluated the performances of SVM, RF, and LSTM-RNN by comparing the metrics.

Experiment Environment Setup
TensorFlow and Keras were used to implement the LSTM-RNN model, and scikit-learn was used to implement SVM and RF. Two types of classifications have been performed to study the performance of the LSTM-RNN model: binary and multi-class/5-category. Binary classification gives normal and anomaly detection results and multi-class/5-category classification gives normal, DoS, Probe, R2L and L2R attack detection results. These two types of classifications have been performed using a set of 122 features and an optimal set of 99 features. The optimal set of 99 features was obtained using the genetic algorithm.
To prevent overfitting in LSTM-RNN, we introduced several dropout amounts during the training of the model. While experimenting with two and three hidden layers, we randomly selected 1% of the neurons to drop out. We dropped out 10%-15% of the neurons while experimenting with four and five hidden layers. A dropout of 20%-30% was performed while experimenting with six, seven, and eight hidden layers. We have adjusted the dropout percentage for the hidden layers through several executions of the same experiment. We executed each experiment at least four times in different machines to adjust the dropout amount in the hidden layers. Those determined dropout amounts are given in Table 4.
We have implemented the default Linear SVM classifier function of scikit-learn using C-Support Vector Classification. We selected the Linear Kernel for SVM implementation.
In addition, we have implemented the RF using the default RandomForest classifier function of scikit-learn. The RandomForest function is implemented with n_estimators = 100 by default.

Hyper-Parameter Setup for Experiments with the LSTM-RNN Model
We have experimented with the LSTM-RNN model by using 5, 10, 20, 40, 60, 80, and 100 neurons in hidden layers for both binary and multi-class classification experiments. Hyper-parameters selected for the binary and multi-class classification experiments are shown in Table 4. The size of the hidden layers of the experiment is shown in Table 5. Loss function binary_crossentropy categorical_crossentropy Table 5. The size of the hidden layer.

Experimental Results
In this section, we present the classification performance of the LSTM-RNN classifier according to accuracy, precision, recall, f-score, TPR and FPR on 5, 10, 20, 40, 60, 80, and 100 neurons in the hidden layers, respectively. The classification performance is analyzed for both binary and 5-class classification. For 5-class classification, the classes include normal, DoS, probe, U2R, and R2L attacks, while for binary classification, the classes include normal and anomaly. We also compared the performance of LSTM-RNN with traditional machine learning approaches such as SVM and RF using the same mixed feature sets. We compared the performance using the 122-feature set and the 99-feature set, which were obtained with the genetic algorithm.

Classification Results of the LSTM-RNN Classifier
The binary and multiclass classification results of the LSTM-RNN classifier are presented below. Tables 6 and 7 show the performance of the LSTM-RNN classifier for binary classification and multi-class classification, respectively, while using the 122-dimensional feature space.

Classification Results Using all Features
From Table 6, we observed that, in binary classification using all features, the highest testing accuracy obtained is 96.51% and the highest training accuracy obtained is 99.88%, in both cases using only five neurons in the hidden layers. Table 6. Binary classification performance results using 122 features.

No. of Neurons in Hidden Layers
Accuracy  Table 7, we observe that, in multi-class classification using all features, the highest testing accuracy obtained is 82.68% and the highest training accuracy obtained is 99.87%, in both cases using 60 neurons in the hidden layers. Table 7. Multi-class classification performance results using 122 features.

No. of Neurons in Hidden Layers
Accuracy  Tables 8 and 9 show the performance of the LSTM-RNN classifier after applying the genetic algorithm with 99 prominent features in binary and multiclass classification, respectively.
From Table 8, we observe that, in binary classification using 99 optimal features, the highest testing accuracy obtained is 99.91% and the highest training accuracy obtained is 99.99%, in both cases using 40 neurons in the hidden layers.

No. of Neurons in Hidden Layers
Accuracy  Table 9. Multi-class classification performance results using 99 features.

No. of Neurons in Hidden Layers
Accuracy From Table 9, we observed that, in multi-class classification using the optimal set of 99 features, the highest testing accuracy obtained is 93.88% and the highest training accuracy obtained is 99.0%, in both cases using 80 neurons in the hidden layers.

Performance Comparison between Classifiers
The binary and multi-class classification performance comparisons between the LSTM-RNN classifier and the SVM and RF classifiers are presented below. Tables 10 and 11 show the performance results of the SVM, RF, and LSTM-RNN classifiers for binary classification and multi-class classification, respectively, when using 122 features.  Table 10 we observed that, LSTM-RNN performed better than the SVM. However, RF is the best in the binary classification experiments using all features. However, LSTM-RNN was able to obtain 96.51% testing accuracy with a TPR of 0.944.  Table 11 shows that LSTM-RNN provides the best testing accuracy of 82.68% among the other classifiers, where SVM gives 67.20% and RF gives 80.70% testing accuracy in multiclass classification experiments using all features. The TPR (0.707) is also higher for LSTM-RNN than SVM and RF. Tables 12 and 13 show the performance results of SVM, random forest and LSTM-RNN classifiers for both binary and multi-class classification respectively when using the 99 optimal features set.  Table 12 shows that using the 99 optimal features set, LSTM-RNN, SVM, and RF give the similar testing accuracy, around 99.9%. In binary classification with an optimal features set, these three classifiers perform almost similar. In Table 13, we observe that in multiclass-classification experiment, LSTM-RNN gives the best testing accuracy of 93.88% among all the experimented classifiers while using the 99 optimal features set. SVM provides 68.10% and RF provides 84.90% testing accuracy. The TPR (0.896) is also higher for LSTM-RNN comparing with SVM and RF in that experiment.

Discussion
In this research, we obtained binary classification results and 5-class classification results. In both categories of classification, we applied a genetic algorithm for feature selection. In this section, we discuss the outcomes and findings of our experiments.

Binary Classification
In binary classification, while using 122 features, we obtained an accuracy of 96.51% for KDDTest+ and an accuracy of 99.88% for the KDDTrain+ dataset. We obtained this high detection rate while using only five neurons in the hidden layers. The TPR and FPR are 0.944 and 0.007, respectively.
After applying the genetic algorithm and using the optimal set of 99 features, we were able to obtain an increased accuracy of 99.91% for KDDTest+ and 99.99% for KDDTrain+ using 40 neurons in the hidden layers with a TPR of 0.999 and FPR of 0.003. We were able to enhance the accuracy by more than 3%. We were able to obtain 99.83% testing accuracy using only five neurons, which is very similar to the highest accuracy of our experiment. Thus, after applying the GA, we could reach the maximum testing accuracy using a small LSTM-RNN network. By using this small network architecture, we were able to minimize the running time as well.
In this experiment, we observed that RF gives the highest testing accuracy in binary classification. Because it is easy to build random binary decision trees if the classes are well defined by the features. The random trees are usually split into different subsets. Then it searches for the best outcome from the random subsets. In the experiment, we implemented the default RandomForest classifier function of scikit-learn which combines hundreds of decision tress. In each tree, there are two final predictions to make, either normal (labeled as 0) or anomalous (labeled as 1).
In this experiment, LSTM-RNN was implemented considering several parameters such as dropout, batch-size, epochs, optimizers etc. The performance of the classifier was biased by those parameters. Thus, the performance of LSTM-RNN in binary classification is slightly less than RF.
The confusion matrices for binary classification on KDDTest+ using 122 features and 99 features are presented in Table 14 We can observe that the use of the genetic algorithm decreased the amount of misclassification. The numbers of correctly classified records are shown in the gray boxes.

Multi-Class Classification
In multi-class classification while using 122 features, we obtained 82.68% accuracy for KDDTest+ and 99.87% accuracy for the KDDTrain+ dataset. We obtained this high detection rate while using 60 neurons in the hidden layers. The TPR and FPR are 0.707 and 0.015, respectively.
After applying the genetic algorithm and obtaining the optimal set of 99 features, we obtained an increased accuracy of 93.88% for KDDTest+ and 99.0% for KDDTrain+ with a TPR of 0.896 and a FPR of 0.005 while using 80 neurons in the hidden layers. The accuracy is more than 10% higher than using all features. The experiment using the genetic algorithm also reduced the training and testing times. From Table 7 we see that we have already obtained an accuracy of 82.68% by using only 10 neurons in the hidden layers after using the GA.
The confusion matrices of the multi-class classification on KDDTest+ using 122 features and 99 features are presented in Table 15. There are 60 neurons in the hidden layer. We can observe that the use of a genetic algorithm decreased the number of misclassifications. The number of correctly classified records is shown in the gray boxes.  Table 16 shows the best results of the evaluation metrics for each individual class of multi-class classification experiment using 122 features. From this table, we can see that the accuracy rate for R2L and U2R classes is comparatively lower than that for the other classes. This is due to the small number of training records for U2R and R2L.  Table 17 shows the best results for the evaluation metrics for each individual class of multi-class classification experiments using 99 features and 80 neurons in the hidden layers. It also shows that using the GA enhanced the classification accuracy of the R2L and U2R classes compared with using 122 features. In [22], the researchers proposed a deep learning approach, self-taught learning, which can achieve a high accuracy of 88.39% for binary classification and 79.10% for 5-class classification. Based on the evaluation using the KDDTrain+ dataset, we found the performance of LSTM-RNN is comparable to the best results obtained in several previous works [16,22]. In [20], the researchers applied LSTM-RNN on the KDD Cup' 99 datasets and obtained 93.82% accuracy in 5-class classification by using a 0.1 learning rate and 1000 epochs. In our research using LSTM-RNN with the GA, we were able to obtain 99.99% training accuracy and 99.91% testing accuracy in the binary classification experiment. Moreover, we obtained 99% training accuracy and 93.88% testing accuracy in the five-class classification experiment.

Conclusions
In this paper, we compared different classifiers on the NSL-KDD dataset for both binary and multi-class classification. We considered SVM, random forest, and the LSTM-RNN model. We have shown that our proposed model produced the highest accuracy rate of 96.51% and 99.91% for binary classification using 122 features and an optimal set of 99 features, respectively. The LSTM-RNN obtained higher accuracy than the SVM in binary classification. However, random forest was the best classifier among all in that case. However, using the 99-feature set, we were able to get testing accuracy similar to that of RF.
In addition, we achieved a testing accuracy of 82.68% using the 122-feature set and 93.88% accuracy while using the 99-feature set in multi-class classification. When compared with the SVM and random forest, LSTM-RNN performed the best in multiclass classification experiments. Our LSTM-RNN-with-GA model achieved a 10% higher accuracy than random forest in that experiment. Using the GA improves the performance of the LSTM-RNN classifier significantly. Finally, we conclude that LSTM-RNN is suitable for large datasets. With small datasets having few features, the performance of LSTM-RNN is not very noticeable. However, enhanced with the GA, LSTM-RNN performs comparatively better than traditional machine learning approaches.
In this research, all the experiments were executed using a Windows platform. In this experiment, we did not record the training time. In future, we will execute all the experiments in a GPU-based system for obtaining a faster training time. Another future work should include using a more recent dataset such as UNSW-NB15 to verify the efficiency of the LSTM-RNN model. Another example for future work will be developing a framework for collecting real-time network data, labeling the dataset with expert knowledge, and then applying the proposed LSTM-RNN classifier on that dataset. Additionally, we will extend our experiments for comparing the performance of LSTM_RNN-with-GA to those of other deep-learning approaches, such as CNN-LSTM, STL, deep belief network (DBN), on the latest datasets.
Author Contributions: Conceptualization: P.S.M. and P.C.; methodology: P.S.M. and P.C.; software, P.S.M., P.C. and K.R.; validation: P.S.M., P.C., X.Y., and K.R.; formal analysis: P.S.M. and X.Y.; investigation: P.S.M.; resources: P.S.M., K.R.; data curation: P.S.M. and P.C.; writing-original draft preparation: P.S.M. and P.C.; writing-review and editing: X.Y., P.S.M., P.C. and A.E.; visualization: X.Y. and P.S.M.; supervision: X.Y.; Project administration: X.Y.; funding acquisition: X.Y. All authors have read and agreed to the published version of the manuscript. F-Score: This is the harmonic mean (in percentile) of precision and recall, and always has a value near the smaller value of precision or recall. It provides a more realistic measure of a test's accuracy by using both precision and recall. If the costs of false positives and false negatives are very different, F-score works the best. F = 2 · P · R P + R × 100% Confusion matrix: This is a table that explains the performance of classification on a set of test data where the true values are known.