Our strategy to reduce FNR and FPR and increase the detection of low-represented attack categories consists of three points, as depicted in
Figure 1. The first point,
distribution alteration, refers to the idea of altering the distribution of the original datasets. The rationale is that the split proposed by the original dataset’s authors is sub-optimal, limiting the final accuracy of the trained model. Our idea is that by reshuffling the datasets, it is possible to improve the detection rate of most of the attack categories.
4.1. Strategy Implementation
Figure 2 shows the details of the phases we took to construct and evaluate the generated models. In the following, we give an overall description of each phase, and the details of the data preparation and model architecture in
Section 4.3 and
Section 4.2, respectively.
The first step was to collect the data. As mentioned in
Section 3.1, we selected NSL-KDD and UNSW-NB15 to have two different datasets considered among the most suitable for our research. Since both datasets are divided into smaller datasets, the following were chosen for our research: KDDTrain+, KDDTest+, and the full UNSW-NB15 dataset, which we split into 4 CSV files.
The next step is to prepare the datasets so they can be ready to be used for training our model. In this phase, we processed the datasets by removing missing and redundant values, normalizing the numerical data, and encoding the categorical data into numerical.
Section 4.2 gives a detailed explanation of this step.
The third phase is constructing the deep neural network used in the research and setting all of its parameters. A detailed explanation of the architecture and the parameters chosen for the model is given in
Section 4.3.
The fourth step is essential in deep learning and consists of training the neural network since the dataset is used to train the model and enhance its ability to make predictions.
After the training of the model, the fifth step is to evaluate the model to see how it performs. The testing datasets are used in this step, in order to see how well the model will perform on the data that it has never seen before.
After the evaluation process, the next step is to tune the hyperparameters to see if it would be possible to improve the learning process and achieve better results. The hyperparameters are the parameters used to control the learning process, as opposed to other parameters, such as node weights, whose values come from training. Some of the parameters modified in this phase to obtain better results are, for example, the number of epochs and the learning rate. We detail all the values and the obtained results of this step in
Section 4.3.
The final step is the prediction step, in which we achieve the final results of the model. In this step, we conclude how well the model performed and if it reached the experiment’s goal. The predictions of each experiment and the evaluation of their results are in
Section 5 and
Section 6, respectively.
4.2. Data Preparation
Preparing the data is a crucial step and can significantly impact the model’s learning process. If we do not give appropriate input to the model, it might not give us the result that we want to obtain. As mentioned earlier, we have two datasets used in this research, the NSL-KDD and the UNSW-NB15 datasets. Both of these datasets need to be processed, and since they have a similar structure, we used the same preparation process.
4.2.1. Preparation of the NSL-KDD Dataset
As a starting point in the data preparation of the NSL-KDD dataset, we have two subsets of data already divided by the authors, the KDDTrain+ and the KDDTest+. These subsets have 43 features, while the KDDTrain+ subset has 125,793 records and the KDDTest+ subset 22,544. We processed and verified that both subsets do not contain any missing values. Therefore, we could proceed with doing the rest of the data preparation on the subsets as they are.
The goal of multiclass classification is to correctly classify records that represent a network attack as the attack category they belong to. Therefore, it is necessary to change the label for every record from the attack type to the class to which that attack type belongs. This step is repeated for both subsets. For the model to learn from this data, we need to transform it into numerical values. For this transformation, we employed one-hot encoding. One-hot encoding is a technique used for categorical features where no ordinal relationship exists. Therefore it is not enough to just do integer encoding (assign each category an integer). One-hot encoding creates new binary columns for each possible unique categorical feature value. In other words, it converts the categorical data into a binary vector representation. We applied one-hot encoding to training and test subsets specifically for the following features: protocol_type, service, and flag. Ultimately, we removed the original categorical columns and obtained a dataset with 124 columns.
The next step was the encoding of the label. For binary classification, the ’normal’ value was represented by a 0, while all the others ’abnormal’ were given the value 1. For multiclass classification, again, the ’normal’ value was given the value 0, and the rest of the values were integer encoded. The multiclass values range from 0 to 4. This was done for both subsets.
The next step was to strip the label and attack category columns from the train and test datasets, building the effective subsets used to generate the model. The combination of the original subset with the label column is used for the binary classification, while the combination with the attack category column is used for the multiclass classification.Thus, we divided the training and the testing subsets into 6 subsets: , and . The subsets and contain all the columns with the features of the original training and testing datasets except for label and attack category columns: they will be given to the model as the input. The label column for training and testing for binary classification went in and , respectively, while the attack category column went in and .
The last preparation step was to normalize the data in the and subsets using the min-max method. For every feature, the minimum value is changed to 0, the maximum value is changed to 1, and every other value is transformed into a decimal value between 0 and 1 using the following formula . The final subsets used, and , now contain 123 columns each, and all the data is encoded into numerical values and normalized.
4.2.2. Preparation of the UNSW-NB15 Dataset
Unlike the NSL-KDD dataset, we opted to use the original full UNSW-NB15 dataset, which contains 2,540,044 records, instead of using the two subsets pre-divided by the authors. The authors have provided four separate CSV files which contain the records of this dataset. The first step was to load all four CSV files and merge them into one dataset.
The next step was to check if there were any duplicate records and remove them. The removal of the duplicates is essential to avoid having the same records in the training and testing subsets because the testing subset should contain only the records that were not previously seen by the neural network. During this phase, we removed 480,625 duplicate records.
The next step was to check if the dataset contains any missing values. Three features contained missing values: ‘ct_flw_http_mthd’, ‘is_ftp_login’ and ‘ct_ftp_cmd’. The missing values were then replaced with ‘0’. It has been noted that the dataset contains the value ’–’ for the feature ‘service’ in a significant number of records, so this value was renamed as ‘undefined’ to give more meaning to it. Then, we removed the columns ‘srcip’ and ‘dstip’. We also fixed some white-space inconsistencies among records with the same values and other minor typos (i.e., ‘Backdoors’ instead of ’Backdoor’ in the ’attack_cat’ field).
We repeated the one-hot encoding for the whole dataset, changing the categorical features ‘proto’, ‘service’, and ‘state’. At the end of this process, the dataset contained 202 columns.
While the column ‘label’ used for binary classification already contained 0 for regular traffic and 1 for abnormal, the ‘attack category’ required an encoding for the multiclass classification. Thus, in the next step, we encoded with a 0, the ‘normal’ (no-attack) value, and assigned values from 1 to 9 to the other attack categories.
The next step was to split the dataset into training and testing subsets. The training subset was a random sample with 80% of the original records, while the testing subset contained a random sample with 20%.
As for the NSL-KDD dataset, we separated the feature data columns ( and ) from the label ( and ) and attack category ( and ) columns.
As for the NSL-KDD dataset, the final step was the normalization of the numerical variables of the and subsets of the features with the min-max normalization method. In the end, these subsets contain 200 columns.
4.3. Model Architecture
After the data preparation phase, we started training the deep neural network. We adopted the same model architecture for both datasets to evaluate which would perform better. Different activation functions are used for different layers of the neural network. We differentiated the model for the binary classification and the one for multiclass classification, changing the number of nodes in the output layer and the activation function for the output layer. The hyperparameters related to the training algorithm are:
Batch size. This is a training parameter that indicates the number of records passed and processed by the algorithm before updating the model.
Number of epochs. This is also a training parameter which indicates the number of passes done through the complete training dataset.
Optimizer. Optimizer is an algorithm, or a method, which is used to change the attributes of the network such as weights and learning rate in order to reduce the loss. The most used optimizers, among the others, are gradient descent, stochastic gradient descent, adagrad, and adaptive moment estimation (Adam) [
37]. The optimizer used for the model is stochastic gradient descent (SGD) with Nesterov momentum.
Momentum. This parameter is used to help predict the direction of the next step, based on the previous steps. It is used to prevent oscillations. The usual choice is a number between 0.5 and 0.9.
Learning rate. The learning rate is a parameter which controls the speed at which the neural network learns. It is usually a small positive value in range between 0.0 and 1.0. This parameter controls how much we should change the model in order to respond to the estimated error each time the weights of the model are updated [
38].
Loss function. The loss function in a neural network is used to calculate the difference between the expected output and the output that was generated by the model. This function allows acquiring the gradients that the algorithm will use to update the neural network’s weights. The loss function used for this model for binary classification is the binary cross-entropy loss function. On the other hand, we used a sparse categorical cross-entropy loss function for multiclass classification.
At the end of our experiments, the final values chosen for the training are provided in
Table 4. These final values were reached after a process of manual hyperparameter tuning which included a series of trials with different values. The number of epochs shown in
Table 4 indicates the maximum number of epochs, but Early Stopping is used in the experiments in order to prevent overfitting.
The neural network used for the experiment is a feed-forward neural network, which means that the connections between the nodes do not form any cycles and the data in the network moves only forward from the input nodes, going through the hidden nodes, and in the end reaching the output nodes. The algorithm used to train the network is the backpropagation algorithm. As mentioned earlier, backpropagation is short for “backward propagation of errors”. Given an error function and an artificial neural network, the backpropagation algorithm calculates the gradient of the error function with respect to the weights of the neural network [
39].
Moreover, the number of layers in the network is six: one input layer, one output layer and four hidden layers. The input layer takes the input dimension which is equal to the number of features used in the training dataset. The first hidden layer uses the Parametric Rectified Linear Unit (PReLU) activation function and it has 496 neurons. The PReLU activation function generalizes the traditional rectified unit with a slope for negative values and it is formally defined as [
40]:
The other hidden layers use the Rectified Linear Unit (ReLU) activation function. This function was designed to overcome the vanishing gradient problem and it works in the way that it returns 0 for any negative input, but for a positive input, it returns the value of the input back. It can be defined as:
The second, third and fourth hidden layers have 248, 124 and 62 nodes, respectively. The output layer has a different activation function and a different number of neurons based on the type of classification which is being done. For binary classification, the output layer uses the sigmoid activation function and has only one neuron. The sigmoid function takes a value as the input, and outputs another value between 0 and 1. It can be defined as:
On the other hand, for the multiclass classification, the output layer has the number of neurons which is equal to the number of the attack categories in the dataset, and the activation function which is used is the softmax function. This function converts a vector of K real values into a vector of K real values that sum to 1 [
41]. It can be defined as:
Additionally, to prevent overfitting during the training phase, we implemented the dropout on all the hidden layers. Dropout is a regularization method that causes some of the neurons of a layer to be randomly dropped out (ignored) during the training of the network. Dropping out the neurons means that they will not be considered during the specific forward or backward passing through the neural network. The dropout rate chosen for this network, for each hidden layer, was equal to 0.1. This means that 10% of the units will be dropped (set to 0) at each step. The units that are not dropped are scaled up by
so that the sum of all the units remains unchanged. A graphical representation of the architecture of the neural network can be seen in
Figure 3.