4.2. Comparative Analysis of Feature Correlation-Based and Sequential Pixel Mapping
This section explains the reasons for employing a method of sequentially representing each feature value at individual pixels in the image generation process. As mentioned in
Section 3.3.4, CNN models perform computations considering spatial correlations, specifically the adjacency of pixels during image processing. To examine the impact of such correlations on image generation, we conducted comparative experiments between correlation-based pixel mapping and sequential pixel mapping strategies.
Figure 6 illustrates the comparison between models trained on images with consideration of feature correlation and models trained without such consideration. The blue curve represents the model trained with correlated features, while the red curve denotes the model trained without considering feature correlation. Solid lines in the graph indicate results from the training set, whereas dashed lines represent results from the validation set. Analysis of the accuracy graph (
Figure 6a) and the loss graph (
Figure 6b) suggests that feature correlation does not significantly impact model performance. As evidenced by the measured values in
Table 8, the differences were found to be minimal. More specifically, the model trained with correlated features achieved an accuracy and validation accuracy of 99.59% and 99.09%, respectively. In contrast, the model without correlation consideration achieved 99.44% accuracy and 98.98% validation accuracy. The difference between these models is merely 0.15% for training accuracy and 0.11% for validation accuracy. Similarly, the difference in loss metrics was negligible (0.0028 for training loss and 0.0024 for validation loss). These minimal differences fall within typical statistical variance ranges for neural network training, indicating no practically significant advantage for correlation-based pixel arrangement. As shown in
Figure 6a,b, both models converge to similar performance levels after approximately 40 epochs, with the correlation-based model showing only marginally better results. This marginal performance difference results from two key factors. First, the information gain-based feature selection prioritized individually discriminative features, reducing the additional benefits of correlation-based spatial arrangements. Second, the convolution operations of CNN effectively capture feature interactions regardless of pixel positioning within our compact 6 × 6 image representation.
This experiment verifies that feature correlation has a minimal effect on the classification accuracy performance of the model. Therefore, an image generation method was employed that sequentially maps each feature to a single pixel, prioritizing simplicity and computational efficiency. This choice provides equivalent classification performance while significantly reducing the computational overhead associated with measuring inter-feature correlations and subsequent correlation-based pixel mapping during preprocessing.
4.3. The Results of CNN Model Training
To evaluate the classification performance of our proposed model, the generated image dataset of 148,517 images was divided into three sets: a training set (60%), a validation set (20%), and a test set (20%). The model was trained using the training set, and the validation set was utilized for verifying classification accuracy performance and tuning hyperparameters. Finally, the test set was employed to evaluate the model’s generalization performance. When a model is trained, the choice of weight initialization method is an important factor, as it determines the outcome of the model training. TensorFlow offers various weight initialization methods, including Glorot initialization [
89] and He initialization [
90]. In this study, we used the Glorot initialization method, which initializes weights randomly, and as a result, the model training results can vary each time. Therefore, we measured the model training results a total of 10 times and calculated the average value as the final classification performance of the model.
Figure 7 shows the results of 10 model training runs. Each training run was performed for a total of 100 epochs with a batch size of 32 and a learning rate of 0.0001. The average accuracy was 99.46%, and the val_accuracy was 99.09%, indicating excellent classification performance. During the 100 epochs, the average loss and val_loss converged to 0.0139 and 0.041, respectively, but the val_loss tended to diverge again after the 60th epoch. To evaluate the classification accuracy performance of the model, we used a confusion matrix.
Detailing the terms within this matrix, True Positive (TP) refers to the accurate classification of attack samples as attack, and True Negative (TN) denotes the correct classification of normal samples as normal. On the other hand, False Positive (FP) signifies the misclassification of normal samples as attack, and False Negative (FN) indicates the misclassification of attack samples as normal. These four metrics constitute the confusion matrix, with performance being measured using Equations (
9)–(
14).
Figure 8 and
Table 9 display the confusion matrix and evaluation metric results, respectively.
The model’s generalization performance was evaluated using the test set. Based on the confusion matrix shown in
Figure 9 (TP = 14,115, FN = 178, FP = 100, and TN = 15,311), the precision, recall, F1-score, and accuracy were 0.9930, 0.9875, 0.9902, and 99.06%, respectively. Furthermore, the proposed model achieved a false positive rate (FPR) of 0.65% and a false negative rate (FNR) of 1.25%. The FPR represents the proportion of normal traffic incorrectly classified as attacks, which can lead to alert fatigue and increased operational overhead for security analysts in production environments. Conversely, the FNR indicates the proportion of actual attacks misclassified as normal traffic, which poses more severe security risks as these attacks would go undetected. It is noteworthy that this FNR of 1.25% was obtained from a balanced dataset with approximately a 5:5 ratio of normal to malicious samples. In real-world network environments where normal traffic constitutes the vast majority of total traffic, while the FNR percentage is expected to remain similar, the absolute number of missed attacks would be significantly lower due to the substantially reduced volume of actual attack traffic compared to our test conditions. Our model demonstrates excellent performance with both low false positive and relatively low false negative rates, representing an effective balance between minimizing alert fatigue and maintaining comprehensive attack detection capabilities.
4.4. Effectiveness of Feature Engineering
To evaluate the effectiveness of dimensionality reduction in improving a model’s training efficiency, we conducted comparison experiments between two models: one trained on images without dimensionality reduction and the other trained on images with dimensionality reduction. Images without dimensionality reduction were created using one-hot encoding during data preprocessing, so 84 new one-hot encoding features were created by adding the number of unique values of categorical features protocol_type, service, and flag, which are 3, 70, and 11, respectively. When the newly created features are added to the existing 38 numeric features, a total of 122 features are created. To convert it to a square image, we removed the urgent feature, which had the lowest importance, to create an 11 × 11-pixel image based on a total of 121 features. The image dataset generated in this way was trained using our designed CNN model.
Our model uses only 6 × 6-pixel images in comparison to 11 × 11-pixel images with the one-hot encoding method. However, as can be seen in
Figure 9, our model has higher accuracy as well as less training time.
While
Figure 9 and
Figure 10 summarize the overall performance advantages of our approach in terms of accuracy and training time reduction, a more detailed statistical analysis is needed to verify the consistency of these improvements. To comprehensively evaluate the consistency and reliability of our results, each model was trained and tested 10 times.
Figure 11 presents the boxplots showing the distribution of performance metrics across these 10 experimental runs.
Table 10 summarizes the mean and standard deviation values for each metric. As shown in
Table 10, our proposed model with dimensionality reduction consistently outperforms the dimensionality-reduction-free model across all metrics. Not only does it achieve higher average accuracy in training (0.9946 ± 0.00051), validation (0.9909 ± 0.00101), and testing (0.9906 ± 0.00099) compared to the model without feature reduction (0.9938 ± 0.00080, 0.9890 ± 0.00140, and 0.9898 ± 0.00116, respectively), but it also demonstrates greater stability with lower standard deviations in most metrics. The boxplots in
Figure 11 visually confirm this trend, showing tighter distributions for the model with feature reduction.
Similarly, the loss metrics indicate better performance for our proposed model, with a training loss of 0.01397 ± 0.00128 compared to 0.01604 ± 0.00209 for the model without dimensionality reduction, as seen in
Table 10. The validation loss shows a similar pattern (0.04132 ± 0.00542 vs. 0.04154 ± 0.00384), though with slightly higher variability in our model.
Figure 10 shows the model training time of the proposed model in comparison to the model without dimensionality reduction. As presented in
Table 10 and visualized in
Figure 11, the average training time for the model without dimensionality reduction was 926.47 ± 21.34 s, while our model was trained in an average of 764.66 ± 32.53 s. This result shows that our model was trained about 17.47% faster and that the image dataset generated using one-hot encoding consumes relatively much more time for model training.
On the other hand, the testing time was measured based on the test set used to evaluate the prediction performance of the trained model. As shown in
Table 10, the dimensionality-reduction-free model took an average of 3.71 ± 0.43 s, whereas the proposed model completed the evaluation in an average of 2.99 ± 0.33 s, achieving a 19.44% improvement in inference speed. These results confirm that the proposed framework improved the speed and efficiency of normal and attack classification predictions while minimizing resource consumption through the application of dimensionality reduction. The small standard deviations observed across accuracy measurements (
), loss metrics, training times, and inference times indicate that the performance improvements are consistent and not due to random variations in model initialization. The boxplots further demonstrate that even in experimental runs with the lowest observed performance (worst-case scenarios), our model with dimensionality reduction consistently outperforms the dimensionality-reduction-free model across most metrics.
In conclusion, the proposed framework showed high classification accuracy and fast model training and testing time. These results demonstrate that the framework can achieve significant performance improvement in real-time malicious network traffic detection in IDSs.
To comprehensively evaluate our dimensionality reduction approach, we compared our framework with existing studies that employed various dimensionality reduction techniques on intrusion detection datasets, as summarized in
Table 11.
Our approach demonstrates several notable advantages when considering the holistic trade-offs between dimensionality reduction extent, accuracy, and computational efficiency. First, a key strength of our approach is achieving superior classification accuracy (99.06%) with only minimal dimensionality reduction (12.2%, from 41 to 36 features, as shown in the ‘Red. (%)’ column). This minimal reduction, unlike studies applying substantial feature reduction (ranging from 49.0% to 94.3% in
Table 11), prioritizes preserving the original information content. This strategic feature selection contributed to our model’s high accuracy (99.06%), which outperformed other NSL-KDD-based approaches, such as those of Li et al. [
62] (96.12%), Obeidat et al. [
58] (80.60%), and Wu et al. [
60] (85.73%) in terms of ‘Acc. After’. Furthermore, another significant strength is the balanced improvement across multiple performance facets: our framework exhibits substantial reductions in both training time (approx. 17.47%; ‘Train Time
(%)’) and inference time (approx. 19.44%; ‘Infer. Time
(%)’) while simultaneously maintaining, and even slightly improving, classification accuracy (+0.08% ‘Acc.
’). This positive ‘Acc.
’ contrasts sharply with the weaknesses observed in several other approaches where accuracy degradation was a notable trade-off. For instance, studies by Moyano et al. [
54], Awad et al. [
66], and Chen et al. [
68] reported accuracy degradations (‘Acc.
’ values of −0.20%, −0.26%, and −0.30%, respectively) after applying their DR techniques.
Regarding computational efficiency, while our method shows significant improvements, some studies present different trade-off profiles. For example, Li et al. [
62] and Obeidat et al. [
58] experienced considerably increased training times (indicated by negative ‘Train Time
(%)’ values of −16.7% and −357.7%, respectively, where ‘−’ signifies degradation, as per
Table 11 note), despite reducing features, highlighting a critical inefficiency in their post-DR modeling or the DR process itself. In contrast, our positive ‘Train Time
(%)’ of +17.47% underscores our efficiency gain. Although Wu et al. [
60] achieved higher percentage improvements in training time (+31.8%) and inference time (+27.2%) compared to our approach, their significant trade-off was a considerably lower overall accuracy (85.73% ‘Acc. After’) post-DR, whereas our method achieved 99.06%. This highlights a crucial strength of our framework: the ability to deliver substantial computational benefits without a major sacrifice in detection accuracy, a balance not achieved by all compared methods.
In essence, while some existing DR techniques exhibit weaknesses such as significant accuracy degradation (e.g., Moyano et al. [
54]), substantial increases in training overhead despite feature reduction (e.g., Li et al. [
62] and Obeidat et al. [
58]), or achieve higher percentage efficiency gains at a steep cost to final accuracy (e.g., Wu et al. [
60]), our approach demonstrates a more robust and holistically optimized strength. It carefully balances minimal, information-preserving feature reduction with tangible improvements in computational efficiency and maintained or enhanced detection accuracy. This balanced approach to feature reduction clearly demonstrates why our method performs effectively in CNN-based network intrusion detection.
4.5. Performance Comparison with Existing CNN-Based IDS Models
To compare the performance of our model with the latest IDS models based on CNN architectures, we investigated many related studies with similar approaches.
Table 12 shows the performance comparison results of the latest models and our model with similar approaches. Despite being an outdated network traffic dataset, the NSL-KDD dataset is still widely used for the performance evaluation of deep learning-based IDSs. In addition, many studies have utilized CNN architectures as classification models, and research on hybrid detection models that combine two or more deep learning models is also actively being conducted. As shown in
Table 12, the binary classification accuracy results mentioned in the previous experimental results were presented as performance comparison criteria. Our model’s performance (99.06% accuracy with a standard CNN) exceeded that of most CNN-based IDS models proposed in previous studies. Exceptions include the models by Cao et al. [
18] (99.81%) and Akhtar et al. [
46] (99.50%), which reported slightly higher classification accuracy, with differences of only 0.75% and 0.44%, respectively. Yan et al. [
20] also reported a comparable accuracy of 99.13% using a more complex CNN (VGG16) architecture. These minor differences highlight the competitiveness of our streamlined approach.
A key strength of our model becomes evident when considering other performance indicators and model complexity. For instance, while Akhtar et al. [
46] achieved a marginally higher accuracy, the other evaluation indicators reported in their study, such as precision (0.98), recall (0.97), and F1-score (0.97), were lower than our model’s corresponding scores of 0.9875, 0.9930, and 0.9902 (as detailed in our
Section 4.3,
Table 9). This suggests our model provides a better balance in identifying true positives while minimizing false alarms.
The model by Cao et al. [
18] achieved the highest accuracy in
Table 12 by employing a hybrid (CNN + GRU) architecture. However, as a general principle, combining two or more deep learning models typically increases the number of neural network layers and parameters, which often leads to increased training time and memory usage. Thus, a potential trade-off for the higher accuracy achieved by such hybrid models is a reduction in computational efficiency. In this respect, while the model [
18] showed high classification accuracy, its complex hybrid structure implies it may not be as resource-efficient as our single, comparatively simpler CNN architecture. Similarly, the CNN (VGG16) model used by Yan et al. [
20], while achieving comparable accuracy, is known to be a deeper and more computationally intensive architecture than the standard CNN developed in our work.
Furthermore, our model’s accuracy significantly surpasses several other hybrid models listed, such as Meliboev et al. [
47] (CNN + LSTM, 82.60%), Cui et al. [
19] (CNN + LSTM, 85.59%), and Yang et al. [
50] (CNN + BiLSTM, 82.60%). This indicates that increasing model complexity through hybridization does not inherently guarantee superior accuracy and may, in some cases, underperform a well-optimized single CNN like ours. Our model also outperforms other single CNN approaches, like those of Al-Turaiki et al. [
17] (90.14%), Wang et al. [
16] (97.70%), Zhang et al. [
13] (83.31%), and Li et al. [
14] (86.95%), further underscoring the effectiveness of our specific architecture and feature engineering.
Therefore, our model can be evaluated as a balanced detection model that takes into account both high classification accuracy and strong potential for system resource efficiency, a crucial aspect for practical IDS deployment.