Abstract
Machine learning (ML)-based network intrusion detection systems (NIDSs) depend entirely on the performance of machine learning models. Therefore, many studies have been conducted to improve the performance of ML models. Nevertheless, relatively few studies have focused on the feature set, which significantly affects the performance of ML models. In addition, features are generated by analyzing data collected after the session ends, which requires a significant amount of memory and a long processing time. To solve this problem, this study presents a new session feature set to improve the existing NIDSs. Current session-feature-based NIDSs are largely classified into NIDSs using a single-host feature set and NIDSs using a multi-host feature set. This research merges two different session feature sets into an integrated feature set, which is used to train an ML model for the NIDS. In addition, an incremental feature generation approach is proposed to eliminate the delay between the session end time and the integrated feature creation time. The improved performance of the NIDS using integrated features was confirmed through experiments. Compared to a NIDS based on ML models using existing single-host feature sets and multi-host feature sets, the NIDS with the proposed integrated feature set improves the detection rate by 4.15% and 5.9% on average, respectively.
1. Introduction
At present, network intrusions are highly diverse and sophisticated. Therefore, it is becoming increasingly difficult to detect them accurately. To improve the accuracy of network intrusion detection, several studies have been conducted, using various technologies. In particular, as machine learning (ML) has evolved significantly, many network intrusion detection systems (NIDSs) that use ML have recently been proposed [1,2,3]. Unlike early ML-based NIDSs, which mainly used simple ML models, complex deep learning models have been recently used for network intrusion detection. Unlike conventional pattern-based NIDSs, deep learning-based NIDSs demonstrate a robust detection performance against detection evasion attacks that partially modify network intrusion methods and achieve a high detection performance against zero-day attacks, which are new and previously unknown [4,5,6]. Therefore, deep learning technology is the most important core technology for overcoming network intrusions.
However, several important problems must be solved to implement a deep-learning-based NIDS. First, a large dataset consisting of many intrusions and normal traffic is needed to train a deep learning model. Recently, various research institutes have been steadily releasing datasets, including the latest intrusion traffic. Studies to remove the bias of datasets generated from specific sites are also being conducted extensively. Therefore, the problem of the quantity of the datasets required to train the deep learning model was significantly alleviated. The second problem is the design of a deep learning model. The most critical factor in the design is determining which data characteristics are used as learning features. The detection performance of the NIDS and the complexity of the deep learning model differ depending on the feature. However, unlike learning datasets, relatively few studies have been conducted to solve this problem. In an NIDS, it is common to use a feature that reflects a session’s overall characteristics rather than a feature for each packet of data [7]. These session features were commonly used in the early days, starting with those presented in the KDD99 dataset [8]. However, the session feature presented in the KDD99 dataset is costly because sessions belonging to multiple hosts (sessions created in multiple hosts or created in one host and having multiple destinations) must be analyzed simultaneously to create the feature.
The session features used in the ISCX2012 dataset, later released by the University of New Brunswick (UNB), are partially similar to the session features presented in the KDD99 dataset [9]. However, its fundamental difference from the KDD99 dataset is that the ISCX2012 dataset only comprises features that can be created by analyzing sessions belonging to a single host (sessions created between the same single source and the same single destination). Therefore, generating session features in the ISCX2012 dataset by analyzing the network traffic is much easier than generating session features in the KDD99 dataset. However, little research has been conducted on the effects of certain session features on intrusion detection performance. Therefore, in order to improve the accuracy of existing ML-based NIDSs, we need to answer the following questions: first, which features are most suitable for ML models used in NIDSs? Second, can those features be generated without significant overhead so that they can be applied to existing NIDSs?
To find the answers in this study, experiments were conducted and analyzed to determine a method of using session features that can increase detection performance, focusing on session features. In addition, we propose a feature set that can further improve the detection performance of existing session-feature-based NIDSs. Finally, we introduce an incremental generation method to build new feature sets from the network traffic in semi-real time. Our contributions are as follows.
- ▪
- We propose a unique integrated feature that combines single-host and multi-host features to significantly improve the detection accuracy of existing NIDSs.Through extensive experiments on features, we present the most suitable feature for ML-based NIDSs.
- ▪
- We present an incremental generation algorithm to build integrated features in realtime without significant overhead.Although the integration feature can improve the classification accuracy of an NIDS the most, it is impossible to apply it to NIDSs due to the high overhead when it is generated in the existing algorithms. To solve this problem, we present a very lightweight, real-time feature generation algorithm which is totally different from the existing algorithms.
The remainder of this study is structured as follows. Section 2 explains the features used in previous studies, and Section 3 presents the new feature sets and progressive generation methods. Section 4 analyzes the performance of the NIDS by applying various ML models to the existing session feature set, including the proposed feature set. Finally, Section 5 presents the conclusions of this study.
2. Existing Work
The feature sets that are widely used in recent ML-based NIDSs can be divided into session features that are determined by analyzing traffic belonging to the entire session rather than individual packets and packet features that are created from the packet data. The session features can be further classified into single-host features, which are created by analyzing sessions between a single source and single destination, and multiple-host features, which can create sessions between a single source and multiple destinations or multiple sources and single destinations.
The well-known datasets that use single-host features are the ISCX2012, CIC-IDS2017, and CSE-CIC-IDS2018 datasets, published by the Canadian Institute for Cybersecurity (CIC) at the University of New Brunswick (UNB) [10,11]. Single-host features are easier to create because only the traffic from that session is required to create features for each session. In general, to create a session feature, the traffic of each session must be collected and analyzed from the beginning to the end of the session. Thus, the memory complexity for generating a session feature is θ(n), where n is the number of packets in the session. Recently, a new approach for generating a single in-line host feature has been proposed. It has been proven that a session feature can be gradually created by updating some data fields whenever each packet is received [12]. In this method, the memory complexity is θ(f) (where f is the number of features), which can significantly reduce complexity compared to the existing method and can generate session features immediately after the session is terminated in semi-real time by minimizing the feature extraction time. This indicates that it is possible to extend the non-real-time NIDS to a real-time network intrusion prevention system (NIPS).
Contrary to the single-host feature, the multi-host feature is created using all sessions with the same destination or the same source among sessions created within a specific period. Therefore, because many sessions involved in creating a session feature must be considered simultaneously, both the memory and time complexity are more significant than the complexity of a single-host feature. Due to this complexity, multi-host features are not commonly used at present. However, the multi-host feature contains valuable information for detecting distributed attacks using multiple zombie hosts. Because distributed attacks are increasingly common, the importance of conducting research on methods of creating multi-host features in real time at a low cost is growing. In particular, if a method for generating multi-host features in real time is found, it is expected that it will be possible to detect distributed attacks accurately in real time. The following table characterizes each feature. Single-host features are often used because they are simpler to create than multi-host features. However, single-host and multi-host features have different information for the traffic; therefore, if they are used simultaneously, they can compensate for each other’s shortcomings. Therefore, in order to use both features simultaneously, it is important to develop a means to efficiently generate multi-host features in real time while minimizing resource usage. Each feature set type and corresponding dataset are described in detail in Table 1.
Table 1.
Comparison for single-host and multi-host features.
2.1. Single-Host Feature
The key to ML-based NIDS research is creating related datasets. However, this task requires considerable effort and time. When an ML-based NIDS was proposed in the early days, only limited datasets could be used practically. The CIC has recently provided various datasets necessary for network security research, as shown in Table 2. Therefore, most ML-based NIDS studies use CIC datasets.
Table 2.
Datasets provided by CIC [12].
The CIC creates session features using a self-developed tool called CICFlowMeter [13,14]. Table 3 shows a part of the feature set in which CICFlowMeter v3 is generated. Note that the features created by CICFlowMeter were created by analyzing packets in all sessions generated between a single source and a single destination. For example, the main features created are the number of packets transmitted and received within one TCP session, or the average size of packets within one session. The total number of feature sets was 80. Overall, these features can be further divided into two types. An intra-flow feature is obtained by analyzing only one session, and an inter-flow feature is obtained by analyzing several sessions simultaneously.
Table 3.
Partial feature set generated by CICFlowMeter. IAT—inter-arrival time.
A dump file of packets is saved to create a single, host-based dataset from CIC, and the packet data for each session are then analyzed with CICFlowmeter to create a feature. This process is impossible to perform in real time within the NIDS because the memory usage is high. Therefore, the NIDS generates features in non-real time for terminated sessions and performs intrusion detection using an ML classifier. In a recent study, a method was proposed to create and update meta-feature values to create session features whenever packets are received from an NIDS without CICFlowmeter and to create single-host features without the high time and space complexity through them immediately after session termination. By using this method, it is possible to create features for sessions in near-real time in NIDSs such that the delay from session termination to detection can be minimized, solving the biggest drawback of the NIDS using existing single-host-based datasets.
This approach is called incremental feature generation (IFG) [15]. In Figure 1, the basic structure of the IFG is presented. The received packets are stored and processed independently according to their direction. This information is updated each time a packet is received, and the entire single-host feature is updated based on this information. Therefore, unlike the method for analyzing session traffic after a session is terminated, a feature is created immediately after a session is terminated.
Figure 1.
IFG block diagram for single-host feature creation.
2.2. Multi-Host Feature
The KDD99 and Kyoto2016 datasets contain single-host features similar to those created by CIC [16,17]. However, these datasets also include multi-host features with distinct characteristics from the CIC datasets. Higher computational complexity and memory complexity are required to create a multi-host feature than existing single-host features. As the KDD99 and Kyoto datasets are very similar, only the Kyoto dataset is described here.
The feature sets used in the Kyoto2016 dataset are as follows [17]. Excluding session-dependent fields, it consists of 11 single-host features and 7 multi-host features as shown in Table 4. Although it consists of a very small number of features compared to the 80 features of the dataset created by CIC, it shows a high intrusion detection success rate with multi-host features.
Table 4.
Partial feature set used in the Kyoto dataset.
2.3. Packet Feature
As mentioned above, because the session features are created by analyzing the packets received from the first to the last packet of each session, the features are inevitably created after the intrusion ends. In contrast, an NIDS using packet features collects a certain number of packet data and uses them directly as features. To use the session feature, it is necessary to decide the traffic characteristics that should be used as a feature in advance; however, the ML model using the packet feature does not require such complicated preliminary work. Instead, deep learning technologies, such as CNN, are essential because meaningful data must be obtained directly from the packet data [4]. The most severe problem with the packet feature is that because each byte of packet data is used as one feature, many features are generated, making real-time processing impossible. As one-hot encoding is applied to each byte, 100-byte packet data are converted into 25,600 features. Moreover, significantly high processing power and time are required to apply the deep learning algorithm to a dataset of many features. Therefore, the process of generating packet features using HAST-IDS, a representative NIDS that uses packet features, as shown in Figure 2, is applied [4]. After sequentially collecting packets of a specific size from the first packet, each byte value was expanded by one-hot encoding to create a packet feature.
Figure 2.
Process of generating packet features used in HAST-IDS.
3. Integrated Session Feature-Based NIDS
3.1. Incremental Session Feature
A new feature is proposed to overcome the disadvantages of the existing single-host or multi-host features. The proposed method removes duplicate features by integrating existing single-host and multi-host features. The integration feature also suggests an incremental generation method that can be created without delay when a session is terminated. Thus, this study did not consider packet features because it is impossible to generate them in real time. The number of fields in the unified feature is such that the features that overlap with the existing two features are duration and service; the source IP, destination IP, and source port vary according to sessions, and are thus excluded from the integrated features. Therefore, the total number of features was 97, as is shown in Table 5.
Table 5.
Feature set size.
To create a feature similar to the existing method, the packet is saved whenever a packet is received. When the session is terminated, the feature is created from all packet data using Zeek and CICFlowmeter [17]. However, in this case, a large amount of calculation and memory is consumed after the session is terminated. To improve this, the method of incrementally creating the existing single-host session feature was expanded. Using this method, both single-host and multi-host session features can be created incrementally. The incremental feature-creation method is illustrated in Figure 3. The structure to create the existing single-host session feature is extended to include a structure that stores the last 2 s of all sessions with the same SIP and counts the number of sessions with the same service and SYN error. This gradually creates all necessary single-host session features.
Figure 3.
Block diagram for incremental generating integrated features from data traffic.
In addition, a new structure for creating a multi-host session feature was added. All sessions with the same DIP are implemented with a window that stores only the latest 100 sessions and a two-second window that stores sessions that occurred during the last two seconds. The number of sessions with the same SIP and SYN errors was calculated in real-time. This way, the multi-host session feature can be created without delay when the session is terminated.
3.2. Incremental Feature Generation
Let us describe how to incrementally generate session features in detail. Due to lack of space, we will show how to obtain features only for the Kyoto2016 dataset, which is more difficult than obtaining features for ISCX2012. As shown in Figure 3, to generate single-host features and multi-host features, the proposed NIDS must manage sessions using a session window of the past 2 s regardless of DIP and the last 100 sessions for each DIP. To manage many sessions in real time, the proposed method uses a linear queue to manage active sessions for the past 2 s, as shown in Figure 4. Therefore, if the end time is over 2 s, the session entry is deleted from the linear queue. In addition, to manage the last 100 sessions for each DIP, the proposed NIDS uses a hash table with the DIP as the key and circular queues referenced by the hash entries.
Figure 4.
Examples of two queues and corresponding session entry structures. The circular queue contains three sessions while the linear queue contains two sessions.
A session entry consists of the session information (i.e., SIP, DIP, service, SYN error status, and source port number) in addition to a session end time, a pointer list to hash table entries, and a reference counter. Since the proposed scheme uses four hash tables, the pointer to the k-th hash table entry is denoted by pk (k = 1, 2, …, 4). The reference counter indicates how many queues the session is stored in. Since the proposed method uses two queues to manage sessions, the reference counter can be set to 0, 1, and 2.
When a session is created, a session entry is created and stored in the circular and linear queues. For the sessions stored in these two queues, four additional hash tables are used to maintain the counting values needed to generate the session features. Figure 5 shows the structure of the hash tables used by the proposed method. The pk of the session entries are used to quickly update the four hash tables when the existing session entry is deleted.
Figure 5.
Four hashes to maintain statistics needed to generate multi-host session features in real-time. The blue and red colors represent values associated with sessions stored in the linear queue and the circular queue, respectively.
As the proposed NIDS uses each hash table similarly, let us describe only how hash table1 works. Each entry in hash table1 contains a “count” field, which stores the number of sessions with the same SIP and DIP for the sessions stored in the linear queue, and a “serror_count” field, which stores the number of sessions with SYN errors among them. It also contains a “dst_host_count” field, which stores the number of sessions with the same SIP and DIP for the sessions stored in the circular queue, and a “dst_host_serror_count” field, which stores the number of sessions with SYN errors. These counters are updated whenever a new session is created.
We will only describe how to obtain the multi-host session feature, which is more complicated than the single-host session feature. Whenever a new session is created or terminated, the circular queue, linear queue, or four hash tables can be updated. Then, if the proposed method needs the multi-host session feature for a particular session, the features are computed as shown in Table 6. It finds a session entry for the currently received session at first. Then, using pk, it accesses the entries in each hash table directly. Eventually, all the features in Table 6 can be obtained with a time complexity of θ(4). For example, for session S, the feature ‘Same srv rate’ is given by
where ‘same_srv_count’ for session S can be read via the p2 pointer from the session entry for session S. The count can also be read through the p1 pointer, making it very fast to calculate the feature.
Table 6.
List of multi-host features and their calculation expressions. hash table k (f) means the value of the field f in the entry of the k-th hash table matching with the given session.
4. Performance Evaluation
4.1. Experiment Environment
We evaluated the performance of the feature-creation method described above and determined the effectiveness of the proposed feature. The feature set compares the performance of the proposed integrated feature using the single-host feature created using CIC FlowMeter with the multi-host feature used in the Kyoto 2016 dataset, which was created using Zeek. The dataset used for the performance analysis was CIC-IDS2017, published by CIC. The size of the dataset used for performance analysis is listed in Table 7. The entire dataset was randomly divided into 60% and 40% to create the training and test datasets, respectively.
Table 7.
Dataset size of CIC-IDS2017.
The algorithms used for performance comparison are the decision tree (DT), naïve Bayes (DTNB), DT with k-NN (DTKNN), synthetic minority over-sampling technique (SMOTE) + random forest (RF), support vector machine (SVM), and 1D convolutional neural networks (1D-CNN) algorithms, which are widely used in NIDSs [4,18,19,20,21,22,23,24]. The performance metrics analyzed were accuracy, precision, recall, and F1-score. The definitions are as follows:
Here, TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative, respectively.
4.2. Detection Rate
The detection performance of each machine learning algorithm for each feature is presented in Figure 6. Based on the F1-scores, the performance of each algorithm was measured as high in the order of 1D-CNN, DTNB, SMOTE + RF, DT, DTKNN, and SVM. Regardless of the type of algorithm, including traditional ML or recent deep-learning models, the performance of the NIDS using single-host features is 1.75% higher on average compared with that of the NIDS using multi-host features, based on the F1-scores. In addition, the NIDS using the integrated feature created by combining the two session features has a higher performance than the NIDS using the multi- or single-host features, specifically, 5.9% higher than the multi-host feature and 4.15% higher than the single-host feature. Considering that the difference between the single-host and multi-host features is 1.75%, the performance improvement is substantial. Excluding SMOTE + RF, the performance increases in the order of multi-host features, single-host features, and integrated features for all metrics: accuracy, precision, recall, and F1-score. In the case of DTKNN, one of algorithms showing the highest performance, Figure 7 shows that DTKNN significantly improves the performance for all metrics in the integrated feature.
Figure 6.
Detection rates in F1-score for machine learning models, according to each feature set.
Figure 7.
Performance metrics for machine learning models, according to each feature set.
Similar to other algorithms, SVM has the highest performance when integrated features are applied to it. The performance with single features is the lowest compared to the others, but the performance improvement is the greatest with integrated features. Finally, the detection performance of SVM improves most when integrated features are used compared to other algorithms.
It is important to compare confusion matrices to analyze the performance of each class in detail. However, due to the lack of space, we only show DTKNN’s confusion matrix, which shows the highest classification accuracy, instead of confusion matrices for all classification algorithms. Table 8, Table 9 and Table 10 show the confusion matrix for DTKNN.
Table 8.
Confusion matrix for DTKNN using the single-host feature.
Table 9.
Confusion matrix for DTKNN using the multi-host feature.
Table 10.
Confusion matrix for DTKNN using the integrated feature.
The results show the highest accuracy in the benign, DoS Slowhttp, Bot, and DDoS classes with the integrated feature, while other classes show almost the same accuracy as single-host or multi-host features. On the other hand, when using the single-host feature, DoS Hulk, DoS Slowloris, Web Bruteforce, and Portscan classes have the highest accuracy, but the difference from the results of the integrated feature is very marginal. In addition to the multi-host feature, it shows the highest performance in the SSH-Patator class, but it shows almost the same accuracy as the integrated feature. Eventually, there are classes that are accurately detectable when using single host or multi-host features, but the advantage of both features can be exploited by using the integrated feature, indicating that an NIDS with the integrated feature achieves the same or higher accuracy for all classes than the existing two features.
Single-host features reflect the details of a session between one source and one destination [25,26]. Multi-host features, on the other hand, contain aggregated information about sessions between multiple sources and a single destination. Thus, the information provided by single-host and multi-host features can reveal different levels of information for a specific session without duplication, allowing for a more detailed analysis of the session. Ultimately, this synergy leads to an integrated feature-based NIDS outperforming single-host feature or multi-host feature-based NIDSs in detection.
4.3. Training and Testing Time
As indicated by the training time results in Figure 8, the ML model using the multi-host feature has the shortest training time. In contrast, the NIDS using integrated features demonstrates the longest training time. Considering that the size of the multi-host feature is the smallest and the size of the integrated feature is the largest, we can realize that the training time tends to be proportional to the number of features. As the multi-host feature size is the smallest, the training time is the smallest compared to the use of other features for all algorithms. An NIDS using the multi-host feature can be trained almost three times faster than the single-host feature. Using the integrated feature with the largest size usually requires more training time than the single-host feature.
Figure 8.
Each algorithm’s relative training and testing time according to each feature set. (a) Training time. (b) Testing time.
One of the most critical aspects of NIDS is the testing time. This is an important performance factor because it determines the maximum processing throughput an NIDS is capable of handling. As shown in Figure 8, the characteristics of the primary testing time are similar to those of the training time. Similar to the training time results excluding DTKNN, the NIDS using the multi-host feature shows the shortest testing time, whereas the NIDS using the integrated feature has the longest testing time. However, unlike the training time result, the testing time of the NIDS using integrated features does not increase significantly compared to the case of an NIDS using single-host features. In the case of the DTKNN with the highest detection rate, the testing time only increases by 8%. Even for the most significant increase in DT, the testing time only increases by 61%. Although the test speed is lower using the integrated features, the performance is almost similar to that of the multi-host feature.
For SVMs, both the training and test times are the shortest for integrated features and the longest for multi-host features. Considering that it has the fewest number of multi-host features and the largest number of integrated features, the results for SVM seem very strange and hard to understand. However, if we note that the time complexity of the SVM is between and , depending on the data distribution, we can see that the number of features affects the time complexity of the classifiers, but the distribution of the samples has a greater impact on the time complexity of SVM [27,28].
4.4. Feature Selection
To resolve the problem of increasing the classification time when using integrated features, it is necessary to compare and analyze the performance through feature selection [29]. In this experiment, a random forest is used for feature selection, the feature importance is calculated, and k features with high importance values are selected [30]. The classifier is then trained using only the selected features, the F1-score and test time are measured. Since most classifiers show similar results, we show here the results only for DT, which shows the largest increase in test time when using integrated features.
Table 11 shows the F1-score and test time according to the size of the features selected through feature selection. We can see that as the number of selected features decreases, the test time decreases proportionally. This is because as the number of features increases, the complexity of building the internal tree of DT increases. On the other hand, the F1-score shows an overall convex shape with respect to the number of features. The table shows that the F1-score is maximized when the number of features is 30 and the test time is decreased to 0.145 s, which is 54.7% less than the original 0.32 s. Considering that the numbers of single-host and multi-host features are 81 and 39, the DT based on integrated features using 39 features shows a shorter test time than those of a DT using single-host or multi-host features. Therefore, it is confirmed that the increased test time caused by using integrated features can be mitigated by feature selection. Moreover, feature selection is essential for using integrated features because feature selection improves not only test time but also the F1-score.
Table 11.
Comparison of F1-score and test time for integrated feature-based DT, according to the size of features selected by feature selection.
5. Conclusions
The existing single-host feature is easy to create but has a disadvantage—it is unable to see relationships with other sessions. However, the multi-host feature is difficult to create but has the advantage of knowing the relationship with other sessions. Therefore, in this study, we proposed a method to integrate the two sessions. Since generating integrated features has a significant overhead, we also proposed an incremental feature generation method to solve this problem. The proposed integrated-session feature has the strength of being able to more accurately detect network intrusions by merging multi- and single-host features to simultaneously use multi- and single-session information. The experimental results indicate that the proposed integrated feature improves the detection rate by 4.15% and 5.9% on average, respectively, compared to NIDSs using traditional single-host and multi-host features.
As the number of features increases, the time required to classify the received session through ML increases. More powerful hardware is required to support the same session-processing speed. However, an NIDS based on integrated session features can improve the classification performance by adopting feature selection or using multiple ML classifiers in parallel. In addition, because the performance of hardware for ML is rapidly improving, the slow classification speed will be technically overcome.
For more accurate network intrusion detection, feature set design is essential; however, research on this has been relatively lacking. Therefore, the integrated session feature proposed in this study could significantly help to design a feature set for NIDSs that can safely protect networks and users from malicious users.
Author Contributions
T.K. and W.P. have written this paper and have conducted the research. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by a National Research Foundation of Korea (NRF) grant, funded by the Korean government (Ministry of Science and ICT) (NRF-2022R1A2C1011774).
Data Availability Statement
The dataset utilized in this paper is CIC-IDS2017 dataset (https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 6 March 2023).
Conflicts of Interest
The authors declare no conflict of interest.
References
- Kruegel, C.; Toth, T. Using decision trees to improve signature-based intrusion detection. In Proceedings of the 2003 International Workshop on Recent Advances in Intrusion Detection, Pittsburgh, PA, USA, 8–10 September 2003; pp. 173–191. [Google Scholar] [CrossRef]
- Wu, S.X.; Banzhaf, W. The use of computational intelligence in intrusion detection systems: A review. Appl. Soft Comput. 2010, 10, 1–35. [Google Scholar] [CrossRef]
- Ektefa, M.; Memar, S.; Sidi, F.; Affendey, L.S. Intrusion detection using data mining techniques. In Proceedings of the 2010 Information Retrieval & Knowledge Management (CAMP), Shah Alam, Selangor, Malaysia, 17–18 May 2010; pp. 200–203. [Google Scholar] [CrossRef]
- Wang, W.; Sheng, Y.; Wang, J.; Zeng, X.; Ye, X.; Huang, Y.; Zhu, M. HAST-IDS: Learning hierarchical spatial-temporal features using deep neural networks to improve intrusion detection. IEEE Access 2017, 6, 1792–1806. [Google Scholar] [CrossRef]
- Bilge, L.; Dumitras, T. Before we knew it: An empirical study of zero-day attacks in the real world. In Proceedings of the 2012 ACM Conference on Computer and Communications Security, Raleigh, NC, USA, 16–18 October 2012; pp. 833–844. [Google Scholar] [CrossRef]
- Al-Qatf, M.; Lasheng, Y.; Al-Habib, M.; Al-Sabahi, K. Deep Learning Approach Combining Sparse Autoencoder with SVM for Network Intrusion Detection. IEEE Access 2018, 6, 52843–52856. [Google Scholar] [CrossRef]
- Li, L.; Yu, Y.; Bai, S.; Hou, Y.; Chen, X. An Effective Two-Step Intrusion Detection Approach Based on Binary Classification and k-NN. IEEE Access 2017, 6, 12060–12073. [Google Scholar] [CrossRef]
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. Detailed Analysis of the KDD CUP 99 Data Set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), Ottawa, ON, Canada, 8–10 December 2009. [Google Scholar] [CrossRef]
- Shiravi, A.; Shiravi, H.; Tavallaee, M.; Ali, A. Ghorbani, Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. 2012, 31, 357–374. [Google Scholar] [CrossRef]
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 2018 4th International Conference on Information Systems Security and Privacy (ICISSP), Funchal-Madeira, Portugal, 22–24 January 2018. [Google Scholar] [CrossRef]
- Soheily-Khah, S.; Marteau, P.; Béchet, N. Intrusion Detection in Network Systems Through Hybrid Supervised and Unsupervised Machine Learning Process: A Case Study on the ISCX Dataset. In Proceedings of the 1st International Conference on Data Intelligence and Security (ICDIS), South Padre Island, TX, USA, 8–10 April 2018; pp. 219–226. [Google Scholar] [CrossRef]
- Sharafaldin, I.; Lashkari, A.H.; Hakak, S.; Ghorbani, A.A. Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy. In Proceedings of the IEEE 53rd International Carnahan Conference on Security Technology, Chennai, India, 1–3 October 2019. [Google Scholar] [CrossRef]
- Lashkari, A.H.; Draper-Gil, G.; Mamun, M.; Ghorbani, A.A. Characterization of Tor Traffic Using Time Based Features. In Proceeding of the 2017 3rd International Conference on Information System Security and Privacy, Porto, Portugal, 19–21 February 2017; SCITEPRESS: Setúbal, Portugal. [Google Scholar] [CrossRef]
- Drapper-Gil, G.; Lashkari, A.H.; Mamun, M.; Ghorbani, A.A. Characterization of Encrypted and VPN Traffic Using Time-Related Features. In Proceedings of the 2016 2nd International Conference on Information Systems Security and Privacy (ICISSP 2016), Rome, Italy, 19–21 February 2016; pp. 407–414. [Google Scholar] [CrossRef]
- Ma, C.; Du, X.; Cao, L. Analysis of Multi-Types of Flow Features Based on Hybrid Neural Network for Improving Network Anomaly Detection. IEEE Access 2019, 7, 148363–148380. [Google Scholar] [CrossRef]
- Panwar, S.S.; Raiwani, Y.P.; Panwar, L.S. An Intrusion Detection Model for CICIDS-2017 Dataset Using Machine Learning Algorithms. In Proceedings of the International Conference on Advances in Computing, Communication and Materials (ICACCM), Dehradun, India, 10–11 November 2022; pp. 1–10. [Google Scholar] [CrossRef]
- Uhm, Y.; Pak, W. Real-Time Network Intrusion Prevention System Using Incremental Feature Generation. CMC-Comput. Mater. Contin. 2022, 70, 1631–1648. [Google Scholar] [CrossRef]
- Sahu, S.; Mehtre, B.M. Network intrusion detection system using J48 Decision Tree. In Proceedings of the 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India, 10–13 August 2015; pp. 2023–2026. [Google Scholar] [CrossRef]
- Description of Kyoto University Benchmark Data. Available online: https://www.takakura.com/Kyoto_data/BenchmarkData-Description-v5.pdf (accessed on 13 January 2023).
- Han, X.; Dong, P.; Liu, S.; Jiang, B.; Lu, Z.; Cui, Z. IV-IDM: Reliable Intrusion Detection Method based on Involution and Voting. In Proceedings of the 2022 IEEE International Conference on Communications (ICC), Seoul, Republic of Korea, 16–20 May 2022; pp. 4162–4167. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Yan, B.; Han, G.; Sun, M.; Ye, S. A novel region adaptive SMOTE algorithm for intrusion detection on imbalanced problem. In Proceedings of the 2017 IEEE International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2017; pp. 1281–1286. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Kiranyaz, S.; Avci, O.; Abdeljaber, O.; Ince, T.; Gabbouj, M.; Inman, D.J. 1D convolutional neural networks and applications: A survey. Mech. Syst. Signal Process. 2021, 151, 107398. [Google Scholar] [CrossRef]
- Wang, W.; Harrou, F.; Bouyeddou, B.; Senouci, S.-M.; Sun, Y. Cyber-attacks detection in industrial systems using artificial intelligence-driven methods. Int. J. Crit. Infrastruct. Prot. 2022, 38, 100542. [Google Scholar] [CrossRef]
- Dairi, A.; Harrou, F.; Bouyeddou, B.; Senouci, S.M.; Sun, Y. Semi-supervised Deep Learning-Driven Anomaly Detection Schemes for Cyber-Attack Detection in Smart Grids. In Power Systems Cybersecurity. Power Systems; Springer: Cham, Switzerland, 2023; pp. 265–295. [Google Scholar] [CrossRef]
- Bottou, L. Support Vector Machine Solvers. Available online: https://leon.bottou.org/publications/pdf/lin-2006.pdf (accessed on 23 March 2023).
- Simon, H.; List, N. SVM-Optimization and Steepest-Descent Line Search. In Proceedings of the 22nd Conference on Learning Theory (COLT), Montreal, QC, Canada, 18–21 June 2009. [Google Scholar]
- Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. JMLR 2003, 3, 1157–1182. [Google Scholar]
- Ho, T.K. The Random Subspace Method for Constructing Decision Forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).