You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

26 November 2022

Three-Branch Random Forest Intrusion Detection Model

,
,
,
and
1
College of Science, North China University of Science and Technology, Tangshan 063210, China
2
Key Laboratory of Data Science and Application of Hebei Province, Tangshan 063210, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Engineering Calculation and Data Modeling

Abstract

Network intrusion detection has the problems of large amounts of data, numerous attributes, and different levels of importance for each attribute in detection. However, in random forests, the detection results have large deviations due to the random selection of attributes. Therefore, aiming at the current problems, considering increasing the probability of essential features being selected, a network intrusion detection model based on three-way selected random forest (IDTSRF) is proposed, which integrates three decision branches and random forest. Firstly, according to the characteristics of attributes, it is proposed to evaluate the importance of attributes by combining decision boundary entropy, and using three decision rules to divide attributes; secondly, to keep the randomness of attributes, three attribute random selection rules based on attribute randomness are established, and a certain number of attributes are randomly selected from three candidate fields according to conditions; finally, the training sample set is formed by using autonomous sampling method to select samples and combining three randomly selected attribute sets randomly, and multiple decision trees are trained to form a random forest. The experimental results show that the model has high precision and recall.

1. Introduction

Rapid progress has been made in science and technology, and networking has been accelerating. While the network provides convenience for people’s lives, the security problem is increasingly prominent. Once a security incident occurs, it causes significant economic losses and social impact. To increase the stability of the network defense system, new intrusion detection models need to be constantly studied and improved to improve the overall performance of the intrusion detection system. With the rapid development of machine learning, many scholars began to study the application of machine learning to intrusion detection, such as decision trees [1,2], support vector machines [3], Bayesian networks [4,5], K-nearest neighbors [6], etc. and constantly put forward improvement measures for algorithms. As a common algorithm used in intrusion detection, decision tree has achieved good results in intrusion detection. With the improvement and development of different detection models in recent years, traditional machine learning has exposed many drawbacks. A single machine learning algorithm is often prone to have some blind spots, resulting in low detection accuracy, poor discrimination, and other problems. Therefore, ensemble learning gradually appeared in the field of intrusion detection. Ensemble learning trains multiple classifiers, and then integrates individual classifiers through certain strategies to obtain a stable model that performs well in all aspects. The intrusion detection methods of ensemble learning are mainly divided into AdaBoost [7] and bagging [8]. The intrusion detection algorithm based on AdaBoost generates bad base classifiers with the increase of learners, which leads to the decline of integration performance and affect the performance of intrusion detection. Bagging-based intrusion detection mainly includes intrusion detection based on random forest. As a typical ensemble learning model, random forest is an algorithm obtained by integrating multiple decision trees. At present, many scholars have applied it to intrusion detection. For example, in [9], considering that the traditional ensemble learning algorithm cannot accept all training samples at one time due to the continuous generation of new attacks in intrusion detection, the idea of semi-supervision is introduced into ensemble learning. The bootstrap method is used for placing back sampling to obtain base classifiers and selectively labeled samples, which improves the detection accuracy of the model. The literature [10] defines the concept of approximate reduction, divides the attributes of the dataset, and trains them separately to obtain a base classifier with significant differences, which ensures the generalization of the final integrated learner. To solve the overfitting phenomenon, study [11] proposed an improved network intrusion detection method of random forest classifier by increasing the diversity of random forests through the Gaussian mixture clustering algorithm. Reference [12] proposed an intrusion detection algorithm based on random forest and artificial immunity to solve the problems of low detection rate of traditional intrusion detection methods for network intrusion attack types such as Probe, U2R, R2L, and false detection and missing detection of intrusion behaviors. Reference [13] proposed a network intrusion detection model based on random forest, and achieved good results in detection rate. Reference [14] proposed the combination of extreme gradient lifting tree and random forest, built an intrusion detection model, improved the random forest algorithm, and dealt with the problem of data imbalance, and finally the detection accuracy was effectively improved.
Intrusion detection datasets usually have many attributes, and not every attribute is equally essential. In a random forest, the probability of selecting attributes with high importance and attributes with low importance is the same. If the likelihood of attributes with high importance being chosen increases, the classification effect of the decision tree will increase accordingly. Therefore, considering the impact of different attribute importance on the final classification accuracy, it is necessary to increase the probability of selecting essential attributes in the intrusion detection dataset. The measurement method of attribute importance is not unique, such as attribute importance based on information entropy [15], attribute importance based on rough set [16,17,18,19,20], etc. When most methods improve the importance of attributes, they directly delete the attributes with low importance, which leads to the loss of data information to a certain extent.
Three decision [21,22,23] refers to the addition of a delayed decision based on the original two decisions. This decision method is to delay judgment of the original uncertain decision. This paper is just to integrate this idea into attribute selection, rejudge some attributes, and propose an attribute selection method based on three-branches of decision-making and attribute importance, fully consider the complexity and uncertainty of data, divide the data into the positive domains, negative domain, and boundary domain, and randomly select a certain number of attributes from the three domains, which not only gives priority to attribute importance, but also ensures the randomness of attributes through attribute selection. According to different attribute subsets obtained from three decisions, the decision tree is trained. The multiple decision trees obtained from the training are integrated according to the voting method, and the three-branch random forest integration model is obtained.
To sum up, the attribute measurement method based on decision boundary entropy is used to calculate the importance, taking into account the information of positive domain and boundary domain at the same time, and then three randomized decision methods of attributes are used to select attributes to ensure the randomness of attribute selection. Finally, the attribute subsets obtained are divided into decision trees, and finally IDTSRF models are integrated and used for intrusion detection. The recall rate and F1 value are selected as the main evaluation indicators in the experiment. Compared with random forest, AdaBoost, XGBoost, extra trees, CNN and CNN-LSTM, the results show that IDTSRF achieves high recall and F1 value.

3. Three-Branch Random Forest Intrusion Detection Model

3.1. IDTSRF Intrusion Detection Model Framework

Figure 5 is the overall framework of the intrusion detection algorithm based on three-way selected random forest (IDTSRF).
Figure 5. IDTSRF Model.
As shown in Figure 5, the steps of the IDTSRF intrusion detection model are as follows:
4.
A hierarchical sampling of the original dataset in proportion;
5.
Because the dataset contains continuous attributes, according to the characteristics of different attributes, the data are discretized using equal distance dispersion or equal frequency dispersion to obtain the dataset D required by the experiment;
6.
Self-service sampling is performed on the dataset D to generate N data subsets D 1 , D 2 , , D N and off-package data;
7.
Set the evaluation function of the attribute based on the attribute importance of decision boundary entropy ( D B E ), give the value of threshold value to ( α , β ) , and divide the attributes in all data subsets into a positive domain, boundary domain or negative domain, respectively;
8.
Select k attributes according to three attribute selection rules (see Section 3.3);
9.
The GINI index is selected as the criteria for node division of the decision tree to generate a N tree;
10.
The majority voting method is adopted to integrate the results to get the final integration;
11.
To verify the effectiveness of the algorithm, test data are input into the integration to verify the effectiveness of the algorithm.
The critical problem of the IDTSRF intrusion detection model is to calculate the importance of attributes and the three choices of attributes and design an improved random forest algorithm based on this.

3.2. Decision Boundary Entropy and Attribute Importance

In the IDTSRF intrusion detection model, attribute importance is redefined based on decision boundary entropy, which is used as an evaluation function to construct three attribute decisions, and attributes are divided into positive domain, negative domain, and boundary domain candidate domains.
Definition 1.
(Decision Boundary Entropy, DBE) [43] Given a decision table  D T = ( U , C , D , V , f ) ,  U / I N D ( D ) = { D 1 , D 2 , , D m } ,  B C , the decision boundary entropy of  D  related to  B  is defined as [43]:
D B E ( D , B ) = ( 1 α B ( D ) ) × i = 1 m | B N D B ( D i ) | | U | log 2 ( | B N D B ( D i ) | | U | + 1 )
where,  α B ( D )  represents approximate classification accuracy and  B N D B ( D i )  represents boundary region.
Definition 1 uses approximate classification precision and | B N D B ( D i ) | | U | to construct decision boundary entropy. First, because the boundary domain is the way to estimate uncertainty information in rough sets, Shannon’s information entropy is also the way to measure uncertainty information, so it is feasible to combine | B N D B ( D i ) | | U | with entropy to define the decision boundary entropy; secondly, because | B N D B ( D i ) | | U | provides information from the boundary domain [43], and the approximate classification quality provides information from the positive domain, combining the two to define the decision boundary entropy, it can effectively integrate the information in the positive domain and the boundary domain and obtain a new standard to measure the quality of attributes.
It can be seen from the definition that the decision boundary entropy is inversely proportional to α B ( D ) and is proportional to | B N D B ( D i ) | | U | .
Definition 2.
(attribute importance based on  D B E ) [43] gives a decision table  D T = ( U , C , D , V , f ) . For  B C , a C B , attribute  a , the attribute importance of  B  and  D  is defined as:
S i g ( a , B , D ) = D B E ( D , B { a } ) D B E ( D , B )
In the IDTSRF intrusion detection model, the D B E -based attribute importance is used as the evaluation function to construct three attribute decisions. The attributes are divided into three candidate domains: positive domain, negative domain, and boundary domain.

3.3. Randomized Three-Branch Decision of Attributes

Set decision table D T = ( U , C , D , V , f ) , where U represents a nonempty finite set containing all instances, called universe; C = { a 1 , a 2 , , a k } represents k condition attributes; D = { d 1 , d 2 , , d m } represents m decision attributes; V is attribute value, f : A V is a mapping [43].
Definition 3.
(Indispensable attribute) Ref. [43] gives a decision table  D T = ( U , C , D , V , f ) . For  c C , if  D B E ( D , C ) D B E ( D , C { c } ) , it is said that  c  is an indispensable attribute of  D  on  C . Otherwise,  c  is an unnecessary attribute of  D  on  C .
Definition 4.
(Core) [43]. Given a decision table  D T = ( U , C , D , V , f )  and  B C , for each  b B  and  c C B ,  b  is an indispensable attribute of  D  on  C , and  c  is an unnecessary attribute of  D  on  C . B  is called a core of  D  on  C .
Definition 5.
(Reduction) Ref. [43] gives a decision table  D T = ( U , C , D , V , f )  and  R C . If  D B E ( D , R ) = D B E ( D , C ) , for each  b R , there is  D B E ( D , R { b } ) > D B E ( D , C ) , and  R  is called a reduction on  C  related to  D  [47].
To make three choices for D T = ( U , C , D , V , f ) the attribute of the decision table, the evaluation function F ( x ) = S i g ( a , C , D ) . For a C , when S i g ( a , C , D ) α , the attribute a is divided into positive decision domain P O S ; when S i g ( a , C , D ) β , attribute a is divided into negative decision domain N E G ; when β < S i g ( a , C , D ) < α , the attribute a is divided into delay decision domain B N D . So, there are
{ P O S = { a C S i g ( a , C , D ) α } B N D = { a C β < S i g ( a , C , D ) < α } N E G = { a C S i g ( a , C , D ) β }
After the attribute is divided into three domains, to make the attribute selection random, the following three attribute selection rules are defined:
  • If | P O S | > ( 1 + δ ) k , randomly select k attributes from the positive field as attribute subsets;
  • If | P O S | + | B N D | > ( 1 + δ ) k , randomly select k attributes as attribute subset from positive domain and boundary domain [43];
  • If | P O S | + | B N D | ( 1 + δ ) k , randomly select | P O S | + | B N D | k attributes from the positive domain and boundary domain, and randomly select ( 1 ( | P O S | + | B N D | ) k ) k attributes from the negative domain to form an attribute subset [43].
Note: The application priority of the three rules is (1) > (2) > (3).
where, k = | P O S | + | B N D | + | N E G | represents the total number of attributes, | P O S | , | B N D | , and | N E G | represent the number of attributes in the positive domain, boundary domain, and negative domain, respectively. In random forests, k attributes are generally used to train decision trees, and here k attributes are also used to train decision trees [43].
The theoretical value range of δ is ( 0 , k 1 ) . The coefficient ( 1 + δ ) of k is set to make the number of attributes included in the candidate attribute greater than k . When δ = 0 , k = | P O S | (or | P O S | + | B N D | = k ), there is only one attribute combination provided by candidate attribute set in P O S (or P O S B N D ), and there is no randomness [43]. With the increase of δ value, | P O S | > ( 1 + δ ) k (or | P O S | + | B N D | > ( 1 + δ ) k ), the number of candidate attribute set attributes increases, attribute randomness increases, and the fault tolerance and robustness of individual learners increase. On the contrary, the accuracy of individual learners decreases. When δ is large enough, the number of attributes in the candidate attribute set can only meet rule (3), the randomness of attributes is the best, and the accuracy of individual learners is the worst.

3.4. Three-Branch Attribute Selection random forest Algorithm

In the IDTSRF intrusion detection model, the key is to select and integrate three attributes in random forest algorithm. To this end, a new random forest algorithm is constructed based on three-attribute selection, which is described as follows (Algorithm 1):
Algorithm1: Three-branch attribute selection random forest algorithm
Three-branch attribute selection random forest algorithm
Input: Decision tables D T = ( U , C , D , V , f ) , Three attribute selection thresholds ( α , β ) ,
Attribute randomness δ
Output: TSRF (Forest)
1 For j = 1 , 2 , , N
2       U j = B o o t s t r a p _ S a m p l e ( U )
3      Initialization P O S = { } , B N D = { } , N E G = { }
4      For a C do
5          Calculate F ( a ) according to formula (2)
6          If F ( a ) > α then
7               P O S a
8          Else
9                If β < F ( a ) < α then
10                   B N D a
11             Else
12                   N E G a
13             End if
14          End if
15      End for
16      If | P O S | > ( 1 + δ ) k
17          Randomly select k attributes from the positive field to form attribute set A
18      Else
19          If | P O S | + | B N D | > ( 1 + δ ) k
20              Randomly select k attributes from positive domain and boundary domain to form attribute set A
21          Else
22              Select | P O S | + | B N D | k k attributes from the positive domain and boundary domain, and | N E G | k k attribute structural sets A from the negative domain
23          End if
24        End if
25       U = Dataset with attribute A
26 End for
27 Training decision tree according to U 1 , U 2 , , U N
28 Comprehensive results
The algorithm consists of two parts. The first part is to select three attributes to form a dataset. Step 2 is to conduct self-service sampling on the original dataset to generate different datasets of N group. Steps 4 to 15 are to select three attributes, in which Step 5’s Formula (2) calculates the importance of each attribute, and Steps 6 to 15 set the threshold ( α , β ) , construct three decisions, divide the attributes of k into positive domain, boundary domain, and negative domain; the time complexity is O ( 2 k ) [43]. Steps 16 to 24 randomly select the attributes, set the attribute randomness threshold δ , and select the attributes in the three fields according to the size of ( 1 + δ ) k according to the random selection rules to obtain the attribute set A . Output the dataset U 1 , U 2 , , U N containing the attribute A , with the time complexity of O ( 1 ) . If each dataset includes n samples, the time complexity of Steps 4 to 24 is O ( N k n ) ;
The second part is to train and integrate decision trees. Step 28 trains U 1 , U 2 , , U N into N CART trees. If the depth of CART is d , the time complexity of a single CART tree is O ( k d n ) , and the time complexity of N CART tree is O ( N k d n ) . Step 29 combines the trained trees according to the voting method to get the final classification result.
Based on the above analysis, the time complexity of the algorithm is O ( N k n ) + O ( N k d n ) , and because k d k , the time complexity is actually O ( N k n ) [43].

3.5. Data Set and Experimental Environment

KDD CUP 99 is the intrusion detection dataset adopted by KDD competition in 1999. The NSL-KDD dataset used in this article has been partially improved. In view of the shortcomings of KDD CUP 99 dataset, the NSL-KDD dataset removes redundant data from the KDD CUP 99 dataset, overcomes the problem that the classifier tends to repeat records, and the performance of learning methods is affected. In addition, the proportion of normal and abnormal data is properly selected, and the number of test and training data is more reasonable [48], so it is more suitable for effective and accurate evaluation between different machine learning technologies.
The NSL-KDD dataset contains four types of exceptions, a total of 42 features, including 9 basic TCP connection features, 13 TCP connection content features, 9 time-based network traffic statistics features, 10 host-based network traffic statistics features, and 1 label feature. The distribution of datasets is shown in Table 2.
Table 2. Data type distribution table.
The experiment is based on Intel Xeon W-2123 processor, Windows Server 2019 operating system, and Python 3.7.9 version is used for code design.

3.6. Subsection Data Preprocessing

Most attribute values in the original data are continuous data, and the data are discretized by equal distance dispersion or equal frequency dispersion. Discretization processing is because there is no way to directly use continuous variables in the decision tree, and the NSL-KDD dataset has 41 attribute features. All but 9 discrete features are continuous attributes. Therefore, continuous attributes need to be discretized. Equal distance dispersion refers to dividing the value range of continuous data into n equal parts to ensure that the spacing of each part is equal. Equal frequency discretization refers to dividing the data points above the continuous data into equal parts, so that the number of data points in each part is the same.

3.7. Subsection Random Selection of Three Attributes

3.7.1. Parameter Selection Experiment

For three decisions, DBE-based attribute importance is selected as the evaluation function of attributes. The importance of all attributes is calculated, and the importance of some attributes is shown in Table 3.
Table 3. Attribute importance.
Since the role of thresholds α and β is to divide attributes into three domains, to determine the α and β sizes, it is necessary to first calculate the importance of attributes on each dataset. According to the calculation results, the maximum value θ max = 1.189756 × 10 3 , the minimum value θ min = 0 , and the overall value range of the two thresholds are ( 0 , 1.189756 × 10 3 ) . As α > β , set θ min β < θ max + θ min 2 , θ max + θ min 2 α θ max .
Figure 6 shows the change in the precision of NSL-KDD with ( α ,   β ) . The overall precision is between 80% and 91%. About 40% of them have a precision ratio greater than 85%. When ( α ,   β ) takes ( 0.00119 ,   5.34 × 10 4 ) , ( 0.001071 ,   5.34 × 10 4 ) , ( 0.000833 ,   5.34 × 10 4 ) , and ( 0.00119 ,   4.16 × 10 4 ) , respectively, the precision ratio reaches 90%.
Figure 6. Precision change graph.
Figure 7 shows the change in the recall ratio of the dataset NSL-KDD with ( α , β ) . The overall recall ratio results are good (more than 90%), of which about 80% are 100%.
Figure 7. Recall variation chart.
For intrusion detection, the evaluation pays more attention to the accuracy of correctly predicting the intrusion behavior in the detection dataset, that is, the precision rate. Therefore, when selecting parameters, the experiment pays more attention to the recall rate, so the final parameter chosen for the experiment is ( α , β ) = ( 9 . 52 × 10 4 , 5.95 × 10 5 ) . According to α value and β value, all attributes are divided into three domains: the positive domain, boundary domain, and negative domain.
Positive field P O S = { s r c _ b y t e s }
Boundary region B N D = { s e r v i c e , d i f f _ s r v _ r a t e }
Negative field N E G = { d u r a t i o n , p r o t o c o l _ t y p e , f l a g , c o u n t , d s t _ h o s t _ c o u n t , } .
The positive field contains one attribute, the boundary field contains two attributes, and the negative field contains the most attributes, including 38 attributes.

3.7.2. Parameter δ Value Determination and Attribute Random Selection

After attribute division, you need to select attributes according to the selection rules. First, it is necessary to choose an appropriate δ value. In this experiment, the interval of δ is set as [0, 0.65], and the comparative test is conducted in steps of 0.05 to observe the influence of different δ values on the precision and recall. The experimental results are shown in Figure 8 below.
Figure 8. Changes in precision and recall.
It can be seen from Figure 8 that the precision ratio fluctuates significantly compared with the recall ratio. However, since the value of the recall ratio is more critical in this experiment, the precision ratio can only represent the proportion of correctly classified positive samples to the total positive samples. It is necessary to select the maximum precision ratio under the condition of ensuring the maximum recall ratio. Figure 4 shows that the recall ratio is 100% maximum, which makes the maximum precision ratio 87.63%. According to Figure 8, the corresponding value is 0.3.
Here k = 42 , according to the random selection rule, select the priority of the attribute to calculate:
  • |POS| = 1, so | P O S | = 1 < ( 1 + 0.3 ) × 42 , does not meet rule 1);
  • | B N D | = 2 , | P O S | + | B N D | = 3 < ( 1 + 0.3 ) × 42 , so rule 2 is not satisfied.
  • Since | P O S | + | B N D | = 3 < ( 1 + 0.3 ) × 42 , which meets rule 3, | P O S | + | B N D | k = 3 7 = 1 attributes are finally randomly selected from the positive domain and boundary domain, and ( 1 ( | P O S | + | B N D | ) k ) k = ( 1 3 42 ) × 42 = 6 attributes are randomly selected from the negative domain to form an attribute subset. N data subsets containing 7 attributes are obtained, and these N data subsets are trained to generate N decision trees. Finally, the data outside the package is input into all decision trees, and the majority voting method is used to synthesize the results to get the final integration.

4. Evaluation Results and Discussion

4.1. Subsection Evaluation Index

Considering the need for network intrusion detection, the model evaluation indicators are precision, recall, FP, FR, and Fβ.
P = T P T P + F P
R = T P T P + F N
F P = F P T P + F P
F R = F N T P + F N
Among them, T P represents a true example, F N represents a false counterexample, F P represents a false positive example, and T N represents a true counterexample.
The formula of F1 is
F 1 = 2 × P × R P + R
F1 is a harmonic average based on precision and recall, which can comprehensively describe the performance of the algorithm [43]. In practical applications, different data backgrounds have different requirements for precision and recall, so there is a weighted harmonic average Fβ. The formula of Fβ is
F β = ( 1 + β 2 ) × P × R ( β 2 × P ) + R
where β > 0 measures the relative importance of recall ratio to precision ratio. When β = 1 , Fβ, it is F1, and Fβ is the general form of F1; when β > 1 , the recall ratio has greater influence, and when β < 1 , the precision has a more significant impact.

4.2. Algorithm Comparison Experiment

To further verify the nature of the IDTSRF, take the average value of 10 operation results (as shown in Table 4) and compare it with the results of random forest, AdaBoost, XGBoost, extra trees, CNN, and CNN-LSTM, as shown in Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13.
Table 4. IDTSRF running results.
Figure 9. R Comparison chart.
Figure 10. P Comparison chart.
Figure 11. FP Comparison chart.
Figure 12. FR Comparison chart.
Figure 13. F1 Comparison chart.
Due to the randomness of attribute selection, the results of each run are not exactly the same, as shown in Table 3. Intrusion detection pays more attention to the accuracy of correctly predicting intrusion behavior, so it focuses on the comparison between recall and other experimental results. The precision ratio indicates the proportion of the number of positive samples correctly classified by the model to the total number of positive samples. The overall precision value is between 80% and 93%, which is relatively volatile compared with the recall. The recall rate is basically stable, and the result is good, basically reaching 100%. Figure 9 and Figure 10 show the performance of all algorithms on recall and precision, respectively. It can be seen that the recall of the IDTSRF algorithm is obviously better than the other algorithms. Adaboost has the worst effect, and the recall is lower than 90%.
Figure 11, Figure 12 and Figure 13 show the comparison of bad debt rate, false negative rate, and the F1 values of different algorithms. Among them, the bad debt rate represents the proportion of the number of positive samples of model error classification in the total number of positive samples, and the false negatives rate represents the proportion of the number of positive samples of model error classification in the total number of correct classification samples. F1 is a harmonic average based on the precision and recall, which can comprehensively describe the performance of the algorithm and so focuses on observing the F1 value. Since NSL-KDD focuses on recall, F γ is taken as the evaluation standard and γ = 1.5 as the result shown in Figure 13. It can be seen from the comparison that the F-value of the IDTSRF algorithm is basically the same as that of CNN and CNN-LSTM and is superior to other algorithms.

5. Conclusions

In this paper, we use the attribute measurement method based on decision boundary entropy to calculate the importance of attributes, increase the probability that attributes of the importance of attributes are selected, and randomly select a certain number of attributes in the positive domain, negative domain, and boundary domain through three decision-making ideas to obtain attribute subsets, train decision trees separately, and finally integrate random forests. On the one hand, this method takes into account the impact of attribute importance on classification results. On the other hand, we use the idea of ensemble learning to select subsets to train the base classifier and finally integrate to obtain the random forest model. The experimental results show that the algorithm in this paper is superior to random forest, AdaBoost, XGBoost, and extra forest in recall ratio and F1 value, and is basically the same as CNN and CNN-LSTM, which proves the effectiveness of the algorithm in this paper. Future work will further improve the precision of the algorithm in this paper. Considering the characteristics of the intrusion detection dataset, we will select three branches of data, and consider using the idea of three branches of decision-making to select and integrate the constructed base classifier, so as to improve the accuracy and diversity of integrated learning, thus improving the recall and precision of intrusion detection.

Author Contributions

Conceptualization, L.L. and C.Z.; data curation, W.W.; formal analysis, W.W. and L.L.; funding acquisition, C.Z.; investigation, J.R. and W.W.; methodology, L.L.; project administration, W.W. and C.Z.; resources, L.W.; software, W.W.; supervision, L.L. and C.Z.; validation, L.L., W.W. and J.R.; visualization, W.W.; writing—original draft, W.W. and L.L.; writing—review and editing, W.W. and J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hebei Natural Science Foundation (No.: F2018209374), the Hebei Professional Master’s Teaching Case Library Construction Project (No.: KCJSZ2022073) and the Hebei Postgraduate Course Ideological and Political Demonstration Course Construction (No.: YKCSZ2021091).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the anonymous reviewers and associate editor for their comments that greatly improved this paper.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationship that could have appeared to influence the work reported in this paper.

References

  1. Yange, T.S.; Onyekware, O.; Abdulmuminu, Y.M. A Data Analytics System for Network Intrusion Detection Using Decision Tree. J. Comput. Sci. Appl. 2020, 8, 21–29. [Google Scholar]
  2. Hassan, E.; Saleh, M.; Ahmed, A. Network Intrusion Detection Approach using Machine Learning Based on Decision Tree Algorithm. J. Eng. Appl. Sci. 2020, 7, 1. [Google Scholar] [CrossRef]
  3. Bhati, B.S.; Rai, C.S. Analysis of Support Vector Machine-based Intrusion Detection Techniques. Arab. J. Sci. Eng. 2020, 45, 2371–2383. [Google Scholar] [CrossRef]
  4. Shi, Q.; Kang, J.; Wang, R.; Yi, H.; Lin, Y.; Wang, J. A Framework of Intrusion Detection System based on Bayesian Network in IoT. Int. J. Perform. Eng. 2018, 14, 2280–2288. [Google Scholar] [CrossRef][Green Version]
  5. Prasath, M.K.; Perumal, B. A meta-heuristic Bayesian network classification for intrusion detection. Int. J. Netw. Manag. 2019, 29, e2047. [Google Scholar] [CrossRef]
  6. Xu, G. Research on K-Nearest Neighbor High Speed Matching Algorithm in Network Intrusion Detection. Netinfo Secur. 2020, 20, 71–80. [Google Scholar]
  7. Chao, D.; Gang, Z.; Liu, Y.; Zhang, D.L. The detection of network intrusion based on improved AdaBoost algorithm. J. Sichuan Univ. (Nat. Sci.Ed.) 2015, 52, 1225–1229. [Google Scholar]
  8. Zhang, K.; Liao, G. Network intrusion detection method based on improving Bagging-SVM integration diversity. J. Northeast. Norm. Univ. (Nat. Sci.Ed.) 2020, 52, 53–59. [Google Scholar]
  9. Li, B.; Zhang, Y. Research on Self-adaptive Intrusion Detection Based on Semi-Supervised Ensemble Learning. Electr. Autom. 2021, 43, 101–104. [Google Scholar]
  10. Jiang, F.; Zhang, Y.Q.; Du, J.W.; Liu, G.Z.; Sui, Y.F. Approximate Reducts-based Ensemble Learning Algorithm and Its Application in Intrusion Detection. J. Beijing Univ. Technol. 2016, 42, 877–885. [Google Scholar]
  11. Xia, J.M.; Li, C.; Tan, L.; Zhou, G. Improved Random Forest Classifier Network Intrusion Detection Method. Comput. Eng. Des. 2019, 40, 2146–2150. [Google Scholar]
  12. Zhang, L.; Zhang, J.; Sang, Y. Intrusion Detection Algorithm Based on Random Forest and Artificial Immunity. Computer Engineering 2020, 46, 146–152. [Google Scholar]
  13. Qiao, J.; Li, J.; Chen, C.; Chen, Y.; Lv, Y. Network Intrusion Detection Method Based on Random Forest. Comput. Eng. Appl. 2020, 56, 82–88. [Google Scholar]
  14. Qiao, N.; Li, Z.; Zhao, G. Intrusion Detection Model of Internet of Things Based on XGBoost-RF. J. Chin. Comput. Syst. 2022, 43, 152–158. [Google Scholar]
  15. Liang, B.; Wang, L.; Liu, Y. Attribute Reduction Based On Improved Information Entropy. J. Intell. Fuzzy Syst. 2019, 36, 709–718. [Google Scholar] [CrossRef]
  16. Ayşegül, A.U.; Murat, D. Generalized Textural Rough Sets: Rough Set Models Over Two Universes. Inf. Sci. 2020, 521, 398–421. [Google Scholar]
  17. Zhang, P.; Li, T.; Wang, G.; Luo, C.; Chen, H.; Zhang, J.; Wang, D.; Yu, Z. Multi-Source Information Fusion Based On Rough Set Theory: A Review. Inf. Fusion 2021, 68, 85–117. [Google Scholar] [CrossRef]
  18. An, S.; Hu, Q.; Wang, C. Probability granular distance-based fuzzy rough set model. Appl. Soft Comput. 2021, 102, 107064. [Google Scholar] [CrossRef]
  19. Han, S.E. Topological Properties of Locally Finite Covering Rough Sets And K-Topological Rough Set Structures. Soft Comput. 2021, 25, 6865–6877. [Google Scholar] [CrossRef]
  20. Liu, J.; Bai, M.; Jiang, N.; Yu, D. A novel measure of attribute significance with complexity weight. Appl. Soft Comput. 2019, 82, 105543. [Google Scholar] [CrossRef]
  21. Yao, Y. Three-Way Decision: An Interpretation of Rules in Rough Set Theory; Rough Sets and Knowledge Technology Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  22. Yao, Y. Three-way decisions with probabilistic rough sets. Inf. Sci. Int. J. 2010, 180, 341–353. [Google Scholar] [CrossRef]
  23. Yao, Y. The superiority of three-way decisions in probabilistic rough set models. Inf. Sci. 2011, 181, 1080–1096. [Google Scholar] [CrossRef]
  24. Rajadurai, H.; Gandhi, U.D. Naive Bayes and deep learning model for wireless intrusion detection systems. Int. J. Eng. Syst. Model. Simul. 2021, 12, 111–119. [Google Scholar] [CrossRef]
  25. Xu, J.; Han, D.; Li, K.C.; Jiang, H. A K-means algorithm based on characteristics of density applied to network intrusion detection. Comput. Sci. Inf. Syst. 2020, 17, 665–687. [Google Scholar] [CrossRef]
  26. Liu, J.; Liu, P.; Pei, S.; Tian, C. Design and Implementation of Network Anomaly Detection System Based on Association Rules. Cyber Secur. Data Gov. 2020, 39, 14–22. [Google Scholar] [CrossRef]
  27. Jia, W.; Zhang, F.; Tong, B.; Wan, C. Application of Self-Organizing Mapping Neural Network in Intrusion Detection. Comput. Eng. Appl. 2009, 45, 115–117. [Google Scholar]
  28. Sohn, I. Deep belief network based intrusion detection techniques: A survey. Expert Syst. Appl. 2021, 167, 114170. [Google Scholar] [CrossRef]
  29. Wang, H.; Cao, Z.; Hong, B. A network intrusion detection system based on convolutional Neural Network. J. Intell. Fuzzy Syst. 2020, 38, 7623–7637. [Google Scholar] [CrossRef]
  30. Sun, X. Intrusion Detection Method Based on Recurrent Neural Network. Master’s Thesis, Tianjin University, Tianjin, China, 2020. [Google Scholar] [CrossRef]
  31. Rodríguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1619–1630. [Google Scholar] [CrossRef]
  32. Yulianto, A.; Sukarno, P.; Suwastika, N.A. Improving AdaBoost-based Intrusion Detection System (IDS) Performance on CIC IDS 2017 Dataset. J. Phys. Conf. Ser. 2019, 1192, 012018. [Google Scholar] [CrossRef]
  33. Dhaliwal, S.S.; Nahid, A.A.; Abbas, R. Effective Intrusion Detection System Using XGBoost. Information 2018, 9, 149. [Google Scholar] [CrossRef]
  34. Resende, P.A.A.; Drummond, A.C. A Survey of Random Forest Based Methods for Intrusion Detection Systems. ACM Comput. Surv. (CSUR) 2018, 51, 1–36. [Google Scholar] [CrossRef]
  35. Wang, L.; Gu, C. Overview of Machine Learning Methods for Intrusion Detection. J. Shanghai Univ. Electr. Power 2021, 37, 591–596. [Google Scholar]
  36. Yang, C.; Zhang, Q.; Zhao, F. Hierarchical Three-Way Decisions with Intuitionistic Fuzzy Numbers in Multi-Granularity Spaces. IEEE Access 2019, 7, 24362–24375. [Google Scholar] [CrossRef]
  37. Wu, Q.; Huang, S. Intrusion Detection Algorithm Combining Convolutional Neural Network and Three-Branch Decision. Comput. Eng. Appl. 2022, 58, 119–127. [Google Scholar]
  38. Du, X.; Li, Y. Intrusion Detection Algorithm Based on Deep Belief Network and Three Branch Decision. J. Nanjing Univ. (Nat. Sci.) 2021, 57, 272–278. [Google Scholar]
  39. Zhang, S.; Li, Y. Intrusion Detection Method Based on Denoising Autoencoder and Three-way Decisions. Comput. Sci. 2021, 48, 345–351. [Google Scholar]
  40. Hassan, M.; Butt, M.A.; Zaman, M. An Ensemble Random Forest Algorithm for Privacy Preserving Distributed Medical Data Mining. Int. J. E-Health Med. Commun. (IJEHMC) 2021, 12, 23. [Google Scholar] [CrossRef]
  41. Zong, F.; Zeng, M.; He, Z.; Yuan, Y. Bus-Car Mode Identification: Traffic Condition–Based Random-Forests Method. J. Transp.Eng. Part A Syst. 2020, 146, 04020113. [Google Scholar] [CrossRef]
  42. Zhang, P.; Jin, Y.F.; Yin, Z.Y.; Yang, Y. Random Forest based artificial intelligent model for predicting failure envelopes of caisson foundations in sand. Appl. Ocean. Res. 2020, 101, 102223. [Google Scholar] [CrossRef]
  43. Zhang, C.; Ren, J.; Liu, F.; Li, X.; Liu, S. Three-way selection Random Forest algorithm based on decision boundary entropy. Appl. Intell. 2022, 52, 13384–13397. [Google Scholar] [CrossRef]
  44. Bamakan SM, H.; Amiri, B.; Mirzabagheri, M.; Shi, Y. A New Intrusion Detection Approach using PSO based Multiple Criteria Linear Programming. Procedia Comput. Sci. 2015, 55, 231–237. [Google Scholar]
  45. Shi, Y.; Tian, Y.; Kou, G.; Peng, Y.; Li, J. Optimization Based Data Mining: Theory and Applications: Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  46. Aghdam, M.H.; Kabiri, P. Feature Selection for Intrusion Detection System Using Ant Colony Optimization. Int. J. Netw. Secur. 2016, 18, 420–432. [Google Scholar]
  47. Jiang, F.; Yu, X.; Du, J.; Gong, D.; Zhang, Y.; Peng, Y. Ensemble learning based on approximate reducts and bootstrap sampling. Inf. Sci. 2021, 547, 797–813. [Google Scholar] [CrossRef]
  48. Meng, Q.; Zheng, S.; Cai, Y. Deep Learning SDN Intrusion Detection Scheme Based on TW-Pooling. J. Adv. Comput. Intell. Intell. Inform. 2019, 23, 396–401. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.