You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

21 April 2022

The Impact of Partial Balance of Imbalanced Dataset on Classification Performance

,
,
,
and
Department of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
This article belongs to the Special Issue Machine Learning with Applications: Dealing with Interpretability and Imbalanced Datasets

Abstract

The imbalance of network data seriously affects the classification performance of algorithms. Most studies have only used a rough description of data imbalance with less exploration of the specific factors affecting classification performance, which has resulted in difficulty putting forward targeted solutions. In this paper, we find that the impact of medium categories on classification performance cannot be ignored, and therefore propose the concept of partial balance, consisting of Class Number of Partial Balance (β) and Balance Degree of Partial Samples (μ). Combined with Global Slope (α), a parameterized model is established to describe the difference of imbalanced datasets. Experiments are performed on the Moore Dataset and CICIDS 2017 Dataset. The experiment’s results on Random Forest, Decision Tree and Deep Neural Network show increasing α is a conducive step in the performance improvement of minority classes and overall classes. When β of dominant categories increases, that of inferior classes decreases, which results in a decrease in the average performance of minority classes. The lower μ is, the closer the sample size of medium classes is to the minority classes, and the better the average performance is. Based on the conclusions, we propose and verify some basic strategies by various classical algorithms.

1. Introduction

In massive network data, due to different user preferences and service types, data distribution is often imbalanced. There are majority classes and minority classes, i.e., the number of samples in some categories is far less than those in other categories. For example, the KDD CUP 99 [1] and Moore dataset [2] are typically imbalanced network data sets. Traditional classification methods usually assume that data distribution is balanced and the misclassification cost is equal. When the traditional classification algorithms are used to deal with imbalanced data, taking overall accuracy as the goal will make the classification model tend to the majority classes and ignore the minority classes, resulting in low classification accuracy in the minority classes. From the perspective of data mining, the discovery and identification of minority classes is of higher analysis value, e.g., the attack data in network intrusion detection [3]. Therefore, researching data imbalance has important theoretical value and practical significance.
Many methods have made significant progress in addressing data imbalance and improved classification performance, but there are still unresolved problems. The existing studies show merely a rough description of data imbalance without considerable exploration of the essence of data imbalance. The specific impact of data imbalance on classification performance is not clear enough. The factors affecting classification performance have not been further explored. Therefore, it is difficult to put forward precise and targeted guidance for follow-up solution strategies. Garcia et al. [4] investigated the influence of both the imbalance ratio and the classifier on the performance of several resampling strategies. Experiments showed oversampling the minority class consistently outperforms undersampling the majority class when data sets are strongly imbalanced. Buda et al. [5] proposed two indicators to describe data imbalance and set imbalanced image datasets to verify the impact of imbalanced datasets on CNN under different parameter settings. However, only majority classes and minority classes were considered, and only using ROC and AUC (area under receiver operating characteristic) as metrics to evaluate the classifier’s performance was not comprehensive. Fadi et al. [6] studied the precise nature of the relationship between the degree of class imbalance and the corresponding classifier performance. By changing class imbalance ratios and using the probabilistic Nave Bayes as the base classifier, the experiments highlighted the effects of class imbalance. Ajay et al. [7] resolved two essential statistical elements: the degree of class imbalance and the complexity of the concept, which helped in building the foundations of a data democracy. They focused on the main causes of imbalance, which were class overlap and small disjoints.
The actual datasets are complex, as there are not only majority and minority classes but also medium classes between them. Most oversampling methods aim to balance all categories by adding the sample size of minority classes [8,9,10,11]. However, in the face of extreme imbalance, it is still not possible to solve the fundamental problem. Therefore, concerning the improvement of classification performance of imbalanced data, further exploration of the factors affecting the classification performance of imbalanced data is a worthy research direction. Research on data imbalance can help us have a clearer understanding of the essence of the problem. When the optimization objectives required by different scenarios are different, we can put forward precise and targeted strategies.
The main contributions of this paper are as follows:
(1) Proposal of a parameterized model based on imbalanced network datasets and a solution to the problem of insufficient description of data imbalance.
(2) Clarification of the factors affecting the classification performance of imbalanced data, and finding and verifying the influence of partial balance.
(3) Proposal and proof of the basic strategies to solve the problem of network data imbalance.
The structure of this paper is organized as follows. The Section 2 is about related work. The Section 3 is the parameterized description of imbalanced datasets. In the Section 4, we conduct experiments to explore the impact of different parameters on the classification performance of imbalanced datasets. The Section 5 explores the differences in classification performance of imbalanced datasets with partial balance and the impact of partial balance. In addition, some feasible strategies are put forward and proved in several algorithms. The Section 6 concludes this paper.

3. Parameterized Model of Imbalanced Dataset

Network services are taken as an example to analyze the existing data imbalance.
Table 1 shows the proportion distribution statistics of the Moore dataset. The dataset contains 377,526 network flow samples, which are divided into 10 application types. The Moore dataset is a typical imbalanced dataset, and the sample size of various application types varies greatly.
Table 1. Distribution of Moore dataset.
From the perspective of similar applications, the launch of new applications will lead to a big gap in network data scale between new applications and existing similar applications. Moreover, influenced by regional cultural differences and user preferences (such as Skype, used internationally, and WeChat, mainly used in China), there will also be great differences in data scale between mainstream applications and non-mainstream applications. In addition, different software data in similar applications with relatively stable user groups will produce a relative data balance (relatively concentrated area of Type 1–Type 3 in Figure 1) or stable data scale gap (Type 4). From the perspective of different types of applications, data imbalance is mainly caused by different business attributes. For example, the number of business flows, such as web data, is huge compared to other data (majority classes in Type 1–Type 4).
Figure 1. The types of datasets in the actual network environment.
Through the above analysis, the following four dataset types with data imbalance are further summarized.
As shown in Figure 1, Type 1 includes a majority class, several medium classes and a minority class, and the distribution area of medium classes is relatively concentrated. We call this partial balance. Type 2 includes medium classes and a minority class, and medium classes show partial balance. Type 3 includes a majority class and several minority classes, and minority classes show partial balance. Type 4 has the characteristics of a majority class and minority class, but the data scale of the medium classes has no relatively centralized attribute and has a linear characteristic. We consider that there is no partial balance in this case. Type 5 is a balanced dataset for comparative analysis.
Based on the above analysis, the characteristics of the actual dataset types include the sample size, the class number of majority, minority and medium categories. According to these characteristics, imbalanced datasets are modeled and several parameters are proposed to describe imbalanced datasets as follows:
Define a dataset as D = D 1 D 2 D N , D i D j =   ( i j ) where D i and D j are the subclasses of D , and N is the number of categories in D , N = 1 , 2 , 3 , , i , j N .
Imbalanced Dataset  D : {Global Slope, Class Number of Partial Balance, Balance Degree of Partial Samples}.
Partial Balance: {Class Number of Partial Balance, Balance Degree of Partial Samples}. Multiple parameters are used to define the balanced part of the imbalanced dataset, which is a phenomenon in the imbalanced dataset.
Global Slope: α is the ratio of minority class samples to majority class samples in the dataset, which is defined as:
α = min | D i | max | D i | , i = 1 , 2 , , N
Class Number of Partial Balance:  β is the class number of partial balance. According to the dataset types in Figure 1, there are β m a j , β m e d , β min which represent the class number of partial balance of majority classes, medium classes, minority classes, respectively. β is an integer, 0 β N . When β = 0 , the dataset shows linear imbalance, i.e., Type 4. When β = N , the dataset is balanced, i.e., Type 5. The larger β , the higher the degree of partial balance.
β = { 1 , 2 , , N }
Balance Degree of Partial Samples: μ is the ratio of the partial average sample size to the average sample size of majority classes, depicting the degree of partial balance. The lower μ , the lower the height of the partial sample size. μ is expressed as:
μ = 1 β m e d m = 1 n m e d | D m | 1 β m a j i = 1 j m a j | D i | ,
where n , j N .
The above parameters can describe the main characteristics of dataset types. To simulate the types of actual datasets, some random fluctuations are added to the sample quantity of each category. For example, σ = 20 % means that all categories’ sample sizes in the dataset fluctuate by 20%.

4. Impact of Imbalanced Dataset Parameters on Classification Performance

When classifying the same type of imbalanced dataset, the average classification performance of majority classes, minority classes and the overall classes can be affected by the parameters. To explore the influence of parameters on classification performance, the following experiments are designed.

4.1. Experimental Details

4.1.1. Experimental Environment

Our experimental environment is shown in Table 2.
Table 2. Experimental environment settings.

4.1.2. Data Sets

The datasets used in this paper are the Moore dataset and CICIDS 2017 dataset [27].
Moore Dataset: The Moore dataset is represented by 249 attribute features. To reduce redundant features and the amount of calculation, and improve the classification efficiency, only 10 features used in reference [28] for each network flow are adopted by our experiments. In addition, only 6 application types are used, namely WWW, MAIL, BULK, DATABASE, SERVICES, P2P. The reason for choosing these categories is that the sample size of these categories is more than 2000, which is, conveniently, enough data for experiments.
CICIDS 2017 Dataset: The CICIDS 2017 dataset is a network traffic dataset close to the real world, including normal traffic and abnormal traffic. There are 79 features in the dataset, including a label feature and a duplicate feature. Our experiments use machine learning CSV data, and, to reduce redundant features, 15 features used in [29] and 6 types are adopted, namely Benign, DOS Hulk, PortScan, DOS Slowhttptest, DOS Slowloris and Web Attack.

4.1.3. Basic Experiment Settings

The basic experiment settings are shown in Table 3. In addition, the average sample quantity of majority classes is always 5000. The classifiers used in these experiments are Random Forest (RF), Decision Tree (DT) and Deep Neural Network (DNN) in Scikit-learn package. The parameter settings of the classifiers are set as the default parameters shown in Table 4. Each category is arranged and combined in turn as the minority category and the majority category to reduce the impact of the category itself on the classification performance. Each experiment takes the statistical average results by repeating 100 times.
Table 3. Dataset Settings.
Table 4. Classifier Settings.

4.1.4. Evaluation Indicators

In machine learning, the commonly used performance metrics are recall rate r e c , precision rate p r e and F1-score f 1 . p r e reflects the proportion of real positive samples in the positive samples determined by the classifier. r e c reflects the proportion of positive samples correctly judged in the total positive samples. f 1 is the harmonic mean of p r e and r e c . Their definitions are given by Equations (4)–(6) for a given category D i .
p r e D i = T P D i / ( T P D i + F P D i )
r e c D i = T P D i / ( T P D i + F N D i )
f 1 D i = 2 r e c D i p r e D i / ( r e c D i + p r e D i )
T P D i is the correctly labeled number. F P D i denotes the predicted label as D i , but the actual label is not D i . F N D i implies that the predicted label is not D i , but the actual label is D i . T N D i means that neither the predicted label nor the actual label is D i .
In addition, macro indicators are used and given by Equations (7)–(9) to measure the overall classification performance. N is the total number of categories. Aver . Pre is the average p r e , Aver . Rec is the average r e c , and Aver . F 1 is the average f 1 . For example, Macro Aver. Rec is the average r e c of the overall categories and Minor Aver. Rec is that of the minority categories.
Aver . Pre = 1 N i = 1 N p r e D i
Aver . Rec = 1 N i = 1 N r e c D i
Aver . F 1 = 1 N i = 1 N f 1 D i

4.2. Parameters’ Impact on Classification Performance of Type 1

Different types of datasets use different parameters. According to the characteristics of Type 1, three sets of experiments are set up to respectively explore the effects of parameters α , β , μ on classification performance, as shown in Table 5.
Table 5. Experimental parameter settings of Type 1.

4.2.1. 1-A Impact of α

To explore the effect of Global Slope on classification performance of Type 1, α is changed and other parameters are fixed.
The lower α , the higher the degree of data imbalance. According to the experimental results shown in Figure 2, with the decrease of α , (1) the average recall rate and F1-score of minority categories show a downward trend. When α = 0.003 , the recall rate of the Moore dataset by Random Forest is 56.96%, so the classifier is not reliable anymore. It shows that data imbalance has a large negative impact on the classification performance of minority classes. (2) The overall classification performance also shows a downward trend.
Figure 2. The effect of α on the classification performance of Type 1. (a) Moore Dataset; (b) CICIDS 2017 Dataset.

4.2.2. 1-B Impact of β

Experiment 1-B explores the effect of Class Number of Partial Balance on the classification performance of Type 1, so β is changed and others are fixed. In this case, β means β m e d , that is, the class number of medium classes in the dataset. The larger β m e d , the higher the degree of partial balance.
According to Figure 3, as β m e d increases, the recall rate and F1-score in minority classes decrease. However, the recall rate and F1-score of the overall classification performance show an upward trend, which indicates that the increase of β m e d is conducive to improvement of the overall performance.
Figure 3. The effect of β on the classification performance of Type 1. (a) Moore Dataset; (b) CICIDS 2017 Dataset.

4.2.3. 1-C Impact of μ

Experiment 1-C explores the effect of the Balance Degree of Partial Samples on the classification performance of Type 1. So μ is changed. The lower μ , the lower the height of the medium classes, and the closer the medium classes are to the minority classes.
As revealed by Figure 4, with increase in μ , all evaluation indicators of minority classes, and the whole, decrease. It shows that the lower μ is, the better the classification performance is.
Figure 4. The effect of μ on the classification performance of Type 1. (a) Moore Dataset; (b) CICIDS 2017 Dataset.

4.3. Parameters’ Impact on Classification Performance of Type 2

Although the practical significance of Type 2 and Type 3 are different, they can be both described by α and β . From the perspective of modeling, they can be regarded as the same case. Therefore, the discussion of Type 2 in this section is equivalent to Type 3. According to the characteristics of these types, α and β are explored and the experimental parameter settings can be seen from Table 6.
Table 6. Experimental parameter settings of type 2.

4.3.1. 2-A Impact of α

Experiment 2-A explores the effect of α on the classification performance of Type 2.
According to Figure 5, with the decrease in α , the imbalance degree intensifies, and the recall rate and F1-score of minority classes and the overall classes show a downward trend. It can be calculated that α is the main parameter on the classification performance of Type 2 and Type 3. The reduction of α is beneficial to the improvement of classification performance.
Figure 5. The effect of α on the classification performance of Type 2. (a) Moore Dataset; (b) CICIDS 2017 Dataset.

4.3.2. 2-B Impact of β

Experiment 2-B explores the effect of β on the classification performance of Type 2.
As shown in Figure 6, with the increase in β , the average performance of minority categories shows a downward trend, but the overall performance is on the rise.
Figure 6. The effect of β on the classification performance of Type 2. (a) Moore Dataset; (b) CICIDS 2017 Dataset.

4.4. Parameters’ Impact on Classification Performance of Type 4

Since Type 4 describes a linear imbalanced dataset, the most obvious feature is the Global Slope. Therefore, only α is used to describe this type. The experimental settings are α = { 0.002 , 0.01 , 0.02 , 0.1 } .
The experimental results are shown in Figure 7. With the decrease in α , the classification performance indicators of minority classes, and the whole, decrease.
Figure 7. The effect of α on the classification performance of Type 4. (a) Moore Dataset; (b) CICIDS 2017 Dataset.

4.5. Result Analysis

α shows a negative impact on the average performance of minority and overall categories. As the results in Table 7 show, due to the decrease in α , the sample proportion of minority classes in the overall dataset decreases and the average classification performance of minority and overall categories show a downward trend.
Table 7. Parameters’ impact on classification performance of different dataset types.
When β of dominant categories increases, that of inferior classes decreases, which leads to the decrease in the average performance of minority classes. In Type1, β = β m e d , the more the class number of medium categories, the less that of minority classes, and the lower the average performance of minority classes. In Type 2, β = β m a j , the more the class number of majority classes, the worse the average performance is. In either case, β m e d or β m a j is the class number of dominant categories in the dataset, which can lead to decline in the average performance of minority classes.
The lower μ is, the closer the medium classes are to the minority classes, and the better the average performance is. From the results, in Type 1, the decrease of μ represents the decrease in the sample size of the medium categories, and the classification performance of minority and overall classes increase.

5. Parameters’ Impact on Classification Performance for Imbalanced Data Sets with Partial Balance

Further analysis shows Type 1 and Type 2 are more complex. Partial balance exists in Type 1 and Type 2, in which Type 1 is affected by α , β and μ , and Type 2 is affected by α and β . To further study the impact of partial balance and the difference in classification performance of imbalanced datasets with partial balance, the coordinated change of multiple parameters is carried out in the following experiments.

5.1. Experimental Setup and Results

This section uses the same experimental environment and classifier as Section 3. α and β are adjusted at the same time to observe the performance difference between Type 1 and Type 2, where α = { 0.001 , 0.01 , 0.1 } , β min = { 1 , 2 , 3 , 4 } , μ = 0.5 , and β m a j = 1 in Type 1. There are 12 groups of experiments.
As shown in Figure 8 and Figure 9, with the decrease in α , the degree of data imbalance intensifies, the average performance of minority categories and the overall average performance shows a downward trend, which is also true when β min = { 1 , 2 , 3 , 4 } . With the increase of β min , the class number of minority categories increases, while the class number of majority categories decreases. Accordingly, the average performance of minority categories in Type 1 and Type 2 both increase. The overall performance of Type 1 and Type 2 decrease. The above is also consistent with the conclusion in Section 4.
Figure 8. Classification performance comparison of Type 1 and Type 2 on Moore dataset. (a) Minor Aver.Rec; (b) Macro Aver.Rec.
Figure 9. Classification performance comparison of Type 1 and Type 2 on CICIDS 2017 dataset. (a) Minor Aver.Rec; (b) Macro Aver.Rec.
The larger β min is, the higher the partial balance of minority categories is. The better the average performance of minority categories is, the worse the overall performance of Type 1 is. The higher the partial balance of the minority class is, the better the average performance of the minority class will be.
Therefore, the conclusions can be summarized in the classification performance of Type 1 and Type 2 being different. For the average performance of minority classes and the overall classes, Type 2 is better than Type 1.

5.2. Strategy Validation for Improving Classification Performance

From Section 4, it can be seen α and μ can affect the classification performance of the minority classes and overall classes. To further verify this conclusion, several classical algorithms based on the data level are used to conduct experimental comparison and validation.

5.2.1. Experimental Setup

The oversampling strategy and the undersampling strategy are used on the Moore dataset. The categories WWW, MAIL and BULK are regarded as majority classes, MULT, INT and GAME are regarded as minority classes, and the other four categories are medium classes. The detailed experimental settings are shown in Table 8. Exp. No. 1 is the original imbalanced Moore dataset. Oversampling is achieved by increasing the sample size of minority classes and the changed parameter is α . When undersampling acts on majority classes, the changed parameter is α . When undersampling acts on medium classes, the changed parameter is μ . Random Forest classifier is used to perform classification experiments. Each experiment takes the statistical average results by repeating 50 times.
Table 8. Experimental setup and description.

5.2.2. Experimental Results

As shown in Figure 10, in Exp. No. 1, the F1-score of the minority classes is 83%, and that of the overall classification performance is 91%. By using different algorithms to adjust the parameters, the classification performance of minority classes can be effectively improved. In Exp. No. 4, α is changed by using the ADASYN algorithm. After oversampling, the classification performance of minority classes reaches 96%, which is 13% higher than that of the original imbalanced dataset. The improvement is most obvious. In addition, the algorithms of different strategies can not only improve the classification performance of minority classes but also effectively ensure the overall classification performance. The experiment proves the rationality of the conclusions in Section 4.
Figure 10. The classification performance of Moore dataset.
Because of the different application scenarios of network data, the categories concerned are different, and the performance indicators focused on are also different. For example, network traffic classification is required for network bandwidth allocation and network resource scheduling, instead of being concerned with the performance of a specific category. Therefore, the overall classification performance should be given much attention. In intrusion detection, malicious traffic as minority classes needs to be paid more attention so that the classification performance of minority classes becomes more important. Therefore, the classification performance of minority classes and overall classes can be improved by changing α and μ . Specifically, undersampling for the majority classes or medium classes can be carried out; or oversampling for the minority classes can be carried out.

6. Conclusions

In the existing research on data imbalance, most studies used the ratio between majority classes and minority classes of the dataset to describe the features of data imbalance, but neglected medium categories. Through analyzing the actual dataset, we find that medium categories have a significant impact on the classification performance. Therefore, we propose partial balance, firstly, in the field of network traffic classification. We define Class Number of Partial Balance (β) and Balance Degree of Partial Samples (μ) to describe the class number of partial balance and the degree of partial balance, respectively. Combined with Global Slope (α), a parameterized model is put forward to depict data imbalance.
By using three machine learning classification algorithms on two classical network traffic datasets, we clarify the factors affecting classification performance. Experimental results show that the lower α , the worse the classification performance of the minority classes and overall classes. When β of dominant categories increases, the classification performance of minority classes decreases. The lower μ , the better the average performance.
Based on these conclusions, the classification performance of minority classes and the overall classes can be improved through adjusting α and μ , which can be achieved by resampling strategies. Therefore, we propose that undersampling for majority classes or medium classes, or oversampling for minority classes, can be conducted to improve classification performance. Experiments on several classical sampling algorithms verified the feasibility of the proposed strategies.
There are still some limitations. The experiments were not conducted on the datasets of other fields, such as the well-known image datasets MNIST, CIFAR, etc. On different datasets, the same parameter may result in different classification performance.
In future work, deep learning methods can be considered for oversampling, such as Generative Adversarial Networks. When conducting resampling, the sampling degree at which the classification performance is best remains to be further studied. Furthermore, under imbalanced conditions, the problem of unlabeled data and concept drift need to be further discussed.

Author Contributions

Conceptualization, Q.L.; methodology, Q.L. and C.Z.; software and validation, Q.L., C.Z. and X.H.; data curation, K.C. and R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. KDD Cup 1999 Data. University of California, Irvine. Available online: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 2 March 2022).
  2. Moore, A.W.; Zuev, D. Internet traffic classification using bayesian analysis techniques. In SIGMETRICS’05, Proceedings of the 2005 ACM SIGMETRICS International Conference Measurement and Modeling of Computer Systems, Banff, AB, Canada, 6–10 June 2005; Association for Computing Machinery: New York, NY, USA, 2005; pp. 50–60. [Google Scholar]
  3. Zhao, G.; Xu, K.; Xu, L.; Wu, B. Detecting APT malware infections based on malicious DNS and traffic analysis. IEEE Access 2015, 3, 1132–1142. [Google Scholar] [CrossRef]
  4. Garcia, V.; Sanchez, J.S.; Mollineda, R.A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 2012, 25, 13–21. [Google Scholar] [CrossRef]
  5. Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
  7. Kulkarni, A.; Chong, D.; Batarseh, F.A. Foundations of data imbalance and solutions for a data democracy. In Data Democracy; Academic Press: Cambridge, MA, USA, 2020; pp. 83–106. [Google Scholar]
  8. Wang, Z.; Wang, P.; Zhou, X.; Li, S.; Zhang, M. FLOWGAN: Unbalanced Network Encrypted Traffic Identification Method Based on GAN. In Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Xiamen, China, 16–18 December 2019; pp. 975–983. [Google Scholar]
  9. Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
  10. Vu, L.; Bui, C.T.; Nguyen, Q.U. A deep learning based method for handling imbalanced problem in network traffic classification. In Proceedings of the Eighth International Symposium on Information and Communication Technology, Nha Trang, Vietnam, 7–8 December 2017; pp. 333–339. [Google Scholar]
  11. Hasibi, R.; Shokri, M.; Dehghan, M. Augmentation scheme for dealing with imbalanced network traffic classification using deep learning. arXiv 2019, arXiv:1901.00204. [Google Scholar]
  12. Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2008, 39, 539–550. [Google Scholar]
  13. Phua, C.; Alahakoon, D.; Lee, V. Minority report in fraud detection: Classification of skewed data. ACM SIGKDD Explor. Newsl. 2004, 6, 50–59. [Google Scholar] [CrossRef]
  14. Laurikkala, J. Improving Identification of Difficult Small Classes by Balancing Class Distribution[C]//Conference on Artificial Intelligence in Medicine in Europe; Springer: Berlin/Heidelberg, Germany, 2001; pp. 63–66. [Google Scholar]
  15. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  16. De La Calleja, J.; Fuentes, O. A Distance-Based Over-Sampling Method for Learning from Imbalanced Data Sets. In Proceedings of the FLAIRS Conference, Key West, FL, USA, 7–9 May 2007; pp. 634–635. [Google Scholar]
  17. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]//International Conference on Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
  18. Lee, S.S. Noisy replication in skewed binary classification. Comput. Stat. Data An. 2000, 34, 165–191. [Google Scholar] [CrossRef]
  19. Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3573–3587. [Google Scholar] [PubMed] [Green Version]
  20. Sahin, Y.; Bulkan, S.; Duman, E. A cost-sensitive decision tree approach for fraud detection. Expert Syst. Appl. 2013, 40, 5916–5923. [Google Scholar] [CrossRef]
  21. Dhar, S.; Cherkassky, V. Development and Evaluation of Cost-Sensitive Universum-SVM. IEEE Trans. Cybernetics 2015, 45, 806–818. [Google Scholar] [CrossRef] [PubMed]
  22. Wang, S.; Liu, W.; Wu, J.; Cao, L.; Meng, Q.; Kennedy, P.J. Training deep neural networks on imbalanced data sets. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 4368–4374. [Google Scholar]
  23. Maldonado, S.; Montecinos, C. Robust classification of imbalanced data using one-class and two-class SVM-based multi classifiers. Intell. Data Anal. 2014, 18, 95–112. [Google Scholar] [CrossRef]
  24. Chaki, S.; Verma, A.K.; Routray, A.; Mohanty, W.K.; Jenamani, M. A One class Classifier based Framework using SVDD: Application to an Imbalanced Geological Dataset. In Proceedings of the 3rd IEEE Students’ Technology Symposium, Kharagpur, India, 28 February–2 March 2016; pp. 76–81. [Google Scholar]
  25. Chen, Y.; Li, Y.; Tseng, A.; Lin, T. Deep learning for malicious flow detection. In Proceedings of the 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), Montreal, QC, Canada, 8–13 October 2017; pp. 1–7. [Google Scholar]
  26. Zhang, Y.; Chen, X.; Guo, D.; Song, M.; Teng, Y.; Wang, X. PCCN: Parallel Cross Convolutional Neural Network for Abnormal Network Traffic Flows Detection in Multi-Class Imbalanced Network Traffic Flows. IEEE Access 2019, 7, 119904–119916. [Google Scholar] [CrossRef]
  27. University of New Brunswick. Intrusion Detection Evaluation Dataset (CICIDS2017). Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 23 November 2021).
  28. Li, L.; Zhang, X.; Zhang, X.; Li, Q. Semi-supervised traffic classification algorithm based on K-means and k-nearest neighbors. J. Inform. Eng. Univ. 2015, 16, 234–239. [Google Scholar]
  29. Kurniabudi; Stiawan, D.; Darmawijoyo; Idris, M.Y.B.; Budiarto, R. CICIDS-2017 Dataset Feature Analysis with Information Gain for Anomaly Detection. IEEE Access 2020, 8, 132911–132921. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.