Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Improved KNN Algorithm for Fine-Grained Classification of Encrypted Network Flow

Electronics 2020, 9(2), 324; https://doi.org/10.3390/electronics9020324

by Chencheng Ma^1,2

, Xuehui Du^1,2,* and Lifeng Cao^1,2

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Electronics 2020, 9(2), 324; https://doi.org/10.3390/electronics9020324

Submission received: 8 January 2020 / Revised: 7 February 2020 / Accepted: 11 February 2020 / Published: 13 February 2020

(This article belongs to the Section Networks)

Round 1

Reviewer 1 Report

This paper proposes a new algorithm for weighted feature KNN (WKNN) to train a model with a small amount of data while considering the importance of different features. It's based on KNN. The proposed classification incorporates a self-adaptive mechanism that aims at adapting the feature weights. It is also used within a three-layer framework for classifying encrypted network flows. In particular, classifying the encrypted network status, encrypted application type, and encrypted content type.
(Fine-grained Classification of Encrypted network flows (FCE-KNN))

Here are some comments:
- I'd recommend rephrasing the first sentence. For instance, Traffic classification technology plays an important role...It's the basis for analysing....

-Authors tend to use a very long sentence which affects the quality. For instance, the second paragraph in the Introduction
"Machine learning is a well-known method in the field of encrypted traffic classification [8]. But to achieve fine-grained classification, machine learning methods need a great amount of labeled data to train a model [9], which is difficult to be realized in actual network for the reasons that labeled data are hard to get much [10] and the model should be updated periodically for coping with concept drift [11-12]."

- Also, the authors link between concept drift and the difficulty of getting labelled data; however, it's not clear. Concept drift is typically about the changing of the statistical properties of the target variable over time in an unpredictable way.

- This paragraph is actually one sentence, and it's very hard to understand; however, it's very important to give the reader an example of the capabilities of the classification framework.
"Assuming a situation that flow was identified as an abnormal application or content type of flow, the flow is likely an malicious flow. However, assuming more a situation that a flow was identified as a flow of Youtube application and file content, the flow would be considered as a normal flow based on the analysis of the single attribute of the flow, i.e., Youtube or file. As a matter of fact, a Youtube flow would not be file flow in most cases, so the flow is suspected of a malicious flow. The framework proposed can distinguish this type of abnormal flows based on the analysis of the correlation between the attributes.". Also, in the same paragraph, "an malicious flow." => *a* malicious flow.

In Algorithm1: WKNN, it seems that the weighted features are actually based on the distance between each feature between two points. It's expected that KNN classifies the data based on the distance. Perhaps, more explanation about the differences between WKNN and simply the KNN.

- "so it can fully learn the law of each sample and realize training with small set." => Perhaps, you can say so it can fully learn and adapt well to the dynamic changes in the features for the small training dataset. Also, *a small set*

- Authors mentioned that they propose a self-adaptive mechanism that aims at adapting the feature weights; however, they discussed in Section 3.2 that the feature weights are retrained based on the new feature set. Retraining the feature weights doesn't consider a self-adaptive mechanism. Authors should elaborate and explain clearly that point.
- On which cases δ might be Null?
- Do you consider a threshold for each feature?.
- In the previous studies, static features such as IP, port number, and TCP flags were used as... => please add references to the previous studies you refer to.
- What is about the overfitting?
- In Page 5, the paragraph starts by "Next, in lines 6–9, the algorithm calculates.. " is quite confusing. It is n't clear the name of the parameters and their descriptions.
- Why the authors didn't consider using any imbalancing techniques?. It seems that the argument here about proposing the classifier that is able to predict under limited data labels. To this end, why authors didn't consider comparing the proposed classifier with balanced data and imbalanced dataset?. Also, for an encrypted and decrypted dataset?
- Please provide an example of the encrypted record from the dataset.
- It seems the author assume that there are a set of fixed features?. In other words, what if a new packet has a new feature?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors present an improvement for the classic K-nearest neighbor algorithm in machine learning. In this study, a weighted KNN (WKNN) and a self-adaptive version of the latter called WKNN-Selfada are presented. Finally, a fine-grained classification procedure is proposed. Some numerical examples are displayed using the ISCX VPN-nonVPN dataset, adding some fine-grained classes on the dataset.

The article is correctly written but requires a large English proof-reading. The results are well explained with a good number of scenarios to evaluate the performance of the proposals. Some suggestions and comments are listed as follows:

1.- The introduction and related work are easy to understand, providing enough background and including relevant references.

2.- The references are up to date, generally since 2015, though they might be too many.

3.- Several algorithms to perform the KNN problem are presented, in which their performance is evaluated. However, some critical performance measurements are not given, such as the computational complexity of the algorithms. The authors should include the complexity of all the algorithms in the text.

4.- In Section 5.3, several metrics are present to evaluate the performance of the method. On lines 503 and 510, the F1-Score and OC-F1-Score are defined. However, the definition makes no sense, since both Precision and Recall can be simplified, then the answer is two on the F1-Score. The same happens to the OC-F1-Score. Please, a more precise definition of the functions is required.

5.- The accuracy is one of the leading performance metrics shown, which makes sense. However, some other performance metrics should be evaluated. For instance, a time execution metric must be included. The WKNN-Selfada is clearly more complex than the regular and weighted one; thus, the time execution should be much slower. How slowly it completes the results?

6.- The English in the text must be significantly improved. Including errors, typos, and mistakes on English grammar.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

This paper proposed to use a boosting-like approach to adjust the feature weights to facilitate a weighted version of kNN classification algorithm, enabling accurate and “fine-grained” network flow detections and examinations for cyber security. Overall, I feel that the work in this paper is well-motivated, the idea is more or less reasonable and promising, and the problem of network flow classification is important. But, the paper is ramblingly written, especially for the descriptions of the algorithms, which are lengthy and not explained very clearly, even though the underlying idea seems to be straightforward. It is still not very clear to me how the accurate network flow classification is achieved using “a small amount of data” as claimed at the beginning of the paper. It seems that the paper did not provide any evidences or evaluations about this claim. Following the last point, the paper concluded that “FCE-KNN” is time-consuming than other algorithms such as decision tree”, this seems to be counterintuitive with the claim that the classification algorithms proposed in this work can do their job using a small amount of data -- if only using a small amount of data then why it could be time-consuming. I feel that the paper should explain this. In the dataset used in this work, it is not clear how features like inter-packet delay are obtained. Since this paper is about applying machine learning methods for networking problems, such information is critical. Are these features already included in the dataset used in the paper or obtained differently? Are they measured at end hosts or in-network routers? More generally, how the datasets used in this paper are obtained and what do they look like? Are there any data preprocessing performed? This paper should provide a citation to the used datasets, acknowledging the data source. In Sec. 5.3, some brief descriptions of the algorithms in comparison with the ones proposed in this work, e.g., DTW-KNN, ADA, AISVM, should be provided. What are the essential approaches used in them? Why FCE-KNN performs better than them? There are quite a few grammar/typo issues, a careful proofread would be highly desirable.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

This work proposes a method that can be used for classification models of encrypted data. The proposed approach is based on weighted feature KNN (WKNN), which is the improvement of the traditional KNN algorithm. In WKNN, the optimal features can be selected based on a weight self-adaptive algorithm, using a defined threshold that can be adjusted to avoid any misclassification problem.

Furthermore, the authors present a three-layer framework to identify the encrypted network traffic that is based on fine-grained classifier and the WKNN algorithm. The framework can recognize the encrypted flow traffic, using the association between three different traffic attributes. The traffic classification is based on the correlation analysis of traffic status (encrypted traffic or not encrypted), application type (YouTube, Facebook, Skype, etc.) and content type (File, VOIP, Chat, etc.).

This work is presented well and the proposed approach is novel but is not efficient.

Although the authors claim that their proposed model gave high accuracy results in comparison with some of the existing methods;

in my opinion, this approach suffers from some limitations as follows:

Classifying the flow traffic based on three attributes it might not be enough to identify the malicious flow. Nowadays, the attacker methods and his techniques become more sophisticated and similar to normal traffic. The attacker can bypass the detection techniques easily if he is successfully able to identify the detection method attributes. Deploying the proposed framework for real-time traffic classification can be difficult since the fine-grained classifier needs a large number of attributes to provide high accuracy results. The rapid growth in Internet traffic requires a huge number of packets that needs to be analyzed. As a result, using multi-layer classification can cause time consumption and add an extra layer of complexity to the classifier model. There are several input parameters for the classifier framework that need to be adjusted to obtain high accuracy results for each layer separately. Further, the set features for model training are changed from one layer to another. Deploying many hierarchal stages framework requires to combine many features for all traffic flows attributes, which can cause another challenge to implement this model in real-time classification. This work is trained their proposed model using the ISCX VPN-nonVPN dataset, which contains a verity of application traffic for normal data only without considering any malicious flow.

A minor correction: pp. 141: presented instead of present.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

This revised paper proposes a weighted feature KNN for classification tasks with limited labelled data availability while attaining high performance. The classifier is also used within a three-layer framework for classifying encrypted network flows and in particular, classifying the encrypted network status, encrypted application type, and encrypted content type.

The paper has been improved. However, there are still some issues:

- The new updated text (in page 2) "if a flow was identified as a flow of Youtube application and file content, it would be considered as a normal flow based on the analysis of the single attribute of the flow, a Youtube flow would not be file flow in most cases, so the flow is probably a malicious flow." => It is not clear why Youtube is a malicious flow in most of the cases. Please add more details.

- In Section 3. Improved of KNN algorithm
"Further, by this section also proposes a feature selection and feature weight.."=> Please correct e.g. "Furthermore, this section also includes our proposed feature selection..."

-"KNN is a supervised machine learning algorithm that recognizes the similarity between two.." =>Please use finds instead of recognizes

- KNN uses a distance measure to find the similarities between data points. It is not necessary to use Minkowski. Please clarify that in Section 3.

- It's still not clear how the proposed algorithm is self-adaptive. In particular, authors mentioned in the response letter and the paper that once the features have been selected, the algorithm gets the weights of the features chosen to retrain the weights based on the new feature set. So what is the difference between running the algorithm for the first time or more times?. It seems each time, features are selected, then new weights are obtained, and the algorithm is trained on the new weights. It is not clear how the algorithm is self-adaptive?.

- Assuming that you have one threshold for all features makes the algorithm very limited. In most cases, data points have different features, not a single feature or homogenous features.

- In Section 4.3. Fine-grained classification method, Please add how the model avoids overfitting. I understand that the authors updated the text to explain the overfitting itself, but instead, it's better to explain how the proposed model avoids overfitting?.

- The authors in their response to Point 12 in the response letter mentioned that the proposed method can deal with imbalance label. However, this is not clear in the main text of the paper. In particular, the proposed methods should work with limited data labels; however, imbalancing labels if a different issue. You might have 1000 data-points but 90%o of them represent one label.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The revision has addressed most of the potential issues I raised in the first round of review.

The authors argued that “training with a small amount of data and high time consumption are not counterintuitive” and gave the analysis polynomial running time complexity, in particular, O(n^2), and some experimental evaluations results to show the time consumption in Table 10. Still several questions though:

1) For the training dataset used to produce the numerical results (in unit of seconds) in Table 10, how many records are there?

2) Should a O(n^2) algorithm be really considered as time-consuming? How about the complexities of other algorithms used as comparison basis?

3) “the algorithm has fully learned the characteristics of each sample”, what is the “fully-learned characteristics of each sample”? Does this mean all feature values of each sample? Maybe elaborate more on the “fully-learned characteristics of each sample”?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Improved KNN Algorithm for Fine-Grained Classification of Encrypted Network Flow

Further Information

Guidelines

MDPI Initiatives

Follow MDPI