SFCWGAN-BiTCN with Sequential Features for Malware Detection
Round 1
Reviewer 1 Report
If what I thought is correct, this study analyzed the semantic logic between API tuning and Opcode call sequence using Word2Vec and proposed a GAN-based sequence feature generation algorithm using it.
The structure of the study is well organized and shows that the malware classifier model trained with the generated data shows high performance on well-known data.
I have a few minor questions here, and if these are addressed, I think it's research enough to be published.
1. What is the source of the real samples of the malware files used in the CWGAN structure? If you are based on a Kaggle or DataCon dataset, you will need a detailed explanation of what overfitting problems might be.
2. The resolution of Figure 1 is too low. It is difficult to see the contents of the picture in detail. Improvement will be needed.
3. An explanation of Equation 18 is needed. Perhaps F1-score?
4. Typos written as F1-socre need to be corrected.
5. There were only 2020 studies for comparison of performance, but I think it will be necessary to compare with studies published in 2021 and 2022.
Author Response
Response to Reviewer 1 Comments
Original Manuscript ID: applsci-2160429
Original Article Title: “SFCWGAN-BiTCN with Sequential Features for Malware Detection”
To: information Editor
Re: Response to reviewers
Dear Editor,
Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.
We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) an updated manuscript with red highlighting indicating changes, and (c) a clean updated manuscript without highlights (PDF main document).
Thanks again for reviewers’ meticulous review and valuable suggestions to improve our manuscript. We hope that the revision has addressed all the issues in the old version. If there are still any problems in the revised manuscript, please do not hesitate to point out, and we will do our best to amend the manuscript according to your suggestions. We are looking forward to your positive response.
Best regards,
Bona Xuan et al.
Point 1: What is the source of the real samples of the malware files used in the CWGAN structure? If you are based on a Kaggle or DataCon dataset, you will need a detailed explanation of what overfitting problems might be.
Response 1: Thanks for your very thoughtful suggestion. The source of the real samples of the malware files used in the CWGAN structure come from Kaggle and DataCon dataset. I am sorry that I did not explain clearly the problem of overfitting in the paper.
According to your suggestion, I have added the consideration of the overfitting problem during the training process in Section 3.6.
Causes of overfitting problem
- Too long training period
2.The model is too big
3.The amount of data is too small
To avoid the overfitting problem in the training process, we adopt the following two ways. Firstly, the number of samples of some families is too small leading to overfitting of training, which requires balancing the dataset by generating malicious samples by Gan. Secondly, for the model is too large, droupout is used to cull the BiTCN model to avoid model overfitting. The whole dropout process is equivalent to averaging the neural network. When the network is overfitted, some of the "reverse" fits cancel each other out to reduce the overall overfitting.
Point 2: The resolution of Figure 1 is too low. It is difficult to see the contents of the picture in detail. Improvement will be needed.
Response 2: Thanks for your very thoughtful suggestion. I am sorry that I did not adjust the resolution and sharpness of the graphs in the paper. According to your suggestion, I have checked the resolution and clarity of all images in the paper. And modified the unclear images. Make the manuscript look clearer.
Point 3: An explanation of Equation 18 is needed. Perhaps F1-score.
Response 3: Thanks for your very thoughtful suggestion. I am sorry that I did not find the error in the paper. According to your suggestion, I have checked the all Equations in the paper. And modified the Equations . Make the manuscript look accuracy.
Point 4: Typos written as F1-socre need to be corrected.
Response 4: Thanks for your very thoughtful suggestion. I am sorry that I did not find and modified the error in the paper. According to your suggestion, I have checked the all written and grammar in the paper. And modified the written and grammar. Make the manuscript look accuracy.
Point 5: There were only 2020 studies for comparison of performance, but I think it will be necessary to compare with studies published in 2021 and 2022.
Response 5: Thanks for your very thoughtful suggestion. I am sorry that I did not consider the problem in the paper. According to your suggestion, I added and cited 2021 and 2022 published studies for comparison. And complete the comparative trials and analysis of results.
Author Response File: Author Response.pdf
Reviewer 2 Report
A selection feature 11 conditional Wasserstein generative adversarial network (SFCWGAN) and bidirectional temporal 12 convolutional network (BiTCN) are proposed in this paper. More detailed comments are given as follows:
1- In introduction, add to the final paragraph the structure of the paper sectionز
2- Redraw the figure 1.
3- In 4.2. Experimental assessment criteria. Add the citation.
4- Add the future works in end of conclusion section.
5- Discuss the limitations of the proposed method.
6- The recent references such as 2022 is very few. There are many works in 2022 therefore add some of them.
7- In training phase, dataset partition is randomly or not? I suggest u to used K-Fold Cross Validation. And how many K-Fold used?
8- In experiments results, the evaluation (train-test round) must repeated for N round. I suggest to repeat for many rounds to ensure that the bias was minimized.
Author Response
Response to Reviewer 2 Comments
Original Manuscript ID: applsci-2160429
Original Article Title: “SFCWGAN-BiTCN with Sequential Features for Malware Detection”
To: information Editor
Re: Response to reviewers
Dear Editor,
Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.
We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) an updated manuscript with red highlighting indicating changes, and (c) a clean updated manuscript without highlights (PDF main document).
Thanks again for reviewers’ meticulous review and valuable suggestions to improve our manuscript. We hope that the revision has addressed all the issues in the old version. If there are still any problems in the revised manuscript, please do not hesitate to point out, and we will do our best to amend the manuscript according to your suggestions. We are looking forward to your positive response.
Best regards,
Bona Xuan et al.
Point 1: In introduction, add to the final paragraph the structure of the paper section.
Response 1: Thanks for your very thoughtful suggestion. I have finished restructuring the article.
According to your suggestion, I have completed the restructuring of the article and related content changes.
Point 2: Redraw the figure 1.
Response 2: Thanks for your very thoughtful suggestion. I am sorry that I did not adjust the resolution and sharpness of the graphs in the paper. According to your suggestion, I have checked the resolution and clarity of all images in the paper. And modified the unclear images. Make the manuscript look clearer.
.
Point 3: In 4.2. Experimental assessment criteria. Add the citation.
Response 3: Thanks for your very thoughtful suggestion. I am sorry that I did not find the error in the paper. According to your suggestion, I'm not sure I need to include references to assessment criteria? in the paper.
Point 4: Add the future works in end of conclusion section.
Response 4: Thank you for your comments. We took the reviewer’s advice and updated conclusion. After careful consideration, added percent results for experiments at the end of the conclusion .
However, the accuracy of this model on the DataCon dataset suggests that there is room for improvement. Future work will focus on this deficiency, and we will investigate the construction of feature extraction and classification models to find ways to improve detection accuracy.
Point 5: Discuss the limitations of the proposed method.
Response 5: Thank you for your comments. After careful consideration, added limitations for experiments at the end of the conclusion.
However, the accuracy of this model on the DataCon dataset suggests that there is room for improvement. Future work will focus on this deficiency, and we will investigate the construction of feature extraction and classification models to find ways to improve detection accuracy.
Point 6: The recent references such as 2022 is very few. There are many works in 2022 therefore add some of them.
Response 6: Thanks for your very thoughtful suggestion. I am sorry that I did not consider the problem in the paper. According to your suggestion, I added and cited 2021 and 2022 published studies for comparison. And complete the comparative trials and analysis of results.
Point 7: In training phase, dataset partition is randomly or not? I suggest u to used K-Fold Cross Validation. And how many K-Fold used?
Response 7: Thanks for your valuable advice. We have adopted the reviewer's suggestion and provided additional clarifications for the training and test sets regarding sample partitioning in Section 4.3.1. I added the content of additional clarifications for the training and test sets regarding sample partitioning in Section 4.3.1.the content is as follows:
In order to fully evaluate our method, experiments were conducted on Kaggle and DataCon datasets according to different methods to fully validate the model. On the basis of the dataset expanded with features based on SFCWGAN, the data were randomly divided into 10 parts using the three-fold cross-validation method, respectively, and 8 of them were selected as the training set and 2 as the test set.
Point 8: In experiments results, the evaluation (train-test round) must repeated for N round. I suggest to repeat for many rounds to ensure that the bias was minimized.
Response 8: Thanks for your valuable advice. We have adopted the reviewer's suggestion. But the short time to rework, I will provide additional explanation for the error minimization problem after I finish the experimental part. I will provide additional explanation for the error minimization problem after I finish the experimental part.
Author Response File: Author Response.pdf
Reviewer 3 Report
The paper proposes novel methods to deal with imbalanced malware datasets in an attempt to improve the performance of the classifiers. The work seems to be interesting and attractive. However,
The following queries has to be addressed:
1. The paper mainly focuses on feature selection and pre-processing to improve the performance. However, the paper lacks discussion about the features of the data sets used.
2. The details of features of the datasets (Kaggle and DataCon) used must be clearly discussed. (Tabular form)
3. The number of samples and number of features of Kaggle dataset used is not mentioned. URL of the dataset used must be provided.
4. Number of features and description not given for DataCon data set. (Table)
5. Check Equation 18. It is incomplete.
6. The purpose and motivation of Table 1 is not clear. WOA, GWO and BO are not explained. Are they used for feature selection? If so, what are the features selected? Why are you comparing these three? Citations are not mentioned in the Table?
7. The algorithms of Table 3 and Table 4, 5, and 6 do not have citations? The authors claim "BSO, ROS, ADASYN, SMOTE, CWGAN and SFCWGAN methods will be tested as classifiers for feature enhancement", but the results does not talk anything about the features. Only accuracy without any details about the features??
8. Citations to be included in Table 7, 8 algorithms.
9. The abstract must be revised including the details about features.
10. The conclusion must be revised and enhanced.
11. All abbreviations must be provided in their full form at their first appearence.
12. Literature review can be improved. With closing remarks and motivation.
Author Response
Response to Reviewer 3 Comments
Original Manuscript ID: applsci-2160429
Original Article Title: “SFCWGAN-BiTCN with Sequential Features for Malware Detection”
To: information Editor
Re: Response to reviewers
Dear Editor,
Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments.
We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) an updated manuscript with red highlighting indicating changes, and (c) a clean updated manuscript without highlights (PDF main document).
Thanks again for reviewers’ meticulous review and valuable suggestions to improve our manuscript. We hope that the revision has addressed all the issues in the old version. If there are still any problems in the revised manuscript, please do not hesitate to point out, and we will do our best to amend the manuscript according to your suggestions. We are looking forward to your positive response.
Best regards,
Bona Xuan et al.
Point 1: The paper mainly focuses on feature selection and pre-processing to improve the performance. However, the paper lacks discussion about the features of the data sets used.
Response 1: Thanks for your valuable advice. We have incorporated the reviewers' suggestions and added a description of the dataset features in Section 4.1. I added the content of "data normalization" in more detail, and the updated content is as follows:
The feature preprocessing is to extract and encode the API and opcode of kaggle and DataCon datasets. the API information and opcode information will be extracted from the .asm disassembly file of kaggle dataset. And DataCon dataset is disassembled into asm file using IDA Pro tool, from which api and opcode information will be obtained. Based on the virtual addresses, the whole sequence of API calls and opcodes are generated and populated with the opcode and API vector matrices corresponding to the virtual addresses. Finally, the sequence features are generated by encoding the vector matrix using Word2Vec. The features are preprocessed as follows:
|
|
(a) |
(b) |
Figure 4. Featrue pre-processing. they should be listed as: (a) Opcode featrue preprocessing; (b) Api featrue preprocessing.
Point 2: The details of features of the datasets (Kaggle and DataCon) used must be clearly discussed. (Tabular form).
Response 2: Thanks for your valuable advice. We have incorporated the reviewers' suggestions and added a description of the dataset features in Section 4.1. I added the details of features of the datasets in more detail, and the updated content is as follows:
In 4.1.1.
Each sample file had two formats: .asm and .bytes. Each malware file has an identifier (a 20-character hash that uniquely identifies the file), and a class label (an integer that indicates one of the nine family names to which the malware belongs). In addition to this, the dataset not only includes the list of raw data, but also contains logs of various raw data information extracted from the binary files, (e.g. function calls, API and parameters).
In 4.1.2
After analyzing the statistics: the dataset contains a variety of samples with compressed or encrypted shells of UPX, PEPack, ASPack, PECompact, Themida, VMP and Mpress shells with encrypted shells and disguised shells, etc..
Point 3: The number of samples and number of features of Kaggle dataset used is not mentioned. URL of the dataset used must be provided.
Response 3: Thanks for your valuable advice. We have incorporated the reviewers' suggestions and added a description of the dataset features in Section 4.1. I added the content of datasets more detail, and the updated content is as follows:
Table 2. Quantity Distribution of Sample Kaggle DataSet.
Family Name |
Samples |
Type |
Ramnit |
1541 |
Worm |
Lollipop |
2478 |
Adware |
Kelihos_ver3 |
2942 |
Backdoor |
Vundo |
475 |
Trojan |
Simda |
42 |
Backdoor |
Tracur |
751 |
TrojanDownloader |
Kelihos_ver1 |
398 |
Backdoor |
Obfuscator.ACY |
1228 |
Any kind of obfuscated malware |
Gatak |
1013 |
Backdoor |
And add Available online: https://www. kaggle.com/c/malware-classification/data.
Point 4: Number of features and description not given for DataCon data set. (Table).
Response 4: Thanks for your valuable advice. We have incorporated the reviewers' suggestions and added a description of the dataset features in Section 4.1. I added the content of datasets in more detail, and the updated content is as follows:
In 4.1.2
After analyzing the statistics: the dataset contains a variety of samples with compressed or encrypted shells of UPX, PEPack, ASPack, PECompact, Themida, VMP and Mpress shells with encrypted shells and disguised shells, etc.
Table 3. Quantity Distribution of Sample DataCon DataSet
Family Name |
Samples |
Type |
White |
15759 |
No_Miner |
Black |
7896 |
Miner |
Point 5: Check Equation 18. It is incomplete.
Response 5: Thanks for your very thoughtful suggestion. I am sorry that I did not find the error in the paper. According to your suggestion, I have checked the all Equations in the paper. And modified the Equations . Make the manuscript look accuracy
Point 6: The purpose and motivation of Table 1 is not clear. WOA, GWO and BO are not explained. Are they used for feature selection? If so, what are the features selected? Why are you comparing these three? Citations are not mentioned in the Table?
Response 6: Thank you for your comments. After careful consideration, the reviewer's comments made me realize that there are some problems in the article, so I thought carefully and made some additions and changes. WOA, GWO and BO have been added to the article in 4.3.1. However, the reviewer's comments made me realize that there are some problems in the article, so I have thought carefully and made some changes. For the feature selection methods, the importance of features is analyzed; the very important features are retained and redundant features can be filtered out. Evaluate the performance using XGBoost for these feature selection methods.
The algorithm is used to analyze the features after the extraction of the training set samples of the dataset, filtering out the redundant samples, keeping the important ones and simplifying the sample structure. The extracted 1632 opcodes and 265 APIs were combined and deweighted to rank the importance, and the top 73 opcodes and 73 api were retained to better analyze the influence of Opcode and API on the importance of the model.
In this paper, Ref[35], three optimization algorithms are used to compare. WOA-XGBoost was used to compare grey wolf optimizer (GWO)-XGBoost and, bayesian optimization (BO)-XGBoost to determine the importance ranking of features and to analyze the detection accuracy of these models. Show in Table 1.
Point 7: The algorithms of Table 3 and Table 4, 5, and 6 do not have citations? The authors claim "BSO, ROS, ADASYN, SMOTE, CWGAN and SFCWGAN methods will be tested as classifiers for feature enhancement", but the results does not talk anything about the features. Only accuracy without any details about the features?
Response 7: Thank you for your comments. Due to a translation error, the sample-balanced one was translated into feature enhancement. This part is an experiment for the sample imbalance problem, comparing several common sampling methods for analysis. After careful consideration, I checked the file. and added citations and details about the resampling. Make the manuscript look accuracy.
In this session, for the sample imbalance problem by oversampling methods and GAN to obtain a balanced distribution of data, this paper compares BSO [36], ROS [37], ADASYN [38], SMOTE [7], CWGAN [39] and SFCWGAN methods will be tested as classifiers for sample balance on Kaggle and DataCon datasets, respectively, under the same experimental conditions, and the results are shown in Tables 3 and 4. In this section, we compare different sample balance algorithms and verify the advantages of SFCWGAN differential data sample balance algorithm in malware detection. In this section, we compare different sample balance algorithms for the semantic call relationship between API and opcode and verify the advantages of SFCWGAN differential data sample balance algorithm for malware detection.
Point 8: Citations to be included in Table 7, 8 algorithms.
Response 8: Thanks for your very thoughtful suggestion. I am sorry that I did not consider the problem in the paper. According to your suggestion, I added and cited published studies for algorithms.
Point 9: The abstract must be revised including the details about features.
Response 9: Thank you for your comments. Thank you for your comments. We took the reviewer’s advice and updated the abstract part. After careful consideration, the abstract section was revised and improved to add details about the characteristics. The content is as follows:
First, we extract the features of malware opcode and api sequences and uses Word2Vec to represent features, emphasizing the semantic logic between API tuning and Opcode calling sequences.
Point 10: The conclusion must be revised and enhanced.
Response 10: Thank you for your comments. Thank you for your comments. We took the reviewer’s advice and updated the conclusion part. After careful consideration, the conclusion section was revised and improved to add details about the characteristics. The content is as follows.
However, on the one hand, the accuracy and bias of this model on the DataCon dataset suggests that there is room for improvement and minimize. Future work will focus on this deficiency, and we will investigate the construction of feature extraction and classification models to find ways to improve detection accuracy. on the other hand, malware has become one of the most serious security threats to the IoT. The application of malware variants classification in IoT needs to be explored further for a series of problems caused by the propagation of malware codes from traditional networks to IoT and within IoT.
Point 11: All abbreviations must be provided in their full form at their first appearence.
Response 11: Thanks for your very thoughtful suggestion. I am sorry that I did not find the error in the paper. According to your suggestion, I have checked the all abbreviations in the paper. And modified the abbreviations . Make the manuscript look accuracy.
Point 12: Literature review can be improved. With closing remarks and motivation.
Response 12: Thank you for your comments. Thank you for your comments. We took the reviewer’s advice and updated the literature review part. After careful consideration, the literature review section was revised and improved to add details about the characteristics. The content is as follows.
In summary, most of the above deep learning-based approaches are targeted at sequence features for classification and detection. For the problem of class imbalance among malware families, the traditional methods show limitations. And, the impact of interference obfuscation techniques on existing malware classification models is not considered.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
The authors have addressed the concerns raised. Hope now it would be more clear for the readers.
I appreciate the authors for updating the paper.
I would suggest to include the following article from MDPI electronics journal as a reference, because it is closely related.
Swarm Optimization and Machine Learning Applied to PE Malware Detection towards Cyber Threat Intelligence, electronics 2023.