Dual Convolutional Malware Network (DCMN): An Image-Based Malware Classification Using Dual Convolutional Neural Networks
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper proposes a DCNN-based architecture for malware classification applicable to grayscale images. The article utilizes the Malimg dataset for experimentation claiming a higher accuracy. Here are some suggestions for improvement:
1. Section 2. Background on Malware Visualization is comparatively short and should be extended to add more details on the variety of visualization techniques and why the grayscale technique is chosen over colored images. It should list its advantages and drawbacks and explain how malware detection using colored images differs from this approach.
2. Figure 9 looks stretched and blurry. It would be great if the image pixels/quality could be enhanced. This will make the text clearer and easier to read.
3. Table 2. A comparison of the proposed model with…. is not referred to in the text. It should be cited in the text, and a textual explanation of the columns should be added.
4. Finally, the paper utilizes the Malimg dataset, which comprises 9,339 grayscale images representing 25 distinct malware families. This is a relatively old dataset, and there are newer, similar/extensive datasets available. There is a need to add details to justify the reason for choosing this dataset over newer ones.
Author Response
- Section 2. Background on Malware Visualization is comparatively short and should be extended to add more details on the variety of visualization techniques and why the grayscale technique is chosen over colored images. It should list its advantages and drawbacks and explain how malware detection using colored images differs from this approach.
Thank you for highlighting this issue. We agree with your observation. Consequently, we have expanded Section 2, "Background on Malware Visualization," to include more details about the diverse visualization techniques available and the advantages and disadvantages of the grayscale technique. (refer to lines 96–124).
Fu et al. [12] proposed a novel method for generating RGB-colored images from malware binary files. Instead of directly converting grayscale images to RGB, they focused on populating the red, green, and blue channels with more informative data.
The core steps of their method involve section division and feature computation. Initially, malicious code was filtered to ensure compliance with the PE format and the preservation of its original structure. Figure 3 illustrates the PE file structure and the necessary field information for filtering malware.
Subsequently, the malware was divided into multiple sections based on the PE format, and these sections were characterized using entropy, byte value, and relative size. The red, green, and blue channels of each pixel were then populated with these values, resulting in a combined RGB-colored image. Figure 4 provides a visual representation of this detailed process. For further information on this method, please refer to reference [12].
Grayscale malware images offer several advantages. First, they require less storage space and processing power compared to RGB images, leading to faster training and inference times. Second, the single channel in grayscale images simplifies feature extraction, potentially improving the performance of certain machine learning algorithms. Finally, grayscale images are not affected by color variations, which can be beneficial for malware classification tasks where color differences are irrelevant.
While grayscale images have their advantages, they also have some drawbacks. One significant drawback is the loss of color information. Color patterns can sometimes be relevant for malware classification, and converting to grayscale discards this information. Additionally, grayscale images may require more sophisticated feature extraction techniques to capture relevant information, especially if color-related features are important for distinguishing malware families. This could potentially limit the discriminative power of the model.
In conclusion, the choice between grayscale and RGB malware images depends on the specific characteristics of the malware dataset and the classification task at hand. If color information is not crucial and computational efficiency is a priority, grayscale images may be a suitable choice. However, if color-related features are important for distinguishing malware families, RGB images may provide better performance.
- Figure 9 looks stretched and blurry. It would be great if the image pixels/quality could be enhanced. This will make the text clearer and easier to read.
Agree. We have accordingly, replaced Figure 9 with a clearer version and included a more detailed discussion of the misclassifications (refer to lines 404–408).
Figure 9, the confusion matrix revealed a low number of misclassified instances, with off-diagonal elements representing errors. This aligns with the model’s overall high accuracy of 99.89%. Despite its overall accuracy, the model exhibited some misclassifications among certain malware families. For instance, a malware image belonging to the Fakerean family was erroneously classified as belonging to the Wintrim.BX family, suggesting potential similarities or overlaps in their behavioral patterns. Additionally, the model struggled to distinguish between specific malware families, such as Lolyda.AA1 and Lolyda.AA2.
- Table 2. A comparison of the proposed model with…. is not referred to in the text. It should be cited in the text, and a textual explanation of the columns should be added.
Thank you for pointing this out. We agree with this comment. Therefore, we have added ablation study that is presented in Table 2 (refer to lines 391 – 401).
To assess the contribution of each component within the proposed DCMN model, we conducted ablation studies by evaluating the custom CNN (regular CNN) and the pre-trained ResNet-50 separately. The results demonstrated that the pre-trained ResNet-50 significantly outperformed the custom CNN. When using the dataset without the Benign class, the training, validation, and testing accuracies of the custom CNN were 85.25%, 74.93%, and 85.30%, respectively, whereas the pre-trained ResNet-50 achieved accuracies of 99.97%, 96.70%, and 98.12%, respectively (see Table 2).
Upon adding the Benign class, the results exhibited minimal change from the previous experiment. The training, validation, and testing accuracies of the custom CNN were 85.67%, 80.62%, and 87.28%, respectively, while the pre-trained ResNet-50 achieved accuracies of 99.84%, 97.44%, and 97.89%, respectively (see Table 3).
- Finally, the paper utilizes the Malimg dataset, which comprises 9,339 grayscale images representing 25 distinct malware families. This is a relatively old dataset, and there are newer, similar/extensive datasets available. There is a need to add details to justify the reason for choosing this dataset over newer ones.
Thank you for highlighting this issue. We have accordingly added details that justify our rationale for selecting the Malimg dataset (refer to lines 331 –336).
Given the widespread popularity of the Malimg dataset, it has been adopted in numerous existing and emerging research endeavors. This widespread usage facilitates the comparison of the proposed model’s results with those of various previous and contemporary models. Furthermore, the dataset’s classification as imbalanced presents an opportunity to investigate the impact of training our proposed model on an imbalanced dataset, both before and after applying techniques to balance the dataset.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper presents an interesting approach by combining a custom CNN with a pre-trained ResNet-50 model for malware classification. However, while the DCNN architecture shows promising results, the novelty of the approach is somewhat limited, given that the use of dual networks and transfer learning in malware classification is not entirely new. The paper could benefit from a deeper exploration of how its approach significantly advances beyond existing methods.
The methodology is well-structured and rigorous, but some specific improvements could be considered:
Control Experiments: Conducting ablation studies to isolate the effects of different components in the dual CNN architecture (e.g., comparing the custom CNN and ResNet-50 contributions) would strengthen the findings.
Hyperparameter Tuning: More detailed exploration of hyperparameter tuning, particularly in the custom CNN, could potentially yield even better performance.
The reliance on the Malimg dataset, while useful, raises concerns about the generalizability of the results. The author should explain why is this dataset choosen.
Figures 7 and 9 need improvement. Figure 9 (confusion matrix) could be accompanied by a more detailed discussion of the misclassifications.
With minor enhancements the work could be accepted and can serve as a foundation for future research in the area.
Author Response
Control Experiments: Conducting ablation studies to isolate the effects of different components in the dual CNN architecture (e.g., comparing the custom CNN and ResNet-50 contributions) would strengthen the findings.
Thank you for pointing this out. We agree with this comment. Therefore, we have added ablation study that is presented in Table 2 (refer to lines 391 – 401).
To assess the contribution of each component within the proposed DCMN model, we conducted ablation studies by evaluating the custom CNN (regular CNN) and the pre-trained ResNet-50 separately. The results demonstrated that the pre-trained ResNet-50 significantly outperformed the custom CNN. When using the dataset without the Benign class, the training, validation, and testing accuracies of the custom CNN were 85.25%, 74.93%, and 85.30%, respectively, whereas the pre-trained ResNet-50 achieved accuracies of 99.97%, 96.70%, and 98.12%, respectively (see Table 2).
Upon adding the Benign class, the results exhibited minimal change from the previous experiment. The training, validation, and testing accuracies of the custom CNN were 85.67%, 80.62%, and 87.28%, respectively, while the pre-trained ResNet-50 achieved accuracies of 99.84%, 97.44%, and 97.89%, respectively (see Table 3).
Hyperparameter Tuning: More detailed exploration of hyperparameter tuning, particularly in the custom CNN, could potentially yield even better performance.
Thank you for highlighting this issue. We agree with your observation. Consequently, we have updated Section 5.3, "Experimental Setup," to include more details about hyperparameter tuning (refer to lines 358-367).
The proposed model was trained for seven epochs each, both with and without the benign class, using a batch size of 28. Experiments were conducted using TensorFlow and Keras libraries in Python, running on a Google Colab T4 GPU. The model was compiled with the Adam optimizer, employing categorical cross-entropy as the loss function. Accuracy was selected as the metric for evaluating performance. The model was trained with a learning rate of 3 × 10−5. To dynamically adjust the learning rate during training, the ReduceLROnPlateau callback was utilized. This callback reduces the learning rate by a factor of 0.2 if the validation loss plateaus, with a minimum learning rate of 1 × 10−6. The custom CNN consisted of four layers, as illustrated in Table 1. The ReLU activation function and max pooling were incorporated into the custom CNN to enhance its performance.
The reliance on the Malimg dataset, while useful, raises concerns about the generalizability of the results. The author should explain why is this dataset choosen.
Thank you for highlighting this issue. We have accordingly added details that justify our rationale for selecting the Malimg dataset (refer to lines 331 –336).
Given the widespread popularity of the Malimg dataset, it has been adopted in numerous existing and emerging research endeavors. This widespread usage facilitates the comparison of the proposed model’s results with those of various previous and contemporary models. Furthermore, the dataset’s classification as imbalanced presents an opportunity to investigate the impact of training our proposed model on an imbalanced dataset, both before and after applying techniques to balance the dataset.
Figures 7 and 9 need improvement. Figure 9 (confusion matrix) could be accompanied by a more detailed discussion of the misclassifications.
Agree. We have accordingly, replaced Figures 7 and 9 with clearer versions and included a more detailed discussion of the misclassifications (refer to lines 404–408).
Figure 9, the confusion matrix revealed a low number of misclassified instances, with off-diagonal elements representing errors. This aligns with the model’s overall high accuracy of 99.89%. Despite its overall accuracy, the model exhibited some misclassifications among certain malware families. For instance, a malware image belonging to the Fakerean family was erroneously classified as belonging to the Wintrim.BX family, suggesting potential similarities or overlaps in their behavioral patterns. Additionally, the model struggled to distinguish between specific malware families, such as Lolyda.AA1 and Lolyda.AA2.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript is well organized and relatively well written, even though some parts require revision: some claims are contradictory and/or not sufficiently supported by data or suitable citations. Furthermore, some important parts of the proposed method are not described in sufficient detail.
Following, a list of issues (that are also required changes):
- In the Introduction, the authors claim that "These features remain robust even with the application of obfuscation, packing, or reassembly techniques. This is because the underlying texture will simply appear at a different
location in the image, regardless of the obfuscation method. This method offers superior performance for classifying malware and is resilient to obfuscation and other modification.". This quite strong claim would require an equally strong support, by means of proofs and/or citations. The author only provide one citation that is to a preprint (i.e., with no peer review).
- In the Introduction, the authors claim that deep-learning models have limitations in real-world security applications and that they struggle with zero-day attacks. Most of the current works on attack detection that I have read in the last years, use deep-learning techniques. Again, this claim is quite strong and requires strong support (or to be removed). Furthermore, the authors mention as a drawback that DL models require hyperparameter tuning: they should explain why and how this is a real drawback, as tuning is part of the standard process.
- The authors, after having said that DL models are, in their opinion, not suitable for detecting attacks, right in the next paragraph propose their method, based on DL (convolutional neural networks are, to my knowledge, considered to be deep learning), which is certainly contradictory.
- In Section 2, it is mentioned that " The remaining area appears
black, suggesting zero padding used to fill out the section.": the authors should explain where this padding is coming from (is it added while creating images?), since they are dealing with binary code.
- Section 2 is rather short and should probably be included in another section (e.g., the current Section 3).
- All sections should have an introductory text before subsections start (e.g., Section 4.1 starts right after the title of Section 4: a text introducing the contents of Section 4 should be present instead).
- The authors mention multiple times that " The custom branch captures
structural details", but they never show it nor discuss why this happens.
- "The custom CNN branch consists of 4 layers and has approximately 34 MB parameters." --> It is not clear what 34MB parameters are, maybe the authors meant 34 million parameters?
- It is not clear why the authors decided to break the flow of the (rather short) explanation of their technique to provide, in Section 4.3.1-4.3.5, some background knowledge.
- Tables (2,3)-- Table 2 and 3.
- The gain in detection performance of the proposed DCMN w.r.t. the sole Pre-trained ResNet-50 looks rather modest: the contribution of the "custom branch" should be discussed, also considering the additional computational resources required by the custom branch (i.e., is it worth to have it considering the added resources and the gain in detection performance?).
- In the Conclusions the authors state that "Additionally, developing techniques to classify encrypted malware files without compromising their integrity could be a promising research direction, despite the inherent challenges posed by strong encryption": this looks pretty unfeasible to me without decrypting the code.
Comments on the Quality of English LanguageThe manuscript would benefit from a careful proofread.
Author Response
- In the Introduction, the authors claim that "These features remain robust even with the application of obfuscation, packing, or reassembly techniques. This is because the underlying texture will simply appear at a different location in the image, regardless of the obfuscation method. This method offers superior performance for classifying malware and is resilient to obfuscation and other modification.". This quite strong claim would require an equally strong support, by means of proofs and/or citations. The author only provide one citation that is to a preprint (i.e., with no peer review).
Answer:
Thank you for bringing this issue to our attention. We concur with your observation. As a result, we have added another citation (reference 4). The authors in this reference assert that:
“Texture features could be extracted from the malware image and could be employed for training the classifier. This will not fail since, due to obfuscation or packing or reassembling, the texture will occur at a different position in the malware image. So the features can act as the telltale clue to detect the occurrence of the texture and thus help in finding the malware, their invariants, and any of their modified versions. This technique aids better performance for classifying malware and is resilient to obfuscation and other modification techniques.”
The authors provided supporting evidence for their claim in their paper.
- In the Introduction, the authors claim that deep-learning models have limitations in real-world security applications and that they struggle with zero-day attacks. Most of the current works on attack detection that I have read in the last years, use deep-learning techniques. Again, this claim is quite strong and requires strong support (or to be removed). Furthermore, the authors mention as a drawback that DL models require hyperparameter tuning: they should explain why and how this is a real drawback, as tuning is part of the standard process.
Answer:
Thank you for highlighting this issue. We agree with your observation. As a result, we have removed this claim and revised the paragraph to read as follows: (refer to lines 54 – 65 page 2)
Based on the above converted malware images, various machine learning and deep learning techniques have been actively explored for developing intelligent malware detection and classification systems [ 8 – 11]. To enhance these systems, we propose adopting a dual CNN approach with transfer learning. This approach offers several advantages over existing methods, as it utilizes two separate CNN branches for extracting complementary features. First, each branch focuses on learning a distinct set of features from the input data. These features are complementary, meaning they capture different but important aspects of the data. This in extracting separately the both high-level structures (coarse structures and shapes) and the fine-grained details (detailed textures and patterns). Then, these two sets of features are concatenated for further learning and multi-classification using nonlinear activation functions. Our experiments showed improvements in detection accuracy, precision, and computational efficiency compared to using a single CNN architecture.
- The authors, after having said that DL models are, in their opinion, not suitable for detecting attacks, right in the next paragraph propose their method, based on DL (convolutional neural networks are, to my knowledge, considered to be deep learning), which is certainly contradictory.
Answer:
Agreed. We have therefore removed our first claim and revised the paragraph as previously mentioned.
- In Section 2, it is mentioned that " The remaining area appears black, suggesting zero padding used to fill out the section.": the authors should explain where this padding is coming from (is it added while creating images?), since they are dealing with binary code.
Answer:
Thank you for pointing this out. We agree with this comment. Therefore, we have added the following sentence: (refer to lines 86-87 page 3)
This padding is added during image creation process.
- Section 2 is rather short and should probably be included in another section (e.g., the current Section 3).
Answer:
Thank you for highlighting this issue. We agree with your observation. Consequently, we have expanded Section 2, "Background on Malware Visualization," to include more details about the diverse visualization techniques available and the advantages and disadvantages of the grayscale technique. (refer to lines 90–118 page 3).
Fu et al. [12] proposed a novel method for generating RGB-colored images from malware binary files. Instead of directly converting grayscale images to RGB, they focused on populating the red, green, and blue channels with more informative data.
The core steps of their method involve section division and feature computation. Initially, malicious code was filtered to ensure compliance with the PE format and the preservation of its original structure. Figure 3 illustrates the PE file structure and the necessary field information for filtering malware.
Subsequently, the malware was divided into multiple sections based on the PE format, and these sections were characterized using entropy, byte value, and relative size. The red, green, and blue channels of each pixel were then populated with these values, resulting in a combined RGB-colored image. Figure 4 provides a visual representation of this detailed process. For further information on this method, please refer to reference [12].
Grayscale malware images offer several advantages. First, they require less storage space and processing power compared to RGB images, leading to faster training and inference times. Second, the single channel in grayscale images simplifies feature extraction, potentially improving the performance of certain machine learning algorithms. Finally, grayscale images are not affected by color variations, which can be beneficial for malware classification tasks where color differences are irrelevant.
While grayscale images have their advantages, they also have some drawbacks. One significant drawback is the loss of color information. Color patterns can sometimes be relevant for malware classification, and converting to grayscale discards this information. Additionally, grayscale images may require more sophisticated feature extraction techniques to capture relevant information, especially if color-related features are important for distinguishing malware families. This could potentially limit the discriminative power of the model.
In conclusion, the choice between grayscale and RGB malware images depends on the specific characteristics of the malware dataset and the classification task at hand. If color information is not crucial and computational efficiency is a priority, grayscale images may be a suitable choice. However, if color-related features are important for distinguishing malware families, RGB images may provide better performance.
- All sections should have an introductory text before subsections start (e.g., Section 4.1 starts right after the title of Section 4: a text introducing the contents of Section 4 should be present instead).
Answer:
Thank you for highlighting this issue. We agree with your observation. Consequently, we have added to Section 4 the following introductory text: (refer to lines 219 – 222 page 7)
This section outlines the proposed Dual Convolutional Malware Network (DCMN) architecture and its implementation details. The DCMN is designed to leverage the strengths of dual convolutional neural networks (CNNs) to effectively classify malware based on their visual representations.
- The authors mention multiple times that " The custom branch captures structural details", but they never show it nor discuss why this happens.
Answer:
Thank you for pointing this out. We agree with this comment. Therefore, we have added the following sentences to explain why the custom branch captures structural details: (refer to lines 258 – 262 page 8)
The max pooling layers with a stride of 2 effectively downsample the feature maps, which can help the network focus on more global structural information. Subsequently, the use of 3x3 kernels in convolutional layers after the max pooling layers is a good choice for capturing structural details, as it allows the network to extract features from a relatively large region of the input image.
- "The custom CNN branch consists of 4 layers and has approximately 34 MB parameters." --> It is not clear what 34MB parameters are, maybe the authors meant 34 million parameters?
Answer:
Agreed. We have therefore replaced 34MB with 34 million parameters.
- It is not clear why the authors decided to break the flow of the (rather short) explanation of their technique to provide, in Section 4.3.1-4.3.5, some background knowledge.
Answer:
Thank you for highlighting this issue. We have accordingly added the following to Section 4.3: (refer to lines 269 – 272 page 8)
In the following section, we will provide a detailed overview of the Dual Convolutional Neural Network (Dual CNN) architecture employed in our proposed DCMN. The Dual CNN consists of three primary components: convolutional layers, pooling layers, and a softmax layer.
- Tables (2,3)-- Table 2 and 3.
Answer:
Agreed. We have therefore replaced Tables (2,3) with Table 2 and 3.
- The gain in detection performance of the proposed DCMN w.r.t. the sole Pre-trained ResNet-50 looks rather modest: the contribution of the "custom branch" should be discussed, also considering the additional computational resources required by the custom branch (i.e., is it worth to have it considering the added resources and the gain in detection performance?).
Answer:
Thank you for pointing this out. We agree with this comment. When using a dataset without the benign class, our proposed DCMN achieved a testing accuracy gain of 1.77% compared to the pre-trained ResNet-50. This gain is 1.52% when the benign class was included. While the improvement in detection performance might seem modest, the DCMN's superior learning speed during training is a significant advantage. For example, when using a dataset without the benign class, the pre-trained ResNet-50 took 1037.38 seconds to reach 96.70% validation accuracy, whereas our DCMN achieved 99.13% validation accuracy in only 570.5 seconds. Similar results were observed with the benign class included, the pre-trained ResNet-50 took 1125.52 seconds to reach 97.44% validation accuracy, whereas our DCMN achieved 99.65% validation accuracy in only 561.88seconds. To illustrate this point, Figure 10 presents a comparative accuracy graph of our DCMN model and the pre-trained ResNet-50 model.
Additionally, we conducted ablation studies, presented in Tables 2 and 3, to evaluate the contributions of the custom CNN and pre-trained ResNet-50 (see lines 398-408 page 14).
To assess the contribution of each component within the proposed DCMN model, we conducted ablation studies by evaluating the custom CNN (regular CNN) and the pre-trained ResNet-50 separately. The results demonstrated that the pre-trained ResNet-50 significantly outperformed the custom CNN. When using the dataset without the Benign class, the training, validation, and testing accuracies of the custom CNN were 85.25%, 74.93%, and 85.30%, respectively, whereas the pre-trained ResNet-50 achieved accuracies of 99.97%, 96.70%, and 98.12%, respectively (see Table 2).
Upon adding the Benign class, the results exhibited minimal change from the previous experiment. The training, validation, and testing accuracies of the custom CNN were 85.67%, 80.62%, and 87.28%, respectively, while the pre-trained ResNet-50 achieved accuracies of 99.84%, 97.44%, and 97.89%, respectively (see Table 3).
- In the Conclusions the authors state that "Additionally, developing techniques to classify encrypted malware files without compromising their integrity could be a promising research direction, despite the inherent challenges posed by strong encryption": this looks pretty unfeasible to me without decrypting the code.
Answer:
Thank you for bringing this issue to our attention. We concur with your observation. We have accordingly added the following to the end of Section 6: (refer to lines 427 – 433 page 15)
Additionally, developing techniques to classify encrypted malware files without compromising their integrity could be a promising research direction. One potential approach in this area involves analyzing metadata. Examining the embedded metadata within the encrypted file, such as file headers, footers, or timestamps, might provide clues about the malware's type, behavior, or origin. Additionally, analyzing the network traffic generated by the encrypted malware can reveal patterns and behaviors that are indicative of specific types of malware.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe paper presents a Dual Convolutional Neural Network (DCNN) architecture designed for malware classification. The approach involves converting malware binary files into 2D gray-scale images and then training a customized dual CNN for multi-classification. This model leverages the strengths of both a custom structure extraction branch and a pre-trained ResNet-50 model, leading to improved performance compared to single-branch methods. The proposed method offers an efficient way to enhance malware classification, helping to protect valuable assets and build trust by preventing cascading effects of malware attacks.
Even if the paper appears to be interesting, I have a couple of concerns.
The first point concerns the originality of the contribution. Given the current proliferation of studies in the field of machine learning applied to intrusion detection systems, it is crucial for the authors to clearly highlight the novel aspects of their work in comparison to existing technical literature.
The second concern relates to the time complexity of the proposed technique, as the authors focus mainly on classic performance. However, in the context of intrusion detection systems, it is also essential to consider the time required by different detection techniques. This is important because, paradoxically, some network attacks might be executed faster than the defense mechanisms themselves. Although the "time" dimension is mentioned in the results, additional details are needed.
To address this issue, the section on related work could be expanded to include reputable studies that explore the time consumption of various techniques in addition to their accuracy and overall performance. Examples include:
"Experimental Review of Neural-Based Approaches for Network Intrusion Management," IEEE Transactions on Network and Service Management, 2020;
"Network Abnormal Traffic Detection Model Based on Semi-Supervised Deep Reinforcement Learning," IEEE Transactions on Network and Service Management, 2021;
"Deep Learning for the Classification of Sentinel-2 Image Time Series," IEEE IGARSS, 2019.
Author Response
The first point concerns the originality of the contribution. Given the current proliferation of studies in the field of machine learning applied to intrusion detection systems, it is crucial for the authors to clearly highlight the novel aspects of their work in comparison to existing technical literature.
Thank you for highlighting this issue. We concur with your observation. Consequently, we have expanded the literature review to include more recent works, such as the following reference: Duraibi, S. Enhanced Image-Based Malware Classification using Snake Optimization Algorithm with Deep Convolutional Neural Network. IEEE Access 2024. Additionally, we have mentioned this work in Table 2 to facilitate a comparison with our proposed model (refer to lines 191–223).
The second concern relates to the time complexity of the proposed technique, as the authors focus mainly on classic performance. However, in the context of intrusion detection systems, it is also essential to consider the time required by different detection techniques. This is important because, paradoxically, some network attacks might be executed faster than the defense mechanisms themselves. Although the "time" dimension is mentioned in the results, additional details are needed.
Thank you for pointing this out. We agree with this comment. Therefore, we have expanded the literature review to include a more comprehensive examination of studies that explore the time consumption of various techniques, in addition to their accuracy and overall performance (refer to lines 200–223).
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors modified the manuscript according to the comments from reviewers and they successfully solved the most important issues that I reported. However, I still have some concerns as follows:
- In the next text inserted in Section 2, the author mention the possible loss of information in converting RGB images to grayscale. This is true in general, but it does not necessarily apply to malware images as they should not need to be converted: the image generation process should be aware of the requirements on images and generate either RGB or grayscale images from the beginning. If grayscale images are generated and not obtained from a conversion, no information loss should happen (i.e., the same information is in the two types of images, just represented in a different way).
- In explaining the custom branch, the authors have now added an explanation on why it is supposed to capture structural details, even though no proof of that is provided.
- The ablation study proposed by authors in the new version, do not state much about the advantages of the proposed DCMN over ResNet-50. Advantages in training time of DCMN (reported in the reply from authors, but not in the manuscript) have a limited value, a part from specific cases. The authors should at least introduce a discussion that underlines the other possible advantages of usng DCMN over ResNet-50. In the discussion, they should evaluate detection performance, memory and computational resources as well as any other aspect that is relevant (e.g., training time, if there is any specific reason to consider it).
Comments on the Quality of English LanguageThe manuscript would still benefit from a careful proofread, also in the new/modified parts.
Author Response
- In the next text inserted in Section 2, the author mention the possible loss of information in converting RGB images to grayscale. This is true in general, but it does not necessarily apply to malware images as they should not need to be converted: the image generation process should be aware of the requirements on images and generate either RGB or grayscale images from the beginning. If grayscale images are generated and not obtained from a conversion, no information loss should happen (i.e., the same information is in the two types of images, just represented in a different way).
Answer:
Thank you for highlighting this issue. We agree with your observation. Consequently, we have removed this drawback and revised the paragraph to read as follows: (refer to lines 108 – 111 page 3)
While grayscale images have their advantages, they may require more sophisticated feature extraction techniques to capture relevant information, especially if color-related features are important for distinguishing malware families. This could potentially limit the discriminative power of the model.
- In explaining the custom branch, the authors have now added an explanation on why it is supposed to capture structural details, even though no proof of that is provided.
Answer:
Thank you for pointing this out. We agree with your comment. Therefore, we have added a new citation (reference 31) at the end of our explanation in the previous reply. In this reference, the authors proposed a multi-branch deep learning framework that combines global contextual features with multi-scale features to identify complex land scenes. They assert in section 3.1 that "The global contextual branch follows the pipeline of a traditional convolutional neural network". Their work demonstrates that using a series of traditional pooling and convolution techniques enables the model to effectively extract global features. We have added the following paragraph: (refer to lines 268 – 272 page 8)
The effectiveness of this method has been validated in [31]. The authors proposed a multi-branch deep learning framework that combines global contextual features with multi-scale features to identify complex land scenes. Their work shows that a series of traditional pooling and convolution techniques allows the model to effectively extract global features.
- The ablation study proposed by authors in the new version, do not state much about the advantages of the proposed DCMN over ResNet-50. Advantages in training time of DCMN (reported in the reply from authors, but not in the manuscript) have a limited value, a part from specific cases. The authors should at least introduce a discussion that underlines the other possible advantages of usng DCMN over ResNet-50. In the discussion, they should evaluate detection performance, memory and computational resources as well as any other aspect that is relevant (e.g., training time, if there is any specific reason to consider it).
Answer:
Thank you for bringing this issue to our attention. We concur with your observation. As a result, we have added the following paragraph: (refer to lines 399 – 408 page 14-15)
Figure 10 presents a comparative accuracy graph of our DCMN model and the pre-trained ResNet-50 model. When using a dataset without the benign class, our proposed DCMN achieved a testing accuracy gain of 1.77% compared to the pre-trained ResNet-50. This gain is 1.52% when the benign class was included. Addition to this improvement in detection performance, the DCMN's superior learning speed during training is a significant advantage. For example, when using a dataset without the benign class, the pre-trained ResNet-50 took 1037.38 seconds to reach 96.70% validation accuracy, while our DCMN achieved 99.13% validation accuracy in only 570.5 seconds. In a similar vein, with the benign class included, the pre-trained ResNet-50 took 1125.52 seconds to reach 97.44% validation accuracy, whereas our DCMN achieved 99.65% validation accuracy in only 561.88 seconds.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors made a nice effort to address all my comments raised in the previous round of review. In particular they have:
- Better highlighted their original contribution;
- Better discussed about the time complexity, by also comparing some recent and credited works in this direction.
In my opinion, the paper can be now accepted in its current form
Author Response
- Better highlighted their original contribution;
Answer:
Thank you for highlighting this issue. We concur with your observation. Consequently, we have expanded the literature review to include more recent works (refer to lines 216–223 page 6-7).
- Better discussed about the time complexity, by also comparing some recent and credited works in this direction.
Answer:
Thank you for pointing this out. We agree with this comment. Therefore, we have expanded the literature review to include a more comprehensive examination of studies that explore the time consumption of various techniques, in addition to their accuracy and overall performance (refer to lines 216–223 page 6-7).