Enhancing Fine-Grained Image Recognition with Multi-Channel Self-Attention Mechanisms: A Focus on Fruit Fly Species Classification
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsMajor comment:
1. The authors should test their pipeline on flies or insects of similar sizes, and test their pipeline. The entirety of the conclusion is hinged upon this critical control experiment.
Minor comment:
1. Figure 1: It would be better to present the data in a clustered column bar graph, or similar graphs where all of the data points are clearly visible.
2. The authors should comment on if the pipeline would work for other applications as well.
Author Response
Major comment:
- The authors should test their pipeline on flies or insects of similar sizes, and test theirpipeline. The entirety of the conclusion is hinged upon this critical control experiment.
Reply: Thanks for your valuable suggestion. The test results of the orange fly test set have been added in Section 3.5, as shown in Table 2:
Table 2 F1 value results from different methods
Number of experiments |
the proposed method |
the method of reference [6] |
the method of reference [7] |
the method of reference [8] |
1 |
0.98 |
0.88 |
0.91 |
0.81 |
2 |
0.96 |
0.86 |
0.93 |
0.83 |
3 |
0.99 |
0.90 |
0.94 |
0.79 |
4 |
0.95 |
0.88 |
0.92 |
0.81 |
5 |
0.97 |
0.89 |
0.95 |
0.82 |
Minor comment:
- Figure 1: It would be better to present the data in a clustered column bar graph, or similar graphs where all of the data points are clearly visible.
Reply: Thanks for your valuable suggestion.We have revised Figure 1 according to your opinion to make all data points clear and visible.
- The authors should comment on if the pipeline would work for other applications as well.
Reply: We are so grateful for this good suggestion.The method described in this paper can be used in other applications and has been supplemented in the conclusion section of Section 4 as follows: The multi-channel self attention mechanism in this method can combine information from other modalities and explore multi task learning methods, which have universality in a wide range of fields such as image processing, computer vision, and pattern recognition. Especially in scenarios that require capturing fine-grained information in images, fusing multimodal data, or performing multitasking learning.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsI was asked to review the paper:
Enhancing Fine-grained Image Recognition with Multi-channel Self-attention Mechanisms: A Focus on Fruit Fly Species Classification
the paper is very interesting and needed some time to understand all the aspects.
Some observations should be elucidated:
a) L 46: “In conclusion, our study introduces a novel fine-grained image recognition method for the effective identification of fruit flies. Two primary innovations define our approach:
(1) First, the incorporation of a multi-channel self-attention mechanism enhances feature extraction, allowing for nuanced recognition of subtle differences between fruit fly species.
(2) Second, the utilization of long-term and short-term memory networks as feature extractors contribute to the robustness of the framework, ensuring consistent and accurate recognition across diverse backgrounds. Together, these innovations mark a significant advancement in fine-grained image recognition techniques tailored specifically for fruit fly identification.”
-At this point, the purpose of the research, the hypotheses that will be studied and the research program should be stated
L84 “Experiments showed that the recognition accuracy of this method is 87.5%. However, this method may overlook important components or contextual information in some fine-grained images, as erasing the most responsive part may result in the network being unable to accurately recognize the image.”
Experiments carried out by whom? a reference should be included.
L143 “The Softmax classifier is used to map weighted fruit fly features to categories and output a function of classification probability. Based on the center loss function, the similarity between fruit fly features and their class centers is increased, which reduces the distance between the two in the feature space and solves the problem of significant intra class differences.”
Softmax classifier used to map weighted fruit fly features by whom? a reference should be included.
L 146
“Based on deep learning, this paper designs a network framework for fine-grained image recognition of fruit flies, which combines fine-grained images X of fruit flies input the feature extractor to obtain the underlying features F , using long and short term memory networks (LSTM) as feature extractors;”
Very interesting. - LSTMs are predominantly used to learn, process, and classify sequential data because these networks can learn long-term dependencies between time steps of data. Common LSTM applications include sentiment analysis, language modeling, and video analysis.
L 207
“ through the convolution layer of 16 1x1” the relation is incorrect
L 239 – 266
the exposition of the algorithm is rather vague. maybe more features could be exposed
L 271 “Softmax is a commonly used classification recognition function in deep learning 271
models, which can map the input weighted fruit fly features to various categories and 272
return the probability results of classification. “ - a reference should be included.
Conclusion
the presented results should be discussed in opposition with other methods/
Comments for author File: Comments.pdf
Author Response
Dear reviewer:
Thank you for your comments. According to your comments, I have made the following modifications:
Some observations should be elucidated:
- a) L 46: “In conclusion, our study introduces a novel fine-grained image recognition method for the effective identification of fruit flies. Two primary innovations define our approach:
(1) First, the incorporation of a multi-channel self-attention mechanism enhances feature extraction, allowing for nuanced recognition of subtle differences between fruit fly species.
(2) Second, the utilization of long-term and short-term memory networks as feature extractors contribute to the robustness of the framework, ensuring consistent and accurate recognition across diverse backgrounds. Together, these innovations mark a significant advancement in fine-grained image recognition techniques tailored specifically for fruit fly identification.”
-At this point, the purpose of the research, the hypotheses that will be studied and the research program should be stated
Reply: Based on your opinion, please add a description of the research purpose, hypotheses, and plan between sections 3 and 3.1. The specific supplementary content is as follows:
Accurate recognition and analysis of cell images are crucial in fields such as cell biology, medicine, and drug development. In order to accurately extract and recognize cellular mechanical images, this paper proposes a method based on multi-channel self attention mechanism. Assuming that the performance of fine-grained image recognition can be significantly improved by combining multi-channel self attention mechanism and long short-term memory network, the expected results are as follows:
1) The multi-channel self attention mechanism can capture global and local features in an image, and strengthen task related feature representations by assigning different weights to each channel;
2) As a feature extractor, LSTM can handle long-term dependencies in sequence data, thereby extracting low-level features in images, which are crucial for identifying fine-grained objects such as fruit flies.
By combining these two technologies, it is expected to achieve efficient and accurate recognition of fine-grained images of fruit flies when processing images with complex backgrounds. Implement fine-grained image recognition through the following steps:
Step 1: Use LSTM to extract the low-level features of the image and capture the feature information in the fine-grained image of fruit flies;
Step 2: Multiply the attention feature map with weights, integrating low-level geometric information and high-level semantic information to obtain the fruit fly feature ;
Step 3: The Softmax classification recognizer outputs the image recognition result ;
Step 4: Improve the Softmax loss function to A-Softmax ;
Step 5: Merge the AM Softmax loss, center loss, and inverse sample weighted loss values to obtain the final loss function , achieving accurate image recognition.
L84 “Experiments showed that the recognition accuracy of this method is 87.5%. However, this method may overlook important components or contextual information in some fine-grained images, as erasing the most responsive part may result in the network being unable to accurately recognize the image.”
Experiments carried out by whom? a reference should be included.
Reply: The experiment described here is conducted using the method described in reference 8. Perhaps labeling specific experimental data in the review section may lead to misunderstandings among readers, so I have deleted: Experiments show that the recognition accuracy of this method is 87.5%.
L143 “The Softmax classifier is used to map weighted fruit fly features to categories and output a function of classification probability. Based on the center loss function, the similarity between fruit fly features and their class centers is increased, which reduces the distance between the two in the feature space and solves the problem of significant intra class differences.”
Softmax classifier used to map weighted fruit fly features by whom? a reference should be included.
Reply: Reference materials for relevant research have been provided, as shown in reference 23:
[23] Liao M , Yingqiong P , Deng H ,et al.CNN-SVM: A classification method for fruit fly image with the complex background[J].IET Cyber-Physical Systems Theory & Applications, 2020, 5(2):181-185.
L 146
“Based on deep learning, this paper designs a network framework for fine-grained image recognition of fruit flies, which combines fine-grained images X of fruit flies input the feature extractor to obtain the underlying features F , using long and short term memory networks (LSTM) as feature extractors;”
Very interesting. - LSTMs are predominantly used to learn, process, and classify sequential data because these networks can learn long-term dependencies between time steps of data. Common LSTM applications include sentiment analysis, language modeling, and video analysis.
Reply: As you mentioned, LSTM is not the most commonly used network structure in image processing, especially in fine-grained image recognition. But under specific conditions, when an image is divided into a series of regions or blocks and input into the network in a certain order, LSTM can be used to process these serialized regions.
L 207
“ through the convolution layer of 16 1x1” the relation is incorrect
Reply: Based on your opinion, I have rechecked and revised the wording in this section as follows: Note that the representation method utilizes a convolutional layer with 161×1 kernels, combined with the bottom features of drosophila .
L 239 – 266
the exposition of the algorithm is rather vague. maybe more features could be exposed
Reply: Based on your opinion, I have supplemented the overall description of this section in each step to improve the expressive power of the paper. By integrating global and local attention features, combined with attention mean, a comprehensive feature representation containing both low-level geometric information and high-level semantic information can be obtained, thereby improving the performance of fruit fly fine-grained image recognition.
L 271 “Softmax is a commonly used classification recognition function in deep learning 271
models, which can map the input weighted fruit fly features to various categories and 272
return the probability results of classification. “ - a reference should be included.
Reply: Relevant references have been added, as shown in reference 23:
[23] Liao M , Yingqiong P , Deng H ,et al.CNN-SVM: A classification method for fruit fly image with the complex background[J].IET Cyber-Physical Systems Theory & Applications, 2020, 5(2):181-185.
Conclusion
the presented results should be discussed in opposition with other methods/
Reply: The discussion results of the proposed method and other methods have been added below Table 2, specifically:
This is because the method proposed in this paper effectively integrates global and local attention features through a multi-channel self attention feature fusion strategy, and introduces the concept of attention mean to extract higher-order features more relevant to the category of fruit flies. This strategy enables the method in this paper to capture more fine-grained information when recognizing fruit fly images, thereby improving the accuracy and stability of recognition. In contrast, although the methods in references [6], [7], and [8] may have certain advantages in certain aspects, their performance in processing the image recognition task of the fruit fly is inferior to the method proposed in this paper. This is due to the shortcomings of these methods in feature extraction, attention mechanism design, or model optimization, which result in their inability to fully capture and utilize the fine-grained information of the fruit fly images.
The above modifications have been highlight in green, please review.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper “Enhancing Fine-grained Image Recognition with Multi-channel Self-attention Mechanisms: A Focus on Fruit Fly Species Classification” needs the following improvements.
· Provide the tabular comparison of the proposed model and other model.
· Include mathematical expression of Kappa Coefficient.
· Which number of channels are used to obtain Figure 5.
· Is the number of training and testing images used in the comparison models is same or different?
· Include a figure of the proposed model or a complete pipeline of the whole process.
· Provide results with the original fine-grained image feature distribution, LSTM extracted low-level features, and fine-grained image features finally extracted.
· You mentioned on line 427-428 that “good balance between recognition accuracy and model complexity can be achieved when the number of attention weight map channels is 16 or 32.” Discuss what is the computational cost when channels are 4, 8, 16, 32, and 64.
Author Response
- Provide the tabular comparison of the proposed model and other model.
Reply: Thank you for your comment. The F1 value comparison results between the multi-channel self attention mechanism model and the methods in reference [6], [7], and [8] have been added, as shown in Table 2
Table 2 F1 value results from different methods
Number of experiments |
the proposed method |
the method of reference [6] |
the method of reference [7] |
the method of reference [8] |
1 |
0.98 |
0.88 |
0.91 |
0.81 |
2 |
0.96 |
0.86 |
0.93 |
0.83 |
3 |
0.99 |
0.90 |
0.94 |
0.79 |
4 |
0.95 |
0.88 |
0.92 |
0.81 |
5 |
0.97 |
0.89 |
0.95 |
0.82 |
2.Include mathematical expression of Kappa Coefficient.
Reply: We are so grateful for this good suggestion.The mathematical expression for the Kappa coefficient has been added as .
3.Which number of channels are used to obtain Figure 5.
Reply: 32 channels were used.
4.Is the number of training and testing images used in the comparison models is same or different?
Reply: We thank the reviewer for pointing out this issue.In the comparison process of the four methods in Table 2, the number of test images used is the same.
5.Include a figure of the proposed model or a complete pipeline of the whole process.
Reply: According to your opinion, we have added the complete steps of image recognition using the method described in this paper in Section 3, as follows:
Step 1: Use LSTM to extract the low-level features of the image and capture the feature information in the fine-grained image of fruit flies;
Step 2: Multiply the attention feature map with weights, integrating low-level geometric information and high-level semantic information to obtain the fruit fly feature ;
Step 3: The Softmax classification recognizer outputs the image recognition result ;
Step 4: Improve the Softmax loss function to A-Softmax ;
Step 5: Merge the AM Softmax loss, center loss, and inverse sample weighted loss values to obtain the final loss function , achieving accurate image recognition.
6.Provide results with the original fine-grained image feature distribution, LSTM extracted low-level features, and fine-grained image features finally extracted.
Reply: Thank you very much for your valuable suggestions. We have added specific explanations below Figure 2 on the distribution of features from the original fine-grained image to the final extraction of features from the fine-grained image. The supplementary content is as follows:
In Figure 2 (a), it can be seen that the distribution of features in the original fine-grained image of fruit flies is reduced to two-dimensional using the t-SNE method. The distribution of these features in two-dimensional space is relatively scattered, and there is a significant overlap of features between different categories without clear boundaries. This indicates that the features of the original image are difficult to distinguish between different types of fruit flies without processing. Figures 2 (b) and (c) show the feature maps of the LSTM layer and the last layer in the method proposed in this paper. From the graph changes in Figures 2 (b) to (c), it can be seen that when the image passes through the LSTM layer in the deep learning network, the extracted low-level features are shown in Figure 2 (b). Compared with the original features, the distribution of these underlying features in two-dimensional space begins to exhibit a certain degree of structural and clustering characteristics. Although there is still some overlap between different categories, it can be seen that compared to the original features, The LSTM layer has been able to extract some useful information for distinguishing fruit fly species. After processing the deep structure of the deep learning network and introducing a multi-channel self attention mechanism in the last layer, the final fine-grained image features extracted are shown in Figure 2 (c). This feature map shows a very obvious clustering effect, with clear boundaries formed by features of different categories in two-dimensional space, and the overlapping parts greatly reduced. This indicates that through the deep structure hierarchy extraction and the introduction of multi-channel self attention mechanism in the method proposed in this paper, the model can extract highly discriminative features, thus achieving good fine-grained image recognition of fruit flies.
7.You mentioned on line 427-428 that “good balance between recognition accuracy and model complexity can be achieved when the number of attention weight map channels is 16 or 32.” Discuss what is the computational cost when channels are 4, 8, 16, 32, and 64.
Reply: Thanks for your positive and constructive suggestions. The calculated costs for different channels have been provided in Table 1.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed the points.
Reviewer 3 Report
Comments and Suggestions for AuthorsMost of my comments are addressed. I recommend acceptance of this article in its current form.