DCFA-YOLO: A Dual-Channel Cross-Feature-Fusion Attention YOLO Network for Cherry Tomato Bunch Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1. Line 133, the contribution of this paper is too hazy and non-specific. Should explain what specifically done, should not be a detailed description of the module itself, the module itself is introduced later, the design is not reasonable.
2. Figure 1 in the three output features out of two branches (feature1), the essence should be the same output. In order to prove multimodality, the picture drawn is too cumbersome.
3. 2.1 does not have a specific description of the module, so I do not understand where the inputs and outputs of the innovative module come from. The description makes it look like BiFPN and concat are simply placed together.
4. Equation 2 does not describe why it ends up with a value of 3.
5. Line 203, part 2.2, the explanation of the optimisation module is not prominent enough, and its mechanism of action and advantages are difficult to reflect. It is suggested that the authors give the original model.
6. The colours of the modules in Figure 2 are too similar and not obviously the same as in Figure 1.
7. Equation 3 does not explain the formula variables.
8. The formulae for AP and mAP are given in section 3.1 Evaluation Indicators, but the text also mentions F1 index and recall rate, etc., and the relevant formulae need to be added.
9. For section 3.1, it is suggested to add the specific location information and the pictures taken in the field.
10. The experimental part of section 3.1 is not specific, and the introduction of model parameters is too little.
11. Two 3.1 sections with too few evaluation indicators. 12.
12. Line 359, the last two rows in table1 CBAM and SPPF_CBAM two modules of various data evaluation indexes are very little difference, comprehensively the last SPPF_CBAM module is the best, it is recommended to add an explanation.
13. Lines 361 to 382 are a bit lengthy, and it is sufficient to describe the improvement in accuracy and precision and the impact of lightweighting.
14. Line 383, Comparison with other DNN models section in 3.3, has too few models to compare, it is suggested to increase the number of comparisons. At least 8-10.
15. The text does not point out the details of the hyperparameter settings, data preprocessing methods, etc. of the comparison models during the training process.
16. Currently there is less information that can be derived from Figure 6, it is suggested that the models used be labelled below each figure and the detection accuracy be labelled in the detection box.
17. Too few experiments, it is recommended to increase a variety of complex backgrounds, backlighting and interference in the environment, such as the visual detection of the results of graphs and detection of heat maps and other related improvements to the module experiments.
Author Response
Please see attached file. Thank you!
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe proposed manuscript presents an advanced model, DCFA-YOLO, for tomato cluster detection using multimodal RGB and depth images. By integrating innovative attention and feature fusion mechanisms, the model offers significant improvements in accuracy and computational ease compared to other YOLO models.
The introduction is well structured and provides an appropriate context for the importance of automation in tomato harvesting. However, it lacks a detailed description of the shortcomings of existing models. I suggest adding a direct comparison with similar approaches to highlight the unique value of DCFA-YOLO.
The objectives are clear, but the emphasis on practical applications could be clearer. I would suggest discussing how the model could affect productivity or reduce waste in automated collection.
The methodology is written very clearly. The description of the data set is detailed, but lacks an explanation of the variety of environmental conditions included. I would therefore suggest indicating whether the dataset represents different types of lighting and positioning of bunches to improve generalisability.
The architecture of the model is well described but complex. Readers would benefit from more explanatory diagrams. I would recommend expanding the figure captions to explain the main functions of the modules (e.g. Concat_BiFPN and C2f_RepGhost).
The metrics used are appropriate, but it would be helpful to discuss why they were chosen specifically for this application.
The results section clearly shows the effectiveness of each module. However, the presentation of the results is dense and could be simplified. The authors should summarise the main improvements in an additional table or diagram. There is no discussion of why DCFA-YOLO outperforms other YOLO models.
The discussion is detailed but lacks a critical assessment of the limitations of the model. The authors should consider scenarios where the model might have difficulties, such as in environments with large variations in illumination.
Furthermore, the manuscript could be enriched with a discussion on how the model could be integrated into existing agricultural systems, such as harvesting robots or drones.
The conclusion summarises the results well, but does not sufficiently emphasise the model's contribution to agricultural sustainability. The authors could highlight the potential of the model to reduce waste and increase efficiency in tomato harvesting.
Overall, the manuscript represents a significant contribution to tomato cluster detection for precision agriculture. With targeted revisions to improve clarity and discussion, the paper can be accepted.
Author Response
Please see the attached file. Thank you!
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have responded well to the reviewers' questions and have improved and optimised the quality of the manuscript, which I think is at a level where it can be accepted for publication.