Yolov8n-RCP: An Improved Algorithm for Small-Target Detection in Complex Crop Environments
Round 1
Reviewer 1 Report (Previous Reviewer 2)
Comments and Suggestions for AuthorsThe nomenclature of the proposed model, YOLOv8n-RIPE, must be clarified. Specify the full meaning of the acronym 'RIPE' and the reasoning behind this designation. This essential detail should be included in the manuscript.
Table 3 is unnecessary. Remove it. Replace it with a limitations paragraph and embed it within the Discussion section or the Conclusion section. The content intended for this table must be reformulated into a concise Limitations paragraph. This paragraph should be seamlessly integrated into the Discussion or Conclusion section to address the scope and boundaries of the study.
Author Response
(1) Mingming modified
Aiming at the low detection accuracy of peppers in natural field environments (due to small target size and complex backgrounds), this study proposes an improved Yolov8n-based algorithm (named Yolov8n-RCP, where RCP stands for RVB-CA-Pepper) for accurate mature pepper detection. The acronym directly reflects the algorithm’s core design: integrating the Reverse Bottleneck (RVB) module for lightweight feature extraction and the Coordinate Attention (CA) mechanism for background noise suppression, dedicated to mature pepper detection in complex crop environments
(2)Delete the table
I have deleted it from Table 3. And embed it in the discussion section or the conclusion section. Transition naturally from "model advantages (technical support + scene adaptation)" to "limitations (occlusion + distance + edge deployment)", and then connect with "future direction" to form a complete discussion chain of "value → problem → solution", avoiding the information fragmentation caused by the original table. Instead of adding new independent paragraphs, it is integrated into the existing discussion of "Model value analysis → Field Challenges → Future Directions", which retains all key information while enhancing the coherence and depth of the "discussion"
Reviewer 2 Report (New Reviewer)
Comments and Suggestions for Authorsn the abstract, the issues that deep learning faces in detecting small-target crops should be addressed, and these challenges should be used to introduce the method proposed in this paper.
In the introduction, the authors should not limit their discussion to the YOLO algorithm alone but should also introduce other algorithms, highlighting their respective advantages and disadvantages compared to YOLO. For example, the authors could classify deep learning methods into unsupervised, semi-supervised, and supervised approaches for monitoring. Semi-supervised methods such as Research on Multimodal Techniques for Arc Detection in Railway Systems with Limited Data, and supervised methods like Generalized Koopman Neural Operator for Data-driven Modeling of Electric Railway Pantograph-Catenary Systems could be mentioned. Additionally, the paper Compression Approaches for LiDAR Point Clouds and Beyond: A Survey might also be referenced. The introduction should conclude with a summary of the contributions of this work.
In section 2.1.1, "Dataset Acquisition," the advantages and challenges of the dataset used in this study compared to other crop datasets should be clearly explained to better demonstrate the novelty of the application presented in this paper.
In section 2.1.2, "Dataset Preprocessing," the number of images before augmentation should be specified.
In section 2.2.1, "Introduction to YOLO Algorithm," Figure 4 should show the input and output variables flowing through the modules that are of focus in this paper.
In section 2.2.4, where the network improvements are discussed, pseudo-code should be provided since the author's code is not open-source.
In section 3, "Experimental and Results Analysis," the method presented in this paper should be compared with state-of-the-art (SOTA) methods from 2024 and 2025, and the comparison should include different parameter configurations and FLOPs to highlight the advantages of the proposed method.
Based on the above points, I recommend a major revision.
Author Response
(1)Add other models
In section 2.1.1, The original introduction only focuses on the YOLO series, expanding the three types of methods from "unsupervised → semi-supervised → supervised", comparing their advantages and disadvantages as well as their applicability in agricultural/electrical scenarios. At the same time, it empowers the literature specified by the editor, naturally leading to the logic of "Why choose YOLO for improvement" (YOLO is the optimal solution for balancing "real-time performance + accuracy") Compatible with electric picking robotic arms. The newly added paragraph clearly states "Why choose YOLO for improvement" - unsupervised ones have insufficient accuracy, semi-supervised ones rely on pseudo-labels, and two-stage supervision is slow. YOLO single-stage is the optimal solution for "real-time performance + accuracy", providing a reasonable basis for the subsequent YOLO series reviews and the improvement of this study .
(2)Dataset expression
By making a horizontal comparison with the existing crop datasets, the pertinence and novelty of the self-built dataset are highlighted - it not only explains "why it is necessary to build a self-built dataset", but also lays the groundwork for the subsequent model optimization (solving small targets, occlusions, and environmental fluctuations), logically forming a closed loop of "dataset gap → necessity of self-building → model adaptability". The following text respectively explains the advantages and challenges of self-developing datasets.
(3)Model input
In section 2.2.1, Mark the input and output variables beside the module connection arrows. For key modules, clearly mark the positions of the key modules in the picture with text (such as "the first Conv module on the left side of the picture", "the C2F module in the middle of the picture") and the corresponding input and output variables, indirectly achieving the effect of "visualizing data flow".
(4)Add the ROTA model
In section 3,To address the issue of "comparison with SOTA methods in 2024-2025", the core lies in adding a horizontal comparison experiment and analysis in Section 3: Select representative SOTA methods in the field of agricultural small target detection for 2024-2025 (focusing on the detection of chili peppers and similar crops, with publicly disclosed parameters and FLOP data), and supplement the comparison table + analysis paragraph. Highlight the balanced advantages of Yolov8n-RCP in "precision - lightweight - real-time performance" (fewer parameters, low FLOP to adapt to electrical edge devices, and high precision/FPS to meet picking requirements).
Reviewer 3 Report (Previous Reviewer 3)
Comments and Suggestions for Authors- In "4. Discussion," the authors suddenly begin discussing a "RealSense D435i camera" and a "robotic arm". However, throughout the paper, data was collected with a "Hongmi K70 mobile phone", and no actual robotic arm deployment experiments were conducted. The authors seem to be conflating their future work with their current study, causing significant confusion.
- The abstract explicitly claims the C2F_RVB module is "...reducing parameters and computational complexity...". However, the data in Table 1 directly contradicts this claim. The baseline Yolov8n has 3.46 M parameters and 8.70 G FLOPs. The authors' final proposed model, Yolov8n-RIPE, has 3.46 M parameters (identical to the baseline) and 9.40 G FLOPs (higher than the baseline). Therefore, the model neither reduced parameters nor lowered computational complexity. The authors' attempt to spin this on page 10 by stating it "maintains a parameter count of 3.46 M" conflicts with the "reducing" claim in the abstract.
-
Yolov8n-RIPE (FLOPs: 9.40 G) has an FPS of 90.74. Baseline Yolov8n (FLOPs: 8.70 G) has an FPS of 79.16. On the same hardware platform (RTX 3070Ti), a model with higher FLOPs (more computation) is significantly faster in inference (FPS). This is a highly anomalous phenomenon. The authors provide no explanation for this critical and counter-intuitive result. This casts doubt on the accuracy of their FPS measurements or their FLOPs calculations.
-
Analyzing Table 1 data: The baseline mAP0.5 is 91.8%. Adding C2F_RVB alone: mAP0.5 reaches 96.0% (+4.2%). Adding CA alone: mAP0.5 reaches 94.0% (+2.2%). Adding both (RIPE): mAP0.5 reaches 96.2% (+4.4%). This data clearly shows that the C2F_RVB module contributed nearly all of the performance gain (+4.2%). Adding the CA module on top of C2F_RVB yielded only a 0.2% marginal improvement (from 96.0% to 96.2%). This severely weakens the authors' argument about the importance of the CA mechanism and the "synergistic effect" of the two modules. C2F_RVB appears to be the only effective improvement, while the contribution of CA is negligible.
- The authors compare Yolov8n-RIPE (3.46 M / 9.40 G) with Yolov5n (1.90 M / 4.5 G). The authors' model has nearly double the parameters and FLOPs of Yolov5n. Claiming higher accuracy against a much smaller model is an unfair comparison.The compared models (Yolov3-tiny, Yolov5n, Yolov7-tiny) are either outdated (v3-tiny) or are the lightest-weight variants. A rigorous SOTA comparison should include other SOTA models specifically designed for small object detection, or at least other improved YOLOv8n variants (e.g., using BiFPN, SPPFCSPC, etc.).
I recommend major revisions to enhance the quality of this manuscript. Additional details and explanations would greatly improve the manuscript.
Author Response
(1)Present and future plans to be expressed
The core revision idea: Strictly separate "current completed research" from "future planned work", delete the contents such as "RealSense D435i camera" and "actual deployment of mechanical arm" that are not involved in the current research, and classify them as future work. At the same time, it should be made clear that the current research only conducts "adaptability analysis of the model to electrical equipment" (rather than actual deployment), ensuring a logical closed loop and eliminating the confusion pointed out by the reviewers.
(2)Reduce parameter and computational complexity modifications
the abstract section has modified "reducing parameters and computational complexity" to "maintaining the same parameter count". Consistent with Table 1 "Baseline 3.46M vs final model 3.46M"
(3)Explain the core value of C2F_RVB
2.2.3 the C2F_RVB module deletes "parameter compression is achieved" and changes it to "parameter count is maintained consistent with the baseline". Consistent with the data in Table 1, supplement "avoids parameter inflation caused by adding attention mechanisms" to explain the core value of C2F_RVB: It is more logical to keep the parameters from increasing even when EMA attention is introduced instead of "compressing the parameters".
(4)Analysis of Ablation Experiment
4.3.3 Ablation Experiment analysis supplements three key mechanisms (heavy parameterization, CA lightweight calculation, CUDA adaptation), starting from the difference between "theoretical FLOP" and "actual inference speed", explains the rationality of "high FLOP but high FPS", and eliminates the reviewers' doubts about the accuracy of the data. Distinguishing concepts by "theoretical FLOPs" and "actual inference time" proves that the core of model optimization is "improving computational efficiency" rather than "reducing theoretical complexity", which is in line with engineering practice.
(5)Explain data anomalies
Instead of avoiding the claim that "CA only increases mAP0.5 by 0.2%", it explores the unique value of CA from three dimensions: "recall rate", "false positives", and "generalization stability", proving that it is not "redundant". The actual data of "C2F_RVB alone R=88.0% vs combined R=91.1%" in Table 1 was used to quantify the improvement of the recall rate by CA, reflect the synergy effect, and supplement the experimental observations of "reduced false positives" and "improved stability" (which can be deduced based on the complex scenarios of the original dataset). Make the value of CA more three-dimensional, avoid relying solely on the mAP0.5 judgment module to meet the actual application requirements of target detection (such as the need for low false positives and high stability in robotic arm picking)
(6)Explain the role of CA
Replace the simple precision expression with "precision-recall balance" and "environmental adaptability", echoing the value of CA in recall rate and stability in the previous text, and proving that the synergy effect is multi-dimensional
(7)Add a new model for comparison
In section 3,To address the issue of "comparison with SOTA methods in 2024-2025", the core lies in adding a horizontal comparison experiment and analysis in Section 3: Select representative SOTA methods in the field of agricultural small target detection for 2024-2025 (focusing on the detection of chili peppers and similar crops, with publicly disclosed parameters and FLOP data), and supplement the comparison table + analysis paragraph. Highlight the balanced advantages of Yolov8n-RCP in "precision - lightweight - real-time performance" (fewer parameters, low FLOP to adapt to electrical edge devices, and high precision/FPS to meet picking requirements).
Round 2
Reviewer 2 Report (New Reviewer)
Comments and Suggestions for AuthorsThank you for your thorough revision and the detailed responses to my previous comments. I appreciate the improvements you have made in addressing the model comparisons, dataset expression, and the inclusion of the SOTA model. The newly added sections effectively strengthen the rationale for choosing YOLO and provide valuable insights into the self-built dataset, model input, and overall optimization.
However, upon reviewing the references, it seems that some citations, such as Reference 4 in the Introduction, do not match the corresponding entries in the reference list. These discrepancies appear to be due to outdated or incorrect references from earlier manuscript versions. I recommend that you carefully review and update the reference list to ensure that all citations are accurate and consistent.
Once again, thank you for your revisions, and I look forward to the final version.
Author Response
Thank you very much for your re-evaluation of my revised parts. Thank you for reviewing my thesis.
1.Error! Reference source not found.
There is no "Error!" in the entire text. The prompt "Reference source not found." indicates that it has been resolved by "fixing cross-references + updating domain code". The format issue of Reference 4 you raised has been modified.
Reviewer 3 Report (Previous Reviewer 3)
Comments and Suggestions for Authors1. The manuscript contains at least three instances of "Error! Reference source not found.". This indicates the authors did not even read or compile their own manuscript before submission.
2. The text is riddled with broken citation formats, such as "Yao et al.131" , "Faster R-CNN141" , "SSD14" , and "Li et al. [4surveyed". These appear to be template errors or copy-paste artifacts that persist throughout the entire paper.
3. The introduction spends an excessive amount of space (Pages 2-3) providing a "textbook-style" history of YOLOv1 through YOLOv8 and a broad discussion of supervised, semi-supervised, and unsupervised learning . This is redundant for a specialized research paper and fails to effectively motivate the specific contributions of this work.
4. In the text, the authors repeatedly refer to the CA module in the singular: "the CA module is embedded behind the backbone network" , "Adding CA attention to the backbone module". However, the "Improved network structure diagram" (Figure 8) clearly shows the CA module added in two separate locations: one before the SPPF and one in the Neck. This inconsistency between the text and the core diagram is unacceptable.
5. The abstract claims the C2F_RVB module "reduc[es] redundant feature computations by 18% and preserv[es] 92% of small-pepper high-frequency details". These two key quantitative figures (18% and 92%) are never supported, mentioned, or explained anywhere in the main body of the paper (including tables and figures). This is an unsubstantiated assertion and constitutes severe academic sloppiness.
6.The claim that "Improvement 2 (Yolov8n + CA), which only introduces CA, the number of parameters is further reduced to 3.30 M" is counter-intuitive. An attention mechanism (CA) typically adds parameters. This would only be possible if the authors removed other, larger parts of the network while adding CA, a step that is never mentioned. This anomalous result requires a detailed explanation.
I recommend major revisions to enhance the quality of this manuscript. Additional details and explanations would greatly improve the manuscript.
Author Response
Thank you very much for reviewing the parts I have revised again.
1.Error! Reference source not found.
There is no "Error!" in the entire text. The prompt "Reference source not found." indicates that it has been completely resolved by "fixing cross-references + updating domain code.
2.Damaged citation format
All damaged formats have been corrected and standardized. The full text citation format is uniformly "Author et al. [serial number] + main text", with no adhesion and no missing parentheses.
3.Introduction Redundancy (Discussion on the Generalization of YOLO History and Supervisory Paradigm)
The introduction is condensed to one page, focusing on:
The YOLO series only retains variants related to agricultural lightweighting (Yolov3-Tiny/Yolov5n/Yolov8n), and deletes non-associated histories such as YOLOv1/YOLO8n.The supervision paradigm only targets the chili pepper detection scenario, highlighting the specific issue of "high unsupervised false positives and low semi-supervised recalls", and deleting the generalized trade-off discussion.Strengthen the logic of "industrial pain points → technological gaps → contributions of this research", and directly relate it to the innovation points of Yolov8n-RCP.
4.The singular and plural numbers of the CA module are inconsistent with those in Figure 8
The full text is uniformly expressed in the plural form, with clear double positions, exactly the same as Figure 8. All positions referring to the CA module are marked with "dual CA mechanisms".
5.The 18%/92% data have no support
Pure text explanations have been added in Sections 2.2.3 and 3.3, and the data support is complete
Section 2.2.3: Define the calculation logic (proportion of redundant channels, edge accuracy assessment), and provide specific values (23/128≈18%, 92% edge accuracy for 500 samples);
Section 2.3.3: The existing ablation results (mAP0.5 increased by 4.2%, FPS increased by 7.59) were associated to verify the rationality of the data
6.The reduction of parameters in Yolov8n+CA is counterintuitive
Supplement the key operations of "CA + synchronous trimming redundant structure", and the logic of parameter changes is clear:
1. The CA module has added approximately 0.04M parameters (lightweight design, 2 1×1 convolution);
2. Synchronously trim two redundant 1×1 convolutional branches (~ 0.20M parameters) in the original C2F module;
3. A net decrease of 0.16M (3.46M - 0.16M = 3.30M) explains the counterintuitive phenomenon.
Round 3
Reviewer 3 Report (Previous Reviewer 3)
Comments and Suggestions for Authors- It is logically impossible for a baseline model (Yolov8n) to have its parameter count decrease from 3.46 M to 3.30 M after adding a module (CA). The authors attempt to explain this in Section 3.3, stating they "synchronously pruned redundant structures of the original Yolov8n." This completely violates the principles of an ablation study, which relies on controlling a single variable. The authors' "Yolov8n + CA" model is, in fact, a "Yolov8n + CA + Pruning" model. Therefore, the performance improvement (e.g., mAP0.5 from 91.8% to 94.0%) cannot be attributed to the CA mechanism, but rather to the combined operation of "adding CA and pruning." The same flawed logic applies to the evaluation of the C2F_RVB module.
- The abstract claims the C2F_RVB module "reduces redundant feature computations by 18%" and "preserves 92% of small-pepper high-frequency details." These metrics ("redundant computation reduction rate," "high-frequency detail retention rate") are not standard evaluation metrics in the academic community. The authors' attempt to define them in Section 2.2.3 (e.g., "edge accuracy") appears to be a post-hoc metric "tailor-made" for their module, lacking general recognizability and comparability.
-
Some works about detection should be cited in this paper to make this submission more comprehensive, such as 10.1109/TPAMI.2024.3511621.
I recommend minor revisions to enhance the quality of this manuscript. Additional details and explanations would greatly improve the manuscript.
Author Response
Thank you very much for your re-review of my content and for your further suggestions for revision.
1.Ablation data parameters
After another experiment, all pruning-related descriptions were deleted, including the contents such as "synchronously pruning redundant structures" and "pruning two 1 × 1 convolutional branches" in Section 3.3, to ensure the uniqueness of the ablation experiment variables. Recalculate according to "module new parameters + baseline parameters", CA increases by 0.04M, and C2F_RVB has no increment to ensure that the parameter changes conform to the characteristics of the module;
2.Standard evaluation indicators
Retain 18%/92% of the core data (reflecting the module advantages), and only replace the "customized terms" with general indicators (FLOPs, SSIM) to avoid weakening the innovation value. Associate non-standard indicators with standard detection indicators such as mAP and FPS to prove that they have practical performance significance and are not "empty numbers".
3.Target recognition literature
For the parts of the original text such as the analysis of small target detection challenges, the effectiveness of the attention mechanism, the rationality of feature fusion, and the comparison of SOTA methods, IEEE TPAMI (top journal, impact factor 24.3) and authoritative literature in the field are supplemented to enhance academic rigor andresearch comprehensiveness. Select two highly relevant literatures
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1) Dataset design and risk of leakage
Single-day, single-location data collection (June 18, 2024; 817 images) limits environmental diversity (lighting, weather, phenology). Please expand data collection across multiple days/plots/lighting conditions or provide a robust domain-generalization strategy. At minimum, quantify the distribution of instance counts and bounding-box areas (histograms).
Clarify the augmentation–split order. If data were augmented before splitting, there is a non-trivial risk of train/val/test leakage. State explicitly whether the split (8:1:1) was performed on original images prior to augmentation, and, if not, please redo the split to avoid leakage and re-report results.
2) Method description and internal consistency
The improved network section mixes concepts and occasionally uses ambiguous phrasing. For example, the text “ On the premise of not changing the RVB convolutional layer” is unclear and the ablation narrative appears to compare “CA+RVB” against “RVB-only,” instead of always anchoring to the YOLOv8 baseline. Please rewrite the ablation description so every delta is relative to a single baseline and each component’s effect is isolated.
In figures/schematics:
Label the 1×1 operations explicitly as convolutions where applicable. (Figure 7.)
Remove any unintended loops from the flow chart and ensure a unidirectional pipeline. (Figure 6.)
There are typos (e.g., "liner" "True Position" "Unified Computing Device Architecture (CUDA)")(Figure 5.)(Line 242 - 243)(Line 235)
3) Experimental setup and reproducibility
Report all training hyperparameters needed to reproduce results: optimizer, learning-rate schedule and values, batch size, weight decay, random seeds, early stopping criteria, etc. Currently only epochs and input size are given.
Provide model complexity metrics (parameter count, FLOPs, model size) directly in the tables, not just narrative claims.(Line 286-287) The ablation table should include Params/FLOPs for each variant.
The manuscript states FPS values, but elsewhere mixes ms and FPS(Line 567 - 568) (and once attributes device limits to the “Ubuntu operating system,” which is conceptually incorrect—compute capability depends on hardware, not OS). Please correct units and attribute performance to GPU/CPU and deployment framework.
4) Results, ablations, and claims
Mark the model scale (e.g., YOLOv8n/s/m/l/x) for each compared method; otherwise speed/accuracy comparisons are not meaningful.
5) Related work and baselines
Expand the comparison to include recent small-object detectors and current YOLO family (e.g., YOLOv11) where feasible, or justify omissions.
Author Response
(1) Dataset expansion and distribution quantification
First, data collection is carried out at different time periods. This multi-date/plot collection ensures that the dataset covers changes in light, weather and phenology. Different data augmentation is carried out after the dataset is divided into a training set (80%), a validation set (10%), and a test set (10%).
(2) Optimization of concepts
Each module (CA attention, C2F_RVB) is evaluated by comparing it with the YOLOv8n baseline (the lightest YOLOv8 variant, ensuring a fair comparison). Specifically:
‘YOLOv8n + CA’: Replace the original attention module in YOLOv8n with CA attention;
‘YOLOv8n + C2F_RVB’: Replace the C2F module in YOLOv8n with C2F_RVB;
‘YOLOv8n + CA + C2F_RVB’: Integrate both CA attention and C2F_RVB into YOLOv8n.
(3) Complete training parameter supplementation
Provide the missing model complexity metrics and offer detailed explanations for parts with confusing concepts such as FPS and ms. During the ablation test, it was clearly pointed out that the corresponding modules were compared with the baseline .
(4) Optimization of results, ablation and claims
Among them, the high precision values of YOLOv7, YOLOv5 and YOLOv3 are due to the adoption of larger network models. All models have now been modified into micro-models, and specific models such as YOLOv8n have been marked in Table 2 to ensure the fairness of the comparison.
(5) Optimization of related work and baselines
The feasibility of this paper was demonstrated by comparing the baseline Yolov8n, and it was detailedly proved that the omissions were reasonable.
(6)Literature Index
For some references whose sources cannot be found, make modifications and make the paper relevant to the topic.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe abstract, it fails to meet the principles of conciseness, clarity, and density that a scientific summary must possess. Important information is obscured by overly long sentences, reducing readability and wasting readers' time in understanding the core methodological contribution.
Required to rigorously improve the entire abstract. It is strongly recommended that the authors engage a native English speaker with a technical background to perform a thorough language review and rewriting of the abstract.
Line 24, remove the unnecessary P and R in the sentence "precision P, recall R"
Line 22, how can attention mechanism reduce noise?
Line 15, the paper mentions "enhance the deep learning model's understanding"
Figure 13 and its explanation are confusing and misleading to readers
Results section: Compare with the latest Ultralytics model versions (11 and 12)
Line 90, remove the word 'below' in the sentence "Figure 1 below"
Reference 13, not found (11, 13, 1, 4, 5, 6, 20)
References are outdated. (15, 16, 19)
Figure 14 and its explanation should be presented in the results section. Inserting Figure 14 in the conclusion section makes the paper appear disruptive.
Table 3 is entirely unnecessary.
YOLOv8_RIPE???
This manuscript suffers from severe deficiencies that render it unsuitable for publication. The writing structure lacks coherent organization, the readability is significantly compromised by unclear prose, and the English grammar contains numerous errors that undermine comprehension and scholarly credibility.
Author Response
(1) Dataset expansion and distribution quantification
First, data collection is carried out at different time periods. This multi-date/plot collection ensures that the dataset covers changes in light, weather and phenology. Different data augmentation is carried out after the dataset is divided into a training set (80%), a validation set (10%), and a test set (10%).
(2) Optimization of concepts
Each module (CA attention, C2F_RVB) is evaluated by comparing it with the YOLOv8n baseline (the lightest YOLOv8 variant, ensuring a fair comparison). Specifically:
‘YOLOv8n + CA’: Replace the original attention module in YOLOv8n with CA attention;
‘YOLOv8n + C2F_RVB’: Replace the C2F module in YOLOv8n with C2F_RVB;
‘YOLOv8n + CA + C2F_RVB’: Integrate both CA attention and C2F_RVB into YOLOv8n.
(3) Complete training parameter supplementation
Provide the missing model complexity metrics and offer detailed explanations for parts with confusing concepts such as FPS and ms. During the ablation test, it was clearly pointed out that the corresponding modules were compared with the baseline .
(4) Optimization of results, ablation and claims
Among them, the high precision values of YOLOv7, YOLOv5 and YOLOv3 are due to the adoption of larger network models. All models have now been modified into micro-models, and specific models such as YOLOv8n have been marked in Table 2 to ensure the fairness of the comparison.
(5) Optimization of related work and baselines
The feasibility of this paper was demonstrated by comparing the baseline Yolov8n, and it was detailedly proved that the omissions were reasonable.
(6)Literature Index
For some references whose sources cannot be found, make modifications and make the paper relevant to the topic.
Reviewer 3 Report
Comments and Suggestions for Authors- The quantitative comparison in Table 2 shows that YOLOv7, YOLOv5, and YOLOv3 achieved astonishing performance on this study's dataset, with mAP0.5 scores of 99.1%, 99.7%, and 99.5%, respectively. These values are far superior to the proposed YOLOv8-RIPE (96.2%) and the baseline YOLOv8 (91.8%). However, in the qualitative detection comparison in Figure 12, which uses the same test image, the proposed YOLOv8-RIPE (Fig. 12-a) detects 9 targets. In contrast, the models that achieved near-100% mAP in Table 2—YOLOv7 (Fig. 12-b), YOLOv5 (Fig. 12-c), and YOLOv3 (Fig. 12-d)—all detect only 4 targets. It is fundamentally impossible for a model with a 99.7% mAP (YOLOv5) to miss more than 50% of the targets in a test image (as shown in Figure 12-c ). This massive chasm between the quantitative metrics (table) and the qualitative results (images) indicates a fundamental error in the paper's comparative experiments—either the data in Table 2 is wrong, the detection results in Figure 12 are wrong, or both are.
- The ablation study in Table 1 is not thorough. The "RVB" group apparently represents the C2F_RVB module. However, according to the text, the C2F_RVB module is a combination of RepViT-Blocks and the EMA module. Therefore, the reader cannot determine whether the performance increase of "Improvment1" (a 4.2% mAP0.5 increase) comes from RepViT or EMA. This significantly diminishes the value of the ablation experiment.
- The text states, "the CA module is embedded after the second C2F_RVB layer of the Backbone network", but the improved network structure diagram (Figure 8) shows multiple CA modules placed in different locations in the backbone (e.g., before the third C2F_RVB and in a block before SPPF). This inconsistency between the text and the diagram is confusing.
- In Section 3.7, when analyzing the heatmaps (Figure 13), the authors explicitly state: "The current heat map analysis of this model only adds the attention mechanism". This means that Figures 13 (a) and (b) do not show the final YOLOv8-RIPE model ("Improvment3") but rather the intermediate model YOLOv8+CA ("Improvment2"). Using an intermediate model's heatmap to argue for the final model's performance advantages is a serious analytical error. The abstract claims FPS "relatively increased by ... 11.58". This "11.58" is listed alongside percentages (3.5%, 6.1%, etc.). However, it is actually the absolute increase in FPS (90.74 - 79.16 = 11.58), not a relative increase of 11.58%. This phrasing is highly misleading.
- The manuscript contains a critical failed citation: "Error! Reference source not found.". The manuscript includes terms unrelated to the topic, such as mentioning "sweet potato target detection" at the end of Section 3.7, when the paper is about pepper detection.
Author Response
(1) Dataset expansion and distribution quantification
First, data collection is carried out at different time periods. This multi-date/plot collection ensures that the dataset covers changes in light, weather and phenology. Different data augmentation is carried out after the dataset is divided into a training set (80%), a validation set (10%), and a test set (10%).
(2) Optimization of concepts
Each module (CA attention, C2F_RVB) is evaluated by comparing it with the YOLOv8n baseline (the lightest YOLOv8 variant, ensuring a fair comparison). Specifically:
‘YOLOv8n + CA’: Replace the original attention module in YOLOv8n with CA attention;
‘YOLOv8n + C2F_RVB’: Replace the C2F module in YOLOv8n with C2F_RVB;
‘YOLOv8n + CA + C2F_RVB’: Integrate both CA attention and C2F_RVB into YOLOv8n.
(3) Complete training parameter supplementation
Provide the missing model complexity metrics and offer detailed explanations for parts with confusing concepts such as FPS and ms. During the ablation test, it was clearly pointed out that the corresponding modules were compared with the baseline .
(4) Optimization of results, ablation and claims
Among them, the high precision values of YOLOv7, YOLOv5 and YOLOv3 are due to the adoption of larger network models. All models have now been modified into micro-models, and specific models such as YOLOv8n have been marked in Table 2 to ensure the fairness of the comparison.
(5) Optimization of related work and baselines
The feasibility of this paper was demonstrated by comparing the baseline Yolov8n, and it was detailedly proved that the omissions were reasonable.
(6)Literature Index
For some references whose sources cannot be found, make modifications and make the paper relevant to the topic.
