Next Article in Journal
Canopy Transpiration Mapping in an Apple Orchard Using High-Resolution Airborne Spectral and Thermal Imagery with Weather Data
Previous Article in Journal
Automated Detection and Counting of Gossypium barbadense Fruits in Peruvian Crops Using Convolutional Neural Networks
 
 
Article
Peer-Review Record

Lightweight Pepper Disease Detection Based on Improved YOLOv8n

AgriEngineering 2025, 7(5), 153; https://doi.org/10.3390/agriengineering7050153
by Yuzhu Wu 1, Junjie Huang 1, Siji Wang 1, Yujian Bao 1, Yizhe Wang 1, Jia Song 2 and Wenwu Liu 1,*
Reviewer 2:
Reviewer 3: Anonymous
AgriEngineering 2025, 7(5), 153; https://doi.org/10.3390/agriengineering7050153
Submission received: 24 March 2025 / Revised: 25 April 2025 / Accepted: 6 May 2025 / Published: 12 May 2025
(This article belongs to the Topic Digital Agriculture, Smart Farming and Crop Monitoring)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

DD-YOLO is proposed in this paper, which is a light-depth model for chili pepper leaf disease detection. Starting with YOLOv8n, the authors integrate a few cutting-edge modules, i.e., DCNv2 (Deformable Convolutional Networks v2), iRMB (Inverted Residual Mobile Block), DySample upsampler, and LSKA (Large Separable Kernel Attention), to enhance detection accuracy and reduce model complexity. The modifications made in YOLOv8n are rational and empirically validated. Other than some of the limitations noted below, the paper is technically sound and well-conceived. A few minor revisions down the line, this paper would be a worthy candidate for publication in your journal.

- The manuscript is replete with grammatical and typographical errors that obscure clarity (e.g., "interms", "comper pest model", etc.). Proofreading or professional language editing is strongly recommended to improve readability.
- While the paper compares DD-YOLO to several other models like YOLOv5n and SSD, it does not compare with other current light detectors like YOLOv7-Tiny or MobileNet-SSD. They would be added to give a better view of the relative performance of the model.
- While 2112 images were collected, the dataset itself is relatively small and possibly not very representative of the scope of actual-world agricultural settings, but the authors are aware of this and suggest GAN-based augmentation and domain adaptation as future work. Maybe the dataset can be updated with images from other sources?
- The paper may find it helpful to have a more extended discussion on generalization, for example, model performance under varying lighting and occlusion or disease types not encountered during training.

Comments on the Quality of English Language

The manuscript is replete with grammatical and typographical errors that obscure clarity (e.g., "interms", "comper pest model", etc.). Proofreading or professional language editing is strongly recommended to improve readability.

Author Response

Reviewer Comment:DD-YOLO is proposed in this paper, which is a light-depth model for chili pepper leaf disease detection. Starting with YOLOv8n, the authors integrate a few cutting-edge modules, i.e., DCNv2 (Deformable Convolutional Networks v2), iRMB (Inverted Residual Mobile Block), DySample upsampler, and LSKA (Large Separable Kernel Attention), to enhance detection accuracy and reduce model complexity. The modifications made in YOLOv8n are rational and empirically validated. Other than some of the limitations noted below, the paper is technically sound and well-conceived. A few minor revisions down the line, this paper would be a worthy candidate for publication in your journal.

Response: We sincerely thank the reviewer for their positive and encouraging comments regarding the technical soundness and overall structure of our work. We are grateful that the rationale behind the integration of DCNv2, iRMB, DySample, and LSKA into the YOLOv8n framework was found to be valid and empirically well supported. We have carefully addressed all noted limitations and minor revision points in the revised manuscript, and we appreciate the reviewer’s constructive suggestions, which have helped us improve the clarity and impact of the paper.

 

Reviewer Comment: The manuscript is replete with grammatical and typographical errors that obscure clarity (e.g., "interms", "comper pest model", etc.). Proofreading or professional language editing is strongly recommended to improve readability.

Response: We sincerely appreciate the reviewer’s constructive comment regarding the grammatical and typographical issues in our manuscript. In response, we have thoroughly proofread the entire manuscript and corrected the identified errors, such as "interms" and "comper pest model," along with other typographical and grammatical issues throughout the text.

Additionally, we sought professional language editing services to ensure the manuscript meets high linguistic and clarity standards. We believe these revisions have significantly improved the readability and clarity of the manuscript.

We hope that these improvements address the reviewer's concerns and enhance the overall quality of the paper.

 

Reviewer Comment: While the paper compares DD-YOLO to several other models like YOLOv5n and SSD, it does not compare with other current light detectors like YOLOv7-Tiny or MobileNet-SSD. They would be added to give a better view of the relative performance of the model.

Response: We sincerely thank the reviewer for this valuable suggestion. In the revised manuscript, we have added a comparative analysis of our proposed DD-YOLO model against two widely-used lightweight detectors, YOLOv7-Tiny and MobileNet-SSD, to better illustrate the relative performance and advantages of our method. The results are summarized in the newly updated Table 6. As detailed in the manuscript (Section 3.3), DD-YOLO achieves a precision improvement of 9.2% over YOLOv7-Tiny and 19.8% over MobileNet-SSD. These enhancements further confirm the superior performance of DD-YOLO in terms of both detection accuracy and computational efficiency. We believe this addition provides a more comprehensive evaluation of the proposed model.

 

Reviewer Comment: While 2112 images were collected, the dataset itself is relatively small and possibly not very representative of the scope of actual-world agricultural settings, but the authors are aware of this and suggest GAN-based augmentation and domain adaptation as future work. Maybe the dataset can be updated with images from other sources?

Response: We greatly appreciate the reviewer’s insightful comments regarding the scale and representativeness of the dataset. In response, we would like to clarify that in addition to the 2112 field images initially collected from our experimental chili pepper park, we further expanded the dataset by sourcing a large number of supplementary images from publicly available online sources. These images were selected to reflect a wider range of lighting conditions, environmental variations, and disease manifestations in real-world agricultural settings, thereby improving the diversity and generalization capacity of the dataset.

To address the limited data volume and enhance model robustness, we also employed a Generative Adversarial Network (GAN)-based data augmentation strategy. As detailed in Section 2.1 of the revised manuscript, the GAN framework was utilized to generate synthetic disease images that mimic realistic lesion patterns. This technique significantly enriched the visual diversity of the training data and contributed to the improved performance of the proposed model.

Furthermore, we fully recognize the importance of incorporating multi-regional data to better simulate field deployment scenarios. As discussed in the Discussion section, our future work will focus on domain adaptation and federated learning techniques to integrate chili pepper disease images from geographically diverse regions (e.g., Chongqing and Sichuan), especially those from mountainous areas with complex agro-ecological conditions. We believe these ongoing efforts will help further strengthen the practical applicability and generalization of our proposed model.

 

Reviewer Comment: The paper may find it helpful to have a more extended discussion on generalization, for example, model performance under varying lighting and occlusion or disease types not encountered during training.

Response: We sincerely thank the reviewer for this insightful suggestion. We fully agree that generalization is a critical aspect of practical disease detection models, particularly in agricultural scenarios where lighting conditions, occlusion levels, and disease expression may vary significantly.

In response, we have enhanced our discussion in the revised manuscript (Section 4, Discussion) to more explicitly address the model’s generalization capabilities. As described in Section 2.1, during dataset construction we intentionally collected images under diverse field conditions—including variations in lighting (sunny, overcast, backlight), partial occlusion, and different shooting angles—to enrich environmental diversity and promote model robustness.

Furthermore, we applied a GAN-based data augmentation strategy to generate synthetic images with a wide range of lesion features. While this has proven effective in improving model training performance, we also acknowledge that GAN-generated data may introduce a domain gap when applied to real-world field environments. As discussed in Section 4, our future work will focus on domain adaptation and adversarial robustness techniques to better handle unseen disease types and unpredictable environmental variations.

We appreciate this valuable suggestion, which has helped us better articulate the strengths and current limitations of our model in real-world deployment settings.

Reviewer 2 Report

Comments and Suggestions for Authors
  • Include a dedicated section discussing the validation methods, software testing techniques, and software quality assurance methods used in validating your deep learning software.
  • The paper lacks clarity about the robustness testing of your approach, specifically the robustness of the DD-YOLO model against adversarial examples, noise, or real-world environmental variations.
  • There's insufficient coverage of important software attributes, so that describe how your proposed DD-YOLO model addresses software maintainability, scalability (especially regarding dataset sizes and model deployment), and reliability in various field environments.
  • The experimental design does not explicitly validate software efficiency in a real-world deployment scenario. Metrics related to inference latency and resource utilization under realistic deployment scenarios on actual edge devices (beyond lab setups) are lacking.
  • Include additional experiments or at least a clear discussion on how synthetic datasets generated by GANs could impact the generalization of your software in practical, unseen scenarios.
  • Several technical descriptions, especially concerning network components (e.g., DCNv2, iRMB, DySample, LSKA), are excessively dense and difficult for non-specialist readers.
  • Ensure all figures are self-contained with sufficient explanatory captions.
  • Provide clearer information about software tools used as these directly impact reproducibility.

Author Response

Reviewer Comment: Include a dedicated section discussing the validation methods, software testing techniques, and software quality assurance methods used in validating your deep learning software.

Response: We appreciate the reviewer’s evaluation regarding the need for rigorous validation and quality assurance in deep learning workflows. While our current study primarily focuses on model performance evaluation through standard practices—such as dataset splitting, metric-based evaluation, and ablation studies—we fully acknowledge the importance of adopting formal software testing techniques in future iterations.

In this work, all experiments were conducted under a consistent computing environment with fixed random seeds, uniform hyperparameters, and repeated runs to ensure reproducibility. Additionally, module-level evaluations were performed via ablation tests to validate the individual contribution of each component (Section 3.2). These steps constitute the core validation procedures commonly adopted in deep learning model development.

That said, we appreciate the reviewer’s emphasis on software quality assurance methods, which are especially relevant for large-scale deployment and maintenance. We plan to incorporate more structured testing pipelines—such as unit testing, robustness evaluation, and QA frameworks—in our future work to ensure better software engineering quality.

 

Reviewer Comment: The paper lacks clarity about the robustness testing of your approach, specifically the robustness of the DD-YOLO model against adversarial examples, noise, or real-world environmental variations.

Response: We thank the reviewer for highlighting the importance of robustness evaluation in practical deployment scenarios. We fully agree that assessing model stability under adversarial perturbations, noise, and real-world environmental variations is critical, particularly for field applications in agriculture.

In our current study, robustness was addressed primarily through the construction of a diverse dataset—including variations in lighting, occlusion, and background complexity—and further enhanced using GAN-based data augmentation. These measures aimed to improve the model’s generalization to natural variations encountered in real-world settings.

However, we acknowledge that the DD-YOLO model has not been formally evaluated against adversarial examples or controlled noise perturbations. We consider this a valuable direction for future work. In subsequent studies, we intend to incorporate robustness evaluation techniques such as adversarial noise injection, synthetic occlusion testing, and quantitative degradation analysis under challenging visual conditions, to comprehensively assess and enhance the resilience of our model.

 

Reviewer Comment: There's insufficient coverage of important software attributes, so that describe how your proposed DD-YOLO model addresses software maintainability, scalability (especially regarding dataset sizes and model deployment), and reliability in various field environments.

Response: We sincerely thank the reviewer for emphasizing the importance of software engineering attributes such as maintainability, scalability, and reliability. In response to this suggestion, we have revised the manuscript to incorporate a more detailed discussion of these aspects, particularly in Sections 2.1, 3.3, 3.5, and 4.

Maintainability is addressed through the modular design of the DD-YOLO model, where each enhancement module (iRMB, DCNv2, DySample, LSKA) can be independently integrated or replaced within the YOLOv8n backbone. This facilitates future upgrades and model fine-tuning.

Scalability has been significantly strengthened in the revised manuscript. In Section 3.5, we now include lightweight deployment experiments on the Jetson Nano platform using TensorRT acceleration, demonstrating that the 4.8MB model can perform real-time inference on resource-constrained edge devices. Additionally, we emphasize that the use of efficient operators such as DySample and iRMB lays a solid foundation for extending the model to larger datasets or real-time applications.

To address reliability, we elaborated in Section 2.1 and the Discussion that our dataset was built with extensive diversity in lighting conditions, camera angles, occlusions, and field backgrounds. Furthermore, we enhanced the Discussion section to include a forward-looking plan to implement formal robustness testing (e.g., adversarial noise, unseen disease generalization) and domain adaptation to further improve real-world performance consistency.

We are grateful for the reviewer’s constructive feedback, which has helped us improve the engineering relevance and practical value of our proposed method.

 

Reviewer Comment: The experimental design does not explicitly validate software efficiency in a real-world deployment scenario. Metrics related to inference latency and resource utilization under realistic deployment scenarios on actual edge devices (beyond lab setups) are lacking.

Response: We sincerely thank the reviewer for highlighting the importance of evaluating model efficiency under real-world deployment conditions. In response to this valuable suggestion, we have incorporated additional experiments in Section 3.5 of the revised manuscript, where we deploy the proposed DD-YOLO model on a Jetson Nano edge computing device using the TensorRT inference framework.

We now report key metrics such as inference latency, model size (in MB), and detection accuracy under both standard and accelerated conditions. For example, the TensorRT-accelerated DD-YOLO model achieves an inference latency of 67.6 ms per image with a model size of only 4.8 MB, while maintaining high detection performance. These results demonstrate the model's suitability for real-time deployment on resource-constrained embedded platforms.

We fully agree that further testing in more diverse field environments and across a broader range of edge devices (e.g., mobile CPUs, IoT nodes) is an important direction, and we plan to explore this in future work.

 

Reviewer Comment: Include additional experiments or at least a clear discussion on how synthetic datasets generated by GANs could impact the generalization of your software in practical, unseen scenarios.

Response: We sincerely thank the reviewer for the valuable suggestion regarding the inclusion of additional experiments or a more detailed discussion on the impact of synthetic datasets generated by GANs on the generalization performance of our model in practical, unseen scenarios.

In response, we have conducted further analysis of the effect of GAN-based data augmentation on the model's generalization ability. As shown in Section 3.1.1 (Generalization Validation Experiment of GAN), we performed a comparative experiment between a model trained on the original dataset and one trained on a GAN-augmented dataset. The results are summarized in Table 3:

As indicated in the table, while the GAN-augmented dataset improved data diversity, the model performance in terms of precision, recall, and mAP@0.5 was slightly lower compared to the model trained on the original dataset. Specifically, the GAN-enhanced model achieved 91.6% precision, 88.9% recall, and 94.4% mAP@0.5, while the non-GAN model achieved 92.9%, 92.3%, and 96.3%, respectively. These results highlight that while synthetic data can help augment the dataset and enhance feature diversity, it may introduce domain-specific biases that negatively affect the model's ability to generalize to real-world, unseen conditions.

In addition to this analysis, we have expanded the discussion in Section 4 (Discussion) to better explain the potential impact of synthetic data on generalization. We have emphasized that although GAN-based data augmentation aids in addressing data scarcity, the synthetic samples may not fully capture the complex variability present in natural scenarios, such as lighting variations, occlusions, and disease-specific visual cues not seen in the training data.

We propose further improvements in future work, such as integrating domain adaptation strategies, enhancing GAN architectures to generate more realistic synthetic samples, and collecting additional real-world data from diverse environments to improve model generalization.

We hope these additions and clarifications address the reviewer's concerns and provide a more comprehensive discussion on the impact of synthetic datasets generated by GANs on the generalization of our model.

 

Reviewer Comment: Several technical descriptions, especially concerning network components (e.g., DCNv2, iRMB, DySample, LSKA), are excessively dense and difficult for non-specialist readers.

Response: We thank the reviewer for this valuable feedback. We fully acknowledge that several technical sections, particularly those describing the internal mechanisms of DCNv2, iRMB, DySample, and LSKA, may appear dense or challenging to non-specialist readers. In the revised manuscript, we have made targeted modifications to improve clarity and accessibility.

Specifically, we have simplified the technical descriptions by reducing overly detailed mathematical expressions, breaking down long sentences, and adding brief functional summaries before or after each module explanation (see Sections 2.2.2 and 3.2). For example, we now clarify the practical roles of iRMB and DySample within the feature extraction and upsampling pipeline, respectively, in more intuitive terms. We also ensured that figure captions and module diagrams (Figs. 4–7) contain concise explanations that complement the text, making it easier for readers to visually understand the function and structure of each component.

We greatly appreciate this suggestion, which has helped us improve the manuscript’s readability for a broader audience, while preserving technical accuracy.

 

Reviewer Comment: Ensure all figures are self-contained with sufficient explanatory captions.Provide clearer information about software tools used as these directly impact reproducibility.

Response: We thank the reviewer for highlighting these important points related to clarity and reproducibility.

In response, we have revised all figure captions—especially for Figures 3 through 7—to ensure they are more self-contained. Each caption now includes a concise description of key components, functional roles of modules (e.g., DySample, iRMB), and any directional flow (e.g., arrows representing data propagation). These improvements help readers better understand the figures without referring back to the main text.

To further enhance reproducibility, we have clarified the software environment in Section 2.3. Specifically, we state that all models were implemented in PyTorch 1.11.0 and deployed using the TensorRT framework on Ubuntu 20.04 with CUDA 11.3. We also indicate that all training and testing were conducted under the same system environment with fixed seeds and hyperparameters for consistency.

We believe these additions significantly improve the transparency and reproducibility of our experimental design, and we thank the reviewer for the constructive suggestions.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper proposes a lightweight pepper disease detection method (DD-YOLO) based on improved YOLOv8n, which significantly improves the detection accuracy and computational efficiency of the model by introducing deformable convolutional modules (DCNv2), inverse residual blocks (iRMB), dynamic sampling operators (DySample), and large separable kernel attention mechanisms (LSKA). The research is closely combined with the actual needs of agriculture, the experimental design is reasonable, and the result analysis is comprehensive, but there is still some room for improvement.

(1) Although DCNv2, iRMB and other modules are introduced, the theoretical analysis of why these modules are suitable for chili disease detection is insufficient. It is suggested to supplement relevant theoretical basis (such as the adaptability of DCNv2 to irregular disease spot features).
(2) In the method section, the mathematical formula descriptions of DySample and LSKA are brief, and the specific calculation process or references to relevant literature can be supplemented.
(3) The dataset only contains 2112 images, and the sample size is small, which may affect the generalization ability of the model. It is suggested to discuss the feasibility of data enhancement.
(4) It is not mentioned whether the sample distribution of various diseases in the dataset is balanced. If there is a class imbalance problem, it may have an adverse effect on the detection accuracy of the model.
(5) The paper currently only compares traditional models such as SSD and Faster-RCNN, and it is suggested to increase the comparison with current popular models (such as YOLOv9, YOLOv10 or MobileNet methods) to better demonstrate the advantages of this method.
(6) Some terminology abbreviations (such as DySample, LSKA) are not given in full when they first appear, and need to be supplemented in the text or abbreviation table.

Author Response

Reviewer Comment: This paper proposes a lightweight pepper disease detection method (DD-YOLO) based on improved YOLOv8n, which significantly improves the detection accuracy and computational efficiency of the model by introducing deformable convolutional modules (DCNv2), inverse residual blocks (iRMB), dynamic sampling operators (DySample), and large separable kernel attention mechanisms (LSKA). The research is closely combined with the actual needs of agriculture, the experimental design is reasonable, and the result analysis is comprehensive, but there is still some room for improvement.

Response: We sincerely thank the reviewer for their positive and constructive feedback. We are especially grateful for the recognition of our model design, its strong alignment with agricultural needs, and the comprehensiveness of our experimental analysis.

In response to the reviewer’s suggestion that there is still room for improvement, we have carefully revisited the manuscript and made several important revisions, including:

Simplifying technical descriptions to enhance accessibility for broader audiences;

Improving figure captions to be more self-contained and explanatory;

Providing detailed deployment experiments on edge devices (Jetson Nano) with latency metrics;

Expanding the discussion on model generalization, robustness under field conditions, and the impact of GAN-based data augmentation;

Clarifying the software tools and environments used to support reproducibility.

We greatly appreciate the reviewer’s thoughtful evaluation, which has helped us strengthen the quality and clarity of the manuscript.

 

(1) Reviewer Comment: Although DCNv2, iRMB and other modules are introduced, the theoretical analysis of why these modules are suitable for chili disease detection is insufficient. It is suggested to supplement relevant theoretical basis (such as the adaptability of DCNv2 to irregular disease spot features).

Response: We thank the reviewer for this insightful comment. We agree that it is important not only to report performance improvements but also to explain why the selected modules are theoretically well-suited to the specific characteristics of chili pepper disease detection.

In the revised manuscript, we have supplemented the theoretical rationale behind the integration of each module:

DCNv2: Chili leaf diseases often manifest as irregularly shaped lesions, spots, and textures. Standard convolutional kernels with fixed receptive fields may struggle to adapt to such deformities. DCNv2 introduces spatially adaptive offsets, enabling the network to better align with and extract features from irregular disease patterns. We have added this explanation in Section 2.2.2, along with a figure showing the role of deformable sampling in localizing variable shapes.

iRMB: The iRMB module enhances feature expressiveness while maintaining lightweight structure. In the context of small disease lesions or fine-grained leaf textures, iRMB’s combination of inverted residual connections and attention-guided convolution improves sensitivity to subtle features. This is particularly important for detecting early-stage or small-scale symptoms. We have included this justification in Section 2.2.2.

DySample and LSKA: DySample’s hardware-agnostic, lightweight sampling strategy helps retain fine-grained information during upsampling, and LSKA enhances multi-scale semantic fusion through large kernel factorization, which is beneficial for lesions of varying sizes. These points are clarified in their respective module descriptions.

These theoretical foundations have also been echoed in the revised Discussion section, where we further relate model design choices to the biological characteristics of chili disease symptoms.

We are grateful for this suggestion, which has helped us strengthen the explanatory rigor of the methodology section.

 

(2)Reviewer Comment: In the method section, the mathematical formula descriptions of DySample and LSKA are brief, and the specific calculation process or references to relevant literature can be supplemented.

Response: We thank the reviewer for pointing out this important detail. We fully agree that a clearer and more complete mathematical description is beneficial for the understanding and reproducibility of our work.

In the revised manuscript, we have expanded the descriptions of the DySample and LSKA modules in Section 2.2.2:

For DySample, we provided more detailed explanations of the offset calculation process, including the roles of the linear projection, sampling offset O, and the final sampled positions S=G+O. We also clarified the definitions of each variable in the equations and added a reference to the original DySample work [24].

For LSKA, we supplemented the explanation of how the parallel depth-wise dilated convolutions and softmax-based attention weighting are used to fuse multi-scale semantic features. We added citations to the original LSKA design paper [25] and clarified the use of dilated kernels and global descriptors in the attention computation.

These improvements aim to make the mathematical formulations more informative and accessible for readers, especially those wishing to reproduce or adapt the modules. We greatly appreciate this helpful suggestion.

 

(3)Reviewer Comment: The dataset only contains 2112 images, and the sample size is small, which may affect the generalization ability of the model. It is suggested to discuss the feasibility of data enhancement.

Response: We thank the reviewer for pointing out this important concern regarding dataset size and model generalization. While our original field collection included 2112 images, we have significantly expanded the dataset through two methods:

Supplementation with publicly available images from online sources, increasing the total dataset size to 8987 images with a diverse range of lighting, background, and disease conditions;

Data augmentation using a Generative Adversarial Network (GAN), as illustrated in Figure 2 and described in Section 2.1. This approach generated additional samples with realistic lesion patterns and varied visual features, helping improve model generalization and robustness.

We also discussed the dual impact of GAN-based augmentation in the revised Discussion section. While synthetic data improved training diversity, it also introduced some distributional shift when applied to natural field images. Therefore, we plan to further refine the augmentation pipeline in future work—possibly combining GAN-based synthesis with traditional augmentation techniques and domain adaptation—to enhance model performance under unseen scenarios.

We appreciate the reviewer’s suggestion, which aligns well with the practical challenges of limited real-world data in agricultural applications.

 

Reviewer Comment: It is not mentioned whether the sample distribution of various diseases in the dataset is balanced. If there is a class imbalance problem, it may have an adverse effect on the detection accuracy of the model.

Response: We thank the reviewer for highlighting this important point regarding dataset class distribution. In our dataset construction process (see Table 1 in Section 2.1), we ensured that the sample sizes across the six categories of chili pepper diseases are reasonably balanced. For instance, the number of images per class ranges from 1400 to 1643, and the corresponding annotation counts vary from 10,494 to 14,010 labels. This distribution helps reduce the risk of significant class imbalance affecting the model performance.

In addition, the use of GAN-based data augmentation and supplemental web-sourced images was conducted with attention to preserving inter-class balance, to avoid amplifying any existing discrepancies. We further evaluated the detection performance for each class individually (see Table 5 in Section 3.5), and the results indicate consistently high precision and recall across all categories, suggesting that class distribution did not adversely impact detection accuracy in this case.

Nonetheless, we acknowledge that slight imbalances may still exist, especially in terms of lesion complexity or intra-class variability. We have added a brief clarification in the revised manuscript (Section 2.1 and Discussion), and we plan to explore the application of adaptive loss weighting and targeted augmentation techniques in future work to further mitigate potential imbalance effects.

 

Reviewer Comment: The paper currently only compares traditional models such as SSD and Faster-RCNN, and it is suggested to increase the comparison with current popular models (such as YOLOv9, YOLOv10 or MobileNet methods) to better demonstrate the advantages of this method.

Response: We appreciate the reviewer’s suggestion to expand the model comparison to include more recent and lightweight detection frameworks. In fact, beyond traditional baselines such as SSD and Faster-RCNN, our work already includes comparisons with several modern, widely used models that are representative of current lightweight detection research.

Specifically, we compare DD-YOLO with:

YOLOv8n, the baseline model on which our work is based;

YOLOv10n, a recently proposed model designed to balance accuracy and efficiency;

YOLOv5n and YOLOv7-tiny, two popular compact versions of the YOLO family;

MobileNet-SSD, a classic lightweight architecture for mobile and embedded deployment.

These comparisons are presented in Table 6 (Section 3.3) and evaluated using four standard detection metrics: precision, recall, mAP@0.5, and mAP@0.5:0.95. Results show that DD-YOLO consistently outperforms the compared models across all metrics, demonstrating its superior performance in both accuracy and computational efficiency.

While YOLOv9 was not included due to the lack of a stable and reproducible public release at the time of our experiments, we acknowledge its potential and will consider including it in future extensions of this work.

We have revised the manuscript to more clearly emphasize the inclusion of these modern baselines and thank the reviewer again for this helpful suggestion.

 

Reviewer Comment: Some terminology abbreviations (such as DySample, LSKA) are not given in full when they first appear, and need to be supplemented in the text or abbreviation table.

Response: We thank the reviewer for this helpful observation regarding abbreviation clarity. In the revised manuscript, we have carefully reviewed all technical terms and ensured that each abbreviation—such as DCNv2 (Deformable Convolutional Networks v2), iRMB (Inverted Residual Mobile Block), DySample (Dynamic Sampling Operator), and LSKA (Large Separable Kernel Attention)—is fully defined at its first occurrence in the main text, particularly in the Abstract and Section 2.2.2.

Additionally, we have revised the corresponding figure captions (e.g., Figures 3–7) to include expanded definitions for key modules, and added clarifying phrases to improve readability for first-time readers.

We appreciate the reviewer’s attention to detail, which helped us improve the overall clarity and professionalism of the manuscript.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors
  • Explicitly state the limitation of robustness against adversarial attacks and noise perturbations clearly in the discussion or conclusion, emphasizing the practical implications instead of only discussing it as future work.
  • It should be explicitly discussed potential biases and limitations clearly, especially the risks associated with synthetic data bias in practical deployments.
  • A final round of careful proofreading or professional editing service is suggested to ensure the manuscript meets publication standards fully.
Comments on the Quality of English Language

There are still a few grammatical and stylistic errors scattered throughout the manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop