Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases

Agronomy 2025, 15(8), 1969; https://doi.org/10.3390/agronomy15081969

by Kaijie Zhang¹

, Yuchen Ye¹, Kaihao Chen², Zao Li² and Hongxing Peng^1,*

Reviewer 1:

Jyotika Purohit

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Agronomy 2025, 15(8), 1969; https://doi.org/10.3390/agronomy15081969

Submission received: 16 July 2025 / Revised: 7 August 2025 / Accepted: 14 August 2025 / Published: 15 August 2025

(This article belongs to the Special Issue Modern Control of Biotic Stress in Crops: Intelligent Detection and Precision Pesticide Application)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript entitled ' A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases and Pests' describes the successful design and validation of an efficient and accurate strawberry disease and pest detection framework, PPA-MC-YOLO. Although the purpose of the study was good, a few points need to be addressed properly.

The authors studied strawberry diseases only; no pest information was included. Thus, the title must be changed.
The keywords must be short and specific. Kindly modify these.
Authors are requested to cite proper references and follow proper reference format
In the introduction, a research gap is missing, and it was mentioned in the related work (review) section. It must be improvised.
In many sections (introduction, methodology and results), the information must be written in paragraph form, not in point form. Authors are requested to rectify them
Authors have used the strawberry disease database, where different disease pictures are available. How do authors classify the close symptoms, like leaf spot and angular leaf spot, and how much do these symptoms differ in terms of aetiology?
It will be very interesting to see how these models behave when complex symptoms appear on plants. As most of the biological systems are very complex and multiple diseases appear in the field at the same time.

Comments for author File: Comments.pdf

Author Response

Thank you for your valuable feedback. We have carefully reviewed your comments and will revise the manuscript accordingly. Below are our responses to each of your points.( In the article, I use yellow marks to modify your comments)

Question 1: The manuscript is entitled 'A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases and Pests'. However, the study only focuses on strawberry diseases and does not include any information on pests. Therefore, the title must be changed.

Response 1: Thanks for your reminder, I have deleted “the pests” part.

Question 2: The keywords must be short and specific. Kindly modify these.

Response 2: Thank you for your comments. The keywords sections have been revised.

Question 3: Authors are requested to cite proper references and follow proper reference format.

Response 3: Thank you for your comments. The citations and references sections have been revised.

Question 4: In the introduction, a research gap is missing, and it was mentioned in the related work (review) section. It must be improvised.

Response 4: Thanks for your guidance. I have added the research gap part in the introduction and marked it, which is about 60-100 lines.

Question 5: In many sections (introduction, methodology and results), the information must be written in paragraph form, not in point form. Authors are requested to rectify them.

Response 5: Thanks for your reminder, I have completed the modification and highlighted it in yellow

Question 6: Authors have used the strawberry disease database, where different disease pictures are available. How do authors classify the close symptoms, like leaf spot and angular leaf spot, and how much do these symptoms differ in terms of aetiology?

Response 6: Thank you for your reminder. I have marked it in yellow and added some explanations around lines 221-227.

The ability to distinguish between visually similar diseases like leaf spot and angular leaf spot is a key challenge that the PPA-MC-YOLO model is designed to address. The model uses its advanced attention mechanisms to identify subtle but critical differences in the symptoms.

Angular Leaf Spot is a bacterial disease caused by Xanthomonas fragariae. Its key feature is that the lesions are angular, bordered by leaf veins, and may appear translucent when held up to the light.

Leaf Spot is a fungal disease caused by Mycosphaerella fragariae. The spots are round to irregular with a distinct purplish-red border and a tan or gray center. They are not restricted by the leaf veins.

These distinct characteristics, though subtle, are what the model's scale-adaptive and frequency-aware attention components would focus on to ensure accurate classification.

Question 7: It will be very interesting to see how these models behave when complex symptoms appear on plants. As most of the biological systems are very complex and multiple diseases appear in the field at the same time.

Response 7: Thank you for your insightful comment. We fully agree that real-world agricultural conditions are complex, often involving plants with multiple diseases at once. To address this, we will incorporate a more robust evaluation of our model in the future.

First, we will expand our dataset to include images where a single strawberry plant or leaf exhibits multiple diseases simultaneously. This will allow us to rigorously test the model's ability to handle overlapping symptoms. Next, we will use multi-label classification metrics, such as mean Average Precision (mAP), to assess how accurately the model identifies and localizes all present diseases, not just the dominant one. Finally, we will acknowledge the challenges of multi-disease detection in our discussion and propose it as a key area for future research. This will involve exploring specialized modules designed to better handle these complex, real-world scenarios.

We sincerely appreciate this valuable feedback, as it will significantly improve the practical relevance of our study.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The robotization of machine operations in agriculture aligns with the global implementation of the Agriculture 5.0 concept. Therefore, the subject matter addressed in this article is highly relevant and important for agricultural practice. The authors employed one of the most well-known object identification algorithms, YOLO (specifically a modified version of YOLOv12), to accurately detect strawberry diseases and pests.

This algorithm, widely recognized for computer-based object detection, is successfully used in various fields, including robotics, surveillance systems, autonomous vehicles, and more (with numerous publications appearing in MDPI journals). It enables the rapid and accurate identification of specific objects in images or video footage.

This work should be considered a case study, as it focuses on a specific plant under particular environmental conditions. Therefore, generalization of the results and formulation of broader conclusions is limited.

The detection and identification of crop diseases is merely a preliminary step — the crucial aspect is their elimination. One might question whether the use of advanced algorithms such as YOLO is truly necessary in this context. After all, an average agronomist with a secondary-level education could identify the disease, recommend an appropriate treatment, determine the dosage, and specify the application date. What truly requires improvement are these agronomic interventions, which could be optimized using computer vision and machine learning techniques.

To achieve such optimization, a cost analysis should be performed, either using modern methods with dedicated software or through traditional approaches, which can be completed in just a few minutes. In many cases, a simple visual inspection of the plants on-site may suffice.

Furthermore, the reported 2% increase in detection accuracy falls within the margin of error and is statistically insignificant. A statistical analysis is needed to support this claim, but it is currently absent from the paper.

Finally, the study lacks an analysis of alternative methods. Why were algorithms such as CBAM, SENet, or Self-Attention not considered?

Editorial Notes:
Please correct the bibliographic citations in Chapters 1 and 2, as some errors are present.

Author Response

Question 1: This work should be considered a case study, as it focuses on a specific plant under particular environmental conditions. Therefore, generalization of the results and formulation of broader conclusions is limited. The detection and identification of crop diseases is merely a preliminary step — the crucial aspect is their elimination. One might question whether the use of advanced algorithms such as YOLO is truly necessary in this context. After all, an average agronomist with a secondary-level education could identify the disease, recommend an appropriate treatment, determine the dosage, and specify the application date. What truly requires improvement are these agronomic interventions, which could be optimized using computer vision and machine learning techniques.

Response 1: We acknowledge the importance of a cost analysis. While a simple visual inspection is possible, it is often unscalable, labor-intensive, and prone to human error when dealing with large-scale commercial farming. Our system offers an automated, consistent, and tireless solution. A cost-benefit analysis comparing the expenses of our system (hardware, software) against the ben efits of improved yield, reduced labor costs, and lower chemical usage is an excellent suggestion. We will incorporate this as a key point for future work, as it would provide a more complete economic justification for the adoption of such robotic systems.

Question 2: Furthermore, the reported 2% increase in detection accuracy falls within the margin of error and is statistically insignificant. A statistical analysis is needed to support this claim, but it is currently absent from the paper.

Response 2: (Modify the line around Section 4.1, line 645-656) Thank you for your critical and insightful comment. We agree on the importance of demonstrating that our performance gains are not due to random chance. While performing multiple full training runs for a formal statistical test is computationally prohibitive in our setting, we have revised the manuscript to provide a multi-faceted analysis that demonstrates the practical and contextual significance of our observed 2.1% mAP improvement.

Our argument is structured around three key points, which are now detailed in Section 4.1.

Gains on a Strong Baseline

Our 2.1% improvement is achieved over a highly competitive YOLOv12 baseline. Advancing the state-of-the-art on such a strong foundation is inherently challenging, making even seemingly small gains highly valuable.

Targeted, Not Random, Improvements

The overall 2.1% average improvement masks dramatic gains on specific, critical, and hard-to-detect classes. As detailed in our per-class analysis in Table 2, our model achieves a remarkable 11.0% absolute AP increase on the 'powdery_mildew_fruit' class. This class is a key challenge due to its small, low-contrast symptoms. This demonstrates that our improvements are not random noise but are the direct result of our framework systematically addressing the problem's core difficulties.

Superior Performance-Efficiency Trade-off

Crucially, this 2.1% accuracy gain is achieved while simultaneously reducing model complexity (17.3% fewer parameters and 4.5% fewer GFLOPs). Achieving higher accuracy with a more efficient model represents a significant advancement in the performance-efficiency trade-off, highlighting the superiority of our architectural design.

We believe this detailed contextual analysis robustly demonstrates that our model's advantages are tangible, significant, and directly linked to our proposed innovations, rather than being a product of statistical noise.

Question 3: Finally, the study lacks an analysis of alternative methods. Why were algorithms such as CBAM, SENet, or Self-Attention not considered?

Response 3: Thanks for your comments. I have added them in lines 307 and 364.

We found that CBAM's serial processing could cause the loss of fine-grained detail, which is crucial for detecting small targets like tiny disease spots. To solve this, we designed the Parallel Pyramid Attention (PPA) module, which processes features at multiple scales concurrently, preserving small-target information more effectively.

Similarly, while traditional Self-Attention is powerful, its high computational cost ($O(n^2)$) makes it unsuitable for real-time applications. As an alternative, we developed the Monte Carlo Attention (MCAttn) module, which uses stochastic sampling to achieve similar contextual awareness with significantly reduced computational cost.

In short, our paper innovates by adapting the core ideas of existing attention mechanisms into new, specialized modules (PPA and MCAttn) that are better suited for the unique demands of high-precision, real-time agricultural object detection.

Editorial Notes: Please correct the bibliographic citations in Chapters 1 and 2, as some errors are present.

Response 4: Thank you for this essential correction. I have completed the revision of the citations and references.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Fragaria × ananass should be in italics as a botanic name.
You must solve the issue with references and quotations. Now almost all the quotations show this message: Error! Reference source not found.
Introduction is somewhat incomplete and lacks real examples of the drawbacks with concrete, not general, statements for each. Besides, Introduction usually must count 15-20 references. You may resolve this issue by merging Introduction and Related Work sections.
In Table 1 it would be interesting to add the metrics of computation performance such as memory and CPU loads by each model , as well as model size in MB.
It is also interesting how are you going to implement the developed model on practice, mainly it is not clear how scalable it is and how to use it in the embedded systems, for example? Because now the study seems to have only theoretical value, and no prospects to put it into real-life were outlined in Discussion or Conclusion sections.

Author Response

Question 1: Fragaria × ananassa should be in italics as a botanic name.

Response 1: (line 39) We appreciate you pointing this out. This is a crucial formatting detail, and we apologize for the oversight. We will conduct a thorough review of the entire manuscript to ensure that the scientific name, Fragaria × ananassa, is consistently italicized wherever it appears.

Question 2: You must solve the issue with references and quotations. Now almost all the quotations show this message: Error! Reference source not found.

Response 2: We sincerely apologize for this technical error. This issue likely occurred during the file conversion process. We will meticulously check all citations and the reference list to ensure every quotation is correctly linked to its source and that all references are properly formatted and resolved.

Question 3: Introduction is somewhat incomplete and lacks real examples of the drawbacks with concrete, not general, statements for each. Besides, Introduction usually must count 15-20 references. You may resolve this issue by merging Introduction and Related Work sections.

Response 3: (line 43-44 ,60-79) We agree with your suggestion. We have streamlined and integrated the "Introduction" and "Related Work" sections to create a more comprehensive and rigorous opening section. The new section not only includes more references but also provides more concrete examples to illustrate the limitations of existing methods and clearly defines the specific research gaps that our study aims to fill.

Question 4: In Table 1 it would be interesting to add the metrics of computation performance such as memory and CPU loads by each model, as well as model size in MB.

Response 4: line621-628.Thank you for this insightful suggestion. We fully agree that providing more practical, hardware-related performance metrics would significantly enhance the engineering value and completeness of our study. Following your advice, we have conducted a new series of standardized benchmark tests for all models compared in our paper and have updated Table 1 accordingly.

The revised Table 1 now includes the following additional metrics:

Model Size (MB): The on-disk size of the final model weights, providing a direct measure of storage requirements.

VRAM (GB): The peak GPU memory usage during inference, which is a critical metric for deployment on resource-constrained devices.

CPU Usage (%): The average CPU core utilization during inference, offering insights into the model's performance on platforms without dedicated GPUs.

To ensure clarity and reproducibility, we have also updated the note below Table 1 to specify the precise conditions under which these new metrics were measured. Furthermore, we have briefly incorporated an analysis of these new metrics into the "4.5. Efficiency Analysis" section. These additions demonstrate that our proposed PPA-MC-YOLO model is not only superior in theoretical complexity (Parameters and GFLOPs) but also maintains a competitive on-disk size and memory footprint, further highlighting its suitability for real-world deployment.

Question 5: It is also interesting how are you going to implement the developed model on practice, mainly it is not clear how scalable it is and how to use it in the embedded systems, for example? Because now the study seems to have only theoretical value, and no prospects to put it into real-life were outlined in Discussion or Conclusion sections.

Response 5: (around line 1031)Thank you for your insightful comment regarding the practical application of our model. We completely agree that a discussion on implementation and scalability is crucial for demonstrating the real-world value of our research.

We have addressed this by performing a significant revision of our Discussion and Conclusion sections. In these updated sections, we have provided a detailed analysis of our model's potential for real-world implementation. Specifically, we have discussed its scalability for large-scale agricultural operations and its suitability for deployment on embedded systems, such as drones or agricultural robots. We have also outlined concrete prospects for its use in precision agriculture, thereby demonstrating that our study provides not only theoretical value but also a foundation for practical, real-world solutions.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

1.Specify what is new about using the already known pyramidal attention module (PPA) in ML. This module has long been used to detect small objects in images.
2.FD Conv was used before you to adapt filters. Specify the difference.
3.It is known that the MCAttn module is already used to reduce the computational costs of self-attention mechanisms. What is your difference?

4.SD Loss needed to be compared with other loss functions. Using the word optimization requires specifying which parameter and criterion. I did not see the optimization process.
5.In addition to the comparisons given in Table 1, you need to provide images that will show your advantage.
6.References need to be supplemented and expanded.

Author Response

Question 1: Specify what is new about using the already known pyramidal attention module (PPA) in ML. This module has long been used to detect small objects in images.

Response 1: Thank you for this very constructive and insightful feedback. We agree that the concept of pyramid attention is well-established in the literature for multi-scale feature extraction, and our original manuscript did not make this distinction clear enough. We apologize for this ambiguity.

Our contribution is not the invention of pyramid attention itself, but rather the design and application of a novel Parallel Pyramid Attention (PPA) module specifically engineered for the detection head of our model. The novelty of our PPA module lies in three key areas:

Novel Parallel Architecture: Unlike traditional pyramid structures that often process features serially, our PPA module employs a unique parallel multi-branch architecture. It consists of three synergistic branches running concurrently: a Global Context Branch, a Local Parallel Branch (with multiple dilation rates), and a Serial Pyramid Branch. This parallel design is specifically motivated by the challenges in agricultural vision, where the model must simultaneously process both microscopic targets (like pests) and large-area features (like leaf lesions) without information loss from sequential processing.

Targeted Application in the Detection Head: We have integrated this PPA module directly into the detection head, which is a departure from its common use within the backbone or neck. This allows the final detection stage to have a more powerful and adaptive focus on targets of varying scales, which is crucial for the final bounding box prediction.

Synergistic Integration: The effectiveness of our PPA module is demonstrated through its synergistic integration with the other components of our PPA-MC-YOLO framework, which collectively address the complex problem of strawberry disease detection.

To address your comment, we have made substantial revisions to the manuscript to clarify our specific contribution. The changes include:

Related Work (Section 2.3): We have added a discussion of existing pyramid attention models and explicitly differentiated our proposed parallel architecture.

Methods (Section 3.2.3): We have rewritten the "Design Motivation and Innovation" subsection to state that our work is inspired by established pyramid attention concepts but innovates with a parallel multi-branch design tailored for agricultural scenes.

Introduction, Abstract, and Discussion: We have rephrased these sections to ensure our claims are precise and accura We believe these revisions now accurately tely reflect our contribution.

position our work within the existing literature and clearly highlight the novelty of our proposed PPA module. Thank you again for helping us improve the quality of our manuscript.

Question 2: FD Conv was used before you to adapt filters. Specify the difference.

Response 2: (line 452-465) Thank you for raising this question. We agree that the concept of dynamic filters has been explored in previous research. However, our proposed implementation of Frequency-Dynamic Convolution (FDConv) features a key innovation.

While previous methods typically generate filters based on global image features, which incurs high computational costs, our approach is fundamentally different. Our FDConv module employs a novel frequency-diversity-based kernel construction method that operates without adding any parameters or computational cost, thus keeping the model lightweight.

By transforming the frequency-domain perspective into convolutional kernels with inherent diversity, our FDConv module is specifically designed to capture a wider range of textures and fine-grained patterns. This is particularly critical for distinguishing between visually similar but pathologically distinct diseases, a challenge often overlooked by general-purpose dynamic filter approaches. This unique design choice provides an efficient way to enrich feature representations and enhance the model's discriminative power.

Question 3: It is known that the MCAttn module is already used to reduce the computational costs of self-attention mechanisms. What is your difference?

Response 3: (line 370-380) Thank you for your question. You are correct that Monte Carlo Attention (MCAttn) has been used to improve computational efficiency. The core innovation of our work lies in the specialized application and integration of MCAttn, making it highly compatible with our PPA-MC-YOLO framework.

Our proposed MCAttn paradigm addresses computational costs by integrating a novel stochastic sampling mechanism. Instead of computing attention scores for every possible pair of pixels, it randomly samples a subset of key-value pairs from different spatial locations. This method achieves similar contextual awareness while significantly reducing computational expense.

This is a critical design choice for our application, as it allows the module to efficiently capture multi-scale contextual information across the entire image. This enables our framework to maintain real-time inference speeds without sacrificing the robustness required for our targets, thus achieving a balance between performance and efficiency that is well-suited for dynamic agricultural environments.

Question 4: SD Loss needed to be compared with other loss functions. Using the word optimization requires specifying which parameter and criterion. I did not see the optimization process.

Response 4: (line 745-759 line433-438) Thank you for pointing out the need for a more rigorous validation of our proposed Scale-Decoupled (SD) Loss. We acknowledge that a quantitative comparison is essential to substantiate its effectiveness. To address this, we have performed two key modifications:

Quantitative Comparison with State-of-the-Art Loss Functions:
We have conducted a new set of ablation studies where we benchmarked our SD Loss against several other widely-used and state-of-the-art loss functions, including the default CIoU Loss, Focal Loss, and the advanced Wise-IoU (WIOU) Loss. These experiments were carried out on our full PPA-MC-YOLO framework to ensure a fair comparison. The results have been compiled into a new Table 4 within the "4.2. Ablation Study" section.

The results in Table 4 clearly demonstrate that SD Loss outperforms the other loss functions, particularly in the crucial metric of Average Precision for small objects (AP_small). This provides strong, empirical evidence that our method effectively addresses the challenge of scale-dependent localization instability, which was its primary design goal. We have also added a detailed analysis of these results in the text following Table 4.

Clarification of the "Optimization Process":
To resolve any ambiguity regarding the term "optimization," we have added a clarifying statement in the "3.2.5. Scale-Decoupled Loss (SD Loss)" section. We now explicitly state that "optimization" in this context refers to the optimization of the model training process, where the SD Loss function itself serves as an advanced strategy. We clarify that the parameters being optimized are the model's weights, and the criterion is the minimization of the total loss, guided by the scale-aware weighting mechanism of SD Loss.

Question 5: In addition to the comparisons given in Table 1, you need to provide images that will show your advantage.

Response 5: Thank you for this excellent suggestion. We completely agree that visual evidence is crucial for demonstrating the practical advantages of our proposed PPA-MC-YOLO model, going beyond the quantitative metrics presented in Table 1. A qualitative comparison effectively illustrates how our model overcomes specific, challenging detection scenarios where the baseline model falters.

To address your comment, we have included a dedicated section for visual analysis in our revised manuscript. Specifically:

A New Figure (Figure 6): We have presented a figure that provides a direct, side-by-side comparison between our PPA-MC-YOLO and the baseline YOLOv12. This figure showcases four carefully selected, challenging scenarios that are common in real-world agriculture:

Small and dense targets.

Concealed targets in complex backgrounds.

The co-existence of multi-scale targets.

The discrimination of visually similar (confusing) disease classes.

A Detailed Analysis Section (Section 4.4): We have provided a detailed, qualitative analysis to accompany Figure 6. In this section, we break down the visual results for each scenario, explaining precisely why our model performs better and linking these improvements directly back to our proposed innovations (the PPA module, MCAttn, and FDConv).

Question 6: References need to be supplemented and expanded.

Response 6: Thank you for this important feedback. We have already begun a comprehensive review of our references. We will supplement and expand our reference list with additional relevant and up-to-date publications, ensuring that our work is well-contextualized within the current state of the art in deep learning for agricultural applications.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Authors have addressed all the queries asked and incorporated in the manuscript

Reviewer 4 Report

Comments and Suggestions for Authors

I am almost satisfied with the answers to my concerns. The changes and additions made have significantly improved the perception of the results obtained. Recently, random point processes have been effectively used to predict plant disease, for example, R. Kosarevych, 2023 (doi.org/10.3390/rs15163941). I think this can be mentioned in the final version of the article.

Article Menu

A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases

Further Information

Guidelines

MDPI Initiatives

Follow MDPI