A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases

Zhang, Kaijie; Ye, Yuchen; Chen, Kaihao; Li, Zao; Peng, Hongxing

doi:10.3390/agronomy15081969

Open AccessArticle

A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases

by

Kaijie Zhang

¹

,

Yuchen Ye

¹,

Kaihao Chen

²,

Zao Li

² and

Hongxing Peng

^1,*

¹

College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China

²

College of Software Engineering, South China Agricultural University, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(8), 1969; https://doi.org/10.3390/agronomy15081969

Submission received: 16 July 2025 / Revised: 7 August 2025 / Accepted: 14 August 2025 / Published: 15 August 2025

(This article belongs to the Special Issue Modern Control of Biotic Stress in Crops: Intelligent Detection and Precision Pesticide Application)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate and automated detection of diseases is crucial for sustainable strawberry production. However, the challenges posed by small size, mutual occlusion, and high intra-class variance of symptoms in complex agricultural environments make this difficult. Mainstream deep learning detectors often do not perform well under these demanding conditions. We propose a novel detection framework designed for superior accuracy and robustness to address this critical gap. Our framework introduces four key innovations: First, we propose a novel attention-driven detection head featuring our Parallel Pyramid Attention (PPA) module. Inspired by pyramid attention principles, our module’s unique parallel multi-branch architecture is designed to overcome the limitations of serial processing. It simultaneously integrates global, local, and serial features to generate a fine-grained attention map, significantly improving the model’s focus on targets of varying scales. Second, we enhance the core feature fusion blocks by integrating Monte Carlo Attention (MCAttn), effectively empowering the model to recognize targets across diverse scales. Third, to improve the feature representation capacity of the backbone without increasing the parametric overhead, we replace standard convolutions with Frequency-Dynamic Convolutions (FDConv). This approach constructs highly diverse kernels in the frequency domain. Finally, we employ the Scale-Decoupled Loss function to optimize training dynamics. By adaptively re-weighting the localization and scale losses based on target size, we stabilize the training process and improve the Precision of bounding box regression for small objects. Extensive experiments on a challenging dataset related to strawberry diseases demonstrate that our proposed model achieves a mean Average Precision (MAP) of 81.1%. This represents an improvement of 2.1% over the strong YOLOv12-n baseline, highlighting its practical value as an effective tool for intelligent disease protection.

Keywords:

strawberry disease detection; deep learning; attention mechanism; small object detection; precision agriculture

1. Introduction

Strawberry (Fragaria × ananassa Duch.) is a globally significant economic crop, highly valued for its nutritional content and substantial financial benefits [1]. However, various diseases persistently challenge the strawberry industry, leading to significant yield losses, degraded fruit quality, and increased production costs [2]. Consequently, developing methods for rapidly and accurately identifying these phytopathological threats is imperative for realizing precision agriculture and sustainable development [3,4]. Such methods can facilitate timely interventions to mitigate economic losses and reduce the reliance on broad-spectrum pesticides [5]. Traditional methods, such as manual field scouting and diagnosis, are often labor-intensive, time-consuming, subjective, and heavily reliant on expert experience, making them inadequate for the fine-grained management demands of modern, large-scale agriculture [6].

In recent years, the advancement of artificial intelligence (AI), particularly deep learning (DL) and computer vision techniques, has ushered in a new era for automated agricultural monitoring [7,8]. Single-stage object detection models, exemplified by the You Only Look Once (YOLO) series, have demonstrated immense potential for rapidly localizing and identifying various objects in images, including plant diseases, owing to their excellent balance between speed and accuracy [9,10,11].

Nevertheless, deploying existing detection models in complex, real-world agricultural environments—especially for fine-grained tasks like strawberry disease recognition—still confronts several fundamental and persistent challenges. These unresolved issues form the core motivation for this research:

Deploying existing detection models in complex, real-world agricultural environments—especially for fine-grained tasks like strawberry disease recognition—still confronts several fundamental and persistent challenges. A primary issue is the difficulty in detecting small, dense, and occluded targets. Early-stage symptoms, such as aphids, thrips, or initial powdery mildew spots, manifest as small, densely distributed, or heavily occluded targets. Standard detection heads often struggle to effectively focus on these critical yet faint visual cues, leading to a high miss detection rate [12]. Furthermore, these diseases’ visual characteristics can vary dramatically in size, from minuscule spots to large lesions, leading to an insufficient capture of multi-scale contextual information [13]. Due to their inherent local receptive fields, conventional convolutional networks have difficulty effectively integrating this multi-scale information, which is crucial for distinguishing pathological features from background noise. This problem is compounded by scale-dependent localization instability, where minor pixel shifts in bounding boxes for small targets can cause drastic fluctuations in Intersection over Union (IoU) values. This IoU sensitivity disrupts the positive-negative sample assignment during training, leading to model instability and ultimately compromising localization accuracy [14]. Finally, standard convolution operations often produce redundant or insufficiently diverse features, resulting in limited feature representation capability. This limitation hinders the model’s ability to discriminate between visually similar yet pathologically distinct diseases or pests, particularly amidst complex background textures [15].

To address the challenges above, this paper proposes a novel detection framework named PPA-MC-YOLO, which has been meticulously optimized for strawberry disease recognition. The framework is built upon the high-performance YOLOv12 baseline and is enhanced by systematically integrating four synergistic innovations to overcome the above challenges. The primary contributions of this work are summarized as follows:

We introduce a novel Parallel Pyramid Attention (PPA) module specifically designed for the detection head. While inspired by established pyramid attention concepts, our module employs a unique multi-branch parallel architecture to simultaneously capture global context, multi-scale local details, and hierarchical information. This design is tailored to enhance the model’s focus on the small, dense, and occluded targets characteristic of agricultural scenes, directly addressing Gap 1.

To improve multi-scale fusion, we integrate Monte Carlo Attention (MCAttn). It leverages stochastic sampling pooling to generate scale-invariant attention, improving contextual awareness for targets of varying sizes, thereby resolving Gap 2.

We introduce the Scale-Decoupled (SD) Loss function to stabilize small object training. Dynamically re-weighting losses effectively mitigates the training instability caused by IoU sensitivity, thus overcoming Gap 3.

To enrich feature representations, we adopt Frequency Dynamic Convolution (FDConv). Without increasing the computational burden, this method generates diverse convolutional kernels, substantially enhancing the model’s feature discriminability for similar diseases, aiming to fill Gap 4.

Extensive experiments on a public strawberry disease dataset demonstrate that our proposed PPA-MC-YOLO model significantly outperforms a range of state-of-the-art detectors, including NanoDet-Plus, MobileNetV2-SSD, RT-DETR-R18, YOLOv8-n, YOLOv10-n, YOLOv11-n, and our baseline model, YOLOv12-n. These results substantiate the effectiveness of our approach and highlight its substantial potential as a reliable tool for intelligent agricultural management and precision crop protection.

2. Related Work

This chapter aims to systematically review the relevant research in plant disease detection, focusing on three primary areas: traditional machine learning-based detection methods, deep learning-based object detection methods, and advanced techniques targeting small objects and multi-scale challenges. By outlining the progress and limitations of existing studies, this chapter will clarify the proposed method’s research motivation and innovative value.

2.1. Traditional Methods for Plant Disease Detection

Before the widespread adoption of deep learning, the automated detection of plant diseases predominantly relied on traditional computer vision and machine learning techniques. These methods typically followed a fixed pipeline: image preprocessing, lesion segmentation, feature extraction, and classifier training. Researchers utilized visual cues such as color, texture, and shape to distinguish between healthy and diseased tissues. For instance, Jamjoom et al. [16] employed K-means clustering for lesion segmentation. They used a Support Vector Machine (SVM) to classify plant leaf diseases based on color and texture features. Similarly, Yogeshwari and Thailambal [17] achieved plant disease identification by extracting color histograms and Gray-Level Co-occurrence Matrix (GLCM) texture features.

However, these traditional methods suffer from significant limitations. Firstly, they heavily depend on hand-crafted features, which exhibit poor generalization capabilities and struggle to adapt to the complex and variable conditions of field environments, such as fluctuations in lighting, background interference, and the morphological diversity of diseases. Secondly, the accuracy of lesion segmentation directly impacts the final detection performance, yet achieving robust segmentation against complex backgrounds is a challenging problem. Lastly, these methods can typically only determine the presence of a disease in an image. Still, they struggle to accurately localize its specific position or identify multiple instances, failing to meet the requirements of precision spraying.

2.2. Deep Learning-Based Object Detection Methods

With the rise of deep learning, Convolutional Neural Networks (CNNs) have achieved breakthrough progress in plant disease detection due to their powerful automatic feature extraction capabilities. Deep learning-based object detection models can identify disease categories and precisely localize their positions within an image, providing robust technical support for precision agriculture. These models are primarily categorized into two types:

Two-Stage Detectors: This category is represented by the R-CNN series, including Fast R-CNN [18] and Faster R-CNN [19]. They generate a series of region proposals and then perform classification and bounding box regression on these regions. For example, Fuentes et al. [20] successfully detected multiple diseases in tomatoes using Faster R-CNN, validating its effectiveness in complex backgrounds. Although two-stage detectors generally achieve high detection accuracy, their high computational complexity and slow inference speeds make them challenging to deploy for real-time detection applications.

Single-Stage Detectors: This category of models directly predicts bounding boxes and classes on the image, bypassing the region proposal generation step, which makes them significantly faster. Representative models include the Single Shot MultiBox Detector (SSD) [21] and the You Only Look Once (YOLO) series [9,22]. YOLO, in particular, has become a mainstream choice in agricultural applications due to its excellent balance between speed and accuracy. Numerous researchers have proposed improvements based on YOLO. For instance, Guo et al. [23] developed YOLO-T, an enhanced YOLOv7 variant that improved tea leaf disease detection through architectural optimizations. Similarly, Tao et al. [24] achieved higher Precision in bell pepper disease detection by adapting YOLOv5’s feature extraction and training methodology.

Despite the remarkable success of YOLO-based models, they still face challenges when dealing with targets characterized by small size, dense distribution, and occlusion, as is common in strawberry disease detection. Standard YOLO models have room for improvement in multi-scale feature fusion and attention to minuscule targets, which is a key motivation for the present work.

2.3. Advanced Techniques for Small Object and Multi-Scale Detection

Researchers have proposed various enhancement strategies to address the difficulties of small object detection and poor multi-scale adaptability, with attention mechanisms and advanced feature fusion structures being the primary approaches.

Attention Mechanisms: Mimicking the human visual system, attention mechanisms enable models to adaptively focus on critical information in an image while suppressing irrelevant background noise. Squeeze-and-Excitation (SE) [25] recalibrates channel-wise features by learning inter-channel correlations. The Convolutional Block Attention Module (CBAM) [26] further enhances feature discriminability by combining channel and spatial attention. These attention modules have been widely applied in agricultural object detection to augment the response to key regions such as disease lesions. However, most existing attention mechanisms are general-purpose designs and may not fully account for the specific patterns of pathological features in agricultural scenes.

Multi-Scale Feature Learning: Feature Pyramid Network (FPN) [27] and its variants, such as PANet [28] and AugFPN [29], enhance the model’s ability to detect multi-scale targets by fusing feature maps from different levels. These structures effectively combine high-level semantic information with low-level detailed information. Furthermore, some studies have explored more flexible feature aggregation methods. For example, Stochastic Pooling [30] was introduced to prevent overfitting by incorporating randomness and has been shown to help models learn more robust feature representations. The MCAttn module proposed in this paper draws inspiration from this idea, constructing scale-invariant attention maps through stochastic sampling.

While the aforementioned feature pyramid networks improve multi-scale representation, Pyramid Attention mechanisms have also been explored to explicitly capture multi-scale context within attention modules themselves. These methods are effective but often rely on a serial or hierarchical structure to fuse pyramid features. Our work diverges from this trend by proposing a parallel architecture for pyramid attention within the detection head. This parallel design is motivated by the specific need in agriculture to concurrently process microscopic disease spots and large lesions, which we argue is more effective than sequential fusion for preserving fine-grained details of small targets.

2.4. Current Limitations of Strawberry Disease Detection Methods

While significant progress has been made in deep learning-based object detection, current mainstream detectors still exhibit limitations when applied to the precise detection of strawberry diseases in complex agricultural environments. Specifically, these methods often struggle with the inherent challenges of small symptom sizes, mutual occlusion among plant parts and symptoms, and the high intra-class variance of disease manifestations. Existing approaches, while effective in controlled settings, frequently show reduced performance and robustness when confronted with these real-world complexities. These identified limitations underscore the pressing need for a more robust and adaptive detection framework capable of overcoming these hurdles, which serves as the primary motivation for the present study.

In conclusion, the work presented in this paper provides a high-performance technical solution for strawberry disease detection, but, more importantly, its modular design philosophy offers new insights and empirical support for resolving common technical challenges prevalent in agricultural vision tasks.

3. Materials and Methods

3.1. Dataset

This study utilizes a publicly available strawberry disease image dataset titled “Strawberry Disease,” which is accessible on the Roboflow Universe platform to support research in the automated detection of strawberry diseases. The dataset can be accessed at: https://universe.roboflow.com/tts-workspace/strawberry-disease-u9xwk (accessed on 15 June 2025). A detailed description of this dataset is provided below.

3.1.1. Dataset Composition and Classes

The dataset comprises 5333 images of strawberry plants captured under various conditions, encompassing multiple disease types and healthy samples. For supervised learning, the dataset contains 12,519 meticulously hand-annotated bounding boxes. A total of seven distinct classes are defined, which include: anthracnose_fruit_rot, blossom_blight, gray_mold, leaf_spot, powdery_mildew_fruit, powdery_mildew_leaf, and angular_leafspot.

To ensure the model can perform precise diagnosis, the dataset distinguishes be-tween visually similar diseases that have different aetiologies. For instance, leaf spot is a fungal disease, typically characterized by circular lesions, while angular leafspot is a bacterial disease, with lesions presenting a distinct angular shape due to being restricted by leaf veins. This detailed classification is crucial for training a robust model that can accurately identify different diseases.

We conducted a detailed statistical analysis to better understand the dataset’s characteristics. Figure 1 illustrates that the dataset exhibits a significant class imbalance issue. For instance, the number of cases for the leaf_spot class far exceeds that of other classes, whereas samples for classes like anthracnose_fruit_rot are relatively scarce. This long-tailed distribution challenges the model’s generalization ability, requiring it to learn features beyond those of the majority classes.

Furthermore, we analyzed the size distribution of all annotated objects, as shown in Figure 2. The results indicate that most target objects in the dataset have an area smaller than 96 × 96 pixels, with a substantial portion smaller than 32 × 32 pixels, classifying them as typical small objects. The minuscule size characteristic of early-stage disease symptoms presents a formidable challenge to detection models’ feature extraction and localization capabilities, which was a primary motivation for our design of targeted strategies such as the PPA and MCAttn modules.

3.1.2. Annotation and Partitioning

All images were annotated with bounding boxes that delineate diseased areas or healthy organs, and the annotation information was stored in the standard Pascal VOC XML format. To facilitate model training, validation, and testing, the dataset was partitioned into three independent subsets: a training set with 4311 images, a validation set with 715 images, and a test set with 307 images.

This partitioning scheme, which follows an approximate 80:15:5 ratio, ensures sufficient data for model training while allowing for a reliable and unbiased evaluation of the model’s generalization performance on independent validation and test sets.

3.1.3. Data Augmentation and Preprocessing

To enhance the model’s robustness and prevent overfitting, we applied a comprehensive suite of data augmentation techniques to the training set. These techniques include geometric transformations (e.g., random horizontal flips, random rotations within a range of −15 ° to +15°) and photometric distortions (e.g., random adjustments to brightness, contrast, and saturation). Additionally, we employed advanced augmentation strategies such as Mosaic and MixUp to enrich the contextual information of training samples and improve the model’s generalization to object occlusion and complex backgrounds. Before being fed into the model, all images were uniformly preprocessed to a resolution of 640 × 640 pixels to meet the model’s input format requirements.

3.2. The Proposed PPA-MC-YOLO Framework

Our proposed model, named PPA-MC-YOLO, is based on the YOLOv12 architecture and is enhanced through four key synergistic innovations: the Parallel Pyramid Attention (PPA) module, Monte Carlo Attention (MCAttn), Scale-Decoupled Loss (SD Loss), and Frequency Dynamic Convolution (FDConv).

3.2.1. Baseline Architecture: YOLOv12

Our framework is built upon the YOLOv12 architecture, which serves as a powerful and efficient baseline. The original YOLOv12 model consists of three main components (as illustrated in Figure 3):

Backbone: The primary feature extraction network typically utilizes an advanced Cross Stage Partial (CSP) structure to balance computational cost and feature extraction capability.

Neck: The network for multi-scale feature fusion employs a Path Aggregation Network (PANet) [28] structure to effectively combine high-level semantic information and low-level localization information from different layers.

Head: The network responsible for performing the final detection, which predicts the bounding boxes, confidence scores, and class probabilities of targets on the multi-scale feature maps output by the neck.

3.2.2. Overall Architecture of PPA-MC-YOLO

Building upon the YOLOv12 baseline, we introduce four synergistic innovations to construct the PPA-MC-YOLO model. These enhancement modules are strategically integrated into the baseline architecture to address specific challenges in agricultural object detection. The overall architecture of our proposed framework is detailed in Figure 4.

3.2.3. Parallel Pyramid Attention (PPA) Head

Design Motivation and Innovation:

The design of our detection head is inspired by the established concept of pyramid attention, which is highly effective for handling multi-scale objects. However, many existing attention mechanisms, such as CBAM [26], and traditional pyramid structures often rely on serial processing architectures. These can be suboptimal for agricultural scenes with extreme scale variations (e.g., simultaneously processing microscopic pests and large-area lesions), as information from small targets can be diluted during sequential fusion.

Unlike CBAM’s sequential approach, which first applies channel attention and then spatial attention, our PPA module employs a parallel multi-branch architecture. This design allows the model to capture and fuse feature information at multiple scales simultaneously, directly addressing the inherent challenge of scale diversity in agricultural vision tasks. By operating in parallel, the PPA can better preserve the detailed features of small targets while also integrating a broader contextual understanding from larger receptive fields. This is particularly beneficial for distinguishing tiny disease spots or pests from complex backgrounds, a task where CBAM’s serial processing might lose critical fine-grained information. In essence, while CBAM enhances features, our PPA is specifically engineered to handle the scale-specific feature extraction and fusion challenges that are critical for high-precision agricultural object detection.

To address this, we innovate by proposing the Parallel Pyramid Attention (PPA) module. The core novelty lies in its parallel multi-branch architecture, which is specifically engineered to operate concurrently on multiple feature scales. Unlike general-purpose serial modules, our parallel design is tailored to the unique demands of agricultural vision, ensuring that fine-grained details of small targets and broad contextual information from large targets are processed and fused simultaneously.

Structural Principle:

The efficacy of the PPA module is derived from its ability to extract and fuse features in parallel from varying receptive fields. Its architecture comprises three synergistic branches: a global context branch, a local parallel branch, and a serial pyramid branch. Let the input feature map be

F \in R^{C \times H \times W}

.

Global Context Branch (GCB): This branch is designed to capture the global contextual priors of the entire feature map, which helps the model distinguish foreground targets from complex backgrounds. It employs a global average pooling (GAP) operation to compress the spatial dimensions into a single feature vector, followed by a 1 × 1 convolution to encode channel-wise statistics. Finally, the feature vector is upsampled to the original spatial resolution to form the global context feature map,

F_{g}

.

F_{g} = U p s a m p l e ({C o n v}_{1 \times 1} (G A P (F)))

Local Parallel Branch (LPB): This branch aims to simultaneously capture multi-scale local contextual information. It utilizes parallel convolutional layers with different dilation rates (e.g., 1, 3, 5). Each dilated convolution processes the input feature map

F

with a distinct receptive field, allowing for the extraction of features from objects of varying sizes. The outputs from these parallel paths are then concatenated and fused through a 1 × 1 convolution to generate the multi-scale local feature map,

F_{l p}

.

F_{l p} = {C o n v}_{1 \times 1} (C o n c a t [{D C o n v}_{d = 1} (F), {D C o n v}_{d = 3} (F), {D C o n v}_{d = 5} (F)])

Serial Pyramid Branch (SPB): This branch simulates a hierarchical feature pyramid to capture features with progressively larger receptive fields, mimicking the information flow in deep networks. It consists of a sequence of 3 × 3 convolutions interleaved with max-pooling layers. This structure allows the model to extract features from a pyramid of scales in a serial manner, capturing dependencies across different levels of abstraction to form the serial pyramid feature map,

F_{s p}

.

The features from these three parallel branches are then fused. Let the three branches’ outputs be

F_{g}

,

F_{l p}

, and

F_{s p}

, respectively. The output attention map

A_{P P A} \in R^{1 \times H \times W}

is formally expressed as:

A_{P P A} = σ (Conv (F_{g} \oplus F_{l p} \oplus F_{s p}))

(1)

Finally, the PPA-enhanced feature map

F^{'}

is obtained via element-wise multiplication with the input feature map:

F^{'} = F \otimes A_{P P A}

(2)

where

\oplus

denotes the feature fusion operation (element-wise addition or channel-wise concatenation in this work),

{Conv}_{1 \times 1}

represents a 1 × 1 convolutional layer for dimensionality reduction and information integration,

σ

is the Sigmoid activation function, and

\otimes

signifies element-wise multiplication. This structure enables the model to focus on the most informative regions adaptively, significantly enhancing its detection performance for minuscule and other challenging targets.

3.2.4. Monte Carlo Attention (MCAttn) Integration

Design Motivation and Innovation:

To further enhance the model’s adaptability to multi-scale targets during the feature fusion stage, particularly its ability to recognize small objects against complex backgrounds, we introduce the Monte Carlo Attention (MCAttn) module. While powerful self-attention mechanisms, as seen in models like Transformers, are highly effective at capturing global dependencies, their quadratic computational complexity makes them computationally prohibitive for real-time applications and lightweight models.

Our proposed MCAttn paradigm addresses this issue by integrating a novel stochastic sampling mechanism. Instead of computing attention scores for every possible pair of pixels, MCAttn randomly samples a subset of key-value pairs from different spatial locations. This approach achieves similar contextual awareness with significantly reduced computational cost. This is a critical design choice for our application, as it allows the module to effectively capture multi-scale contextual information across the entire image in a highly efficient manner. This enables our PPA-MC-YOLO framework to maintain real-time inference speed without sacrificing the robustness required for targets of varying sizes, providing a balance of performance and efficiency that is well-suited for the dynamic conditions of agricultural environments.

Structural Integration and Working Mechanism:

The MCAttn module is seamlessly embedded within a parallel architecture inside our proposed A2C2f_MoCA block. When an input feature tensor

F_{i n} \in R^{C \times H \times W}

enters the module, it is split along the channel dimension into two data streams,

F_{1}

and

F_{2}

. These streams are concurrently fed into two parallel sub-branches:

Main Feature Extraction Branch: This branch comprises a standard stack of bottleneck blocks for deep feature transformation. It processes

F_{2}

to extract high-level semantic features, denoted as

F_{b} = B o t t l e n e c k (F_{2})

.

Attention Computation Branch (MCAttn): This branch receives

F_{2}

as input and generates a spatial attention map

A_{M C}

using our novel Monte Carlo pooling. The process follows: For the input feature map

F_{2}

, we randomly generate

N

sampling points with coordinates

(x_{i}, y_{i})

within the normalized range of [−1, 1]. Feature Sampling: We use bilinear interpolation to sample the feature values at these N random locations. This results in a feature set

\{v_{1}, v_{2} {\dots, a n d v}_{n}\}

. Attention Map Generation: The sampled features are aggregated (e.g., through a small MLP) and then reshaped and upsampled to generate the final spatial attention map

A_{M C} \in R^{1 \times H \times W}

, which is subsequently normalized by a Sigmoid function.

This process can be abstractly represented as:

F_{b} = B o t t l e n e c k (F_{2})

(3)

A_{M C} = M C A t t n (F_{2})

(4)

where

F_{b}

represents the deep features extracted by the main branch, and

A_{M C}

is the normalized spatial attention map. Subsequently, the attention map is applied to the deep features via element-wise multiplication to produce an attention-modulated feature tensor

F_{a t t}

:

F_{a t t} = F_{b} \otimes A_{M C}

(5)

Finally,

F_{a t t}

is concatenated with the feature stream

F_{1}

from the initial split to complete the feature recombination and enhancement. This parallel design effectively decouples the feature transformation process from the attention computation. By relying on a representative subset of features rather than a deterministic aggregate, MCAttn functions as an intelligent “saliency controller.” It dynamically amplifies target-relevant features and suppresses background noise without disrupting the main feature extraction flow, thereby elevating the representation quality of the entire feature pyramid, especially for targets with significant scale variations.

3.2.5. Scale-Decoupled Loss (SD Loss)

Design Motivation and Innovation:

In addition to optimizing the network architecture, we addressed the training strategy by proposing the Scale-Decoupled Loss (SD Loss). A fundamental limitation of standard object detection loss functions (e.g., CIoU) is their inherent assumption of treating targets of all scales equally. This approach can lead to significant training instability when dealing with minuscule targets, primarily due to drastic fluctuations in their Intersection over Union (IoU) values, even with minor pixel shifts. While methods like Focal Loss [31] have successfully addressed the class imbalance issue, they do not directly tackle this scale-dependent problem. SD Loss is predicated on a new premise: for minuscule targets, achieving accurate classification is often more critical and feasible than attaining pixel-perfect localization. We, therefore, decouple the importance of localization and classification tasks based on the target scale. This allows the network to prioritize a more stable and robust learning signal for small objects, thereby improving overall detection performance without compromising the accuracy of larger targets.

In the context of our work, “optimization” refers to the optimization of the model training process. The SD Loss function acts as an advanced optimization strategy by dynamically adjusting the learning objective based on the target’s scale. The goal is to provide more stable and task-relevant gradient signals to guide the model’s parameters (the optimization parameters) towards a better solution, with the minimization of the total loss (the optimization criterion) being the ultimate goal.

Fundamental Principle:

SD Loss dynamically adjusts the contributions of the localization loss (

L_{l o c}

) and the classification loss (

L_{c l s}

) through scale-aware weighting functions

α (s)

and

β (s)

, where

s

represents the normalized scale of the target’s bounding box. The total loss for a single prediction

i

is defined as:

L_{S D_{i}} = α (s_{i}) \cdot L_{l o c} (b_{i}, \hat{b_{i}}) + β (s_{i}) \cdot L_{c l s} (c_{i}, \hat{c_{i}}) + L_{o b j} (o_{i}, \hat{o_{i}})

(6)

The weighting functions are designed to decrease the weight of the localization loss and increase the weight of the classification loss as the target becomes smaller. We implement this through the following functional forms:

α (s) = w_{m i n} + (w_{m a x} - w_{m i n}) \cdot (1 - \exp (- s / s_{0}))

(7)

β (s) = w_{m i n} + w_{m a x} - α (s)

(8)

where

w_{m i n}

and

w_{m a x}

are predefined minimum and maximum weights (set to 0.5 and 1.5 in this work, respectively),

s

is the target scale, and

s_{0}

is a hyperparameter that controls the shape of the function. As

s

approaches 0,

α (s)

approaches

w_{m i n}

, thereby reducing the weight of the localization loss. This adaptive weighting strategy provides more stable and task-relevant gradient signals during model training, significantly improving the model’s detection consistency and robustness across the entire size spectrum.

3.2.6. Frequency Dynamic Convolution (FDConv)

Design Motivation and Innovation:

Finally, to enhance the model’s feature representation capability at a fundamental level, we introduce Frequency-Dynamic Convolution (FDConv). While the concept of dynamic filters has been explored in prior works to adapt to input content, our implementation of FDConv differs significantly. Previous methods often generate filters based on global image features, which can be computationally expensive. Our approach focuses on a novel, frequency-diverse kernel construction method that operates without adding any parameters or computational costs, thus maintaining a lightweight model design. By converting the frequency domain perspective into a kernel with endogenous diversity, our FDConv module is specifically designed to capture a broader spectrum of textures and fine-grained patterns. This is particularly crucial for distinguishing between visually similar yet pathologically distinct diseases, a challenge that is often overlooked by general-purpose dynamic filter methods. This unique design choice provides an efficient way to enrich feature representations and enhance the model’s discriminative ability.

Working Mechanism:

A standard convolution operation is given by

Y = F \times W

. In FDConv, the convolutional kernel

W

is decomposed into a sum of

M

frequency groups:

W = \sum_{m = 1}^{M} W_{m}

(9)

Each frequency group kernel

W_{m}

is constrained to approximate a specific frequency-domain basis function

B_{m}

(e.g., a DCT basis function), accompanied by a learnable coefficient

c_{m}

. Consequently, the output of FDConv,

Y_{F D}

, can be expressed as:

Y_{F D} = F \times (\sum_{m = 1}^{M} W_{m}) = \sum_{m = 1}^{M} (F \times W_{m})

(10)

This formulation indicates that the output of FDConv is a linear combination of the results from convolving the input feature with multiple filters, each having a distinct frequency response. This allows the model to process a rich information spectrum in parallel within a single layer, simultaneously capturing smooth, global structures and fine-grained, local textures. As a result, the discriminability and generalization ability of the learned features are substantially enhanced.

3.3. Experimental Protocol and Implementation

3.3.1. Experimental Platform and Framework

To ensure the rigor and reproducibility of this study, all experiments were conducted on a unified hardware and software platform. We implemented all models based on the Python (v3.11) and PyTorch (v2.4.1) frameworks, including our proposed PPA-MC-YOLO. The model construction and training pipelines were based on the standard implementation of the Ultralytics 8.3.63 library.

We adopted a standardized initialization strategy for all models to establish a fair comparative benchmark. All YOLO-based architectures, including our proposed PPA-MC-YOLO and the comparative models such as YOLOv10-n and YOLOv11-n, were fine-tuned starting from their official weights pre-trained on the COCO dataset. Other comparative models, such as the lightweight NanoDet-Plus and MobileNetV2-SSD, and the Transformer-based RT-DETR-R18, were implemented and trained following their official guidelines to ensure a fair comparison. This strategy ensures that all models leverage the benefits of transfer learning and are initialized optimally within their respective frameworks.

The experiments were performed on a high-performance workstation with the following core configuration: an NVIDIA RTX 4090 GPU (24 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA), an Intel Xeon Gold 6430 CPU (16 cores; Intel Corporation, Santa Clara, CA, USA), and 120 GB of system memory. The software environment was Ubuntu 22.04 LTS with the CUDA (v12.4) toolchain.

3.3.2. Training Protocol and Hyperparameters

To guarantee a fair comparison, all models were trained using a unified training protocol and hyperparameter settings.

Epochs: To ensure full convergence, all models were fine-tuned on the target dataset for 300 epochs starting from their pre-trained weights.

Batch Size: The batch size was set to 32 during training.

Optimizer and Learning Rate: We employed the Stochastic Gradient Descent (SGD) optimizer with momentum. The momentum was set to 0.937, and the weight decay was set to 0.0005. The initial learning rate (lr0) was 0.01, which was dynamically decayed using a cosine annealing schedule. A linear warm-up strategy was applied during the first three epochs to stabilize the initial training process.

Input Size and Data Augmentation: The input image size for training was uniformly resized to 640 × 640 pixels. To enhance model generalization, we utilized a set of online data augmentation strategies, including: Mosaic (probability = 1.0), random scaling (scale, range = [0.5, 1.5]), and Copy-Paste (probability = 0.1). Preliminary ablation studies indicated that MixUp data augmentation (probability = 0.1) did not yield significant improvements for the mAP@0.5:0.95 metric. Consequently, it was disabled in our final training configuration.

Training Strategies: All experiments were conducted with Automatic Mixed Precision (AMP) training, which enabled the acceleration of computation and the optimization of GPU memory usage. Concurrently, an Early Stopping mechanism was implemented to prevent overfitting. This mechanism monitored the mAP@0.5:0.95 metric on the validation set with a patience of 50 epochs. This metric was chosen because it comprehensively evaluates the model’s overall performance across different IoU thresholds.

3.4. Evaluation Protocol

We employed a set of standardized evaluation metrics widely recognized in object detection to conduct a comprehensive, objective, and quantitative evaluation of our proposed PPA-MC-YOLO and all comparative models. This evaluation protocol is designed to provide a holistic assessment from three dimensions: Accuracy, Robustness, and Efficiency.

3.4.1. Accuracy and Robustness Metrics

The detection accuracy of the models is primarily measured using the following metrics:

Precision (P) and Recall (R): Precision measures the accuracy of the model’s positive predictions (i.e., a low false positive rate), while Recall measures the model’s ability to identify all true positive instances (i.e., a low false negative rate). In phytopathological applications, a high Recall is particularly crucial as it directly relates to the timely detection of early-stage diseases to prevent their spread.

F1-Score: As the harmonic mean of Precision and Recall, the F1-score provides a comprehensive performance measure, especially suitable for scenarios where a trade-off between P and R is necessary.

Mean Average Precision (mAP): This is the core metric for evaluating the overall performance of an object detection model. In our experiments, we report two standard mAP metrics:

mAP@0.5: This metric is calculated at an Intersection over Union (IoU) threshold of 0.5 and primarily evaluates the model’s ability to recognize and roughly localize targets.

mAP@0.5:0.95 (COCO Style): This metric is obtained by calculating a series of mAP values over an IoU threshold range from 0.5 to 0.95 (with a step of 0.05) and then averaging them. It imposes stricter requirements on the localization accuracy of the bounding boxes and is currently the gold standard for measuring comprehensive model performance. It can better reflect the improvements our model has made in precise localization.

3.4.2. Efficiency and Complexity Metrics

In addition to detection accuracy, a model’s computational efficiency and resource requirements are critical for its practical deployment. We evaluate these aspects using the following metrics:

Parameters: Measured in millions (M), this metric reflects the model’s storage requirements.

FLOPs (Floating Point Operations): Measured in giga-FLOPs (GFLOPs), this metric quantifies the computational resources required for a single forward pass of the model, directly correlating with its hardware performance demands.

Inference Speed: Reported in Frames Per Second (FPS). All speed tests were conducted on a single NVIDIA RTX 4090 GPU (24 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA) with a batch size of 1 to simulate a realistic single-image inference scenario.

3.5. Comparative Experiment Setup

To comprehensively benchmark the performance of our proposed lightweight model, PPA-MC-YOLO, we systematically compared it against a series of representative, state-of-the-art (SOTA) models focusing on high efficiency.

Fairness Assurance:

To ensure the direct comparability of all performance metrics and to eliminate potential biases introduced by different implementations or training configurations, we adhered to a strict principle of controlled variables. All comparative models were retrained and evaluated on our custom strawberry disease et dataset, following the same training protocol detailed in Section 3.3.

Selection of Comparative Models:

Our selection of comparative models followed a core principle: to compare detection accuracy at a similar level of model complexity and computational scale. This efficiency-centric comparison strategy is designed to rigorously validate our model’s ability to achieve superior performance while maintaining its lightweight characteristics. The selected models encompass different architectural paradigms, including CNN-based and Transformer-based approaches.

Direct Baseline:

YOLOv12-n [32]: This is the direct starting point for all our modifications. The comparison with this model aims to quantify the net performance gain brought about by our proposed PPA, MCAttn, SD Loss, and FDConv modules, with virtually no additional computational burden.

CNN-based SOTA Lightweight Models:
Classic Architectures:

MobileNetV2-SSD [21,33]: A classic lightweight detector for mobile devices, representing traditional efficient models.

NanoDet-Plus [34]: A modern lightweight model with widespread influence in industry and academia, known for its excellent speed-accuracy trade-off.

Cutting-Edge YOLO Series Models:

YOLOv8-n [35]: Represents a significant performance peak for lightweight models in the YOLO series.

YOLOv10-n [36]: The nano version of YOLOv10, aimed at achieving an ultimate balance between efficiency and performance.

YOLOv11-n [37]: Represents the forefront of exploration in lightweight design within the YOLO series.

Transformer-based SOTA Lightweight Models:

RT-DETR-R18 [38]: To benchmark our CNN-based approach against the increasingly prominent Transformer paradigm, the most lightweight version of RT-DETR, which uses ResNet-18 as its backbone, was selected. This comparison allows for an assessment of our model’s performance advantage when competing against a state-of-the-art, real-time Transformer-based detector.

Through a comprehensive comparative analysis against these state-of-the-art models—all of which prioritize high efficiency and low complexity—this study robustly demonstrates the significant potential of PPA-MC-YOLO as a high-performance, lightweight detection solution tailored for resource-constrained application scenarios.

4. Results

This chapter details the experimental results for the proposed enhanced YOLOv12 framework and other comparative models on the strawberry disease dataset. We begin by introducing the main detection performance comparison, followed by ablation studies to validate the effectiveness of each innovative module, and conclude with a visualization analysis of several challenging detection cases.

4.1. Main Detection Performance Comparison

To comprehensively evaluate the performance of our proposed PPA-MC-YOLO framework, we conducted a fair comparison against a series of representative, state-of-the-art (SOTA) detectors spanning various architectures. These include lightweight models (NanoDet-Plus, MobileNetV2-SSD), a Transformer-based model (RT-DETR-R18), and advanced models from the YOLO series (YOLOv8-n, v10-n, v11-n). We selected YOLOv12, which has the closest model complexity to our own, as the direct baseline for comparison. All experiments were conducted on the same hardware platform, and the detailed performance comparison is presented in Table 1.

As demonstrated by the comprehensive data in Table 1, our proposed PPA-MC-YOLO model effectively balances accuracy and efficiency.

Regarding detection accuracy, our model attains the best performance across all major evaluation metrics. It achieves an mAP@0.5 of 81.1% and an mAP@0.5:0.95 of 63.2%, representing improvements of 2.1 and 1.9 percentage points, respectively, over the baseline YOLOv12. This significant accuracy enhancement directly substantiates the collective efficacy of our introduced improvements. Specifically, the adoption of SD Loss effectively stabilized the training process. At the same time, the PPA module enhanced the focus on hard-to-detect targets, with multiple improvements collectively contributing to the overall surge in Precision.

Regarding model efficiency, PPA-MC-YOLO exhibits superior lightweight characteristics. With only 2.44 M parameters and 6.3 GFLOPs, it is the most lightweight among all high-accuracy models (mAP@0.5 > 79%). This is primarily attributed to the application of FDConv, which effectively enhances feature representation capabilities without significantly increasing computational costs.

While a 2.1% absolute improvement in overall mAP@0.5 may appear modest, its significance is underscored by several critical factors. Firstly, this gain is achieved over the highly optimized and strong YOLOv12 baseline, where further improvements are notoriously difficult to obtain. Secondly, and more importantly, the overall average improvement conceals substantial, targeted gains on the most challenging disease classes. As evidenced by the per-class analysis in Table 2, our model achieves a remarkable 11.0% absolute AP increase for ‘powdery_mildew_fruit’, a class characterized by minute, hard-to-recognize symptoms. This demonstrates that our framework’s improvements are not random but systematically solve key pain points in strawberry disease detection. Finally, this superior accuracy is delivered by a more efficient model, with 17.3% fewer parameters and 4.5% fewer GFLOPs than the baseline. This enhanced performance-efficiency trade-off provides strong evidence for the architectural superiority of our proposed PPA-MC-YOLO.

Our model demonstrates outstanding overall performance in the trade-off between accuracy and efficiency. Compared to the baseline YOLOv12, PPA-MC-YOLO achieves an mAP improvement of over two percentage points while simultaneously reducing parameters and GFLOPs by 17.3% and 4.5%, respectively. Although there is a slight decrease in inference speed (from 480 to 465 FPS), it still maintains a high capacity for real-time inference. This proves that our framework design successfully achieves effective model lightweighting while boosting detection accuracy.

To further investigate the specific sources of performance improvement in the PPA-MC-YOLO model, we conducted a detailed analysis of its Average Precision (AP), Precision (P), and Recall (R) for each disease class in comparison to the baseline YOLOv12 model. The detailed results are presented in Table 2.

The per-class analysis results in Table 2 provide compelling evidence for the success of our model at a micro-level. PPA-MC-YOLO achieves a 2.1 percentage point increase in overall mAP and demonstrates targeted improvements on several challenging classes.

Significantly Enhanced Detection of Small and Hard-to-Recognize Targets: The most noteworthy improvement is observed in the ‘powdery_mildew_fruit’ class. This category is extremely difficult to detect due to its minute early-stage symptoms and low contrast against the fruit background. The baseline model achieved a Recall of only 46.5%, indicating many missed detections. In contrast, our PPA-MC-YOLO model, benefiting from the effective capture of fine-grained features by the PPA module and the aggregation of multi-scale information by MCAttn, increased the Recall to 52.0% (a 5.5 percentage point increase). Its AP surged from 63.5% to 74.5% (an 11.0% increase). This substantiates the effectiveness of our improvements in addressing the challenges of detecting small and difficult-to-recognize targets.

Accuracy Gains from Improved Feature Discriminability: Our model also performs well on diseases with visually similar and easily confusable features, such as ‘leaf_spot’ and ‘angular_leafspot’. For instance, the AP for ‘angular_leafspot’ increased by 2.5 percentage points, and the AP for ‘leaf_spot’ rose by 2.4 percentage points. This is primarily attributed to the synergistic effects of modules like FDConv. The model can more accurately distinguish the subtle differences between these similar diseases by constructing richer feature representations, thereby improving detection accuracy.

Achieved a Better Precision-Recall Balance: On average, our model increased the overall Recall from 73.2% to 76.2% (a 3.0 percentage point increase) while raising the overall Precision from 81.2% to 81.6%. This indicates that the performance improvement is comprehensive and balanced. Our model effectively controls false detections through stronger feature representations and a more stable training process while simultaneously reducing missed detections, thus achieving superior overall detection performance.

In summary, the results of the per-class analysis further confirm the effectiveness of our proposed PPA-MC-YOLO framework in handling the detection of complex pathological features. It demonstrates significant improvements, particularly in recognizing small targets and similar diseases, highlighting the model’s potential for practical application in agriculture.

4.2. Ablation Study

To systematically quantify the individual contributions of our four proposed modules—PPA, MCAttn, SD Loss, and FDConv—and to investigate their potential synergistic effects, we conducted a series of exhaustive ablation experiments using YOLOv12 as the baseline under identical experimental settings. The mAP@0.5 was uniformly adopted as the evaluation metric, and the detailed results are presented in Table 3.

Analysis of Results:

The results in Table 3 deconstruct the sources of our model’s performance improvement, from which several key insights can be drawn.

Independent Efficacy of Each Module: The forward-addition experiments (#2 to #5) confirm that our proposed modules independently contribute to performance gains on top of the baseline model. Notably, the MCAttn module exhibits the most significant individual gain, boosting the mAP@0.5 by 1.0 percentage point (experiment #3). This finding provides a crucial insight for our task: enhancing the model’s ability to generate scale-invariant attention for multi-scale contextual awareness is the most effective single intervention for improving detection performance. The PPA module also contributes a significant 0.7 percentage point increase (experiment #2), demonstrating the value of its attention-driven detection head in focusing on fine-grained target features. While the SD Loss and FDConv modules show smaller individual gains, they provide essential training stability and feature enrichment for the model.

Synergistic Effects Between Modules: The combination experiments reveal positive synergy between the modules. For example, when PPA and MCAttn are integrated (experiment #6), the mAP@0.5 increases by 1.5 percentage points. This combined effect indicates that the local-focusing capability of PPA and the global, scale-invariant contextual ability of MCAttn are complementary, working together to achieve a more robust feature representation. Ultimately, the final model integrating all four components (experiment #8) reaches a peak performance of 81.1%, with a total improvement of 2.1 percentage points, proving the effectiveness of our overall framework design.

Indispensability of Core Components: The backward-removal experiments (#9 and #10) further validate the necessity of each key module. Removing any single component from the final PPA-MC-YOLO model results in a performance drop. In particular, removing the MCAttn module causes the largest performance degradation (−1.0 percentage point, experiment #10), reaffirming its status as the most critical contributor to our framework’s success. Similarly, removing the PPA module also leads to a notable performance decline of 0.7 percentage points.

To further validate the effectiveness of our proposed Scale-Decoupled (SD) Loss, we conducted a comparative ablation study against several state-of-the-art loss functions. As shown in Table 4, the results demonstrate the superiority of our approach.

When compared to the default CIoU loss, SD Loss achieves a notable improvement across all key metrics, particularly boosting the AP for small objects (AP_small) by 2.0 percentage points. This directly confirms its efficacy in stabilizing the training for minuscule targets, which was its primary design motivation.

While Focal Loss slightly improves Precision, it comes at the cost of Recall and overall mAP, indicating that solely addressing class imbalance is insufficient for this task. The advanced Wise-IoU Loss shows competitive performance, but our SD Loss still outperforms it, especially in the crucial AP_small metric.

This quantitative comparison provides strong evidence that by adaptively re-weighting localization and classification losses based on target scale, SD Loss offers a more effective optimization strategy for detecting objects with significant scale variations, a common challenge in agricultural scenes.

In conclusion, the ablation study validates the effectiveness of all four of our innovations and highlights the core driving role played by the MCAttn module in enhancing multi-scale awareness. This core function, supported by the focused attention of PPA, the enriched features from FDConv, and the stabilized training from SD Loss, collectively forms a powerful and synergistically efficient detection framework capable of effectively addressing complex agricultural scenes.

4.3. Parameter Sensitivity Analysis

We conducted a series of parameter sensitivity analyses to validate the design rationale of our key modules and determine their optimal hyperparameters. These analyses provide a solid empirical basis for the final configuration of the PPA-MC-YOLO framework.

4.3.1. Impact of the Number of Sampling Points N in MCAttn

The number of sampling points, N, in the MCAttn module is a critical hyperparameter that directly influences the model’s accuracy and efficiency. We tested the impact on the mAP@0.5 and FPS of the full PPA-MC-YOLO model as N was varied within the set {4, 8, 16, 32, 64}. The results are illustrated in Figure 5.

Analysis: As illustrated in Figure 5, the model’s performance is sensitive to the value of N. As N increases from 4 to 16, the mAP@0.5 steadily rises from 79.9% to a peak of 81.1%. This indicates that moderate stochastic sampling can effectively form a robust feature representation. However, as N continues to increase to 64, the mAP slightly decreases to 80.8%, possibly because oversampling diminishes the advantages brought by randomness. Concurrently, the FPS monotonically decreases (from 475 to 425) as N increases. Considering both aspects, N = 16 strikes the optimal balance between accuracy and efficiency and was therefore selected for the final configuration.

4.3.2. Impact of Branch Combinations in the PPA Module

The PPA module is composed of three branches: Global (G), Local Parallel (L), and Serial (S). To verify the necessity of each branch, we conducted an internal ablation study on their combinations. The results are presented in Table 5.

Analysis: The results in Table 5 validate the effectiveness of our multi-branch design. The individual Local (L) and Serial (S) branches significantly improve performance. When combined (L+S), the mAP increases to 80.7%, demonstrating that these two paradigms—one for capturing multi-scale local details and the other for hierarchical information—are highly complementary.

Crucially, introducing the Global (G) branch pushes the performance to its peak of 81.1%. This confirms that the global context is vital for the model to distinguish foreground from complex backgrounds. Therefore, the complete architecture incorporating all three branches (G+L+S) is confirmed as the optimal design for the PPA module.

The parameter sensitivity analyses above provided sufficient experimental support for PPA-MC-YOLO’s key design and hyperparameter selections.

4.4. Visualization Results and Analysis

To intuitively evaluate the PPA-MC-YOLO model’s performance and investigate its internal working mechanisms, this section presents qualitative analysis results under typical challenging scenarios.

Figure 6 compares the detection results between our proposed PPA-MC-YOLO model and the baseline YOLOv12 on four representative types of challenging samples. These results reveal the practical performance gains from the modules we introduced.

Small and Dense Targets (Figure 6a,b): The first row displays a leaf with fine leaf-spot symptoms. The baseline model (a) uses a single, imprecise, and overly large box to cover all lesions, failing to distinguish individual spots effectively. In contrast, our model (b) can generate multiple smaller, more precise bounding boxes, locally localizing each lesion area. This provides compelling evidence for the superiority of the PPA module in focusing on and segmenting dense, minuscule targets.

Detection of Concealed Targets in Complex Backgrounds (Figure 6c,d): The second row presents a complex scene containing a withered flower (gray_mold). This target is small, has an irregular shape, and is easily mistaken for harmless background clutter. The baseline model (c) failed to identify this concealed target, resulting in a missed detection. Our model (d), however, successfully detected it. This highlights the comprehensive advantages of our framework: the PPA module enables it to focus on small targets. At the same time, MCAttn and FDConv work in synergy to enhance the model’s ability to discriminate true targets from complex backgrounds.

Co-existence of Multi-scale Targets (Figure 6e,f): In the sample from the third row, the main leaf exhibits clear powdery_mildew_leaf, while another leaf in the background has a much smaller lesion. The baseline model (e) only detected the large target in the foreground, completely ignoring the small one. Our model (f) accurately detected the large target and successfully captured the distant, tiny lesion, demonstrating its excellent multi-scale detection capability, which is attributable to the synergistic effect of all our innovative modules.

Discrimination of Confusing Classes (Figure 6g,h): The fourth row provides an excellent case for distinguishing between two visually similar diseases. The baseline model (g) misclassified, incorrectly identifying the sample as angular_leafspot. However, our model (h) made the correct classification, identifying it as leaf_spot. This improvement is directly attributed to the FDConv module, which enables the model to capture the subtle visual cues that differentiate these similar diseases by extracting richer and more discriminative frequency-domain features.

In conclusion, the qualitative comparison results in Figure 6 intuitively validate the superiority of the PPA-MC-YOLO framework in addressing various complex detection challenges. Whether dealing with small, dense targets or identifying concealed, easily confused diseases in complex backgrounds, our model consistently demonstrates greater robustness and accuracy than the baseline model. This fully substantiates the practical value of our proposed multi-module synergistic design in real-world applications.

4.5. Efficiency Analysis

While pursuing higher detection accuracy, this study has also paid close attention to model efficiency. As shown in Table 1, PPA-MC-YOLO demonstrates a significant advantage in terms of model complexity. Compared to the baseline YOLOv12, our model’s parameter count is reduced from 2.95 M to 2.44 M (a 17.3% decrease), and its theoretical computational load (GFLOPs) is also reduced from 6.6 to 6.3 (a 4.5% decrease). This proves that our proposed framework, particularly modules like FDConv, can effectively enhance feature representation capabilities without increasing—and in fact, while decreasing—model complexity.

Notably, despite the reduction in theoretical computational load, the model’s actual inference speed (FPS) slightly decreased from 480 to 465. This discrepancy between theoretical efficiency and practical speed may stem from the operations introduced in the PPA and MCAttn modules, such as multi-branch processing and attention calculations. Although these operations may not have a high total number of floating-point operations, they might increase the Memory Access Cost (MAC) or data dependencies, thereby affecting parallel computing efficiency on the GPU.

Nevertheless, PPA-MC-YOLO achieves an mAP improvement of over two percentage points while simultaneously reducing model complexity. Its inference speed meets the’ real-time or near-real-time requirements of agricultural scenarios. This indicates that our model has struck an excellent balance among accuracy, complexity, and speed, making it a highly efficient and powerful solution.

5. Discussion

The PPA-MC-YOLO framework proposed in this study is an enhanced object detector that has demonstrated superior performance on a challenging strawberry disease dataset, surpassing the baseline YOLOv12 and other state-of-the-art (SOTA) models. This chapter aims to provide an in-depth interpretation of the mechanisms driving these performance gains, critically evaluate the model’s strengths and inherent limitations, and situate our contributions within the context of related work, while also proposing promising avenues for future research.

5.1. Interpretation of Performance Enhancement Mechanisms

The empirical results, particularly the comprehensive ablation studies in Table 3, unequivocally demonstrate that our proposed modules positively contribute to the model’s final efficacy. This performance uplift is not coincidental but stems from a synergistic design in which each component targets specific challenges prevalent in agricultural computer vision.

Efficacy of the PPA Module in Detecting Small and Ambiguous Targets. The success of the Parallel Pyramid Attention (PPA) module can be attributed to its innovative multi-branch parallel feature extraction strategy. This design overcomes the limitations of conventional detection heads, which typically employ a single convolutional path with a fixed receptive field. The PPA’s architecture facilitates a more comprehensive feature analysis: (1) a global branch provides contextual priors, enabling the model to distinguish foreground from complex backgrounds; (2) a parallel local branch, utilizing dilated convolutions with varying rates, functions like a multi-focal “microscope,” capturing fine-grained details across a spectrum of scales, from minuscule spots to medium-sized lesions; and (3) a serial branch preserves hierarchical feature extraction capabilities. Fusing these branches generates an adaptive attention map that optimally allocates feature weights for targets of different sizes. The effectiveness of this mechanism is validated by the model’s performance on notoriously difficult classes. For instance, in the powdery_mildew_fruit class, characterized by minute, low-contrast symptoms, the AP achieved a remarkable 11.0 percentage point surge, with a corresponding 5.5-point increase in Recall (Table 2). This demonstrates the critical role of PPA in enhancing focus on early-stage, subtle pathological features.

Role of MCAttn in Achieving Scale-Invariant Recognition. The core innovation of the Monte Carlo Attention (MCAttn) module lies in its stochastic sampling-based pooling mechanism. This approach diverges from traditional deterministic pooling methods (e.g., average or max pooling), which are inherently sensitive to target scale and background interference. By employing random sampling, MCAttn shifts the attention dependency from an aggregate of all pixels within a region to a representative subset. This confers two distinct advantages: (1) Scale Invariance: The stochastic nature of the sampling ensures that stable and effective attention responses can be generated regardless of target size, as long as its core features are sampled, thereby mitigating overfitting to specific scales. (2) Robustness: Random sampling reduces the model’s sensitivity to spurious, high-activation pixels, which may be noise, allowing it to extract true disease features from complex background textures more robustly. The pivotal contribution of MCAttn is substantiated by the ablation study (Table 3), where its standalone integration yielded the largest single performance gain (+1.0% mAP@0.5), and its removal caused the most significant performance degradation (−1.0% mAP@0.5).

Synergistic Contributions of FDConv and SD Loss. While the PPA and MCAttn modules provide the primary performance boosts, the contributions of FDConv and SD Loss are equally crucial for constructing a comprehensive, high-performance model. FDConv enriches feature representation without incurring additional parametric costs by constructing frequency-diverse convolution kernels. Low-frequency kernels capture smooth, global structures (e.g., leaf contours), while high-frequency kernels focus on fine textural details (e.g., lesion edges). This enhanced feature diversity is likely the reason for the model’s improved ability to distinguish between visually similar diseases, as evidenced by the AP gains of +2.5% and +2.4% for angular_leafspot and leaf_spot, respectively (Table 2). Concurrently, SD Loss provides critical optimization stability. By dynamically re-weighing classification and localization losses during training, the inherent IoU sensitivity problem in small object detection is addressed, leading to more precise bounding box regression. This is reflected in the model’s overall performance, where a substantial 3.0-point increase in Recall was achieved with a concurrent improvement in Precision (Table 2), indicating a more optimal precision-recall balance.

5.2. Strengths and Limitations

Strengths:

Superior Accuracy on Challenging Targets: The most significant advantage of PPA-MC-YOLO is its marked improvement in detection accuracy, particularly when handling typical challenges in agricultural scenes such as small, dense, and partially occluded targets. As detailed in Table 2, the model achieved substantial AP gains on key difficult classes, highlighting its practical utility.

Targeted and Synergistic Design: Unlike general-purpose object detectors, PPA-MC-YOLO components were specifically engineered to address known pain points in agricultural vision. The ablation study (Table 3) confirms that the synergy between these targeted modules yields a more significant performance uplift than any single modification could.

Excellent Accuracy-Efficiency Trade-off: A core strength is that PPA-MC-YOLO achieves a high degree of model lightweighting while simultaneously improving accuracy. Compared to the baseline YOLOv12, our model not only reduces the number of parameters by a remarkable 17.3% but also decreases the theoretical computational load (GFLOPs) from 6.6 to 6.3 (a 4.5% reduction) (Table 1). This result demonstrates that our proposed modules, particularly FDConv, possess exceptional computational efficiency while enhancing feature representation. Achieving higher detection accuracy at a lower computational and storage cost is paramount for future deployment on resource-constrained edge devices.

Limitations:

Trade-off between Theoretical Computation and Practical Speed: Although PPA-MC-YOLO is superior to the baseline model in terms of theoretical computational load (GFLOPs), its actual inference speed (FPS) slightly decreased (from 480 to 465). This reveals a common trade-off in model design: certain operations, despite having fewer floating-point operations, may introduce higher Memory Access Cost (MAC) or more complex control flows, thereby impacting parallel execution efficiency on specific hardware like GPUs. The multi-branch and attention computations in the PPA and MCAttn modules likely fall into this category. Therefore, a comprehensive consideration of theoretical efficiency and practical deployment performance is necessary when evaluating the model.

Sensitivity to Extreme Environmental Conditions: While the model was trained on a diverse dataset, its performance may degrade under conditions not well-represented in the training data, such as extreme over- or under-exposure, or severe motion blur caused by rapid movement.

Unverified Generalizability to Other Crops: The current model was trained and optimized exclusively on a strawberry dataset. Its efficacy on other crops (e.g., tomato, cucumber, rice) remains unproven. The pathological features and environmental contexts can vary significantly across different plant species, which may necessitate domain-specific fine-tuning or adaptation.

5.3. Comparison with Related Work and Future Directions

Comparison with Related Work:

The PPA-MC-YOLO framework proposed in this study surpasses various mainstream detectors in performance, with its advantages rooted in a deep understanding of and targeted design for the specific pain points of agricultural scenes.

Firstly, our approach is more systematic and comprehensive compared to other studies. For instance, Guo et al. [23] adapted the YOLOv7 model, proposing YOLO-T to enhance tea leaf disease detection performance. Similarly, Tao et al. [24] effectively applied YOLOv5 to the task of disease identification in bell pepper plants. These works demonstrate the effectiveness of tailoring the YOLO architecture for specialized application domains. However, our PPA-MC-YOLO framework introduces synergistic innovations at four distinct levels: the detection head (PPA), feature fusion (MCAttn), convolutional kernel design (FDConv), and loss function (SD Loss). The ablation study (Table 3) shows that this multi-pronged strategy yields a more significant performance gain than improvements to a single module, forming a more complete solution.

Secondly, our method exhibits a distinct design philosophy in addressing the challenges of small objects and multi-scale issues. Many advanced detectors, such as EfficientDet [39] with its BiFPN, improve performance through complex cross-scale connections and weighted feature fusion. While effective, these serial or iterative fusion methods may lose the fine-grained features of minuscule targets during repeated downsampling and upsampling. Our results indicate that the parallel branch design of the PPA module leads to a very significant improvement in small object detection (e.g., the AP increase for ‘powdery_mildew_fruit’ in Table 2). This is likely because its parallel structure can better preserve original features, avoiding information attenuation during transmission through deep networks. This forms an interesting contrast and complement to the literature [27,29] that emphasizes the importance of top-down pathways.

Furthermore, our modules are more targeted than general-purpose attention mechanisms like SE [25] or CBAM [26]. SE and CBAM primarily re-weight features along the channel and spatial dimensions, whereas our MCAttn module addresses the fundamental problem of scale invariance. By introducing stochastic sampling, it resolves the issue of scale sensitivity caused by deterministic pooling, which is particularly effective when dealing with disease targets of dramatically varying sizes.

Future Research Directions:

Based on the findings and identified limitations of this research, we propose the following promising avenues for future investigation:

Model Lightweighting and Inference Speed Optimization: Although PPA-MC-YOLO has been made lightweight regarding parameters and GFLOPs, there is still room for optimization in its practical inference speed (see Table 1). Future work should focus on optimizing the implementation of modules like PPA and MCAttn, for instance, through operator fusion or designing structures better suited for parallel computation. This would reduce memory access costs and fully translate theoretical computational advantages into practical speed improvements, better adapting the model for real-time deployment on mobile or edge devices.

Multi-Modal Data Fusion: The current research is based solely on RGB images. However, multispectral or hyperspectral imagery can provide plant physiological information invisible to the naked eye, which is crucial for the early, pre-symptomatic detection of diseases. Future research could explore how to effectively fuse this multi-modal data with our model by designing dual-stream network architectures, potentially enabling earlier warnings at the initial onset of a disease.

Cross-Crop Generalization and Domain Adaptation: The current model has been optimized and validated on a strawberry dataset. When applied to other crops (e.g., tomato, cucumber, rice), it may face performance degradation due to differences in pathological features and growing environments. Therefore, investigating Domain Adaptation techniques and exploring how to efficiently transfer the knowledge learned from strawberries to new crops using unsupervised or few-shot learning is a research direction of significant practical value.

Disease Progression Prediction Combined with Time-Series Analysis: The current model processes static images. Future research could incorporate time-series analysis. By continuously monitoring images of the same plant at different time points, it would be possible to detect diseases and predict their development trends and spread rates, providing data support for more precise intervention measures.

6. Conclusions

The rapid and accurate detection of diseases is a critical step in ensuring stable yields and enhancing the quality of strawberries. However, existing models often face challenges in complex field environments, such as insufficient capability in detecting small and ambiguous targets, poor multi-scale adaptability, and limited feature discriminability. To address these challenges, this paper proposes an enhanced deep learning model optimized for strawberry disease detection—PPA-MC-YOLO. This framework achieves a comprehensive performance improvement by systematically integrating four synergistic innovative modules:

The Parallel Pyramid Attention (PPA) Module, which introduces a multi-branch parallel structure in the detection head to effectively enhance the model’s ability to focus on the features of small, dense, and occluded targets; Monte Carlo Attention (MCAttn), which achieves scale-invariant contextual awareness through a novel stochastic sampling mechanism, significantly improving recognition robustness for targets of varying sizes; Frequency Dynamic Convolution (FDConv), which, without adding any parameters or computational costs, greatly enriches feature representations by constructing frequency-diverse convolutional kernels, enhancing the model’s ability to discriminate between similar diseases; and Scale-Decoupled Loss (SD Loss), which dynamically adjusts loss weights to optimize the training process, effectively mitigating the IoU sensitivity issue in small object detection and improving the model’s precision-recall balance.

Extensive experimental results on a custom strawberry disease dataset demonstrate that the PPA-MC-YOLO framework surpasses the baseline YOLOv12 and other SOTA models across all key metrics. It achieves an mAP@0.5 of 81.1% and an mAP@0.5:0.95 of 63.2%, representing improvements of 2.1 and 1.9 percentage points over the baseline model, respectively. Crucially, this performance leap is achieved on a more lightweight model, making it highly suitable for resource-constrained environments. Compared to the baseline, PPA-MC-YOLO’s parameters and GFLOPs are reduced by 17.3% and 4.5%, respectively, which is a key advantage for deployment on embedded systems. The model demonstrates outstanding performance in addressing visually ambiguous and hard-to-recognize classes, such as ‘powdery_mildew_fruit’, with an 11.0 percentage point increase in AP. This validates its effectiveness in solving real-world agricultural problems, including automated crop health management and precision spraying. Such high accuracy on challenging classes directly underscores its practical value in mitigating economic losses. Comprehensive ablation studies and visualization analyses confirm each innovative module’s independent efficacy and synergistic effects.

In conclusion, this study successfully developed and validated PPA-MC-YOLO, an efficient and accurate framework for detecting strawberry diseases. By employing a multi-module approach, it significantly enhances detection performance and model efficiency, offering a robust solution for intelligent crop health management. This research provides a promising blueprint for designing next-generation vision systems for fine-grained agricultural tasks. Building on these advancements, future work will focus on both technical optimization and a comprehensive cost–benefit analysis. Specifically, we plan to migrate the framework to a wider range of crop species and optimize the model for edge computing devices through techniques like model quantization to enable deployment on agricultural robots, drones, and portable diagnostic tools. Simultaneously, we will collaborate with agricultural experts and economists to conduct a cost–benefit analysis. This will quantify the model’s economic value by measuring savings in pesticide and labor costs, as well as reduced yield losses due to early disease control, providing a data-driven basis for its commercialization and widespread application.

Author Contributions

Software, K.Z. and Z.L.; Investigation, K.C.; Resources, K.Z. and H.P.; Data curation, Z.L.; Writing—original draft, K.Z.; Visualization, Y.Y.; Supervision, K.C.; Funding acquisition, H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of Guangdong Province (Grant No. 2025A1515011771) and Guangzhou Science and Technology Plan Project (Grant No. 2023B01J0046, 2024E04J1242).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Simpson, D. The economic importance of strawberry crops. In The Genomes of Rosaceous Berries and Their Wild Relatives; Springer International Publishing: Cham, Switzerland, 2018; pp. 1–7. [Google Scholar]
Li, G.; Jiao, L.; Chen, P.; Liu, K.; Wang, R.; Dong, S.; Kang, C. Spatial convolutional self-attention-based transformer module for strawberry disease identification under complex background. Comput. Electron. Agric. 2023, 212, 108121. [Google Scholar] [CrossRef]
Upadhyay, A.; Chandel, N.S.; Singh, K.P.; Chakraborty, S.K.; Nandede, B.M.; Kumar, M.; Subeesh, A.; Upendar, K.; Salem, A.; Elbeltagi, A. Deep learning and computer vision in plant disease detection: A comprehensive review of techniques, models, and trends in precision agriculture. Artif. Intell. Rev. 2025, 58, 92. [Google Scholar] [CrossRef]
Qin, Y.M.; Tu, Y.H.; Li, T.; Ni, Y.; Wang, R.F.; Wang, H. Deep Learning for sustainable agriculture: A systematic review on applications in lettuce cultivation. Sustainability 2025, 17, 3190. [Google Scholar] [CrossRef]
Zhou, W.; Li, M.; Achal, V. A comprehensive review on environmental and human health impacts of chemical pesticide usage. Emerg. Contam. 2025, 11, 100410. [Google Scholar] [CrossRef]
Lu, H.; Dong, B.; Zhu, B.; Ma, S.; Zhang, Z.; Peng, J.; Song, K. A survey on deep learning-based object detection for crop monitoring: Pest, yield, weed, and growth applications. Vis. Comput. 2025, 1–26. [Google Scholar] [CrossRef]
Padhiary, M.; Saha, D.; Kumar, R.; Sethi, L.N.; Kumar, A. Enhancing precision agriculture: A comprehensive review of machine learning and AI vision applications in all-terrain vehicle for farm automation. Smart Agric. Technol. 2024, 8, 100483. [Google Scholar] [CrossRef]
Dhanya, V.; Subeesh, A.; Kushwaha, N.; Vishwakarma, D.K.; Kumar, T.N.; Ritika, G.; Singh, A. Deep learning based computer vision approaches for smart agricultural applications. Artif. Intell. Agric. 2022, 6, 211–229. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Ge, H.; Wang, J.; Zhen, T.; Li, Z.; Zhu, Y.; Pan, Q. FCA-YOLO: An Efficient Deep Learning Framework for Real-Time Monitoring of Stored-Grain Pests in Smart Warehouses. Agronomy 2025, 15, 1313. [Google Scholar] [CrossRef]
de Almeida, G.P.S.; dos Santos, L.N.S.; da Silva Souza, L.R.; da Costa Gontijo, P.; de Oliveira, R.; Teixeira, M.C.; De Oliveira, M.; Teixeira, M.B.; do Carmo França, H.F. Performance Analysis of YOLO and Detectron2 Models for Detecting Corn and Soybean Pests Employing Customized Dataset. Agronomy 2024, 14, 2194. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Deng, R.; Cui, C.; Remedios, L.W.; Bao, S.; Womick, R.M.; Chiron, S.; Li, J.; Roland, J.T.; Lau, K.S.; Liu, Q.; et al. Cross-Scale Multi-Instance Learning for Pathological Image Diagnosis. Med. Image Anal. 2024, 94, 103124. [Google Scholar] [CrossRef]
Abud, K.; Lavrushkin, S.; Vatolin, D. Evaluating Adversarial Robustness of No-Reference Image and Video Quality Assessment Models with Frequency-Masked Gradient Orthogonalization Adversarial Attack. Big Data Cogn. Comput. 2025, 9, 166. [Google Scholar] [CrossRef]
Younesi, A.; Ansari, M.; Fazli, M.; Ejlali, A.; Shafique, M.; Henkel, J. A comprehensive survey of convolutions in deep learning: Applications, challenges, and future trends. IEEE Access 2024, 12, 41180–41218. [Google Scholar] [CrossRef]
Jamjoom, M.; Elhadad, A.; Abulkasim, H.; Abbas, S. Plant Leaf Diseases Classification Using Improved K-Means Clustering and SVM Algorithm for Segmentation. Comput. Mater. Contin. 2023, 76, 367–382. [Google Scholar] [CrossRef]
Yogeshwari, M.; Thailambal, G. Automatic feature extraction and detection of plant leaf disease using GLCM features and convolutional neural networks. Mater. Today Proc. 2023, 81, 530–536. [Google Scholar] [CrossRef]
Li, K.R.; Duan, L.J.; Deng, Y.J.; Liu, J.L.; Long, C.F.; Zhu, X.H. Pest detection based on lightweight locality-aware faster R-CNN. Agronomy 2024, 14, 2303. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Fuentes, A.; Yoon, S.; Lee, M.H.; Park, D.S. Improving accuracy of tomato plant disease diagnosis based on deep learning with explicit control of hidden classes. Front. Plant Sci. 2021, 12, 682230. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision—ECCV 2016, Proceedings of 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Guo, F.; Li, J.; Liu, X.; Chen, S.; Zhang, H.; Cao, Y.; Wei, S. Improved YOLOv7-Tiny for the Detection of Common Rice Leaf Diseases in Smart Agriculture. Agronomy 2024, 14, 2796. [Google Scholar] [CrossRef]
Tao, Z.; Li, K.; Rao, Y.; Li, W.; Zhu, J. Strawberry maturity recognition based on improved YOLOv5. Agronomy 2024, 14, 460. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12595–12604. [Google Scholar]
Zeiler, M.D.; Fergus, R. Stochastic pooling for regularization of deep convolutional neural networks. arXiv 2013, arXiv:1301.3557. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Peng, H.; Xu, H.; Shen, G.; Liu, H.; Guan, X.; Li, M. A lightweight crop pest classification method based on improved MobileNet-V2 model. Agronomy 2024, 14, 1334. [Google Scholar] [CrossRef]
Xu, Y.; Yi, J.; Gao, J. Defect detection of automotive leather based on Nanodet-Plus. In Proceedings of the 2023 35th Chinese Control and Decision Conference (CCDC), IEEE, Yichang, China, 20–22 May 2023; pp. 1458–1463. [Google Scholar]
Reis, D; Kupec, J; Hong, J.; Daoudi, A. Real-time flying object detection with YOLOv8. arXiv 2023, arXiv:2305.09972.
Huang, Y.; Liu, Z.; Zhao, H.; Tang, C.; Liu, B.; Li, Z.; Wan, F.; Qian, W.; Qiao, X. YOLO-YSTs: An improved YOLOv10n-based method for real-time field pest detection. Agronomy 2025, 15, 575. [Google Scholar] [CrossRef]
Xie, X.; Zhang, R.; Guo, J.; Lu, L.; Pan, H.; Luo, X.; Meng, S. Strawberry Disease Detection Algorithm Based on YOLO11-Strawberry. Food Qual. Saf. 2025, fyaf027. [Google Scholar] [CrossRef]
Wu, M.; Qiu, Y.; Wang, W.; Su, X.; Cao, Y.; Bai, Y. Improved RT-DETR and its application to fruit ripeness detection. Front. Plant Sci. 2025, 16, 1423682. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]

Figure 1. Class distribution of instances in the strawberry disease dataset.

Figure 2. Size distribution of annotated bounding boxes in the dataset.

Figure 3. A schematic diagram of the YOLOv12 baseline architecture, illustrating the flow of features through the Backbone, Neck, and Head components.

Figure 4. The overall architecture of the proposed PPA-MC-YOLO framework and its key components. The diagram illustrates the strategic integration of our four innovations: (a) The standard detection head is replaced by our PPA-driven head, which uses Parallel Pyramid Attention to enhance focus on small targets. (b) The C2f modules are enhanced into A2C2f_MoCA blocks, incorporating a parallel MCAttn branch for adaptive feature recalibration (detailed on the left). (c) Key convolutional layers are upgraded to FDConv to enrich feature diversity. (d) The entire head is optimized with our SD-Loss to stabilize training for small objects.

Figure 5. Sensitivity analysis of the number of sampling points N in the MCAttn module.

Figure 6. A comparison of detection results between the baseline YOLOv12 and our PPA-MC-YOLO on four types of typical challenging samples. Subfigures (a,c,e,g) show the results of the baseline model, while subfigures (b,d,f,h) show the results of our model. The rows, respectively, display performance differences in the scenarios of (a,b) small and dense targets, (c,d) concealed targets in complex backgrounds, (e,f) co-existence of multi-scale targets, and (g,h) discrimination of confusing classes.

Table 1. Performance comparison of different strawberry disease dataset models.

Model	mAP@0.5 (%)	mAP@0.5:0.95 (%)	P (%)	R (%)	F1 (%)	GFLOPs	FPS	Params (M)	Model Size (MB)	CPU Load (%)
NanoDet-Plus	75.2	56.8	73.5	71.2	72.3	1.8	625	0.98	5.2	25.4
MobileNetV2-SSD	69.5	47.8	68.2	66.8	67.5	6.8	485	3.50	14.1	31.2
RT-DETR-R18	77.6	58.9	75.8	72.8	74.3	12.3	285	11.8	47.5	45.1
YOLOv8-n	77.8	57.8	73.2	75.8	74.5	8.1	455	3.01	12.3	35.7
YOLOv10-n	79.4	60.5	76.6	73.0	74.8	8.2	455	2.70	11.0	36.6
YOLOv11-n	78.9	60.0	77.0	76.2	76.6	6.3	345	2.58	10.5	33.8
YOLOv12-n (Baseline)	79.0	61.3	81.2	73.2	77.0	6.6	480	2.95	11.9	34.1
PPA-MC-YOLO (Ours)	81.1	63.2	81.6	76.2	78.8	6.3	465	2.44	9.9	38.5

Note: P: Precision; R: Recall; F1: F1-score; Parameters (M): model parameters (in millions); Model Size (MB): size of the saved model weights file; GFLOPs: Giga Floating-point Operations at 640 × 640 resolution; VRAM (GB) and CPU Usage (%) represent the peak GPU memory usage and average CPU core usage, respectively, measured during inference on a single 640 × 640 image with a batch size of 1. FPS: Frames Per Second. All hardware-related tests were conducted on a single NVIDIA RTX 4090 GPU. The best performance is highlighted in bold, and the second-best is underlined.

Table 2. Performance comparison of PPA-MC-YOLO and the baseline model across different disease classes.

Class	YOLOv12 (Baseline)			PPA-MC-YOLO
Class	AP@0.5 (%)	P (%)	R (%)	AP@0.5 (%)	P (%)	R (%)
angular_leafspot	76.1	84.4	68.9	78.6 (+2.5)	90.3 (+5.9)	69.5 (+0.6)
anthracnose_fruit_rot	71.3	73.1	57.2	71.9 (+0.6)	73.5 (+0.4)	60.0 (+2.8)
blossom_blight	87.0	73.1	100	85.7 (−1.3)	74.1 (+1.0)	100
gray_mold	85.6	80.4	83.5	86.2 (+0.6)	76.5 (−3.9)	87.5 (+4.0)
leaf_spot	80.7	90.7	70.7	83.1 (+2.4)	87.4 (−3.3)	74.4 (+3.7)
powdery_mildew_fruit	63.5	84.4	46.5	74.5 (+11.0)	84.4 (0.0)	52.0 (+5.5)
powdery_mildew_leaf	89.1	82.0	85.1	92.6 (+3.5)	85.2 (+3.2)	88.2 (+3.1)
All	79.0	81.2	73.2	81.1 (+2.1)	81.6 (+0.4)	76.2 (+3.0)

Note: Values in parentheses indicate the performance change in the PPA-MC-YOLO model relative to the baseline YOLOv12. Red font highlights the significant improvements in key challenging classes. Blue font indicates the average performance across all classes.

Table 3. Ablation study of each module’s impact on the performance of PPA-MC-YOLO (mAP@0.5%).

Experiment Number	Model Configuration	mAP@0.5 (%)	ΔmAP (%)
#1	YOLOv12 (Baseline)	79.0	-
#2	+ PPA	79.7	+0.7
#3	+ MCAttn	80.0	+1.0
#4	+ SD Loss	79.2	+0.2
#5	+ FDConv	79.3	+0.3
#6	+ PPA + MCAttn	80.5	+1.5
#7	+ SD Loss + FDConv	79.4	+0.4
#8	PPA-MC-YOLO (Ours)	81.1	+2.1
#9	w/o PPA	80.4	−0.7
#10	w/o MCAttn	80.1	−1.0

Table 4. Ablation study on the effectiveness of SD Loss compared to other loss functions. The experiments were conducted on the full PPA-MC-YOLO framework, with only the loss function being varied.

Loss Function	mAP@0.5 (%)	mAP@0.5:0.95 (%)	AP_small (%)	Precision (%)	Recall (%)
Default (CIoU)	80.8	62.1	41.5	80.5	75.1
Focal Loss	80.5	61.8	40.9	82.1	73.9
Wise-IoU Loss	80.9	62.5	42.1	81.1	75.8
SD Loss (Ours)	81.1	63.2	43.5	81.6	76.2

Table 5. Performance impact of different branch combinations within the PPA module.

Experiment Setup (Inside the PPA Module)	mAP@0.5 (%)	ΔmAP (%)	Description
L	80.2	+0.2	Local feature branch only
S	80.0	Baseline	Spatial attention branch only
L+S	80.7	+0.7	Local + Spatial branches
G+L+S (Full PPA)	81.1	+1.1	Complete PPA module

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, K.; Ye, Y.; Chen, K.; Li, Z.; Peng, H. A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases. Agronomy 2025, 15, 1969. https://doi.org/10.3390/agronomy15081969

AMA Style

Zhang K, Ye Y, Chen K, Li Z, Peng H. A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases. Agronomy. 2025; 15(8):1969. https://doi.org/10.3390/agronomy15081969

Chicago/Turabian Style

Zhang, Kaijie, Yuchen Ye, Kaihao Chen, Zao Li, and Hongxing Peng. 2025. "A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases" Agronomy 15, no. 8: 1969. https://doi.org/10.3390/agronomy15081969

APA Style

Zhang, K., Ye, Y., Chen, K., Li, Z., & Peng, H. (2025). A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases. Agronomy, 15(8), 1969. https://doi.org/10.3390/agronomy15081969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Scale-Adaptive and Frequency-Aware Attention Network for Precise Detection of Strawberry Diseases

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods for Plant Disease Detection

2.2. Deep Learning-Based Object Detection Methods

2.3. Advanced Techniques for Small Object and Multi-Scale Detection

2.4. Current Limitations of Strawberry Disease Detection Methods

3. Materials and Methods

3.1. Dataset

3.1.1. Dataset Composition and Classes

3.1.2. Annotation and Partitioning

3.1.3. Data Augmentation and Preprocessing

3.2. The Proposed PPA-MC-YOLO Framework

3.2.1. Baseline Architecture: YOLOv12

3.2.2. Overall Architecture of PPA-MC-YOLO

3.2.3. Parallel Pyramid Attention (PPA) Head

3.2.4. Monte Carlo Attention (MCAttn) Integration

3.2.5. Scale-Decoupled Loss (SD Loss)

3.2.6. Frequency Dynamic Convolution (FDConv)

3.3. Experimental Protocol and Implementation

3.3.1. Experimental Platform and Framework

3.3.2. Training Protocol and Hyperparameters

3.4. Evaluation Protocol

3.4.1. Accuracy and Robustness Metrics

3.4.2. Efficiency and Complexity Metrics

3.5. Comparative Experiment Setup

4. Results

4.1. Main Detection Performance Comparison

4.2. Ablation Study

4.3. Parameter Sensitivity Analysis

4.3.1. Impact of the Number of Sampling Points N in MCAttn

4.3.2. Impact of Branch Combinations in the PPA Module

4.4. Visualization Results and Analysis

4.5. Efficiency Analysis

5. Discussion

5.1. Interpretation of Performance Enhancement Mechanisms

5.2. Strengths and Limitations

5.3. Comparison with Related Work and Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI