Next Article in Journal
A Polydimethylsiloxane (PDMS) Transparent Fresnel Zone Lens Antenna at Ku-Band for Satellite Communication
Previous Article in Journal
Resilience-Oriented Repair Strategy for Integrated Electricity and Natural Gas Systems with Line Pack Consideration
 
 
Article
Peer-Review Record

Dynamic and Lightweight Detection of Strawberry Diseases Using Enhanced YOLOv10

Electronics 2025, 14(19), 3768; https://doi.org/10.3390/electronics14193768
by Huilong Jin 1,2, Xiangrong Ji 1,3 and Wanming Liu 1,2,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Electronics 2025, 14(19), 3768; https://doi.org/10.3390/electronics14193768
Submission received: 10 August 2025 / Revised: 11 September 2025 / Accepted: 21 September 2025 / Published: 24 September 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposes YOLO10-SC, an enhanced version of YOLOv10 designed for real-time detection of strawberry diseases under natural conditions. The model integrates three main improvements: (1) a Convolutional Block Attention Module (CBAM) to focus on disease-related features while suppressing irrelevant background noise, (2) an SCConv module integrated into the C2f structure for better fine-grained feature representation, and (3) DySample, a lightweight dynamic upsampler that improves boundary smoothness and detail preservation. Using a dataset of 5,000 augmented images covering seven strawberry disease categories, the authors conduct experiments showing that YOLO10-SC outperforms baseline YOLOv10 and other state-of-the-art detectors (YOLOv5–YOLOv9, Rt-DETR, Faster R-CNN, Mask R-CNN, etc.) in precision, recall, F1-score, and mAP50, while also reducing computational complexity. An Android-based strawberry disease detection app prototype is also demonstrated.

The manuscript would benefit from a more critical discussion of dataset limitations and generalization capacity. While augmentation increases dataset size, reliance on a relatively small base set may restrict model robustness. A section acknowledging this limitation and proposing future expansion to multi-crop or multi-regional datasets would strengthen the contribution.

The comparison with baselines should clarify whether all models were retrained under consistent settings (same dataset splits, training epochs, hardware) to ensure fairness. If not, the authors should explicitly acknowledge potential inconsistencies.

A deeper error analysis should be included. Presenting confusion matrices, example failure cases, and interpretability outputs (e.g., CBAM attention heatmaps or Grad-CAM visualizations) would reveal where the model succeeds and where it struggles, adding valuable insights.

The presentation of methods could be improved by balancing detailed mathematical descriptions with more intuitive explanations. For instance, SCConv and DySample could be described in simpler terms alongside diagrams, to make their benefits clearer to readers from agriculture rather than deep learning.

Finally, the discussion of deployment and future work could be expanded. The authors should consider addressing practical aspects such as inference speed on mobile devices in field conditions, battery and connectivity constraints, and opportunities for extending YOLO10-SC to multimodal approaches (e.g., combining visual detection with sensor or text data).

Author Response

The paper proposes YOLO10-SC, an enhanced version of YOLOv10 designed for real-time detection of strawberry diseases under natural conditions. The model integrates three main improvements: (1) a Convolutional Block Attention Module (CBAM) to focus on disease-related features while suppressing irrelevant background noise, (2) an SCConv module integrated into the C2f structure for better fine-grained feature representation, and (3) DySample, a lightweight dynamic upsampler that improves boundary smoothness and detail preservation. Using a dataset of 5,000 augmented images covering seven strawberry disease categories, the authors conduct experiments showing that YOLO10-SC outperforms baseline YOLOv10 and other state-of-the-art detectors (YOLOv5–YOLOv9, Rt-DETR, Faster R-CNN, Mask R-CNN, etc.) in precision, recall, F1-score, and mAP50, while also reducing computational complexity. An Android-based strawberry disease detection app prototype is also demonstrated.

Response: We sincerely thank the reviewers for their positive evaluation and recognition of our work. We have carefully considered all suggestions and made comprehensive revisions. Our point-by-point responses to the review comments and all modifications are outlined below, with key changes highlighted in the revised manuscript.

 

The manuscript would benefit from a more critical discussion of dataset limitations and generalization capacity. While augmentation increases dataset size, reliance on a relatively small base set may restrict model robustness. A section acknowledging this limitation and proposing future expansion to multi-crop or multi-regional datasets would strengthen the contribution.

Response: We greatly appreciate your professional comments on our paper. As you pointed out, algorithms trained solely on strawberry disease datasets have limitations. We have expanded the Discussion section to address this issue, indicating our future research will focus on multi-source, multi-crop, and multi-region datasets, while also exploring multimodal applications in this field. Specific corrections are as follows.

Action: [Section 4]

First, although this paper expanded the size of the data set through data augmentation, several limitations remain. These include: data sourced from a single origin, inclusion of only strawberries as the target crop, and the absence of analysis on disease severity or occlusion levels. These constraints may limit the model's generalization capabilities and robustness across diverse real-world scenarios. Moving forward, our work will focus on collecting data from multiple countries and regions, with plans to expand research to pest and disease detection across multiple crops. This will enhance model performance and contribute more effectively to smart agriculture.

 

The comparison with baselines should clarify whether all models were retrained under consistent settings (same dataset splits, training epochs, hardware) to ensure fairness. If not, the authors should explicitly acknowledge potential inconsistencies.

Response: Thank you very much for your feedback. We consider this a valuable suggestion. For comparison with the baseline, all models were retrained under consistent settings to ensure fairness. The specific parameters and relevant explanations have been added to the paper.

Action: [Section 3.2]

The experiments were done under the Windows operating system, based on the GPU, Pytorch and CUDA frameworks, with parameters specified in Table 2. All algorithms in this paper were trained with identical hyperparameters to ensure fairness.

Table 2. Configuration of the experimental training environment and hyperparameters

software and hardware platform

Model parameters

operating system

Windows11

processing unit

11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz

display card (computer)

NVIDIA GeForce RTX 3080

organizing plan

Pytorch2.3.1

Programming Environment

Python3.9

Video Memory/GB

36GB

Memory/GB

32GB

Image Size

640×640

Optimizer

AdamW

Learning Rate

0.01

epochs

200

Batch Size

32

 

A deeper error analysis should be included. Presenting confusion matrices, example failure cases, and interpretability outputs (e.g., CBAM attention heatmaps or Grad-CAM visualizations) would reveal where the model succeeds and where it struggles, adding valuable insights.

Response: We appreciate the reviewers for raising these important points. In the experimental section of the paper, we have incorporated confusion matrices and Grad-CAM visualization results, and carefully analyzed the aspects where the model fails during the analysis of results.

Action: [Section 3.4]

Figure 16 shows the normalized confusion matrices for the baseline model and YOLO10-SC. Comparisons reveal that the improved model significantly outperforms the baseline model in classification accuracy for most disease categories, such as Anthracnose Fruit Rot (correct classification rate increased from 0.50 to 0.79) and Blossom Blight (increased from 0.98 to 1.00). Furthermore, the confusion between background and Anthracnose Fruit Rot decreased from 0.32 to 0.18. However, the baseline model achieved a slightly higher classification accuracy (0.94) for Angular Leafspot than the improved model (0.90). While still maintaining a high accuracy, the reasons for the reduced classification performance in this category warrant further investigation. Meanwhile, categories such as Gray Mold and Leaf Spot demonstrated consistently high and stable classification accuracy across both models, reflecting the model's strong robustness in identifying these diseases.

(a) YOLOv10                                                   (b) YOLO10-SC

Figure 16. Comparison of normalized confusion matrix before and after improvement

 

Figure 17 displays the heatmap visualizations of the improved YOLO10-SC model compared to the baseline YOLOv10 model. The comparison reveals that the improved YOLO10-SC heatmap exhibits superior visual coherence and target focus: its high-activation regions (warm tones) precisely cover the core lesion areas, forming a distinct gradient difference with the thermal boundaries of healthy tissue and background regions, demonstrating enhanced lesion-background discrimination capability. In contrast, the baseline YOLOv10 exhibits relatively diffuse heat distribution, with redundant activations persisting in healthy areas surrounding lesions. This reduces the visual distinctiveness of target contours, validating YOLO10-SC's enhanced capability to capture key visual patterns of disease during feature extraction while improving robustness in disease identification.

      

(a) Original image               (b) YOLOv10                      (c) YOLO10-SC

Figure 17. Comparison of heatmap before and after improvement

 

The presentation of methods could be improved by balancing detailed mathematical descriptions with more intuitive explanations. For instance, SCConv and DySample could be described in simpler terms alongside diagrams, to make their benefits clearer to readers from agriculture rather than deep learning.

Response: Thank you for your suggestions. We have streamlined the more complex sections of the methodology, significantly reducing the length. Additionally, we have included comparisons of feature maps before and after introducing each module to provide readers with a clearer and more intuitive understanding of the advantages of the algorithm presented in this paper. The newly added feature map comparisons and their brief descriptions are as follows.

Action: [Section 2]

To visually demonstrate the optimising effect of the CBAM attention mechanism on feature representation, the Figure 3 and Figure 4 compare feature map visualisations before and after incorporating CBAM. The visualisation reveals that without CBAM, highly activated regions appear relatively dispersed, demonstrating insufficient focus on critical semantic information. Conversely, with CBAM applied, high-response areas in the feature map concentrate more precisely on regions containing significant objects or structures, markedly enhancing the specificity and discriminative power of activation distribution. This comparison validates CBAM's ability to effectively guide the network towards more representative features, thereby improving the efficacy of feature representation.

Figure 3. Feature map prior to incorporating the CBAM attention mechanism

Figure 4. Feature map after incorporating the CBAM attention mechanism

 

As shown in Figure 9 and Figure 10, the enhanced feature map demonstrates greater focus on the activation of key blade structures, reduced background interference, and clearer depiction of blade texture details. This effectively strengthens the expression of target features, validating the effectiveness of the improvement.

 

Figure 9. Feature map output by the C2f module

Figure 10. Feature map output by the C2f_SCConv module

 

As shown in Figures 11 and 12, after introducing dysample, the highlighted key regions in the feature maps become more focused and exhibit sharper details. Compared to the original upscaling method, this approach effectively enhances the resolution and speciffcity of the features.

Figure 11. The feature maps generated by Dysample were not introduced

Figure 12. The feature maps generated by Dysample were introduced

 

Finally, the discussion of deployment and future work could be expanded. The authors should consider addressing practical aspects such as inference speed on mobile devices in field conditions, battery and connectivity constraints, and opportunities for extending YOLO10-SC to multimodal approaches (e.g., combining visual detection with sensor or text data).

Response: We sincerely appreciate the reviewer's suggestions. We have added explanations for relevant parameters in Section 3.7. Strawberry Disease Detection System, clarified issues related to inference speed and network connectivity limitations, and discussed shortcomings and future work in Chapter 4. Discussion, as detailed below.

Action1: [Section 3.7]

To simulate the hardware usage scenarios of agricultural users in the ffeld, deployment  performance was validated on a mid-range mobile device: the HUAWEI nova9, equipped with a Snapdragon 778G processor and 8GB of RAM. During operation, the frame rate remained within the 50-100 range, achieving over 100 frames per second under favourable conditions. This system operates offfine, eliminating concerns regarding signal delays in remote environments.

Action2: [Section 4]

Secondly, although this study achieved promising results in detecting diseases on strawberry images, it remains conffned to a single modality—visual information. With  advancements in multimodal fusion and sensor technologies, our future work will focus on integrating visual data with sensor-derived information such as temperature and humidity, alongside textual and audio data, to provide farmers with enhanced decision support. 

Finally, although this paper deploys YOLO-SC within a mobile application, testing has veriffed that the system effectively addresses strawberry pest and disease detection tasks in productive agricultural cultivation. Its inference speed meets practical requirements in real-world environments, and the system operates offfine without network constraints. However, due to limitations of the experimental site, it was not possible to test inference speed and battery endurance in ffeld conditions. Future work will involve selecting suitable locations and seasons to conduct more comprehensive testing and reffnement of the mobile application.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The study uses YOLOv10 as the base model and innovatively integrates three core components: through the CBAM attention module, the representation ability of key disease features is enhanced; by using SCConv, the convolution module is reconstructed to improve the accuracy of distinguishing subtle differences in diseases. The overall technical solution conforms to the cutting-edge trends in the application of deep learning in intelligent agriculture. Among them, the lightweight design concept can effectively adapt to mobile terminals and edge computing devices, providing a feasible technical path for real-time disease detection. In terms of result verification, the study has constructed a comprehensive analysis system: through control experiments, the performance advantages of the improved YOLO10-SC algorithm compared to the original YOLOv10 model were quantitatively verified; through ablation experiments, the independent contribution and collaborative enhancement mechanism of each improvement module were clarified; through generalization tests on the COCO dataset and horizontal comparisons with 13 mainstream detection algorithms, the generalization ability and industry competitiveness of the proposed algorithm were fully confirmed.

Revision Comments

  1. The article only mentions the number of iterations, but does not specify the learning rate, optimizer type, batch size, etc. as the core training parameters. It is recommended to complete this part of the data.
  2. The introduction of modules such as CBAM and C2f_SCConv in the text is overly complex. It is recommended to simplify it. Besides, the original framework in YOLOV10 does not need much explanation.
  3. What advantages does the dataset in this article, which is from a laboratory in South Korea, have over the one cited in the article "An instance segmentation model for strawberry diseases based on Mask R-CNN" by U. Afzaal, B. Bhattarai, Y. R. Pandeya, and J. Lee, Sensors, vol. 21, 2021?
  4. The comparative experiments in the article have not been compared with the higher-level models in YOLO. It is suggested to conduct comparisons with YOLOV11, V12, etc.
  5. In this paper, only numerical verification was conducted to assess the improvement effects of CBAM and DySample, but no intuitive visualization of the impact of the module on feature extraction was provided; it is suggested to add visual comparisons of the feature maps before and after the module's addition.
  6.  

Author Response

[Reviewer 2]

The study uses YOLOv10 as the base model and innovatively integrates three core components: through the CBAM attention module, the representation ability of key disease features is enhanced; by using SCConv, the convolution module is reconstructed to improve the accuracy of distinguishing subtle differences in diseases. The overall technical solution conforms to the cutting-edge trends in the application of deep learning in intelligent agriculture. Among them, the lightweight design concept can effectively adapt to mobile terminals and edge computing devices, providing a feasible technical path for real-time disease detection. In terms of result verification, the study has constructed a comprehensive analysis system: through control experiments, the performance advantages of the improved YOLO10-SC algorithm compared to the original YOLOv10 model were quantitatively verified; through ablation experiments, the independent contribution and collaborative enhancement mechanism of each improvement module were clarified; through generalization tests on the COCO dataset and horizontal comparisons with 13 mainstream detection algorithms, the generalization ability and industry competitiveness of the proposed algorithm were fully confirmed.

Response: We sincerely thank the reviewers for their positive evaluation and precise summary of our work. We have carefully incorporated all suggestions and conducted comprehensive revisions. Our point-by-point responses to the review comments and all modifications are outlined below, with key revisions highlighted in the revised manuscript.

 

Revision Comments

1.The article only mentions the number of iterations, but does not specify the learning rate, optimizer type, batch size, etc. as the core training parameters. It is recommended to complete this part of the data.

Response: We appreciate the reviewer's pointed out issues. After thorough review, we acknowledge that the training parameters were indeed not described in sufficient detail. All experiments were conducted under consistent parameter settings, and we have supplemented the specific details in the paper as follows.

Action: [Section 3.2]

The experiments were done under the Windows operating system, based on the GPU, Pytorch and CUDA frameworks, with parameters specified in Table 2. All algorithms in this paper were trained with identical hyperparameters to ensure fairness.

Table 2. Configuration of the experimental training environment and hyperparameters

software and hardware platform

Model parameters

operating system

Windows11

processing unit

11th Gen Intel(R) Core(TM) i9-11900 @ 2.50GHz

display card (computer)

NVIDIA GeForce RTX 3080

organizing plan

Pytorch2.3.1

Programming Environment

Python3.9

Video Memory/GB

36GB

Memory/GB

32GB

Image Size

640×640

Optimizer

AdamW

Learning Rate

0.01

epochs

200

Batch Size

32

 

 

2.The introduction of modules such as CBAM and C2f_SCConv in the text is overly complex. It is recommended to simplify it. Besides, the original framework in YOLOV10 does not need much explanation.

Response: We appreciate the reviewer's feedback. We have streamlined the description of object detection in the introduction and simplified the explanation of the original YOLOv10 framework. Below are excerpts from the revised sections.

Action1: [Section 1]

Target detection is an important branch of computer vision, which is mainly used to identify and analyze targets in images, and is widely used in the fields of crop pest detection and yield estimation. It is mainly divided into two-stage target detection and single-stage target detection. Among these, single-stage detection algorithms can successfully output the boundaries of the detection classification and prediction boxes through a simple one-time network processing, so this class of detection algorithms has good detection speed and is suitable for mobile use, and at the same time retains enough structural space for the addition of algorithmic modules to realize the various needs of detection applications. The single-stage target detection algorithm is represented by the YOLO algorithm.

Action2: [Section 2.1]

The introduction of the Channel Attention Mechanism facilitates the efficient detection of contour features associated with the target, thereby enriching the information available for target detection. This mechanism enables the network to prioritize critical feature channels pertinent to specific tasks, ultimately enhancing both the performance and efficiency of the network. The structure of the channel attention module is shown in Figure (2a). The mathematical representation of the computation for the channel attention mechanism is expressed as follows:

The Spatial Attention Mechanism operates by compressing the channel dimension and performing mean  and maximum pooling along this axis. By integrating the Spatial Attention Module, the model can effectively localize the detection target, thereby enhancing the detection rate. The structure of the spatial attention module is shown in Figure (2b). The mathematical representation of the  spatial attention mechanism is articulated as follows:

Action3: [Section 2.2]

SRU uses a separation-reconstruction approach: separation distinguishes high- and low-information feature maps via Group Normalization-derived scaling factors to address spatial content, while reconstruction merges these features for more informative outputs and optimized spatial utilization, with its structure shown in Figure 6.

CRU is a channel reconstruction unit that utilizes a segmentation-transformation-fusion strategy to reduce the redundancy of channel dimensions as well as computational cost and storage.The structure of CRU is shown in Figure 7.

Action4: [Section 2.3]

Feature up-sampling is crucial for target detection, as it restores feature resolution to boost classification and localization accuracy. However, YOLOv10’s original Nearest-neighbor Interpolation for up-sampling only depends on pixel spatial positions (ignoring feature map semantic info and surrounding points, resulting in low-quality outputs). Although dynamic upsamplers like CARAFE[25], FADE[26], and SAPA[27] improve performance via content-aware kernels, they add extra complexity—with FADE and SAPA even requiring high-resolution feature inputs.

 

3.What advantages does the dataset in this article, which is from a laboratory in South Korea, have over the one cited in the article "An instance segmentation model for strawberry diseases based on Mask R-CNN" by U. Afzaal, B. Bhattarai, Y. R. Pandeya, and J. Lee, Sensors, vol. 21, 2021?

Response: We thank the reviewers for raising this point. Indeed, the comparative advantages of our algorithm over that proposed in the article ‘An instance segmentation model for strawberry diseases based on Mask R-CNN’ were not sufficiently evident in the dataset. We have now addressed this by including a single paragraph in the experimental section detailing these advantages. Specifically:

Action: [Section 3.6]

The algorithm in this paper has improved the P valule by 0.183, the R value by 0.05, the map50 value by 0.09, and the F1 score by 0.121 compared to Mask-RCNN in 2021 when the dataset used in this paper was proposed. The performance of the algorithm in this paper is better. Regarding the dataset, the original dataset images contained relatively few challenging detection conditions. This study implemented data augmentation techniques: noise addition to simulate blurred images and weather conditions like fog or dust storms; brightness adjustment to mimic nighttime and midday scenarios; random masking of original images to simulate occlusions; and rotation, cropping, translation, and mirroring to replicate images captured from various angles. Models trained on the augmented dataset demonstrate stronger generalization capabilities and better adaptability for detecting strawberry diseases under diverse conditions. In summary, the proposed method demonstrates superior performance compared to approaches used when establishing the dataset.

 

4.The comparative experiments in the article have not been compared with the higher-level models in YOLO. It is suggested to conduct comparisons with YOLOV11, V12, etc.

Response: We are grateful for the reviewer's suggestions. Indeed, we had not compared our proposed algorithm with newer versions of baseline models such as YOLOv11 and YOLOv12. In our latest revisions, we have supplemented the experiments in this regard. The specific results are presented below.

Action: [Section 3.6]

When compared with the latest single-stage object detection algorithms YOLOv11 and YOLOv12 over the past two years, the proposed algorithm also achieved optimal performance in strawberry disease detection, making it more suitable for application in smart agriculture.

Table 7. Results of comparison experiments

mould

P

R

mAP

F1

YOLOv11(2024)

0.889

0.824

0.885

0.855

YOLOv12(2025)

0.880

0.823

0.892

0.851

YOLO10-SC

0.885

0.865

0.914

0.875

 

5.In this paper, only numerical verification was conducted to assess the improvement effects of CBAM and DySample, but no intuitive visualization of the impact of the module on feature extraction was provided; it is suggested to add visual comparisons of the feature maps before and after the module's addition.

Response: We are grateful for the reviewer's suggestion that adding visual comparisons of feature maps before and after module insertion would indeed enhance the readability of the paper. We have now generated comparative feature maps for each module before and after insertion. These are presented below.

Action: [Section 2]

To visually demonstrate the optimising effect of the CBAM attention mechanism on feature representation, the Figure 3 and Figure 4 compare feature map visualisations before and after incorporating CBAM. The visualisation reveals that without CBAM, highly activated regions appear relatively dispersed, demonstrating insufficient focus on critical semantic information. Conversely, with CBAM applied, high-response areas in the feature map concentrate more precisely on regions containing significant objects or structures, markedly enhancing the specificity and discriminative power of activation distribution. This comparison validates CBAM's ability to effectively guide the network towards more representative features, thereby improving the efficacy of feature representation.

Figure 3. Feature map prior to incorporating the CBAM attention mechanism

Figure 4. Feature map after incorporating the CBAM attention mechanism

 

As shown in Figure 9 and Figure 10, the enhanced feature map demonstrates greater focus on the activation of key blade structures, reduced background interference, and clearer depiction of blade texture details. This effectively strengthens the expression of target features, validating the effectiveness of the improvement.

 

Figure 9. Feature map output by the C2f module

Figure 10. Feature map output by the C2f_SCConv module

 

As shown in Figures 11 and 12, after introducing dysample, the highlighted key regions in the feature maps become more focused and exhibit sharper details. Compared to the original upscaling method, this approach effectively enhances the resolution and speciffcity of the features.

Figure 11. The feature maps generated by Dysample were not introduced

Figure 12. The feature maps generated by Dysample were introduced

 

 

 

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper revised the Yolov10 for straberry plant diseasese detection tasks.  It leverages the channel attention mechanism. However, the research is not able to find significant improvement. For example, for the Coco dataset, which is very old dataset, there is no significant improvement. This makes the main results of improvement not quite convincing and there is a chance of overfitting the small dataset. 

More innovative methods are needed to make the paper more convincing. 

Author Response

[Reviewer 3]

The paper revised the Yolov10 for strawberry plant diseases detection tasks.  It leverages the channel attention mechanism. However, the research is not able to find significant improvement. For example, for the Coco dataset, which is very old dataset, there is no significant improvement. This makes the main results of improvement not quite convincing and there is a chance of overfitting the small dataset. 

More innovative methods are needed to make the paper more convincing. 

Response: We sincerely appreciate the reviewers' meticulous comments on the improvement directions and argumentation logic of this study. Your insights have provided crucial guidance for more precisely articulating the research value. We have refined and supplemented the manuscript accordingly and humbly request further guidance from the reviewers.

First, regarding the observation that “no significant improvement was observed on the COCO dataset,” the core focus of this study is scenario-specific optimization for strawberry disease detection, rather than enhancing general object detection models. In actual experiments, the improved YOLO10-SC achieves enhanced key metrics on a strawberry disease-specific dataset (containing 7 common disease categories and 5,000 field-captured samples). Detailed comparisons are presented in Tables 3 and 4 and Figure 15 of the paper. In the latest revision, we have incorporated comparisons of confusion matrices before and after improvements, heatmap comparisons, and feature map comparisons before and after module insertion to more intuitively demonstrate the effectiveness of our proposed solution.

Second, regarding innovation, this study's contribution lies not in breakthroughs within individual modules but in scenario-based collaborative design addressing three core challenges in strawberry disease detection. We acknowledge that the foundational structures of CBAM, SCConv, and DySample modules were established in prior literature. Subsequent revisions have supplemented the section on “Innovative combination adaptation and optimization tailored for strawberry disease detection scenarios.”

Finally, future research will prioritize model generalization by extending the algorithm to multi-region, multi-crop, and even multi-modal scenarios. Integrating sensor data and textual information will further advance contributions to smart agriculture. Specific modifications are outlined below.

Action1: [Section 3.4]

Table 3. Performance comparison before and after improvement in the dataset of this paper (unpre-trained).

arithmetic

P

R

map50

F1

Params

GFLOPs

Model size

FPS

pre-improvement

0.882

0.812

0.873

0.846

2697146

8.2

5.8

142.8

improved

0.885

0.865

0.914

0.875

2624204

7.9

5.7

149

Table 4. Performance comparison before and after improvement in the dataset of this paper (pre-trained).

arithmetic

P

R

map50

F1

Params

GFLOPs

Model size

FPS

pre-improvement

0.935

0.851

0.917

0.891

2697146

8.2

5.8

141.1

improved

0.951

0.903

0.958

0.926

2624204

7.9

5.7

146.5

Figure 15. Comparison of mAP50 visualization before and after improvement

Action2: [Section 3.4]

Figure 16 shows the normalized confusion matrices for the baseline model and YOLO10-SC. Comparisons reveal that the improved model significantly outperforms the baseline model in classification accuracy for most disease categories, such as Anthracnose Fruit Rot (correct classification rate increased from 0.50 to 0.79) and Blossom Blight (increased from 0.98 to 1.00). Furthermore, the confusion between background and Anthracnose Fruit Rot decreased from 0.32 to 0.18. However, the baseline model achieved a slightly higher classification accuracy (0.94) for Angular Leafspot than the improved model (0.90). While still maintaining a high accuracy, the reasons for the reduced classification performance in this category warrant further investigation. Meanwhile, categories such as Gray Mold and Leaf Spot demonstrated consistently high and stable classification accuracy across both models, reflecting the model's strong robustness in identifying these diseases.

(a) YOLOv10                                                   (b) YOLO10-SC

Figure 16. Comparison of normalized confusion matrix before and after improvement

 

Figure 17 displays the heatmap visualizations of the improved YOLO10-SC model compared to the baseline YOLOv10 model. The comparison reveals that the improved YOLO10-SC heatmap exhibits superior visual coherence and target focus: its high-activation regions (warm tones) precisely cover the core lesion areas, forming a distinct gradient difference with the thermal boundaries of healthy tissue and background regions, demonstrating enhanced lesion-background discrimination capability. In contrast, the baseline YOLOv10 exhibits relatively diffuse heat distribution, with redundant activations persisting in healthy areas surrounding lesions. This reduces the visual distinctiveness of target contours, validating YOLO10-SC's enhanced capability to capture key visual patterns of disease during feature extraction while improving robustness in disease identification.

      

(a) Original image               (b) YOLOv10                      (c) YOLO10-SC

Figure 17. Comparison of heatmap before and after improvement

 

Action3: [Section 2]

To visually demonstrate the optimising effect of the CBAM attention mechanism on feature representation, the Figure 3 and Figure 4 compare feature map visualisations before and after incorporating CBAM. The visualisation reveals that without CBAM, highly activated regions appear relatively dispersed, demonstrating insufficient focus on critical semantic information. Conversely, with CBAM applied, high-response areas in the feature map concentrate more precisely on regions containing significant objects or structures, markedly enhancing the specificity and discriminative power of activation distribution. This comparison validates CBAM's ability to effectively guide the network towards more representative features, thereby improving the efficacy of feature representation.

Figure 3. Feature map prior to incorporating the CBAM attention mechanism

Figure 4. Feature map after incorporating the CBAM attention mechanism

 

As shown in Figure 9 and Figure 10, the enhanced feature map demonstrates greater focus on the activation of key blade structures, reduced background interference, and clearer depiction of blade texture details. This effectively strengthens the expression of target features, validating the effectiveness of the improvement.

 

Figure 9. Feature map output by the C2f module

Figure 10. Feature map output by the C2f_SCConv module

 

As shown in Figures 11 and 12, after introducing dysample, the highlighted key regions in the feature maps become more focused and exhibit sharper details. Compared to the original upscaling method, this approach effectively enhances the resolution and speciffcity of the features.

Figure 11. The feature maps generated by Dysample were not introduced

Figure 12. The feature maps generated by Dysample were introduced

 

Action4: [Section 4]

The proposed YOLO10-SC is a scenario-specific collaborative solution designed to address the unique challenges of strawberry pest and disease detection—filling technical gaps that individual modules or generic combinations cannot cover. This integrated design targets three critical pain points in strawberry disease detection: complex backgrounds, subtle morphological differences between disease categories, and computational constraints on edge devices. The closed-loop synergistic optimization chain formed by the three modules amplifies the model's performance advantages: CBAM reduces background interference, thereby alleviating the computational load of SCConv's fine-grained discrimination; the highly recognizable feature maps output by SCConv enhance the precision of DySample's detail reconstruction; while DySample's efficient upsampling ensures optimized features fully serve detection tasks. This synergy enables YOLO10-SC to achieve a perfect balance of detection accuracy, speed, and device adaptability in strawberry disease detection, providing an efficient solution for real-time field monitoring.

 

Action5: [Section 4]

First, although this paper expanded the size of the data set through data augmentation, several limitations remain. These include: data sourced from a single origin, inclusion of only strawberries as the target crop, and the absence of analysis on disease severity or occlusion levels. These constraints may limit the model's generalization capabilities and robustness across diverse real-world scenarios. Moving forward, our work will focus on collecting data from multiple countries and regions, with plans to expand research to pest and disease detection across multiple crops. This will enhance model performance and contribute more effectively to smart agriculture.

Secondly, although this study achieved promising results in detecting diseases on strawberry images, it remains conffned to a single modality—visual information. With  advancements in multimodal fusion and sensor technologies, our future work will focus on integrating visual data with sensor-derived information such as temperature and humidity, alongside textual and audio data, to provide farmers with enhanced decision support. 

Finally, although this paper deploys YOLO-SC within a mobile application, testing has veriffed that the system effectively addresses strawberry pest and disease detection tasks in productive agricultural cultivation. Its inference speed meets practical requirements in real-world environments, and the system operates offfine without network constraints. However, due to limitations of the experimental site, it was not possible to test inference speed and battery endurance in ffeld conditions. Future work will involve selecting suitable locations and seasons to conduct more comprehensive testing and reffnement of the mobile application.

 

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript presents an improved single-stage detection framework (YOLO10-SC) for strawberry pest and disease recognition in natural environments. The authors integrate three modules into YOLOv10: (1) CBAM for enhanced feature attention, (2) SCConv embedded into C2f to improve fine-grained discrimination, and (3) DySample for lightweight and adaptive upsampling. Experimental validation on a strawberry disease dataset demonstrates improvements in precision, recall, mAP50, F1-score, FPS, and reduced computational complexity. Comparative and ablation studies further support the effectiveness of the proposed method, and a prototype mobile app demonstrates its potential for smart agriculture applications. However, I mainly have the following comments to be addressed.

  1. While the integration of CBAM, SCConv, and DySample into YOLOv10 demonstrates performance gains, the incremental novelty is somewhat limited since all three modules are established in prior literature. The manuscript should better highlight what is unique about their specific combination or adaptation for strawberry disease detection.
  1. The dataset is relatively small (2,500 original images, expanded to 5,000 after augmentation) and originates from a single source. This may limit generalizability. More discussion is needed on dataset diversity, potential biases, and whether the model would perform robustly under different real-world conditions (e.g., field lighting, occlusion, disease severity stages).
  1. Although pre-training and ablation studies are reported, the paper lacks statistical validation (e.g., confidence intervals, significance tests) to confirm the robustness of improvements. Additionally, comparisons to more recent lightweight YOLO variants should be deepened, not just numerically but also with qualitative visualization of detection outputs.
  1. The mobile app deployment is briefly mentioned, but details are lacking on real-time inference performance on actual mobile/edge devices. This information is critical to validate claims of suitability for real-world smart agriculture use.
  1. The manuscript is generally well organized but tends to be verbose with repeated explanations of standard modules (e.g., CBAM, SCConv). Condensing these sections would improve readability. The conclusion could also better articulate limitations and future directions.

Author Response

[Reviewer 4]

The manuscript presents an improved single-stage detection framework (YOLO10-SC) for strawberry pest and disease recognition in natural environments. The authors integrate three modules into YOLOv10: (1) CBAM for enhanced feature attention, (2) SCConv embedded into C2f to improve fine-grained discrimination, and (3) DySample for lightweight and adaptive upsampling. Experimental validation on a strawberry disease dataset demonstrates improvements in precision, recall, mAP50, F1-score, FPS, and reduced computational complexity. Comparative and ablation studies further support the effectiveness of the proposed method, and a prototype mobile app demonstrates its potential for smart agriculture applications. However, I mainly have the following comments to be addressed.

Response: We are most grateful for the reviewers' valuable feedback, which we have utilised to enhance the quality of our paper. In accordance with your suggestions, we have carefully revised the previous manuscript, with specific amendments listed below.

 

1.While the integration of CBAM, SCConv, and DySample into YOLOv10 demonstrates performance gains, the incremental novelty is somewhat limited since all three modules are established in prior literature. The manuscript should better highlight what is unique about their specific combination or adaptation for strawberry disease detection.

Response: We are grateful to the reviewers for their insightful comments on the novelty of this research. We acknowledge that the fundamental architecture of the CBAM, SCConv, and DySample modules has been established in prior literature. In subsequent revisions, we have supplemented the discussion on the ‘innovative combination adaptation and optimisation tailored for strawberry disease detection scenarios’, as detailed below.

Action: [Section 4]

The proposed YOLO10-SC is a scenario-specific collaborative solution designed to address the unique challenges of strawberry pest and disease detection—filling technical gaps that individual modules or generic combinations cannot cover. This integrated design targets three critical pain points in strawberry disease detection: complex backgrounds, subtle morphological differences between disease categories, and computational constraints on edge devices. The closed-loop synergistic optimization chain formed by the three modules amplifies the model's performance advantages: CBAM reduces background interference, thereby alleviating the computational load of SCConv's fine-grained discrimination; the highly recognizable feature maps output by SCConv enhance the precision of DySample's detail reconstruction; while DySample's efficient upsampling ensures optimized features fully serve detection tasks. This synergy enables YOLO10-SC to achieve a perfect balance of detection accuracy, speed, and device adaptability in strawberry disease detection, providing an efficient solution for real-time field monitoring.

 

2.The dataset is relatively small (2,500 original images, expanded to 5,000 after augmentation) and originates from a single source. This may limit generalizability. More discussion is needed on dataset diversity, potential biases, and whether the model would perform robustly under different real-world conditions (e.g., field lighting, occlusion, disease severity stages).

Response: We are most grateful for your professional comments on our paper. As you rightly pointed out, algorithms trained solely on strawberry disease datasets have limitations. We have expanded the Discussion section to address this issue, indicating our future research will explore multi-source, multi-crop, and multi-region datasets, while also delving into the application of multimodal approaches in this field. The specific corrections are as follows.

Action: [Section 4]

First, although this paper expanded the size of the data set through data augmentation, several limitations remain. These include: data sourced from a single origin, inclusion of only strawberries as the target crop, and the absence of analysis on disease severity or occlusion levels. These constraints may limit the model's generalization capabilities and robustness across diverse real-world scenarios. Moving forward, our work will focus on collecting data from multiple countries and regions, with plans to expand research to pest and disease detection across multiple crops. This will enhance model performance and contribute more effectively to smart agriculture.

Secondly, although this study achieved promising results in detecting diseases on strawberry images, it remains conffned to a single modality—visual information. With  advancements in multimodal fusion and sensor technologies, our future work will focus on integrating visual data with sensor-derived information such as temperature and humidity, alongside textual and audio data, to provide farmers with enhanced decision support. 

 

3.Although pre-training and ablation studies are reported, the paper lacks statistical validation (e.g., confidence intervals, significance tests) to confirm the robustness of improvements. Additionally, comparisons to more recent lightweight YOLO variants should be deepened, not just numerically but also with qualitative visualization of detection outputs.

Response: We are grateful to the reviewers for their attention to the research validation section. While the paper did indeed lack statistical validation, we have incorporated additional visualisation results, including heatmap comparisons, feature map comparisons before and after module integration, confusion matrix comparisons, and experimental results for the latest YOLOv11 and YOLOv12 models in comparative tests. This ensures the reliability and practicality of our conclusions through enhanced numerical comparisons and more intuitive visual evidence. Specific modifications are outlined below.

Action1: [Section 2]

To visually demonstrate the optimising effect of the CBAM attention mechanism on feature representation, the Figure 3 and Figure 4 compare feature map visualisations before and after incorporating CBAM. The visualisation reveals that without CBAM, highly activated regions appear relatively dispersed, demonstrating insufficient focus on critical semantic information. Conversely, with CBAM applied, high-response areas in the feature map concentrate more precisely on regions containing significant objects or structures, markedly enhancing the specificity and discriminative power of activation distribution. This comparison validates CBAM's ability to effectively guide the network towards more representative features, thereby improving the efficacy of feature representation.

Figure 3. Feature map prior to incorporating the CBAM attention mechanism

Figure 4. Feature map after incorporating the CBAM attention mechanism

 

As shown in Figure 9 and Figure 10, the enhanced feature map demonstrates greater focus on the activation of key blade structures, reduced background interference, and clearer depiction of blade texture details. This effectively strengthens the expression of target features, validating the effectiveness of the improvement.

 

Figure 9. Feature map output by the C2f module

Figure 10. Feature map output by the C2f_SCConv module

 

As shown in Figures 11 and 12, after introducing dysample, the highlighted key regions in the feature maps become more focused and exhibit sharper details. Compared to the original upscaling method, this approach effectively enhances the resolution and speciffcity of the features.

Figure 11. The feature maps generated by Dysample were not introduced

Figure 12. The feature maps generated by Dysample were introduced

 

Action2: [Section 3.4]

Figure 16 shows the normalized confusion matrices for the baseline model and YOLO10-SC. Comparisons reveal that the improved model significantly outperforms the baseline model in classification accuracy for most disease categories, such as Anthracnose Fruit Rot (correct classification rate increased from 0.50 to 0.79) and Blossom Blight (increased from 0.98 to 1.00). Furthermore, the confusion between background and Anthracnose Fruit Rot decreased from 0.32 to 0.18. However, the baseline model achieved a slightly higher classification accuracy (0.94) for Angular Leafspot than the improved model (0.90). While still maintaining a high accuracy, the reasons for the reduced classification performance in this category warrant further investigation. Meanwhile, categories such as Gray Mold and Leaf Spot demonstrated consistently high and stable classification accuracy across both models, reflecting the model's strong robustness in identifying these diseases.

(a) YOLOv10                                                   (b) YOLO10-SC

Figure 16. Comparison of normalized confusion matrix before and after improvement

 

Figure 17 displays the heatmap visualizations of the improved YOLO10-SC model compared to the baseline YOLOv10 model. The comparison reveals that the improved YOLO10-SC heatmap exhibits superior visual coherence and target focus: its high-activation regions (warm tones) precisely cover the core lesion areas, forming a distinct gradient difference with the thermal boundaries of healthy tissue and background regions, demonstrating enhanced lesion-background discrimination capability. In contrast, the baseline YOLOv10 exhibits relatively diffuse heat distribution, with redundant activations persisting in healthy areas surrounding lesions. This reduces the visual distinctiveness of target contours, validating YOLO10-SC's enhanced capability to capture key visual patterns of disease during feature extraction while improving robustness in disease identification.

      

(a) Original image               (b) YOLOv10                      (c) YOLO10-SC

Figure 17. Comparison of heatmap before and after improvement

 

Action3: [Section 3.6]

When compared with the latest single-stage object detection algorithms YOLOv11 and YOLOv12 over the past two years, the proposed algorithm also achieved optimal performance in strawberry disease detection, making it more suitable for application in smart agriculture.

Table 7. Results of comparison experiments

mould

P

R

mAP

F1

YOLOv11(2024)

0.889

0.824

0.885

0.855

YOLOv12(2025)

0.880

0.823

0.892

0.851

YOLO10-SC

0.885

0.865

0.914

0.875

 

 

4.The mobile app deployment is briefly mentioned, but details are lacking on real-time inference performance on actual mobile/edge devices. This information is critical to validate claims of suitability for real-world smart agriculture use.

Response: We are grateful to the reviewers for their suggestions regarding the paper. We acknowledge that our description of the mobile application deployment was indeed insufficiently detailed. The relevant information has now been supplemented in the manuscript, and the shortcomings along with future work have been addressed in the Discussion section. The specific details are as follows.

Action1: [Section 3.7]

To simulate the hardware usage scenarios of agricultural users in the field, deployment performance was validated on a mid-range mobile device: the HUAWEI nova9, equipped with a Snapdragon 778G processor and 8GB of RAM. During operation, the frame rate remained within the 50-100 range, achieving over 100 frames per second under favourable conditions. This system operates offline, eliminating concerns regarding signal delays in remote environments.

Action2: [Section 4]

Finally, although this paper deploys YOLO-SC within a mobile application, testing has veriffed that the system effectively addresses strawberry pest and disease detection tasks in productive agricultural cultivation. Its inference speed meets practical requirements in real-world environments, and the system operates offfine without network constraints. However, due to limitations of the experimental site, it was not possible to test inference speed and battery endurance in ffeld conditions. Future work will involve selecting suitable locations and seasons to conduct more comprehensive testing and reffnement of the mobile application.

 

5.The manuscript is generally well organized but tends to be verbose with repeated explanations of standard modules (e.g., CBAM, SCConv). Condensing these sections would improve readability. The conclusion could also better articulate limitations and future directions.

Response: We are grateful to the reviewers for their suggestions regarding the article's structure. We have streamlined the descriptions of the three introduced modules, with examples of the revised paragraphs provided below. Additionally, we have incorporated a new Discussion section outlining the limitations of this study and potential avenues for future research.

Action1: [Section 2.1]

The introduction of the Channel Attention Mechanism facilitates the efficient detection of contour features associated with the target, thereby enriching the information available for target detection. This mechanism enables the network to prioritize critical feature channels pertinent to specific tasks, ultimately enhancing both the performance and efficiency of the network. The structure of the channel attention module is shown in Figure (2a). The mathematical representation of the computation for the channel attention mechanism is expressed as follows:

The Spatial Attention Mechanism operates by compressing the channel dimension and performing mean  and maximum pooling along this axis. By integrating the Spatial Attention Module, the model can effectively localize the detection target, thereby enhancing the detection rate. The structure of the spatial attention module is shown in Figure (2b). The mathematical representation of the  spatial attention mechanism is articulated as follows:

Action2: [Section 2.2]

SRU uses a separation-reconstruction approach: separation distinguishes high- and low-information feature maps via Group Normalization-derived scaling factors to address spatial content, while reconstruction merges these features for more informative outputs and optimized spatial utilization, with its structure shown in Figure 6.

CRU is a channel reconstruction unit that utilizes a segmentation-transformation-fusion strategy to reduce the redundancy of channel dimensions as well as computational cost and storage.The structure of CRU is shown in Figure 7.

Action3: [Section 2.3]

Feature up-sampling is crucial for target detection, as it restores feature resolution to boost classification and localization accuracy. However, YOLOv10’s original Nearest-neighbor Interpolation for up-sampling only depends on pixel spatial positions (ignoring feature map semantic info and surrounding points, resulting in low-quality outputs). Although dynamic upsamplers like CARAFE[25], FADE[26], and SAPA[27] improve performance via content-aware kernels, they add extra complexity—with FADE and SAPA even requiring high-resolution feature inputs.

Action4: [Section 4]

First, although this paper expanded the size of the data set through data augmentation, several limitations remain. These include: data sourced from a single origin, inclusion of only strawberries as the target crop, and the absence of analysis on disease severity or occlusion levels. These constraints may limit the model's generalization capabilities and robustness across diverse real-world scenarios. Moving forward, our work will focus on collecting data from multiple countries and regions, with plans to expand research to pest and disease detection across multiple crops. This will enhance model performance and contribute more effectively to smart agriculture.

Secondly, although this study achieved promising results in detecting diseases on strawberry images, it remains conffned to a single modality—visual information. With  advancements in multimodal fusion and sensor technologies, our future work will focus on integrating visual data with sensor-derived information such as temperature and humidity, alongside textual and audio data, to provide farmers with enhanced decision support. 

Finally, although this paper deploys YOLO-SC within a mobile application, testing has veriffed that the system effectively addresses strawberry pest and disease detection tasks in productive agricultural cultivation. Its inference speed meets practical requirements in real-world environments, and the system operates offfine without network constraints. However, due to limitations of the experimental site, it was not possible to  test inference speed and battery endurance in ffeld conditions. Future work will involve selecting suitable locations and seasons to conduct more comprehensive testing and reffnement of the mobile application.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Authors have improved the manuscript and have addressed most of the suggestions. 

Back to TopTop