StrawberryNet: Fast and Precise Recognition of Strawberry Disease Based on Channel and Spatial Information Reconstruction

Li, Xiang; Jiao, Lin; Liu, Kang; Liu, Qihuang; Wang, Ziyan

doi:10.3390/agriculture15070779

Open AccessArticle

StrawberryNet: Fast and Precise Recognition of Strawberry Disease Based on Channel and Spatial Information Reconstruction

by

Xiang Li

¹,

Lin Jiao

^2,3,*

,

Kang Liu

⁴

,

Qihuang Liu

² and

Ziyan Wang

²

¹

School of Computer Science and Artificial Intelligence, Chaohu University, Hefei 238000, China

²

School of Internet, Anhui University, Hefei 230601, China

³

Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

⁴

Department of Aeronautical and Aviation Engineering, Hong Kong Polytechnic University, Hong Kong 999077, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(7), 779; https://doi.org/10.3390/agriculture15070779

Submission received: 24 February 2025 / Revised: 27 March 2025 / Accepted: 2 April 2025 / Published: 3 April 2025

(This article belongs to the Section Crop Protection, Diseases, Pests and Weeds)

Download

Browse Figures

Versions Notes

Simple Summary

The frequent outbreak of strawberry diseases severely affects their yield and quality. Accurate and rapid identification of strawberry disease categories is a critical step in effective prevention and control. Artificial intelligence technologies are widely used for plant disease identification; however, in the task of strawberry disease identification, the high similarity between different disease categories and their complex backgrounds leads to low recognition accuracy. The redundant computations in current methods result in low real-time performance for strawberry disease identification, making it difficult to meet the requirements of practical applications. To address these problems, we propose StrawberryNet, a lightweight model designed for accurate and real-time identification of multiple strawberry diseases.

Abstract

Timely and effective identification and diagnosis of strawberry diseases play essential roles in the prevention of strawberry diseases. Nevertheless, various types of strawberry diseases with high similarity pose a great challenge to the accuracy of strawberry diseases, and the recent module with high parameter counts is not suitable for real-time identification and monitoring. Therefore, in this paper, we propose a lightweight strawberry disease identification method, termed StrawberryNet, to achieve accurate and real-time identification of strawberry diseases. First, to decrease the number of parameters, instead of standard convolution, a partial convolution is selected to construct the backbone for extracting the features of strawberry disease, which can significantly improve efficiency. And then, a discriminative feature extractor, including channel information reconstruction network (CIR-Net) and spatial information reconstruction network (SIR-Net) modules, is designed for abstracting the identifiable features of different types of strawberry disease. A large number of experimental results were conducted on the constructed strawberry disease dataset, containing 2903 images and 10 common strawberry diseases and normal leaves and fruits. Extensive experiments show that the recognition accuracy of the proposed method can reach 99.01% with only 3.6 M parameters, which have good balance between the identification precision and speed compared to other excellent modules.

Keywords:

strawberry disease; image recognition; deep learning; partial convolution; lightweight network

1. Introduction

Strawberry is one of the most important commercial fruit crops in the world [1]. It ranks first in the world’s small berry production in terms of planting area. As of 2019, strawberries are being planted on 396,401 hm² around the world with an annual production of up to 8.34 million tons, and they have become one of the major cash crops in China, the United States, Mexico, Turkey, etc. [2] Unfortunately, the quality and yield of strawberry have seriously suffered from various strawberry diseases. Taking strawberry gray mold as an example, it is a major disease that can affect strawberries grown in both open fields and greenhouses. If effective prevention and control measures are not taken promptly, the yield of strawberries will be greatly threatened. Therefore, the fast and accurate recognition of strawberry disease is an essential means for timely prevention and control.

In early stages, the detection and recognition of strawberry diseases usually rely on manual identification by agricultural experts, which can bring about time-consuming and skilled taxonomic experts. These methods are the basis for diagnosing plant diseases, but they rely on time consumption, labor, and low efficiency [3]. Recently, the rapid development of machine learning or deep learning has significantly promoted the process of computer vision, which has been transferred to smart agriculture, for example, plant phenotype [4], plant disease recognition, and so on. Machine-learning methods adopted support vector machine, random forest, and other classification methods to recognize the types of plant disease by using the extracted features, like LBP, HOG, SIFT, and so on. Nevertheless, to raise the performance of machine-learning classifiers, the feature descriptors need to be carefully tuned up for high precision, especially under complex backgrounds.

To further improve the ability of feature expression, deep learning-based approaches have been introduced, which can automatically learn the features using convolutional neural networks (CNNs). A large number of CNN-based modules have been proposed to achieve the task of image classification. On the one hand, these methods focus on the improvement of classification accuracy. GoogleNet deepens the depth and width of networks by stacking multiple inception modules to boost the precision [5]. Other deep CNN-based networks, like VGG-Net [6], ResNet [7], DarkNet [8], and so on, improve the capability of abstracting features by increasing the depth. In addition, to further enhance the performance, an attention mechanism has been introduced to construct the network. The attention mechanism facilitates adaptive feature recalibration, enabling the network to prioritize significant components. Broadly, within the field of computer vision, it is classified into two main types: channel attention and spatial attention. Each type serves a distinct purpose. For example, spatial attention focuses primarily on identifying crucial spatial areas. For instance, spatial attention mainly cares about the important spatial regions [9,10]. In contrast, channel attention aims to guide the network to focus selectively on significant objects, a concept that previous studies [11,12] have highlighted as crucial. Regarding the recently popular Vision Transformers [13,14,15,16,17,18], they often overlook adaptability within the channel dimension. However, there are many works that pay more attention to the recognition efficiency. Depthwise separable convolution and Group Convolution have been introduced to reduce the number of networks in lightweight networks [19]. Related work includes MobileNet [19], ShuffleNet [20], and GhostNet [21]. In addition, to reduce computational complexity, partial convolution is introduced to design the FasterNet [22] module [22].

Apart from CNN-based networks, Transformer-based methods have been widely used in the field of computer vision due to their ability to capture long-distance dependencies’ features. For instance, a Vision Transformer (ViT) [13] has been proposed by dividing the image into sequences with position coding and then extracting parameterized vectors as visual representations using cascading transform blocks, achieving better results than CNNs on ultra-large datasets. Due to its low-resolution feature maps and quadratic growth of complexity with image size, its structure is not suitable as a backbone network for dense vision tasks or input images with high resolution. To strike the best balance between accuracy and speed, Cao et al. proposed a Swin Transformer module by designing the shift window multi-head attention, which further improves the accuracy and efficiency of image classification [14]. Thus, the Swin Transformer proves that the Transformer can be widely used in the computer vision community and perform well on a series of vision tasks. In addition, many excellent Transformer-based modules have been introduced, such as DeiT [23], Conv2Former [24], and so on.

CNN-based and Transformer-based modules have been introduced into the field of agriculture for accomplishing the recognition of plant disease. Based on the Faster R-CNN detector, Zhao et al. proposed a multi-scale feature fusion CNN-based module for strawberry disease detection, resulting in 92.18% mAP [25]. Attallah et al. developed an automatic identification module of tomato leaf diseases using three compact CNNs [26], and then using transfer learning to obtain deep features from the final fully connected layer of the CNN, which increases the complexity of the classification and achieves an accuracy of 92%. DFNet [27] uses multiple CNN models to take full advantage of the opportunity of rich features in the image to increase the diversity of extracted features, which improves the recognition ability of the disease recognition network and increases the recognition accuracy to 95%. To deal with the various shapes and scales of apple leaf disease, Liu et al. adopted multi-branch convolution and deformable convolution to improve the extraction ability for the feature maps of lesions of apple leaf disease, which accomplishes the automatic detection of apple leaf disease [28]. As the same time, an improved MobileNet has been presented to quickly and accurately recognize seven types of cucumber disease under a natural environment [29]. In addition, many researchers also contribute to developing Transformer-based modules to recognize and detect plant disease to improve the recognition results. The Vision Transformer has been adopted to detect seven types of strawberry disease, which reaches 92.7% accuracy, surpassing other approaches [30]. A “Swin-MLP” method based on a Swin Transformer and multilayer perceptron (MLP) has been proposed to extract the strawberry disease features, and then it imports the features into the MLP to accurately and quickly identify strawberry diseases, achieving an accuracy of 98.4% [31]. In our previous work, we also proposed the SCSA-Transformer [32], which combined a Transformer and CNN for spatial feature extraction of images, leading to an improvement in recognition efficiency. Although recent methods attain promising recognition accuracy, there are still some limitations, such as the number of parameters being too large with high arithmetic power requirements, which hinders the mobile deployment of Transformer-based modules.

In this paper, to reduce the computation burden, we propose a lightweight strawberry disease recognition module by using partial convolution [22] instead of standard convolution for fast feature extraction of lesions. In addition, we design a spatial and channel information reconstruction module to accurately extract the features of strawberry lesions with irregular edges. Finally, to validate the performance of the proposed approaches, a large number of experiments are conducted on our constructed strawberry disease dataset, showing excellent recognition results for strawberry lesions with different categories, small sizes, and various shapes. In summary, the main contributions are listed:

(1) An effective and simple convolution, named partial convolution, has been introduced to design a lightweight backbone for extracting features of strawberry lesion to largely decrease the number of parameters.

(2) We proposed a discriminative feature extractor by fusing the spatial and channel feature recalibration, leading to the precise feature expression. This allows the proposed strawberry recognition methods to achieve superior performance compared with other modules.

(3) Extensive experiments on the large-scale strawberry disease dataset illustrate the effectiveness of the proposed approach, which achieves state-of-the-art performance, including efficiency and accuracy, compared to other classification methods.

2. Materials and Methods

2.1. Strawberry Disease Dataset

In previous work [32], we built a large-scale strawberry disease dataset with multi-categories. Specifically, 2903 strawberry disease images with 10 common types of strawberry diseases as well as two categories of normal strawberry fruits and leaves were assembled under a natural environment. The strawberry disease images were saved in JPG format with a size of

3942 \times 2611

. Figure 1 shows some examples of each category of strawberry disease. Figure 2 demonstrates that the strawberry disease dataset shows a long-tailed distribution, that is, the long-tailed distribution describes a scenario where a small number of categories occupy a majority of the samples, while the majority of categories only occupy a small number of samples. For example, there are 886 images of the strawberry fertilizer damage, which make the network biased towards this disease. This will harm the precise recognition of multiple types of strawberry disease. To alleviate the long-tailed distribution, the commonly used data augmentation strategies Cutout [33] and Mixup [34] were adopted to amplify the quantity and diversity of the strawberry disease dataset. The number of strawberry disease images was increased from 2903 to 5369.

2.2. Proposed Method

2.2.1. Overall Architecture of StrawberryNet

To accomplish the identification of the multiple categories of strawberry disease, we proposed a lightweight and accurate CNN-based network, termed StrawberryNet. Figure 3 shows the overall architecture of the proposed StrawberryNet. It demonstrates that, similar to the roles of ResNet or Swin Transformer, FasterNet is viewed as a backbone for feature extraction, which includes four stages. The strawberry disease image is first processed by the operation of linear embedding to split each image into patch embedding and output C-dimension vectors (C = 20). Then, these generated vectors are input into the FasterNet module to extract strawberry diseases’ feature information. In addition, the Merging module [31] is adopted to down-sample the strawberry disease feature maps for the hierarchical feature map of the strawberry disease image. To further encompass the distinguished features, we devise the discriminative feature extraction (DFE) module to enhance the features of strawberry disease. Finally, based on the global average pool and the fully connected (FC) layer, a strawberry disease classifier is developed, which outputs the classification confidence and category of each strawberry disease image.

2.2.2. Review of Depthwise Separable Convolution and Partial Convolution

Based on standard convolution, depthwise separable convolution (DSConv) and partial convolution are successively proposed to improve running time. Here, DSConv first takes advantage of depthwise convolution, where a convolutional kernel of depthwise convolution is responsible for one channel. Thus, for the input feature maps F with a size of

w \times h \times c

, when using the depthwise convolution with a kernel size of

k \times k \times c

, the output feature map O can be defined as:

\begin{matrix} O_{k, l, m} = \sum_{i, j} K_{i, j, m} \cdot F_{k + i - 1, l + j - 1, m} \end{matrix}

(1)

where K represents the convolutional filters with a size of

k \times k \times c

and the

m_{t h}

kernel in k is adopted to the

m_{t h}

channel in the feature map F to produce the

m_{t h}

channel of the output feature map O. Compared with standard convolution, DWConv largely significantly cuts down the computation cost. However, it only performs convolutional operations independently for each channel of the input feature map, lacking the capability to efficiently harness the information from diverse channels located at the identical spatial location. Accordingly, the feature maps generated from depthwise convolution are processed by the pointwise convolution operation with

1 \times 1

kernel size. Thus, this will merge the feature maps from the previous layer in the depth dimension through weighted summation to generate new feature maps. Note that the number of feature maps is the same with the number of convolution kernels. Thus, the FLOPs of the DSConv could be computed as:

\begin{matrix} F L O P_{s} (D S C o n v) = h \times w \times k \times k \times c \end{matrix}

(2)

where h and w represent the height and width of feature map; k and c denote the size and the number of channels of the convolutional kernel.

Although DWConv, followed by a pointwise convolution, is effective in reducing FLOPs, it cannot be substituted directly for a regular CONV because this substitution would significantly compromise the accuracy of the model. The partial convolution (PConv) is proposed to optimize the computation cost. As shown in Figure 4, it demonstrates how PConv works. It can be seen that the PConv only needs to perform spatial feature extraction on the first consecutive channels of the input channels with regular convolution without influencing the remaining channels. This allows PConv to better utilize the computational power on the device. Thus, the FLOPs of a PConv are represented as:

\begin{matrix} F L O P_{s} (P C o n v) = h \times w \times k \times k \times c_{p} \times c_{p} \end{matrix}

(3)

where

c_{p}

represents the number of channels of the partial convolution, which is set to

c / 4

.

2.2.3. FasterNet Stage

As we know, the recent image classification modules have high accuracy, while having low efficiency. Therefore, researchers pay more attention to designing lightweight neural networks by reducing computation that is assessed by FLOPs. Some works, like MobileNet, ShuffleNet, and GhostNet, have been proposed to decrease FLOPs by utilizing depthwise separable convolution to establish the feature extraction module. However, reducing FLOPs often suffers from increasing memory access. Therefore, in this paper, instead of DWConv, we introduce partial convolution (PConv) into the FasterNet module to effectively reduce the computational redundancy as well as the number of memory accesses.

There are four stages arranged sequentially in FasterNet. The input and output of stage l are

F^{L - 1} \in R^{c \times h \times w}

and

F^{L} \in R^{c \times h \times w}

, respectively. Each FasterNet stage has the same architecture, including the PConv layer followed by two

1 \times 1

convolutional layers. And a batch normalization (BN) layer and Relu activation layer have been added to the two convolution layers, where the BN can combine the adjacent convolutional layers, leading to faster inference. Taking both effectiveness and efficiency, for the activation layer, we select the Relu as the activation function. At the same time, the FasterNet block will reuse the input feature map to form a residual structure, which can be expressed by the formula as follows:

\begin{matrix} F^{l} = F^{l - 1} + C o n v^{1 \times 1} (R e l u (B N (C o n v^{1 \times 1} (P C o n v (F^{l - 1} * w))))) \end{matrix}

(4)

where w is the filter,

F^{l}

is the input image, and

C o n v^{1 \times 1}

is

1 \times 1

the convolutional operation.

P C o n v (\cdot)

denotes the partial convolution. Relu denotes the Relu activation function for nonlinear transformation. BN represents the operation of batch normalization for accelerating network convergence, which can improve the training efficiency.

2.2.4. Discriminative Feature Extraction Block

In order to extract the identifiable features, a discriminative feature extraction (DFE) block is introduced. Figure 5 illustrates the architecture of the DFE block, which contains a channel information reconstruction network (CIR-Net) and a spatial information reconstruction network (SIR-Net). With the help of the inter-channel relation of the feature map, the CIR-Net is built to explore the usefulness of the information of each channel. To be specific, the operations of average pooling and max pooling are adopted to squeeze the spatial dimension of the feature map for improving the computation efficiency, which will produce two different feature maps

F_{a v g}

and

F_{m a x}

with spatial information. Then, the generated feature maps

F_{a v g}

and

F_{m a x}

are input into a shared network with one hidden layer, which respectively generates feature maps

G_{a v g}

and

G_{m a x}

. After that, these two different feature maps are combined by the operation of element-wise summation. Finally, to highlight the important information, the input feature maps X are merged with the produced attention maps by dot product. In summary, the SIR-Net module can be expressed as:

\begin{matrix} S I R (X) = δ (M L P (A v g P o o l (X) + M a x P o o l (X))) \end{matrix}

(5)

where

A v g P o o l (\cdot)

denotes the operation of average pooling,

M a x P o o l (\cdot)

denotes the operation of max pooling,

δ

represents the sigmoid function, and X denotes the input feature maps. MLP is the share network, which is a multi-layer perceptron.

By the usage of the SIR-Net, we can fully explore the channel information of the feature map and know which channel should give more attention. Nevertheless, we do not know where the most informative part is in the feature maps. Therefore, we introduce the CIR-Net to reconstruct the spatial information and find the highlighting information parts. Similar to the SIR-Net, average pooling and max pooling are employed to reduce the computation burden. Then, these feature maps are merged by the operation of concatenation. And a convolutional layer with a kernel size of

7 \times 7

is adopted to encode the spatial attention of the feature map. The process of SIR-Net can be defined as:

\begin{matrix} C I R (X) = δ (C o n v (C o n c a t (A v g P o o l (X), M a x P o o l (X)))) \end{matrix}

(6)

where

δ

represents the sigmoid function; X denotes the input feature maps; and the function

C o n v (\cdot)

denotes the standard convolution.

Given the input X, the two modules, CIR-Net and SIR-Net, are used to extract the channel and spatial information of the feature map. Our proposed DFE can be demonstrated as:

\begin{matrix} D F E (X) = C o n v (C I R (X) + S I R (X)) \cdot X \end{matrix}

(7)

where the function

C o n v (\cdot)

denotes the convolution operation and X denotes the input feature map. CIR and SIR are the CIR-Net and SIR-Net, which output the channel attention map and spatial attention map.

2.3. Evaluation Metrics

The evaluation metrics, including accuracy, precision, recall, specificity, and

F_{1 - s c o r e}

, are adopted to comprehensively evaluate the performance of StrawberryNet and others. These metrics can be calculated as follows:

Accuracy = \frac{T P + T N}{T P + F P + T N + F N}

(8)

Precision = \frac{T P}{T P + F P}

(9)

Recall = \frac{T P}{T P + F N}

(10)

Specificity = \frac{F P}{F P + T N}

(11)

TP (True Positive) indicates the number of features predicted to be positive samples and actually positive samples; FP (False Positive) indicates the number of features predicted to be positive samples and actually negative samples; TN (True Negative) indicates the number of features predicted to be negative samples and actually negative samples; and FN (False Negative) indicates the number of samples predicted to be negative samples and actually positive samples.

Since the accuracy and recall are influenced by each other, to further evaluate the performance of the model, the metric

F_{1} - s c o r e

is introduced, which is obtained by weighted summation average of precision and recall. It can be computed as:

F_{1 - s c o r e} = \frac{2 P R}{P + R}

(12)

Furthermore, the efficiency of the modules is assessed by FLOPs, the number of parameters, and testing time (image/s).

3. Experimental Results and Analysis

3.1. Experimental Settings

The modules of this paper are trained and tested by one RTX 3060 graphics card with 12G memory, which is configured with python 3.8, pytorch 1.10.0, and cuDNN 8.2.0. The learning rate and training iteration are 0.002 and 100 epochs, respectively. The batch size is set to 64. The training and testing set are divided in a 7:3 ratio. We conducted experiments on the built strawberry disease dataset, as well as taking RegNet [35], MobileNetV2 [19], MobileVit [35], ShuffleNetV2 [20], Swin Transformer [31], L-GhostNet [21], SCSA-Transformer [14], and FasterNet [22] as compared modules.

3.2. Experimental Results

3.2.1. Overall Performance on Strawberry Disease Dataset

Many experiments were conducted on the constructed strawberry disease dataset, as shown in Table 1. It shows the experimental results of the proposed methods and comparison approaches. Apparently, the recognition accuracy of the proposed method in this paper has the highest recognition accuracy of 99.01%, and the recall, precision, specificity and F1-score reach 97.66%, 96.88%, 99.22%, and 97.27%, respectively. Notably, our proposed StrawberryNet can better balance the recognition accuracy and efficiency, leading to the accomplishment of lightweight models with higher recognition accuracy. To be specific, on the one hand, the recognition accuracy can achieve 99.01%, which is next to our previous work, the SCSA-Transformer. Nevertheless, the number of parameters of StrawberryNet is reduced by 85% compared to SCSA-Transformer. On the other hand, the number of parameters of the proposed StrawberryNet is only 3.6 M, only increasing by 0.02 M compared with the most lightweight network, FasterNet. This is due to the introduction of PConv, which reduces redundant computations and memory accesses in the model network. The introduction of the DFE module to enhance the feature representation of strawberry disease images further improves the disease recognition accuracy. Therefore, the proposed method can be adapted to real-time monitoring of strawberry disease images in complex backgrounds.

As shown in Figure 6, we also demonstrate the loss curve of the proposed module and other excellent image classification methods, such as MobileNetV2, RegNet, MobileVit, ShuffleNetV2, Swin Transformer, and FasterNet. In addition, from the view of convergence speed in the stage of network training, our module can achieve the minimum loss with less training time, compared to other modules. In Figure 6b, the plot of identification accuracy demonstrates that the accuracy of our method outperforms other lightweight classifiers and does not fall behind when comparing some Transformers-based methods. The excellent performance of our method results in its pre-eminent design of a lightweight network architecture, which can obtain the richer spatial information of strawberry disease.

3.2.2. Ablation Experiments

In order to explore the effect of the DFE module method, we add the DFE module in different FasterNet stages. Table 2 demonstrates the experimental results of recognition accuracy and efficiency (including FLOPs, the number of parameters, and recognition speed). Without the usage of the DFE module, the identification accuracy can achieve 97.25% with 3.58 M parameters. When using the DFE module in Stage IV, the classification of strawberry disease reaches an accuracy of 99.01% with 3.6 M parameters. In addition, from this table, it demonstrates that it also can obtain an accuracy of 99.01% when adding the DFE module in all stage of FasterNet. Nevertheless, both the number of parameters and FLOPs of the module raise. In addition, the adoption of the DFE module at Stage I and Stage II will harm the recognition accuracy. In summary, in this paper, we only adopt the DFE module at Stage IV of FasterNet, leading to an improvement in recognition accuracy as well as less computation burden. This improvement may be because the introduction of the DFE block can extract the identifiable features of strawberry disease with high similarity, which is conducive to accurate recognition.

3.2.3. Efficiency Analysis of Strawberry Disease

In order to further validate the effectiveness of the proposed StrawberryNet, Table 3 shows the experimental results of FLOPs, the number of parameters, and testing speed of StrawberryNet and other classification modules. The number of parameters, FLOPs, and testing speed of StrawberryNet can achieve 3.6 M, 0.42 G, and 4.3 image/s, respectively. In addition, its identification accuracy can achieve 99.01%, which realizes a good balance of precision and efficiency. In addition, due to the smaller number of parameters, this is beneficial for model deployment on mobile devices.

3.2.4. Visualized Analysis of StrawberryNet

The gradient-weighted class activation mapping as a visualization tool is applied to demonstrate how StrawberryNet extracts spatial features. The heat maps in Figure 7 indicate that the deeper red hues in the image represent areas where the model directs greater attention. Conversely, a blue hue in the heat map signifies that the respective region is likely to be useless. Figure 7 clearly demonstrates that the novel DFE module enhances StrawberryNet’s capability to concentrate on the region affected by strawberry disease, thereby facilitating improved recognition of strawberry disease images, even in complex backgrounds. Additionally, Figure 8 also presents some prediction results of strawberry disease of StrawberryNet. This visualization reveals that our proposed StrawberryNet achieves accurate predictions with high confidence, thereby affirming the efficacy of StrawberryNet.

4. Conclusions and Future Work

Strawberry yield seriously suffers from strawberry disease, even leading to complete crop failure and causing significant economic losses. The precise and fast prediction of strawberry disease is crucial for increasing their yield and quality. However, because of the lack of a large-scale strawberry disease dataset and a large number of parameters of recent deep-learning modules, it is hard to quickly and accurately recognize the types of strawberry disease. Therefore, in this paper, we have proposed a lightweight strawberry disease recognition module, StrawberryNet, by using partial convolution instead of standard convolution for fast feature extraction of lesions. In addition, a spatial and channel information reconstruction module has been designed to accurately extract the features of strawberry lesions with irregular edges. Finally, to validate the performance of the proposed approaches, a large number of experiments have been conducted on our constructed strawberry disease dataset, showing excellent recognition results for strawberry lesions with different categories, small sizes and various shapes. The proposed StrawberryNet obtains an approximate recognition accuracy (99.01%) with 3.6 M parameters. Thus, the proposed StrawberryNet can not only improve the efficiency of disease diagnosis and reduce economic losses but also promote the sustainable development of intelligent agriculture. However, there are two works we need to accomplish in our further work. On one hand, we need to take into account the evaluation of the severity level of strawberry diseases to guide refined pesticide application. On the other hand, the module needs to further deploy on mobile devices or cell phones for convenience.

Author Contributions

Conceptualization: X.L.; Funding acquisition: L.J.; Investigation: K.L.; Methodology: X.L. and L.J.; Software: X.L. and L.J.; Supervision: L.J.; Validation: K.L. and Q.L.; Writing—original draft: X.L.; Writing—review and editing: X.L., L.J., K.L., Q.L. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2024 Central Guiding Local Science and Technology Development Special Plan Funding and Projects (No. 202407a12020010), Chaohu University’s research start-up fund (No. KYQD-2024006), Natural Science Foundation of Anhui Higher Education Institutions of China (No. KJ2021A0025), the Open Research Fund of the National Engineering Research Center for Agro-Ecological Big Data Analysis Application, Anhui University (No. AE202213), and the Natural Science Foundation of Anhui Province (No. 2208085MC57).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Due to the confidentiality agreement, this dataset cannot be fully disclosed temporarily. However, if there is a research need, researchers can contact the author via email to obtain it.

Acknowledgments

Our thanks to all the authors cited in this paper and the anonymous referees for their helpful comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, D.; Wang, X.; Chen, Y.; Wu, Y.; Zhang, X. Strawberry ripeness classification method in facility environment based on red color ratio of fruit rind. Comput. Electron. Agric. 2023, 214, 108313. [Google Scholar] [CrossRef]
Liu, J.; Zhao, S.; Li, N.; Faheem, M.; Li, P. Development and Field Test of an Autonomous Strawberry Plug Seeding Transplanter for Use in Elevated Cultivation. Appl. Eng. Agric. 2019, 35, 1067–1078. [Google Scholar]
Dong, A.Y.; Wang, Z.; Huang, J.J.; Song, B.A.; Hao, G.F. Bioinformatic tools support decision-making in plant disease management. Trends Plant Sci. 2021, 26, 953–967. [Google Scholar] [PubMed]
Zhou, Y.; Zhou, H.; Chen, Y. An automated phenotyping method for Chinese Cymbidium seedlings based on 3D point cloud. Plant Methods 2024, 20, 151. [Google Scholar] [PubMed]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar] [CrossRef]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal Self-attention for Local-Global Interactions in Vision Transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Liu, R.; Deng, H.; Huang, Y.; Shi, X.; Lu, L.; Sun, W.; Wang, X.; Dai, J.; Li, H. Fuseformer: Fusing fine-grained information in transformers for video inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 14040–14049. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Hou, Q.; Lu, C.Z.; Cheng, M.M.; Feng, J. Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8274–8283. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; Liu, J.; Wu, S. Multiple disease detection method for greenhouse-cultivated strawberry based on multiscale feature fusion Faster R_CNN. Comput. Electron. Agric. 2022, 199, 107176. [Google Scholar] [CrossRef]
Attallah, O. Tomato leaf disease classification via compact convolutional neural networks with transfer learning and feature selection. Horticulturae 2023, 9, 149. [Google Scholar] [CrossRef]
Faisal, M.; Leu, J.S.; Avian, C.; Prakosa, S.W.; Köppen, M. DFNet: Dense fusion convolution neural network for plant leaf disease classification. Agron. J. 2024, 116, 826–838. [Google Scholar]
Liu, B.; Huang, X.; Sun, L.; Wei, X.; Ji, Z.; Zhang, H. MCDCNet: Multi-scale constrained deformable convolution network for apple leaf disease detection. Comput. Electron. Agric. 2024, 222, 109028. [Google Scholar]
Liu, Y.; Wang, Z.; Wang, R.; Chen, J.; Gao, H. Flooding-based MobileNet to identify cucumber diseases from leaf images in natural scenes. Comput. Electron. Agric. 2023, 213, 108166. [Google Scholar] [CrossRef]
Nguyen, H.T.; Tran, T.D.; Nguyen, T.T.; Pham, N.M.; Nguyen Ly, P.H.; Luong, H.H. Strawberry disease identification with vision transformer-based models. Multimed. Tools Appl. 2024, 83, 73101–73126. [Google Scholar] [CrossRef]
Zheng, H.; Wang, G.; Li, X. Swin-MLP: A strawberry appearance quality identification method by Swin Transformer and multi-layer perceptron. J. Food Meas. Charact. 2022, 16, 2789–2800. [Google Scholar] [CrossRef]
Li, G.; Jiao, L.; Chen, P.; Liu, K.; Wang, R.; Dong, S.; Kang, C. Spatial convolutional self-attention-based transformer module for strawberry disease identification under complex background. Comput. Electron. Agric. 2023, 212, 108121. [Google Scholar] [CrossRef]
DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville TN, USA, 11–15 June 2020; pp. 10428–10436. [Google Scholar]

Figure 1. Some examples of strawberry disease images.

Figure 2. A histogram of the distribution of the strawberry disease dataset.

Figure 3. (a) Network structure of StrawberryNet and (b) description of FasterNet stage.

Figure 4. Description of DWConv (a) and PConv (b).

Figure 5. Structure of DFE module, which is composed of CIR-Net and SIR-Net in parallel.

Figure 6. The loss (a) and accuracy (b) curves of our method and other comparison methods.

Figure 7. Visualized heat maps of proposed StrawberryNet.The deeper red hues in the image represent areas where the model directs greater attention.

Figure 8. Prediction results of proposed method on strawberry disease dataset.

Table 1. Comparison results with state-of-the-art methods on strawberry disease dataset.

Methods	Accuracy (%)	Recall (%)	Precision (%)	Specificity (%)	F1-Score	Params (M)
MobileNet	93.02	95.21	95.12	96.88	95.44	15.4
RegNet	97.04	96.33	96.44	98.1	96.54	15.7
ShuffleNet v2	98.02	96.11	96.52	98.77	96.7	15.6
MobileVit	98.1	96.33	95.89	98.65	96.7	5.6
Swin Transformer	98.34	96.54	96.77	98.99	96.8	40
L-GhostNet	98.33	96.6	96.74	98.86	96.67	5.14
SCSA-Transformer	99.1	98.47	97.77	99.37	97.75	24.2
FasterNet	98.27	97	96.81	98.8	96.9	3.58
StrawberryNet	99.01	97.66	96.88	99.22	97.27	3.6

Table 2. Results of ablation experiments on strawberry disease dataset. Stage I, Stage II, Stage III, and Stage IV represent proposed DFE module added in different stages.

Stage I	Stage II	Stage III	Stage IV	Accuracy (%)	FLOPs (G)	Image/s	Params (M)
				97.25	0.34	4.3	3.58
✓				98.27	0.37	4	3.6
	✓			98.15	0.36	4.1	3.6
		✓		98.34	0.37	4.2	3.6
			✓	99.01	0.42	4.3	3.6
✓	✓			98.64	0.4	4	3.73
✓	✓	✓		98.74	0.4	3.9	3.84
✓	✓	✓	✓	99.01	0.42	3.7	4

Table 3. Results of recognition efficiency, including speed, parameter number, and FLOPS.

Methods	Image/s	Params (M)	FLOPs (G)	Accuracy (%)
MobileNet v2	3.3	15.4	0.3	97.25
Regnet	4	15.7	0.33	98.27
ShuffleNet v2	4.1	15.6	0.3	98.15
MobileVit	4.2	5.6	0.37	98.34
Swin Transformer	4.3	40	0.41	99.01
L-GhostNet	3.9	5.14	0.36	98.64
SCSA-Transformer	4.2	24.2	0.41	99.1
FasterNet	4.3	3.58	0.34	98.74
Our methods	4.3	3.6	0.42	99.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Jiao, L.; Liu, K.; Liu, Q.; Wang, Z. StrawberryNet: Fast and Precise Recognition of Strawberry Disease Based on Channel and Spatial Information Reconstruction. Agriculture 2025, 15, 779. https://doi.org/10.3390/agriculture15070779

AMA Style

Li X, Jiao L, Liu K, Liu Q, Wang Z. StrawberryNet: Fast and Precise Recognition of Strawberry Disease Based on Channel and Spatial Information Reconstruction. Agriculture. 2025; 15(7):779. https://doi.org/10.3390/agriculture15070779

Chicago/Turabian Style

Li, Xiang, Lin Jiao, Kang Liu, Qihuang Liu, and Ziyan Wang. 2025. "StrawberryNet: Fast and Precise Recognition of Strawberry Disease Based on Channel and Spatial Information Reconstruction" Agriculture 15, no. 7: 779. https://doi.org/10.3390/agriculture15070779

APA Style

Li, X., Jiao, L., Liu, K., Liu, Q., & Wang, Z. (2025). StrawberryNet: Fast and Precise Recognition of Strawberry Disease Based on Channel and Spatial Information Reconstruction. Agriculture, 15(7), 779. https://doi.org/10.3390/agriculture15070779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

StrawberryNet: Fast and Precise Recognition of Strawberry Disease Based on Channel and Spatial Information Reconstruction

Simple Summary

Abstract

1. Introduction

2. Materials and Methods

2.1. Strawberry Disease Dataset

2.2. Proposed Method

2.2.1. Overall Architecture of StrawberryNet

2.2.2. Review of Depthwise Separable Convolution and Partial Convolution

2.2.3. FasterNet Stage

2.2.4. Discriminative Feature Extraction Block

2.3. Evaluation Metrics

3. Experimental Results and Analysis

3.1. Experimental Settings

3.2. Experimental Results

3.2.1. Overall Performance on Strawberry Disease Dataset

3.2.2. Ablation Experiments

3.2.3. Efficiency Analysis of Strawberry Disease

3.2.4. Visualized Analysis of StrawberryNet

4. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI