Multiscale Tea Disease Detection with Channel–Spatial Attention

Sun, Yange; Jiang, Mingyi; Guo, Huaping; Zhang, Li; Yao, Jianfeng; Wu, Fei; Wu, Gaowei

doi:10.3390/su16166859

Open AccessArticle

Multiscale Tea Disease Detection with Channel–Spatial Attention

by

Yange Sun

^1,2,*,

Mingyi Jiang

¹,

Huaping Guo

¹,

Li Zhang

¹

,

Jianfeng Yao

¹,

Fei Wu

¹ and

Gaowei Wu

^3,4

¹

School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China

²

Henan Key Laboratory of Tea Plant Biology, Xinyang 464000, China

³

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

⁴

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(16), 6859; https://doi.org/10.3390/su16166859

Submission received: 1 July 2024 / Revised: 26 July 2024 / Accepted: 5 August 2024 / Published: 9 August 2024

(This article belongs to the Special Issue New Technological Applications in Agriculture for the Development of the Circular Bioeconomy)

Download

Browse Figures

Versions Notes

Abstract

Tea disease detection is crucial for improving the agricultural circular economy. Deep learning-based methods have been widely applied to this task, and the main idea of these methods is to extract multiscale coarse features of diseases using the backbone network and fuse these features through the neck for accurate disease detection. This paper proposes a novel tea disease detection method that enhances feature expression of the backbone network and the feature fusion capability of the neck: (1) constructing an inverted residual self-attention module as a backbone plugin to capture the long-distance dependencies of disease spots on the leaves; and (2) developing a channel–spatial attention module with residual connection in the neck network to enhance the contextual semantic information of fused features in disease images and eliminate complex background noise. For the second step, the proposed channel–spatial attention module uses Residual Channel Attention (RCA) to enhance inter-channel interactions, facilitating discrimination between disease spots and normal leaf regions, and employs spatial attention (SA) to enhance essential areas of tea diseases. Experimental results demonstrate that the proposed method achieved accuracy and mAP scores of 92.9% and 94.6%, respectively. In particular, this method demonstrated improvements of 6.4% in accuracy and 6.2% in mAP compared to the SSD model.

Keywords:

tea disease detection; object detection; channel attention; spatial attention; feature fusion

1. Introduction

Tea is considered a very valuable green cash crop due to its profound health benefits for humans [1], making the tea industry an important part of the agricultural circular economy. Tea disease (such as leaf blight, rust algae disease, and red spot disease) annually lead to a decrease of approximately 20% in tea production, leading to declines in yield and quality as well as resource wastage [2]. Therefore, how to develop a precise and timely tea disease detector is meaningful for early warning and accurate control of diseases, and it can enhance control efficiency and reduce pesticide use, aligning with the principles of reducing chemical inputs and protecting the environment in the agricultural circular economy.

Traditional tea disease detection primarily relies on artificial methods [3,4]. However, these methods exhibit a high misdiagnosis rate and low detection efficiency. Many convolution-based deep learning methods for tea disease detection have been proposed, for instance, the YOLO series [5,6,7] and other object detection methods [8,9]. However, these object detection methods of tea disease still have limitations in integrating global information and handling long-range dependencies.

The attention mechanism can effectively overcome the limitations of convolution-based deep learning methods [10], and existing attention-based techniques can be roughly grouped into three groups: channel attention-based, spatial attention-based, and self-attention-based methods [11,12]. Channel-based techniques enhance the model’s focus on important features by dynamically adjusting the weight of each channel [13]. Liu et al. [14] introduced a channel attention mechanism that assigns weights to individual channels and multiplies them with the original disease image to emphasize significant features while suppressing irrelevant background noise. Zhang et al. [15] presented a multi-channel automatic directional circulation mechanism to extract and reuse multiscale features to overcome challenges associated with the similarity among tomato leaf diseases. Gao et al. [16] introduced a highly efficient crop disease recognition model to improve the accuracy of plant disease recognition while minimizing the extraction of extraneous features. Wenxia Bao et al. [17] proposed an improved target detection and recognition network, AX-RetinaNet, for the automatic detection and recognition of tea leaf diseases in natural scene images. The attention module assigns adaptively optimized weights to each feature map channel, enabling the network to select more effective features and reduce the interference of redundant features. Although these methods effectively tackle redundancy and background noise at the channel level, they encounter challenges in extracting spatial position information and capturing inter-regional relationships.

Spatial attention-based methods aim to analyze and highlight particular regions and locations in an image to enhance the understanding of local structure and contextual information [18]. Wang et al. [19] utilized spatial attention to establish correlations between different regions, thereby enhancing the importance of critical regions in the feature map. Tang et al. [20] incorporated a multiscale spatial attention block into fully convolutional neural networks to address the issue of spatial information loss. Wang et al. [21] proposed an attention model that incorporates correlation spatial attention, increasing the importance of critical regions in the feature map by establishing correlations between different regions. Xie et al. [22] proposed a tea disease detection model based on YOLOv8s, integrating deformable convolutions, attention mechanisms, and an improved spatial pyramid pooling module to enhance the model’s capability to handle complex targets and challenging backgrounds by reducing interference from irrelevant factors and enabling effective multi-feature fusion. Although these methods effectively extract spatial information, they do not incorporate spatial attention for capturing multiscale features.

Self-attention can be seen as a special form of spatial attention that captures spatial relationships among individual pixels, allowing for efficient feature representation [23,24]. Therefore, introducing self-attention mechanisms can enhance multiscale feature extraction [25,26]. Stephen et al. [27] incorporated self-attention into the ResNet18 and ResNet34 architectures to enhance the process of feature selection. Zeng et al. [28] introduced a Self-focused Convolutional Neural Network (SACNN) that incorporates a self-attention mechanism to extract relevant features from crop disease spots for disease identification purposes. The above studies demonstrate that the self-attention mechanism is very effective in enhancing feature extraction in convolutional neural networks.

Based on the above analysis, integrating self-attention into convolutional neural networks for multiscale feature extraction is feasible. Furthermore, spatial and channel attention effectively reduce noise and redundancy. Therefore, this paper proposes a tea disease detection method based on channel–spatial attention (TCSA). The primary contributions are summarized as follows:

TCSA enhances the backbone network’s ability to capture long-range dependencies of disease spots on leaves using inverted residual self-attention modules. Additionally, TCSA improves feature fusion in the neck with channel–spatial attention modules, enriching contextual semantic information of diseased regions and eliminating complex background noise.
This paper designs a channel attention module, named Residual Channel Attention (RCA). RCA uses residual connections to enhance channel interactions and employs two pooling techniques to capture global information from each channel, thereby improving the distinction between diseased and healthy tea leaves.
This paper proposes a spatial attention module called spatial attention (SA). SA optimizes attention computation to reduce computational complexity and introduces Depth-Wise Convolution (DW-Conv) to focus on more informative disease regions, further reducing the computational load.

We organize ablation experiments and comparative analyses to demonstrate the effectiveness of the proposed method. Experimental results show that the detection accuracy of the proposed method is improved in multiscale and complex background scenes.

This paper is structured as follows: Section 2 introduces the dataset utilized in this paper. Section 3 introduces the structure of TCSA and the channel–spatial attention module. Section 4 describes the configuration of the experimental equipment, some experimental parameters, the results of the ablation and comparative experiments, and the performance of TCSA on different diseases. Section 5 concludes with recommendations and future directions.

2. Materials

2.1. Disease Dataset

The dataset of tea used in the deep learning data modeling was acquired from Anhui Agricultural University [29]. This dataset consists of 776 tea disease images with a size of 906 × 600 pixels, and each image can be grouped into one of the six common tea disease categories, i.e., tea algae leaf spot (Als), tea cake (Tc), tea cloud leaf blight (Clb), tea exobasidium blight (Eb), tea red rust (Tr), and tea red scab (Rs), as shown in Figure 1. Due to the original dataset without the bounding boxes of the tea disease targets, we used a tool called labelimg for labeling the images to assess the model’s performance on the problem of tea leaf pest and disease detection.

2.2. Data Augmentation

To enhance the model’s robustness and generalization capability, data augmentation techniques including image flipping, horizontal and vertical mirroring, and adding noise have been employed in this method, as shown in Figure 2. Additionally, the number of 776 tea disease images increased to 3676 after data augmentation, and the enhanced dataset was divided into a training set, validation set, and test dataset in the ratio of 8:1:1.

3. Methodology

To solve the tea disease detection problems of varying disease scales, occlusions of tea leaves, and the complexity of backgrounds in natural scenes, we propose a tea disease detection with channel–spatial attention (TCSA), consisting of three parts: backbone, neck, and head, as shown in Figure 3.

The backbone composed of the multiscale convolutional block and self-attention mechanism captures features with multiple scales. Then, the neck uses a top-down/bottom-up fashion to fuse multiscale features generated from the backbone to aid in capturing both the global information and detailed features of tea leaf diseases, such as distribution patterns, small spots, and edge details. In addition, we propose a novel channel attention module and a spatial attention module, and incorporate them as plugins after each feature fusion operation to eliminate the background noise and highlight the disease features. Finally, the head consists of three parallel convolution branches to detect tea diseases with multiple sizes.

3.1. Backbone

The backbone is used to extract multiscale coarse features of diseases. There are many commonly used models, such as ResNet-101 [30] and Darknet53 [31], that can be selected as backbones for extracting features. We chose CSPDarkNet53 [32], which has demonstrated excellent results in many classic deep learning-based methods [33,34], as our backbone framework to extract multiscale features of tea diseases, due to CSPDarkNet53’s low computational complexity and strong adaptability. Additionally, we introduce Inverted Residual Mobile self-attention Block (IRMB) [35] as plugins inserted into CSPDarkNet53 to solve the limitation of CSPDarkNet53 in integrating global information and handling long-range dependencies, as shown in Figure 3. From Figure 3, we insert IRMB at four dimensions of CSPDarknet53: 160 × 160 × 128, 80 × 80 × 256, 40 × 40 × 512, and 20 × 20 × 1024. The disease features extracted by IRMB at the dimensions of 80 × 80 × 256, 40 × 40 × 512, and 20 × 20 × 1024 are then output to the neck.

Figure 4 shows the details of the proposed Inverted Residual Mobile self-attention Block (IRMB). IRMB combines self-attention with Depth-Wise Convolution (DW-Conv). The 1 × 1 convolutions compress and expand the number of channels to optimize computational efficiency, and the 3 × 3 DW-Conv captures spatial features, while the attention mechanism captures global dependencies between features. We inserted IRMB into the CSPDarkNet53 structure, which not only optimizes computational efficiency but also enhances the model’s ability to extract multiscale features of tea diseases.

3.2. Neck

The neck repeatedly fuses multiscale tea disease feature maps through a top-down/bottom-up fashion. In addition, we utilize Residual Channel Attention (RCA) and spatial attention (SA) in the neck to enhance the contextual semantics of fused features of the disease image, and to eliminate redundancy and noise from complex backgrounds, as shown in Figure 3.

3.2.1. Residual Channel Attention Module

Channel attention automatically highlights relevant feature channels while suppressing irrelevant ones [36]. Therefore, we propose a novel Residual Channel Attention (RCA) module which uses Global Average Pooling (GAP) and Global Maximum Pooling (GMP) to retain relevant channel information. Specifically, GAP computes the mean value of each feature channel, providing a generalized summary of the feature information, while GMP identifies the maximum value within each feature channel, highlighting the most prominent features. Additionally, residual connections are used to strengthen the interactions between channels, enhancing the overall feature representation.

Figure 5 shows the details of the proposed Residual Channel Attention (RCA). Given the input x, RCA uses Global Average Pooling (GAP) and Global Maximum Pooling (GMP) followed by a shared Multi-Layer Perceptron (MLP) with a ReLU operation and a convolution with a kernel size of 1 × 1 to capture the global information of each channel, denoted by x₁ and x₂. Then, RCA aggregates x₁ and x₂ using element-wise addition followed by the sigmoid to obtain channel attention weights that are used for reweighting the input feature maps. Therefore, RCA is formulized as follows:

θ = S i g m o i d (R e L U (M L P (G A P (x))) + R e L U (M L P (G M P (x))))

(1)

F C = x \otimes θ \oplus x

(2)

where

\otimes

means element-wise multiplication,

\oplus

means element-wise addition, and FC means channel feature maps.

3.2.2. Spatial Attention Module

In this section, we propose a novel spatial attention (SA) module with Self-Attention Block (SAB) as the plugin to focus on disease areas that may be occluded by other leaves and reduce spatial dimensions to optimize computational complexity, thereby addressing the issue of extensive leaf occlusion in complex backgrounds. Note: in the field of visual research, the self-attention mechanism can be considered, to some extent, as a form of spatial attention mechanism. Figure 6a shows the overall structure of the proposed SA. SA includes the Self-Attention Block (SAB), Depth-Wise Convolution (DW-Conv), residual connections, and layer normalization. Figure 6b illustrates the details of how SAB reduces the spatial dimensions. SAB includes a Spatial Decrease module (SD) and Multi-Head Attention (MHA). SAB receives Query, Key, and Value as input. Then, SD conducts convolution operations on the Key and Value of kernel size N × N. The Key and Value space dimension of H × W is reduced to HW/N². The formula is as follows:

S D (x) = N o r m (C o n v (K e y, V a l u e, N))

(3)

where Conv(Key, Value, N) is a convolution operation with kernel size N × N on the input Key and Value. Norm(·) refers to layer normalization.

Furthermore, DW-Conv can be regarded as local attention, which makes the model pay attention to the adjacent features in the space, to ensure that the model pays attention to the overall disease area while enhancing the local disease detection ability in tea diseases.

Hence, the output of spatial attention is as follows:

F S = D W - C o n v (F C^{'}) + F C^{'}

(4)

F C^{'} = S A B (N o r m (F C)) + N o r m (F C)

(5)

where FS is the spatial feature map.

Finally, the feature map fused through the residual connection is expressed as follows:

F S C = F S \otimes x

(6)

where FSC is the feature map after the final fusion.

3.3. Head

The head aims to identify tea diseases at different scales. In this paper, we utilize three parallel convolutions to better capture the features of objects at different scales and improve the detection performance, as shown in Figure 3. Three detection heads with sampling ratios of 1/8, 1/16, and 1/32 are used, corresponding to small object detection, medium object detection, and large object detection, respectively.

3.4. Loss Function

The loss function of the proposed model primarily comprises VariFocal Loss (VFL) [37], CIoU loss, and Distribution Focal Loss (DFL) [38].

The main improvement of the VFL is the proposal on asymmetric weighting operations. The mathematical formulation of this loss function is delineated as follows:

V F L (p, q) = \{\begin{matrix} - q (q l o g (p) + (1 - q) l o g (1 - p)) & q > 0 \\ - α p^{g} l o g (1 - p) & q = 0 \end{matrix}

(7)

where p represents the predicted Iou-Aware Classification Score (IACS), and q represents the target score.

DFL optimizes the probabilities of the two locations closest to the target label y using cross-entropy, enabling the network to quickly focus on the distribution near the target location.

D F L (S_{i} + S_{i + 1}) = - ((y_{i + 1} - y) l o g (S_{i}) + (y - y_{i}) l o g (S_{i + 1}))

(8)

S_{i} = \frac{y_{i + 1} - y}{y_{i + 1} - y_{i}}

(9)

S_{i + 1} = \frac{y - y_{i}}{y_{i + 1} - y_{i}}

(10)

where y represents the label, and y_i and y_i₊₁ represent the nearest two to y (y_i ≤ y ≤ y_i₊₁).

CIoU [39] loss is employed to address the issue of mutual occlusion between leaves and a better return to the bounding box.

C I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{s^{2}} - α υ

(11)

where α is the weight coefficient, b and b^gt represent the center point of the predicted and actual frames, respectively, and ν is used to represent the similarity of the aspect ratio.

4. Experiments

4.1. Dataset and Experimental Setup

To eliminate variations in experimental data due to differing conditions, all experiments in this study were conducted under the same hardware and software environments. We used NVIDIA A100 80 GB PCIe GPUs (NVIDIA, Santa Clara, CA, USA) and the Linux operating system. The programming language was Python 3.9, and the deep learning framework was PyTorch 1.8.2. The training was conducted from scratch without using any pre-trained weights. During training, the image size was resized to 640 × 640, the optimizer chosen was Adam, the batch size was set to 32, the initial learning rate was 0.001, and the number of epochs was set to 300.

4.2. Evaluation Metrics

In this study, mean average precision (mAP) is employed for evaluating the proposed methods, which are as follows:

m A P = \frac{\sum_{1}^{k} A P (i)}{k}

(12)

where k is the number of classes in the test set, and AP(i) is the average precision of the model for the i-th class of targets.

The computation of mAP encompasses average precision (AP). In our evaluation, the AP is formulated as follows:

A P = \int_{0}^{1} P (r) d r

(13)

P is the percentage of actual target instances correctly predicted by the model among all the predicted target instances. The calculation of precision is as follows:

P = \frac{T P}{T P + F P}

(14)

R denotes the percentage of recognized targets by the model among the actual target instances. R is computed using the following formula:

R = \frac{T P}{T P + F N}

(15)

where TP refers to correctly identifying the target instance, and FP represents an instance where the target was incorrectly identified. TN means accurate identification of the background instance. FN indicates instances where the background was misidentified and is highlighted when the object was misclassified as background.

4.3. Ablation Study

Ablation experiments were conducted to evaluate the efficacy and contribution of the proposed blocks to overall performance. Each attention mechanism block was incrementally incorporated into the architecture based on the CSPDarkNet53 backbone.

Table 1 shows that, overall, the performance of TCSA improves with the addition of more sub-modules and TCSA with all sub-modules achieves the highest accuracy. In particular, CSPDarkNet53 + IRMB, CSPDarkNet53 + RCA, CSPDarkNet53 + SA, CSPDarkNet53 + IRMB + RCA and CSPDarkNet53 + IRMB+SA improve the mAP of the CSPDarkNet53 by 0.2%, 0.4%, 0.8%, 0.6%, and 1.1%, respectively. This result indicated that the three sub-modules of IRMB, RCA, and SA were effective.

4.4. Comparative Analysis

The proposed method TCSA was compared with SSD [40], YOlOv5s [41], Tea-YOLOv8s [22], AX-RetinaNet [17], and RT-DETR [42] in terms of precision, recall, and mAP, as shown in Table 2 and Figure 7 and Figure 8, where Table 2 presents the accuracy, recall, and mAP of six different models. Figure 7 and Figure 8 primarily compare TCSA with the other five models in terms of mAP and attention heatmaps.

Table 2 shows that TCSA achieved the best precision and mAP compared to the other methods. In particular, TCSA outperformed SSD and YOLOv5 by 6.2% and 6% in mAP, respectively. These results indicated that incorporating attention mechanism plugins into deep learning frameworks effectively improved detection accuracy. The TCSA model showed significant improvements in mAP compared to Tea-YOLOv8s, AX-RetinaNet, and RT-DETR, with 1.3%, 1.2%, and 0.4%, respectively. These improvements can be attributed to TCSA’s integration of the IRMB as a backbone plugin, which effectively captures multiscale disease features. Additionally, the channel–spatial attention plugin in the neck enhances the model’s ability to refine neck regions and fuse multiscale features. These experimental results validate the effectiveness of the proposed TCSA method in tea disease detection.

Figure 7 illustrates the mAP curves for four attention mechanism-based models: TCSA, Tea-YOLOv8s, AX-RetinaNet, and RT-DETR. From Figure 7, TCSA had certain advantages in the detection of tea disease. Although RT-DETR gradually approached TCSA in average precision after 300 epochs, TCSA still showed improvement in the final results.

Figure 8 displays the attention heatmaps for TCSA, Tea-YOLOv8s, AX-RetinaNet, RT-DETR, YOLOV5s, and SSD. In the attention heatmaps, color variation indicates the degree of attention the recognition network paid to different areas of the image. The original disease images from top to bottom in Figure 8 represented scenarios such as single disease, multiscale disease, leaf occlusion, and disease in complex backgrounds. In these scenarios, our method, which integrated channel and spatial attention mechanisms, more accurately covered the target areas than other methods.

4.5. TCSA’s Performance on Different Diseases

We evaluated the performance of TCSA on various diseases using the provided dataset, as shown in Table 3 and Figure 9, where Table 3 presents the accuracy, recall rate, and mAP values of TCSA for different diseases, while Figure 9 illustrates the visual results for these diseases. As shown in Table 3, TCSA exhibited high detection accuracy for diseases such as Eb and Tr that encompass wide distribution, significant variation, and complex backgrounds. However, the mAP of Rs was relatively low at 87.1%. The mAP value exceeded the individual accuracy of 86.4% and recall rate of 82.4%, indicating that the model has excellent comprehensive performance under a variety of conditions, with good robustness and adaptability. In addition, the proposed method showed promising results in the detection of a variety of tea diseases, including Als, Tc, and Clb.

Figure 9 shows the accuracy maps generated by TCSA for the test results, including prediction boxes and scores. In Figure 9a,e, precise and accurate detection of diseases of various scales and shapes could be observed in a simple background. This demonstrated the algorithm’s ability to accurately detect scale changes. Figure 9b,e demonstrates tea disease detection under complex background and leaf occlusion conditions, accurately identifying the affected areas and disease types on the leaves. This could be attributed to the feature extraction and fusion network’s ability to accurately detect disease regions in complex backgrounds by capturing the dependencies between input feature scales. The visualization results demonstrate that the model can effectively address challenges in tea disease detection, such as varying disease scales, leaf occlusion, and complex backgrounds.

5. Conclusions

This paper proposes a TCSA network for detecting tea disease in complex backgrounds. The innovative model for tea disease detection effectively overcomes the challenges posed by complex backgrounds and variable scales. Compared to prior network models, our approach focuses more on identifying the diseased parts of tea leaf images, improving accuracy on average. In this study, TCSA achieved an outstanding mAP of 94.6%. Additionally, TCSA demonstrated excellent performance with precision and recall rates of 92.9% and 89.6%, respectively. The experimental results demonstrate that the proposed channel–spatial attention mechanism improves the feature extraction of images for tea disease diagnosis under complex backgrounds. In addition, we find that the self-attention mechanism in our method can help the backbone to effectively extract multiscale features of diseases.

Future research will focus on exploring the actual deployment and time constraints of real-time detection of tea diseases [43], strengthening the feature fusion strategy, and improving the practicability and effectiveness of the model [44]. We hope that our research will enable the accurate and timely detection of tea diseases, which will not only enhance tea production yield and quality but also contribute to achieving the agricultural circular economy and promoting sustainable agriculture.

Author Contributions

Y.S.: methodology, original draft, writing; M.J.: formal analysis, resources, writing; H.G. and G.W.: review and editing, visualization; L.Z.: review and editing, supervision; J.Y. and F.W.: review and editing, resources. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation 2030 Major S&T Projects of China (2021ZD0113600), the Science and Technology Plan Project of Henan Province (242102210092), the Henan Province Key Research and Development Project (241111212200), the Natural Science Foundation of Henan Province (222300420275, 232300421167), the Postgraduate Education Reform and Quality Improvement Project of Henan Province (YJS2023SZ23, YJS2022KC34, YJS2024AL104), the Teacher Education Curriculum Reform Projects of Henan Province (2024-JSJYZD-008), and the Nanhu Scholars Program for Young Scholars of XYNU.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author. The source data are available at https://github.com/Jeremy-54295/tea-dataset.git (accessed on 23 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fang, J.; Sureda, A.; Silva, A.S.; Khan, F.; Xu, S.; Nabavi, S.M. Trends of tea in cardiovascular health and disease: A critical review. Trends Food Sci. Technol. 2019, 88, 385–396. [Google Scholar] [CrossRef]
Hu, G.; Yang, X.; Zhang, Y.; Wan, M. Identification of tea leaf diseases by using an improved deep convolutional neural network. Sustain. Comput. Inform. Syst. 2019, 24, 100353. [Google Scholar] [CrossRef]
Long, Z.; Jiang, Q.; Wang, J.; Zhu, H.; Li, B. Research on method of tea flushes vision recognition and picking point localization. Microsyst. Technol. 2022, 2, 41–45. [Google Scholar]
Zhang, L.; Zou, L.; Wu, C.; Chen, J.; Chen, H. Locating famous tea’s picking point based on shi-tomasi algorithm. Comput. Mater. Contin. 2021, 69, 1109–1122. [Google Scholar] [CrossRef]
Cardellicchio, A.; Solimani, F.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Detection of tomato plant phenotyping traits using YOLOv5-based single stage detectors. Comput. Electron. Agric. 2023, 207, 107757. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Automatic bunch detection in white grape varieties using YOLOv3, YOLOv4, and YOLOv5 deep learning algorithms. Agronomy 2022, 12, 319. [Google Scholar] [CrossRef]
Zhou, G.; Zhang, W.; Chen, A.; He, M.; Ma, X. Rapid detection of rice disease based on fcm-km and faster r-cnn fusion. IEEE Access 2019, 7, 143190–143206. [Google Scholar] [CrossRef]
Sun, C.; Huang, C.; Zhang, H.; Chen, B.; An, F.; Wang, L.; Yun, T. Individual tree crown segmentation and crown width extraction from a height map derived from aerial laser scanning data using a deep learning framework. Front. Plant Sci. 2022, 13, 914974. [Google Scholar] [CrossRef]
Dai, G.; Fan, J. An industrial-grade solution for crop disease image detection tasks. Front. Plant Sci. 2022, 13, 921057. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates: Red Hook, NY, USA, 2017; Volume 30, pp. 1–11. Available online: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 4 May 2022).
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Hu, J.; Li, S.; Samuel, A.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Wang, R.; Xie, C.; Yang, P.; Wang, F.; Sudirman, S.; Liu, W. PestNet: An end-to-end deep learning approach for large-scale multi-class pest detection and classification. IEEE Access 2019, 7, 45301–45312. [Google Scholar] [CrossRef]
Zhang, Y.; Huang, S.; Zhou, G.; Hu, Y.; Li, L. Identification of tomato leaf diseases based on multi-channel automatic orientation recurrent attention network. Comput. Electron. Agric. 2023, 205, 107605. [Google Scholar] [CrossRef]
Gao, R.; Wang, R.; Feng, L.; Li, Q.; Wu, H. Dual-branch, efficient, channel attention-based crop disease identification. Comput. Electron. Agric. 2021, 190, 106410. [Google Scholar] [CrossRef]
Bao, W.; Fan, T.; Hu, G.; Liang, D.; Li, H. Detection and identification of tea leaf diseases based on AX-RetinaNet. Sci. Rep. 2022, 12, 2183. [Google Scholar] [CrossRef] [PubMed]
Sunil, C.; Jaidhar, C.; Patil, N. Tomato plant disease classification using multilevel feature fusion with adaptive channel spatial and pixel attention mechanism. Expert Syst. Appl. 2023, 228, 120381. [Google Scholar] [CrossRef]
Wang, X.; Cao, W. Bit-plane and correlation spatial attention modules for plant disease classification. IEEE Access 2023, 11, 93852–93863. [Google Scholar] [CrossRef]
Tang, Z.; Zhang, R.; Peng, Z.; Chen, J.; Lin, L. Multi-stage spatiotemporal aggregation transformer for video person re-identification. arXiv 2023. [Google Scholar] [CrossRef]
Wang, F.; Wang, R.; Xie, C.; Yang, P.; Liu, L. Fusing multiscale context-aware information representation for automatic in-field pest detection and recognition. Comput. Electron. Agric. 2020, 169, 105222. [Google Scholar] [CrossRef]
Xie, S.; Sun, H. Tea-YOLOv8s: A tea bud detection model based on deep learning and computer vision. Sensors 2023, 23, 6576. [Google Scholar] [CrossRef] [PubMed]
Ren, B.; Liu, B.; Hou, B.; Wang, Z.; Yang, C.; Jiao, L. SwinTFNet: Dual-stream transformer with cross attention fusion for land cover classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 2501505. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2023; pp. 205–218. [Google Scholar] [CrossRef]
Hu, Y.; Deng, X.; Lan, Y.; Chen, X.; Long, Y.; Liu, C. Detection of rice pests based on self-attention mechanism and multiscale feature fusion. Insects 2023, 14, 280. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Wu, F.; Guo, H.; Li, R.; Yao, J.; Shen, J. Teadiseasenet: Multiscale self-attentive tea disease detection. Front. Plant Sci. 2023, 14, 1257212. [Google Scholar] [CrossRef] [PubMed]
Stephen, A.; Punitha, A.; Chandrasekar, A. Designing self-attention-based resnet architecture for rice leaf disease classification. Neural Comput. Appl. 2023, 35, 6737–6751. [Google Scholar] [CrossRef]
Zeng, W.; Li, M. Crop leaf disease recognition based on self-attention convolutional neural network. Comput. Electron. Agric. 2020, 172, 105341. [Google Scholar] [CrossRef]
Tholkapiyan, M.; Aruna Devi, B.; Bhatt, D.; Saravana Kumar, E.; Kirubakaran, S.; Kumar, R. Performance analysis of rice plant diseases identification and classification methodology. Wireless Pers. Commun. 2023, 130, 1317–1341. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018. [Google Scholar] [CrossRef]
Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of cnn. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
Roy, A.M.; Bose, R.; Bhaduri, J. A fast accurate fine-grain object detection model based on yolov4 deep neural network. Neural Comput. Appl. 2022, 34, 3895–3921. [Google Scholar] [CrossRef]
Xue, Z.; Xu, R.; Bai, D.; Lin, H. Yolo-tea: A Tea Disease Detection Model Improved by YOLOv5. Forests 2023, 14, 415. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar] [CrossRef]
Sharma, V.; Tripathi, A.K.; Mittal, H. CLDA-Net: A novel citrus leaf disease attention network for early identification of leaf diseases. In Proceedings of the 2023 15th International Conference on Computer and Automation Engineering (ICCAE), IEEE, Sydney, Australia, 3–5 March 2023; pp. 178–182. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. VarifocalNet: An iou-aware dense object detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Li, X.; Lv, C.; Wang, W.; Li, G.; Yang, L.; Yang, J. Generalized Focal Loss: Towards efficient representation learning for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1–14. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; TaoXie; Fang, J.; Imyhxy; et al. Ultralytics/yolov5: V7. 0-yolov5 sota real-time instance segmentation. Zenodo 2022. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Pandey, A.; Jain, K. A robust deep attention dense convolutional neural network for plant leaf disease identification and classification from smartphone captured real-world images. Ecol. Inform. 2022, 70, 101725. [Google Scholar] [CrossRef]
Liu, Y.; Gao, G.; Zhang, Z. Crop disease recognition based on modified lightweight cnn with attention mechanism. IEEE Access 2022, 10, 112066–112075. [Google Scholar] [CrossRef]

Figure 1. Six common types of tea pests and diseases. (a) Tea algae leaf spot (Als), (b) tea cake (Tc), (c) tea cloud leaf blight (Clb), (d) tea exobasidium blight (Eb), (e) tea red rust (Tr), (f) tea red scab (Rs).

Figure 2. Comparison of original and enhanced images of tea disease.

Figure 3. The network structure of channel-spatial attention fusion for tea disease detection. The backbone is mainly composed of multiscale convolutional blocks and blocks based on the self-attention mechanism IRMB. The neck used four spatial attention modules (SA1 to SA3) and four channel attention modules (RCA1 to RCA3). The head detects different features with three parallel convolutions.

Figure 4. Inverted Residual Mobile self-attention Block. (a) is the overall structure of the Inverted Residual Mobile self-attention Block (IRMB) and (b) is the depth-wise convolution (DW-Conv) in the structure.

Figure 5. Residual Channel Attention.

Figure 6. Spatial attention module. (a) is the overall architecture of the spatial attention module (SA), and (b) is the self-attention block (SAB) within that architecture.

Figure 7. The mAP variation of the six different models.

Figure 8. Identification of network heatmap visualization results. In the attention heatmaps, red areas represents higher attention, yellow areas represents medium attention areas, and blue areas represents lower attention.

Figure 9. Visualization test results. (a) shows the detection results for multi-scale diseases, (b) shows the detection results for complex backgrounds, (c,d,f) show the detection results for single diseases, and (e) shows the detection results for multi-scale diseases and complex backgrounds.

Table 1. Results of ablation experiments with different models.

Backbone (CSPDarkNet53)	IRMB	RCA	SA	mAP
√				93.3
√	√			93.5
√		√		93.7
√			√	94.1
√	√	√		93.9
√	√		√	94.4
√	√	√	√	94.6

Table 2. Performance comparison of different test models.

Model	P	R	mAP
SSD [40]	86.5	89.1	88.4
YOLOv5s [41]	87.4	82.5	88.6
Tea-YOLOv8s [22]	92.7	89.2	93.3
AX-RetinaNet [17]	91.7	90.8	93.4
RT-DETR [42]	89.6	87.8	94.2
TCSA	92.9	89.6	94.6

Table 3. Detection results of different diseases in TCSA.

Disease	P	R	mAP
Als	89.5	92.0	91.7
Tc	97.4	83.3	97.0
Clb	93.8	93.9	96.5
Eb	87.3	97.4	98.5
Tr	96.3	97.2	97.5
Rs	86.4	82.4	87.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Jiang, M.; Guo, H.; Zhang, L.; Yao, J.; Wu, F.; Wu, G. Multiscale Tea Disease Detection with Channel–Spatial Attention. Sustainability 2024, 16, 6859. https://doi.org/10.3390/su16166859

AMA Style

Sun Y, Jiang M, Guo H, Zhang L, Yao J, Wu F, Wu G. Multiscale Tea Disease Detection with Channel–Spatial Attention. Sustainability. 2024; 16(16):6859. https://doi.org/10.3390/su16166859

Chicago/Turabian Style

Sun, Yange, Mingyi Jiang, Huaping Guo, Li Zhang, Jianfeng Yao, Fei Wu, and Gaowei Wu. 2024. "Multiscale Tea Disease Detection with Channel–Spatial Attention" Sustainability 16, no. 16: 6859. https://doi.org/10.3390/su16166859

APA Style

Sun, Y., Jiang, M., Guo, H., Zhang, L., Yao, J., Wu, F., & Wu, G. (2024). Multiscale Tea Disease Detection with Channel–Spatial Attention. Sustainability, 16(16), 6859. https://doi.org/10.3390/su16166859

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiscale Tea Disease Detection with Channel–Spatial Attention

Abstract

1. Introduction

2. Materials

2.1. Disease Dataset

2.2. Data Augmentation

3. Methodology

3.1. Backbone

3.2. Neck

3.2.1. Residual Channel Attention Module

3.2.2. Spatial Attention Module

3.3. Head

3.4. Loss Function

4. Experiments

4.1. Dataset and Experimental Setup

4.2. Evaluation Metrics

4.3. Ablation Study

4.4. Comparative Analysis

4.5. TCSA’s Performance on Different Diseases

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI