1. Introduction
Plant diseases cause approximately 30% of annual crop losses [
1], posing a significant threat to agricultural production. Tomato, as a widely cultivated and distributed crop, can suffer from extensive yield reduction or even complete crop failure if diseases are not promptly addressed [
2]. Rapid and early identification of plant diseases is helpful in intervening with preventive measures and preventing the wide spread of diseases. Traditional diagnostic methods such as manual visual analysis and chemical testing [
3] are time-consuming, labor-intensive, and expensive. With the advancement of computer technology, image-based plant disease recognition offers advantages of being fast, resource-efficient, and low-cost. Many image recognition methods have been proposed and applied for plant disease identification by researchers, including artificial bee colony algorithm [
4], image segmentation [
5], SVM [
6], and other machine learning algorithms [
7]. In recent years, deep learning approaches have shown promising results in plant leaf disease recognition [
8]. However, due to the random distribution of plant leaf disease images, diverse symptoms, and complex backgrounds [
9], researchers have made a series of improvements to deep learning models.
Researchers propose a novel plant leaf disease identification model based on a deep convolutional neural network (Deep CNN), achieving better recognition results by setting an appropriate number of convolutional layers [
10]. Due to the scarcity of datasets and the difficulty in data acquisition, researchers often heavily rely on data augmentation and transfer learning techniques. Building upon CNN, a deep learning architecture called EfficientNet [
11] is designed on top of data augmentation, improving generalization through transfer learning methods. It is tested on the public dataset Plant Village and achieves higher accuracy compared to other models. The introduction of attention mechanisms allows deep learning to assign higher weights to favorable features during the training process. In [
12], the authors incorporate attention mechanisms on top of a residual CNN network for plant leaf disease recognition, comparing it with regular CNN networks and residual CNN networks. By adjusting hyperparameters, designing deep learning architectures, enhancing preprocessing tasks, and incorporating attention mechanisms, researchers have explored deep learning methods suitable for plant leaf disease identification and improved recognition accuracy.
Meanwhile, researchers have found that highlighting the typical features of plant leaf disease regions during the training process of deep learning models can improve their performance. Therefore, guiding deep learning models to focus on the diseased areas of plant leaves has become an important concern for researchers. In the research article [
13], the authors explored the integration of the Convolutional Block Attention Module (CBAM) into the ResNet-34 architecture to enhance feature extraction. In addition, they implemented the Faster-RCNN framework. A noteworthy aspect of this study is that it generated annotations for suspected images in advance, which helped specify the regions of interest (RoI). By incorporating this approach, the model acquired prior knowledge about disease regions and was able to capture more profound disease-related features. Another solution to address this issue is presented in [
9], where researchers construct a large-scale plant leaf disease dataset called PDD271 and propose a new framework. This framework explores a multi-scale strategy and reweights both visual regions and the loss to emphasize discriminative diseased parts for plant disease recognition.
In traditional feature engineering, pixel value features are an important characteristic of plant diseases. In deep learning, appropriately regressing at the pixel level can guide the model towards the desired training direction. In the research article [
14], by establishing a pixel-level correspondence relationship at both ends of the U-Net network, the model gains better control over geometric deformations. We draw inspiration from this idea and incorporate the pixel value distribution information of plant leaf diseases as global information into each self-attention module. This guides the self-attention module to assign greater weights to the pixel value range corresponding to the diseased region, further guiding the deep learning model to pay more attention to the diseased areas during the training process. We refer to this approach as GD-Attention. We utilize ResNet-18 as the backbone and experiment with applying GD-Attention to tomato leaf disease recognition, yielding the following contributions:
The training dataset images were enhanced using methods such as brightness adjustment, noise addition, rotation, scaling, and Gaussian filtering. Training was conducted using the augmented dataset, and compared to training with the original dataset, the accuracy of the test dataset increased by 1.2% with the same model.
The GD-Attention method was proposed and applied to plant leaf disease classification using ResNet-18 as the backbone. On the PV dataset, an accuracy of 99.97% was achieved.
An ablation experiment was performed, comparing ResNet-18, ResNet-18 with self-attention, and ResNet-18 with GD-Attention to validate the effectiveness of GD-Attention.
By analyzing the attention heatmaps, a comparison was made between self-attention and GD-attention regarding their focus on plant leaf regions. This validated that GD-Attention has guiding capabilities for disease areas in the model.
Comparative experiments were conducted to compare the presented work with other methods for plant leaf disease classification, including basic model methods and state-of-the-art (SOTA) methods, in order to demonstrate the efficacy of the presented work.
2. Materials and Methods
2.1. Experimental Data Preparation
Plant Village is a public plant disease image dataset that covers 14 plant species and 26 types of diseases, including common crops such as corn, tomato, apple, and chili. The images were manually collected, classified and annotated by plant scientists and professional horticulturists. Some of the annotations were completed by multiple experts to ensure accuracy. In this study, we selected seven types of tomato leaf diseases and one normal state from the Plant Village dataset, resulting in a total of 15,755 images. The image labeling process was automated using the original dataset. We then partitioned the dataset into training and testing sets with a ratio of 0.8:0.2.
Table 1 displays the data information used in this research according to different class of tomato diseases.
2.2. Model Establishment
As shown in
Figure 1, the framework of the method used in this study mainly consists of the following components: Input data augmentation (step 1), Incorporation of global pixel value distribution information (step 2), Image feature extraction with GD-Attention mechanism (step 3), and Disease classification (step 4). In step 1, we perform data augmentation on the original images, including brightness adjustment, noise introduction, rotation and scaling, and Gaussian filtering, to expand the training dataset and enhance the model’s generalization capability. In step 2, we take into account the global pixel value distribution information. As an additional input to the model, we incorporate the pixel value distribution information of the images. Specifically, we calculate the statistical information of the pixel value distribution for each image and input it into the GD-Attention module in step 3. In step 3, we input the enhanced image along with global pixel value distribution information into a residual network composed of GD-Attention blocks and Res blocks. Within each GD-Attention block, global pixel value distribution information is introduced to guide the attention mechanism in focusing on important regions and features of the image. In step 4, we first reduce the dimensionality of the output from the last block using pooling operation. Then, the dimensionally-reduced feature maps are fed into a Fully Connected Layer for feature abstraction and non-linear transformation. Next, we introduce an output layer with the same number of neurons as the number of disease categories. By applying the softmax function to the outputs of the fully connected layer, we can calculate the probability scores for each category, thus determining the disease type to which the image belongs. In this study, we use ResNet-18 as the residual network in step 2 for experimentation.
2.3. Data Augmentation
Data augmentation is a technique in the field of image recognition that enhances a model’s performance by introducing additional data. Common image data augmentation techniques include rotation, cropping, flipping, and other operations. To enhance the model’s generalization ability, we performed the following data augmentation on the original images: brightness adjustment, noise introduction, rotation and scaling, and Gaussian filtering.
By randomly adjusting the brightness of the images, we simulated image variations under different lighting conditions. This helps the model be robust to changes in lighting conditions and improves its generalization ability. By introducing a certain level of noise into the images, we simulated background interference or other image corruption. This helps the model learn robustness to noise and improves its robustness and generalization ability. By randomly rotating and scaling the images, we simulated variations in the images at different angles and ratios. This helps the model learn features that are invariant to rotation and scale, improving its robustness and generalization ability. By applying Gaussian filtering to the images, we blurred them and reduced noise within them. This helps reduce the interference of fine details on the model and improves its ability to learn overall image features.
Through these data augmentation operations, we can generate more diverse and challenging training samples, enabling the model to accurately predict and classify images under different lighting, noise, and transformation conditions. This enhances the model’s robustness and generalization ability, allowing it to better adapt to various situations in the real world.
2.4. Introducing Global Pixel Value Distribution Information
The combination of a simple attention mechanism and CNN network has limited ability to enhance the attention weight on the regions of plant leaf diseases, especially when affected by brightness and noise. Therefore, we propose whether it is possible to add an additional input to the attention mechanism, representing some traditional feature representation of the disease regions on plant leaves, to help the attention mechanism better identify the disease areas.
In image processing, we often adjust the pixel value distribution of an image to enhance contrast, improve visual effects, and adapt to different lighting conditions. Methods such as histogram equalization [
15], gamma correction [
16], and local adaptive techniques [
17] are commonly used. In plant leaf disease classification studies based on traditional methods, researchers analyze pixel values to obtain pixel features of the disease regions on plant leaves [
18,
19], and utilize these features for disease classification.
In this study, we use the traditional feature representation of the pixel value distribution of the image as global information. It is inputted into each Attention-block using skip connection [
20] and participates in the matrix operations within the attention mechanism.
As shown in
Figure 2a, we have an image of tomato leaf disease, with a size of (3, 224, 224), and its pixel values range from 0 to 255. In
Figure 2b, we divide the pixel values from 0 to 255 into 16 intervals and count the number of pixels in each interval. We then normalize these pixel counts to the range of [0, 1]. This way, we obtain the pixel value distribution density for each RGB channel of the image. The distribution is represented by:
where
/
/
is the
k-th interval of the R/G/B channel,
/
/
is the number of pixels in the
k-th interval, and
is the total number of pixels in the image. These three distributions form a matrix with dimensions of (3, 16, 1), as shown in
Figure 2c.
After obtaining the pixel value distribution information, we treat it as global information and output it to each GD-Attention block. As shown in
Figure 1 step 2 and step 3, the feature extraction module of the image consists of a series of alternating GD-Attention blocks and Res-blocks, forming a residual network. By combining these different structured blocks, the model’s feature extraction capability can be enhanced by leveraging their respective advantages [
21]. Through skip connections, the global pixel value distribution information enters the GD-Attention module and is introduced into the feature extraction process at different depths, guiding the attention mechanism to allocate weights for different pixel value ranges.
2.5. GD-Attention
2.5.1. Self-Attention
Self-attention is a mechanism widely used in neural networks and commonly applied in natural language processing and computer vision tasks. Compared to RNN and LSTM, it has advantages such as context awareness, parallel computation, and handling long-term dependencies. Self-attention works by mapping an input sequence or matrix into three representations: query, key, and value. It then calculates the similarity between the query and other positions, applies weighted aggregation based on this similarity, and finally obtains the output representation.
2.5.2. GD-Attention
As shown in
Figure 3, We denote a feature map matrix
X of size
.
X is transformed into three matrices
Q,
K, and
V through the transformations
,
, and
respectively, as shown in the following equations:
where the transformations
,
, and
are combined with con.
Before introducing the self-attention mechanism, we perform the transformation
on the global pixel value distribution matrix
Y to obtain the matrix
G, as shown in the following equation:
where
has the same dimension as the feature map in the GD-attention convolutional layer.
Then, we use the values in matrix
G as the new distribution density to map the elements in matrix
Q. Assuming the value of
falls within the
k-th interval of
G, the calculation process for the mapped
is as follows:
where
is the value in the
i-th row and
j-th column of matrix
Q,
is the value of the
k-th interval in matrix
G.
is the maximum value among all elements in matrix
Q.
Afterwards, similar to self-attention, matrix multiplication is used to calculate the correlation between the
matrix and the
K matrix. The result is divided by
, Which is the dimension of the self-attention head. After followed by applying softmax operation, the result is multiplied by
V to obtain the module output. The calculation process is as follows:
2.5.3. The Structure of GD-Attention Block
In this study, we conducted experiments using ResNet18. We introduced the GD-Attention mechanism into the basic blocks of ResNet-18,called GD-Attention block.
As shown in
Figure 4a, the ResNet BasicBlock consists of two convolutional layers and a residual connection. Given an input matrix
X, the output matrix
F is computed using the following expression:
where
represents the activation function RELU,
and
represent the weights of the first and second convolutional layers, respectively.
Then, the final output matrix
Y is obtained by adding
F to the input matrix
X:
In this way, the residual block can learn the incremental changes in the input information and add them to the original input, thereby alleviating the problem of gradient vanishing and improving the training effectiveness of the network.
As shown in
Figure 4b, We added the GD-attention module between two convolutional layers. The final output matrix
Y of the block can be represented by the following equation:
3. Results and Discussion
The experiment was conducted using Python 3.7 and the PaddlePaddle 2.4.0 deep learning framework. The execution took place in a Linux environment, specifically running on a system equipped with a Tesla V100 GPU, 32 GB of video memory, a 4-core CPU, and 32 GB of RAM.
3.1. Data Augmentation
Image enhancement is used to improve the quality and quantity of original images, aiming to enhance the robustness and generalization capability of models to different lighting conditions, noises, and other variations. Common image enhancement methods include adjusting brightness, contrast, and color balance; cropping, rotating, and scaling images; adding random perturbations and mixing, among others. By applying image enhancement, we obtain more training data and improve the performance and stability of the model. We apply data augmentation to the PlantVillage dataset used for training and testing. As shown in
Figure 5a, it displays the original images from the PV dataset.
Figure 5b–e demonstrate the effects of random brightness adjustment, salt-and-pepper noise addition, rotation and scaling, and Gaussian filtering, respectively. Through data augmentation, the dataset is expanded, while certain fine details are enhanced, resulting in increased diversity and improved generalization capability.
3.2. Training Parameter Settings and Model Details
In the process of model training, we utilize the cross-entropy loss function to measure the difference between the predicted results and the true labels. We employ the Adam optimizer to optimize the model parameters and minimize the loss function. The batch size is set to 64, and we train the model for 100 epochs.
Table 2 provides details of the parameter settings for the layers in our model structure.
3.3. Model Identification Results
As shown in
Figure 6, after 100 epochs of training, our model successfully classifies and recognizes different types of plant leaf diseases with high accuracy. The model achieved an accuracy of 99.97 % on the test set, with a loss of approximately 0.002. Specifically, the model reached an accuracy of over 95% around the 50th epoch, demonstrating good convergence and stability.
As shown in
Figure 7, we utilized the t-SNE method to project the original data and the feature data extracted by our model onto a two-dimensional space. Through t-SNE visualization analysis, the scattered distribution of the original signals was significantly improved after feature extraction by our model, resulting in a better clustering effect.
As shown in
Figure 8, we plotted the confusion matrix for the classification results. Through analysis using the confusion matrix, our model achieved a classification accuracy of 100% for all categories except for misdiagnosing diseases in the “early blight” class and “mosaic virus” class, where some errors occurred. Through calculation, we obtained the average precision, recall rate, F1 score, and accuracy rate of the proposed model as 99.94%, 99.94%, 99.94%, 99.97%, respectively.
3.4. Ablation Experiment
To verify the effectiveness of each component in our model, we conducted ablative experiments on various modules within the model.The ablative experiments included resnet18 lacking data argumentation, resnet18 with data argumentation, and self-attention lacking global information. The experimental results are shown in
Table 3.
Based on the results of the ablative experiments, we found that data argumentation made a significant contribution to the improvement in accuracy, achieving an increase of 1.2%. The self-attention mechanism also had a positive effect on the accuracy improvement. Moreover, incorporating global information and using the attention mechanism led to an additional improvement of 0.95% in accuracy.
We plotted Precision-Recall and ROC curves to analyze the model’s performance, as shown in
Figure 9 and
Figure 10. The results indicate that our proposed model, while maintaining high precision, also exhibits good recall capability. The ROC curve outperforms the PR curve, but this phenomenon also reflects the data imbalance among disease categories. A more balanced and high-quality dataset is expected to further enhance model performance.
3.5. Attention Analysis of Leaf Disease Areas
We generate heatmaps for the feature extraction process to analyze the GD-attention mechanism. As shown in
Figure 11a, it represents the Global pixel value Distribution information in the GD-attention mechanism.
Figure 11b shows the Global pixel value Distribution information after entering the GD-Attention block, obtained through training, which is the output of Equation (
7).
Figure 12c is the output of Equation (
5). It showcases the feature map generated by mapping the Q matrix in GD-Attention using the information from
Figure 11b.
From
Figure 11c, it can be observed that the feature map reflects information from various pixel value distribution ranges, indicating that the feature extraction process is guided by the global pixel value distribution information. Next, the matrix in
Figure 11c is operated with the K matrix and V matrix according to Equation (
9), resulting in the output features of the GD-Attention block.
We map the output feature maps of the Attention block back to the original images for observation. We compare the results of Self-attention and GD-attention, as shown in
Figure 12. In the image, the result of GD-attention is more focused on the disease area, while the result of self-attention not only enhances the disease area but also emphasizes the edge areas and even background features. This also indicates that GD-attention is less affected by the background and contributes to its generalization capability.
3.6. Comparison
We selected classic models and state-of-the-art models in the field of plant leaf disease classification based on the Plant Village dataset for comparison. We have selected commonly used classical models in the field of plant leaf disease classification, including VGG16, ResNet-50, and CNN, and MobileNet. In addition, we have also chosen state-of-the-art models, such as Customized CNNs, Faster-RCNN, U-net considering the regions of interest, and ResNet with multi-scale strategy and reweight both visual regions and the loss.
The comparison results are shown in
Table 4. Through our comparison, we have observed that classical models have achieved high recognition accuracy through techniques such as transfer learning, data augmentation, and squeeze and excitation, thanks to the continuous research conducted by scholars in recent years. State-of-the-art (SOTA) models further improve disease recognition accuracy by modifying and adjusting the structures of classical models. Notably, the papers [
5,
13,
22] have demonstrated accuracy rates surpassing 99.9%. These papers commonly adopt ROI, multi-scale, or other algorithms to highlight plant disease regions. The distinguishing factor is that ROI marks the areas of focus prior to training, aiming to achieve specific training objectives, while multi-scale and other algorithms emphasize feature extraction of plant disease regions through structural changes in the model.
In this study, we propose the GD-Attention mechanism, which introduces pixel value distribution information to guide the model’s focus on plant disease regions during training. The results demonstrate the high accuracy of our proposed method. Furthermore, our proposed model has a size of approximately 27 MB and achieves high accuracy with a relatively small number of parameters. Small-scale accurate models are easy to load into automated systems, contributing to rapid automated early plant disease diagnosis. In applications, we can combine classification models with agricultural robotics systems or monitoring systems to provide automated solutions for rapid diagnosis of plant diseases [
30,
31].
4. Conclusions
This study focuses on the problem of excessive emphasis on edges, background, and other parts in feature extraction for plant leaf disease identification. We propose the GD-Attention mechanism, which introduces global pixel value distribution information to guide the model’s attention towards the diseased regions of plants. We use ResNet-18 as the backbone and incorporate the GD-Attention mechanism into the ResNet basic block. In addition, we design skip connection structures that incorporate global pixel value distribution information. The proposed deep learning model is trained and tested on the tomato leaf disease dataset from Plant Village.
To validate the proposed model in this study, we conducted a series of experiments, including ablation experiments, attention visualization analysis, and comparative experiments with traditional and state-of-the-art models. The experimental results indicate that the model’s data augmentation module enhances its robustness against blurring, noise, rotation, and scaling. Additionally, the introduction of global pixel value distribution information guides the attention mechanism’s focus. We incorporated global pixel value information at various locations in the model’s feature extraction module using a skip-connection structure. This GD-Attention mechanism reinforces the model’s feature extraction for plant disease regions, leading to improved accuracy. Remarkably, our proposed model achieves the highest accuracy while having only 27 M parameters, which is expected to facilitate further research in real-world applications.
In summary, this study combines traditional image processing methods with deep learning approaches. By guiding the deep learning training process using global pixel value distribution information, the resulting model pays more attention to the diseased areas of plant leaves. This allows the model to achieve higher accuracy with a smaller number of parameters. With the help of agricultural robots and monitoring systems, this method can provide important help in rapid detection of plant diseases. In future work, we plan to collect more complex scenarios data and investigate the performance of this method in more complex scenarios, with the goal of further optimizing the model.