1. Introduction
Fruit is a vital economic crop in agriculture, but annual losses from fruit diseases can reach up to 30% [
1]. Early disease prevention is crucial for plant protection, as it helps avoid significant yield losses before severe infection occurs. Most plant diseases first manifest visible symptoms on leaves, allowing for early detection through visual observation or computer vision techniques. Therefore, accurate recognition of fruit leaf diseases has become a key component in effective fruit disease management.
Researchers study vision-based fruit leaf disease recognition technology due to its cost-effectiveness and efficiency. A range of methods, including traditional image processing techniques [
2], pattern recognition [
3], and artificial intelligence (AI) technologies [
4], have been explored by researchers. Among these, AI, particularly deep learning, has shown promising outcomes [
5,
6]. However, the performance of deep learning in fruit leaf disease recognition degrades under real-world conditions like uneven lighting, diverse viewpoints, and background interference. Addressing these challenges, researchers have proposed various improvement strategies.
In [
7], the authors proposed a preprocessing method that combines denoising and hybrid contrast enhancement based on Convolutional Neural Networks (CNNs), which achieved a 99.4% accuracy rate when applied to fruit leaf disease classification. Reference [
8] combines VGG16 with AlexNet to design a deep learning model for the classification and recognition of pepper leaf and fruit diseases, which accelerates the computational speed on the basis of performance improvement. In [
9], the author designed an improved ResNetV2 to identify apple leaf diseases. By comparing it with various models, such as VGG16, InceptionV3, MobileNetV2, etc., the effectiveness of the improved network structure was proven. These methods mainly aim to improve the recognition accuracy for fruit leaf disease scenarios through data augmentation, preprocessing modules, and the design of improved deep learning models.
Meanwhile, researchers have discovered that through the refined processing and screening of deep features, they can extract features that are even more crucial to classification results. The literature [
10] demonstrates that by extracting and integrating deep information from various scales, important semantic features can be provided. Subsequently, through embedded attention mechanisms, non-essential information is suppressed, and important information is enhanced to improve recognition accuracy. In deep information processing, strategies such as spatial and channel attention [
11] are commonly employed to enhance critical information, thereby increasing the contribution of key features to classification results.
With the analysis of these deep features, researchers have found that some key deep features share similarities with those extracted by traditional image processing methods, such as texture features [
12], morphological features [
13], and frequency domain features [
14].
In [
15], the authors propose FcaNet, which integrates channel attention into the frequency domain, combining Discrete Cosine Transform (DCT) with deep learning models. This novel approach leverages frequency domain features to introduce a new deep learning model that has demonstrated strong performance on the ImageNet and COCO datasets. Considering the importance of frequency features in fruit disease analysis, we designed the FA-attention module, which integrates Fourier transform methods with attention mechanisms. This module autonomously extracts frequency domain features from images and combines them with the U-Net network to create the FA-Unet network. This approach emphasizes the key frequency domain features of diseased regions through feature selection to enhance accuracy. The main contributions of this paper are as follows:
By combining Fourier transform with the attention mechanism, we propose the FA-attention method and integrate it with the U-Net network. Compared to the baseline without this module, the accuracy is improved by 3.18%, and the model starts to converge as early as the 20th epoch, showing faster convergence.
Considering the fusion of frequency-domain features and multi-scale convolutional features, we developed AFF4 by extending the Attentional Feature Fusion (AFF) method. This module effectively integrates frequency-domain and multi-scale features, serving as the core component of our proposed FA-UNet network. Based on this architecture, we achieve significant improvements in segmentation accuracy.
The generalization ability of the model was validated on a dataset of complex background images, and based on the results, the process of frequency domain feature extraction in our model and the applicability of the FA-attention mechanism in complex background images were analyzed.
2. Materials and Methods
2.1. Experimental Data Preparation
The Plant Village (PV) dataset [
16] and the Plant Pathology (PP) dataset [
17,
18] were chosen for our study. The Plant Village dataset is a public dataset of plant disease images, covering 14 plant species and 26 disease classes, with a total of 54,000 images. It includes common crops and fruits, such as maize, tomato, and apple. The dataset was manually collected, classified, and annotated by botanists and horticulturists, and it is widely used for plant disease research and agricultural monitoring. Due to the fact that the images in the PV dataset were taken in a laboratory setting, they are not sufficient for recognizing plant leaf diseases in real-world scenarios. Therefore, this study also selected the PP dataset to validate for generalization capability. The PP dataset is a competition dataset from FGVC7 and FGVC8, comprising 18,632 images of apple leaf diseases with complex backgrounds. The dataset composition used in this study is shown in
Table 1.
2.2. Model Establishment
We employed Unet as the backbone to validate the proposed FA-attention module. Combining the structural characteristics of the Unet network, we designed an improved model called FA-Unet. Additionally, we incorporated a data fusion mechanism to enhance the integration of frequency domain features with deep convolutional features.
As shown in
Figure 1, the framework of the method adopted in this study mainly includes the following parts: data input and transformation (step 1), data augmentation (step 2), frequency information extraction (step 3), feature extraction based on FA-Unet (step 4), feature fusion (step 5), and disease classification (step 6).
In Step 1, the original images were processed using Fast Fourier Transform (FFT). These raw images are then fed into the data augmentation module, while the FFT results are utilized for frequency-domain feature extraction.
In Step 2, to enhance the model’s ability to recognize various factors, such as different lighting and shooting angles in real-world data, we performed data augmentation on the original images. Techniques such as brightness adjustment, rotation, scaling, and Gaussian filtering were employed to expand the training dataset, improve sample diversity, and enhance the model’s generalization capability.
In Step 3, frequency domain distribution information was simultaneously introduced as an additional input into the model. We performed FFT transformation on the images to extract frequency domain distribution information. This information undergoes convolution operations at different scales and is introduced into four FA-attention modules.
In Step 4, the images processed through data augmentation were input into the FA-Unet network. Building upon the Unet architecture, the FA-Unet incorporates the FA-attention module between the contracting and expansive paths by operating on the correspondingly cropped feature maps from the contracting path during concatenation. This integration enhances the model’s ability to process frequency domain attention information, effectively extracting critical frequency domain features from images and adaptively adjusting the representation capacity of image features.
In Step 5, the feature maps outputted by the Unet expansive path are fed into an AFF4 feature fusion module for feature abstraction and fusion.
In Step 6, the fused features are then passed through fully connected layers and an output layer, where the number of neurons in the output layer corresponds to the number of disease categories. Applying the softmax function to the output layer’s outputs computes the probability scores for each category, determining the disease type to which the image belongs.
2.3. Data Augmentation
In real-world images, variations in lighting, shooting angles, and image clarity significantly influence recognition outcomes. Therefore, to enhance the model’s generalization capability, we performed data augmentation on the original images. The data augmentation methods employed in this design include random brightness adjustments, noise addition, rotation, and scaling.
2.4. Acquisition and Introduction of Frequency Domain Information
There are various frequency domain transformation methods for images, including Fourier transform, Laplace transform, and others. Among them, Fourier transform is a classical method widely used in image processing applications such as frequency domain filtering [
19], frequency domain feature extraction [
20], and edge detection [
21]. In this study, Fourier transform was employed to convert images into the frequency domain. Subsequently, an attention mechanism was applied to automatically select frequency domain information that is more valuable for disease classification.
Fourier transform can be applied to both continuous and discrete signals. For images represented as discrete matrices, the transformation process can be described by the following formula:
In this formula, represents the pixel value of the image at position , represents the transformed frequency domain representation, and M and N are the number of rows and columns of the image, respectively.
To achieve faster computation speed and more stable numerical performance, DFT (Discrete Fourier Transform) typically utilizes the FFT (Fast Fourier Transform) method. We denote a feature map matrix
X of size
and perform FFT operations on it:
After transforming, we divide the frequency domain matrix into 4 × 4 patches, essentially partitioning it into different frequency ranges. We then compute amplitude-frequency information within each of these 16 regions to derive frequency domain distribution details. Considering the multi-scale up-sampling process in UNet, we convolve these distribution details at different stages of up-sampling to accommodate multi-scale requirements. Then, we feed the results after multi-scale convolution into FA-attention modules at different stages, as shown in
Figure 2.
2.5. FA-Attention
In the frequency domain, abrupt changes in pixel values in images typically manifest as high-frequency features, while regions with gradual pixel variations are characterized by low-frequency features. Therefore, common frequency domain filters such as low-pass, band-pass, and Butterworth filters are used to selectively process specific frequency regions in images. However, these methods rely on a limited number of image analyses or expert knowledge, which can be subjective and lack flexibility.
In this study, we propose an FA-attention module designed to autonomously select frequency domain data matrices, aiming to highlight key features that are more effective for classification outcomes.
As shown in
Figure 3, the matrix
transformed by FFT is multiplied by a set of self-trained weights
W, thereby enhancing different frequency domain intervals to varying extents. This is represented by the following equation:
where the ⊙ denotes the Hadamard product (element-wise multiplication). Each element of
is multiplied by the corresponding element in
, achieving enhancement or suppression in the frequency domain.
Then, the enhanced matrix is subjected to inverse FFT, and a new feature map
Z is obtained, as shown in the following formula:
Due to the enhanced frequency domain features in the output feature map, we expect that during the training process, it will increasingly resemble the frequency characteristics of fruit leaf disease areas.
2.6. Feature Fusion Based on FA-Attention
The operation of the FA-attention module on matrix frequency domain features can be designed either at the input layer of the model to extract frequency domain features from the original image or in the deep part of the model to enhance the frequency domain features of the deep features. In this design, we have conducted a frequency domain feature enhancement operation, specifically targeting the concatenation between the contracting path and the expansive path across different scales.
As shown in
Figure 4, we have added a feature fusion mechanism to the expansive path of the U-Net. This mechanism performs feature fusion on the feature maps obtained from different scales in the expansive path to enhance frequency-domain features under different receptive fields.
We designed the AFF4 feature fusion module based on the AFF feature fusion method proposed in reference [
22], as well as the up-sampling method in the U-Net model’s expansive path.
Four different scales of feature maps are sequentially fused through the AFF4 module. In this mechanism, the feature maps from smaller scales are up-sampled to restore spatial resolution, addressing cross-scale feature propagation. These up-sampled features then pass through the AFF module for multi-scale fusion. The fused feature maps repeat this process and further integrate with higher-scale features.
By utilizing the AFF feature fusion mechanism, we optimize the original up-sampling method in the U-Net model, thereby enhancing the aggregation capability of the expansive path for features from different scales.
3. Experiments and Results
The experiment was conducted using Python 3.7 and the PaddlePaddle 2.4.1 deep learning framework. It was executed in a Linux environment on a system equipped with a Tesla V100 GPU, which has 32 GB of video memory, a 4-core CPU, and 32 GB of RAM.
3.1. Data Augmentation
This paper conducted training and testing on the PV dataset and PP dataset separately, comparing the model’s performance on datasets with simple and complex backgrounds. We split the dataset into training, validation, and test sets at a ratio of 0.8:0.1:0.1.
In this study, we used Gaussian filtering, brightness adjustment, rotation, and scaling methods, as shown in
Figure 5.
Figure 5a shows an original image of an apple leaf disease from the PP dataset.
Figure 5b–d display the same image after Gaussian filtering, brightness adjustment, rotation, and scaling.
We exclusively applied data augmentation to the training set. Each original training image was transformed through three operations: Gaussian blur, scaling and rotation, and brightness adjustment. Through data augmentation, the training set expanded to four times its original size.
3.2. Training and Testing Parameter Settings
In our experiments, we set the batch size to 32, and we trained the model for 100 epochs. The CrossEntropyLoss function is used to assess the accuracy of predictions by measuring the discrepancy between the predicted probabilities and true labels.
We employed a combination of SGD (Stochastic Gradient Descent) and Adam optimizers to leverage their respective strengths. The training begins with the Adam optimizer using an initial learning rate of 0.001. During the training process, if the validation loss fails to decrease for 10 consecutive epochs, the optimizer switches to SGD with a fixed learning rate of 0.0001. A weight decay of 0.0005 is applied throughout the training to mitigate overfitting by penalizing large weights.
3.3. Training, Testing, and Generalization Ability Verification Experiments
We conducted training and testing using the PV dataset, and the results are presented in
Figure 6. After 100 epochs of training, the proposed model successfully classified and identified various types of fruit leaf diseases. The model achieved an accuracy of 99.91% on the test set, with a loss of approximately 0.002. Notably, the model surpassed 99% accuracy around the 20th epoch, indicating strong convergence and stability throughout the training process.
As shown in
Figure 7, we validated the generalization ability using the PP dataset, achieving a maximum accuracy of 89.59%.
As shown in
Figure 8 and
Figure 9, we used the t-Distributed Stochastic Neighbor Embedding (t-SNE) method to project the original data and the features extracted by our model into a two-dimensional space for both datasets. The t-SNE visualization analysis indicates that the chaotic distribution of the original signals improved significantly after feature extraction by our model, resulting in better classification performance. Similarly, on the PP dataset, the different categories of data were also well differentiated.
As shown in
Figure 10, we plotted the testing results in a confusion matrix. Our model achieved 99.91% test accuracy on the simple-background PV dataset. It maintained a robust 89.59% accuracy on the significantly more challenging complex-background PP dataset. Notably, the model showed reduced effectiveness for categories with limited training samples, a limitation we plan to address in future improvements.
3.4. Ablation Experiment
To validate the effectiveness of each component of our model, we designed ablation experiments for various modules of the model. The ablation experiments include Unet lacking data augmentation, Unet with data augmentation, Unet with data augmentation and FA-attention, and Unet with data augmentation, FA-attention, and feature fusion. The experimental results are shown in
Table 2.
Based on the ablation experiment results, we found that data augmentation contributes significantly to accuracy improvement, especially for datasets with complex backgrounds, enhancing accuracy by 0.49% and 9.18% on the PV and PP datasets, respectively. FA-attention has a substantial effect on increasing model accuracy, achieving 2.18% and 8.19% improvements on the PV and PP datasets, respectively. After incorporating feature fusion, there was a notable performance boost on the PP dataset, reaching 1.56%, while the PV dataset only saw a 0.51% improvement.
3.5. Comparative Experiment and Analysis
We conducted comparative experiments using both classic models and state-of-the-art (SOTA) models. The classic models selected for fruit leaf disease classification include fine-tuned Darknet53 [
7], a VGG16-based model [
23], EfficientNet [
24], and CNN [
25]. The SOTA models include a YOLOv5-CA backbone [
26], PDDNET-LVE [
27], DINOV2 [
28], and HSSNet [
29].
We reproduced Darknet53, VGG16, EfficientNet, YOLOv5-CA, PDDNet-LVE, and HSSNet, conducting training and testing experiments on both the PV and PP datasets. We cite the published results of the CNN and DINOV2 models, which were only evaluated on the PV dataset.
The experimental results are shown below.
As shown in
Table 3, for the PV dataset, where the images were taken in a controlled experimental environment with a uniform background, all models achieved high accuracy, with our proposed model also attaining 99.91%. In the PP dataset, which has a complex background, our proposed model achieved an accuracy of 89.59%, outperforming both the classic and SOTA models. It should also be noted that in experiments conducted on the simple-background PV datasets, HSSNet achieved a 0.02% higher accuracy than our model. However, on the complex-background PP datasets, it underperformed against our model by 1.01%. Our model exhibits superior comprehensive performance when handling complex backgrounds.
This demonstrates that our proposed model has a notable advantage in generalization capability. Based on the results of training, testing, generalization validation, and comparative experiments, our model not only achieved high accuracy on the PV dataset but also showed better generalization performance on the PP dataset. Moreover, during training, our model exhibited faster convergence. This highlights the importance of frequency domain features for expressing the key characteristics of diseases and their invariance under complex backgrounds.
4. Analysis and Discussion
4.1. Analysis of Frequency Domain Feature Invariance for Leaf Disease
We selected fruit leaves under different rotation angles, lighting conditions, translations, and scaling, and we used the proposed model to predict them. We then extracted and visualized the feature maps after FFT, and the feature maps were output by FA-attention.
Figure 11a shows an image from the PV dataset, which has been subjected to Gaussian noise addition, brightness adjustment, rotation, and scaling, respectively.
Figure 11b shows the results of FFT visualization obtained by performing FFT transformation on the feature map matrix. It can be observed that when the target fruit leaf undergoes translation or rotation, there is only a minor change in its frequency domain features.
Figure 11c displays the attention heatmap generated by overlaying the FA-attention output feature map matrix with the original image. The figure demonstrates that as the original image undergoes rotation, scaling, or brightness adjustment, the focus of attention does not change significantly and remains concentrated on the diseased areas of the leaf.
In fact, when an image is rotated, only the coordinates undergo a certain rotation, and this change has almost negligible impact on the frequency domain features. This enables our model to better cope with variations in factors such as rotation, scaling, and lighting in complex background images.
Therefore, we chose frequency domain transformation combined with the attention mechanism to automatically select key frequency domain features, which serve as the core module of our proposed mode. This forms the core module of our proposed model, which can counteract factors such as rotation, translation, and the scaling of images in real-world scenarios, thus performing well even in complex backgrounds.
4.2. Attention Analysis of Disease Regions in Complex Backgrounds
We selected images with complex backgrounds, extracted deep feature maps from the model, and visualized them to analyze the attention on fruit leaf disease areas. We used the classic self-attention module from the transformer model and compared it with FA-attention, as shown in
Figure 12.
Images with complex backgrounds typically have higher resolutions, and the fruit disease areas exhibit more fine-grained effects within the image. Moreover, complex background images are more susceptible to interference from background lighting, edges, and other leaves in the background, which makes the classification of such images more challenging.
In our comparison experiment, We selected four images with complex backgrounds for comparison. These four images each presented challenging conditions, such as lighting shadow interference, cluttered background leaves, leaf occlusion and curling, and background light spot interference. The comparison results show that the proposed model demonstrates stronger attention toward the disease areas when dealing with complex backgrounds.
4.3. Discussion
Frequency domain features, as a key characteristic, play an important role in fields such as digital signal processing [
30] and image recognition [
31]. In fruit leaf disease areas, they often exhibit significant frequency domain specificity [
32] and exhibit a certain invariance under complex scene variations (see
Figure 11). If an attention mechanism can be designed to select specific frequency domain features corresponding to the disease area from the frequency domain information, it would have a positive impact on the recognition of fruit leaf diseases.
This study combines frequency domain transformation and attention mechanisms to design the FA-attention mechanism. This mechanism enhances frequency domain attention and guides the model to focus on the typical frequency domain features of fruit leaf disease areas, thereby giving more attention to the fruit disease regions (see
Figure 11 and
Figure 12).
To further enhance the guiding effect of this mechanism on deep learning models, we selected U-Net as the backbone network and embedded the FA-attention mechanism into the skip connection section. In the feature concatenation of the corresponding layers in the contracting and expansive paths, the extraction and fusion of frequency domain features are enhanced. Meanwhile, to address the issue of feature resolution fusion in each layer of the expansive path, we designed a feature fusion module based on AFF, fully utilizing the structural characteristics of the multi-scale features of U-Net.
We conducted training and testing experiments (see
Figure 7 and
Figure 10), ablation experiments (see
Table 2), and comparison experiments (see
Table 3) on the simple-background PV dataset and the complex-background PP dataset, further validating the effectiveness of this mechanism. At the same time, we also identified the limitations of this method when dealing with small sample categories, which will become the focus of our next research steps.
Through a series of experiments in this study, we found that frequency domain features are more effective than other features when dealing with factors such as lighting changes, angle variations, and interference from a complex background.
5. Conclusions
This study focuses on the problem of leaf disease recognition against complex backgrounds. Considering that the frequency-domain features exhibit greater invariance compared to spatial-domain features in complex scenarios, we propose an FA-attention mechanism to improve recognition accuracy. We employ UNet as the backbone network, integrate AFF4 for feature fusion, and incorporate the FA-attention mechanism to design the FA-UNet architecture.
Experimental validation utilized both the simple-background Plant Village (PV) dataset and complex-background Plant Pathology (PP) dataset, achieving 99.91% and 89.59% accuracy, respectively. To enable deep analysis, we conducted ablation experiments, comparative experiments, and frequency-domain attention visualization analysis. The experimental results demonstrate that while our model achieves high accuracy on simple-background datasets, it exhibits even stronger performance in complex-background scenarios.
Overall, this study combines Fourier transform with deep learning methods, utilizing the attention mechanism to focus on key frequency domain features, along with multi-scale feature fusion, which enables the model to perform better on complex background datasets. In future work, we will collect more complex scene data and fully consider methods to address small sample categories, further optimizing the model.