1. Introduction
The main organs of plants that carry out photosynthesis, respiration, and transpiration are their leaves. The plant’s general health is immediately reflected in their physiological and metabolic state. Diseases that infect leaves have the potential to spread and cause large financial losses. Using the Padwick technique, A Y Bandara et al. calculated the economic losses resulting from 23 main soybean diseases in 28 major US states that produce soybeans between 1996 and 2016, estimating the overall damages to be around 95.48 billion USD [
1]. Tea anthracnose leads to a yield loss of 30% to 50%, according to research by Shi N et al. [
2]. Crop yields can be reduced by 30% to 50% due to tea anthracnose, which is caused by a complex of fungal infections. Plant leaf diseases can result from a number of reasons, such as abnormal light exposure, inappropriate temperatures, poor soil quality, and bacterial infections. Naturally, leaf area may also be measured fast and precisely using automatic digital picture analysis [
3]. One-by-one, thorough on-site examinations are necessary for manual inspection. Complex outdoor agriculture can occasionally be challenging for people to reach. Additionally, human inspection takes a lot of time and is prone to error, particularly when it comes to delicate, early-stage leaf diseases. As a result, cameras have been developed to automatically take pictures of susceptible leaves. Nonetheless, there is still much to learn about the automatic identification of plant leaf diseases from these kinds of photos, which is essential for maintaining steady agricultural output and advancing agricultural growth.
Sahu AK et al. [
4] used Gaussian filtering on grayscale images for feature extraction and illness classification in medicinal plants in the context of leaf disease detection; nevertheless, the feature classifier’s data was quite constrained. Using fine-grained image characteristics based on Local Binary Patterns (LBP), Rachmad et al. [
5] investigated the classification of maize leaf diseases. Nevertheless, this approach lacks versatility and necessitates manual parameter tweaking. These semi-automated detection techniques require significant manual parameter adjustment and classifier screening, despite the fact that they save time and lower financial losses [
6]. Moreover, they are frequently impractical for identifying minute targets such minor texture-, color-, or dust-related illness markers. Shrestha et al. [
7] collected over 3000 damaged leaves from 15 different plant species in order to investigate the relationship between various degrees of Convolutional Neural Networks (CNNs). The method has significant feature extraction times despite achieving 88.8% accuracy. Yin et al. [
8] responded by proposing an SSD detector for jujube tree disease detection that uses texture features acquired by transfer learning in the backbone network to cut the detection time down to 0.14 s. Nevertheless, this approach increased the model depth by adding a lot more preset anchor boxes to the feature maps. To address this, Wu et al. developed a spatial collaborative attention detection model based on DETR (DS-DETR), which utilizes pre-trained Transformer structures to extract disease features [
9]. With an average accuracy gain of 9.52% over the baseline DETR model, the model evaluates disease severity by comparing the ratio of plant leaf area to lesion area.
In the realm of deep learning-based detection, semantic segmentation, image classification, and object detection have all played significant roles. Semantic segmentation provides precise boundary information for leaf lesions, making it particularly suitable for irregularly shaped disease spots. Ref. [
10] proposed a lightweight unsupervised learning framework that automatically generates apple disease cues by utilizing an improved high-frequency attention mechanism and contrastive learning. Another study constructed a pixel-level annotated leaf disease dataset, employing both supervised and weakly supervised learning for the semantic segmentation of lesions [
11]; a fusion segmentation network combining CNN and Vision Transformer (ViT) architectures. Ref. [
12] used patch segmentation, histogram-based lesion localization, and ROF filtering noise reduction to accurately identify plant leaf diseases. A DbneAlexnet classifier was used for classification after segmenting lesions using a U-Net network optimized with a Gradient-Golden Search Optimization technique. Another method has confirmed the effectiveness of the integrated framework for disease segmentation and classification of tomato plant leaves [
13]. Semantic segmentation has advantages, but it also has drawbacks, including high annotation costs and difficulty with edge device deployment. Another method for identifying plant leaf diseases is image classification. By combining logistic regression with hue moment data and a ResNet-50 network to create bounding boxes for areas of interest (ROIs), research on infected wheat plants obtained 99.8% classification accuracy [
14,
15]. A multivariate GrabCut algorithm was used to handle image occlusion and accomplish accurate segmentation, and the algorithm created an enhanced INC-VGGN network linked with a Kohonen learning layer. However, it is challenging to directly use image classification with precision spraying settings since it cannot offer location information about the disease. Using a multivariate GrabCut algorithm handles image occlusion and accomplishes accurate segmentation. On the other hand, object detection methods provide higher computing efficiency and more accurate illness location. With the addition of unsupervised pre-training, spatially modulated co-attention, and relative position encoding, DETR—a Transformer-based detection technique—performed well in tomato leaf disease segmentation tasks [
16]. An effective method for accurately diagnosing rice illness was developed using a single-stage Faster R-CNN model [
17]. The MEAN-SSD model, which reconstructs Inception modules, has shown robust detection performance on apples’ simple and complex leaf diseases [
18]. YOLO-JD improved the feature extraction module through spatial pyramid pooling, achieving significant progress in multi-scale feature extraction and fusion [
19]. He et al. proposed YOLOv11-RCDWD, which utilizes the RepLKNet module as its backbone and optimizes the attention mechanism, achieving an 85.4% recall rate [
20]. YOLOv11 has demonstrated improvements in detection speed and accuracy, leading to enhanced feature extraction capability, inference speed, and task versatility [
21]. For the C3K2 module, the input feature map is divided into two parts, which are subsequently concatenated to fuse features [
22], enabled by the use of depthwise separable convolutions (DWC) [
23] and model pruning for lightweight design [
24]. For the Neck, optimization of the C2PSA module allows it to excel in detection performance within complex backgrounds or small target scenarios [
25].
The aforementioned approach usually chooses plant damaged leaves under specific parameters, such as clean backdrops and indoor plant illnesses; however, it performs poorly in real-world settings. This research creatively presents the ATD-Net network to efficiently enhance the detection performance of plant diseases in various scenarios, situations, and categories. Our approach strikes a compromise between detection accuracy and model complexity by including downsampling convolution, Gabor filters, and dynamic boundary loss adjustment regression boxes. The following are the primary contributions: (1) Proposed downsampling convolution (ADown) to feed back the features of interest to Gabor and project them onto Transformer; (2) Employed innovative dynamic boundary loss function was proposed to dynamically adjust disease prediction boundaries in the backpropagation of multi-layer features, further improving detection accuracy; (3) Conducted multiple dataset experiments and designed ATD-Net with edge-device compatibility constraints (model size 5.2 MB, 6.5 GFLOPs, and architecture aligned with NVIDIA Jetson Xavier NX requirements), providing a lightweight solution suitable for future deployment in agricultural applications.
2. Data Acquiring
The data set is crucial for testing our methodology. In actual agricultural production, plant disease monitoring requires continuity and timeliness, which requires sufficient data and high-quality samples. However, obtaining high-quality disease sample data is quite difficult. According to agricultural standards, early diseases require more frequent monitoring (such as daily inspections). During high-risk disease seasons (such as warm and humid periods), relevant personnel should visit the fields every three to five days, take photos, and record disease data. Although automated systems installed on drones or fixed stations can perform high-frequency and large-scale monitoring, manual inspection cannot meet this frequency requirement. Therefore, our methods of obtaining data include online search and self-capture, which provide necessary assistance for model detection.
2.1. Self-Built Dataset
The self-built dataset was sourced from an outdoor wild tea field in Shanghai, China, and online resources. We used a standard mobile phone to take the pictures, which are all in JPG format, have a resolution of 640 × 640, and weigh around 40 KB. We gathered 8302 photos of plant illnesses, comprising 14 disease categories. Each type of disease is represented by a number of photos in the dataset, which is annotated following data cleaning. Every species in the database has a roughly uniform distribution.
In real-world application settings, the device’s input image resolution should be at least 640 × 640 pixels, and in order to prevent overexposure or excessive shadows, a camera with at least 5 million pixels should be employed. The model handles 1080 P video in real-time at 30 frames per second on the NVIDIA Jetson Xavier NX. To enable real-time on-site reconnaissance, it is advised to utilize milliseconds per image for mobile devices if the inference time is less than 50. The gathered photos can be utilized on drones or stationary monitoring stations once the model has been trained. To maintain inference speed, GPU acceleration is advised. In order to conduct experiments with the aforementioned datasets, the three overall datasets were divided into proportions of 8:1:1 for self-built dataset training, testing, and validation, respectively. These datasets provide a sound foundation for evaluating the model which can detect multiple plant diseases in real-world field scenarios, enhancing its professionalism and applicability.
With the assistance of Professor Guo Jianwei’s team at Yunnan Forestry University, we manually annotated the disease areas using the LabelImg(1.8.6) tool, confirmed the accuracy and completeness of disease classification, and annotated the bounding box in order to further enhance the quality and reliability of the dataset [
26]. Common plant leaf disease data examples from our self-built dataset are shown in
Figure 1, which includes several disease spots, complicated scenarios, and lighting settings. The screen display of two distributions from our own dataset is shown in
Figure 2.
2.2. PlantVillage Dataset
PlantVillage offers a wealth of standardized leaf data covering fourteen different economic crop varieties. There are 26 distinct illnesses among the 54,306 photos of plant diseases. Experts have annotated the data, which comes from the official Kaggle page. Although PlantVillage has a big scale, excellent quality, and a fair distribution of categories, it is limited by a monotonous background [
27].
2.3. PlantDoc Dataset
In contrast to PlantVillage, PlantDoc emphasizes the intricacy of real-world leaves with background noise from weeds and dirt. It includes 13 species in 18 categories, 17 of which are diseased and one of which is healthy [
28]. However, the total number of 2598 is not enough to satisfy the number and distribution requirements of the model.
To make it easier to describe the illness distribution across three datasets, we have gathered statistics and made a line graph (
Figure 2).
3. System Architecture
The three layers of ATD-Net—the convolutional layer, the attention mechanism layer, and the loss function layer—are proposed in this study. Enhancing the textural characteristics of plant leaf diseases is the goal.
This paper uses YOLOv11n (the nano variant) as our baseline model, selected for its lightweight design (model size 5.2 MB, 2.58 M parameters) and suitability for deployment on resource-constrained devices. The YOLO series are used to extract picture characteristics in the convolutional layer. In order to preserve YOLO’s lightweight characteristics in the fine-grained situation of plant leaf disease detection, this study reconstructs its convolution structure into an ADown convolution. The ADown convolution is a downsampling module that makes use of pooling and multiple convolutions. It is essential for adding features to the model that take the place of conventional regular convolution.
The attention mechanism layer weighs the discriminative textural characteristics of sick leaves, such as their round, irregular, water-stained, and fuzzy edges, as well as their roughness, dryness, concavity, and convexity in the disease region. After collecting the down sampled information, ATD-Net additionally takes into account the textural feature that plant leaf diseases possess in order to demonstrate the goal of multi-feature network design. To acquire local positional information, the baseline YOLOv11 uses the Transformer attention mechanism. However, it struggles to capture long-distance characteristics, multi-scale flexibility, and adequate spatial relationship modeling, particularly for fine-grained plant leaf texture disease features.
Texture characteristics taken from the two layers above are connected to the detection head in the loss function layer. The loss parameters for bounding box regression are dynamically modified to produce more precise bounding box localization. Even while basic YOLOv11 optimizes the detection method of region proposal reclassification and uses the Intersection over Union (IoU) loss, it still has trouble dynamically adjusting bounding boxes with confirmed weights, which leads to comparatively low bounding box localization accuracy. To solve this issue, a single-stage detector prediction branch is used to estimate the precise location of crop illnesses and construct focal loss through three stages: quality estimation, classification, and localization. This approach is motivated by Generalized Focal Loss (GFL). Experiments have shown that this method helps improve plant disease detection performance.
We optimized and rebuilt the original YOLOv11 model using the previously specified design philosophy. We allowed the modified convolutional layers to collect features, enhance textural features using Gabor, and then optimize the loss function while maintaining the original model’s other structures.
Figure 3 illustrates how this model optimization approach offers theoretical support for later experimental verification.
4. Texture Feature Enhancement
After optimizing and reconstructing the model, the next step is to execute feature extraction, feature fusion, and feature enhancement.
4.1. Feature Extraction
In this work, important textural characteristics of leaf diseases are extracted and preserved using adaptive downsampling (ADown) [
29]. The input picture is converted to tensor form, the input channel is set to 16, and the output is still in tensor form with the same number of output channels. This is mostly due to ADown’s dual branch parallel structure, which may alleviate the issue of texture feature loss in conventional sampling procedures and more precisely capture texture features that change to illness characteristics. ADown is displayed in
Figure 4.
ADown uses a 2 × 2 convolution kernel for average pooling, creating a steady input for branch sampling, and doing minor smoothing and denoising in ordr to suppress local peak responses. The same spatial data is then divided into two functionally different routes, which extract convolutional semantic features and pool saliency characteristics, respectively. Controlling computational complexity is advantageous for this method. A convolution with a stride of two is used to do downsampling and semantic extraction for the convolutional route of the left branch. Saliency areas like borders, textures, and tiny places are highlighted for the pooling route of the right branch. Ultimately, SoftMax smoothing processing is followed by feature fusion and output.
Suppose the input dimension is 4, denoted as , Where B represents the batch size, C denotes the number of channels, H stands for the image height, and W signifies the image width. ⊕ denotes the summation of feature channels.
The known parameter quantity consists of two parts: weights and bias terms. Compared with standard convolution quantization, we separately count the parameter quantities of standard convolution and ADown convolution.
For ,
For
, Calculate the left branch and the right branch separately.
After sorting out the results, we obtain the results.
FLOPs is also an important indicator reflecting parameter changes, and by comparing it with standard convolution, the attenuation changes in parameters can be discovered. The standard convolution uses a 3 × 3 convolution kernel with 64 input channels and 128 output channels. The input and output feature map sizes are 160. ADown convolution uses a 3 × 3 convolution kernel, with an input channel of 32 and an output channel of 32. The size of the input and output feature maps is 160.
Put the data into the equation and calculate
is 3.775 GFLOPs.
Put the data into the equation and calculate is 0.524 GFLOPs. Compared with , GFLOPs attenuation rate is 86.12%.
It is evident that ADown reduces the computational parameters of the model and greatly facilitates the extraction of important feature information since the number of ADown parameters produced via calculation is fewer than that of ordinary convolution parameters.
4.2. Texture Fusion
Original YOLOv11 does not work well, as there is an uneven distribution of attention weights, a loss of fine-grained texture features during model training, and the submergence of key features. Therefore, this paper employs a Gabor filtering module to enhance the texture features of leaves, which improves the bio visual biomimetic characteristics of plant diseased leaves and simulates the human visual system’s perception mechanism of texture better [
30]. In addition, Gabor has multi-scale and multi-directional analysis capabilities, which can generate various combinations of filtering kernels, thus filling the gap in deep learning. This paper rebuilds the Gabor algorithm, incorporating new algorithm into a separate module before the Transformation. Thus, YOLOv11 uncaptured long-distance features can be captured, and powerful receptive fields can be provided [
31].
Figure 5 shows the detailed processing structures and steps.
The difference between the two branches is that branch two extracts more texture features, as it utilizes the feature extracted from the improved backbone layer for smoothing processing before passing it to the Transformer. It enhances the advantage of multi-directional texture analysis and provides dependencies for the Transformer to extract long-range features later [
32]. In order to achieve multi-directional texture, the processed features were segmented by SPPF, with some features directly passing to the Transformer for feature capture, while the other features undergo fine-grained texture feature extraction through Gabor filtering.
Suppose the sampling point is , and if then convert it to a grayscale value , the pixel coordinates is , and the core center is in the origin. Then key steps for implementing the Gabor filtering module are as follows.
Firstly, calculate the feature information in the direction of
on the
and
coordinate axes; finally, complete the coordinate transformation.
After obtaining the directional sampling points and rotational coordinates, it is necessary to apply the parameters to the spatial domain Gabor kernel. By performing a two-dimensional Fourier transform on
and centering it, the Gabor kernel can be used as convolution weights. The Gabor parameters shown in
Figure 5 were determined through grid search on the mAP50 metric on the validation set of a self-built dataset. A combination search was conducted within a reasonable range for scale σ, wavelength λ, and kernel size, and the parameter set that optimized the detection performance on the validation set was ultimately selected.
In the following formula,
is the filtering scale,
is the wavelength, and
is the phase.
Then, come to the processing of spatial grayscale transformation, multi-directional convolution, normalization, and activation function.
obtained the grayscale image of the disease image. Then,
aims to calculate the product of the grayscale image and the direction vector based on
. Finally, T’ performs batch normalization and smoothing processing using an activation function.
where * denotes convolution.
Secondly, the original features are fused with Gabor texture features through
convolution, and linear mapping is performed based on a multi-head Transformer:
In the aforementioned formula, , , represent the query, key, and value in the Transformer, respectively. The dimension of each attention head is , and the dimension of the key or query is , where denotes the scaling factor of the attention head.
Finally, combine the information from each attention head and output the projection results through value aggregation and positional encoding.
Figure 6 will express the mathematical expression in an algorithmic way, making Gabor easier to understand.
A technique for improving texture features is used, which is based on the theoretical analysis mentioned above. In order to show the textural properties of plant disease regions, this method uses principal component analysis (PCA) and statistical analysis techniques to generate feature maps from the intermediate layers of a deep neural network.
Figure 7 shows the experimental results, which shows that this technique may successfully capture the textural characteristics of diseased regions, offering a crucial foundation for the identification of plant diseases.
4.3. Bounding Box Regression Optimization
Inspired a dynamic boundary loss module with a dynamic bounding box adjustment emphasis [
33]. It adjusts to the detection requirements of fine-grained multi-scale plant disease characteristics based on the fused features. In order to complete the repositioning of coordinates and bounding boxes, the model must be able to dynamically modify the discrete probability distribution. The model must recalculate the regression values using temperature coefficients and weighted convolution after adjusting the probability distribution for each bounding box coordinate based on the input coordinates, including entropy calculation, adaptive EMA momentum, and the normalization of entropy.
From the perspective of theoretical design objectives, the obtained feature tensor x is split into dimensional features that conform to YOLO object detection, followed by the calculation of the basic probability distribution. To embody the ‘dynamic’ idea, we introduce a temperature coefficient during the probability distribution calculation to control the smoothness of the probability distribution,
is the probability value of the ith bin after temperature adjustment.
where
is the logical value of the ith bin,
is the temperature parameter, and
. Based on the probability, it is convenient to calculate the information entropy, estimate the difference between the predicted probability and the true probability, and complete the next step of entropy normalization and power transformation:
where
controls the degree of nonlinearity influenced by entropy, and
is randomly reselected from the set
in each batch and is only used for inference. If
, then it amplifies the influence of entropy, resulting in a smaller adjustment coefficient for high-entropy samples. If
, then it diminishes the influence of entropy, leading to a relatively larger adjustment coefficient for high-entropy samples.
Thus, the dynamic coefficient can be adjusted by combining the number of training rounds (t) and the average entropy of the current batch (b)
, and complete the calculation of the adaptive EMA mechanism.
Finally, adjust the probabilities again to complete the calculation of weights and regression values, and ultimately determine the coordinate output and repositioning.
The result obtained is as follows, shown in
Figure 8.
5. Disease Detection
The model training phase follows the framework creation, the investigation of the core module mechanism, and the verification of mathematical derivations. Based on the dataset we gathered, 100 iterations of iterative model training are carried out using sensible training techniques and hyperparameter tuning. In addition to predicting the test set data, the trained model is also utilized to measure important metrics including model accuracy and generalization capacity.
5.1. Experiment Environment
ATD-Net model on multi class datasets is run on the operating system of Linux Ubuntu 22.04 LTS, with CPU 7 core Intel (R) Xeon (R), NVIDIA GeForce RTX 3090 (24 GB), and the coding tool is PyTorch 2.2.1, CUDA 12.1, DIE with PyCharm(2024.3.6), and Anaconda 23.5.2.
The optimizer is SGD, and the resolution is 640 × 640. The weight decay coefficient is 0.0005, the momentum factor is 0.937, the fixed batch is 16, and the starting learning rate is 0.01. Through repeated testing, it was discovered that ATD-Net’s prediction results on the dataset converged after about 78 rounds.
5.2. Detection Results
The experiment data is from eight different models and three types of samples based on a trained weight model (best. pt), including samples with complex backgrounds, clean backgrounds, and small lesions. In order to systematically evaluate the true detection performance of the ATD-Net model and visually demonstrate the performance differences between different models and the proposed ATD-Net, The YOLO prediction results are calculated based on confidence scores.
where α is the task alignment coefficient, used to balance the contribution weights of classification and localization, so that the confidence level can reflect the comprehensive quality of the prediction box better.
Figure 9 compares the detection results of various models, beginning with the left side of the first line, including Faster R-CNN, YOLOv5, YOLOv8, YOLOv10, YOLOv11, YOLOv12, DETR, and ATD-Net.
Figure 9a–c shows the detection performance of different models in complex backgrounds, clean backgrounds, and early disease detection. These bounding boxes are actually useful for forecasting the course of diseases. Every box identifies a certain illness. The prediction box with the highest degree of confidence is chosen when more than one box appears on a single leaf.
It is evident that Faster R-CNN has the lowest confidence score in both false positives and false negatives. The multi-detection bounding box has a maximum confidence score of just 0.66. The best-performing model in the YOLO series is YOLOv11 because of its robust backbone extraction network, which more effectively combines various feature information, whereas YOLOv5 lacks feature alignment techniques. Overall, ATD-Net outperforms YOLOv11 in terms of accuracy and recall, despite YOLOv11’s notable advancements in detection performance.
Confidence heatmap is a spatial visualization of the target confidence/category probability output, which intuitively shows the likelihood of different positions in the image having targets. The generation of heat maps consists of three steps: numerical extraction, normalization, and color mapping. Confidence scores are first extracted for each feature map location
. Then, numerical normalization can be written as:
In the parameters of the heat map, is the minimum value in matrix M, is the maximum value in matrix M, is a normalized response value and is strictly within the range of . Finally, the experiment will normalize by calling the hot algorithm integrated in Python(3.9) and implementing it through a color mapping function.
In contrast to
Figure 9’s presentation,
Figure 10 compares the heatmap findings of several models by using color gradients to graphically represent data density and distribution. Fine-grained flaws are often missed during the model’s feature extraction step because of the tiny pixel ratio, ambiguous texture geometric properties of the chosen sample, and background information. As a result, heat maps can better show the severity of illnesses through variations in color concentration while ignoring background noise interference. This enables agricultural professionals to promptly detect illnesses and implement efficient solutions. The heat map feature distribution of ATD-Net is more robust and concentrated than that of conventional YOLO series models like YOLOv10 and YOLOv12n, which is the key feature information obtained through downsampling, and the fine-grained texture features of the disease are enhanced, making it more dynamic and accurate to locate the disease area. Starting from the left side of the first line, including Faster R-CNN, YOLOv5, YOLOv8, YOLOv10, YOLOv11, YOLOv12, DETR, ATD-Net.
7. Conclusions
The paper proposes ATD-Net, an improved texture-aware detection network, to address the problems of fine-grained feature extraction, high computational cost, and bounding box localization accuracy in plant leaf disease detection. Three key enhancements are incorporated into the network. First, as shown by the lower parameter count and steady mAP in ablation tests, ADown is used to decrease model parameters and computational complexity while preserving the representational capacity for crucial texture aspects (
Table 1). Second, the combination of the Gabor filter and Transformers improves recall and mAP on early-stage illness samples by helping to extract and enhance edge, directional, and texture features in diseased areas (
Table 1). Third, the proposed dynamic boundary loss function adaptively adjusts the probability distribution of bounding box regression, resulting in improved localization accuracy under strict IoU thresholds (mAP75) increased by 0.97% in the full model). Experimental results demonstrate that ATD-Net achieves a favorable balance between detection accuracy and model complexity.
In the future, we intend to improve the texture enhancement technique for plant leaf diseases and optimize the Gabor and DBL algorithms. In order to test, apply the algorithm to several categories, and improve the practical quality of ATD-Net by merging the method with features, we still need to gather more data for plant leaf diseases. Additionally, characteristics of sick leaves’ color and form still need to be more complete and discriminating.