Next Article in Journal
Piezoresistive Sensing Performance of Smart Layer in Multi-Material 3D-Printed Reinforced Cementitious Beams
Previous Article in Journal
BATFNet: Boundary-Aware Transformer Fusion Network for RGB-DSM Semantic Segmentation of Remote Sensing Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AD-DETR: A Real-Time Transformer with Multi-Scale Alignment and Spatial–Spectral Fusion for Crop Disease Detection

School of Mathematical Sciences, Harbin Normal University, Harbin 150025, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(10), 3206; https://doi.org/10.3390/s26103206
Submission received: 7 March 2026 / Revised: 12 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026
(This article belongs to the Section Smart Agriculture)

Abstract

Agriculture faces significant challenges from crop diseases, which threaten global food security and cause substantial economic losses annually. While deep learning has advanced plant disease detection, existing models often struggle with generalization across heterogeneous environments and real-time deployment constraints, hindering their practical application in diverse agricultural settings. This paper proposes AD-DETR, an enhanced real-time detection transformer framework specifically designed for agricultural scenarios. The model incorporates three key innovations to address these issues. First, the Multi-Scale Align Network (MSANet) achieves adaptive feature alignment through an Adapt Fusion Align (AFA) block, effectively preserving disease detail information across varying scales. Second, the Spatial–Spectral Attentive Feature Fusion (SSAFF) module integrates frequency-domain processing with attention mechanisms, enhancing feature representation quality by combining spatial and spectral information. Third, the IPIoUv2 loss function improves bounding-box regression accuracy through an internal perception mechanism and scale-adaptive weighting. Comprehensive experiments demonstrate that AD-DETR achieves strong performance, with 90.2% mean average precision at I o U = 0.5 on the Crop Disease dataset and 97.4% on the PlantDoc dataset. It maintains high efficiency with 16.4 million parameters, 47.2 GFLOPs computational complexity, and inference speeds of 230–242 frames per second. These results indicate that AD-DETR is robust to domain shift and suitable for resource-constrained applications, such as real-time monitoring on mobile and edge platforms.

1. Introduction

Agriculture acts as the mainstay of both economic prosperity and social progress on a national scale. Unfortunately, the rampant proliferation of crop diseases presents a formidable challenge to our food supply, the quality of agricultural goods, and the overall balance of agricultural environments [1]. The United Nations Food and Agriculture Organization (FAO) reports that between 20% and 40% of worldwide crop harvests are lost each year due to these destructive pests and diseases, translating into staggering economic damages worth hundreds of billions of dollars [2]. Conventional methods of diagnosing plant pathogens depend heavily on the seasoned judgment of agricultural specialists, an approach that suffers from steep labor expenses, sluggish processing times, and inconsistent evaluations—factors that fail to keep pace with the requirements of today’s data-driven, technologically advanced farming practices [3]. As a result, creating sophisticated and automated systems for identifying crop ailments has emerged as a critical focus area within the domain of agricultural innovation [4].
Early detection of plant diseases primarily relied on manual observation and laboratory analysis, which were inefficient and highly subjective. With technological advancements, traditional methods based on image processing techniques began to be applied in disease detection, such as using color and texture features for lesion segmentation. For instance, Loranger et al. [5] developed a Colour Analyzer tool based on HSV and Lab* color models for accurately measuring the area of leaf lesions. Bhujade et al. [6] proposed OptCFA method, which combines various filters such as GABF, AmPel, and E-SGF for image denoising, outperforming traditional filters. However, such conventional approaches struggle to adapt to different plant species and complex, variable field environments, exhibiting limited generalization capability.
Machine learning methods have further advanced the automation of plant disease detection. These approaches achieve disease identification by combining handcrafted features with classifiers. Sahu et al. [7] developed an HRF-MCSVM model that integrates spatial fuzzy C-means clustering and feature preprocessing to improve the accuracy of leaf disease detection. Selvaraj et al. [8] constructed a SqueezeNet model combined with the RCSO optimization algorithm, achieving high-sensitivity root disease classification under low-power constraints. Nevertheless, machine learning approaches’ performance heavily relies upon manual feature design quality, and they are often inadequate for handling large-scale, complex data.
During the previous decade, deep learning has become increasingly prevalent in computer vision. Convolutional neural networks (CNNs) have shown strong feature-extraction and image-recognition capabilities, creating new opportunities for intelligent crop disease diagnosis [9,10]. Scholarly inquiry in this domain has progressively moved beyond image classification to object detection and instance segmentation, which allow diseased areas to be localized and distinguished with greater accuracy. Gong and Zhang [11] developed an improved Faster R-CNN method for apple leaf disease detection, Arun et al. [12] proposed an enhanced tiny YOLO network for rice leaf disease detection, and Faisal et al. [13] used transfer learning with several CNN architectures for citrus plant disease classification. Zhang et al. [14] employed a ResNet-50-based network to evaluate tomato leaf disease severity. Related studies, including PlantVillage- and PlantDoc-based work, have also demonstrated that deep networks can learn discriminative disease features under both controlled and field-like settings [15,16,17,18].
Real-time object detection has been dominated by YOLO-series models, which prioritize inference speed but often struggle with complex multi-scale targets and domain shifts [19]. The introduction of real-time detection transformers, especially RT-DETR, marked a paradigm shift by combining the end-to-end detection capability of transformers with real-time performance [20]. RT-DETR uses a hybrid encoder with attention-based intra-scale feature interaction and cross-scale feature fusion to integrate multi-scale features. While RT-DETR achieves competitive accuracy and speed, its fixed-scale interaction mechanism lacks adaptability to the scale variations common in crop disease symptoms. Moreover, its reliance on spatial-domain processing overlooks the potential of frequency-domain information for capturing global texture patterns, a critical aspect for discriminating subtle disease lesions.
Multi-scale feature alignment is a core challenge in crop disease detection because disease symptoms exhibit high diversity across lesion sizes and developmental stages. Traditional CNNs extract hierarchical features, but they lack an explicit multi-scale alignment mechanism, resulting in semantic gaps and spatial misalignment between feature maps. Researchers have proposed methods to strengthen multi-scale representation. HRNet maintains high-resolution feature flows through parallel multi-resolution branches [21], while TridentNet uses multi-branch convolution kernels to process different scales [22]. Feature-pyramid networks have also been widely adopted for multi-scale object detection [23]. However, disease spots are often similar to healthy leaf vein textures, and cluttered backgrounds may cause fixed-scale networks to ignore small or low-contrast lesions.
Feature fusion technology integrates information at different levels or domains to improve representation quality. Spatial-domain methods have evolved from simple concatenation or element-wise addition to attention-based weighted fusion, such as channel attention and spatial attention that emphasize salient regions through adaptive weights [24]. Frequency-domain processing is also gaining attention in image analysis. Wavelet transform has been used to extract periodic patterns of leaf texture, while Fourier transform can improve global context modeling by enhancing edge and texture features [25,26]. In agricultural vision tasks, frequency-domain information can help capture disease distribution and texture patterns, but existing work is mostly limited to traditional signal processing or preprocessing and lacks deep integration with real-time detection architectures.
Despite this progress, existing crop disease detectors still face three fundamental limitations that constrain practical deployment. First, models often perform well on specific datasets but suffer performance drops when generalized to new environments, crops, or disease types. Second, agricultural applications require high inference speed for high-resolution image streams, yet many accurate models are computationally expensive. Third, lightweight models may meet speed requirements but frequently sacrifice sensitivity to small or early-stage disease lesions. These limitations motivate the development of a detector that jointly improves accuracy, generalization, localization precision, and real-time efficiency.
The objective of this study is to develop and validate a real-time crop disease detection transformer, termed AD-DETR, that can accurately localize disease symptoms in complex agricultural scenes while remaining lightweight enough for edge deployment. The working hypotheses are as follows: (H1) adaptive multi-scale alignment improves the representation of lesions with different sizes and reduces background interference; (H2) spatial–spectral fusion improves disease-feature discriminability by combining local spatial cues with global texture information in the frequency domain; (H3) an improved IoU-based loss function improves bounding-box regression and class-imbalance handling; and (H4) the integration of these components provides a better accuracy–efficiency balance than mainstream CNN- and transformer-based detectors.
Accordingly, the specific objectives of this study are:
  • To design the Multi-Scale Align Network (MSANet) with an Adapt Fusion Align block for adaptive alignment of multi-scale crop disease features;
  • To develop the Spatial–Spectral Attentive Feature Fusion module that combines spatial attention and Fourier-based frequency-domain processing for robust feature representation;
  • To introduce the IPIoUv2 loss function for improved bounding-box regression and more stable localization of irregular disease symptoms; and
  • To evaluate AD-DETR on a large Crop Disease dataset and the PlantDoc dataset, comparing it with representative two-stage, single-stage, YOLO-series, and RT-DETR-based detectors in terms of accuracy, efficiency, and generalization.

2. Method

2.1. Overall Architecture

The proposed AD-DETR in this paper is a deep learning model particularly designed for detecting crop illness. Its overall architecture is founded upon an enhanced real-time detection transformer framework, and it has been comprehensively optimized for the specific requirements of agricultural scenarios. Seen from Figure 1, the network utilizes an encoder–decoder structure, primarily composed of three core components: the Multi-Scale Align Network (MSANet) backbone network, the Spatial–Spectral Attentive Feature Fusion (SSAFF) model, and an improved detection head. Such a design enables the model to significantly enhance detection precision for multiple categories of crop diseases while maintaining real-time performance.
This paper first designed the Adapt Fusion Align (AFA) block and, based on this block, developed a feature-extraction network named MSANet. This module adaptively aligns multi-scale features to tackle challenges of scale variation and background interference of pest and disease targets in images. Secondly, we innovatively designed the SSAFF model in the encoder part. The innovation of this module lies in its simultaneous processing of spatial- and frequency-domain information. It achieves attention-weighted feature fusion through the Multi-scale Focus Integration (MFI) block, while utilizing the Fourier Transform Downsampling (FTD) block to extract frequency-domain features. This multi-domain fusion strategy allows the model to take full advantage of comprehensive information in illness images, significantly enhancing the discriminative power of the features. Finally, tailored to the characteristics of the disease detection task, we designed the Inner-Powerful-IoUv2 (IPIoUv2) loss function, which significantly improves bounding-box regression accuracy through its internal perception mechanism.
To clarify the uniqueness of AFA, we further compare it with the Squeeze-and-Excitation (SE) mechanism and BiFPN-style weighted fusion. SE recalibrates channels by using global pooled statistics from a single feature map and therefore focuses mainly on channel dependency modeling [27]. BiFPN learns scalar or normalized weights for bidirectional pyramid fusion and is designed for repeated top–down and bottom–up multi-scale aggregation [28]. In contrast, AFA first projects heterogeneous features into a shared channel space, learns spatially varying and channel-aware gates from the concatenated cross-scale features, and then combines these gates with learnable branch-level coefficients. Thus, AFA is not only a channel attention block or a feature-pyramid weighting rule; it is a feature-alignment and fusion unit specifically designed to reduce cross-scale misalignment in crop disease detection. The conceptual differences among AFA, SE, and BiFPN are summarized in Table 1.

2.2. Multi-Scale Align Network

Traditional CNNs frequently struggle to efficiently catch multi-scale features, which directly limits the model’s overall performance. To address such an issue, we introduced two modules in MSANet: the HGStem [29] module for low-level feature extraction and C2f module [30] for high-level feature fusion. Both modules dramatically enhance the representational capacity of the network, and their specific structures are illustrated in the Figure 1. To further improve calculational efficiency, we integrated a Depthwise Separable Convolution (DWConv) [31]. DWConv decomposes standard convolution into depthwise and pointwise operations, substantially lowering calculational costs at the same time as maintaining the capability of extracting rich multi-scale features.
The AFA block is the core innovative component of MSANet, designed to reduce feature-alignment deviations within the backbone network. Since feature maps often differ in terms of channel count, stride, and receptive field, simple concatenation or weighted averaging can degrade fusion quality. The AFA module solves such an issue by performing feature alignment, adaptive weighting, and channel-wise modulation, thereby achieving more accurate fusion and consequently enhancing the object detection network’s performance. The AFA module’s working principle refers to Figure 2.
In the AFA structure, input features X = { x 1 , x 2 } first undergo a 1 × 1 convolution for channel alignment, ensuring they are fused within the same feature space. Since x 1 and x 2 may originate from different backbone networks, their channel dimensions might not match. Therefore, the following transformations are applied:
x ^ 1 = W 1 ( 1 × 1 ) x 1 ,   x ^ 2 = W 2 ( 1 × 1 ) x 2
Here, ⊗ represents the convolution operation; W 1 ( 1 × 1 ) and W 2 ( 1 × 1 ) are the 1 × 1 convolutional weights used for channel transformation, aligning the channel dimensions of x 1 and x 2 to lay the foundation for the following fusion.
Feature-aligned x ^ 1 and x ^ 2 are concatenated post channel alignment. To enhance the adaptability of the fusion, an Adapt Align Weight (AAW) mechanism is introduced. Specifically, a 3 × 3 convolution is applied to features concatenated to catch high-level fusion information:
x concat = Concat ( x ^ 1 , x ^ 2 ) ,   x f = Conv 3 × 3 ( x concat )
where Conv 3 × 3 denotes the 3 × 3 convolutional layer used to extract fusion information from concatenated features.
Subsequently, the weights are normalized to the interval ( 0 , 1 ) using the sigmoid function σ ( · ) , endowing adaptive weighting of features from varied sources:
W align = σ ( x f )
The tensor W align is split along the channel dimension into two independent dynamic weight tensors w 1 and w 2 :
w 1 , w 2 = Split ( W align )
Finally, through reasonable weight allocation, AFA dynamically adjusts each input feature’s contribution during the fusion process, achieving smoother and more coordinated feature fusion:
x fused = w 1 x ^ 1 + w 2 x ^ 2
where ⊙ denotes element-wise multiplication, guaranteeing that valid information from varied sources is fused in a more appropriate manner.
But solely relying upon dynamic weights might cause certain feature paths tp be excessively suppressed. To address this, learnable channel weights λ 1 and λ 2 are further introduced to optimize the final fusion ratio:
x final = λ 1 · ( w 1 x ^ 1 ) + λ 2 · ( w 2 x ^ 2 )
Here, λ 1 and λ 2 are trainable parameters, initialized to 0.5 and automatically optimized during training, allowing the model to learn optimal feature fusion ratio. To prevent gradient explosion or numerical instability during training, the AFA block imposes constraints as shown below on channel weights:
λ 1 + λ 2 = 1 ,   λ 1 , λ 2 [ 0 , 1 ]
This constraint ensures that the channel weights remain within a steady range, guaranteeing model consistency and generalization capability. Lastly, 1 × 1 convolution is applied to match features for downstream detection tasks.

2.3. Spatial–Spectral Attentive Feature Fusion

Traditional feature fusion methods exhibit significant limitations in crop disease detection: They fail to effectively handle semantic inconsistencies among multi-scale features, leading to inaccurate feature alignment and information loss. Moreover, these methods lack the utilization of frequency-domain information, overlooking the texture and periodic patterns in disease images, while their computational inefficiency makes it difficult to meet real-time detection demands.
To address such issues, this paper designed an SSAFF model. The MFI block applies a multi-scale attention mechanism to dynamically adjust feature weights and prioritize salient disease regions. The FTD block leverages the Fourier transform to map spatial features into the frequency domain for processing, enabling the convolution operation to transcend the limitations of local receptive fields and directly capture global contextual information in images. The entire architecture significantly enhances robustness and preciseness of feature fusion at the same time as keeping real-time performance, providing an efficient solution for crop disease detection.
The MFI block is a core component of Spatial–Spectral Attentive Feature Fusion. Its core innovation lies in integrating local and global attention branches, along with a hierarchical feature processing path to achieve refined fusion of input features. Such a design maintains spatial details of input features and enhances the semantic representation of global contexts via an adaptive weighting mechanism.
Observed from Figure 3, the MFI block workflow consists of the following stages: First, the input features experience a 1 × 1 convolution for dimensionality reduction. This step is designed to unify the dimensionality of the input features and reduce the computational complexity of subsequent operations. Next, a 3 × 3 convolution is applied to the dual input features for preliminary fusion, producing baseline features. This provides a stable foundation for subsequent attention-based processing.
Subsequently, these features are processed through the Hierarchical Aware (HA) block. The input feature map is first divided into non-overlapping patches. Let the input feature tensor be F R H × W × C . Through a spatial operation such as unfolding, it is partitioned into patches of size p × p , generating a set of patches P = { P 1 , P 2 , , P N } , where N = H × W p × p . This step avoids the information loss associated with uniform downsampling. After simplifying each patch P i , it is converted into a patch-level feature vector t i R d (where d = p × p ).
Task-oriented weighting is then introduced: by comparing the similarity between t i and a task-embedding vector ξ R C , a weight is computed. A linear transformation P R C × C is applied for channel selection, formulated as:
t ^ i = P · sim ( t i , ξ ) · t i
Here, sim ( t i , ξ ) is a cosine similarity function, outputting a value in the range [ 0 , 1 ] to measure the relevance of the patch to the task. Patches with high weights are enhanced, while those with low weights are suppressed, achieving adaptive selection.
The weighted patches t ^ i are then reassembled into a complete feature map via a feature recombination operation. Specifically, all patches are recombined and interpolated back to the original spatial dimensions, producing the enhanced feature F out R H × W × C , expressed as:
F out = Reshape Interpolate i = 1 N t ^ i
Finally, the features processed by the hierarchical perception branch are integrated with the baseline features. The fusion process consists of a 1 × 1 convolution, a reparameterized 3 × 3 convolution, and another 1 × 1 convolution. The use of reparameterized convolution significantly improves parameter efficiency and effectively reorganizes features from multiple branches, ensuring high inference efficiency.
In the feature fusion network of AD-DETR, the FTD block acts as a key component of the SSAFF model, undertaking the critical downsampling function. Unlike traditional downsampling methods based on pooling or strided convolutions, the FTD block innovatively achieves feature reduction in the frequency domain. By combining spectral filtering and resolution adjustment, it reduces feature map resolution while better preserving frequency-domain feature information.
The downsampling process of the FTD block is implemented through frequency-domain operations. Its core idea is to leverage the frequency truncation property of the Fourier transform to achieve resolution reduction. The core mathematical foundation is the convolution theorem: convolution in the spatial domain is equal to element-wise multiplication in the frequency domain. For an N × N input image x ( n , m ) of size, its discrete Fourier transform (DFT) is defined as:
X k 1 , k 2 = n = 0 N 1 m = 0 N 1 x ( n , m ) N 2 exp 2 π j N n k 1 + m k 2
The convolution operation in the frequency domain is expressed as:
Y k 1 , k 2 = X k 1 , k 2 W k 1 , k 2
where W ( k 1 , k 2 ) is a learnable convolution kernel in the frequency domain. The result in the spatial domain can be recovered via the inverse DFT:
IDFT { Y } = x w
This mathematical equivalence enables the FTD block to achieve receptive field coverage ranging from local ( 1 × 1 ) to global ( N × N ).
The illustration of FCB is shown in Figure 4. Specifically, given x R H × W × C input feature map, spatial features are first transformed into the frequency domain via the Fast Fourier Transform (FFT):
X = F ( x )
in which F means 2D FFT operation. Since the input consists of real-valued features, the output spectrum displays conjugate symmetry, meaning only half of the frequency-domain data needs to be processed for a complete representation. Then, frequency-domain modulation is applied using a learnable frequency-domain filter W C H × W × C , simultaneously performing frequency truncation:
Y = W X truncated
where X truncated represents the truncated spectrum (retaining low-frequency components), and ⊙ represents element-wise complex multiplication. The learnable filter enables our model to adaptively choose frequency components most important for downsampling. Finally, the result is transformed back to the spatial domain via Inverse FFT (IFFT), naturally achieving resolution reduction:
y = F 1 ( Y )
After the inverse transform, the output feature map y has dimensions H 2 × W 2 × C , completing 2 × downsampling.
The advantages of this frequency-domain downsampling method are as follows: First, by preserving low-frequency components, it naturally achieves anti-aliasing, avoiding the spectral aliasing issues common in spatial-domain downsampling. Second, the learnable filter can optimize frequency selection for the specific task, enhancing the discriminative power of the downsampled features. Lastly, frequency-domain operations provide a global receptive field, ensuring that long-range dependencies are not lost during the downsampling process.

2.4. Inner-Powerful-IoUv2

In crop disease detection tasks, bounding-box regression precision directly affects the model’s localization capability. Traditional IoU loss functions suffer from problems like slow convergence, geometric misalignment, along with limited generalization when handling targets. To address these issues, this paper proposes IPIoUv2, which integrates the geometric alignment penalty mechanism from PowerfulIoUv2 (PIoUv2) [32] and the scale scaling strategy of InnerIoU [33]. By dynamically adjusting the loss weights and adapting to scale variations, our method dramatically enhances bounding-box regression precision, convergence speed, and robustness. The geometric penalty term P added by PIoU is calculated directly based on the target bounding box width and height, guiding the anchor box to regress more directly toward the target center and avoiding unnecessary expansion. Its core calculation formula is:
P = 1 4 d w 1 w g t + d w 2 w g t + d h 1 h g t + d h 2 h g t
L P I o U = L I o U + 1 e p 2 , 0 L P I o U 2
where d w 1 , d w 2 , d h 1 , and d h 2 measure boundary distance disparities between target and predicted boxes in horizontal and vertical directions. w g t and h g t refer to ground-truth box height and width, respectively. Subsequently, based on PIoU, a non-monotonic focal mechanism was introduced to dynamically adjust loss weight, resulting in a new loss function termed PIoUv2. Its calculation formula is defined as follows:
q = e p , 0 < q 1
u ( x ) = 3 x · e x 2
L P I o U v 2 = u ( λ q ) · L P I o U = 3 · ( λ q ) · e ( λ q ) 2 · L P I o U
among which, λ denotes a hyperparameter controlling penalty strength. Through experiments, we finally set λ = 1.3 .
The motivation behind Inner-IoU differs from that of PIoU. It primarily addresses the insufficient generalization capability and slow convergence of the traditional IoU loss in disease detection. Its core concept involves introducing a proportionally scaled “auxiliary bounding box” for loss calculation, thereby dynamically adjusting difficulty and focus of regression. The specific calculation formula is as follows:
b l g t = x c g t w g t γ 2 , b r g t = x c g t + w g t γ 2
b t g t = y c g t h g t γ 2 , b b g t = y c g t + h g t γ 2
b l = x c w γ 2 , b r = x c + w γ 2
b t = y c h γ 2 , b b = y c + h γ 2
i n t e r = ( m i n ( b r g t , b r ) m a x ( b l g t , b l ) )   ( m i n ( b b g t , b b ) m a x ( b t g t , b t ) )
u n i o n = ( w g t h g t ) γ 2 + ( w h ) γ 2 i n t e r
I o U i n n e r = i n t e r u n i o n
where ( x c g t ,   y c g t ) and ( x c ,   y c ) are the mean ground-truth box and the predicted box center coordinates, respectively. w g t , h g t , w, and h represent the ground-truth box and predicted box width and height, respectively. γ means a scale factor, which is set to γ = 0.7 here. This setting imposes stricter regression criteria and is capable of accelerating quality samples’ convergence. Therefore, the formula for IPIoUv2 can be expressed as:
L I P I o U v 2 = 3 · ( λ q ) · e ( λ q ) 2 · ( 2 e p 2 I o U i n n e r )

3. Experiments

3.1. Datasets

This study uses two datasets. The first is a self-constructed Crop Disease dataset, and the second is the public PlantDoc benchmark dataset [18]. The Crop Disease dataset contains 37,714 high-quality images covering 41 fine-grained categories. It includes healthy states and common diseases such as early blight, late blight, powdery mildew, bacterial spot, and leaf blight for more than ten important crops, including corn, tomato, grape, strawberry, bell pepper, peach, cherry, potato, apple, citrus, squash, rice, and cassava. Among them, 4850 images are from the Academy of Agricultural Sciences, 7542 images are from the Institute of Intelligent Machinery of the Chinese Academy of Sciences, 10,693 images are from commercial datasets, and 14,629 images are from mainstream search engines. Domain experts used LabelImg to annotate lesion areas and their corresponding categories. After annotation, files were saved in TXT format. The dataset was split into training, test, and validation subsets using a 7:1:2 ratio.
To make the dataset composition clearer, the number of diseased and healthy images used in the experiments is reported in Table 2. In this study, an image is counted as diseased if it contains at least one disease label; otherwise, it is counted as healthy. The self-constructed Crop Disease dataset contains 33,286 diseased images and 4428 healthy images. The PlantDoc object-detection version used for external validation contains 2569 images from 13 plant species and 30 categories, including both diseased and healthy leaf categories, and has 8851 object annotations. PlantDoc was used only for testing cross-dataset generalization and is cited here according to its original dataset paper [18].
The distribution in Table 3 confirms that class imbalance exists, mainly because healthy samples and several visually similar disease categories are less frequent than common diseases such as tomato late blight, corn leaf blight, and tomato early blight. To reduce the influence of this imbalance, the train/validation/test split was performed in a stratified manner, rare categories were augmented using random scaling, flipping, color jittering, and mosaic-style composition, and the IPIoUv2 loss was introduced to improve localization stability for difficult and minority lesion samples.
Representative images from both datasets are shown in Figure 5.

3.2. Experimental Setup and Evaluation Metrics

The computational infrastructure underpinning this research comprises an NVIDIA GeForce RTX 3090 GPU, collaborating with an Intel Xeon E5-2630 v3 processor (2.40 GHz), housed within a Windows 11 64-bit operational milieu. Model implementation harnesses PyTorch 1.13.1, synergized with Python 3.8, and turbocharged by CUDA 12.1 and cuDNN 8.9.2. An exhaustive enumeration of training configurations is provided in Table 4.
To ensure a fair comparison, all subsequent baseline detectors were trained and evaluated under a unified protocol. The same training/validation/test splits, input size, batch size, number of epochs, image normalization, and data augmentation pipeline were used whenever supported by the official implementation. For two-stage detectors, anchor and region-proposal settings followed the default configuration recommended by the corresponding framework, but the image size and training schedule were kept consistent with the other models. The unified training and evaluation protocol is summarized in Table 5.
To exhaustively appraise the performance of the presented AD-DETR framework and juxtapose it against other avant-garde object identification models, we leverage multiple pivotal metrics quantifying detection precision, computational efficacy, and inference velocity.
Because this study addresses multi-class object detection rather than binary classification, precision and recall are calculated for each class and then averaged across all classes. For class c, a detection is counted as a true positive only when its predicted label is correct and its IoU with the corresponding ground-truth box is at least 0.5:
P c = T P c T P c + F P c ,   R c = T P c T P c + F N c .
The macro-averaged precision and recall reported in this study are:
P = 1 N c c = 1 N c P c ,   R = 1 N c c = 1 N c R c ,
where N c is the number of categories, and T P c , F P c , and F N c denote class-specific true positives, false positives, and false negatives. This formulation better reflects performance across all crop and disease categories, including minority classes.
Mean Average Accuracy at I o U = 0.5 ( m A P @ 50 ): A pervasive metric in object detection, m A P @ 50 encapsulates the precision–recall curve spanning all object groupings. It entails computing the Average Precision ( A P ) for each class at an Intersection over Union ( I o U ) threshold of 0.5, followed by averaging these A P values across all categories. The I o U metric assesses overlap between a forecasted bounding box ( B p r e d ) and its associated ground-truth bounding box ( B g t ):
I o U = A r e a ( B p r e d B g t ) A r e a ( B p r e d B g t )
The m A P @ 50 is defined as:
m A P @ 50 = c = 1 N c A P c N c ( I o U = 0.5 )
N c symbolizes the count of object categories, while A P c i embodies Average Accuracy for category c, computed as the area beneath the accuracy-recall curve specific to that category at I o U = 0.5 .
Mean Average Accuracy over I o U thresholds 0.5–0.95 ( m A P @ 50 95 ): Such a metric furnishes a more stringent assessment via averaging m A P computed across a spectrum of I o U thresholds, ranging from 0.5 to 0.95 in 0.05 increments. It rigorously appraises the model’s localization precision across diverse strictness levels.
Parameters (Params): Aggregate sum of trainable parameters within model, quantified in millions (M). Such a metric is critical because it directly impacts the model’s memory footprint and storage requirements.
Floating Point Operations (FLOPs): The computational intricacy of the model, enumerated in Giga FLOPs (GFLOPs). It approximates the requisite floating-point arithmetic operations for a solitary forward pass of an input image. Inferior FLOPs typically suggest superior computational efficiency.
Frames per second (FPS): Evaluates the model’s inference velocity, embodying the number of images that the model is capable of processing per second on designated hardware.

4. Results

4.1. Ablation Experiments

In order to systematically verify the effectiveness of each core component in the AD-DETR model, we conducted comprehensive ablation experiments. As shown in Table 6, by gradually introducing MSANet, SSAFF and IPIoUv2 modules, the contribution of each component to the model performance and its interaction mechanism were deeply analyzed.
Experimental results show that when MSANet is used alone, the number of model parameters is significantly reduced from the baseline of 19.9 M to 12.3 M, the calculation amount is reduced from 57.0 GFLOPs to 34.4 GFLOPs, and the m A P @ 50 is increased from 84.8% to 85.0%. This optimization is mainly due to the innovative AFA module design in MSANet, which greatly reduces parameter redundancy while maintaining multi-scale feature-extraction capabilities through deep DWConv and AAW mechanisms. The feature-alignment strategy adopted by the AFA module effectively solves the dimension mismatch problem of traditional convolutional neural networks in cross-scale feature fusion, laying an efficient foundation for subsequent processing.
It is worth noting that when the SSAFF module is introduced alone, the parameter amount increases to 22.3 M and the calculation amount reaches 70.1 GFLOPs, but the mAP@50 is significantly increased to 87.8%. The parameter growth is mainly due to the introduction of frequency-domain processing components in SSAFF. In particular, the FTD block needs to learn complex frequency-domain filters, and the multi-scale attention mechanism in the MFI block increases the computational overhead. However, when SSAFF is combined with MSANet, a significant synergistic effect is produced: the number of parameters is reduced to 16.4 M, which is lower than the baseline model, while m A P @ 50 is further improved to 89.0%. This “reduction and efficiency increase” phenomenon stems from the deep integration of the two modules—the refined feature alignment provided by MSANet creates a more efficient processing environment for SSAFF, allowing frequency-domain operations to be performed in the compressed feature space, greatly reducing the dimensionality requirements of the frequency-domain filters in the FTD block.
The separate introduction of the IPIoUv2 loss function increases mAP@50 to 86.4%. More importantly, when combined with the complete architecture, it increases m A P @ 50 to 90.2% while maintaining parameter efficiency. The final AD-DETR model achieved a 5.4% m A P @ 50 improvement with a 17.2% reduction in parameters, which verifies the synergistic advantages generated through structural optimization and functional complementation between components. This design embodies the concept of “fine preprocessing + intelligent fusion” and provides the optimal accuracy–efficiency balance for real-time disease detection in agricultural scenarios.

4.2. Comparison with Representative IoU-Based Losses

To further demonstrate the necessity of IPIoUv2, we replaced only the bounding-box regression loss while keeping the AD-DETR architecture, dataset split, training schedule, and augmentation pipeline unchanged. As shown in Table 7, IPIoUv2 achieved the highest m A P @ 50 and m A P @ 50 95 . Compared with CIoU and DIoU, IPIoUv2 improved localization by combining geometric penalty and scale-adaptive auxiliary boxes. Compared with WIoU and PIoUv2, it showed better recall, suggesting that the inner-box strategy helps the model learn from small and irregular disease regions.

4.3. Comparative Analysis

To evaluate the performance of AD-DETR in crop disease detection in an all-round manner, comparative experiments against cutting-edge object detection architectures, like Faster R-CNN [40], SSD [41], RetinaNet [42], YOLOv8m [43], YOLOv10m [44], YOLOv12m [45], and original RT-DETR variants (r18, r34, r50), were carried out. Experimental outcomes in Table 8 display the superior performance of the proposed approach.
As shown in Table 8, traditional two-stage detectors, including Faster R-CNN, exhibit limited performance with 42.0% accuracy, 35.3% recall, and 36.6% m A P @ 50 , indicating their inadequacy for complex crop disease detection tasks. SSD shows improved precision but suffers from low recall, suggesting significant missed detections in agricultural scenarios.
Among single-stage detectors, RetinaNet achieves competitive results with 84.3% precision and 77.4% recall, while YOLOv8m demonstrates strong recall performance but relatively lower precision. Notably, YOLOv10m achieves the highest precision among all compared methods but shows a noticeable recall drop, highlighting the challenge of balancing precision and recall in disease detection.
The baseline RT-DETR models show consistent performance, with RT-DETR-r50 achieving 89.9% precision and 89.7% recall. However, our proposed AD-DETR outperforms all comparative methods in the critical m A P @ 50 metric, achieving 90.2% while maintaining excellent precision and recall.
More importantly, AD-DETR achieves this superior performance with the most efficient architecture among all compared methods. With only 16.4 M parameters and 47.2 GFLOPs, our model reduces computational complexity by 17.2% compared to RT-DETR-r18 while improving m A P @ 50 by 5.4%. The inference speed of 230 FPS further demonstrates ADDETR’s real-time capability, making it appropriate for practical agricultural uses requiring fast disease diagnosis.
Superior performance of AD-DETR can be ascribed to its innovative design: MSANet backbone effectively handles scale variations in disease symptoms, the SSAFF module enhances feature representation through spatial–spectral fusion, and the IPIoUv2 loss function optimizes bounding-box regression for irregular disease patterns. These components work synergistically to address unique challenges of crop illness detection, including small lesion sizes, complex backgrounds, and subtle symptom variations.
Such findings verify that AD-DETR realizes an optimal balance between detection precision and calculational efficiency, making it especially applicable for deployment in resource-constrained agricultural conditions where real-time performance and accuracy are equally important.

4.4. Generalization Test

To assess the generalization ability of the AD-DETR model in diverse agricultural environments in an all-around manner, our research performed rigorous cross-dataset validation experiments utilizing the PlantDoc dataset as the test platform. This dataset exhibits significant differences from the primary training dataset, encompassing diverse plant species, disease manifestation patterns, and imaging conditions, thereby effectively testing the model’s robustness to domain shift.
Experimental results in Table 9 demonstrate that AD-DETR achieves outstanding performance on the PlantDoc dataset, with key metrics significantly surpassing those of comparative models. Specifically, AD-DETR realizes 96.4% precision, 94.2% recall, and 97.4% m A P @ 50 . Such outcomes not only substantially exceed those of traditional detectors but also outperform current mainstream deep learning methods. Furthermore, AD-DETR exhibits remarkable efficiency, with only 16.4 M parameters, a computational load of 47.2 GFLOPs, and an inference speed of 242 FPS. This high efficiency makes it highly applicable for resource-constrained real-time agricultural applications, like disease monitoring on drones or mobile devices.
In horizontal comparisons, traditional detectors reveal significant limitations. Faster RCNN achieves only 71.8% m A P @ 50 , and its high parameter count and computational load impede real-time deployment. While SSD shows improved precision, its low recall rate indicates severe missed detections in unfamiliar environments. Among contemporary methods, YOLOv10m is competitive in efficiency but shows a significant gap in m A P @ 50 compared to AD-DETR. Although a variant of RT-DETR achieves 95.9% m A P @ 50 , its parameter count and calculational load are substantially higher than those of AD-DETR, highlighting the latter’s lightweight advantage.

4.5. Visualization of Model Predictions

To qualitatively validate the detection performance and attention mechanisms of the AD-DETR model proposed, this section presents visual comparisons using detection results and Grad-CAM visualizations. These analyses provide intuitive insights into the model’s capability to precisely localize disease spots and focus on relevant regions, complementing the quantitative metrics discussed earlier.
Figure 6 illustrates a comparative visualization of detection outcomes between the original RTDETR model and our AD-DETR on representative crop disease images. The results clearly demonstrate that RT-DETR suffers from significant issues, including false positives and missed detections. In contrast, AD-DETR achieves precise detection without such errors, accurately bounding all disease spots while minimizing background interference. This visual evidence confirms the effectiveness of AD-DETR in handling the diverse manifestations of diseases in complex agricultural scenarios.
Additionally, Figure 7 showcases Grad-CAM visualizations to analyze the spatial focus of the models. The heatmaps generated by RT-DETR exhibit a dispersed attention pattern, often highlighting irrelevant background areas or only partially covering disease regions, which contributes to its suboptimal performance. Conversely, AD-DETR’s heatmaps are highly concentrated on the core disease areas, indicating a more targeted and reliable feature-extraction process. This enhanced focus can be attributed to the SSAFF model, which integrates frequency-domain information to improve feature discriminability. The sharper heatmaps confirm that AD-DETR effectively suppresses noise and prioritizes salient disease features, leading to superior localization accuracy.
As shown in Figure 8, AD-DETR still makes errors when disease symptoms are similar to healthy leaf veins, under low contrast, or in the presence of severe occlusion. These cases reveal the need for further improvement in difficult field environments.

4.6. Confusion Matrix and Error Analysis

To further analyze the class-wise detection performance of AD-DETR, a normalized confusion matrix was generated for the Crop Disease validation set, as shown in Figure 9. The matrix includes all 41 crop disease/healthy categories and the background category. The horizontal axis represents the true class, while the vertical axis represents the predicted class. Predictions were matched with ground-truth boxes using an IoU threshold of 0.5. Diagonal elements indicate correct detections, whereas off-diagonal elements indicate misclassification or confusion with the background.
As shown in Figure 9, most categories present strong diagonal responses, indicating that AD-DETR correctly detects the majority of crop disease classes. The off-diagonal values are generally weak, suggesting that the proposed model has good discriminative ability across multiple crops and disease types. However, some errors still occur between visually similar categories. For example, several tomato diseases, such as early blight, late blight, bacterial spot, Septoria leaf spot, and leaf mold, may be confused because they share similar necrotic spots, yellowing symptoms, and irregular lesion boundaries. Similar confusion can also be observed among diseases from the same crop, such as grape black rot, grape leaf blight, and grape esca, as well as potato early blight and potato late blight.
In addition, some elongated leaf-lesion diseases, such as corn leaf blight, rice leaf blast, and rice brown spot, may be confused when lesions are small, low-contrast, or distributed along leaf veins. Cassava-related categories also show several off-diagonal responses, mainly because mottling, chlorosis, and weak color transitions make their visual differences less obvious. False positives are mainly caused by healthy veins, shadows, or senescent leaf edges that resemble disease texture, while false negatives usually occur under small-lesion, occlusion, blurred-image, or low-contrast conditions. These findings are consistent with the qualitative error cases shown in Figure 8.
Overall, the confusion matrix confirms that AD-DETR achieves strong class-wise detection performance, with most errors concentrated in visually similar diseases and difficult background conditions. Future work should further improve the recognition of early-stage and low-contrast symptoms by using more balanced samples, hard negative mining, and higher-resolution lesion features.

5. Discussion

The objective of this study was to develop and validate a real-time transformer detector for crop disease detection that improves localization accuracy, generalization, and computational efficiency. The results support the proposed hypotheses. AD-DETR achieved 90.2% m A P @ 50 on the Crop Disease dataset and 97.4% m A P @ 50 on PlantDoc while retaining 16.4 M parameters, 47.2 GFLOPs, and real-time inference speed. These findings indicate that adaptive multi-scale alignment, spatial–spectral fusion, and improved regression loss jointly improve the accuracy–efficiency balance of crop disease detection.
The first hypothesis, that adaptive multi-scale alignment improves disease-feature representation, is supported by the ablation results. MSANet reduced parameters from 19.9 M to 12.3 M and slightly improved m A P @ 50 when used alone. More importantly, when MSANet was combined with SSAFF, the model reached 89.0% m A P @ 50 while keeping parameters at 16.4 M. This result is consistent with prior evidence that multi-scale feature representation is critical for object detection. In crop disease images, lesions vary from small early-stage spots to large necrotic areas. A model that aligns cross-scale features can therefore reduce missed detections caused by scale mismatch and background interference.
The second hypothesis, that spatial–spectral fusion improves disease discrimination, is also supported. SSAFF increased m A P @ 50 to 87.8% when used alone, despite the additional computational cost. This indicates that frequency-domain cues can provide complementary information to spatial features. Prior plant disease studies have shown that leaf lesions often include subtle texture, color-transition, and edge-distribution patterns. Unlike methods that use frequency-domain operations as preprocessing, SSAFF integrates Fourier-based downsampling into the detection architecture, allowing global texture information to interact with attention-based spatial features. This helps the model concentrate on disease regions, as shown by the sharper Grad-CAM responses.
The third hypothesis, that the improved IoU-based loss enhances localization, is supported by the performance of IPIoUv2. When IPIoUv2 was introduced alone, m A P @ 50 improved to 86.4%, and the complete model achieved the highest m A P @ 50 and m A P @ 50 95 . Irregular lesion boundaries and imbalanced class distributions are common in plant disease detection, and conventional bounding-box regression may converge slowly or produce suboptimal localization. By combining geometric penalties and scale-adaptive auxiliary boxes, IPIoUv2 provides stricter localization guidance for high-quality samples while improving stability for difficult targets.
Compared with previous crop disease studies, AD-DETR addresses a different and more deployment-oriented task. Several CNN-based studies have reported high classification accuracy under curated or semi-controlled image conditions. However, classification models often identify the image-level disease category without localizing the lesion area, limiting their utility for severity assessment, targeted treatment, and field monitoring. Object detection studies such as improved Faster R-CNN and YOLO-based detectors can localize disease regions, but they may trade speed for accuracy or show reduced robustness under scale variation. In this study, AD-DETR surpassed Faster R-CNN, SSD, RetinaNet, YOLOv8m, YOLOv10m, YOLOv12m, and RT-DETR variants in m A P @ 50 on the Crop Disease dataset while maintaining real-time speed. These comparisons suggest that the proposed design better balances lesion localization, computational cost, and inference velocity.
The generalization results on PlantDoc further demonstrate the potential of AD-DETR for real agricultural scenarios. PlantDoc contains heterogeneous images with diverse plant species, disease symptoms, and field conditions. AD-DETR achieved 97.4% m A P @ 50 on this dataset, outperforming RT-DETR-r50 while requiring fewer parameters and lower computational cost. This finding is particularly important because real-world deployment often involves images collected under variable lighting, crop growth stages, camera distances, and background complexity. The result supports the fourth hypothesis that the integrated architecture improves accuracy and efficiency under domain shift.
The broader implication of this work is that transformer-based detection can be adapted to agricultural edge intelligence when architectural components are designed for field-specific challenges. The lightweight and fast AD-DETR model may support real-time monitoring on unmanned aerial vehicles, mobile phones, field robots, and greenhouse cameras. Accurate lesion localization can assist precision spraying, early warning, yield-loss reduction, and disease-spread monitoring. Because the model produces bounding boxes rather than only class labels, its outputs may also support downstream disease severity estimation, treatment prioritization, and agronomic decision-making.
Nevertheless, the model still has limitations. First, performance may decline under extreme lighting, strong occlusion, or backgrounds that resemble disease texture, as shown in the error cases. Second, although the Crop Disease dataset covers many categories, rare diseases and early-stage symptoms may remain underrepresented. Third, frequency-domain downsampling improves global texture representation but may still lose some high-frequency details that are useful for very small lesions. Fourth, this study focuses on RGB images; environmental variables such as temperature, humidity, and crop growth stage were not incorporated. Future work should expand multimodal data sources, combine visual and environmental sensor information, and evaluate the model in long-term field trials.
Future research will focus on three directions. First, multimodal sensing will be introduced to improve robustness under complex field conditions. Second, adaptive lightweight strategies will be developed through dynamic network structures to reduce computation for edge devices. Third, incremental and continual learning mechanisms will be explored so that AD-DETR can adapt to new disease types without full retraining. These directions will help extend the practical value of AD-DETR in precision agriculture.

6. Conclusions

This paper proposes AD-DETR, a real-time detection framework specifically designed for crop disease detection. Through multi-scale feature alignment, spatial–spectral fusion, and bounding-box regression optimization, it effectively addresses key challenges in agricultural computer vision. Experimental validation shows that AD-DETR achieves state-of-the-art performance on multiple datasets while maintaining efficient computational characteristics. Specifically, on the Crop Disease dataset, the model achieves 90.2% m A P @ 50 , which is a 5.4% improvement over the baseline. The parameter count is only 16.4 M, with a computational complexity of 47.2 GFLOPs and an inference speed of 242 FPS. In generalization tests on the PlantDoc dataset, the m A P @ 50 reaches 97.4%, confirming its practicality in real-world agricultural environments.
Methodologically, the core innovations of AD-DETR include: MSANet, which achieves adaptive alignment of multi-scale features through the AFA block, significantly reducing feature bias and parameter redundancy; the SSAFF module, which integrates spatial attention and frequency-domain processing to enhance the discriminative capability of disease features; and the IPIoUv2 loss function, which combines geometric penalties and scale-adaptive strategies to optimize the precision and convergence speed of bounding-box regression. Ablation experiments validate the independent contributions and synergistic effects of each component, while the overall architecture achieves a balance between accuracy and efficiency while maintaining lightweight characteristics.
Although AD-DETR performs excellently in most scenarios, it still has limitations. Future research can be expanded in three aspects: first, incorporating multimodal data sources and combining environmental sensor information to improve model robustness; second, developing adaptive lightweight strategies to optimize computational efficiency through dynamic network structures; and third, exploring incremental learning mechanisms to enable the model to continuously adapt to new disease types. These directions will promote the broader application of AD-DETR in precision agriculture.
Overall, through methodological innovations and empirical validation, AD-DETR provides an efficient and reliable solution for crop disease detection and is expected to extend to broader areas of plant science and precision agriculture.

Author Contributions

Conceptualization, B.W.; Methodology, B.W.; Software, R.C.; Validation, R.C.; Formal analysis, R.C.; Investigation, R.C.; Resources, Z.W.; Data curation, Z.W.; Writing—original draft, B.W.; Writing—review & editing, H.Z.; Visualization, Z.W.; Supervision, H.Z.; Project administration, H.Z.; Funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Heilongjiang Provincial Natural Science Foundation of China, grant number PL2024F009, and the Harbin Normal University Postgraduate Innovative Research Project, grant number HSDSSCX2025-70. The APC was funded by the corresponding author, Huibo Zhou.

Data Availability Statement

Due to the nature of this research, participants in this study did not agree for their data to be shared publicly, so supporting data are not available.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Ashok, S.; Kishore, G.; Rajesh, V.; Suchitra, S.; Sophia, S.G.; Pavithra, B. Tomato leaf disease detection using deep learning techniques. In Proceedings of the 2020 5th International Conference on Communication and Electronics Systems (ICCES); IEEE: Piscataway, NJ, USA, 2020; pp. 979–983. [Google Scholar]
  2. Harakannanavar, S.S.; Rudagi, J.M.; Puranikmath, V.I.; Siddiqua, A.; Pramodhini, R. Plant leaf disease detection using computer vision and machine learning algorithms. Glob. Transit. Proc. 2022, 3, 305–310. [Google Scholar] [CrossRef]
  3. Eunice, J.; Popescu, D.E.; Chowdary, M.K.; Hemanth, J. Deep learning-based leaf disease detection in crops using images for agricultural applications. Agronomy 2022, 12, 2395. [Google Scholar] [CrossRef]
  4. Vasavi, P.; Punitha, A.; Rao, T.V.N. Crop leaf disease detection and classification using machine learning and deep learning algorithms by visual symptoms: A review. Int. J. Electr. Comput. Eng. 2022, 12, 2079. [Google Scholar] [CrossRef]
  5. Loranger, M.E.W.; Yim, W.; Accomazzi, V.; Morales-Lizcano, N.; Moeder, W.; Yoshioka, K. Colour-analyzer: A new dual colour model-based imaging tool to quantify plant disease. Plant Methods 2024, 20, 60. [Google Scholar] [CrossRef]
  6. Bhujade, V.G.; Sambhe, V.; Banerjee, B. Digital image noise removal towards soybean and cotton plant disease using image processing filters. Expert Syst. Appl. 2024, 246, 123031. [Google Scholar] [CrossRef]
  7. Sahu, S.K.; Pandey, M. An optimal hybrid multiclass SVM for plant leaf disease detection using spatial Fuzzy C-Means model. Expert Syst. Appl. 2023, 214, 118989. [Google Scholar] [CrossRef]
  8. Jayapalan, D.F.S.; Ananth, J.P. Root disease classification with hybrid optimization models in IoT. Expert Syst. Appl. 2023, 226, 120150. [Google Scholar] [CrossRef]
  9. Kundu, R.; Chauhan, U.; Chauhan, S. Plant leaf disease detection using image processing. In Proceedings of the 2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM); IEEE: Piscataway, NJ, USA, 2022; Volume 2, pp. 393–396. [Google Scholar]
  10. Chowdhury, M.E.; Rahman, T.; Khandakar, A.; Ayari, M.A.; Khan, A.U.; Khan, M.S.; Al-Emadi, N.; Reaz, M.B.I.; Islam, M.T.; Ali, S.H.M. Automatic and reliable leaf disease detection using deep learning techniques. AgriEngineering 2021, 3, 294–312. [Google Scholar] [CrossRef]
  11. Gong, X.; Zhang, S. A high-precision detection method of apple leaf diseases using improved faster R-CNN. Agriculture 2023, 13, 240. [Google Scholar] [CrossRef]
  12. Sangaiah, A.K.; Yu, F.N.; Lin, Y.B.; Shen, W.C.; Sharma, A. UAV T-YOLO-rice: An enhanced tiny YOLO networks for rice leaves diseases detection in paddy agronomy. IEEE Trans. Netw. Sci. Eng. 2024, 11, 5201–5216. [Google Scholar] [CrossRef]
  13. Faisal, S.; Javed, K.; Ali, S.; Alasiry, A.; Marzougui, M.; Khan, M.A.; Cha, J.H. Deep transfer learning based detection and classification of citrus plant diseases. Comput. Mater. Contin. 2023, 76, 895–914. [Google Scholar] [CrossRef]
  14. Zhang, D.; Huang, Y.; Wu, C.; Ma, M. Detecting tomato disease types and degrees using multi-branch and destruction learning. Comput. Electron. Agric. 2023, 213, 108244. [Google Scholar] [CrossRef]
  15. Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. 2016, 7, 1419. [Google Scholar] [CrossRef] [PubMed]
  16. Hughes, D.; Salathé, M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv 2015, arXiv:1511.08060. [Google Scholar]
  17. Ferentinos, K.P. Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
  18. Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A dataset for visual plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD; ACM: New York, NY, USA, 2020; pp. 249–253. [Google Scholar]
  19. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  20. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
  21. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  22. Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6054–6063. [Google Scholar]
  23. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  24. Mahmood, M.A.; Alsalem, K. Olive Leaf Disease Detection via Wavelet Transform and Feature Fusion of Pre-Trained Deep Learning Models. Comput. Mater. Contin. 2024, 78, 3431–3448. [Google Scholar] [CrossRef]
  25. Wang, P.; Tong, L.; Gong, X.; Gao, B. A Study on Wavelet Transform-Based Inversion Method for Forest Leaf Area Index Retrieval. Forests 2025, 16, 736. [Google Scholar] [CrossRef]
  26. Lei, H.; Hu, Y.; Wang, M.; Ding, M.; Li, Z.; Luo, G. Fast Fourier Asymmetric Context Aggregation Network: A Controllable Photo-Realistic Clothing Image Synthesis Method Using Asymmetric Context Aggregation Mechanism. Appl. Sci. 2025, 15, 3534. [Google Scholar] [CrossRef]
  27. Deng, J.; Ma, Y.; Li, D.-a.; Zhao, J.; Liu, Y.; Zhang, H. Classification of breast density categories based on SE-Attention neural networks. Comput. Methods Programs Biomed. 2020, 193, 105489. [Google Scholar] [CrossRef]
  28. Chen, J.; Mai, H.; Luo, L.; Chen, X.; Wu, K. Effective feature fusion network in BIFPN for small object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2021; pp. 699–703. [Google Scholar]
  29. Teng, W.; Wang, C.; Fei, S. Research on coal gangue recognition algorithm based on HGTC-YOLOv8n model. J. Mine Autom. 2024, 50, 52–59. [Google Scholar]
  30. Zhu, J.; Hu, T.; Zheng, L.; Zhou, N.; Ge, H.; Hong, Z. YOLOv8-C2f-Faster-EMA: An improved underwater trash detection model based on YOLOv8. Sensors 2024, 24, 2483. [Google Scholar] [CrossRef]
  31. Huang, S.; Wang, Q.; Zhang, S.; Yan, S.; He, X. Dynamic context correspondence network for semantic alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2010–2019. [Google Scholar]
  32. Liu, C.; Wang, K.; Li, Q.; Zhao, F.; Zhao, K.; Ma, H. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Netw. 2024, 170, 276–284. [Google Scholar] [CrossRef]
  33. Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
  34. Qian, X.; Zhang, N.; Wang, W. Smooth giou loss for oriented object detection in remote sensing images. Remote Sens. 2023, 15, 1259. [Google Scholar] [CrossRef]
  35. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
  36. Du, S.; Zhang, B.; Zhang, P.; Xiang, P. An improved bounding box regression loss function based on CIOU loss for multi-scale object detection. In Proceedings of the 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML); IEEE: Piscataway, NJ, USA, 2021; pp. 92–98. [Google Scholar]
  37. Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
  38. Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
  39. Cho, Y.J. Weighted Intersection over Union (wIoU) for evaluating image segmentation. Pattern Recognit. Lett. 2024, 185, 101–107. [Google Scholar] [CrossRef]
  40. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  41. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  42. Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. Automatic ship detection based on RetinaNet using multi-resolution Gaofen-3 imagery. Remote Sens. 2019, 11, 531. [Google Scholar] [CrossRef]
  43. Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS); IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
  44. Hussain, M.; Khanam, R. In-depth review of yolov1 to yolov10 variants for enhanced photovoltaic defect detection. Solar 2024, 4, 351–386. [Google Scholar] [CrossRef]
  45. Ma, J.; Zhou, Y.; Zhou, Z.; Zhang, Y.; He, L. Toward smart ocean monitoring: Real-time detection of marine litter using YOLOv12 in support of pollution mitigation. Mar. Pollut. Bull. 2025, 217, 118136. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Architecture of AD-DETR for crop disease detection. MSANet = Multi-Scale Align Network; AFA = Adapt Fusion Align; SSAFF = Spatial–Spectral Attentive Feature Fusion; MFI = Multi-scale Focus Integration; FTD = Fourier Transform Downsampling.
Figure 1. Architecture of AD-DETR for crop disease detection. MSANet = Multi-Scale Align Network; AFA = Adapt Fusion Align; SSAFF = Spatial–Spectral Attentive Feature Fusion; MFI = Multi-scale Focus Integration; FTD = Fourier Transform Downsampling.
Sensors 26 03206 g001
Figure 2. Structure of the AFA block used in MSANet. The two input feature maps are first channel-aligned by 1 × 1 convolution, then fused through the Adapt Align Weight (AAW) mechanism. AFA = Adapt Fusion Align; AAW = Adapt Align Weight.
Figure 2. Structure of the AFA block used in MSANet. The two input feature maps are first channel-aligned by 1 × 1 convolution, then fused through the Adapt Align Weight (AAW) mechanism. AFA = Adapt Fusion Align; AAW = Adapt Align Weight.
Sensors 26 03206 g002
Figure 3. Detailed structure of the MFI and HA blocks in the SSAFF module. MFI = Multi-scale Focus Integration; HA = Hierarchical Aware; SSAFF = Spatial–Spectral Attentive Feature Fusion; p is the patch-size parameter used in the HA block, with p = 4 and p = 2 representing global and local branches, respectively.
Figure 3. Detailed structure of the MFI and HA blocks in the SSAFF module. MFI = Multi-scale Focus Integration; HA = Hierarchical Aware; SSAFF = Spatial–Spectral Attentive Feature Fusion; p is the patch-size parameter used in the HA block, with p = 4 and p = 2 representing global and local branches, respectively.
Sensors 26 03206 g003
Figure 4. Structure of the FTD block used for frequency-domain downsampling. FTD = Fourier Transform Downsampling; FFT = Fast Fourier Transform; IFFT = Inverse Fast Fourier Transform.
Figure 4. Structure of the FTD block used for frequency-domain downsampling. FTD = Fourier Transform Downsampling; FFT = Fast Fourier Transform; IFFT = Inverse Fast Fourier Transform.
Sensors 26 03206 g004
Figure 5. Representative disease images from the datasets. (a) shows examples from the Crop Disease dataset, including soybean bacterial blight, strawberry leaf scorch, rice leaf blight, and leaf-spot symptoms. (b) shows examples from the PlantDoc dataset, including apple rust, tomato early blight, tomato bacterial spot, and grape leaf disease. Each panel illustrates variation in lesion size, color, texture, and background complexity.
Figure 5. Representative disease images from the datasets. (a) shows examples from the Crop Disease dataset, including soybean bacterial blight, strawberry leaf scorch, rice leaf blight, and leaf-spot symptoms. (b) shows examples from the PlantDoc dataset, including apple rust, tomato early blight, tomato bacterial spot, and grape leaf disease. Each panel illustrates variation in lesion size, color, texture, and background complexity.
Sensors 26 03206 g005
Figure 6. Detection comparison between RT-DETR and AD-DETR on representative crop disease images. (a) shows the original images; (b) shows RT-DETR results; (c) shows AD-DETR results. The sample diseases include bell pepper bacterial spot, soybean leaf disease, apple rust, tomato bacterial spot, and tomato early blight.
Figure 6. Detection comparison between RT-DETR and AD-DETR on representative crop disease images. (a) shows the original images; (b) shows RT-DETR results; (c) shows AD-DETR results. The sample diseases include bell pepper bacterial spot, soybean leaf disease, apple rust, tomato bacterial spot, and tomato early blight.
Sensors 26 03206 g006
Figure 7. Grad-CAM visualization for representative crop disease samples. (a) shows original images; (b) shows RT-DETR attention maps; (c) shows AD-DETR attention maps. The examples include leaf blight, early blight-like lesions, bacterial spot, and rust-like symptoms.
Figure 7. Grad-CAM visualization for representative crop disease samples. (a) shows original images; (b) shows RT-DETR attention maps; (c) shows AD-DETR attention maps. The examples include leaf blight, early blight-like lesions, bacterial spot, and rust-like symptoms.
Sensors 26 03206 g007
Figure 8. False-positive and false-negative cases of AD-DETR under challenging conditions. (a) shows original images; (b) shows AD-DETR detection results. Disease names visible in the samples include corn leaf blight, tomato yellow leaf curl virus, grape leaf disease, and tomato early blight. These cases illustrate errors caused by low contrast, occlusion, background similarity, and overlapping leaves.
Figure 8. False-positive and false-negative cases of AD-DETR under challenging conditions. (a) shows original images; (b) shows AD-DETR detection results. Disease names visible in the samples include corn leaf blight, tomato yellow leaf curl virus, grape leaf disease, and tomato early blight. These cases illustrate errors caused by low contrast, occlusion, background similarity, and overlapping leaves.
Sensors 26 03206 g008
Figure 9. Normalized confusion matrix for representative high-frequency crop disease classes. The matrix is row-normalized; darker diagonal cells indicate correct detections, whereas off-diagonal cells indicate misclassification or confusion between visually similar disease categories.
Figure 9. Normalized confusion matrix for representative high-frequency crop disease classes. The matrix is row-normalized; darker diagonal cells indicate correct detections, whereas off-diagonal cells indicate misclassification or confusion between visually similar disease categories.
Sensors 26 03206 g009
Table 1. Conceptual comparison among AFA, SE, and BiFPN. AFA is designed for adaptive cross-scale feature alignment, whereas SE mainly recalibrates channel responses and BiFPN learns pyramid-level fusion weights.
Table 1. Conceptual comparison among AFA, SE, and BiFPN. AFA is designed for adaptive cross-scale feature alignment, whereas SE mainly recalibrates channel responses and BiFPN learns pyramid-level fusion weights.
MethodInput ConditionWeight GenerationFusion/Alignment BehaviorMain Difference from AFA
SESingle feature mapGlobal pooling and channel excitationRecalibrates channel responses within one scaleDoes not explicitly align two heterogeneous feature maps or model cross-scale spatial offsets
BiFPNMulti-scale feature pyramidLearnable normalized scalar weightsRepeated bidirectional pyramid fusionUses scale-level weights but does not perform feature-specific channel projection and spatially varying gated alignment
AFA (ours)Two heterogeneous cross-scale featuresConcatenation, 3 × 3 convolution, sigmoid split, and learnable branch coefficientsChannel alignment, spatial-channel gating, and adaptive branch fusionJointly performs feature-space alignment and adaptive cross-scale fusion for small and irregular lesions
Table 2. Dataset statistics used in this study. The Crop Disease dataset was constructed in this work; PlantDoc is a public benchmark dataset cited from Singh et al. [18]. Images with at least one annotated disease region were counted as diseased images.
Table 2. Dataset statistics used in this study. The Crop Disease dataset was constructed in this work; PlantDoc is a public benchmark dataset cited from Singh et al. [18]. Images with at least one annotated disease region were counted as diseased images.
DatasetSource/CitationTotal ImagesCrops/SpeciesCategoriesDiseased ImagesHealthy Images
Crop DiseaseSelf-constructed in this study37,714>10 crops4133,2864428
PlantDocPublic benchmark [18]256913 species302118451
Table 3. Image distribution of the 41 categories in the self-constructed Crop Disease dataset. The distribution shows that the dataset contains class imbalance, with healthy categories generally having fewer images than major disease categories.
Table 3. Image distribution of the 41 categories in the self-constructed Crop Disease dataset. The distribution shows that the dataset contains class imbalance, with healthy categories generally having fewer images than major disease categories.
CategoryImagesRatio (%)CategoryImagesRatio (%)CategoryImagesRatio (%)
apple healthy4201.11bell pepper bacterial spot11112.95cassava bacterial blight11593.07
bell pepper healthy3901.03cherry powdery mildew10372.75cassava green mottle10982.91
cherry healthy3500.93corn perispore leaf spot11843.14strawberry leaf scorch10502.78
corn healthy4301.14corncommon rust11353.01tomato early blight11963.17
grape healthy4001.06northern leaf blight11112.95tomato late blight12083.20
peach healthy3200.85grape blackrot11232.98tomato bacterial spot11843.14
potato healthy3801.01grape leaf blight10862.88pider mites two-spotted spider mite11353.01
rice healthy4401.17grapeesca10502.78cassava Brown Streak Disease10742.85
cassava healthy4101.09peach bacterial spot10372.75tomato leaf mould10622.82
strawberry healthy3300.88potato early blight11112.95tomato septoria leaf spot10982.91
tomato leaf healthy5581.48potato late blight11232.98quash powdery mildew10742.85
apple black rot10622.82rice hispa11723.11orange citrus greening11473.04
cedar apple rust11473.04rice brown spot10862.88mosaiccassava10542.79
apple scab10742.85rice leaf blast10982.91
Table 4. Training hardware and hyperparameter configuration used for AD-DETR experiments.
Table 4. Training hardware and hyperparameter configuration used for AD-DETR experiments.
ParameterConfiguration
Training epochs200
Batch size16
Workers4
Learning rate0.0001
OptimizerAdamW (implemented in PyTorch v1.13.1)
Input image size640 × 640
Table 5. Unified training and evaluation protocol for comparison models. All models used the same dataset split, input resolution, and timed inference procedure.
Table 5. Unified training and evaluation protocol for comparison models. All models used the same dataset split, input resolution, and timed inference procedure.
Model GroupEpochsInput SizeBatch SizeOptimizer/SchedulerData Augmentation
Faster R-CNN, SSD, RetinaNet200 640 × 640 16SGD/AdamW with cosine decayResize, horizontal flip, color jitter, random crop
YOLOv8m, YOLOv10m, YOLOv12m200 640 × 640 16AdamW with cosine decayMosaic, random scale, flip, HSV/color jitter
RT-DETR variants200 640 × 640 16AdamW with cosine decayResize, random scale, flip, color jitter
AD-DETR (ours)200 640 × 640 16AdamW with cosine decaySame as RT-DETR plus class-balanced rare-category augmentation
Table 6. Ablation analysis of MSANet, SSAFF, and IPIoUv2 on the Crop Disease dataset. The best results are shown in bold.
Table 6. Ablation analysis of MSANet, SSAFF, and IPIoUv2 on the Crop Disease dataset. The best results are shown in bold.
MSANetSSAFFIPIoUv2PRmAP@50mAP@50–95Params/MGFlops/G
×××0.8670.8540.8480.82019.957.0
××0.8700.8530.8500.81612.334.4
××0.8930.8800.8780.84422.370.1
××0.8960.8590.8640.82919.957.0
×0.9030.8920.8900.85916.447.2
×0.9100.8990.8940.86122.370.1
0.8980.8950.9020.86816.447.2
Table 7. Comparison of representative IoU-based losses on the Crop Disease dataset. Only the regression loss was changed; the network architecture and training protocol were unchanged. The best results are shown in bold.
Table 7. Comparison of representative IoU-based losses on the Crop Disease dataset. Only the regression loss was changed; the network architecture and training protocol were unchanged. The best results are shown in bold.
Regression LossPRmAP@50mAP@50–95
GIoU [34]0.8950.8870.8930.858
DIoU [35]0.8930.8840.8900.855
CIoU [36]0.8890.8780.8840.848
EIoU [37]0.8910.8810.8870.852
SIoU [38]0.8840.8720.8790.841
WIoU [39]0.8960.8890.8950.860
PIoUv20.8970.8910.8970.862
Inner-IoU0.8940.8900.8960.861
IPIoUv2 (ours)0.8980.8950.9020.868
Table 8. Performance comparison on the Crop Disease dataset. The best results are shown in bold.
Table 8. Performance comparison on the Crop Disease dataset. The best results are shown in bold.
NetworkPRmAP@50Params/MGFLOPs/GFPS
Faster RCNN0.4200.3530.36641.4134.023
SSD0.6570.3860.51224.8217.084
Retinanet0.8430.7740.79336.5130.076
YOLOv8m0.7550.8640.82925.979.3170
YOLOv10m0.9310.8090.87216.664.5222
YOLOv12m0.7530.9010.88820.268.1228
RT-DETR-r180.8670.8540.84819.956.9211
RT-DETR-r340.8700.8530.84931.188.9147
RT-DETR-r500.8990.8970.88541.9129.697
Ours0.8980.8950.90216.447.2230
Table 9. Cross-dataset performance comparison on the PlantDoc dataset. The best results are shown in bold.
Table 9. Cross-dataset performance comparison on the PlantDoc dataset. The best results are shown in bold.
NetworkPRmAP@50Params/MGFLOPs/GFPS
Faster RCNN0.7050.6720.71841.4134.015
SSD0.8570.7630.86224.8217.067
Retinanet0.8670.8280.87336.5130.070
YOLOv8m0.6830.7010.75025.979.3155
YOLOv10m0.7420.8490.87316.664.5269
YOLOv12m0.6930.6550.74920.268.1204
RT-DETR-r180.8610.8530.90319.956.9201
RT-DETR-r340.9120.8730.92931.188.9145
RT-DETR-r500.9500.9120.95941.9129.6101
Ours0.9640.9420.97416.447.2242
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, B.; Zhou, H.; Wang, Z.; Chen, R. AD-DETR: A Real-Time Transformer with Multi-Scale Alignment and Spatial–Spectral Fusion for Crop Disease Detection. Sensors 2026, 26, 3206. https://doi.org/10.3390/s26103206

AMA Style

Wang B, Zhou H, Wang Z, Chen R. AD-DETR: A Real-Time Transformer with Multi-Scale Alignment and Spatial–Spectral Fusion for Crop Disease Detection. Sensors. 2026; 26(10):3206. https://doi.org/10.3390/s26103206

Chicago/Turabian Style

Wang, Bingyang, Huibo Zhou, Zhi Wang, and Ruolan Chen. 2026. "AD-DETR: A Real-Time Transformer with Multi-Scale Alignment and Spatial–Spectral Fusion for Crop Disease Detection" Sensors 26, no. 10: 3206. https://doi.org/10.3390/s26103206

APA Style

Wang, B., Zhou, H., Wang, Z., & Chen, R. (2026). AD-DETR: A Real-Time Transformer with Multi-Scale Alignment and Spatial–Spectral Fusion for Crop Disease Detection. Sensors, 26(10), 3206. https://doi.org/10.3390/s26103206

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop