2.1. Overall Architecture
The proposed AD-DETR in this paper is a deep learning model particularly designed for detecting crop illness. Its overall architecture is founded upon an enhanced real-time detection transformer framework, and it has been comprehensively optimized for the specific requirements of agricultural scenarios. Seen from
Figure 1, the network utilizes an encoder–decoder structure, primarily composed of three core components: the Multi-Scale Align Network (MSANet) backbone network, the Spatial–Spectral Attentive Feature Fusion (SSAFF) model, and an improved detection head. Such a design enables the model to significantly enhance detection precision for multiple categories of crop diseases while maintaining real-time performance.
This paper first designed the Adapt Fusion Align (AFA) block and, based on this block, developed a feature-extraction network named MSANet. This module adaptively aligns multi-scale features to tackle challenges of scale variation and background interference of pest and disease targets in images. Secondly, we innovatively designed the SSAFF model in the encoder part. The innovation of this module lies in its simultaneous processing of spatial- and frequency-domain information. It achieves attention-weighted feature fusion through the Multi-scale Focus Integration (MFI) block, while utilizing the Fourier Transform Downsampling (FTD) block to extract frequency-domain features. This multi-domain fusion strategy allows the model to take full advantage of comprehensive information in illness images, significantly enhancing the discriminative power of the features. Finally, tailored to the characteristics of the disease detection task, we designed the Inner-Powerful-IoUv2 (IPIoUv2) loss function, which significantly improves bounding-box regression accuracy through its internal perception mechanism.
To clarify the uniqueness of AFA, we further compare it with the Squeeze-and-Excitation (SE) mechanism and BiFPN-style weighted fusion. SE recalibrates channels by using global pooled statistics from a single feature map and therefore focuses mainly on channel dependency modeling [
27]. BiFPN learns scalar or normalized weights for bidirectional pyramid fusion and is designed for repeated top–down and bottom–up multi-scale aggregation [
28]. In contrast, AFA first projects heterogeneous features into a shared channel space, learns spatially varying and channel-aware gates from the concatenated cross-scale features, and then combines these gates with learnable branch-level coefficients. Thus, AFA is not only a channel attention block or a feature-pyramid weighting rule; it is a feature-alignment and fusion unit specifically designed to reduce cross-scale misalignment in crop disease detection. The conceptual differences among AFA, SE, and BiFPN are summarized in
Table 1.
2.2. Multi-Scale Align Network
Traditional CNNs frequently struggle to efficiently catch multi-scale features, which directly limits the model’s overall performance. To address such an issue, we introduced two modules in MSANet: the HGStem [
29] module for low-level feature extraction and C2f module [
30] for high-level feature fusion. Both modules dramatically enhance the representational capacity of the network, and their specific structures are illustrated in the
Figure 1. To further improve calculational efficiency, we integrated a Depthwise Separable Convolution (DWConv) [
31]. DWConv decomposes standard convolution into depthwise and pointwise operations, substantially lowering calculational costs at the same time as maintaining the capability of extracting rich multi-scale features.
The AFA block is the core innovative component of MSANet, designed to reduce feature-alignment deviations within the backbone network. Since feature maps often differ in terms of channel count, stride, and receptive field, simple concatenation or weighted averaging can degrade fusion quality. The AFA module solves such an issue by performing feature alignment, adaptive weighting, and channel-wise modulation, thereby achieving more accurate fusion and consequently enhancing the object detection network’s performance. The AFA module’s working principle refers to
Figure 2.
In the AFA structure, input features
first undergo a
convolution for channel alignment, ensuring they are fused within the same feature space. Since
and
may originate from different backbone networks, their channel dimensions might not match. Therefore, the following transformations are applied:
Here, ⊗ represents the convolution operation; and are the convolutional weights used for channel transformation, aligning the channel dimensions of and to lay the foundation for the following fusion.
Feature-aligned
and
are concatenated post channel alignment. To enhance the adaptability of the fusion, an Adapt Align Weight (AAW) mechanism is introduced. Specifically, a
convolution is applied to features concatenated to catch high-level fusion information:
where
denotes the
convolutional layer used to extract fusion information from concatenated features.
Subsequently, the weights are normalized to the interval
using the sigmoid function
, endowing adaptive weighting of features from varied sources:
The tensor
is split along the channel dimension into two independent dynamic weight tensors
and
:
Finally, through reasonable weight allocation, AFA dynamically adjusts each input feature’s contribution during the fusion process, achieving smoother and more coordinated feature fusion:
where ⊙ denotes element-wise multiplication, guaranteeing that valid information from varied sources is fused in a more appropriate manner.
But solely relying upon dynamic weights might cause certain feature paths tp be excessively suppressed. To address this, learnable channel weights
and
are further introduced to optimize the final fusion ratio:
Here,
and
are trainable parameters, initialized to 0.5 and automatically optimized during training, allowing the model to learn optimal feature fusion ratio. To prevent gradient explosion or numerical instability during training, the AFA block imposes constraints as shown below on channel weights:
This constraint ensures that the channel weights remain within a steady range, guaranteeing model consistency and generalization capability. Lastly, convolution is applied to match features for downstream detection tasks.
2.3. Spatial–Spectral Attentive Feature Fusion
Traditional feature fusion methods exhibit significant limitations in crop disease detection: They fail to effectively handle semantic inconsistencies among multi-scale features, leading to inaccurate feature alignment and information loss. Moreover, these methods lack the utilization of frequency-domain information, overlooking the texture and periodic patterns in disease images, while their computational inefficiency makes it difficult to meet real-time detection demands.
To address such issues, this paper designed an SSAFF model. The MFI block applies a multi-scale attention mechanism to dynamically adjust feature weights and prioritize salient disease regions. The FTD block leverages the Fourier transform to map spatial features into the frequency domain for processing, enabling the convolution operation to transcend the limitations of local receptive fields and directly capture global contextual information in images. The entire architecture significantly enhances robustness and preciseness of feature fusion at the same time as keeping real-time performance, providing an efficient solution for crop disease detection.
The MFI block is a core component of Spatial–Spectral Attentive Feature Fusion. Its core innovation lies in integrating local and global attention branches, along with a hierarchical feature processing path to achieve refined fusion of input features. Such a design maintains spatial details of input features and enhances the semantic representation of global contexts via an adaptive weighting mechanism.
Observed from
Figure 3, the MFI block workflow consists of the following stages: First, the input features experience a
convolution for dimensionality reduction. This step is designed to unify the dimensionality of the input features and reduce the computational complexity of subsequent operations. Next, a
convolution is applied to the dual input features for preliminary fusion, producing baseline features. This provides a stable foundation for subsequent attention-based processing.
Subsequently, these features are processed through the Hierarchical Aware (HA) block. The input feature map is first divided into non-overlapping patches. Let the input feature tensor be . Through a spatial operation such as unfolding, it is partitioned into patches of size , generating a set of patches , where . This step avoids the information loss associated with uniform downsampling. After simplifying each patch , it is converted into a patch-level feature vector (where ).
Task-oriented weighting is then introduced: by comparing the similarity between
and a task-embedding vector
, a weight is computed. A linear transformation
is applied for channel selection, formulated as:
Here, is a cosine similarity function, outputting a value in the range to measure the relevance of the patch to the task. Patches with high weights are enhanced, while those with low weights are suppressed, achieving adaptive selection.
The weighted patches
are then reassembled into a complete feature map via a feature recombination operation. Specifically, all patches are recombined and interpolated back to the original spatial dimensions, producing the enhanced feature
, expressed as:
Finally, the features processed by the hierarchical perception branch are integrated with the baseline features. The fusion process consists of a convolution, a reparameterized convolution, and another convolution. The use of reparameterized convolution significantly improves parameter efficiency and effectively reorganizes features from multiple branches, ensuring high inference efficiency.
In the feature fusion network of AD-DETR, the FTD block acts as a key component of the SSAFF model, undertaking the critical downsampling function. Unlike traditional downsampling methods based on pooling or strided convolutions, the FTD block innovatively achieves feature reduction in the frequency domain. By combining spectral filtering and resolution adjustment, it reduces feature map resolution while better preserving frequency-domain feature information.
The downsampling process of the FTD block is implemented through frequency-domain operations. Its core idea is to leverage the frequency truncation property of the Fourier transform to achieve resolution reduction. The core mathematical foundation is the convolution theorem: convolution in the spatial domain is equal to element-wise multiplication in the frequency domain. For an
input image
of size, its discrete Fourier transform (DFT) is defined as:
The convolution operation in the frequency domain is expressed as:
where
is a learnable convolution kernel in the frequency domain. The result in the spatial domain can be recovered via the inverse DFT:
This mathematical equivalence enables the FTD block to achieve receptive field coverage ranging from local () to global ().
The illustration of FCB is shown in
Figure 4. Specifically, given
input feature map, spatial features are first transformed into the frequency domain via the Fast Fourier Transform (FFT):
in which
means 2D FFT operation. Since the input consists of real-valued features, the output spectrum displays conjugate symmetry, meaning only half of the frequency-domain data needs to be processed for a complete representation. Then, frequency-domain modulation is applied using a learnable frequency-domain filter
, simultaneously performing frequency truncation:
where
represents the truncated spectrum (retaining low-frequency components), and ⊙ represents element-wise complex multiplication. The learnable filter enables our model to adaptively choose frequency components most important for downsampling. Finally, the result is transformed back to the spatial domain via Inverse FFT (IFFT), naturally achieving resolution reduction:
After the inverse transform, the output feature map y has dimensions , completing downsampling.
The advantages of this frequency-domain downsampling method are as follows: First, by preserving low-frequency components, it naturally achieves anti-aliasing, avoiding the spectral aliasing issues common in spatial-domain downsampling. Second, the learnable filter can optimize frequency selection for the specific task, enhancing the discriminative power of the downsampled features. Lastly, frequency-domain operations provide a global receptive field, ensuring that long-range dependencies are not lost during the downsampling process.
2.4. Inner-Powerful-IoUv2
In crop disease detection tasks, bounding-box regression precision directly affects the model’s localization capability. Traditional IoU loss functions suffer from problems like slow convergence, geometric misalignment, along with limited generalization when handling targets. To address these issues, this paper proposes IPIoUv2, which integrates the geometric alignment penalty mechanism from PowerfulIoUv2 (PIoUv2) [
32] and the scale scaling strategy of InnerIoU [
33]. By dynamically adjusting the loss weights and adapting to scale variations, our method dramatically enhances bounding-box regression precision, convergence speed, and robustness. The geometric penalty term
P added by PIoU is calculated directly based on the target bounding box width and height, guiding the anchor box to regress more directly toward the target center and avoiding unnecessary expansion. Its core calculation formula is:
where
,
,
, and
measure boundary distance disparities between target and predicted boxes in horizontal and vertical directions.
and
refer to ground-truth box height and width, respectively. Subsequently, based on PIoU, a non-monotonic focal mechanism was introduced to dynamically adjust loss weight, resulting in a new loss function termed PIoUv2. Its calculation formula is defined as follows:
among which,
denotes a hyperparameter controlling penalty strength. Through experiments, we finally set
.
The motivation behind Inner-IoU differs from that of PIoU. It primarily addresses the insufficient generalization capability and slow convergence of the traditional IoU loss in disease detection. Its core concept involves introducing a proportionally scaled “auxiliary bounding box” for loss calculation, thereby dynamically adjusting difficulty and focus of regression. The specific calculation formula is as follows:
where
and
are the mean ground-truth box and the predicted box center coordinates, respectively.
,
,
w, and
h represent the ground-truth box and predicted box width and height, respectively.
means a scale factor, which is set to
here. This setting imposes stricter regression criteria and is capable of accelerating quality samples’ convergence. Therefore, the formula for IPIoUv2 can be expressed as: