To address the inherent limitation of RT-DETR in capturing fine-grained defect structures, the original backbone is redesigned using the EMO module. Built upon the iRMB architecture, EMO integrates local convolutional operations with an efficient attention mechanism, enabling the model to jointly preserve high-frequency structural details and global contextual information. This hybrid representation is particularly suited for detecting subtle defects such as cracks and scratches, while significantly reducing parameter redundancy and computational overhead, thereby improving deployment feasibility on UAV platforms and edge devices.
Beyond feature extraction, ESS-DETR introduces a Scale-Decoupled Loss (SDLoss) to explicitly mitigate the optimization bias caused by scale imbalance in aerial imagery. By decoupling object scale from gradient dominance during training, SDLoss suppresses the disproportionate influence of large objects and ensures that small-scale defects contribute stable and meaningful gradients. This loss formulation provides a principled mechanism for enhancing small-object detection performance, rather than relying on heuristic re-weighting strategies.
To further facilitate effective multi-scale feature interaction, the SPPELAN module is incorporated as a structured feature fusion component. By combining path enhancement with spatial selective attention, SPPELAN constructs a lightweight yet discriminative feature interaction topology that selectively emphasizes defect-relevant regions while suppressing background interference. This design ensures scale-consistent and attention-aligned feature propagation between the backbone and the detection head.
Overall, ESS-DETR forms a coherent detection framework in which feature representation, optimization strategy, and multi-scale fusion are jointly optimized. The coordinated design of EMO, SDLoss, and SPPELAN enables ESS-DETR to achieve improved detection accuracy and strong deployment adaptability, making it well suited for efficient and accurate multi-class structural defect detection using UAV imagery.
3.1. EMO-Based Lightweight Feature Extraction Backbone
In UAV-based structural defect detection, targets are predominantly small and heavily dependent on fine-grained texture and edge cues. Capturing such details typically requires strong semantic representation, which in conventional detectors is achieved through large backbones and computationally intensive global attention mechanisms, severely hindering deployment on resource-constrained UAV platforms.
To overcome this limitation, we reformulate the backbone design of RT-DETR by introducing an EMO-driven lightweight semantic encoding strategy. EMO reorganizes feature extraction by coupling efficient local mixing with selective global context modeling, enabling rich semantic representation with significantly reduced parameters and computational cost. This redesign preserves discriminative capability for small defects while establishing a more deployment-oriented detection architecture.
The Meta Mobile Block integrates an Inverted Residual Block from MobileNetV2 [
27], the core MHSA [
15] from Transformers, and an FFN, forming an efficient and compact structure. This design combines the lightweight advantages of CNNs with the global context modeling capability of Transformers, optimizing computational efficiency while enhancing feature representation, as illustrated in
Figure 2. The MMB module effectively captures both local image details and global dependencies without compromising inference efficiency.
The computational flow of the MMB can be divided into three key stages, with the corresponding derivations as follows:
Stage 1: Channel Expansion
For the given image
, the MMB module expands the channel dimension of the input using an
with an output-to-input ratio of
, producing:
controls the dimension of intermediate feature channels. Increasing can improve feature representation but will increase FLOPs.
Stage 2: Feature Enhancement via Efficient Operator F
The intermediate operator F further enhances the image features. Depending on design choices, intermediate operator can take various forms, such as an identity mapping, a static convolution, or a dynamic MHSA. To align with the lightweight and efficient nature of the MMB, we formalize F as an efficient operator, defined as:
Stage 3: Channel Shrinkage and Residual Connection
The intermediate features are then enhanced using an efficient operator F, and finally, the channel dimension is reduced through a shrinkage
with an input-to-output ratio of
, producing:
A residual connection is applied to obtain the module’s output:
Building upon the Meta Mobile Block (MMB), we propose the Inverted Residual Mobile Block (iRMB), whose core innovation lies in designing the efficient operator F as a cascaded structure of multi-head self-attention (MHSA) and convolutional operations.
This cascaded design is not a simple stacking; rather, it enables functional complementarity between the two operators. Specifically, MHSA captures global context dependencies across the feature map, while the convolutional operations strengthen the extraction of local textures and edge details, providing fine-grained feature support essential for small defect detection.
To reduce computational overhead while maintaining high representational capacity, the backbone integrates window-based multi-head self-attention (W-MHSA) with depthwise separable convolutions (DW-Conv), complemented by residual connections to ensure training stability.
In W-MHSA, the feature map is partitioned into local windows, and attention is computed independently within each window. This transforms the quadratic complexity of conventional MHSA into a linear complexity with respect to spatial size, significantly reducing computational cost. Meanwhile, DW-Conv decomposes standard convolution into depthwise and pointwise operations, reducing parameter redundancy and enhancing channel-wise feature decoupling.
In conventional W-MHSA, computing the query Q and key K involves expanded channels, resulting in a quadratic complexity with respect to the channel dimension. To improve efficiency, we propose Expanded Window MHSA (EW-MHSA), where the attention matrix is computed using the unexpanded feature
, while the expanded feature
serves as the value V:
The attention output is then formulated as:
And operator F is formulated as:
The iRMB incorporates DW-Conv and EW-MHSA, enabling an effective trade-off between lightweight structure and strong detection performance. This architecture captures both local cues and global contextual dependencies with high efficiency, making it particularly suitable for real-time UAV-based surface defect inspection (
Table 1).
Downsampling in the model is achieved through stride adaptation within iRMBs rather than aggressive pooling or positional embeddings. This strategy preserves spatial continuity and reduces feature distortion, ensuring that small defects remain detectable across network stages. Meanwhile, the gradual increase in channel dimensions and expansion ratios enhances representational capacity without introducing excessive computational cost.
Based on this design, the EMO architecture is constructed as shown in
Figure 3. EMO is an efficient, ResNet-inspired four-stage network built entirely from a sequence of iRMBs, without incorporating other module types. Each iRMB contains only standard convolutional layers and multi-head self-attention, avoiding additional complex operations. Downsampling is performed via stride adaptation, eliminating the need for positional embeddings. Additionally, the expansion ratios and channel dimensions progressively increase across stages, improving the network’s ability to represent features while preserving computational efficiency, making it particularly suitable for UAV-based surface defect inspection.
From a deployment perspective, aircraft inspection tasks require low-latency and energy-efficient inference to support on-board or near-edge processing. EMO relies exclusively on standard convolutional operations and efficient self-attention, avoiding complex operators and memory-intensive designs. Consequently, the backbone achieves a favorable balance between accuracy and computational efficiency, making it suitable for real-time aircraft defect detection in resource-constrained UAV scenarios.
By aligning its architectural design with the geometric characteristics, background complexity, and operational constraints of UAV-based aircraft defect detection, our EMO-based backbone goes beyond a simple lightweight backbone replacement and provides a task-driven feature extraction framework for accurate and efficient inspection.
3.2. Dynamic Scale-Weighted Loss for UAV Surface Inspection
Surface defects on large-scale structures often vary considerably in size and morphology. In small-object detection tasks, IoU-based loss functions can exhibit high fluctuations, negatively affecting model stability and regression accuracy. Moreover, conventional losses do not fully account for scale- and position-dependent sensitivity across objects of different sizes, making small-target detection particularly unstable.
To address this issue, we propose the Scale-Decoupled Loss, which explicitly mitigates localization errors caused by inconsistent defect scales, as shown in
Figure 4. In SDLoss, the contributions of Scale Loss and Localization Loss are dynamically adjusted according to the object size. For small defects, Sloss weight is reduced to alleviate scale-related errors, while Lloss weight is increased to ensure precise localization. Conversely, for larger defects, Sloss weight is amplified to optimize scale regression, providing balanced and robust supervision across multi-scale targets in UAV-based surface inspection tasks.
The influence coefficient for the BBox labels is calculated using the following formula:
Here,
denotes the area of the current target box,
,
represents the scale ratio, and
is a tunable parameter.
Define:
where
measures the consistency of the predicted and ground-truth boxes in terms of aspect ratio,
is the Euclidean distance function used to calculate the distance between the center points of the predicted box
, the ground-truth box
, and
denotes the diagonal length of the smallest rectangle that encloses both the predicted and ground-truth boxes.
The final scale-adaptive loss for SDB Loss is formulated as:
3.3. Adaptive Multi-Scale Feature Enhancement for UAV-Based Defect Detection
In deep object detection models, feature maps from different layers exhibit varying receptive fields and semantic representation capabilities. Shallow layers are more effective at capturing edge and texture details, while deeper layers provide stronger semantic discriminability. To fully leverage multi-scale features and enhance the model’s sensitivity to fine-grained defects, we adopt the SPPELAN module, which combines path enhancement and spatial positional attention mechanisms. This design strengthens the representation of salient regions corresponding to defect areas, improving the detection of small and subtle defects in UAV-based surface inspection tasks.
The module first applies a 1 × 1 convolution to the input feature map
in to reduce the channel dimension, as shown in
Figure 1, thereby lowering the computational cost. Multiple parallel convolutional paths are then constructed, each consisting of convolutional stacks of varying depth. In one of these paths, a Spatial Pyramid Pooling structure is introduced, which employs max-pooling operations with kernels of different sizes
to extract multi-scale contextual features:
The aforementioned operations significantly enlarge the model’s receptive field, enhancing its responsiveness to large-scale objects and elongated structures. During the multi-path output fusion stage, all path features are concatenated and passed through a 1 × 1 convolution to achieve channel compression and feature integration, producing the final feature representation:
Here, denotes the convolutional transformation of the i-th path, including the path that incorporates the SPP operation.
Through this design, the SPPELAN module effectively enhances the network’s feature representation and detection robustness while maintaining low parameter count and computational overhead. By combining local structural details with global contextual information, SPPELAN substantially improves the model’s capability to detect objects of varying sizes, especially small and irregularly shaped defects. Experimental results indicate that integrating SPPELAN boosts overall detection accuracy and inference efficiency without significantly increasing model complexity, establishing it as a key component for lightweight UAV-based surface defect detection models that balance precision and operational speed.