ESS-DETR: A Lightweight and High-Accuracy UAV-Deployable Model for Surface Defect Detection

Wang, Yunze; Yao, Yong; Zheng, Heng; Han, Yeqing

doi:10.3390/drones10010043

Open AccessArticle

ESS-DETR: A Lightweight and High-Accuracy UAV-Deployable Model for Surface Defect Detection

School of Computer Science and Technology, Xidian University, Xi’an 710126, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(1), 43; https://doi.org/10.3390/drones10010043

Submission received: 18 November 2025 / Revised: 31 December 2025 / Accepted: 5 January 2026 / Published: 8 January 2026

(This article belongs to the Special Issue Advances in Detection, Security, and Communication for UAV: 2nd Edition)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A lightweight ESS-DETR framework is proposed for UAV-based aircraft surface defect detection, in which the EMO backbone, SDLoss, and SPPELAN modules are jointly redesigned and collaboratively optimized. The proposed architecture establishes a coherent coupling between efficient feature extraction, scale-aware supervision, and multi-scale feature aggregation, thereby achieving high detection accuracy while substantially reducing computational complexity.
Extensive experimental results demonstrate that the proposed ESS-DETR exhibits superior overall performance compared with DETR-based and YOLO-based detectors under UAV deployment scenarios.

What are the implications of the main findings?

The proposed model is suitable for real-time UAV-based inspection scenarios, enabling rapid, accurate, and reliable detection of small and complex defects on aircraft surfaces.
The framework provides a practical foundation for deploying high-precision defect detection systems on resource-constrained aerial platforms, supporting intelligent maintenance and inspection operations.

Abstract

Defects on large-scale structural surfaces can compromise integrity and pose safety hazards, highlighting the need for efficient automated inspection. UAVs provide a flexible and effective platform for such inspections, yet traditional vision-based methods often require high computational resources and show limited sensitivity to small defects, restricting practical UAV deployment. To address these challenges, we propose ESS-DETR, a lightweight and high-precision detection model designed for UAV-based surface inspection, built upon core modules: EMO-inspired lightweight backbone that integrates convolution and efficient attention mechanisms to reduce parameters; Scale-Decoupled Loss that adaptively balances targets of various sizes to enhance accuracy and robustness for small and irregular defect patterns frequently encountered in UAV imagery; and SPPELAN multi-scale fusion module that improves feature discrimination under complex reflections, shadows, and lighting variations typical of aerial inspection environments. Experimental results demonstrate that ESS-DETR reduces computational complexity from 103.4 to 60.5 GFLOPs and achieves a Precision of 0.837, Recall of 0.738, and mAP of 79, outperforming Faster R-CNN, RT-DETR, and YOLOv11, particularly for small-scale defects, confirming that ESS-DETR effectively balances accuracy, efficiency, and onboard deployability, providing a practical solution for intelligent UAV-based surface inspection.

Keywords:

UAV-based inspection; defect detection; RT-DETR; lightweight model

1. Introduction

The integrity of large-scale structural surfaces is critical for maintaining operational performance and safety. These surfaces are highly susceptible to external damage under diverse operational and storage conditions, with common defects including scratches, cracks and dents. Such defects not only compromise structural strength and service life but can also pose significant safety hazards. Hence, effective and precise detection is crucial for maintaining operational reliability and minimizing maintenance expenses.

Traditional visual inspection methods rely heavily on ground personnel using auxiliary tools to examine key areas of the structure. However, this approach is labor-intensive, limited in coverage—especially for large-scale surfaces—and highly dependent on personnel experience and subjective judgment. Consequently, it is prone to missed or false detections, and its efficiency and reliability are difficult to guarantee. Alternative approaches, such as wall-climbing robots attached to the surface [1], ground-based automated obstacle-avoiding platforms with lifting mechanisms, and high-resolution cameras or sensors installed in fixed positions [2,3], have been explored to partially automate defect inspection. Nonetheless, these methods face inherent limitations, including restricted maneuverability, low inspection efficiency, and incomplete coverage, preventing comprehensive and rapid defect detection.

As UAV technology progresses, drone-based inspections equipped with high-resolution cameras have gradually emerged as a flexible and efficient solution for large-surface monitoring [2,3].

Despite these advances, detecting structural surface defects remains a challenging task. Defects are typically small, exhibit diverse morphologies, and can be easily obscured by complex backgrounds, making automated detection difficult to achieve with high stability and accuracy. While deep learning-based object detection models have improved detection precision, practical applications often face high computational costs, complex network structures, and difficulties in deploying on UAVs or other edge devices, limiting their suitability for engineering scenarios that require high efficiency and low power consumption.

Moreover, accuracy is still constrained. Surface defects are usually small, with blurred edges and weak texture features, and can easily be confused with background noise, resulting in poor performance for small-object localization and recognition. This is particularly problematic in multi-class, multi-scale detection tasks, where conventional models struggle to maintain accuracy and robustness. In addition, mainstream detection algorithms often exhibit limitations in feature representation and information fusion. Given the diverse visual manifestations of different defect types, models require enhanced feature extraction and discriminative capabilities, whereas traditional methods show limited effectiveness in handling complex backgrounds, restricting their ability to model and detect fine structural defects.

To address these critical challenges, such as high computational cost, limited detection accuracy and low sensitivity to small defects, this study develops a detection framework based on RT-DETR [4]. The proposed framework yields a practical and efficient multi-class defect detection model suitable for deployment on UAV platforms.

In the feature extraction stage, EMO [5] module is introduced to replace the original RT-DETR backbone, which integrates convolutional operations with attention mechanisms using the efficient iRMB architecture, preserving sensitivity to local details while capturing global context. It substantially reduces model parameters and computational complexity, thereby improving deployment efficiency and operational stability on UAVs.

Addressing challenges in small-object localization and multi-scale training imbalance, the Scale-Decoupled Loss [6] is employed. By assigning differentiated regression and classification weights to targets of varying scales, it mitigates the dominance of large targets during training and enhances regression accuracy and classification robustness for fine defects.

To further enhance detection of small defects in complex aerial scenes, the SPPELAN [7] module is incorporated. By integrating path enhancement with spatial selective attention, it improves the network’s focus on critical regions and fuses shallow structural cues with deep semantic information, markedly boosting the recognition of fine-grained defects.

The primary contributions of this research can be outlined as:

EMO-based lightweight backbone: The feature extraction network is jointly redesigned and optimized based on EMO principles. This design reduces model complexity while preserving high detection accuracy and enhancing the feasibility of deployment on resource-constrained UAV platforms.
Scale-Decoupled Loss is introduced, employing a scale-decoupled strategy for multi-scale targets to improve small-object detection capabilities in UAV imagery.
SPPELAN module achieves efficient multi-scale feature fusion, improving robustness and recognition accuracy under complex aerial backgrounds.

Comprehensive experiments indicate that the proposed model attains excellent performance across surface defect datasets, significantly outperforming mainstream methods in both mAP and detection efficiency, and exhibiting strong practical applicability and potential for UAV-based engineering deployment.

2. Related Work

As computing technologies have rapidly progressed, deep learning has been widely applied across various domains, largely enabled by the availability of substantial computational resources. Object detection, as a key branch of deep learning [8], leverages the strong feature representation abilities of CNNs [9], which eliminate the need for manually designed feature extractors. Common CNN-based feature extraction networks include CornerNet [10], VGGNet [11] and GoogleNet [12]. Building upon these networks, complete object detection algorithms have been developed to enable end-to-end target recognition and localization.

At present, object detection approaches are generally classified into two-stage and single-stage methods, each with distinct trade-offs between speed and accuracy. The earliest two-stage methods, represented by the R-CNN family [13], rely on candidate proposals, whereas single-stage approaches, such as the YOLO series [14], treat detection as a classification problem and perform prediction directly on the input image. While single-stage methods achieve high detection speed suitable for real-time applications, their accuracy is generally lower than that of two-stage algorithms. Overall, CNN-based detection models remain dominant, offering strong generalization and high detection performance, which provides a foundation for UAV-based surface defect detection in engineering applications.

Recently, the Transformer architecture [15] has achieved remarkable success in natural language processing has inspired its adoption in computer vision, particularly for object detection. The Transformer’s global modeling capability offers a new paradigm for visual representation. Building on this concept, Carion et al. proposed DETR [16], which integrates a Transformer encoder–decoder with CNN-based feature extraction to form an end-to-end detection system. DETR reformulates object detection as a set prediction problem, employing Hungarian matching to associate predicted boxes, thereby eliminating manually designed modules such as region proposal generation and NMS [17]. This design simplifies the detection pipeline and improves overall accuracy.

However, DETR faces practical challenges, including slow training convergence, limited performance on small objects, and high computational requirements, which hinder deployment on UAVs or other resource-constrained edge devices due to the high complexity of its self-attention mechanism. To address these limitations, numerous improvements to DETR have been proposed. One line of research focuses on accelerating training convergence and enhancing small-object detection. For example, Deformable DETR [18] introduces sparse sampling and deformable attention to focus on key image regions, significantly improving convergence and small-object performance. UP-DETR [19] leverages pre-training and self-supervised tasks to provide better initialization, enhancing convergence efficiency. Efficient DETR [20] optimizes reference point initialization and query embeddings, reducing the number of iterations required.

Another line of work focuses on optimizing attention mechanisms and query design. SMCA-DETR [21] incorporates spatially modulated co-attention to strengthen spatial awareness, improving convergence and accuracy. Conditional DETR [22] employs conditional queries that dynamically capture reference point information, maintaining structural simplicity while enhancing localization precision. Anchor DETR [23] combines query design with traditional anchor concepts to improve object perception and small-object detection performance.

Some studies target the overall DETR architecture for simplification or enhancement. PnP-DETR [24] uses a position sampling module to reduce redundancy and improve encoder–decoder efficiency; Dynamic DETR [25] introduces dynamic attention for more expressive spatial weighting; and DINO [26] integrates positive-negative query perturbation and denoising mechanisms, achieving significant improvements in convergence speed and small-object performance. Collectively, these advances have extended the practical deployment and performance boundaries of DETR-based models.

The evolution of Transformer-based detection frameworks provides promising solutions for multi-object detection in complex scenarios and lays the foundation for developing high-precision, lightweight, and deployable models. Building upon this foundation, this study explores the efficient deployment of DETR-series models on UAV platforms and other edge devices to meet the requirements of intelligent multi-class surface defect detection.

In the context of UAV-based structural surface inspection, DETR-based approaches have gained attention. Surface defects are often small and diverse in morphology, making traditional detection models inadequate for complex textures and multi-scale targets. Transformer-based frameworks, with their global context awareness and powerful feature representation, provide a novel avenue to enhance detection accuracy and stability. Nonetheless, existing DETR-based detection models still suffer from large parameter sizes, high computational costs, and insufficient sensitivity to small defects, limiting their broad application in real-world UAV inspection scenarios.

Therefore, combining the advantages of Transformers with the efficient local perception of CNNs to design lightweight and multi-scale sensitive detection networks has become a research focus in UAV-based surface defect detection. Based on the RT-DETR framework, this study proposes ESS-DETR, which integrates a lightweight EMO-inspired backbone, Scale-Decoupled Loss, and SPPELAN, aiming to improve detection accuracy and efficiency, enable edge-device deployment, and advance intelligent UAV-based surface inspection technology.

3. Proposed Method

The overall architecture of ESS-DETR is illustrated in Figure 1. The model is developed under a task-driven design philosophy that explicitly targets the key challenges of UAV-based structural defect detection, including small-object sensitivity, computational efficiency, and robustness to complex aerial backgrounds.

To address the inherent limitation of RT-DETR in capturing fine-grained defect structures, the original backbone is redesigned using the EMO module. Built upon the iRMB architecture, EMO integrates local convolutional operations with an efficient attention mechanism, enabling the model to jointly preserve high-frequency structural details and global contextual information. This hybrid representation is particularly suited for detecting subtle defects such as cracks and scratches, while significantly reducing parameter redundancy and computational overhead, thereby improving deployment feasibility on UAV platforms and edge devices.

Beyond feature extraction, ESS-DETR introduces a Scale-Decoupled Loss (SDLoss) to explicitly mitigate the optimization bias caused by scale imbalance in aerial imagery. By decoupling object scale from gradient dominance during training, SDLoss suppresses the disproportionate influence of large objects and ensures that small-scale defects contribute stable and meaningful gradients. This loss formulation provides a principled mechanism for enhancing small-object detection performance, rather than relying on heuristic re-weighting strategies.

To further facilitate effective multi-scale feature interaction, the SPPELAN module is incorporated as a structured feature fusion component. By combining path enhancement with spatial selective attention, SPPELAN constructs a lightweight yet discriminative feature interaction topology that selectively emphasizes defect-relevant regions while suppressing background interference. This design ensures scale-consistent and attention-aligned feature propagation between the backbone and the detection head.

Overall, ESS-DETR forms a coherent detection framework in which feature representation, optimization strategy, and multi-scale fusion are jointly optimized. The coordinated design of EMO, SDLoss, and SPPELAN enables ESS-DETR to achieve improved detection accuracy and strong deployment adaptability, making it well suited for efficient and accurate multi-class structural defect detection using UAV imagery.

3.1. EMO-Based Lightweight Feature Extraction Backbone

In UAV-based structural defect detection, targets are predominantly small and heavily dependent on fine-grained texture and edge cues. Capturing such details typically requires strong semantic representation, which in conventional detectors is achieved through large backbones and computationally intensive global attention mechanisms, severely hindering deployment on resource-constrained UAV platforms.

To overcome this limitation, we reformulate the backbone design of RT-DETR by introducing an EMO-driven lightweight semantic encoding strategy. EMO reorganizes feature extraction by coupling efficient local mixing with selective global context modeling, enabling rich semantic representation with significantly reduced parameters and computational cost. This redesign preserves discriminative capability for small defects while establishing a more deployment-oriented detection architecture.

The Meta Mobile Block integrates an Inverted Residual Block from MobileNetV2 [27], the core MHSA [15] from Transformers, and an FFN, forming an efficient and compact structure. This design combines the lightweight advantages of CNNs with the global context modeling capability of Transformers, optimizing computational efficiency while enhancing feature representation, as illustrated in Figure 2. The MMB module effectively captures both local image details and global dependencies without compromising inference efficiency.

The computational flow of the MMB can be divided into three key stages, with the corresponding derivations as follows:

Stage 1: Channel Expansion

For the given image

X (\in R^{C \times H \times W})

, the MMB module expands the channel dimension of the input using an

M L P_{e}

with an output-to-input ratio of

λ (λ > 1)

, producing:

X_{e} {= MLP}_{e} (X) (\in R^{λ C \times H \times W})

(1)

λ (λ > 1)

controls the dimension of intermediate feature channels. Increasing

λ (λ > 1)

can improve feature representation but will increase FLOPs.

Stage 2: Feature Enhancement via Efficient Operator F

The intermediate operator F further enhances the image features. Depending on design choices, intermediate operator can take various forms, such as an identity mapping, a static convolution, or a dynamic MHSA. To align with the lightweight and efficient nature of the MMB, we formalize F as an efficient operator, defined as:

\begin{array}{l} X_{f} = F (X_{e}) (\in R^{λ C \times H \times W}) \end{array}

(2)

Stage 3: Channel Shrinkage and Residual Connection

The intermediate features are then enhanced using an efficient operator F, and finally, the channel dimension is reduced through a shrinkage

{MLP}_{s}

with an input-to-output ratio of

λ

, producing:

X_{s} {= MLP}_{s} {(X}_{f}) (\in R^{C \times H \times W})

(3)

A residual connection is applied to obtain the module’s output:

{Y = X + X}_{s} (\in R^{C \times H \times W})

(4)

Building upon the Meta Mobile Block (MMB), we propose the Inverted Residual Mobile Block (iRMB), whose core innovation lies in designing the efficient operator F as a cascaded structure of multi-head self-attention (MHSA) and convolutional operations.

This cascaded design is not a simple stacking; rather, it enables functional complementarity between the two operators. Specifically, MHSA captures global context dependencies across the feature map, while the convolutional operations strengthen the extraction of local textures and edge details, providing fine-grained feature support essential for small defect detection.

F (•) = C o n v (M H S A (•))

(5)

To reduce computational overhead while maintaining high representational capacity, the backbone integrates window-based multi-head self-attention (W-MHSA) with depthwise separable convolutions (DW-Conv), complemented by residual connections to ensure training stability.

In W-MHSA, the feature map is partitioned into local windows, and attention is computed independently within each window. This transforms the quadratic complexity of conventional MHSA into a linear complexity with respect to spatial size, significantly reducing computational cost. Meanwhile, DW-Conv decomposes standard convolution into depthwise and pointwise operations, reducing parameter redundancy and enhancing channel-wise feature decoupling.

In conventional W-MHSA, computing the query Q and key K involves expanded channels, resulting in a quadratic complexity with respect to the channel dimension. To improve efficiency, we propose Expanded Window MHSA (EW-MHSA), where the attention matrix is computed using the unexpanded feature

X

, while the expanded feature

X_{e}

serves as the value V:

Q = K = X \in R^{C \times H \times W}, V = X_{e} \in R^{λ C \times H \times W}

(6)

The attention output is then formulated as:

E W - M H S A (X, X_{e}) = S o f t m a x (\frac{Q^{T} K}{\sqrt{d_{k}}}) V

(7)

And operator F is formulated as:

F (•) = (D W - C o n v, S k i p) (E W - M H S A (•))

(8)

The iRMB incorporates DW-Conv and EW-MHSA, enabling an effective trade-off between lightweight structure and strong detection performance. This architecture captures both local cues and global contextual dependencies with high efficiency, making it particularly suitable for real-time UAV-based surface defect inspection (Table 1).

Downsampling in the model is achieved through stride adaptation within iRMBs rather than aggressive pooling or positional embeddings. This strategy preserves spatial continuity and reduces feature distortion, ensuring that small defects remain detectable across network stages. Meanwhile, the gradual increase in channel dimensions and expansion ratios enhances representational capacity without introducing excessive computational cost.

Based on this design, the EMO architecture is constructed as shown in Figure 3. EMO is an efficient, ResNet-inspired four-stage network built entirely from a sequence of iRMBs, without incorporating other module types. Each iRMB contains only standard convolutional layers and multi-head self-attention, avoiding additional complex operations. Downsampling is performed via stride adaptation, eliminating the need for positional embeddings. Additionally, the expansion ratios and channel dimensions progressively increase across stages, improving the network’s ability to represent features while preserving computational efficiency, making it particularly suitable for UAV-based surface defect inspection.

From a deployment perspective, aircraft inspection tasks require low-latency and energy-efficient inference to support on-board or near-edge processing. EMO relies exclusively on standard convolutional operations and efficient self-attention, avoiding complex operators and memory-intensive designs. Consequently, the backbone achieves a favorable balance between accuracy and computational efficiency, making it suitable for real-time aircraft defect detection in resource-constrained UAV scenarios.

By aligning its architectural design with the geometric characteristics, background complexity, and operational constraints of UAV-based aircraft defect detection, our EMO-based backbone goes beyond a simple lightweight backbone replacement and provides a task-driven feature extraction framework for accurate and efficient inspection.

3.2. Dynamic Scale-Weighted Loss for UAV Surface Inspection

Surface defects on large-scale structures often vary considerably in size and morphology. In small-object detection tasks, IoU-based loss functions can exhibit high fluctuations, negatively affecting model stability and regression accuracy. Moreover, conventional losses do not fully account for scale- and position-dependent sensitivity across objects of different sizes, making small-target detection particularly unstable.

To address this issue, we propose the Scale-Decoupled Loss, which explicitly mitigates localization errors caused by inconsistent defect scales, as shown in Figure 4. In SDLoss, the contributions of Scale Loss and Localization Loss are dynamically adjusted according to the object size. For small defects, Sloss weight is reduced to alleviate scale-related errors, while Lloss weight is increased to ensure precise localization. Conversely, for larger defects, Sloss weight is amplified to optimize scale regression, providing balanced and robust supervision across multi-scale targets in UAV-based surface inspection tasks.

The influence coefficient for the BBox labels is calculated using the following formula:

β_{B} = \min (\frac{B_{g t}}{B_{gt \max}} \times R_{O C} \times δ, δ)

(9)

Here,

B_{g t}

denotes the area of the current target box,

B_{g t \max} = 81

,

R_{O C}

represents the scale ratio, and

δ

is a tunable parameter.

β_{L_{B S}} = 1 - δ + β_{B}

(10)

\begin{array}{l} β_{L_{B L}} = 1 + δ - β_{B} \end{array}

(11)

Define:

L_{B S} = 1 - I o U + α_{v}

(12)

L_{B L} = \frac{ρ^{2} (b_{p}, b_{g t})}{c^{2}}

(13)

where

α_{v}

measures the consistency of the predicted and ground-truth boxes in terms of aspect ratio,

ρ

is the Euclidean distance function used to calculate the distance between the center points of the predicted box

b_{p}

, the ground-truth box

b_{g t}

, and

c

denotes the diagonal length of the smallest rectangle that encloses both the predicted and ground-truth boxes.

The final scale-adaptive loss for SDB Loss is formulated as:

L_{S D B} = β_{L_{B S}} \times L_{B S} + β_{L_{B L}} \times L_{B L}

(14)

3.3. Adaptive Multi-Scale Feature Enhancement for UAV-Based Defect Detection

In deep object detection models, feature maps from different layers exhibit varying receptive fields and semantic representation capabilities. Shallow layers are more effective at capturing edge and texture details, while deeper layers provide stronger semantic discriminability. To fully leverage multi-scale features and enhance the model’s sensitivity to fine-grained defects, we adopt the SPPELAN module, which combines path enhancement and spatial positional attention mechanisms. This design strengthens the representation of salient regions corresponding to defect areas, improving the detection of small and subtle defects in UAV-based surface inspection tasks.

The module first applies a 1 × 1 convolution to the input feature map

X \in R^{H \times W \times C}

in to reduce the channel dimension, as shown in Figure 1, thereby lowering the computational cost. Multiple parallel convolutional paths are then constructed, each consisting of convolutional stacks of varying depth. In one of these paths, a Spatial Pyramid Pooling structure is introduced, which employs max-pooling operations with kernels of different sizes

k \in \{5, 9, 13\}

to extract multi-scale contextual features:

S P P (X) = C o n c a t (M a x P o o l_{k = 5} (X), M a x P o o l_{k = 9} (X), M a x P o o l_{k = 13} (X), X)

(15)

The aforementioned operations significantly enlarge the model’s receptive field, enhancing its responsiveness to large-scale objects and elongated structures. During the multi-path output fusion stage, all path features are concatenated and passed through a 1 × 1 convolution to achieve channel compression and feature integration, producing the final feature representation:

Y = C o n v_{1 \times 1} (C o n c a t (f_{1} (X), f_{2} (X),, f_{n} (X)))

(16)

Here,

f_{i} (X)

denotes the convolutional transformation of the i-th path, including the path that incorporates the SPP operation.

Through this design, the SPPELAN module effectively enhances the network’s feature representation and detection robustness while maintaining low parameter count and computational overhead. By combining local structural details with global contextual information, SPPELAN substantially improves the model’s capability to detect objects of varying sizes, especially small and irregularly shaped defects. Experimental results indicate that integrating SPPELAN boosts overall detection accuracy and inference efficiency without significantly increasing model complexity, establishing it as a key component for lightweight UAV-based surface defect detection models that balance precision and operational speed.

4. Experiments

4.1. Experimental Environment

All experiments were conducted using RT-DETR as the baseline, with the detailed experimental environment summarized in Table 2. To closely emulate realistic UAV-based surface inspection scenarios, the computational capabilities of UAV-mounted hardware were simulated during both model training and inference. This setup enables evaluation of model performance under conditions representative of onboard UAV deployment, accounting for limited computational resources and real-time processing constraints.

The training phase employed the hyperparameter settings summarized in Table 3.

4.2. Dataset and Evaluation Metrics

The dataset used in this study was obtained from the Roboflow platform and consists of 4652 annotated images of aircraft surface defects. The dataset is partitioned into training (4304 images), validation (245 images), and test (103 images) subsets, ensuring sufficient data for model training while maintaining independent subsets for unbiased performance evaluation.

The dataset includes three representative defect categories commonly encountered in aircraft surface inspections: Crack, Dent, and Missing Fastener. These defect types exhibit substantial intra-class variability in terms of geometric structure, texture patterns, and spatial scale, which poses significant challenges for accurate and robust detection.

To improve the generalization capability of the detection model and mitigate overfitting, a series of data augmentation techniques were applied to the training data. These augmentations include conventional operations such as horizontal flipping, random rotation, brightness and contrast adjustment, and minor geometric transformations.

In addition, to better approximate the complex visual variations encountered in UAV-based inspection scenarios, advanced data augmentation strategies were employed to simulate realistic imaging distortions. Specifically, rolling shutter–like distortions were introduced to model line-wise spatial displacement caused by platform motion, while perspective shear was used to reflect viewpoint changes resulting from camera pose variation. Motion blur augmentation was applied to emulate relative motion between the UAV and the inspected surface. Furthermore, illumination-related augmentations, including high-contrast glare and sun-reflection-like artifacts, were incorporated to represent severe outdoor lighting conditions. Through these data augmentation techniques, the training data were enriched with diverse appearance variations, thereby enhancing the robustness of the model to motion-induced artifacts, geometric deformation, and illumination interference.

After data augmentation, the number of annotated defect instances increased to 5790 Crack, 4016 Dent, and 6170 Missing Fastener samples, leading to a relatively balanced class distribution. This balanced distribution helps reduce category bias during training and contributes to stable optimization and reliable convergence of the proposed detection model.

To further analyze the characteristics of the dataset, a statistical analysis of defect scale distribution was conducted based on the original image resolutions. Defects were categorized into small, medium and large objects, as shown in Figure 5. The results indicate that 27.0% of the defects are small objects, 33.7% are medium-sized objects, and 39.3% are large objects. Importantly, more than 60% of the defect instances belong to small and medium scales, highlighting the inherent difficulty of fine-grained defect detection and underscoring the necessity of effective multi-scale feature representation.

To evaluate the proposed defect detection method, Precision, Recall, mAP, and FPS were used. Model parameter count was also considered to assess lightweightness, with smaller models being more fit for deployment on UAV platforms.

The core computation formulas for these evaluation metrics are outlined below:

Precision = \frac{T P}{T P + F P}

(17)

Recall = \frac{T P}{T P + F N}

(18)

AP = \int_{0}^{1} p (r) d r

(19)

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(20)

To evaluate the individual contributions of each modified component within the RT-DETR framework, a series of ablation experiments were conducted, as shown in Table 4. Specifically, the EMO lightweight backbone, Scale-Decoupled Loss (SDLoss), and SPPELAN multi-scale feature enhancement module were each introduced independently, and their combined effect was also assessed. All experimental groups were trained under identical configurations to ensure a fair comparison. The results are summarized in the table below.

Integrating the EMO lightweight backbone substantially reduced model computational complexity, from 103.4 GFLOPs in the original RT-DETR to 60.5 GFLOPs, while improving mAP to 0.747. This demonstrates that EMO effectively compresses model parameters and eliminates redundant computations, maintaining stable feature representation and enhancing inference efficiency for UAV deployment.

The Scale-Decoupled Loss further improved detection performance. By applying scale-aware regression and classification weighting, Recall increased from 0.657 to 0.718, and mAP reached 0.770. These results highlight SDLoss’s effectiveness in addressing positive-negative sample imbalance and stabilizing model training.

The SPPELAN multi-scale fusion module enhanced sensitivity to fine-grained defects through multi-scale path aggregation and spatial attention. While its standalone contribution to mAP was moderate, it significantly improved feature fusion and robustness, providing a strong foundation for overall performance gains.

When all three modules were combined in ESS-DETR, the model achieved the best performance, with Precision of 0.837, Recall of 0.738, mAP of 0.790, and computational cost further reduced to 60.2 GFLOPs. Importantly, these gains arise from a carefully coordinated design and joint optimization of the modules, rather than from a simple stacking of individual components. This demonstrates a well-balanced trade-off between detection accuracy and computational efficiency, as shown in Figure 6.

Overall, each module contributes positively to the framework, enhancing feature representation under complex backgrounds and improving small-target detection. In practical UAV inspection scenarios, which often involve high-resolution imagery, dense target distributions, and real-time requirements, these improvements enable high-accuracy detection while substantially reducing inference overhead, satisfying the combined demands for precision, speed, and stability in engineering deployments.

4.3. Comparative Evaluation of Various Models

To evaluate the practical effectiveness of ESS-DETR for UAV-based surface defect detection, several representative detection frameworks, Faster R-CNN, RT-DETR, and YOLOv11, were selected for comparison. All models were trained and tested on the same dataset with identical image resolution and consistent hyperparameter settings, ensuring a fair and controlled evaluation.

Table 5 reports the AP for Crack, Dent, and Missing Fastener, along with overall mAP, mAP0.5:0.95, and inference speed FPS, providing a direct comparison of detection performance across different models. Compared with RT-DETR, ESS-DETR achieved notable improvements across all defect types, as shown in Figure 7. Specifically, AP for cracks increased by 7.9 points, AP for dents improved from 87.0 to 93.2, and AP for missing fasteners increased from 65.8 to 71.4. Overall, the mAP rose by 6.6 points, and mAP0.5:0.95 increased by 2.8 points. These results indicate that the jointly optimized EMO backbone, SDLoss, and SPPELAN modules collectively enhance the model’s ability to detect small, fine-grained, and irregular defects, particularly under complex backgrounds with varying illumination, shadows, and structural textures commonly encountered in UAV inspection scenarios.

When compared with YOLOv11, which incorporates optimized attention mechanisms and loss functions, ESS-DETR still outperformed it by 5.9 points in mAP and 1.4 points in mAP0.5:0.95. Although YOLOv11 demonstrates faster inference due to its lightweight architecture, it exhibits relatively lower performance in identifying small targets, delineating object boundaries, and handling densely packed or occluded defects. In contrast, ESS-DETR achieves a favorable balance between detection accuracy, robustness, and computational efficiency, making it particularly well-suited for real-time UAV deployment where both precision and speed are critical.

YOLOv8-tiny exhibits extremely high inference speed (400 FPS) due to its aggressively simplified network design, making it suitable for ultra-low-latency scenarios. However, this efficiency comes at the cost of reduced representational capacity, resulting in noticeably lower detection accuracy for fine-grained defects. Specifically, YOLOv8-tiny shows inferior performance in Crack detection and fails to consistently distinguish subtle defect boundaries, which limits its applicability in high-precision UAV inspection tasks.

In contrast, ESS-DETR explicitly addresses the accuracy–efficiency trade-off through the coordinated design of its EMO-based backbone, Scale-Decoupled Loss, and SPPELAN multi-scale fusion module. This joint optimization enables robust feature representation, scale-aware localization, and effective multi-scale context aggregation. Consequently, ESS-DETR achieves a favorable balance between accuracy and efficiency, delivering 79.0 mAP at 58.5 FPS, which satisfies real-time UAV deployment requirements while preserving high inspection precision.

Overall, the comparative experiments demonstrate that ESS-DETR is not a simple integration of existing components but a task-driven detection framework. The synergistic interaction among feature extraction, supervision strategy, and feature fusion yields superior performance across defect categories and object scales, validating its practical applicability for UAV-based surface inspection.

4.4. Comparative Evaluation on Public Dataset

To further evaluate the generalization capability of the proposed method and address concerns regarding dataset bias, comparative experiments were conducted on a publicly available surface defect dataset released by the team of Song Kechen from Northeastern University, which is widely adopted for steel surface defect detection. The dataset contains 1800 grayscale images acquired under industrial inspection conditions and includes six representative defect categories: Crazing, Inclusion, Patches, Pitted Surface, Rolled-in Scale, and Scratches.

These defect categories exhibit substantial variations in shape, scale, texture complexity, and visual contrast, presenting significant challenges for robust feature representation and cross-category discrimination. Several categories are characterized by weak texture cues and irregular boundaries, making this dataset particularly suitable for evaluating the detection capability of fine-grained and small-scale defects under unseen data distributions.

The baseline RT-DETR and the proposed ESS-DETR were evaluated under identical experimental settings, ensuring a fair and controlled comparison. The quantitative results are summarized in Table 6. Compared with RT-DETR, ESS-DETR achieves a clear performance improvement, with Recall increasing from 71.7% to 78.7% and mAP rising from 75.3% to 83.2%, demonstrating stronger capability in capturing diverse defect patterns across categories.

In terms of computational efficiency, ESS-DETR significantly reduces complexity from 103.5 GFLOPs to 60.3 GFLOPs, while simultaneously improving inference speed from 36 FPS to 43 FPS. Although a slight decrease is observed in the mAP0.5:0.95 metric, the overall gains in Recall, mAP, and runtime efficiency indicate a more favorable trade-off between detection accuracy and computational cost.

These results confirm that the performance improvements of ESS-DETR are not confined to a single proprietary dataset. Instead, the proposed method consistently outperforms the baseline RT-DETR on a public benchmark, validating that the observed gains stem from coordinated architectural redesign and loss-level optimization, rather than dataset-specific tuning. This further demonstrates the robustness and practical applicability of ESS-DETR for real-world surface defect inspection tasks under diverse conditions.

To assess the generalization performance under non-industrial visual domains, experiments were further conducted on a mini-COCO dataset. Specifically, the mini-COCO dataset was constructed by randomly sampling 5000 images from the MS COCO training set without category filtering, in order to preserve the diversity of object classes, scene layouts, and scale distributions. In contrast to surface defect datasets, COCO contains a wide variety of object categories, complex scene compositions, and rich contextual backgrounds, resulting in a markedly different data distribution. As reported in Table 7, ESS-DETR exhibits a modest decline in mAP. To assess the generalization performance under non-industrial visual domains, experiments were further conducted on a mini-COCO dataset. Specifically, the mini-COCO dataset was constructed by randomly sampling 5000 images from the MS COCO training set without category filtering, in order to preserve the diversity of object classes, scene layouts, and scale distributions. In contrast to surface defect datasets, COCO contains a wide variety of object categories, complex scene compositions, and rich contextual backgrounds, resulting in a markedly different data distribution. As reported in Table 7, ESS-DETR exhibits a modest decline in mAP-based metrics compared with RT-DETR; however, it substantially reduces computational complexity and achieves a notable increase in inference speed. These results suggest that ESS-DETR preserves reliable detection performance under significant domain shifts while providing a more balanced trade-off between accuracy and efficiency, highlighting its suitability for real-time applications in diverse and unconstrained environments.

4.5. Model Validation on UAV Platform

To evaluate the practical deployment performance of the proposed defect-detection model, the trained network was directly integrated into a UAV-mounted inspection platform, and physical tests were conducted using real aircraft surface images. Two imaging conditions were considered: distortion-free scenes and scenes incorporating realistic imaging distortions.

Figure 8 presents representative inference results obtained by the UAV-based system. As shown, the model accurately detects cracks, dents, and missing fasteners, with predicted bounding boxes closely aligning with the corresponding damaged regions under both imaging conditions.

These results demonstrate that the proposed model can be effectively applied to real-world aircraft surface inspections, as shown in Table 8. The UAV-deployed system maintains high detection accuracy and robustness even in the presence of realistic imaging distortions, highlighting its potential for integration into intelligent aerial inspection workflows and on-site maintenance operations.

5. Conclusions

In this work, we propose an enhanced ESS-DETR model for UAV-based aircraft surface defect detection. Experimental results demonstrate that the model substantially improves detection accuracy across multiple defect categories while maintaining high inference efficiency. The jointly optimized EMO backbone, Scale-Decoupled Loss, and SPPELAN module provide strong feature representation, effective multi-scale fusion, and robust handling of small, densely distributed, and weak-texture defects.

Particularly in high-resolution UAV inspection scenarios, the model reliably identifies fine-grained and complex defect patterns, demonstrating strong cross-category generalization and practical deployability. These results highlight the model’s potential as a technical foundation for intelligent aerial inspection systems and rapid on-site maintenance, enabling accurate, efficient, and automated surface defect detection in real-world applications.

Author Contributions

Conceptualization was carried out by Y.W., Y.Y. and H.Z.; Methodology was developed by Y.W. and H.Z.; Data curation was performed by Y.W. and Y.H.; the original draft was written by Y.W.; Review and editing were conducted by Y.Y. and H.Z.; Software development was done by Y.W. and H.Z.; Validation was performed by Y.W., Y.H. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Siegel, M.W.; Kaufman, W.M.; Alberts, C.J. Mobile robots for difficult measurements in difficult environments: Application to aging aircraft inspection. Robot. Auton. Syst. 1993, 11, 187–194. [Google Scholar] [CrossRef]
Alqaysi, M. Hangar of the Future—Digital Scanning & Analysis; Mälardalen University: Västerås, Sweden, 2017. [Google Scholar]
Rice, M.; Li, L.; Ying, G.; Wan, M.; Lim, E.T.; Feng, G.; Ng, J.; Jin-Li, M.T.; Babu, V. Automating the Visual Inspection of Aircraft. 2018. Available online: https://oar.a-star.edu.sg/communities-collections/articles/13872 (accessed on 4 January 2026).
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE Computer Society: Los Alamitos, CA, USA, 2023; pp. 1389–1400. [Google Scholar]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 9202–9210. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October, 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 8, 679–698. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Dai, Z.; Cai, B.; Lin, Y.; Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1601–1610. [Google Scholar]
Yao, Z.; Ai, J.; Li, B.; Zhang, C. Efficient detr: Improving end-to-end object detector with dense prior. arXiv 2021, arXiv:2104.01318. [Google Scholar]
Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3621–3630. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query design for transformer-based object detection. arXiv 2021, arXiv:2109.07107. [Google Scholar]
Wang, T.; Yuan, L.; Chen, Y.; Feng, J.; Yan, S. Pnp-detr: Towards efficient visual analysis with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4661–4670. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Zhang, L. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2988–2997. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]

Figure 1. ESS-DETR Framework for Surface Defect Detection.

Figure 2. MMB and iRMB modules.

Figure 3. Overall architecture of the EMO backbone.

Figure 4. Variation in the weighting coefficient in Sloss and Lloss with respect to the target bounding box area.

Figure 5. Dataset statistics and distribution analysis.

Figure 6. Comparison of mAP during training between RT-DETR and our method.

Figure 7. Confusion matrices of our model and the RT-DETR model. (a) Confusion matrix of our proposed model. (b) Confusion matrix of the RT-DETR model.

Figure 8. Qualitative defect detection results of the UAV-based inspection system under different imaging conditions: (a) detection results in distortion-free scenes; (b) detection results in scenes with realistic imaging distortions.

Table 1. Computational complexity and maximum path length of different modules [5]. The input/output feature maps have a spatial resolution of

C \times W \times W

, Define

L = W^{2}

as the total number of spatial tokens, and

l = w^{2}

as the number of tokens within each local window, where

W

and

w

represent the feature map resolution and window size, respectively. Here,

k

denotes the kernel size, and

G

denotes the number of groups.

Table 1. Computational complexity and maximum path length of different modules [5]. The input/output feature maps have a spatial resolution of

C \times W \times W

, Define

L = W^{2}

as the total number of spatial tokens, and

l = w^{2}

as the number of tokens within each local window, where

W

and

w

represent the feature map resolution and window size, respectively. Here,

k

denotes the kernel size, and

G

denotes the number of groups.

Model	Params	FLOPS	MPL
MHSA	$4 (C + 1) C$	$8 C^{2} L + 4 C L^{2} + 3 L^{2}$	$O (1)$
W-MSHA	$4 (C + 1) C$	$8 C^{2} L + 4 C L^{2} l + 3 L^{2} l$	$O (I n f)$
Conv	$(C k^{2} / G + 1) C$	$(2 C k^{2} / G) L C$	$O (2 W / (k - 1))$
DDW-Conv	$(k^{2} + 1) C$	$(2 k^{2}) L C$	$O (2 W / (k - 1))$

Table 2. Experimental Environment Configuration.

Parameter	Configuration
System	Ubuntu 22.04
Framework	PyTorch 2.1.0
CUDA Version	12.1
GPU	RTX 4090
CPU	Intel Platinum 8352 V
Programming Language	Python 3.10

Table 3. Hyperparameter Settings.

Hyperparameter	Value
Learning rate	0.01
Image size	640 × 640
Momentum	0.937
Batch size	4
Epoch	150
Weight decay	0.0005

Table 4. Ablation experiments with the modules.

Model	p	R	mAP	mAP0.5:0.95	GFLOPs
RT-DETR	0.743	0.657	0.724	0.39	103.4
EMO	0.807	0.661	0.747	0.421	60.5
SDLoss	0.852	0.718	0.77	0.413	103.4
SPPELAN	0.787	0.654	0.728	0.413	104.1
ESS-DETR	0.837	0.738	0.79	0.418	60.2

Table 5. Evaluation of Performance Across Models.

Model	Crack	Dent	Missing Fastener	mAP	mAP0.5:0.95	FPS
Faster R-CNN	65.99	95.44	39.70	67.04	33.1	32.6
RT- DETR	64.5	87.0	65.8	72.4	39.0	22.7
YOLOv11	65.7	89.4	66.1	73.1	40.4	109.9
YOLOv8-tiny	59.2	91.9	66.2	72.4	39.5	400
ESS-DETR	72.4	93.2	71.4	79.0	41.8	58.5

Table 6. Quantitative comparison of RT-DETR and ESS-DETR on public NEU dataset.

Model	R	mAP	mAP0.5:0.95	FPS	GFLOPS
RT-DETR	71.7	75.3	44.9	36	103.5
ESS-DETR	78.7	83.2	42.4	43	60.3

Table 7. Quantitative comparison of RT-DETR and ESS-DETR on mini-COCO dataset.

Model	mAP	mAP0.5:0.95	FPS	GFLOPS
RT-DETR	71.2	34.8	45	103.5
ESS-DETR	68.2	30.2	70	60.3

Table 8. Deployment Performance in UAV Inference Environment.

Model	R	mAP	mAP0.5:0.95	FPS	GFLOPS
RT-DETR	71.7	75.3	44.9	36	103.5
ESS-DETR	78.7	83.2	42.4	43	60.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Yao, Y.; Zheng, H.; Han, Y. ESS-DETR: A Lightweight and High-Accuracy UAV-Deployable Model for Surface Defect Detection. Drones 2026, 10, 43. https://doi.org/10.3390/drones10010043

AMA Style

Wang Y, Yao Y, Zheng H, Han Y. ESS-DETR: A Lightweight and High-Accuracy UAV-Deployable Model for Surface Defect Detection. Drones. 2026; 10(1):43. https://doi.org/10.3390/drones10010043

Chicago/Turabian Style

Wang, Yunze, Yong Yao, Heng Zheng, and Yeqing Han. 2026. "ESS-DETR: A Lightweight and High-Accuracy UAV-Deployable Model for Surface Defect Detection" Drones 10, no. 1: 43. https://doi.org/10.3390/drones10010043

APA Style

Wang, Y., Yao, Y., Zheng, H., & Han, Y. (2026). ESS-DETR: A Lightweight and High-Accuracy UAV-Deployable Model for Surface Defect Detection. Drones, 10(1), 43. https://doi.org/10.3390/drones10010043

Article Menu

ESS-DETR: A Lightweight and High-Accuracy UAV-Deployable Model for Surface Defect Detection

Highlights

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. EMO-Based Lightweight Feature Extraction Backbone

3.2. Dynamic Scale-Weighted Loss for UAV Surface Inspection

3.3. Adaptive Multi-Scale Feature Enhancement for UAV-Based Defect Detection

4. Experiments

4.1. Experimental Environment

4.2. Dataset and Evaluation Metrics

4.3. Comparative Evaluation of Various Models

4.4. Comparative Evaluation on Public Dataset

4.5. Model Validation on UAV Platform

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI