IDD-DETR: Lightweight Multi-Defect Detection Model for Transmission Line Insulators Based on UAV Remote Sensing

Xu, Cheng; Liu, Xin; Wang, Jiaxin; Ding, Yun; Zheng, Chunhou

doi:10.3390/rs18030486

Open AccessArticle

IDD-DETR: Lightweight Multi-Defect Detection Model for Transmission Line Insulators Based on UAV Remote Sensing

by

Cheng Xu

¹,

Xin Liu

²,

Jiaxin Wang

¹,

Yun Ding

^1,*

and

Chunhou Zheng

¹

School of Artificial Intelligence, Anhui University, Hefei 230601, China

²

School of Control and Computer Engineering, North China Electric Power University, Beijing 102206, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(3), 486; https://doi.org/10.3390/rs18030486

Submission received: 16 December 2025 / Revised: 23 January 2026 / Accepted: 29 January 2026 / Published: 3 February 2026

(This article belongs to the Special Issue GeoAI and EO Big Data Driven Advances in Earth Environmental Science (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed lightweight insulator defect detection model (IDD-DETR) achieves a 2.2% higher mean average precision (mAP) on the insulator small-defect dataset compared to the baseline algorithm, while reducing model parameters by 44.9% and computation by 47.1%.
IDD-DETR integrates three key components (LMS-FE, SOEP, and LMB-FF) to balance detection precision and efficiency, finally reaching a detection speed of 61.2 frames per second.

What are the implications of the main findings?

The model’s lightweight design and high small-target detection ability meet the edge deployment requirements for transmission line aerial inspection, addressing the practical challenge of on-site insulator defect detection under complex backgrounds.
The innovations provide a feasible technical solution to common issues in insulator defect detection, such as excessive model complexity and inadequate small-defect recognition.

Abstract

Aiming to address the challenges of excessive model parameters, high computational complexity, strong complex background interference, and inadequate small-target detection found in insulator defect detection when using UAV remote sensing imagery of transmission lines, we propose a lightweight multi-defect detection model—Insulator Defect Detection-DETR (IDD-DETR). Specifically, we introduce a lightweight multi-starblock feature extractor (LMS-FE) as the backbone network to enhance its feature extraction capacity. Next, in order to enhance small-defect detection performance, a multi-scale feature pyramid (SOEP) is constructed by integrating shallow high-resolution features into the neck network. Additionally, a lightweight multi-branch feature fusion module (LMB-FF) is designed to efficiently fuse spatial and semantic information of small defects, suppressing background interference while optimizing model complexity. Finally, experimental results demonstrate that IDD-DETR achieves a 2.2% improvement in mean average precision (mAP) on the insulator small-defect dataset compared with the baseline algorithm, with model parameters and computation reduced by 44.9% and 47.1%, respectively. It also reaches a detection speed of 61.2 frames per second, satisfying the lightweight and high-precision requirements for edge deployment in transmission line inspection scenarios.

Keywords:

insulator defect detection; small target detection; complex background; lightweighting; RT-DETR algorithm

Graphical Abstract

1. Introduction

Insulators are critical components in transmission lines, primarily serving the functions of insulation and mechanical suspension. However, with its advantages of flexibility and maneuverability, wide coverage, high inspection efficiency, and convenient data acquisition, UAV remote sensing technology has become the mainstream technical means for monitoring the operational status of transmission line equipment [1,2,3,4,5]. Due to prolonged exposure to natural environmental conditions, insulators are vulnerable to various external factors such as temperature and humidity fluctuations, lightning strikes, strong electric fields, and contamination erosion, which can result in breakage, self-explosion, or surface degradation, posing serious threats to the safety and stability of transmission lines [6,7]. Given that most high-voltage transmission lines are located in remote and inaccessible regions such as mountainous terrains and river valleys, traditional manual inspection methods are not only costly and inefficient but also require substantial human and financial resources. Therefore, developing efficient algorithms and deploying them on embedded AI modules for real-time, edge-side UAV-based insulator inspection has become a key solution to enhancing inspection efficiency [8].

Traditional object detection algorithms can be broadly classified into two categories: two-stage methods (e.g., R-CNN [9] and Faster R-CNN [10]), which first generate region proposals followed by classification and localization, and one-stage approaches (e.g., YOLO [11] and SSD [12]), which directly predict bounding boxes and class probabilities. Among these, YOLO has emerged as a mainstream object detection algorithm due to its real-time inference capabilities, end-to-end trainability, low memory footprint, reduced background false-positive rates, and strong generalization ability, thereby attracting extensive research attention. For instance, PPYOLOE [13] and YOLOv6 [14] improved detection performance through structural re-parameterization and task-aligned learning (TAL) strategies, while YOLOv7 [15] further optimized cross-stage partial networks using extended efficient layer aggregation networks (ELAN) and bag-of-freebies techniques. YOLOv9 [16] addressed information bottlenecks through generalized efficient layer aggregation networks (GELAN) and programmable gradient information (PGI), whereas DAMO-YOLO [17] achieved superior backbone–neck feature fusion with the aid of re-parameterized generalized FPN (RepGFPN). Gold-YOLO [18] incorporated local–global feature interaction via a gather-and-distribute (GD) mechanism, combining convolution and self-attention modules. In AE-YOLOv5 [19], an attention-enhanced YOLOv5 (AE-YOLOv5) was proposed by embedding a channel-spatial attention module (CSAttention) and a multi-scale attention module (MAttention), which significantly enhanced the model’s feature representation capability for insulator defects under complex backgrounds. In ID-YOLO [20], a cross-stage residual attention network (CSP-ResNeSt) combined with a bidirectional weighted feature pyramid network (Bi-SimAM-FPN) was designed, leveraging grouped convolution and inter-class attention mechanisms to effectively address the scale-sensitivity issues associated with multi-type insulator defects. In LiteYOLO-ID [21], a lightweight ECA-GhostNet-C2f module (EGC) and its derivative network architecture were constructed, achieving real-time and high-precision detection on embedded devices through parameter sharing strategies and attention-guided feature compression. In addition, some methods [22,23,24,25,26,27,28] utilize the strategy with the integration between spatial information and spectral information to achieve good detection performance. Meanwhile, traditional CNN-based detectors still heavily depend on threshold-based candidate screening and non-maximum suppression (NMS), which can reduce model robustness and impair detection speed to a certain extent. To overcome these limitations, the Facebook AI team introduced detection transformer (DETR) [29], a model that utilizes a global feature representation through the Transformer architecture, replaces anchor-based proposal generation with object queries, and employs a bipartite graph matching algorithm combined with a novel loss function, thereby eliminating the need for NMS and simplifying the detection pipeline. However, despite its superior detection accuracy, DETR suffers from time-consuming training processes and complex query optimization procedures, which limit its practical deployment.

To address these limitations, the Baidu team proposed a real-time end-to-end detection transformer model (RT-DETR) [30] and its variants [31,32,33,34] that significantly enhances detection performance by eliminating post-processing steps such as confidence screening and NMS. Deformable DETR [32] is proposed to focus on a small set of key sampling points to deal with slow convergence and limited feature spatial resolution due to the limitation of Transformer attention modules in processing image feature maps. MS-DETR [33] places one-to-many supervision to the object queries of the primary decoder that is used for inference. Moreover, YOLOv10 [35] aims to further advance the performance efficiency boundary of YOLOs from both the post-processing and the model architecture. YOLOv11 [36] incorporates advanced feature extraction techniques, allowing for more nuanced detail capture. Mamba YOLO [37] introduce a State Space Model (SSM) with linear complexity to address the quadratic complexity of self-attention. Compared with mainstream models like the YOLO series [35,36,37,38,39], which capture multiscale information, RT-DETR offers advantages in both detection accuracy and inference speed. Nonetheless, its large computational load and parameter count hinder its deployment on mobile and resource-constrained edge devices. Additionally, in aerial transmission line scenarios with complex backgrounds, RT-DETR still exhibits shortcomings in small target detection, leading to missed detections and false alarms, which in turn affect the stability and accuracy of insulator defect detection.

To address these challenges, this paper proposes a lightweight insulator multi-defect detection algorithm. The proposed method effectively reduces the computational complexity and parameter count of the RT-DETR model while enhancing small-target detection accuracy, thereby meeting the demands of industrial practical deployment. Specifically, the improvements are made in the following three aspects:

(1) We design a lightweight multi-starblock feature extractor (LMS-FE), which enhances feature extraction capability while significantly reducing the model’s computational complexity. This design achieves model lightweighting and makes the model suitable for edge deployment on UAV platforms with limited computing resources.

(2) We propose a multi-scale feature pyramid named Small Object Enhancement Pyramid (SOEP). The SOEP incorporates SPD convolutional module and the CSPO module. The CSPO module extends the receptive field and enhances global feature extraction capability by parallelizing full-kernel morphological convolutions. Moreover, SOEP integrates the shallow 160 × 160 feature map; this modification enhances the model’s sensitivity to small defective targets in UAV remote sensing imagery, addressing the issue of low spatial resolution of small defects in aerial photography.

(3) We propose a lightweight multi-branch gradient feedback paths feature fusion module (LMB-FF). Specifically, a small-scale receptive field sensing module is designed and embedded into LMB-FF, which effectively suppresses complex background noise in UAV imagery and improves the detection accuracy of small targets under aerial shooting conditions.

The remainder of this article is organized as follows. Section 2 covers the related work. Section 3 details the implementation of the proposed method. Section 4 presents the experimental results and analysis. Section 5 provides the conclusion and future work of this article.

2. Related Work

The RT-DETR model leverages the Transformer architecture and offers multiple variants, including R18, R34, R50, R101, L, and X configurations. In response to the challenges posed by complex backgrounds and stringent model size constraints inherent in UAV-based small object detection tasks, this paper selects RT-DETR-R18 as the baseline model due to its reduced computational overhead and superior detection accuracy. The architecture of RT-DETR-R18 is composed of three principal components: a Backbone, a Neck Hybrid Encoder, and a Transformer Decoder equipped with an auxiliary detection head.

In the backbone component, the model employs a Convolutional Neural Network (CNN) based on ResNet18 [40] for hierarchical feature extraction. The feature outputs from the last three stages, denoted as S3, S4, and S5, are utilized as inputs to the encoder. The neck hybrid encoder integrates two key modules: the Attention-based Intra-scale Feature Interaction (AIFI) module and the CNN-based Cross-scale Feature Fusion Module (CCFM). The AIFI module is designed to enhance and preserve fine-grained information by encoding features across multiple scales, thereby improving the model’s ability to capture small object characteristics. Meanwhile, the CCFM employs a bidirectional fusion mechanism, effectively aggregating multi-scale features from S3, S4, and S5 through a combination of upsampling and downsampling operations. This process incorporates both bottom-up and top-down pathways to strengthen feature representation and maintain semantic integrity. The fused features are subsequently aggregated via element-wise addition and reshaped into a sequence of image features through flattening.

In the decoder component, initial object queries are selected from the encoder output sequence using the IoU-aware Query Selection module. These queries are then refined through iterative optimization to produce final bounding box predictions and their corresponding confidence scores, enabling efficient target detection. The overall architectural design achieves a balance between computational efficiency and detection precision, meeting the practical requirements of small object detection in resource-constrained edge computing environments.

3. Methods

The proposed insulator defect detection method IDD-DETR includes three modules: a feature extraction module, a feature enhancement module, and a feature fusion module. First, IDD-DETR adopts the lightweight multi-starblock (LMS-FE) as the feature extractor. While improving the computational efficiency of the model, it realizes the lightweight deployment on UAV edge platforms, effectively addressing the issue of excessive resource consumption of traditional models on edge devices. Following this, aiming at the pain points of insufficient low-level feature representation and low detection rate of small defect targets in small target detection tasks, a small target enhancement pyramid module SOEP is designed. By fusing the shallow features of the P2 layer, it not only strengthens the low-level feature representation capability but also significantly improves the detection sensitivity for small defect targets. Finally, a lightweight multi-branch feature fusion module LMB-FF is proposed, which utilizes multi-branch gradient feedback paths to achieve efficient fusion of spatial information and semantic information. This design not only suppresses the interference of complex background noise but also further enhances the robustness of the model, ensuring the stability of detection performance. The overall architecture is depicted in Figure 1.

3.1. Lightweight Multi-Starblock Feature Extractor

To address the high computational complexity caused by traditional backbone networks, IDD-DETR proposes a lightweight multi-starblock feature extractor LMS-FE based on StarNet [41] to meet the requirements of mobile deployment, thereby further optimizing the overall performance of the model.

The architecture of feature extraction process comprises four hierarchical stages, as shown in Figure 2. Each stage of the module performs downsampling via a convolution layer and extracts features using the starblock. Specifically, the operation process of each starblock is as follows: First, the input feature map is processed with a 7 × 7 depth-wise convolution (DW-Conv); this operation efficiently captures local spatial features while significantly reducing the model’s parameters and computational complexity. Subsequently, a batch normalization layer is introduced to standardize the input distribution of each layer, thereby stabilizing the training process, accelerating model convergence, and enhancing robustness against input noise. Finally, the normalized feature map is divided into two parallel branches: one branch adopts a 1 × 1 pointwise convolution (PWConv) for cross-channel feature fusion and nonlinear transformation, while the other leverages the ReLU6 activation function to introduce nonlinear feature responses.

The starblock module leverages element-wise multiplication operations, which can be expressed by the following formula:

(W_{1}^{T} X + B_{1}) * R e L U 6 (W_{2}^{T} X + B_{2})

(1)

where

W_{1}

and

W_{2}

are distinct weight matrices used to perform linear transformations on the input feature

X

, where

X

denotes the output feature of DW-Conv module,

B_{1}

and

B_{2}

correspond to the bias terms of the two linear transformations, and

*

represents the element-wise multiplication operation, i.e., the core operator of the star operation.

This operation significantly strengthens the model’s capability to capture complex and fine-grained features.

3.2. Small Object Enhancement Pyramid

To address the structural redundancy and high computational complexity issues of traditional schemes in small target detection tasks, we propose the small object enhancement pyramid (SOEP) method to efficiently enhance the shallow feature representation required for small target detection, as illustrated in Figure 3.

To improve the perception capability of defective small targets, we introduce the 160 × 160 shallow features from the P2 layer into the SOEP structure and integrate them into the neck network. Specifically, the P2 layer leverages SPD convolution (SPDConv) [42] to generate scale-aware features that are rich in small-object information. The detailed design of SPDConv is shown in Figure 3, which can be formulated as follows:

X_{SPD} = concat (f_{0, 0}, f_{1, 0}, \dots, f_{scale - 1, scale - 1})

(2)

X_{SPDConv} = {Conv}_{k, 1} (X_{SPD})

(3)

where

f_{i, j}

is a region sampled from the inpu

X

with a stride of scale.

Subsequently, after upsampling the P5 layer, it is concatenated and fused with the P4 layer. The fused feature is then upsampled again and concatenated with the output feature of the P3 layer and

X_{S P D C o n v}

, denoted as

X^{'}

.

To alleviate the problem of information redundancy in feature fusion, we feed

X'

into CSP Omni-Kernel (CSPO) module, as illustrated in Figure 4. It consists of a global branch, a large-kernel branch, and a local branch. Specifically, the global branch extracts global contextual information using a dual-domain channel attention module (DCAM) and a frequency-based spatial attention module (FSAM). The large-kernel branch captures large-scale structural features via depth-wise convolutions with an extremely large kernel size. The local branch supplements fine-grained local details through pointwise depth-wise convolutions.

X_{conv} = {Conv}_{1 \times 1} (X^{'}),

(4)

X_{local} = {DWConv}_{1 \times 1} (X_{conv}),

(5)

\begin{matrix} X_{large} = {DWConv}_{9 \times 1} (X_{conv}) + {DWConv}_{9 \times 9} (X_{conv}) \\ + {DWConv}_{1 \times 9} (X_{conv}), \end{matrix}

(6)

X_{global} = FSAM (DCAM (X_{conv})),

(7)

where

D W C o n v

is depth-wise convolutional layer.

Finally, the outputs of these three branches are fused via a 1 × 1 convolutional layer to generate the final output:

X_{Omni} = {Conv}_{1 \times 1} (X_{local} + X_{large} + X_{global}) .

(8)

Specifically, DCAM enhances the representation of global features by reinforcing the coarse-grained dual-domain characteristics of the channel dimension. It leverages the Fast Fourier Transform (FFT) and its inverse (IFFT) combined with Frequency Channel Attention (FCA) to refine global feature representations based on the spectral convolutional theorem. Subsequently, these enhanced features are further processed by the spatial channel attention (SCA) module to capture comprehensive global information in the frequency domain. Meanwhile, the FSAM refines the spectral feature representations from the spatial dimension, thereby strengthening the characterization of fine-grained details.

By integrating global, large, and local branches, the module achieves efficient feature extraction across multiple scales, from global structures to local details, while effectively suppressing interference from complex backgrounds. As a result, the proposed SOEP structure substantially improves the detection performance of small targets by efficiently fusing multi-scale features without incurring additional computational overhead.

3.3. Lightweight Multi-Branch Feature Fusion Module

To improve the detection accuracy of small-object detection under complex backgrounds, we propose the LMB-FF method, a lightweight, efficient, and parameterized aggregation module designed to enhance feature fusion through a multi-branch gradient feedback mechanism. This design aggregates local and contextual features of small targets, improving detection performance.

The LMB-FF module, as illustrated in Figure 5, implements a feature split, multi-branch processing, concatenation fusion workflow to enhance feature representation while balancing efficiency [43]. Specifically, the input feature map first undergoes dimensionality reduction via a 1 × 1 convolutional kernel, and is then split into multiple sub-feature maps through a split operation. These sub-feature maps are processed in parallel across three distinct branches: a 3 × 3 convolutional kernel branch integrated with the NCSP structure, a standard 3 × 3 convolutional kernel branch, and a combined branch consisting of Conv3XC-NCSP and 3 × 1 convolutional kernel. Notably, the Conv3XC-NCSP itself adopts a multi-branch parallel design, which includes a 3 × 3 convolutional kernel branch, a Conv3XC and 3 × 3 convolutional kernel branch, and another 3 × 3 convolutional kernel branch. The outputs of these internal branches are fused via element-wise addition. After processing, the outputs of all branches are concatenated, and the channel dimension is adjusted using a 1 × 1 convolutional kernel to generate the final output of the LMB-FF module.

To further optimize deployment efficiency without sacrificing performance, the Conv3XC component within the module leverages a reparameterization strategy [34] that decouples training and inference structures. During the training phase, it employs a multi-branch parallel architecture composed of 1 × 1 convolutional layer, 3 × 3 convolutional layer, and 1 × 1 convolutional layer, with branch outputs fused via element-wise addition to ensure sufficient feature diversity. In the inference phase, this multi-branch structure is equivalently re-parameterized into a single 3 × 3 convolutional kernel through Fuse BN, which simplifies the model architecture and reduces computational overhead while retaining the learned feature representation capabilities.

The convolutional layer is represented by the equation in Equation (9), while the BN layer is described in Equation (10). The fusion process is illustrated in Equation (11), which effectively combines the operations of convolution and BN to reduce computational overhead while maintaining the model’s performance.

\{\begin{matrix} C o n v (x) = w * x + b \\ a v g (x) = \frac{1}{n} \sum_{i}^{n} x_{i} \end{matrix}

(9)

\{\begin{matrix} σ^{2} (x) = \frac{1}{n} \sum_{x = i}^{n} (x_{i} - a ν g (x))^{2} \\ B N (x) = γ \frac{x - a ν g (x)}{\sqrt{σ^{2} (x)}} + β \end{matrix}

(10)

Con v_{fused} (x) = \frac{w * γ}{\sqrt{σ^{2} (x)}} x + γ \frac{b - a ν g (x)}{\sqrt{σ^{2} (x)}} + β

(11)

where x is the input value, w is the convolution module weight, b is the bias, avg is the input mean,

σ

is the input variance,

γ

is the BN weight, and

β

is the weight; Equation (9) represents the expression for the weights and biases of the newly fused convolutional layer. Specifically, during the inference phase, the convolutional weights are fused with the batch normalization (BN) parameters, thereby enabling the reparameterization of model weights and biases. Equation (11) corresponds to the fused weights of the convolutional layer and the BN layer. The determination of these weights combines parameter learning during the training phase with reparameterization optimization during the inference phase, and the weights of the convolution module enhance the stability of small target feature representation and accelerate the convergence of the proposed IDD-DETR model and optimize small target detection accuracy. This process reduces computational complexity and enhances inference speed. During the model conversion process, the weights of each branch are uniformly transformed to a 3 × 3 convolutional kernel size. To achieve consistency in kernel size, appropriate padding operations are applied to the 1 × 1 convolutional layers. Ultimately, the weights and biases of each branch are weighted and summed to form the final weights and biases of the single-branch convolution, thereby completing the optimization step of weight integration.

By optimizing both the feature transfer path and the aggregation mechanism, the LMB-FF module significantly enhances the fusion and representation of multi-scale features, while simultaneously reducing the computational resource overhead. This efficient feature aggregation mechanism plays a crucial role in preserving detailed information, particularly when handling complex tasks. As a result, it considerably improves the model’s performance in small target detection, enabling more accurate and effective detection under challenging conditions.

4. Results

4.1. Experiments Settings

(1) Dataset: The dataset utilized in this study comprises three distinct parts. The first part is the CPLID dataset [44], publicly released by the State Grid on GitHub, containing 1240 images. The second part is the Insulator Defect Multiple Defects Image Dataset (IDID) [31], provided by IEEE, which includes a total of 1702 images. The third part is a transmission line insulator defect detection dataset, collected by a power grid company, which consists of 119 images. In total, these three datasets encompass 3061 images, representing a broad spectrum of glass, ceramic, and composite insulators used in transmission lines. The dataset includes images of insulators in various complex scenarios, such as forests, rivers, construction sites, wheat fields, meadows, and snow. All images have been meticulously annotated using LabelImg 1.8.6 software, with the label distribution statistics presented in Figure 6. Insulator defects are categorized into three types: self-destructive (drope), breakage (break), and flashover damage. These defect categories are characterized by small-sized targets that are easily confused with background noise, making them particularly challenging to detect. Following the annotation process, the dataset is randomly split into training, testing, and validation sets at a ratio of 8:1:1.

(2) Experiment Setup: Hyperparameters are detailed in Table 1. Experiments were run on a Windows 10 platform with an NVIDIA GeForce RTX 2080Ti GPU; full hardware specifications are listed in Table 2.

(3) Evaluation Metrics: In the domain of target detection, specific metrics are employed to assess the performance of detection models. For evaluating model complexity, this study utilizes the number of parameters (Parameters) and the number of floating-point operations (GFLOPs) as indicators of spatial and temporal complexity. Lower values for Parameters and GFLOPs signify a more lightweight model, which is indicative of reduced hard are performance requirements.

To evaluate algorithm performance, this paper adopts several widely recognized criteria: average precision (AP), mean average precision (mAP), and frames per second (FPS). Average precision is derived from the Recall and Precision metrics, which provide direct and intuitive measures of the model’s detection effectiveness for each class.

The recall rate is

R_{ec} = T_{p} / (T_{p} + F_{N}) \times 100 %

(12)

The accuracy rate is

P_{re} = T_{p} / (T_{p} + F_{p}) \times 100 %

(13)

where P(r) is the Precision–Recall curve,

\sum_{k = 0}^{c} A P

is the precision of each category, and C is the total number of categories.

4.2. Experimental Results Comparison

To evaluate the effectiveness of the proposed algorithm, a series of lightweight models with comparable parameter counts are selected for comparative experiments. These models include YOLOv5s, YOLOv9s [16], YOLOv10s [35], YOLOv11s [36], Mamba-YOLO [37], Gold-YOLO [18], and other algorithms of the same category, such as Deformable-DETR [32], MS-DETR [33], and DINO-DETR [45], alongside the benchmark model RT-DETR-R18. In these experiments, the average accuracy, measured by mAP at an Intersection over Union (IoU) threshold of 0.5, was assessed for each model. The detailed results are presented in Table 3.

Experimental results show that the proposed algorithm significantly outperforms state-of-the-art methods in mAP and AP, particularly in detecting small targets such as self-destructive, broken, and flashover-tainted insulators. This performance gain stems from targeted optimization strategies that enhance detection accuracy. Additionally, the model features a lightweight design with only 10.9M parameters and 30.1 GFLOPs, achieving high precision while reducing computational costs. Overall, the method balances detection accuracy and model efficiency, offering a practical solution for insulator defect detection in resource-constrained scenarios.

To assess robustness under complex backgrounds, five representative scenes (e.g., grass, sky, snow) were selected from the test set. As shown in Figure 7, the proposed algorithm consistently maintains higher confidence scores than other models, effectively suppressing background interference and ensuring reliable detection under challenging conditions. These results confirm the robustness and effectiveness of the proposed approach. In dry land, the localization error of IDD-DETR for insulator targets is less than 5%. Its bounding boxes can fully enclose the target contours without redundant coverage or boundary offset. At the junctions between insulators and dry soil or gravel, no background false detection boxes are generated, demonstrating excellent background suppression capability. In contrast, Mamba YOLO exhibits 2–3 instances of bounding box offset (with an offset range of 8–12 pixels) and one low-confidence false detection in the target edge regions, resulting in slightly inferior overall localization stability. Moreover, based on the detection results for “dry land” in Figure 7, the proposed IDD-DETR method achieves a detection score of approximately 0.8, while Mamba YOLO yields a score of about 0.5. This indicates that IDD-DETR outperforms Mamba YOLO in local detection performance. Although the detection results of IDD-DETR do not achieve a perfect score of 1, they still demonstrate greater advantages in extracting fine-grained local details compared to Mamba YOLO. As shown in Table 4, the base model achieves a mAP of 94.8% when using StarNet as the backbone, with improvements of 0.9%, 1.7%, and 2.0% compared to ResNet18, EfficientVit, and EfficientFormerv2, and further gains of 1.3%, 3.3%, 1.1%, 5.5%, and 3.4% over GhostNetv2, MobileNetv4, FasterNet, UnireplkNet, and RepVit, respectively. StarNet also demonstrates superior AP in detecting defective insulators, benefiting from its element-wise residual fusion of convolutional kernels, which enhances small target feature extraction. Combined with its lightweight architecture, the feature extraction module (LMS-FE) of IDD-DETR is designed based on StarNet.

To assess the performance of the SOEP structure proposed in this study, comparison experiments are conducted between the SOEP structure and the RT-DETR-R18 and YOLOv8s models, both equipped with the P2 small target detection head, as well as the TPH-YOLOv5s model, specifically designed for the detection of small targets in aerial images captured by UAVs. The experiments are carried out on the insulator defect detection dataset, with the results presented in Table 5. The findings reveal that the SOEP module demonstrates a significantly higher accuracy rate and requires considerably less computational effort compared to the RT-DETR-P2 model. These results underscore the superior accuracy and efficiency of the SOEP structure, highlighting its potential for practical deployment. The SOEP module not only achieves higher detection accuracy but also offers substantial reductions in computational cost and inference time, thereby facilitating faster deployment and showcasing the exceptional performance of the SOEP structure.

The analytical results presented in Figure 8 clearly demonstrate the challenges associated with small-scale target detection in high-resolution and background-complex images for existing models. Specifically, both the TPH-YOLOv5s and YOLOv8s-p2 models exhibit significant leakage and misdetection when tasked with detecting small-scale targets related to insulator defects. As illustrated in Figure 8a, the TPH-YOLOv5s, YOLOv8s-p2, RT-DETR-bifpn, and RT-DETR-p2 models all suffer from considerable leakage or misdetection. In Figure 8b, the TPH-YOLOv5s continues to exhibit leakage, whereas in Figure 8c,e, although no substantial leakage or misdetection is observed, the confidence levels of TPH-YOLOv5s are notably lower compared to the IDD-DETR model, which demonstrates significantly higher confidence in its detection results. Furthermore, Figure 8d highlights misdetection by the YOLOv8s-p2 model, further emphasizing the limitations of current models in small-scale target detection. In contrast, the improved model proposed in this study exhibits significant performance advantages in detecting small-scale targets such as insulators and their defects. Not only does the improved model successfully detect these small-scale targets, but it also demonstrates considerably higher confidence in its detection results, thereby validating the effectiveness of the proposed improvements.

4.3. Ablation Experiment

In this study, RT-DETR-R18 is employed as the base model, and a series of sequential ablation experiments are conducted to evaluate the performance of various model configurations. These configurations included Base, Base + LMS − FE, Base + SOEP, Base + LMB − FF, Base + LMS − FE + SOEP, Base + LMS − FE + LMB − FF, and IDD − DETR. The experimental results of these ablation studies are presented in Table 6, where “√” denotes the inclusion of a specific module and “×”indicates its exclusion.

As presented in Table 6, the mAP@0.5 values from Experiment 2 to Experiment 7 demonstrate superior performance compared to the benchmark model (Experiment 1), indicating that the proposed improvements significantly enhance the effectiveness of insulator defect detection. In Experiment 2, the incorporation of the LMS-FE not only boosts detection accuracy but also substantially reduces both the number of parameters and computational complexity, with reductions of 39.9% and 44%, respectively, relative to the baseline model. This validates the effectiveness of the proposed enhancement strategy. In Experiments 3 and 4, the introduction of the SOEP feature pyramid and LMB-FF module leads to improved detection accuracy of small defective targets, while maintaining a parameter and computational load comparable to the benchmark model. In Experiment 5, the combination of LMS-FE and SOEP reduces the number of parameters and computational demands by 37.9% and 35.5%, respectively, further improving the model’s lightweight design. Experiment 6 demonstrates that by integrating LMS-FE and LMB-FF, the model’s ability to detect small-scale defective targets is significantly enhanced, resulting in a 1.8% improvement in the AP value for insulator defects, while reducing parameters and computation by 47% and 56.2%, respectively. This underscores the performance enhancement and computational efficiency achieved by LMB-FF module. Finally, Experiment 7, which integrates all of the aforementioned modules, achieves a detection accuracy of 96.1%, surpassing the baseline model by 2.2%. Simultaneously, the number of parameters and computational load are reduced by 44.9% and 47.1%, respectively, while the detection speed reaches 61.2 fps, meeting both the accuracy requirements for transmission line insulator inspection and the lightweight deployment constraints for edge devices.

4.4. Visual Analysis of Results

Figure 9 illustrates the comparative results between the algorithm proposed in this paper and the baseline algorithm. In all test images, the confidence level of the proposed algorithm is notably higher than that of the baseline, demonstrating superior and more stable detection performance. In Figure 9b, the baseline algorithm fails to detect the defect due to the high similarity between the defect location and the complex background. In contrast, the proposed algorithm successfully identifies the target by effectively suppressing background interference. Furthermore, in Figure 9e, the algorithm demonstrates its ability to accurately recognize defects even when they are partially obscured, highlighting the robustness of the proposed method in challenging detection scenarios.

As shown in Figure 10, the improved model is evaluated against the baseline using four metrics—precision, recall, mAP@0.5, and mAP@0.5–0.95—as well as training and validation loss. The results indicate consistent performance gains across all metrics. Precision increases slightly by 0.1%, while recall improves by 2.5%, suggesting enhanced defect localization. mAP@0.5 and mAP@0.5–0.95 rise by 1.3% and 1.7%, respectively, confirming better detection performance across IoU thresholds. The improved model also converges more rapidly and stably, reaching steady-state around the 75th epoch owing to the multi-branch gradient feedback path that boosts gradient propagation efficiency. These findings demonstrate that the proposed method not only improves detection accuracy but also enhances training stability and efficiency, supporting its application in complex scenarios.

4.5. Detection Performance for Insulator Defect on CPLID and IDID

The first part of the dataset used in this study is the CPLID dataset (1240 images), which is publicly released on GitHub by the State Grid Corporation. The second part is the IDID (1702 images), which is provided by IEEE. Both datasets are widely used benchmarks in the field of insulator defect detection. The following is a summary of detection accuracies from previous studies on these two datasets, along with a comparative analysis with the proposed IDD-DETR model.

The experimental results are shown in Table 7; IDD-DETR achieves competitive mAP@0.5 performance on both CLIPD and IDID while requiring significantly fewer parameters and FLOPs than most transformer-based detectors. Most YOLO-based methods (YOLOv5s, YOLOv8s, YOLOv9s, YOLOv10s, and YOLOv11s) utilize CSPDarknet53 as a multi-scale feature extraction backbone and PANet for multi-scale feature fusion to extract features of small targets. Experimental results show that these methods achieve relatively good object detection performance. On the other hand, most DETR-based methods (Deformable-DETR, MS-DETR, DINO-DETR, and RT-DETR-R18) adopt the encoder–decoder structure of the Transformer as the backbone, performing global modeling of image features. They employ a set of learnable object queries to interact with the encoder features and output the final prediction results. Although the aforementioned methods have achieved promising results in small object detection, the proposed IDD-DETR in this manuscript further leverages starblock for hierarchical local feature extraction, as well as SOEP structure and the CSPO module for multi-scale global information extraction. This design enables the synergistic utilization of local and global features, effectively enhancing the detection performance for small objects. These results indicate that IDD-DETR offers a favorable trade-off between detection accuracy and computational efficiency, which is particularly suitable for real-time and UAV-based applications.

5. Discussion

Firstly, in terms of the model performance, it can be observed in Table 3 in that, compared with other popular object detection algorithms based on YOLO and DETR models, the proposed IDD-DETR algorithm achieves the best results in terms of mAP for small targets such as self-exploded, fractured, and flashover-contaminated insulators. While ensuring strong detection performance for small objects, IDD-DETR also has a relatively low number of parameters and relatively low FLOPs, indicating that the IDD-DETR algorithm has low complexity and high computational efficiency, making it well-suited for deployment on edge devices. Table 7 also shows that the IDD-DETR algorithm exhibits excellent small-object detection performance on both public datasets, CPLID and IDID, further demonstrating its advantages. The main reason lies in the ability of IDD-DETR to capture local–global information, thereby extracting effective semantic features while suppressing background noise interference, which results in its superior small-target detection performance.

Secondly, in terms of visualization results, as shown in Figure 7, Figure 8 and Figure 9, the proposed IDD-DETR algorithm is capable of accurately detecting small insulator targets in various scenarios such as grassland, sky, snow, dry land, and construction scenes. Whether under sunny or cloudy weather conditions, it consistently maintains precise detection of insulators. Moreover, even in scenes with significant background noise (e.g., construction scenes), it achieves strong detection performance. These detection outcomes are attributed to the synergistic interaction of every module within the IDD-DETR algorithm. The ablation studies in Table 6 and the module selection experiments for lightweight design in Table 4 demonstrate that modules such as starblock, LMS-FE, SOEP, and LMB-FE work collaboratively to enable accurate detection of small targets across diverse scenes and varying weather conditions. Furthermore, the convergence curve of the algorithm in Figure 10 shows that the loss value continuously decreases throughout training. This indicates that the internal modules of IDD-DETR interact effectively to capture meaningful semantic information, thereby contributing to its superior performance in small object detection.

6. Conclusions

Specifically designed for UAV remote sensing imagery of transmission lines, the proposed IDD-DETR effectively solves the key problems of UAV edge deployment, such as limited computing resources, small defect targets, and complex background interference. It provides a practical and efficient solution for real-time insulator defect detection in UAV power inspection. Firstly, the LMS-FE module based on StarNet is incorporated, which significantly reduces both the model parameters and computational overhead, while simultaneously enhancing detection accuracy, thereby achieving a lightweight model. Secondly, for small target detection, a multi-scale feature pyramid, named SOEP, is introduced. This approach effectively replaces the traditional method of increasing the P2 detection head by fusing shallow feature layers with the P3 detection head. This modification not only reduces computational costs but also improves feature extraction from the underlying layers. Additionally, this study introduces a lightweight, highly parameterized, and efficient feature fusion module, LMB-FF, which incorporates multi-branch gradient feedback paths. This module refines and effectively merges low-level spatial information with high-level semantic features of defective targets. It enhances the expression capability for small-scale features and suppresses complex background noise, using a small-scale, heavily parameterized receptive field design. The incorporation of this field design accelerates model convergence, further reduces parameter count, and minimizes computational demand. The experimental results demonstrate that the proposed algorithm achieves 96.1% mAP on the proprietary insulator defect dataset, reduces model parameters and computational volume by 44.9% and 47.1%, respectively, and attains a detection speed of 61.2 frames per second, meeting the real-time monitoring requirements for insulator defects on UAV platforms. Future work will focus on integrating multi-sensor data to improve defect detection performance under extreme weather conditions and exploring UAV swarm collaborative detection based on the proposed lightweight model to enhance inspection coverage and efficiency.

Author Contributions

Methodology, C.X.; software, C.X. and X.L.; validation, J.W.; formal analysis, X.L.; writing—original draft preparation, C.X. and X.L.; writing—review and editing, Y.D.; supervision, Y.D.; project administration, C.Z.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 62433001.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, M.; Zhong, S.; Ge, Y.; Lin, H.; Chang, L.; Zhu, D.; Zhang, L.; Xiao, C.; Altan, O. Evaluating the performance of SDGSAT-1 GLI data in urban built-up area extraction from the perspective of urban morphology and city scale: A case study of 15 cities in China. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 17166–17180. [Google Scholar] [CrossRef]
Gong, D.; Huang, M.; Ge, Y.; Lin, H.; Zhang, L.; Altan, O. Operationalizing SDGs through the water–energy–food nexus: Multi-level assessment of ecosystem service supply–demand patterns in China. Ecol. Indic. 2025, 179, 114235. [Google Scholar] [CrossRef]
Sun, H.; Huang, M.; Lin, H.; Ge, Y.; Zhu, D.; Gong, D.; Altan, O. Spatiotemporal dynamics of ecological environment quality in arid and sandy regions with a particular remote sensing ecological index: A study of the Beijing–Tianjin sand source region. Geo-Spat. Inf. Sci. 2025, 1–20. [Google Scholar] [CrossRef]
Deng, Y.; Huang, M.; Gong, D.; Ge, Y.; Lin, H.; Zhu, D.; Chen, Y.; Altan, O. Carbon balance dynamic evolution and simulation coupling economic development and ecological protection: A case study of Jiangxi Province at county scale from 2000 to 2030. Int. J. Digit. Earth 2025, 18, 2448572. [Google Scholar] [CrossRef]
Huang, M.; Gong, D.; Zhang, L.; Lin, H.; Chen, Y.; Zhu, D.; Xiao, C.; Altan, O. Spatiotemporal dynamics and forecasting of ecological security pattern considering habitat protection: A case study of the Poyang Lake ecoregion. Int. J. Digit. Earth 2024, 17, 2376277. [Google Scholar] [CrossRef]
Li, Y.; Zou, G.; Zou, H.; Zhou, C.; An, S. Insulators and defect detection based on the improved focal loss function. Appl. Sci. 2022, 12, 10529. [Google Scholar] [CrossRef]
Xu, J.; Liao, H.; Li, K.; Jiang, C.; Li, D. Multi-scale feature fusion transformer with hybrid attention for insulator defect detection. IEEE Trans. Instrum. Meas. 2025, 74, 3539813. [Google Scholar] [CrossRef]
Akindele, O.; Atolagbe, J. YOLO-ELA: Efficient local attention modeling for high-performance real-time insulator defect detection. arXiv 2024, arXiv:2410.11727. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision–ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Chu, X. YOLOv6 v3.0: A full-scale reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. In Advances in Neural Information Processing Systems 36; NeurIPS: San Diego, CA, USA, 2023; Volume 36. [Google Scholar]
Shen, W.; Fang, M.; Wang, Y.; Xiao, J.; Chen, H.; Zhang, W.; Li, X. AE-YOLOv5 for detection of power line insulator defects. IEEE Open J. Comput. Soc. 2024, 5, 468–479. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, J.; Li, Y.; Zhu, C.; Wang, G. ID-YOLO: A multi-module optimized algorithm for insulator defect detection in power transmission lines. IEEE Trans. Instrum. Meas. 2025, 74, 3505611. [Google Scholar] [CrossRef]
Li, D.; Lu, Y.; Gao, Q.; Li, X.; Yu, X.; Song, Y. LiteYOLO-ID: A lightweight object detection network for insulator defect detection. IEEE Trans. Instrum. Meas. 2024, 73, 5023812. [Google Scholar] [CrossRef]
Zhao, D.; Zhang, H.; Huang, K.; Zhu, X.; Arun, P.V.; Jiang, W.; Li, S.; Pei, X.; Zhou, H. SASU-Net: Hyperspectral video tracker based on spectral adaptive aggregation weighting and scale updating. Expert Syst. Appl. 2025, 272, 126721. [Google Scholar] [CrossRef]
Jiang, W.; Zhao, D.; Wang, C.; Yu, X.; Arun, P.V.; Asano, Y.; Xiang, P.; Zhou, H. Hyperspectral video object tracking with cross-modal spectral complementary and memory prompt network. Knowl.-Based Syst. 2025, 330, 114595. [Google Scholar] [CrossRef]
Zhu, X.; Zhang, H.; Hu, B.; Huang, K.; Arun, P.V.; Jia, X.; Zhao, D.; Wang, Q.; Zhou, H.; Yang, S. DSP-Net: A dynamic spectral–spatial joint perception network for hyperspectral target tracking. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5510905. [Google Scholar] [CrossRef]
Zhao, D.; Zhong, W.; Ge, M.; Jiang, W.; Zhu, X.; Arun, P.V.; Zhou, H. SiamBSI: Hyperspectral video tracker based on band correlation grouping and spatial–spectral information interaction. Infrared Phys. Technol. 2025, 151, 106063. [Google Scholar] [CrossRef]
Zhao, D.; Wang, M.; Huang, K.; Zhong, W.; Arun, P.V.; Li, Y.; Asano, Y.; Wu, L.; Zhou, H. OCSCNet-Tracker: Hyperspectral video tracker based on octave convolution and spatial–spectral capsule network. Remote Sens. 2025, 17, 693. [Google Scholar] [CrossRef]
Zhao, D.; Hu, B.; Jiang, W.; Zhong, W.; Arun, P.V.; Cheng, K.; Zhao, Z.; Zhou, H. Hyperspectral video tracker based on spectral difference matching reduction and deep spectral target perception features. Opt. Lasers Eng. 2025, 194, 109124. [Google Scholar] [CrossRef]
Zhao, D.; Yan, W.; You, M.; Zhang, J.; Arun, P.V.; Jiao, C.; Wang, Q.; Zhou, H. Hyperspectral anomaly detection based on empirical mode decomposition and local weighted contrast. IEEE Sens. J. 2024, 24, 33847–33861. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision–ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Lewis, D.; Kulkarni, P. Insulator defect detection. IEEE Dataport 2021. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Zhao, C.; Sun, Y.; Wang, W.; Chen, Q.; Ding, E.; Yang, Y.; Wang, J. MS-DETR: Efficient DETR training with mixed supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 17027–17036. [Google Scholar]
Dai, Q.; Xiong, L.; Chen, L.; Li, S. Improvement of YOLOv5 for defect recognition of PDC drill composite piece. J. Electron. Meas. Instrum. 2023, 37, 164–172. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-based YOLO for object detection. arXiv 2024, arXiv:2406.05835. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection in drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 2778–2788. [Google Scholar]
Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR); IEEE: New York, NY, USA, 2006; Volume 3, pp. 850–855. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 5694–5703. [Google Scholar]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In ECML PKDD 2022; Springer: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar]
Wan, C.; Yu, H.; Li, Z.; Chen, Y.; Zou, Y.; Liu, Y.; Yin, X.; Zuo, K. Swift parameter-free attention network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 6246–6256. [Google Scholar]
Tao, X.; Zhang, D.; Wang, Z.; Liu, X.; Zhang, H.; Xu, D. Detection of power line insulator defects using aerial images analyzed with convolutional neural networks. IEEE Trans. Syst. Man Cybern. Syst. 2018, 50, 1486–1498. [Google Scholar] [CrossRef]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]

Figure 1. Structure of the IDD-DETR model.

Figure 2. Starblock structure.

Figure 3. SOEP structure diagram.

Figure 4. CSPO module diagram.

Figure 5. LMB-FF module.

Figure 6. Statistical chart of dataset label details. (a) Statistical distribution of class counts in the dataset. (b) Illustrates the bounding box style used for labeling defects in the dataset (corresponding to the rectangle annotation tool in LabelImg), visually demonstrating the annotation format for defective targets, which consists of multiple rectangular boxes of varying sizes. (c) Presents a scatter plot showing the normalized coordinate distribution of defective targets. The x-axis and y-axis represent the normalized coordinates of defective targets within the images, with denser point clusters indicating that defects are primarily concentrated in the central regions of the images, consistent with the typical positioning of insulators in transmission line imagery. (d) Presents a scatter plot of the normalized size distribution of defective targets, where the horizontal axis (width) and vertical axis (height) represent the normalized width and height of the defective targets.

Figure 7. Visualization results of different object detection methods under complex backgrounds, where each column corresponds to grassland, sky, snow, dry land, and construction scenes, respectively. (a) grassland; (b) sky; (c) snow; (d) dry land; (e) construction scenes.

Figure 8. Comparison of the effect of small target detection model for insulator defects at high resolution. (a) forest; (b) bare soil; (c) hills; (d) transmission tower; (e) dry land.

Figure 9. Comparison chart of experimental results. (a) forest; (b) forest; (c) dry land; (d) farmland; (e) residential buildings.

Figure 10. Convergence speed comparison graph.

Table 1. Experimental hyperparameter settings.

Parameters	Configuration
Epoch	200
Batch size	16
Input size	640 × 640
Loss function	GIou
IoU	0.7
Optimizer	AdamW
Learning rate	0.0001
Weight decay	0.0001

Table 2. Experimental platform configuration.

Name	Configuration
CPU	Intel Xeon Gold 5120
GPU	NVIDIA GeForce RTX 2080Ti
Display memory	12 G
Running memory	16 G
Operating system	Win 10
CUDA	CUDA-11.3
Name	Configuration
CPU	Intel Xeon Gold 5120

Table 3. Table of experimental results for comparison of target detection algorithm.

Model	AP%			mAP@0.5%	Parameters (M)	FLOPs (G)
Model	Drope	Broken	Flashover	mAP@0.5%	Parameters (M)	FLOPs (G)
YOLOv5s	96.2	89.8	90.2	92.1	9.1	23.8
YOLOv8s	95.1	92.4	92.0	93.1	11.1	28.4
YOLOv9s	96.0	92.9	90.4	93.1	7.1	26.7
YOLOv10s	96.0	91.7	88.7	92.1	8.0	24.5
YOLOv11s	95.5	89.1	91.6	92.1	9.4	21.3
Mamba-YOLO-T	93.3	91.7	84.3	89.7	9.0	17.2
Gold-YOLO	92.1	88.6	80.3	87.0	5.6	12.1
Deformable-DETR	97.8	90.3	91.4	93.1	40.0	173.2
MS-DETR	96.0	91.2	92.1	93.1	39.8	160.4
DINO-DETR	96.0	88.0	85.1	89.7	47.0	279.2
RT-DETR-R18	95.5	96.2	90.1	93.9	19.8	56.9
IDD-DETR	97.1	97.4	93.7	96.1	10.9	30.1

Table 4. Experimental comparison of different lightweight backbone networks.

Model	Backbone	AP%			mAP@0.5%	Parameters (M)	FLOPs (G)
Model	Backbone	Drope	Broken	Flashover	mAP@0.5%	Parameters (M)	FLOPs (G)
RT-DETR	ResNet18	95.5	96.2	90.1	93.9	19.8	56.9
	EfficientVit	94.8	96.5	87.9	93.1	10.7	27.2
	EfficientFormerv2	94.7	95.4	88.3	92.8	11.8	29.4
	GhostNetv2	96.9	94.9	88.6	93.5	9.7	24.9
	MobileNetv4	94.5	93.2	86.8	91.5	11.3	39.5
	FasterNet	95.2	97.1	88.9	93.7	10.8	28.5
	UnireplkNet	94.3	89.5	84.2	89.3	12.7	33.4
	RepVit	94.5	94.8	85.0	91.4	13.3	36.3
	StarNet	96.4	96.5	91.5	94.8	11.9	31.8

Table 5. Comparison experiment between SOEP structure and small target detection model.

Model	AP%			mAP@0.5%	Parameters (M)	FLOPs (G)	Speed (FPS)
Model	Drope	Broken	Flashover	mAP@0.5%	Parameters (M)	FLOPs (G)	Speed (FPS)
TPH-YOLOv5s	87.3	85.1	74.5	82.3	7.4	36.6	54.5
YOLOv8s-p2	97.2	94.9	91.6	94.6	10.6	36.6	59.7
RT-DETR-bifpn	95.1	93.8	89.2	92.7	20.3	64.3	56.9
RT-DETR-p2	96.2	97.1	91.4	94.9	18.5	78.1	62.2
RT-DETR-SOEP	95.8	97.2	93.9	95.6	20.4	65.2	68.4

Table 6. Comparison of results of ablation experiments.

No.	LMS-FE	SOEP	LMB-FF	AP%			mAP@0.5%	Parameters (M)	FLOPs (G)	Speed (FPS)
No.	LMS-FE	SOEP	LMB-FF	Drope	Broken	Flashover	mAP@0.5%	Parameters (M)	FLOPs (G)	Speed (FPS)
1	×	×	×	95.5	96.2	90.1	93.9	19.8	56.9	79.6
2	√	×	×	96.4	96.5	91.5	94.8	11.9	31.8	80.3
3	×	√	×	95.8	97.2	93.9	95.6	20.4	65.2	68.4
4	×	×	√	95.7	96.4	90.5	94.2	18.4	50.0	61.4
5	√	√	×	97.3	96.0	94.2	95.8	12.3	36.7	70.0
6	√	×	√	97.1	96.3	93.5	95.6	10.5	24.9	62.0
7	√	√	√	97.1	97.4	93.7	96.1	10.9	30.1	61.2

Table 7. Experimental results of different methods on CPLID and IDID.

Model	mAP@0.5%			Parameters (M)	FLOPs (G)
Model	CPLID	IDID	Avg.	Parameters (M)	FLOPs (G)
YOLOv5s	99.1	97.2	98.15	9.1	23.8
YOLOv8s	97.6	97.0	97.3	11.1	28.4
YOLOv9s	97.5	97.1	97.3	7.1	26.7
YOLOv10s	99.5	97.1	97.3	8.0	24.5
YOLOv11s	99.5	97.2	97.35	9.4	21.3
Mamba-YOLO-T	99.5	97.1	97.3	9.0	17.2
Gold-YOLO	94.6	91.7	93.15	5.6	12.1
Deformable-DETR	99.5	97.1	97.3	40.0	173.2
MS-DETR	97.5	94.2	95.85	39.8	160.4
DINO-DETR	99.2	97.3	98.4	47.0	279.2
RT-DETR-R18	99.5	97.5	98.5	19.8	56.9
IDD-DETR	99.5	98.1	98.8	10.9	30.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, C.; Liu, X.; Wang, J.; Ding, Y.; Zheng, C. IDD-DETR: Lightweight Multi-Defect Detection Model for Transmission Line Insulators Based on UAV Remote Sensing. Remote Sens. 2026, 18, 486. https://doi.org/10.3390/rs18030486

AMA Style

Xu C, Liu X, Wang J, Ding Y, Zheng C. IDD-DETR: Lightweight Multi-Defect Detection Model for Transmission Line Insulators Based on UAV Remote Sensing. Remote Sensing. 2026; 18(3):486. https://doi.org/10.3390/rs18030486

Chicago/Turabian Style

Xu, Cheng, Xin Liu, Jiaxin Wang, Yun Ding, and Chunhou Zheng. 2026. "IDD-DETR: Lightweight Multi-Defect Detection Model for Transmission Line Insulators Based on UAV Remote Sensing" Remote Sensing 18, no. 3: 486. https://doi.org/10.3390/rs18030486

APA Style

Xu, C., Liu, X., Wang, J., Ding, Y., & Zheng, C. (2026). IDD-DETR: Lightweight Multi-Defect Detection Model for Transmission Line Insulators Based on UAV Remote Sensing. Remote Sensing, 18(3), 486. https://doi.org/10.3390/rs18030486

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IDD-DETR: Lightweight Multi-Defect Detection Model for Transmission Line Insulators Based on UAV Remote Sensing

Highlights

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Lightweight Multi-Starblock Feature Extractor

3.2. Small Object Enhancement Pyramid

3.3. Lightweight Multi-Branch Feature Fusion Module

4. Results

4.1. Experiments Settings

4.2. Experimental Results Comparison

4.3. Ablation Experiment

4.4. Visual Analysis of Results

4.5. Detection Performance for Insulator Defect on CPLID and IDID

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI