Lightweight Insulator Defect Detection in High-Resolution UAV Imagery via System-Level Co-Design

Zhu, Yujie; Chen, Guanhua; Zhang, Linghao; Zhou, Jiajun; Kuang, Junwei; Zhu, Jiangxiong

doi:10.3390/rs18060953

Open AccessArticle

Lightweight Insulator Defect Detection in High-Resolution UAV Imagery via System-Level Co-Design

by

Yujie Zhu

¹,

Guanhua Chen

²

,

Linghao Zhang

³,

Jiajun Zhou

^1,*

,

Junwei Kuang

³ and

Jiangxiong Zhu

⁴

¹

School of Computer Science, China University of Geosciences, Wuhan 430074, China

²

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

³

State Grid Sichuan Electric Power Company, Chengdu 610041, China

⁴

School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(6), 953; https://doi.org/10.3390/rs18060953

Submission received: 13 January 2026 / Revised: 12 March 2026 / Accepted: 18 March 2026 / Published: 21 March 2026

(This article belongs to the Special Issue Lightweight Artificial-Intelligence Techniques for Remote-Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A lightweight data–model joint optimization framework is proposed, reducing computational cost by 30.1% (GFLOPs) and parameters by 19.2%, while achieving an mAP of 92.8% for insulator defect detection.
An efficient, resource-aware visual analysis pipeline is developed by integrating an optimized-tiling-based training and inference strategy, enabling effective processing of high-resolution UAV imagery without inducing feature distortion from image resizing.

What are the implications of the main findings?

The semi-decoupled prior-driven detection head improves feature representation for tiny and sparsely distributed defects, which are common in real-world power line inspection scenarios.
The proposed framework facilitates the practical deployment of lightweight object detectors for large-scale UAV-based inspection tasks under limited computational resources.

Abstract

The inspection of minuscule insulator defects from high-resolution (HR) UAV imagery presents a significant algorithmic challenge. The severe scale mismatch between HR images and low-resolution model inputs often leads to feature distortion for sparsely distributed targets. To address these issues, this paper proposes an integrated data–model collaborative framework. At the data level, an offline label-guided optimal tiling (LGOT) strategy is introduced to alleviate scale mismatch by curating information-dense training tiles. At the model level, we design the semi-decoupled prior-driven detection head (SDPD-Head), which leverages evolutionary priors to stabilize the learning of microscopic spatial features. During inference, an online inference-time adaptive tiling (ITAT) strategy is used to match the spatial scale distribution between training and inference and to reduce feature loss caused by direct downscaling. Experiments on a real-world inspection dataset show that the proposed framework achieves an mAP@50 of 92.9% with 2.17 M parameters and 4.7 GFLOPs.

Keywords:

lightweight deep learning; UAV remote sensing; high-resolution imagery; insulator defect detection; efficient object detection

1. Introduction

High-resolution (HR) remote sensing imagery has become an indispensable data source for large-scale infrastructure inspection, environmental monitoring, and industrial condition assessment. With the rapid advancement of imaging sensors, modern remote sensing platforms are capable of capturing images with resolutions exceeding several thousand pixels per side, providing fine-grained spatial details that are critical for identifying small and subtle structural anomalies. Recent advances in deep neural networks (DNNs) combined with unmanned aerial vehicle(UAV) imagery have shown promising performance for insulator defect detection tasks [1]. However, effectively exploiting such rich information using deep learning models remains challenging, particularly when balancing high-resolution processing with algorithmic efficiency.

In large-scale visual analysis, processing overhead and memory footprints can grow rapidly. To ensure efficiency, most state-of-the-art object detectors operate on fixed low-resolution inputs (e.g., 640 × 640 pixels), which introduces a severe scale mismatch when directly applied to HR remote sensing imagery. Directly downscaling HR images inevitably leads to substantial spatial information loss, causing tiny and sparsely distributed targets to become indistinguishable from background noise. This issue is especially pronounced in industrial inspection tasks, such as insulator defect detection, where defective regions often occupy only a minute fraction of the original image area, as illustrated in Figure 1.

Recent studies have attempted to mitigate scale mismatch by enhancing multi-scale feature representations through feature pyramid networks, attention mechanisms, or advanced backbone designs [2,3,4]; while these approaches can improve feature fusion across scales, they do not fundamentally address the dominance of background pixels in HR imagery, particularly when targets are extremely small and sparsely distributed. As a result, even sophisticated model-level enhancements may struggle to extract discriminative features from heavily downsampled inputs under standard algorithmic constraints.

An alternative line of research focuses on data-level strategies, such as image tiling or sliding-window processing, to preserve fine-grained details by partitioning HR images into smaller sub-images. By enlarging the relative scale of small targets, tiling-based approaches can effectively reduce background dominance and improve detection performance without modifying the underlying network architecture. Prior studies have demonstrated the effectiveness of sliding-window tiling in aerial imagery [5], medical imaging, and other high-resolution visual analysis tasks [6]. However, conventional tiling strategies are largely task-agnostic and often rely on exhaustive scanning, leading to excessive computational redundancy and inefficient use of resources. Moreover, unguided tiling may fragment targets across tile boundaries, degrading feature integrity and limiting training effectiveness [5,6].

From a lightweight remote sensing perspective, these limitations highlight the need for content-aware and task-oriented tiling strategies that can balance detection accuracy and computational efficiency. For efficient HR image analysis, it is insufficient to simply preserve spatial details; instead, the data sampling process must actively suppress irrelevant background regions and concentrate computational effort on informative areas.

In addition to data-level challenges, model design plays a critical role in determining the upper bound of performance–efficiency trade-offs. Many modern object detectors adopt highly flexible and universal detection heads to accommodate diverse object categories and aspect ratios; while such designs are effective in generic benchmarks, they may introduce unnecessary complexity in domain-specific remote sensing tasks where target structures exhibit relatively stable geometric properties. For lightweight inspection applications, a simpler and more prior-informed detection head may offer improved feature alignment and faster convergence with reduced computational overhead.

Motivated by these observations, this work addresses the problem of tiny defect detection in high-resolution remote sensing imagery from a lightweight system-level perspective, while lightweight model optimization can also be achieved through techniques such as quantization or knowledge distillation, this work focuses on architecture-level and system-level efficiency, which avoids additional training stages and preserves training and simplicity in high-resolution remote sensing scenarios.

Rather than optimizing data sampling, model architecture, or inference strategies in isolation, we propose an integrated framework that jointly considers all three aspects. Specifically, we introduce a label-guided optimal tiling strategy for offline data preparation, an inference-time adaptive tiling mechanism for efficient HR image analysis, and a lightweight semi-decoupled prior-driven detection head tailored for structurally regular targets. UAV-based power line inspection is adopted as a representative application scenario to validate the proposed approach. By reducing model complexity and avoiding exhaustive high-resolution scanning, the framework provides a lower-complexity solution for high-resolution defect detection.

The main contributions of this work are summarized as follows:

We propose a lightweight label-guided optimal tiling (LGOT) strategy for offline data preprocessing, which leverages annotation priors to generate information-dense training samples, effectively reducing background redundancy and alleviating scale mismatch in HR remote sensing imagery.
We develop an efficient inference-time adaptive tiling (ITAT) strategy that enables direct analysis of HR images by allocating computational resources to regions of interest, avoiding exhaustive sliding-window scanning while maintaining consistency with the training data distribution.
We design a lightweight semi-decoupled prior-driven detection head (SDPD-Head) that exploits the structural regularity of insulators through evolution-generated priors, achieving accurate detection of tiny defects with low computational overhead.

The remainder of this paper is organized as follows. Section 2 reviews related work on high-resolution remote sensing detection and lightweight inspection models. Section 3 describes the proposed framework in detail. Section 4 presents experimental results and performance analysis. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Works

2.1. Tiling-Based High-Resolution Image Detection

High-resolution image detection tasks, such as insulator defect detection based on UAV imagery, pose significant challenges to deep learning models due to the limited memory and computational capacity of graphics processing units (GPUs). As a result, input images are often downsampled to fit fixed model input sizes, leading to severe loss of spatial details and degraded detection performance for small and sparsely distributed targets. To mitigate this issue, image tiling or cropping has emerged as a widely adopted preprocessing strategy across multiple application domains.

In the context of aerial imagery, Varga et al. [5] demonstrated that sliding-window-based cropping can effectively alleviate background bias in sparse target detection, significantly improving detection performance without modifying the underlying network architecture. Similar observations have been reported beyond the remote sensing domain. For instance, Shewajo and Fante [6] showed that directly downsampling high-resolution medical images causes substantial loss of spatial information, and addressed small object detection in malaria parasite screening through a systematic image slicing strategy. Image tiling has also been validated in other large-scale detection scenarios, including road damage detection from multi-source high-resolution images [7] and the detection of small and densely distributed oil palm fruits in agricultural environments [8], where notable performance gains were achieved even with lightweight detectors such as YOLOv4-tiny. Collectively, these studies establish image tiling as a fundamental technique for small object detection in high-resolution imagery.

Beyond high-resolution visual detection, AI-based and data-driven methods have also been used in measurement and navigation to model and mitigate complex, hard-to-model errors. For example, data-driven approaches have been applied to characterize Global Navigation Satellite System(GNSS) site-specific unmodeled errors under reflection and diffraction [9] and to improve multipath modeling in high-precision processing [10]. Recent studies have further shown that fusing heterogeneous measurements (e.g., camera observations with carrier-to-noise density ratio C/NO) can mitigate multipath and non-line-of-sight (NLOS) effects in challenging environments [11], and deep models can support soft-decision multipath detection and mitigation [12]. These works motivate our perspective that, when dominant errors originate from data acquisition and preprocessing (e.g., scale mismatch and downsampling-induced feature loss), incorporating data-driven priors and aligning the processing pipeline with the measurement characteristics can be as important as optimizing the network itself.

Despite their effectiveness, existing tiling-based approaches are largely task-agnostic and rely on fixed or heuristic cropping strategies. As a result, they often suffer from low computational efficiency due to exhaustive scanning and may produce suboptimal sub-image partitions that fragment targets or introduce redundant background regions. These limitations highlight the need for more content-aware and task-oriented tiling strategies that can better exploit annotation information and improve both data efficiency and detection performance.

2.2. Insulator Defect Detection

Existing studies on insulator defect detection primarily aim to improve the recognition of small and subtle defects under complex backgrounds, and can be broadly categorized into data-level enhancement and model-level architectural optimization.

At the data level, several works attempt to enhance the representation of tiny defects by improving input data quality or diversity. Zhang and Akella employed generative adversarial networks (GANs) to synthesize high-resolution defect samples, alleviating ambiguity caused by extremely small targets [13,14]. Cao et al. [15] introduced pre-cutting strategies combined with Cutout or GridMask to preserve contextual information of small defects during training. In addition, environmental factors such as rain and fog have been incorporated to enrich data diversity and improve model robustness [16,17]. Despite their effectiveness, these approaches mainly operate at the feature or appearance level, while the intrinsic challenges of UAV-captured high-resolution imagery—such as severe background clutter and the extremely low pixel proportion of tiny defects—remain insufficiently addressed. Moreover, generative methods often suffer from limited feature diversity for micro-defects, and pre-cutting strategies dependent on pretrained models may introduce additional computational overhead and efficiency constraints [15].

At the model level, existing research predominantly focuses on modifying network architectures to enhance small-target feature extraction. Representative approaches include the use of GhostBlock-enhanced modules for improved texture representation [2], attention-aware feature pyramid networks for multi-scale alignment [3], and multi-path fusion structures for capturing contextual information [4]. While these architectural optimizations can locally strengthen feature responses for small objects, they do not fundamentally resolve the core challenge of insulator defect detection, namely the insufficient feature expression caused by the extremely low proportion of defective pixels in high-resolution UAV images. In addition, many model-level enhancements increase network complexity and computational cost, which limits their applicability in large-scale inspection scenarios.

In summary, existing insulator defect detection methods tend to optimize data representation or model architecture in isolation, with limited consideration of task-specific characteristics such as sparse defect distribution, complex backgrounds, and efficiency requirements in UAV-based high-resolution inspection. This observation highlights the necessity of task-oriented solutions that jointly consider data sampling, model design, and inference efficiency for robust and practical insulator defect detection. These limitations motivate the need for a task-oriented and application-driven solution that jointly optimizes data sampling, model design, and inference efficiency for UAV-based high-resolution remote sensing inspection.

3. Method

Our framework targets tiny defect detection in large-scale HR UAV imagery. We adopt YOLO11 [18] as the baseline detector and replace its detection head with the proposed Semi-Decoupled Prior-Driven head (SDPD-Head). On the data and inference sides, we use label-guided optimal tiling (LGOT) for training data preparation and inference-time adaptive tiling (ITAT) for HR image inference.

Figure 2 provides an overview of the proposed framework, including offline LGOT-based data preparation, SDPD-Head training, and ITAT-based inference. For visualization, the number of sub-images in Figure 2 is reduced; this does not affect the description of the method.

3.1. Label-Guided Optimal Tiling Method

The original dataset consists of high-resolution images (ranging from 3648 × 5472 to 6000 × 8000), posing significant challenges for defect detection. High resolution introduces excessive irrelevant pixels, while complex backgrounds generate numerous distracting features that hinder model learning—particularly for small targets. Additionally, resizing these images during preprocessing may lead to substantial feature loss.

To address these challenges, we designed the label-guided optimal tiling (LGOT) method. Based on the original data annotations, this algorithm employs an optimal strategy called Corner Optimal Cover (COC) to cut the original image into multiple sub-images. Specifically, the algorithm aligns the four corners of each sub-image with the four corners of every annotated bounding box in the original image to find the optimal cropping solution. This approach offers several advantages:

This strategy has four practical effects: (1) it increases the relative scale of small targets within each tile; (2) it reduces resizing distortion because each tile is closer to the model input size (see Figure 2 comparison); (3) it reduces redundant background content in the training set; and (4) compared with random cropping or exhaustive sliding-window tiling, it generates tiles based on annotations, which reduces redundant crops.

The entire process comparison of the COC strategy and sliding window is shown in Figure 3. Specifically, the coverage score is calculated as the number of label boxes covered by a given sub-image. Compared with sequential sliding-window tiling, COC uses annotation-guided corner alignment to generate candidate tiles and selects those with the highest coverage score (the number of labeled boxes covered by a tile). This reduces redundant tiling for sparse targets.

To further illustrate the first benefit, we randomly selected an example showing the pixel proportion before and after cropping (Figure 4), where Figure 4a shows the original image and Figure 4b–d show cropped sub-images. As observed, the relative area of insulators and defects increases significantly after cropping. This occurs because, while the bounding boxes remain unchanged, the overall image size is reduced, thereby increasing the relative prominence of the targets.

To further inspect the effect on feature attention, we apply Grad-CAM to two representative YOLO baselines; one example is shown in Figure 5.

n \cdot D \leq min (w, h) where n, D \in N^{+}

(1)

Concerning dataset expansion, our method generates multiple sub-images from each original image. The size of each sub-image is determined by Equation (1), where D denotes the base dimension, typically set as a multiple of 32 for compatibility with network downscaling layers. However, D can be adjusted based on specific task requirements. n serves as a scaling factor, and as long as n × D remains smaller than the shorter edge of the original image, n can take any positive integer value.

3.2. Semi-Decoupled Prior-Driven Detect Head

We believe that for the specific task of detecting small defects in high-resolution images of power transmission lines, the detection head is a critical component that determines the upper limit of model performance and efficiency bottlenecks. This is because, whether in the training or inference process, all multi-scale features extracted by the backbone and neck networks ultimately enter the detection head. For small targets, even after feature extraction, the feature signals ultimately transmitted to the detection head remain extremely weak and have a low signal-to-noise ratio. Therefore, the bottleneck of the problem may not lie in whether features can be extracted, but rather in how to make accurate decisions from weak features.

As we mentioned in Section 2, numerous studies have been conducted in-depth explorations into advanced feature enhancement, extraction, and separation. However, while these works focus on enhancing the backbone and neck, they mostly use universal design detection heads. Such detection heads, lacking task-specific design, may not perform optimally in specific tasks. Taking the original detection head of YOLO11 as an example, to ensure flexibility and universality, the YOLO11 Head introduces a fully decoupled depth structure and adopts an anchor-free design along with Distribution Focal Loss (DFL). However, while this complexity enhances its upper limit in general scenarios, it also introduces significant parameter complexity and computational overhead in tasks with relatively fixed target types (e.g., insulator detection or other industrial applications), making the model harder to train and infer. Our experimental results (see Section 4) also confirm this: for domain-specific tasks, a simpler, more prior-informed design may be more effective.

We revisited the modern design of the YOLO11 detection head, retained its parallel branching approach, and significantly lightened it. Considering that the length-to-width ratio of insulators and their defects is relatively fixed, we introduced an evolutionary algorithm to generate anchor priors, making the learning process more stable and reliable. The overall architecture of the improved SDPD-YOLO model is illustrated in Figure 6.

To be more specific, as shown in Figure 7, different from the YOLO11 Detect Head, to resolve the inherent conflict between classification and regression tasks, the SDPD-Head designs parallel, dedicated feature extraction branches for localization and classification tasks. Both branches consist of a 3 × 3 Depthwise Convolution (DWConv). DWConv can extract spatial features that are more focused on contours and edges for the regression task without significantly increasing the number of parameters, while extracting semantic features that are more focused on texture and material for the classification task. After the two branches extract their respective specialized feature, we do not use two independent prediction layers as in a fully decoupled head. Instead, we concatenate these two feature maps along the channel dimension. This fused, more comprehensive feature map is ultimately fed into a unified 1 × 1 prediction convolutional layer to generate the final prediction containing all information. This process has been shown in Figure 8. This structure maintains simplicity, enabling efficient information exchange between the features of the two tasks before the final decision.

Using two DWConv branches and a shared

1 \times 1

prediction layer reduces the parameter count compared with the fully decoupled YOLO11 Head, as summarized in Equations (2) and (3).

\begin{matrix} P_{SDPD} & = 2 \times P_{{DWConv}_{3 \times 3}} + P_{{Conv}_{1 \times 1}} \\ \approx 2 \times C_{in} \cdot K^{2} + C_{in} \cdot C_{out} \end{matrix}

(2)

\begin{matrix} P_{YOLO 11 Head} & = P_{regression} + P_{classification} \\ \approx (C_{in} \cdot C_{2} \cdot K^{2} + C_{2} \cdot C_{2} \cdot K^{2} + C_{2} \cdot 4 R) \\ + [C_{in} (K^{2} + C_{3}) + C_{3} (K^{2} + C_{3}) + C_{3} \cdot N] \end{matrix}

(3)

Here, N is the number of classes, R is the maximum regression bin in DFL (

R = 16

), K is the kernel size, and

C_{2}

and

C_{3}

are the intermediate channel sizes in the regression and classification branches, respectively. Therefore, the SDPD head uses fewer parameters than the original YOLO11 Head under the same channel settings.

On the other hand, modern detection heads (e.g., YOLOv8 Head [19]) typically employ anchor-free designs for versatile size adaptation. However, in industrial scenarios like insulator defect detection, targets exhibit consistent aspect ratios. Leveraging anchor priors here accelerates model learning of target features. To achieve this, we propose an evolutionary algorithm (shown in Figure 9) to generate optimized anchor priors for initializing training.

First, the method generates an initial anchor group randomly within the width and height ranges of all target bounding boxes in the dataset. The initial group (generation

t = 0

) is defined as

A^{(0)} = {A_{1}^{(0)}, A_{2}^{(0)}, \dots, A_{M}^{(0)}}

, in which every candidate

A_{m}

contains a series of anchors:

A_{m}^{(0)} = \{[\begin{matrix} w_{1}^{(m)} \\ h_{1}^{(m)} \end{matrix}], [\begin{matrix} w_{2}^{(m)} \\ h_{2}^{(m)} \end{matrix}], \dots, [\begin{matrix} w_{k}^{(m)} \\ h_{k}^{(m)} \end{matrix}]\}

(4)

Subsequently, during the evolutionary generation phase, the algorithm employs a fitness function

F_{a n c h o r}

to evaluate the quality of each candidate anchor set

A_{m}^{(t)}

. This function calculates the average of the maximum Intersection over Union (IoU) between all N ground-truth bounding boxes

b_{i}

in the dataset and the candidate anchors

a_{j}

:

F_{a n c h o r} (A_{m}^{(t)}) = \frac{1}{N} \sum_{i = 1}^{N} {max}_{j = 1}^{k} IoU (b_{i}, a_{j})

(5)

In each evolution iteration, the method selects M candidates based on the following probability distribution:

{Probability}_{s e l e c t} (A_{m}) = \frac{F_{a n c h o r} (A_{m})}{\sum_{n = 1}^{M} F_{a n c h o r} (A_{n})}

(6)

It then executes the Crossing Operation on the M-selected A values, paring them as follows:

a_{child} = λ \cdot a_{parent 1} + (1 - λ) \cdot a_{parent 2}

(7)

in which a is an anchor for A, and

λ

is a random number uniformly distributed in the interval [0, 1]. The Mutation Operation is then perfomed on each anchor

a_{i}

of the offspring individuals obtained from the Cross Operation.

a_{i, mut} = a_{i} \cdot (1 + δ)

(8)

Among them,

δ

is a random number drawn from a normal distribution with a mean of 0 and a standard deviation of

σ

(default = 0.1). Finally, repeat the steps until the iteration reaches the maximum number and fitness convergence or meets the recall threshold.

In summary, although replacing the fully decoupled head with a semi-decoupled design significantly reduces computational overhead and parameter count, it may limit representational capacity. To mitigate this issue, we introduce a prior-driven paradigm that leverages pre-generated optimal anchor boxes to provide explicit scale guidance, enabling more stable and effective spatial feature learning, particularly for microscopic defects. Notably, the only architectural difference between the proposed SDPD-YOLO and the baseline YOLO11 lies in the detection head. Therefore, the effectiveness of this design is validated by the comparative experiments presented in Section 4.

3.3. Inference-Time Adaptive Tiling

To reduce feature loss at inference, we propose inference-time adaptive tiling (ITAT). ITAT uses a two-stage procedure: a coarse detector proposes Region of Interest(ROIs) on a downscaled image, and the fine detector performs tiling-based inference only within these ROIs at a higher resolution. This reduces redundant computation on background regions compared with full-image sliding-window inference (shown in Figure 10).

Stage 1: ROI proposal. We use a lightweight coarse detector (a downscaled SDPD-YOLO) to generate candidate ROIs on the input HR image. The goal is to identify regions that may contain insulators or defects, so that high-resolution tiling is applied only to these regions.

Stage 2: Sub-image inference within ROIs. In the ITAT engine, the coarse selection model initially predicts bounding boxes

B = (x_{m i n}, y_{m i n}, x_{m a x}, y_{m a x})

corresponding to potential defect regions in the global image, where the width and height are denoted as w and h, respectively. Because the edge features of small defects are prone to information loss under tight cropping and reliable detection requires sufficient surrounding context, an adaptive edge expansion mechanism is employed to generate the final Region of Interest (ROI). An expansion factor

α

is therefore applied to enlarge the initially predicted bounding box, and the coordinates of the resulting high-resolution ROI are computed as follows:

R O I = [max (0, x_{m i n} - α w), max (0, y_{m i n} - α h), min (W, x_{m a x} + α w), min (H, y_{m a x} + α h)]

(9)

Within each ROI, the fine detector performs sliding-window inference and maps local predictions back to the global coordinate system, followed by non-maximum suppression(NMS) to remove duplicate boxes.

[\begin{matrix} x_{global} \\ y_{global} \end{matrix}] = [\begin{matrix} x_{local} / t_{w} \\ y_{local} / t_{h} \end{matrix}] \circ [\begin{matrix} w_{roi} \\ h_{roi} \end{matrix}] + [\begin{matrix} x_{roi} \\ y_{roi} \end{matrix}]

(10)

where

x_{roi}, y_{roi}

is the origin coordinate of the upper left corner of the ROI, and

w_{roi}, h_{roi}

is the size of the ROI.

Through the above two-stage inference mechanism, ITAT effectively combines global fuzzy understanding with local precise inference, improving the detection accuracy and efficiency of small objects in high-resolution images.

4. Experiments

In this section, we evaluate the proposed method on a real-world UAV-based power line inspection dataset provided by State Grid Sichuan Electric Power Company. The experimental evaluation is systematically designed to address the following research questions:

RQ1: How do the proposed label-guided optimal tiling (LGOT) and inference-time adaptive tiling (ITAT) strategies quantitatively outperform conventional image processing methods?
RQ2: Why is a synergistic data–model collaborative framework (coupling LGOT and ITAT) strictly necessary to prevent feature distribution mismatch?
RQ3: Can the proposed SDPD-YOLO architecture effectively compensate for the representational trade-offs of semi-decoupled designs and outperform current mainstream detectors?

4.1. Experimental Dataset

Following the experimental settings adopted in previous studies [2,15,20], we construct a comprehensive dataset by combining real-world power line inspection images with additional normal (non-defective) insulator samples. The primary dataset consists of UAV-captured images collected from operational transmission lines by a power utility company. To enhance data diversity and reduce bias toward defective samples, normal glass (G-insulator) and porcelain (P-insulator) images are randomly sampled from the publicly available InsPLAD dataset [21].

The final dataset includes three defect categories—G-insulator broken, G-insulator dirty, and P-insulator dirty—as well as normal G- and P-insulator images. The distribution of defect samples before and after tiling is summarized in Table 1. As the number of sub-images generated by tiling is correlated with the number of targets per image, defect categories with higher occurrence frequency (e.g., dirty insulators) result in a larger number of cropped sub-images compared to less frequent defects such as broken insulators.

To mitigate sample imbalance caused by varying defect occurrence rates, basic image augmentation techniques, including hue, saturation, brightness adjustment, and noise injection, are applied to the G-insulator broken class, expanding this category by a factor of four. As a result, the final experimental dataset contains a total of 4717 images. The dataset is randomly divided into training, validation, and test sets with a ratio of 7:2:1.

4.2. Implementation Details

All experiments are conducted on a workstation running Windows 11(Microsoft Corporation, Redmond, WA, USA), equipped with an NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), an Intel Xeon W3-2525 CPU(Intel Corporation, Santa Clara, CA, USA), and 128 GB of RAM, utilizing the PyTorch (v2.4.1) framework. Detection performance is evaluated using mean average precision at an IoU threshold of 50% (mAP@50). Computational efficiency is assessed via the number of model parameters (Params) and floating-point operations (GFLOPs).

For the training settings, models are trained for 200 epochs with a batch size of 32 and an input resolution of 640 × 640. We employ the AdamW optimizer to achieve stable convergence. To optimize data loading, 8 dataloader workers with image caching and Mosaic augmentation are utilized. Notably, to guarantee a fair comparison and isolate the structural effectiveness of the proposed data–model collaborative framework, the loss functions during the training phase are kept strictly identical to the baseline default configurations. No additional customized loss functions or weight optimizations are introduced. To systematically evaluate the effectiveness of the proposed approaches, three distinct experimental configurations are defined, as summarized in Table 2.

4.3. Analysis of the Data-Model Collaborative Strategy

To explicitly address RQ1 and RQ2, this subsection dissects the proposed framework into its data augmentation (LGOT) and inference engine (ITAT) components, evaluating their individual and synergistic effects.

4.3.1. Experiment of Label-Guided Optimal Tiling

As shown in Figure 11a, the target scale in the original dataset exhibits an imbalanced distribution, confirming the severe scale mismatch characteristic of this task. After LGOT processing (Figure 11b), this distribution is significantly optimized. By intelligently guiding the cropping strategy based on labels, the scale of small targets is enlarged relative to the input resolution, thereby reducing feature distortion.

To quantitatively demonstrate the superiority of the proposed LGOT over conventional tiling-based data preparation methods (addressing RQ1), we construct a comparative experiment based on the YOLO11 baseline. The model is trained on three distinct dataset configurations: a dataset generated by conventional sliding-window tiling, a dataset generated by random tiling, and the dataset generated by our LGOT strategy. To strictly isolate the impact of the training data and exclude any variables introduced by inference-side strategies (such as ITAT), the mAP (mAP@50) on the validation set is adopted as the evaluation metric.

As detailed in Table 3, the model trained on the LGOT-augmented dataset achieves the highest validation accuracy. This confirms that compared to blind or random cropping, the label-guided strategy provides the highest-quality and feature-aligned training samples for microscopic defect learning.

4.3.2. Ablation of the Collaborative Data-Model Strategy

To explicitly answer RQ2, an ablation study is conducted on the YOLO11 baseline to investigate the necessity of coupling the data-side LGOT and the inference-side ITAT as a collaborative closed-loop system.

As shown in Table 4, employing the tiling strategy unilaterally yields severe performance degradation. Specifically, training the model with the LGOT-augmented dataset but evaluating it using standard full-image inference causes the mAP to drop drastically from the baseline’s 83.9% to 72.7%. This substantial decline occurs because unilateral tiling introduces a severe domain shift and feature distribution mismatch (i.e., scale and receptive field misalignment) between the training and inference phases. Essentially, the model learns to detect magnified, high-resolution targets during training but struggles when confronted with the severely downscaled targets in standard inference.

Conversely, the synergistic integration of LGOT and ITAT perfectly aligns the spatial feature distributions across both phases. This end-to-end alignment effectively resolves the scale mismatch, unlocking the maximum detection potential and ultimately achieving the highest mAP of 86.0%.

4.4. Baseline Selection and Architecture Ablation

To establish a robust foundation for our proposed framework and further validate the necessity of tiling strategies, Table 5 reports the baseline performance of recent standard YOLO series models. The experimental results yield two key observations.

First, Method 2 (sliding-window tiling) consistently and significantly outperforms Method 1 (direct downscaling) across all evaluated architectures. This confirms that for high-resolution images, tiling strategies effectively preserve crucial spatial features and improve the training efficacy for tiny defect detection.

Second, among all the evaluated baseline models, YOLO11 achieves the highest detection accuracy (84.8% mAP under Method 2) while maintaining a highly competitive computational profile (2.58 M parameters and 6.3 GFLOPs). Based on its superior balance between detection precision and computational efficiency, YOLO11 is demonstrably the most suitable base model. Consequently, we select it as the foundation for our subsequent architectural improvements (i.e., SDPD-YOLO).

To validate the superiority of the SDPD-YOLO architecture, Table 6 compares it against advanced YOLO-based detectors under Method 2 and the proposed Method 3 (LGOT + ITAT). Notably, the only structural difference between the YOLO11 baseline and SDPD-YOLO lies in the detection head. As demonstrated in Table 6, while transitioning from a fully decoupled head (YOLO11) to a semi-decoupled design naturally reduces computational overhead, it inherently risks a drop in representational capacity. However, our proposed prior-driven paradigm successfully compensates for this. By retaining shared low-level spatial features before unified prediction, SDPD-YOLO achieves the highest mAP (92.9%) while maintaining the lowest GFLOPs (4.7), achieving a better accuracy–efficiency trade-off for microscopic defect detection.

4.5. Comparison with General Mainstream Detectors

To demonstrate the necessity and effectiveness of task-specific architectural optimizations for insulator defect detection (further addressing RQ3), horizontal comparisons are conducted with representative general-purpose lightweight detectors, including the RT-DETR series [27] and the NanoDet family [28].

As shown in Table 7, all models are trained and evaluated under the unified LGOT + ITAT framework (Method 3) to ensure a fair comparison. The results indicate that off-the-shelf general-purpose detectors struggle to achieve an optimal balance between detection accuracy and computational overhead for this highly specialized task. For instance, RT-DETR models achieve competitive mAP scores, but their parameter counts and GFLOPs are substantially higher under our measurement. Conversely, the NanoDet series maintains low computational costs but suffers from significantly lower accuracy.

In contrast, SDPD-YOLO achieves the highest mAP@50 with the lowest GFLOPs among the compared models in Table 7. This confirms that compared to directly deploying general-purpose vision models, our vertically optimized approach is significantly more effective and efficient for real-world power line inspections.

4.6. Qualitative Analysis

These selected samples encompass typical challenging scenarios encountered in actual hardware inspections. Specifically, Sample #1 illustrates accurate detection under extreme weather (such as foggy); Sample #2 demonstrates the model’s exceptional sensitivity by successfully identifying unannotated microscopic defects (i.e., true positive targets missed by the ground-truth labels); and Sample #3 highlights the model’s robust discriminative capability, where the fine model correctly avoids false alarms even when the coarse model generates a background-only ROI. As shown in Figure 12, the method can localize small defect regions under cluttered backgrounds in these examples.

4.7. Limitations and Discussion

Although the proposed framework demonstrates strong performance for insulator defect detection in high-resolution UAV imagery, several underlying mechanisms and potential limitations warrant further discussion.

A critical challenge in microscopic defect detection is the high susceptibility to false negatives (missed detections). The proposed data–model collaborative framework inherently incorporates a multi-level mechanism to mitigate this risk. At the data-processing level, ITAT reduces feature loss by running the fine detector on higher-resolution ROIs instead of relying on global downsampling. At the architecture level, the semi-decoupled prior-driven (SDPD) head is specifically designed to share low-level features before branching. This architectural safeguard retains the subtle morphological clues (e.g., edges and textures of tiny defects) that are often suppressed in deep, fully decoupled branches, which helps reduce missed detections of tiny defects.

Furthermore, regarding the generalization capability, the core strategies (LGOT and ITAT) are designed to resolve the universal scale mismatch issue, operating independently of the specific material properties. Therefore, the framework exhibits strong potential for application to composite insulators and other microscopic defects. Future work will expand the dataset to encompass a wider spectrum of power transmission components to fully validate this cross-category generalization.

However, several limitations remain. First, the efficiency of the proposed ITAT strategy depends on a pre-trained lightweight detector used for coarse region-of-interest screening, while this two-stage design significantly improves inference efficiency for sparse-target scenarios, it introduces additional components into the deployment pipeline, leading to increased system complexity compared with a single end-to-end detection model.

Second, while the proposed framework significantly reduces theoretical computational complexity (GFLOPs) and parameter count—providing an efficient algorithmic foundation for resource-constrained UAV systems—explicit platform-level validation has not yet been conducted. Comprehensive physical evaluations, including runtime efficiency, real-time inference latency, and power consumption profiling on actual embedded or edge devices, are crucial for industrial applications. Translating this algorithmic efficiency into physical edge-device deployment remains a high-priority direction for our future work.

5. Conclusions

This work targets two practical issues in detecting minuscule insulator defects from high-resolution UAV imagery: severe scale mismatch and feature loss caused by downsampling. To address these issues, we propose a data–model collaborative framework that jointly considers data preparation, model design, and inference.

A main contribution is the combined label-guided optimal tiling (LGOT) and inference-time adaptive tiling (ITAT) scheme. LGOT increases the effective target scale during training by constructing tilings guided by annotations, while ITAT applies a consistent spatial sampling strategy at inference to focus computation on candidate regions. Together, they reduce the feature distortion introduced by direct image downscaling. Experiments show that this data–inference pipeline brings stable gains on the current dataset across multiple mainstream detectors.

At the model level, we propose a semi-decoupled prior-driven (SDPD) detection head for microscopic defect detection. The SDPD-Head uses explicit scale priors generated by an evolutionary algorithm to provide direct scale guidance during learning, which improves training stability for tiny and sparse targets. In addition, it avoids fully decoupled head designs where unnecessary, while retaining shared low-level spatial features that are important for small-object sensitivity. This simplification also keeps the model lightweight.

By integrating SDPD-YOLO with LGOT and ITAT, the proposed method achieves state-of-the-art results on a real-world power-line inspection dataset. It reaches a peak mAP@50 of 92.9% with 2.17 M parameters and 4.7 GFLOPs, outperforming advanced YOLO variants and other mainstream detectors under the same data preparation and inference settings.

Overall, the results indicate that microscopic defect detection in remote sensing benefits more from end-to-end spatial scale alignment across training and inference than from optimizing a single module in isolation. Future work will extend the dataset to cover more transmission components (e.g., composite insulators and hardware fittings) to evaluate cross-category generalization. We also plan to conduct platform-level validation on representative edge devices, including latency and power measurements, to quantify runtime performance.

Author Contributions

Conceptualization, Y.Z.; Methodology, G.C.; Software, G.C.; Validation, Y.Z.; Formal analysis, Y.Z.; Investigation, Y.Z.; Resources, L.Z., J.Z. (Jiajun Zhou) and J.K.; Data curation, Y.Z.; Writing—original draft, G.C.; Writing—review & editing, Y.Z. and J.Z. (Jiangxiong Zhu); Visualization, G.C.; Supervision, J.Z. (Jiajun Zhou); Project administration, G.C. and J.Z. (Jiangxiong Zhu); Funding acquisition, J.Z. (Jiajun Zhou). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Data Availability Statement

Due to the privacy policies and security protocols established by the data provider (State Grid Sichuan Electric Power Company) regarding high-voltage infrastructure, the datasets analyzed in this study cannot be made publicly available.

Conflicts of Interest

Authors Linghao Zhang and Junwei Kuang were employed by the company State Grid Sichuan Electric Power Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yang, Z.; Xu, Z.; Wang, Y. Bidirection-Fusion-YOLOv3: An Improved Method for Insulator Defect Detection Using UAV Image. IEEE Trans. Instrum. Meas. 2022, 71, 3521408. [Google Scholar] [CrossRef]
He, M.; Qin, L.; Deng, X.; Liu, K. MFI-YOLO: Multi-Fault Insulator Detection Based on an Improved YOLOv8. IEEE Trans. Power Deliv. 2024, 39, 168–179. [Google Scholar] [CrossRef]
Hao, K.; Chen, G.; Zhao, L.; Li, Z.; Liu, Y.; Wang, C. An Insulator Defect Detection Model in Aerial Images Based on Multiscale Feature Pyramid Network. IEEE Trans. Instrum. Meas. 2022, 71, 3522412. [Google Scholar] [CrossRef]
Fu, Q.; Liu, J.; Zhang, X.; Zhang, Y.; Ou, Y.; Jiao, R.; Li, C.; Mazzanti, G. A Small-Sized Defect Detection Method for Overhead Transmission Lines Based on Convolutional Neural Networks. IEEE Trans. Instrum. Meas. 2023, 72, 3524612. [Google Scholar] [CrossRef]
Varga, L.A.; Zell, A. Tackling the background bias in sparse object detection via cropped windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montral, QC, Canada, 10–17 October 2021; pp. 2768–2777. [Google Scholar]
Shewajo, F.A.; Fante, K.A. Tile-based microscopic image processing for malaria screening using a deep learning approach. BMC Med. Imaging 2023, 23, 39. [Google Scholar] [CrossRef] [PubMed]
Jeong, D.; Kim, J. Road damage detection using yolo with image tiling about multi-source images. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6401–6406. [Google Scholar]
Japar, A.F.; Ramli, H.R.; Norsahperi, N.M.H.; Hasan, W.Z.W. Oil Palm Loose Fruit Detection Using YOLOv4 for an Autonomous Mobile Robot Collector. IEEE Access 2024, 12, 138582–138593. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, L.; Li, X. Characterization and modeling of GNSS site-specific unmodeled errors under reflection and diffraction using a data-driven approach. Satell. Navig. 2025, 6, 8. [Google Scholar] [CrossRef]
Pan, Y.; Möller, G.; Soja, B. Machine learning-based multipath modeling in spatial domain applied to GNSS short baseline processing. GPS Solut. 2024, 28, 9. [Google Scholar] [CrossRef]
Zhang, Z.; Yu, Y.; Li, X.; He, X. Reliable GNSS positioning and navigation with simultaneous multipath and NLOS mitigation by using camera and C/N0 at deep canyons. Meas. Sci. Technol. 2025, 36, 046308. [Google Scholar] [CrossRef]
Nunes, F.; Sousa, F. Deep learning soft-decision GNSS multipath detection and mitigation. Sensors 2024, 24, 4663. [Google Scholar] [CrossRef]
Zhang, L.; Wang, L.; Yan, Z.; Jia, Z.; Wang, H.; Tang, X. Star Generative Adversarial VGG Network-Based Sample Augmentation for Insulator Defect Detection. Int. J. Comput. Intell. Syst. 2024, 17, 141. [Google Scholar] [CrossRef]
Akella, R.; Gunturi, S.K.; Sarkar, D. Enhancing Power Line Insulator Health Monitoring with a Hybrid Generative Adversarial Network and YOLO3 Solution. Tsinghua Sci. Technol. 2024, 29, 1796–1809. [Google Scholar] [CrossRef]
Cao, Y.; Xu, H.; Su, C.; Yang, Q. Accurate Glass Insulators Defect Detection in Power Transmission Grids Using Aerial Image Augmentation. IEEE Trans. Power Deliv. 2023, 38, 956–965. [Google Scholar] [CrossRef]
Zhang, Z.D.; Zhang, B.; Lan, Z.C.; Liu, H.C.; Li, D.Y.; Pei, L.; Yu, W.X. FINet: An Insulator Dataset and Detection Benchmark Based on Synthetic Fog and Improved YOLOv5. IEEE Trans. Instrum. Meas. 2022, 71, 6006508. [Google Scholar] [CrossRef]
Panigrahy, S.; Karmakar, S. Real-Time Condition Monitoring of Transmission Line Insulators Using the YOLO Object Detection Model with a UAV. IEEE Trans. Instrum. Meas. 2024, 73, 2514109. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 January 2026).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 January 2026).
Li, D.; Lu, Y.; Gao, Q.; Li, X.; Yu, X.; Song, Y. LiteYOLO-ID: A Lightweight Object Detection Network for Insulator Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 5023812. [Google Scholar] [CrossRef]
Vieira e Silva, A.L.B.; de Castro Felix, H.; Simões, F.P.M.; Teichrieb, V.; dos Santos, M.; Santiago, H.; Sgotti, V.; Lott Neto, H. InsPLAD: A Dataset and Benchmark for Power Line Asset Inspection in UAV Images. Int. J. Remote Sens. 2023, 44, 7294–7320. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Ding, L.; Rao, Z.Q.; Ding, B.; Li, S.J. Research on Defect Detection Method of Railway Transmission Line Insulators Based on GC-YOLO. IEEE Access 2023, 11, 102635–102642. [Google Scholar] [CrossRef]
Bao, W.; Du, X.; Wang, N.; Yuan, M.; Yang, X. A Defect Detection Method Based on BC-YOLO for Transmission Line Components in UAV Remote Sensing Images. Remote Sens. 2022, 14, 5176. [Google Scholar] [CrossRef]
Shen, W.; Fang, M.; Wang, Y.; Xiao, J.; Chen, H.; Zhang, W.; Li, X. AE-YOLOv5 for Detection of Power Line Insulator Defects. IEEE Open J. Comput. Soc. 2024, 5, 468–479. [Google Scholar] [CrossRef]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
RangiLyu. NanoDet-Plus: Super Fast and Lightweight Anchor-Free Object Detection Model. GitHub Repository. 2023. Available online: https://github.com/RangiLyu/nanodet (accessed on 12 January 2026).

Figure 1. Insulator defects raw image sample.

Figure 2. The whole framework of the working process.

Figure 3. The process of COC tiling strategy in LGOT.

Figure 4. A sample of the pixel area proportion comparison before and after cropping. (a) The original high-resolution UAV image; (b) A cropped sub-image containing an insulator string; (c) A cropped sub-image highlighting a microscopic insulator defect (red box), demonstrating a significantly increased relative pixel area of the target; (d) Another cropped sub-image showing an insulator under a complex background. The colored boxes indicate insulators (orange) and insulator defects (red).

Figure 5. An attention heat map comparison using Grad-CAM for models trained with non-cropped and cropped data. The different colors represent the intensity of the network’s attention: warmer colors (e.g., red and yellow) indicate regions with high activation and focus, corresponding to the target insulators, while cooler colors (e.g., dark blue) indicate low attention areas in the background.

Figure 6. Structure of the SDPD-YOLO. The light blue blocks represent normal operators, the grey blocks represent basic operations and the red blocks represent our SDPD-Head.

Figure 7. (a) The structure of the YOLO11 Head. (b) The structure of the SDPD-Head.

Figure 8. The topology flowchart of the SDPD-Head.

Figure 9. The evolution anchors generation process.

Figure 10. The ITAT inference process diagram.

Figure 11. The target size distribution comparison before and after applying the proposed tiling strategy. (a) The relative target size distribution in the original dataset, illustrating severe scale mismatch with targets heavily concentrated at extremely small scales; (b) The relative target size distribution after applying the label-guided optimal tiling (LGOT) processing, showing an enlarged and more balanced target scale relative to the input resolution. The color gradient represents the data density, where darker blue indicates a higher frequency of target bounding boxes at a specific relative width and height, and lighter blue indicates a lower frequency.

Figure 12. Qualitative detection results of the proposed framework on representative high-resolution UAV images. The green bounding boxes indicate the pre-annotated ground truth, the red bounding boxes denote the model’s predictions, and the yellow dashed boxes represent the Region of Interest (RoI) areas. In the Notes section at the bottom, the green text indicates missed or false detections by the model, while the red text highlights successful detection cases.

Table 1. Defect image distribution of the experimental dataset.

Defect Subject	Raw Images	Tiled Images
G-Insulator Broken	83	96
G-Insulator Dirty	383	464
P-Insulator Dirty	410	869

Table 2. Definition of data and inference strategies used in the experiments.

Method	Data Strategy	Inference Strategy
Method 1	Downscaling the original image	Direct inference on resized images
Method 2	Conventional sliding-window tiling	Sliding-window inference
Method 3	Label-guided optimal tiling (LGOT)	Inference-time adaptive tiling (ITAT)

Table 3. Validation mAP comparison of YOLO11 trained with different data tiling strategies.

Model	Training Data Strategy	Validation mAP@50 (%)
YOLO11	Conventional Sliding-Window	82.7
	Random Tiling	84.0
	LGOT (Ours)	86.5

Table 4. Ablation study of the collaborative data–model components on YOLO11.

Training Set	Inference Strategy	mAP@50 (%)
Original Dataset	Standard Inference	83.9
LGOT Dataset	Standard Inference	72.7
LGOT Dataset	ITAT Inference	86.0

Table 5. Performance comparison of baseline YOLO models under Method 1 and Method 2.

Model	Method	mAP@50 (%)	Params (M)	GFLOPs
YOLOv8 [19]	Method 1	78.9	2.69	6.8
	Method 2	84.3
YOLOv9 [22]	Method 1	75.8	1.73	6.4
	Method 2	82.7
YOLOv10 [23]	Method 1	78.8	2.70	8.2
	Method 2	83.1
YOLO11 [18]	Method 1	79.1	2.58	6.3
	Method 2	84.8

Table 6. Performance comparison of advanced YOLO models evaluating the SDPD architecture.

Model	Method	mAP@50 (%)	Params (M)	GFLOPs
YOLO11 [18]	Method 2	84.8	2.58	6.3
YOLO11 [18]	Method 3	86.3	2.58	6.3
GC-YOLO [24]	Method 2	87.5	2.16	5.6
GC-YOLO [24]	Method 3	89.3	2.16	5.6
BC-YOLO [25]	Method 2	86.9	1.79	5.6
BC-YOLO [25]	Method 3	88.7	1.79	5.6
FINet [16]	Method 2	87.7	2.21	5.9
FINet [16]	Method 3	89.4	2.21	5.9
AE-YOLO [26]	Method 2	88.4	2.25	6.0
AE-YOLO [26]	Method 3	90.0	2.25	6.0
SDPD-YOLO (Ours)	Method 2	90.1	2.17	4.7
SDPD-YOLO (Ours)	Method 3	92.9	2.17	4.7

Table 7. Performance comparison of mainstream non-YOLO detectors under the unified LGOT + ITAT framework (Method 3).

Model	Method	mAP@50 (%)	Params (M)	GFLOPs
NanoDet-1.5 × [28]	Method 3	63.3	2.07	5.8
NanoDet-plus-1.5 × [28]	Method 3	72.1	2.44	7.2
NanoDet-EfficientLite2 [28]	Method 3	69.3	4.66	11.1
RT-DETR-l [27]	Method 3	87.8	17.25	53.0
RT-DETR-Resnet50 [27]	Method 3	88.7	31.52	81.9
SDPD-YOLO (Ours)	Method 3	92.9	2.17	4.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Y.; Chen, G.; Zhang, L.; Zhou, J.; Kuang, J.; Zhu, J. Lightweight Insulator Defect Detection in High-Resolution UAV Imagery via System-Level Co-Design. Remote Sens. 2026, 18, 953. https://doi.org/10.3390/rs18060953

AMA Style

Zhu Y, Chen G, Zhang L, Zhou J, Kuang J, Zhu J. Lightweight Insulator Defect Detection in High-Resolution UAV Imagery via System-Level Co-Design. Remote Sensing. 2026; 18(6):953. https://doi.org/10.3390/rs18060953

Chicago/Turabian Style

Zhu, Yujie, Guanhua Chen, Linghao Zhang, Jiajun Zhou, Junwei Kuang, and Jiangxiong Zhu. 2026. "Lightweight Insulator Defect Detection in High-Resolution UAV Imagery via System-Level Co-Design" Remote Sensing 18, no. 6: 953. https://doi.org/10.3390/rs18060953

APA Style

Zhu, Y., Chen, G., Zhang, L., Zhou, J., Kuang, J., & Zhu, J. (2026). Lightweight Insulator Defect Detection in High-Resolution UAV Imagery via System-Level Co-Design. Remote Sensing, 18(6), 953. https://doi.org/10.3390/rs18060953

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Insulator Defect Detection in High-Resolution UAV Imagery via System-Level Co-Design

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Tiling-Based High-Resolution Image Detection

2.2. Insulator Defect Detection

3. Method

3.1. Label-Guided Optimal Tiling Method

3.2. Semi-Decoupled Prior-Driven Detect Head

3.3. Inference-Time Adaptive Tiling

4. Experiments

4.1. Experimental Dataset

4.2. Implementation Details

4.3. Analysis of the Data-Model Collaborative Strategy

4.3.1. Experiment of Label-Guided Optimal Tiling

4.3.2. Ablation of the Collaborative Data-Model Strategy

4.4. Baseline Selection and Architecture Ablation

4.5. Comparison with General Mainstream Detectors

4.6. Qualitative Analysis

4.7. Limitations and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI