YOLO-PowerLite V2: An Enhanced Lightweight Detector for Real-Time Tiny Anomaly Identification on Overhead Transmission Lines in Complex Environments

Wei, Shuangfeng; Cai, Yuhang; Zhong, Shaobo; Lv, Zheng

doi:10.3390/rs18121937

Open AccessArticle

YOLO-PowerLite V2: An Enhanced Lightweight Detector for Real-Time Tiny Anomaly Identification on Overhead Transmission Lines in Complex Environments

¹

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

Institute of Urban Systems Engineering, Beijing Academy of Science and Technology, Beijing 100012, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1937; https://doi.org/10.3390/rs18121937

Submission received: 7 May 2026 / Revised: 4 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Advances in Artificial Intelligence (AI) and Deep Learning (DL) in UAV-Based Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

YOLO-PowerLite V2, built on YOLO11n, integrates C3k2-UIB, MCA, MFM, and MBConv to achieve 0.97 M parameters, 2.8 G FLOPs, and 95.2% mAP@50 for detecting bird nests, defective insulators, and balloons.
The proposed model reduces parameters by 62.5% and FLOPs by 56.25% compared to the baseline, while maintaining detection accuracy and outperforming mainstream lightweight detectors.

What are the implications of the main findings?

The model meets the strict computing constraints of UAV edge devices, enabling real-time tiny anomaly identification for overhead transmission lines in complex environments.
It provides a scalable, lightweight design paradigm for customized object detection models in industrial UAV inspection scenarios.

Abstract

Aiming at the core pain point that in existing object detection models, it is difficult to balance detection accuracy and real-time inference efficiency on edge computing devices in UAV-based intelligent inspection of power transmission lines, this paper proposes a lightweight YOLO-PowerLiteV2 model for anomaly target detection in power transmission lines to address the shortcomings of YOLO-PowerLite. Based on YOLO11n as the baseline, the model achieves compression of model volume while guaranteeing detection performance through four core improvements: the C3k2-UIB lightweight backbone module, the MCA (Multi-scale Cross-Axis) attention mechanism, the MBConv lightweight detection head, and the MFM (Modulation Feature Fusion) module. Experiments were conducted on a dataset constructed from 5563 aerial images of transmission lines containing three types of targets: bird nests, defective insulators, and balloons. The results show that YOLO-PowerLiteV2 achieves a mAP@50 of 95.2%, with only 0.97 M parameters and 2.8 G floating point operations (FLOPs). Compared with the baseline model, the number of parameters is reduced by 62.5%, and FLOPs are decreased by 56.25%. On the NVIDIA Jetson Xavier NX edge platform, the model achieves 59.5 FPS with only 16.8 ms latency, outperforming the baseline by 31% in frame rate. Its comprehensive performance outperforms mainstream lightweight detection models. The model demonstrates excellent adaptability to UAV edge-terminal deployment requirements, thereby providing technical support for real-time intelligent inspection of power transmission lines.

Keywords:

object detection; YOLO-PowerLiteV2; power transmission line inspection; lightweight model; Unmanned Aerial Vehicle (UAV)

1. Introduction

The safe and stable operation of power transmission lines, exposed to diverse and often harsh environments, is critical to grid reliability but continuously threatened by various anomalous targets and defects [1,2]. Among the most common and hazardous anomalies are bird nests, defective insulators, and balloons, which can trigger short circuits or flashovers, leading to power outages and significant economic losses [3]. Traditional inspection methods relying on manual patrols or fixed sensor networks are limited by high labor intensity, poor coverage, and safety risks [4,5]. UAVs equipped with high-resolution sensors have revolutionized data acquisition [6], but generate massive visual data requiring automated analysis [7]. Thus, deep learning-based object detection has become the cornerstone of intelligent power line inspection, enabling real-time identification of defects from aerial images [8].

However, deploying these high-performance vision models directly on UAV platforms for real-time analysis remains a key challenge. UAV edge devices have strict constraints on computing power, storage, and power consumption [9]. Mainstream models such as the R-CNN series [10], SSD [11], DETR [12], and standard YOLO versions [13] have high computational demands, making them unsuitable for real-time inference on resource-constrained edge devices. This creates a critical gap between instant hazard identification requirements and limited on-board processing capabilities [14], driving the urgent need for lightweight, efficient detection models tailored for UAV edge deployment [15]. The core challenge lies in achieving an optimal balance: the model must be compact enough for real-time performance on limited hardware while maintaining high detection accuracy.

In response to this urgent demand, the research in this paper focuses on developing a lightweight object detection model specifically designed for UAV edge computing platforms in power transmission line inspection. This paper targets three critical and common threat categories: bird nests, defective insulators, and balloons. The work is based on the YOLO architecture, which is renowned for its favorable speed–accuracy trade-off, making it the preferred candidate for real-time applications. Specifically, this paper continues the research direction of the YOLO-PowerLite network model [16], and maximizes the lightweight design of the model to meet the constraints of edge computing devices while maintaining good detection accuracy. The goal of this paper is to enable the model to detect hazardous targets accurately, robustly, and efficiently under the strict constraints of UAV edge deployment. This effort aims to advance the development of more autonomous, reliable, and real-time intelligent inspection systems, thereby enhancing the safety and operational resilience of modern power grids.

The development of visual inspection for transmission lines reflects the broader shift from traditional feature engineering to deep learning, with growing focus on edge computing efficiency. Early research works were based on traditional computer vision techniques, using manually designed feature descriptors such as Scale-Invariant Feature Transform (SIFT) [17] and Histogram of Oriented Gradients (HOG) [18] to extract key points and edge information from aerial images. These hand-crafted features were then classified using machine learning algorithms such as Support Vector Machines (SVMs) [19]. While effective in constrained scenarios with clear targets and simple backgrounds, these methods degrade severely under real-world inspection challenges due to the inability of hand-crafted features to generalize across variations in illumination, scale, viewing angle, and occlusion [20]. This limitation, together with the labor-intensive feature design process, paved the way for deep learning.

The rise in deep learning, especially CNNs, has marked a paradigm shift, enabling automatic learning of hierarchical and robust feature representations from large-scale data [21]. In the field of object detection, this has spawned two major algorithm families. In object detection, two-stage detectors represented by the R-CNN series adopt a region proposal network to generate candidate targets, followed by classification and regression. These models have achieved remarkable accuracy in aerial power component analysis—for example, bird nest detection based on Faster R-CNN [22,23] and insulator defect identification based on Mask R-CNN [24], often incorporating enhancement techniques such as ResNeXt backbones. Other works have integrated attention mechanisms into Faster R-CNN for multi-task component identification and defect assessment [25]. However, the inherent “proposal-refinement” pipeline results in high computational complexity and slow inference, making them unsuitable for real-time UAV applications.

In contrast, single-stage detectors such as SSD [11] and the YOLO family [13] frame detection as a single regression problem, directly predicting bounding boxes and class probabilities for much faster inference. Through continuous iterations from YOLOv1 to YOLOv3, YOLOv5, and beyond, the YOLO series has continuously optimized the speed–accuracy trade-off, consolidating its position as the preferred real-time detection framework [26]. YOLO has been extensively applied in power line inspection for detecting components such as insulators, vibration dampers, and clamps [8]. For instance, YOLOv3 and YOLOv4 have been successfully applied to insulator and bird nest detection, showing significant speed advantages over two-stage methods [27]. With the advent of YOLOv5 and later versions, research focus has expanded to address domain-specific challenges. Common improvements include integrating attention mechanisms (e.g., Convolutional Block Attention Module, CBAM [28]) to help the model focus on salient features in cluttered backgrounds; adopting advanced data augmentation strategies to improve generalization ability with limited data; and applying improved loss functions for better bounding box regression and handling of class imbalance problems.

Recent research has continued to push the boundaries of YOLO-based models for more comprehensive and accurate inspection. The work of Liu et al. [8] proposed an enhanced YOLOv8 model specifically for multi-target detection in transmission line inspection. Its improvements include the integration of C2f-Faster, Ghost-C2f, SPD-Conv, and triplet attention modules, which together achieve a significant improvement in average precision across a range of component and defect detection tasks. Similarly, for the critical rust defect detection task of metal components, researchers developed RD-YOLO based on the YOLOv10 architecture [15]. This model incorporates a coordinate channel attention residual module to enhance spatial-channel features, a receptive field block to capture multi-scale contextual information, and an efficient convolutional block attention module, demonstrating high accuracy in identifying rust in complex backgrounds.

While accuracy has been greatly improved, the computational requirements of these increasingly complex models have grown accordingly, creating a contradiction with deployment on resource-limited edge devices. This has given rise to a crucial and parallel research direction: model lightweighting and efficiency optimization for edge AI [29]. The goal is to reduce the number of parameters, computational cost, and memory requirements of the model while preserving its accuracy as much as possible. Common strategies include replacing heavy backbones (e.g., Darknet) with lightweight backbones (e.g., MobileNet [30] or GhostNet [31]), designing efficient convolutional blocks, applying neural architecture search, and adopting techniques such as pruning and knowledge distillation. In the context of power transmission line inspection, several lightweight YOLO variants have emerged [32]. These include an adapted version of YOLOv3 using MobileNetv2 [33], an improved YOLOv5 model with layer pruning and adaptive attention [34], and YOLOv8 integrated with lightweight modules such as GSCov [35].

A particularly relevant contribution is the YOLO-PowerLite model [16], which explicitly addresses high accuracy and minimal resource consumption for UAV-based transmission line anomaly detection. Built on YOLOv8n, it introduces the C2f_AK module with deformable kernel convolution, a BiFPN-based feature fusion network, a parameter-sharing detection head, and coordinate attention to enhance focus on target regions. On a composite dataset containing bird nests and insulator defects, YOLO-PowerLite achieved mAP@0.5 comparable to the baseline YOLOv8n, while realizing significant reductions in the number of parameters, computational cost, and model size. Its practical feasibility was verified through deployment on the NVIDIA Jetson Xavier NX edge platform, where the model maintained an average inference time of 31.2 ms per frame, highlighting its strong potential for real-time on-board processing. Lightweight models for edge devices remain critical in remote sensing. Liu et al. [36] achieved 54.6% AP₅₀ with merely 4.85 M parameters on AI-TODv2, validating a balanced efficiency–rationality paradigm.

Despite these significant advances, an optimized, unified lightweight model specifically for simultaneous, real-time detection of the three key risk hazards—bird nests, defective insulators, and balloons—on UAV edge platforms remains an area to be further explored. Balloons, as a unique hazardous object with simple texture and potential occlusion, have received relatively little attention in the design of dedicated detection models. Furthermore, while YOLO-PowerLite provides an excellent lightweight infrastructure, there is still room to specifically customize and enhance its feature representation and discriminative ability for the joint features of these three target categories with vastly different sizes, shapes, textures, and contextual appearances. Therefore, the work in this paper is positioned as a further expansion and in-depth study of the YOLO-PowerLite design concept [16], with the aim of improving the accuracy, robustness, and real-time performance of the simultaneous detection of bird nests, defective insulators, and balloons under the strict operational constraints of UAV on-board edge computing, and to meet a specific and practical demand in the intelligent inspection process.

The application effect of YOLO-PowerLite was elaborated in the authors’ previously published paper [37]. With the release of the latest and more efficient baseline general-purpose models and the proposal of new lightweight or higher-precision improved modules, this paper continues the research direction of YOLO-PowerLite and proposes the YOLO-PowerLiteV2 model using more efficient baseline models and improved modules.

2. Materials and Methods

2.1. Review of YOLO-PowerLite

YOLO-PowerLite is a lightweight detection model designed for the edge deployment scenario of UAV inspection of power transmission lines, with YOLOv8n as the baseline. Its core design concept is to realize model lightweighting through full-link network optimization to adapt to the computing power and storage constraints of edge devices, while ensuring the detection accuracy of anomalous targets. The model is optimized mainly through four aspects: introducing a Coordinate Attention (CA) mechanism at the end of the backbone network to alleviate the interference of complex backgrounds on target detection [38]; improving the native C2f module of the feature fusion module to the C2f_AK module, reducing the number of parameters based on the variable kernel convolution (AKConv) [39] while improving the adaptability to multi-scale features; reconstructing the feature fusion network based on BiFPN [40] to optimize the fusion efficiency of cross-scale features; and lightweight modification of the native decoupled head of YOLOv8, cutting redundant computation through parameter sharing at the front end of the task branch. The final model maintained accuracy comparable to the baseline model in the detection of two types of targets, bird nests and defective insulators, while achieving a significant reduction in the number of parameters, computational cost, and model volume.

However, the model still has obvious shortcomings: first, the detection categories only cover bird nests and defective insulators, and do not include common floating foreign objects such as balloons on lines, making it difficult to meet the multi-target detection requirements of actual inspection; second, the ability to capture features of tiny and occluded targets is insufficient, and the detection robustness in complex backgrounds still has room for improvement; third, the lightweight degree of the model still has optimization space, and it cannot be better adapted to small UAV edge platforms with more limited computing power; fourth, the synergy between feature fusion and attention mechanism is insufficient, and the dynamic adaptability to multi-scale targets is limited, which restricts the detection performance in complex scenes.

2.2. Improved Model

To improve the recognition accuracy of the target recognition model, balance the recognition accuracy and the operating efficiency of the model, and take into account the deployment and operation requirements of the model on UAVs, this research is based on YOLO11n [41]. A series of targeted improvements have been made to the model, continuing the research plan of the previous generation model. In this paper, the final improved network model is called YOLO-PowerLiteV2, and the structure of the improved network model is shown in Figure 1.

First, in the backbone part of the model, this paper makes an important improvement to the C3k2 module in YOLO11 by replacing the two 3 × 3 convolutions of the standard bottleneck structure in the C3k2 module with one 3 × 3 convolution and two 1 × 1 convolutions. This improvement greatly reduces the number of parameters and computational cost of this module, which not only provides lightweight and efficiency advantages for the model, but also facilitates its deployment on UAV edge computing devices for real-time detection of anomalous targets in power transmission lines. At the same time, to improve the efficiency of the model, this paper introduces the Universal Inverted Bottleneck (UIB) module [42] at the end of the C3k2 module, which greatly improves the recognition accuracy of the model. Meanwhile, this paper introduces the Multi-Scale Cross-Axis (MCA) attention mechanism [43] to replace the C2PSA attention mechanism included in YOLO11, thereby reducing the number of parameters and computational cost of the model while preserving the accuracy of the original model.

Second, in the neck of the model, this paper replaces the original concat module in the feature fusion part of YOLO11 with the Modulation Fusion Module (MFM) [44], so as to improve the efficiency and effect of the model on target feature processing. The modulation fusion module enhances the network’s sensitivity to key features and improves feature representation capability by dynamically adjusting the feature fusion weights, thereby optimizing the detail and structural consistency of the results.

Finally, in the detection head part of the model, this paper optimizes the detection head of YOLO11 to a certain extent. The optimized detect_MBConv (Mobile Bottleneck Convolution) reduces the number of parameters and computational cost of the model, while the unified terminal layer and the efficient Squeeze and Excitation (SE) module and depthwise separable convolution of MBConv enable the model to retain sufficient accuracy. The design advantages of MBConv are integrated into the YOLO11 detection head, so that it can not only identify various targets more accurately in the object detection task, but also respond quickly under limited computing resources, showing greater real-time detection capability.

2.2.1. C3k2-UIB Lightweight Backbone Module

YOLO11 introduces the C3k2 block instead of the previous C2f block, which is more computationally efficient. This block is a custom variant of the CSP bottleneck that uses two convolutions instead of one large convolution. To further streamline the model, this paper proposes an improvement to the C3k2 module of YOLO11: replacing the two 3 × 3 convolutions of the standard bottleneck structure in the module with a combination of one 3 × 3 convolution and two 1 × 1 convolutions. This modification greatly reduces the number of parameters and computational cost of the module, enabling the model to achieve significant lightweight and efficiency upgrades, laying a solid foundation for its deployment on UAV edge computing devices for real-time detection of anomalous targets in power transmission lines. At the same time, to further improve the model performance, this paper introduces the Universal Inverted Bottleneck (UIB) module at the end of the C3k2 module, which effectively enhances the target recognition accuracy of the model. The structure of the improved C3k2-UIB module and UIB module is shown in Figure 2.

The improvement purpose of UIB is to lightweight the model, alleviate the computing and memory bottlenecks of edge computing devices, and improve the model efficiency while ensuring accuracy. Its principle is to extend the inverted bottleneck (IB) structure of MobileNetV2, adding two optional depthwise (DW) convolutions between the expansion layer (the first 1 × 1 convolutional layer in IB) and the projection layer (the second 1 × 1 convolution in IB), and flexibly select the presence or absence and kernel size of DW through Neural Architecture Search (NAS) to form various instantiated structures, as shown in Figure 3. The innovation of the UIB module lies in unifying mainstream microarchitectures such as IB, ConvNext-Like, and FFN, adding an Extra DW variant, and taking into account spatial and channel mixing as well as receptive field expansion. Finally, the improved C3k2-UIB module maintains the powerful detection capability of the YOLO11 model while maximizing the utilization of computing resources and reducing unnecessary computing overhead, thereby improving the efficiency of the model.

Transmission line scenarios are often accompanied by complex backgrounds, and most anomalous targets are small objects with irregular shapes, resulting in high detection difficulty. The optimized UIB module can dynamically adjust the presence or absence and kernel size of DW according to the actual requirements of the input features, which is naturally adapted to the characteristics of this detection scenario. Specifically, the custom mechanism of the depthwise separable module introduced by this module can adjust the calculation and efficiency of the model in a targeted manner, so as to capture the details of small anomalous targets more accurately. This improvement enables the C3k2-UIB module to effectively distinguish normal components of transmission lines from various anomalous objects (such as bird nests, damaged insulators, etc.).

2.2.2. MCA Multi-Scale Cross-Axis Attention Mechanism

C2PSA is an innovative module of YOLO11, which introduces the Position-Sensitive Attention (PSA) mechanism based on the CSP split-merge structure. Its core feature is to split the input features into two branches through 1 × 1 convolution: one branch is enhanced by multiple PSA blocks to capture key spatial information, and the other branch retains the original linear features, which are finally concatenated and fused. It can dynamically adjust the feature weights of different positions, taking into account global and local information, and performs outstandingly in complex scenes and small target detection. However, the module has limitations: the attention mechanism relies on convolution implementation, resulting in limited long-range dependency modeling ability, insufficient flexibility of multi-scale feature fusion, and weak adaptability to scenarios with variable target shapes and sizes, making it difficult to fully cope with detection tasks that require accurate capture of complex structures.

To address the shortcomings of C2PSA, this paper adopts the MCA (Multi-scale Cross-Axis Attention) module for replacement, and its structure is shown in Figure 4. Different from the convolutional attention and serial feature processing of C2PSA, MCA adopts dual parallel axial attention branches, combines multi-scale strip convolution to capture target features of different sizes, and then constructs inter-branch interaction through cross-axis attention. It not only retains the computational efficiency of axial attention (the complexity is reduced from O(HW × HW) to O(HW × (H + W)), but also enhances the long-range dependency modeling and multi-scale fusion capabilities, which meets the adaptation requirements of variable target shapes and sizes. MCA has a low number of parameters and controllable computational cost, and its multi-scale feature encoding and global context fusion capabilities have been verified to outperform heavy models such as Swin Transformer in tasks such as medical image segmentation. After replacement, the detection accuracy of YOLO11 for targets with variable shapes and sizes is significantly improved, while maintaining the lightweight advantage, which effectively makes up for the shortcomings of C2PSA in complex structure capture and flexible adaptability.

2.2.3. MBConv Lightweight Detection Head

The original Detect module of YOLO11 is an innovative detection component of the model. Its core feature is a concise and compact structure, adopting a series design of continuous ordinary convolutional layers, and directly outputting target category, bounding box coordinates, and confidence information through fixed-dimensional feature mapping. Its calculation process is intuitive, and the inference speed is fast, without complex feature conversion mechanisms, and it can quickly adapt to the end-to-end detection process of the model, which is in line with the lightweight and high real-time design positioning of the YOLO series. However, the module has certain limitations: the feature extraction method of ordinary convolutional layers is inefficient, and the ability to capture fine-grained features of multi-scale targets is also insufficient, especially in scenarios with small or overlapping targets. Feature confusion is also prone to occur, resulting in difficulty in balancing accuracy improvement and computational cost control, and it is difficult to fully explore the expression potential of deep features.

To solve the shortcomings of the original Detect module, this research adopts the detect_MBConv module for replacement. Figure 5 shows the structural comparison between the detect_MBConv module and the original YOLO11 Detect module. Its core principle is targeted optimization based on the MBConv structure of EfficientNet [45]. The module takes “bottleneck structure + depthwise separable convolution + SE attention” as the core. It first compresses the feature dimension through 1 × 1 convolution to reduce redundant calculation, then uses depthwise separable convolution instead of ordinary convolution to efficiently extract feature details while reducing the amount of calculation. Finally, it restores the feature dimension through 1 × 1 convolution to complete the dimension conversion of features. At the same time, the embedded SE attention mechanism can dynamically calculate the importance weight of each feature channel, strengthen the key features of the target, suppress invalid background information, and accurately make up for the shortcoming of the original module lacking attention guidance. In addition, the module retains the output layer design adapted to the detection task, ensuring the accurate connection between feature extraction and classification and regression tasks, and realizing efficient mapping from features to detection results. The effect after replacement is significant: while maintaining or even optimizing the lightweight characteristics of the model, the feature expression ability is greatly improved, making the model more adaptable to the detection of multi-scale targets and more accurate in target recognition in complex backgrounds; the improvement of parameter utilization efficiency enables the model to mine richer deep features under the same computational cost, and the robustness is significantly enhanced. It can maintain stable detection performance in various scenarios and complex environments, while taking into account the dual requirements of real-time performance and detection accuracy, resulting in a more balanced performance overall.

2.2.4. MFM Modulation Feature Fusion Module

The original concat block of YOLO11 only performs simple concatenation of features from different levels, which cannot distinguish the differences in feature importance, and easily leads to the dilution of key information or interference from redundant information. Based on the concept of dynamic feature weighting, the MFM (Modulation Fusion Module) generates an adaptive weight matrix through Global Average Pooling (GAP), Multi-Layer Perceptron (MLP), and Softmax in response to the complementary requirements of shallow details and deep semantics in object detection. It dynamically adjusts the fusion weights according to feature complementarity, establishes an intelligent association between cross-level and multi-scale features, and replaces the traditional static concatenation method. Its structure is shown in Figure 6. Its core advantage is the dynamic modulation capability, which can adaptively highlight key features and suppress redundant information. It has strong adaptability and can accurately meet the fusion requirements of boundary features and semantic information. Its structure is also very concise, and no complex modification is required when integrated into the neck part of YOLO11, with controllable computational burden. This improvement effectively alleviates the problem of insufficient feature fusion of the original concat, improves the effectiveness of feature representation, enhances the network’s ability to identify target boundaries and details, makes the model perform better in complex scenes, and has stronger adaptability to targets of different scales. The overall detection accuracy, robustness and scene adaptability are significantly optimized.

3. Experiments

3.1. Experimental Setup

3.1.1. Dataset

The dataset used in this research mainly comes from four parts: first, high-voltage transmission line images obtained by on-site UAV inspection; second, the public transmission line bird nest dataset [46]; third, the public Chinese Power Line Insulator Dataset (CPLID) [47]; fourth, the public transmission line balloon dataset. In view of the fact that there is no public dataset for multi-category anomalous targets of transmission lines in the power field, this paper integrates the existing data and supplements the self-captured images. In this way, this paper constructs a comprehensive dataset containing 5563 images, aiming to cover three common anomalous target types in transmission lines: bird nests, defective insulators, and foreign object balloons. The details of the dataset are shown in Figure 7. The proportion of the three types of targets is approximately 5:5:1. This ratio was determined based on the frequency of occurrence of this risk target in the actual scenario. The detection targets are mainly small-sized objects, and they are mainly located in the central area.

To systematically evaluate the performance of the YOLO-PowerLiteV2 model, this paper divides the entire dataset into a training set, validation set, and test set according to the ratio of 13:2:1. This division ensures that the model can be trained on sufficiently diverse data, while leaving an independent validation set and test set for evaluating the generalization ability and practical application performance of the model.

In view of the overfitting problem common in deep learning model training, especially when the training samples are relatively limited, this paper adopts a series of data augmentation techniques to improve the robustness and generalization ability of the model. The specific data augmentation operations include horizontal flip, vertical flip, random rotation, brightness adjustment, Gaussian noise, and blurring. These techniques can not only expand the diversity of training samples, but also simulate various complex environmental conditions that may be encountered in practical applications, thereby ensuring the effectiveness and robustness of the model in real-world environments.

3.1.2. Experimental Environment and Training Strategy

In this experiment, the experimental environment configuration is as follows. This paper uses an AMD Ryzen 7 8845H processor and an NVIDIA GeForce RTX4060 Laptop 8G graphics card. The deep learning model framework uses PyTorch 2.6.0 and Python 3.12, and the CUDA version is 12.6. To ensure the fairness and comparability of the model effects, in all ablation experiments and comparative experiments, this paper does not use pre-trained weights for the training process of various models. Some important hyperparameter settings of the model in the training phase are shown in Table 1.

To adapt to different application scenarios and hardware device requirements, the YOLO11 model is specifically designed to derive five model versions of different scales by adjusting two key parameters, width and depth, named YOLO11n, YOLO11s, YOLO11m, YOLO11l, and YOLO11x, respectively. The number of parameters and resource consumption of each version increase accordingly, so as to meet different detection performance requirements. The corresponding width, depth, and maximum number of channels of these five models are shown in Table 2.

Considering that edge computing devices are usually limited by computing resources and energy consumption, this paper selects the lightest YOLO11n as the baseline model when improving the lightweight model and preparing for deployment.

3.1.3. Evaluation Metrics

To comprehensively evaluate the performance of the anomaly target detection model for transmission lines, this research not only focuses on the accuracy of the model, but also takes into account the lightweight requirements of the model deployed in embedded devices. Therefore, Precision, Recall, Average Precision (AP), mean Average Precision (mAP), number of Parameters, and Floating Point Operations (FLOPs) are selected as the key evaluation metrics.

Precision refers to the proportion of all predicted targets correctly identified by the model, and the calculation formula is as follows. Among them, true positive (TP) is the number of anomalous targets in transmission lines correctly identified by the model, while false positive (FP) and false negative (FN) represent the number of anomalous targets that actually exist but are misidentified and missed by the model, respectively.

P r e c i s i o n = \frac{T P}{T P + F P} .

(1)

Recall refers to the proportion of all actual targets correctly identified by the model, and the calculation formula is as follows.

R e c a l l = \frac{T P}{T P + F N} .

(2)

AP is equal to the area under the Precision–Recall curve, and the closer its value is to 1, the better the model performance. The calculation formula is as follows.

A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d R e c a l l .

(3)

mAP is the average value of AP for multiple categories, which is a commonly used evaluation metric in object detection and intuitively reflects the performance of the current model. The calculation formula is as follows. Here, N represents the number of types of the identified targets.

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i} .

(4)

FLOPs can reflect the computational complexity of the model, and the calculation formula is as follows. Here, H and W represent the height and width of the input feature map, Cin and Cout denote the number of channels for input and output respectively, and K is the size of the convolution kernel.

F L O P s = 2 \times H \times W (C_{i n} K^{2} + 1) C_{o u t} .

(5)

The lightweight degree of the model is evaluated by the number of Parameters, and the calculation formula is as follows.

P a r a m e t e r s = C_{i n} \times K^{2} \times C_{o u t} .

(6)

3.2. Experimental Results

3.2.1. Ablation Experiment

In this ablation experiment, the original YOLO11n was adopted as the baseline model, and the single variable control method was used to verify the independent performance of each improved module. Meanwhile, to further quantify the synergistic effect and gradient contribution between modules, gradient ablation experiments with two-module and three-module combinations were supplemented. The specific results of the ablation experiment are shown in Table 3. All experiments were conducted on the self-built dataset of abnormal targets on overhead transmission lines, under a unified hardware environment and consistent training hyperparameters, with no pre-trained weights used throughout the process to ensure the fairness and comparability of the experimental results. The core of the experiment was centered on the edge deployment requirements of UAV inspection for transmission lines, and verified the optimization effect of each module on the lightweight and detection performance of the model on the premise of ensuring the detection accuracy of three types of abnormal targets: bird nests, defective insulators and balloon foreign objects. As the lightest version of the YOLO11 series, the core performance indicators of the baseline YOLO11n provided a unified and fair reference for all improvements. After replacing the C3k2 module in the baseline backbone network with the C3k2-UIB module and completing channel adaptation, the precision of the model slightly increased to 95.2%, the recall rate was significantly increased by 3 percentage points to 96.0%, the core mAP indicator was basically the same as the baseline, while the number of parameters was reduced to 1.93 M (a reduction of 25.5%), and the FLOPs were reduced to 4.6 G (a reduction of 28.1%), which was the most prominent lightweight effect among single-module improvements. This module reduced the computational overhead by reconstructing the bottleneck convolution structure, and adaptively adapted to the feature capture requirements of small and irregular targets in complex backgrounds through the dynamic depth convolution mechanism of the UIB module, which fundamentally improved the parameter utilization efficiency and met the core safety requirement of low missed detection in inspection scenarios. After replacing the C2PSA module of the baseline with the MCA Multi-Scale Cross-Axis Attention mechanism, the core detection accuracy of the model only had negligible slight fluctuations compared with the baseline, while the number of parameters was reduced to 2.26 M (a reduction of 12.7%), and the FLOPs were reduced to 6.1 G. This verified the design effectiveness of the module in reducing the attention computational complexity through parallel axial attention and multi-scale strip convolution, while enhancing the long-distance dependency modeling and multi-scale feature fusion capabilities. It solved the problem of insufficient adaptability of the original module to targets with large differences in size and shape, and achieved further compression of the model volume without significant accuracy loss. After replacing the original detection head of the baseline with the Detect-MBConv module, the model achieved two-way optimization of lightweightness and accuracy: the mAP@50 increased to 95.5%, the mAP@50–95 increased by 0.9 percentage points to 63.8%, while the number of parameters was reduced to 2.26 M, and the FLOPs were reduced to 5.1 G (a reduction of 20.3%). This module replaced the ordinary convolution series structure of the original detection head with depthwise separable convolution to reduce the amount of calculation, and embedded the SE channel attention to enhance the extraction of key features, which made up for the shortcoming of the original module in insufficient fine-grained feature capture ability for small targets and overlapping targets, and was the only single-module improvement that achieved both accuracy improvement and lightweight optimization. After replacing the concat static concatenation module in the neck part of the baseline with the MFM (Modulation Fusion Module), the mAP@50 of the model was completely consistent with the baseline, the mAP@50–95 was significantly increased by 1.3 percentage points to 64.2%, which was the most significant improvement of this indicator among single modules, and the number of parameters and FLOPs were almost the same as the baseline without additional computational burden. This verified that the module replaced static concatenation through dynamic adaptive weights, intelligently adjusted the fusion weight of deep and shallow features, effectively avoided the dilution of key features by redundant background information, and improved the model’s ability to identify target boundaries and details without additional overhead.

To further verify the independent contribution and adaptability of each module in the combined state, a two-module combined ablation experiment was carried out in this study. The results showed that all two-module combinations including the C3k2-UIB module achieved a significant reduction in the number of parameters and computational effort, and the reduction was significantly higher than that of combinations without this module in the same dimension. Among them, the combination of C3k2-UIB and MCA reduced the number of parameters to 1.68 M, a 25.7% reduction compared with the single MCA module, while the mAP@50 rose back to 94.8%, which made up for the slight accuracy loss caused by the single C3k2-UIB module, and the recall rate increased to 93.5%, a 3.2 percentage point increase compared with the single MCA module. This verified the synergistic adaptability of the two modules in the feature extraction link of the backbone network, and realized the superposition of lightweight gains and the complementarity of accuracy loss. The combination of C3k2-UIB and the MBConv detection head achieved two-way optimization of performance on the basis of single-module lightweightness, with the number of parameters reduced to 1.62 M, FLOPs reduced to 3.7 G, mAP@50 increased to 95.4%, and mAP@50–95 increased to 63.6%, which further highlighted the core value of the MBConv module in balancing accuracy and lightweightness. The combination of C3k2-UIB and MFM increased the mAP@50–95 from 62.7% of the single C3k2-UIB module to 64.0%, with almost no additional increase in the number of parameters and FLOPs, which verified the improvement effect of the MFM on the fine-grained detection performance of the model without increasing the computational burden. On this basis, the three-module combined ablation experiment further quantified the synergistic effect after the superposition of multiple modules. Among them, the combination of three core lightweight modules, C3k2-UIB, MCA, and MBConv, reduced the number of parameters to 1.31 M, a 49.4% reduction compared with the baseline model, and the FLOPs to 3.2 G, a 50% reduction compared with the baseline. At the same time, the mAP@50 rose back to 95.1%, a difference of only 0.1 percentage points from the baseline, and the recall rate of 92.8% was also close to the baseline level. Compared with any two-module combination, this combination achieved further compression of the number of parameters and computational effort, and offset the accuracy fluctuation of a single module through the full-link synergy of backbone network lightweightness, attention mechanism optimization, and detection head efficiency improvement, which laid a core foundation for the model to achieve extreme lightweightness. The combination of C3k2-UIB, MBConv, and MFMs achieved the optimal detection accuracy in the gradient experiment, with mAP@50 reaching 95.4% and mAP@50–95 reaching 64.5%, both significantly better than the baseline model. Meanwhile, the number of parameters was only 1.58 M, a 39.0% reduction compared with the baseline, and the FLOPs were 3.8 G, a 40.6% reduction compared with the baseline. This fully verified the positive synergistic effect of the three modules in the whole process of feature extraction, feature fusion, and result output, which can simultaneously improve the detection accuracy while compressing the model volume.

After integrating all the above improved modules into the complete YOLO-PowerLiteV2 model, the optimization effects of each module achieved a significant synergistic superposition. In terms of accuracy, the core indicators of the model were fully equal to or even exceeded the baseline, with a precision of 95.2%, mAP@50 of 95.2%, and mAP@50–95 of 63.0%. Only the recall rate showed an acceptable small decline, which fully met the accuracy requirements for abnormal target detection of transmission lines. In terms of lightweight performance, the model achieved breakthrough optimization, with the number of parameters plummeting to 0.97 M, a reduction of 62.5% compared with the baseline, and the FLOPs falling to 2.8 G, a reduction of 56.25%, which fully adapted to the strict constraints of computing power, storage, and power consumption of UAV edge computing devices. On the whole, the lightweight gains of each module were effectively superimposed, and the possible accuracy loss of a single module was offset through functional complementarity. The final model perfectly achieved the design goal of extreme lightweightness and high-precision detection. This ablation experiment also fully verified the effectiveness and rationality of all improved modules, and proved that the model can be stably adapted to a real-time transmission line inspection scenario on the airborne edge end of UAVs.

3.2.2. Comparative Experiment

To comprehensively verify the performance and engineering application value of the YOLO-PowerLiteV2 model, this comparative experiment selects the current mainstream lightweight YOLO series models YOLOv8n, YOLO10n, YOLO12n, the newly released YOLO2026n, and the previous YOLO-PowerLite of this research as comparison objects. The specific experimental results are shown in Table 4. All models are trained and tested under the same self-made transmission line anomaly target dataset, hardware environment, and training hyperparameters, and no pre-trained weights are used throughout the process to ensure the fairness and comparability of the experimental results. The experiment is mainly carried out around two dimensions: detection accuracy and lightweight performance.

From the perspective of detection accuracy indicators, the mAP@50 of YOLO-PowerLiteV2 reaches 95.2%, which is at the same performance level as YOLOv8n, YOLO12n, YOLO2026n, and the previous YOLO-PowerLite, only slightly lower than YOLO2026n (95.9%) and YOLO12n (95.8%). At the same time, the Precision of the model reaches 95.2%, which is the highest value among all comparison models, indicating that it has better recognition accuracy and a lower false detection rate for the three types of targets in transmission lines: bird nests, defective insulators, and balloons. The Recall of the model is 91.0%; although there is a slight decrease compared with some comparison models, it is still within the acceptable range for engineering applications of transmission line inspection, and can meet the core requirements of anomaly target detection.

In terms of lightweight performance, YOLO-PowerLiteV2 shows significant advantages. Its number of parameters is only 0.97 M, a decrease of 61.2% compared with YOLO2026n (2.50 M), with the lowest number of parameters among the comparison models, and a decrease of 37.8% compared with the previous YOLO-PowerLite (1.56 M). The FLOPs are only 2.8 G, a decrease of 51.7% compared with YOLO2026n (5.8 G), with the lowest computational cost among general-purpose models, and a decrease of 42.1% compared with YOLO-PowerLite (4.84 G). It is the model with the lowest number of parameters and computational cost among all comparison models.

In summary, while maintaining detection accuracy on par with mainstream lightweight detection models, YOLO-PowerLiteV2 achieves breakthrough compression in both parameter count and floating-point operations (FLOPs), realizing dual optimality in detection accuracy and lightweight performance. Compared with existing models and technologies, it is better adapted to the computing power, storage, and power consumption constraints of UAV edge computing devices, and delivers superior real-time airborne deployment capability and engineering application value.

3.2.3. Model Deployment on NVIDIA Jetson Xavier NX

To evaluate the practicality and operational efficiency of the YOLOPowerLiteV2 model in real-world scenarios, especially its performance on resource-constrained edge devices, we adopt the NVIDIA Jetson Xavier NX as the test platform. Designed exclusively for edge AI computing, the Jetson Xavier NX features a compact size and low power consumption, making it well-suited for complex computational tasks on mobile devices such as unmanned aerial vehicles (UAVs). It is equipped with a 6-core NVIDIA Carmel ARMv8.2 CPU and a 384-core NVIDIA Volta GPU. The GPU integrates 48 tensor cores dedicated to accelerating deep learning tasks, enabling the module to deliver an AI computing capability of up to 21 TOPS with low power consumption.

Terminal tests were conducted using the validation set of our dataset under a power budget of 20 W, and the results are presented in Table 5. Test results of our model on edge computing devices show that its frame rate rises by approximately 31% and 6% compared with the baseline model and the previous generation model, respectively. Nevertheless, in a complete practical system, power consumption cannot be sustained at 20 W due to UAV endurance limitations, and other applications will also consume part of the available computing resources. In addition, the object detection model will not run solely in the PyTorch format. We adopt TensorRT for model deployment and acceleration to ensure the final model meets real-time operation requirements.

3.2.4. Visualization Analysis

Grad-CAM++ [52], as an effective generalization of gradient-based visual interpretation methods, effectively solves a shortcoming of Grad-CAM [53]—poor target localization when multiple objects of the same category appear in an image. It is a popular visualization method that can increase the transparency of the model’s decision-making process by highlighting the regions in the image that contribute the most to the model’s prediction results. This paper selects transmission line images with complex background interference as test cases. In these cases, the image background is complex and changeable, including wasteland, grassland, and other facilities, which may have an impact on the performance of the target detection model. Based on the outputs of the YOLO11n and YOLO-PowerLiteV2 models, this paper generates heat maps reflecting the focus of the models, as shown in Figure 8.

The analysis of the heat maps shows that although YOLO11n has good detection performance under normal circumstances, its focus is scattered to non-target areas when faced with a large amount of background interference in the image. In contrast, YOLO-PowerLiteV2 demonstrates higher robustness of target detection, and can more intensively focus on the actual anomalous target areas of transmission lines even in the context of complex background information.

Furthermore, this paper conducts a visual comparison and analysis of the performance of YOLO-PowerLiteV2 and the other three lightweight models—YOLO-PowerLite, YOLO2026n, and YOLO11n—on the transmission line foreign object detection task. The detection results are shown in Figure 9.

4. Discussion

The core bottleneck for the implementation of UAV-based intelligent inspection of power transmission lines lies in the difficulty of simultaneously meeting the high precision of anomaly target detection and real-time inference requirements under the limited computing power, storage, and power consumption constraints of on-board edge computing devices. Existing general-purpose lightweight detection models are mostly designed for open scenarios, and cannot achieve customized optimization for the inspection scenarios of transmission lines with complex backgrounds, a high proportion of small targets, and large differences in target shapes, often falling into the dilemma that lightweightness and accuracy are difficult to balance. To address the challenges of dense small objects, large-scale variations, and complex backgrounds in remote sensing images, Zhao et al. [54] proposed the YOLO-FSD algorithm, which introduces the Swin-CSP module and a lightweight decoupled head (DWC-head), achieving significant accuracy improvements on the VisDrone and DOTA datasets, demonstrating the effectiveness of jointly optimizing network structures and detection heads for remote sensing object detection. Xu et al. proposed HLSC-SSGF [55], enhancing low-contrast anomalies via local sub-block contrast (LSRMC) and suppressing background edges via spatial-spectral gradient features (SSGF), achieving >0.996 accuracy on four datasets. DFBSNet [56] designs multi-scale atrous pooling decomposition and dual frequency-domain branch fusion with selection, jointly optimizing background reconstruction and anomaly saliency for enhanced hyperspectral anomaly detection. The YOLO-PowerLiteV2 model constructed based on the YOLO11n baseline in this study completes full-link lightweight optimization for three types of typical abnormal targets of lines: bird nests, defective insulators, and balloons. The results of its ablation and comparative experiments fully verify the effectiveness and engineering value of the customized optimization strategy.

The ablation experiment of this study, through the single variable control method, reveals the core logic of lightweight model design for industrial scenarios: lightweightness is achieved not simply through the compression of the number of parameters, but the realization of the precise allocation of computing resources and the maximization of feature extraction efficiency in combination with the core requirements of the scenario. From the perspective of single-module improvement effects, the C3k2-UIB module achieves the most significant lightweight gain. While the number of parameters and computational cost are reduced by 25.5% and 28.1% respectively, Recall is increased by 3 percentage points, which is highly consistent with the core safety requirement of “low missed detection” for transmission line inspection. This module compresses redundant computation by reconstructing the bottleneck convolution structure, and at the same time, with the dynamic depth convolution mechanism of the UIB module, adaptively adapts to the feature capture requirements of small targets and irregular targets in complex backgrounds, fundamentally improving the efficiency of parameter utilization. The MBConv detection head is the only module that achieves both accuracy improvement and lightweight optimization at the same time. While compressing the amount of calculation through depthwise separable convolution, it embeds the SE channel attention mechanism to strengthen target-related features, making up for the shortcoming of the original detection head in capturing fine-grained features of overlapping targets and small-size defects, which accurately matches the scenario requirements of insulator defect detection and occluded foreign object recognition. The MFM, on the premise of zero additional computational overhead, replaces static concat splicing through dynamic weighted fusion, solves the problem of dilution of shallow detail features, and significantly improves the accuracy of target boundary localization. When all modules are integrated, each improvement forms a synergistic optimization effect: the number of parameters and computational cost of the model are reduced by 62.5% and 56.25% respectively compared with the baseline, while the core mAP@50 indicator is equal to the baseline, achieving a deep balance between lightweightness and detection accuracy, which verifies the rationality of the full-link optimization strategy.

The comparison results with the current mainstream lightweight YOLO series models and the previous YOLO-PowerLite of this research further highlight the engineering application advantages of YOLO-PowerLiteV2. In terms of detection accuracy, the mAP@50 of the model reaches 95.2%, which is at the same performance level as general-purpose models such as YOLOv8n, YOLO12n, and YOLO2026n. At the same time, the Precision of 95.2% is the highest among all comparison models, which means a lower false detection rate, effectively reducing the cost of invalid alarms and manual review in inspection. In terms of lightweight performance, the number of parameters of the model is only 0.97 M, which is 38.8% of the current YOLO2026n with the lowest number of parameters, and the FLOPs are only 2.8 G, a decrease of 51.7% compared with YOLO2026n, which is the smallest volume among all comparison models. The core reason for this advantage is that general-purpose lightweight models need to take into account the detection requirements of dozens of general-purpose targets, retaining a large number of redundant general-purpose feature extraction units, while the model in this study is customized and optimized for three specific targets of transmission lines, which can cut redundant structures in a targeted manner while strengthening the feature extraction ability related to the targets, achieving higher parameter utilization efficiency. Compared with the previous YOLO-PowerLite model, the model in this study reduces the number of parameters and computational cost by 37.8% and 42.1% respectively under the premise of expanding the detection category of balloon foreign objects, and the lightweight performance achieves a further breakthrough, which is more adaptable to the deployment constraints of UAV edge devices—lower computational cost directly corresponds to lower processor power consumption, which can effectively extend the single inspection battery life of UAVs and expand the inspection coverage. Furthermore, the Hy-Tracker framework proposed by Islam et al. [57] introduces YOLOv7 into hyperspectral video object tracking, effectively integrating hierarchical attention band selection (HAS-BS) with a GRU-based temporal network to achieve robust tracking performance under occlusions and scale variations, providing a new paradigm for extending YOLO detectors to multi/hyperspectral remote sensing tracking tasks. HCSMP [58] achieves spectral complementarity via a cross-modal prompt network and employs a memory prompt network for temporal modeling from historical frames, enabling robust hyperspectral tracking under scale variation and occlusion.

This study still has certain limitations. First, the Recall of the model is 91.0%, which is slightly lower than some mainstream models, and the risk of missed detection in scenarios with strong occlusion and extremely small targets still needs to be further controlled. Second, although the dataset of this study covers three core targets, the samples under extreme harsh environments such as rain and fog, strong backlight, and severe occlusion are still insufficient, and the generalization ability of the model in extreme scenarios needs to be further verified. In addition, this study has not completed the actual deployment and end-side inference testing of the model on mainstream on-board edge platforms, and the engineering indicators such as actual frame rate, memory usage, and power consumption still need to be measured and optimized. In addition, knowledge distillation technology, which has been proven effective in lightweight remote sensing models [59], should be introduced in future work to further improve the recall rate of small and occluded targets without increasing the inference complexity of the model.

5. Conclusions

Aiming at the core problem of balancing detection accuracy and real-time inference efficiency on resource-constrained edge devices in UAV-based intelligent inspection of transmission lines, this paper proposes YOLO-PowerLiteV2, a lightweight model built on YOLO11n with four core improvements: C3k2-UIB, MCA, MFM, and MBConv modules for full-link lightweight optimization.

Experiments are conducted on a self-made dataset of 5563 aerial images covering three typical anomalous targets—bird nests, defective insulators, and balloons. YOLO-PowerLiteV2 achieves 95.2% mAP@50 with only 0.97 M parameters and 2.8 G FLOPs, reducing parameters by 62.5% and FLOPs by 56.25% compared with the baseline YOLO11n, while maintaining accuracy comparable to mainstream lightweight detectors. On the NVIDIA Jetson Xavier NX edge platform, it attains 59.5 FPS (16.8 ms latency), representing improvements of 31% and 6% over the baseline and YOLO-PowerLite, respectively.

The proposed model adapts well to the deployment constraints of UAV on-board edge devices, meets the engineering requirements of real-time intelligent inspection of transmission lines, and provides a design reference for customized lightweight detection models in industrial scenarios.

Based on the remaining limitations discussed, future research will focus on four directions: first, introducing knowledge distillation and model quantization to improve recall for small and occluded targets while maintaining lightweight efficiency; second, expanding dataset scene richness and optimizing data augmentation to enhance robustness in extreme environments; third, completing edge platform deployment and building an end-to-end inspection solution integrated with the UAV flight control system; fourth, expanding detection categories to cover more line anomaly types and further improve engineering applicability.

Author Contributions

Conceptualization, Y.C., S.Z. and S.W.; methodology, Y.C.; software, Y.C.; validation, Y.C., S.W. and Z.L.; formal analysis, Y.C.; investigation, Y.C.; resources, S.Z.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, S.W. and S.Z.; visualization, Y.C.; supervision, S.W. and S.Z.; project administration, S.Z. and S.W.; funding acquisition, S.Z. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Beijing Academy of Science and Technology Innovation Engineering Project (26CA011-01), the National Natural Science Foundation of China (72174031), the National Key R&D Program of China (2021YFE0194700, 2021YFB2600101), and the Scientific Research Plan of the Beijing Municipal Education Commission (KM202010016010).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors would like to thank the editors and reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

UAVs	Unmanned Aerial Vehicles
SIFT	Scale-Invariant Feature Transform
HOG	Histogram of Oriented Gradients
SVMs	Support Vector Machines
CNNs	Convolutional Neural Networks
R-CNN	Region-based Convolutional Neural Network
SSD	Single Shot MultiBox Detector
YOLO	You Only Look Once
CBAM	Convolutional Block Attention Module
BiFPN	Bidirectional Feature Pyramid Network
CA	Coordinate Attention
AKConv	Variable kernel convolution
UIB	Universal Inverted Bottleneck
MCA	Multi-scale Cross-Axis
MFM	Modulation Feature Fusion Module
MBConv	Mobile Bottleneck Convolution
SE	Squeeze and Excitation
IB	Inverted Bottleneck
DW	Depthwise
NAS	Neural Architecture Search
CSP	Cross Stage Partial
PSA	Position-Sensitive Attention
CPLID	Chinese Power Line Insulator Dataset
TP	True Positive
FP	False Positive
FN	False Negative
AP	Average Precision
mAP	Mean Average Precision
FLOPs	Floating Point Operations
Grad-CAM	Gradient-weighted Class Activation Mapping
Grad-CAM++	Gradient-weighted Class Activation Mapping++

References

Nguyen, V.N.; Jenssen, R.; Roverso, D. Automatic autonomous vision-based power line inspection: A review of current status and the potential role of deep learning. Int. J. Electr. Power Energy Syst. 2018, 99, 107–120. [Google Scholar] [CrossRef]
Liu, X.; Miao, X.; Jiang, H.; Chen, J. Data analysis in visual power line inspection: An in-depth review of deep learning for component detection and fault diagnosis. Annu. Rev. Control 2020, 50, 253–277. [Google Scholar] [CrossRef]
Yuan, J.; Zheng, X.; Peng, L.; Qu, K.; Luo, H.; Wei, L.; Jin, J.; Tan, F. Identification method of typical defects in transmission lines based on YOLOv5 object detection algorithm. Energy Rep. 2023, 9, 323–332. [Google Scholar] [CrossRef]
Ahmed, M.D.F.; Mohanta, J.C.; Sanyal, A.; Yadav, P.S. Path planning of unmanned aerial systems for visual inspection of power transmission lines and towers. IETE J. Res. 2024, 70, 3259–3279. [Google Scholar] [CrossRef]
Liu, K.P.; Li, B.Q.; Qin, L.; Li, Q.; Zhao, F.; Wang, Q.L.; Xu, Z.P.; Yu, J.Y. Review of application research of deep learning object detection algorithms in insulator defect detection of overhead transmission lines. High Volt. Eng. 2023, 49, 3584–3595. [Google Scholar] [CrossRef]
Chen, C.; Zheng, Z.; Xu, T.; Guo, S.; Feng, S.; Yao, W.; Lan, Y. YOLO-Based UAV Technology: A Review of the Research and Its Applications. Drones 2023, 7, 190. [Google Scholar] [CrossRef]
Liu, Z.; Miao, X.; Chen, J.; Jiang, H. Review of visible image intelligent processing for transmission line inspection. Power Syst. Technol. 2020, 44, 1058–1069. [Google Scholar] [CrossRef]
Liu, C.; Liu, J.; Wu, Y.; Sun, Z. Application of enhanced YOLOv8 in multi-object detection for autonomous inspection of transmission lines. Eng. Res. Express 2025, 7, 045240. [Google Scholar] [CrossRef]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-Based Lightweight YOLO Network for UAV Small Object Detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Li, H.; Dong, Y.; Liu, Y.; Ai, J. Design and Implementation of UAVs for Bird’s Nest Inspection on Transmission Lines Based on Deep Learning. Drones 2022, 6, 252. [Google Scholar] [CrossRef]
Liu, C.; Liu, J.; Wu, Y.; Sun, Z. RD-YOLO: Towards Rust Defect Detection for Future Unmanned Transmission Lines Maintenance. IEICE Trans. Inf. Syst. 2025, E108.D, 1348–1358. [Google Scholar] [CrossRef]
Liu, C.; Wei, S.; Zhong, S.; Yu, F. YOLO-PowerLite: A Lightweight YOLO Model for Transmission Line Abnormal Target Detection. IEEE Access 2024, 12, 105004–105015. [Google Scholar] [CrossRef]
Lowe, D.G. Object Recognition from Local Scale-Invariant Features. In Proceedings of the Seventh IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 1999; pp. 1150–1157. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); IEEE: Piscataway, NJ, USA, 2005; pp. 886–893. [Google Scholar] [CrossRef]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support Vector Machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Wu, X.; Sahoo, D.; Hoi, S.C.H. Recent Advances in Deep Learning for Object Detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Lei, X.; Sui, Z. Intelligent fault detection of high voltage line based on the faster R-CNN. Measurement 2019, 138, 379–385. [Google Scholar] [CrossRef]
Dai, G.; Yang, R.; Deng, Z.; Lan, R.; Zhao, F.; Xie, G.; You, K. L-FPN R-CNN: An Accurate Detector for Detecting Bird Nests in Aerial Power Tower Pictures. In Artificial Intelligence and Robotics. ISAIR 2022; Communications in Computer and Information Science; Springer: Singapore, 2022; pp. 374–387. [Google Scholar] [CrossRef]
Li, F.; Xin, J.; Chen, T.; Xin, L.; Wei, Z.; Li, Y.; Zhang, Y.; Jin, H.; Tu, Y.; Zhou, X.; et al. An automatic detection method of bird’s nest on transmission line tower based on Faster R-CNN. IEEE Access 2020, 8, 164214–164221. [Google Scholar] [CrossRef]
Zhang, H.; Wu, L.; Chen, Y.; Chen, R.; Kong, S.; Wang, Y.; Hu, J.; Wu, J. Attention-guided multitask convolutional neural network for power line parts detection. IEEE Trans. Instrum. Meas. 2022, 71, 5008213. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Yang, L.; Fan, J.; Song, S.; Liu, Y. A light defect detection algorithm of power insulators from aerial images for power inspection. Neural Comput. Appl. 2022, 34, 17951–17961. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar] [CrossRef]
Jiang, H.; Hu, F.; Fu, X.; Chen, C.; Wang, C.; Tian, L.; Shi, Y. YOLOv8-Peas: A Lightweight Drought Tolerance Method for Peas Based on Seed Germination Vigor. Front. Plant Sci. 2023, 14, 1257947. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
Li, H.; Liu, L.; Du, J.; Jiang, F.; Guo, F.; Hu, Q.; Fan, L. An Improved YOLOv3 for Foreign Objects Detection of Transmission Lines. IEEE Access 2022, 10, 45620–45628. [Google Scholar] [CrossRef]
Huang, S.; Dong, X.; Wang, Y.; Yang, L. Detection of insulator burst position of lightweight YOLOv5. In ICCAI ’22: Proceedings of the 8th International Conference on Computing and Artificial Intelligence; ACM: New York, NY, USA, 2022; pp. 573–578. [Google Scholar] [CrossRef]
Chen, Y.; Liu, H.; Chen, J.; Hu, J.; Zheng, E. Insu-YOLO: An Insulator Defect Detection Algorithm Based on Multiscale Feature Fusion. Electronics 2023, 12, 3210. [Google Scholar] [CrossRef]
Zhang, L.; Li, B.; Cui, Y.; Lai, Y.; Gao, J. Research on improved YOLOv8 algorithm for insulator defect detection. J. Real-Time Image Process. 2024, 21, 22. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Qi, Y.; Xi, Y.; Jin, J. Exploring Lightweight Structures for Tiny Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5623215. [Google Scholar] [CrossRef]
Wei, S.; Cai, Y.; Dong, K.; Liu, C.; Yu, F.; Zhong, S. Towards Autonomous Powerline Inspection: A Real-Time UAV-Edge Computing Framework for Early Identification of Fire-Related Hazards. Drones 2026, 10, 183. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear Deformable Convolution for Improving Convolutional Neural Networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://platform.ultralytics.com/ultralytics/yolo11 (accessed on 28 March 2026).
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal Models for the Mobile Ecosystem. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2024; pp. 78–96. [Google Scholar] [CrossRef]
Shao, H.; Zeng, Q.; Hou, Q.; Yang, J. MCANet: Medical image segmentation with multi-scale cross-axis attention. J. Mach. Intell. Robot. Control 2025, 22, 437–451. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, S.; Li, H. Depth information assisted collaborative mutual promotion network for single image dehazing. arXiv 2024, arXiv:2403.01105. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar] [CrossRef]
Li, J.; Yan, D.; Luan, K.; Li, Z.; Liang, H. Deep Learning-Based Bird’s Nest Detection on Transmission Lines Using UAV Imagery. Appl. Sci. 2020, 10, 6147. [Google Scholar] [CrossRef]
Tao, X.; Zhang, D.; Wang, Z.; Liu, X.; Zhang, H.; Xu, D. Detection of power line insulator defects using aerial images analyzed with convolutional neural networks. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 1486–1498. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://platform.ultralytics.com/ultralytics/yolov8 (accessed on 28 March 2026).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLO12: Attention-centric real time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J. Ultralytics YOLO26. 2026. Available online: https://www.ultralytics.com/yolo/yolo26 (accessed on 28 March 2026).
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2017; pp. 618–626. [Google Scholar] [CrossRef]
Zhao, H.; Chu, K.; Zhang, J.; Feng, C. YOLO-FSD: An Improved Target Detection Algorithm on Remote-Sensing Images. IEEE Sens. J. 2023, 23, 30751–30764. [Google Scholar] [CrossRef]
Zhao, D.; Xu, X.; You, M.; Arun, P.V.; Zhao, Z.; Ren, J.; Wu, L.; Zhou, H. Local Sub-Block Contrast and Spatial–Spectral Gradient Feature Fusion for Hyperspectral Anomaly Detection. Remote Sens. 2025, 17, 695. [Google Scholar] [CrossRef]
Yao, Y.; Wang, Q.; Zhao, D.; You, M.; Xiang, P.; Asano, Y.; Yu, X.; Wang, C.; Zhou, H.; Ren, J. DFBSNet: Dual Frequency-Domain Branch Fusion and Selection Network for Hyperspectral Anomaly Detection. Pattern Recognit. 2026, 180, 113967. [Google Scholar] [CrossRef]
Islam, M.A.; Xing, W.; Zhou, J.; Gao, Y.; Paliwal, K.K. Hy-Tracker: A Novel Framework for Enhancing Efficiency and Accuracy of Object Tracking in Hyperspectral Videos. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5521514. [Google Scholar] [CrossRef]
Jiang, W.; Zhao, D.; Wang, C.; Yu, X.; Arun, P.V.; Asano, Y.; Xiang, P.; Zhou, H. Hyperspectral Video Object Tracking with Cross-Modal Spectral Complementary and Memory Prompt Network. Knowl.-Based Syst. 2025, 330, 114595. [Google Scholar] [CrossRef]
Liu, D.; Zhang, J.; Liang, X.; Qi, Y.; Song, Y.; Xi, Y.; Jin, J. RS-LLIC: A Lightweight Learned Image Compression Model With Knowledge Distillation for Onboard Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2026, 64, 5610213. [Google Scholar] [CrossRef]

Figure 1. Structure diagram of YOLO-PowerLite V2.

Figure 2. Structure diagram of C3k2-UIB: (left) C3k2-UIB module; (right) UIB module.

Figure 3. Optional variants of the UIB block.

Figure 4. Structure of the MCA module.

Figure 5. Structure diagram of detection heads: (top) Detect_MBConv; (bottom) YOLO11 Detect.

Figure 6. Structure of the MFM.

Figure 7. Dataset information.

Figure 8. Grad-CAM++ visualization results.

Figure 9. The detection results of four lightweight algorithms in abnormal target detection of transmission lines.

Table 1. Model hyperparameter settings.

Parameters	Setup
Epoch	200
Batch size	16
Image Size	640 × 640
Initial Learning Rate	1 × 10⁻²
Final Learning Rate	1 × 10⁻²
Momentum	0.937
Weight Decay	5 × 10⁻⁴
Optimizer	Auto

Table 2. Parameters corresponding to different sizes of YOLO11.

Model	Depth	Width	Max. Channels	Parameters (M)	FLOPs (G)
YOLO11n	0.50	0.25	1024	2.62	6.6
YOLO11s	0.50	0.50	1024	9.46	21.7
YOLO11m	0.50	1.00	512	20.11	68.5
YOLO11l	1.00	1.00	512	25.37	87.6
YOLO11x	1.00	1.50	512	56.97	196.0

Table 3. Detection results after the introduction of different improvement strategies (A = C3k2-UIB; B = MCA; C = Detect-MBConv; D = MFM).

Model	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50–95 (%)	Param. (M)	FLOPs (G)
YOLO11n	0.950	0.930	0.950	0.629	2.59	6.4
YOLO11n + A	0.952	0.960	0.943	0.627	1.93	4.6
YOLO11n + B	0.944	0.903	0.947	0.626	2.26	6.1
YOLO11n + C	0.951	0.919	0.955	0.638	2.26	5.1
YOLO11n + D	0.927	0.931	0.950	0.642	2.55	6.5
YOLO11n + A + B	0.950	0.935	0.948	0.628	1.68	4.3
YOLO11n + A + C	0.953	0.942	0.954	0.636	1.62	3.7
YOLO11n + A + D	0.949	0.951	0.945	0.640	1.90	4.7
YOLO11n + B + C	0.948	0.912	0.952	0.635	1.98	4.8
YOLO11n + A + B + C	0.951	0.928	0.951	0.633	1.31	3.2
YOLO11n + A + B + D	0.948	0.930	0.949	0.639	1.65	4.4
YOLO11n + A + C + D	0.952	0.933	0.954	0.645	1.58	3.8
Ours	0.952	0.910	0.952	0.630	0.97	2.8

Table 4. Results of each indicator for different models.

Model	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50–95 (%)	Parameters (M)	FLOPs (G)
YOLOv8n [48]	0.938	0.925	0.955	0.648	2.70	6.9
YOLO10n [49]	0.897	0.922	0.943	0.637	2.71	8.4
YOLO12n [50]	0.945	0.938	0.958	0.65	2.57	6.5
YOLO2026n [51]	0.942	0.942	0.959	0.667	2.50	5.8
YOLO-PowerLite [16]	0.934	0.927	0.952	0.646	1.56	4.84
Ours	0.952	0.910	0.952	0.630	0.97	2.8

Table 5. Results of each indicator for different models on NVIDIA Jetson Xavier NX by PyTorch.

Model	Precision (%)	Recall (%)	mAP@50 (%)	mAP@50–95 (%)	Latency (ms)	FPS
YOLOv8n [48]	0.938	0.925	0.955	0.646	19.5	51.3
YOLO11n [41]	0.948	0.927	0.956	0.640	22.0	45.4
YOLO-PowerLite [16]	0.934	0.928	0.952	0.644	17.9	55.8
Ours	0.954	0.912	0.955	0.623	16.8	59.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, S.; Cai, Y.; Zhong, S.; Lv, Z. YOLO-PowerLite V2: An Enhanced Lightweight Detector for Real-Time Tiny Anomaly Identification on Overhead Transmission Lines in Complex Environments. Remote Sens. 2026, 18, 1937. https://doi.org/10.3390/rs18121937

AMA Style

Wei S, Cai Y, Zhong S, Lv Z. YOLO-PowerLite V2: An Enhanced Lightweight Detector for Real-Time Tiny Anomaly Identification on Overhead Transmission Lines in Complex Environments. Remote Sensing. 2026; 18(12):1937. https://doi.org/10.3390/rs18121937

Chicago/Turabian Style

Wei, Shuangfeng, Yuhang Cai, Shaobo Zhong, and Zheng Lv. 2026. "YOLO-PowerLite V2: An Enhanced Lightweight Detector for Real-Time Tiny Anomaly Identification on Overhead Transmission Lines in Complex Environments" Remote Sensing 18, no. 12: 1937. https://doi.org/10.3390/rs18121937

APA Style

Wei, S., Cai, Y., Zhong, S., & Lv, Z. (2026). YOLO-PowerLite V2: An Enhanced Lightweight Detector for Real-Time Tiny Anomaly Identification on Overhead Transmission Lines in Complex Environments. Remote Sensing, 18(12), 1937. https://doi.org/10.3390/rs18121937

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-PowerLite V2: An Enhanced Lightweight Detector for Real-Time Tiny Anomaly Identification on Overhead Transmission Lines in Complex Environments

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Review of YOLO-PowerLite

2.2. Improved Model

2.2.1. C3k2-UIB Lightweight Backbone Module

2.2.2. MCA Multi-Scale Cross-Axis Attention Mechanism

2.2.3. MBConv Lightweight Detection Head

2.2.4. MFM Modulation Feature Fusion Module

3. Experiments

3.1. Experimental Setup

3.1.1. Dataset

3.1.2. Experimental Environment and Training Strategy

3.1.3. Evaluation Metrics

3.2. Experimental Results

3.2.1. Ablation Experiment

3.2.2. Comparative Experiment

3.2.3. Model Deployment on NVIDIA Jetson Xavier NX

3.2.4. Visualization Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI