An Enhanced YOLOv8-Based Approach for Foreign Object Detection on Transmission Lines

Gou, Ming; Xu, Weizhong; Liu, Chunyu; Zhang, Liguang; Tang, Hao; Liu, Jiwu; Fu, WenLong

doi:10.3390/a19040264

Open AccessArticle

An Enhanced YOLOv8-Based Approach for Foreign Object Detection on Transmission Lines

by

Ming Gou

¹,

Weizhong Xu

¹,

Chunyu Liu

²,

Liguang Zhang

¹,

Hao Tang

^1,*,

Jiwu Liu

^3,* and

WenLong Fu

^4,*

¹

Yichang Electric Power Survey and Design Institute Co., Ltd., Yichang 443000, China

²

Wuhan Huayuan Electric Power Design Institute Co., Ltd., Wuhan 430056, China

³

State Grid Hubei Electric Power Transmission & Transformation Engineering Co., Ltd., Wuhan 430063, China

⁴

College of Electrical Engineering and New Energy, China Three Gorges University, Yichang 443002, China

^*

Authors to whom correspondence should be addressed.

Algorithms 2026, 19(4), 264; https://doi.org/10.3390/a19040264

Submission received: 25 February 2026 / Revised: 24 March 2026 / Accepted: 27 March 2026 / Published: 1 April 2026

Download

Browse Figures

Versions Notes

Abstract

To overcome the limitations of existing transmission-line inspection models, including reduced detection precision in complex environments, inadequate performance for small objects and multi-scale targets, and high model complexity, a novel foreign object detection method for transmission lines is proposed in this study, based on an enhanced YOLOv8 architecture. First, the original YOLOv8 backbone is substituted with EfficientNetV2 to achieve model lightweighting while improving detection performance. Second, a Slim-neck module is integrated into the YOLOv8 neck to promote cross-layer information propagation and improve feature perception, which in turn boosts the detection performance on small objects. Meanwhile, an efficient multi-scale attention (EMA) is incorporated to boost multi-scale target detection performance, reduce computational overhead, and strengthen feature representation robustness. Finally, the localization performance of predicted targets is further improved by adopting MPDIoU rather than the original loss function. The experimental results indicate that the proposed method attains 97.7% precision, 95.6% recall, and a 97.5% mAP50, outperforming mainstream detection algorithms in comparative evaluations. Furthermore, relative to the baseline model, the Params and GFLOPs are reduced by 32.1% and 31.6%, respectively, thereby achieving a lightweight design and demonstrating its suitability for transmission-line foreign object detection.

Keywords:

transmission line; foreign objects; YOLOv8; EfficientNetv2; Slim-neck; efficient multi-scale attention; MPDIoU; visualization

1. Introduction

Driven by rapid socio-economic growth and the sustained development of power grids, transmission lines have become indispensable elements of modern power systems. These lines often traverse complex geographical environments and are exposed to natural climatic conditions over extended periods, making them highly susceptible to interference from external foreign objects [1]. Common foreign objects include lightweight wind-blown debris such as plastic films, dust-proof nets, and kite strings entangled with conductors under strong wind conditions, foreign objects thrown due to mis-operation of construction machinery, and vegetation like trees and bamboo falling from extreme weather events [2]. In residential areas, entangled kites, balloons, other easily wind-blown household waste, and nests are the most prevalent occurrences. According to statistical analyses, foreign object debris (FOD) on transmission lines contributes to more than 30% of tripping incidents in distribution networks, resulting in widespread power interruptions and substantial economic losses, while seriously endangering the reliability and operational security of the power system [3]. Accordingly, timely identification and removal of FOD on transmission lines are critical for ensuring power supply reliability.

In traditional transmission-line inspection practices, manual inspections, helicopter patrols, and fixed-position camera monitoring are the primary approaches. Among these, manual inspection is inefficient, highly hazardous, and constrained by terrain and weather conditions [4,5,6]; helicopter-based inspection is costly and difficult to deploy frequently; and fixed monitoring systems often suffer from visual blind spots. These methods predominantly rely on the empirical judgment of inspection personnel, resulting in high missed detection rates for small or semi-transparent foreign objects, and they are incapable of achieving real-time alerts and precise localization.

With the growing maturity of drone technologies in recent years, inspection systems based on unmanned aerial vehicles have gradually gained wider application. However, their backend analysis still relies heavily on manual visual interpretation of images, resulting in insufficient intelligence. The growing volume of image data brings substantial processing pressure, which has become a key bottleneck and is no longer sufficient to satisfy the monitoring demands of contemporary smart power systems. The aforementioned challenges can be effectively addressed by deep learning approaches, among which convolutional neural network (CNN)-based object detection playing a central role. In CNN-based object detection, existing methods are commonly grouped into two paradigms, referred to as one-stage and two-stage approaches. Common one-stage object detection frameworks are single-shot multiBox detector (SSD) [7] and YOLO [8,9]. These algorithms accomplish target localization and classification through a single forward network propagation, eliminating the need for candidate region generation, resulting in a fast detection speed and achieving end-to-end detection. Representative two-stage object detection frameworks are R-CNN and Faster R-CNN [10]. They follow a two-step process of region proposal and classification-regression. The detection pipeline typically proceeds by first generating region proposals and then performing fine-grained classification and bounding box regression for each region. In recent years, algorithms represented by Faster R-CNN, YOLO, and SSD have achieved breakthrough progress in object detection scenarios. Specifically, Wang et al. [11] utilized a real-world dataset to conduct a comparative analysis of the deformable part model, Faster R-CNN, and SSD methods, indicating the capability of deep learning techniques to support real-time foreign object on transmission lines. Satheeswari et al. [12] combined VGG16 and EfficientNetB7 as feature extraction networks and employed an SSD to localize nests; however, due to the dataset containing only 500 images, the resulting model exhibited weak generalization capability and robustness.

Although existing detection models have achieved certain application results in object detection, directly applying general object detection models to foreign object detection on transmission lines faces some challenges: (1) inspection images have complex backgrounds, with strong confusion between foreign object targets and foreground objects such as conductors, insulators, and tower materials, as well as backgrounds such as mountains, trees, and buildings, resulting in significant environmental interference; (2) the targets to be detected exhibit diverse scales, ranging from large-sized hanging objects at close distances to small-scale floating objects at far distances, often displaying characteristics such as translucency and reflection, leading to weak and incomplete features; (3) there is a scarcity of positive sample data for foreign objects, while the background forms of normal line components as negative samples are extremely diverse, resulting in severe class imbalance and long-tail distribution problems. Given the practical requirements in transmission-line inspection tasks and the limitations of existing general detection models, the YOLO series algorithms demonstrate unique advantages due to their exceptional inference speed. Related research, oriented towards the scenario of foreign object detection on transmission lines, achieves more accurate detection of targets by innovating upon YOLO detection algorithms and their variants. For instance, Liu et al. [13] developed a multi-level cross-domain detection framework by combining the YOLOv11 architecture with ConvNeXt. Simultaneously, they employed Bayesian Bayesian optimization for hyperparameter tuning of the model, which increased the convergence speed and ensured high detection precision, though it led to increased structural complexity of the model. Wang et al. [14] proposed a YOLOv8-BiFPN method, which incorporates a weighted bidirectional cross-scale feature fusion structure into the YOLOv8 detection head. Although the two strategies enhance the approach’s adaptability to complex environments and diverse target shapes, its detection performance on small-scale targets remains inadequate. Liu et al. [15] developed an enhanced YOLOv8n model that substitutes the conventional stepwise convolution with a spatial depthwise convolution module, thereby improving recognition efficiency for small and low-resolution targets. The large selective kernel attention mechanism is adopted to improve the feature extraction network, thereby strengthening effectiveness in feature representation. However, when applied to targets in complex environments, the model remains susceptible to missed detections and false positives. Li et al. [16] introduced an enhanced object detection framework, termed KM-YOLO, developed on the basis of the improved YOLOv5s algorithm. By integrating the GC and C3 modules to construct a C3GC attention mechanism and embedding it within the backbone, the model achieves higher detection precision, though its detection speed requires further improvement. Liu et al. [17] incorporated Swin Transformer and CBAM attention modules, along with an additional detection layer, to enhance the extraction of global context and salient visual features, which improves the recognition of tiny defects and distant objects in the scene. Gao et al. [18] introduced an enhanced YOLOv11-SDI foreign object detection framework, integrating a hierarchical spatial-channel dynamic inference (SDI) and adopting an adaptive feature fusion strategy to strengthen multi-scale recognition capability.

However, despite the significant progress achieved in the aforementioned research, directly applying existing models to the scenario of UAV-based transmission-line inspection still faces severe tests posed by multiple coupled challenges: First, the computational power and endurance constraints of the inspection platform necessitate that the model must be extremely lightweight. Second, the vast scale variation of targets, ranging from large, close-range hanging objects to distant, pixel-sized floating debris, demands that the model possess exceptional multi-scale perception capabilities. Third, the background is extremely complex; targets are often highly confused with conductors, insulators, and mountainous backdrops, and frequently exhibit weak features such as semi-transparency and reflection. Existing improvement schemes predominantly focus on enhancing a single performance metric or involve simple module stacking, failing to construct a synergistic solution that simultaneously optimizes accuracy, speed, lightweight design, and robustness from a systems engineering perspective. This paper takes the most lightweight YOLOv8n as the baseline model and performs a systematic, modular, and synergistic improvement upon it. To address the three aforementioned major challenges, we introduce, respectively, EfficientNetV2, Slim-neck, the Efficient multi-scale attention (EMA), and the MPDIoU loss function, aiming to build an integrated lightweight detection model for transmission-line inspection that balances high accuracy and high efficiency.

Building on the above findings, this study is proposed. The specific research contributions include:

(1): The original YOLOv8 backbone network is substituted with EfficientNetV2 to better balance detection precision and model efficiency.
(2): Slim-neck is embedded into the YOLOv8 neck to facilitate cross-layer feature interaction and strengthen feature representation, resulting in better performance on small object detection.
(3): By introducing the EMA after the output of the Slim-neck module, the multi-scale object detection capability is enhanced, the computational cost is reduced, and the robustness of feature representation is strengthened.
(4): MPDIoU is adopted in place of the default loss function to further refine localization precision for target regions.

2. YOLOv8 Algorithm

The YOLO framework directly predicts bounding box coordinates and class probabilities through single-stage forward propagation, achieving efficient end-to-end object detection with distinct advantages over other algorithms. It utilizes a convolutional network to extract multi-scale features and employs a grid partitioning strategy to accomplish target localization. Since its initial release in 2015, the YOLO family of algorithms has been extensively used for single-stage object detection [19]. Among them, YOLOv8, introduced by Ultralytics in 2023, offers faster detection speed and higher detection precision compared to earlier versions in the YOLO series such as v3, v5, and v7. YOLOv8 is available in multiple scales, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. The differences among these versions primarily lie in model complexity, specifically in terms of network depth and parameter count. As the number of residual modules increases progressively, the models gain stronger feature extraction and fusion capabilities, and detection precision improves accordingly, but at the cost of extended processing time [20]. Among the five versions, YOLOv8n stands out as the most lightweight version with exceptional detection speed, yet its precision is relatively lower. In contrast, the series from YOLOv8s to YOLOv8x shows significant improvements in precision, but the excessive residual structures introduce additional computational burden, leading to prolonged detection cycles. In scenarios with real-time detection requirements, this time delay may pose disadvantages, affecting the response efficiency and practicality of the system. In light of this, this study targets YOLOv8n for specific optimization and proposes a detection algorithm that balances both speed and precision.

YOLOv8 employs a modular design, consisting of backbone network, neck network, and detection head modules.

YOLOv8’s backbone network enhances information processing capability by extracting features from input images and is constructed from convolutional block, C2f modules, and an SPPF module. The convolutional block includes standard convolution (SC), batch normalization (BN), and a SiLU activation function [21]. Among them, SC is utilized to capture local image features, thereby improving the capability to handle small target objects; BN improves information processing capability through normalization, strengthening the stability of the network structure; and the SiLU activation function strengthens the system’s generalization performance in complex detection environments through nonlinear operations. In the C2f module, the input feature map first undergoes a 1 × 1 convolution to align the channel dimension and perform preliminary feature projection, aiming to compress computational dimensions. Subsequently, the processed feature map is partitioned into two equal channel-wise branches to construct a multi-path information flow. One part of the split feature map is directly transmitted as a shortcut branch to the end of the module, while the other part serves as the main branch and is fed into a series composed of several bottleneck units connected in sequence. Each bottleneck unit contains two 1 × 1 convolutional layers and one 3 × 3 depthwise convolutional layer, equipped with residual connections, while caching the input features of each unit in the bottleneck series. The deep features processed through all bottleneck units are concatenated along the channel dimension with the original shallow features from the shortcut branch and the intermediate layer features cached in the bottleneck series, achieving dense fusion of features from shallow to deep layers. Finally, the concatenated composite feature map undergoes channel integration and dimensionality reduction through a 1 × 1 convolutional layer to produce the output. Furthermore, SPPF employs three MaxPool layers for serial computation, replacing large pooling kernels with a series of small pooling kernels, followed by residual connections and a concatenated operation. Unlike the SPP module, which introduces parallel max-pooling kernels at the network tail, SPPF preserves the ability to capture multi-scale features while lowering computational cost and accelerating training [22].

In YOLOv8, the neck network lies between the backbone and the detection head and adopts a hybrid architecture combining a feature pyramid network (FPN) with a path aggregation network (PANet). Its primary role is to perform multi-scale feature fusion. FPN up-samples deep feature maps and fuses them with shallow feature maps, enabling shallow features to also possess strong semantic information [23]. Building upon FPN, PANet adds a bottom-up pathway to transmit shallow feature information to deep layers, enhancing the perception of location and details in deep features. By combining top-down and bottom-up information processing pathways, feature fusion is conducted for detection targets of different sizes, strengthening detection robustness across different object scales.

The detection head of YOLOv8 adopts a decoupled design and operates under an anchor-free paradigm. The decoupled head design independently handles classification and regression problems, using separate network branches to address each, which improves task-specific learning and mitigates feature interference when detecting different targets [24]. The traditional anchor box mechanism used in the YOLO series is replaced with a method that directly predicts bounding boxes, simplifying the design, reducing redundant computations, and improving performance when handling dense small targets.

3. Enhanced YOLOv8 Algorithm

Aiming at practical foreign object detection on transmission lines, YOLOv8 is improved from multiple aspects to jointly boost precision and reduce complexity, including the adoption of EfficientNetV2, Slim-neck, EMA, and MPDIoU-based loss optimization. Firstly, the backbone network is replaced by EfficientNetV2 to achieve higher precision, accelerate model training speed, improve detection performance, and achieve model lightweighting [25]. Secondly, the neck feature fusion network is restructured by introducing Slim-neck to strengthens feature extraction and fusion. Then, the EMA is introduced to enhance the detection performance on multi-scale targets. Furthermore, to ensure higher precision and stability in the bounding box regression task, MPDIoU loss is adopted in place of the original loss function. This loss directly minimizes the Euclidean distances between the corresponding top-left and bottom-right corners of the predicted and ground-truth boxes, and employs a normalized formulation, enabling more comprehensive and efficient optimization of the regression process. Figure 1 illustrates the enhanced YOLOv8 architecture.

3.1. EfficientNetV2

EfficientNetV2 is a significant upgrade by Google Brain in 2021 over EfficientNetV1. It introduces training-aware neural architecture search (NAS), Fused-MBConv modules, and employs an improved progressive learning method. The synergistic effect of these three enhancements leads to further performance improvement. EfficientNetV1 employs a uniform scaling rule, equally increasing network depth, parameter count, and input image size. However, research has found that uniformly scaling each stage is not an optimal strategy, as different stages contribute unequally to training speed and parameter efficiency, leading to resource wastage after uniform scaling. In contrast, EfficientNetV2 adopts a non-uniform scaling strategy. In the early training stages, it uses small images and weak regularization to enable the model to quickly learn simple features. Subsequently, it progressively increases the image size while simultaneously strengthening the regularization intensity. In the later stages of network training, more network layers are added, and the maximum input image size is constrained. This approach increases the model’s parameters while avoiding memory consumption and speed degradation caused by excessively large images [26]. Such dynamic adjustments effectively accelerate training speed and reduce precision loss. EfficientNetV2 is illustrated in Figure 2.

To decrease model complexity while improving detection precision, this study selects EfficientNetV2-B0 as the backbone network, which strikes a good balance between model performance, computational performance, and feasibility for engineering deployment. Its lightweight design meets accurate feature extraction of transmission lines foreign objects while maintaining stable and efficient long-term system operation.

3.2. Slim-Neck

Although enhanced detection models improve the precision of foreign object detection on transmission lines, they also impose greater demands on computational resources. Although lightweight architectures built with extensive depthwise separable convolution (DSConv) layers can improve computational speed, their detection precision fails to meet the required standards. Accordingly, this study incorporates Slim-neck to redesign the YOLOv8 neck feature fusion network, reducing computational burden while enhancing detection precision. Slim-neck integrates GSConv with VoV-GSCSP modules. The GSConv module first applies standard convolutional downsampling to the input, reducing the output channels by half. The result is then processed through DSConv, further halving the output channels. The outputs from both steps are concatenated, followed by a channel shuffle operation [27]. The architectures of GSConv is illustrated in Figure 3. Figure 4b–d show the three design structures for VoV-GSCSP, respectively. Figure 4b is simple and allows faster inference, while Figure 4c,d have a higher feature reuse rate. Specifically, VoV-GSCSP1 employs the most straightforward single-path configuration. The input feature map is processed directly through a GS bottleneck (GSBottleneck) module constructed with GSConv, followed by feature fusion with the original input. VoV-GSCSP2 incorporates a deeper feature reuse mechanism. It adds extra convolutional layers or connections either within or around the GSBottleneck module, forming a more densely interactive feature pathway. Building upon VoV-GSCSP2, VoV-GSCSP3 adopts a multi-path aggregation strategy analogous to residual or dense connections. This constitutes the most complex structure, potentially involving multiple GSBottleneck branches arranged in parallel or series. In conclusion, VoV-GSCSP, as a one-shot cross-stage aggregator, employs a dual-path structure to process input features: one path processes features through the GSBottleneck module composed of GSConv, while the other path applies simple Conv-based processing or directly retains the input. Finally, features from both paths are fused, effectively reducing computational load and model depth while improving feature utilization and maintaining detection precision.

3.3. EMA

EMA is an efficient multi-scale attention mechanism that improves multi-scale feature awareness and strengthens its feature extraction capability through unique structural optimization, while reducing computational overhead. The core design of EMA includes multi-scale feature capture and cross-dimensional interaction. First, the input feature map is partitioned into multiple channel-wise subgroups, each maintaining spatial integrity to ensure feature diversity. This design prevents the information degradation typically caused by dimensionality reduction in conventional channel attention mechanisms, significantly improving feature utilization. Second, the grouped features are fed into two parallel branches. The global information branch extracts long-range dependencies through 1 × 1 convolutions and adaptive pooling to generate spatial attention weights that calibrate channel importance. The local detail branch captures local spatial features via 3 × 3 convolutions to enhance detail perception. The outputs of the two branches interact across dimensions through matrix multiplication, fusing global semantics with local details [28]. Then, pixel-level pairwise relationships between features from different branches are computed using matrix dot products to generate an attention weight map. This step dynamically adjusts feature responses through Softmax normalization and Sigmoid activation, highlighting key regions and achieving cross-dimensional interaction and weight generation. Finally, the original features are reweighted through element-wise multiplication with the attention weights to obtain reconstructed representations. The entire process avoids pooling or dimensionality reduction operations, preserving spatial resolution to the greatest extent. EMA captures both global context and local details simultaneously through its parallel branch structure, effectively enhancing the model’s suitability to scenes with drastic scale variations. Compared to traditional attention mechanisms, EMA reduces the number of parameters through channel grouping, avoids information loss from dimensionality compression, and effectively suppresses complex background interference through its cross-dimensional interaction mechanism. Figure 5 illustrates the architecture of EMA.

3.4. Loss Function Improvement

The loss function for object detection generally includes two components, namely, classification loss and bounding box regression loss. The classification loss aims to determine the category of the target, addressing class imbalance by using varifocal loss and focusing on high-quality samples. The bounding box regression loss combines CIoU with distribution focal loss (DFL) to predict the target’s location and size. Specifically, CIoU computes the regression penalty by jointly accounting for the overlap of the predicted and ground-truth bounding boxes, the distance between their center points, and the consistency of their aspect ratios. DFL focuses on the distribution near the label values to improve localization precision [29]. The detailed formulation of CIoU is as follows:

{CIOU}_{L O S S} = 1 - IoU + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α v,

(1)

where IoU quantifies the overlap between the predicted bounding box and the ground-truth bounding box.

ρ^{2} (b, b^{g t})

denotes the squared Euclidean distance between the centers of the predicted bounding box and the ground-truth bounding box. c represents the diagonal length of the smallest enclosing rectangle covering both boxes, which normalizes the center distance to make the loss function scale-invariant. α is a dynamic weighting coefficient that balances the aspect ratio loss, and v measures the aspect ratio consistency between the predicted and ground-truth bounding boxes. The calculations for α and v are given below:

α = \frac{v}{(1 - IoU) + v},

(2)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{g t}}{h^{g t}} - \arctan \frac{w}{h})}^{2},

(3)

where w^gt and h^gt denote the length and width of the ground-truth bounding box, respectively, whereas w and h correspond to the length and width of the predicted bounding box. The CIoU is depicted in Figure 6.

In the CIoU loss function, the parameter v, used to measure aspect ratio consistency, is not calculated by directly comparing the absolute differences in length and width values. Instead, it reflects their relative proportional differences by comparing the arctangent of the aspect ratios of the predicted and the ground-truth bounding box. When there is a significant discrepancy in length and width between the predicted and ground-truth bounding boxes, the value of v is small. This can cause the model to prioritize similarity in aspect ratio during optimization, leading to larger errors in accurately regressing the actual dimensions [30]. Furthermore, during the early training stages, when the length and width of the predicted bounding boxes are small, the computed gradients can become abnormally large, leading to gradient explosion. The fundamental reason is that the CIoU loss lacks a mechanism to distinguish the difficulty of samples, causing the model to be dominated by a large number of simple samples when calculating the loss. This makes it difficult for the model to compute the loss for complex samples, ultimately limiting the improvement of model performance. Additionally, if the center points of the predicted and ground-truth bounding boxes coincide exactly, the term

ρ^{2} (b, b^{g t})

, representing the squared distance between the centers in the CIoU loss, equals zero. This causes the CIoU loss to lose its constraint on the center point distance, thereby affecting the model’s localization precision.

In bounding box regression, MPDIoU is employed to replace CIoU for localization loss computation, as CIoU has limited capability in differentiating predicted boxes that share the same aspect ratio but vary in scale. By directly minimizing the Euclidean distance between the corresponding top-left and bottom-right vertices of the predicted and ground-truth bounding boxes, MPDIoU effectively resolves optimization problems encountered by traditional loss functions in specific scenarios [31]. Meanwhile, by combining its normalized form with IoU to construct the loss function, the bounding box regression process can be optimized more comprehensively and efficiently, improving the precision of target localization and shape characterization. MPDIoU is depicted in Figure 7, and its computation can be expressed as follows:

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2},

(4)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2},

(5)

MPDIoU = IoU - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(6)

{MPDIoU}_{L O S S} = 1 - IoU + \frac{d_{1}^{2}}{w^{2} + h^{2}} + \frac{d_{2}^{2}}{w^{2} + h^{2}}

(7)

Here, A and B correspond to the ground-truth bounding box and the predicted bounding box. Additionally, the coordinates of the top-left and bottom-right corners of box A are (x₁^A, y₁^A) and (x₂^A, y₂^A), while those of box

B

are (x₁^B, y₁^B) and (x₂^B, y₂^B). Moreover, d₁ and d₂ are the distances between the top-left corners and the bottom-right corners of boxes A and B, respectively.

4. The Experimental Evaluation and Analysis

4.1. Dataset Introduction

The experimental dataset comprises images of foreign objects on multiple transmission line segments, collected by a power grid company. It encompasses four categories: nests, kites, balloons, and debris. The images were captured under varying angles and lighting conditions, with a resolution of 640 × 640 and saved in JPG format. Specific samples are shown in Figure 8, Figure 9 and Figure 10.

To enhance the model’s generalization capability and mitigate overfitting due to limited data, data augmentation is performed on the collected images, including random rotation, noise addition, brightness adjustment, and other operations, as illustrated in Figure 11. After augmentation, the dataset is expanded to 4516 images. To ensure a comprehensive evaluation of the model’s generalization performance, the dataset was randomly split into training, validation, and test sets with ratio 8:1:1. This random partitioning ensures that the data distribution in each subset (training, validation, and test) aligns with the overall dataset distribution, thereby avoiding performance instability caused by data bias. Through random sampling, the data in the training, validation, and test sets are kept independent and non-overlapping, ensuring fairness and independence in model evaluation. Furthermore, this strategy guarantees that each subset represents the diversity and complexity of the entire dataset, making the performance evaluation on each subset more reliable. The categories and quantity statistics of the dataset are presented in Table 1.

The dataset was annotated using the LabelImg software with version windows_v1.8.1. Specifically, the four types of foreign objects were labeled as “nest”, “kite”, “balloon”, and “debris”. After importing the image information into the LabelImg software, each region containing a foreign object was annotated individually. The annotation process is illustrated in Figure 12.

4.2. Experimental Setup

All experiments were performed on a Windows 11 platform equipped with NVIDIA GeForce RTX 4060Ti GPU (16 GB). The implementation was based on Python 3.10, PyTorch 2.6.0, and CUDA12.1. The experimental environment is detailed in Table 2.

Before conducting the experiments, multiple trials were performed on the original YOLOv8 model. Finally, the hyperparameters applied in all experiments were configured uniformly, as shown in Table 3. During training, the batch size was set to 4, with an initial learning rate of 0.01 and a final learning rate of 0.001. The momentum parameter was 0.937 and the weight decay was 0.0005. The model was optimized using AdamW for 200 epochs. All images were resized to 640 × 640.

4.3. Experimental Evaluation Indicators

For a quantitative evaluation of model performance in the experiments, the primary evaluation metrics include precision (P), recall (R), and mean average precision (mAP). In addition, model efficiency is evaluated using the number of parameters (Params) and giga floating-point operations (GFLOPs). The definitions of the primary indicators are given below:

P = \frac{TP}{TP + FP},

(8)

R = \frac{T P}{T P + F N},

(9)

AP = \int_{0}^{1} P (R) d R,

(10)

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i},

(11)

where P denotes the fraction of correct predictions among all predicted results. R refers to the proportion of actual detected targets in the dataset. Prediction results are divided into positive and negative samples: TP is the number of samples where the true value is positive and the predicted value is positive; FP is the number of samples where the true value is negative but the predicted value is positive; FN is the number of samples where the true value is negative and the predicted value is negative. AP is defined as the area under the precision–recall curve, whereas mAP represents the average of values across all target classes. The mAP50 is the mAP value when IoU is set to 0.5, and mAP50:95 is the average mAP value when IoU ranges from 0.5 to 0.95.

4.4. Results and Analysis of the Ablation Experiment

The ablation experiment is conducted to examine the effect of each enhancement to YOLOv8 as well as the combined impact of their integration on baseline performance. The corresponding results are listed in Table 4. Experiment 1 serves as the baseline using the original YOLOv8, providing a reference for subsequent experiments. The improvements are incrementally introduced: EfficientNetV2, Slim-neck, EMA, and MPDIoU loss function. In experiment 2, the backbone network of YOLOv8 is replaced by EfficientNetV2. EfficientNetV2 optimizes computational parameters through NAS and Fused-MBConv modules, increasing the training speed. This change improved precision by 0.4% but resulted in a 0.3% decrease in recall and a 0.1% decrease in mAP50, respectively. Building on experiment 2, experiment 3 further redesigns the YOLOv8 neck by introducing Slim-neck. This enhanced cross-layer information flow and feature perception, enriching the model’s feature representation capability. Specifically, precision increased by 0.2%, recall by 0.2%, and mAP50 by 0.2%. Based on experiment 2, EMA is introduced in experiment 4 to further improve the model’s ability to detect targets across multiple scales. This led to a 0.6% increase in precision, a 0.3% increase in recall, and a 0.2% increase in mAP50. Experiment 5 combined EfficientNetV2, Slim-neck, and EMA. All evaluation indicators improved: precision by 0.6%, recall by 0.4%, and mAP50 by 0.4%. Experiment 6 integrated EfficientNetV2, Slim-neck, and MPDIoU loss function, resulting in a 0.5% increase in precision, a 0.1% increase in recall, and a 1.1% increase in mAP50. Experiment 7 combined all four improvement strategies. Compared with the baseline model, precision increased by 0.9%, recall by 0.5%, and mAP50 by 1.4%. The ablation results above confirm that the proposed improvements effectively enhance performance and optimize the model architecture.

4.5. Comparative Analysis of Loss Function Strategy

The baseline model is reconfigured by alternately replacing the loss function with CIoU, DIoU, EIoU, GIoU, SIoU, and WIoUv3 to evaluate the performance of the MPDIoU loss through comprehensive comparative experiments. The corresponding results are presented in Table 5.

As indicated in Table 5, substituting CIoU with MPDIoU yields a precision of 97.7%, a recall of 95.6%, and an mAP50 of 97.5%. Relative to the CIoU-based baseline, these results correspond to gains of 0.3% in precision, 0.2% in recall, and 1.0% in mAP50. The comparative results indicate that incorporating the MPDIoU loss leads to improved performance across key evaluation metrics relative to other loss functions, highlighting the superiority of MPDIoU.

4.6. Performance Comparison of Different Detection Models

To comprehensively validate the effectiveness of the enhanced model, comparative experiments are performed under the same experimental settings using different network models, including YOLOv5, YOLOv8, YOLOv10, YOLOv11, YOLOv13, and RT-DETR. The results obtained from different models are presented in Table 6.

As indicated in Table 6, the proposed model exhibits slightly higher GFLOPs than YOLOv5 and a marginally larger parameter count than YOLOv10, whereas it achieves superior performance on all other evaluation indicators relative to other models. The proposed model achieves 97.7% precision, 95.6% recall, and a 97.5% mAP50. Compared with the baseline YOLOv8, precision is improved by 0.9%, recall by 0.5%, and mAP50 by 1.4%, while floating-point operations are reduced by 31.6% and parameters by 32.1%. Compared to RT-DETR, precision is improved by 0.5%, recall by 3.3%, and mAP50 by 3.2%, while floating-point operations and parameters are reduced by 7.6% and 7.7%, respectively. The results of comparative experiments demonstrate that the combined improvement strategies effectively enhance detection precision and reduce computational load.

4.7. Visual Analysis of Results

To qualitatively evaluate the detection performance of the enhanced model, images from the dataset under different environmental conditions are selected, and visual detection results for transmission-line foreign objects are compared among all models. The results are illustrated in Figure 13.

The enhanced model achieves higher detection precision across diverse environments and foreign object categories, as depicted in Figure 13. It not only achieves precise localization and identification of foreign object types but also maintains high detection confidence. Notably, even under low-image-contrast conditions, the proposed model accurately locates and classifies targets, with improved capability for detecting small objects, showcasing strong robustness. The visual comparison suggests that the proposed model demonstrates improved adaptability in addressing complex detection requirements.

A confusion matrix is utilized to assess detection performance across different foreign object categories by comparing the proposed model with the baseline, and the corresponding results are illustrated in Figure 14. The confusion matrix is obtained by comparing the actual labels with the predicted labels, where rows denote the actual classes and columns correspond to the predicted classes. Additionally, the prediction effectiveness is indicated by the color intensity. Regions with darker shades of blue represent higher values, particularly along the diagonal from the top-left to the bottom-right, where the color is more pronounced. In regions outside the diagonal, the shades are lighter.

As illustrated in Figure 14, the detection performance of the enhanced model for different categories of foreign objects is superior to that of the baseline model. For four typical transmission-line foreign objects including nests, balloons, debris, and kites, the enhanced model achieves high detection precision, indicating its ability to effectively learn the feature representations of the detected foreign objects and indicating strong potential for practical engineering applications.

5. Conclusions

To achieve high-precision and lightweight foreign object detection on transmission lines, a multi-strategy enhanced YOLOv8-based model is proposed. First, EfficientNetV2 is employed to substitute the original backbone, thereby jointly improving detection performance and model efficiency. Second, the neck architecture is redesigned with Slim-neck to facilitate feature fusion and strengthen the recognition of small objects. Subsequently, EMA is introduced to strengthen multi-scale target detection performance. Furthermore, the MPDIoU loss is employed to more accurately characterize the alignment between predicted and ground-truth bounding boxes, leading to improved localization performance. Experimental case and comparative analysis show that the enhanced model achieves optimal comprehensive detection performance. Compared with the baseline, the enhanced model preserves high inference speed while achieving higher precision, recall, and mAP50, and it also substantially reduces the number of parameters, demonstrating clear advantages for transmission-line foreign object detection. However, challenges remain in achieving high detection accuracy and avoiding missed detections, especially for small foreign objects and in complex scenarios. Future research will focus on the deployment and optimization of the model for UAV and edge computing platforms. In conjunction with practical application requirements, efforts will be made towards lightweight deployment and acceleration optimization of the model to further validate its engineering application value. Research will also be deepened to enhance detection accuracy for small targets and within complex environments. Furthermore, a more systematic and reliable assessment of model performance will be conducted through multiple independent experiments and statistical analysis methods.

Author Contributions

Conceptualization, M.G. and L.Z.; methodology, J.L.; software, W.X.; validation, C.L.; writing—original draft preparation, M.G.; writing—review and editing, H.T. and W.F.; visualization, W.X. and H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding authors due to privacy, legal, or ethical reasons.

Conflicts of Interest

Author Ming Gou, Weizhong Xu, Liguang Zhang, Hao Tang were employed by the company Yichang Electric Power Survey and Design Institute Co., Ltd., Yichang 443000, China, Author Chunyu Liu was employed by the company Wuhan Huayuan Electric Power Design Institute Co., Ltd., Wuhan 430056, China, Author Jiwu Liu was employed by the company State Grid Hubei Electric Power Transmission & Transformation Engineering Co., Ltd., Wuhan 430063, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, W.; Li, Y.; Liu, A. RCDAM-Net: A Foreign Object Detection Algorithm for Transmission Tower Lines Based on RevCol Network. Appl. Sci. 2024, 14, 1152. [Google Scholar] [CrossRef]
Zhu, J.; Guo, Y.; Yue, F.; Yuan, H.; Yang, A.; Wang, X.; Rong, M. A deep learning method to detect foreign objects for inspecting power transmission lines. IEEE Access 2020, 8, 94065–94075. [Google Scholar] [CrossRef]
Mishra, D.; Ray, P. Fault detection, location and classification of a transmission line. Neural Comput. Appl. 2018, 30, 1377–1424. [Google Scholar] [CrossRef]
Wang, Y.; Li, Q.; Chen, B. Image classification towards transmission line fault detection via learning deep quality-aware fine-grained categorization. J. Vis. Commun. Image Represent. 2019, 64, 102647. [Google Scholar] [CrossRef]
Zheng, X.; Jia, R.; Aisikaer Gong, L.; Zhang, G.; Dang, J. Component identification and defect detection on transmission lines based on deep learning. J. Intell. Fuzzy Syst. 2021, 40, 3147–3158. [Google Scholar] [CrossRef]
Deng, F.; Zeng, Z.; Mao, W.; Wei, B.; Li, Z. A novel transmission line defect detection method based on adaptive federated learning. IEEE Trans. Instrum. Meas. 2023, 72, 3508412. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Wu, M.; Guo, L.; Chen, R.; Du, W.; Wang, J.; Liu, M.; Kong, X.; Tang, J. Improved YOLOX foreign object detection algorithm for transmission lines. Wirel. Commun. Mob. Comput. 2022, 2022, 5835693. [Google Scholar] [CrossRef]
Liu, H.; Su, G.; Zuo, X.; He, J.; Zhang, P. Defect detection of overhead line insulators in power distribution networks based on improved Yolov10. Shandong Electr. Power 2025, 52, 67–75. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, B.; Wu, R.; Zheng, Z.; Zhang, W.; Guo, J. Study on the method of transmission line foreign body detection based on deep learning. In Proceedings of the 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China, 26–28 November 2017; pp. 1–5. [Google Scholar] [CrossRef]
Satheeswari, D.; Shanmugam, L.; Swaroopan, N.J. Recognition of bird’s nest in high voltage power line using SSD. In 2022 First International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT); IEEE: New York, NY, USA, 2022; pp. 1–7. [Google Scholar] [CrossRef]
Liu, Q.; Wang, X.; Su, Y.; Jiang, W.; Zhang, Z.; Shen, F.; Zhu, L. Research on Deep Learning-Based Multi-Level Cross-Domain Foreign Object Detection in Power Transmission Lines. Sensors 2025, 25, 5141. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Li, C.; Zou, W.; Zheng, Q. Foreign Object Detection Network for Transmission Lines from Unmanned Aerial Vehicle Images. Drones 2024, 8, 361. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, X.; Xu, R.; Cui, Y.; Yu, C.; Yang, J.; Zhou, J. A Novel Foreign Object Detection Method on transmission lines Based on Improved YOLOv8n. Comput. Mater. Contin. 2024, 79, 1263–1279. [Google Scholar] [CrossRef]
Li, S.; Wang, Z.; Lv, Y.; Liu, X. Improved YOLOv5s-based algorithm for foreign object intrusion detection on overhead transmission lines. Energy Rep. 2024, 11, 6083–6093. [Google Scholar] [CrossRef]
Liu, C.; Ma, L.; Sui, X.; Guo, N.; Yang, F.; Yang, X.; Huang, Y.; Wang, X. YOLO-CSM-Based Component Defect and Foreign Object Detection in Overhead Transmission Lines. Electronics 2024, 13, 123. [Google Scholar] [CrossRef]
Gao, D.; Yin, Y.; Zhang, H.; Li, C.; Wang, B. YOLOv11-Based UAV Foreign Object Detection for Power Transmission Lines. Electronics 2025, 14, 3577. [Google Scholar] [CrossRef]
Ji, H.; Chen, X.; Bai, J.; Gong, C. Lightweight SCL-YOLOv8: A High-Performance Model for Transmission Line Foreign Object Detection. Sensors 2025, 25, 5147. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Yuan, G.; Zhou, H.; Ma, Y.; Ma, Y. Foreign-Object Detection in High-Voltage Transmission Line Based on Improved YOLOv8m. Appl. Sci. 2023, 13, 12775. [Google Scholar] [CrossRef]
Shao, Y.; Zhang, R.; Lv, C.; Luo, Z.; Che, M. TL-YOLO: Foreign-Object Detection on Power Transmission Line Based on Improved YOLOv8. Electronics 2024, 13, 1543. [Google Scholar] [CrossRef]
Zhang, D.; Zhang, Z.; Zhao, N.; Wang, Z. A Lightweight Modified YOLOv5 Network Using a Swin Transformer for Transmission-Line Foreign Object Detection. Electronics 2023, 12, 3904. [Google Scholar] [CrossRef]
Duan, P.; Liang, X. An Improved YOLOv8-Based Foreign Detection Algorithm for Transmission Lines. Sensors 2024, 24, 6468. [Google Scholar] [CrossRef]
Wu, K.; Chen, Y.; Lu, Y.; Yang, Z.; Yuan, J.; Zheng, E. SOD-YOLO: A High-Precision Detection of Small Targets on High-Voltage Transmission Lines. Electronics 2024, 13, 1371. [Google Scholar] [CrossRef]
Ma, Y.; Tang, X.; Shi, Y.; Chan, P.W. YOLOv8n–CBAM–EfficientNetV2 Model for Aircraft Wake Recognition. Appl. Sci. 2024, 14, 7754. [Google Scholar] [CrossRef]
Banjar, A.; Javed, A.; Nawaz, M.; Dawood, H. E-AppleNet: An enhanced deep learning approach for apple fruit leaf disease classification. Appl. Fruit Sci. 2025, 67, 18. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Shi, J.; Na, X.; Hai, S.; Sun, Q.; Feng, Z.; Zhu, X. MCD-YOLOv10n: A Small Object Detection Algorithm for UAVs. IET Image Process. 2025, 19, e70145. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Ma, S.; Zhao, X.; Wan, L.; Zhang, Y.; Gao, H. A lightweight algorithm for steel surface defect detection using improved YOLOv8. Sci. Rep. 2025, 15, 8966. [Google Scholar] [CrossRef]
Di, X.; Cui, K.; Wang, R.F. Toward Efficient UAV-Based Small Object Detection: A Lightweight Network with Enhanced Feature Fusion. Remote Sens. 2025, 17, 2235. [Google Scholar] [CrossRef]

Figure 1. Framework of enhanced YOLOv8 algorithm.

Figure 2. Overall architecture of EfficientNetV2.

Figure 3. The schematic diagram of GSConv.

Figure 4. The structures of (a) the GS bottleneck module and (b–d) the VoV-GSCSP_1,2,3 modules.

Figure 5. The schematic diagram of EMA.

Figure 6. The schematic diagram of CIoU.

Figure 7. The schematic diagram of MPDIoU.

Figure 8. Sample of the dataset.

Figure 9. Different angles.

Figure 10. Different weather conditions.

Figure 11. Example of data augmentation: (a) original, (b) rotations, (c) variations in brightness, (d) salt-and-pepper noise, (e) occlusions.

Figure 12. Labeling process of data set.

Figure 13. Visual comparison of detection performance.

Figure 14. Confusion matrixes of foreign object detection results for transmission lines.

Table 1. Dataset information.

Class	Original	Data Augmentation
Nest	383	1532
Kite	282	1125
Balloon	232	925
Debris	234	934

Table 2. Experimental environment.

Environment	Version
GPU	NVIDIA GeForce RTX 4060Ti
Python	3.10
PyTorch	2.6.0
CUDA	12.1

Table 3. Hyperparameter configurations.

Parameter	Setup
Batch Size	4
Initial Learning Rate	0.01
Final Learning Rate	0.001
Momentum	0.937
Weight Decay	0.0005
Optimizer	AdamW
Epoch	200
Image Size	640 × 640
Batch Size	4
Initial Learning Rate	0.01
Final Learning Rate	0.001

Table 4. The experimental results of ablation experiments.

NO	Efficientnetv2	Slim-Neck	EMA	MPDIoU	P/%	R/%	mAP50/%
1					96.8	95.1	96.1
2	√				97.2	94.8	96.0
3	√	√			97.0	95.3	96.3
4	√		√		97.4	95.4	96.3
5	√	√	√		97.4	95.5	96.5
6	√	√		√	97.3	95.2	97.2
7	√	√	√	√	97.7	95.6	97.5

Table 5. Comparative results of different loss functions.

Loss Function	P/%	R/%	mAP50%
CIoU	97.4	95.4	96.5
DIoU	96.9	95.5	96.7
EIoU	97.2	95.1	96.6
GIoU	97.5	95.2	97.0
SIoU	97.4	95.4	97.2
WIoUv3	97.5	95.3	97.3
MPDIoU	97.7	95.6	97.5

Table 6. Evaluation metrics of different models.

Model	P/%	R/%	mAP50/%	GFLOPs	Param/MB
YOLOv5	96.7	94.9	96.6	24.6	80.8
YOLOv8	96.8	95.1	96.1	39.6	98.8
YOLOv10	95.3	91.4	95.4	32.2	63.2
YOLOv11	96.9	92.6	96.8	34.2	76.7
YOLOv13	96.5	95.2	96.7	43.4	105.3
RT-DETR	97.2	92.3	94.3	29.3	72.7
Ours	97.7	95.6	97.5	27.1	67.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gou, M.; Xu, W.; Liu, C.; Zhang, L.; Tang, H.; Liu, J.; Fu, W. An Enhanced YOLOv8-Based Approach for Foreign Object Detection on Transmission Lines. Algorithms 2026, 19, 264. https://doi.org/10.3390/a19040264

AMA Style

Gou M, Xu W, Liu C, Zhang L, Tang H, Liu J, Fu W. An Enhanced YOLOv8-Based Approach for Foreign Object Detection on Transmission Lines. Algorithms. 2026; 19(4):264. https://doi.org/10.3390/a19040264

Chicago/Turabian Style

Gou, Ming, Weizhong Xu, Chunyu Liu, Liguang Zhang, Hao Tang, Jiwu Liu, and WenLong Fu. 2026. "An Enhanced YOLOv8-Based Approach for Foreign Object Detection on Transmission Lines" Algorithms 19, no. 4: 264. https://doi.org/10.3390/a19040264

APA Style

Gou, M., Xu, W., Liu, C., Zhang, L., Tang, H., Liu, J., & Fu, W. (2026). An Enhanced YOLOv8-Based Approach for Foreign Object Detection on Transmission Lines. Algorithms, 19(4), 264. https://doi.org/10.3390/a19040264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Enhanced YOLOv8-Based Approach for Foreign Object Detection on Transmission Lines

Abstract

1. Introduction

2. YOLOv8 Algorithm

3. Enhanced YOLOv8 Algorithm

3.1. EfficientNetV2

3.2. Slim-Neck

3.3. EMA

3.4. Loss Function Improvement

4. The Experimental Evaluation and Analysis

4.1. Dataset Introduction

4.2. Experimental Setup

4.3. Experimental Evaluation Indicators

4.4. Results and Analysis of the Ablation Experiment

4.5. Comparative Analysis of Loss Function Strategy

4.6. Performance Comparison of Different Detection Models

4.7. Visual Analysis of Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI