An Autonomous UAV Power Inspection Framework with Vision-Based Waypoint Generation

Wang, Qi; Zhang, Zixuan; Wang, Wei

doi:10.3390/app16010076

Open AccessArticle

An Autonomous UAV Power Inspection Framework with Vision-Based Waypoint Generation

by

Qi Wang

^1,*

,

Zixuan Zhang

² and

Wei Wang

³

¹

School of Electronic Engineering, Nanjing Xiaozhuang University, 3601 Hongjing Avenue, Jiangning District, Nanjing 211171, China

²

Graduate School of Engineering, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba 263-8522, Japan

³

Fukushima Institute for Research, Education and Innovation (F-REI), 6-1 Yazawa-machi, Gongendo, Namie-Town, Futaba-Couty, Fukushima 979-1521, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 76; https://doi.org/10.3390/app16010076

Submission received: 18 November 2025 / Revised: 15 December 2025 / Accepted: 17 December 2025 / Published: 21 December 2025

(This article belongs to the Topic Unmanned Vehicles Technology and Embodied Intelligence Systems for Intelligent Transportation)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of Unmanned Aerial Vehicle (UAV) technology, it plays an increasingly important role in electrical power inspection. Automated approaches that generate inspection waypoints based on tower features have emerged in recent years. However, these solutions commonly rely on tower coordinates, which can be difficult to obtain at times. To address this issue, this study presents an autonomous inspection waypoint generation method based on object detection. The main contributions are as follows: (1) After acquiring and constructing the distribution tower dataset, we propose a lightweight object detector based on You Only Look Once (YOLOv8). The model integrates the Generalized Efficient Layer Aggregation Network (GELAN) module in the backbone to reduce model parameters and incorporates Powerful Intersection over Union (PIoU) to enhance the accuracy of bounding box regression. (2) Based on detection results, a three-stage waypoint generator is designed: Stage 1 estimates the initial tower’s coordinates and altitude; Stage 2 refines these estimates; and Stage 3 determines the positions of subsequent towers. The generator ultimately provides the target’s position and heading information, enabling the UAV to perform inspection maneuvers. Compared to classic models, the proposed model runs at 56 Frames Per Second (FPS) and achieves an approximate 2.1% improvement in mAP50:95. In addition, the proposed waypoint estimator achieves tower position estimation errors within 0.8 m and azimuth angle errors within 0.01 rad. Multiple consecutive tower inspection flights in actual environments further validate the effectiveness of the proposed method. The proposed method’s effectiveness is validated through actual flight tests involving multiple consecutive distribution towers.

Keywords:

embodied intelligence; autonomous waypoint generation; vision-based localization; object detection; power inspection

1. Introduction

The critical role of electricity in modern society is more evident than ever, as nearly all aspects of life, from industrial production to daily activities, rely on stable power supply. Electricity is typically generated at power plants located in remote areas and transmitted to end users via the transmission system, and once the transmission system deteriorates, frequent power outages can disrupt residents’ lives [1] and business operations [2], leading to substantial economic losses. Therefore, regular power inspections are crucial. Traditional manual inspections offer intuitive, flexible, and detailed observations of the site; however, they are highly labor-intensive and time-consuming [3,4]. Recently, advanced inspection robots, including climbing robots [5,6], UAVs, and hybrid robots [7], have been developed to improve the ease and efficiency of inspection tasks. Compared to manual inspections, UAV based inspection solutions have gained attention for their compact size, accuracy, and cost-effectiveness. Existing autonomous UAV inspection technologies generally fall into Global Positioning System (GPS)-based waypoint navigation or perception-based wire tracking [8]. However, waypoint methods struggle with map inaccuracies caused by environmental or construction errors, while wire-tracking approaches often fail to provide the semantic understanding needed for precise tower and insulator inspection.

To address these limitations, this study proposes a framework premised on the insight into visual semantic features to provide sufficient relative positioning constraints for navigation without relying on high-precision priors. We introduce an autonomous inspection framework that integrates lightweight object detection with a vision-based waypoint generator. The novelty of this approach lies in its ability to semantically perceive tower structures and dynamically generate flight waypoints in real time. The main contributions are summarized as follows:

For object detection, we established a distribution tower dataset, and then incorporated GELAN and PIoU [9] modules to enhance the YOLOv8 model by reducing model parameters and improving bounding box regression accuracy. The improved model achieves a 2.1% increase in mAP50:95 and can run at 56 FPS on an RK3588-based onboard computer.
An inspection waypoint generator is designed, which collects the UAV’s states and detection results at specific intervals, estimates the relative distance between the tower and the UAV by analyzing their relative position and pixel variations, and estimates the tower’s geographic coordinates. The generator operates in three stages: initial tower coordinate estimation, coordinate correction, and refined tower coordinate estimation.

To illustrate this process, the remainder of this study is organized as follows: in Section 2, we review related work on object detection and its application in UAV power inspection; in Section 3, we present the system structure and describe the object detection process; Section 4 details the waypoint generation process and UAV controller; in Section 5, we validate the performance of the improved detection model and the effectiveness of the generator; a discussion of the experimental results is provided in Section 6; finally, in Section 7, we conclude our work and offer suggestions for future research.

2. Related Works

With the improvement in processor performance, deep learning technology has rapidly advanced, and as one of its applications, object detection techniques have undergone a transformation from traditional to deep learning-based approaches. Traditional object detection relies on manually extracted features, which can yield effective results for specific datasets; however, its performance is limited in complex environments and variable situations. In contrast, deep learning-based techniques can extract rich features from the target to perform detection. The successful applications of the YOLO [10] and Faster Region-based Convolutional Neural Network (R-CNN) [11] models have been particularly encouraging.

Object detection in power inspection: The detection of different targets has been extensively studied [12]. For transmission lines, ref. [13] proposed an improved YOLO model that enhances the accuracy and efficiency of small object detection in cases of transmission line breaks by adding deformable convolutions and hybrid attention modules. Ref. [14] improved the Solov2 network and pre-trained the model with transfer learning to enable better segmentation of line features. A defect is another common application target. Refs. [15,16] balance the model’s accuracy and lightweight design, achieving insulator identification and defect detection. Additionally, ref. [17] improved the Faster R-CNN by adding a feature enhancement module and an auxiliary classification module. A network that introduces a damper shape attention mechanism feature and a fusion structure further improves defect recognition accuracy and speed [18]. For foreign objects, ref. [19] combined YOLOv7 with a genetic algorithm for fast recognition, while ref. [20] modified YOLOX with multi-scale learning to enhance target focus, and ref. [21] integrated a CNN with a random forest model to optimize foreign object classification. Moreover, in complex background environments, multi-feature fusion modules are frequently incorporated into network structures [22,23,24]; however, as emphasized in ref. [25], data collection of the tower’s raw image remains to be preformed by the operators.

Flight navigation in power inspection: This study focus on generating safe navigation points around towers to enable effective data acquisition. Since the environment is typically open and outdoor, obstacle avoidance and path planning [26] are not considered. Mainstream inspection path generation solutions include waypoint-based and line detection-based methods. The waypoint-based method primarily relies on GPS for positioning, combined with prior knowledge or manually collected tower features, to customize flight paths for autonomous inspection [27]; on this basis, integrating object detection can assist in bias correction [28] and reduce misidentification [29]. The line detection-based method involves using radars [30] and cameras [31] to detect power lines and follow the wires during inspections, making it suitable for transmission line inspections.

Beyond the specific inspection approaches mentioned above, general autonomous navigation has made significant progress. State-of-the-art visual-inertial SLAM systems [32] and LiDAR-based algorithms [33] achieve high-precision localization in feature-rich environments. However, applying these general-purpose frameworks to power line inspection presents unique challenges. Visual SLAM often suffers from tracking loss due to texture-less sky backgrounds, while LiDAR-based solutions impose high demands on payload capacity and costs for inspection UAVs. A qualitative comparison between the proposed method and these navigation methods is provided in Table 1.

3. System Structure and Object Detection

3.1. System Structure

To achieve autonomous inspection, a lightweight object detection model based on YOLOv8 is designed to enable real-time and accurate detection of distribution tower components, the overall system architecture follows a modular “Perception-Planning-Control” paradigm that is widely adopted in autonomous aerial robotics for its robustness and flexibility [34,35]. In addition, a waypoint generator that operates without relying on tower coordinates or structural features is developed, thereby enhancing the autonomy and intelligence of UAV inspections, as illustrated in Figure 1.

Since the classifications in public datasets do not meet the requirements of our generator, we first establish an appropriate dataset, as described in Section 3.2.1. The detection model is trained on a desktop computer, and both the trained model and the generator are deployed on the onboard computer, this high-performance embedded platform processes high-resolution image streams locally, ensuring navigation logic responds timely [36]. As shown in Figure 1, the onboard computer receives images transmitted from the camera, processes them to extract object pixels, and simultaneously records the UAV’s states. It then combines the object pixels with the UAV’s state information to estimate the next waypoint, which is sent to the UAV. The UAV’s cascade control system then follows these commands to perform autonomous inspection; such a hierarchical controller separates the fast dynamics of the attitude loop from the slow dynamics of the position loop, ensuring stable trajectory tracking and precise hovering during the inspection maneuvers [37].

3.2. Tower Detection

3.2.1. Dataset Description

In the absence of specific tower coordinates, locating the tower using camera becomes a prerequisite for inspection. In our proposed method, not only the towers but also several key components need to be detected to provide sufficient information. However, the existing power-related datasets generally focus on detailed structural classifications [38], defect types [39], and power lines [40], making it difficult to provide sufficient classification suitable for our method. Therefore, we collected images containing both front and top views of distribution towers and labeled five key classes, namely the tower top, crossarm, insulator, tower body, and tower base, as shown in Figure 2.

A total of 8700 images are collected in this study and compared with public datasets. Particular emphasis is placed on two categories, namely the tower top and tower base, as shown is Table 2. Combined with the actual altitude variations during flight, the top and base pixels can be used to estimate the tower’s altitude and geographic coordinates; they can then be utilized to generate the inspection waypoint. The insulator and tower body are primarily used for estimating the tower’s orientation, while the crossarm is used to identify the relevant insulator sections. The crossarm is necessary since each tower typically includes multiple insulators.

To ensure the model’s generalization capability in real-world detection scenarios, the dataset includes images captured at different times of the day, covering common frontlit and backlit conditions under overcast or cloudy skies. The image backgrounds range from clear sky to complex ground surfaces. Additionally, the captured perspectives span multiple flight altitudes (5–50 m) and camera angles (0–90°) to ensure the model recognizes tower components from diverse viewpoints.

3.2.2. Improved YOLOv8

The fundamental model we used is YOLOv8; it introduces EfficientNet as the backbone network, and then combines the Feature Pyramid Network (FPN) and the Path Augmentation Network (PAN) to optimize the parameter count, computational complexity, and feature extraction. Advanced data augmentation techniques, such as adaptive hybrid enhancement and target generation enhancement, are employed to improve the model’s generalization ability. Owing to these innovations, YOLOv8 enhances inference speed and achieves higher accuracy on the public datasets [41].

To improve the model’s recognition accuracy for multi-scale targets and real-time performance, we improve YOLOv8 by (1) introducing the GELAN module in the neck structure to reduce model parameters and computation costs and (2) the PIoU loss function to enhance the accuracy of bounding box regression. As a result, the optimized model significantly reduces network parameters while maintaining higher accuracy.

3.2.3. Lightweight Backbone

Accurately detecting various components is crucial during inspection tasks. However, the complex background environment and multi-scale targets pose significant challenges; hence, we introduced the GELAN module into the backbone of YOLOv8, as shown in Figure 3, to further enhance its feature extraction capabilities and overall performance.

The feature extraction efficiency of GELAN is enhanced through the RepNCSPELAN4 module [42,43], which integrates the characteristics of Cross Stage Partial (CSP) connections and Spatial Pyramid Pooling (SPP), to address the issue of excessive computational and memory overhead in traditional neural networks. By implementing RepNCSP operations and CSP connections, the module facilitates feature grouping and information exchange, thereby achieving efficient feature reuse and optimized information flow. Consequently, the GELAN architecture reduces computational and memory consumption while maintaining robust detection accuracy.

With a more efficient multi-scale feature fusion mechanism, the detection of varying sizes is improved. On the other hand, the GELAN module reduces model parameters and computational overhead by optimizing feature representation and improving parameter utilization. As a result, the improved model enhances detection performance and enables efficient inference on onboard devices, making it suitable for UAV-based inspection applications.

3.2.4. Bounding Box Regression

Considering that the proposed method relies on accurate detection results, only anchor frames within an appropriate range can be matched to the actual tower coordinates. In scenarios with significant differences in target sizes, the traditional IoU loss function may cause anchor frames to expand during the regression process, thereby reducing localization accuracy. Additionally, since the IoU loss function does not differentiate between anchor frame quality, the training process may be negatively affected. To address this problem, we introduce the PIoU variable, which combines a size-adaptive penalty factor and a gradient adjustment function. It accelerates model convergence and improves detection accuracy, even in the presence of varying target sizes and qualities.

The PIoU loss can be defined as

L_{P I o U} = P \cdot f (x),

(1)

where

P = 1 / (1 + α \cdot size_diff)

is a size-adaptive penalty factor based on the target frame size,

α

is a hyperparameter of penalty sensitivity, and

size_diff

is the size difference between the anchor and target boxes.

f (x)

is a function of the quality-adjusted gradient based on the anchor box; it is defined as

f (x) = \{\begin{matrix} β \cdot g_{1} (x), if x is of medium quality; \\ γ \cdot g_{2} (x), if x is of high quality; \\ δ \cdot g_{3} (x), if x is of low quality . \end{matrix}

(2)

In Equation (2),

g_{1} (x), g_{2} (x),

and

g_{3} (x)

are functions that provide gradients based on the anchor box’s mass, while

β, γ,

and

δ

are factors that limit the effect of gradient for different quality levels.

4. Inspection Waypoint Generator

4.1. Overview

In general, the inspection process consists of two parts: detection and navigation. The detection process is described in Section 3.2, and this section presents the method used to guide the UAV. A waypoint generator consisting of three stages is proposed, as shown in Figure 4.

In Figure 4, after an autonomous takeoff from point O, the UAV climbs vertically until all components are detected. The UAV then temporarily hovers at point A to capture images and record its current position. It continues climbing until the tower top appears at the bottom of the camera’s Field of View (FOV), at which point the UAV hovers at point B. Subsequently, the tower information is calculated, and waypoint D is generated.

Considering possible deviations in the estimation results, we designed a correction stage (Stage 2) by adding auxiliary points C and E at both ends of D to obtain more accurate tower coordinates. Finally, Stage 3 is designed to detect the subsequent towers, corresponding to the process from point D to point G, thus enabling continuous inspection of the entire transmission grid.

4.2. Stage 1: Initial Tower Estimation

The tower top is usually not visible within the camera’s FOV when the UAV is on the ground due to perspective and height limitations; therefore, the critical condition for hovering occurs when the tower body, top, and base can be simultaneously detected, denoted as point A. In this point, the UAV captures tower images and record its current position, which is located at

(l o n_{A}, l a t_{A})

with an altitude of

a l t_{A}

and a heading of

ψ_{A}

. The pixel coordinates of the critical components, including the tower top

(x_{t p A}, y_{t p A})

and the tower base

(x_{b s A}, y_{b s A})

, are then extracted. It is worth noting that all pixel coordinates correspond to the centers of their respective bounding boxes.

Subsequently, the UAV maintains its heading and position while continuing to ascend. In contrast to hovering at point A, where the entire tower is visible, the main condition for hovering at point B is that the tower top is located within the bottom of the FOV, which is intended to produce significant altitude and pixel disparities, thereby minimizing measurement noise. In point B, the UAV’s altitude is

a l t_{B}

, and the pixel coordinates of the tower top are

(x_{t p B}, y_{t p B})

, as shown in Figure 5.

The ratio of altitude to pixel in the Y-axis can be calculated as

δ = | a l t_{B} - a l t_{A} | / | y_{t p B} - y_{t p A} |

. Note that the camera’s focal length is

f_{c}

; then the tower altitude and longitudinal distance between the UAV and the tower can be calculated in Equations (3) and (4).

H_{t} = | y_{t p A} - y_{b s A} | δ,

(3)

d_{t l} = δ f_{c} .

(4)

Currently, the existing information is insufficient to estimate the tower’s position; it still needs the heading information. Hence, we calculate the relative lateral distance to address this problem. Note that the camera’s resolution is

R e s_{X} \times R e s_{Y}

. Then the relative lateral distance between the UAV and the tower can be calculated as

d_{t p} = (\frac{d_{t l}}{f_{c}} - 1) (x_{t p B} - R e s_{X} / 2) w_{u},

(5)

where

w_{u}

is the camera unit’s diagonal size, while

x_{t p B} - R e s_{X} / 2

denotes the pixel difference between the tower and the image center. Combined with

d_{t l}

and the relative heading deviation

Δ ψ = {tan}^{- 1} (d_{t p} / d_{t l})

, we note earth’s radius as

R_{E}

, with a radian factor of

k_{r d} = 57.3

, and then the initial tower location can be calculated as

l o n_{t} = l o n_{A} + \frac{d_{t l} sin (ψ_{A} + Δ ψ)}{k_{r d} R_{E} c o s (k_{r d} l a t_{A})},

(6)

l a t_{t} = l a t_{A} + \frac{d_{t l} cos (ψ_{A} + Δ ψ)}{k_{r d} R_{E}} .

(7)

Therefore, the initial tower is located at

(l o n_{t}, l a t_{t})

with an altitude of

H_{t}

; considering safety distance

H_{s f}

, the estimated waypoint D is at

(l o n_{t}, l a t_{t})

, with an altitude of

H_{t} + H_{s f}

and a heading of

ψ_{A} + Δ ψ

. The UAV then ascended to the desired altitude, adjusted heading, and tilted the gimbal to 90° (90° being vertically downward and 0° pointing forward) to change the camera’s FOV.

It should be noted that due to the UAV’s status and the possible errors during detection, deviations may exist between the calculated results and the actual situation.

4.3. Stage 2: Initial Tower Coordinate Correction

Since the waypoint estimated in Stage 1 may contain deviations, we added auxiliary points C and E around D on the flight route. C is recorded when the tower and all the components could be detected in the

[0, R e s_{Y} / 3]

position of the FOV, and E is recorded when the tower can be detected in

[2 R e s_{Y} / 3, R e s_{Y}]

, as shown in Figure 6.

Note that the position of point C is

(l o n_{C}, l a t_{C})

, and the pixel coordinate of the tower top is

(x_{t p C}, y_{t p C})

, while the position of point E is

(l o n_{E}, l a t_{E})

, and the corresponding tower top is

(x_{t p E}, y_{t p E})

. Then, the distance

d_{C E}

from points C to E can be calculated as

d_{C E} = 2 R_{E} arcsin \sqrt{a_{0} + a_{1}},

(8)

where

a_{0} = {sin}^{2} ((l a t_{E} - l a t_{C}) / 2),

(9)

a_{1} = cos (k_{r d} l a t_{C}) cos (k_{r d} l a t_{E}) {sin}^{2} (\frac{(l o n_{E} - l o n_{C})}{2}) .

(10)

Combined with the heading

ψ_{C} = ψ_{A} + Δ ψ

, it should be noted that

ψ_{A}

was converted to the NED coordinate.

d_{C E}

can then be decomposed into North–East–Down (NED) coordinates, with

d_{C E N}

in the north and

d_{C E E}

in the east, as shown in Equations (11) and (12).

d_{C E N} = d_{C E} |cos (ψ_{C})|,

(11)

d_{C E E} = d_{C E} |sin (ψ_{C})| .

(12)

The conversion ratios for unit latitude

λ_{l a t}

and longitude

λ_{l o n}

corresponding to actual distances can be calculated as

λ_{l o n} = d_{C E E} / |l o n_{E} - l o n_{C}|,

(13)

λ_{l a t} = d_{C E N} / |l a t_{E} - l a t_{C}| .

(14)

Note the image center as

O_{P} = (R e s_{X} / 2, R e s_{Y} / 2)

, and the pixel vector of the tower top in point C is

{\vec{P}}_{b} = (x_{t p C} - R e s_{X} / 2, R e s_{Y} / 2 - y_{t p C})

. The angular deviation from the tower top to the Y-axis pixel is

ψ_{P} = tan [(x_{t p C} - R e s_{X} / 2) / (R e s_{Y} / 2 - y_{t p C})]

; then the pixel vector corresponding to

{\vec{P}}_{b}

in NED coordinates is

{\vec{P}}_{g} = (d_{P} sin (ψ_{C} + ψ_{P}), d_{P} cos (ψ_{C} + ψ_{P})),

(15)

where

d_{P}

is the vector length of

{\vec{P}}_{b}

, and the actual distances between point C and the tower top in NED coordinates are

d_{C t N} = d_{P} cos (ψ_{C} + ψ_{P}) d_{C E} / a_{2},

(16)

d_{C t E} = d_{P} sin (ψ_{C} + ψ_{P}) d_{C E} / a_{2},

(17)

where

a_{2} = \sqrt{{(x_{t p E} - x_{t p C})}^{2} + {(y_{t p E} - y_{t p C})}^{2}} .

(18)

d_{C t N}

and

d_{C t E}

denote the actual distance in the N and E directions; then the corrected coordinates of the initial tower is

(l o n_{C} + d_{C t E} / λ_{l o n}, l a t_{C} + d_{C t N} / λ_{l a t})

.

In addition, the UAV’s heading should also be consistent with the transmission line’s direction during inspection. In point C, the pixel coordinates of the insulators are denoted as

(x_{i 1 C}, y_{i 1 C})

and

(x_{i 2 C}, y_{i 2 C})

. The deviation angle between the UAV and the transmission line can be calculated as

Δ ψ_{t} = {tan}^{- 1} [(y_{i 2 C} - y_{i 1 C}) / (x_{i 2 C} - x_{i 1 C})]

, and the transmission line’s direction is

ψ_{l} = ψ_{C} + Δ ψ_{t}

.

4.4. Stage 3: Subsequent Tower Positioning

In the previous stages, the UAV hovers directly above the initial tower, and the heading is consistent with the transmission line, with the camera gimbal set to 90°. However, the estimated transmission line direction accuracy based on the insulator should be further confirmed. Therefore, we designed Stage 3 to accomplish transmission line direction correction and the subsequent tower inspection.

In Stage 3, we adjust the camera to 45° firstly to facilitate detection of the subsequent tower. Then, we adjust the UAV’s heading based on the detected result so that the tower can be kept in the FOV’s center. The UAV then maintains forward flight at a constant velocity and adjusts the camera to 90° when the tower top is detected and within

[2 R e s_{Y} / 3, R e s_{Y}]

of the FOV. The subsequent process is similar to the process from points C to E in Stage 2, where the subsequent tower is within the

[0, R e s_{Y} / 3]

,

[R e s_{Y} / 3, 2 R e s_{Y} / 3]

,

[2 R e s_{Y} / 3, R e s_{Y}]

range of the FOV; the UAV hovers to collect data and estimates the tower coordinates.

Compared with Stage 2, the UAV is above the tower in Stage 3, and the heading is consistent with that of the transmission line. Hence, the inspection process is more accessible than the initial tower’s inspection, and no positional correction is required. Finally, after the current tower’s inspection is finished, the process of Stage 3 can be repeated to start a new round of subsequent tower inspection, thus realizing the inspection of the towers in the entire transmission grid.

4.5. Cascade Control

To enhance the broad applicability of our proposed method, we considered a UAV system with a typical cascade structure design. The cascade control system primarily includes the outer position and inner attitude control. Our generator can estimate the real-time navigation waypoint and heading; hence, we mainly focus on outer position control, which consists of heading control, horizontal position control, and altitude control. The control method involved is the classical proportional integral derivative [44]. In heading and altitude control, combined with the target heading and altitude with state feedback, the heading error

e_{Y}

and altitude error

e_{H}

can be computed; then the yaw rate target

r_{Y R}

and thrust target

r_{T}

can be updated.

r_{Y R} = K_{p Y R} e_{Y} + K_{i Y R} \int e_{Y} d t + K_{d Y R} \frac{d e_{Y}}{d t},

(19)

r_{T} = K_{p T} e_{H} + K_{i T} \int e_{H} d t + K_{d T} \frac{d e_{H}}{d t} .

(20)

Using a horizontal position controller, we calculate the target and the feedback coordinate error, and then convert it to the actual distance error in the N and E directions; then, we decompose the distance error to the forward error

e_{f o}

and the lateral error

e_{l a}

by combining the current heading, and then we calculate the target roll

r_{R}

and pitch

r_{P}

; such a process is presented as

r_{R} = K_{p R} e_{l a} + K_{i R} \int e_{l a} d t + K_{d R} \frac{d e_{l a}}{d t},

(21)

r_{P} = K_{p P} e_{f o} + K_{i P} \int e_{f o} d t + K_{d P} \frac{d e_{f o}}{d t} .

(22)

r_{R}

and

r_{P}

are combined with the appropriate feedback to calculate the respective angle rate targets. Subsequently, the rate controls for roll, pitch, and yaw determine the required torque for each rotation direction.

5. Experiment and Verification

5.1. Model Validation

5.1.1. Model Training

The training process is conducted on a hardware platform equipped with an Intel Core i9-14900K CPU, an NVIDIA GeForce RTX 4080 GPU, and 32GB of RAM. The code is implemented in Python 3.11.8 on a Windows operating system, utilizing PyTorch 2.2.2 as the deep learning framework. During training, the batch size is 16, the initial learning rate is 0.01, the weight decay is 0.0005, and the input image resolution is 640 × 480.

Key performance metrics, including precision, recall, mAP50, mAP50:95, and FPS, were considered to provide a comprehensive evaluation of the model’s performance. It should be noted that, to ensure the reproducibility of the experimental results and a fair comparison between models, all training sessions were initialized with a fixed random seed. Consequently, the reported mAP values represent the deterministic performance of the fixed validation set.

5.1.2. Ablation Validation

The dataset is divided into two subsets: a training set comprising 7000 images and a validation set containing 1700 images. To evaluate the feasibility of the proposed model, ablation experiments are conducted using the validation set. The precision and recall of each model are presented in Figure 7.

In Figure 7, when the confidence ranges from 0 to 0.45, the proposed model demonstrates an improved ability to accurately identify positive samples, particularly for the ‘Top’ and ‘Cros’ classes. The model achieves precision closer to one when the confidence is larger than 0.8. This improvement is primarily attributed to the GELAN module’s RepNCSPELAN4 structure, which enhances multi-scale feature fusion and efficiently retains and utilizes shallow features (such as tower top, base, and insulators). These capabilities improve the detection of targets that are prone to being overlooked. However, RepNCSPELAN4 shows limited performance in extracting features from uniform and continuous patterns, resulting in insufficient distributional bias for larger targets. Consequently, the efficiency for the tower body and crossarm is reduced. Regarding recall, the proposed model demonstrates the highest efficiency in detecting positive samples across all classes when the confidence is between 0 and 0.8. In the ‘Inst’ and ‘Body’ classes, the proposed model achieves significantly higher recall for confidence scores below 0.45. The v8-PIoU model slightly outperforms the proposed model in the ‘Top’ and ‘Body’ classes for confidence scores between 0.45 and 0.8. Such improvement is primarily attributed to the PIoU loss function, which emphasizes medium-quality anchor frames and incorporates penalties based on the angular difference between predicted and actual bounding boxes. Table 3 presents a more detailed summary of the detection performance after enabling each module in the ablation experiment.

To assess the model’s robustness under adverse environmental conditions, we simulated rainy and foggy scenarios by adding Gaussian noise and introduced pixel blocks to mimic partial occlusion. The corresponding detection results are presented in Figure 8.

As illustrated in Figure 8b, the introduced noise compromises the structural edges of finer details. While the primary tower structure remains discernible, smaller components, such as insulators, experience sporadic missed detections due to texture degradation. In Figure 8c, which simulates a scenario where the tower top is obstructed, the detector successfully identifies the unobstructed parts but fails to locate the tower top due to the loss of semantic features. These results demonstrate that while the model retains robust detection capabilities for visible components, severe occlusion of key navigation features represents a limiting case. In scenarios where critical navigation features remain undetectable, the UAV is programmed to hover to execute redundant detection attempts. Should the target remain unverified after a predefined time threshold, the system triggers a mission abort and initiates the autonomous return procedure.

5.1.3. Model Comparisons

To further evaluate the models’ performance, classical models from YOLOv5 to YOLOv12 are selected for comparison, with the results shown in Figure 9. Considering deployment on an onboard computer, the smallest model from each series is chosen: YOLOv5s [45], YOLOv6n [46], YOLOv7t [47], YOLOv8n [48], YOLOv9t [49], YOLOv11n [50], and YOLOv12n [51].

In Figure 9, the proposed model demonstrates high precision across the entire confidence range and maintains superior recall in most ranges. Notably, the proposed model reduces computational burdens by approximately 40–50% compared to YOLOv5s and YOLOv7t. In Figure 9a, the proposed model achieves noticeably higher precision than the others, particularly within the confidence range of 0–0.4, indicating that it remains reliable even when detecting low-confidence targets. By contrast, YOLOv5s and YOLOv7t show limited precision improvement, while newer versions such as YOLOv8n and YOLOv9t achieve moderate gains but still fall slightly below the overall performance of the proposed model. As shown in Figure 9b, the proposed model maintains a stable and competitive recall rate within the 0–0.6 confidence range, making it comparable to YOLOv6n, YOLOv8n, and YOLOv11n. Moreover, it exhibits a smoother decline at higher confidence levels (>0.7), demonstrating greater robustness in target selection. To intuitively illustrate the differences between models, several scenarios are selected to evaluate their detection performance, as shown in Figure 10.

The practical detection performance of the models is illustrated in Figure 10, and the missed objects are highlighted with yellow circles. The YOLOv8n, YOLOv7t, and YOLOv6n models commonly fail to detect the tower base and top classes. Meanwhile, the YOLOv5s model also struggles with large targets such as the crossarm and tower body. In contrast, the proposed model demonstrates no missed detections. Detailed detection statistics for each model are provided in Table 4.

Compared to the classic YOLO series models, the proposed model exhibits outstanding performance across several critical metrics. It achieves the highest precision at 0.9189 and a recall of 0.8368, which is slightly lower than that of YOLOv5s. For mAP, the proposed model attains 0.8970 on mAP50 and 0.6143 on mAP50:95, both of which are the highest among the compared models. Furthermore, the proposed model only uses 2.01 M parameters, and its FPS reaches 56, demonstrating its capability for fast data processing on onboard computers.

5.2. Inspection Flight Cases

5.2.1. Inspection Platform

Real-world flight validation is essential to assess the effectiveness of the proposed method. For this purpose, a quadcopter UAV is selected as the inspection platform, with a 5.76 kg takeoff weight and 45 min of endurance, and it is operated with an integrated autopilot. Image capture and processing are performed using a camera and the onboard computer, while the UAV’s position is provided with a Real-Time Kinematic (RTK) system. The complete system configuration is illustrated in Figure 11.

The inspection platform comprises several key components, including the power system (a, b, c), the Controller Area Network (CAN)-to-Local Area Network (LAN) converter module (d), the RTK system (e), an RK3588-based onboard computer (f), an autopilot (g), and a camera (h). All the components are designed and developed in-house, except for the power system, camera, and converter module.

The right part in Figure 11 illustrates the platform’s communication structure. The autopilot integrates various subsystems, including the Inertial Measurement Unit (IMU), a barometer, a magnetometer, a Flight Data Recording (FDR) module, and an LED. It controls the gimbal to adjust the camera’s FOV via the CAN and interacts with the onboard computer through module d. The camera and computer handle video streaming and image transmission over the LAN. During operation, the UAV can be monitored and controlled via the ground station or remote control when necessary, and the final experimental process is available at https://youtu.be/R9EN5_aVGwY?si=EpeEJiV0FiZxNB_K (accessed on 15 December 2025).

5.2.2. Inspection Implementation

The proposed method required validation in an actual environment; to this end, a field site with several consecutive distribution towers is selected. The validation procedure involved positioning the UAV toward the initial tower and issuing the takeoff command from the ground station, and then the UAV operated autonomously without further intervention. During the test, neither the position nor the altitude of the inspected towers is provided in advance. The inspection process is depicted in Figure 12.

During the flight, the UAV hovered for 5 s at each position, where the tower appeared in different FOVs, such as points C, D, and E, to ensure precise estimation of the tower’s position. After completing the inspection of the initial tower, the UAV adjusted its heading toward the subsequent tower and repeated the process, hovering at different FOV positions, including points F, G, and H, to estimate the subsequent tower’s position. The inspection route for further towers, including points I, J, and K, followed the same procedure.

We will provide a detailed explanation of the waypoint calculations for each stage. Given the extensive detection data and UAV status information involved in the inspection process, only the most pertinent details are highlighted.

In Stage 1, the detection results for the top, base, and body of the initial tower are shown in Figure 13a–c, while the UAV’s altitude and heading are displayed in Figure 13d,e. The UAV received the command at 15.1 s, took off, and hovered at 2 m within 7 s. Subsequently, at 26.4 s, the computer activated the detection protocol, and the UAV ascended slowly. By 42 s, both the tower top and base are centered in the FOV, and the UAV hovered to capture images. The UAV then continued to rise and hovered again at 62 s. At this point, the tower base is out of the FOV, while the tower top is close to the FOV’s bottom. Following the process outlined in Section 4.2, the pixel ratio

δ

is calculated, and the altitude of the initial tower is derived. Combined with the UAV’s heading, the latitude and longitude of the tower are estimated to approximate its location.

The UAV rises to a safety altitude of 21.4 m at 78 s, and then the UAV adjusted its heading to keep the tower centered in the FOV.

Figure 14 illustrates the correction of the initial tower, which involves detecting the tower top and the insulators, along with the UAV states. At 86 s, the insulators begin to enter the FOV, and the entire tower is detected at 93.3 s. The UAV then stops tracking the estimation result from Stage 1 and hovers. It subsequently moves forward until the tower is centered in the FOV, continuing to fly until the tower is positioned within the

[2 R e s_{Y} / 3, R e s_{Y}]

range. By combining the tower top pixel information from three different positions with the UAV’s latitude and longitude data, a more accurate estimate of the tower’s location can be obtained.

Additionally, the pixel positions of two insulators are used to correct the UAV’s heading to align with the transmission line. For example, at 117 s, the Y pixels of insulator 1 and insulator 2 are 1922 and 1872. Based on the pixel difference, the UAV adjusts its heading at 117.4 s so that the Y pixels of both insulators 1 and 2 reach 1800.

At the end of Stage 2, the UAV hovers over the initial tower with the gimbal adjusted to 45°. At 123.6 s, as shown in Figure 15, the UAV detects the tower steadily and further corrects its heading based on the tower body’s X pixel, after which the UAV moves forward. When the entire subsequent tower appears in the

[0, R e s_{Y} / 3]

range of the FOV, the UAV adjusts the gimbal to 90°. Due to the change in the viewing angle, some of the components temporarily disappear. Once all components re-enter the

[0, R e s_{Y} / 3]

range of the FOV, the UAV repeats the procedure from Stage 2.

During the flight tests, the estimated tower coordinates, their deviations from the ground truth, and the azimuth errors computed from the UAV’s real-time position are summarized in Table 5. All geographic coordinates follow the WGS84 reference frame.

As shown in Table 5, during Stage 1, the large distance between the UAV and the tower introduces substantial depth estimation uncertainty and amplifies pixel-level detection errors through imaging geometry. Consequently, the initial tower localization exhibits deviations of 2.64 m and 1.11 m along the N and E axes, along with an azimuth error of 0.004 rad. Although this stage yields the largest errors, its role is to obtain a coarse bearing, which aligns with the design objective.

In Stage 2, as the UAV moves closer to the tower, the impact of pixel noise on spatial inference is significantly reduced. The maximum position error decreases to 0.63 m, indicating that the local refinement strategy provides effective correction to the initial estimate.

In sub-Stages 3-1 and 3-2, the estimation results for the two downstream towers further demonstrate the stability and high accuracy of the proposed method. The maximum position error remains within 0.79 m, and the azimuth error is approximately 0.008 rad.

From the experimental results, the pronounced error reduction from Stage 1 to Stage 2 verifies the effectiveness of the intermediate correction mechanism, while the centimeter-level accuracy achieved in Stage 3 highlights the applicability of the method in complex environments. Moreover, the azimuth error remains consistently low and stable throughout the entire process, indicating strong robustness of the proposed waypoint estimation framework in both positional accuracy and heading consistency.

6. Discussion

6.1. Inspection Performance

The core challenge in object detection for UAV inspections is balancing detection accuracy with onboard computational constraints. Our experimental results (Table 4 and Figure 10) demonstrate that the proposed model achieves an optimal trade-off. While some larger models exhibit marginally higher recall in specific scenarios, they incur a significantly higher parameter count and increased inference latency.

In real-world operations, challenges such as variable lighting and complex backgrounds are inevitable. Although our primary flight tests were conducted under clear conditions, supplementary offline evaluations using data augmentation, to simulate rain, fog, and partial occlusion, confirmed that the detection logic remains robust.

Another key contribution of this study is the realization of vision-based waypoint generation without reliance on prior maps. Flight tests validated the system’s capability to successfully inspect three consecutive towers. The results indicate that the proposed method maintains positional estimation errors within a safety margin of approximately 0.8 m, with azimuth estimation errors consistently kept within 0.01 rad.

6.2. Sensitivity to Detection Noise

A sensitivity analysis based on the projection model was conducted to quantify the impact of bounding box detection errors on navigation accuracy. According to Equation (4), the longitudinal distance

d_{t l}

is determined by the altitude variation

Δ H

and the pixel disparity

Δ Y

and is expressed as

d_{t l} = \frac{Δ H}{Δ Y} f_{c}

.

Assuming a detection noise level of

δ_{p i x}

in the bounding box coordinates, the resulting distance estimation error

E_{d}

can be approximated using the first-order Taylor expansion:

E_{d} \approx |\frac{\partial d_{t l}}{\partial (Δ Y)}| \cdot δ_{p i x} = \frac{d_{t l}}{Δ Y} \cdot δ_{p i x} = \frac{d_{t l}^{2}}{Δ H \cdot f_{c}} \cdot δ_{p i x} .

(23)

This derivation reveals that the estimation error is proportional to the square of the distance. For instance, given a focal length of

f_{c} = 1000

pixels and an altitude change of

Δ H = 5

m, a detection noise of

\pm 5

pixels yields significantly different outcomes based on range: At a far distance (Stage 1,

d_{t l} = 40

m), the estimated position error is approximately

1.6

m. However, at close range (Stage 2,

d_{t l} = 5

m), the error drastically decreases to approximately

0.025

m. Consequently, while the system exhibits sensitivity to detection noise at long ranges, the proposed multi-stage framework ensures that the final localization accuracy remains highly robust, effectively mitigating the impact of pixel-level inaccuracies.

6.3. Limitations

Numerous studies have focused on defect detection in power inspection; however, research on flight path planning based on detection results remains relatively scarce. Our findings reveal that current mainstream models still face challenges in detecting uncommon targets, such as tower tops and tower bases. Moreover, the proposed method is primarily applicable to distribution towers with relatively simple structures and is not yet suitable for more complex transmission towers. In addition, a rigorous quantitative analysis of how detection error affects waypoint generation is not conducted. In the future, we will also attempt to construct a more comprehensive error model to achieve precise quantitative analysis of system robustness, and since power inspection work is typically carried out in open outdoor environments with limited environmental conditions, response methods in the presence of obstructions and harsh environments have not yet been considered.

7. Conclusions and Future Work

To overcome the reliance on tower information in power inspection tasks, an autonomous inspection method based on target detection was proposed. The method includes two main components: tower feature detection and navigation information estimation. Due to the lack of sufficient labels in existing tower datasets, a distribution tower dataset was created, incorporating tower top and base classes. Based on the YOLOv8 model, the GELAN module was integrated to design a lightweight backbone, and PIoU was employed for bounding box regression, enabling effective recognition of each target class during inspections. An inspection waypoint generator was also developed based on key feature detection. The inspection process was divided into two phases: initial and subsequent towers. Stages 1 and 2 focus on localizing and correcting the initial tower’s position, while Stage 3 is used for subsequent tower localization. Model validation through comparisons with YOLOv5 to YOLOv12 demonstrated superior performance. The proposed model achieved minimum improvements of 1.37% in mAP50 and 2.1% in mAP50:90 while reducing model parameters to 2.014 M. Validation flights in actual environments with consecutive multilevel towers confirmed our method’s effectiveness. The UAV successfully conducted inspections for three towers without relying on pre-existing tower information or operator intervention.

As noted in the Section 6 , the current waypoint generator is not yet applicable to high-voltage transmission towers due to their more complex inspection requirements. In addition, performance degradation was observed during the inspection of the third tower, which resulted from structural differences. Therefore, future work will focus on collecting a larger and more diverse dataset by applying targeted data augmentation to improve the model’s generalization capability and further optimizing the solution through the incorporation of geometric constraints.

Author Contributions

Conceptualization, Q.W. and Z.Z.; methodology, Q.W.; software, Q.W.; validation, Q.W., Z.Z., and W.W.; formal analysis, Q.W.; investigation, Q.W. and Z.Z.; resources, Q.W.; data curation, Q.W.; writing—original draft preparation, Q.W.; writing—review and editing, W.W.; supervision, Q.W.; project administration, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Newman, D.E.; Carreras, B.A.; Lynch, V.E.; Dobson, I. Exploring complex systems aspects of blackout risk and mitigation. IEEE Trans. Reliab. 2011, 60, 134–143. [Google Scholar] [CrossRef]
Castillo, A. Risk analysis and management in power outage and restoration: A literature survey. Electr. Power Syst. Res. 2014, 107, 9–15. [Google Scholar] [CrossRef]
Chen, D.-Q.; Guo, X.-H.; Huang, P.; Li, F.-H. Safety distance analysis of 500 kv transmission line tower uav patrol inspection. IEEE Lett. Electromagn. Compat. Pract. Appl. 2020, 2, 124–128. [Google Scholar] [CrossRef]
Larrauri, J.I.; Sorrosal, G.; González, M. Automatic system for overhead power line inspection using an Unmanned Aerial Vehicle—RELIFO project. In Proceedings of the 2013 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 28–31 May 2013; IEEE: New York, NY, USA, 2013; pp. 244–252. [Google Scholar]
Gao, Y.; Song, G.; Li, S.; Zhen, F.; Chen, D.; Song, A. LineSpyX: A power line inspection robot based on digital radiography. IEEE Robot. Autom. Lett. 2020, 5, 4759–4765. [Google Scholar] [CrossRef]
Mendoza, N.; Nemati, H.; Haghshenas-Jaryani, M.; Dehghan-Niri, E. An Inflatable Soft Crawling Robot with Nondestructive Testing Capability for Overhead Power Line Inspection. In Proceedings of the ASME International Mechanical Engineering Congress and Exposition, Columbus, OH, USA, 30 October–3 November 2022; American Society of Mechanical Engineers: New York, NY, USA, 2022; Volume 86670, p. V005T07A021. [Google Scholar]
Baba, A. A new design of a flying robot, with advanced computer vision techniques to perform self-maintenance of smart grids. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 2252–2261. [Google Scholar] [CrossRef]
Zhou, G.; Yuan, J.; Yen, I.-L.; Bastani, F. Robust real-time UAV based power line detection and tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; pp. 744–748. [Google Scholar]
Liu, C.; Wang, K.; Li, Q.; Zhao, F.; Zhao, K.; Ma, H. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Netw. 2024, 170, 276–284. [Google Scholar] [CrossRef] [PubMed]
Nazir, A.; Wani, M.A. You only look once-object detection models: A review. In Proceedings of the 2023 10th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 15–17 March 2023; IEEE: New York, NY, USA, 2023; pp. 1088–1095. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhao, Z.; Qi, H.; Qi, Y.; Zhang, K.; Zhai, Y.; Zhao, W. Detection method based on automatic visual shape clustering for pin-missing defect in transmission lines. IEEE Trans. Instrum. Meas. 2020, 69, 6080–6091. [Google Scholar] [CrossRef]
Wang, X.; Cao, Q.; Jin, S.; Chen, C.; Feng, S. Research on detection method of transmission line strand breakage based on improved YOLOv8 network model. IEEE Access 2024, 12, 168197–168212. [Google Scholar] [CrossRef]
Ma, W.; Xiao, J.; Zhu, G.; Wang, J.; Zhang, D.; Fang, X.; Miao, Q. Transmission tower and Power line detection based on improved Solov2. IEEE Trans. Instrum. Meas. 2024, 73, 5015711. [Google Scholar] [CrossRef]
Liang, X.; Wang, J.; Xu, P.; Kong, Q.; Han, Z. Gdipayolo: A fault detection algorithm for uav power inspection scenarios. IEEE Signal Process. Lett. 2023, 30, 1577–1581. [Google Scholar] [CrossRef]
Zhang, S.; Qu, C.; Ru, C.; Wang, X.; Li, Z. Multi-objects recognition and self-explosion defect detection method for insulators based on lightweight GhostNet-YOLOV4 model deployed onboard UAV. IEEE Access 2023, 11, 39713–39725. [Google Scholar] [CrossRef]
Shuang, F.; Wei, S.; Li, Y.; Gu, X.; Lu, Z. Detail R-CNN: Insulator detection based on detail feature enhancement and metric learning. IEEE Trans. Instrum. Meas. 2023, 72, 2524414. [Google Scholar] [CrossRef]
Zhang, Y.; Li, B.; Shang, J.; Huang, X.; Zhai, P.; Geng, C. DSA-Net: An Attention-Guided Network for Real-Time Defect Detection of Transmission Line Dampers Applied to UAV Inspections. IEEE Trans. Instrum. Meas. 2023, 73, 3501022. [Google Scholar] [CrossRef]
Yu, C.; Liu, Y.; Zhang, W.; Zhang, X.; Zhang, Y.; Jiang, X. Foreign objects identification of transmission line based on improved YOLOv7. IEEE Access 2023, 11, 51997–52008. [Google Scholar] [CrossRef]
Wu, M.; Guo, L.; Chen, R.; Du, W.; Wang, J.; Liu, M.; Kong, X.; Tang, J. Improved YOLOX foreign object detection algorithm for transmission lines. Wirel. Commun. Mob. Comput. 2022, 2022, 5835693. [Google Scholar] [CrossRef]
Yu, Y.; Qiu, Z.; Liao, H.; Wei, Z.; Zhu, X.; Zhou, Z. A method based on multi-network feature fusion and random forest for foreign objects detection on transmission lines. Appl. Sci. 2022, 12, 4982. [Google Scholar] [CrossRef]
Xu, C.; Li, Q.; Zhou, Q.; Zhang, S.; Yu, D.; Ma, Y. Power line-guided automatic electric transmission line inspection system. IEEE Trans. Instrum. Meas. 2022, 71, 3512118. [Google Scholar] [CrossRef]
He, M.; Qin, L.; Deng, X.; Liu, K. MFI-YOLO: Multi-fault insulator detection based on an improved YOLOv8. IEEE Trans. Power Deliv. 2023, 39, 168–179. [Google Scholar] [CrossRef]
Liu, C.; Wu, Y.; Liu, J.; Sun, Z. Improved YOLOv3 network for insulator detection in aerial images with diverse background interference. Electronics 2021, 10, 771. [Google Scholar] [CrossRef]
Guan, H.; Sun, X.; Su, Y.; Hu, T.; Wang, H.; Wang, H.; Peng, C.; Guo, Q. UAV-lidar aids automatic intelligent powerline inspection. Int. J. Electr. Power Energy Syst. 2021, 130, 106987. [Google Scholar] [CrossRef]
Xing, J.; Cioffi, G.; Hidalgo-Carrió, J.; Scaramuzza, D. Autonomous power line inspection with drones via perception-aware MPC. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; IEEE: New York, NY, USA, 2023; pp. 1086–1093. [Google Scholar]
Calvo, A.; Silano, G.; Capitán, J. Mission planning and execution in heterogeneous teams of aerial robots supporting power line inspection operations. In Proceedings of the 2022 International Conference on Unmanned Aircraft Systems (ICUAS), Dubrovnik, Croatia, 21–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 1644–1649. [Google Scholar]
Li, Z.; Wang, Q.; Zhang, T.; Ju, C.; Suzuki, S.; Namiki, A. UAV high-voltage power transmission line autonomous correction inspection system based on object detection. IEEE Sens. J. 2023, 23, 10215–10230. [Google Scholar] [CrossRef]
Wang, Q.; Wang, W.; Li, Z.; Namiki, A.; Suzuki, S. Close-Range Transmission Line Inspection Method for Low-Cost UAV: Design and Implementation. Remote Sens. 2023, 15, 4841. [Google Scholar] [CrossRef]
Schofield, O.B.; Iversen, N.; Ebeid, E. Autonomous power line detection and tracking system using UAVs. Microprocess. Microsyst. 2022, 94, 104609. [Google Scholar] [CrossRef]
Jenssen, R.; Roverso, D. Automatic autonomous vision-based power line inspection: A review of current status and the potential role of deep learning. Int. J. Electr. Power Energy Syst. 2018, 99, 107–120. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardos, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Yang, H.; Shi, J.; Carlone, L. Teaser: Fast and certifiable point cloud registration. IEEE Trans. Robot. 2020, 37, 314–333. [Google Scholar] [CrossRef]
Arafat, M.Y.; Alam, M.M.; Moh, S. Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges. Drones 2023, 7, 89. [Google Scholar] [CrossRef]
Javaid, S.; Khan, M.A.; Fahim, H.; He, B.; Saeed, N. Explainable AI and monocular vision for enhanced UAV navigation in smart cities: Prospects and challenges. Front. Sustain. Cities 2025, 7, 1561404. [Google Scholar] [CrossRef]
Cao, Z.; Kooistra, L.; Wang, W.; Guo, L.; Valente, J. Real-time object detection based on uav remote sensing A systematic literature review. Drones 2023, 7, 620. [Google Scholar] [CrossRef]
Meier, L.; Honegger, D.; Pollefeys, M. PX4: A node-based multithreaded open source robotics framework for deeply embedded platforms. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; IEEE: New York, NY, USA, 2015; pp. 6235–6240. [Google Scholar]
Vieira e Silva, A.L.B.; de Castro Felix, H.; Simoes, F.P.M.; Teichrieb, V.; dos Santos, M.; Santiago, H.; Sgottib, V.; Lott Neto, H. A dataset and benchmark for power line asset inspection in uav images. Int. J. Remote. Sens. 2023, 44, 7294–7320. [Google Scholar] [CrossRef]
Tao, X.; Zhang, D.; Wang, Z.; Liu, X.; Zhang, H.; Xu, D. Detection of power line insulator defects using aerial images analyzed with convolutional neural networks. IEEE Trans. Syst. Man Cybern. Syst. 2018, 50, 1486–1498. [Google Scholar] [CrossRef]
Madaan, R.; Maturana, D.; Scherer, S. Wire detection using synthetic data and dilated convolutional networks for unmanned aerial vehicles. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; pp. 3487–3494. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision, Proceedings of the ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Islam, S.B.; Chowdhury, M.E.H.; Hasan-Zia, M.; Kashem, S.B.A.; Majid, M.E.; Ansaruddin Kunju, A.K.; Khandakar, A.; Ashraf, A.; Nashbat, M. VisioDECT: A novel approach to drone detection using CBAM-integrated YOLO and GELAN-E models. Neural Comput. Appl. 2025, 37, 20181–20204. [Google Scholar] [CrossRef]
Wang, L.; Letchmunan, S.; Xiao, R. Gelan-SE: Squeeze and stimulus attention based target detection network for gelan architecture. IEEE Access 2024, 12, 182259–182273. [Google Scholar] [CrossRef]
Quan, Q.; Du, G.-X.; Cai, K.-Y. Proportional-integral stabilizing control of a class of MIMO systems subject to nonparametric uncertainties by additive-state-decomposition dynamic inversion design. IEEE/ASME Trans. Mechatron. 2015, 21, 1092–1101. [Google Scholar] [CrossRef]
Jaiswal, S.K.; Agrawal, R. A Comprehensive Review of YOLOv5: Advances in Real-Time Object Detection. Int. J. Innov. Res. Comput. Sci. Technol. 2024, 12, 75–80. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Sohan, M.; Sai Ram, T.; Reddy, R.; Venkata, C. A review on yolov8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 529–545. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]

Figure 1. Systemstructure of the object detection-based autonomous inspection method: First, tower images are collected to construct the dataset, which is then used to train and deploy the detection model. During the inspection flight, the onboard computer receives camera images, processes them to obtain the object pixels’ coordinates, and then gathers the UAV’s state. The waypoint generator combines the object pixels with UAV states to derive navigation waypoints, and the cascade control system then executes these commands to enable autonomous flight.

Figure 2. Attention classes for tower detection: in contrast to the public dataset, additional attention is given to the tower top and tower base, and combined with the actual altitude variation, they can be used to estimate the tower’s altitude and coordinates.

Figure 3. Structure of the lightweight backbone based on the improved GENLAN model.

Figure 4. Illustration of the object detection-based waypoint generator: The UAV takes off from O, then ascends vertically to points A and B to collect two sets of information for estimating waypoint D. Subsequently, auxiliary points C and E are added to correct D. The UAV then flies toward the subsequent tower and completes data collection at points F, G, and H.

Figure 5. Estimation of the initial tower: estimate the conversion relationship between relative pixel variation and UAV movement by recording multiple images and their corresponding UAV states, and then estimate the geographic coordinates of the tower based on the UAV’s location, with the yellow line representing the pre-designed climb-and-level flight maneuver process.

Figure 6. Initial tower coordinate correction: Following a similar principle to Stage 1, multiple sets of positional information are recorded to correct the geographic coordinates of the tower, with the yellow line representing the pre-designed flight maneuver process.

Figure 7. Precision and recall of each class in ablation experiments: we present the results of three models, including classical YOLOv8, the one improved with PIoU, and ours.

Figure 8. Verification under simulated foggy conditions and occlusion scenarios: (a) is the original image. (b) is the simulated rainy and foggy interference scenario, and (c) is the simulated occlusion scene.

Figure 9. Mean precision and recall of different models: classical network models from YOLOv5 to YOLOv12 are selected for comparison.

Figure 10. Detection results of different models in three scenarios: missed objects are highlighted with yellow circles, and the tower base and tower top are the classes that commonly fail to be detected in the comparison model.

Figure 11. Inspection platform and system communication architecture. The quadrotor UAV is propelled by 335 KV motors (a) with 18-inch propellers (b) and powered by a 6S LiPo battery (c). The onboard system integrates a CAN-to-LAN converter module (d), an autopilot (e) with RTK positioning (g), an RK3588-based onboard computer (f), and an RGB camera (h).

Figure 12. Data acquisition and flight trajectory during inspection: The UAV takes off towards the tower, then follows the waypoint generator to complete the inspection task, where Stage 1 involves points A and B; Stage 2 involves points C, D, E; and Stage 3 involves the second tower, which involves points F, G, and H. Afterwards, Stage 3 is repeated to complete the inspection of the third tower, which involves points I, J, and K.

Figure 13. Detectionresults and UAV states involved in waypoint generation in Stage 1.

Figure 14. Detection results and UAV states involved in waypoint generation in Stage 2.

Figure 15. Detection results and UAV states involved in waypoint generation in Stage 3.

Table 1. Qualitative comparison of different navigation strategies for power line inspection.

Method	Sensor Cost	Map Dependency	Manual Intervention	Operational Foundation
[26]	Medium	Low	Low	Based on a camera sensor; requires manual guidance to inspection target.
[27]	Medium	High	None	Based on a camera sensor; requires a map of tower locations.
[28]	Highly	High	None	Based on high-precision positioning devices and predefined waypoints.
[29]	Medium	High	High	Based on preset waypoints, radar, and camera sensors.
[30]	Medium	Medium	Medium	Based on solid-state LiDAR and cameras; requires manual guidance for UAVs to patrol targets.
[32]	Medium	Low	None	Relies on feature points; prone to tracking failure in texture-less backgrounds.
[33]	High	High	None	Relies on geometric registration; constrained by high payload weight and power consumption.
Proposed	Low	None	None	Based on a camera;

Table 2. Detailed statistics of the distribution of the tower dataset. The dataset is split into training and validation sets with a ratio of approximately 8:2.

Object Class	Training Instances	Validation Instances	Total Instances
Tower Top	6400	1550	7950
Tower Body	6500	1570	8070
Tower Base	5200	1260	6460
Crossarm	9100	2200	11,300
Insulator	26,500	6450	32,950
Total Images	7000	1700	8700

Table 3. Statistics of detection performance of each model.

YOLO	mAP50	mAP50:95	Precision	Recall
v8n	0.8845	0.6025	0.9133	0.8358
v8n-GELAN	0.8836	0.6028	0.9110	0.8299
v8n-PIoU	0.8972	0.6132	0.9128	0.8414
Proposed	0.8971	0.6144	0.9189	0.8368

Table 4. Statistics of detection performance from YOLOv5 to YOLOv12.

YOLO	mAP50	mAP50:95	Precision	Recall	Parameters	FPS
v5s	0.8897	0.5836	0.9099	0.8653	7.03 M	36
v6n	0.8842	0.5983	0.9160	0.8290	4.24 M	50
v7t	0.8864	0.5672	0.9030	0.8417	6.03 M	54
v8n	0.8845	0.6025	0.9133	0.8358	3.01 M	52
v9t	0.8853	0.6083	0.9112	0.8303	2.01 M	39
v11n	0.8792	0.5992	0.8994	0.8373	2.59 M	50
v12n	0.8834	0.5983	0.9162	0.8096	2.57 M	44
Proposed	0.8971	0.6144	0.9189	0.8368	2.01 M	56

Table 5. Statistics of the tower coordinates, position deviations, and azimuth deviations during inspection.

	Stage 1	Stage 2	Stage 3-1	Stage 3-2
Estimated Coordinate	(118.67588544, 31.99312495)	(118.67589890, 31.99315348)	(118.67628796, 31.99345934)	(118.67665632, 31.99375512)
Actual Coordinate	(118.67589719, 31.99314869)	(118.67589719, 31.99314869)	(118.67628744, 31.99345916)	(118.67666469, 31.99376036)
N Distance error [m]	2.641	−0.533	0.02	−0.583
E Distance error [m]	1.111	−0.161	0.05	−0.790
Azimuth error [rad]	0.004	0.006	0.001	0.008

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Q.; Zhang, Z.; Wang, W. An Autonomous UAV Power Inspection Framework with Vision-Based Waypoint Generation. Appl. Sci. 2026, 16, 76. https://doi.org/10.3390/app16010076

AMA Style

Wang Q, Zhang Z, Wang W. An Autonomous UAV Power Inspection Framework with Vision-Based Waypoint Generation. Applied Sciences. 2026; 16(1):76. https://doi.org/10.3390/app16010076

Chicago/Turabian Style

Wang, Qi, Zixuan Zhang, and Wei Wang. 2026. "An Autonomous UAV Power Inspection Framework with Vision-Based Waypoint Generation" Applied Sciences 16, no. 1: 76. https://doi.org/10.3390/app16010076

APA Style

Wang, Q., Zhang, Z., & Wang, W. (2026). An Autonomous UAV Power Inspection Framework with Vision-Based Waypoint Generation. Applied Sciences, 16(1), 76. https://doi.org/10.3390/app16010076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Autonomous UAV Power Inspection Framework with Vision-Based Waypoint Generation

Abstract

1. Introduction

2. Related Works

3. System Structure and Object Detection

3.1. System Structure

3.2. Tower Detection

3.2.1. Dataset Description

3.2.2. Improved YOLOv8

3.2.3. Lightweight Backbone

3.2.4. Bounding Box Regression

4. Inspection Waypoint Generator

4.1. Overview

4.2. Stage 1: Initial Tower Estimation

4.3. Stage 2: Initial Tower Coordinate Correction

4.4. Stage 3: Subsequent Tower Positioning

4.5. Cascade Control

5. Experiment and Verification

5.1. Model Validation

5.1.1. Model Training

5.1.2. Ablation Validation

5.1.3. Model Comparisons

5.2. Inspection Flight Cases

5.2.1. Inspection Platform

5.2.2. Inspection Implementation

6. Discussion

6.1. Inspection Performance

6.2. Sensitivity to Detection Noise

6.3. Limitations

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI