An Optimized Composite YOLO Model for Transmission Tower Detection in Satellite Optical Remote Sensing Imagery

Leng, Runming; Zhang, Guo; Hao, Weifeng; Guo, Bingxuan; Zhu, Chunyang

doi:10.3390/rs18101499

Open AccessArticle

An Optimized Composite YOLO Model for Transmission Tower Detection in Satellite Optical Remote Sensing Imagery

by

Runming Leng

^1,2,3,

Guo Zhang

³

,

Weifeng Hao

^1,2,*

,

Bingxuan Guo

³ and

Chunyang Zhu

³

¹

Chinese Antarctic Center of Surveying and Mapping, Wuhan University, Wuhan 430079, China

²

Key Laboratory of Polar Environment Monitoring and Public Governance (Wuhan University), Ministry of Education, Wuhan 430079, China

³

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1499; https://doi.org/10.3390/rs18101499

Submission received: 14 March 2026 / Revised: 30 April 2026 / Accepted: 8 May 2026 / Published: 10 May 2026

(This article belongs to the Special Issue Artificial Intelligence-Driven Methods for Remote Sensing Target and Object Detection (Third Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A multi-source, multi-resolution satellite-only transmission tower dataset (HRS-PTD) is constructed, and statistical analysis reveals that over 75% of targets occupy less than 3.26% of the image area, with nearly two-thirds exhibiting slender, randomly oriented bounding boxes.
An optimized composite YOLO model integrating CARAFE upsampling and a direction-aware deformable convolution module (C_DCA) achieves 92.28% mAP on HRS-PTD, improving by 5.41 percentage points over RetinaNet while achieving 102.6 FPS, demonstrating superior accuracy–efficiency trade-off over representative classical detectors.

What are the implications of the main findings?

The proposed method demonstrates practical feasibility for large-scale transmission tower detection in real satellite imagery, achieving correct detection rates of 88% on Google Earth and 76% on Gaofen-7 imagery, with particularly pronounced gains under lower-resolution and weak-feature conditions.
The complementary design of CARAFE and C_DCA offers a transferable framework for detecting small, slender, and randomly oriented objects in high-resolution satellite remote sensing imagery beyond transmission towers.

Abstract

Safe low-altitude flight requires precise perception of obstacles like widespread transmission towers. Traditional inspection is often costly and inefficient. While satellite remote sensing enables automated detection, transmission towers exhibit small scales, slender structures, and random orientations, causing feature loss and receptive field mismatch. This study constructs HRS-PTD, a multi-source, multi-resolution satellite optical dataset, and analyzes target morphology. We then propose an optimized composite YOLO model using a streamlined three-stage baseline with C3k2 and SPPF modules. To enhance small object feature reconstruction, CARAFE is integrated into the upsampling path for content-aware dynamic kernels. Furthermore, a direction-aware C_DCA module, incorporating deformable convolutions, utilizes multi-directional strip branches and adaptive attention to improve slender target representation. Ablation experiments show the model achieves 92.28% mAP, with precision and recall increasing by 1.53 and 12.12 percentage points over the baseline. Comparative experiments against representative classical detectors further demonstrate that the proposed model achieves superior overall performance in both detection accuracy and inference efficiency. Tests on Google Earth and Gaofen-7 imagery yield 88% and 76% accuracy, confirming real-world feasibility.

Keywords:

high-resolution satellite remote sensing imagery; deep learning; transmission tower detection

1. Introduction

With the rapid expansion of the low-altitude economy, a significant volume of aircraft is entering low-altitude airspace [1], leading to an increasing demand for obstacle information to support airspace resource management and flight safety [2]. Transmission towers, as representative navigation hazards, are widely distributed and pose potential threats to aircraft operations [3,4]. However, existing navigation systems and traditional aeronautical charts lack robust identification and update mechanisms for such obstacles, resulting in significant safety blind spots for low-altitude flights.

Traditional methods for transmission tower detection rely on manual measurement or visual interpretation, which are characterized by low efficiency and a high dependence on expert experience. Recent advancements in photogrammetry have enabled the use of ultra-high-resolution aerial imagery captured by unmanned aerial vehicles (UAVs) as a primary data source for many studies [5]. For instance, the Electric Transmission and Distribution Infrastructure Imagery Dataset developed by Duke University [6] is widely utilized. Wang et al. [7] constructed an aerial photography dataset to compare the performance of Faster R-CNN and YOLOv3 in tower detection, while other researchers have developed custom datasets blending aerial and satellite imagery [8]. Nevertheless, UAVs are constrained by operational costs and airspace regulations [9], making rapid large-scale detection difficult to achieve. In contrast, high-resolution optical satellite remote sensing imagery offers advantages such as extensive coverage, low acquisition costs, and stable revisit cycles, providing a reliable data source for the automated extraction of large-scale transmission tower information [10,11]. However, due to the substantial resolution gap between aerial and satellite imagery, models trained on aerial data exhibit limited generalization on relatively lower-resolution satellite images. Furthermore, current research lacks a systematic analysis of the specific imaging characteristics of transmission towers in satellite imagery.

Transmission towers exhibit distinct small-target characteristics in large-swath remote sensing images. Their hollow structures and weak texture information make them easily confused with complex backgrounds. To address these detection challenges, researchers have conducted studies based on various classic object detection networks. Haroun et al. [12] utilized RetinaNet to detect towers in satellite images and extracted power corridors through a virtual path algorithm. Yuan [13] addressed tower detection in wide-field optical satellite imagery using Faster R-CNN combined with sliding window cropping and NMS, achieving a detection accuracy of over 83%. Zou et al. [14] proposed a novel pylon detection model, the Dynamic Focal Network, which utilizes dynamic convolution to handle variations in pylon size and applies a focus mechanism to reduce interference from complex backgrounds. Shi et al. [8] enhanced multi-scale context through large selective kernel feature fusion, improving detection accuracy and robustness in complex environments. Hou et al. [11] introduced MD-YOLO, which combines large selective kernel networks with space-to-depth cross-stage partial networks, increasing the mAP for small target detection by 2.2%. Although existing studies have made significant progress in transmission tower detection within remote sensing imagery, deficiencies remain in small-target feature representation, the perception of unique tower structures, and generalization across medium-to-high resolutions. Consequently, these methods struggle to meet the practical requirements for large-scale satellite-based tower detection.

To address these issues, this study constructs a multi-source, multi-resolution satellite imagery transmission tower dataset (HRS-PTD) and implements targeted model improvements. A detection method based on an optimized composite YOLO network is proposed for transmission tower detection in satellite optical imagery. Considering the small scale and slender structure of transmission towers, a composite baseline architecture is constructed using C3k2 blocks and SPPF modules based on the classic three-stage YOLO framework. To prevent the loss of small-target features during multi-scale fusion, the Content-Aware ReAssembly of FEatures (CARAFE) upsampling module is integrated into the Feature Pyramid Network (FPN). Furthermore, a direction-aware feature extraction module integrated with deformable convolution (C_DCA) is designed to accommodate the slender morphology and random orientations of transmission towers. The effectiveness of each module is verified through ablation experiments, and the detection performance and practical value of the proposed method are demonstrated on real-world satellite imagery.

The main contributions of this work are as follows:

Construction of the HRS-PTD dataset, which collects transmission tower samples from multi-source and multi-resolution satellite optical imagery to enhance data diversity and model generalization.
Proposal of a detection framework based on an optimized composite YOLO network, employing an efficient baseline architecture combining C3k2 and SPPF modules while introducing CARAFE upsampling to strengthen multi-scale feature reconstruction for small targets.
Design of an orientation-aware feature extraction module, C_DCA, which incorporates deformable convolutions to adaptively adjust the shape and orientation of the receptive field, effectively representing slender and randomly oriented transmission tower targets.

The remainder of this paper is organized as follows. Section 2 describes the construction and statistical analysis of the HRS-PTD dataset, followed by a detailed presentation of the proposed optimized composite YOLO model, including the BaseYOLO baseline design, the CARAFE upsampling module, and the C_DCA direction-aware module. Section 3 presents the experimental setup, ablation studies, comparative experiments on the HRS-PTD test set, and application validation on real large-swath satellite imagery. Section 4 provides a discussion of the results and limitations. Section 5 concludes the paper and outlines future research directions.

2. Methods

This study aims to achieve the precise detection of transmission towers based on high-resolution satellite remote sensing imagery. The HRS-PTD dataset is first constructed, followed by the proposal of a transmission tower detection method for satellite remote sensing images based on an optimized composite YOLO network. Specifically, CARAFE and the C_DCA deformable orientation-aware module are integrated into a C3k2 + SPPF composite baseline to enhance the recovery of small-target features and morphological adaptation for slender objects.

2.1. HRS-PTD Dataset Construction

Currently available public datasets for transmission tower detection often mix satellite and aerial imagery. While the latter reaches centimeter-level resolution and facilitates higher detection accuracy, it is constrained by airspace regulations, operational costs, and limited coverage, making it difficult to meet the demands of large-scale inspection and low-altitude obstacle avoidance detection. Furthermore, models trained on aerial data exhibit limited generalization capabilities when applied to relatively lower-resolution satellite imagery.

Based on these considerations, this study constructs a transmission tower detection dataset composed exclusively of satellite-sourced remote sensing imagery, termed HRS-PTD. Image samples are collected from three representative high-resolution data sources: the Google Earth imagery service, which provides multispectral optical imagery aggregated from commercial satellite sensors; the Gaofen-7 (GF-7) satellite, a Chinese domestically developed sub-meter stereo mapping satellite; and the Esri World Imagery service, which provides high-resolution optical imagery tiles derived from commercial satellite acquisitions. Typical regions containing transmission towers were selected from the aforementioned image sources and cropped into image patches ranging from 256 × 256 pixels to 640 × 640 pixels. The LabelImg tool (version 1.8.6) was employed for horizontal bounding box annotation, with the labels saved in YOLO format. The dataset was partitioned into training, validation, and test sets according to an 8:1:1 ratio. The final dataset comprises 1575 images and 2577 annotated instances. The characteristics and sample statistics of these data sources are summarized in Table 1.

To ensure diversity and representativeness, sample collection covered various geographical regions, seasonal phases, lighting conditions, and diverse land cover backgrounds—including farmland, mountainous terrain, urban areas, and forest land—as illustrated in Figure 1. This variety is intended to enhance the generalization capabilities of the model across complex scenarios.

Following the construction of the HRS-PTD dataset, a statistical analysis of the 2577 annotated transmission tower instances was conducted. The target characteristics of the towers were systematically explored across three dimensions—scale, morphology, and shadow—to identify the core challenges of the detection task and provide a data-driven basis for improvement strategies.

Scale characteristics:

Regarding scale characteristics, common transmission towers in reality range from 8 to 15 m in width and 20 to 120 m in height, resulting in relatively small physical dimensions. Due to factors such as satellite imaging angles and altitudes, the imaging size of these targets in satellite imagery varies significantly. As illustrated in Figure 2, samples of transmission towers of different sizes are captured from the same region in Google Earth satellite imagery (0.3–0.5 m resolution), with each sample patch being 640 × 640 pixels.

To better quantify the scale characteristics of transmission tower targets in satellite imagery, this study performed a Kernel Density Estimation (KDE) analysis on the normalized area

S_{n o r m}

of all annotated instances in the dataset, as shown in Equation (1).

S_{n o r m} = \frac{w}{W} \times \frac{h}{H}

(1)

where

w

and

h

represent the pixel width and height of the bounding box, respectively, and

W

and

H

represent the actual width and height of the corresponding image. The results are illustrated in Figure 3.

From the perspective of distribution morphology, the normalized area exhibits a strong right-skewed distribution, with the density peak occurring near the minimum values. Specifically, 75% of the targets have a normalized area of less than 3.26%. Such small targets occupy a very limited proportion of satellite imagery, making it difficult for general detectors to achieve accurate identification [15]. This represents one of the common bottlenecks in the field of remote sensing object detection. Addressing the problem of feature degradation caused by excessively small target scales [16] will be a core focus and challenge of the subsequent research in this study.

Morphological characteristics:

Regarding morphological characteristics, as illustrated in Figure 4, the projected poses of the transmission towers in the samples can be summarized into four imaging angles: top-down, front, oblique, and side views. Except for the top-down perspective, where the tower body presents an approximately square bounding box, the bounding boxes in other perspectives generally exhibit distinct differences between length and width.

Statistical analysis of the bounding box morphology (Figure 5) reveals a distinct bimodal aspect ratio distribution, corresponding to vertically and horizontally oriented tower bodies respectively, with significantly elongated targets accounting for nearly two-thirds of the total. This bimodal distribution statistically confirms the slender nature and random orientation of the transmission tower targets—the uncertainty in both the power line routing and the satellite imaging angle results in the random distribution of the projection directions.

The aforementioned analysis indicates that the bounding boxes of transmission tower targets possess two prominent morphological characteristics: slenderness (high elongation ratio) and directional randomness (bimodal distribution of aspect ratios). The square receptive field of standard convolutions cannot adaptively align with these slender and randomly oriented targets, which will serve as a direction for subsequent improvements to the object detection model.

Shadow characteristics:

Regarding shadow characteristics, the transmission tower instances in the dataset encompass three scenarios: long shadows, short shadows, and no obvious shadows, as illustrated in Figure 6. The length and direction of the shadows vary significantly with changes in the solar elevation and azimuth angles. The shadows formed by the tower body and conductors on the ground will also serve as one of the characteristic bases for the model to identify transmission tower targets.

2.2. Improvement Based on Optimized Composite YOLO Network

2.2.1. YOLO Baseline Architecture

YOLO (You Only Look Once) [17] is a representative class of single-stage object detection frameworks. The core philosophy of this framework is to unify the object detection task as a regression problem, enabling the simultaneous prediction of categories and positions for all targets through a single forward inference pass on the input image. This approach balances detection accuracy and real-time performance, making it well-suited for the application scenarios in this study. Since its inception, the YOLO series has undergone numerous iterations [18,19,20]. However, the three-stage “Backbone–Neck–Head” architecture has remained fundamentally stable since the introduction of YOLOv3 [19]. In particular, following the improvements in YOLOv8—such as the Cross Stage Partial Bottleneck with Two Convolutions (C2f) module and the anchor-free detection head—the collaborative mechanisms across multi-scale feature extraction, bidirectional feature fusion, and decoupled detection heads have been thoroughly validated in extensive research. Given its high degree of ecological maturity and engineering reliability [21,22], this study adopts it as the baseline framework.

2.2.2. Composite Baseline Network Design for Transmission Tower Detection

In response to the specific requirements for small target detection of transmission towers in satellite imagery, this study designed a baseline network following the principle of “task-driven module selection.” By incorporating characteristics such as the small size of transmission tower targets, the study purposefully selected and combined state-of-the-art key modules from YOLO series to construct BaseYOLO, a composite baseline network tailored for the tasks in this research.

First, the C3k2 module introduced in YOLO11 [23] was selected as the fundamental feature extraction unit. Compared to the C2f module in YOLOv8, C3k2 replaces a single bottleneck with a cascaded structure of two smaller convolutional kernels. This maintains an equivalent receptive field while reducing parameter count and computational overhead. The multi-branch feature aggregation approach of this module is also conducive to preserving shallow spatial detail information [24,25], thereby enhancing the detection capability for small transmission tower targets.

Secondly, YOLO11 introduced the C2-module with Partial Self-Attention (C2PSA) [26] after the Spatial Pyramid Pooling—Fast (SPPF) at the end of the Backbone, aiming to enhance global context awareness of deep features through a local self-attention mechanism [27]. However, C2PSA is located at the deepest P5 layer of the Backbone. After four downsampling operations, the spatial resolution of the feature map is at its lowest, and the spatial response of small transmission tower targets at this layer is inherently extremely sparse, resulting in limited gains from the self-attention mechanism. Furthermore, the additional computational overhead introduced by C2PSA offers insufficient benefits for the lightweight YOLO11n used in this study. Therefore, this paper retains only the computationally efficient SPPF module for multi-scale receptive field aggregation to maintain an effective response to small target features.

In summary, the BaseYOLO designed in this paper is illustrated in Figure 7. This network synthesizes representative YOLO versions within the classic three-stage architecture, purposefully combining the efficient C3k2 feature extraction module from YOLO11 with the concise SPPF Backbone-end design from YOLOv5, resulting in a streamlined and efficient baseline network oriented toward small target detection scenarios. Each design decision is grounded in task-specific considerations: the C3k2 module is preserved for its small-target-friendly cascaded small-kernel structure, the C2PSA module is excluded due to its limited self-attention gains at the spatially sparse P5 feature map, and the SPPF module is retained for its proven efficiency in multi-scale receptive field aggregation.

2.2.3. CARAFE: Content-Aware ReAssembly of Features

Content-Aware ReAssembly of Features (CARAFE) was first proposed by Wang et al. in 2019 [28]. The core concept of this module is to dynamically generate upsampling kernels based on the content of the input feature map, achieving content-aware feature reassembly. This addresses the issue where conventional nearest-neighbor interpolation upsampling methods tend to cause blurring and loss of target features during the cross-scale transmission process [29,30].

The workflow of the CARAFE module consists of two sub-modules, as illustrated in Figure 8.

The Kernel Prediction Module generates reassembly kernels dynamically for each upsampling position based on the content of the input feature map. Specifically, a 1 × 1 convolution first compresses the input feature map

X \in R^{C \times H \times W}

by reducing the channel count from

C

to

C_{m}

(

C_{m} ≪ C

) to minimize computational overhead. A

k_{e n c o d e r} \times k_{e n c o d e r}

convolutional layer then encodes the compressed features to produce kernel prediction features that incorporate local contextual information. Subsequently, a PixelShuffle operation increases the spatial resolution of the feature map by a factor of

σ

, where

σ

represents the upsampling ratio, ensuring that each target position corresponds to a specific set of predicted kernel parameters. Finally, Softmax normalization is applied to these parameters to ensure the sum of reassembly weights equals one, resulting in the upsampling kernel

W \in R^{k_{u p}^{2} \times σ H \times σ W}

, where

k_{u p}

denotes the spatial dimension of the reassembly kernel.

The Content-Aware ReAssembly Module utilizes the kernels produced by the prediction module to perform weighted reassembly on the input feature map. For each position

(i^{'}, j^{'})

in the upsampled feature map, the feature value is calculated by taking a weighted sum of the features within a

k_{u p} \times k_{u p}

neighborhood centered at the corresponding source position in the input map:

Y (i^{'}, j^{'}) = \sum_{m, n \in N (i, j)} W_{(i^{'}, j^{'}), (m, n)} \cdot X_{(m, n)}

(2)

where

N (i, j)

represents the

k_{u p} \times k_{u p}

neighborhood of the source position

(i, j)

, and

W_{(i^{'}, j^{'})}

is the dynamic reassembly kernel for that location. Since the reassembly kernel for each position is predicted adaptively based on the input content, CARAFE aggregates semantically relevant features effectively, maintaining sharp feature responses at object boundaries while ensuring smooth integration in background regions.

In the modified model proposed in this study, the two nearest-neighbor upsampling operations in the top-down path of the BaseYOLO neck are replaced with CARAFE modules as shown in Figure 9. This replacement allows high-level semantic features to better retain spatial details during downward propagation, which specifically enhances the feature representation quality for small-scale transmission tower targets.

2.2.4. C_DCA: Direction-Aware Feature Extraction Module Integrated with Deformable Convolution

The C3k2 module retained in the BaseYOLO architecture is based on the Cross Stage Partial (CSP) network concept, which splits input features into two parts along the channel dimension. One part extracts features progressively through a series of bottleneck blocks composed of multiple small convolutional kernels (

k e r n e l_s i z e = 3 \times 3

), while the other part preserves original information via a direct shortcut connection. These two paths are eventually concatenated and fused to maintain sufficient feature representation capability while significantly reducing the number of parameters and computational overhead.

However, each bottleneck sub-block in C3k2 employs standard square convolutional kernels, which possess isotropic receptive fields [31]. This design is appropriate for objects with regular shapes and consistent orientations, but for transmission towers, which are slender and randomly oriented, isotropic receptive fields cannot align with the principal directions of the targets. This misalignment results in a large number of kernel parameters being used to encode irrelevant background regions, leading to suboptimal feature extraction efficiency.

To address this issue, this study proposes the Direction-Aware Convolution (DCA) module. The core idea is to decompose feature extraction into multiple orthogonal directional branches and adaptively select the optimal feature extraction direction for each spatial location through an attention mechanism [32,33]. The DCA module consists of three parallel convolutional branches as shown in Figure 10. The standard branch utilizes a conventional

3 \times 3

convolution to capture local spatial patterns and provide an isotropic feature baseline. The vertical branch employs a

k \times 1

strip convolution (default

k = 7

) to extract continuous structural features along the vertical direction, while the horizontal branch uses a

1 \times k

strip convolution to extract continuous structural features along the horizontal direction. These three branches share the same input feature map and generate output feature maps

f_{s t d}

,

f_{v e r}

, and

f_{h o r}

of identical dimensions.

The three branches provide complementary directional features, and the DCA module introduces a branch-level attention mechanism to achieve adaptive directional selection. Specifically, the outputs of the three branches are stacked along a new dimension to form

F_{s t a c k} \in R^{3 \times C \times H \times W}

. Global average pooling is then applied along the spatial dimensions to obtain a global descriptor

z \in R^{3 C}

:

z = G A P (F_{s t a c k})

(3)

A two-layer bottleneck fully connected network maps z to branch importance scores with reduction ratio r = 4:

s = W_{2} \cdot δ (W_{1} \cdot z), W_{1} \in R^{(3 C / r) \times 3 C}, W_{2} \in R^{3 \times (3 C / r)}

(4)

where

δ (\cdot)

denotes the ReLU activation. Softmax normalization is then applied to produce the normalized branch weights

w = [w_{d}, w_{v}, w_{h}]

:

[w_{d}, w_{v}, w_{h}] = S o f t m a x (s)

(5)

The branch weights are learned end-to-end through back-propagation during training rather than manually specified. The network automatically determines the relative contribution of each directional branch based on the global feature statistics of the input, adapting to the dominant orientation characteristics of the targets present in the feature map. The final fused features are obtained through a weighted sum of the three branches:

Y_{f u s e} = w_{d} \cdot f_{s t d} + w_{v} \cdot f_{v e r} + w_{h} \cdot f_{h o r}

(6)

This attention mechanism allows the network to dynamically adjust the contribution of each directional branch based on the input content. For non-slender structures, such as tower bases or approximately square objects, the standard branch weight

w_{d}

provides a baseline guarantee; for vertically oriented tower bodies, the vertical branch weight

w_{v}

is expected to dominate; and for horizontally oriented towers, the horizontal branch weight

w_{h}

is expected to dominate. The fused features undergo a 1 × 1 convolution for channel projection to produce the final output of the DCA module.

Finally, the DCA module is integrated into the C3k2 architecture to form the C_DCA module, where the second 3 × 3 convolution in each bottleneck block is replaced by a DCA module. The two cascaded 3 × 3 convolutions in the original bottleneck are thus replaced by one 3 × 3 convolution followed by one DCA module, enabling each bottleneck to acquire direction-aware capabilities while retaining local feature extraction performance.

By combining these two improvements, CARAFE enhances the quality of multi-scale feature fusion through content-aware dynamic upsampling kernels, while C_DCA strengthens the feature extraction capability for slender, randomly oriented targets via multi-directional strip convolutions and adaptive attention. Together, these components improve the detection accuracy of transmission tower targets in satellite imagery from the perspectives of both feature fusion and feature extraction.

3. Experiments and Results

3.1. Experiment Setup and Evaluation Metrics

The experiments are conducted on an Intel Core i9-10980XE CPU and an NVIDIA GeForce RTX 3090 GPU (24 GB VRAM) using Python 3.9 and the PyTorch 2.8 deep learning framework, with CUDA 12.6. Model training is performed on the HRS-PTD dataset with the number of iterations set to 150 epochs and a batch size of 16. Input images are uniformly resized to 640 × 640 pixels. During the training process, all baseline models adopt their officially recommended hyperparameter configurations and are fine-tuned under identical conditions to ensure a fair comparison.

Precision, Recall, and mean Average Precision (mAP) are employed to evaluate the detection performance of the models on the test set. Precision (P) measures the proportion of true positive samples among those predicted as positive, while Recall (R) measures the proportion of actual positive samples correctly detected by the model. mAP comprehensively reflects the overall detection performance across different confidence thresholds. These metrics are defined as follows:

P = \frac{T P}{T P + F P},

(7)

R = \frac{T P}{T P + F N},

(8)

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(9)

where

T P

,

F P

, and

F N

represent the number of true positives, false positives, and false negatives, respectively.

N

denotes the total number of categories, and

A P_{i}

is the average precision for every category.

For application testing on large-scale remote sensing images, a different evaluation framework is adopted. Unlike the patch-level P/R/mAP metrics, which are computed on cropped image tiles with complete bounding-box annotations, large-format application testing operates on full-scene imagery where exhaustive bounding-box annotations are unavailable. Three instance-level metrics are therefore adopted: Correct Detection Rate (CDR), Miss Detection Rate (MDR), and False Detection Rate (FDR). These metrics require only the total instance count, which can be reliably obtained through manual visual interpretation, and directly reflect detection completeness and false alarm rate at the object level, making them more aligned with the practical requirements of obstacle inventory applications. They are defined as follows:

C D R = \frac{N_{c o r r e c t}}{N_{t o t a l}} \times 100 %,

(10)

M D R = \frac{N_{m i s s}}{N_{t o t a l}} \times 100 %,

(11)

F D R = \frac{N_{f a l s e}}{N_{c o r r e c t} + N_{f a l s e}} \times 100 %

(12)

where

N_{t o t a l}

is the total number of actual transmission towers in the image,

N_{c o r r e c t}

is the number of correct detections,

N_{m i s s}

is the number of missed detections, and

N_{f a l s e}

is the number of false detections.

3.2. Performance Validation of the Optimized Composite YOLO Model on the HRS-PTD Dataset

3.2.1. Comparative Experiments of BaseYOLO

To verify that the BaseYOLO architecture designed in this study offers superior performance in remote sensing transmission tower detection scenarios, comparative experiments are conducted on the HRS-PTD dataset. The BaseYOLO baseline model is compared against YOLOv5, YOLOv8, and YOLO11. For all models, the most lightweight variants within their respective versions are selected. All experiments are performed using identical dataset partitions and training parameters, and the results are presented in Table 2.

Experimental results demonstrate that BaseYOLO achieves optimal performance across all evaluation metrics. From YOLOv5 to YOLO11, a clear upward trend in all metrics is observed as versions are updated and optimized, which aligns with expectations. The accuracy of BaseYOLO on the transmission tower dataset is nearly identical to that of the state-of-the-art YOLO11n model, showing slight optimization while further reducing the number of parameters by 9.7%. These results indicate that by purposefully combining the efficient C3k2 feature extraction module with a streamlined SPPF structure, the constructed BaseYOLO achieves optimized or maintained detection accuracy while keeping the lowest parameter count. This provides a compact and efficient baseline architecture for the subsequent introduction of improved modules.

3.2.2. Ablation Experiments

To verify the effectiveness of each proposed improvement module, a systematic ablation study is designed by sequentially introducing the CARAFE upsampling module and the C_DCA direction-aware module into the composite BaseYOLO baseline. All experiments are conducted using identical dataset partitions and training parameters, and the results are presented in Table 3.

After introducing the CARAFE module independently, the recall of the model increased to 88.40%, a significant gain of 8.62 percentage points, and the mAP rose to 91.38%, an increase of 6.48 percentage points. This indicates that content-aware upsampling effectively restores detailed information of small targets lost during cross-scale feature transmission, significantly reducing the missed detection rate. However, precision decreased slightly, likely because the substantial increase in recall introduced some low-confidence detection results. When the C_DCA module was introduced independently, both precision and mAP showed slight improvements, while recall increased by 5.77 percentage points. This suggests that the deformable convolution enhances the discriminability between target and background features through adaptive sampling, effectively reducing false positives while also improving missed detections.

When both improvement modules were integrated into the full model, the precision reached 88.40%, recall reached 91.90%, and mAP reached 92.28%, representing increases of 1.53, 12.12, and 7.38 percentage points over the baseline, respectively. Notably, the precision of the full model is not only higher than the baseline but also exceeds the 85.29% achieved when only CARAFE was used. This demonstrates that the C_DCA module effectively compensates for the precision loss introduced by CARAFE, creating a strong complementary and synergistic effect between feature visibility enhancement and feature discriminability enhancement. Although the model parameters increased by 1.49 M, the overall architecture remains lightweight and capable of meeting practical deployment requirements.

To validate the directional attention mechanism in C_DCA from an interpretability perspective, three representative transmission tower samples from the test set were selected for per-pixel spatial attention weight visualization, as shown in Figure 11. The three samples correspond to three morphological categories defined by the aspect ratio w/h (where w/h denotes the ratio of bounding box width to height): top-down near-square targets (0.75 ≤ w/h < 1.25), horizontally oriented targets (w/h ≥ 1.25), and vertically oriented targets (w/h < 0.75), consistent with the shape classification scheme introduced in Section 2.1. Heatmap colors transition from blue to red indicating increasing attention weight.

For the top-down near-square sample, the standard branch (

w_{d}

) produces moderate in-box activation distributed across the tower footprint. This behavior is consistent with the geometric characteristics of top-down targets, which lack a dominant elongation direction and are therefore better captured by the isotropic standard branch rather than either strip branch. For the horizontally oriented sample, the horizontal branch (

w_{h}

) exhibits elevated warm-color responses within the detection boxes, aligning with the horizontal elongation of the tower bodies in this imaging perspective. For the vertically oriented sample shown, the vertical branch (

w_{v}

) produces the most distinct directional response: prominent high-activation regions with peak weights exceeding 0.6 are concentrated within the detection boxes along the vertical axis, while background regions remain consistently below 0.3, forming a clear contrast between target and background.

These results demonstrate that the three directional branches produce differentiated, orientation-dependent responses rather than degenerating into uniform activation. In particular, the dominant branch shifts systematically from

w_{d}

to

w_{h}

to

w_{v}

as the target aspect ratio transitions from near-square to horizontally elongated to vertically elongated, confirming that the per-location attention mechanism operates as designed and provides interpretability-level support for the quantitative ablation results in Table 3.

The results of the ablation experiments fully validate the effectiveness and complementarity of the two improvement modules proposed in this study. To further illustrate the qualitative detection performance of the proposed model, Figure 12 presents the detection results of the complete improved model on eight representative samples from the HRS-PTD test set, covering four land cover types: farmland, bare soil, forest, and built-up areas.

In the farmland scenes, the model successfully detects transmission towers under both normal illumination and low-light conditions across different seasonal appearances, though one tower with indistinct structural features under weak illumination results in a missed detection, reflecting the inherent difficulty of small-target recognition under adverse imaging conditions. In the bare soil scenes, where the sparse and uniform background provides relatively high contrast, the model produces stable and accurate detections with confidence scores consistently above 0.80. In the forest scenes, despite the dense vegetation cover that partially obscures tower structures and reduces spectral separability, the model maintains reliable detection performance. In the built-up area scenes, where rooftops and other man-made structures share visual similarity with transmission towers, one missed detection occurs, indicating that background clutter from densely packed artificial objects remains a challenge for the model. Overall, the results demonstrate that the complementary design of the CARAFE upsampling module and the C_DCA direction-aware feature extraction module enables the proposed model to achieve robust detection across diverse and complex background environments.

3.2.3. Comparative Experiments of the Improved Model

To further validate the comprehensive performance of the proposed model, comparative experiments are conducted on the HRS-PTD test set against mainstream object detection models and a domain-specific detection method, including Faster R-CNN, RetinaNet, YOLOv5n, YOLOv8n, and LSKF-YOLO [8]. LSKF-YOLO is specifically designed for transmission tower detection in satellite remote sensing imagery and is approximately reproduced in this study by integrating the LSK attention module into the YOLOv8 backbone, as the original source code is not publicly available. The results are presented in Table 4.

As shown in Table 4, the proposed model achieves the best performance across all three accuracy metrics, attaining a precision of 88.40%, recall of 91.90%, and mAP of 92.28%.

Among the general-purpose detectors, Faster R-CNN and RetinaNet demonstrate relatively high recall but achieve precision below 80%, with parameter counts of 41.30 M and 32.17 M respectively and FPS of only 20.6 and 19.9, making them inadequate for the efficiency demands of large-scale satellite imagery processing. Among the YOLO series, YOLOv5n achieves an mAP of only 77.20% with a recall of 73.22%, indicating a high missed detection rate; YOLOv8n achieves an mAP of 83.75% but a recall of only 74.86%, similarly exhibiting a pronounced tendency toward missed detections of small targets. Regarding the domain-specific comparison, LSKF-YOLO achieves competitive precision (86.47%) and mAP (88.72%) among all compared methods, and attains the lowest parameter count (2.78 M) with the highest FPS of 106.2. However, its recall of 82.57% indicates a non-negligible missed detection rate.

With a parameter count of 3.83 M, the proposed model improves mAP by 3.56 percentage points and recall by 9.33 percentage points over LSKF-YOLO, while achieving an FPS of 102.6, demonstrating a notably strong overall balance between detection accuracy and inference efficiency among all compared methods. These results confirm the effectiveness and competitiveness of the proposed improvements for transmission tower detection in satellite remote sensing imagery.

3.3. Application Validation of the Optimized Composite YOLO Model on Real Satellite Imagery

To verify the applicability of the improved model in practical application scenarios, this section selects two real large-format satellite remote sensing images from different sources for application testing.

3.3.1. Application Experimental Data

A specific area in Liaocheng, Shandong Province, China, was selected as the experimental zone. Satellite images from two different sources were obtained for comparative testing, with detailed information presented in Table 5.

Since the dimensions of remote sensing images far exceed the input size of the model, direct input would lead to severe information loss and potential memory overflow. To address this, a sliding window strategy [34] was employed to segment the original images. The sliding window size was set to 640 × 640 pixels, with a stride configured to maintain a 25% overlap rate. This ensures that transmission tower targets located at window boundaries are not truncated during segmentation. Following detection, the results from all windows are mapped back to the original image coordinate system, with NMS (

I o U = 0.45

) applied to eliminate redundant detections in overlapping areas.

3.3.2. Application Experimental Results

Detection results on Google Earth imagery:

On the Google Earth imagery with a resolution of approximately 0.3 m, the detection results of the improved model are presented in Table 6.

The improved model achieves a CDR of 88% on Google Earth imagery, representing a 12% increase over the baseline model. Simultaneously, the MDR decreased by 12%, and the FDR saw a substantial reduction of 17.6%. These results demonstrate that with the synergistic assistance of the CARAFE and C_DCA modules, the proposed model effectively meets the fundamental requirements for transmission tower detection in high-resolution satellite remote sensing images.

The experimental results on Google Earth satellite imagery are illustrated in Figure 13. The three missed targets of the improved method are primarily distributed in areas with complex backgrounds and weak illumination. The three false detections consist mostly of non-power tower facilities that exhibit similar morphological and optical characteristics to actual towers.

2.: Detection results on Gaofen-7 imagery:

On the Gaofen-7 imagery with a resolution of approximately 0.65 m, the features of the transmission towers become increasingly blurred due to the decreased resolution, significantly raising the detection difficulty. The statistical results are presented in Table 7.

The experimental results on the Gaofen-7 satellite imagery are illustrated in Figure 14. On the Gaofen-7 imagery, the improved model achieved a CDR of 76%, representing a substantial 44% increase over the baseline model. It is particularly noteworthy that the performance gain on low-resolution imagery (0.65 m) is significantly larger than that observed on high-resolution imagery (0.3 m). This demonstrates that the proposed improvement strategies, specifically the integration of CARAFE and C_DCA, provide a more pronounced advantage in difficult scenarios where target features are weak or blurred, effectively enhancing the reliability of the model in degraded imaging conditions.

4. Discussion

This study addresses the challenges of detecting transmission towers in satellite remote sensing imagery—such as small target size, slender structures, and complex backgrounds—by constructing the multi-source, multi-resolution HRS-PTD dataset and proposing an optimized composite YOLO-based detection method. Using a three-stage architecture combining C3k2 and SPPF as the baseline, the method integrates the CARAFE module to enhance multi-scale feature reconstruction and the C_DCA module to adaptively represent the slender structures of the targets. Compared to state-of-the-art algorithms, the proposed approach achieved superior performance on the HRS-PTD dataset, with ablation studies confirming the independent contributions and synergistic gains of each module. The improved model also demonstrated excellent results on real-world satellite imagery. Notably, the synergy between the two modules was most prominent on the lower-resolution Gaofen-7 imagery, yielding higher performance gains than on Google Earth imagery. This underscores the effectiveness of the improvements in addressing “small scale, weak features, and random orientation” challenges.

However, certain limitations remain. In real-world testing, some transmission towers were missed due to their extremely low pixel proportions and insufficient contrast with surrounding features. Particularly in densely vegetated or urban areas, tower structures can easily blend into background textures, indicating that the discriminative power of the model for such targets requires further refinement. Additionally, some morphologically similar structures in lower-resolution images led to false positives. Currently, the model relies solely on visual features and does not yet fully exploit the contextual auxiliary information surrounding the towers.

5. Conclusions

To tackle the issues of small target size, slender structures, complex backgrounds, and limited generalization of training data in satellite imagery, this paper constructed the multi-source, multi-resolution HRS-PTD dataset and proposed a detection method based on an optimized composite YOLO network. By building upon a baseline architecture of C3k2 and SPPF, the CARAFE module was introduced for enhanced multi-scale feature reconstruction, and the C_DCA module was designed for adaptive representation of slender structures, effectively improving detection accuracy. Experimental results validated the effectiveness of these improvements, and tests on large-format real-world satellite imagery proved the engineering utility of the proposed method.

Despite these advancements, limitations persist. Future research will focus on further refining the accuracy of the detection model by incorporating auxiliary information near the bounding boxes—such as tower shadows—to enhance detection precision. Furthermore, the plan is to expand the scope of this research by applying the model to the automated detection of other low-altitude flight obstacles, meeting the practical requirements for low-altitude aviation safety.

Author Contributions

Conceptualization, R.L. and G.Z.; methodology, R.L. and G.Z.; validation, R.L., G.Z. and W.H.; formal analysis, R.L. and W.H.; investigation, G.Z. and C.Z.; data curation, R.L., G.Z., W.H. and C.Z.; writing—original draft preparation, R.L.; writing—review and editing, G.Z., W.H. and C.Z.; visualization, R.L.; supervision, G.Z. and W.H.; project administration, B.G.; funding acquisition, B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China, “Research and Application of Key Technologies and Standards for Quantitative Assessment of Low-Altitude Flight Environmental Risks”, grant No.: 2024YFF0617401.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kim, J.; Atkins, E. Airspace Geofencing and Flight Planning for Low-Altitude, Urban, Small Unmanned Aircraft Systems. Appl. Sci. 2022, 12, 576. [Google Scholar] [CrossRef]
Huang, H.; Su, J.; Wang, F.-Y. The Potential of Low-Altitude Airspace: The Future of Urban Air Transportation. IEEE Trans. Intell. Veh. 2024, 9, 5250–5254. [Google Scholar] [CrossRef]
Dong, C.; Zhang, Y.; Jia, Z.; Liao, Y.; Zhang, L.; Wu, Q. Three-Dimension Collision-Free Trajectory Planning of UAVs Based on ADS-B Information in Low-Altitude Urban Airspace. Chin. J. Aeronaut. 2025, 38, 103170. [Google Scholar] [CrossRef]
Xie, Y.; Yu, C.; Zang, H.; Gao, F.; Tang, W.; Huang, J.; Chen, J.; Xu, B.; Wu, Y.; Wang, Y. Multi-UAV Formation Control with Static and Dynamic Obstacle Avoidance via Reinforcement Learning. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hangzhou, China, 19–25 October 2025; pp. 20410–20417. [Google Scholar]
Choi, S.; Chae, C.; Lim, C.; Park, U.; Choi, M. Development of Image Analysis Algorithm for Automatic Detection of Transmission Towers and Facilities. Trans. Korean Inst. Electr. Eng. 2020, 69, 772–782. [Google Scholar] [CrossRef]
Electric Transmission and Distribution Infrastructure Imagery Dataset. Available online: https://figshare.com/articles/dataset/Electric_Transmission_and_Distribution_Infrastructure_Imagery_Dataset/6931088 (accessed on 12 March 2026).
Wang, H.; Yang, G.; Li, E.; Tian, Y.; Zhao, M.; Liang, Z. High-Voltage Power Transmission Tower Detection Based on Faster R-CNN and YOLO-V3. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; pp. 8750–8755. [Google Scholar]
Shi, C.; Zheng, X.; Zhang, K.; Su, Z.; Lu, Q. LSKF-YOLO: Large Selective Kernel Feature Fusion Network for Power Tower Detection in High-Resolution Satellite Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5620116. [Google Scholar] [CrossRef]
Wang, Y.; Sun, G.; Sun, Z.; Wang, J.; Li, J.; Zhao, C.; Wu, J.; Liang, S.; Yin, M.; Wang, P.; et al. Toward Realization of Low-Altitude Economy Networks: Core Architecture, Integrated Technologies, and Future Directions. IEEE Trans. Cogn. Commun. Netw. 2025, 11, 2788–2820. [Google Scholar] [CrossRef]
Qiao, S.; Sun, Y.; Zhang, H. Deep Learning Based Electric Pylon Detection in Remote Sensing Images. Remote Sens. 2020, 12, 1857. [Google Scholar] [CrossRef]
Hou, S.; Wang, B.; Wang, N.; Hou, J.; Pang, Y. YOLO with Multi-Scale Optimization for Overhead Line Tower Detection in Satellite Remote Sensing Images. J. Real-Time Image Process. 2026, 23, 28. [Google Scholar] [CrossRef]
Haroun, F.M.E.; Deros, S.N.M.; Din, N.M. Detection and Monitoring of Power Line Corridor From Satellite Imagery Using RetinaNet and K-Mean Clustering. IEEE Access 2021, 9, 116720–116730. [Google Scholar] [CrossRef]
Yuan, M.S. Target Recognition and Localization Change Analysis System Based on Remote Sensing Images. Master’s Thesis, Harbin Institute of Technology, Harbin, China, 2022. [Google Scholar]
Zou, C.; Yan, B.; Zhang, Y.; Liu, Z.; Feng, J.; Tang, S. Dynamic Focal Network: Advancing Real-World Electric Pylon Detection Using High-Resolution Satellite Remote Sensing Image. In Proceedings of the 2025 7th International Conference on Power and Energy Technology (ICPET), Shanghai, China, 4–7 July 2025; pp. 181–186. [Google Scholar]
Shi, S.; Fang, Q.; Xu, X. Dynamic Adaptive Label Assignment for Tiny Object Detection in Remote Sensing Images. CAAI Trans. Intell. Technol. 2025, 11, 428–446. [Google Scholar] [CrossRef]
Li, Z.; Guo, Q.; Sun, B.; Cao, D.; Li, Y.; Sun, X. Small Object Detection Methods in Complex Background: An Overview. Int. J. Patt. Recogn. Artif. Intell. 2023, 37, 2350002. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Hu, J.; Wei, Y.; Chen, W.; Zhi, X.; Zhang, W. CM-YOLO: Typical Object Detection Method in Remote Sensing Cloud and Mist Scene Images. Remote Sens. 2025, 17, 125. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, B.; Jia, X.; Ma, T. Efficient Real-Time Object Detection for Embedded and Industrial Systems: The You Only Look Once-Compact Approach. Eng. Appl. Artif. Intell. 2025, 156, 109927. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Xu, Y.; Wu, H.; Liu, Y.; Zhang, X. PCB Electronic Component Soldering Defect Detection Using YOLO11 Improved by Retention Block and Neck Structure. Sensors 2025, 25, 3550. [Google Scholar] [CrossRef]
Chen, X.; Zhang, Y.; Lei, J.; Li, L.; Liu, L.; Zhang, D. YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments. Remote Sens. 2025, 17, 3488. [Google Scholar] [CrossRef]
Zhang, C.; Tang, B.-H.; Cai, F.; Li, M.; Fan, D. YOLOv11-SAFM: Enhancing Landslide Detection in Complex Mountainous Terrain Through Spatial Feature Adaptation. Remote Sens. 2026, 18, 24. [Google Scholar] [CrossRef]
Zhao, Q.; Zhu, J. An Improved YOLOv11 Architecture with Multi-Scale Attention and Spatial Fusion for Fine-Grained Residual Detection. Results Eng. 2025, 27, 107061. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE++: Unified Content-Aware ReAssembly of FEatures. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4674–4687. [Google Scholar] [CrossRef]
Ryu, J.; Kwak, D.; Choi, S. YOLOv8 with Post-Processing for Small Object Detection Enhancement. Appl. Sci. 2025, 15, 7275. [Google Scholar] [CrossRef]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, H.; Cai, Y.; Nie, S.; Liu, K. YOLO11-MSCAM UAV Remote Sensing-Based Detection of Illegal Rare-Earth Mining with Multi-Scale Convolution and Attention Module. Remote Sens. 2026, 18, 738. [Google Scholar] [CrossRef]
Mao, J.; Zhang, X.; Ji, Y.; Zhang, Z.; Guo, Z. Improved High Precision Aircraft Target Detection Method of YOLT. J. Phys. Conf. Ser. 2021, 1955, 012028. [Google Scholar] [CrossRef]

Figure 1. Representative transmission tower samples with diverse backgrounds (from left to right: farmland, forest land, bare soil, and built-up area).

Figure 2. Transmission tower samples of different sizes in Google Earth satellite imagery.

Figure 3. Normalized distribution of the area occupied by transmission tower targets in images.

Figure 4. Imaging angles of selected transmission tower samples (from left to right: top-down, front, oblique, and side views).

Figure 5. Morphological analysis of transmission tower bounding boxes in the HRS-PTD dataset: (a) histogram of aspect ratio distribution categorized by shape class; (b) proportional distribution of shape categories; (c) aspect ratio distribution with vertical and horizontal orientation regions delineated; (d) scatter plot of normalized bounding box dimensions colored by elongation ratio.

Figure 6. Shadow characteristics of selected transmission tower samples (from left to right: long shadow, short shadow, and no obvious shadow).

Figure 7. The workflow of BaseYOLO.

Figure 8. CARAFE framework (where ⊗ denotes the content-aware local resampling operation).

Figure 9. CARAFE modules in the neck network.

Figure 10. Schematic diagram of the receptive fields for the three parallel convolutional branches in the DCA module: (a) standard branch with a conventional 3 × 3 convolution; (b) vertical branch with a k × 1 strip convolution; (c) horizontal branch with a 1 × k strip convolution. The “+” symbol marks the center of each receptive field.

Figure 11. Visualization of per-pixel directional branch attention weights in C_DCA for three representative target orientations: (a) standard branch (

w_{d}

), near-square target; (b) horizontal branch (

w_{h}

), horizontally oriented target; (c) vertical branch (

w_{v}

), vertically oriented target.

Figure 11. Visualization of per-pixel directional branch attention weights in C_DCA for three representative target orientations: (a) standard branch (

w_{d}

), near-square target; (b) horizontal branch (

w_{h}

), horizontally oriented target; (c) vertical branch (

w_{v}

), vertically oriented target.

Figure 12. Detection results of the proposed model on representative test samples across four land cover types (from left to right: farmland, bare soil, forest, and built-up area).

Figure 13. Comparison of detection performance in Google Earth imagery: (a) Input image; (b) Tower detection results of BaseYOLO; (c) Tower detection results of ours.

Figure 14. Comparison of detection performance in Gaofen-7 imagery: (a) Input image; (b) Tower detection results of BaseYOLO; (c) Tower detection results of ours.

Table 1. Characteristics and sample statistics of data sources in HRS-PTD.

Data Source	Spatial Resolution	Imagery Type	Coverage Region	Number of Images Patches
Google Earth	Approximately 0.3~0.6 m	Optical True Color	Anyang, Shenzhen, Qingdao	1023
Gaofen-7	Pan 0.65 m/Multi 2.6 m	Optical Multispectral	Anyang, Shenzhen	455
Esri World Imagery	Approximately 0.3~0.5 m	Optical True Color	Ningbo	97

Table 2. Performance comparison of representative baseline detection models on the HRS-PTD dataset.

Model	P (%)	R (%)	mAP (%)	Params (M)
YOLOv5n	78.87	73.22	77.20	9.12
YOLOv8n	84.49	74.86	83.75	3.01
YOLO11n	86.05	78.10	84.33	2.59
BaseYOLO	86.87	79.78	84.90	2.34