Symmetry-Enhanced YOLOv8s Algorithm for Small-Target Detection in UAV Aerial Photography

Zhou, Zhiyi; Wei, Chengyun; Wang, Lubin; Yu, Qiang

doi:10.3390/sym18010197

Open AccessArticle

Symmetry-Enhanced YOLOv8s Algorithm for Small-Target Detection in UAV Aerial Photography

¹

College of Mechanical and Control Engineering, Guilin University of Technology, Guilin 541006, China

²

Guilin Feiyu Technology Incorporated Company, Guilin 541004, China

³

Guilin Institute of Information Technology, Guilin 541004, China

⁴

Digital Intelligent Information Technology Research Center, Guilin Institute of Information Technology, Guilin 541004, China

⁵

National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2026, 18(1), 197; https://doi.org/10.3390/sym18010197

Submission received: 2 December 2025 / Revised: 11 January 2026 / Accepted: 15 January 2026 / Published: 20 January 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

In order to solve the problems of small-target detection in UAV aerial photography, such as small scale, blurred features and complex background interference, this article proposes the ACS-YOLOv8s method to optimize the YOLOv8s network: notably, most small man-made targets in UAV aerial scenes (e.g., small vehicles, micro-drones) inherently possess symmetry, a key geometric attribute that can significantly enhance the discriminability of blurred or incomplete target features, and thus symmetry-aware mechanisms are integrated into the aforementioned improved modules to further boost detection performance. The backbone network introduces an adaptive feature enhancement module, the edge and detail representation of small targets is enhanced by dynamically modulating the receptive field with deformable attention while also capturing symmetric contour features to strengthen the perception of target geometric structures; a cascaded multi-receptive field module is embedded at the end of the trunk to integrate multi-scale features in a hierarchical manner to take into account both expressive ability and computational efficiency with a focus on fusing symmetric multi-scale features to optimize feature representation; the neck is integrated with a spatially adaptive feature modulation network to achieve dynamic weighting of cross-layer features and detail fidelity and, meanwhile, models symmetric feature dependencies across channels to reduce the loss of discriminative information. Experimental results based on the VisDrone2019 data set show that ACS-YOLOv8s is superior to the baseline model in precision, recall, and mAP indicators, with mAP50 increased by 2.8% to 41.6% and mAP50:90 increased by 1.9% to 25.0%, verifying its effectiveness and robustness in small-target detection in complex drone aerial-photography scenarios.

Keywords:

UAV aerial photography scene target detection; adaptive feature enhancement; cascaded multi-receptive fields; spatial adaptive feature modulation

1. Introduction

Object detection is a core task in computer vision, aiming to accurately identify the category and location of targets in images. With the widespread use of unmanned aerial vehicles (UAVs) in military reconnaissance, disaster relief, agricultural monitoring, and urban management, onboard detection models play a critical role in intelligent perception [1]. In recent years, the YOLO series has become a mainstream approach in UAV scenarios due to its fast end-to-end prediction, high accuracy, and flexible deployment [2,3,4].

However, most of these studies ignore the inherent geometric attributes of aerial photography targets, especially symmetry. With the development of anchor-free mechanisms, decoupled detection heads, and attention modules, single-stage detectors have achieved notable improvements in small-object representation and complex background suppression, showing strong potential for UAV-based vision tasks [5]. Ma et al. [6] proposed a sparse non-local attention mechanism to aggregate contextual information from multi-level features efficiently, while Zhang et al. [7] introduced a cross-layer feature aggregation module (CFAM) to alleviate limitations of sequential feature propagation in feature pyramids, although its fusion capability remains limited. In UAV detection, UAV-YOLOv8 [8], UN-YOLOv5s [9], and HSP-YOLOv8 [10] enhance detection performance via multi-scale feature fusion, small-object detection strategies, or structural improvements, but challenges remain in accuracy, deployment adaptability, or inference efficiency. Meanwhile, various lightweight convolution modules and efficient network architectures have been proposed to meet practical requirements.

Symmetry is a fundamental geometric property of most natural and man-made targets in aerial scenes, such as vehicles, buildings, and aircraft. It serves as a critical cue for distinguishing targets from complex backgrounds and can effectively enhance the discriminability of incomplete or blurred target features. Mining symmetric information has been proven to be an effective way to improve target detection performance in complex scenarios. Unfortunately, existing YOLOv8s-based improved algorithms rarely integrate symmetry-aware mechanisms into feature extraction and fusion processes, resulting in underutilization of target intrinsic features. However, UAV aerial images present unique challenges: they often contain many small targets, complex backgrounds, dense arrangements, and occlusions, which make general-purpose detectors prone to missed or false detections [11,12,13]. Various methods have been proposed to address these issues.

Dong Gang et al. [14] optimized small-target detection through multi-scale feature fusion, evaluation metric improvement, super-resolution reconstruction, and lightweight modeling; Jiang Maoxiang et al. [15] introduced a small-target detection head in RT-DETR and combined SimAM attention with inverted residual modules to enhance the backbone network; Ma Junyan et al. [16] proposed a super-resolution method combining MFE-YOLOX with attention; Liang Xiuman et al. [17] added a small-target detection layer in YOLOv7 and introduced a multi-information flow fusion attention mechanism; Zhu et al. [18], Liu Shudong et al. [19], and Shao et al. [20] strengthened feature representation via attention mechanisms and backbone improvements; Li et al. [21] and Pan Wei et al. [22] optimized feature fusion structures to balance performance and efficiency; Wang et al. [8] and Deng Tianmin et al. [23] proposed combining loss function design with channel optimization to improve detection accuracy and efficiency.

Despite these advances, existing YOLO-based UAV detectors still have several limitations: insufficient exploitation of spatial correlations in complex scenes, reliance on complex or redundant modules for multi-scale feature modeling, and feature detail loss during upsampling, resulting in inadequate cross-scale information fusion. Our proposed modules are explicitly designed to address these limitations: C2f_AFE enhances cross-regional feature dependencies and fine-grained analysis to improve small-target representation; CMRF efficiently integrates multi-scale receptive fields in a hierarchical manner to reduce redundancy and alleviate performance bottlenecks; SAFMN optimizes cross-scale feature fusion and preserves detail during upsampling, effectively solving these core limitations.

To address these challenges, we propose ACS-YOLOv8s, built on the YOLOv8s framework, with three innovative modules:

C2f_AFE module: Enhances cross-regional feature dependencies and fine-grained analysis, improving multi-scale feature representation in complex scenarios. CMRF module: Efficiently mines multi-scale receptive fields through a cascading strategy, alleviating redundancy and performance bottlenecks of traditional multi-scale modules. SAFMN module: Combines convolution channel mixing to optimize cross-scale feature fusion and preserve fine details, mitigating feature blurring during the upsampling stage.

These modules collaboratively improve the accuracy and robustness of small-target detection in UAV aerial images while maintaining high computational efficiency. Experimental results on the VisDrone2019 dataset show that ACS-YOLOv8s achieves substantial improvements over baseline models, validating the effectiveness and practicality of the proposed method. The main contributions of this work are summarized as follows:

We propose the C2f_AFE module to enhance cross-regional feature dependencies and fine-grained analysis, improving small-target representation in complex UAV aerial images.

We design the CMRF module to efficiently integrate multi-scale receptive fields in a hierarchical manner, reducing redundancy and alleviating performance bottlenecks.

We introduce the SAFMN module to optimize cross-scale feature fusion and preserve feature details during upsampling, mitigating feature blurring.

Extensive experiments on the VisDrone2019 dataset demonstrate that ACS-YOLOv8s significantly improves small-target detection accuracy, recall, and mAP compared with baseline models, while maintaining high computational efficiency.

2. YOLOv8 Algorithm

YOLOv8 [24] was released by the Ultralytics team in 2023. As an important update of the YOLO series, its architecture continues the four-stage design of the input end, backbone network, feature fusion network, and detection head. The structure diagram of YOLOv8 is shown in Figure 1. w (width) and r (ratio) in Figure 1 can adjust the model size to adapt to different scenarios. The backbone part uses the C2f module to replace the C3 structure of YOLOv5, and combines CSPDarknet and ELAN ideas to optimize gradient flow and improve feature extraction capabilities; the feature fusion end uses a combination of levy pyramid network [25] and Path Aggregation Network [26] to achieve efficient cross-layer information interaction; the detection head separates classification and regression tasks through a decoupled structure and introduces an anchor-free mechanism to reduce label noise and hyperparameter dependence, thereby striking a balance between detection accuracy and inference efficiency. With the above improvements, YOLOv8 shows strong advantages in low load characteristics and stability.

Although YOLOv10 and YOLOv11 introduce multi-scale enhancement and attention mechanisms to improve small-target detection, the accuracy gain in aerial photography scenes is limited (mAP50 improvement is less than 1.2%), the complex structure increases reasoning delay and computing power consumption, and the engineering adaptability is insufficient. In contrast, YOLOv8 has been proven mature in many fields and has high deployment feasibility. Cross-architecture comparison shows that although the two-stage method has high accuracy, the inference is slow, the Transformer-based method has high computational overhead, and the lightweight model lacks accuracy. Taken together, YOLOv8s achieves a balance between accuracy, efficiency, and resource consumption, and is more suitable for UAV aerial photography tasks with large target scale differences, dense distribution, and high real-time requirements, so it was selected as the benchmark model in this article.

3. Improved ACS-YOLOv8s Algorithm

3.1. Overall Network Structure

This article proposes an improved method for UAV aerial-photography target detection based on YOLOv8s. The network architecture is shown in Figure 2. The whole body still consists of the backbone, neck, and head, but the feature extraction and fusion mechanism has been systematically optimized. Specifically, the backbone part builds the C2f_AFE module to enhance cross-regional dynamic feature correlation and improve the representation ability of small targets and multi-scale targets; the cascaded multi-receptive field is used to replace the original SPPF module, and multi-scale information is efficiently modeled through the multi-receptive field cascade strategy, taking into account both computational cost and feature expression; the neck upsampling link introduces a spatial adaptive feature modulation network to strengthen cross-scale feature interaction and alleviate the contradiction between resolution improvement and detail preservation. The overall improvement achieves full link optimization from feature breadth and semantic depth to detail accuracy, providing higher robustness and adaptability for target detection in complex drone aerial-photography scenes.

3.2. Adaptive Feature Enhancement Module

With the development of computer vision, semantic segmentation and target detection have been widely used in fields such as autonomous driving, smart security, and drone monitoring. However, existing methods still have limitations in complex scenarios. Traditional CNN has difficulty capturing long-range dependencies, and its performance is limited in multi-target, occlusion-affected, and complex backgrounds; although Vision Transformer has global modeling capabilities, it is still insufficient in detail capture and semantic-context modeling; the hybrid attention model takes into account both local and global features, but has limited accuracy when dealing with cluttered backgrounds or small-scale targets. In response to these problems, this paper proposes an adaptive feature enhancement (AFE) module to achieve global semantic modeling and local detail enhancement through parallel design: it uses global attention to capture long-range dependencies to improve semantic richness, and combines local enhancement to highlight high-frequency features such as edges and textures, thereby improving the detection and segmentation capabilities of small targets in complex scenes without significantly increasing computational complexity.

The AFE block proposed in this article consists of a convolutional embedding, a spatial-context module, a feature refinement module, and a convolutional multi-layer perceptron, shown in Figure 3 and Figure 4. And it is embedded in a backbone feature enhancement network for semantic segmentation in complex backgrounds. After the input is embedded by CE, SCM aggregates the global context through large kernel convolution, FRM decomposes and refines high- and low-frequency features to highlight small-target edges and textures, as shown in Figure 5. And ConvMLP completes cross-channel modeling to enhance feature expression. In FRM, high- and low-frequency information is extracted and fused to generate enhanced features through differential and element-wise multiplication, achieving collaborative modeling of global and local, low and high frequencies, while maintaining computational efficiency.

F: Input feature map of the FRM module (size H × W × C, containing complete high- and low-frequency information); P: F feature map after downsampling by “DWConv 3 × 3 (Stride = 2)” (size H/2 × W/2 × C; the downsampling process will lose high-frequency details and only retain low-frequency contours);

Q: The feature map of P after Upsample (upsampling to H × W × C) (corresponding to the low-frequency approximate feature of F, including global smoothness and large-area semantic information); R: The result of the operation F-Q (the residual of the input F and the low-frequency feature Q, corresponding to the high-frequency detailed feature of F, including local information such as small-target edges and textures); S: The result of the operation F⊙Q (element-wise multiplication of F and Q, strengthening the spatially effective area of low-frequency features);

T: The fusion feature map obtained by “splicing (symbol C)” after R and S were processed by “DWConv 3 × 3”, respectively;

\bar{F}

: The output of T after “Conv 1 × 1” channel integration (the final feature map of the FRM module).

Core operation description: F-Q (that is, the operation mentioned): Since Q is the product of F downsampling + upsampling (downsampling will filter out high-frequency details), the difference between F and Q naturally corresponds to the high-frequency information in

\bar{F}

; this is the core operation for separating high- and low-frequency features.

The design intuition of the FRM module is that high- and low-frequency features need to be differentially modeled:

Traditional feature modules usually process high- and low-frequency information in a mixed manner, which can easily cause high-frequency details (edges, textures) that small targets rely on to be overwhelmed by global low-frequency information (large-area contours); FRM achieves precise separation and enhancement through the following logic:

Separate high and low frequencies: Use downsampling + upsampling to extract low-frequency features Q (downsampling filters high frequencies, upsampling restores size), and then obtain high-frequency features R through “residual subtraction”—natural splitting of high and low frequencies can be achieved without additional parameters;
Differentiation enhancement: For low-frequency features Q (corresponding to global semantics), enhance the spatial consistency through element-wise multiplication + DWConv 3 × 3; for high-frequency features R (corresponding to small-target details), retain and enhance the local response through DWConv 3 × 3;
Lightweight fusion: Use splicing + 1 × 1 convolution to integrate high- and low-frequency features. While controlling the amount of calculation, the enhanced high- and low-frequency information can complement each other, ultimately improving the expressive ability of features (especially adapted to the needs of small-target detection for high-frequency details).

3.3. Channelwise Multi-Receptive Field Module

In image segmentation and target detection, existing models generally face a contradiction between efficiency and accuracy: although compression parameters and calculations improve efficiency, they weaken feature expression, leading to a decline in low-resolution image and small-target detection performance; although the multi-receptive field module improves feature modeling, it has a high computational overhead and is not conducive to resource-constrained scenarios. To this end, this paper proposes the cascaded multi-receptive field module (CMRF) structure as shown in Figure 6. Through the cascade design, multi-scale information is integrated, and everything from fine textures to global structures can be effectively modeled. It is combined with efficient feature mining and fusion strategies to improve detection accuracy while controlling complexity, achieving both accuracy and real-time performance.

As depicted in the left portion of Figure 6, the proposed CMRF module cleverly incorporates deep convolutions.

Initial feature extraction for the input feature map:

X_{in} \in R^{C_{in} \times H \times W}

(1)

Among them, DWConv is a depth convolution (reducing the number of parameters), and the output feature map is

X^{'} \in R^{C_{out} \times H \times W}

(2)

and GELU is the activation function, which balances the amount of calculation and feature capacity by controlling

C_{out}

. Odd- and even-channel splitting and differentiation processing Split X′ according to channel number into the odd-channel subset

{X^{'}}_{odd} \in R^{(C_{out} / 2) \times H \times W}

(3)

and the even-channel subset

{X^{'}}_{even} \in R^{(C_{out} / 2) \times H \times W}

(4)

and perform dual-path processing:

Detail enhancement path: perform elemental addition of

{X^{'}}_{odd}

and

{X^{'}}_{even}

to obtain

{X^{'}}_{sum} \in R^{C_{out} \times H \times W}

(5)

that retains the fine-grained features of small targets;

Multi-receptive field enhancement path: Cascade

N_{1}

small core (size 2) DWConv-BN modules for

{X^{'}}_{even}

, capture multi-scale receptive field information, and then splice it into

X^{″} \in R^{C_{out} \times H \times W}

(6)

Final feature fusion: After splicing

{X^{'}}_{sum}

and X″, the information is integrated through a point-wise convolution block (PWConv-BN-GELU) to output the final feature map:

X_{out} \in R^{C_{out} \times H \times W}

(7)

Lightweight cascaded multi-receptive fields: Use small-core depth convolution to replace large-core/single-scale convolution, with parameters only 75% of SSFF, covering the full scale from small targets 10 × 10 to large targets 60 × 60.

Odd- and even-channel differentiation: Split channels reserve exclusive resources for small targets, and their channel proportion increases from 15% to 45%;

Dual-path fusion: Element addition (details preserved) + branch splicing (complementing the global picture), small-target matching error ≤ 2 pixels, recall rate increased by 4.3% compared to SFCC.

Advantages of small-target detection

Cascade design: The number of small-target receptive field layers is increased from 1 to 3, and the 10 × 10 small-target AP is 7.2% higher than YOLOv8 SPPF;

Odd- and even-channel separation: Small-target feature response intensity is increased by 27%.

See Appendix A for symbol definitions, dimensions, and notes.

3.4. Spatially Adaptive Feature Modulation Network

In image super resolution and target detection tasks, traditional methods still have limitations in feature learning, multi-scale expression, and computational overhead, which can easily lead to texture blur, loss of details, and loss of small-target features, making it difficult to take into account global semantics and fine reconstruction. In response to these problems, this paper proposes the spatial adaptive feature modulation network (SAFMN). Its overall architecture is shown in Figure 7. Its core module, FMM, consists of spatial adaptive feature modulation and a convolution channel mixer. The input image is first mapped to the feature space through shallow convolution, and then FMM uses multi-scale feature division and transformation, and combines with global residuals to enhance high-frequency information to achieve deep feature extraction and high-resolution reconstruction:

[X_{0}, X_{1}, X_{2}, X_{3}] = S p l i t (X)

(8)

Some features are extracted using 3 × 3 depthwise convolutions to capture local representations, while others undergo multi-level pooling and upsampling to capture long-range dependencies. The resulting features are then concatenated and aggregated along the channel dimension:

\hat{X} = C o n v_{1 \times 1} (C o n c a t [{\hat{X}}_{0}, {\hat{X}}_{1}, {\hat{X}}_{2}, {\hat{X}}_{3}])

(9)

An attention map is subsequently generated through a nonlinear activation function

ϕ (\cdot)

to adaptively modulate the input

X

:

\overset{⃐}{X} = ϕ (\hat{X}) ⊙ X

(10)

This design effectively enhances non-local feature modeling and multi-scale representation. To further integrate local contextual information and channel interactions, the CCM uses a compact combination of 3 × 3 and 1 × 1 convolutions—the 3 × 3 convolution captures spatial context and expands the channel dimension, while the 1 × 1 convolution restores it to the original scale and employs GELU activation to strengthen nonlinear representation. The update process of the FMM can then be summarized as follows:

Y = S A F M (L N (X)) + X, Z = C C M (L N (Y)) + Y

(11)

where

L N (\cdot)

denotes layer normalization. The additional residual path not only stabilizes training but also enhances the restoration of high-frequency details.

Figure 8 shows an overview of the SAFMN architecture. The input LR image is first mapped to the feature space through a convolutional layer, then deep features are extracted through a series of FMMs, and finally, it is reconstructed by the upsampling module. Each FMM consists of the SAFM, CCM, and two jump connections.

As shown in Figure 9, after introducing SAFMN, the responses of the upsampled features are more concentrated in the target regions, further validating the effective modulation of SAFMN during the feature upsampling process.

3.5. Dataset and Experimental Environment

The experiments in this study were conducted using the VisDrone2019 dataset [27] for model training and validation. The VisDrone2019 data set was released by the AISKYEVE team of Tianjin University. It contains 8629 UAV aerial images and more than 2.6 million targets, covering 10 categories (pedestrians, vehicles, non-motorized vehicles, etc.), and is divided into 6471 training sets, 548 verification sets, and 1610 test sets. This data set has the characteristics of dense targets, unbalanced categories, large-scale differences, severe occlusion, and complex backgrounds, which pose a large challenge to the target detection algorithm.

3.6. Experimental Environment

The experiment was built based on PyTorch 2.0.1 in the Windows 11 environment. The hardware is a Lenovo Legion R7000 laptop (Lenovo Group Limited, Beijing, China) equipped with an NVIDIA RTX 4060 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA), using Python 3.8 and CUDA 11.8 for acceleration. The model was implemented via Ultralytics YOLOv8 v8.0.200, with image preprocessing by OpenCV v4.8.1, dataset annotation by LabelImg v1.8.6, and result visualization by Matplotlib v3.7.2. The training parameters are set as follows: 200 rounds of iterations, batch size 8, number of threads 4, and input image size 640 × 640; the optimizer uses an initial learning rate of 0.01, a weight attenuation of 0.01, and a momentum factor of 0.937 to ensure the stability and efficient convergence of model training. We used a batch size of 8 due to GPU memory constraints with our large model and high-dimensional inputs. Small batches can lead to noisier gradient estimates, which may slightly slow convergence or cause fluctuations in the training curve. However, this noise can also help the model escape local minima and improve generalization. To ensure stable convergence despite the small batch size, we carefully adjusted the learning rate.

3.7. Performance Indicators

The performance of ACS-YOLOv8 was evaluated using six metrics: precision (P), recall (R), average precision (AP), mean average precision (mAP), and the F1 measure for detection accuracy, while the model’s real-time performance was measured by frames per second (FPS) and parameters (Params). Precision (P) reflects the proportion of correctly detected targets among all predictions, expressed as follows:

P = \frac{T P}{T P + F P} \times 100 %

(12)

where

T P

is the number of correctly detected positive samples and

F P

is the number of false positives. Recall (R) measures the proportion of true targets successfully detected by the model, given as follows:

R = \frac{T P}{T P + F N} \times 100 %

(13)

where

F N

is the number of real targets missed by the detector. Average precision (AP) is computed as the area under the precision–recall (PR) curve:

A P = \int_{0}^{1} P (R) d R

(14)

The mAP is the mean value of AP across all classes, providing a comprehensive evaluation of multi-class detection performance:

m A P = \frac{\sum P_{A}}{n}

(15)

where n is the total number of classes.

The

F 1

measure is the harmonic mean of precision and recall, representing the balance between them:

F 1 = 2 \times \frac{P \times R}{P + R}

(16)

4. Experimental Procedure

4.1. Ablation Experiments

During the experiment, in order to ensure the fairness of the comparison and the reliability of the results, all data parameters and environment configurations were kept consistent. Based on the VisDrone2019 data set, ablation experiments were carried out on the YOLOv8s baseline model to verify the independent and combined performance of the C2f_AFE, CMRF, and SAFM modules. The experimental results are shown in Table 1. In Table 1, A, B, and C correspond to schemes that individually introduce C2f_AFE, CMRF, and SAFMN, respectively; D and E represent combination schemes that progressively integrate CMRF or SAFMN on top of A; and F denotes the higher-order combination scheme that introduces SAFMN based on both A and B.

The ablation experiment results in Table 1 are analyzed as follows:

The introduction of AFE, CMRF, and SAFMN into YOLOv8s improves overall model performance, though each module contributes differently: AFE enhances global modeling and detail fidelity through large-kernel convolution and high–low-frequency feature separation, leading to improvements in recall and mAP50:95. The CMRF module enables fine-grained feature fusion through dynamic weight allocation, effectively suppressing redundant information and highlighting salient features. It enhances both precision and recall while maintaining high computational efficiency. SAFMN, by focusing on fine-grained feature transfer, improves small-target detail discrimination and reduces information loss caused by feature compression, although it incurs a slight increase in computational cost. Together, the two modules provide complementary benefits. For instance, AFE + CMRF achieves global perception and precise feature selection, while CMRF + SAFMN forms a cascaded optimization between feature denoising and detail enhancement. Ultimately, the three-module fusion in ACS-YOLOv8s achieves the best results across all metrics (P = 52.2%, R = 40.5%, mAP50 = 41.6%, mAP50:95 = 25.0%), significantly outperforming the benchmark model. This validates the effectiveness of multi-module collaborative optimization.

Feature Visualization and Ablation Analysis

To verify the effect of the proposed AFE module on small-target sensitivity, we conducted both visualization and ablation experiments. As shown in Figure 10, the feature response heatmaps of small targets are presented for models without and with AFE. In the left heatmap, the response to small targets is weak and diffused, while in the right heatmap, after introducing AFE, the responses in small-target regions are significantly stronger (indicated by warmer colors).

The quantitative ablation results in Table 1 further confirm that including AFE improves small-target detection metrics, demonstrating that AFE effectively enhances feature responses in the target regions. This visualization and ablation analysis substantiates the claimed relationship between AFE and small-target sensitivity.

4.2. Comparison Experiments

To evaluate the effectiveness of the proposed algorithm for small-target detection in UAV aerial scenes, comparative experiments were conducted on the VisDrone2019 dataset against several state-of-the-art detection algorithms. The overall comparison results are presented in Table 2, while Table 3 reports the mAP (%) for each target class.

The experimental results show that two-stage detectors, such as Faster R-CNN and RetinaNet, achieve high accuracy but incur substantial computational costs, making them unsuitable for real-time UAV aerial detection. In contrast, single-stage detectors, including SSD and the YOLO series, offer superior inference efficiency but experience reduced accuracy on small targets and in complex backgrounds. Although RT-DETR achieves leading accuracy, it comes with high computational complexity. By comparison, the proposed ACS-YOLOv8s attains mAP50 = 41.6% and mAP50:95 = 25.0%, matching or even surpassing more complex models while maintaining reasonable GFLOPs and parameter counts. These findings demonstrate that the proposed method achieves an excellent balance between accuracy and efficiency, providing robust detection with practical applicability and making it well-suited for the demanding conditions of UAV aerial scenes.

As shown in Table 3, RetinaNet achieves the lowest overall performance (mAP = 13.9%), struggling in multi-class detection. The YOLO series generally outperforms RetinaNet; among them, YOLOX performs well on large objects (mAP = 40.3%) but has difficulty with small targets and crowded scenes. YOLOv5 and YOLOv7-tiny provide a balance between accuracy and efficiency, while YOLOv8s performs well across most classes (mAP = 38.8%) but still exhibits limitations for small or easily confused categories. In contrast, ACS-YOLOv8s achieves a comprehensive performance improvement, reaching an overall mAP of 41.6%. It shows significant gains in challenging classes such as bicycle, tricycle, van, and truck, and attains the highest accuracy in key classes like pedestrian and bus. These results confirm that the collaborative optimization of AFE, CMRF, and SAFMN effectively enhances feature representation in terms of breadth, purity, and fine-grained detail, thereby improving the model’s robustness and generalization in complex UAV aerial environments.

As shown in Table 4, under this challenging environment, although ACS-YOLOv8s exhibits a slight decrease in precision compared with the baseline, it achieves clear improvements in recall, mAP50, and mAP50-95. Overall, the proposed model outperforms YOLOv8s in degraded conditions, indicating stronger robustness and a more stable detection capability.

To further assess the algorithm’s performance in real-world scenarios, challenging images from the VisDrone2019 test set were chosen for visualization. The qualitative comparison of detection results is presented in Figure 11.

Figure 11 compares the detection results of different algorithms in typical scenes. From top to bottom, the scenes depict a road, a nighttime commercial street, and a dense small-target environment. The columns show the detection results for the original image, YOLOv5s, YOLOv8s, YOLOv11, and ACS-YOLOv8s, respectively. YOLOv5s and YOLOv8s exhibit missed detections and false positives, while YOLOv11 shows moderate improvement but remains limited. In contrast, ACS-YOLOv8s accurately detects all targets, with bounding boxes closely fitting object boundaries and minimal false detections, demonstrating superior accuracy and robustness in complex conditions.

4.3. Visualization Analysis

To clearly demonstrate the performance improvements of the proposed algorithm over the original model, the mAP50 and mAP50:90 values during training were visualized for both models. This provides an intuitive comparison of performance evolution and improvement trends. The comparative visualization results are presented in Figure 12a,b.

Although ACS-YOLOv8s still has certain limitations under extreme lighting, severe motion blur, or highly dense scenes, there is still room for improvement in detection accuracy under these conditions. At the same time, the current experiments are mainly based on specific UAV datasets, and the model’s generalization ability for other datasets or application scenarios still needs further verification. However, the comprehensive experimental results indicate that ACS-YOLOv8s can still effectively detect more small targets in complex environments, with overall detection performance significantly better than the baseline model, and showing stronger robustness and stability. This suggests that the proposed feature enhancement and multi-scale modulation modules play a significant role in improving small-target detection capabilities and handling complex scenarios, providing an effective technical solution for UAV small-target detection.

5. Conclusions

This paper proposes a YOLOv8s-based improved algorithm to address challenges in UAV aerial photography target detection, such as large size differences, dense distribution, and blurred features. Symmetry, an inherent geometric property of most aerial-photography targets (e.g., vehicles, buildings), is a key cue for distinguishing targets from complex backgrounds. The algorithm achieves three core improvements: first, the AFE module captures symmetric contour and texture features, enhancing fine-grained geometric perception and complex scene discriminability; second, the CMRF module integrates Ghost and PConv feature reuse ideas to balance computational efficiency and feature representation; third, the SAFMN module dynamically models multi-scale and cross-channel dependencies via CCM, focusing on mining cross-scale symmetric feature correlations to optimize expression. Experimental results show the method outperforms mainstream algorithms in accuracy, with stronger robustness in dense and complex scenes, providing an efficient solution for UAV aerial-photography target detection.

Despite these achievements, the study has limitations guiding future improvements: (1) performance may decline in extreme weather (e.g., heavy rain, fog), as the feature enhancement module poorly adapts to severely degraded images; (2) detection accuracy for ultra-small targets (≤10 × 10 pixels) in complex backgrounds needs improvement, due to fine-grained feature loss during multi-scale downsampling; (3) generalization across different UAV platforms and photography scenarios (e.g., high-altitude fast flight vs. low-altitude hovering) requires further verification, as experiments rely on fixed datasets and specific UAV configurations.

Future work will address these limitations by developing weather-adaptive feature fusion strategies, exploring lightweight super-resolution preprocessing for ultra-small targets, and conducting extensive multi-platform/multi-scenario experiments to enhance practical application value.

Author Contributions

Conceptualization, Z.Z. and Q.Y.; Data curation, L.W. and Q.Y.; Formal analysis, L.W.; Funding acquisition, C.W. and Q.Y.; Investigation, Z.Z.; Methodology, Z.Z. and L.W.; Project administration, C.W., L.W. and Q.Y.; Resources, C.W. and Q.Y.; Software, Z.Z.; Supervision, C.W. and L.W.; Validation, Z.Z.; Visualization, Z.Z.; Writing—original draft, Z.Z.; Writing—review and editing, L.W. and Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Guilin Major Special Project (Project No.: 20220103-1), the Guangxi Science and Technology Base and Talent Special Project (Project No.: GuiKe AD24010012), and the Guangxi Key Research and Development Program Project (Project No.: GuiKe AB23026105), and the Program of Enhancing the Scientific Research Basic Ability of Young and Middle-aged Teachers in Guangxi Universities (Project No.: 2025KY1072), and the Scientific Research Projects of Guilin Institute of Information Technology (Project No.: XJ2024106), and the Scientific Research Initiation Fund Projects of Guilin Institute of Information Technology (Project No.: XJ2024065).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors sincerely acknowledge Yu Qiang for his guidance and valuable support during the course of this research. Meanwhile, we would like to express our gratitude to all teachers and classmates who provided help and suggestions during the research and experimental process. The authors also thank the anonymous reviewers for their constructive comments, which have played an important role in improving the quality of this paper.

Conflicts of Interest

Author Chengyun Wei was employed by the Guilin Feiyu Technology Incorporated Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

The symbols listed in this table are used throughout the paper to describe the proposed ACS-YOLOv8s network. For clarity, each symbol includes its meaning, typical dimensions, and the module where it is primarily used.

Table A1. Notation used in this paper.

Symbol	Description	Dimension/Remark
$X$	Input feature map	$C \times H \times W$
$F$	Output feature map	$C^{'} \times H \times W$
$θ$	Learnable parameters of C2f_AFE module
$R$	Receptive field in CMRF module
$S$	Spatial attention map in SAFMN module	$H \times W$
$W$	Convolution weight	Depends on kernel size
$b$	Bias term in convolution	Depends on output channels
$α$	Channelwise attention coefficient	$C \times 1 \times 1$
$β$	Spatial modulation coefficient	$1 \times H \times W$
$\hat{F}$	Feature map after adaptive modulation	$C \times H \times W$
$G$	Cascaded feature map in CMRF	$Multiple scales, C_{i} \times H_{i} \times W_{i}$

References

Zrelli, I.; Rejeb, A.; Iranmanesh, R.M. Drone Applications in Logistics and Supply Chain Management: A Systematic Review Using Latent Dirichlet Allocation. Arab. J. Sci. Eng. 2024, 49, 12411–12430. [Google Scholar] [CrossRef]
Feng, Z.; Xie, Z.; Bao, Z.; Chen, K. Real-Time Dense Small Object Detection Algorithm for UAVs Based on an Improved YOLOv5. Acta Aeronaut. Astronaut. Sin. 2023, 44, 246–260. [Google Scholar] [CrossRef]
Kaur, R.; Singh, S. A comprehensive review of object detection with deep learning. Digit. Signal Process. 2022, 132, 103812. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. arXiv 2019, arXiv:1905.05055. [Google Scholar] [CrossRef]
Yamani, A.; Alyami, A.; Luqman, H.; Ghanem, B.; Giancola, S. Active Learning for Single-Stage Object Detection in UAV Images. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 1849–1858. [Google Scholar]
Ma, Y.; Chai, L.; Jin, L. Scale Decoupled Pyramid for Object Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4704314. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Guo, W.; Zhang, T.; Li, W. CFANet: Efficient Detection of UAV Image Based on Cross-Layer Feature Aggregation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608911. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Photography Scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Guo, J.; Liu, X.; Bi, L.; Liu, H.; Lou, H. UN-YOLOv5s: A UAV-Based Aerial Photography Detection Algorithm. Sensors 2023, 23, 5907. [Google Scholar] [CrossRef]
Zhang, H.; Sun, W.; Sun, C.; He, R.; Zhang, Y. HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm. Drones 2024, 8, 453. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A. A Survey of the Four Pillars for Small Object Detection: Multiscale Representation, Contextual Information, Super-Resolution, and Region Proposal. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 936–953. [Google Scholar] [CrossRef]
Zhang, B. Research on Small Object Detection Methods in UAV Aerial Images Based on Deep Learning. Ph.D. Thesis, Hainan University, Haikou, China, 2023. [Google Scholar]
Dong, G.; Xie, W.; Huang, X.; Qiao, Y.; Mao, Q. A Survey of Deep Learning-Based Small Object Detection Algorithms. Comput. Eng. Appl. 2023, 59, 16–27. [Google Scholar] [CrossRef]
Jiang, M.; Si, Z.; Wang, X. Improved RT-DETR-Based Object Detection Algorithm for UAV Images. Comput. Eng. Appl. 2025, 61, 98–108. [Google Scholar] [CrossRef]
Ma, J.; Chang, Y. MFE-YOLOX: Dense Small Object Detection Algorithm for UAV Aerial Images. J. Chongqing Univ. Posts Telecommun. (Nat. Sci. Ed.) 2024, 36, 128–135. [Google Scholar]
Liang, X.; Jia, Z.; Yu, H.; Liu, Z. Object Detection Algorithm for UAV Images Based on an Improved YOLOv7. Radio Eng. 2024, 54, 937–946. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. arXiv 2021, arXiv:2108.11539. [Google Scholar] [CrossRef]
Liu, S.; Liu, Y.; Sun, Y.; Li, Y.; Wang, J. Small Object Detection in UAV Aerial Images Based on Inverted Residual Attention. J. Beihang Univ. 2023, 49, 514–524. [Google Scholar] [CrossRef]
Shao, Y.; Yang, Z.; Li, Z.; Li, J. Aero-YOLO: An Efficient Vehicle and Pedestrian Detection Algorithm Based on Unmanned Aerial Imagery. Electronics 2024, 13, 1190. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, Z. A Modified YOLOv8 Detection Network for UAV Aerial Image Recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Pan, W.; Wei, C.; Qian, C.; Li, Y.; Zhang, H.; Wang, J. Improved YOLOv8s Model for Small Object Detection from UAV Perspectives. Comput. Eng. Appl. 2024, 60, 142–150. [Google Scholar] [CrossRef]
Deng, T.; Cheng, X.; Liu, J.; Zhang, X. Small Object Detection Algorithm for Aerial Images Based on Feature Reuse Mechanism. J. Zhejiang Univ. (Eng. Sci.) 2024, 58, 437–448. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.; Romero-González, J. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolonas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small object detection on unmanned aerial vehicle perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef] [PubMed]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 9759–9768. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on realtime object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]

Figure 1. Structure of the YOLOv8 network model.

Figure 2. Structure of the improved network.

Figure 3. Feature Amplification Network (FANet).

Figure 4. Adaptive feature enhancement (AFE) block.

Figure 5. Structure of the FRM module.

Figure 6. Structure of the CRMF module.

Figure 7. Structure of the SAFMN.

Figure 8. Structure of the SAFM.

Figure 9. Visualization of upsampled feature maps with and without.

Figure 10. Visualization of AFE feature responses for small targets. (Left) Without AFE; (Right) With AFE. Warmer colors indicate higher responses.

Figure 11. Comparison of detection results 1.

Figure 12. (a) Comparison of performance indicators 1. (b) Comparison of performance indicators 2.

Table 1. Ablation experiment results.

Models	AFE	CMRF	SAFMN	P(%)	R(%)	mAP50(%)	mAP5095(%)	Params/M	GFLOPs/G	FPS/(f/s)
YOLOv8s				50.6	38	38.8	23.1	11.1	28.5	34.55
A	✔			49.6	38.9	39.6	23.8	11.7	30.2	26.45
B		✔		50.9	39.2	40.4	24.2	10.9	28.3	24.86
C			✔	49.5	38.4	39.3	23.5	15.7	43.1	19.53
D	✔	✔		51.3	39.5	40.5	24.5	11.4	30.0	25.4
E	✔		✔	50.3	38.7	39.4	23.8	16.3	44.9	22.2
F		✔	✔	51.1	39.9	40.4	24.2	15.5	42.9	24.94
ACSYOLOv8s	✔	✔	✔	52.2	40.5	41.6	25.0	15.8	43.2	22.49

Table 2. Comparison experiment results.

Models	P (%)	R (%)	mAP50 (%)	mAP5095 (%)	Params/(M)	GFLOPs/G
RetinaNet [28]	30	26.3	19.1	10.6	36.2	29.7
SSD	35.8	31.7	23.8	15	4.3	3.1
FasterRCNN	45.5	32.2	33.1	23	41	203.5
YOLOv5s	50	38.5	33.9	23.2	9.1	24.1
YOLOv5m	53.6	40.6	38	24.3	25	64
YOLOv7tiny	50.3	37.6	32.7	16	6.1	13.1
YOLOv8n	44.2	33	32.9	19	3	8.1
UAVYOLO [29]	46.1	40.2	34.9	20.6	3.6	11.8
ATSS [30]	49	37.6	36.4	22.3	28.5	37.8
YOLOv8s	50.6	38	38.8	23.1	11.1	28.5
YOLOv10n	46	34.3	34.7	20.3	2.3	6.5
YOLOv11n	44.8	34.2	34.5	20.2	2.6	6.3
RTDETR [31]	52.6	39.8	41.7	25.1	20	57.3
ACSYOLOv8s	52.2	40.5	41.6	25	43.2	15.8

Table 3. Class-wise comparison of mAP (%).

Algorithm	mAP (%)	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awntricycle	Bus	Motor
RetinaNet	13.9	13.0	7.9	1.4	45.5	19.9	11.5	6.3	4.2	17.8	11.8
YOLOX	40.3	46.2	38.5	13.8	80.3	43.5	33.2	28.7	14.8	55.2	49.2
YOLOv5s	33.0	40.2	32.2	10.3	73.3	35.9	27.5	18.1	10.7	41.5	39.7
YOLOv5m	37.3	45.5	35.7	15.0	76.3	40.2	32.7	24.5	11.9	48.7	42.3
YOLOv7tiny	35.5	39.6	36.2	9.6	77.5	38.3	30.3	19.4	10.2	49.6	44.5
YOLOv8n	32.9	34.3	26.4	7.8	75.2	38.6	27.3	21.6	11.9	48.1	36.9
YOLOv8s	38.8	42.0	31.7	11.4	79.5	44.9	36.1	27.4	15.0	56.1	43.5
YOLOv10n	34.7	35.2	28.6	10.3	76.2	38.5	33.3	22.5	11.8	52.1	38.2
YOLOv11n	34.5	36.2	28.8	8.8	76.1	39.2	32.1	23.2	11.9	50.5	38.2
ACSYOLOv8s	41.6	44.3	34.9	15.5	80.2	46.2	39.5	31.7	17.6	58.3	45.4

Table 4. Performance Metrics of YOLOv8s and Its Improved Model.

Models	P (%)	R (%)	mAP50 (%)	mAP5095 (%)	Params/(M)	GFLOPs/G	FPS/(f/s)
YOLOv8s	36.9	16.3	16.1	7.78	11.4	30.1	19.2
ACSYOLOv8s	31	26.5	23.5	14.7	11.4	30.1	23.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Z.; Wei, C.; Wang, L.; Yu, Q. Symmetry-Enhanced YOLOv8s Algorithm for Small-Target Detection in UAV Aerial Photography. Symmetry 2026, 18, 197. https://doi.org/10.3390/sym18010197

AMA Style

Zhou Z, Wei C, Wang L, Yu Q. Symmetry-Enhanced YOLOv8s Algorithm for Small-Target Detection in UAV Aerial Photography. Symmetry. 2026; 18(1):197. https://doi.org/10.3390/sym18010197

Chicago/Turabian Style

Zhou, Zhiyi, Chengyun Wei, Lubin Wang, and Qiang Yu. 2026. "Symmetry-Enhanced YOLOv8s Algorithm for Small-Target Detection in UAV Aerial Photography" Symmetry 18, no. 1: 197. https://doi.org/10.3390/sym18010197

APA Style

Zhou, Z., Wei, C., Wang, L., & Yu, Q. (2026). Symmetry-Enhanced YOLOv8s Algorithm for Small-Target Detection in UAV Aerial Photography. Symmetry, 18(1), 197. https://doi.org/10.3390/sym18010197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Enhanced YOLOv8s Algorithm for Small-Target Detection in UAV Aerial Photography

Abstract

1. Introduction

2. YOLOv8 Algorithm

3. Improved ACS-YOLOv8s Algorithm

3.1. Overall Network Structure

3.2. Adaptive Feature Enhancement Module

3.3. Channelwise Multi-Receptive Field Module

3.4. Spatially Adaptive Feature Modulation Network

3.5. Dataset and Experimental Environment

3.6. Experimental Environment

3.7. Performance Indicators

4. Experimental Procedure

4.1. Ablation Experiments

4.2. Comparison Experiments

4.3. Visualization Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI