1. Introduction
Object detection is a core task in computer vision, aiming to accurately identify the category and location of targets in images. With the widespread use of unmanned aerial vehicles (UAVs) in military reconnaissance, disaster relief, agricultural monitoring, and urban management, onboard detection models play a critical role in intelligent perception [
1]. In recent years, the YOLO series has become a mainstream approach in UAV scenarios due to its fast end-to-end prediction, high accuracy, and flexible deployment [
2,
3,
4].
However, most of these studies ignore the inherent geometric attributes of aerial photography targets, especially symmetry. With the development of anchor-free mechanisms, decoupled detection heads, and attention modules, single-stage detectors have achieved notable improvements in small-object representation and complex background suppression, showing strong potential for UAV-based vision tasks [
5]. Ma et al. [
6] proposed a sparse non-local attention mechanism to aggregate contextual information from multi-level features efficiently, while Zhang et al. [
7] introduced a cross-layer feature aggregation module (CFAM) to alleviate limitations of sequential feature propagation in feature pyramids, although its fusion capability remains limited. In UAV detection, UAV-YOLOv8 [
8], UN-YOLOv5s [
9], and HSP-YOLOv8 [
10] enhance detection performance via multi-scale feature fusion, small-object detection strategies, or structural improvements, but challenges remain in accuracy, deployment adaptability, or inference efficiency. Meanwhile, various lightweight convolution modules and efficient network architectures have been proposed to meet practical requirements.
Symmetry is a fundamental geometric property of most natural and man-made targets in aerial scenes, such as vehicles, buildings, and aircraft. It serves as a critical cue for distinguishing targets from complex backgrounds and can effectively enhance the discriminability of incomplete or blurred target features. Mining symmetric information has been proven to be an effective way to improve target detection performance in complex scenarios. Unfortunately, existing YOLOv8s-based improved algorithms rarely integrate symmetry-aware mechanisms into feature extraction and fusion processes, resulting in underutilization of target intrinsic features. However, UAV aerial images present unique challenges: they often contain many small targets, complex backgrounds, dense arrangements, and occlusions, which make general-purpose detectors prone to missed or false detections [
11,
12,
13]. Various methods have been proposed to address these issues.
Dong Gang et al. [
14] optimized small-target detection through multi-scale feature fusion, evaluation metric improvement, super-resolution reconstruction, and lightweight modeling; Jiang Maoxiang et al. [
15] introduced a small-target detection head in RT-DETR and combined SimAM attention with inverted residual modules to enhance the backbone network; Ma Junyan et al. [
16] proposed a super-resolution method combining MFE-YOLOX with attention; Liang Xiuman et al. [
17] added a small-target detection layer in YOLOv7 and introduced a multi-information flow fusion attention mechanism; Zhu et al. [
18], Liu Shudong et al. [
19], and Shao et al. [
20] strengthened feature representation via attention mechanisms and backbone improvements; Li et al. [
21] and Pan Wei et al. [
22] optimized feature fusion structures to balance performance and efficiency; Wang et al. [
8] and Deng Tianmin et al. [
23] proposed combining loss function design with channel optimization to improve detection accuracy and efficiency.
Despite these advances, existing YOLO-based UAV detectors still have several limitations: insufficient exploitation of spatial correlations in complex scenes, reliance on complex or redundant modules for multi-scale feature modeling, and feature detail loss during upsampling, resulting in inadequate cross-scale information fusion. Our proposed modules are explicitly designed to address these limitations: C2f_AFE enhances cross-regional feature dependencies and fine-grained analysis to improve small-target representation; CMRF efficiently integrates multi-scale receptive fields in a hierarchical manner to reduce redundancy and alleviate performance bottlenecks; SAFMN optimizes cross-scale feature fusion and preserves detail during upsampling, effectively solving these core limitations.
To address these challenges, we propose ACS-YOLOv8s, built on the YOLOv8s framework, with three innovative modules:
C2f_AFE module: Enhances cross-regional feature dependencies and fine-grained analysis, improving multi-scale feature representation in complex scenarios. CMRF module: Efficiently mines multi-scale receptive fields through a cascading strategy, alleviating redundancy and performance bottlenecks of traditional multi-scale modules. SAFMN module: Combines convolution channel mixing to optimize cross-scale feature fusion and preserve fine details, mitigating feature blurring during the upsampling stage.
These modules collaboratively improve the accuracy and robustness of small-target detection in UAV aerial images while maintaining high computational efficiency. Experimental results on the VisDrone2019 dataset show that ACS-YOLOv8s achieves substantial improvements over baseline models, validating the effectiveness and practicality of the proposed method. The main contributions of this work are summarized as follows:
We propose the C2f_AFE module to enhance cross-regional feature dependencies and fine-grained analysis, improving small-target representation in complex UAV aerial images.
We design the CMRF module to efficiently integrate multi-scale receptive fields in a hierarchical manner, reducing redundancy and alleviating performance bottlenecks.
We introduce the SAFMN module to optimize cross-scale feature fusion and preserve feature details during upsampling, mitigating feature blurring.
Extensive experiments on the VisDrone2019 dataset demonstrate that ACS-YOLOv8s significantly improves small-target detection accuracy, recall, and mAP compared with baseline models, while maintaining high computational efficiency.
2. YOLOv8 Algorithm
YOLOv8 [
24] was released by the Ultralytics team in 2023. As an important update of the YOLO series, its architecture continues the four-stage design of the input end, backbone network, feature fusion network, and detection head. The structure diagram of YOLOv8 is shown in
Figure 1. w (width) and r (ratio) in
Figure 1 can adjust the model size to adapt to different scenarios. The backbone part uses the C2f module to replace the C3 structure of YOLOv5, and combines CSPDarknet and ELAN ideas to optimize gradient flow and improve feature extraction capabilities; the feature fusion end uses a combination of levy pyramid network [
25] and Path Aggregation Network [
26] to achieve efficient cross-layer information interaction; the detection head separates classification and regression tasks through a decoupled structure and introduces an anchor-free mechanism to reduce label noise and hyperparameter dependence, thereby striking a balance between detection accuracy and inference efficiency. With the above improvements, YOLOv8 shows strong advantages in low load characteristics and stability.
Although YOLOv10 and YOLOv11 introduce multi-scale enhancement and attention mechanisms to improve small-target detection, the accuracy gain in aerial photography scenes is limited (mAP50 improvement is less than 1.2%), the complex structure increases reasoning delay and computing power consumption, and the engineering adaptability is insufficient. In contrast, YOLOv8 has been proven mature in many fields and has high deployment feasibility. Cross-architecture comparison shows that although the two-stage method has high accuracy, the inference is slow, the Transformer-based method has high computational overhead, and the lightweight model lacks accuracy. Taken together, YOLOv8s achieves a balance between accuracy, efficiency, and resource consumption, and is more suitable for UAV aerial photography tasks with large target scale differences, dense distribution, and high real-time requirements, so it was selected as the benchmark model in this article.
3. Improved ACS-YOLOv8s Algorithm
3.1. Overall Network Structure
This article proposes an improved method for UAV aerial-photography target detection based on YOLOv8s. The network architecture is shown in
Figure 2. The whole body still consists of the backbone, neck, and head, but the feature extraction and fusion mechanism has been systematically optimized. Specifically, the backbone part builds the C2f_AFE module to enhance cross-regional dynamic feature correlation and improve the representation ability of small targets and multi-scale targets; the cascaded multi-receptive field is used to replace the original SPPF module, and multi-scale information is efficiently modeled through the multi-receptive field cascade strategy, taking into account both computational cost and feature expression; the neck upsampling link introduces a spatial adaptive feature modulation network to strengthen cross-scale feature interaction and alleviate the contradiction between resolution improvement and detail preservation. The overall improvement achieves full link optimization from feature breadth and semantic depth to detail accuracy, providing higher robustness and adaptability for target detection in complex drone aerial-photography scenes.
3.2. Adaptive Feature Enhancement Module
With the development of computer vision, semantic segmentation and target detection have been widely used in fields such as autonomous driving, smart security, and drone monitoring. However, existing methods still have limitations in complex scenarios. Traditional CNN has difficulty capturing long-range dependencies, and its performance is limited in multi-target, occlusion-affected, and complex backgrounds; although Vision Transformer has global modeling capabilities, it is still insufficient in detail capture and semantic-context modeling; the hybrid attention model takes into account both local and global features, but has limited accuracy when dealing with cluttered backgrounds or small-scale targets. In response to these problems, this paper proposes an adaptive feature enhancement (AFE) module to achieve global semantic modeling and local detail enhancement through parallel design: it uses global attention to capture long-range dependencies to improve semantic richness, and combines local enhancement to highlight high-frequency features such as edges and textures, thereby improving the detection and segmentation capabilities of small targets in complex scenes without significantly increasing computational complexity.
The AFE block proposed in this article consists of a convolutional embedding, a spatial-context module, a feature refinement module, and a convolutional multi-layer perceptron, shown in
Figure 3 and
Figure 4. And it is embedded in a backbone feature enhancement network for semantic segmentation in complex backgrounds. After the input is embedded by CE, SCM aggregates the global context through large kernel convolution, FRM decomposes and refines high- and low-frequency features to highlight small-target edges and textures, as shown in
Figure 5. And ConvMLP completes cross-channel modeling to enhance feature expression. In FRM, high- and low-frequency information is extracted and fused to generate enhanced features through differential and element-wise multiplication, achieving collaborative modeling of global and local, low and high frequencies, while maintaining computational efficiency.
F: Input feature map of the FRM module (size H × W × C, containing complete high- and low-frequency information); P: F feature map after downsampling by “DWConv 3 × 3 (Stride = 2)” (size H/2 × W/2 × C; the downsampling process will lose high-frequency details and only retain low-frequency contours);
Q: The feature map of P after Upsample (upsampling to H × W × C) (corresponding to the low-frequency approximate feature of F, including global smoothness and large-area semantic information); R: The result of the operation F-Q (the residual of the input F and the low-frequency feature Q, corresponding to the high-frequency detailed feature of F, including local information such as small-target edges and textures); S: The result of the operation F⊙Q (element-wise multiplication of F and Q, strengthening the spatially effective area of low-frequency features);
T: The fusion feature map obtained by “splicing (symbol C)” after R and S were processed by “DWConv 3 × 3”, respectively;
: The output of T after “Conv 1 × 1” channel integration (the final feature map of the FRM module).
Core operation description: F-Q (that is, the operation mentioned): Since Q is the product of F downsampling + upsampling (downsampling will filter out high-frequency details), the difference between F and Q naturally corresponds to the high-frequency information in ; this is the core operation for separating high- and low-frequency features.
The design intuition of the FRM module is that high- and low-frequency features need to be differentially modeled:
Traditional feature modules usually process high- and low-frequency information in a mixed manner, which can easily cause high-frequency details (edges, textures) that small targets rely on to be overwhelmed by global low-frequency information (large-area contours); FRM achieves precise separation and enhancement through the following logic:
Separate high and low frequencies: Use downsampling + upsampling to extract low-frequency features Q (downsampling filters high frequencies, upsampling restores size), and then obtain high-frequency features R through “residual subtraction”—natural splitting of high and low frequencies can be achieved without additional parameters;
Differentiation enhancement: For low-frequency features Q (corresponding to global semantics), enhance the spatial consistency through element-wise multiplication + DWConv 3 × 3; for high-frequency features R (corresponding to small-target details), retain and enhance the local response through DWConv 3 × 3;
Lightweight fusion: Use splicing + 1 × 1 convolution to integrate high- and low-frequency features. While controlling the amount of calculation, the enhanced high- and low-frequency information can complement each other, ultimately improving the expressive ability of features (especially adapted to the needs of small-target detection for high-frequency details).
3.3. Channelwise Multi-Receptive Field Module
In image segmentation and target detection, existing models generally face a contradiction between efficiency and accuracy: although compression parameters and calculations improve efficiency, they weaken feature expression, leading to a decline in low-resolution image and small-target detection performance; although the multi-receptive field module improves feature modeling, it has a high computational overhead and is not conducive to resource-constrained scenarios. To this end, this paper proposes the cascaded multi-receptive field module (CMRF) structure as shown in
Figure 6. Through the cascade design, multi-scale information is integrated, and everything from fine textures to global structures can be effectively modeled. It is combined with efficient feature mining and fusion strategies to improve detection accuracy while controlling complexity, achieving both accuracy and real-time performance.
As depicted in the left portion of
Figure 6, the proposed CMRF module cleverly incorporates deep convolutions.
Initial feature extraction for the input feature map:
Among them, DWConv is a depth convolution (reducing the number of parameters), and the output feature map is
and GELU is the activation function, which balances the amount of calculation and feature capacity by controlling
. Odd- and even-channel splitting and differentiation processing Split X′ according to channel number into the odd-channel subset
and the even-channel subset
and perform dual-path processing:
Detail enhancement path: perform elemental addition of
and
to obtain
that retains the fine-grained features of small targets;
Multi-receptive field enhancement path: Cascade
small core (size 2) DWConv-BN modules for
, capture multi-scale receptive field information, and then splice it into
Final feature fusion: After splicing
and X″, the information is integrated through a point-wise convolution block (PWConv-BN-GELU) to output the final feature map:
Lightweight cascaded multi-receptive fields: Use small-core depth convolution to replace large-core/single-scale convolution, with parameters only 75% of SSFF, covering the full scale from small targets 10 × 10 to large targets 60 × 60.
Odd- and even-channel differentiation: Split channels reserve exclusive resources for small targets, and their channel proportion increases from 15% to 45%;
Dual-path fusion: Element addition (details preserved) + branch splicing (complementing the global picture), small-target matching error ≤ 2 pixels, recall rate increased by 4.3% compared to SFCC.
Advantages of small-target detection
Cascade design: The number of small-target receptive field layers is increased from 1 to 3, and the 10 × 10 small-target AP is 7.2% higher than YOLOv8 SPPF;
Odd- and even-channel separation: Small-target feature response intensity is increased by 27%.
See
Appendix A for symbol definitions, dimensions, and notes.
3.4. Spatially Adaptive Feature Modulation Network
In image super resolution and target detection tasks, traditional methods still have limitations in feature learning, multi-scale expression, and computational overhead, which can easily lead to texture blur, loss of details, and loss of small-target features, making it difficult to take into account global semantics and fine reconstruction. In response to these problems, this paper proposes the spatial adaptive feature modulation network (SAFMN). Its overall architecture is shown in
Figure 7. Its core module, FMM, consists of spatial adaptive feature modulation and a convolution channel mixer. The input image is first mapped to the feature space through shallow convolution, and then FMM uses multi-scale feature division and transformation, and combines with global residuals to enhance high-frequency information to achieve deep feature extraction and high-resolution reconstruction:
Some features are extracted using 3 × 3 depthwise convolutions to capture local representations, while others undergo multi-level pooling and upsampling to capture long-range dependencies. The resulting features are then concatenated and aggregated along the channel dimension:
An attention map is subsequently generated through a nonlinear activation function
to adaptively modulate the input
:
This design effectively enhances non-local feature modeling and multi-scale representation. To further integrate local contextual information and channel interactions, the CCM uses a compact combination of 3 × 3 and 1 × 1 convolutions—the 3 × 3 convolution captures spatial context and expands the channel dimension, while the 1 × 1 convolution restores it to the original scale and employs GELU activation to strengthen nonlinear representation. The update process of the FMM can then be summarized as follows:
where
denotes layer normalization. The additional residual path not only stabilizes training but also enhances the restoration of high-frequency details.
Figure 8 shows an overview of the SAFMN architecture. The input LR image is first mapped to the feature space through a convolutional layer, then deep features are extracted through a series of FMMs, and finally, it is reconstructed by the upsampling module. Each FMM consists of the SAFM, CCM, and two jump connections.
As shown in
Figure 9, after introducing SAFMN, the responses of the upsampled features are more concentrated in the target regions, further validating the effective modulation of SAFMN during the feature upsampling process.
3.5. Dataset and Experimental Environment
The experiments in this study were conducted using the VisDrone2019 dataset [
27] for model training and validation. The VisDrone2019 data set was released by the AISKYEVE team of Tianjin University. It contains 8629 UAV aerial images and more than 2.6 million targets, covering 10 categories (pedestrians, vehicles, non-motorized vehicles, etc.), and is divided into 6471 training sets, 548 verification sets, and 1610 test sets. This data set has the characteristics of dense targets, unbalanced categories, large-scale differences, severe occlusion, and complex backgrounds, which pose a large challenge to the target detection algorithm.
3.6. Experimental Environment
The experiment was built based on PyTorch 2.0.1 in the Windows 11 environment. The hardware is a Lenovo Legion R7000 laptop (Lenovo Group Limited, Beijing, China) equipped with an NVIDIA RTX 4060 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA), using Python 3.8 and CUDA 11.8 for acceleration. The model was implemented via Ultralytics YOLOv8 v8.0.200, with image preprocessing by OpenCV v4.8.1, dataset annotation by LabelImg v1.8.6, and result visualization by Matplotlib v3.7.2. The training parameters are set as follows: 200 rounds of iterations, batch size 8, number of threads 4, and input image size 640 × 640; the optimizer uses an initial learning rate of 0.01, a weight attenuation of 0.01, and a momentum factor of 0.937 to ensure the stability and efficient convergence of model training. We used a batch size of 8 due to GPU memory constraints with our large model and high-dimensional inputs. Small batches can lead to noisier gradient estimates, which may slightly slow convergence or cause fluctuations in the training curve. However, this noise can also help the model escape local minima and improve generalization. To ensure stable convergence despite the small batch size, we carefully adjusted the learning rate.
3.7. Performance Indicators
The performance of ACS-YOLOv8 was evaluated using six metrics: precision (P), recall (R), average precision (AP), mean average precision (mAP), and the F1 measure for detection accuracy, while the model’s real-time performance was measured by frames per second (FPS) and parameters (Params). Precision (P) reflects the proportion of correctly detected targets among all predictions, expressed as follows:
where
is the number of correctly detected positive samples and
is the number of false positives. Recall (R) measures the proportion of true targets successfully detected by the model, given as follows:
where
is the number of real targets missed by the detector. Average precision (AP) is computed as the area under the precision–recall (PR) curve:
The mAP is the mean value of AP across all classes, providing a comprehensive evaluation of multi-class detection performance:
where n is the total number of classes.
The
measure is the harmonic mean of precision and recall, representing the balance between them:
4. Experimental Procedure
4.1. Ablation Experiments
During the experiment, in order to ensure the fairness of the comparison and the reliability of the results, all data parameters and environment configurations were kept consistent. Based on the VisDrone2019 data set, ablation experiments were carried out on the YOLOv8s baseline model to verify the independent and combined performance of the C2f_AFE, CMRF, and SAFM modules. The experimental results are shown in
Table 1. In
Table 1, A, B, and C correspond to schemes that individually introduce C2f_AFE, CMRF, and SAFMN, respectively; D and E represent combination schemes that progressively integrate CMRF or SAFMN on top of A; and F denotes the higher-order combination scheme that introduces SAFMN based on both A and B.
The ablation experiment results in
Table 1 are analyzed as follows:
The introduction of AFE, CMRF, and SAFMN into YOLOv8s improves overall model performance, though each module contributes differently: AFE enhances global modeling and detail fidelity through large-kernel convolution and high–low-frequency feature separation, leading to improvements in recall and mAP50:95. The CMRF module enables fine-grained feature fusion through dynamic weight allocation, effectively suppressing redundant information and highlighting salient features. It enhances both precision and recall while maintaining high computational efficiency. SAFMN, by focusing on fine-grained feature transfer, improves small-target detail discrimination and reduces information loss caused by feature compression, although it incurs a slight increase in computational cost. Together, the two modules provide complementary benefits. For instance, AFE + CMRF achieves global perception and precise feature selection, while CMRF + SAFMN forms a cascaded optimization between feature denoising and detail enhancement. Ultimately, the three-module fusion in ACS-YOLOv8s achieves the best results across all metrics (P = 52.2%, R = 40.5%, mAP50 = 41.6%, mAP50:95 = 25.0%), significantly outperforming the benchmark model. This validates the effectiveness of multi-module collaborative optimization.
Feature Visualization and Ablation Analysis
To verify the effect of the proposed AFE module on small-target sensitivity, we conducted both visualization and ablation experiments. As shown in
Figure 10, the feature response heatmaps of small targets are presented for models without and with AFE. In the left heatmap, the response to small targets is weak and diffused, while in the right heatmap, after introducing AFE, the responses in small-target regions are significantly stronger (indicated by warmer colors).
The quantitative ablation results in
Table 1 further confirm that including AFE improves small-target detection metrics, demonstrating that AFE effectively enhances feature responses in the target regions. This visualization and ablation analysis substantiates the claimed relationship between AFE and small-target sensitivity.
4.2. Comparison Experiments
To evaluate the effectiveness of the proposed algorithm for small-target detection in UAV aerial scenes, comparative experiments were conducted on the VisDrone2019 dataset against several state-of-the-art detection algorithms. The overall comparison results are presented in
Table 2, while
Table 3 reports the mAP (%) for each target class.
The experimental results show that two-stage detectors, such as Faster R-CNN and RetinaNet, achieve high accuracy but incur substantial computational costs, making them unsuitable for real-time UAV aerial detection. In contrast, single-stage detectors, including SSD and the YOLO series, offer superior inference efficiency but experience reduced accuracy on small targets and in complex backgrounds. Although RT-DETR achieves leading accuracy, it comes with high computational complexity. By comparison, the proposed ACS-YOLOv8s attains mAP50 = 41.6% and mAP50:95 = 25.0%, matching or even surpassing more complex models while maintaining reasonable GFLOPs and parameter counts. These findings demonstrate that the proposed method achieves an excellent balance between accuracy and efficiency, providing robust detection with practical applicability and making it well-suited for the demanding conditions of UAV aerial scenes.
As shown in
Table 3, RetinaNet achieves the lowest overall performance (mAP = 13.9%), struggling in multi-class detection. The YOLO series generally outperforms RetinaNet; among them, YOLOX performs well on large objects (mAP = 40.3%) but has difficulty with small targets and crowded scenes. YOLOv5 and YOLOv7-tiny provide a balance between accuracy and efficiency, while YOLOv8s performs well across most classes (mAP = 38.8%) but still exhibits limitations for small or easily confused categories. In contrast, ACS-YOLOv8s achieves a comprehensive performance improvement, reaching an overall mAP of 41.6%. It shows significant gains in challenging classes such as bicycle, tricycle, van, and truck, and attains the highest accuracy in key classes like pedestrian and bus. These results confirm that the collaborative optimization of AFE, CMRF, and SAFMN effectively enhances feature representation in terms of breadth, purity, and fine-grained detail, thereby improving the model’s robustness and generalization in complex UAV aerial environments.
As shown in
Table 4, under this challenging environment, although ACS-YOLOv8s exhibits a slight decrease in precision compared with the baseline, it achieves clear improvements in recall, mAP50, and mAP50-95. Overall, the proposed model outperforms YOLOv8s in degraded conditions, indicating stronger robustness and a more stable detection capability.
To further assess the algorithm’s performance in real-world scenarios, challenging images from the VisDrone2019 test set were chosen for visualization. The qualitative comparison of detection results is presented in
Figure 11.
Figure 11 compares the detection results of different algorithms in typical scenes. From top to bottom, the scenes depict a road, a nighttime commercial street, and a dense small-target environment. The columns show the detection results for the original image, YOLOv5s, YOLOv8s, YOLOv11, and ACS-YOLOv8s, respectively. YOLOv5s and YOLOv8s exhibit missed detections and false positives, while YOLOv11 shows moderate improvement but remains limited. In contrast, ACS-YOLOv8s accurately detects all targets, with bounding boxes closely fitting object boundaries and minimal false detections, demonstrating superior accuracy and robustness in complex conditions.
4.3. Visualization Analysis
To clearly demonstrate the performance improvements of the proposed algorithm over the original model, the mAP50 and mAP50:90 values during training were visualized for both models. This provides an intuitive comparison of performance evolution and improvement trends. The comparative visualization results are presented in
Figure 12a,b.
Although ACS-YOLOv8s still has certain limitations under extreme lighting, severe motion blur, or highly dense scenes, there is still room for improvement in detection accuracy under these conditions. At the same time, the current experiments are mainly based on specific UAV datasets, and the model’s generalization ability for other datasets or application scenarios still needs further verification. However, the comprehensive experimental results indicate that ACS-YOLOv8s can still effectively detect more small targets in complex environments, with overall detection performance significantly better than the baseline model, and showing stronger robustness and stability. This suggests that the proposed feature enhancement and multi-scale modulation modules play a significant role in improving small-target detection capabilities and handling complex scenarios, providing an effective technical solution for UAV small-target detection.
5. Conclusions
This paper proposes a YOLOv8s-based improved algorithm to address challenges in UAV aerial photography target detection, such as large size differences, dense distribution, and blurred features. Symmetry, an inherent geometric property of most aerial-photography targets (e.g., vehicles, buildings), is a key cue for distinguishing targets from complex backgrounds. The algorithm achieves three core improvements: first, the AFE module captures symmetric contour and texture features, enhancing fine-grained geometric perception and complex scene discriminability; second, the CMRF module integrates Ghost and PConv feature reuse ideas to balance computational efficiency and feature representation; third, the SAFMN module dynamically models multi-scale and cross-channel dependencies via CCM, focusing on mining cross-scale symmetric feature correlations to optimize expression. Experimental results show the method outperforms mainstream algorithms in accuracy, with stronger robustness in dense and complex scenes, providing an efficient solution for UAV aerial-photography target detection.
Despite these achievements, the study has limitations guiding future improvements: (1) performance may decline in extreme weather (e.g., heavy rain, fog), as the feature enhancement module poorly adapts to severely degraded images; (2) detection accuracy for ultra-small targets (≤10 × 10 pixels) in complex backgrounds needs improvement, due to fine-grained feature loss during multi-scale downsampling; (3) generalization across different UAV platforms and photography scenarios (e.g., high-altitude fast flight vs. low-altitude hovering) requires further verification, as experiments rely on fixed datasets and specific UAV configurations.
Future work will address these limitations by developing weather-adaptive feature fusion strategies, exploring lightweight super-resolution preprocessing for ultra-small targets, and conducting extensive multi-platform/multi-scenario experiments to enhance practical application value.