Multi-Module Collaborative Optimization for SAR Image Aircraft Recognition: The SAR-YOLOv8l-ADE Network

Wang, Xing; Hong, Wen; Li, Qi; Liu, Yunqing; Zhang, Qiong; Xin, Ping

doi:10.3390/rs18020236

Open AccessArticle

Multi-Module Collaborative Optimization for SAR Image Aircraft Recognition: The SAR-YOLOv8l-ADE Network

by

Xing Wang

^1,2

,

Wen Hong

^1,3,*,

Qi Li

¹,

Yunqing Liu

¹,

Qiong Zhang

¹ and

Ping Xin

²

¹

College of Electrical and Information Engineering, Changchun University of Science and Technology, Changchun 130022, China

²

College of Electrical and Information Engineering, Beihua University, Jilin 132013, China

³

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 236; https://doi.org/10.3390/rs18020236

Submission received: 26 November 2025 / Revised: 25 December 2025 / Accepted: 8 January 2026 / Published: 11 January 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We proposed the SAR-ACGAN network architecture to generate high-quality, diversified virtual aircraft target images, effectively expanding the scale of the SAR image dataset, addressing sample scarcity, and significantly improving aircraft target recognition accuracy.
We constructed the SAR-YOLOv8l-ADE network structure. Through the collaborative optimization of three modules, namely SAR-DFE, SAR-C2f, and 4SDC, it strengthens detailed feature extraction, adapts to multi-scale targets, and enhances the recognition capability of small targets.

What are the implications of the main findings?

To address the common problem of scarce target samples in the SAR field, the SAR-ACGAN network provides an efficient dataset expansion solution, laying the foundation for performance breakthroughs in similar SAR target recognition tasks.
The optimization of the SAR-YOLOv8l-ADE network for feature extraction and small-target recognition not only improves the overall performance of SAR aircraft target detection but also provides a methodological reference for other small-target recognition tasks in the SAR field, helping SAR technology be more accurately applied in practical scenarios such as aviation monitoring and target reconnaissance.

Abstract

As a core node of the air transportation network, airports rely on aircraft model identification as a key link to support the development of smart aviation. Synthetic Aperture Radar (SAR), with its strong-penetration imaging capabilities, provides high-quality data support for this task. However, the field of SAR image interpretation faces numerous challenges. To address the core challenges in SAR image-based aircraft recognition, including insufficient dataset samples, single-dimensional target features, significant variations in target sizes, and high missed-detection rates for small targets, this study proposed an improved network architecture, SAR-YOLOv8l-ADE. Four modules achieve collaborative optimization: SAR-ACGAN integrates a self-attention mechanism to expand the dataset; SAR-DFE, a parameter-learnable dual-residual module, extracts multidimensional, detailed features; SAR-C2f, a residual module with multi-receptive field fusion, adapts to multi-scale targets; and 4SDC, a four-branch module with adaptive weights, enhances small-target recognition. Experimental results on the fused dataset SAR-Aircraft-EXT show that the mAP₅₀ of the SAR-YOLOv8l-ADE network is 6.1% higher than that of the baseline network YOLOv8l, reaching 96.5%. Notably, its recognition accuracy for small aircraft targets shows a greater improvement, reaching 95.2%. The proposed network outperforms existing methods in terms of recognition accuracy and generalization under complex scenarios, providing technical support for airport management and control, as well as for emergency rescue in smart aviation.

Keywords:

synthetic aperture radar; aircraft target recognition; data augmentation; attention mechanism; multi-scale feature extraction

1. Introduction

Synthetic Aperture Radar (SAR) overcomes the limitations of a single-aperture radar by using virtual-aperture synthesis, enabling high-resolution imaging. It boasts all-time, all-weather, and high-penetration imaging capabilities, providing abundant target information, including structure, shape, and scattering characteristics, and thus playing a crucial role in both civil and military monitoring. Aircraft targets are important surveillance objects, and accurately and efficiently recognizing aircraft types in SAR images is a key research topic in target recognition. This technology not only meets the urgent need to improve aviation intelligence and ensure safe, efficient operations, but also defines the core direction of this research.

The development of SAR image recognition technology lags significantly behind that of optical imaging. SAR images are not only severely disturbed by speckle noise but also exhibit complex geometric features and radiation characteristics formed by microwave scattering. These images differ drastically from optical images in visual appearance, leading to an order-of-magnitude increase in the difficulty of automated recognition. With the rapid advancement of deep learning, breakthroughs in computer vision have enabled novel solutions for SAR image target recognition. The deep learning algorithms automatically learn abstract image features through hierarchical network structures, ranging from shallow edge and texture information to deep semantic features, markedly improving target recognition performance in complex scenarios. This enables an end-to-end object detection mode that integrates traditional detection, identification, and classification into a unified process. Object detection algorithms are divided into two-stage and one-stage types. Classic two-stage algorithms first generate candidate boxes, then perform regression and classification, resulting in low recognition efficiency [1,2,3]. One-stage algorithms, such as YOLO [4,5,6,7,8,9,10], have emerged accordingly. Without the intermediate candidate-box prediction step, these algorithms obtain results directly from images, ensuring recognition accuracy while reducing computational complexity and improving detection efficiency.

In the early stage of research, the academic community mostly attempted to directly transfer classical convolutional neural networks from the field of optical images to SAR image target recognition tasks. However, due to the inherent characteristics of SAR images, such as coherent speckle noise interference, single-channel information limitations, complex background interference, and target discreteness, direct transfer faces numerous challenges, prompting researchers to make targeted improvements to classical frameworks. Xiao et al. [11] proposed the PFF-ADN network framework. Peak features of targets are extracted using a corner detector, which yields geometrically invariant features. A deformable convolution structure is integrated into the backbone network, enabling effective adaptation to the attitude sensitivity of aircraft targets and the deformation characteristics of aircraft components. Zhao et al. [12] proposed the SFSA network based on the spatial structure of scattering features. Their team designed a spatial electromagnetic scattering feature-extraction module to mine potential structural features and high-level semantic features of aircraft targets, and an image-domain feature-extraction module to extract global information about the discrete appearance of aircraft targets. Both modules provide more discriminative features for aircraft model recognition. Zhu et al. [13] designed the FEMSFNet network structure. It integrated the attention mechanism [14,15] and residual connections [16] into the feature-extraction backbone network, fused local and global features via pooling operations with multiple receptive fields, accelerated model convergence, and improved model performance. Guo et al. [17] proposed the SAR-NTV-YOLOv8 network structure. It incorporated a detection branch for high-resolution small aircraft targets to mitigate the loss of detailed features, and presented a multi-scale feature-adaptive fusion module based on the attention mechanism to establish short- and long-term dependencies between feature grouping and multi-scale structures. Huang et al. [18] developed the SEFFNet network structure. They designed a scattering information extraction and enhancement module to suppress clutter and avoid false alarms effectively, and introduced a multi-scale feature fusion pyramid structure to adaptively assign weights to different feature maps, achieving efficient fusion.

Despite the remarkable progress achieved in aircraft target recognition technology for SAR images, the field still faces numerous urgent challenges and issues due to the complexity of targets and backgrounds, and the inadequacy of algorithm adaptability [19]. Firstly, the high costs of image acquisition and data annotation have led to an insufficient number of samples in current public datasets, severely restricting further improvement in deep learning model performance. Secondly, SAR images contain only single-channel grayscale information, with limited features, lacking the multi-dimensional characteristics of optical images, such as color and texture. This makes it difficult for models to capture the differential features between targets and backgrounds. Thirdly, affected by imaging distance, resolution, and observation angle, the pixel sizes of targets in images show significant variations, requiring feature extraction networks to have the ability to dynamically adjust receptive fields [20,21]. Finally, small aircraft targets are depicted with very few pixels in SAR images. They lack complete semantic features, which makes it challenging to extract practical features for accurate recognition [22]. The paper proposed the SAR-YOLOv8l-ADE (SAR-YOLOv8l-Airplane Detection Enhancement), tailored to the characteristics of SAR images across four dimensions: dataset expansion, detailed feature extraction, adaptive adjustment of multi-scale targets, and enhancement of small-target recognition performance. The main innovative achievements are as follows:

(1): A SAR image sample generation network (SAR-Auxiliary Classifier Generative Adversarial Network, SAR-ACGAN) is designed. Based on the ACGAN architecture, the model integrates a self-attention mechanism to enhance adaptability and robustness in noisy environments, enabling the generation of high-quality, diverse virtual aircraft target images. It effectively expands the dataset scale and further improves the recognition accuracy of aircraft targets.
(2): A SAR image detail feature extraction module (SAR-Detail Feature Extraction, SAR-DFE) is designed as a parameter-learnable, adaptive deep learning component. Integrating central difference convolution and adaptive filtering algorithms enhances the extraction of edge and structural features while suppressing speckle noise. The dual residual structure avoids information loss, and the three-branch concatenation expands single-channel images into three-channel images for deeper feature extraction.
(3): A multi-scale target detection module (SAR-Coarse-to-Fine, SAR-C2f) is designed. Based on the original C2f structure of YOLOv8, it integrates a residual structure with multi-receptive-field adaptive fusion (Multi-Scale Adaptive Fusion Resnet, SAR-MAFR), which dynamically adapts receptive fields. This effectively improves the detection performance of multi-scale targets and is particularly conducive to the extraction of small-target features.
(4): A four-branch adaptive fusion detection head (Four-scale Detectors with Coordinate Attention, 4SDC) is designed. It adds a small-target detection branch and dynamically allocates branch weights via an attention mechanism, thereby enhancing the fusion of shallow detail features and deep semantic features, thereby effectively improving the recognition accuracy of small targets.

The subsequent sections are organized as follows: Section 2 elaborates on our proposed method in thorough detail. Section 3 describes the dataset, experimental environment, and evaluation metrics, and reports the experimental results along with their analysis. Finally, the discussion and conclusions are presented.

2. Methods

This paper designs a SAR-YOLOv8l-ADE network structure with YOLOv8l as the baseline model, and the specific improvement measures are as follows: Firstly, the SAR-ACGAN algorithm is adopted to realize dataset augmentation. The SAR-ACGAN operates independently of the main detection network. Its core function is to augment the dataset by generating high-quality synthetic aircraft images, which are then integrated into the training data of the main detection network. Secondly, the SAR-DFE model is used to extract richer features from SAR images. Thirdly, the SAR-C2f module structure is designed to extract multi-scale aircraft target features from images effectively, and the Ghost-CBS operation [23] is integrated to improve computational efficiency and streamline model parameters. Finally, in the detection head, the designed 4SDC structure can effectively aggregate contextual information from different feature layers, and automatically assign weights to different detection branches according to the size of the input aircraft target, which helps to enhance the recognition ability of small targets. The structure of the improved model is shown in Figure 1.

2.1. Dataset Sample Augmentation Network SAR-ACGAN

Sufficient and high-quality data serve as the cornerstone for advancing theoretical and methodological innovations. However, acquiring authentic SAR images via satellites is exceptionally costly and difficult to annotate, leading to a severe shortage of samples that significantly restricts the performance improvement of deep learning-based recognition algorithms. This paper proposes the SAR-ACGAN, a network that generates new images via adversarial training between the generator and discriminator. The synthetic images not only exhibit high quality, rich diversity, and strong consistency but also better meet the target recognition algorithm’s requirements for data distribution and feature distributions, endowing the model with stronger generalization ability.

The classic ACGAN [24,25] model is built with convolutional layers. It excels at handling local information relationships but struggles with long-range dependencies in images. Aircraft targets in SAR images have quite complex structures. A certain distance spatially separates components such as the fuselage, wings, and empennage, yet they share close structural and semantic connections. Meanwhile, some background information around the aircraft also has specific associations with the aircraft target itself. These non-local spatial correlations form the crucial long-range dependencies of aircraft targets. In this paper, the SAR-ACGAN network is designed by integrating the self-attention mechanism [26,27] into the classic ACGAN framework. This integration enables the network to focus more on relatively straightforward and representative parts and features of images, effectively capture feature relationships across spatially separated regions, and generate SAR aircraft images with rich details closer to reality.

To match the ACGAN network’s structure, the self-attention module is designed as shown in Figure 2. By means of the similarity between query

Q

and key

K

, the weight of value

V

is determined. It calculates the correlation between each pixel and all other pixels, capturing semantic and spatial relationships across distant regions of the image.

The input feature map

X

is convolved with three parallel (1 × 1) kernels to obtain

Q

,

K,

and

V

matrices, respectively. Then, matrix multiplication is performed between

Q

and the transposed matrix of

K

to obtain matrix

S

. The formula is as follows:

S = \frac{Q \times K^{T}}{\sqrt{d}}

(1)

where d denotes the channel dimension of

Q, K,

and

V

matrices, dividing by

\sqrt{d}

scales the attention scores to an appropriate range, alleviates gradient vanishing, and stabilizes training. Subsequently, the Softmax activation function [28] is applied to normalize matrix

S

, yielding the attention weight matrix

W

. Then

W

is multiplied by matrix

V

, followed by convolution with a (1 × 1) kernel to adjust the number of channels of the output feature map, yielding feature map

O

. In the generator, the channel dimension is compressed with a compression ratio of 0.5; in the discriminator, it is expanded with an expansion ratio of 2. Additionally, the output of the attention layer is multiplied by a scaling parameter

λ

. It is a learnable parameter that adjusts the contribution of the attention output to the final result through iterative learning. Initialized to 0,

λ

allows the model first to learn simple tasks and then gradually increase task complexity. The introduction of this parameter avoids convergence difficulties caused by tackling complex tasks at the outset. Meanwhile, a residual structure is integrated to ensure that basic and critical information is not lost and to maintain feature stability. The formula of the

Y

is as follows:

Y = λ O + R

(2)

The structure of the SAR-ACGAN network is illustrated in Figure 3, and its network parameters are presented in Table 1.

To support the training and validation of the ACGAN and SAR-ACGAN network models, we construct a novel dataset named SAR-Aircraft-Gen, with detailed information presented in Section 3.1.2.

2.2. Detailed Feature Extraction Module SAR-DFE

SAR images are single-channel grayscale data with inherent insufficient feature information. Most mainstream deep learning algorithms are designed for three-channel color images, and their inherent feature-extraction mechanisms cannot adapt to single-channel SAR data, posing significant challenges for extracting discriminative features between targets and the background. Additionally, speckle noise appears as random granular textures that cover target regions, blur the edge contours and internal structures of aircraft targets, and increase the difficulty of feature extraction.

This study proposed a novel convolutional architecture that integrates two core functions: structural feature extraction of aircraft targets and denoising of SAR images. This component seamlessly integrates with deep learning frameworks, combining the advantages of traditional feature extraction with neural networks’ powerful feature-learning capabilities. As a parameter-learnable and adaptive module, it enhances the model’s sensitivity to local structural features. Experimental results demonstrate that the SAR-DFE module can effectively extract aircraft targets from high-noise SAR images, and the module’s architecture is shown in Figure 4.

The SAR-DFE structure consists of three parallel convolutional branches, including one standard convolutional module and two image-processing convolutional modules, which form a dual residual structure via skip connections [29,30]. Specifically, the structural feature extraction module uses the central difference convolution (CDC) module [31]. Meanwhile, the SAR image denoising module is designed using an adaptive LEE (A-LEE) filtering algorithm with a convolutional structure [32]. Both modules are equipped with trainable parameters to better adapt to variations in scattering characteristics in SAR images. Subsequently, the three branches undergo a non-linear transformation via the ReLU activation function and are fused with the original input image. This residual structure helps preserve the original image information. Then, each of the three branches performs further feature extraction and transformation through standard convolution operations, followed by another fusion with the original input image. This dual residual structure can effectively prevent information loss during training and promote gradient backpropagation. Finally, the three branches are concatenated and converted into a three-channel image, which better meets the input requirements of deep learning algorithms. This module can provide more detailed features for subsequent aircraft recognition tasks.

CDC is a distinctive convolutional operation, whose core advantage lies in enhancing the ability to extract detailed local features through differential operations. This operation ingeniously integrates the characteristics of traditional differential operators (e.g., Sobel operator [33,34]) that are sensitive to edges and gradients, while introducing trainable parameters to enable adaptive adjustment for different data characteristics. The CDC operational logic diagram is shown in Figure 5.

The calculation formula of CDC is as follows:

Y_{C D C} (i, j) = \sum_{m, n = - 2, - 2}^{2, 2} W (m, n) \cdot [X (i + m, j + n) - X (i, j)]

(3)

where

(m, n)

is the position index of the convolution kernel;

W (m, n)

is the trainable weight parameter at the corresponding neighborhood position

(m, n)

. The kernel size of

W

is set to

(5, 5)

. CDC computes the difference between the central pixel and its neighbors, then aggregates the difference using trainable weights

W

. CDC enables the model to capture more high-contrast information, such as strong scattering point features. The comparative advantages of CDC and the traditional Sobel operator are shown in Table 2.

CDC combines the edge-sensitivity of traditional differential operators with the adaptability of deep learning, providing more powerful structural feature-extraction capability. It demonstrates stronger adaptability in complex scenarios. In SAR image analysis, it can effectively enhance sensitivity to aircraft structures, making it a key component in building high-performance SAR target recognition models.

The speckle noise in SAR images is multiplicative, causing uneven brightness variations and severely interfering with the extraction of detailed features of aircraft targets. The fundamental purpose of adopting a filtering algorithm is to eliminate the speckle noise in the image while preserving as much detailed information of aircraft targets as possible. Based on the traditional LEE filtering algorithm, we construct the A-LEE structure. First, the local mean

{\bar{I}}_{u}

and local variance

{σ_{u}}^{2}

are calculated. Then, the weight coefficient

W

is calculated according to the following formula:

W = 1 - \frac{C_{u}^{2}}{C_{I}^{2}}

(4)

C_{u} = \frac{σ_{u}}{{\bar{I}}_{u}}, C_{I} = \frac{σ_{I}}{{\bar{I}}_{I}}

(5)

where

C_{u}

is the local standard deviation coefficient of the image;

C_{I}

is the global standard deviation coefficient of the original image;

σ_{I}

is the global standard deviation of the original image;

{\bar{I}}_{I}

is the global mean of the original image. On this basis, a weight attention correction coefficient is incorporated, transforming a traditional filter with fixed algorithm parameters into a learnable deep learning component that can better cope with complex, changing noise environments. The module structure is shown in Figure 6.

The formula for the estimated value

\hat{R}

of the denoised image is as follows:

\hat{R} = \bar{I} + A t t e n \cdot W \cdot (I - \bar{I})

(6)

A weight attention module is designed, as shown in the pink box in Figure 6. First, the two features, local mean and local variance, are concatenated to integrate information. Then, a 3 × 3 convolution kernel is used to extract features, producing a 16-channel feature map. Next, a nonlinear factor is introduced through the ReLU activation function to enhance the model’s ability to express complex features. Furthermore, a 1 × 1 convolution kernel is used to compress the channel dimension to a single channel. Finally, through the Sigmoid activation function [35], the output result is mapped to the interval (0, 1) to generate the attention correction coefficient matrix Atten.

The A-LEE denoising module effectively suppresses speckle noise while preserving structural details. This module combines the physical significance of traditional algorithms and the data-driven advantages of deep learning, providing a more flexible and effective solution for SAR image denoising.

2.3. Multi-Scale Feature Extraction Module SAR-C2f

Scale diversity is a core challenge in aircraft target recognition for SAR images. Aircraft of the same model exhibit a large span and a scattered distribution of pixel scales, which severely impairs algorithm stability. Taking the A220 model as an example: the maximum pixel size is approximately 150 × 100 pixels, where the wing contour and fuselage structure can be distinguished through the distribution of strong and weak scattering regions; the minimum pixel size is only 35 × 35 pixels, with a highly abstract overall shape. Fine-grained structures are submerged by speckle noise and background clutter, and the model can only be associated with them based on the spatial relationships of local strong-scattering points, combined with prior knowledge, leading to a drastic increase in recognition difficulty. Additionally, target attitude variations exacerbate image distortion, further highlighting the necessity for deep learning algorithms to exhibit cross-scale feature robustness and generalization, especially for the recognition of small aircraft targets.

This paper proposed the SAR-MAFR structure, which integrates the Inception structure [36], the SENet attention mechanism, and the Residual structure. By paralleling convolutional operations with different kernel sizes, the Inception can capture feature information from multiple scales simultaneously. Instead of simply increasing network depth, this structure expands network width, broadens feature extraction pathways, and enriches feature representations. Incorporating the SENet attention mechanism into this structure enables the module to consider contributions from different receptive fields and adaptively adjust their weights based on input SAR image information, thereby improving the detection accuracy of multi-scale aircraft targets. Additionally, to address the gradient vanishing problem caused by increased depth, a Residual connection is introduced in this module, enabling rapid gradient propagation. This avoids information loss during multi-layer transmission and optimizes the efficiency of feature propagation and learning. The structure of the SAR-MAFR is illustrated in Figure 7.

The input feature map

X \in R^{w, h, c}

is convolved with kernels of sizes (3, 3), (5, 5), and (7, 7), respectively, to obtain

U 1

,

U 2

, and

U 3

. To reduce the number of parameters and improve computational efficiency, we adopted the SE-Ghost-Bottleneck convolution structure [37]. Then,

U 1

,

U 2

, and

U 3

are summed pixel-wise to obtain

U \in R^{w, h, c}

, which fuses information from different receptive fields.

U

is fed into the SENet to calculate channel-wise weight vectors

a

,

b

, and

c

for predicting the weights of different receptive fields. Finally, a short connection is used to form a residual connection, and the calculation formula for the output feature map

V

is as follows:

V = a U 1 + b U 2 + c U 3 + X

(7)

Based on the C2f module in the YOLOv8l architecture, this paper proposes the SAR-C2f, whose structure is shown in Figure 8.

The Split operation establishes distinct feature processing pathways, increasing the number of branches for gradient propagation and facilitating training and convergence. Subsequently, after passing through N feature-extraction modules, each module directly extracts a feature map with C/2 channels and feeds it into the subsequent concatenation operation via a skip connection, effectively preserving the feature information from intermediate layers. The N value is designed to strictly adhere to the structural configuration of the original YOLOv8l network. The N values of C2f modules follow a unified “shallow-deep-deep-shallow” principle to balance the requirements of fine-grained feature preservation and semantic feature extraction. Specifically, the N values of the four C2f modules in the backbone adopted are sequentially configured with (3, 6, 6, 3). The SAR-C2f effectively captures multi-scale characteristics by fusing features extracted from different receptive fields, enhancing the algorithm’s adaptability to scale variations in aircraft targets and particularly improving the recognition accuracy of small aircraft targets.

2.4. Four-Scale Adaptive Fusion Detection Head 4SDC

Small-target samples exhibit sparse features, manifesting only as aggregates of a few scattering points in images without complete semantic features. Their signal intensity is often comparable to background clutter, making it difficult for traditional detection algorithms to achieve effective differentiation. Experimental data indicate that, in typical airport scenarios, the missed-detection rate for small aircraft targets is significantly higher than the detection error for medium and large-sized targets. The recognition of small targets in SAR images has become an intractable technical bottleneck.

This paper proposed the 4SDC structure [38]. To improve the detection accuracy of small targets, a fourth detection branch is added to the C2 layer of the YOLOv8l baseline. The C2 layer corresponds to the shallow feature extraction stage, which retains richer fine-grained spatial details that are critical for small-target detection. By adding the detection branch at the C2 layer, we can effectively capture the low-level fine features of small targets in SAR images, thus compensating for the insufficient small-target representation capability of the original multi-scale detection framework. To resolve the problem of insufficient feature utilization caused by traditional equal-weight fusion, the Coordinate Attention (CA) [39] is merged into the detection head. By modeling feature correlations across spatial and channel dimensions, this mechanism adaptively allocates contribution weights to the four detection branches, thereby achieving efficient fusion of multi-branch features. The structure of the 4SDC is illustrated in Figure 9.

Simple equal-weight fusion degrades the recognition performance; thus, the CA mechanism is adopted to adjust the weights of each branch adaptively. Due to differences in resolution and the number of channels across the feature maps of each detection branch, it is necessary to align these feature maps to a standard size using up-sampling, down-sampling, and convolution operations. Subsequently, concatenation is performed to obtain the feature map

X

∈ (4C × H × W). Through the CA mechanism, the feature map

X

is transformed into the weighted feature map

U

∈ (4C × H × W). The channel dimension of

U

is compressed via convolution operations to obtain

W

∈ (4 × H × W).

W

consists of weight scalar maps

λ_{a}

,

λ_{b}

,

λ_{c}

, and

λ_{d}

. Finally, the Softmax activation function is applied to

W

to compute four normalized weight matrices a, b, c, and d. The formulas are as follows:

a_{i j} = \frac{e^{λ_{a_{i j}}}}{e^{λ_{a_{i j}}} + e^{λ_{b_{i j}}} + e^{λ_{c_{i j}}} + e^{λ_{d_{i j}}}}

(8)

b_{i j} = \frac{e^{λ_{b_{i j}}}}{e^{λ_{a_{i j}}} + e^{λ_{b_{i j}}} + e^{λ_{c_{i j}}} + e^{λ_{d_{i j}}}}

(9)

c_{i j} = \frac{e^{λ_{c_{i j}}}}{e^{λ_{a_{i j}}} + e^{λ_{b_{i j}}} + e^{λ_{c_{i j}}} + e^{λ_{d_{i j}}}}

(10)

d_{i j} = \frac{e^{λ_{d_{i j}}}}{e^{λ_{a_{i j}}} + e^{λ_{b_{i j}}} + e^{λ_{c_{i j}}} + e^{λ_{d_{i j}}}}

(11)

where

λ_{a_{i j}}

,

λ_{b_{i j}}

,

λ_{c_{i j}}

, and

λ_{d_{i j}}

represent the pixel values at position

(i, j)

in the weight feature map

W

(

λ_{a}

,

λ_{b}

,

λ_{c}

,

λ_{d}

).

a_{i j}

,

b_{i j}

,

c_{i j}

, and

d_{i j}

represent the pixel values at position

(i, j)

in the weight matrices (a, b, c, d). By multiplying the feature map of each layer by the corresponding weight matrix and then fusing these weighted feature maps to obtain

V

. The formula is as follows:

V_{l} = a X_{2 \to l} + b X_{3 \to l} + c X_{4 \to l} + d X_{5 \to l}

(12)

The weight matrices a, b, c, and d are iteratively updated through the backpropagation mechanism. They perform strategic weight allocation among different detection branches, aggregate contextual information from feature layers with different receptive fields, generate fused features, and effectively enhance recognition of small targets.

3. Results

In this section, the recognition performance of the proposed method is evaluated via experiments. We elaborate on the dataset construction, experimental environment, and evaluation metrics, and then present a meticulous analysis of the experimental results.

3.1. Dataset Construction

3.1.1. SAR-Aircraft and SAR-Aircraft-EXT Dataset

The datasets in this paper are based on two public datasets, namely ISPRS-SAR-Aircraft [40] and SAR-AIRcraft-1.0 [41]. The datasets include images of multiple civil airports captured at different periods, covering seven aircraft categories as illustrated in Figure 10: Boeing787 (B787), A330, Boeing737-800 (B737), A320/321, ARJ21, A220, and others.

In this paper, two public datasets are fused to construct a new dataset, SAR-Aircraft, containing 6368 images and 23,019 aircraft instances: 4868 images for the training set, 500 images for the validation set, and 1000 images for the test set. The SAR-Aircraft is the base dataset. Subsequently, the SAR-ACGAN network is used to augment the dataset, with 2500 additional images and 8000 additional aircraft instances added to the training set of the SAR-Aircraft dataset. The augmented dataset is then renamed SAR-Aircraft-EXT. The extended dataset comprises 8868 images, 31,019 aircraft instances, and 7855 small-target instances (defined as aircraft with a pixel size of less than 60 × 60). Its training set consists of 7368 images, with the validation and test sets consisting of 500 and 1000 images. The final reported mAP₅₀ is entirely evaluated on the independent test set, which has no overlap with the augmented training data. The SAR-Aircraft-EXT is the primary dataset used for training and evaluating the SAR-YOLOv8l-ADE network.

3.1.2. SAR-Aircraft-Gen Dataset

The SAR-Aircraft-Gen dataset is constructed by cleaning, filtering, and cropping the base SAR-Aircraft dataset into a standardized 128 × 128-pixel format. It contains 1600 samples covering seven aircraft types, with its sample distribution detailed in Table 3. This dataset is used exclusively for training and validation by both the SAR-ACGAN and ACGAN networks.

First, the original SAR-Aircraft training set is cleaned and filtered. Due to diverse data sources, some images are redundant from repeated acquisitions, and some samples are low quality, with blurred targets or unclear edges. Such samples hinder network convergence and should be removed. The final dataset is refined to 500 images, containing 1600 aircraft targets.

Second, SAR images contain substantial interference. With a maximum resolution of 2048 × 2048, noise and complex background interference are further amplified. This increases the difficulty of model convergence, often leading to blurred generated images, missing details, and incorrect features. Therefore, full-image training was abandoned in the experiments. Instead, images were cropped to a uniform size of 128 × 128 pixels centered on the target aircraft, using the bounding box center in the labels as the cropping center. The selection of this size is first based on the size distribution law of aircraft targets in the dataset. Among all aircraft targets, 21,065 targets are smaller than 128 × 128 pixels, accounting for 91.5% of the total; in addition, 128 × 128 is also the standard training image size for the ACGAN network. Therefore, adopting this uniform size not only matches the scale characteristics of the vast majority of targets in the dataset but also significantly reduces the complexity of data processing and effectively mitigates the problems of gradient vanishing or explosion during training.

Finally, the SAR-ACGAN model generates 128 × 128-pixel single-target fake images. By copying these synthetic images back to the corresponding positions in the original cropped images, sample augmentation is achieved. The dataset with augmented samples is called SAR-Aircraft-EXT.

3.2. Experimental Environment

The experiments were conducted on a computer running 64-bit Windows 10. The hardware configuration: CPU is an Intel(R) Xeon(R) E5-2680 v3 @ 2.5 GHz, GPU is an NVIDIA GeForce RTX 4060 Ti with 16 GB memory, and the training was accelerated using CUDA 12.4. The software configuration: the network architecture was built based on PyTorch 2.4, with Python 3.10 adopted as the programming language. The training parameter configuration: input image resolution is 640 × 640 pixels, number of training epochs is 500, batch size is 16, and learning rate is 0.01. The SGD optimizer was used, and Mosaic data augmentation was turned off during the last 10 training epochs. To ensure fairness and statistical consistency, each comparative experiment used a fixed random seed (set to 110) to initialize the parameters; each experiment was independently repeated five times under consistent experimental conditions to calculate the mean.

3.3. Evaluation Metrics

3.3.1. Evaluation Metrics for Generated Image Quality

In the paper, the Fréchet Inception Distance (FID) is used to evaluate the similarity between generated images and authentic images [42]. We use the weights of the pre-trained Inception-V3 model [43] to extract a 2048-dimensional feature from the fully connected layer before the final classification. Both real and generated images extract features through this model, and the calculation formula of FID is as follows:

F I D = {| | µ_{r} - µ_{f} | |}_{2}^{2} + T_{r} (\sum r + \sum f - 2 \sqrt{\sum r \times \sum f})

(13)

where

{| | µ_{r} - µ_{f} | |}_{2}^{2}

represents the Euclidean distance between the mean value

µ_{r}

of the real image feature vectors and the mean value

µ_{f}

of the generated image feature vectors, which measures the degree of difference between the two distributions in terms of their means;

\sum r

and

\sum f

are the covariance matrices of the real image feature vectors and the generated image feature vectors, respectively, which describe the correlation between the various dimensions of the feature vectors;

T_{r}

denotes the trace of a matrix, which is the sum of the elements on the main diagonal of the matrix.

T_{r} (\sum r + \sum f - 2 \sqrt{\sum r \times \sum f})

measures the difference between the two distributions in terms of the correlation of feature dimensions. In summary, the smaller the FID value, the closer the generated sample distribution is to the real sample distribution, indicating better image quality.

3.3.2. Evaluation Metrics for Aircraft Target Recognition Capability

In the paper, indicators such as average precision (AP), mean average precision (mAP), precision (P), recall (R), parameters (Params), and giga floating-point operations per second (GFLOPs) are used to evaluate the recognition performance of aircraft targets.

AP is a metric for evaluating the recognition performance of single-type aircraft targets. The higher its value, the better the algorithm’s performance. mAP is used to measure the average recognition precision of seven aircraft types. The higher its value, the better the algorithm’s overall recognition performance. This indicator includes two sub-indicators: mAP₅₀ and mAP_50–95. The former represents the recognition precision at an IoU threshold of 50%. The latter represents the average recognition precision across all 10 IoU thresholds from 5% to 95% with a step size of 5. Params represent the number of model parameters; GFLOPs represent the computational complexity.

3.4. Analysis of Experimental Results

3.4.1. Analysis of Experimental Results for the SAR-ACGAN Network

We conducted experiments and validation on the SAR-Aircraft-Gen dataset, incorporating traditional data augmentation methods (e.g., size adjustment, random rotation, and random flipping). Figure 11 shows the images generated at different Epochs.

To intuitively analyze the performance differences between the ACGAN and SAR-ACGAN networks, two images were selected for each aircraft type from the models’ outputs for comparison, as shown in Figure 12 and Figure 13. The quality of images generated by ACGAN is significantly inferior to that of SAR-ACGAN. Specific manifestations are as follows: in the green-marked regions, there is a lack of tail-scattering features; in the pink-marked regions, some unrealistic highlight points appear; in the yellow-marked regions, aircraft target misclassification occurs with images of incorrect aircraft types generated; additionally, the aircraft targets in ACGAN-generated images are blurred. Based solely on visual effect analysis, the performance of samples generated by the SAR-ACGAN has improved substantially compared with that of the ACGAN.

To accurately evaluate the quality of generated images, the FID metric is adopted. Table 4 presents FID scores of aircraft target images generated by different network models. The average FID score of aircraft target images generated based on the ACGAN network is 76.25, while that of images generated by the SAR-ACGAN network proposed in this study is 42.67. This represents a 33.58% improvement in the FID metric. From the detailed data, among all aircraft types, the A330 type achieves the most remarkable FID improvement, with a reduction of 43.18. As this aircraft type has the smallest sample size in the dataset, the SAR-ACGAN structure shows a more pronounced improvement in few-shot scenarios. These results highlight the distinct advantages of the SAR-ACGAN architecture designed in this study.

To further improve image quality, the FID score for each image is calculated. Images with FID values below the average are selected as high-quality for subsequent use. Based on 1600 single-aircraft target images from the SAR-Aircraft-Gen dataset, after performing 1×, 2×, 3×, 4×, and 5× quantity augmentation, the augmented images are copied to the exact positions of the corresponding aircraft models in the original cropped images. Finally, these images are integrated into the SAR-Aircraft-EXT training set to conduct a refined comparative analysis.

Three target recognition algorithms are employed, with recognition accuracy mAP₅₀ presented as curves in Figure 14. Both the ACGAN and SAR-ACGAN data augmentation methods improve aircraft target recognition accuracy. The proposed SAR-ACGAN method demonstrates superior performance: when the dataset is augmented by 5×, the mAP₅₀ values of YOLOv5s, YOLOv8s, and Faster R-CNN reach 90.2%, 91%, and 88.2%, respectively. Compared with the ACGAN method, these represent improvements of 0.6%, 0.6%, and 0.8%, while achieving increases of 2.1%, 1.6%, and 1.5% relative to the baseline model. Further analysis of the curves shows that when the number of augmented images exceeds 3×, the ACGAN method’s aircraft target recognition accuracy declines, indicating suboptimal image quality. The integration of low-quality images into the original dataset negatively affects the final aircraft target recognition accuracy. In stark contrast, when the SAR-ACGAN method is adopted, the aircraft target recognition accuracy remains stable even with a large number of generated images, verifying the effectiveness of the proposed method. Finally, a 5× image augmentation strategy is employed, resulting in a total of 2500 full-scene images. These images are integrated into the original training set to construct the SAR-Aircraft-EXT dataset.

3.4.2. Analysis of Experimental Results for the SAR-DFE Module

Based on the SAR-Aircraft-EXT dataset, a systematic analysis and summary of the experiments are conducted.

First, using YOLOv8s and YOLOv8l as the baselines, Table 5 presents the AP₅₀ and mAP₅₀ values for aircraft target recognition across different algorithms. The data show that the mAP₅₀ values of YOLOv8s and YOLOv8l, when integrated with the SAR-DFE, increased by 2.2% and 2.6%, respectively, compared with the baseline models. This further confirms that the proposed module is conducive to improving the recognition accuracy of aircraft targets. Figure 15 presents the confusion matrix and PR curve using the YOLOv8l model integrated with the SAR-DFE. This indicates that the improved algorithm correctly recognizes the vast majority of aircraft categories, achieving an mAP₅₀ of 94.6%.

Second, the SAR-DFE structure comprises three parallel branches: a conventional convolution (Conv) branch, a structural feature extraction (SFE) branch, and a SAR image denoising (SAR-ID) branch. The experiment uses the control variable method to analyze the effects of the CDC and A-LEE modules proposed in this paper on aircraft recognition accuracy. Moreover, these two improved modules are replaced by the traditional Sobel operator and LEE filtering algorithm for comparative analysis, and the experimental data are shown in Table 6. The experiment uses YOLOv8l as the baseline. When the CDC module is adopted independently, the mAP₅₀ increases by 1.2% relative to the baseline model and by 0.5% relative to the traditional Sobel operator, demonstrating that the CDC module possesses superior structural feature extraction capability. When used alone, the A-LEE module increases mAP₅₀ by 1.6% relative to the baseline model and by 0.7% compared with the traditional LEE algorithm; it effectively suppresses speckle noise and preserves the detailed information of aircraft targets through its attention mechanism. When both the CDC and A-LEE modules are employed simultaneously, without additional parameter consumption or computational overhead, the mAP₅₀ improves by 1.3% compared with the combined use of the Sobel operator and LEE algorithm. This indicates that the synergistic effect of the CDC and A-LEE modules is practical, enabling enhanced feature extraction capability for aircraft targets in complex scenarios.

Finally, to further verify the feature extraction capability of the SAR-DFE structure in complex scenarios, 200 SAR images with complex backgrounds and poor quality were selected from the SAR-Aircraft-EXT test set to construct the SAR-Aircraft-200-LQ sub-test set. The comparative experimental analysis is shown in Table 7, with the YOLOv8l network model as the baseline. The mAP₅₀ of the baseline model on the SAR-Aircraft-200-LQ test set is 83.2%, which is 8.8% lower than its performance on the SAR-Aircraft-EXT test set. This indicates that the baseline model has significant limitations in feature extraction and aircraft target recognition capabilities when dealing with SAR images with complex backgrounds and poor quality. In contrast, the SAR-DFE (CDC + A-LEE) module achieves an mAP₅₀ of 89.8% on the SAR-Aircraft-200-LQ test set, which is 6.6% higher than that of the baseline model and 2.9% higher than that of the traditional SAR-DFE (Sobel + LEE) module. To more intuitively analyze the recognition performance, three representative test samples were selected from the dataset for visual comparison, as shown in Figure 16. The images in the upper row exhibit a prominent granular scatter distribution, an inherent characteristic of speckle noise in SAR images. The images in the middle row are of poor quality, contain multiple target categories, and thus exhibit low feature discriminability. The images in the lower row suffer from severe background interference, which significantly reduces the distinguishability between targets and the background. Figure 16b shows the recognition result of the baseline algorithm, where yellow circles indicate missed aircraft targets and green circles denote false-alarm aircraft targets, demonstrating poor recognition performance. Figure 16c presents the recognition result of the improved algorithm integrated with the SAR-DFE (CDC + A-LEE) module. All aircraft targets are correctly identified, which further verifies the effectiveness of the SAR-DFE module.

3.4.3. Analysis of Experimental Results for the SAR-C2f and 4SDC Module

To clarify the synergistic mechanism of the SAR-C2f and 4SDC modules and verify their improvement effect in the recognition accuracy of small aircraft targets, this study performs a systematic analysis and verification on the SAR-Aircraft-EXT dataset.

First, this paper designs a dedicated experiment to assess small-target recognition accuracy. Standard target recognition evaluation metrics insufficiently focus on small targets, making it difficult to intuitively reflect the model’s performance on small targets. The detection shortcomings of small targets are masked by the excellent performance of normal-sized targets. In this paper, targets smaller than 60 × 60 pixels are defined as small aircraft targets. By retrieving the bounding box sizes in the annotation files of the SAR-Aircraft-EXT test set, 200 images containing small targets are selected to construct the SAR-Aircraft-200-ST sub-test set, and 200 images containing normal-size targets are selected to construct the SAR-Aircraft-200-NT sub-test set. The test accuracy is shown in Table 8, where four algorithms—Faster R-CNN, SSD [4], YOLOv8s, and YOLOv8l—are compared. Experimental results indicate that the mAP₅₀ of the aforementioned algorithms on the SAR-Aircraft-200-ST is lower than that on the SAR-Aircraft-EXT (SAR-Aircraft-NT) by 6.2% (7.4%), 5.1% (6.1%), 3.6% (4.3%), and 3.3% (3.9%), respectively. This verifies the general rule that recognition of small targets is more challenging than that of normal-sized targets.

Second, to improve the model’s recognition accuracy for small aircraft targets, the SAR-C2f and 4SDC modules are designed. The synergy performance of SAR-C2f and 4SDC modules is shown in Table 9. In this experiment, the YOLOv8l algorithm, integrated with the SAR-DFE module, serves as the baseline. The SAR-C2f, 4SDC, and SAR-C2f + 4SDC modules are incorporated. For the SAR-Aircraft-EXT dataset, the recognition accuracy improves by 0.7%, 1.2%, and 1.9%, respectively; for the SAR-Aircraft-200-ST dataset, the recognition accuracy increases by 1.6%, 2.6%, and 3.9%, respectively. In addition, we calculated the FPS value to verify the detection speed for different combinations: 20.4, 21.2, 19.8, and 20.3, respectively. The experimental data show that the synergy of SAR-C2f and 4SDC modules effectively improves the accuracy of aircraft target recognition without affecting detection speed. For the small-target dataset SAR-Aircraft-200-ST, the improvement in recognition accuracy is more significant. Figure 17 presents the confusion matrix and PR curve of the target recognition model integrated with the SAR-C2f and 4SDC modules. Ultimately, the mAP₅₀ for aircraft recognition on the SAR-Aircraft-EXT reaches 96.5%.

Finally, to further verify the proposed method’s improvement in the recognition performance of small aircraft targets, the experiment examines the model’s attention to target regions and its recognition performance using heatmap visualization. The YOLOv8l algorithm, integrated with the SAR-DFE module, serves as the baseline. The feature response heatmap of the module “model.22.cv3.0.2” is extracted. The spatial distribution of feature values can intuitively reflect the model’s attention intensity to small targets. Three small-target images are sampled from the dataset, and the heatmaps of the baseline and the improved network are presented in Figure 18. According to the color-matching rules, the redder a region in the figure, the greater the algorithm’s attention to that region. A comparative analysis reveals that the baseline model exhibits notable shortcomings: it pays insufficient attention to the target area for small aircraft, leading to missed detections. In the heatmap of the improved network, the feature response to small-target regions is significantly enhanced, and the dark red regions are entirely consistent with the positions of aircraft targets in the labeled original images. This further confirms that the improved algorithm can improve the recognition performance of small aircraft targets.

3.4.4. Ablation Experiment

Using the original YOLOv8l network as the baseline, ablation experiments are designed as shown in Table 10. Both the SAR-Aircraft and SAR-Aircraft-EXT datasets are utilized in these experiments, as detailed also in Table 10. By activating each improved module individually, the independent and synergistic contributions of SAR-ACGAN, SAR-DFE, SAR-C2f, and 4SDC are evaluated to systematically quantify each module’s performance gains in the recognition task.

When the SAR-ACGAN is used alone to generate the expanded dataset SAR-Aircraft-EXT, the recognition accuracy of aircraft targets reaches 92%, outperforming the baseline model by 1.6%. When SAR-ACGAN is combined with SAR-DFE, SAR-C2f, and 4SDC, respectively, the recognition accuracy reaches 94.6%, 92.7%, and 93.2%, increasing by 2.6%, 0.7%, and 1.2% compared with the scenario where SAR-ACGAN is used alone. Furthermore, SAR-ACGAN exhibits stronger complementarity when combined with other modules, and the improvement in recognition accuracy is more pronounced than with SAR-DFE, SAR-C2f, and 4SDC used individually. When SAR-DFE is used alone, the recognition accuracy of aircraft targets reaches 92.5%, outperforming the baseline model by 2.1%. This module focuses on denoising and feature enhancement of SAR images. It incurs low parameter consumption and computational complexity, yet achieves a notable improvement in recognition accuracy with a reasonably controlled overhead. When SAR-C2f is used alone, the recognition accuracy reaches 91.0%, outperforming the baseline model by 0.6%. By capturing multi-scale features with parallel convolutional kernels of different sizes and optimizing channel weight allocation via the attention mechanism, this module yields limited accuracy gains when applied independently, yet provides critical support for multi-scale target recognition. When 4SDC is used alone, recognition accuracy reaches 91.4%, outperforming the baseline model by 1.0%. This module adds a small-target detection branch and effectively aggregates shallow-layer detailed features and deep-layer semantic features via an attention mechanism, thereby specifically addressing the problem of missed small-target detection. When SAR-C2f and 4SDC are combined, recognition accuracy reaches 92.4%, outperforming the baseline model by 2.0% and exceeding the sum of the accuracy gains achieved by each module alone. This indicates a significant synergistic effect between the two modules: SAR-C2f optimizes the efficiency of multi-scale feature extraction and provides richer feature inputs for 4SDC, while 4SDC maximizes the utilization of features at different scales through precise branch weight allocation, and particularly enhances the response intensity of small-target features, enabling efficient integration between the two modules.

When the SAR-YOLOv8l-ADE network is constructed by the synergistic interaction of four modules, the recognition accuracy of aircraft targets reaches 96.5%, representing a substantial improvement of 6.1% compared with the baseline model. Meanwhile, the number of parameters increases by only 1.6M, and the computational complexity rises by only 1.0 G, achieving an optimal balance between accuracy and overhead. This result fully demonstrates that the four improvement strategies have a clear division of labor and strong complementarity, collectively constructing an efficient recognition framework tailored to the characteristics of SAR images.

3.4.5. Analysis of Experimental Results for Comparisons Among Different Models

To demonstrate the strengths of the proposed SAR-YOLOv8l-ADE model for aircraft target recognition, we selected several representative target recognition algorithms currently available for comparative experiments, including Faster R-CNN [2], Retina-Net [5], SSD [4], YOLOv5s [7], YOLOv8l [9], Swin Transformer [44], SADRN [45], YOLOv11L [10], ResNeXt-101 [31], SAR-NTV-YOLOv8 [17], YOLO-SAD [46], SA-Net [24], and SFSA [12]. The recognition accuracy is shown in Table 11. The experimental results demonstrate that the improved SAR-YOLOv8l-ADE model achieves the highest mAP₅₀ of 96.5% across all aircraft types. Moreover, it shows significant advantages in AP₅₀, precision (P), and recall (R) across all categories. The comprehensive advantages are remarkable, which verifies the improved model’s effectiveness for aircraft recognition in SAR images.

3.4.6. Experimental Effect Diagrams

To intuitively verify the recognition capability of the improved model for aircraft targets, four samples covering multiple aircraft models, different scale distributions, and typical complex airport backgrounds were selected for visual comparison. As shown in Figure 19, the experiment used the control-variable method to gradually integrate improved modules, including SAR-ACGAN, SAR-DFE, and SAR-C2f + 4SDC, verifying the effectiveness of each module and their synergistic effects one by one.

Figure 19a displays the original image with bounding boxes. First, YOLOv8l was adopted as the baseline for interpreting the samples, and the effects are illustrated in Figure 19b. Evidently, the interpretation performance exhibits notable limitations: a total of 12 aircraft targets were missed across the four images, including 6 small-scale misses, 4 medium-to-large-scale misses under complex background superposition, and 2 medium-scale target misses with weak features. This result indicates that the baseline model lacks the ability to extract features of multi-scale targets adaptively and struggles to effectively distinguish targets from background clutter, thereby limiting its ability to recognize some aircraft targets. After introducing the SAR-ACGAN network into the baseline model for training set augmentation, the recognition performance is illustrated in Figure 19c: the number of missed detections was reduced from 12 to 5, representing a significant decrease in the miss rate, with a particularly improvement in the recognition accuracy of medium-to-large-scale targets such as A220 and other types of aircraft. However, this module still has limitations: slight feature distortion in the generated samples, leading the model to erroneously learn spurious features and trigger 2 false alarms, where strong scattering areas of ground buildings are misclassified as aircraft targets. On this basis, after further integrating the SAR-DFE, the recognition performance is shown in Figure 19d: the number of missed detections decreased by an additional 2, with the remaining 3 missed detections all being small-scale targets, and the previous 2 false alarms were eliminated. This result fully validates the core role of the SAR-DFE module: Through the synergistic effect of noise suppression and feature enhancement, this module not only compensates for the feature distortion introduced by generated samples but also strengthens the model’s capability to identify blurred and low-contrast targets. Finally, after integrating the SAR-C2f and 4SDC on the aforementioned basis, the recognition performance is presented in Figure 19e: the number of missed detections across the four images was reduced to 0, with all aircraft targets accurately identified and no new false alarms generated. Through their synergistic effect, the two modules fully address the shortcomings of multi-scale target recognition, enabling accurate and efficient recognition of all types of aircraft.

4. Discussion

4.1. Analysis of Cross-Dataset Application

To validate the generalization performance of the proposed approach in SAR image target recognition, we adopt the MSAR-1.0 dataset to facilitate subsequent research. The samples in this dataset are sourced from SAR images captured by the Haisi-1 and Gaofen-3 satellites. It also includes a large number of small-target samples, providing a rich, diverse testing environment for verifying the method’s universality.

Taking SAR-YOLOv8l as the baseline, the improved SAR-YOLOv8l-ADE (SAR-DFE + SAR-C2f + 4SDC) model is applied to the MSAR-1.0 for comparative analysis of detection performance, and its effect diagrams are shown in Figure 20; the target recognition accuracy mAP₅₀ is listed in Table 12. Experimental results indicate that the SAR-YOLOv8l baseline exhibits significant missed detection of small targets, whereas the improved method effectively alleviates this issue and significantly improves target recognition accuracy. This fully demonstrates that the improved method has strong universality for SAR image interpretation tasks, can be adapted across target types, and provides a generalized network architecture for multi-class target recognition in SAR images.

4.2. Future Research Directions

In airport scenarios, the strong scattering characteristics of interfering targets such as terminal buildings and aircraft tractors, along with the scattering differences and angular sensitivity of aircraft’s complex structural components, significantly increase the difficulty of aircraft target recognition. Single-source SAR signals have encountered a bottleneck in improving recognition accuracy. Future research will explore fusion strategies for optical and SAR images, leveraging the color, texture, and other detailed information of optical images to supplement scene context, facilitate more accurate aircraft model recognition and target detection, and thereby enhance target recognition performance in complex scenarios.

Current algorithms focus primarily on SAR amplitude images, yet SAR echoes actually contain diverse and rich information, including phase information reflecting the target distance relationship, polarization information characterizing the target scattering properties, and texture information capturing the roughness of the target surface. Future research needs to explore an adaptive fusion mechanism for multi-dimensional information: on the one hand, a multi-channel feature extraction module can be constructed to perform adaptive weighted fusion of phase, polarization, and texture information under the attention mechanism, before feeding the fused features into the detection network; on the other hand, an end-to-end processing architecture for echo information can be designed to directly extract features from raw echo signals, breaking through the information limitations of single amplitude images. Through the above approaches, the network detection and recognition accuracy for aircraft targets in complex scenarios can be further improved, and the full potential of SAR data can be exploited.

In the future, efforts will be devoted to exploring and designing handcrafted features with stronger generality and robustness to effectively address issues such as target rotation, deformation, and noise interference; meanwhile, such features should possess favorable interpretability and low computational complexity, thereby complementing deep learning models. By exploring more efficient feature fusion mechanisms, handcrafted features and automatically learned deep features from convolutional neural networks (CNNs) are organically combined, thereby further improving the model’s recognition accuracy and generalization for aircraft targets in complex scenarios.

In this paper, 200 images with complex backgrounds and severe interference were manually selected from the 1000-image test set to construct the SAR-Aircraft-200-LQ test subset. This manual selection process may inevitably introduce subjective biases: different researchers may have inconsistent judgments on the “complexity of backgrounds” and “severity of noise interference”, leading to potential deviations in the composition of the subset. From a statistical perspective, manual screening cannot fully guarantee that the selected 200 images are representative of all complex background and high-noise samples in the entire test set. In subsequent work, we will develop an objective, quantifiable method to construct an SAR-Aircraft-200-LQ subset further to improve the statistical reliability and generalizability of the conclusions.

5. Conclusions

This paper proposes an improved SAR-YOLOv8l-ADE network architecture that explicitly addresses the core issues of insufficient dataset samples, single-dimensional target features, large target size spans, and a high miss-detection rate for small targets in aircraft target recognition from SAR images. The proposed SAR-ACGAN network integrates the self-attention mechanism, captures the long-range dependencies of aircraft targets, and generates high-quality, diverse virtual images, effectively addressing deficiencies in sample quantity and diversity. The designed SAR-DFE detailed feature extraction module achieves dual goals of feature enhancement and noise suppression through the synergistic effect of the CDC and A-LEE modules, converting single-channel images into three-channel images, and significantly improves performance in complex scenarios, outperforming the combination of traditional operators and filtering schemes. The collaborative design of the SAR-C2f and 4SDC modules addresses the bottlenecks of large-scale spans and the difficulty of recognizing small targets, enhancing the model’s feature response to small-target areas. The four modules in the SAR-YOLOv8l-ADE network are complementary, achieving an optimal balance between accuracy and computational overhead.

Author Contributions

Conceptualization, W.H. and Y.L.; methodology, W.H. and X.W.; software, Q.L.; validation, X.W., Q.L. and Y.L.; formal analysis, X.W.; investigation, P.X.; resources, Y.L.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, X.W.; visualization, P.X.; supervision, W.H. and Q.L.; project administration, P.X.; funding acquisition, W.H., Q.Z., Q.L. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by National Natural Science Foundation of China (42204144), the Noise Mechanism and Suppression Methods of High-precision Phase Measurement in the Low-frequency Band (No. 2022YFC2203901). Higher Education Society of Jilin Province Higher Education Research Project (JGJX24D0157).

Data Availability Statement

The ISPRS-SAR-Aircraft dataset is provided by the 2021 Gaofen challenge on Automated High-Resolution Earth Observation Image Interpretation. Available online: https://www.grss-ieee.org/publications/call-for-papers/2021-gaofen-challenge-on-automated-high-resolution-earth-observation-image-interpretatio/ (accessed on 1 October 2021). The SAR-Aircraft-1.0 dataset is provided by the paper “SAR-Aircraft-1.0: High-Resolution SAR Aircraft Detection and Recognition Dataset” published in the 4th issue of 2023 of the Journal of Radars. Available online: https://radars.ac.cn/article/doi/10.12000/JR23043?viewType=HTML (accessed on 15 October 2023).

Acknowledgments

We sincerely appreciate all reviewers for their insightful comments and helpful suggestions, which have greatly enhanced the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

SAR	Synthetic Aperture Radar
SAR-ACGAN	SAR-Auxiliary Classifier Generative Adversarial Network
SAR-DFE	SAR-Detail Feature Extraction
SAR-C2f	SAR-Coarse-to-Fine
4SDC	Four-Scale Detectors with Coordinate Attention
CDC	Central Difference Convolution
A-LEE	Adaptive LEE
FID	Fréchet Inception Distance
SFE	Structural Feature Extraction
SAR-ID	SAR Image Denoising

References

Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Zhou, L.; Ran, H.; Xiong, R.; Tan, R. NWD-YOLOv5: A YOLOv5 Model for Small Target Detection Based on NWD Loss. In Proceedings of the IEEE International Conference on Robotics, Intelligent Control and Artificial Intelligence, Nanjing, China, 6–8 December 2024. [Google Scholar] [CrossRef]
Zhang, H.; Xiong, A.; Lai, L.; Chen, C.; Liang, J. AMME-YOLOv7: Improved YOLOv7 Based on Attention Mechanism and Multiscale Expansion for Electric Vehicle Driver and Passenger Helmet Wearing Detection. In Proceedings of the IEEE International Conference on Smart Internet of Things, Xining, China, 25–27 August 2023. [Google Scholar] [CrossRef]
Jiang, X.N.; Niu, X.Q.; Wu, F.L.; Fu, Y.; Bao, H.; Fan, Y.C.; Zhang, Y.; Pei, J.Y. A Fine-Grained Aircraft Target Recognition Algorithm for Remote Sensing Images Based on YOLOV8. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 4060–4073. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Xiao, X.; Jia, H.; Xiao, P.; Wang, H. Aircraft Detection in SAR Images Based on Peak Feature Fusion and Adaptive Deformable Network. Remote Sens. 2022, 14, 6077. [Google Scholar] [CrossRef]
Zhao, C.; Zhang, S.; Luo, R.; Feng, S.; Kuang, G. Scattering features spatial-structural association network for aircraft recognition in SAR images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4006505. [Google Scholar] [CrossRef]
Zhu, W.; Zhang, L.; Lu, C.; Fan, G.; Song, Y.; Sun, J.; Lv, X. FEMSFNet: Feature Enhancement and Multi-Scales Fusion Network for SAR Aircraft Detection. Remote Sens. 2024, 16, 1589. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Yasir, M.; Liu, S.; Xu, M.; Wan, J.; Nazir, S.; Islam, Q.U.; Dang, K.B. SwinYOLOv7: Robust Ship Detection in Complex Synthetic Aperture Radar Images. Appl. Soft Comput. 2024, 160, 111704. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Guo, X.; Xu, B. SAR-NTV-YOLOv8: A Neural Network Aircraft Detection Method in SAR Images Based on Despeckling Preprocessing. Remote Sens. 2024, 16, 3420. [Google Scholar] [CrossRef]
Huang, B.; Zhang, T.; Quan, S.; Wang, W.; Guo, W.; Zhang, Z. Scattering Enhancement and Feature Fusion Network for Aircraft Detection in SAR Images. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 1936–1950. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Liu, Z.; Hu, D.; Kuang, G.; Liu, L. Attentional Feature Refinement and Alignment Network for Aircraft Detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5220616. [Google Scholar] [CrossRef]
Chen, L.; Luo, R.; Xing, J.; Li, Z.; Xing, X.; Yuan, Z.; Tan, S.; Cai, X. Geospatial transformer is what you need for aircraft detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5225715. [Google Scholar] [CrossRef]
Nie, Y.; Bian, C.; Li, L.; Chen, H.; Chen, S. LFC-SSD: Multiscale aircraft detection based on local feature correlation. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6510505. [Google Scholar] [CrossRef]
Kang, Y.; Wang, Z.; Fu, J.; Sun, X.; Fu, K. SFR-Net: Scattering Feature Relation Network for Aircraft Detection in Complex SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5218317. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
Zou, L.; Zhang, H.; Wang, C.; Wu, F.; Gu, F. MW-ACGAN: Generating Multiscale High-Resolution SAR Images for Ship Detection. Sensors 2020, 20, 6673. [Google Scholar] [CrossRef] [PubMed]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2642–2651. [Google Scholar] [CrossRef]
Sun, X.; Li, W.; Huang, C.; Fu, J.; Feng, J. Multi-step-ahead Prediction of Grenade Trajectory Based on CNN-LSTM Enhanced by Deep Learning and Self-attention Mechanism. Acta Armamentarii 2024, 44, 240659. [Google Scholar]
Li, Z.; Zhang, S.; Qiao, Y.; Wang, Q.; Jiang, Y.; Zhang, F. Maneuvering trajectory prediction of air combat targets based on self-attention mechanism and CNN-LSTM. J. Ordnance Equip. Eng. 2023, 44, 209–216. [Google Scholar] [CrossRef]
Geng, J.; Jiang, W.; Deng, X. Multi-scale deep feature learning network with bilateral filtering for SAR image classification. ISPRS J. Photogramm. Remote Sens. 2020, 167, 201–213. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar] [CrossRef]
Veit, A.; Wilber, M.J.; Belongie, S. Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 550–558. [Google Scholar] [CrossRef]
Yu, Z.; Zhao, C.; Wang, Z.; Qin, Y.; Su, Z.; Li, X.; Zhou, F.; Zhao, G. Searching central difference convolutional networks for face anti-spoofing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5295–5305. [Google Scholar] [CrossRef]
Lee, S.; Kim, S.-W. Recognition of Targets in SAR Images Based on a WVV Feature Using a Subset of Scattering Centers. Sensors 2022, 22, 8528. [Google Scholar] [CrossRef]
Zhang, L.; Wu, J.; Fan, Y.; Gao, H.; Shao, Y. An Efficient Building Extraction Method from High Spatial Resolution Remote Sensing Images Based on Improved Mask R-CNN. Sensors 2020, 20, 1465. [Google Scholar] [CrossRef] [PubMed]
Hoorfar, H.; Puche, A.; Merchenthaler, I. Thermal image edge detection for AI-powered medical research imaging. J. Supercomput. 2025, 81, 629. [Google Scholar] [CrossRef]
Rumelhart, D.; Hinton, G.; Williams, R. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; Liu, W.; et al. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Wang, X.; Hong, W.; Liu, Y.; Hu, D.; Xin, P. SAR Image Aircraft Target Recognition Based on Improved YOLOv5. Appl. Sci. 2023, 13, 6160. [Google Scholar] [CrossRef]
Wang, X.; Hong, W.; Liu, Y.; Yan, G.; Hu, D.; Jing, Q. Improved YOLOv8 Network of Aircraft Target Recognition Based on Synthetic Aperture Radar Imaging Feature. Sensors 2025, 25, 3231. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021. [Google Scholar] [CrossRef]
IEEE-GRSS. 2021 Gaofen Challenge on Automated High-Resolution Earth Observation Image Interpretation; IEEE-GRSS: Piscataway, NJ, USA, 2021; Available online: https://www.grss-ieee.org/publications/call-for-papers/2021-gaofen-challenge-on-automated-high-resolution-earth-observation-image-interpretatio/ (accessed on 1 October 2021).
Wang, Z.; Kang, Y.; Zeng, X.; Wang, Y.; Zhang, D.; Sun, X. SAR-AIRcraft-1.0: High-resolution SAR Aircraft Detection and Recognition Dataset. J. Radars 2023, 12, 906–922. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Gayathri, R.; Lincy, R. Transfer learning based handwritten character recognition of tamil script using inception-V3 Model. J. Intell. Fuzzy Syst. 2022, 42, 6102. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Ye, X.; Du, C. Integrated Multi-Scale Aircraft Detection and Recognition with Scattering Point Intensity Adaptiveness in Complex Background Clutter SAR Images. Remote Sens. 2024, 16, 2471. [Google Scholar] [CrossRef]
Chen, J.; Shen, Y.; Liang, Y.; Wang, Z.; Zhang, Q. YOLO-SAD: An Efficient SAR Aircraft Detection Network. Appl. Sci. 2024, 14, 3025. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of our improved SAR-YOLOv8l-ADE network. The colored parts represent the optimization measures proposed in this paper. Among them, SAR-ACGAN, SAR-DFE, SAR-C2f, and 4SDC are structures designed in this paper. The Ghost-CBS structure is cited from Reference [23]. The SAR-ACGAN is an independent network dedicated to dataset augmentation.

Figure 2. Self-Attention module structure.

Figure 3. SAR-ACGAN network structure. Class1, Class2, …, Class7 correspond to seven aircraft types. The binary cross-entropy loss is used to determine whether an image is real or fake, and the cross-entropy loss is used to classify the image category.

Figure 4. SAR-DFE module structure. CDC is the structural feature extraction module, and A-LEE is the SAR image denoising module.

Figure 5. The CDC operational logic diagram. The coordinates of the red point in

X

are

(i, j)

, and the coordinates of the blue point in

Y_{C D C}

are

(i, j)

, they are in the same position. Different colors in matrix

W

represent the distinct weight coefficients.

Figure 5. The CDC operational logic diagram. The coordinates of the red point in

X

are

(i, j)

, and the coordinates of the blue point in

Y_{C D C}

are

(i, j)

, they are in the same position. Different colors in matrix

W

represent the distinct weight coefficients.

Figure 6. A-LEE denoising structure. By adaptively fusing the

σ^{2}

and

\bar{I}

to learn an attention correction coefficient (Atten), it is used to enhance the focus on important features.

Figure 6. A-LEE denoising structure. By adaptively fusing the

σ^{2}

and

\bar{I}

to learn an attention correction coefficient (Atten), it is used to enhance the focus on important features.

Figure 7. SAR-MAFR module structure.

Figure 8. SAR-C2f module structure. When N = 1, the first feature extraction layer uses the SAR-MAFR structure; when N

\neq

1, all other feature extraction layers use the SE-Ghost-Bottleneck structure [37]. In our improved SAR-YOLOv8l-ADE, the Shortcut configuration of the backbone is set to True, while that of the neck is set to False. C denotes the number of channels.

Figure 8. SAR-C2f module structure. When N = 1, the first feature extraction layer uses the SAR-MAFR structure; when N

\neq

1, all other feature extraction layers use the SE-Ghost-Bottleneck structure [37]. In our improved SAR-YOLOv8l-ADE, the Shortcut configuration of the backbone is set to True, while that of the neck is set to False. C denotes the number of channels.

Figure 9. 4SDC module structure. P2, P3, P4, and P5 are the four detection branches derived from the YOLOv8l baseline. X_2→l, X_3→l, X_4→l, and X_5→l represent that different inputs are adjusted to a unified size at the l layer. V2, V3, V4, and V5 are the feature maps after adaptive fusion at the l layer.

Figure 10. Seven types of aircraft SAR imaging samples. The upper row shows optical images, and the lower row shows SAR images.

Figure 11. The generated SAR images of single aircraft targets based on the SAR-ACGAN network. As the Epoch increases, the generated SAR images evolve from being initially noisy and lacking distinct target features to gradually exhibiting clear structural features of aircraft. The different colors in the images at Epoch 0 represent the initial noisy pixel distribution of the generated SAR image.

Figure 12. The generated aircraft target samples based on the ACGAN network. Green-marked regions: lack of tail-scattering features; pink-marked regions: unrealistic highlight points; yellow-marked regions: aircraft target misclassification.

Figure 13. The generated aircraft target samples based on the SAR-ACGAN network.

Figure 14. The mAP₅₀ of ACGAN and SAR-ACGAN under different sample augmentation scales.

Figure 15. Confusion matrix and PR curve of the YOLOv8l model integrating SAR-DFE. (a) Confusion matrix. (b) PR curve.

Figure 16. Comparative analysis of aircraft target interpretation effects. (a) Original images with labels. (b) Recognition effects of the baseline. (c) Recognition effects of the improved algorithm. Definitions of rectangular boxes with different colors: dark-orange denotes A220, light-orange denotes ARJ21, light-green denotes A320/321, dark-yellow denotes A330, and dark-green denotes other.

Figure 17. Confusion matrix and PR curve of the YOLOv8l model integrating SAR-C2f and 4SDC. (a) Confusion matrix. (b) PR curve.

Figure 18. Heatmap visual comparison of the improved model integrating SAR-C2f and 4SDC. (a) Original images with labels. (b) Heatmap of the Baseline. (c) Heatmap of the improved model. In (a), dark-orange denotes A220, light-orange denotes ARJ21, and dark-green denotes other. In (b), black denotes the missed detection cases. In (b,c), redder regions indicate greater algorithm attention.

Figure 19. Recognition effect diagrams of the improved algorithm. In the figure, missed-detection targets are marked with yellow ellipses, and false-alarm targets are marked with pink ellipses. Definitions of rectangular boxes with different colors: dark-green denotes other, light-green denotes A320/321, dark-orange denotes A220, light-orange denotes ARJ21, dark-yellow denotes A330, red denotes Boeing787, and pink denotes Boeing737.

Figure 20. Comparative analysis of target recognition effect diagrams in the MSAR-1.0 dataset. (a) Original images with labels. (b) Recognition effects of the baseline. (c) Recognition effects of the improved algorithm. The orange rectangular boxes indicate ships, and the yellow ellipses mark missed detection targets.

Table 1. SAR-ACGAN network parameters.

	Layer	Algorithm	Input Size	Output Size
Generator	Layer1	ConvT + BN + ReLU	(768, 1, 1)	(384, 4, 4)
	Layer2		(384, 4, 4)	(256, 8, 8)
	Layer3		(256, 8, 8)	(128, 16, 16)
	Layer4		(128, 16, 16)	(64, 32, 32)
	Layer5	Self-Attention (ratio = 0.5)	(64, 32, 32)	(32, 32, 32)
	Layer6	ConvT + BN + ReLU	(32, 32, 32)	(16, 64, 64)
	Layer7	ConvT + Tanh	(16, 64, 64)	(1, 128, 128)
Discriminator	Layer1	Conv + BN + LeakyReLU + Dropout	(1, 128, 128)	(1, 128, 128)
	Layer2		(1, 128, 128)	(16, 64, 64)
	Layer3		(16, 64, 64)	(32, 32, 32)
	Layer4	Self-Attention (ratio = 2)	(32, 32, 32)	(64, 32, 32)
	Layer5	Conv + BN + LeakyReLU + Dropout	(64, 32, 32)	(128, 16, 16)
	Layer6		(128, 16, 16)	(256, 8, 8)
	Layer7		(256, 8, 8)	(512, 4, 4)

ConvT stands for transposed convolution operation. The probability of random dropout is set to 0.5.

Table 2. Comparison of CDC and Sobel characteristics.

Characteristics	CDC	Sobel
Parameter	Trainable	Fixed
Difference directions	8	4
Nonlinear capability	Yes	No
Adaptability	Yes	No

Table 3. Distribution of aircraft types in the SAR-Aircraft-Gen dataset.

A220	A320/321	A330	ARJ21	B737	B787	Other
307	207	65	201	300	278	242

Table 4. The FID scores of generated images by different network.

	A220	A320/321	A330	ARJ21	B737	B787	Other	Mean
ACGAN	62.23	83.36	103.56	82.82	56.87	70.18	75.63	76.25
SAR-ACGAN	29.42	49.61	60.38	51.37	26.29	37.64	43.97	42.67

Table 5. Impact analysis of the SAR-DFE on the performance of baseline.

	SAR-DFE	A220 AP₅₀	A320/321 AP₅₀	A330 AP₅₀	ARJ21 AP₅₀	B737 AP₅₀	B787 AP₅₀	Other AP₅₀	Mean mAP₅₀
YOLOv8s	×	90.8	89.3	86.3	94.5	94.6	91.8	89.4	91.0
YOLOv8s	√	94.8	90.9	88.7	96.2	95.9	94.7	91.2	93.2 (+2.2)
YOLOv8l	×	91.9	90.3	87.4	95.6	95.7	92.9	90.5	92.0
YOLOv8l	√	94.1	96.2	92.4	96.0	96.8	95.3	91.4	94.6 (+2.6)

“×” indicates not used, while “√” indicates used. Bold indicates accuracy improvement.

Table 6. Performance analysis of CDC and A-LEE module.

SAR-DFE			Params	GFLOPs	Mean
Conv	SFE	SAR-ID	(M)	(G)	mAP₅₀ (%)	mAP_50–95 (%)
×	×	×	43.6	165.4	92.0	70.6
√	Sobel	×	43.9	165.6	92.7 (+0.7)	72.1 (+1.5)
√	CDC	×	43.9	165.6	93.2 (+1.2)	73.8 (+3.2)
√	×	LEE	45.6	165.8	92.9 (+0.9)	73.1 (+2.5)
√	×	A-LEE	45.9	166.0	93.6 (+1.6)	74.8 (+4.2)
√	Sobel	LEE	45.9	166.0	93.3 (+1.3)	74.2 (+3.6)
√	CDC	A-LEE	46.2	166.2	94.6 (+2.6)	77.2 (+6.6)

“×” indicates not used, while “√” indicates used. The first row (×|×|×) indicates the experimental results with YOLOv8l as the baseline. Bold indicates accuracy improvement.

Table 7. Performance analysis of the SAR-DFE module on the SAR-Aircraft-200-LQ test set.

SAR-DFE		SAR-Aircraft-EXT		SAR-Aircraft-200-LQ
(CDC + A-LEE)	(Sobel + LEE)	mAP₅₀ (%)	mAP_50–95 (%)	mAP₅₀ (%)	mAP_50–95 (%)
×	×	92.0	70.6	83.2	59.3
×	√	93.3 (+1.3)	74.2 (+3.6)	86.9 (+3.7)	64.9 (+5.6)
√	×	94.6 (+2.6)	77.2 (+6.6)	89.8 (+6.6)	69.8 (+10.5)

“×” indicates not used, while “√” indicates used. The first row (×|×) indicates the experimental results with YOLOv8l as the baseline. Bold indicates accuracy improvement.

Table 8. Performance analysis of four target recognition algorithms on different datasets.

Model	SAR-DFE	SAR-Aircraft-EXT		SAR-Aircraft-200-NT		SAR-Aircraft-200-ST
Model	SAR-DFE	mAP₅₀ (%)	mAP_50–95 (%)	mAP₅₀ (%)	mAP_50–95 (%)	mAP₅₀ (%)	mAP_50–95 (%)
Faster R-CNN	×	88.9	66.3	90.1 (+1.2)	68.1 (+1.8)	82.7 (−6.2)	56.1 (−10.2)
SSD	×	86.5	63.6	87.5 (+1.0)	65.2 (+1.6)	81.4 (−5.1)	52.6 (−11.0)
YOLOv8s	√	93.2	74.0	93.9 (+0.7)	75.1 (+1.1)	89.6 (−3.6)	67.2 (−6.8)
YOLOv8l	√	94.6	77.2	95.2 (+0.6)	78.5 (+1.3)	91.3 (−3.3)	69.1 (−8.1)

“×” indicates not used, while “√” indicates used. Bold indicates accuracy decline.

Table 9. Performance analysis of SAR-C2f and 4SDC modules on SAR-Aircraft-200-ST.

SAR-DFE	SAR-C2f	4SDC	SAR-Aircraft-EXT		SAR-Aircraft-200-ST		FPS
SAR-DFE	SAR-C2f	4SDC	mAP₅₀ (%)	mAP_50–95 (%)	mAP₅₀ (%)	mAP_50–95 (%)	FPS
√	×	×	94.6	77.2	91.3	69.1	20.4
√	√	×	95.3 (+0.7)	78.9 (+1.7)	92.9 (+1.6)	73.0 (+3.9)	21.2
√	×	√	95.8 (+1.2)	80.0 (+2.8)	93.9 (+2.6)	75.6 (+6.5)	19.8
√	√	√	96.5 (+1.9)	81.7 (+4.5)	95.2 (+3.9)	79.5 (+10.4)	20.3

“×” indicates not used, while “√” indicates used. Bold indicates accuracy improvement.

Table 10. Ablation experiments on four improved strategies.

SAR-ACGAN	SAR-DFE	SAR-C2f	4SDC	Params (M)	GFLOPs (G)	SAR-Aircraft-EXT mAP₅₀ (%)	SAR-Aircraft mAP₅₀ (%)
×	×	×	×	43.6	165.4	-	90.4 (Baseline)
√	×	×	×	43.6 (+0.0)	165.4 (+0.0)	92.0 (+1.6)	-
×	√	×	×	46.2 (+2.6)	166.2 (+0.8)	-	92.5 (+2.1)
×	×	√	×	41.5 (−2.1)	165.2 (−0.2)	-	91.0 (+0.6)
×	×	×	√	44.7 (+1.1)	165.8 (+0.4)	-	91.4 (+1.0)
×	×	√	√	42.6 (−1.0)	165.6 (+0.2)	-	92.4 (+2.0)
×	√	√	√	45.2 (+1.6)	166.4 (+1.0)	-	94.4 (+4.0)
√	√	×	×	46.2 (+2.6)	166.2 (+0.8)	94.6 (+4.2)	-
√	×	√	×	41.5 (−2.1)	165.2 (−0.2)	92.7 (+2.3)	-
√	×	×	√	44.7 (+1.1)	165.8 (+0.4)	93.2 (+2.8)	-
√	×	√	√	42.6 (−1.0)	165.6 (+0.2)	94.2 (+3.8)	-
√	√	√	√	45.2 (+1.6)	166.4 (+1.0)	96.5 (+6.1)	-

“×” indicates not used, while “√” indicates used. Bold indicates the magnitude of change.

Table 11. Comparison experimental results of different models.

Model	A220 AP₅₀ (%)	A320/321 AP₅₀ (%)	A330 AP₅₀ (%)	ARJ21 AP₅₀ (%)	B737 AP₅₀ (%)	B787 AP₅₀ (%)	Other AP₅₀ (%)	P (%)	R (%)	Mean mAP₅₀ (%)
Faster R-CNN	85.7	86.6	80.9	88.4	86.0	91.9	87.5	84.2	89.5	86.7
Retina-Net	81.9	80.6	76.8	85.3	82.5	86.0	82.1	83.4	77.1	82.2
SSD	83.1	76.4	91.0	73.3	85.6	88.7	81.2	84.4	79.2	82.8
YOLOv5s	90.9	81.8	91.8	84.4	86.1	93.3	88.3	89.7	83.4	88.1
YOLOv8l	93.3	84.2	93.3	86.8	88.5	95.7	90.7	91.3	87.5	90.4
Swin Transformer	80.9	100	77.4	74.6	73.8	86.1	84.8	-	-	82.5
SADRN	95.3	96.4	96.4	94.8	95.7	95.8	94.1	85.6	93.0	95.0
YOLOv11L	93.2	88.5	94.0	90.8	89.0	94.5	90.2	93.0	88.3	91.5
ResNeXt-101	80.9	100	87.1	74.9	71.1	83.9	87.7	-	-	83.7
SAR-NTV-YOLOv8	-	-	-	-	-	-	-	93.5	92.2	83.4
YOLO-SAD	93.5	87.9	92.7	90.1	88.2	94.3	90.9	89.3	87.9	90.8
SA-Net	80.3	94.3	88.6	78.6	59.7	70.8	71.3	-	-	77.7
SFSA	90.0	96.0	97.0	99.0	95.0	89.0	81.0	-	-	92.4
SAR-YOLOv8l-ADE	96.3	96.3	97.1	95.9	96.7	98.0	95.1	93.9	91.1	96.5

“-” indicates that the corresponding data was not reported in the original papers.

Table 12. Comparative analysis of target recognition accuracy in the MSAR-1.0 dataset.

Dataset	Model	Aircraft	Oil Tanks	Bridges	Ships	Mean
Dataset	Model	AP₅₀ (%)	AP₅₀ (%)	AP₅₀ (%)	AP₅₀ (%)	mAP₅₀ (%)
MSAR-1.0	Baseline	73.9	93.9	90.9	96.1	88.7
MSAR-1.0	SAR-YOLOv8l-ADE	78.4	94.8	91.6	97.6	90.6 (+1.9)

Bold indicates accuracy improvement.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Hong, W.; Li, Q.; Liu, Y.; Zhang, Q.; Xin, P. Multi-Module Collaborative Optimization for SAR Image Aircraft Recognition: The SAR-YOLOv8l-ADE Network. Remote Sens. 2026, 18, 236. https://doi.org/10.3390/rs18020236

AMA Style

Wang X, Hong W, Li Q, Liu Y, Zhang Q, Xin P. Multi-Module Collaborative Optimization for SAR Image Aircraft Recognition: The SAR-YOLOv8l-ADE Network. Remote Sensing. 2026; 18(2):236. https://doi.org/10.3390/rs18020236

Chicago/Turabian Style

Wang, Xing, Wen Hong, Qi Li, Yunqing Liu, Qiong Zhang, and Ping Xin. 2026. "Multi-Module Collaborative Optimization for SAR Image Aircraft Recognition: The SAR-YOLOv8l-ADE Network" Remote Sensing 18, no. 2: 236. https://doi.org/10.3390/rs18020236

APA Style

Wang, X., Hong, W., Li, Q., Liu, Y., Zhang, Q., & Xin, P. (2026). Multi-Module Collaborative Optimization for SAR Image Aircraft Recognition: The SAR-YOLOv8l-ADE Network. Remote Sensing, 18(2), 236. https://doi.org/10.3390/rs18020236

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Module Collaborative Optimization for SAR Image Aircraft Recognition: The SAR-YOLOv8l-ADE Network

Highlights

Abstract

1. Introduction

2. Methods

2.1. Dataset Sample Augmentation Network SAR-ACGAN

2.2. Detailed Feature Extraction Module SAR-DFE

2.3. Multi-Scale Feature Extraction Module SAR-C2f

2.4. Four-Scale Adaptive Fusion Detection Head 4SDC

3. Results

3.1. Dataset Construction

3.1.1. SAR-Aircraft and SAR-Aircraft-EXT Dataset

3.1.2. SAR-Aircraft-Gen Dataset

3.2. Experimental Environment

3.3. Evaluation Metrics

3.3.1. Evaluation Metrics for Generated Image Quality

3.3.2. Evaluation Metrics for Aircraft Target Recognition Capability

3.4. Analysis of Experimental Results

3.4.1. Analysis of Experimental Results for the SAR-ACGAN Network

3.4.2. Analysis of Experimental Results for the SAR-DFE Module

3.4.3. Analysis of Experimental Results for the SAR-C2f and 4SDC Module

3.4.4. Ablation Experiment

3.4.5. Analysis of Experimental Results for Comparisons Among Different Models

3.4.6. Experimental Effect Diagrams

4. Discussion

4.1. Analysis of Cross-Dataset Application

4.2. Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI