SLD-YOLO: A Lightweight Satellite Component Detection Algorithm Based on Multi-Scale Feature Fusion and Attention Mechanism

Li, Yonghao; Yang, Hang; Lü, Bo; Wu, Xiaotian

doi:10.3390/rs17172950

Open AccessArticle

SLD-YOLO: A Lightweight Satellite Component Detection Algorithm Based on Multi-Scale Feature Fusion and Attention Mechanism

¹

School of Physics, Northeast Normal University, Changchun 130024, China

²

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(17), 2950; https://doi.org/10.3390/rs17172950

Submission received: 24 June 2025 / Revised: 28 July 2025 / Accepted: 22 August 2025 / Published: 25 August 2025

(This article belongs to the Special Issue Advances in Remote Sensing Image Target Detection and Recognition)

Download

Browse Figures

Versions Notes

Abstract

Space-based on-orbit servicing missions impose stringent requirements for precise identification and localization of satellite components, while existing detection algorithms face dual challenges of insufficient accuracy and excessive computational resource consumption. This paper proposes SLD-YOLO, a lightweight satellite component detection model based on improved YOLO11, balancing accuracy and efficiency through structural optimization and lightweight design. First, we design RLNet, a lightweight backbone network that employs reparameterization mechanisms and hierarchical feature fusion strategies to reduce model complexity by 19.72% while maintaining detection accuracy. Second, we propose the CSP-HSF multi-scale feature fusion module, used in conjunction with PSConv downsampling, to effectively improve the model’s perception of multi-scale objects. Finally, we introduce SimAM, a parameter-free attention mechanism in the detection head to further improve feature representation capability. Experiments on the UESD dataset demonstrate that SLD-YOLO achieves measurable improvements compared to the baseline YOLO11s model across five satellite component detection categories: mAP50 increases by 2.22% to 87.44%, mAP50:95 improves by 1.72% to 63.25%, while computational complexity decreases by 19.72%, parameter count reduces by 25.93%, model file size compresses by 24.59%, and inference speed reaches 90.4 FPS. Validation experiments on the UESD_edition2 dataset further confirm the model’s robustness. This research provides an effective solution for target detection tasks in resource-constrained space environments, demonstrating practical engineering application value.

Keywords:

satellite component detection; lightweight network; YOLO11; on-orbit servicing

1. Introduction

With the rapid advancement of space technology, satellites have become essential means for humanity to acquire Earth information and support socio-economic development. Various satellite systems are extensively applied in multiple domains including communications, navigation, meteorological observation, and Earth resource monitoring [1,2,3,4]. In recent years, space activities have become increasingly frequent, and the rapid deployment of low Earth orbit satellite constellations has led to a dramatic surge in near-Earth orbit satellite populations. Predictions indicate that, by 2030, the number of satellites in near-Earth orbit may exceed 100,000 [5]. However, alongside the growth in satellite numbers, problems caused by defunct satellites are becoming increasingly prominent. These non-functional satellites not only occupy limited orbital resources but also increase collision risks with other spacecraft, presenting major challenges that the global aerospace community must address.

To tackle this problem, the international community is actively exploring on-orbit servicing robot technology [6], extending satellite lifespans through cargo resupply, fuel refueling, maintenance, and capture missions (Figure 1). The U.S. DARPA’s FREND project has successfully completed on-orbit experiments with robotic arm technology, demonstrating its potential in space service missions [7]. However, one of the core challenges in these missions is precisely acquiring position and attitude information of target satellite components [8], where accurate identification of satellite components is crucial for subsequent operations.

Prior to the widespread adoption of deep learning technologies, satellite component identification primarily relied on traditional image processing methods. These approaches included edge detection and specific shape matching algorithms [9,10]. For instance, Cai et al. proposed a triangular detection method based on line extraction, specifically designed for feature recognition of satellite main bodies and solar panels in tethered space robots [11]. However, this method has limitations when applied to complex structures, operating effectively only under specific scenarios. Although traditional detection algorithms can meet requirements to some extent, their high dependence on specific types of satellite images leads to poor generalization capability and processing efficiency, resulting in frequent false detections and missed detections.

With the proliferation of deep learning technology, particularly the emergence of R-CNN [12] and its subsequent improved version Faster R-CNN [13], the target recognition field has undergone revolutionary changes. Chen et al. employed an improved R-CNN-based algorithm for satellite component detection, achieving significant results compared to mainstream algorithms at the time, though their model complexity was high, making it unsuitable for resource-constrained space environment applications [14]. Cao et al. utilized an improved Faster R-CNN algorithm for satellite component recognition, demonstrating the significant advantages of deep learning methods over traditional approaches, though their research focused mainly on single satellite models, lacking broad applicability [15]. With continuous advancement in Convolutional Neural Network (CNN) technology, especially the breakthrough development of YOLO(You Only Look Once) [16,17,18,19,20,21,22,23,24] one-stage object detection algorithms, detection speed has significantly improved while achieving accuracy levels comparable to Faster R-CNN, leading more subsequent research to focus on applications and development of one-stage object detectors.

In 2021, Trupti Mahendrakar et al. utilized YOLOv5 to achieve real-time satellite component identification on small sample datasets, preliminarily validating the feasibility of practical applications [25]. However, these studies still faced challenges of detection accuracy and insufficient datasets. Liu et al. improved satellite component detection performance by introducing attention mechanisms into YOLOv5 models and employing data augmentation techniques and Generative Adversarial Networks (GANs) to expand datasets. Despite achieving certain progress, there remained deficiencies in dataset diversity and comparisons with new-generation algorithms based on Transformer architectures [26]. Tang et al. developed a lightweight improved model based on YOLOv8 that demonstrated competitive performance in satellite component detection tasks, with detection accuracy comparable to mainstream algorithms. However, the satellite dataset used had limited annotated category coverage, making it difficult to meet the recognition requirements for diverse components such as novel satellite optical devices [27].

Addressing the aforementioned issues, this paper optimizes the YOLO11s network structure to enhance detection performance for satellite components while reducing model computation, parameter scale, and weight file size.

The main contributions of this paper are as follows:

Targeting the practical requirements of satellite component detection in space environments, this paper conducted grayscale preprocessing and fine-grained annotation work for five target categories based on the currently largest and most comprehensive public dataset UESD [28] in this field, further improving the dataset’s adaptability and practicality in real application scenarios.
To enhance model operational efficiency, this paper optimized the backbone network structure of YOLO11s, proposing a novel lightweight backbone network RLNet(Reparameterization Lightweight Network) . This network improves detection accuracy while reducing model computational complexity.
Considering the characteristic of large target scale variations in satellite images, this paper innovatively proposes a multi-scale feature fusion module CSP-HSF (Cross Stage Partial - Hybrid Scale Fusion). This module effectively enhances the model’s perception capability for multi-scale targets through channel division and multi-scale convolution fusion strategies. Additionally, we employ PSConv (Pinwheel-Shaped Convolution) [29] as the downsampling operation to further compress model scale while ensuring feature extraction quality.
We introduce a lightweight attention mechanism SimAM [30] in the detection head, which can enhance key feature representation capabilities without additional computational costs, improving the model’s overall detection performance.

2. Methods

2.1. The YOLO11 Baseline Framework Introduction

YOLO11 [24] represents Ultralytics’ most recent advancement in their YOLO family of real-time object detection algorithms. Unlike two-stage detectors that separate region proposal and classification, YOLO adopts a unified single-stage approach, directly predicting bounding boxes and class probabilities from full images in one evaluation. The architecture comprises three interconnected components: backbone, neck, and head networks.

Feature extraction primarily occurs within the backbone network, which progressively downsamples input images while extracting hierarchical feature representations. YOLO11 integrates SPPF (Spatial Pyramid Pooling - Fast) and C2PSA (Channel-wise and Spatial Pyramid Attention) modules to strengthen object feature representation capabilities across multiple receptive field sizes. The structural details are illustrated in Figure 2.

Serving as an intermediary between backbone and head components, the neck network facilitates multi-scale feature integration through feature pyramid structures. This component processes features extracted at different resolutions from the backbone, enabling detection of objects across various scales within the same image. The refined features are subsequently delivered to support the head network’s detection tasks.

The head network architecture mirrors that of YOLOv8 [21], executing the final detection operations including bounding box regression and classification probability estimation. Each grid cell in the output feature maps is responsible for predicting objects whose centers fall within that cell. These predictions are generated based on the multi-scale feature maps propagated through the backbone and neck stages, enabling efficient end-to-end object detection.

2.2. The Improved SLD-YOLO Network Design

In this study, to achieve high-precision detection of satellite components in complex space environments while considering model lightweight design and inference efficiency, we select YOLO11s as the base network architecture and propose SLD-YOLO (Satellite Lightweight Detection—You Only Look Once), a lightweight network structure based on improved YOLO11s. The enhanced network architecture is illustrated in Figure 3. The proposed SLD-YOLO incorporates several key modifications:

RLNet is adopted as the backbone network to reduce model complexity.
CSP-HSF modules are integrated into the neck for enhanced feature fusion.
PSConv replaces traditional downsampling operations to better preserve features.
SimAM attention is applied in the detection head to boost feature discrimination.

These components work collectively to achieve improved detection performance while maintaining computational efficiency.

2.2.1. Improvement of the Backbone Network

Traditional YOLO11 backbone networks have several limitations when processing satellite component identification tasks: the dense connectivity of C3k2 modules leads to computational redundancy and inefficient feature extraction; conventional convolution structures lack adaptability when processing multi-scale features in satellite images, making it difficult to effectively capture satellite component detail features under different poses and lighting conditions; existing network architectures have large parameter counts and high computational complexity, which is disadvantageous for deployment in resource-constrained space environments. To address these problems, this paper proposes a lightweight backbone network called RLNet, whose structure is shown in Figure 4. This network reduces model complexity by 19.72% while maintaining 85.86% mAP50 accuracy through the introduction of reparameterization mechanisms and hierarchical feature fusion strategies, improving the robustness and accuracy of satellite component identification and providing more efficient solutions for space target recognition tasks.

RLNet employs StemBlock [31] modules to replace traditional convolution structures, achieving richer shallow feature representation through padding and multi-path feature fusion, with its structure shown in Figure 5.

StemBlock serves as the feature extraction frontend of the network, employing dual-branch parallel processing mechanisms to enhance shallow feature representation capabilities. This module first preprocesses inputs through initial convolution, then extracts different types of feature information through max pooling paths and dual convolution paths, finally fusing the two-path features. Its mathematical expression can be described as:

X_{s t} = ϕ_{s t 4} (ϕ_{s t 3} (Concat (P_{max} (X_{s t 1}), ϕ_{s t 2} (ϕ_{s t 2} (X_{s t 1})))))

(1)

where

X_{s t 1} = ϕ_{s t 1} (X_{i n})

represents initial feature extraction,

P_{max} (\cdot)

denotes max pooling operation, and

ϕ_{s t 2}

,

ϕ_{s t 2}

,

ϕ_{s t 3}

, and

ϕ_{s t 4}

represent corresponding convolution transformation functions.

Meanwhile, we design DARB (Dense Adaptive Reparameterization Block) modules as basic building units, achieving progressive feature extraction through dense connectivity mechanisms, with its structure shown in Figure 6. The module also employs RLConv lightweight convolution units, utilizing reparameterized convolution to maintain multi-branch structural expressiveness during training while fusing into single convolution during inference to improve computational efficiency, reducing parameter count while maintaining feature extraction effectiveness, thereby achieving optimal balance between accuracy and efficiency. Its workflow can be expressed as:

Y_{i} = \{\begin{matrix} X_{i n}, & i = 0 \\ F_{R L}^{(i)} (Y_{i - 1}), & i = 1, 2, \dots, n \end{matrix}

(2)

Z_{f u s e d} = ϕ_{e} (ϕ_{s} (Concat (Y_{0}, Y_{1}, \dots, Y_{n})))

(3)

where

ϕ_{s}

and

ϕ_{e}

represent squeeze and excitation convolution operations, respectively, achieving feature compression and activation in the channel dimension.

X_{o u t p u t} = \{\begin{matrix} Z_{f u s e d} + X_{i n}, & if C_{i} = C_{o u t} \\ Z_{f u s e d}, & otherwise \end{matrix}

(4)

Residual connections are used when input and output channels are identical, thereby maintaining network depth trainability while ensuring feature tensor dimensional consistency constraints are satisfied.

The RLNet backbone network constructs a feature extraction architecture that combines high accuracy and high efficiency through innovative architectural design and optimization strategies. Through the organic combination of reparameterization technology and hierarchical feature fusion strategies, this network achieves model lightweight while maintaining high recognition accuracy. The reparameterization design enables the network to fully learn feature representations during the training phase, while the hierarchical feature aggregation mechanism enhances the network’s perception capability for detail features, providing strong technical support for precise identification and localization of satellite components, making the improved model more suitable for practical space mission application scenarios.

2.2.2. Improvement of the C3K2 Module

Traditional C3K2 structures have significant limitations when processing satellite component identification tasks. First, C3K2 structures employ fixed convolution kernel sizes for feature extraction, unable to effectively capture multi-scale feature variations of satellite components at different observation distances and angles, leading to insufficient detection accuracy for small target components. Second, this structure lacks effective channel attention mechanisms, unable to perform adaptive feature weight allocation based on the importance of different components, easily causing false and missed detections in complex space backgrounds. Additionally, the feature fusion approach of C3K2 structures is relatively simple, making it difficult to fully utilize the complementarity between shallow geometric detail information and deep semantic information, limiting the model’s precise perception capability for satellite component boundaries and texture features. To address the above issues, this paper proposes a novel structure called CSP_HSF (Cross Stage Partial with Hierarchical Spatial Fusion), whose structure is shown in Figure 7. This structure improves satellite component detection accuracy and robustness by introducing multi-scale progressive feature fusion mechanisms and efficient channel attention modules.

The core innovation of the CSP_HSF backbone network lies in combining hierarchical spatial fusion mechanisms with cross-stage partial connection strategies to construct an efficient network architecture capable of adaptively extracting multi-scale features. This design first captures feature information within different receptive field ranges through multi-branch convolution networks, then organically integrates multi-scale features using progressive feature fusion strategies. Compared to traditional CSP structures, CSP_HSF introduces PCFF (Progressive Channel Feature Fusion) modules as core components, whose structure is shown in Figure 8. This module can simultaneously model local detail features and global context information of satellite components while maintaining computational efficiency. Additionally, the PCFF module integrates ECA (Efficient Channel Attention) attention mechanisms [32], achieving efficient inter-channel information interaction through one-dimensional convolution, dynamically adjusting the importance weights of feature channels, thereby enhancing the model’s perception capability for key features and anti-interference performance.

The workflow of the CSP_HSF modules can be summarized into three stages: feature splitting, multi-scale processing, and adaptive fusion. First, input feature maps

X_{i n} \in R^{B \times C \times H \times W}

undergo channel splitting through

1 \times 1

convolution, generating the original branch

F_{s p l i t 1}

and the feature branch

F_{s p l i t 2}

for subsequent processing. Next, feature branches undergo multi-scale feature extraction and fusion through PCFF modules. Then, the final output features are fused with the original branch and undergo convolution operations to obtain the final output:

Y_{o u t p u t} = {Conv}_{1 \times 1} (Concat (F_{s p l i t 1}, F_{P C F F 1}, F_{P C F F 2}, \dots, F_{P C F F n}))

(5)

where

F_{P C F F i}

represents the output features of the i-th PCFF module, and

X_{r e s i d u a l}

is the residual connection term. PCFF modules employ progressive multi-scale feature fusion strategies, achieving effective extraction of multi-level features through cascaded combinations of different-sized convolution kernels. The workflow includes three key steps:

First, input features undergo preliminary feature extraction through

3 \times 3

convolution, then generate two sub-feature maps

F_{1 a}

and

F_{1 b}

through channel splitting operations. Second, the smaller feature map

F_{1 a}

undergoes medium-scale feature extraction through

5 \times 5

grouped convolution and further splits into two branches

F_{2 a}

and

F_{2 b}

. Finally, the smallest feature map

F_{2 a}

undergoes large-scale context extraction through

7 \times 7

grouped convolution and feature enhancement through ECA attention mechanisms:

F_{3} = ECA ({DWConv}_{7 \times 7} (F_{2 a}))

(6)

Y_{P C F F} = {Conv}_{1 \times 1} (Concat (F_{3}, F_{2 b}, F_{1 b})) + X_{i n}

(7)

This design can capture geometric structures and texture features of satellite components at different levels through progressive multi-scale feature extraction, while using attention mechanisms to highlight important feature channels, improving feature representation discriminability and robustness.

The introduction of CSP_HSF modules improves model performance in satellite component identification tasks. Through the organic combination of multi-scale progressive feature fusion and efficient channel attention mechanisms, this module effectively addresses the limitations of traditional network structures when processing multi-scale targets and complex backgrounds. The progressive design strategy of PCFF modules enables the network to simultaneously capture local detail features and global semantic information of satellite components, improving detection accuracy for small target components. The integration of ECA attention mechanisms further enhances the network’s perception capability for key features, improving model detection stability under different lighting conditions and observation angles. Overall, CSP_HSF modules provide more accurate and robust feature representation capabilities for satellite component identification tasks while maintaining efficient computational performance.

2.2.3. Modified Downsampling Operation

In practical application scenarios of satellite component identification, models often need to be deployed in environments with limited computational resources, such as space-based computing platforms or edge devices. Although traditional downsampling operations can effectively reduce feature map resolution, they typically involve high computational overhead and parameter redundancy, limiting model deployment efficiency in resource-constrained environments. Existing pooling operations and standard convolution downsampling methods often require substantial computational resources to maintain detection accuracy when processing complex satellite images, conflicting with practical application requirements for model lightweight design. To address this issue, this paper introduces PSConv (Pinwheel-Shaped Convolution) modules as lightweight downsampling strategies [29], whose structure is shown in Figure 9. This module reduces model computational complexity and parameter count while maintaining detection performance through structured convolution decomposition and efficient feature extraction mechanisms.

The core innovation of PSConv modules lies in employing four different asymmetric padding patterns to enhance the directional perception capability of convolution operations. Specifically, this module defines four asymmetric padding configurations:

(k, 0, 1, 0)

,

(0, k, 0, 1)

,

(0, 1, k, 0)

, and

(1, 0, 0, k)

, where k is set to 3. These four padding patterns correspond to extended padding in four main directions: left, right, up, and down, enabling the same convolution kernel to perceive feature information in different spatial contexts.

PSConv modules employ efficient separable convolution design, processing horizontal and vertical directional feature information through

1 \times k

and

k \times 1

convolution kernels, respectively. This design decomposes standard

k \times k

convolution into two one-dimensional convolution operations, reducing parameter count and computational complexity. Each convolution kernel applies to two different padding configurations, forming four parallel feature extraction branches. The separable convolution feature extraction process can be expressed as:

F_{w 0} = {Conv}_{1 \times k} ({Pad}_{0} (X)), F_{w 1} = {Conv}_{1 \times k} ({Pad}_{1} (X))

(8)

F_{h 0} = {Conv}_{k \times 1} ({Pad}_{2} (X)), F_{h 1} = {Conv}_{k \times 1} ({Pad}_{3} (X))

(9)

To maintain parameter efficiency, each branch’s output channel count is set to one-quarter of the total output channels, i.e.,

C_{o u t} / 4

. Finally, the outputs of four branches are fused through channel concatenation, and downsampling and feature integration are achieved through

2 \times 2

convolution:

Y_{o u t p u t} = {Conv}_{2 \times 2} (Concat (F_{w 0}, F_{w 1}, F_{h 0}, F_{h 1}))

(10)

This paper integrates PSConv modules into key downsampling positions of the YOLO11 network, replacing traditional standard convolution downsampling modules. This integration strategy fully utilizes PSConv’s advantages in parameter efficiency. Through separable convolution and channel grouping design, the network can reduce memory usage and computational latency when processing high-resolution satellite images while improving detection accuracy, enhancing model deployment feasibility in resource-constrained environments.

The lightweight characteristics of PSConv enable the network to reduce memory occupation and computational delay when processing high-resolution satellite images, improving model deployment feasibility in resource-constrained environments.

2.2.4. Enhanced Detection Head with SimAM Attention

YOLO11’s detection head faces the problem of insufficient feature discrimination capability when processing satellite component identification tasks. Since component targets in satellite images typically have similar geometric shapes and texture features, traditional detection heads struggle to effectively distinguish different types of components. Additionally, traditional detection heads lack adaptive adjustment mechanisms for the importance of different spatial positions and channels in feature maps, unable to dynamically allocate feature weights based on target significance, causing key component features to be interfered with by background noise, affecting detection accuracy and robustness. To tackle this issue, the paper incorporates the SimAM (Simple, Parameter-Free Attention Module) attention mechanisms in the detection head [30], whose structure is shown in Figure 10. This mechanism can effectively improve the network’s perception capability for key features of satellite components through parameter-free adaptive feature recalibration strategies, enhancing feature representation discriminability and improving detection head performance in complex scenarios.

The key innovation of SimAM attention mechanisms lies in proposing a parameter-free attention calculation method based on statistical principles, inspired by neuron inhibition mechanisms in human visual systems. SimAM evaluates the importance of each position by analyzing statistical differences between each position in feature maps and its surrounding regions, achieving efficient attention weight calculation without additional learnable parameters. The theoretical foundation of this mechanism is built on energy function minimization principles, implementing automatic identification and enhancement of salient features by constructing mathematical mapping relationships between feature variances and attention weights.

SimAM’s energy function is defined as the normalized difference between neuron activation values and their local means, with its mathematical expression described as:

E_{i, j} = \frac{{(X_{i, j} - μ)}^{2}}{4 (σ^{2} + λ)} + 0.5

(11)

where

X_{i, j}

represents the activation value at position

(i, j)

in the feature map,

μ

and

σ^{2}

represent the spatial mean and variance of the feature map, respectively, and

λ

is a smoothing term for numerical stability. This energy function ensures consistent processing of features at different scales through normalization operations.

Based on the computed energy function, SimAM generates adaptive attention weights and applies them to original feature maps. The complete attention enhancement process can be expressed as:

Y_{SimAM} = X ⊙ σ (E) = X ⊙ σ (\frac{{(X - μ)}^{2}}{4 (σ^{2} + λ)} + 0.5)

(12)

where

σ (\cdot)

represents the Sigmoid activation function, ⊙ denotes element-wise multiplication, and

Y_{SimAM}

is the feature output after attention enhancement. This mechanism enables feature regions with higher discriminability to obtain larger weights through adaptive weight allocation, while background noise and redundant information are correspondingly suppressed.

By incorporating SimAM attention mechanisms into multiple scale branches of the detection head, the proposed method enhances the network’s ability to capture key features of satellite components. The parameter-free design of SimAM not only avoids redundancy common in traditional attention mechanisms but also ensures computational efficiency, making it well-suited for resource-constrained space applications. This attention mechanism improves small target detection by highlighting discriminative regions in feature maps and reducing background interference, ultimately boosting the accuracy and reliability of satellite component identification.

3. Experimental Details

3.1. Dataset Preparation

In previous research on satellite components, most researchers used satellite datasets created with Satellite Tool Kit (SKT) software, which had low resolution and was unfavorable for deep learning model training. This study employs the UESD (Unreal Engine Satellite Dataset) developed by Zhao et al. [28] based on the Unreal Engine 4, which is currently the largest available satellite simulation dataset. The UESD dataset meticulously simulates over 30 different types of satellites and sets Earth, Sun, and other stellar systems as background environments; simulates different satellite poses and observation angles; moreover, this dataset covers various lighting scenarios from high exposure to low illumination. Figure 11 illustrates an image of the same satellite with different backgrounds, satellite pose angles, and lighting conditions. Such diversity makes UESD highly suitable for model training and validation, helping improve model generalization capability in various practical applications.

3.2. Dataset Processing

The UESD dataset contains a total of 10,000 images, which were first converted to grayscale image format. The processed satellite images are shown in Figure 12. This is because, in space exploration missions, especially those requiring high-precision measurements, grayscale cameras have significant advantages over color cameras, including but not limited to higher signal-to-noise ratios and sensitivity, wider dynamic ranges to adapt to extreme lighting conditions, lower data transmission volumes and higher processing efficiency. Additionally, grayscale cameras demonstrate superior performance in radiation resistance and system reliability, making them ideal choices for tasks such as satellite component detection, space target identification, and deep space scientific research. By using these grayscale-processed images, we aim to further improve model performance in real space missions.

Based on experimental requirements, we used labelimg software to annotate dataset images with five component categories: 1. panel, 2. antenna, 3. instrument, 4. thruster, 5. opticpayload, which can well satisfy the requirements of space satellite component identification tasks. The label quantities are shown in Table 1; the label distribution scatter plot is shown in Figure 13. The training, validation, and test set data division ratio is 7:1:2. Annotation example images are shown in Figure 14.

3.3. Experimental Environment

The experimental environment is shown in Table 2:

Training parameter settings: The model was trained over 300 epochs, with each batch consisting of 64 images at a resolution of

640 \times 640

. The optimization process used the Stochastic Gradient Descent (SGD) algorithm, initialized with a learning rate of 0.01, combined with Automatic Mixed Precision (AMP) technology to improve training efficiency.

3.4. Experimental Evaluation Criteria

To better evaluate space satellite component identification, we selected commonly used evaluation metrics in object detection including Precision (P), Recall (R), and mean Average Precision (mAP) to assess model performance. For model lightweight aspects, parameters, floating-point operations (GFLOPs), and optimal weight file size are also important evaluation metrics.

P = \frac{T P}{T P + F P}

(13)

Precision (P) measures the proportion of correctly identified samples within those predicted by the model as belonging to a certain category. TP refers to the number of correctly identified targets, while FP indicates the number of background or other category objects misclassified as targets.

R = \frac{T P}{T P + F N}

(14)

Recall (R) represents the proportion of samples correctly identified by the model among all samples actually belonging to a certain category. FN: number of actual targets that failed to be identified.

A P = \int_{0}^{1} P (R) \cdot d R

(15)

AP integrates precision at different recall levels, typically calculated by plotting Precision–Recall curves. The area under this curve is the AP value, providing a comprehensive evaluation metric for classifier performance.

m A P = \frac{1}{N} \sum_{i = 0}^{n} A P_{i}

(16)

mAP is obtained by averaging AP values across all categories. It provides a single numerical value to measure overall model performance, serving as the most important comprehensive evaluation metric.

4. Results and Analysis

To comprehensively validate the effectiveness and advancement of the SLD-YOLO algorithm for satellite component identification in space environments, this study designed four systematic comparative experiments: (1) Ablation experiments analyzing the contribution of each improved module; (2) Backbone network architecture comparison experiments validating the optimization effects of improved backbone modules; (3) Multi-model performance comparison experiments evaluating performance differences between the improved algorithm and other mainstream object detection algorithms in satellite component identification tasks; (4) Detection effect visualization experiments intuitively demonstrating algorithm recognition capabilities in complex space scenarios; (5) Robustness validation experiments verifying model learning and generalization capabilities in other types of satellite small-sample scenarios. Through constructing a multi-dimensional experimental validation system, we systematically evaluated the performance improvements and engineering application potential of the improved algorithm in satellite component object detection tasks from both quantitative metrics and qualitative analysis perspectives.

4.1. Ablation Experiment Analysis

To systematically evaluate the impact of each improved module on model performance, this paper uses YOLO11s as the baseline model and gradually introduces the designed improved modules while observing changes in metrics including mAP50, mAP50:95, computational complexity (GFLOPs), parameter count (Parameters), and model file size (Model File Size). Through this ablation experiment, we aim to quantitatively analyze the independent contributions of each module and provide evidence for model optimization. Experimental results are shown in Table 3.

From Table 3, we can observe that each improved module introduced individually demonstrates consistent performance gains, validating the effectiveness of our proposed improvements. Specifically, after replacing the backbone network with the improved RLNet, computational complexity decreases by 19.72%, parameter count reduces by 22.74%, model size compresses by 21.86%, while mAP50 improves to 85.86%, validating RLNet’s effectiveness in both lightweight design and performance improvement. When RLNet is integrated with other enhancement modules, consistent improvements are observed across all combinations, confirming RLNet’s excellent compatibility and adaptability within the overall framework. After introducing the proposed CSP-HSF module in the neck, mAP50 improves to 86.76% and mAP50:95 reaches 62.95%, indicating this module’s advantages in multi-scale feature fusion and effectively addressing detection challenges for satellite components with large size variations. Integrating PSConv in downsampling further optimizes model efficiency, reducing parameter count to 6.97 M and model size to 13.8 MB, while improving detection performance and reducing computational resource requirements. The introduction of simAM in the head improves mAP50 to 87.44% and mAP50:95 to 63.25% without increasing computational complexity and parameter count, demonstrating good adaptability to lighting and scale variations in satellite environments.

After integrating all modules, mAP50 increases by 2.22% and mAP50:95 improves by 1.72%; computational complexity decreases by 19.72%, parameter count reduces by 25.93%, and model size shrinks by 24.59%. Experimental results demonstrate that the proposed multi-module collaborative optimization strategy successfully achieves unification of performance improvement and model lightweight design.

4.2. Backbone Network Comparison Analysis

This study systematically compares the performance of five popular backbone networks for object detection tasks. These include the baseline YOLO11s backbone, as well as four improved architectures: a Transformer-based architecture (EfficientViT [33]), a lightweight CNN architecture (FasterNet [34]), and two hybrid CNN architectures (PPHGNetV2(High Performance GPU Network V2 by PaddlePaddle) [31] and RLNet). Experiments were conducted under strictly controlled training conditions to establish comprehensive comparative baselines for the lightweight improvement scheme proposed in this paper. The experimental results are presented in Table 4.

The design of the backbone network, as the primary element for feature extraction, has a direct impact on both model performance and computational efficiency. Table 4 experimental results show significant performance differences among different backbone networks in satellite component identification tasks. Transformer-based EfficientViT, while excellent in model lightweight aspects, shows obvious performance metric decline. FasterNet, as a representative of lightweight CNNs, demonstrates performance between Transformer and traditional CNNs, indicating that optimizing convolution operations and network structure design can effectively reduce computational resource consumption. PPHGNetV2 employs hybrid architecture strategies, achieving detection performance close to baseline while maintaining relatively low computational costs.

The RLNet backbone network proposed in this paper performs well in comprehensive performance evaluation, with mAP50 improving by 0.64% over baseline while computational complexity decreases by 19.72% and parameter count reduces by 22.74%. Experimental results demonstrate that RLNet successfully achieves optimal balance between performance and efficiency through innovative network architecture design, particularly suitable for application scenarios requiring high detection accuracy and resource constraints.

4.3. Mainstream Object Detection Model Performance Comparison Analysis

To comprehensively evaluate SLD-YOLO model performance advantages, this paper conducted comparative experiments with current mainstream and advanced object detection models. Comparison models include the original baseline model YOLO11s [24], other variants of the YOLO series (YOLOv8s [21], YOLOv9s [22], YOLOv10s [23]), Transformer-based DETR(Detection Transformer) [35] and RT-DETR(Real-Time Detection Transformer) [31], and anchor-free design FCOS [36]. Experimental comparison results are shown in Table 5.

According to Table 5 data, the SLD-YOLO algorithm proposed in this paper performs excellently in both performance and lightweight metrics. Compared to the YOLO series, SLD-YOLO achieves the highest detection accuracy while having the smallest computational load, parameter count, and storage space occupation. Compared to mainstream detection paradigms, although Recall is slightly lower than DETR by about 0.13%, DETR’s other metrics all lag behind SLD-YOLO, especially in lightweight metrics, demonstrating SLD-YOLO’s significant comprehensive performance advantages in space satellite component identification tasks.

Figure 15 specifically presents lightweight metrics comparison among YOLO series models (YOLOv8s, YOLOv9s, YOLOv10s, YOLO11s, and our SLD-YOLO), demonstrating SLD-YOLO’s superior efficiency within this architectural family. The 3D visualization clearly shows SLD-YOLO achieving the lowest values across all three metrics: computational load, parameter count, and model size.

Extending the comparison to include all evaluated models from Table 5, SLD-YOLO significantly outperforms not only the YOLO series but also Transformer-based models (DETR, RT-DETR) and fully convolutional detection models (FCOS) across the three lightweight metrics. This comprehensive comparison demonstrates SLD-YOLO’s significant lightweight advantages and optimal balance between efficient detection and model compression, making it highly suitable for target detection tasks in space scenarios with limited computational resources and storage space.

4.4. Detection Effect Visualization Analysis

To more intuitively compare the performance of the original model with the improved model, this paper designed visualization experiments including confusion matrices, test set detection results, and heatmap comparisons.

To systematically compare differences between the improved model (SLD-YOLO) and original model (YOLO11s) in satellite component category identification, this study employed confusion matrices for visualization comparison experiments. Confusion matrices are classic classification evaluation tools that can clearly reflect model recognition accuracy, false positive rates, and missed detection situations across different categories. This method is particularly suitable for satellite component identification tasks with multiple categories and complex scenarios. Confusion matrices are shown in Figure 16, where horizontal coordinates represent true category labels and vertical coordinates represent predicted category labels. Diagonal values reflect correctly predicted quantities for each category, while off-diagonal values reflect instances where the model misclassifies categories, intuitively showing model classification bias and missed detection situations. A detailed statistical analysis of the confusion matrix reveals that SLD-YOLO achieved 6274 true positives (TP), compared to 6209 TP for YOLO11s. More importantly, SLD-YOLO demonstrated significant reductions in both false positive (FP) and false negative (FN) rates: false positives decreased from 1839 to 1519, and false negatives were substantially reduced from 1814 to 928. The confusion matrix analysis confirms that SLD-YOLO not only maintains high detection accuracy but also significantly reduces the miss detection rate, thereby demonstrating the effectiveness of the proposed architectural improvements in addressing the complex multi-class satellite component recognition challenges.

Figure 17 shows partial detection results of both models on the test set. From the figure, we can clearly observe that YOLO11s performs poorly when processing complex background scenarios (such as cloud interference, background-target edge adhesion) and small target detection tasks, prone to the following issues: (1) false detection (misidentifying background noise as targets), (2) missed detection (small targets not identified), (3) duplicate detection (generating multiple redundant boxes for the same target). This indicates that YOLO11s’ robustness for small target identification needs improvement. In contrast, SLD-YOLO demonstrates higher accuracy and stability in high-challenge scenarios through introducing multi-scale feature fusion technology. Additionally, this model achieves 90.4 FPS (Frames Per Second), ensuring real-time processing capability. These characteristics make SLD-YOLO demonstrate better adaptability and practicality in space satellite component detection tasks.

To validate the effectiveness of our proposed SLD-YOLO algorithm, we conducted heatmap visualization analysis on the test set, as shown in Figure 18. By comparing heatmaps of YOLO11s and SLD-YOLO on identical test samples, we can intuitively demonstrate the differences in feature extraction and localization capabilities between the two algorithms in object detection tasks.

Heatmap results demonstrate that, compared to the YOLO11s algorithm, SLD-YOLO exhibits more concentrated and precise feature responses in target regions. From heatmaps, we can observe that SLD-YOLO can more accurately focus on key regions of target objects, with more concentrated thermal responses, significantly reduced background noise, and more precise target boundary localization. Particularly in complex scenarios, SLD-YOLO shows stronger target discrimination capabilities, with red high-response regions in heatmaps having higher overlap with actual target regions. These visualization results fully validate the superior performance of our proposed SLD-YOLO algorithm in feature extraction and target localization.

4.5. Robustness Analysis

To further validate the generalization capability and robustness of the proposed SLD-YOLO model, this study conducted additional validation experiments on part of the UESD_edition2 dataset, selecting five different types of satellite targets with each category containing 100–160 images, totaling 700 high-quality annotated images. The dataset was divided into training, validation, and test sets at a 7:1.5:1.5 ratio, with experimental environment and parameter settings consistent with Section 3.3. Table 6 shows quantitative comparison results between SLD-YOLO and baseline model YOLO11s on the UESD_edition2 dataset. Experimental results demonstrate that SLD-YOLO outperforms YOLO11s across all evaluation metrics, with precision improving by 4.01%, recall improving by 3.88%, mAP50 improving by 2.97%, and mAP50:95 improving by 0.41% compared to YOLO11s. These results prove that SLD-YOLO has good detection performance and generalization capability in small-sample scenarios with different types of satellites.

Figure 19 shows heatmap comparisons between YOLO11s and SLD-YOLO models on identical satellite targets. From heatmaps, we can observe that, compared to the YOLO11s model (Figure 19b), the SLD-YOLO model (Figure 19c) can more precisely focus on key feature regions of satellite targets, showing clearer target boundaries and less background noise interference. This indicates that the improved SLD-YOLO model has strong learning capabilities for satellite target features and can achieve more precise target localization and identification performance.

5. Discussion

In space on-orbit servicing missions, high-precision classification and localization of satellite target components is a key technical prerequisite for achieving autonomous operations, presenting dual challenges for detection algorithms: ensuring sufficient detection accuracy while meeting stringent constraints on computational resources in space-based systems. The improved model based on YOLO11s (SLD-YOLO) proposed in this paper achieves breakthrough balance between detection accuracy and model efficiency through module-level structural optimization and lightweight design.

Depending on whether target satellites have accessible prior information, they can be classified into cooperative targets and non-cooperative targets. For cooperative targets, since their geometric models and equipment information are known, high-precision component identification and localization can be achieved through data generation and training strategies based on specific models. For non-cooperative targets, although such targets lack precise prior modeling information, different satellite platforms share common characteristics in structural design, and universal detection models built on various satellite images can achieve effective generalization identification of non-cooperative targets to some extent. As space mission complexity increases and the variety of orbital platforms continues expanding, while basic satellite panel and antenna structures remain relatively stable, future research should focus on the following directions when facing more diverse spacecraft configurations: enhancing model perception capabilities for fine-grained features through refined component classification annotation; developing more adaptive and scalable detection algorithms to achieve rapid and accurate identification of novel aerospace systems. These technological developments will not only help improve on-orbit servicing efficiency and reliability but also provide important technical support for addressing new challenges in space exploration.

6. Conclusions

Targeting the critical task of space satellite component identification, this study proposes the SLD-YOLO lightweight detection model, which is based on YOLO11s architecture and achieves effective balance between accuracy and efficiency. The model incorporates four key technical contributions: (1) RLNet backbone network with reparameterization mechanisms for lightweight feature extraction, (2) CSP-HSF modules replacing C3K2 structures for enhanced multi-scale feature fusion, (3) PSConv downsampling operations for computational efficiency, and (4) SimAM attention mechanisms in the detection head for improved accuracy without parameter overhead. The experimental dataset (UESD) contains 10,000 images covering over 30 categories of satellites simulating real space environments. We selected five categories for annotation and converted images to grayscale format to better adapt to actual on-orbit servicing mission requirements. Experiments show that SLD-YOLO achieves SOTA(State-Of-The-Art) detection accuracy on the UESD dataset, with mAP50 improving by 2.2% compared to YOLO11s while reducing model parameters and computational complexity by 25.93% and 19.72%, respectively. With weight size controlled to 13.8MB and FPS reaching 90.4, SLD-YOLO ensures both deployment feasibility in resource-constrained space environments and real-time performance requirements. Compared to current mainstream object detection models, this model exhibits competitive lightweight characteristics while maintaining detection accuracy.

The above achievements prove the effectiveness and efficiency of this method when performing component detection tasks in resource-constrained space environments, demonstrating its tremendous potential in practical applications. As technology advances and application scenarios expand, this method is expected to become one of the key tools for on-orbit servicing.

Author Contributions

Methodology, Y.L., B.L., H.Y. and X.W.; Software, Y.L., B.L., H.Y. and X.W.; Data curation, Y.L. and X.W.; Writing—original draft, Y.L. and X.W.; Writing—review editing, B.L. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Changchun Higher Education Institutions Pilot Project Selection Program (Grant No.24GXYSZZ29).

Data Availability Statement

The images used in this study are available at UESD (https://github.com/zhaoyunpeng57/BUAA-UESD33), accessed on 24 June 2025, while the annotation data can be obtained from the corresponding author upon reasonable request, as they are not publicly available due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Giordani, M.; Zorzi, M. Non-terrestrial networks in the 6G era: Challenges and opportunities. IEEE Netw. 2020, 35, 244–251. [Google Scholar] [CrossRef]
Montenbruck, O.; Steigenberger, P.; Prange, L.; Deng, Z.; Zhao, Q.; Perosanz, F.; Romero, I.; Noll, C.; Stürze, A.; Weber, G.; et al. The Multi-GNSS Experiment (MGEX) of the International GNSS Service (IGS)—Achievements, prospects and challenges. Adv. Space Res. 2017, 59, 1671–1697. [Google Scholar] [CrossRef]
Schmit, T.; Griffith, P.; Gunshor, M.; Daniels, J.; Goodman, S.; Lebair, W. A closer look at the ABI on the goes-r series. Bull. Am. Meteorol. Soc. 2017, 98, 681–698. [Google Scholar] [CrossRef]
Zhu, Z.; Wulder, M.A.; Roy, D.P.; Woodcock, C.E.; Hansen, M.C.; Radeloff, V.C.; Healey, S.P.; Schaaf, C.; Hostert, P.; Strobl, P.; et al. Benefits of the free and open Landsat data policy. Remote Sens. Environ. 2019, 224, 382–385. [Google Scholar] [CrossRef]
Venkatesan, A.; Lowenthal, J.; Prem, P.; Vidaurri, M. The impact of satellite constellations on space as an ancestral global commons. Nat. Astron. 2020, 4, 1043–1048. [Google Scholar] [CrossRef]
Davis, J.P.; Mayberry, J.P.; Penn, J.P. On-orbit servicing: Inspection repair refuel upgrade and assembly of satellites in space. In The Aerospace Corporation, Report; The Aerospace Corporation: Chantilly, VA, USA, 2019; Volume 25. [Google Scholar]
Buckelew, R.; Catalanello, E.; Scacchioli, A. Control of Satellites with Onboard Robotic Manipulators. Aresty Rutgers Undergrad. Res. J. 2021, 1. [Google Scholar] [CrossRef]
Peng, J.; Xu, W.; Yuan, H. An efficient pose measurement method of a space non-cooperative target based on stereo vision. IEEE Access 2017, 5, 22344–22362. [Google Scholar] [CrossRef]
Canny, J. A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 6, 679–698. [Google Scholar] [CrossRef]
Ballard, D.H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit. 1981, 13, 111–122. [Google Scholar] [CrossRef]
Cai, J.; Huang, P.; Chen, L.; Zhang, B. A fast detection method of arbitrary triangles for Tethered Space Robot. In Proceedings of the 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), Zhuhai, China, 6–9 December 2015; pp. 120–125. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Gao, J.; Zhang, K. R-CNN-Based Satellite Components Detection in Optical Images. Int. J. Aerosp. Eng. 2020, 2020, 8816187. [Google Scholar] [CrossRef]
Cao, Y.; Cheng, X.; Mu, J.; Li, D.; Han, F. Detection method based on image enhancement and an improved faster R-CNN for failed satellite components. IEEE Trans. Instrum. Meas. 2023, 72, 5005213. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. arXiv 2016, arXiv:1612.08242. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G. ultralytics/yolov5: v7.0; YOLOv5 by Ultralytics; Zenodo: Geneva, Switzerland, 2020. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, Version 8.0.0; Github: San Francisco, CA, USA, 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 August 2025).
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer: Milan, Italy, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, Version 8.3.9; Github: San Francisco, CA, USA, 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 August 2025).
Mahendrakar, T.; White, R.T.; Wilde, M.; Kish, B.; Silver, I. Real-time satellite component recognition with YOLO-V5. In Proceedings of the Small Satellite Conference, Virtual, 7–12 August 2021; Volume 4558. [Google Scholar]
Li, C.; Zhao, G.; Gu, D.; Wang, Z. Improved lightweight YOLOv5 using attention mechanism for satellite components recognition. IEEE Sens. J. 2022, 23, 514–526. [Google Scholar] [CrossRef]
Tang, Z.; Zhang, W.; Li, J.; Liu, R.; Xu, Y.; Chen, S.; Fang, Z.; Zhao, F. LTSCD-YOLO: A Lightweight Algorithm for Detecting Typical Satellite Components Based on Improved YOLOv8. Remote Sens. 2024, 16, 3101. [Google Scholar] [CrossRef]
Zhao, Y.; Zhong, R.; Cui, L. Intelligent recognition of spacecraft components from photorealistic images based on Unreal Engine 4. Adv. Space Res. 2023, 71, 3761–3774. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 9202–9210. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]

Figure 1. On-orbit servicing project images.

Figure 2. SPPF and C2PSA structure diagram.

Figure 3. SLD-YOLO network structure diagram.

Figure 4. RLNet network structure diagram.

Figure 5. StemBlock network structure diagram.

Figure 6. DARB network structure diagram.

Figure 7. CSP_HSF network structure diagram.

Figure 8. PCFF network structure diagram.

Figure 9. PSConv network structure diagram.

Figure 10. Detection head network structure diagram.

Figure 11. Illustration of satellite image variations: backgrounds, poses, and lighting conditions.

Figure 12. Dataset grayscale processing demonstration.

Figure 13. Label distribution scatter plot.

Figure 14. Dataset annotation examples.

Figure 15. Lightweight metrics 3D bar chart comparison within YOLO series: Other models are not included due to excessive scale differences; refer to Table 5 for complete comparison details.

Figure 16. Confusion matrix comparison results: (a) YOLO11s and (b) SLD-YOLO.

Figure 17. Test set detection result comparison: (a) Original images; (b) YOLO11s; and (c) SLD-YOLO.

Figure 18. Test set heatmap comparison results: (a) Original images; (b) YOLO11s; and (c) SLD-YOLO.

Figure 19. Test set heatmap comparison results: (a) Original images; (b) YOLO11s; and (c) SLD-YOLO.

Table 1. Label quantity statistics.

Label	Panel	Antenna	Instrument	Thruster	Opticpayload
Quantity	16,040	6105	9549	2858	1460

Table 2. Server and environment parameters.

Parameter	Environment Configuration
Operating System	Ubuntu 22.04.3
CPU	AMD EPYC 7K62 48-Core Processor
GPU	GTX 4090 24G
Memory	100G
Environment Version	Pytorch 2.2.2
IDE	VS Code 1.90.2

Table 3. Ablation experiment results.

Group	RLNet	CSP-HSF	PSConv	simAM	mAP50/%	mAP50:95/%	GFLOPs/G	Parameters/M	Size/MB
1	-	-	-	-	85.22	61.53	21.3	9.41	18.3
2	√	-	-	-	85.86	61.91	17.1	7.27	14.3
3	-	√	-	-	86.19	62.62	21.5	9.40	18.3
4	-	-	√	-	86.09	62.31	21.1	9.13	17.8
5	-	-	-	√	85.50	62.05	21.3	9.41	18.3
6	√	-	√	-	86.64	62.73	16.8	6.85	13.6
7	√	-	-	√	86.51	62.41	17.0	7.13	14.1
8	√	√	-	-	86.76	62.95	17.3	7.26	14.3
9	√	√	√	-	86.92	63.02	17.1	6.97	13.8
10	√	√	√	√	87.44	63.25	17.1	6.97	13.8

Table 4. Backbone network comparison experiment results.

Group	Precision/%	Recall/%	mAP50/%	mAP50:95/%	GFLOPs/G	Parameters/M	Size/MB
Baseline (CSPDarknet)	88.15	79.51	85.22	61.53	21.3	9.41	18.3
EfficientViT	85.59	73.12	79.14	56.10	14.3	7.28	14.7
FasterNet	87.45	74.71	81.55	58.07	15.6	7.52	14.7
PPHGNetV2	88.07	79.81	84.99	61.24	18.3	7.60	14.9
Ours (RLNet)	88.93	80.20	85.86	61.91	17.1	7.27	14.3

Table 5. Different model comparison experiment results.

Group	Precision/%	Recall/%	mAP50/%	mAP50:95/%	GFLOPs/G	Parameters/M	Size/MB
YOLOV8s	89.30	80.23	85.75	61.80	28.4	11.13	21.5
YOLOV9s	89.40	80.70	86.10	62.60	38.7	9.60	20.3
YOLOV10s	88.86	79.04	84.53	61.20	24.5	7.22	16.6
YOLO11s	88.15	79.51	85.22	61.53	21.3	9.41	18.3
DETR-ResNet50	88.50	82.10	86.20	55.50	95.1	41.56	170
RT-DETR-L	74.89	66.74	72.44	50.90	100.6	28.45	56.4
FCOS	84.10	72.90	80.0	55.40	50.8	32.12	128.4
Ours (SLD-YOLO)	89.46	81.97	87.44	63.25	17.1	6.97	13.8

Table 6. Validation dataset model comparison experiment results.

Group	Precision/%	Recall/%	mAP50/%	mAP50:95/%
YOLO11s	65.11	50.49	55.01	32.52
SLD-YOLO	69.12	54.37	57.98	32.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Yang, H.; Lü, B.; Wu, X. SLD-YOLO: A Lightweight Satellite Component Detection Algorithm Based on Multi-Scale Feature Fusion and Attention Mechanism. Remote Sens. 2025, 17, 2950. https://doi.org/10.3390/rs17172950

AMA Style

Li Y, Yang H, Lü B, Wu X. SLD-YOLO: A Lightweight Satellite Component Detection Algorithm Based on Multi-Scale Feature Fusion and Attention Mechanism. Remote Sensing. 2025; 17(17):2950. https://doi.org/10.3390/rs17172950

Chicago/Turabian Style

Li, Yonghao, Hang Yang, Bo Lü, and Xiaotian Wu. 2025. "SLD-YOLO: A Lightweight Satellite Component Detection Algorithm Based on Multi-Scale Feature Fusion and Attention Mechanism" Remote Sensing 17, no. 17: 2950. https://doi.org/10.3390/rs17172950

APA Style

Li, Y., Yang, H., Lü, B., & Wu, X. (2025). SLD-YOLO: A Lightweight Satellite Component Detection Algorithm Based on Multi-Scale Feature Fusion and Attention Mechanism. Remote Sensing, 17(17), 2950. https://doi.org/10.3390/rs17172950

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SLD-YOLO: A Lightweight Satellite Component Detection Algorithm Based on Multi-Scale Feature Fusion and Attention Mechanism

Abstract

1. Introduction

2. Methods

2.1. The YOLO11 Baseline Framework Introduction

2.2. The Improved SLD-YOLO Network Design

2.2.1. Improvement of the Backbone Network

2.2.2. Improvement of the C3K2 Module

2.2.3. Modified Downsampling Operation

2.2.4. Enhanced Detection Head with SimAM Attention

3. Experimental Details

3.1. Dataset Preparation

3.2. Dataset Processing

3.3. Experimental Environment

3.4. Experimental Evaluation Criteria

4. Results and Analysis

4.1. Ablation Experiment Analysis

4.2. Backbone Network Comparison Analysis

4.3. Mainstream Object Detection Model Performance Comparison Analysis

4.4. Detection Effect Visualization Analysis

4.5. Robustness Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI