M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition

Yi, Wenxin; Zhang, Zhang; Chang, Liang

doi:10.3390/rs17183144

Open AccessArticle

M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition

by

Wenxin Yi

¹,

Zhang Zhang

² and

Liang Chang

^2,*

¹

Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China

²

Innovation Academy for Microsatellites of Chinese Academy of Sciences, Shanghai 201304, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(18), 3144; https://doi.org/10.3390/rs17183144

Submission received: 29 July 2025 / Revised: 28 August 2025 / Accepted: 6 September 2025 / Published: 10 September 2025

Download

Browse Figures

Versions Notes

Abstract

With the continuous advancement of on-orbit services and space intelligence sensing technologies, the efficient and accurate identification of spacecraft components has become increasingly critical. However, complex lighting conditions, background interference, and limited onboard computing resources present significant challenges to existing segmentation algorithms. To address these challenges, this paper proposes a lightweight spacecraft component segmentation framework for on-orbit applications, termed M4MLF-YOLO. Based on the YOLOv5 architecture, we propose a refined lightweight design strategy that aims to balance segmentation accuracy and resource consumption in satellite-based scenarios. MobileNetV4 is adopted as the backbone network to minimize computational overhead. Additionally, a Multi-Scale Fourier Adaptive Calibration Module (MFAC) is designed to enhance multi-scale feature modeling and boundary discrimination capabilities in the frequency domain. We also introduce a Linear Deformable Convolution (LDConv) to explicitly control the spatial sampling span and distribution of the convolution kernel, thereby linearly adjusting the receptive field coverage range to improve feature extraction capabilities while effectively reducing computational costs. Furthermore, the efficient C3-Faster module is integrated to enhance channel interaction and feature fusion efficiency. A high-quality spacecraft image dataset, comprising both real and synthetic images, was constructed, covering various backgrounds and component types, including solar panels, antennas, payload instruments, thrusters, and optical payloads. Environment-aware preprocessing and enhancement strategies were applied to improve model robustness. Experimental results demonstrate that M4MLF-YOLO achieves excellent segmentation performance while maintaining low model complexity, with precision reaching 95.1% and recall reaching 88.3%, representing improvements of 1.9% and 3.9% over YOLOv5s, respectively. The mAP@0.5 also reached 93.4%. In terms of lightweight design, the model parameter count and computational complexity were reduced by 36.5% and 24.6%, respectively. These results validate that the proposed method significantly enhances deployment efficiency while preserving segmentation accuracy, showcasing promising potential for satellite-based visual perception applications.

Keywords:

semantic segmentation; lightweight; YOLOv5; satellite component segmentation

1. Introduction

Since the onset of space exploration in the 20th century, space technology has advanced rapidly, resulting in increasingly complex spacecraft structures and a growing number of such vehicles. Numerous spacecraft have been launched into space and operate long-term in harsh space environments, where their critical components are susceptible to aging and damage, significantly affecting the reliability of on-orbit missions [1]. On-orbit services targeting space objects, such as satellites, are becoming a key focus in the development of space technology. A primary prerequisite for these services is the precise identification and localization of target spacecraft components. Satellite component identification serves as the foundation for achieving spacecraft rendezvous and docking [2], on-orbit capture, and maintenance tasks [3]. Satellite components, including thrusters, solar panels, and antennas, must be accurately identified to enhance mission success rates. Spacecraft image segmentation, as a crucial technology in spatial visual perception, aims to achieve precise differentiation between spacecraft and their background environments through pixel-level segmentation. Compared to target detection, image segmentation not only offers higher positioning accuracy but also restores the complete geometric structure of the spacecraft, making it more suitable for high-precision tasks such as on-orbit servicing. However, its technical implementation is more challenging. The unique characteristics of the on-orbit operational environment present the following challenges: (1) Extreme lighting conditions (such as low light and high light overexposure) cause significant degradation of target edge features. (2) Complex and variable backgrounds, including deep space backgrounds, space debris interference, and sensor noise, significantly impact segmentation robustness. (3) Spacecraft structures are complex, with components varying greatly in size, further increasing the difficulty of feature extraction and component segmentation. Additionally, limited onboard computational resources require segmentation algorithms to maintain high computational efficiency while ensuring accuracy. Therefore, spacecraft image segmentation algorithms for on-orbit applications must overcome the limitations of traditional methods, maintain stable performance under complex imaging conditions, and efficiently perform key tasks such as multi-scale feature fusion, noise suppression, and lighting adaptation to achieve efficient, accurate, and robust component identification under resource-constrained conditions.

Currently, spacecraft component segmentation methods can be broadly categorized into two main types: traditional algorithms and deep learning-based algorithms. Traditional algorithms typically rely on manually designed feature extraction or mathematical modeling tailored to specific targets. While these methods can achieve certain results in controlled environments, their limitations become increasingly evident in complex or dynamic scenarios. Especially when faced with changes in lighting, pose, or complex background interference, the heavy reliance of traditional algorithms on manually designed features results in insufficient robustness and adaptability. For example, geometric shape extraction techniques such as edge detection [4] and Hough transform [5,6] perform well in simple scenes but are often affected by noise interference in high-resolution or complex environments, leading to unstable segmentation results. Additionally, these methods often require extensive parameter tuning, resulting in high computational overhead, making them difficult to meet the real-time and generalization requirements of practical applications. Traditional algorithms are particularly limited in handling dynamic backgrounds (such as Earth surface textures, cloud changes, and lighting fluctuations), further restricting their engineering application value.

In contrast, segmentation methods based on deep learning learn complex features through large-scale data, exhibiting stronger robustness and generalization capabilities, and performing well in variable spatial environments. Deep learning not only avoids the cumbersome process of manually designing features but also adapts to different types of spacecraft components and imaging conditions through end-to-end training, effectively improving segmentation accuracy under varying lighting conditions, attitude changes, and complex background interference. As a result, such methods have gradually become the mainstream technical solution for spacecraft component identification and segmentation.

However, deep learning algorithms also face specific constraints in the space environment, such as limited storage space and scarce computing resources. This makes it difficult to deploy deep learning models with a large number of parameters and high computing requirements in practical applications, especially in scenarios that require real-time performance. High-precision, high-complexity network models cannot meet the strict requirements of space missions for computing efficiency and resource utilization. This poses new challenges for deep learning-based spacecraft component segmentation algorithms, requiring models to ensure high accuracy and high-speed segmentation performance while featuring lightweight structural designs that significantly reduce parameter scale and computational costs to adapt to the limited hardware conditions in space scenarios. This demand has driven research into efficient, compact algorithms tailored for spacecraft applications, becoming an important direction for the continued development of deep learning technology in the aerospace field.

To address the aforementioned issues, this paper proposes an efficient and lightweight spacecraft component segmentation network framework aimed at improving real-time performance without compromising segmentation accuracy while reducing the number of model parameters and computational complexity. The main contributions include the following:

1.: A lightweight backbone network based on MobileNetV4, which reduces parameters and computational complexity through structural optimization, making it suitable for resource-constrained space environments while maintaining good segmentation performance.
2.: Linear Deformable Convolution, which can adaptively adjust sampling point density and distribution based on regional characteristics, enhancing feature extraction capabilities in complex regions while reducing redundant computations in flat backgrounds.
3.: Multi-Scale Fourier Adaptive Calibration Module (MFAC), which leverages frequency domain characteristics and multi-scale fusion strategies to improve boundary segmentation accuracy and background suppression capabilities for spacecraft components.
4.: C3-Faster module, which performs lightweight reconstruction of the original C3 structure, introduces efficient residual units based on FasterNet, and adopts a partial convolution strategy to perform spatial modeling on only some channels, effectively reducing the number of parameters and computational load. This module significantly improves inference efficiency and edge deployment adaptability while retaining channel interaction and multi-scale fusion capabilities.

2. Related Work

2.1. Semantic Segmentation Based on Deep Learning

Semantic segmentation is a prominent research topic in the field of computer vision and with widespread applications in areas such as space object docking, space debris removal, and on-orbit servicing. Segmenting spacecraft components not only achieves precise differentiation of the target from the background but also requires fine-grained segmentation and identification of key components within the target structure. Traditional methods for satellite component segmentation primarily depend on classical image processing and feature extraction techniques. For instance, the Canny operator [7], Sobel operator [8], and Laplacian operator [9] are common edge detection methods used to extract target contours and enhance boundary information, while the Hough transform [10] is widely applied to detect components with regular geometric structures, such as the linear configuration of solar panels or the circular contour of antennas. These methods achieve preliminary segmentation of the target area by analyzing edges, shapes, and texture features of the image, and they are effective in scenarios where the prior conditions are clear and background interference is minimal. However, spatial remote sensing images often contain significant noise interference, large variations in target scale, and complex backgrounds, which can lead to traditional methods to easily produce mis-segmentation and missed segmentation when dealing with weak textures, blurred edges, or low-contrast targets. Additionally, these methods lack adaptive modeling capabilities for varying task scenarios and target characteristics, often requiring manual parameter settings and experience-based tuning, complicating their universal deployment across multiple satellite platforms and complex operational conditions. This significantly limits the practicality and scalability of traditional algorithms in real-world on-orbit applications. With the advancements of computer vision, deep learning has been integrated into spatial scenes in recent years, and deep learning-based object segmentation algorithms have made significant progress. These algorithms are widely applied in satellite identification, pose estimation, and component segmentation. Based on presence there is a preprocessing step for selecting candidate regions, such algorithms can be categorized into two-stage and one-stage object segmentation algorithms. Two-stage network algorithms first generate candidate regions, followed by object classification and location regression. Representative algorithms in this category include R-CNN [11], SPPNet [12], Fast R-CNN [13], and Cascade R-CNN [14]. Although these algorithms demonstrate excellent accuracy, their relatively slow processing speed limits their real-time segmentation capabilities on devices. In contrast, one-stage network algorithms directly extract features within the network to predict object categories and locations. Notable networks in this category include SSD [15] and the YOLO series [16,17,18,19,20,21], which have achieved commendable performance on natural image datasets such as MS COCO [22], ADE20K [23], and Cityscapes [24]. Both categories of methods each have their advantages and have been widely applied in various practical scenarios. Among them, YOLO, as the earliest one-stage object segmentation algorithm with practical application value, has gained widespread adoption. The current YOLOv5 [25] achieves a favorable balance between accuracy and speed. However, when applied to spacecraft component segmentation tasks with complex backgrounds, its performance still encounters challenges. Chen et al. [26] proposed a lightweight instance segmentation model based on Mask R-CNN for identifying components such as the main body, solar panels, and antennas in satellite images. This method reduces model size and enhances computational efficiency by introducing depthwise separable convolutions and simplifying the residual structure. Nevertheless, it still adopts a two-stage segmentation framework, resulting in a complex overall structure that fails to meet the real-time and deployment efficiency requirements of space-borne scenarios. Wang et al. [27] proposed a Faster R-CNN spacecraft component detection algorithm based on the RegNet backbone network to enhance the detection capability of small targets in low Earth orbit spacecraft images. This method significantly improved the retention of critical information during sampling, resulting in an mAP improvement of over 26% across multiple datasets. However, it still utilizes a two-stage segmentation structure, leading to high computational overhead and limited real-time performance. Liu et al. [28] proposed a spacecraft image segmentation network based on DeepLabv3+, integrating dilated convolutions, channel attention mechanisms, and atrous spatial pyramid pooling (ASPP) to effectively enhance context modeling capabilities and improve the completeness and segmentation accuracy of spacecraft masks. However, this method relies heavily on network structure and has high training complexity, making it unsuitable for scenarios with stringent real-time and deployment requirements. Guo et al. [29] released a semi-real spacecraft payload segmentation dataset (SSP), constructed by fusing real spacecraft model images with NASA space images, and proposed a segmentation network SPSNet based on an anti-pyramid-structured decoder, achieving performance superior to existing methods on this dataset. This method demonstrates outstanding performance segmentation accuracy, but still faces issues such as limited image channel diversity and constrained jump connectivity, indicating room for improvement in scalability. Zhao et al. [30] constructed the Unreal Engine Satellite Dataset (UESD) by building a space simulation environment based on Unreal Engine 4, generating high-quality image data containing various attitudes and Earth backgrounds for spacecraft component recognition tasks. This dataset exhibits high realism and practicality; however, the types of satellites covered are limited, and the generalization ability of the models requires further enhancement. Cao et al. [31] collected real images using a hardware-in-the-loop system and constructed the Spacecraft-DS dataset. It encompasses complex factors in real space environments and is suitable for various visual tasks such as detection and segmentation. Nevertheless, the proposed benchmark method still has limitations, including a large model size and deployment challenges. Recent studies have shown that incorporating self-attention mechanisms can significantly enhance model robustness and generalization. For example, the FCIHMRT network [32] integrates a dual-branch feature extraction structure based on Res2Net and Vision Transformer, along with a cross-layer interactive attention mechanism, to achieve efficient fusion of multi-scale features. This approach has demonstrated superior classification accuracy on several public datasets, indicating that self-attention-based feature interaction strategies can provide valuable insights for improving spacecraft component segmentation and recognition in complex scenarios. Liu et al. [33] proposed a multi-scale adaptive spatial feature fusion network (MASFFN), achieving significant progress in edge modeling and feature extraction of spacecraft components, thereby improving recognition accuracy, and performing excellently on the UESD and URSO [34] datasets. However, the method’s adaptability in complex space environments with strong light interference remains limited, and making it challenging to meet the requirements of real on-orbit applications. This presents research opportunities for further enhancing robustness and recognition capabilities under complex conditions.

2.2. Model Lightweighting Technology

Current lightweight neural network technologies can be broadly classified into two primary types: model compression and compact network design. Model compression methods aim to reduce model size and computational load while maintaining model accuracy as much as possible. Common techniques include pruning, quantization, and knowledge distillation. Pruning simplifies network structure by removing redundant weights, channels, or convolutional kernels; quantization replaces floating-point parameters with low-bit-width representations, such as 8-bit or binary formats, significantly reducing storage and computational costs; knowledge distillation transfers knowledge from a large teacher model to a smaller student model, thereby enhancing the latter’s performance while decreasing training and deployment expenses. In contrast, compact network architectures concentrate on the model structure itself, designing more efficient computational units and connection methods to achieve end-to-end lightweight design. Notable lightweight models such as the MobileNet series [35,36,37] introduce depthwise separable convolutions and inverted residual structures, effectively reducing model complexity. ShuffleNet [38,39] enhances cross-group information exchange capabilities through group convolutions and channel reordering operations; EfficientNet [40,41] systematically optimizes model width, depth, and resolution based on composite scaling strategies. Other architectures, such as ShiftNet [42] and AdderNet [43], further compress computational resource consumption by replacing multiplication with feature shifting and addition, respectively. It is crucial to note that the number of parameters and FLOPs metrics do not always accurately reflect a model’s actual inference speed. Some lightweight models, despite having lower theoretical computational requirements, still encounter efficiency bottlenecks during hardware deployment due to issues such as frequent memory access. Consequently, recent research has also focused on deployment-friendly structural reparameterization strategies and hardware-aware neural architecture search (NAS) methods, aiming to achieve a better balance between model performance, resource utilization, and platform adaptability. Overall, the continuous evolution of lightweight technologies has established a solid foundation for the efficient deployment of neural networks in edge computing and real-time scenarios.

3. Methods

3.1. M4MLF-YOLO Network Architecture

This paper presents a lightweight semantic segmentation network architecture designed for spacecraft component recognition tasks (as illustrated in Figure 1). The architecture comprises three main components: a newly designed backbone network, a novel feature fusion enhancement module (referred to as the Neck), and a multi-scale prediction branch. The primary objective of this architecture is to achieve accurate segmentation and real-time recognition of spacecraft components in complex backgrounds with limited computational resources. (1) To simultaneously address the requirements for high precision and lightweight design in space missions, this paper constructs an efficient backbone network. The front end of the network consists of multiple ConvBN layers that extract image visual features progressively. Additionally, a lightweight inverted residual module, termed UIB (Universal Inverted Bottleneck), is introduced to significantly reduce the number of parameters while maintaining feature modeling capabilities. The tail end of the backbone network connects to the proposed MFAC (Multi-scale Fourier Adaptive Calibration Module), which enhances edge response and scale perception through a frequency domain enhancement mechanism, thereby improving the network’s ability to distinguish component edges in complex images. This backbone network completely replaces the original CSPDarkNet architecture in YOLOv5, significantly enhancing structural robustness and semantic expression while reducing model size and inference costs. (2) The Neck network employs a top-down and bottom-up multi-scale fusion strategy, undergoing structural lightweight modifications to enhance cross-scale feature expression capabilities. This paper introduces the C3-Faster module and the LDConv (Linear Deformable Convolution) structure in this section. The C3-Faster module improves feature aggregation efficiency while reducing computational burden through residual channel interaction mechanisms. LDConv, as a key module replacing traditional convolutions, supports arbitrary numbers of sampling points and non-regular sampling shapes, with linear parameter scalability, enabling better adaptation to spacecraft components with large scale variations, blurred boundaries, or irregular shapes, such as antennas and solar panels. While maintaining model lightweightness, LDConv enhances the network’s ability to capture geometrically varying targets and spatial feature adaptability, resulting in more accurate and stable performance when segmenting fine-grained structures. (3) Given that the main contributions of this study focus on improvements to the feature extraction and fusion modules, this paper retains the original segmentation head structure of YOLOv5 to ensure that the evaluation of the improved components remains unaffected. This structure has been validated to provide good real-time segmentation performance and can efficiently generate multi-scale semantic segmentation results. Consequently, during the model prediction stage, the original multi-scale mask coefficient prediction mechanism of YOLOv5 is employed for component-level mask restoration. This configuration ensures the integrity and fairness of the network’s end-to-end segmentation capabilities without introducing additional complexity. The final output is a segmentation map used for fine-grained semantic segmentation of critical components, such as antennas and solar panels on spacecraft. This structure not only enhances the model’s segmentation accuracy for targets of varying scales but also ensures real-time performance when deployed on edge devices.

3.2. Backbone Network Improvement

The YOLOv5 backbone network extracts multi-scale information through various convolutional and pooling operations. While it effectively manages multiple targets and complex backgrounds, its deep network structure incurs high computational overhead, which limits its deployment on embedded or low-computational-capability devices. In contrast, MobileNetV4 is an efficient, lightweight neural network architecture specifically designed for devices with limited computational resources, effectively reducing computational load while maintaining high feature extraction capabilities. MobileNetV4 achieves efficient inference and feature extraction across different hardware platforms by integrating the UIB (Universal Inverted Bottleneck) module and the Mobile MQA (Multi-Query Attention) mechanism. The UIB module reduces computational load and enhances computational efficiency by combining depthwise separable convolutions with pointwise convolutions. Additionally, the two optional depthwise convolutions introduced in the inverted bottleneck structure further improve its flexibility and efficiency, adapting to the requirements of low-power devices. As illustrated in Figure 2, this module structure effectively retains the ability to express key information while ensuring a lightweight design.

To enhance attention modeling capabilities, MobileNetV4 integrates the Mobile MQA (Mobile-friendly Multi-Query Attention) mechanism. This module significantly reduces computational load and memory access overhead by sharing keys and values across all attention heads while preserving critical spatial resolution information through asymmetric spatial subsampling. This approach improved inference efficiency while maintaining high performance. Compared to traditional multi-head self-attention [44] (MHSA), Mobile MQA achieves significant acceleration without compromising modeling capabilities. The core calculation formula of Mobile MQA can be expressed as follows:

M o b i l e_{M Q A} (X) = C o n c a t ({A t t e n t i o n}_{1}, \dots, {A t t e n t i o n}_{n}) W^{O}

(1)

{A t t e n t i o n}_{j} = S o f t m a x (\frac{(X W^{Q_{j}}) {(S R (X) W^{K})}^{T}}{\sqrt{d_{k}}}) (S R (X) W^{V})

(2)

In this context, SR refers to the spatial downsampling operation, which in our design is implemented as depthwise separable convolution (DW) with a stride of 2. The matrices

W^{Q}

,

W^{K}

, and

W^{V}

are the query, key, and value weight matrices, respectively. Additionally,

d_{k}

is the dimension of the key vector.

3.3. MFACM

In the task of spacecraft recognition, traditional Spatial Pyramid Pooling (SPP) and its accelerated version SPPF, while effective in various computer vision applications, exhibit significant limitations when processing spacecraft images. Figure 3 shows the module structures of the two. SPPF primarily relies on multi-scale region pooling to generate fixed-length feature representations, addressing the issue of inconsistent input sizes. However, spacecraft images often suffer from edge blurriness, noise interference, and geometric diversity of components, which hampers coarse-grained pooling’s ability to effectively capture local detail features, resulting in decreased recognition accuracy. Furthermore, SPPF demonstrates limited performance in multi-scale feature fusion, struggling to achieve an effective balance between global consistency and local refinement, both of which are crucial for the precise segmentation of spacecraft components.

To address these shortcomings, we explored a key direction in architectural design: multi-dimensional feature modeling. Recent research has shown that attention mechanisms can selectively emphasize task-relevant features while suppressing redundant information in the spatial domain. Additionally, frequency domain modeling has proven advantageous in tasks such as image restoration by supplementing global high-frequency information. In particular, frequency domain features have garnered significant attention in the fields of deep learning and image processing. Fast Fourier Convolution (FFC) [45] first demonstrated the potential for global feature modeling based on the Fast F ourier Transform. Qiu et al. [46] extended self-attention to the spatiotemporal frequency domain to enhance video super-resolution performance, while Kong et al. [47] proposed a frequency domain-based self-attention method for image deblurring, and Zhou et al. [48] utilized Fourier transforms as image degradation priors to construct new global modeling networks. Sun et al. [49] proposed FCB for MRI reconstruction, integrating depthwise separable design and re-parameterization to effectively enlarge the receptive field and improve reconstruction quality. Suvorov et al. [50] introduced LaMa, a one-stage inpainting framework that leverages Fast Fourier Convolution (FFC) to achieve a global receptive field, combined with a high-receptive-field perceptual loss and an aggressive mask generation strategy, significantly enhancing robustness to large missing regions while reducing parameters and inference costs. These studies indicate that frequency domain features possess unique advantages in capturing global structural information and high-frequency details. However, existing research has primarily focused on image restoration tasks, with a notable lack of systematic exploration in image segmentation, particularly in complex scenarios such as spacecraft component segmentation.

Based on these observations, we propose a Multi-scale Fourier Adaptive Calibration (MFAC) module. This module integrates two complementary feature modeling strategies: in the spatial domain branch, we employ Multi-Scale Spatial Pyramid Attention (MSPA) [51] to simultaneously model global structural regularization and local structural information through a dual pooling strategy of AAP(global) and AAP(local), thereby achieving cross-scale dependencies and long-range channel interactions with low overhead. In the frequency domain branch, we introduce Frequency Domain Features (FDFs) [52], utilizing Fourier transforms to supplement global high-frequency information and enhance boundary modeling. Finally, in the calibration stage, MFAC adaptively fuses spatial and frequency features to achieve a dynamic balance between multi-scale spatial features and global frequency domain features. The overall architecture of the MFAC module is illustrated in Figure 4. Through this design, MFAC provides more robust and detailed feature representations under challenging conditions such as noise interference, edge blurriness, and geometric diversity, offering a novel and effective solution for spacecraft component segmentation.

The MSPA module primarily emphasizes feature highlighting in the spatial dimension. Utilizing a multi-scale pyramid structure, MSPA dynamically adjusts the weights of features at various scales based on the local and global semantic information of the image. This capability enables the model to concentrate on different regions of the image, particularly distinguishing between complex spacecraft components and their backgrounds. MSPA effectively enhances key information regions while suppressing interference from irrelevant backgrounds. Specifically, the MSPA module facilitates the network’s ability to learn meaningful spatial information at each scale by applying weights across different spatial levels. For instance, when spacecraft components, such as solar panels and antennas, are positioned at varying spatial locations, MSPA enhances the model’s capacity to focus on these critical local regions by weighting features at different scales. Through this spatial attention mechanism, MSPA captures fine-grained local details and integrates them with global structural information, thereby improving robustness and adaptability in segmentation tasks.

Unlike the Multi-Scale Pyramid Attention (MSPA) module, which emphasizes the spatial dimension, the Frequency Domain Features (FDF) module focuses on feature enhancement in the frequency dimension. FDF employs the Fourier transform to convert images from the spatial domain to the frequency domain, processing both low-frequency and high-frequency information. Low-frequency information captures the global shape of the target in the image, while high-frequency information emphasizes the details and edges, ensuring clear target boundaries. The primary advantage of the FDF module lies in its ability to effectively integrate the complementary nature of low-frequency and high-frequency information. Low-frequency information provides the overall structure of the image, whereas high-frequency information enhances the details and edge features. Particularly when addressing blurred boundaries or noise interference, FDF improves edge clarity while maintaining the consistency of the global structure. Through this frequency domain optimization, FDF significantly enhances the precise segmentation capability of spacecraft components, demonstrating notable advantages when processing images with complex details and high background noise.

By combining the MSPA and FDF modules, the Multi-Scale Fourier Adaptive Calibration (MFAC) module optimizes feature extraction in both the spatial and frequency dimensions. In the spatial dimension, MSPA employs multi-scale weighting to help the model better capture local details and reduce background noise interference. In the frequency dimension, FDF enhances the complementary interaction between low-frequency and high-frequency information, improving edge detection and global structure extraction capabilities. The two modules complement each other, jointly enhancing the segmentation accuracy of spacecraft images in complex backgrounds. By leveraging the strengths of both the MSPA and FDF modules, the MFAC module significantly improves the accuracy and robustness of image feature extraction. In spacecraft image segmentation tasks, MFAC not only addresses the limitations of SPPF in detail capture, noise processing, and geometric diversity but also enhances computational efficiency and task adaptability, making it an effective and precise solution.

3.4. LDConv

In spacecraft component identification tasks, traditional modeling methods based on standard convolution encounter numerous challenges due to significant variations in component shapes, extensive scale ranges, and frequent changes in orientation. Standard convolution operations typically rely on fixed sampling positions using odd-numbered square kernels (such as 1 × 1, 3 × 3, 5 × 5, and 7 × 7), which have limited receptive fields confined to local windows and cannot dynamically adjust their sampling structures according to target shapes. This limitation hinders the ability of convolutions to fully capture the edges and geometric features of complex or asymmetric components, such as slender antennas or tilted solar panels. Furthermore, increasing the size of the convolution kernel to enhance the receptive field results in a quadratic increase in the number of parameters and computational complexity, severely restricting the model’s operational efficiency and feasibility for space-based deployment. Although deformable convolution (DCN) [53] partially addresses the issue of fixed sampling positions by introducing offsets to improve adaptability to complex deformations, its parameters and computational complexity also grow quadratically, making it challenging to satisfy the dual requirements of lightweight design and real-time performance on resource-constrained platforms.

To address these challenges, this paper introduces a structurally flexible and computationally efficient linear deformable convolution [54] (LDConv) to replace the standard convolution structure in the neck network. LDConv features a flexible kernel design that allows for arbitrary parameter counts and custom sampling patterns, enabling the model to balance computational overhead and feature expression capabilities according to specific task requirements. With appropriate parameter selection, LDConv can effectively reduce the model’s parameter size while maintaining computational efficiency. Its core concept involves generating initial sampling coordinates based on rule structures or prior knowledge, predicting offsets through lightweight convolution branches, and dynamically adjusting each sampling position. This approach allows the convolution kernel to adaptively cover target regions of varying shapes and scales. Consequently, LDConv transforms the quadratic growth trend of parameters in standard and deformable convolutions into a linear growth pattern, thereby reducing the computational cost of convolution operations. This results in enhanced receptive fields and structural adaptability without significantly increasing computational costs. The structure of LDConv is illustrated in Figure 5.

The experimental results indicate that the introduction of LDConv significantly enhances the model’s segmentation accuracy for spacecraft components with complex structures or varying scales, such as solar panels and antennas. This improvement in accuracy is attributed to LDConv’s capacity to flexibly adjust sampling positions and quantities, allowing the model to more accurately extract key geometric features of irregularly shaped, large-scale, or blurred-edge targets while maintaining stronger spatial adaptability with lower computational overhead. These findings validate LDConv’s practical application and deployment potential in real-world space vision tasks.

3.5. C3-Faster

In YOLOv5, the C3 module is a crucial component of both the backbone network and the neck structure. Its core utilizes a Bottleneck architecture that integrates input features with results extracted from the main path through skip connections. This design enhances cross-layer information transmission and gradient propagation, effectively alleviating the vanishing gradient problem and improving the expressive power of deep features. The structure typically consists of multiple stacked 1 × 1 and 3 × 3 convolutions, providing strong feature modeling capabilities. However, in scenarios involving high-resolution inputs or deeper networks, the multiple standard convolutional operations within the Bottleneck architecture lead to a significant increase in the number of parameters and computational complexity, resulting in reduced inference speed and posing challenges for model deployment on edge devices. This has become the primary bottleneck constraining the inference efficiency and adaptability of YOLOv5 for edge deployment. To address these issues, this paper proposes a structural modification to the C3 module, introducing a design inspired by the efficient residual units in FasterNet [55], resulting in a lightweight alternative module named C3-Faster. This module employs FasterBlock to replace the traditional Bottleneck, facilitating spatial feature modeling and feature fusion with lower computational complexity. It enhances the overall operational efficiency of the network and improves deployment feasibility without significantly affecting accuracy. Its structure is illustrated in Figure 6. The core design of FasterBlock is based on the Partial Convolution (PConv) strategy. Unlike standard convolution, where all input channels participate in the computation, PConv performs spatial convolution operations on only a subset of the input feature map channels, while the remaining channels remain unchanged and directly contribute to the residual connection. This approach to structural sparsification achieves a balance between spatial modeling capabilities and channel retention efficiency, effectively compressing redundant computations. Specifically, let the number of input channels be c, the number of output channels be

c^{'}

, and the convolution kernel size be k × k. Then, the number of parameters in standard convolution is:

P a r a m s_{c o n v} = k^{2} \cdot c \cdot c'

(3)

In PConv, if only r·c channels participate in convolution (where r ∈ (0, 1], for example r = 1/4), the number of parameters is reduced to:

P a r a m s_{c o n v} = k^{2} \cdot (r \cdot c) \cdot c'

(4)

Similarly, assuming that the spatial dimensions of the input feature map are h × w, the FLOPs can be calculated using the following formula:

F L O P s_{c o n v} = h \cdot w \cdot k^{2} \cdot c \cdot c'

(5)

F L O P s_{P c o n v} = h \cdot w \cdot k^{2} \cdot (r \cdot c) \cdot c'

(6)

Therefore, when r = 1/4, the number of parameters and the computational complexity of PConv can be reduced to approximately 25% of those associated with standard convolution, providing a significant advantage in computational efficiency.

Additionally, the sparse nature of PConv reduces memory access pressure. In traditional convolutions, all channels require data read/write operations, whereas PConv accesses only a subset of the channels, effectively reducing bandwidth usage and on-chip cache pressure. This characteristic makes it particularly suitable for I/O-constrained or power-sensitive embedded platforms and spaceflight computing environments. The C3-Faster module structurally inherits the dual-branch design of the original C3. One path uses 1 × 1 convolution for channel compression and transformation of the input, while the other path consists of stacked FasterBlocks to extract local spatial features; subsequently, the outputs from both paths are concatenated with the input features from the skip connections, fusing multi-level information. The resulting feature map is then unified in channel dimension via a 1 × 1 convolution, generating an output feature map with rich semantic representation. This design not only enhances channel-to-channel interaction and multi-scale feature fusion capabilities but also exhibits excellent structural flexibility. The C3-Faster module supports flexible configuration of the number of stacked FasterBlocks, allowing developers to dynamically adjust the module depth and channel count based on specific task requirements, accuracy goals, and platform resource constraints to achieve a balance between performance and efficiency. Furthermore, since the module’s input and output interfaces are fully compatible with the original C3, seamless replacement and integration can be achieved without altering the overall network topology. Experiments have demonstrated that this module significantly reduces model complexity while maintaining excellent feature representation capabilities across multiple downstream tasks, making it particularly suitable for computationally sensitive scenarios such as edge deployment and satellite-based vision.

4. Experimental Results and Analyses

4.1. Dataset Construction

To address the demand for high-quality samples in spacecraft component segmentation tasks, a comprehensive and meticulously annotated satellite image dataset has been constructed. The data sources include the synthetic dataset UESD published by Zhao et al., the semi-realistic dataset SSP proposed by Guo et al., and several real satellite images collected from public channels. All images were annotated at the pixel level using the LabelMe 5.5 tool to ensure data quality and consistency in annotations. The dataset comprises 10,000 images, annotated across five categories of spacecraft components: solar panels, communication antennas, thruster nozzles, optical remote sensing payloads, and instruments. Each component category includes clear boundary information and precise instance-level segmentation. The dataset is divided into training, validation, and test sets in a 7:1:2 ratio to ensure stable training performance and objective evaluation results.

As shown in Figure 7, several example images from the dataset are presented. To address common challenges in satellite imaging, such as low light, noise, varying illumination, and complex backgrounds, the dataset construction involved environment-aware preprocessing and data augmentation. Specifically, spacecraft images were seamlessly fused with high-resolution Earth backgrounds provided by NASA using binary masks to ensure precise boundary alignment, and a 3 × 3 Gaussian filter (σ = 0.5) was applied to smooth transitions between objects and backgrounds. Furthermore, illumination adjustments were applied to simulate diverse lighting conditions, including low-light scenarios and strong glare effects, enabling the dataset to better represent realistic space environments. In addition, scaling and rotation operations were applied within a reasonable range to maintain geometric consistency while enriching sample diversity. Random spatial positioning was introduced to simulate various observation angles, and all augmented images were resized to a unified resolution to ensure stable image quality. Through these steps, the dataset effectively approximates real on-orbit imaging conditions, enhances diversity, and improves the robustness and adaptability of the model in complex scenarios.

4.2. Experimental Environment and Metrics

All experiments were conducted under consistent conditions for training, validation, and testing, as detailed in Table 1. The model was trained from scratch without using pre-trained weights. The hyperparameters were configured as follows: the training period consisted of 250 epochs, the batch size is set to 64, and the image size is set to 640 × 640. The AdamW optimizer was employed for parameter optimization, with an initial learning rate of

1 \times 10^{- 2}

and a momentum coefficient of 0.937. To mitigate overfitting, weight decay was applied with a decay value of

5 \times 10^{- 4}

. Additionally, a cosine annealing schedule was employed to adjust the learning rate throughout the training process.

To accurately assess the model’s segmentation performance on satellite components in a space environment, we selected the mean intersection over union (mIoU), precision (P), recall (R), number of model parameters (Params), computational overhead (GFLOPs), and inference speed (FPS) as evaluation metrics. These metrics comprehensively measure the model’s performance in terms of accuracy, computational efficiency, and practical applicability, thereby ensuring its adaptability in complex space environments.

Precision reflects the reliability of the prediction results and is defined as the proportion of correctly predicted positive samples to all predicted positive samples. It can be expressed by Formula (7) as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(7)

In the formula, TP (true positive) denotes correctly identified positive samples, while FP (false positive) refers to incorrectly identified negative samples. This metric employs an Intersection over Union (IoU) confidence threshold, typically set at 0.5, to classify samples. When the IoU value between the predicted bounding box and the ground truth bounding box exceeds this threshold, the prediction is deemed valid.

Recall characterizes the model’s ability to identify positive samples and is calculated using the formula presented in Equation (8):

R e c a l l = \frac{T P}{T P + F N}

(8)

In this context, FN (false negative) refers to true positive samples that were not detected. The Precision-Recall (P-R) curve, formed by these two metrics, is intuitively represented in a coordinate system with sensitivity on the x-axis and precision on the y-axis, illustrating the model’s performance at various confidence levels. The area under this curve is referred to as the average precision (AP), which quantifies the model’s overall performance.

A P = \int_{0}^{1} P r e c i s i o n (r e c a l l) d (r e c a l l)

(9)

In multi-category segmentation systems, mean average precision (mAP) serves as the primary evaluation metric, and its calculation method is presented in Equation (10):

m A P = \frac{\sum_{n = 1}^{N} A P}{N}

(10)

This metric averages the AP values for N segmentation categories under a predefined IoU threshold (typically 0.5), comprehensively reflecting the model’s accuracy and comprehensiveness in multi-object recognition.

In terms of real-time performance evaluation, processing speed is measured in frames per second (FPS), which indicates the number of images that can be processed per unit of time. Model complexity is evaluated using two metrics: the number of parameters (Params) and computational complexity (GFLOPs). The number of parameters reflects the model’s storage requirements, while computational complexity denotes the number of floating-point operations required for a single inference. Collectively, these three metrics determine the model’s deployability in resource-constrained environments, such as edge devices.

4.3. Comparison of Lightweight Backbone Network Effects

To identify an efficient feature extraction network suitable for spacecraft component segmentation in space scenarios, this paper optimized the backbone of the improved YOLOv5 segmentation framework, replaced the existing lightweight feature extraction networks, and conducted comparative experiments on the backbone networks. The results are presented in Table 2.

The backbone network, as a core component of the segmentation model, directly influences the model’s feature extraction capabilities, computational efficiency, and parameter scale, making it a critical factor in achieving a lightweight model design for spacecraft mission scenarios. In selecting the backbone network, this paper emphasizes multiple metrics, including network structure lightweightness, computational resource consumption, and segmentation accuracy, to comprehensively evaluate its suitability for spacecraft component segmentation tasks. Specifically, ShuffleNetv2, GhostNetv2, MobileNetv4, EfficientFormer v2, and MobileViTv3-S were chosen as candidate networks for comparison. These networks were integrated into the backbone portion of YOLOv5 while maintaining consistent Neck and Head structures to ensure fairness in the comparison.

As shown in Table 2, MobileNetv4 demonstrated the most balanced performance across multiple metrics, achieving high segmentation accuracy while significantly reducing the number of parameters and computational costs. It exhibits excellent inference speed and resource adaptability, making it particularly suitable for deployment on resource-constrained space mission platforms. EfficientFormer v2 excels in parameter compression but has slightly weaker feature representation capabilities for spacecraft images. MobileViTv3-S is competitive in certain metrics but has a complex overall structure and a heavy computational burden, rendering it unsuitable for embedded deployment. GhostNetv2 and ShuffleNetv2 have lightweight advantages but exhibit slightly lower segmentation accuracy than MobileNetv4 in complex backgrounds.

In summary, MobileNetv4 combines segmentation accuracy, model lightweightness, and inference efficiency, making it the optimal backbone network selected in this paper. It effectively meets the dual requirements of computational resources and response speed for space missions while ensuring segmentation performance.

4.4. Ablation Experiment

To validate the impact of various improvement modules on model performance, we designed ablation experiments by incrementally introducing these modules into the baseline model. The modules include (1) replacing the backbone with MobileNetV4 to enhance the model’s lightweight nature and computational efficiency; (2) improving the Neck with LDConv to optimize feature extraction and information transmission; (3) substituting the SPPF module with the proposed MFAC module to enhance target recognition capabilities through multi-scale feature fusion; and (4) enhancing the C3 module with C3-Faster to improve adaptability and accuracy for different target scales. The ablation results are summarized in Table 3.

After replacing the original model backbone with MobileNetV4, the number of model parameters and GFLOPs were reduced by 46.04% and 27.08%, respectively, resulting in a more lightweight model overall. The mean average precision at IoU 0.5 (mAP50) decreased slightly to 91.5%, but the overall segmentation accuracy remained satisfactory. Following the introduction of the LDConv module, mAP50 improved by 0.3%, and recall increased by 1.1%, indicating that LDConv effectively enhances feature extraction performance. The introduction of the MFAC module led to a significant improvement in recall by 4.4% and an increase in mAP50 by 1.0%, validating MFAC’s advantages in complex feature modeling. After replacing the C3 module with C3-Faster, the number of parameters further decreased, mAP50 remained stable, and model efficiency was optimized. Finally, after integrating MobileNetV4, LDConv, MFAC, and C3-Faster, precision improved to 95.1%, recall increased to 88.3%, and mAP50 rose to 93.1%. Meanwhile, the number of parameters decreased by 6.6%, and GFLOPs increased slightly; however, overall performance was significantly enhanced, demonstrating the excellent performance and practical value of the improved module combination proposed in this paper.

4.5. Visualization Experiment

To comprehensively evaluate the performance of the proposed M4MLF-YOLO model in spacecraft component recognition tasks, this paper conducts visualization experiments from multiple perspectives and introduces evaluation tools such as Precision–Recall (PR) curves and Grad-CAM activation heatmaps to facilitate both quantitative and qualitative analyses.

The PR curve serves as a crucial metric for assessing model performance in semantic or instance segmentation tasks, illustrating the trade-off between precision and recall at varying confidence thresholds. Precision refers to the proportion of pixels predicted by the model to belong to a specific category that are indeed part of that category, while recall indicates the percentage of all true pixels in that category that are accurately identified. Ideally, the PR curve should approach the top-right corner of the graph, signifying that the model achieves high accuracy while maintaining strong coverage capabilities. As illustrated in Figure 8, M4MLF-YOLO exhibits superior curve trends compared to the baseline model (YOLOv5s) across multiple critical spacecraft components, particularly sustaining high precision even at low confidence thresholds. This demonstrates its enhanced robustness and effective control of false positives, making it suitable for space-based applications where false positives are particularly sensitive.

To further investigate the spatial distribution of key regions of interest within the model, this paper introduces the Grad-CAM++ [56] (Gradient-weighted Class Activation Mapping) method for visualizing and elucidating the modules. Grad-CAM++ (Gradient-weighted Class Activation Mapping Plus Plus) is an extension of Grad-CAM [57], designed to enhance the interpretability of decision-making mechanisms in deep neural networks. Unlike the Grad-CAM method, Grad-CAM++ incorporates detailed second-order gradient information when calculating channel weights, enabling a more accurate quantification of the contributions of different positions in the feature map to the final prediction. This significantly enhances the model’s ability to respond to multiple targets or fine-grained regions. This mechanism is particularly suitable for tasks involving images of spacecraft, which often contain multiple small targets or intricate components, assisting researchers in determining whether the model is focusing on semantically critical regions. As illustrated in Figure 9, the attention region of the baseline model YOLOv5s exhibits diffusion, with some non-target regions being activated, potentially leading to false detections. In contrast, the Grad-CAM++ activation regions generated by M4MLF-YOLO are more concentrated around the contours of typical components, particularly for targets with indistinct boundaries or small scales, such as antennas and solar panels, where the alignment of the hotspots is significantly improved. This phenomenon further validates the effectiveness of the multi-scale feature fusion module and adaptive convolution mechanism proposed in this paper for enhancing spatial feature expression capabilities.

To intuitively demonstrate the segmentation performance of the models, this paper selects spacecraft image samples for visualizing segmentation results and compares the recognition performance of different models. As illustrated in Figure 10, the segmentation results of YOLOv5s exhibit both false negatives and false positives, particularly in scenarios with blurry edges, overlapping objects, or complex backgrounds, where recognition accuracy becomes unstable. In contrast, M4MLF-YOLO demonstrates more stable and precise target localization capabilities under the same conditions, featuring tighter bounding boxes and maintaining robust recognition even in the presence of significant scale variations. This highlights its potential for application in complex remote sensing environments.

4.6. Network Comparative Experiment

To evaluate the effectiveness of the proposed M4LMF-YOLO framework, we conducted comparative experiments under identical training environments and parameter settings against two representative lightweight networks, MobileNetV4 and EfficientNetV2. The experimental results are presented in Figure 11 and Table 4.

As shown in Table 4, M4LMF-YOLO consistently outperforms the baseline models in terms of precision, recall, and mAP, while maintaining lower parameter counts and computational costs, thereby striking a better balance between accuracy and efficiency. Compared with MobileNetV4, our model achieves an approximate 2.1% improvement in precision, a 1.5% gain in recall, and a 1.0% increase in mAP. These improvements translate into smoother boundary segmentation and more robust structural detail preservation, particularly under challenging conditions such as extreme lighting, deep-space backgrounds, and complex spacecraft geometries. When compared with EfficientNetV2, M4LMF-YOLO delivers a notable 0.2% boost in mAP and achieves this with 17.7% fewer parameters, while maintaining competitive GFLOPs. This highlights its superior balance between lightweight design and high-precision segmentation, making it highly suitable for deployment in resource-constrained space environments.

Figure 11 presents the prediction results of various methods applied to typical test samples. The rows, arranged from top to bottom, represent the original images, EfficientNetV2, MobileNetV4, our proposed method, and the corresponding ground-truth labels. The results indicate that under challenging conditions, such as strong illumination interference, deep-space backgrounds, and complex structural layouts, existing methods exhibit significant boundary ambiguities and missing details. In contrast, M4LMF-YOLO reconstructs spacecraft contours more comprehensively, preserves fine structural details, and produces segmentation results with enhanced consistency and clarity. These findings demonstrate that our proposed framework exhibits superior adaptability and robustness in the presence of noise, blur, and substantial geometric variations. Overall, M4LMF-YOLO achieves an optimized balance among segmentation accuracy, lightweight deployment, and inference speed, highlighting its strong potential for practical applications in resource-constrained space environments.

5. Conclusions

In response to the challenges posed by limited computational resources and stringent real-time requirements in space-based scenarios, this paper proposes a lightweight spacecraft component segmentation algorithm, M4MLF-YOLO (MobileNetV4 + LDConv + MFAC + C3-Faster), for the automatic identification of typical components such as satellite bodies, solar panels, and antennas. The algorithm is built upon the YOLOv5 framework and incorporates several structural optimizations, including the adoption of the lightweight backbone network MobileNetV4, the introduction of the Linear Deformable Convolution (LDConv) module, the Multi-Scale Fourier Adaptive Calibration (MFAC) module, and the integration of the improved feature fusion module C3-Faster. This construction results in an overall segmentation framework with excellent engineering adaptability.

Compared to YOLOv5s, the M4MLF-YOLO model achieves an average precision (mAP@0.5) improvement of 0.7%, reaching 93.4%, with precision and recall improving by 1.9% and 3.9%, respectively. Additionally, the number of model parameters was reduced by 36.5%, and computational complexity decreased by 24.6%, significantly compressing the model size and computational load while maintaining segmentation accuracy. The algorithm’s generalization capability was further validated using real spacecraft image data, demonstrating that M4MLF-YOLO can rapidly and accurately identify spacecraft components, thereby validating its effectiveness and deployment feasibility in satellite-based visual perception tasks.

Future research should focus on the dual optimization of model segmentation accuracy and satellite platform deployment adaptability. On one hand, model compression strategies such as knowledge distillation and model pruning can be introduced to significantly reduce model complexity and inference latency while ensuring segmentation performance, thereby improving deployment efficiency on actual satellite hardware. On the other hand, for small object segmentation challenges in complex spatial scenarios, such as occlusion and dense distributions, more efficient feature enhancement mechanisms and context modeling methods can be explored to further improve the model’s robustness and generalization capabilities in complex backgrounds.

Author Contributions

Conceptualization, W.Y. and Z.Z.; methodology, W.Y.; software, W.Y.; validation, Z.Z.; formal analysis, Z.Z.; investigation, W.Y.; resources, W.Y.; data curation, W.Y.; writing—original draft preparation, W.Y.; writing—review and editing, W.Y., Z.Z. and L.C.; visualization, W.Y.; supervision, L.C.; project administration, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Henshaw, C. The DARPA Phoenix Spacecraft Servicing Program: Overview and Plans for Risk Reduction. In Proceedings of the International Symposium on Artificial Intelligence, Robotics and Automation in Space (i-SAIRAS), Montreal, QC, Canada, 17–19 June 2014. [Google Scholar]
Volpe, R.; Circi, C. Optical-Aided, Autonomous and Optimal Space Rendezvous with a Non-Cooperative Target. Acta Astronaut. 2019, 157, 528–540. [Google Scholar] [CrossRef]
Reed, B.B.; Smith, R.C.; Naasz, B.J.; Pellegrino, J.F.; Bacon, C.E. The Restore-L Servicing Mission. In Proceedings of the AIAA SPACE 2016, Long Beach, CA, USA, 13–16 September 2016; American Institute of Aeronautics and Astronautics: Reston, VA, USA, 2016. [Google Scholar]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Sharma, S.; Beierle, C.; D’Amico, S. Pose Estimation for Non-Cooperative Spacecraft Rendezvous Using Convolutional Neural Networks. In Proceedings of the 2018 IEEE Aerospace Conference, Big Sky, MT, USA, 3–10 March 2018; pp. 1–12. [Google Scholar]
Ballard, D.H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognit. 1981, 13, 111–122. [Google Scholar] [CrossRef]
Ding, L.; Goshtasby, A. On the Canny edge detector. Pattern Recognit. 2001, 34, 721–725. [Google Scholar] [CrossRef]
Wang, Y.; Yin, T.; Chen, X.; Hauwa, A.S.; Deng, B.; Zhu, Y.; Gao, S.; Zang, H.; Zhao, H. A Steel Defect Detection Method Based on Edge Feature Extraction via the Sobel Operator. Sci. Rep. 2024, 14, 27694. [Google Scholar] [CrossRef] [PubMed]
Liu, T.; Liu, Y.; Yang, J.; Li, B.; Wang, Y.; An, W. Graph Laplacian regularization for fast infrared small target detection. Pattern Recognit. 2025, 158, 111077. [Google Scholar] [CrossRef]
Wang, M.; Yuan, S.; Pan, J. Building Detection in High Resolution Satellite Urban Image Using Segmentation, Corner Detection Combined with Adaptive Windowed Hough Transform. In Proceedings of the 2013 IEEE International Geoscience and Remote Sensing Symposium—IGARSS, Melbourne, VIC, Australia, 21–26 July 2013; pp. 508–511. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Proceedings of the 13th European Conference on Computer Vision, ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Volume 8691, pp. 346–361. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the 14th European Conference on Computer Vision, ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Las Vegas, NV, USA, 2016; pp. 779–788. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Vancouver, BC, Canada, 2023; pp. 7464–7475. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar] [CrossRef]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene Parsing Through ADE20K Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 633–641. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Ultralytics/Yolov5 at v6.1. Available online: https://github.com/ultralytics/yolov5 (accessed on 30 June 2025).
Chen, J.; Wei, L.; Zhao, G. An Improved Lightweight Model Based on Mask R-CNN for Satellite Component Recognition. In Proceedings of the 2020 2nd International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–25 October 2020; pp. 1–6. [Google Scholar]
Wang, Z.; Cao, Y.; Li, J. A Detection Algorithm Based on Improved Faster R-CNN for Spacecraft Components. In Proceedings of the 2023 IEEE International Conference on Image Processing and Computer Applications (ICIPCA), Changchun, China, 11–13 August 2023; pp. 1–5. [Google Scholar]
Liu, Y.; Zhu, M.; Wang, J.; Guo, X.; Yang, Y.; Wang, J. Multi-Scale Deep Neural Network Based on Dilated Convolution for Spacecraft Image Segmentation. Sensors 2022, 22, 4222. [Google Scholar] [CrossRef]
Guo, Y.; Feng, Z.; Song, B.; Li, X. SSP: A Large-Scale Semi-Real Dataset for Semantic Segmentation of Spacecraft Payloads. In Proceedings of the 2023 8th International Conference on Image, Vision and Computing (ICIVC), Dalian, China, 27–29 July 2023; pp. 831–836. [Google Scholar]
Zhao, Y.; Zhong, R.; Cui, L. Intelligent Recognition of Spacecraft Components from Photorealistic Images Based on Unreal Engine 4. Adv. Space Res. 2023, 71, 3761–3774. [Google Scholar] [CrossRef]
Cao, Y.; Mu, J.; Cheng, X.; Liu, F. Spacecraft-DS: A Spacecraft Dataset for Key Components Detection and Segmentation via Hardware-in-the-Loop Capture. IEEE Sens. J. 2024, 24, 5347–5358. [Google Scholar] [CrossRef]
Huo, Y.; Gang, S.; Guan, C. FCIHMRT: Feature Cross-Layer Interaction Hybrid Method Based on Res2Net and Transformer for Remote Sensing Scene Classification. Electronics 2023, 12, 4362. [Google Scholar] [CrossRef]
Liu, X.; Wang, H.; Wang, Z.; Chen, X.; Chen, W.; Xie, Z. Filtering and Regret Network for Spacecraft Component Segmentation Based on Gray Images and Depth Maps. Chin. J. Aeronaut. 2024, 37, 439–449. [Google Scholar] [CrossRef]
Proença, P.F.; Gao, Y. Deep Learning for Spacecraft Pose Estimation from Photorealistic Rendering. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 6007–6013. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4—Universal Models for the Mobile Ecosystem. In Proceedings of the ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the 38th International Conference on Machine Learning, PMLR 139, Virtual, 18–24 July 2021. [Google Scholar]
Yan, Z.; Li, X.; Li, M.; Zuo, W.; Shan, S. Shift-Net: Image Inpainting via Deep Feature Rearrangement. In Proceedings of the ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018. [Google Scholar]
Chen, H.; Wang, Y.; Xu, C.; Shi, B.; Xu, C.; Tian, Q.; Xu, C. AdderNet: Do We Really Need Multiplications in Deep Learning? arXiv 2021, arXiv:1912.13200. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Chi, L.; Jiang, B.; Mu, Y. Fast Fourier Convolution. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 4479–4488. [Google Scholar]
Qiu, Z.; Yang, H.; Fu, J.; Fu, D. Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 257–273. [Google Scholar]
Kong, L.; Dong, J.; Ge, J.; Li, M.; Pan, J. Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5886–5895. [Google Scholar]
Zhou, M.; Huang, J.; Guo, C.-L.; Li, C. Fourmer: An Efficient Global Modeling Paradigm for Image Restoration. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 42589–42601. [Google Scholar]
Sun, H.; Li, Y.; Li, Z.; Yang, R.; Xu, Z.; Dou, J.; Qi, H.; Chen, H. Fourier Convolution Block with global receptive field for MRI reconstruction. Med. Image Anal. 2025, 99, 103349. [Google Scholar] [CrossRef]
Suvorov, R.; Logacheva, E.; Mashikhin, A.; Remizova, A.; Ashukha, A.; Silvestrov, A.; Kong, N.; Goka, H.; Park, K.; Lempitsky, V. Resolution-Robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3172–3182. [Google Scholar]
Yu, Y.; Zhang, Y.; Cheng, Z.; Song, Z.; Tang, C. Multi-Scale Spatial Pyramid Attention Mechanism for Image Recognition: An Effective Approach. Eng. Appl. Artif. Intell. 2024, 133, 108261. [Google Scholar] [CrossRef]
Gao, N.; Jiang, X.; Zhang, X.; Deng, Y. Efficient Frequency-Domain Image Deraining with Contrastive Regularization. In Proceedings of the Computer Vision—ECCV 2024, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2025; Volume 15099, pp. 240–257, ISBN 978-3-031-72939-3. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear Deformable Convolution for Improving Convolutional Neural Networks. Image Vision Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Chattopadhyay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Improved Visual Explanations for Deep Convolutional Networks. arXiv 2018, arXiv:1710.11063. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. M4LMF-YOLO network architecture.

Figure 2. Universal Inverted Bottleneck (UIB) block in MobileNetV4.

Figure 3. Structure of SPP and SPPF; c indicates the concatenation operation.

Figure 4. Structure of the MFAC module.

Figure 5. Structure of the LDConv.

Figure 6. Structure of the C3-Faster.

Figure 7. Diagram of the dataset.

Figure 8. Comparison of segmentation results: (a) output of YOLOv5s and (b) output of M4MLF-YOLO.

Figure 9. GRAD-CAM++ based visualization of classification.

Figure 10. Visual comparison of segmentation results.

Figure 11. Experimental results and comparison with other methods.

Table 1. Experimental environment configuration.

Configuration Name	Environment Parameters
Operating system	Windows 10
CPU	Intel(R) Xeon(R) w5-3425
GPU	2xNVIDIA GTX 3090
Memory	128 G
Programming language	Python3.9.21
Framework	PyTorch 2.12 + CUDA 11.8
IDE	VSCode

Table 2. Backbone network comparison experiment results.

Backbone Network	Precision	Recall	mAP	GFLOPs	Parameter
ShuffleNetv2	91.2	85.3	89.3	16.0	5.94 M
GhostNetv2	92.1	86.4	91.0	19.1	8.24 M
EfficientFormer v2	93.7	87.4	92.7	26.0	7.42 M
MobileViTv3-S	92.8	86.7	92.6	22.4	7.71 M
MobileNetv4	93.5	87.6	93.0	23.1	6.01 M

Table 3. Ablation study of key lightweight modules on spacecraft component segmentation performance.

MobileNetV4	LDConv	MFAC	C3-Faster	Precision	Recall	mAP	GFLOPs	Parameter
√				91.9	79.1	91.5	18.4	5,834,531
	√			93.4	85.2	92.8	22.5	5,943,707
		√		94.8	83.5	93.1	25.8	6,802,430
			√	93.7	87.5	92.6	23.1	6,013,745
√		√		94.4	85.3	93.2	20.1	5,818,481
	√		√	94.6	87.4	92.9	20.8	4,551,241
√	√	√	√	95.1	88.3	93.4	19.6	4,713,179
-	-	-	-	93.2	84.4	92.7	26.0	7,421,699

Table 4. Comparative experimental results of different models.

Backbone Network	Precision	Recall	mAP	GFLOPs	Parameter
EfficientNetV2	94.3	87.7	93.2	15.1	5.72 M
MobileNetV4	93.0	86.8	92.4	18.3	6.34 M
Ours	95.1	88.3	93.4	19.6	4.71 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, W.; Zhang, Z.; Chang, L. M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition. Remote Sens. 2025, 17, 3144. https://doi.org/10.3390/rs17183144

AMA Style

Yi W, Zhang Z, Chang L. M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition. Remote Sensing. 2025; 17(18):3144. https://doi.org/10.3390/rs17183144

Chicago/Turabian Style

Yi, Wenxin, Zhang Zhang, and Liang Chang. 2025. "M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition" Remote Sensing 17, no. 18: 3144. https://doi.org/10.3390/rs17183144

APA Style

Yi, W., Zhang, Z., & Chang, L. (2025). M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition. Remote Sensing, 17(18), 3144. https://doi.org/10.3390/rs17183144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation Based on Deep Learning

2.2. Model Lightweighting Technology

3. Methods

3.1. M4MLF-YOLO Network Architecture

3.2. Backbone Network Improvement

3.3. MFACM

3.4. LDConv

3.5. C3-Faster

4. Experimental Results and Analyses

4.1. Dataset Construction

4.2. Experimental Environment and Metrics

4.3. Comparison of Lightweight Backbone Network Effects

4.4. Ablation Experiment

4.5. Visualization Experiment

4.6. Network Comparative Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI