Next Article in Journal
Intelligent 5G-Aided UAV Positioning in High-Density Environments Using Neural Networks for NLOS Mitigation
Previous Article in Journal
Noise Reduction Mechanism and Spectral Scaling of Slat Gap Filler Device at Low Angle of Attack
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Interactive Maintenance of Space Station Devices Using Scene Semantic Segmentation

1
Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China
2
Jiuquan Satellite Launch Center, Jiuquan 732750, China
*
Author to whom correspondence should be addressed.
Aerospace 2025, 12(6), 542; https://doi.org/10.3390/aerospace12060542
Submission received: 6 May 2025 / Revised: 11 June 2025 / Accepted: 11 June 2025 / Published: 15 June 2025
(This article belongs to the Section Astronautics & Space Science)

Abstract

A novel interactive maintenance method for space station in-orbit devices using scene semantic segmentation technology is proposed. First, a wearable and handheld system is designed to capture images from the astronaut in the space station’s front view scene and play these images on a handheld terminal in real-time. Second, the proposed system quantitatively evaluates the environmental lighting condition in the scene by calculating image quality evaluation parameters. If the lighting condition is not proper, a prompt message will be given to the astronaut to remind him or her to adjust the environment illumination. Third, our system adopts an improved DeepLabV3+ network for semantic segmentation of these astronauts’ forward view scene images. Regarding the improved network, the original backbone network is replaced with a lightweight convolutional neural network, i.e., the MobileNetV2, with a smaller model scale and computational complexity. The convolutional block attention module (CBAM) is introduced to improve the network’s feature perception ability. The atrous spatial pyramid pooling (ASPP) module is also considered to enable an accurate calculation of encoding multi-scale information. Extensive simulation experiment results indicate that the accuracy, precision, and average intersection over the union of the proposed algorithm can be better than 95.0%, 96.0%, and 89.0%, respectively. And the ground application experiments have also shown that our proposed technique can effectively shorten the working time of the system user.

1. Introduction

With the rapid development and deployment of China space station technologies, the daily workloads of astronauts in space stations in orbit [1] are increasing day by day. Like the International Space Station (ISS), on the one hand, astronauts need to accomplish the normal maintenance tasks inside the space station, such as cabin cleaning tasks, living material handling tasks, or safety inspection tasks [2], etc. On the other hand, the space station has also deployed a large number of scientific experimental cabinets [3] for conducting various kinds of scientific research in materials science, biological science, physical science, etc.; therefore, maintaining and operating those experimental cabinets will also occupy a lot of time and energy in astronauts’ daily work (Figure 1). Some practical applications have shown that engaging in monotonous and boring work for a long time can increase the mental fatigue of astronauts in space stations. Clearly, small operational errors and decision mistakes in orbit may lead to some significant flight safety accidents [4]. Therefore, to reduce the daily workload of astronauts and improve the reliability of in-orbit tasks, it is necessary to develop a device using artificial intelligence technology [5] to assist them in carrying out various scientific research and daily maintenance tasks [6].
During the operations of the ISS, artificial intelligence technologies were applied to assist astronauts in space stations in their daily work and life schedules from a long time ago [7]. Since 2019, space collaborative robots have been used to assist astronauts in their daily operations on the ISS. A small cubic flying robot called “Astrobee” [8] is designed to assist astronauts in completing daily management tasks, participating in various scientific experiments, and conducting safety monitoring of astronauts’ daily behaviors. To further improve the human–machine-friendly performance of collaborative robots, much research work [9,10] has been carried out around the design of interactive operation processes, scene understanding, and human–computer interface display methods. For example, in [11], scholars developed a kind of intelligent Polipo pressure sensing system for assessing injury inside the space suit. The sensitivity model was used for data analyses of sensing data. In [12], a wide-ranging stable motion control method for robot astronauts in space stations using human dynamics was reported. A viscoelastic dynamic humanoid model of parking under a microgravity environment was established. In [13], the virtual reality technique was also considered for improving the psychological and psychosocial health state of astronauts. The exergaming was proposed for application during the long-duration flight mission.
In the past, a large number of studies have been conducted regarding semantic segmentation [14] to alleviate the workloads of people. Semantic segmentation aims to accurately classify each pixel of an image into its corresponding category, thereby achieving a refined understanding of image content. The core feature of semantic segmentation is pixel-level classification. Each pixel is assigned to a predefined semantic category, but different instances of the same category are not distinguished. It is widely applied in autonomous driving [15] or medical image processing [16]. Its common models include the U-Net [17], DeepLab network [18], SegNet [19], etc. The encoder and decoder of U-Net are closely connected through skip connections, allowing the decoder to fully utilize the feature information of different levels of the encoder when restoring image resolution. The DeepLab series models introduce dilated convolution [20], which expands the receptive field of the convolution kernel without increasing computational complexity. It can better capture contextual information from images and has advantages in segmenting large-scale objects and complex scenes [21]. SegNet adopts an encoder–decoder structure, characterized in that the decoder part uses pooling indexes corresponding to the encoder for upsampling, reducing the number of model parameters and improving the running efficiency of the model. Clearly, although the above networks have shown good computational performance in their specific applications, it is still worth researching how to develop a real-time and high-accuracy scene understanding method for space station applications [22].
Regarding the semantic segmentation technology, from the perspective of network training, it can be classified into fully supervised, semi-supervised, and unsupervised networks [23]. The fully supervised network needs to label all training data before it can be trained. The semi-supervised network only needs a small amount of complete labeled data, supplemented by a large number of unlabeled data for network training. And the unsupervised network aims to achieve accurate image segmentation without relying on annotation data or only a small amount of annotation data. For example, the semi-supervised network includes the generative adversarial network (GAN)-based method [24], the multi-network architecture-based method [25], the multi-stage architecture-based method [26], and the single-stage end-to-end architecture-based method [27], etc., while the unsupervised networks include the adaptive adversarial learning-based method [28], the adaptive image segmentation transfer-based method [29], and the adaptive self-training-based method [30], etc. In comparison, the fully supervised network can accurately learn the mapping relationship between input and output data, so its calculation accuracy is highest. In addition, the design of a semi-supervised or unsupervised network is more difficult than that of a fully supervised network. In this article, because the working scene in the space station is all known and the ambient light is also controllable, as a result, the fully supervised network is selected for semantic segmentation calculation.
The semantic segmentation technology can also be classified into the traditional or the real-time computational network according to its processing speed. In general, the real-time semantic segmentation network refers to a model with a computational frame rate of 30 frames per second on a specific device. The traditional non-real-time semantic segmentation methods [31] rely on fine imaging details and rich spatial context information, which have the disadvantages of a large amount of calculation and a complex network structure. Differently, through the reasonable improvement of network structure, the real-time semantic segmentation network can retain more abundant spatial information with less computational cost and capture more effective multi-scale context so as to achieve the balance between the calculation speed and accuracy of the model. According to the network structure, the real-time semantic segmentation network can be roughly classified into the single-branch network [32], the double-branch network [33], the multi-branch network [34], the U-shaped network [35], and the neural structure search network [36], etc. The main disadvantage of the real-time semantic segmentation method is its limited fine segmentation ability. Considering that the application requirements of device maintenance guidance in this paper have higher requirements for calculation effect but little demand on the real-time calculation, a non-real-time network design method is adopted to reduce the difficulty of network design.
For the fully supervised non-real-time semantic segmentation networks, the common methods include the encoder–decoder model-based method, the dilated convolution model-based method, and the multi-task learning model-based method, etc. The encoder–decoder model-based method converts original data into low-dimensional representations, captures their key features, and then converts them back to the original data space. The common networks have RefineNet [37], U-Net, SegNet, etc. The dilated convolution model expands the receptive field of the convolution kernel by introducing dilated convolution; its typical model is the DeepLab series networks. The multi-task learning model combines the network framework with complementary tasks, and its network structure can make the parameters shared in the training process and reduce the maintenance cost of different tasks. Its common methods include the multi-scale convolutional neural network (MSCNN) [38], the prediction-and-distillation network (PAD-Net) [39], etc. In comparison, the DeepLab network has a fine calculation effect and fast processing speed, which is more suitable for the application of this article. The DeepLab series network is derived from VGG16 [40]. DeepLabV1 uses the full convolution network to replace the last full connection layer, removes the last two pooling layers, and expands the receptive field using dilated convolution. DeepLabV2 uses ResNet-101 [41] as the basic network and puts forward the pyramid pool structure. DeepLabV3 further improves the atrous spatial pyramid pooling (ASPP) structure to solve the problem of low computational efficiency of dilated convolution. DeepLabV3+ upgrades ResNet-101 to Xception, which further reduces the number of network parameters.
In this paper, an interactive maintenance technique of the space station complex devices using the scene semantic segmentation method is proposed. First, a glasses-style wearable device and a piece of handheld panel equipment are both designed and used to assist astronauts in space stations to collect and play the scene imaging information. In a microgravity environment, the handheld panel can be secured to an astronaut by a restraint strap. Second, the image quality evaluation parameters [42] are used for evaluating environmental lighting. Since the working environment inside of a space station cabin is entirely an artificial system, it will be very easy to achieve environmental lighting control by astronauts. Third, a semantic segmentation network, i.e., an improved DeepLabV3+ network [43], is proposed for scene understanding application. MobileNetV2 [44] is employed as the backbone network, a convolutional block attention module (CBAM) [45] is considered to participate in the feature computations, and a kind of ASPP module [46] is also used. The main contributions of this paper include (1) a new interactive task assistance system proposed for astronauts working in orbit during space station missions, which effectively reduces the complexity of astronauts’ work and improves mission reliability [47]. (2) An effective computational model for scene perception and understanding has been proposed in this paper, which has the advantages of lightweight deployment and high computational accuracy.
In the following sections, the key techniques of the proposed system and method will be presented in Section 2, some evaluation experiments will be illustrated in Section 3, further data analyses and discussions will be given in Section 4, and a conclusion will be made in Section 5.

2. Proposed Methods

2.1. Computational Flow Chart Overview

Figure 2 presents the application illustration and computational flowchart of the proposed scene perception system. After the astronaut in the space station wears the glasses-based environmental perception system and enters the working area of the space station, taking the handheld display system, first, our proposed system evaluates the ambient lighting intensity in front of astronauts. If the ambient lighting intensity is too low (for example, the main control computer has reduced the ambient lighting output in the working space in order to save energy consumption in the space station), our system prompts the astronaut to adjust the cabin lighting source to meet the normal working lighting brightness. Second, our system collects the forward view scene images of the astronaut in the space station and uses our improved DeepLabV3+ network to perform semantic segmentation on the complex cabinet components in the scene. Third, after completing the segmentation calculation, the astronaut can select the interested components in the handheld display system according to the repair outline of the complex scientific experimental cabinet based on the segmentation results above. Finally, according to the astronaut’s selections, our system can create the 3D models, audio and video introductions, and textual descriptions of the selected component in the software interface, thereby assisting the astronaut in understanding its maintenance principles.

2.2. Environment Lighting Perception

A robust environment lighting perception method is used in this paper [48]. First, the original input image is transformed from the RGB color space into the Lab color space [49]. Second, the mean subtracted and contrast normalized (MSCN) coefficient [50] is computed. The MSCN coefficient is extracted by subtracting the mean from the image and normalizing the contrast to extract statistical features of the image, which can be used to predict the distortion type and perceived quality of the image. Third, the information entropy indices are calculated. The image information entropy is calculated from the image histogram and can reflect the average information content of an image. The larger the entropy value, the more chaotic the image. And the gray-level histogram statistics can reflect the pixel distribution or brightness changes in an image. Therefore, this paper uses the information entropy to measure the overall brightness of an image, which has the advantages of fast calculation speed and high processing accuracy. Finally, the support vector machine (SVM) [51] is considered for environment lighting degree perception. Equations (1)–(4) show the computational method of information entropy, and a three-input and one-output SVM is used for lighting state evaluation. A binary classification problem is considered, dividing the ambient lighting into normal (ordinary and strong lighting conditions) and abnormal (dim lighting conditions). Table 1 presents the definition methods of imaging luminance degrees and their roughly corresponding information entropy distributions in this paper. Figure 3 shows the data samples of complex cabinet components captured from different ambient lighting conditions. From Table 1, the brightness division can be continuously optimized and adjusted with the accumulation of data, so the estimation of lighting brightness using SVM will be more meaningful currently.
I X i , j = I X i , j μ X i , j σ X i , j + C ,
μ X i , j = k = K K l = L L w k , l I X i + k , j + l ,
σ X i , j = k = K K l = L L w k , l I X i + k , j + l μ X i , j 2 ,
H X = m = 0 255 P X m log 2 P X m ,
where IX(i, j) is the image intensity of the X component of Lab color space in point (i, j), i = 1, 2, …, M, j = 1, 2, …, N; M and N are the height and weight of the input image; IX(i, j) is the MSCN estimation of IX(i, j); μX(i, j) is the mean value of IX(i, j); σX(i, j) is the standard deviation value of IX(i, j); C is a constant, C = 1.0 in this paper; X ∈ {L, a, b}; wk,l is a Gaussian filter; k and l are the sizes of the discrete Gaussian filter, k = −K, …, K, l = −L, …, L, K = 2 and L = 2 in this paper; HX is the information entropy of the X component; and PX(m) is the histogram value of IX(i, j) in gray intensity m.

2.3. Improved DeepLabV3+

The overall architecture of classic DeepLabV3+ is shown in Figure 4. The core idea of DeepLabV3+ is the introduction of a new encoder–decoder architecture. The main body of its encoder is a deep convolution neural network (DCNN) [52] with atrous convolutions, which can be dependent on the commonly used classification networks. The atrous convolution adds holes to the standard convolution kernel to increase the receptive field. By setting different dilation rates, different-sized receptive fields can be obtained, and the multi-scale information can be estimated. For example, a normal convolution is a dilated convolution with an expansion rate of 1; differently, a 3 × 3 convolution kernel with an expansion rate of 2 has the same receptive field as a 5 × 5 convolution kernel and only needs 9 parameters. At the end of the encoder module, DeepLabV3+ introduces the ASPP module with dilated convolutions to improve its segmentation capabilities for multi-scale objects. ASPP performs 1 × 1 convolution on the input feature maps, 3 × 3 convolution with dilation rates of 6, 12, and 18, and the global average pooling operations. After fusing the feature maps and performing 1 × 1 convolution, the number of channels is compressed to 256. Finally, the ASPP is able to extract and distinguish feature information of targets at different scales, achieving good segmentation of multi-scale targets. The goal of this module is to capture contextual information at different scales in order to better understand the semantic information of whole images. The decoder module in DeepLabV3+ is used to perform necessary downsampling operations, further fusing low-level features with high-level features. During the feature map restoration process, low-level features are fused to restore the boundary information of the target part. Linear interpolation is also considered for feature map restoration, ultimately improving the accuracy of network segmentation.
In this paper, a series of improvement studies are conducted on the DeepLabV3+ network to meet the application requirements of the scene understanding mission in the space station. Figure 4 illustrates the structure of our improved DeepLabV3+. The original backbone network is replaced with a lightweight MobileNetV2 with a smaller model scale and lower computational complexity. The CBAM is introduced to improve the network’s feature perception ability. And the ASPP module is also considered to enable an accurate calculation of encoding multi-scale information.
  • Improvement of backbone network
The classic DeepLabV3+ uses the Xception [53] as the backbone network. The architecture of the Xception model is relatively complex and has strong expression ability, making it suitable for complex tasks and large-scale datasets. However, due to its relatively large model size, it requires more computing resources and storage space, and its training time is also long, making it less suitable for portable devices with limited resources. In this paper, our improved DeepLabV3+ network (Figure 5) replaces the classic backbone network with a lightweight convolutional neural network, i.e., MobileNetV2. Compared with the Xception network, MobileNetV2 has a smaller model size and lower computational complexity, making it suitable for mobile devices and embedded systems. Due to its lightweight design, MobileNetV2 also has a fast speed in the reasoning phase. It replaces ordinary convolution with deep convolution and introduces two hyperparameters, namely width factor and resolution factor, to flexibly control the size of the network model. With the optimization of network structure, the network accuracy exceeds that of most neural networks with fewer parameters and computational complexity. MobileNetv2 has two important modules, i.e., the linear bottleneck relationships between network layers and the residual connections between bottleneck blocks. The design of bottleneck blocks can effectively encode the feature information of the model at the input and output ends, and the inner layer is used to encapsulate the model’s transformation from lower-level information to higher-level abstract representations.
  • Combination of CBAM
CBAM (Figure 6) is one of the widely used attention mechanisms in deep learning networks, aiming to improve the performance of the network model by focusing on important parts of the image. CBAM processes both channel and spatial dimension information, focusing on which channels are important through the channel attention module and then focusing on where the informative part is through the spatial attention module. This dual attention mechanism enables it to comprehensively capture key information in features. CBAM mainly has the following characteristics. First, it has a dual attention mechanism. It can simultaneously model the dependency relationship between channels and the importance of spatial positions, enhancing the model’s ability to capture the key features. Second, it utilizes a lightweight design method. It only requires a small number of additional parameters to significantly improve performances. Third, it has the advantage of convenient plug-and-play. It can be flexibly embedded into any convolutional blocks of deep learning networks and is suitable for tasks such as classification, detection, segmentation, etc.
The channel attention mechanism obtains two different feature descriptions, i.e., an average and a maximum, for each channel of the input feature map through global average pooling and global maximum pooling. These descriptions are processed by a multi-layer perceptron with the shared weights, and the output feature maps are then element-wise summed. The Sigmoid function is applied to generate a channel attention map Mc with a shape consistent with the number of input channels. Its computational formula is shown in Equation (5). The spatial attention uses channel attention to generate two 2D feature maps through average pooling and max pooling in the channel direction. These feature maps are stacked in the channel dimension, then passed through a 7 × 7 convolutional layer, and then a spatial attention map Ms can be generated using the Sigmoid function. Its computational formula is expressed in Equation (6). The output feature map F′ (Equation (7)) can be obtained by the element-wise multiplication, combining the input feature map F with the channel attention map Mc and spatial attention map Ms. Finally, through the structure above, the CBAM can effectively enhance the network’s response to important detailed features in images, improving the model’s response performance in various visual tasks, especially in image recognition and semantic segmentation of complex application scenes.
M c = σ M L P A v g P o o l F + M L P M a x P o o l F ,
M s = σ C o n v 7 × 7 C o n c a t A v g P o o l F , M a x P o o l F ,
F = F M c M s ,
where MLP(·) is the multi-layer perceptron; AvgPool(·) and MaxPool(·) are the average pooling and maximum pooling processing; F is the first feature map; σ(·) is the Sigmoid activation function; Conv7×7(·) means the 7 × 7 convolution operation; Concat(·) is the feature fusion operation; and ⊗ is the Kronecker product.
  • Replacement of ASPP by DenseASPP
After extracting features in the backbone feature extraction network, a preliminary effective feature layer that has undergone four downsampling is obtained, and then multi-scale feature capture is performed on this effective feature layer. After multi-scale capture is completed, multiple feature layers are fused and then rolled up by 1 × 1 to obtain the feature output map. The traditional ASPP utilizes parallel AtrousConvolution to extract features using AtrousConvolution at different rates, which can capture different information at multiple scales through convolution of different receptive fields. The traditional ASPP format is shown in the figure. However, as the expansion rate increases (especially when it exceeds 24), the effective weights of dilated convolutions also decrease, and the feature extraction ability also decreases. Therefore, this paper adopts the DenseASPP [54] module (Figure 7) instead of the traditional ASPP module, which can better capture semantic information in images. DenseASPP adopts a dense connection and cascaded dilated convolutional layer design. The output of each dilated convolutional layer (Equation (8)) is not only passed to the next layer but also connected to all subsequent unvisited layers, forming a dense feature transfer path. The dilation rate of each layer in DenseASPP increases layer by layer, and the final output is a feature map generated by multi-scale dilated convolutions with multiple dilation rates.
y i = H k , d l y l 1 , y l 2 , , y 0
where dl represents the expansion rate of the lth layer of the network; Hk,dl(·) is the dilated convolution operation; and [yl−1, yl−2, …, y0] is the feature map formed by connecting the outputs of all previous layers. DenseASPP not only accumulates multi-scale features through dense connections but also determines the receptive field through multiple continuous dilated convolutional layers, avoiding the problem of performance degradation caused by a single high expansion rate.

3. Experiment Results and Evaluations

A series of simulation tests are carried out on our PC (Intel Xeon Gold 6133 CPU (Intel Corporation, Santa Clara, CA, USA) 2.50 GHz, 40 GB RAM, and Tesla V100-SXM2-32 GB) by Python 3.8.20 to test the correctness of the proposed algorithm. In this paper, we use three comparative experiments to evaluate our proposed system and method, including the environmental lighting evaluation experiment, the improved DeepLabV3+ network comparison experiment, and the system application evaluation experiment. Due to the space station application limitations, all the evaluations are conducted using the ground imitation experimental methods currently.

3.1. Experiment Data and Organizations

Because we cannot directly obtain the massive imaging observation data of the actual experimental devices in the space station currently, in order to verify the correctness of the proposed system and method, in this paper we design a 3D model of a kind of complex device and use the 3D printing technology to produce its physical object to imitate the appearance of corresponding actual experimental devices in the space station. A Groudchat wearable camera whose image resolution is 3840 × 2160 is used to collect experimental data. The samples of the relevant image dataset are shown in Figure 8. As can be seen from Figure 8, the components in Figure 8 include multiple types of jars or boxes, round buttons, and smaller-sized pipelines and valves, which are similar in appearance to the experimental devices or daily environmental control and life support equipment (e.g., the urine recycler) in the space station. In order to increase data diversity, we collect and store these images of this device from different photographic angles (angle deviations from the center front view direction of this device are about −60°~60°) and distances (distance deviations from this device are about 20.0 to 50.0 cm). Meanwhile, considering that environmental light control is often required within the space station to save electricity, we also collect data under different lighting conditions to support testing the brightness evaluation indicators proposed in our method. Currently, we have accumulated more than 1500 images in our dataset. Among them, 750 images are collected under ordinary lighting conditions, while the remaining 750 images are captured in environments with strong lighting (Figure 8). The LabelMe software [55] (windows version) is used to label the original data. Regarding the image annotation task, currently, we classify the scene into three kinds of devices, including five tanks, three valves, and eight pipelines. In this paper, the occlusion processing is also considered. For the partially occluded part, the method of bypassing the occlusion object to mark the rear target is used. For the truncated part, the method of double-line segmentation is employed. In addition, we also consider the color specification. The color is strictly consistent with the pre-designed specified color, and LabelMe automatically assigns the color according to the category order.

3.2. Evaluation Experiments of Environment Lighting

This paper uses image quality evaluation parameters to evaluate the lighting brightness of the imaging environment. On one hand, we calculate the information entropy in the LAB color space to quantitatively describe illumination imaging features. Compared to the RGB color space, it is well known that the LAB color space is more in line with the human eye’s perception of color, making it suitable for objective calculation of environmental brightness. On the other hand, we also use the SVM for binary classification of imaging environment brightness. The SVM classifier has the characteristics of fast calculation speed, good calculation effect, and less training data requirement, making it particularly suitable for lightweight implementation of various computing applications. Table 2 compares the classification accuracy of SVM using different kernel functions [56]. The training data amount is 600, and the testing data amount is 56. As shown in Table 2, using the radial basis function as the kernel function of SVM can achieve the best classification calculation accuracy in this application. Therefore, this kernel function will also be used for our subsequent experiments. Table 3 shows the information entropy values corresponding to all the data in Figure 8 and Figure 9. From Figure 8 and Figure 9 and Table 3, it can be seen that the imaging illumination intensity calculation method in this paper can correctly evaluate the illumination environment and meet our system application requirements. Finally, according to the statistical calculations, the average calculation speed of the above method is only about 0.2 s. The above fast processing speed also ensures that our method can be applied to practical space station scene understanding.

3.3. Evaluation Experiments of Improved DeepLabV3+

An ablation experiment is performed in this paper to verify the effectiveness of the proposed improved DeepLabV3+ network. Table 4 presents the corresponding results. Regarding this experiment, the training set has 1170 data, and the test set has 130 data. Three evaluation metrics are used for segmentation performance evaluation, i.e., accuracy, precision, and MIoU. The accuracy is defined as the ratio of correctly classified pixels to the total number of pixels. The higher the accuracy, the better the segmentation performance of the model. The precision refers to the proportion of correctly predicted (true label positive) samples among all samples predicted as positive. In semantic segmentation, precision reflects the model’s ability to recognize positive samples. The MIoU is an indicator that measures the degree of overlap between predicted and actual regions. The higher the MIoU, the better the segmentation effect. In Table 4, all models use MobileNetV2 as the backbone network, and besides the CBAM and DenseASPP modules, a cascaded feature fusion (CFF) module [57] is also considered for evaluation. The CFF module is a deep learning architecture designed for multi-scale feature optimization. Its core idea is to improve semantic segmentation performance in complex scenes through a progressive feature integration mechanism. Because CFF often appears in the design of deep learning network structures, this paper also adds comparisons to our ablation experiment. From Table 4, although our method cannot perform the best in all the evaluation parameters; however, our model with CBAM and DenseASPP modules can achieve the relative best results.
To further verify the effectiveness of the proposed method in replacing the backbone network, attention mechanism, and application of the ASPP module, eight comparison experiments are also designed. In this experiment, the training set has 1100 data, and the test set has 120 data. In experiment 1, we use the classic DeeplabV3+ algorithm for computation. In experiment 2, we use the classic DeeplabV3+ and replace its backbone network with MobileNetV2 while keeping other parts unchanged. In experiment 3, we use the classic DeeplabV3+ and replace the backbone network with MobileNetV3 while keeping other parts unchanged. In experiment 4, we use the backbone network with MobileNetV2 and introduce the attention mechanism CBAM. In experiment 5, we use the backbone network with MobileNetV2 and introduce the attention mechanism of the squeeze-and-excitation (SE) module [58]. In experiment 6, we use the backbone network with MobileNetV2 and consider the mechanism of the context anchor attention (CAA) module [59]. In experiment 7, we use the backbone network with MobileNetV2 and employ both CBAM and the CFF module. In experiment 8, we use the backbone network with MobileNetV2 and utilize CBAM and DenseASPP modules. The evaluation metrics, including accuracy, precision, and MioU, are used again for processing effect evaluation. The corresponding results are illustrated in Table 5. From Table 5, it can be seen that our proposed model with MobileNetV2, CBAM, and DenseASPP modules can achieve the relative best processing effect.
A series of comparison experiments among the common semantic segmentation networks are also performed. In this experiment, we compare the computational performance of the SegFormer network [60], UNet, SegNet, pyramid scene parsing network (PSPNet) [61], classic DeepLabV3+, and our improved DeepLabV3+ network. In this experiment, the training set has 1168 data, and the test set has 131 data. SegFormer is a semantic segmentation model based on Transformer [62], which adopts a hierarchical Transformer encoder structure and a lightweight multilayer perceptron decoder. The encoder part fuses features of different scales through overlapping patch merging operations, while the decoder part uses lightweight multilayer perceptrons to gradually upsample the feature map to the original resolution. This design not only improves the network efficiency but also reduces the computational load. The core design concept of UNet combines the encoder and decoder structures and skip connection mechanism, which is particularly suitable for scenarios with small sample data and high-precision segmentation requirements. Its main features include the small sample adaptability, fine segmentation ability, and lightweight network design. The core design of SegNet improves segmentation accuracy and reduces computational overhead through the pooling index upsampling mechanism. Its main features include efficient parameter design, pooling index upsampling, and lightweight real-time computation performance. The PSPNet is also a deep convolutional neural network semantic segmentation model. The core idea of this model is to use pyramid pooling modules to capture contextual information at different scales to improve the accuracy of image semantic segmentation. Its characteristic is that it can effectively utilize contextual information while integrating multi-scale information, etc. The corresponding experimental results are presented in Table 6 and Figure 10. As can be seen from these results, the method proposed in this paper can perform the best on all indicators, which demonstrates the effectiveness of our method to some extent. The average processing speeds of all the above algorithms are basically less than 1.0 s, which can meet the requirement of human–computer interactive operation regarding the corresponding in-orbit maintenance guiding applications. When astronauts in space stations apply this technology in practice, they need to carefully observe the handheld panel for a very long time, so the impact of a few seconds of processing time on the mission is not significant.

3.4. Evaluation Experiments of System Application

This paper performs an application comparison experiment of our designed system and method in the ground environment. First, an equipment inspection and disassembly operation without using our system and method is performed. The system users directly employ the traditional handheld panel to learn the device’s working principles and try to dismantle and assemble it. The handheld panel is equipped with a complete device introduction video and text. Then the average operation time, the average longest operation time, and the average shortest operation time can be recorded. Second, our designed method is used to re-examine and disassemble the same equipment above, and the average operation time, the average longest operation time, and the average shortest operation time are calculated again. The corresponding results are recorded in Table 7. As shown in Table 7, the average operation time without using our method is more than 140 s longer than that using our intelligent interactive method. And the average maximum and average minimum operation times are also significantly longer than the method proposed in this paper. In addition, we also can observe that when our intelligent method is not used, the astronaut will occasionally make some mistakes during that mission, which will delay the task execution time and reduce task reliability. Figure 11 presents the application process example of our proposed system and method. Users can select objects of interest based on the semantic segmentation results of the scene returned by the handheld panel. The system comprehensively uses 3D model explosion figures, audio and video files, text, and other methods to provide maintenance prompts to users.

4. Discussion

The importance of space station technology is self-evident as the forefront base for human exploration of space. However, space station in-orbit devices face many serious challenges. The enclosed and narrow space characteristics of the space station make its internal air circulation dependent on the ventilation system. Meanwhile, astronauts in space stations breathing, sweating, and even the water vapor generated by equipment operation will make the devices in the space station prone to surface contamination and mold problems, seriously threatening the performance and service life of the space station and thus affecting the safe and stable operation of space missions as well as the health and work efficiency of astronauts in space stations. To ensure the long-term reliable operation of the space station, astronauts working on the station need to regularly maintain the equipment. Due to the heavy workload of astronauts, wearable glasses and handheld panels can be designed based on semantic segmentation methods to identify the devices within the astronaut’s line of sight, allowing astronauts to directly interact with the device models and view their working principles. The performance parameters of our system can include the scene understanding semantic segmentation accuracy, the system response speed, the system mean time between failure (MTBF), etc. Clearly, the application of this kind of interactive artificial intelligence technology can significantly improve the safety of space flight missions, reduce the workload of astronauts, and avoid many potential risks.
Numerous experiments have shown that ambient light can affect the computational performance of the semantic segmentation network proposed in this paper. Therefore, in this paper, we further evaluate and analyze the computational performance of our improved DeepLabV3+ on images with different ambient light ratios. The experiments use images collected under different proportions of strong and ordinary lighting sources for network training and then randomly extract data from the image dataset (including images under various lighting conditions) for testing the training network. Currently, a total of 5 experiments are tested, with the number of strong lighting environment images and ordinary lighting environment images being 600 and 0 (experiment 1), 400 and 200 (experiment 2), 300 and 300 (experiment 3), 200 and 400 (experiment 4), and 0 and 600 (experiment 5), respectively. The number of test datasets is 60, including 30 images from strong lighting environments and 30 images from ordinary lighting conditions. The relevant network calculation results are shown in Table 8 and Figure 12. From the results in Table 8, it can be seen that using data under ordinary lighting conditions for network training can achieve the best computational performance. Strong (i.e., overly bright) lighting environments may not achieve the best segmentation results; the possible reason for this result is that the device surfaces segmented in this paper are mostly metallic, which are prone to producing bright areas or reflectors and therefore affect the subsequent calculations. Clearly, the final lighting output control also needs to consider the subjective visual perception of astronauts in space stations and cannot be determined solely based on the imaging semantic segmentation results.
Generally speaking, the device maintenance missions of space stations in orbit have gone through three stages of development. In the first stage, in order to assist astronauts in space stations in completing various tasks in space stations, the paper operation manual is universally used to provide guidance to astronauts on relevant work details. Clearly, using paper-based operation manuals will lead to an increasing number of accumulated operation manuals and data generated during the long-term implementation of space station missions. In addition, paper manuals are not conducive to the storage and electronic processing of information, nor are they convenient for the efficient analysis of complex information. During the second stage, the use of a fixed electronic manual to assist astronauts in orbit operations emerged. Conventional electronic manuals only provide some principal system introductions, typical schematic diagrams, photo playback, and other functions, and cannot perform online interactive analysis of device status in actual scenarios, so their application effectiveness is also relatively limited. In the third stage, interactive scene understanding techniques using imaging analysis have emerged [63]. Astronauts in space stations can perform online interactive operations anytime and anywhere based on the information they are actually capturing, which significantly enhances their work interest and reduces workload [64]. As a result, the application of these techniques will significantly simplify astronauts’ in-orbit operations and improve mission reliability.
The system and method proposed in this paper have at least the following advantages. First, the network structure of our improved DeepLabV3+ method is lightweight, computationally efficient, and fast to implement, which can meet the real-time application of scene understanding in space stations. The actual experimental results show that the processing speed of our method is better than 1.0 s per frame. Second, the designed method has high reliability and can be used independently, making it suitable for practical applications in space stations. Based on the imaging lighting range inside the space station, we conduct a series of semantic segmentation calculations under different lighting environments, and the experimental results all show the effectiveness of the proposed method. Third, the designed method has strong scalability and can easily be combined with other deep learning methods for computation or replaced with other networks with more powerful computing capabilities. Of course, our methods also have certain shortcomings. For example, our method currently does not further process the segmentation of the same object under occlusion conditions. The semantic segmentation of small-sized targets remains a challenge in this field. In the future, we will further optimize the design methods of semantic segmentation networks, combining pre-training, transfer learning [65] and reinforcement learning techniques [66] to improve network computing performance. And the proposed system and method are expected to be applied in the maintenance of the life support system of the space station.

5. Conclusions

A design scheme for an interactive application system and its related algorithms is proposed for the development of space station scene understanding technology. First, a wearable glasses-based imaging acquisition device of the astronaut in the space station front view scene is designed, and the image processing and result display are achieved through a handheld panel. Second, an information entropy image quality assessment parameter is used to estimate the brightness of the imaging environment. When the brightness of the imaging environment is too low, the system prompts the astronaut to increase the ambient lighting output. Third, an improved design method for the DeepLabV3+ network is proposed, which introduces MobileNetV2, the attention mechanism, the DenseASPP module, etc., achieving a lightweight and efficient scene semantic segmentation calculation. Fourth, an interactive astronaut mission assistance display system is also designed, which comprehensively uses audio, video, text, and other methods to guide and assist astronauts in maintaining their in-orbit equipment cabinets. A large number of ground experiments have shown the correctness and effectiveness of the proposed method. In the future, the designed system will be applied in real in-orbit missions, further improving the safety and reliability of space flight missions.

Author Contributions

Conceptualization, H.L. (Haoting Liu), C.L., Z.T., M.W., H.L. (Haiguang Li), X.L. (Xiaofei Lu), Z.G. and Q.L.; data curation, H.L. (Haoting Liu), C.L. and X.L. (Xikang Li); formal analysis, H.L. (Haoting Liu); funding acquisition, H.L. (Haoting Liu); investigation, H.L. (Haoting Liu) and C.L.; methodology, H.L. (Haoting Liu), C.L., X.L. (Xikang Li), Z.T., M.W., H.L. (Haiguang Li), X.L. (Xiaofei Lu), Z.G. and Q.L.; project administration, H.L. (Haoting Liu), H.L. (Haiguang Li), X.L. (Xiaofei Lu), Z.G. and Q.L.; resources, H.L. (Haoting Liu); software, C.L., X.L. (Xikang Li), Z.T. and M.W.; supervision, H.L. (Haoting Liu); validation, H.L. (Haoting Liu), C.L., X.L. (Xikang Li), Z.T. and M.W.; writing—original draft, H.L. (Haoting Liu); writing—review and editing, H.L. (Haoting Liu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Foundation of National Key Laboratory of Human Factors Engineering under Grant No. HFNKL2023WW11, the National Natural Science Foundation of China under Grant 62373042, the Guangdong Basic and Applied Basic Research Foundation under Grant 2023A1515010275, and the Fundamental Research Fund for the China Central Universities of USTB under Grant FRF-BD-19-002A.

Data Availability Statement

The data presented in this study are available on request from the corresponding author, Haoting Liu.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shen, M.; Huang, X.; Zhao, Y.; Wang, Y.; Li, H.; Jiang, Z. Human-like acceleration and deceleration control of a robot operator in space station floating in a space station. ISA Trans. 2024, 148, 397–411. [Google Scholar] [CrossRef]
  2. Ge, X.; Zhou, Q.; Liu, Z. Assessment of space station on-orbit maintenance task complexity. Reliab. Eng. Syst. Saf. 2020, 193, 106661. [Google Scholar] [CrossRef]
  3. Wang, F.; Zhang, L.; Xu, Y.; Wang, K.; Qiao, Z.; Guo, D.; Wang, J. Development of the on-orbit maintenance and manipulation workbench (MMW) for the Chinese space station. Acta Astronaut. 2024, 214, 366–379. [Google Scholar] [CrossRef]
  4. Schmitz, J.; Komorowski, M.; Russomano, T.; Ullrich, O.; Hinkelbein, J. Sixty years of manned spaceflight—Incidents and accidents involving operator in space stations between launch and landing. Aerospace 2022, 9, 675. [Google Scholar] [CrossRef]
  5. Zhang, R.; Zhang, Y.; Zhang, X. Tracking in-cabin operator in space stations using deep learning and head motion clues. IEEE Access 2021, 9, 2680–2693. [Google Scholar] [CrossRef]
  6. Ulusoy, U.; Reisman, G.E. What kind of support do operator in space stations need for maintenance tasks in space habitats? A survey study. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 2–9 March 2024; pp. 1–7. [Google Scholar]
  7. Lin, P.P.; Jules, K. An intelligent system for monitoring the microgravity environment quality on-board the international space station. IEEE Trans. Instrum. Meas. 2022, 51, 1002–1009. [Google Scholar] [CrossRef]
  8. Kwok-Choon, S.T.; Romano, M.; Hudson, J. Orbital hopping maneuvers with Astrobee on-board the international space station. Acta Astronaut. 2023, 207, 62–76. [Google Scholar] [CrossRef]
  9. Yu, J.; Hylan, M.F. Interpretable state-space model of urban dynamics for human-machine collaborative transportation planning. Transp. Res. Part B Methodol. 2025, 192, 103134. [Google Scholar] [CrossRef]
  10. Reig, S.; Fong, T.; Forlizzi, J.; Steinfeld, A. Theory and design considerations for the user experience of smart environments. IEEE Trans. Hum. Mach. Syst. 2022, 52, 522–535. [Google Scholar] [CrossRef]
  11. Anderson, A.; Menguc, Y.; Wood, R.J.; Newman, D. Development of the Polipo pressure sensing system for dynamic space-suited motion. IEEE Sens. J. 2015, 15, 6229–6237. [Google Scholar] [CrossRef]
  12. Jiang, Z.; Xu, J.; Li, H.; Huang, Q. Stable parking control of a robot operator in space station in a space station based on human dynamics. IEEE Trans. Robot. 2020, 36, 399–413. [Google Scholar] [CrossRef]
  13. Ciocca, G.; Tschan, H. The enjoyability of physical exercise: Exergames and virtual reality as new ways to boost psychological and psychosocial health in operator in space stations. A prospective and perspective view. IEEE Open J. Eng. Med. Biol. 2023, 4, 173–179. [Google Scholar] [CrossRef]
  14. Sohail, A.; Nawaz, N.A.; Shah, A.A.; Rasheed, S.; Ilyas, S.; Ehsan, M.K. A systematic literature review on machine learning and deep learning methods for semantic segmentation. IEEE Access 2022, 10, 134557–134570. [Google Scholar] [CrossRef]
  15. Elhassan, M.A.M.; Zhou, C.; Zhu, D.; Adam, A.B.M.; Benabid, A.; Khan, A.; Mehmood, A.; Zhang, J.; Jin, H.; Jeon, S.-W. CSNet: Cross-stage subtraction network for real-time semantic segmentation in autonomous driving. IEEE Trans. Intell. Transp. Syst. 2025, 26, 4093–4108. [Google Scholar] [CrossRef]
  16. Xie, J.; Zhang, Q.; Cui, Z.; Ma, C.; Zhou, Y.; Wang, W.; Shen, D. Integrating eye tracking with grouped fusion networks for semantic segmentation on mammogram images. IEEE Trans. Med. Imaging 2025, 44, 868–879. [Google Scholar] [CrossRef] [PubMed]
  17. Zannah, R.; Bashar, M.; Mushfiq, R.B.; Chakrabarty, A.; Hossain, S.; Jung, Y.J. Semantic segmentation on panoramic dental X-ray images using U-Net architectures. IEEE Access 2024, 12, 44598–44612. [Google Scholar] [CrossRef]
  18. Hedrich, K.; Hinz, L.; Reithmeier, E. Damage segmentation on high-resolution coating images using a novel two-stage network pipeline. Aerospace 2023, 10, 245. [Google Scholar] [CrossRef]
  19. Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  20. Ma, Y.-J.; Shuai, H.-H.; Cheng, W.-H. Spatiotemporal dilated convolution with uncertain matching for video-based crowd estimation. IEEE Trans. Multimed. 2022, 24, 261–273. [Google Scholar] [CrossRef]
  21. Sussi; Husni, E.; Yusuf, R.; Harto, A.B.; Suwardhi, D.; Siburian, A. Utilization of improved annotations from object-based image analysis as training data for DeepLab V3+ model: A focus on road extraction in very high-resolution orthophotos. IEEE Access 2024, 12, 67910–67923. [Google Scholar] [CrossRef]
  22. Jiang, A.; Gong, Y.; Yao, X.; Foing, B.; Allen, R.; Westland, S.; Hemingray, C.; Zhu, Y. Short-term virtual reality simulation of the effects of space station colour and microgravity and lunar gravity on cognitive task performance and emotion. Build. Environ. 2023, 227, 109789. [Google Scholar] [CrossRef]
  23. Zhou, T.; Porikli, F.; Crandall, D.J.; Gool, L.V.; Wang, W. A survey on deep learning technique for video segmentation. IEEE Trans. Pattern Anal. 2023, 45, 7099–7122. [Google Scholar] [CrossRef] [PubMed]
  24. Souly, N.; Spampinato, C.; Shah, M. Semi supervised semantic segmentation using generative adversarial network. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5689–5697. [Google Scholar]
  25. Zhou, Y.; Xu, H.; Zhang, W.; Gao, B.; Heng, P.-A. C3-SemiSeg: Contrastive semi-supervised segmentation via cross-set learning and dynamic class-balancing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2017; pp. 7016–7025. [Google Scholar]
  26. Ke, R.; Aviles-Rivero, A.I.; Pandey, S.; Reddy, S.; Schonlieb, C.-B. A three-stage self-training framework for semi-supervised semantic segmentation. IEEE Trans. Image Process. 2022, 31, 1805–1815. [Google Scholar] [CrossRef] [PubMed]
  27. Zhong, Y.; Yuan, B.; Wu, H.; Yuan, Z.; Peng, J.; Wang, Y.-X. Pixel contrastive-consistent semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7253–7262. [Google Scholar]
  28. Chen, Y.; Li, W.; Gool, L.V. ROAD: Reality oriented adaptation for semantic segmentation of urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7892–7901. [Google Scholar]
  29. Li, Y.; Yuan, L.; Vasconcelos, N. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6929–6938. [Google Scholar]
  30. Hoyer, L.; Dai, D.; Gool, L.V. DAFormer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9914–9925. [Google Scholar]
  31. Carunta, C.; Carunta, A.; Popa, C.-A. Heavy and lightweight deep learning models for semantic segmentation: A survey. IEEE Access 2025, 13, 1145–1165. [Google Scholar] [CrossRef]
  32. Paluri, K.V.; Gupta, A.; Nain, G. DABNet: Hybrid acne classification approach with attention-based transfer learning using Bayesian optimization. In Proceedings of the International Conference on Emerging Systems and Intelligent Computing, Bhubaneswar, India, 8–9 February 2025; pp. 676–681. [Google Scholar]
  33. Yan, L.; Fan, J.; Liu, T.; Wang, H.; Zhao, Z. Automatic labeling of in-situ crystal images based on lightweight network Bisenetv2. In Proceedings of the Chinese Control Conference, Kunming, China, 28–31 July 2024; pp. 7888–7893. [Google Scholar]
  34. Luo, J.-H.; Wang, S.-H.; Hsia, S.-C. Based on ICNet image matching system for gold powder distribution on the surface of the ankle ring. In Proceedings of the International Symposium on Intelligent Signal Processing and Communication Systems, Hualien City, Taiwan, 16–19 November 2021; pp. 1–2. [Google Scholar]
  35. Wang, H.; Jiang, X.; Ren, H.; Hu, Y.; Bai, S. SwiftNet: Real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1296–1305. [Google Scholar]
  36. Lee, M.; Kim, M.; Jeong, C.Y. Real-time semantic segmentation on edge devices: A performance comparison of segmentation models. In Proceedings of the International Conference on Information and Communication Technology Convergence, Jeju Islan, Republic of Korea, 19–21 October 2022; pp. 383–388. [Google Scholar]
  37. Li, Z.; Tao, R.; Wu, Q.; Li, B. DA-RefineNet: Dual-inputs attention RefineNet for whole slide image segmentation. In Proceedings of the International Conference on Pattern Recognition, Milan, Italy, 10–15 January 2021; pp. 1918–1925. [Google Scholar]
  38. Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiage, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
  39. Xu, D.; Ouyang, W.; Wang, X.; Sebe, N. PAD-Net: Multi-tasks guided prediction-and-distillation network for simultaneous depth estimation and scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 675–684. [Google Scholar]
  40. Yi, W.; Ma, S.; Zhang, H.; Ma, B. Classification and improvement of multi-label image based on VGG16 network. In Proceedings of the International Conference on Information Science, Parallel and Distributed Systems, Guangzhou, China, 22–24 July 2022; pp. 243–246. [Google Scholar]
  41. Jusman, Y. Comparison of prostate cell image classification sing CNN: ResNet-101 and VGG-19. In Proceedings of the IEEE International Conference on Control System, Computing and Engineering, Penang, Malaysia, 25–26 August 2023; pp. 74–78. [Google Scholar]
  42. Su, C.-C.; Cormack, L.K.; Bovik, A.C. Oriented correlation models of distorted natural images with application to natural stereopair quality evaluation. IEEE Trans. Image Process. 2015, 24, 1685–1699. [Google Scholar] [CrossRef]
  43. Cheng, L.; Xiong, R.; Wu, J.; Yan, X.; Yang, C.; Zhang, Y.; He, Y. Fast segmentation algorithm of USV accessible area based on attention fast Deeplabv3. IEEE Sens. J. 2024, 24, 24168–24177. [Google Scholar] [CrossRef]
  44. Hou, X.; Zeng, H.; Jia, L.; Peng, J.; Wang, W. MobGSim-YOLO: Mobile device terminal-based crack hole detection model for aero-engine blades. Aerospace 2024, 11, 676. [Google Scholar] [CrossRef]
  45. Veni, P.K.; Cupta, A. Revolutionizing acne diagnosis with hybrid deep learning model integrating CBAM, and capsule network. IEEE Access 2024, 12, 82867–82879. [Google Scholar] [CrossRef]
  46. Marjani, M.; Mahdianpari, M.; Ahmadi, S.A.; Hemmati, E.; Mohammadimanesh, F.; Mesgari, M.S. Application of explainable artificial intelligence in predicting wildfire spread: An ASPP-enabled CNN approach. IEEE Geosci. Remote Sens. 2024, 21, 2504005. [Google Scholar] [CrossRef]
  47. Saint-Guillain, M.; Vanderdonckt, J.; Burny, N.; Pletser, V.; Vaquero, T.; Chien, S.; Karl, A.; Marquez, J.; Wain, C.; Comein, A.; et al. Enabling operator in space station self-scheduling using a robust advanced modelling and scheduling system: An assessment during a Mars analogue mission. Adv. Space Res. 2023, 71, 1378–1398. [Google Scholar] [CrossRef]
  48. Liu, H.; Chen, S.; Zheng, N.; Wang, Y.; Ge, J.; Ding, K.; Guo, Z.; Li, W.; Lan, J. Ground pedestrian and vehicle detections using imaging environment perception mechanisms and deep learning networks. Electronics 2022, 11, 1873. [Google Scholar] [CrossRef]
  49. Zhang, Y.; Dong, Z.; Zhang, K.; Shu, S.; Lu, F.; Chen, J. Illumination variation-resistant video-based heart rate monitoring using LAB color space. Opt. Lasers Eng. 2021, 136, 106328. [Google Scholar] [CrossRef]
  50. Dendi, S.V.R.; Channappayya, S.S. No-reference video quality assessment using natural spatiotemporal scene statistics. IEEE Trans. Image Process. 2020, 29, 5612–5624. [Google Scholar] [CrossRef]
  51. Mathur, A.; Foody, G.M. Multiclass and binary SVM classification implications for training and classification users. IEEE Geosci. Remote Sens. Lett. 2008, 5, 241–245. [Google Scholar] [CrossRef]
  52. Hassanzadeh, T.; Essam, D.; Sarker, R. 2D to 3D evolutionary deep convolutional neural networks for medical image segmentation. IEEE Trans. Med. Imaging 2021, 40, 712–721. [Google Scholar] [CrossRef]
  53. Tani, T.A.; Tesic, J. Advancing retinal vessel segmentation with diversified deep convolutional neural networks. IEEE Access 2024, 12, 141280–141290. [Google Scholar] [CrossRef]
  54. Hu, P.; Li, X.; Tian, Y.; Tang, T.; Zhou, T.; Bai, X.; Zhu, S.; Liang, T.; Li, J. Automatic pancreas segmentation in CT images with distance-based saliency-aware DenseASPP network. IEEE J. Biomed. Health 2021, 25, 1601–1611. [Google Scholar] [CrossRef]
  55. Ullah, R.; Jaafar, J.; Md Said, A.B. Semantic annotation model for objects classification. In Proceedings of the IEEE Student Conference on Research and Development, Kuala Lumpur, Malaysia, 13–14 December 2015; pp. 87–92. [Google Scholar]
  56. Lee, S.W.; Bien, Z. Representation of a fisher criterion function in a kernel feature space. IEEE Trans. Neural Netw. 2010, 21, 333–339. [Google Scholar] [PubMed]
  57. Du, Z.; Liang, Y. Research on image semantic segmentation based on hybrid cascade feature fusion and detailed attention mechanism. IEEE Access 2024, 12, 62365–62377. [Google Scholar] [CrossRef]
  58. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and excitation networks. IEEE Trans. Pattern Anal. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
  59. Zeng, H.; Fu, L.; Li, J.; Li, X.; He, X. CAAVM-TransUNet: Integrating context anchor attention with transformer U-Net for single image dehazing. In Proceedings of the International Conference on Computer and Communications, Chengdu, China, 13–16 December 2024; pp. 719–724. [Google Scholar]
  60. Mahboob, Z.; Khan, M.A.; Lodhi, E.; Nawaz, T.; Khan, U.S. Using SegFormer for effective semantic cell segmentation for fault detection in photovoltaic arrays. IEEE J. Photovolt. 2025, 15, 320–331. [Google Scholar] [CrossRef]
  61. Zhang, R.; Chen, J.; Feng, L.; Li, S.; Yang, W.; Guo, D. A refined pyramid scene parsing network for polarimetric SAR image semantic segmentation in agricultural areas. IEEE Geosci. Remote Sens. Lett. 2022, 19, 4014805. [Google Scholar] [CrossRef]
  62. Jun, E.; Jeong, S.; Heo, D.-W.; Suk, H.-I. Medical transformer: Universal encoder for 3-D brain MRI analysis. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 17779–17789. [Google Scholar] [CrossRef] [PubMed]
  63. Sun, Q.; Chao, J.; Lin, W.; Wang, D.; Chen, W.; Xu, Z.; Xie, S. Pixel-wise and class-wise semantic cues for few-shot segmentation in operator in space station working scenes. Aerospace 2024, 11, 496. [Google Scholar] [CrossRef]
  64. Kimura, S.; Yamauchi, M.; Ozawa, Y. Magnetically jointed module manipulators: New concept for safe intravehicular activity in space vehicles. IEEE Trans. Aerosp. Electron. Syst. 2011, 47, 2247–2253. [Google Scholar] [CrossRef]
  65. Mehta, Y.; Baz, A.; Patel, S.K. Semantic segmentation of optical satellite images for the illegal construction detection using transfer learning. Results Eng. 2024, 24, 103383. [Google Scholar] [CrossRef]
  66. Wang, J.; Sun, H.; Zhu, C. Vision-based autonomous driving: A hierarchical reinforcement learning approach. IEEE Trans. Veh. Technol. 2023, 72, 11213–11226. [Google Scholar] [CrossRef]
Figure 1. Schematic diagram of astronaut in space station’s in-orbit daily work missions (data come from Baidu web searching).
Figure 1. Schematic diagram of astronaut in space station’s in-orbit daily work missions (data come from Baidu web searching).
Aerospace 12 00542 g001
Figure 2. Application illustration and computational flowchart of the proposed scene perception system.
Figure 2. Application illustration and computational flowchart of the proposed scene perception system.
Aerospace 12 00542 g002
Figure 3. Data samples of complex device components captured from different ambient lighting conditions. (a) Data samples captured from dim ambient lighting conditions. (b) Data samples captured from ordinary ambient lighting conditions. (c) Data samples captured from strong ambient lighting conditions.
Figure 3. Data samples of complex device components captured from different ambient lighting conditions. (a) Data samples captured from dim ambient lighting conditions. (b) Data samples captured from ordinary ambient lighting conditions. (c) Data samples captured from strong ambient lighting conditions.
Aerospace 12 00542 g003
Figure 4. Schematic diagram of the network structure of classic DeepLabV3+.
Figure 4. Schematic diagram of the network structure of classic DeepLabV3+.
Aerospace 12 00542 g004
Figure 5. Schematic diagram of the network structure of our improved DeepLabV3+.
Figure 5. Schematic diagram of the network structure of our improved DeepLabV3+.
Aerospace 12 00542 g005
Figure 6. Schematic diagram of CBAM.
Figure 6. Schematic diagram of CBAM.
Aerospace 12 00542 g006
Figure 7. Schematic diagram of DenseASPP: (a) structure of feature extraction network and (b) structure of the DenseASPP module.
Figure 7. Schematic diagram of DenseASPP: (a) structure of feature extraction network and (b) structure of the DenseASPP module.
Aerospace 12 00542 g007
Figure 8. Image samples of the experiment dataset. (a) Wearable camera and its application examples. (b) Image samples captured under ordinary ambient lighting conditions. (c) Image samples captured under strong ambient lighting conditions. Numbers 1–8 in (a,b) are the sequence number of images.
Figure 8. Image samples of the experiment dataset. (a) Wearable camera and its application examples. (b) Image samples captured under ordinary ambient lighting conditions. (c) Image samples captured under strong ambient lighting conditions. Numbers 1–8 in (a,b) are the sequence number of images.
Aerospace 12 00542 g008aAerospace 12 00542 g008b
Figure 9. Image samples of dim (abnormal) ambient lighting conditions. Numbers 1–8 are the sequence number of images.
Figure 9. Image samples of dim (abnormal) ambient lighting conditions. Numbers 1–8 are the sequence number of images.
Aerospace 12 00542 g009
Figure 10. Example of semantic segmentation results using different semantic segmentation networks. Images (a)-1 and (a)-2 are the original images collected under normal lighting conditions. Images (b)-1 and (b)-2 are the result of manual annotation (ground truth) results for (a)-1 and (a)-2, respectively. Images (c)-1 and (c)-2 show the calculation results of the SegFormer network for (a)-1 and (a)-2, respectively. Images (d)-1 and (d)-2 present the calculation results of UNet for (a)-1 and (a)-2, respectively. Images (e)-1 and (e)-2 illustrate the calculation results of SegNet for (a)-1 and (a)-2, respectively. Images (f)-1 and (f)-2 are the calculation results of PSPNet for (a)-1 and (a)-2, respectively. Images (g)-1 and (g)-2 are the calculation results of classic DeepLabV3+ for (a)-1 and (a)-2, respectively. Images (h)-1 and (h)-2 give out the calculation results of improved DeepLabV3+ (ours) for (a)-1 and (a)-2, respectively.
Figure 10. Example of semantic segmentation results using different semantic segmentation networks. Images (a)-1 and (a)-2 are the original images collected under normal lighting conditions. Images (b)-1 and (b)-2 are the result of manual annotation (ground truth) results for (a)-1 and (a)-2, respectively. Images (c)-1 and (c)-2 show the calculation results of the SegFormer network for (a)-1 and (a)-2, respectively. Images (d)-1 and (d)-2 present the calculation results of UNet for (a)-1 and (a)-2, respectively. Images (e)-1 and (e)-2 illustrate the calculation results of SegNet for (a)-1 and (a)-2, respectively. Images (f)-1 and (f)-2 are the calculation results of PSPNet for (a)-1 and (a)-2, respectively. Images (g)-1 and (g)-2 are the calculation results of classic DeepLabV3+ for (a)-1 and (a)-2, respectively. Images (h)-1 and (h)-2 give out the calculation results of improved DeepLabV3+ (ours) for (a)-1 and (a)-2, respectively.
Aerospace 12 00542 g010
Figure 11. Practical application example of proposed system and method in this paper.
Figure 11. Practical application example of proposed system and method in this paper.
Aerospace 12 00542 g011
Figure 12. Example of semantic segmentation results using different training data combinations. (a)-1 and (a)-2 are the original images collected under strong and ordinary lighting conditions. (b)-1 and (b)-2 show the calculation results of experiment 1 for (a)-1 and (a)-2, respectively. (c)-1 and (c)-2 present the calculation results of experiment 2 for (a)-1 and (a)-2, respectively. (d)-1 and (d)-2 illustrate the calculation results of experiment 3 for (a)-1 and (a)-2, respectively. (e)-1 and (e)-2 are the calculation results of experiment 4 for (a)-1 and (a)-2, respectively. (f)-1 and (f)-2 give out the calculation results of experiment 5 (ours) for (a)-1 and (a)-2, respectively.
Figure 12. Example of semantic segmentation results using different training data combinations. (a)-1 and (a)-2 are the original images collected under strong and ordinary lighting conditions. (b)-1 and (b)-2 show the calculation results of experiment 1 for (a)-1 and (a)-2, respectively. (c)-1 and (c)-2 present the calculation results of experiment 2 for (a)-1 and (a)-2, respectively. (d)-1 and (d)-2 illustrate the calculation results of experiment 3 for (a)-1 and (a)-2, respectively. (e)-1 and (e)-2 are the calculation results of experiment 4 for (a)-1 and (a)-2, respectively. (f)-1 and (f)-2 give out the calculation results of experiment 5 (ours) for (a)-1 and (a)-2, respectively.
Aerospace 12 00542 g012
Table 1. Definition of imaging luminance degrees.
Table 1. Definition of imaging luminance degrees.
Imaging Luminance Condition
DimOrdinaryStrong
Ambient lighting condition~<50.0 lx~>50.0 lx and ~<160.0 lx~≥160.0 lx
Information entropy4.5 > HL > 2.5, 5.2 > Ha > 5.0, and 2.5 > Hb > 2.15.0 > HL > 4.0, 5.0 > Ha > 4.5, and 2.5 > Hb > 2.119.0 > HL > 11.5, 5.4 > Ha > 5.2, and 3.3 > Hb > 2.8
Table 2. Classification accuracy comparison of environment lighting conditions using different SVM kernel functions.
Table 2. Classification accuracy comparison of environment lighting conditions using different SVM kernel functions.
Kernel FunctionPolynomial FunctionLinear FunctionRadial Basis Function
Classification accuracy0.750.8750.8929
Table 3. Subjective and objective imaging brightness evaluation results of Figure 8 and Figure 9.
Table 3. Subjective and objective imaging brightness evaluation results of Figure 8 and Figure 9.
Image NameHLHaHbSVM Output *Subjective Imaging Luminance Evaluation Result
Figure 8a-14.324.852.29PNormal
Figure 8a-24.244.522.12PNormal
Figure 8a-34.154.612.16PNormal
Figure 8a-44.364.882.29PNormal
Figure 8a-54.304.852.29PNormal
Figure 8a-64.404.582.17PNormal
Figure 8a-74.094.632.16PNormal
Figure 8a-84.014.922.47PNormal
Figure 8b-117.195.323.27PNormal
Figure 8b-218.045.323.28PNormal
Figure 8b-318.535.343.28PNormal
Figure 8b-414.415.282.88PNormal
Figure 8b-516.205.292.99PNormal
Figure 8b-613.135.322.82PNormal
Figure 8b-713.355.332.85PNormal
Figure 8b-811.755.262.83PNormal
Figure 9-13.255.082.43NAbnormal
Figure 9-23.835.052.40NAbnormal
Figure 9-33.265.082.45NAbnormal
Figure 9-43.645.062.30NAbnormal
Figure 9-53.695.072.20NAbnormal
Figure 9-63.825.072.14NAbnormal
Figure 9-72.875.022.12NAbnormal
Figure 9-83.875.142.48NAbnormal
* In this table, the symbol ‘P’ means the output of SVM is positive, i.e., the imaging quality is normal; the symbol ‘N’ indicates the output of SVM is negative, i.e., the imaging quality is abnormal.
Table 4. The ablation experiment results of the proposed DeepLabV3+ network.
Table 4. The ablation experiment results of the proposed DeepLabV3+ network.
Experiment IDNetwork ModuleAccuracy/%Precision/%MIoU/%
CFFCBAMDenseASPP
1×××97.6996.5791.4
2××97.6896.6491.34
3××97.7696.7491.62
4××97.7196.791.45
5×97.7196.7891.48
6×97.7296.791.51
797.7396.6891.53
8 (Ours)×97.7696.791.66
Table 5. Semantic segmentation comparison experiments using different computational modules.
Table 5. Semantic segmentation comparison experiments using different computational modules.
Experiment TypeAccuracy/%Precision/%MIoU/%Time Consumption
Experiment 1OverfittingOverfittingOverfitting1.63 s
Experiment 294.9697.1989.451.52 s
Experiment 395.0297.1189.191.65 s
Experiment 495.1997.2389.521.60 s
Experiment 594.8897.1989.391.58 s
Experiment 694.9497.2189.471.64 s
Experiment 795.1397.2189.511.68 s
Experiment 8 (Ours)95.0597.2489.591.74 s
Table 6. Comparison experiment results of different semantic segmentation networks.
Table 6. Comparison experiment results of different semantic segmentation networks.
Experiment TypeAccuracy%Precision%MIoU%Time Consumption
SegFormer93.9580.6173.141.76 s
UNet95.9793.0483.941.84 s
SegNet94.6593.6787.432.46 s
PSPNet93.1778.3670.831.67 s
Classic DeepLabV3+97.0396.3588.731.63 s
Improved DeepLabV3+ (Ours)97.2796.5789.861.74 s
Table 7. Comparison experiment results using different equipment maintenance guidance methods.
Table 7. Comparison experiment results using different equipment maintenance guidance methods.
Experiment TypeAverage Operation Time/sLongest Operation Time/sShortest Operation Time/s
Experiment without the application of our proposed system and method~140.0~164.0~123.0
Experiment with the application of our proposed system and method~25.0~32.0~21.0
Table 8. The semantic segmentation results using different ratios of environmental light intensity data.
Table 8. The semantic segmentation results using different ratios of environmental light intensity data.
Experiment TypeAccuracy/%Precision/%MIoU/%
Experiment 197.8796.7291.71
Experiment 298.0496.7691.56
Experiment 397.8696.3691.33
Experiment 497.9296.7191.65
Experiment 598.0797.0592.11
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, H.; Liao, C.; Li, X.; Tian, Z.; Wang, M.; Li, H.; Lu, X.; Guo, Z.; Li, Q. Interactive Maintenance of Space Station Devices Using Scene Semantic Segmentation. Aerospace 2025, 12, 542. https://doi.org/10.3390/aerospace12060542

AMA Style

Liu H, Liao C, Li X, Tian Z, Wang M, Li H, Lu X, Guo Z, Li Q. Interactive Maintenance of Space Station Devices Using Scene Semantic Segmentation. Aerospace. 2025; 12(6):542. https://doi.org/10.3390/aerospace12060542

Chicago/Turabian Style

Liu, Haoting, Chuanxin Liao, Xikang Li, Zhen Tian, Mengmeng Wang, Haiguang Li, Xiaofei Lu, Zhenhui Guo, and Qing Li. 2025. "Interactive Maintenance of Space Station Devices Using Scene Semantic Segmentation" Aerospace 12, no. 6: 542. https://doi.org/10.3390/aerospace12060542

APA Style

Liu, H., Liao, C., Li, X., Tian, Z., Wang, M., Li, H., Lu, X., Guo, Z., & Li, Q. (2025). Interactive Maintenance of Space Station Devices Using Scene Semantic Segmentation. Aerospace, 12(6), 542. https://doi.org/10.3390/aerospace12060542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop