Automatic Assembly Inspection of Satellite Payload Module Based on Text Detection and Recognition

Li, Jun; Dai, Junwei; Kang, Jia; Wei, Wei

doi:10.3390/electronics14122423

Open AccessArticle

Automatic Assembly Inspection of Satellite Payload Module Based on Text Detection and Recognition

¹

China Academy of Space Technology (Xi’an), Xi’an 710100, China

²

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(12), 2423; https://doi.org/10.3390/electronics14122423

Submission received: 11 April 2025 / Revised: 29 May 2025 / Accepted: 11 June 2025 / Published: 13 June 2025

(This article belongs to the Special Issue Real-Time Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

The payload module of a high-throughput satellite involves the complex assembly of various components, which plays a vital role in maintaining the satellite’s structural and functional integrity. To support this, inspections during the assembly process are essential for minimizing human error, reducing inspection time, and ensuring adherence to design specifications. However, the current inspection process is entirely manual. It requires substantial manpower and time and is prone to errors such as missed or false detections, which compromise the overall effectiveness of the inspection process. To enhance the inspection efficiency and accuracy of the payload module in high-throughput satellites, this paper proposes a framework for text detection and recognition targeting diamond labels, R-hole labels, and interface labels within payload module images. Detecting and recognizing text labels on products in the high-throughput satellite payload module provides a means to determine the individual products’ assembly states and the correctness of their connection relationships with the waveguides/cables. The framework consists of two key components: a copy-and-paste data augmentation method, which generates synthetic images by overlaying foreground images onto background images, together with a text detection and recognition model incorporating a dual decoder. The detection accuracy on the simulated payload module data reached 87.42%, while the operational efficiency improved significantly by reducing the inspection time from 5 days to just 1 day.

Keywords:

high-throughput satellite payload module; text detection and recognition; data enhancement

1. Introduction

The high-throughput satellite payload module [1,2,3] contains thousands of payload products, each with multiple interfaces. These products are connected through waveguides and cables that convey signals and power. The assembly [4,5] of these payload products is directly related to the safety and normal operation of the satellite. Due to the intricate connection relationships and high layout density, conducting efficient and accurate inspections is essential to ensure the overall reliability of the system [6]. In addition, the development of high-throughput satellites and their digital payload technologies plays a critical role in supporting the future 6G network infrastructure, enabling wide-area, low-latency, and high-capacity communications from space. With the increasing capacity of satellite communications and the rapid advancement of mass-produced models like SatNet, traditional manual inspection methods are no longer sufficient to meet the demands of inspecting satellite payload modules. These high-throughput modules comprise numerous individual products and intricate connections, resulting in an enormous final assembly inspection workload that demands significant manpower and time. Moreover, manual inspections are prone to errors, such as missed or false detections, which compromise the overall reliability and effectiveness of the inspection process.

A labeling system is used on the products within the high-throughput satellite payload module. Using the textual information on these labels, automated inspection of the assembly of the payload module can be achieved. To improve inspection efficiency and accuracy, computer vision technology [7,8,9] can locate them rapidly and precisely in the captured images.

Although there are some deep learning-based text label recognition algorithms [10,11,12] available, the existing data [13] are not specifically for satellite payload module labels. In addition, the dataset for product labels is very limited. Specifically, the images taken by the payload module cannot meet the tens of thousands of images required by the detection and recognition algorithm. Furthermore, the captured images often have the problem that the text characters may be deformed due to image distortion, rotation, or perspective transformation [14,15,16,17]. In this case, it is difficult for conventional text detection algorithms to achieve effective detection, which will greatly affect the effectiveness of the inspection work.

To overcome these challenges, this paper proposes an automatic assembly inspection framework based on text information and employs a copy–paste augmentation method, where the foreground image block is pasted onto a random position within an unlabeled image to create a composite image. This approach generates a large amount of simulated data for training to address the challenge of small sample sizes. The trained model is then fine-tuned using a small amount of actual collected data to enable intelligent inspection of the satellite payload module assembly. Our proposed method greatly improves the detection accuracy and efficiency. In the end, the inspection work achieved an accuracy rate of 87.42% for the relevant data of the payload module, and the work efficiency was improved from 5 days to 1 day.

The contributions of our work are as follows:

We proposed an automatic assembly inspection framework for the high-throughput satellite payload module based on text information.
We developed a copy–paste data augmentation method that generates synthetic images by overlaying foreground images onto background images. Combined with an end-to-end text detection and recognition model, the proposed method enables highly accurate detection performance.
We conducted experiments on images captured within the specified payload module, achieving an accuracy rate of 87.42% and surpassing the existing methods.

2. Related Work

In Section 2.1, we will introduce the implementation of high-throughput satellite payload module assembly inspection. In Section 2.2, we will introduce the related approaches of text detection and recognition.

2.1. Inspection of the Assembly of the High-Throughput Satellite Payload Module

Inspection of the assembly of individual products and connected waveguides and cables in the satellite payload module is a crucial task that includes the following two aspects. One is to inspect whether the assembly position and orientation of the individual product are correct. The other is to inspect whether the connection relationships between the individual products and connected waveguides and cables are correct.

There is a label system mainly consisting of diamond labels, R-hole labels, and interface labels on the products of the high-throughput satellite payload module, as shown in Figure 1. With the help of the text information on these labels, automatic assembly inspection of payload module products via these labels is feasible. Among them, the diamond label indicates the identity information of the individual product, the R-hole label indicates the position of the reference hole and the assembly orientation of the individual product, and the interface label indicates the waveguide/cable connected to the individual product interface.

Currently, inspectors mainly use the following methods to inspect the position of the individual product and its connection relationships with the waveguides and cables. During the individual product inspection process, the inspector will carefully check each product against the CAD model to ensure that it is installed correctly in the desired position. As shown in Figure 2, by comparing the labels on the individual products with the markings in the Computer-Aided Design (CAD) model, the correctness of the individual products can be determined.

During the cable/waveguide connection relationship inspection process, inspectors mainly complete the inspection by comparing the interface labels of the individual product with the interface identification labels [18] on the cables/waveguides indicating the current connection relationships. To effectively record the inspection process, the following measures need to be taken. First, photos of each product and its connected waveguides and cables are taken at different inspection stages, and the images are classified and organized for retention. Second, the correctness of the assembly position of the individual product and the correctness of the connection relationships between the waveguides and the cables are judged. Once a problem with the assembly or connection is found, it should be handled in time, recorded, and reported.

However, the entire inspection process is completely dependent on artificial visual inspection and judgment. Given the large number of individual products contained in the high-throughput satellite payload module and the complex connection relationships, the inspection workload is huge. In this study, we introduce advanced artificial intelligence technology to achieve automatic and intelligent inspection of satellite payload module assembly, reducing the requirement for manual photography and identification.

2.2. Text Detection and Recognition

Considering that text labels carry substantial information, we adopted text detection and recognition technology for automatic assembly inspection. Through text detection and recognition technology, we can quickly and accurately locate and read diamond labels, R-hole labels, and interface labels in the images, and automatically compare them with information from the CAD model to ensure the correctness of the assembly position and connection relationships. It can not only greatly reduce the time and error rate of manual inspection, but also significantly improve work efficiency and inspection accuracy.

Text detection and recognition are machine vision tasks [19] whose goal is to identify and locate text areas in images or videos. Text detection and recognition mainly comprise two types of approaches: two-stage methods and end-to-end methods. Two-stage methods first perform text detection to localize text regions, and then apply a separate recognition model to recognize the detected text. This modular pipeline allows each task to be optimized independently but often leads to increased computational cost and suboptimal overall performance due to error propagation between stages.

Unlike general objects, however, text often appears in irregular shapes with varying aspect ratios, which poses additional challenges. To tackle these challenges, methods such as TextBoxes [10] modified convolutional kernels and anchor boxes to better capture diverse text shapes, while DMPNet [11] introduced quadrilateral sliding windows to improve localization. More recently, the Rotation-Sensitive Regression Detector (RSDD) [12] actively rotates convolutional filters to fully exploit rotation-invariant features. Despite their strong performance in certain scenarios, these methods typically focus on detection alone and do not jointly perform text recognition, which can limit both efficiency and accuracy.

To overcome these limitations, end-to-end methods have been developed that integrate text detection and recognition into a unified framework. By enabling joint optimization, these approaches improve both efficiency and accuracy, directly producing recognized text sequences along with their locations. Our proposed method falls into this end-to-end category. The core idea of the method is to integrate the detection and recognition tasks into one model so that the entire process from input to output is integrated without manually splitting it into independent detection and recognition steps. Models like ABINet [14] and Text Perceptron E2E [15] integrate popular detection and recognition methods into a unified end-to-end framework. SwinTextSpotter [16] utilizes a unified model to approach the recognition task as a semantic segmentation problem. Training with an integrated recognition module helps the text detector become more robust against text-like background clutter.

Although these methods have achieved improvements in performance compared to conventional approaches, they are primarily designed for typical text images, which demand a large number of training samples. For high-throughput satellite payload modules, the amount of captured image data is small compared to the tens of thousands of images required by the conventional text image, so the existing end-to-end method cannot directly obtain good performance. In addition, the captured images are deformed, making it difficult for the existing end-to-end method for inspection to work on high-throughput satellite payload modules.

3. Materials and Methods

Our goal is to develop an automated assembly inspection framework. To address the challenges of limited data and text deformation, we have designed the following framework. Our framework consists of two parts. First, we design the copy–paste enhancement method on simulated data to solve the problem of scarce training samples. Then, we design a text detection and recognition method, train the model with the generated large amount of simulated data, and fine-tune the trained model based on a small amount of actual collected data to reduce the impact of data distribution differences on model performance. Overall, this section will be divided into two parts, which will discuss in detail the data augmentation method used to generate data and briefly introduce the text detection and recognition method used for training. The overall flow chart is shown in Figure 3.

3.1. Copy–Paste Enhancement Method

The limited number of training samples hinders effective model training. Inspired by the work [20], we constructed a copy–paste enhancement method, in which the foreground image block is pasted onto a random position within an unlabeled image to create a composite image. This method calculates the valid pasting area for the foreground image, generates a random pasting position, and overlays the foreground image onto the background image to produce a composite image. Our method’s flow chart is shown in Figure 4.

First, we begin by randomly selecting a background image. Next, we resize the foreground image to create multiple versions of varying sizes. This approach enhances the diversity of the dataset. Each foreground image can be scaled to a different size to simulate different foreground object sizes. The formula is as follows:

(w_{f}^{'}, h_{f}^{'}) = (w_{f} \times S, h_{f} \times S)

(1)

where

w_{f}^{'}

and

h_{f}^{'}

are the width and height of the resized foreground image,

w_{f}

and

h_{f}

are the width and height of the original foreground image, and S is a randomly varying scaling factor between 1.25 and 3, chosen based on camera distance and empirical validation. Given that the captured text is already small due to the camera’s distance from the payload, further downscaling would harm recognition accuracy. Thus, we focus on enlarging the text regions.

We then apply a perspective transformation to the foreground image to change its geometry. Through a random selection process, we identify four transformed points corresponding to the four vertices of the polygon in the foreground image. Utilizing these four points, we compute the perspective transformation matrix H. This matrix H is then used to transform the points in the foreground image to their new positions, effectively changing the geometry of the image. The formula is as follows:

Q = H \cdot P

(2)

where Q is the transformed coordinate, H is the perspective transformation matrix, and P is the original coordinate.

At the same time, we generate various annotation contents by creating random letters, numbers, and special characters. The annotation positions are adjusted based on the different sizes of the foreground image. These annotations are then added to the foreground image, and the image is updated according to the annotation details. Additionally, we use the new set of points obtained from the perspective transformation to create the polygon boundary vertex array (Poly) for the transformed foreground image.

To ensure that the foreground image fits within the background image, we first calculate the area where the foreground image can be placed using Poly. Specifically, we determine the range of coordinates for the bounding box in the foreground image after the perspective transformation. By calculating the width and height of this boundary box, we ensure that the entire foreground image is contained within the boundaries of the background image. The formula is as follows:

t_{x} = random . randint (0, min (w_{b} - w_{t}^{'}, h_{b} - h_{t}^{'}))

(3)

t_{y} = random . randint (0, min (w_{b} - w_{t}^{'}, h_{b} - h_{t}^{'}))

(4)

where

w_{t}^{'}

and

h_{t}^{'}

are the width and height of the foreground image after the perspective transformation, and

w_{b}

and

h_{b}

are the width and height of the background image. The function random.randint(a, b) generates a random integer between a and b inclusive. In practice, we ensure that the background image is sufficiently larger than the transformed foreground image, so both

(w_{b} - w_{f}^{'})

and

(h_{b} - h_{f}^{'})

are non-negative. Therefore, the use of

min (w_{b} - w_{f}^{'}, h_{b} - h_{f}^{'})

will not produce negative results.

Finally, a random paste position

(t_{x}, t_{y})

is generated according to the size range of the background image, and the foreground image is pasted onto the background image. The formula is as follows:

B^{'} (x^{'}, y^{'}) = \{\begin{matrix} F (H^{- 1} (x^{'} - t_{x}, y^{'} - t_{y})) & if (x^{'}, y^{'}) within Poly \\ B (x^{'}, y^{'}) & otherwise \end{matrix}

(5)

where

B^{'} (x^{'}, y^{'})

is the pixel in the background image after compositing,

F (x, y)

is the pixel in the original (untransformed) foreground image, and

H^{- 1}

is the inverse of the perspective transformation matrix, which maps the background pixel coordinate

(x^{'}, y^{'})

(after subtracting the paste offset

(t_{x}, t_{y})

) back to the corresponding coordinate in the original foreground image.

P o l y

denotes the polygon region occupied by the transformed foreground image after perspective transformation, and

(t_{x}, t_{y})

is the random pasting position where the transformed foreground image is placed in the background. The subtraction of

(t_{x}, t_{y})

aligns the background pixel coordinate with the local coordinate system of the perspective-transformed foreground image, enabling correct inverse mapping and pixel sampling.

Through the data augmentation method we constructed, we can obtain synthetic data in various styles, so that the model can obtain more training samples of different styles, thereby improving the generalization ability of the model.

3.2. Text Detection and Recognition Method

Since this study is based on the high-throughput satellite payload module, all experiments are carried out in a designated simulated payload module environment to meet confidentiality requirements. To address the issues arising from the captured images, which often have the problem that the text characters may be deformed due to image distortion, rotation, or perspective transformation, we have used a model architecture as depicted in Figure 3. The model architecture comprises three primary components: a ResNet-50 backbone [21], a Transformer Encoder [22], and a dual decoder [23]. The ResNet-50 backbone serves as the initial feature extractor, capturing essential features from the input image. Building upon these feature maps, the Transformer Encoder models global relationships within the data, enhancing the model’s capacity to recognize complex textual structures. The dual decoder system comprises two specialized components: a Location Decoder and a Character Decoder, as depicted in Figure 5. By assigning specific tasks to each decoder, this design improves the overall performance of the model. The following will introduce these modules, respectively.

ResNet-50: ResNet-50 is a convolutional neural network that serves as the backbone for feature extraction and is designed to capture high-level image features. Its residual structure enables the network to handle deeper architectures effectively, making it well-suited for complex image recognition tasks. In our implementation, ResNet-50 is applied to the input image to extract key features that are essential for subsequent text detection and recognition tasks.
Transformer Encoder: Following the feature extraction by ResNet-50, a Transformer Encoder is utilized. Transformers excel at capturing long-range dependencies in sequential data, which enables the modeling of relationships between different parts of the text within the image. The Transformer Encoder processes the features extracted by ResNet-50, preparing them for the subsequent decoders. This integration enhances the understanding of both location and character information, thereby improving the overall performance of text detection and recognition.
Location Decoder: The Location Decoder is tasked with accurately localizing text within the image. It employs queries tailored to each text instance and predicts control points, such as the corners of bounding boxes or vertices of polygons. This method enables precise identification of text regions, even when faced with deformations or distortions, thereby ensuring robust text localization. Specifically, the initial control point queries are input into the Location Decoder. After undergoing multi-layer decoding, the refined control point queries are processed by two heads: a classification head that predicts the confidence score, and a 2-channel regression head that outputs the normalized coordinates for each control point. The predicted control points can represent either the N vertices of a polygon or the control points for Bezier curves [24]. For polygon vertices, we adopt a sequence that begins at the top-left corner and proceeds in a clockwise direction. For Bezier control points, Bernstein Polynomials [25] are employed to construct the parametric curves. In this context, we utilize two cubic Bezier curves [24] for each text instance, corresponding to the two potentially curved sides of the text.
Character Decoder: While the Location Decoder focuses on the spatial localization of text, the Character Decoder is dedicated to the actual recognition of the text content. It predicts the characters within each localized text region by leveraging character queries that are learned and aligned with the location queries. This alignment allows the model to handle both text detection and recognition in parallel, ensuring seamless integration of spatial and semantic information. It largely follows the structure of the Location Decoder, with the key difference being that control point queries are replaced by character queries. The initial character queries consist of a learnable query embedding combined with 1D sine positional encoding, and these queries are shared across different text instances. Importantly, the character query and control point query with the same index correspond to the same text instance.

Figure 5. Text detection and recognition method framework.

3.3. Training Losses

Instance classification loss: We use Focal Loss [26] as the loss function for instance classification. It is used to predict whether a text instance exists. Each query corresponds to a text instance, and the loss is used to classify whether each query correctly identifies a text instance. For the t-th query, the loss is defined as follows:

$L_{cls} = - α \cdot {(1 - p_{t})}^{γ} \cdot log (p_{t})$

where $α$ is a balancing factor, used to adjust the importance of positive and negative examples. $p_{t}$ is the predicted probability of the correct class for the t-th query. $γ$ is the focusing parameter, which down-weights easy examples and focuses the model on hard examples.
Control Point Loss: We use Smooth L1 Loss [27] as the control point loss. It is used to regress the control point coordinates of each text instance. The control points define the shape of the text area, usually the four corner points of the text bounding box. The loss is defined as follows:

$L_{control} = \sum_{i} L_{s_L 1} (t_{i} - v_{i})$

where $t_{i}$ is the predicted coordinate of the i-th control point. $v_{i}$ is the ground truth coordinate of the i-th control point. $L_{s_L 1}$ is the Smooth L1 loss function, which combines L1 and L2 loss, being less sensitive to large errors than L2 loss.
Character classification loss: We use Cross-Entropy Loss [28] as the loss function for character classification. It is used to classify each character in the detected text instance to identify the text content. The loss is defined as follows:

$L_{char} = - \sum_{c} y_{c} \cdot log (p_{c})$

where $y_{c}$ is the true label for the c-th character in the text instance. $p_{c}$ is the predicted probability of the c-th character. The summation runs over all possible characters in the detected text instance.

These three loss functions collaboratively supervise the model from different perspectives. Specifically, the instance classification loss enhances the model’s ability to detect valid text regions by focusing on harder samples; the control point loss preserves the spatial structure of the detected text by refining geometric alignment; and the character classification loss ensures accurate recognition by enforcing consistency with the ground truth content. Together, these complementary losses facilitate end-to-end optimization for robust text spotting, as validated by previous studies such as [23].

4. Results and Discussion

4.1. Dataset and Evaluation Metric

We used high-resolution cameras mounted on pre-designed guide rails to capture images of all visible surfaces of products from appropriate angles. Since this study is based on the high-throughput satellite payload module, all experiments were carried out in a designated simulated module environment to meet confidentiality requirements. We used images taken by a designated simulated satellite payload module to verify the effectiveness of the proposed framework. For training, we used these images as the base and then used our proposed copy–paste enhancement method to amplify the training samples. The training samples are a total of 5000 randomly generated images with the corresponding labels. Since the assembly status of an individual product can be determined by identifying its text information, the verification of the automated assembly is based on the success rate of text recognition for the individual product. For evaluation, we used 159 images taken by the simulated module to test at four angles of 0, 90, 180, and 270 degrees. If the required text is detected in any one of the four given angles, the detection is considered successful and can be used for further analysis. The number of successful images is counted, the success rate is calculated, and the performance is evaluated by the success rate like in [29].

Given the objective of our study—to develop an automated solution for satellite payload module text label detection—the success rate directly corresponds to practical performance in inspection scenarios. In this study, we adopt the success rate as the primary evaluation metric to assess the effectiveness of our text spotting model. This metric is defined as the proportion of correctly processed images among all test samples and offers a straightforward and intuitive way to measure system performance. Since our primary objective is to perform automated text label detection for automated inspection, the success rate, as a measure of detection accuracy, directly reflects the effectiveness of our model in this task. Therefore, we have chosen the success rate as our main evaluation metric.

4.2. Comparing Methods

We compare our method with other text recognition and detection methods, including ABINet [14], Text Perceptron E2E [15], SwinTextSpotter [16], and Paddle OCR [30].

4.3. Implementation Details

In this study, we use ResNet-50 as the backbone of our transformer-based detection and recognition network. The model consists of six encoder layers and six decoder layers, each equipped with eight-head multi-head attention. The hidden dimension is set to 256, and the feed-forward network dimension is 1024. We apply a dropout rate of 0.1 across attention and feed-forward layers. The number of object queries is 100, and both the encoder and decoder use four sampling points per head.

All transformations utilize OpenCV’s perspective warp with configurable crop boundaries. For annotation generation, we combine randomized alphanumeric characters (A–Z/0–9) with five template-driven layouts, dynamically scaled via ImageFont (base 33 pt × scale) and spatially transformed.

The model is implemented in PyTorch (version 1.10.1+cu111) and trained on a single NVIDIA RTX 3090 GPU using the AdamW optimizer with a base learning rate of

2 \times 10^{- 5}

, and a backbone learning rate of

2 \times 10^{- 6}

. For layers such as reference_points and sampling_offsets, a lower learning rate is applied with a scaling factor of 0.1. Gradient clipping is enabled with a maximum norm of 0.1 to stabilize training. During training, input images are randomly resized between 480 and 896 pixels in height, and cropped with a crop size of [0.1, 0.1]. For testing, the input size is resized to a minimum of 1280 and a maximum of 2080 pixels. All comparison methods follow the same experimental setup as reported in their original papers to ensure fairness.

4.4. Results and Analysis

4.4.1. Evaluation of the Proposed Method

The experimental results are summarized in Table 1. First, the ABINet-PP model achieved a success rate of 54.71%, correctly processing 87 out of 159 images. SwinTextSpotter performed slightly worse, with a success rate of 42.14%, successfully processing 67 out of 159 images. Text Perceptron E2E showed some improvement, with a success rate of 57.23%, handling 91 out of 159 images accurately. PaddleOCR, which is widely used in industrial applications due to its efficiency and robustness in natural scene text recognition, achieved the lowest success rate of 30.19%, correctly processing only 48 images. This result indicates that PaddleOCR struggles with the unique challenges presented by our satellite payload module scenario, such as dense component layout and varying text orientations.

In comparison, our method significantly outperforms these approaches, achieving a success rate of 87.42% by accurately processing 139 out of 159 images. This demonstrates that the integration of our proposed techniques within the model framework leads to substantial improvements in text detection and recognition performance, particularly in complex satellite payload module environments.

The relevant detection results are shown in Figure 6. A comparison reveals that our method not only detects more text but also demonstrates higher reliability.

Figure 6 presents qualitative comparisons of five methods on five representative samples from the satellite payload module dataset: 11108-X02G, MNC11609-A1, MNC11409-A1, 11102-X01G, and MNC11701-A1 (from top to bottom).

In 11108-X02G, ABINet and Swin fail to recognize several rotated or occluded text regions, while PaddleOCR exhibits obvious false positives. Our method achieves accurate recognition across all instances with correct alignment.

For MNC11609-A1, the dense layout and small text size result in missing detections for ABINet, Swin, and OCR. PaddleOCR incorrectly merges neighboring text regions. Our method successfully separates and identifies each text field.

In MNC11409-A1, the curved and slanted text poses challenges for most methods. OCR and Paddle fail to preserve spatial structure, while our model maintains both integrity and precision in recognition.

In 11102-X01G, occlusion and metallic reflections lead to misreads in ABINet and Paddle. Swin partially succeeds but misses key regions. Our approach handles challenging lighting and preserves complete text content.

Finally, MNC11701-A1 involves diagonal and reflective surface labels. Paddle and ABINet produce misaligned results. Our method shows robustness to orientation and background clutter, achieving full and accurate recognition.

These results are consistent with the quantitative analysis in Table 1, further confirming the effectiveness of our method in complex and cluttered industrial scenarios.

To further understand the limitations of our method, we compare the predicted results with ground truth label names in several failure cases, as shown in Figure 7. In 11104-X02G, the method mistakenly recognizes the label 11104-X02G as BDX-113-2-1, showing that similar-looking characters and confusing background patterns can lead to incorrect recognition rather than simple omission. In 11206-X02G, the detection fails to localize the correct label positions, resulting in missed detection of labels such as 11206-X02G, despite their clear presence. This suggests that densely arranged labels with similar patterns challenge the model’s spatial localization capabilities. In MNC11103-A1, no text is recognized in the entire image, indicating that strong specular reflections and possibly other image quality issues cause the model to completely fail in extracting text features.

These errors highlight three key factors that most frequently lead to recognition failures:

(1) Background confusion, where complex textures or patterns and similar character shapes lead to misrecognition;

(2) Localization errors, typically caused by inaccurate detection in densely packed or compactly arranged text regions;

(3) Reflection issues, where strong lighting reflections or low contrast hinder the visibility of textual features, resulting in complete recognition failure.

Addressing these issues may require enhanced spatial disentanglement, improved localization strategies, and robust reflection normalization methods in future model design.

4.4.2. Ablation Studies on Key Components

We conducted an ablation experiment under the same experimental conditions as before to verify the effectiveness of the proposed data augmentation method. Specifically, we omitted data augmentation and used only the original data collected from the payload module for training. We trained and tested the same model used in our method to ensure consistency. The success rate was used as the evaluation metric to assess the effectiveness of our approach.

The ablation experimental results are summarized in Table 2. Direct training only achieved a success rate of 58.49%, correctly processing 93 out of 159 images. In contrast, our method significantly improved the success rate to 87.42%. We can observe a significant difference in success rates, which underscores the effectiveness of data augmentation techniques in enhancing model performance. The results of the ablation experiment indicate that the success rate of the model trained directly with the original data is lower than the model’s success rate after applying our data augmentation method.

4.4.3. Improved Assembly Inspection Efficiency

The significant reduction in inspection time—from five days to one day in our work—is based on a systematic experimental evaluation conducted on a payload module comprising 159 individual components. We performed comparative tests between traditional manual inspection and the proposed method. The manual process typically requires approximately five days to complete, whereas the automated method reduces this duration to about one day. This corresponds to an approximate 400% improvement in inspection efficiency for a payload with this number of components.

The time savings were calculated by recording the total inspection duration for both manual and automated processes under controlled conditions. Given that the component count may vary between different payload modules, the reported efficiency gain is specifically representative of modules containing around 159 components.

This substantial time reduction highlights the effectiveness of the automated inspection framework in accelerating the quality assurance process, thereby facilitating more timely and reliable payload module verification.

5. Conclusions

To enhance the inspection efficiency and accuracy of high-throughput satellite payload modules, this paper presents a framework for text detection and recognition of diamond labels, R-hole labels, and interface labels in images captured within the payload module.

To overcome the scarcity of training data and the lack of pre-existing datasets, we propose a copy–paste data augmentation method that generates a large volume of simulated data. This approach effectively enriches the training set, enabling the text detection and recognition algorithms to learn robustly despite limited original samples.

Furthermore, to mitigate the impact of deformed text and product occlusion often present in payload module images, we integrate a text detection model enhanced by a dual decoder architecture. This design improves the model’s ability to accurately localize and recognize distorted or occluded text instances, thus significantly boosting both detection accuracy and operational efficiency.

Tests conducted on 159 images taken within the simulated module demonstrate the effectiveness of our approach. As a result, the accuracy of detection on relevant payload module data reached 87.42%, and the operational efficiency improved from the original 5 days to just 1 day. Since the entire process is digitally implemented, photos can be renamed and managed simultaneously, facilitating subsequent identification and retrieval while enhancing digital management efficiency.

Since our copy–paste data augmentation method can generate a large amount of simulated data as training data based on the labels from the images collected by the simulation module, and considering the regularity in the signage system definition of the payload cabin, we hope to extend the core algorithm to the text recognition process of other payload modules in the future.

Author Contributions

Conceptualization, J.L. and W.W.; methodology, J.L. and J.D.; software, J.K.; validation, J.L., J.D. and W.W.; formal analysis, J.L.; investigation, J.L.; resources, J.L.; data curation, J.D.; writing—original draft preparation, J.L.; writing—review and editing, W.W.; visualization, J.K.; supervision, W.W.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by Xi’an’s Key Industrial Chain Core Technology Breakthrough Project under Grant 24ZDCYJSGG0003.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to confidentiality agreements.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tan, X.; Li, X.; Li, Y.; Sun, G.; Niyato, D. Exploration and research on satellite payload technology in boosting the low-altitude economy. Space Electron. Technol. 2025, 22, 1–10. [Google Scholar]
Baghal, L.A. Assembly, Integration, and Test Methods for Operationally Responsive Space Satellites. Master’s Thesis, Air Force Institute of Technology, Wright-Patterson Air Force Base, Dayton, OH, USA, 2010. [Google Scholar]
Reyneri, L.M.; Sansoè, C.; Passerone, C.; Speretta, S.; Tranchero, M.; Borri, M.; Corso, D.D. Design solutions for modular satellite architectures. In Aerospace Technologies Advancements; IntechOpen: London, UK, 2010. [Google Scholar]
Zu, J.; Xiao, P.; Shi, H.; Zhang, X.; Hao, G. Design of Spacecraft Configuration and Assembly. In Spacecraft System Design; CRC Press: Boca Raton, FL, USA, 2023; pp. 221–264. [Google Scholar]
Rong, Y.; Bingyi, T.; Wei, H.; Yu, L.; Xiting, Q.; Cheng, L. Research on Lean and Accurate Method for High Throughput Satellite Payload Modules’ Assembly Process. In Proceedings of the 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 29–31 August 2024; pp. 48–52. [Google Scholar]
Stanczak, M. Optimisation of the Waveguide Routing for a Telecommunication Satellite. Ph.D. Thesis, ISAE-Institut Supérieur de l’Aéronautique et de l’Espace, Toulouse, France, 2022. [Google Scholar]
Liang, S.; Jin, R.; Li, Y.; Dou, H. A convolutional neural network based method for repairing offshore bright temperature error of radiometer array. Space Electron. Technol. 2024, 21, 39–49. [Google Scholar]
Starodubov, D.; Danishvar, S.; Abu Ebayyeh, A.A.R.M.; Mousavi, A. Advancements in PCB Components Recognition Using WaferCaps: A Data Fusion and Deep Learning Approach. Electronics 2024, 13, 1863. [Google Scholar] [CrossRef]
Cao, W.; Chen, Z.; Wu, C.; Li, T. A Layered Framework for Universal Extraction and Recognition of Electrical Diagrams. Electronics 2025, 14, 833. [Google Scholar] [CrossRef]
Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2017; Volume 31, pp. 4161–4167. [Google Scholar]
Liu, Y.; Jin, L. Deep Matching Prior Network: Toward Tighter Multi-Oriented Text Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1962–1969. [Google Scholar]
Liao, M.; Zhu, Z.; Shi, B.; Xia, G.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5909–5918. [Google Scholar]
Peng, H.; Yu, J.; Nie, Y. Efficient Neural Network for Text Recognition in Natural Scenes Based on End-to-End Multi-Scale Attention Mechanism. Electronics 2023, 12, 1395. [Google Scholar] [CrossRef]
Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7098–7107. [Google Scholar]
Qiao, L.; Tang, S.; Cheng, Z.; Xu, Y.; Niu, Y.; Pu, S.; Wu, F. Text Perceptron: Towards End-to-End Arbitrary-Shaped Text Spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11899–11907. [Google Scholar]
Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4593–4603. [Google Scholar]
Zhou, B.; Wang, X.; Zhou, W.; Li, L. Trademark Text Recognition Combining SwinTransformer and Feature-Query Mechanisms. Electronics 2024, 13, 2814. [Google Scholar] [CrossRef]
Kortmann, M.; Zeis, C.; Meinert, T.; Schröder, K.; Dueck, A. Design and qualification of a multifunctional interface for modular satellite systems. In Proceedings of the 69th International Astronautical Congress, Bremen, Germany, 1–5 October 2018; pp. 1–5. [Google Scholar]
Liu, Y.; Li, Y.; Sun, B.; Zheng, Y.; Ye, Z. Multi-scale heterogeneous and cross-modal attention for remote sensing image classification. Space Electron. Technol. 2024, 21, 57–65. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Zhang, X.; Su, Y.; Tripathi, S.; Tu, Z. Text Spotting Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9519–9528. [Google Scholar]
Liu, Y.; Chen, H.; Shen, C.; He, T.; Jin, L.; Wang, L. ABCNet: Real-Time Scene Text Spotting with Adaptive Bezier-Curve Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9809–9818. [Google Scholar]
Lorentz, G.G. Bernstein Polynomials; AMS Chelsea Publishing, American Mathematical Society: New York, NY, USA, 2012; Volume 323. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Janocha, K.; Czarnecki, W.M. On Loss Functions for Deep Neural Networks in Classification. arXiv 2017, arXiv:1702.05659. [Google Scholar] [CrossRef]
Zhang, Z.; Sabuncu, M.R. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31, pp. 8778–8788. [Google Scholar]
Bogdanović, M.; Frtunić Gligorijević, M.; Kocić, J.; Stoimenov, L. Improving Text Recognition Accuracy for Serbian Legal Documents Using BERT. Appl. Sci. 2025, 15, 615. [Google Scholar] [CrossRef]
Du, Y.; Li, C.; Guo, R.; Yin, X.; Liu, W.; Zhou, J.; Bai, Y.; Yu, Z.; Yang, Y.; Dang, Q.; et al. Pp-ocr: A practical ultra lightweight ocr system. arXiv 2020, arXiv:2009.09941. [Google Scholar]

Figure 1. Typical satellite products and their labels.

Figure 2. Satellite payload block diagram and its assembly and inspection process.

Figure 3. The proposed framework.

Figure 4. Data augmentation method framework.

Figure 6. Comparisons of five methods on five representative samples from the satellite payload module dataset.

Figure 7. Failure cases on three samples from the satellite payload module dataset.

Table 1. Experiment results.

Method	Success Images	All Images	Success Rate [%]
ABINet-PP [14]	87	159	54.71
SwinTextSpotter [16]	67	159	42.14
Text Perceptron E2E(OCR) [15]	91	159	57.23
Paddle OCR [30]	48	159	30.19
Proposed	139	159	87.42

Note: Bold font indicates the final or best-performing method.

Table 2. Ablation experiment results.

Method	Success Images	All Images	Success Rate [%]
Without Data Augmentation	93	159	58.49
Proposed	139	159	87.42

Note: Bold font indicates the final or best-performing method.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Dai, J.; Kang, J.; Wei, W. Automatic Assembly Inspection of Satellite Payload Module Based on Text Detection and Recognition. Electronics 2025, 14, 2423. https://doi.org/10.3390/electronics14122423

AMA Style

Li J, Dai J, Kang J, Wei W. Automatic Assembly Inspection of Satellite Payload Module Based on Text Detection and Recognition. Electronics. 2025; 14(12):2423. https://doi.org/10.3390/electronics14122423

Chicago/Turabian Style

Li, Jun, Junwei Dai, Jia Kang, and Wei Wei. 2025. "Automatic Assembly Inspection of Satellite Payload Module Based on Text Detection and Recognition" Electronics 14, no. 12: 2423. https://doi.org/10.3390/electronics14122423

APA Style

Li, J., Dai, J., Kang, J., & Wei, W. (2025). Automatic Assembly Inspection of Satellite Payload Module Based on Text Detection and Recognition. Electronics, 14(12), 2423. https://doi.org/10.3390/electronics14122423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Automatic Assembly Inspection of Satellite Payload Module Based on Text Detection and Recognition

Abstract

1. Introduction

2. Related Work

2.1. Inspection of the Assembly of the High-Throughput Satellite Payload Module

2.2. Text Detection and Recognition

3. Materials and Methods

3.1. Copy–Paste Enhancement Method

3.2. Text Detection and Recognition Method

3.3. Training Losses

4. Results and Discussion

4.1. Dataset and Evaluation Metric

4.2. Comparing Methods

4.3. Implementation Details

4.4. Results and Analysis

4.4.1. Evaluation of the Proposed Method

4.4.2. Ablation Studies on Key Components

4.4.3. Improved Assembly Inspection Efficiency

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI