A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft

Li, Wanyun; Huo, Yurong; Zhu, Qinyu; Lu, Yao; Fang, Yuqiang; Zhang, Yasheng

doi:10.3390/rs18081177

Open AccessArticle

A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft

by

Wanyun Li

,

Yurong Huo

,

Qinyu Zhu

,

Yao Lu

,

Yuqiang Fang

^*

and

Yasheng Zhang

Graduate School, Space Engineering University, Beijing 101416, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(8), 1177; https://doi.org/10.3390/rs18081177

Submission received: 17 March 2026 / Revised: 10 April 2026 / Accepted: 13 April 2026 / Published: 15 April 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel, physically grounded framework is proposed to generate large-scale synthetic visual datasets for non-cooperative spacecraft. It incorporates relative orbital motion simulation and end-to-end imaging degradation to produce realistic multimodal data (RGB, mask, depth, normal) with precise 6-DoF pose annotations for four representative spacecraft types.
The generated dataset, comprising 8000 high-fidelity samples, effectively bridges the domain gap between synthetic and real on-orbit imagery. It specifically addresses the critical data scarcity issue in spacecraft visual perception, which arises from limited and inaccessible real-world data.

What are the implications of the main findings?

This work provides a crucial, open-access data foundation and simulation tool for the development, benchmarking, and validation of data-driven algorithms (e.g., for pose estimation, segmentation, tracking) in on-orbit servicing and space debris removal, potentially accelerating related research and technology maturation.
The proposed integrated pipeline establishes a new paradigm for generating physically accurate and task-specific synthetic data in remote sensing by integrating motion physics, multimodal ground truth, and sensor degradation. This paradigm can be adapted to other space and Earth observation applications facing similar data scarcity challenges.

Abstract

The advancement of on-orbit servicing and space debris removal missions has established high-precision visual perception for non-cooperative spacecraft as a key research focus. However, the availability of high-quality, diverse spacecraft image datasets is severely limited due to extreme on-orbit imaging conditions, data confidentiality, and morphological diversity of targets, significantly constraining the advancement of data-driven algorithms in this domain. To address this challenge, we propose a relative orbital motion-guided framework for generating multimodal visual data of spacecraft. The proposed method integrates an orbital dynamics model into the synthetic data generation pipeline to simulate typical relative motion patterns between the camera and the target in a realistic orbital environment, thereby generating image sequences characterized by continuous spatiotemporal evolution. Targeting four representative spacecraft—Tiangong, Spacedragon, ICESat, and Cassini—this work simultaneously generates a dataset comprising 8000 samples, each containing four strictly aligned modalities: RGB images, instance segmentation masks, depth maps, and surface normal maps, along with precise 6-degree-of-freedom (6-DoF) pose ground truth. Furthermore, an end-to-end physical image degradation model is developed to accurately simulate the complete imaging chain—from optical diffraction and aberrations to sensor sampling and noise—thereby effectively narrowing the domain gap between synthetic and real data. By addressing three key aspects—physical motion modeling, synchronous multimodal ground truth, and imaging degradation simulation—this work provides a crucial data foundation for training, testing, and validating data-driven on-orbit perception algorithms.

Keywords:

relative orbital motion patterns; spacecraft pose estimation; synthetic data generation; multimodal dataset; image degradation

1. Introduction

With the increasing prevalence of space missions such as on-orbit servicing and space debris removal [1,2,3], 3D reconstruction [4,5,6,7] and robust pose estimation [8,9,10] for non-cooperative spacecraft have become critically important. Accurately determining a target spacecraft’s six-degree-of-freedom (6-DoF) pose is a fundamental prerequisite for autonomous rendezvous and docking, on-orbit maintenance, and active debris removal, directly impacting mission safety and success. However, acquiring visual data of real spacecraft faces multiple challenges: first, extreme on-orbit imaging conditions characterized by high-contrast illumination, rapid relative motion, and complex background interference; second, the typically classified and non-public nature of space mission data; finally, the morphological diversity and structural complexity of non-cooperative targets, often without prior model support. These factors collectively result in a severe scarcity of publicly available, high-quality, and diverse spacecraft image datasets [11], which significantly constrains the development and application of data-driven visual perception algorithms in this domain.

To address these challenges, this paper proposes a relative orbital motion-guided framework for generating multimodal visual data of spacecraft. Our method integrates orbital dynamics models [12,13,14] into the image generation stage to simulate the relative motion between the camera and the target spacecraft in a realistic orbital environment, producing image sequences that exhibit continuous temporal evolution across multiple relative distances and modalities. Focusing on four representative spacecraft targets—Tiangong, Spacedragon, ICESat, and Cassini—our selection is designed to maximize morphological diversity, thereby establishing a benchmark dataset capable of effectively evaluating the generalization ability of perception algorithms: Tiangong represents large, symmetric modular space stations, characterized by structural complexity and severe self-occlusion; Spacedragon represents symmetric, compact crew capsules, which pose challenges due to limited surface texture and the inherent ambiguity in pose estimation caused by symmetry; ICESat represents science satellites equipped with large, flat solar panels, challenging due to their non-centroidal structure and extensive low-texture regions; Cassini represents deep-space probes with large parabolic antennas and deployable booms, whose extreme geometric complexity tests an algorithm’s robustness to irregular structures and protruding components.

Our framework generates four types of strictly synchronized multimodal data for each target, with 500 samples per modality for each. The result dataset comprises 8000 images in total (4 targets × 4 modalities × 500 samples), with all modalities precisely aligned spatially and temporally. By grounding the generation process in physically consistent relative motion patterns, our method ensures that the synthetic data reflects various real mission scenarios, ranging from approach tracking to docking, while simultaneously providing rich multimodal ground truth to support algorithms for pose estimation, instance segmentation, and 3D reconstruction.

The main contributions of this work are as follows:

(1): Relative motion modeling with orbital constraints: Orbital dynamics equations are integrated into the data generation pipeline to construct physically plausible relative motion trajectories, ensuring the kinematic realism of simulated scenarios.
(2): Multi-source spacecraft ground truth data generation: RGB images, instance segmentation masks, depth maps, and surface normal maps are rendered synchronously, accompanied by precise 6-DoF pose ground truth, establishing a benchmark for related algorithm research.
(3): Physics-driven image degradation model: An end-to-end imaging chain model incorporating optical degradation and sensor noise is developed, effectively narrowing the domain gap between synthetic and real satellite imagery, thereby enhancing the generalization capability of models trained on synthetic data for real-world tasks.

2. Related Works

For space manipulation missions, academia and industry have released various spacecraft image datasets. A classic benchmark for spacecraft pose estimation is the Spacecraft Pose Estimation Dataset (SPEED) [15], which was provided for a satellite pose estimation challenge jointly organized by the European Space Agency and Stanford University. This dataset aims to support space object pose estimation tasks and comprises 15,300 images of the Tango satellite. Of these, 15,000 images are synthetic, generated using the OpenGL-based optical camera simulation software, while the remaining 300 are real images captured by the Rendezvous and Optical Navigation (TRON) facility at the Space Rendezvous Laboratory (SLAB). The SPEED provides images with a resolution of 1920 × 1200, annotated with quaternions and 3D depth information. In 2021, SPEED+ [16] was introduced as the next-generation dataset to address the cross-domain gap in spacecraft pose estimation. It contains 59,960 synthetic images generated by simulation software, split into training and validation sets in an 8:2 ratio. Additionally, it provides test set images captured by the TRON facility under different illumination sources: 6740 images under lampbox illumination and 2791 images under solar lamp illumination. Park et al. introduced the SHIRT dataset [17], which extends the SPEED+ dataset by providing continuous image sequences of a target mock satellite in simulated rendezvous scenarios. The dataset encompasses both synthetic images generated with OpenGL and Hardware-in-the-Loop (HIL) images captured from the TRON facility’s lightbox, corresponding to the same pose labels for two representative rendezvous orbital element (ROE) sets: ROE1 and ROE2. Subsequently, the authors proposed the SP3ER dataset [18], which contains over 60 distinct types of space objects. For each object, the dataset provides 1000 RGB images along with their corresponding binary masks.

Huang et al. constructed a synthetic Wide Depth Illumination-Composite Dataset (WDICD) [19] for the spacecraft segmentation task. They proposed a pixel-level automatic semantic annotation method and, based on this, developed a structure-aware semantic segmentation framework for spacecraft components. Sam et al. introduced the SwiM spacecraft segmentation dataset [20], which comprises two versions: a baseline version (28,917 images) and an enhanced version (63,917 images), covering a variety of spacecraft configurations ranging from classic satellites to deep-space probes. The dataset’s unique value is reflected in three aspects: it pioneered the simulation of optoelectronic distortion and noise characteristics in space imaging; it achieved comprehensive coverage of illumination conditions, spacecraft poses, and environmental backgrounds by blending real and synthetic data; and all annotations are provided in a unified YOLO-format polygon contour representation, ensuring consistency in algorithm evaluation. Cao et al. generated the Spacecraft-DS dataset [21] through hardware-in-the-loop acquisition for key component detection and segmentation. This dataset encompasses two types of spacecraft, nine key components, three illumination types, and two motion states: approach and hover.

Proença et al. generated the URSO dataset [22] comprising images of the Soyuz and Dragon spacecraft using the Unreal Engine 4 simulator, with an image resolution of 1080 × 960. However, both the SPEED and URSO datasets primarily focus on pose estimation and do not provide any segmentation annotations. To address this, Duang et al. proposed a novel space imagery dataset [23] containing 3117 annotated space-based images of satellites and space stations at a resolution of 1280 × 720. The dataset covers 10,350 components from 3667 spacecraft, with the size of spacecraft in the images varying from as small as 100 pixels to nearly occupying the entire frame. To standardize benchmarking for segmentation methods, the dataset is partitioned into 2516 training images and 600 test images, and is intended for tasks such as spacecraft object detection, segmentation, and component recognition. Hematulin et al. created a synthetic dataset for spacecraft pose estimation [24] using the Unreal Engine 5 simulator. The work provides a detailed algorithm for pose sampling and image generation. The dataset includes images captured under harsh lighting conditions and against high-resolution backgrounds, featuring models of the Dragon, Soyuz, Tianzhou, and the Chang’E-6 ascent vehicle, totaling 40,000 high-resolution images. Each image is annotated with a pose vector representing the spacecraft’s relative position and orientation with respect to the camera.

Hu et al. introduced the SwissCube dataset [25], generated using physics-based rendering. This dataset accounts for the 3D model of the satellite, as well as the influences of star background, the Sun, and the Earth. SwissCube is placed in its actual orbit at approximately 700 km above the Earth’s surface, and an “observer” is positioned in a slightly higher orbit to render image sequences with varying relative velocities, distances, and viewing angles. The experiments generated 500 scenes, with each scene containing a sequence of 100 frames, resulting in a total of 50 k images with a resolution of 1024 × 1024, split into training and test sets in a 4:1 ratio. Bechini et al. proposed a tool for generating spaceborne image datasets [26], capable of producing noise-free synthetic images using ray tracing. Gallet et al. constructed and made publicly available a large-scale dataset, named the RAPTOR dataset [27], comprising 120,000 images. Each image not only provides appearance information but is also accompanied by rich ground truth annotations, including masks, distance maps, celestial positions, and precise camera parameters. The data generation process ensures fine statistical balancing to facilitate the training and evaluation of pose estimation solutions. The method employs a physics-based camera model and meticulously simulates space-specific lighting and radiation characteristics—such as solar flux and spectrum, Earth albedo, and a vacuum environment without atmospheric effects—as well as texture scattering, to produce highly realistic images.

During the 2021 IEEE International Conference on Image Processing, the Interdisciplinary Center for Security, Reliability, and Trust organized the first Spacecraft Recognition competition leveraging space environment knowledge, and provided the SPARK dataset [28] for tasks such as space object classification and detection. The SPARK dataset utilizes realistic 3D models of 10 distinct satellites and spacecraft (along with 5 debris models) to simulate images of space objects. The dataset enhances the fidelity of the simulated data by compositing the rendered targets onto high-resolution Earth background images. Yang et al. [29] constructed the NASA 3D dataset, which includes point clouds, watertight meshes, occupancy fields, and corresponding points, utilizing 3D models available on the official NASA website.

Existing public datasets have made considerable progress in advancing tasks such as object detection [30,31,32], pose estimation [33,34], segmentation [35,36] and relative navigation [37,38]. Existing datasets offer valuable training resources, yet their generation frameworks are limited by two main shortcomings in theory and practice. First, in terms of motion modeling, current methods primarily depend on static scenes or overly simplified motion trajectories that lack physical realism. They do not incorporate orbital dynamics models, and therefore cannot reproduce the kinematically and dynamically consistent relative motion observed in actual on-orbit missions. Second, regarding data completeness, available resources are mostly restricted to RGB images. The absence of synchronized, pixel-level ground-truth data for geometry (depth, normals) and segmentation hinders the development and evaluation of data-driven algorithms in tasks such as 3D reconstruction, accurate pose estimation, and model-based tracking—tasks that rely on rich multimodal annotations.

3. Method

This work establishes high-fidelity 3D models of the Tiangong, Spacedragon, ICESat, and Cassini spacecraft [39] based on publicly available 3D model resources. A novel method for generating multimodal spacecraft visual data guided by orbital relative motion is proposed. This method departs from traditional random pose sampling by incorporating an orbital dynamics model, systematically defining four typical relative motion scenarios: elliptical co-orbit, spiral, teardrop, and drift. This approach generates continuous relative motion trajectories that are physically plausible. To achieve high-fidelity imaging simulation, a parameterized camera model is further developed, and stringent physical constraints for space-based imaging are analyzed, including the geometric and optical visibility of the target within the camera’s field of view. Building on this, an automated rendering pipeline is constructed within the Blender platform using its Cycles physically based rendering engine. By configuring physically based shaders and multi-source lighting models, the pipeline synchronously generates per-frame RGB images, instance segmentation masks, depth maps, surface normal maps, and 6-degree-of-freedom (6-DoF) pose ground truth data. The resulting spacecraft image dataset covers a wide range of distances, viewing angles, lighting conditions, and motion patterns, providing a robust data foundation for algorithm training and evaluation in spacecraft visual tasks such as pose estimation and 3D reconstruction.

3.1. Relative Orbital Motion Pattern Modeling

This section establishes a mathematical model for the relative orbital motion pattern between the target spacecraft and the observation spacecraft, which serves as the core foundation for subsequent analysis of observation constraints and optimization of imaging strategies. The applicable constraints of the C-W linearized equations used in this paper to simulate spacecraft relative motion are as follows: the reference orbit must be near-circular to avoid nonlinear interference from highly elliptical orbits; the relative distance between the two spacecraft must be much smaller than the reference orbit radius to ensure the relative motion is in the linear range; only two-body Keplerian dynamics is considered, ignoring various orbital perturbation effects, making it suitable for short-duration and close-range on-orbit relative motion simulation. To lay a solid basis for the subsequent modeling work, we first describe the observation scenario and define the reference coordinate system to accurately characterize the orbital motion states of the two types of spacecraft. Then, three key parameters are defined to describe the characteristics of the trajectory. Finally, a dimensionless parameter is defined as the criterion for classifying typical relative motions, and a schematic diagram of the relative motion trajectory is presented.

3.1.1. Scene Description

To characterize the orbital motion states of the target spacecraft and the observation spacecraft, the Earth-centered inertial (ECI) frame is employed. Its origin is located at the center of the Earth, as shown in the accompanying Figure 1. The

X

-axis points towards the vernal equinox, aligning with the Earth’s equatorial and ecliptic planes. The

Z

-axis points towards the North Pole. The

Y

-axis is perpendicular to the

X

-axis, following the right-hand rule.

The relative motion is described in the Local Vertical, Local Horizontal (LVLH) frame attached to the target spacecraft. The origin of this coordinate system is located at the center of mass of the observation spacecraft. The

x

-axis is aligned with the geocentric radius vector

r_{1}

, pointing from the Earth’s center to the spacecraft. The

y

-axis lies within the orbital plane, perpendicular to the

x

-axis, and points in the direction of motion. The

z

-axis is aligned with the spacecraft’s angular momentum vector. The average angular velocity of the observation spacecraft is

n = \sqrt{μ / a^{3}}

, where

μ = 3.986 \times 10^{14} m^{3} {/ s}^{2}

is Earth’s gravitational parameter, and

a

denotes the orbital semi-major axis.

Furthermore, since the relative distance between the target and observer spacecraft is significantly smaller than the target’s geocentric distance, the relative kinematic equation for the observer spacecraft, namely the Clohessy–Wiltshire (CW) equations [40], can be derived.

\{\begin{cases} \ddot{x} - 2 n \dot{y} - 3 n^{2} x = 0 \\ \ddot{y} + 2 n \dot{x} = 0 \\ \ddot{z} + n^{2} z = 0 \end{cases}

(1)

Here,

(x, y, z)

and

(\dot{x}, \dot{y}, \dot{z})

represent the position and velocity coordinates, respectively, of the observer spacecraft in the LVLH frame. The solution to Equation (1) can be obtained by solving the ordinary differential equations, resulting in the following expression:

\{\begin{cases} x (t) = \frac{{\dot{x}}_{0}}{n} \sin n t - (2 \frac{{\dot{y}}_{0}}{n} + 3 x_{0}) \cos n t + 2 (2 x_{0} + \frac{{\dot{y}}_{0}}{n}) \sin n t \\ y (t) = 2 (\frac{{\dot{y}}_{0}}{n} + 3 x_{0}) \sin n t + 2 \frac{{\dot{x}}_{0}}{n} \cos n t - 3 ({\dot{y}}_{0} + 2 n x_{0}) t + (y_{0} - \frac{2}{n} {\dot{x}}_{0}) \\ z (t) = \frac{{\dot{z}}_{0}}{n} \sin n t + z_{0} \cos n t \end{cases}

(2)

where

(x_{0}, y_{0}, z_{0}, {\dot{x}}_{0}, {\dot{y}}_{0}, {\dot{z}}_{0})

denotes the relative position and velocity at a given time

t_{0}

, and is the state transition matrix. For ease of analysis, the initial time is set to 0.

3.1.2. Feature Parameterization Formulation

The orbital component (x–y-plane) of the relative motion between the spacecraft and the out-of-plane motion (

z

-axis) are separated. The in-plane relative motion (ROMP) of the spacecraft is characterized by four parameters:

x_{0}, y_{0}, {\dot{x}}_{0}, {\dot{y}}_{0}

. To simplify the analysis, these four parameters are processed as follows: According to Equation (2), there exists a time

t_{p}

such that

\dot{x} (t_{p}) = 0

, at which point

x (t_{p})

reaches its maximum value. The state at this moment is defined as the characteristic point [41]:

\dot{x} (t_{p}) = 0, x_{p} = x_{\max}

.

Here,

x_{\max}

represents the maximum value of the radial

x

-axis motion. Thus, the solution to the C-W equations can be simplified as:

\{\begin{cases} x (t) = x_{c} (t) + x_{l} (t) \\ y (t) = y_{c} (t) + y_{l} (t) \\ z (t) = \frac{{\dot{z}}_{p}}{n} \cdot \sin n (t) + z_{p} \cdot \cos n (t) \end{cases}

(3)

Taking the state of the characteristic point

P

as new parameters, the solution to the C-W equations can be decomposed into the superposition of an elliptical periodic motion and a linear drift motion.

\{\begin{array}{l} x_{c} (t) = - \frac{α}{2} \cdot \cos n t \\ y_{c} (t) = α \cdot \sin n t \\ z_{c} (t) = \frac{{\dot{z}}_{p}}{n} \cdot \sin n t + z_{p} \cdot \cos n t \end{array}

(4)

\{\begin{array}{l} x_{l} (t) = - \frac{2 β}{3 n} \\ y_{l} (t) = β (t) + γ \\ z_{l} (t) = 0 \end{array}

(5)

Three key parameters are defined:

α = 2 (2 \frac{y_{p}}{n} + 3 x_{p})

,

β = - 3 ({\dot{y}}_{p} + 2 n x_{p})

, and

γ = y_{p}

. The parameters

α

,

β

, and

γ

are uniformly sampled within the physical ranges permitted by their definition. Here,

α

represents the elliptical size, determining the amplitude of the periodic motion;

β

is the drift rate, determining whether the trajectory is closed and its direction of motion. When

β = 0

, the trajectory is a closed ellipse; when

β \neq 0

, the trajectory is various types of open curves;

γ

is the initial trajectory deviation.

3.1.3. Typical Relative Motion Classification

Based on the formula for

β

, a dimensionless parameter

σ ≜ - \frac{{\dot{y}}_{p}}{n \cdot x_{p}}

can be defined [41], which essentially represents the ratio of the along-track velocity to the radial position at the characteristic point. Different values of

σ

correspond to four typical scenarios, namely elliptical trajectory, spiral trajectory, drip-drop trajectory, and drift trajectory, covering various categories of relative trajectories. The classification criteria and corresponding dynamic characteristics are summarized in Table 1.

Based on the aforementioned classification, four types of relative motion can be constructed. The three-dimensional and two-dimensional schematic diagrams of their trajectories are shown in Figure 2 below.

For the elliptical trajectory, it forms a closed ellipse, where the observation spacecraft moves periodically around the target spacecraft, presenting a standard closed elliptical shape within the orbital plane. The spiral trajectory is a superposition of drift motion and periodic motion, forming a smooth non-intersecting spiral that drifts along-track either away from or toward the target. The drip-drop trajectory forms a waterdrop-like shape in the drift direction, combining periodicity with slight drift, thus exhibiting a unique quasi-periodic motion morphology. The drift trajectory is an approximately linear or gently curved curve with minimal oscillation amplitude, sweeping unidirectionally past the target along the track direction, representing the simplest form of relative motion.

3.2. Imaging Geometry Principles

This section introduces the camera imaging model, which provides the geometric foundation for pose estimation and visibility analysis. First, the camera imaging model is elaborated. The model describes the projective mapping from 3D world points to 2D image coordinates. Second, two common attitude representations used in spacecraft vision are then presented: Euler angles and quaternions. Finally, target visibility conditions are examined from two perspectives. Geometric visibility determines whether the target is within the camera’s line of sight and not occluded. Optical visibility depends on whether the target surface is sufficiently illuminated to be observed. Both are governed by the relative positions of the target, camera, and sun.

3.2.1. Camera Imaging Model

The camera imaging model is the foundation for target position and pose estimation. The imaging process of a target is illustrated in Figure 3. The imaging process can be summarized as the projection of a three-dimensional 3D point in space onto a two-dimensional 2D point on the image plane. Before describing the camera imaging model, it is necessary to define four coordinate systems: the world coordinate system

O_{W} - X_{W} Y_{W} Z_{W}

, the camera coordinate system

O_{C} - X_{C} Y_{C} Z_{C}

, the image coordinate system on the physical imaging plane

O_{R} - x y

, and the pixel coordinate system

O_{I} - u v

. In the attitude estimation problem, the rotation matrix

R_{c}

is the quantity to be determined.

Typically, the origin

O_{C}

of the camera coordinate system

O_{C} - X_{C} Y_{C} Z_{C}

is the camera’s optical center. The

X_{C}

-axis points to the right, the

Y_{C}

-axis points downward, and the

Z_{C}

-axis points along the camera’s optical axis. The origin

O_{R}

of the image coordinate system

O_{R} - x y

is defined as the intersection of the optical axis with the image plane, typically the center of the image plane. The

x

-axis is aligned with the image plane pointing to the right, and the

y

-axis points downward. The origin

O_{I}

of the pixel coordinate system

O_{I} - u v

is defined as the top-left corner of the image. The

u

-axis points to the right along the image and is parallel to the

x

-axis, while the

v

-axis points downward and is parallel to the

y

-axis.

Consider a spatial point

P

. Its coordinates in the camera coordinate system are

P_{C} {(x_{C}, y_{C}, z_{C})}^{T}

. According to the perspective projection principle, the corresponding image point

p

on the image plane can be expressed as:

x = \frac{f \cdot x_{C}}{z_{C}}, y = \frac{f \cdot y_{C}}{z_{C}}

(6)

where

f

is the focal length, defined as the distance from the optical center to the physical image plane. Coordinates in the image plane are expressed in physical dimensions (e.g., meters).

Since an image is composed of discrete pixel elements, the projection on the physical image plane must undergo sampling and quantization, i.e., a transformation from the image coordinate system to the pixel coordinate system. The coordinates of the image point

p

in the pixel coordinate system are:

\{\begin{matrix} u = f_{x} \cdot x + u_{0} \\ v = f_{y} \cdot y + v_{0} \end{matrix}

(7)

Here,

f_{x} = 1 / d_{x}

and

f_{y} = 1 / d_{y}

, where

d_{x}

and

d_{y}

are the physical dimensions of a pixel along the

x

and

y

axes of the image plane,

f_{x}

and

f_{y}

are the number of pixels per unit physical length, with units of pixel/m, and

(u_{0}, v_{0})

are the pixel coordinates of the principal point. Equation (7) can be written in matrix form using homogeneous coordinates as:

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}]

(8)

Therefore, the transformation from the camera coordinate system to the pixel coordinate system can be expressed as:

z_{C} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} x_{C} \\ y_{C} \\ z_{C} \end{matrix}]

(9)

The middle matrix in Equation (9) is defined as the camera’s intrinsic matrix, denoted as

K

. The intrinsic matrix is typically provided by the camera manufacturer but can also be obtained through calibration.

In the projection of a spatial point onto an image, the point’s coordinates in the camera coordinate system are used. However, as the camera moves, the coordinates of point

P

in the camera coordinate system change. Thus, the coordinates of point

P

in the camera coordinate system are obtained by transforming its coordinates from the world coordinate system

O_{W} - X_{W} Y_{W} Z_{W}

based on the current camera pose

T

. Camera pose describes the transformation between the camera and world coordinate systems and is represented by a rotation matrix

R

and a translation vector

t

. In summary, the transformation of a 3D spatial point

P

from the world coordinate system to the camera coordinate system can be expressed in homogeneous coordinates as:

[\begin{matrix} x_{C} \\ y_{C} \\ z_{C} \\ 1 \end{matrix}] = T [\begin{matrix} x_{W} \\ y_{W} \\ z_{W} \\ 1 \end{matrix}] = [\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} x_{W} \\ y_{W} \\ z_{W} \\ 1 \end{matrix}]

(10)

T

is also called the camera’s extrinsic parameter matrix. Then, the process of projecting a 3D spatial point from the world coordinate system to the pixel coordinate system can be expressed as:

z_{C} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & u_{0} \\ 0 & f_{y} & v_{0} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} R & t \\ 0^{T} & 1 \end{matrix}] [\begin{matrix} x_{W} \\ y_{W} \\ z_{W} \\ 1 \end{matrix}] = K T [\begin{matrix} x_{W} \\ y_{W} \\ z_{W} \\ 1 \end{matrix}] = K T P_{W}

(11)

The transformation between the camera coordinate system and the world coordinate system defines the relative pose and position of the target (in the world coordinate system) with respect to the camera, represented by the rotation matrix

R

and the translation vector

t

.

It should be noted that the aforementioned camera imaging model represents an ideal configuration. In practice, however, the manufacturing and assembly processes of camera lenses—along with the handling of light transmission by optical elements and the integration of camera components—cannot achieve perfect precision as assumed in the ideal case. Furthermore, influences from the space environment and the operational lifespan of the camera can introduce deviations from the ideal optical state, leading to lens distortion in the imaging system. Lens distortion causes image points to deviate from their ideal positions, resulting in geometric deformation of the captured imagery. Under such conditions, the ideal imaging model described earlier can no longer accurately characterize the projective geometry. Therefore, distortion correction is required.

Distortion correction can generally be implemented in two ways: (1) applying a global distortion correction to the entire image first, and then conducting subsequent research based on the corrected image; (2) correcting only the specific image points of interest according to the distortion model. In the field of computer vision, approach (1) is more commonly adopted. Once the images have been corrected for distortion, the pinhole camera model can be applied to establish the projection relationship from 3D points in space to the 2D pixel coordinate system. Since this study focuses on the investigation of vision-based estimation algorithms, all images discussed in the subsequent sections are assumed to have undergone distortion correction.

3.2.2. Attitude Representation

In spacecraft visual perception, the accurate representation of attitude serves as a critical bridge connecting geometric observation with dynamic control. Attitude describes the rotational state of a rigid body (e.g., the spacecraft body) in space, with common representations including Euler angles, quaternions, rotation matrices, etc. [42]. This section systematically introduces the two most commonly used methods—Euler angles and quaternions.

(1): Euler Angles

In the field of aerospace, Euler angles obtained by rotating the target’s body frame (world coordinate system) in a

zyx

sequence are typically used to represent the spacecraft’s attitude change, usually the attitude of the target’s body frame relative to the orbital coordinate system. The corresponding Euler angles are yaw

ψ

, roll

φ

, and pitch

θ

, respectively. When a spatial target is rotated in the

zyx

sequence, the corresponding rotation matrix

R

is:

\begin{array}{l} R = [\begin{matrix} \cos θ & 0 & - \sin θ \\ 0 & 1 & 0 \\ \sin θ & 0 & \cos θ \end{matrix}] [\begin{matrix} 1 & 0 & 0 \\ 0 & \cos φ & \sin φ \\ 0 & - \sin φ & \cos φ \end{matrix}] [\begin{matrix} \cos ψ & \sin ψ & 0 \\ - \sin ψ & \cos ψ & 0 \\ 0 & 0 & 1 \end{matrix}] \\ = [\begin{matrix} \cos θ \cos ψ - \sin φ \sin θ \sin ψ & \cos θ \sin ψ + \cos ψ \sin φ \sin θ & - \cos φ \sin θ \\ - \cos φ \sin ψ & \cos φ \cos ψ & \sin φ \\ \cos ψ \sin θ + \cos θ \sin φ \sin ψ & \sin θ \sin ψ - \cos θ \cos ψ \sin φ & \cos φ \cos θ \end{matrix}] \end{array}

(12)

(2): Quaternions

An attitude quaternion is defined as

q = q_{0} + q_{1} i + q_{2} j + q_{3} k

, where

{q_{0}}^{2} + {q_{1}}^{2} + {q_{2}}^{2} + {q_{3}}^{2} = 1

. Given a rotation axis vector

n

and a rotation angle

θ

about that axis, the corresponding quaternion is:

q = [\cos (\frac{θ}{2}), \sin (\frac{θ}{2}) \cdot n_{x}, \sin (\frac{θ}{2}) \cdot n_{y}, \sin (\frac{θ}{2}) \cdot n_{z}]

(13)

Therefore, the rotation matrix corresponding to the quaternion can be written as:

R = [\begin{matrix} 1 - 2 q_{2}^{2} - 2 q_{3}^{2} & 2 (q_{1} q_{2} + q_{3} q_{0}) & 2 (q_{1} q_{3} - q_{2} q_{0}) \\ 2 (q_{1} q_{2} - q_{3} q_{0}) & 1 - 2 q_{1}^{2} - 2 q_{3}^{2} & 2 (q_{2} q_{3} + q_{1} q_{0}) \\ 2 (q_{1} q_{3} + q_{2} q_{0}) & 2 (q_{2} q_{3} - q_{1} q_{0}) & 1 - 2 q_{1}^{2} - 2 q_{2}^{2} \end{matrix}]

(14)

Euler angles are suitable for static coordinate transformations, while quaternions are better suited for dynamic attitude control. Currently, mainstream spacecraft visual datasets predominantly use quaternions as the standard representation for attitude annotation, as they more effectively capture the continuous nature of attitude changes.

3.3. Visibility Analysis

The prerequisite for characterizing a space object based on space-based visible light imagery is the acquisition of valid space-based images. Considering the unique characteristics of the space environment and the orbital motion patterns of space objects, it is essential to determine the target’s visibility before a space-based imaging device can capture an image. This means verifying whether, at the moment of imaging, the lighting conditions and the geometric positions of the space-based imager and the target satisfy the imaging constraints. Therefore, this section will briefly discuss the visibility conditions for space-based observation of objects from two aspects: geometric visibility and optical visibility.

3.3.1. Geometric Visibility

Geometric visibility refers to the condition that the space-based observation platform and the space target are mutually visible to each other. A schematic diagram of the observation geometry between the platform and the targets is shown in Figure 4.

As can be clearly seen from the figure, Target A is visible, Target B is in the critical region, and Target C is not visible. Let

r_{C}

be the geocentric radius vector of the observation platform (camera),

r_{T}

be the geocentric radius vector of the observation target,

r_{T C}

be the distance vector from the observation target to the camera,

θ

be the angle between the line from the camera to the Earth’s center and the line from the camera to the target spacecraft (ranging from 0° to 180°), and

R_{E}

be the Earth’s radius. The condition for geometric visibility between the observation platform and the target is:

\{\begin{array}{l} \cos θ = \frac{r_{C} \cdot r_{T C}}{|r_{C}| \cdot |r_{T C}|} \\ \cos θ < \frac{\sqrt{{r_{C}}^{2} - {R_{E}}^{2}}}{|r_{C}|} \end{array}

(15)

Therefore,

r_{C} \cdot r_{T C} < \sqrt{{r_{C}}^{2} - {R_{E}}^{2}} \cdot |r_{T C}|

(16)

3.3.2. Optical Visibility

Optical visibility can be concisely summarized as the geometric visibility condition among the Sun, the observation platform, and the observation target. Optical visibility must satisfy two conditions: (1) during observation, the Sun does not directly shine into the platform’s lens; (2) the space target is not within the Earth’s shadow (umbra/penumbra) region.

Regarding condition (1), if sunlight directly enters the lens, the field of view will be flooded with a high-brightness background, making imaging of the target impossible and potentially damaging the camera. Figure 5 below shows the schematic diagram of the conditions for avoiding direct sunlight.

Regarding condition (2), whether the platform carries a CCD or CMOS camera, the imaging principle relies on receiving sunlight reflected from the target. The photosensitive elements convert this light signal into an electrical signal to form an image. If the target is within Earth’s shadow, it cannot reflect sunlight, and thus the platform cannot image it. Therefore, during target observation and imaging, it is essential to ensure the target is outside Earth’s shadow region. Figure 6 below shows the schematic diagram of the Earth shadow conditions that need to be satisfied.

Let the distance vector from the observation platform to the Sun be

r_{C S}

, the distance vector from the camera to the observation target be

r_{C T}

, the sum of the Sun’s apparent radius and the light scattering angle be

α

, and the angle between

r_{C S}

and

r_{C T}

be

β

. Then the condition to avoid direct solar illumination can be expressed as:

0 < α < β < π

(17)

The above formula can be written as:

\cos α > \frac{r_{C S} \cdot r_{T C}}{|r_{C S}| \cdot |r_{T C}|}

(18)

Meanwhile, optical visibility also requires that the space target is outside the Earth’s shadow region. The Earth’s shadow condition is shown in the figure. In the figure, the position indicated by the blue arrow is the critical position of the Earth’s shadow region. Let the central angle between the target’s geocentric vector

r_{T}

and the Sun’s geocentric vector

r_{S}

be

λ

. Then, if the target is to be outside the Earth’s shadow, the following condition needs to be met:

\cos λ > \frac{- \sqrt{{r_{T}}^{2} - {R_{E}}^{2}}}{|r_{T}|}

(19)

The above formula is equivalent to:

r_{T} \cdot r_{S} > - \sqrt{{r_{T}}^{2} - {R_{E}}^{2}} \cdot |r_{S}|

(20)

The aforementioned visibility analysis establishes theoretical support and a basis for evaluating dataset quality. In the future, it can serve as a key module for automatic filtering and grading of multimodal images, enabling efficient filtering of low-quality, low-information samples, thereby enhancing the overall signal-to-noise ratio of the training data and improving model convergence stability.

3.4. Multimodal Ground Truth Rendering and Imaging Simulation

This section systematically presents the multimodal ground truth rendering and space imaging simulation framework constructed in this paper. It elaborates on the simulation environment configuration, multimodal image generation pipeline, and space image degradation modeling method, which can provide sufficient, high-quality, and well-annotated datasets for the training, testing, and performance evaluation of subsequent pose estimation algorithms. The output of the orbital dynamics model serves as the precise input for the rendering pipeline. Specifically, the output of the dynamics model at time t is a 6-DoF relative pose

[R (t), T (t)]

, which is directly utilized as the extrinsic parameters for the virtual camera in the 3D rendering engine.

3.4.1. Simulation Environment and Configuration

In the synthetic data generation experiments conducted for this study, the hardware and software environment was configured as follows: The experimental platform was equipped with an Intel Core i7-12700H processor and an NVIDIA GeForce RTX 3060 Laptop GPU, running a 64-bit Windows operating system. All 3D scene construction, material configuration, camera parameter setup, and image rendering were performed within Blender 4.0 software, utilizing the Cycles rendering engine. The pipeline outputs four data types: RGB images, depth maps, and mask maps. The ground truth annotations for 6D pose were generated using the script compiler built into the Blender software.

3.4.2. Multimodal Image Rendering Pipeline

The flowchart of the designed simulation experiment is shown in Figure 7, which is divided into four core stages. First, in the orbital parameters and relative motion setup stage, the 3D trajectory (represented by in the LVLH frame) of the observer spacecraft relative to the target spacecraft is defined, along with the attitude angles of the target spacecraft relative to the camera. This establishes the relative motion constraints for constructing subsequent 3D scenes. Given the initial state, the relative orbital dynamics model can generate the 6-DOF relative pose

P (t) = [R (t), T (t)]

of the observer spacecraft with respect to the target spacecraft at any time

t

, where

R (t) \in S O (3)

is the rotation matrix and

T (t) \in R^{3}

is the translation vector. This pose is directly utilized as the extrinsic matrix for the virtual camera in the subsequent rendering pipeline.

Second, in the high-fidelity 3D scene and material configuration stage, high-fidelity 3D models containing textures and materials are imported. The space background and lighting environment are configured to enhance the physical realism of the simulation scene. The third stage is camera parameters and rendering setup. Here, the camera’s intrinsic parameters, including focal length and sensor size, are configured. The Cycles rendering engine is selected to output images with a resolution of 1280 × 960, containing four modalities: RGB images, normal maps, depth maps, and mask maps. This ensures the rendering process conforms to the imaging characteristics of a real camera. Finally, in the Image and Annotation Generation stage, time-series image sequences are output, and 6D pose ground truth annotation files are synchronously generated. This provides a high-quality, labeled dataset for subsequent 3D reconstruction and pose estimation algorithms. The entire pipeline implements an end-to-end generation process from orbital dynamics parameters to synthetic imagery and ground truth annotations.

3.4.3. Image Physical Degradation Simulation

The objective of image degradation research for space target images is to construct an end-to-end imaging degradation pipeline that simulates various degradation effects introduced by the optical system and sensor characteristics of a real spacecraft camera during imaging. By authentically reproducing the actual imaging distortion, blurring, noise interference and color distortion of space targets in complex on-orbit observation scenarios, this research generates high-fidelity simulated space target image data that are consistent with real detection conditions, so as to provide reliable and realistic training and test samples for subsequent space target detection, recognition, feature extraction and image restoration algorithms. This aims to generate training and testing data that more closely approximate real observational scenarios. The degradation pipeline primarily encompasses two major categories of degradation factors: The first is optical system degradation modeling. This mainly includes effects such as the optical diffraction limit, aberrations, defocus, and motion blur. These degradations can be described by corresponding Point Spread Functions (PSFs) and simulated by convolving the ideal image. The second is sensor sampling and readout degradation modeling. This part simulates the degradation introduced during the sensor’s conversion of a continuous optical signal into a discrete digital signal. Most spacecraft cameras employ a single-chip sensor with a Bayer color filter array to acquire color images. Therefore, the high-resolution RGB image obtained from rendering is first downsampled into a single-channel RAW image according to the Bayer pattern. It is then reconstructed into an RGB image through a demosaicing algorithm (such as bilinear interpolation or adaptive homochromatic interpolation), a process that introduces color artifacts and high-frequency detail loss. Additionally, the sensor introduces various types of noise during signal conversion, such as shot noise and read noise, which can be modeled as the sum of Poisson noise and additive Gaussian noise.

I_{noisy} = P o s s i o n (I_{clean}) + G a u s s i a n (I_{clean}) + (I_{clean} \times PSF)

(21)

where

I_{noisy}

is the degraded image, and

I_{clean}

is the original clear image.

4. Results

4.1. Synthetic Image Results

In a simulation scenario, it is essential to construct a complete imaging system model that incorporates the environment, the target, and camera motion. The International System of Units (SI) is uniformly adopted, with the “meter” serving as the scale baseline, ensuring consistency between the simulated environment and the real physical world. A directional light source simulating solar illumination is introduced into the scene, allowing for flexible configuration of its intensity and incident angle to replicate various lighting conditions in space.

We systematically analyzed the pose parameters, distance distribution, and lighting conditions of the dataset, with the results summarized in Table 2. The dataset covers a variety of typical scenarios, in which: the position range represents the spatial distribution of the camera relative to the target object in the Cartesian coordinate system; the average angular variation reflects the average magnitude of pose changes between consecutive frames (the frame sequence is uniformly sampled in time); the total illumination energy characterizes the total radiant flux density of all light sources in the scene; and the ambient light intensity describes the strength of global ambient light, simulating indirect lighting effects. Table 2 presents the key parameter statistics for four space target scenarios (Tiangong, Spacedragon, ICESat, and Cassini). In terms of camera poses, the positional variation ranges along the X and Y axes generally span from ±15 m to ±40 m, while the Z-axis remains relatively stable. The average angular variation ranges from 0.20° to 1.52°, indicating notable differences in trajectory smoothness. Regarding illumination conditions, the total illumination energy covers a wide spectrum from 40.08 W/m² to 357.29 W/m², whereas the ambient light intensity is maintained at a relatively low level between 0.80 W/m² and 1.40 W/m², simulating the weak contribution of indirect light in real scenarios. In summary, the dataset exhibits sufficient representational capacity in terms of camera pose variation, spatial coverage, and lighting diversity. It comprehensively covers typical ranges of motion and illumination variation, thereby ensuring the diversity and representativeness of the data. This makes it suitable for providing a multi-dimensional, highly robust test benchmark for vision-based tasks such as visual localization and 3D reconstruction.

This study has constructed a large-scale, high-fidelity multimodal synthetic dataset to support vision-based tasks in complex space scenarios. For each target object, a systematic data acquisition pipeline was employed to generate synchronized multimodal imagery under diverse viewing geometries, illumination conditions, and scene backgrounds. Specifically, four complementary image modalities were rendered in parallel for every scenario, including RGB, depth, semantic segmentation masks, and surface normal maps, with 500 samples generated per modality to ensure statistical sufficiency. The detailed data composition and scale are summarized in Table 3. Importantly, all modal images are strictly aligned both temporally and spatially, thereby forming four sets of perfectly matched multimodal data streams that enable pixel-wise correspondence across modalities. This alignment facilitates joint learning and cross-modal validation in downstream tasks such as 3D reconstruction, pose estimation, and scene understanding.

Figure 8 illustrates the synthetic rendering results of four typical space targets: Tiangong, Spacedragon, ICESat, and Cassini. For each target, four types of complementary data are provided: high-resolution RGB images, corresponding binary segmentation masks, grayscale depth maps, and surface normal maps. Among these, the RGB image faithfully reproduces the visual appearance of the target under space-based lighting and background conditions, providing intuitive photometric information for visual perception tasks. The segmentation mask clearly isolates the target spacecraft from the space background, serving as ground truth annotations for semantic segmentation and target detection algorithms. The depth map encodes the relative distance from the camera to each point on the target surface, offering critical geometric cues for 3D reconstruction and pose estimation. The surface normal map records the orientation of each point on the target surface, providing normal vector information for surface feature analysis and illumination simulation. Collectively, these multimodal data form a comprehensive dataset to support a variety of space target perception and analysis tasks.

During the image degradation modeling stage, a Point Spread Function (PSF) is employed to simulate motion blur resulting from the relative motion between the camera and the target, with the motion blur length set to 15 pixels and the motion direction set to 30 degrees. Poisson noise and Gaussian noise are subsequently added to the blurred image. Poisson noise models the statistical fluctuations of photons arriving at the sensor, with its intensity determined by the image brightness. Concurrently, Gaussian noise is introduced to simulate electronic sensor noise, with a mean of 0 and a variance of 0.01. The results of the image degradation are illustrated in Figure 9.

4.2. Ablation Study

To quantitatively assess the contribution of each degradation module within the proposed image degradation model and systematically analyze its impact on image quality, this study designs a progressive ablation experiment. Using the original high-definition synthetic images as the reference benchmark, three core degradation modules—motion blur, Poisson noise, and additive Gaussian noise—are sequentially introduced to construct four progressively degraded experimental groups: D1 includes motion blur only; D2 combines motion blur and Poisson noise; D3 combines motion blur and Gaussian noise; and D4 represents the complete degradation model, incorporating all three modules—motion blur, Poisson noise, and Gaussian noise—thereby forming a comprehensive end-to-end physical imaging pipeline.

For each experimental group, three objective image quality assessment metrics were computed between its output images and the original reference images: the Peak Signal-to-Noise Ratio (PSNR), the Structural Similarity Index (SSIM), and the Learned Perceptual Image Patch Similarity (LPIPS). PSNR reflects pixel-level error, SSIM gauges the preservation of structural information, while LPIPS assesses perceptual similarity from a deep feature space. Together, this triad of metrics provides a comprehensive evaluation of the degradation in image fidelity, structural integrity, and visual realism across different stages of the applied degradations. For each object, evaluate the performance metrics of different combinations of the four degradation modules, and then take the average. Ablation study results are shown in Table 4.

The experimental results are presented in Table 3. As the degradation modules are progressively added (from D1 to D4), the PSNR and SSIM metrics exhibit a monotonic decrease, while the LPIPS metric shows a monotonic increase, which systematically validates the physical soundness of the proposed degradation model. Detailed analysis indicates that the introduction of Poisson noise (D1 → D2) leads to the most drastic changes in SSIM and LPIPS, demonstrating its strongest disruptive effect on structural information and perceptual quality. In contrast, the introduction of Gaussian noise (D2 → D3) causes the most significant attenuation in PSNR, highlighting its predominant impact on pixel-level fidelity. The complete model (D4) exhibits the most severe degradation across all metrics, confirming the necessity of joint modeling of multiple physical effects for realistic image degradation simulation. In summary, the experimental results quantitatively reveal the contribution of each degradation module and provide a reliable degradation benchmark for training and testing algorithms based on synthetic data.

5. Discussion

In this paper, a multimodal synthetic dataset for spacecraft visual perception is constructed, and an end-to-end physical degradation model is designed. The core contribution lies in filling the gap of existing test benchmarks for spacecraft visual perception in terms of multimodality and multi-scenario coverage. The dataset designs a variety of relative motion orbits and generates four-modal image data, which not only provides a multi-dimensional test benchmark for vision-based space missions but also solves the problems of difficult acquisition and high cost of real on-orbit data, offering reliable support for subsequent visual perception algorithms. Meanwhile, through systematic ablation experiments, we quantitatively analyze the mechanism of each degradation module, clarify its contribution weight to the degradation effect, and verify the rationality and effectiveness of the proposed physical degradation model, ensuring that it can truly reflect the degradation characteristics of spacecraft on-orbit visual imaging.

Combined with the research status in the field of spacecraft visual perception, the innovations of this work are highly consistent with industrial demands; however, there are still limitations, which also serve as the key breakthrough directions for follow-up research, specifically reflected in three aspects. First, although the dataset covers four typical on-orbit service scenarios, the relative motion modes are established under ideal orbital conditions without fully considering motion deviations caused by complex orbital perturbations such as Earth’s oblateness and atmospheric drag. This restricts the applicability of synthetic data in non-ideal orbital scenarios and leads to inconsistencies with the real relative motion characteristics of on-orbit spacecraft. Second, the illumination simulation is insufficient, as it fails to accurately model complex on-orbit illumination variations and only considers basic static illumination scenarios, making it impossible to verify the robustness of visual perception algorithms against illumination changes. Finally, the validation is only conducted based on full digital simulation without semi-physical simulation or hardware-in-the-loop testing. As a result, the practicality of synthetic data and the algorithm performance in real imaging chains have not been fully verified, making it difficult to ensure the smooth migration of algorithms from simulation to engineering applications, which is also a critical bottleneck for its engineering implementation.

Future work will focus on improving the physical fidelity and validation completeness of the simulation framework. Three main directions are proposed.

First, high-fidelity orbital dynamics will be integrated. This includes modeling key perturbations such as Earth’s oblateness (J₂), atmospheric drag, and third-body gravitational effects. These additions will allow a more accurate simulation of six-degree-of-freedom relative motion in complex space environments. The framework will be extended to cover challenging scenarios like highly elliptical orbits and long-duration missions.

Second, a dynamic, high-fidelity illumination model will be developed. This model will account for time-varying solar irradiation, Earth albedo, planetary illumination, and deep-space thermal radiation. It will realistically simulate the complex lighting conditions encountered in orbit, including scenes with high dynamic range, strong shadows, and extreme contrast. This will provide a more reliable test environment for evaluating the robustness of visual algorithms under real illumination variations.

Third, the validation pipeline will be extended to include hardware-in-the-loop (HIL) testing. Building on the current fully digital simulation data, real optical hardware will be integrated with synthetic scenes. This step will further verify the utility of the synthetic data and assess the performance of perception algorithms within a near-realistic imaging chain. The goal is to establish a closed-loop “digital–HIL” validation framework.

Through these extensions, a higher-fidelity and more complete simulation and verification platform will be established. This platform will support the development and testing of algorithms for vision-based navigation and on-orbit servicing in a wider range of demanding space scenarios.

6. Conclusions

This paper addresses the challenge of insufficient high-quality real-world data in the field of spacecraft visual perception. We propose a multimodal spacecraft data generation framework based on orbital relative motion constraints. This method integrates the laws of orbital relative motion into the synthetic data generation process. It simulates four typical spacecraft relative motion patterns: elliptical trajectory, spiral trajectory, drip-drop trajectory, and drift trajectory. It generates kinematically plausible image sequences. The constructed dataset contains 8000 images. Each image is accompanied by rich annotations. These annotations include RGB images, masks, depth maps, surface normals, and 6-DoF poses. The dataset covers various position ranges, angle changes, and light intensities. It provides dense supervision information for tasks such as pose estimation and 3D reconstruction. Meanwhile, we design an end-to-end physical degradation model. This model simulates real imaging artifacts from optical systems to sensors. It effectively narrows the gap between simulation and reality. In addition, this paper summarizes the shortcomings and limitations of the research. It also looks forward to future research directions. In summary, this study systematically improves the quality of spacecraft synthetic data from three key aspects: motion physical authenticity, annotation completeness, and visual realism. It lays a solid data foundation for the training, testing, and validation of spacecraft perception algorithms.

Author Contributions

Conceptualization, W.L. and Y.H.; methodology, Q.Z.; validation, Y.F. and Y.Z.; formal analysis, Y.L. and Q.Z.; resources, Y.H.; writing—original draft preparation, W.L.; writing—review and editing, Y.L. and Q.Z.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the editors of Remote Sensing and the anonymous reviewers for their patience, helpful remarks, and useful feedback.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Geller, D.K. Orbital rendezvous: When is autonomy required? J. Guid. Control Dyn. 2007, 30, 974–981. [Google Scholar] [CrossRef]
Li, W.-J.; Cheng, D.-Y.; Liu, X.-G.; Wang, Y.-B.; Shi, W.-H.; Tang, Z.-X.; Gao, F.; Zeng, F.-M.; Chai, H.-Y.; Luo, W.-B.; et al. On-orbit service (OOS) of spacecraft: A review of engineering developments. Prog. Aerosp. Sci. 2019, 108, 32–120. [Google Scholar] [CrossRef]
Flores-Abad, A.; Ma, O.; Pham, K.; Ulrich, S. A review of space robotics technologies for on-orbit servicing. Prog. Aerosp. Sci. 2014, 68, 1–26. [Google Scholar] [CrossRef]
Xu, Q.; Hu, M.; Fang, Y.; Zhang, X. A neural radiance fields method for 3D reconstruction of space target. Adv. Space Res. 2025, 75, 6924–6943. [Google Scholar] [CrossRef]
Feng, J.; Yang, Z.; Zhang, Z.; Chen, W.; Yu, J.; Chang, L. Point Cloud Model Reconstruction and Pose Estimation of Non-Cooperative Spacecraft Without Feature Extraction. Acta Opt. Sin. 2025, 45, 0712004. [Google Scholar]
Li, Y.; Hu, Q.; Liang, T.; Li, D.; Ouyang, Z. Active 2DGS for 3D Reconstruction of Space Targets Under Orbital Constraints. IEEE Trans. Circuits Syst. Video Technol. 2025, 36, 4971–4983. [Google Scholar] [CrossRef]
Zhao, Y.; Yi, J.; Pan, Y.; Chen, L. 3D reconstruction of non-cooperative space targets of poor lighting based on 3D gaussian splatting. Signal Image Video Process. 2025, 19, 509. [Google Scholar] [CrossRef]
Chen, Z.; Gui, H.; Zhong, R. Neural network-based navigation filter for monocular pose and motion tracking of noncooperative spacecraft. Adv. Space Res. 2025, 75, 2908–2928. [Google Scholar] [CrossRef]
Feng, Q.; Zhu, Z.H.; Pan, Q.; Liu, Y. Pose and motion estimation of unknown tumbling spacecraft using stereoscopic vision. Adv. Space Res. 2018, 62, 359–369. [Google Scholar] [CrossRef]
Yin, H.; Ren, X.; Jiang, L.; Wang, C.; Xiong, Q.; Wang, Z. Robust Semantic Feature Extraction and Attitude Estimation of Unseen Noncooperative On-Orbit Spacecraft. IEEE Sens. J. 2025, 25, 31858–31873. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, J.; Chen, J.; Shi, D.; Chen, X. A Space Non-Cooperative Target Recognition Method for Multi-Satellite Cooperative Observation Systems. Remote Sens. 2024, 16, 3368. [Google Scholar] [CrossRef]
Li, X.; Zhang, L.; Li, Z.; He, X. Application of the Relative Orbit in an On-Orbit Service Mission. Electronics 2023, 12, 3034. [Google Scholar] [CrossRef]
Dai, C.; Qiang, H.; Zhang, D.; Hu, S.; Gong, B. Relative Orbit Determination Algorithm of Space Targets with Passive Observation. J. Syst. Eng. Electron. 2024, 35, 793–804. [Google Scholar] [CrossRef]
Doshi, M.J.; Pathak, N.M.; Abouelmagd, E.I. Periodic orbits of the perturbed relative motion. Adv. Space Res. 2023, 72, 2020–2038. [Google Scholar] [CrossRef]
Kisantal, M.; Sharma, S.; Park, T.H.; Izzo, D.; Martens, M.; D’Amico, S. Satellite pose estimation challenge: Dataset, competition design, and results. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 4083–4098. [Google Scholar] [CrossRef]
Park, T.H.; Märtens, M.; Lecuyer, G.; Izzo, D.; Amico, S.D. SPEED+: Next-Generation Dataset for Spacecraft Pose Estimation across Domain Gap. In Proceedings of the 2022 IEEE Aerospace Conference, Big Sky, MT, USA, 5–12 March 2022; IEEE: New York, NY, USA, 2022; pp. 1–15. [Google Scholar]
Park, T.H.; D’amico, S. Adaptive Neural-Network-Based Unscented Kalman Filter for Robust Pose Tracking of Noncooperative Spacecraft. In Proceedings of the AIAA Science and Technology Forum and Exposition (AIAA SciTech Forum), San Diego, CA, USA, 3–7 January 2022; pp. 1671–1688. [Google Scholar]
Park, T.H.; D’amico, S. Rapid Abstraction of Spacecraft 3D Structure from Single 2D Image. In Proceedings of the AIAA SCITECH 2024 Forum, Orlando, FL, USA, 8–12 January 2024; American Institute of Aeronautics and Astronautics: Reston, VA, USA, 2024. [Google Scholar]
Huang, K.; Zhang, Y.; Ma, F.; Chen, J.; Tan, Z.; Qi, Y. WDICD: A novel simulated dataset and structure-aware framework for semantic segmentation of spacecraft component. Acta Astronaut. 2024, 225, 1–15. [Google Scholar] [CrossRef]
Sam, J.J.; Sathe, J.; Chigali, N.; Gupta, N.; Ruparel, R.; Jiang, Y.; Singh, J.; Berck, J.W.; Barman, A. A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Computers. arXiv 2025, arXiv:250710775. [Google Scholar]
Cao, Y.; Mu, J.; Cheng, X.; Liu, F. Spacecraft-DS: A Spacecraft Dataset for Key Components Detection and Segmentation via Hardware-in-the-Loop Capture. IEEE Sens. J. 2024, 24, 5347–5358. [Google Scholar] [CrossRef]
Proença, P.F.; Gao, Y. Deep Learning for Spacecraft Pose Estimation from Photorealistic Rendering. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; IEEE: New York, NY, USA, 2020; pp. 6007–6013. [Google Scholar]
Dung, H.A.; Chen, B.; Chin, T.J. A Spacecraft Dataset for Detection, Segmentation and Parts Recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 2012–2019. [Google Scholar]
Hematulin, W.; Kamsing, P.; Phisannupawong, T.; Panyalert, T.; Manuthasna, S.; Torteeka, P.; Boonsrimuang, P. Generating Large-Scale Datasets for Spacecraft Pose Estimation via a High-Resolution Synthetic Image Renderer. Aerospace 2025, 12, 334. [Google Scholar] [CrossRef]
Hu, Y.; Speierer, S.; Jakob, W.; Fua, P.; Salzmann, M. Wide-Depth-Range 6D Object Pose Estimation in Space. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 15865–15874. [Google Scholar]
Bechini, M.; Lavagna, M.; Lunghi, P. Dataset generation and validation for spacecraft pose estimation via monocular images processing. Acta Astronaut. 2023, 204, 358–369. [Google Scholar] [CrossRef]
Gallet, F.; Marabotto, C.; Chambon, T. Exploring AI-Based Satellite Pose Estimation: From Novel Synthetic Dataset to In-Depth Performance Evaluation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024; pp. 6770–6778. [Google Scholar]
Musallam, M.A.; Gaudillière, V.; Ghorbel, E.; Al Ismaeil, K.; Perez, M.D.; Poucet, M.; Aouada, D. Spacecraft Recognition Leveraging Knowledge of Space Environment: Simulator, Dataset, Competition Design and Analysis. In Proceedings of the 2021 IEEE International Conference on Image Processing Challenges (ICIPC), Anchorage, AK, USA, 19–22 September 2021; pp. 11–15. [Google Scholar]
Yang, X.; Cao, M.; Li, C.; Zhao, H.; Yang, D. Learning Implicit Neural Representation for Satellite Object Mesh Reconstruction. Remote Sens. 2023, 15, 4163. [Google Scholar] [CrossRef]
Liu, X.; Li, D.; Dong, N.; Ip, W.H.; Yung, K.L. Noncooperative Target Detection of Spacecraft Objects Based on Artificial Bee Colony Algorithm. Tianjin Univ. Hong Kong Polytech. Univ. 2019, 34, 3–15. [Google Scholar] [CrossRef]
Yuan, Y.; Bai, H.; Wu, P.; Guo, H.; Deng, T.; Qin, W. An Intelligent Detection Method for Small and Weak Objects in Space. Remote Sens. 2023, 15, 3169. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, R.; Yao, Z.; She, J.; Qi, N. STDSD: A Spacecraft Target Detection Framework Considering Similarity and Diversity. IEEE Trans. Autom. Sci. Eng. 2025, 22, 11038–11049. [Google Scholar] [CrossRef]
Zhu, Q.; Lu, Y.; Li, P.; Li, J.; Li, W.; Zhang, Y. SO-PEN: Strong transformers enable a pan-dimensional equilibrium network for non-controlled space object pose estimation. Adv. Space Res. 2026, 77, 7387–7405. [Google Scholar] [CrossRef]
Zhao, M.; Xu, L. Robust Pose Estimation for Noncooperative Spacecraft Under Rapid Inter-Frame Motion: A Two-Stage Point Cloud Registration Approach. Remote Sens. 2025, 17, 1944. [Google Scholar] [CrossRef]
Yi, W.; Zhang, Z.; Chang, L. M4MLF-YOLO: A Lightweight Semantic Segmentation Framework for Spacecraft Component Recognition. Remote Sens. 2025, 17, 3144. [Google Scholar] [CrossRef]
Li, F.; Zhang, Z.; Wang, X.; Xu, Y. Exploiting Diffusion Priors for Generalizable Few-Shot Satellite Image Semantic Segmentation. Remote Sens. 2025, 17, 3706. [Google Scholar] [CrossRef]
Zhou, X.; Qiao, D.; Macdonald, M.; Li, X. Optical Orbit Determination for Asteroid Approach with Vision-Based Range Information. J. Guid. Control Dyn. 2025, 48, 1492–1510. [Google Scholar] [CrossRef]
Piccolo, F.; Pugliatti, M.; Mcmahon, J.W.; Topputo, F. Autonomous Vision-Based Navigation at Small Bodies Combining Centroiding and Visual Odometry. J. Spacecr. Rocket. 2026, 63, 308–326. [Google Scholar] [CrossRef]
NASA’s Gateway Program Office. Available online: https://nasa3d.arc.nasa.gov/model (accessed on 27 March 2026).
Jiao, B.; Sun, Q.; Han, H.; Dang, Z. A parametric design method of nanosatellite close-range formation for on-orbit target inspection. Chin. J. Aeronaut. 2023, 36, 194–209. [Google Scholar] [CrossRef]
Sun, Q.; Zhao, L.; Dang, Z. Comprehensive classification of non-periodic relative motion styles of spacecraft: A geometric approach and applications. Astrodynamics 2025, 9, 969–992. [Google Scholar] [CrossRef]
Pauly, L.; Rharbaoui, W.; Shneider, C.; Rathinam, A.; Gaudillière, V.; Aouada, D. A survey on deep learning-based monocular spacecraft pose estimation: Current state, limitations and prospects. Acta Astronaut. 2023, 212, 339–360. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the ECI and LVLH coordinate frames.

Figure 2. Schematic diagrams of typical relative motion orbits.

Figure 3. Schematic of the target imaging process.

Figure 4. Schematic of observation geometry between camera and targets.

Figure 5. Schematic diagram of the condition for avoiding direct sunlight.

Figure 6. Schematic diagram of the Earth’s shadow condition.

Figure 7. Software simulation flowchart.

Figure 8. Synthetic rendering results of four typical space targets.

Figure 9. Simulation results after image degradation.

Table 1. Classification of relative orbital motion pattern based on normalized drift parameter (

σ

).

Table 1. Classification of relative orbital motion pattern based on normalized drift parameter (

σ

).

Trajectory Classification	$σ$ Range	Dynamic Characteristics
Elliptical trajectory	$σ = 2$	$β$ = 0, the trajectory forms a closed ellipse, and the observer spacecraft moves periodically around the target spacecraft.
Spiral trajectory	$\begin{array}{l} 1.8877 < σ < 2.2039, \\ σ \neq 2 \end{array}$	Drift and periodic motion combine into a smooth, non-intersecting spiral, drifting along-track away from or toward the target.
Drip-drop trajectory	$1.75 < σ < 1.8877$ $or σ > 2.2039$	The trajectory shows periodic loops in the drift direction, resembling a water droplet, with two subcategories based on $y$ -axis crossing.
Drift Trajectory	$σ \leq 1.75$	With small periodic oscillation, the trajectory is nearly linear or gently curved, moving unidirectionally away from the target along-track, representing the simplest relative motion.

Table 2. Dataset scenario parameter analysis.

Scene	Position Range (m)	Average Angular Variation (°)	Total Illumination Energy (W/m²)	Ambient Light Intensity (W/m²)
Tiangong	X ∈ [−19.80, 19.80] Y ∈ [−25.38, 25.38] Z ∈ [4.26, 4.26]	0.70	264.06	1.40
Spacedragon	X ∈ [−13.91, 13.15] Y ∈ [−13.08, 12.67] Z ∈ [7.39, 7.39]	0.73	40.08	1.00
ICESat	X ∈ [−41.26, 41.51] Y ∈ [−39.08, 20.47] Z ∈ [−19.05, 40.19]	1.52	100.54	0.80
Cassini	X ∈ [−30.06, 29.98] Y ∈ [−38.97, −24.30] Z ∈ [13.28, 13.29]	0.20	357.29	1.00

Table 3. Statistics of the synthetic multimodal dataset.

Data Modality	Description	Samples per Object	Total Samples	Primary Use
RGB Image	24-bit true color image	500	2000	Network input
Segmentation Mask	Binary pixel-level annotation	500	2000	Segmentation tasks
Depth Map	Z-depth in camera space	500	2000	3D reconstruction; depth estimation
Normal Map	Surface normal vectors	500	2000	Normal estimation; lighting analysis

Table 4. Ablation experimental results of degradation models (↑ means better performance for higher values, ↓ means better performance for lower values).

Experiment Groups	PSNR (dB) ↑	SSIM ↑	LPIPS ↓
D1	29.60 ± 2.59	0.90 ± 0.06	0.13 ± 0.05
D2	25.24 ± 2.60	0.58 ± 0.19	0.54 ± 0.26
D3	21.85 ± 0.60	0.11 ± 0.05	0.89 ± 0.08
D4	20.13 ± 1.23	0.06 ± 0.02	0.95 ± 0.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, W.; Huo, Y.; Zhu, Q.; Lu, Y.; Fang, Y.; Zhang, Y. A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft. Remote Sens. 2026, 18, 1177. https://doi.org/10.3390/rs18081177

AMA Style

Li W, Huo Y, Zhu Q, Lu Y, Fang Y, Zhang Y. A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft. Remote Sensing. 2026; 18(8):1177. https://doi.org/10.3390/rs18081177

Chicago/Turabian Style

Li, Wanyun, Yurong Huo, Qinyu Zhu, Yao Lu, Yuqiang Fang, and Yasheng Zhang. 2026. "A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft" Remote Sensing 18, no. 8: 1177. https://doi.org/10.3390/rs18081177

APA Style

Li, W., Huo, Y., Zhu, Q., Lu, Y., Fang, Y., & Zhang, Y. (2026). A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft. Remote Sensing, 18(8), 1177. https://doi.org/10.3390/rs18081177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Relative Orbital Motion-Guided Framework for Generating Multimodal Visual Data of Spacecraft

Highlights

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Relative Orbital Motion Pattern Modeling

3.1.1. Scene Description

3.1.2. Feature Parameterization Formulation

3.1.3. Typical Relative Motion Classification

3.2. Imaging Geometry Principles

3.2.1. Camera Imaging Model

3.2.2. Attitude Representation

3.3. Visibility Analysis

3.3.1. Geometric Visibility

3.3.2. Optical Visibility

3.4. Multimodal Ground Truth Rendering and Imaging Simulation

3.4.1. Simulation Environment and Configuration

3.4.2. Multimodal Image Rendering Pipeline

3.4.3. Image Physical Degradation Simulation

4. Results

4.1. Synthetic Image Results

4.2. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI