Intelligent QLFEKF Integrated Navigation for the SSBE Cruise Phase Based on X-Ray Pulsar/Solar and Target Planetary Doppler Information Fusion

Wenjian Tao; Jinxiu Zhang; Jianing Song; Qin Lin; Zebin Chen; Hui Wang; Jikun Yang; Jihe Wang

doi:10.3390/rs16234465

Abstract

The Solar System Boundary Exploration (SSBE) mission is the focal point for future far-reaching space exploration. Due to the SSBE having many scientific difficulties that need to be studied, such as a super long space exploratory distance, a super long flight time in orbit, and a significant communication data delay between the ground and the probe, the probe must have sufficient intelligence to realize intelligent autonomous navigation. Traditional navigation schemes have been unable to provide high-accuracy autonomous intelligent navigation for the probe independent of the ground. Therefore, high-accuracy intelligent astronomical integrated navigation would provide new methods and technologies for the navigation of the SSBE probe. The probe of the SSBE is disturbed by multiple sources of solar light pressure and a complex, unknown environment during its long cruise operation while in orbit. In order to ensure the high-accuracy position state and velocity state error estimation for the probe in the cruise phase, an autonomous intelligent integrated navigation scheme based on the X-ray pulsar/solar and target planetary Doppler velocity measurements is proposed. The reinforcement Q-learning method is introduced, and the reward mechanism is designed for trial-and-error tuning of state and observation noise error covariance parameters. The federated extended Kalman filter (FEKF) based on the Q-learning (QLFEKF) navigation algorithm is proposed to achieve high-accuracy state estimations of the autonomous intelligence navigation system for the SSBE probe cruise phase. The main advantage of the QLFEKF is that Q-learning combined with the conventional federated filtering method could optimize the state parameters in real-time and obtain high position and velocity state estimation (PVSE) accuracy. Compared with the conventional FEKF integrated navigation algorithm, the PVSE navigation accuracy of the federated filter integrated based the Q-learning navigation algorithm is improved by 55.84% and 37.04%, respectively, demonstrating the higher accuracy and greater capability of the raised autonomous intelligent integrated navigation algorithm. The simulation results show that the intelligent integrated navigation algorithm based on QLFEKF has higher navigation accuracy and is able to satisfy the demands of autonomous high accuracy for the SSBE cruise phase.

Keywords:

solar system boundary exploration; X-ray pulsar; solar/target planetary Doppler; Q-learning FEKF; intelligent integrated navigation; cruise phase

1. Introduction

The exploration of the extremely distant, extremely dark, and extremely cold regions at the solar system boundary is a hot point in deep space exploration. After the human exploration of the Moon and Mars, which has progressively expanded to the solar system boundary, further space exploration could continuously open new windows for humans’ understanding of the universe [1]. There are only a limited number of deep space missions that continue to explore the solar system boundary after completing the scheduled exploration mission, such as Pioneer 10–11 [2,3,4], Voyager 1–2 [5,6], and New Horizons [7,8]. With the continuous extension of the field of deep space exploration, China has put forward the SSBE program, which could build the capability of the probe to reach the whole area of the solar system and realize super long-distance detection from internal celestial bodies to interstellar space, providing us with the capability to independently explore interplanetary far-reaching space [1,9]. Due to the characteristics of unknown and varied exploration environments, super long space exploratory distances, super long orbital flight times, and significant communication data delays between the SSBE and the ground, the implementation of the exploration mission is very difficult [10]. Because the probe is so far from the ground, it cannot rely on traditional navigation methods, such as the radio station aeronautical ground, very long baseline interferometry (VLBI), and tracking telemetry and command (TT&C) communication networks to provide real-time and high-accuracy navigation information, which presents new challenges to ultra-long-distance autonomous high-accuracy navigation [11,12].

Regarding the solar system boundary probe, reliance on local ground-based tracking navigation would no longer be applicable and could not be used for autonomous real-time and high-accuracy navigation of the probe at great distances from the ground [12]. In order to improve the autonomous operation capability of the probe in orbit, autonomous navigation technology that is independent of ground navigation measurements has been actively studied in depth. X-ray pulsar navigation (XNAV) is a new kind of autonomous celestial navigation technology that can be applied to both near-Earth and deep space [13,14,15]. Based on the three X-ray pulsars, the autonomous position determination of the deep space probe could be realized. By collecting the X-ray radiation signals from the X-ray pulsar with the X-ray detector, the pulse time-of-arrival (TOA) can be obtained. The arrival time of the pulse reaching the detector and the time to reach the solar system barycenter (SSB) are predicted, which can determine the position of the probe by characterizing its position direction along the X-ray pulsar [16,17]. Reichley et al. [18] and Downs [19] stated that the position of the spacecraft is determined through the measurement of the phase of a periodic signal, and they speculated for the first time that the X-ray pulsar could be used to determine the time and position of the spacecraft in orbit [15]. Runnels et al. [20] proposed a method to estimate the comprehensive six DOF position navigation and time solutions for the deep space exploration spacecraft, which relied on the measurement of the arrival time and angles of the X-ray photons from the X-ray pulsars and other bright stars. Gao et al. [21] proposed differential XNAV based on the time difference of the pulse arrival, which analyzed the impact of the ephemeris errors on correcting the photon TOA and observation equations, and three Mars orbits with different altitudes were simulated to validate the efficacy of the navigation scheme and its suitability for the different heights from the Martian surface. When the detector moves relative to stellar objects in far-reaching space exploration, the measured stellar objects spectrum also undergoes Doppler frequency offset, which reflects the velocity of the probe relative to the stellar objects [22]. The radial velocity of the probe relative to the navigation stellar objects can be calculated based on the Doppler frequency offset of the probe with respect to the stellar object, and then the velocity state of the probe can be estimated. Yim et al. [23] adopted Doppler measurements generated by the relative motion of the spacecraft and Sun as the key measurement, which realized the autonomous navigation of the spacecraft for deep space exploration, but the positioning accuracy is not particularly high. Ning et al. [24] introduced the celestial navigation algorithm assisted by the differential Doppler measurements, which used the difference between Doppler radial velocity measured in two adjacent observation periods as the measured value to mitigate the effects of solar spectrum frequency instability and enhance autonomous celestial navigation accuracy.

Since deep space probes operate in long-endurance, highly dynamic, and extreme environments, it is difficult for the single navigation pattern to satisfy the navigation demands of excellent accuracy and great dependability for the deep space probe. The integrated navigation formed by the combination of multiple navigation modes is an irresistible trend for deep space exploration navigation systems in the future. Liu et al. [25] proposed an autonomous integrated navigation scheme based on Doppler/XNAV, and the Doppler autonomous navigation provides high-accuracy velocity error estimation information by measuring the radial velocity information relative to the Sun. The complementary of the two navigation methods can make up for the problem of low-accuracy navigation caused by shrinking the pulse signal receiving area size for the X-ray pulsar detection sensor. Pan et al. [26] used the stability of the solar Doppler differential navigation to put forward the solar time difference of arrival (TDOA)/Doppler difference joint observation and integrate it with Mars angle navigation. This approach allows fully observable Mars angle and solar light joint observation navigation, providing higher accuracy and stability of the navigation. Cui et al. [27] proposed the integrated navigation scheme based on the X-ray pulsars and Doppler to advance the error estimation precision of the probe during the final approach phase entry state of Mars exploration, and the navigation scheme adopted the X-ray pulsar observations and Doppler velocity measurements to complement each other to eliminate the problem of the reduced velocity state estimation performance of the XNAV for this phase.

During the design of the autonomous navigation system for the SSBE cruise phase, compared with other filtering estimators [28,29], the extended Kalman filter (EKF) is the preferred state estimator for the cruise phase navigation system because of the nonlinearity of the orbital dynamics model [30]. The EKF is widely used in various scenarios involving navigation systems and has a very stable navigation performance [31,32]. This state estimation algorithm has also been favored by researchers from the field of astronomical autonomous navigation. In addition, for the navigation system, the measurement noise is usually based on the measurement error of the sensor used in actual tasks. The environment disturbances mainly come from micrometeoroids, cosmic rays, and solar wind in the space environment, which all have an impact on the navigation accuracy of the deep space probe, and it is necessary to set the noise according to the scene in the different phases during the simulation analysis. Kang et al. [33] combined Doppler velocity measurements and the XNAV and proposed the integrated navigation method of the Doppler velocity measurements based on the double measurement model and X-ray pulsar. The EKF was used to achieve continuous high-accuracy state estimation, which effectively suppressed the accumulative state errors in the Doppler velocity measurements and improved the PVSE precision of the autonomous navigation system for the deep space exploration spacecraft. To enhance the autonomous navigation capability of the deep space probes, many FEKF integrated navigation methods based on multiple measurement models have been proposed [34]. Yang et al. [35] applied the XNAV and ultraviolet sensor to measure the position, velocity, and attitude of the deep space probe. The researchers also designed the FEKF integrated navigation system based on the X-ray pulsar and ultraviolet sensor, and the system could perform clock time prediction, attitude determination, and orbit determination concurrently during the deep space probe in-orbit flight. It could supply high-precision error state estimations. In order to eliminate the influence of the unstable solar spectra on Doppler velocity measurements and given that Doppler difference navigation might accumulate position errors over time, Liu et al. [36] combined the solar Doppler difference velocity measurement and the XNAV method to propose integrated X-ray pulsar/Doppler differential velocity measurement navigation based on FEKF for deep space exploration. This method has high robustness for Doppler velocity measurement deviation brought by the solar spectral instability and also ensures high-accuracy state estimation of the deep space probe in orbit.

The intelligent autonomous navigation algorithm combined with the reinforcement Q-learning algorithm has been favored by researchers at present [37]. In particular, it has a very good application in estimating the accuracy of in-orbit spacecraft, providing high precision for spacecraft navigation systems [38]. Xiong et al. [39] proposed the state error estimation algorithm that combines the reinforcement Q-learning algorithm and the EKF and tried to use reinforcement learning (RL) to restrain the influence of the uncertain noise error covariance matrix on the estimation performance of the EKF navigation algorithm, in which the covariance adaptive adjustment strategy based on Q-learning was designed to enhance the state estimation performance of the autonomous nonlinear navigation systems for the spacecraft. In addition, they also adopted the astronomical integrated navigation based on an ultraviolet earth sensor and optical interferometer and introduced QLEKF to integrate the information from the two measurements to estimate the position, velocity, and attitude state error of the spacecraft [40]. They also assessed the optical path delay bias in the optical interferometer, and the proposed QLEKF demonstrates superior estimation performance compared to the conventional EKF. Nemati et al. [41] adopted RL to enhance the position estimation of the spacecraft and proposed the state estimation algorithm based on the combination of RL and the Kalman filter to acquire the optimal solution of the state and observation noise covariance, which improved the PVSE accuracy for the spacecraft navigation. Xiong et al. [42] recently developed an improved Q-learning EKF optical autonomous navigation based on their own previous research by measuring the line of sight (LOS) direction of the non-cooperative spacecraft targets, which improved the PVSE precision of the spacecraft navigation based on the space target LOS by fine-tunning the process noise error covariance matrix filtering parameters.

Based on the above research conducted by others, this paper presents a combined navigation scheme based on the X-ray pulsar and two-dimensional Doppler velocity measurements to improve the high accuracy of autonomous integrated navigation for the SSBE cruise phase, which can make up for the accumulated position estimation error in the Doppler velocity measurements. The FEKF algorithm is designed to correct the position information obtained from the two-dimensional Doppler velocity-measuring in real-time by the position information estimation of the XNAV, and eliminate the accumulated position errors caused by Doppler velocity-measuring at the same time. The Q-learning-based FEKF integrated navigation algorithm was developed, enabling real-time parameter tuning and fine-tuning the state and observation noise error parameters for this navigation system according to the operational state environment of the probe during the cruising phase. The research is different from the X-ray pulsar and the Doppler velocity dual measurement model integrated navigation given by [33] in that the federation filter composed of the primary filter and sub-filter is adopted to realize the information fusion of multiple navigation modes and guarantee the estimation performance of autonomous integrated navigation. Compared with the above research [39,40,41,42], the intelligent Q-learning algorithm used to tune the parameters of the state and observation noise covariance is also improved accordingly. The reward mechanism and Q-table of the reinforcement Q-learning is designed according to the sub-filter characteristics of the federated filter, and the iterative period of the Q-learning is taken as the algorithm to evaluate the cumulative reward. The optimal state and observation noise covariance parameters are obtained by adjusting the grid size of the Q-table and the learning parameters. The proposed intelligent integrated navigation method based on QLFEKF has the potential to realize high precision and autonomy.

The main research work for this paper involves in-depth research on the high accuracy and feasibility of the intelligent integrated navigation for the cruise phase of the SSBE based on the X-ray pulsar/solar and target planetary Doppler velocity measurements of the QLFEKF algorithm. This paper is divided into six sections. After the introduction of the first section, the orbital dynamics model and the integrated navigation measurement model based on the X-ray pulsar and two-dimensional Doppler velocity measurements of the cruise phase are modeled in Section 2, and the state and measurement equations intelligent navigation for the cruise phase are given. In Section 3, the navigation information obtained from the fusion state estimation algorithm based on QLFEKF is developed with the combination of reinforcement Q-learning and the conventional Kalman filter. In Section 4, the performance of the QLFEKF-based intelligent integrated navigation using X-ray pulsar/solar and target planetary Doppler velocity measurements is simulated and analyzed. The results are compared with the EKF and the FEKF integrated navigation algorithms. The high accuracy performance of the suggested intelligent integrated navigation algorithm is confirmed with simulation and analysis. Finally, the conclusions are presented in Section 5.

2. System Models of Integrated Navigation

2.1. Integrated Navigation Dynamics Model

The heliocentric inertial coordinate system is chosen [43]. The orbit dynamics model [44] of the navigation system during the cruise phase for the SSBE with the Sun as the center and the large planet as the gravitational perturbations [45] is established as follows:

\{\begin{array}{l} \dot{r} = v \\ \dot{v} = - μ_{s} \frac{r_{p s}}{r_{p s}^{3}} - \sum_{i}^{N_{p}} μ_{i} [\frac{r_{p i}}{r_{p i}^{3}} - \frac{r_{s i}}{r_{s i}^{3}}] + α \end{array}

(1)

where

r = {[\begin{matrix} x & y & z \end{matrix}]}^{T}

and

v = {[\begin{matrix} \dot{x} & \dot{y} & \dot{z} \end{matrix}]}^{T}

are the vector of the position and velocity for the SSBE probe in the system of heliocentric inertial coordinate, respectively. µ_s is the constant of heliocentric attraction. r_ps is the vector of the relative position between the heliocentric and the SSBE probe. The gravitational constant is denoted as µ_i for the ith perturbation planet object. The vector of the relative position between the ith perturbation planet object and the probe is denoted by r_pi. The vector of the relative position between the ith perturbation planet and the heliocentric in the inertial system is denoted by r_si,

r_{s i} = r_{p i} + r

. N_p represents the number of planetary targets for perturbing planets. α represents a variety of disturbances in the cruise phase environment that are not modeled and can be treated as process noise.

The orbit dynamic model for the cruise phase of the Earth-to-Jupiter is expressed as follows [46].

\{\begin{array}{l} \dot{x} = v_{x} \\ \dot{y} = v_{y} \\ \dot{z} = v_{z} \\ {\dot{v}}_{x} = - μ_{s} \frac{x}{r_{p s}^{3}} - μ_{j} [\frac{x - x_{j}}{r_{p j}^{3}} + \frac{x_{j}}{r_{s j}^{3}}] - μ_{m} [\frac{x - x_{m}}{r_{p m}^{3}} + \frac{x_{m}}{r_{s m}^{3}}] - μ_{e} [\frac{x - x_{e}}{r_{p e}^{3}} + \frac{x_{e}}{r_{s e}^{3}}] + w_{x} \\ {\dot{v}}_{y} = - μ_{s} \frac{y}{r_{p s}^{3}} - μ_{j} [\frac{y - y_{j}}{r_{p j}^{3}} + \frac{y_{j}}{r_{s j}^{3}}] - μ_{m} [\frac{y - y_{m}}{r_{p m}^{3}} + \frac{y_{m}}{r_{s m}^{3}}] - μ_{e} [\frac{y - y_{e}}{r_{p e}^{3}} + \frac{y_{e}}{r_{s e}^{3}}] + w_{y} \\ {\dot{v}}_{z} = - μ_{s} \frac{z}{r_{p s}^{3}} - μ_{j} [\frac{z - z_{j}}{r_{p j}^{3}} + \frac{z_{j}}{r_{s j}^{3}}] - μ_{m} [\frac{z - z_{m}}{r_{p m}^{3}} + \frac{z_{m}}{r_{s m}^{3}}] - μ_{e} [\frac{z - z_{e}}{r_{p e}^{3}} + \frac{z_{e}}{r_{s e}^{3}}] + w_{z} \end{array}

(2)

Here, (x_j, y_j, z_j), (x_m, y_m, z_m), and (x_e, y_e, z_e) are the position vectors of Jupiter, Mars, and Earth, respectively. The relative positions between the center of mass of Jupiter, Mars, and Earth and the probe are denoted as r_pj, r_pm, and r_pe, respectively. The relative positions between the center of mass of Jupiter, Mars, and Earth and the heliocentric are denoted as r_sj, r_sm, and r_se, respectively.

The state variable

X = {[\begin{matrix} r^{T} & v^{T} \end{matrix}]}^{T}

is selected. The above orbital dynamics model is converted into the equation of state of the system as follows:

X (t) = f [X (t), t] + ω (t)

(3)

where ω(t) is the perturbation force that is not modeled considering solar radiation pressure and other planetary bodies, ω(t) = [0, 0, 0, ω_x, ω_y, ω_z]^T. Compared to the perturbations of Sun, Jupiter, Mars, and Earth, ω_x, ω_y, and ω_z are very small and can be treated as the error noise of the system state in the dynamic model of the cruise phase. It is simplified to the state error covariance

{\hat{Q}}_{k}

with the zero mean white noise,

{\hat{Q}}_{k} = {[{\hat{Q}}_{r k}, {\hat{Q}}_{v k}]}^{T}

.

2.2. Integrated Navigation Measurement Model

2.2.1. Navigation Measurement Model Based on X-Ray Pulsar

The X-ray pulsar navigation measurement uses the pulse TOA as the navigation measurement information, which is suitable for the probe on the SSBE mission that is an in-orbit operation for a long period of time [47]. The X-ray pulsars provide a unique pulse signal that can be detected by the X-ray sensor mounted on the probe. Figure 1 presents the fundamental principle of the X-ray pulsar measurement of the pulse TOA. The X-ray pulsar measurement is the difference between the pulse TOA observed by the X-ray sensor on the probe and the pulse TOA predicted by the pulse arrival time model on the SSB [48].

t_{p s}^{k}

is the pulse TOA of the kth X-ray pulsar measured from the probe, and

t_{S S B}^{k}

is the time it takes the pulse of the kth X-ray pulsar to reach the SSB.

{\hat{n}}^{k}

is the line of sight direction vector of the kth pulsar along the SSB, it can be considered constant throughout the solar system [49]. r_pb is the position vector of the probe with respect to the SSB, r_pb = r_ps − r_SSB. Here, r_SSB is the position vector of the SSB relative to the center of Sun, which could be provided by the JPL DE441 ephemeris tables. c is the speed at which light travels. The offset projection reflected on

t_{S S B}^{k} - t_{p s}^{k}

could be used to estimate the position of the probe in the following expression [27].

t_{S S B}^{k} - t_{p s}^{k} = \frac{{\hat{n}}^{k} \cdot r_{p b}}{c}

(4)

Figure 1. The fundamental principle of the X-ray pulsar measurement pulse TOA.

The TDOA between the pulse of the X-ray pulsar and the SSB could be used as the observation value. Considering the influences of the general relativistic effect, the time conversion model from the probe to the SSB is obtained as follows [49].

\begin{array}{l} t_{S S B}^{k} - t_{p s}^{k} = & \frac{1}{c} {\hat{n}}^{k} \cdot r_{p b} + \frac{1}{2 c D_{0}^{k}} [{({\hat{n}}^{k} \cdot r_{p b})}^{2} + 2 ({\hat{n}}^{k} \cdot b) ({\hat{n}}^{k} \cdot r_{p b}) \\ - 2 (b \cdot r_{p b}) - r_{p b}^{2}] + \frac{2 μ_{S}}{c^{3}} \ln |\frac{{\hat{n}}^{k} \cdot r_{p b} + ‖r_{p b}‖}{{\hat{n}}^{k} \cdot b + ‖b‖} + 1| \end{array}

(5)

Here,

D_{0}^{k}

is the range from the kth X-ray pulsar to the SSB. b is the position vector from the SSB to the heliocentric, b = r_SSB. k is the number of pulsars applied for navigation estimates, k = 1, 2, …, m. When three X-ray pulsars are observed simultaneously, the high-precision three-dimensional position estimation information could be obtained.

Assume the XNAV measurement value is

Z^{X} (t)

, which can be expressed as follows:

Z^{X} (t) = {[c (t_{S S B}^{1} - t_{p s}^{1}), c (t_{S S B}^{2} - t_{p s}^{2}), \dots, c (t_{S S B}^{k} - t_{p s}^{k})]}^{T}

(6)

The observed quantity of the X-ray pulsar measurement can be expressed as follows:

Z^{X} (t) = h^{X} [X (t), t] + v (t)

(7)

where

v (t)

is measurement noise.

h^{X} [X (t), t]

is the XNAV measurement model, and the measurement model for multi-pulsar navigation [36] can be expressed as follows:

h^{X} [X (t), t] = {[h_{1} [X (t), t], \dots, h_{i} [X (t), t], \dots, h_{k} [X (t), t]]}^{T}

(8)

where

h_{k} [X (t), t]

is the XNAV measurement equation for the k-th X-ray pulsar.

\begin{array}{l} h^{X} [X (t), t] & = c \cdot (t_{S S B} - t_{S C}) = {\hat{n}}^{k} \cdot r_{p b} + \frac{1}{2 D_{0}^{k}} [{({\hat{n}}^{k} \cdot r_{p b})}^{2} + 2 ({\hat{n}}^{k} \cdot b) ({\hat{n}}^{k} \cdot r_{p b}) \\ - 2 (b \cdot r_{p b}) - r_{p b}^{2}] + \frac{2 μ_{S}}{c^{2}} \ln |\frac{{\hat{n}}^{k} \cdot r_{p b} + ‖r_{p b}‖}{{\hat{n}}^{k} \cdot b + ‖b‖} + 1| \end{array}

(9)

The corresponding measurement matrix can be expressed as follows:

H^{X} = \frac{\partial h^{X} [X (t), t]}{\partial X} = [{\hat{n}}^{1}, \dots, {\hat{n}}^{i}, \dots {\hat{n}}^{k}]

(10)

2.2.2. Navigation Measurements Model Based on Two-Dimensional Doppler Velocity

Two-dimensional Doppler velocity measurement means that the two spectrometers on the probe separately measure two radial velocities from the probe relative to Sun and the planetary target object, which are used as the velocity measurement information for the SSBE cruise phase. We adopt the two-dimensional Doppler velocity measurement based on the Sun and the target planetary bodies to compensate for the defect of the velocity estimation accuracy of the XNAV and enhance the state estimation precision for the cruise phase navigation.

The basic principle of Doppler velocity measurements based on the solar/target planetary object is shown in Figure 2. Assume that two solar photons leave the Sun at t₀ and are captured by the spectrometer on board the probe through two paths. One is received by spectrometer 1 facing the Sun at time t₁, and the position of the probe at this time is r_ps (t₁). The other one arrives at Jupiter at time t₂, where the position of Jupiter is r_sj(t₂). After it is reflected, it is received by the spectrometer 2 facing Jupiter at time t [33]. The times t₁, t₂, and t can be obtained by numerical calculations, corresponding to the velocity v_ps (t₁), v_ps (t₂), and v_ps (t) of the probe.

Figure 2. The basic principle of the solar/target planetary object Doppler velocity measurement.

In the cruise phase orbit, the positions r_ps (t₀) and r_ps (t₁) of the probe at time t₀ and t₁ can be approximated as follows:

r_{p s} (t_{1}) = r_{p s} (t_{0}) + v_{p s} (t_{0}) (t_{1} - t_{0}) + \frac{1}{2} a_{p s} {(t_{1} - t_{0})}^{2}

(11)

where the acceleration vector for the probe is denoted by a_ps. In the temporal interval t₀ ~ t₁, the distance traveled by the solar photon after departure from Sun is derived as follows:

c (t_{1} - t_{0}) = |r_{p s} (t_{1})| = |r_{p s} (t_{0}) + v_{p s} (t_{0}) (t_{1} - t_{0}) + \frac{1}{2} a_{p s} {(t_{1} - t_{0})}^{2}|

(12)

Similarly, the time t and t₂ can be calculated as follows:

\begin{array}{l} c (t_{1} - t_{0}) & = |r_{s j} (t_{2})| \\ c (t - t_{0}) & = |r_{s j} (t_{2})| + |r_{p s} (t) - r_{s j} (t_{2})| \\ = |r_{s j} (t_{2})| + |r_{p s} (t_{0}) + v_{p s} (t_{0}) (t - t_{0}) + \frac{1}{2} a_{p s} {(t_{1} - t_{0})}^{2} - r_{s j} (t_{2})| \end{array}

(13)

(1): Solar Doppler radial velocity measurement

The radial velocity relative to the Sun and the target planetary object measured by the two spectrometers on board the probe is based on the Doppler shift principle, which can be directly used as navigation measurement information [33,36]. According to the position vector r_ps and velocity vector v_ps of the probe, its radial velocity

v_{p s}^{s}

relative to the Sun can be determined as follows:

v_{p s}^{s} = \frac{r_{p s} (t) \cdot v_{p s} (t)}{|r_{p s} (t)|} + ω_{1} (t)

(14)

where

ω_{1} (t)

is the Doppler measurement noise of spectrometer 1 facing the Sun. Based on the above analysis, the Doppler radial velocity measurement model of the probe relative to Sun is described as follows

Z_{D}^{S} (t) = h_{D}^{S} [X (t), t] + ω_{1} (t)

(15)

The measurement equation of the Doppler radial velocity from the probe relative to the Sun is represented as follows:

h_{D}^{S} [X (t), t] = \frac{r_{p s x} \cdot v_{p s x} + r_{p s y} \cdot v_{p s y} + r_{p s z} \cdot v_{p s z}}{\sqrt{r_{p s x}^{2} + r_{p s y}^{2} + r_{p s z}^{2}}}

(16)

The following expression can be used to represent the corresponding matrix of the observed values.

H_{D}^{S} = \frac{\partial h_{D}^{S} [X (t), t]}{\partial X} = [\begin{matrix} \frac{v_{p s}}{|r_{p s}|} - \frac{r_{p s} \cdot v_{p s}}{{|r_{p s}|}^{3}} r_{p s} \\ \frac{r_{p s}}{|r_{p s}|} \end{matrix}]

(17)

(2): Target planetary object Doppler radial velocity measurement

The radial velocity of the probe with respect to Jupiter can be measured by the spectrometer 2 facing Jupiter on the probe, for which the radial velocity measurement of the probe with respect to Jupiter can be measured.

The radial velocity measurement model measured by the probe relative to Jupiter is expressed as follows:

Z_{D}^{S} (t) = h_{J}^{D} [X (t), t] + ω_{2} (t)

(18)

where

ω_{2} (t)

is the Doppler measurement noise of the spectrometer 2 oriented toward Jupiter. The corresponding velocity observation equation is expressed as follows:

h_{J}^{D} [X (t), t] = \frac{(r_{p s} - r_{j s}) \cdot (v_{p s} - v_{j s})}{|r_{p s} - r_{j s}|}

(19)

where r_js and v_js represent the position and velocity vectors of the probe, respectively, when the Sun photon arrives at Jupiter.

This measurement model corresponds to the observation matrix is expressed as follows:

H_{J}^{D} = \frac{\partial h_{J}^{D} [X (t), t]}{\partial X} = [\begin{matrix} \frac{v_{p s} - v_{j s}}{|r r_{p s} - r_{j s}|} - \frac{(r_{p s} - r_{j s}) \cdot (v_{p s} - v_{j s})}{{|r_{p s} - r_{j s}|}^{3}} (r_{p s} - r_{j s}) \\ \frac{r_{p s} - r_{j s}}{|r_{p s} - r_{j s}|} \end{matrix}]

(20)

3. Intelligent Integrated Navigation Q-Learning FEKF Algorithm

3.1. Navigation Information Fusion with the Q-Learning

In the integrated navigation scheme of the SSBE based on pulsar and solar/target planetary objects presented in the previous section, the state equation and the observational equation of the system are both nonlinear. The orbital dynamics model of the cruise phase for the SSBE shown as Equation (2) is adopted as the state transition model to forecast the next state of the probe, and it can be seen that both the X-ray pulse and solar/target planetary object Doppler velocity measurements are adopted as the measurement model for the navigation system. The EKF algorithm has a good performance for optimal state estimation for nonlinear navigation systems. The FEKF based on RL is adopted to evaluate the position and velocity for the navigation system integrating X-ray pulsar and two-dimensional Doppler velocity measurements in this research. However, the state estimation performance of the conventional federated EKF depends on the suitable state and observation noise error covariance matrices. Based on this, we attempt to combine the reinforcement Q-learning with the federated EKF, which searches for the best state and observation noise error covariance through continuous interactive trial-and-error with the environment and optimizes the error of state estimation by the QLFEKF algorithm. It can improve the navigation error precision for the probe cruise phase during long orbit run.

One of the RL methods that is most frequently applied is Q-learning [50,51,52]. With the development of intelligent autonomous navigation research, Q-learning has been applied more and more in the navigation system design [39,41]. Since the Q-learning is essential for adaptive trial-and-error optimization of the state and observation noise error covariance matrices, we briefly describe the fusion of enhanced Q-learning and navigation information. The intelligent information interaction with the operating environment for the probe agent (PA) is presented in Figure 3. The real-time in-orbit flight environment status for the probe is characterized by the state s. The various measuring instruments have their own state in the flight environment. The environment-based state of the pulsar detector is s₁(t), and the environment-based state of Doppler velocimetry spectrometer is s₂(t). The on-orbit state prediction of the PAs is characterized by the Doppler velocity spectrometer and the X-ray pulsar detector. The PA moves from the current states s₁(t) and s₂(t) to the next state and takes the corresponding actions a₁(t) and a₂(t), while receiving the corresponding rewards R₁(t) and R₂(t) from the environmental information as feedback for performing the corresponding actions in that state. In the current operational environment state in-orbit, the reward represents the good or bad action performed by the PA to choose the next state.

Figure 3. Intelligent information interaction with the flight environment for the PA.

The PA updates its state, selects the best action in real-time according to the ε-greedy strategy, and calculates the cumulative reward. The state and observation noise matrices Q_k (t₁) and R_k (t₁), respectively, are used as the observed state s_k (t₁) for the integrated navigation sub-filter EKF in the SSBE cruise phase. When the PA enters the next operational environment state (performing action a_k (t₂)), the Q_k (t₁) and R_k (t₁) are transferred to Q_k (t₂) and R_k (t₂) (current state s_k (t₂)). The reward could be generated during the process of state transition, evaluating the PVSE errors during the cruise phase for the probe. The positive reward could reinforce the action to choose the more proper observation noise covariance error matrix. The cumulative update rules for the reward R(s_k, a_k) are as follows:

R (s_{k}, a_{k}) = R (s_{k - 1}, a_{k - 1}) + \frac{1}{T t C y} \{{[{({\tilde{y}}_{p s, k}^{T r})}^{T} {({\hat{R}}_{k})}^{- 1} {\tilde{y}}_{p s, k}^{E x}]}^{- 1} - R (s_{k - 1}, a_{k - 1})\}

(21)

where

R (s_{k - 1}, a_{k - 1})

is the cumulative reward after the selection action performed in the previous state. ItCy is the learning reward iteration cycle.

{\hat{R}}_{k}

is the observation noise error covariance matrix corresponding to the current state.

{({\tilde{y}}_{p s, k}^{T r})}^{T}

and

{\tilde{y}}_{p s, k}^{E x}

are the deviation (measurement residual) between the actual observed value and observed update value obtained by the stateful estimation calculation for the conventional EKF and searching EKF, respectively, and are closely associated with the observation noise error covariance adopted for the QLFEKF.

The accumulation of rewards is assessed using the Q-value. The reward is updated in each iteration cycle in the Q-learning algorithm, and the Q-value also is updated accordingly. The updating rules of the Q-value are expressed as follows:

Q (s_{k}, a_{k}) = (1 - α) Q (s_{k}, a_{k}) + α [R (s_{k}, a_{k}) + γ \max_{a_{k + 1}} Q (s_{k + 1}, a_{k + 1})]

(22)

where Q (s_k, a_k) is the performance of performing action a_k based on flight environment state s_k for the agent. α is the learning rate,

α \in (0, 1]

. γ is the discount factor,

γ \in (0, 1]

. Each time the state s changes, the corresponding action a is selected to obtain the corresponding cumulative reward R, and a corresponding Q-value is calculated. The Q-value represents the superiority of choosing action a in state s. When the Q-value corresponding to action a is greater than the Q-value corresponding to other actions that can be selected in this state, action a is the optimum selection for the current state.

The Q-table is introduced to combine Q-learning with the traditional navigation filtering method. By constructing collections for state S and collections for actions A, the Q-learning table with the grid specification M × N is established. The Q-table is utilized to determine the appropriate noise covariance matrix to enhance the efficiency of the intelligent integrated navigation system during the cruise phase. The PA tries to search the optimal state and observation noise covariance based on the Q-value calculated by the cumulative reward to optimize the federated filter. The set of states and corresponding action sets for the QLFEKF are shown in Figure 4. The process noise error covariance matrix is denoted as

Q_{k}^{(i)} = \{Q_{r k}^{(i)}, Q_{v k}^{(i)}\}

, i = 1, 2, …, m.

Q_{r k}^{(i)}

and

Q_{v k}^{(i)}

represent the position and velocity state noise covariance information, respectively. The measurement noise error covariance matrix is denoted as

R_{k}^{(j)} = \{R_{p k}^{(j)}, R_{d k}^{(j)}\}

j = 1, 2, …, n.

R_{p k}^{(j)}

and

R_{d k}^{(j)}

represent the measurement error covariance information of the pulsar detectors and Doppler spectrometers, respectively. The state set of the Q-table is represented as

S^{(i j)} = \{Q_{r k}^{(i)}, Q_{v k}^{(i)}, R_{p k}^{(j)}, R_{d k}^{(j)}\}

, which is distributed in a grid of M rows and N columns. These are divided into three state types, including corner regions (such as S⁽¹¹⁾, S^(m1), S⁽¹ⁿ⁾, and S^(mn)), edge regions (such as S⁽ⁱ¹⁾, S^(1j), S⁽ⁱⁿ⁾, and S^(mj), i ≠ 1, m − 1, j ≠ 1, n − 1) and the central region (S^(ij), i ≠ 1, m − 1 and j ≠ 1, n − 1). The process of state transition defines the actions A set with up direction (↑), down direction (↓), left direction (←), and right direction (→), which is represented by A = {0 ↑, 1 ↓, 2 ←, 3 →}. The actions that could be selected for the corner regions (top left, bottom left, top right, bottom right) are A_corner_{_tl} = {1, 3}, A_corner_{_bl} = {0, 3}, A_{corner_tr} = {1, 2}, and A_corner_{_br} = {0, 2}; the edge regions (up, down, left, right) are A_edge_{_u} = {1, 2, 3}, A_edge_{_d} = {0, 2, 3}, A_edge_{_l} = {0, 1, 3}, and A_edge_{_r} = {0, 1, 2}; and the central region is A_center = {0, 1, 2, 3}.

Figure 4. Collections of states and corresponding collections for actions for the QLFEKF. The shaded areas denote various combinations of the state and observation noise error covariance matrices Q_k and R_k_. The arrows represent the transitions between different states, and it means choosing different actions.

3.2. Design of the Q-Learning-Based FKF for Intelligent Integrated Navigation

The Q-learning-based FEKF navigation algorithm is designed based on the dynamics and measurement model established in Section 2.1 and Section 2.2, and the updated cumulative reward criteria of the Q-learning shown in Equation (21). The Q-learning and federated EKF are deeply integrated, and the X-ray pulsar and Doppler velocity measurement sub-filter parameters are trial-and-error and fine-tuned to improve the PVSE precision of the intelligent integrated navigation for the cruise phase. The state process and observation noise error covariance matrices related to the dynamic and measurement model are both crucial filtering parameters that affect the state estimation precision and performance for the intelligent integrated FEKF navigation system. Thus, the QLFEKF algorithm is adopted to adjust the state and observation noise error covariance matrices in real-time to accurately assess the PVSE precision of the in-orbit operation probe during the cruise phase.

According to the above description, the implementation process of the QLFEKF is given in Algorithm 1.

Algorithm 1: The Q-learning-based FEKF algorithm of the intelligent integrated navigation system
Input:	Initial state estimation ${\hat{X}}_{p s, 0}$ , error covariance matrix P_ps_,0, predetermined set of the noise covariance matrix error initial values ${\hat{Q}}_{0}, {\hat{Q}}_{p 0}, {\hat{Q}}_{d 0}$ , and learning parameter α, γ, ε
Step 1:	Initialize variables, ${\hat{X}}_{p s, 0}^{T r (p)} = {\hat{X}}_{p s, 0}^{E x (p)} = {\hat{X}}_{p s, 0}^{S e (p)} \leftarrow {\hat{X}}_{p s, 0}, {\hat{X}}_{p s, 0}^{T r (d)} = {\hat{X}}_{p s, 0}^{E x (d)} = {\hat{X}}_{p s, 0}^{S e (d)} \leftarrow {\hat{X}}_{p s, 0}, {\hat{X}}_{p s, 0}^{g} \leftarrow {\hat{X}}_{p s, 0},$ $P_{p s, 0}^{T r} = P_{p s, 0}^{E x} = P_{p s, 0}^{S e} \leftarrow P_{p s, 0}, P_{p s, 0}^{g} \leftarrow P_{p s, 0}$
Step 2:	Time index initialization $k \leftarrow 0$
Step 3:	For the specified time of the navigation
Step 4:	Set state and action, for all construct state sets $s \in S$ , the action set $a \in A$
Step 5:	Environment state initialization $s_{k} \leftarrow s_{0}$ , Q-value $Q^{(p)} (s_{k}, a_{k}) \leftarrow 0, R^{(p)} (s_{k}, a_{k}) \leftarrow 0$ , and reward initialization $Q^{(d)} (s_{k}, a_{k}) \leftarrow 0, R^{(d)} (s_{k}, a_{k}) \leftarrow 0$
Step 6:	Set the initial noise error covariance matrix ${\hat{Q}}_{r k} \leftarrow {\hat{Q}}_{r 0}, {\hat{Q}}_{v k} \leftarrow {\hat{Q}}_{v 0}$ , and ${\hat{Q}}_{k} \leftarrow [{\hat{Q}}_{r k}, {\hat{Q}}_{v k}]$ ; ${\hat{R}}_{p k} \leftarrow {\hat{R}}_{p 0}, {\hat{R}}_{d k} \leftarrow {\hat{R}}_{d 0}$
Step 7:	Generated randomly the noise error covariance matrix ${\hat{Q}}_{r k}^{(i)} \in [{\underline{q}}_{r k} * {\hat{Q}}_{r 0}, {\bar{q}}_{r k} * {\hat{Q}}_{r 0}]$ , ${\hat{Q}}_{v k}^{(i)} \in [{\underline{q}}_{v k} * {\hat{Q}}_{v 0}, {\bar{q}}_{v k} * {\hat{Q}}_{v 0}], {\hat{R}}_{p k} \in [{\underline{r}}_{p k} * {\hat{R}}_{p 0}, {\bar{r}}_{p k} * {\hat{R}}_{p 0}], {\hat{R}}_{d k} \in [{\underline{r}}_{d k} * {\hat{R}}_{d 0}, {\bar{r}}_{d k} * {\hat{R}}_{d 0}]$
Step 8:	ε-greedy strategy a_k ← ε-greedy (s_k, A, Q (s_k, a_k), ε) choose action a_k for state s_k
Step 9:	Execute action a_k and observe arrived the next states ${s^{'}}_{k}$
Step 10:	For each time step ItCy = 1, 2, …, T in one iteration, do
Step 11:	k ← k + 1
Step 12:	$[{\hat{X}}_{p s, k}^{T r (p)}, P_{p s, k}^{T r (p)}, {\tilde{y}}_{p s, k}^{T r (p)}] \leftarrow E K F ({\hat{X}}_{p s, k \| k - 1}^{T r (p)}, P_{p s, k \| k - 1}^{T r (p)}, y_{p s, p k}, {\hat{Q}}_{k}, {\hat{R}}_{p k}) ⊳$ XP benchmark $[{\hat{X}}_{p s, k}^{T r (d)}, P_{p s, k}^{T r (d)}, {\tilde{y}}_{p s, k}^{T r (d)}] \leftarrow E K F ({\hat{X}}_{p s, k \| k - 1}^{T r (d)}, P_{p s, k \| k - 1}^{T r (d)}, y_{p s, d k}, {\hat{Q}}_{k}, {\hat{R}}_{d k}) ⊳$ STD benchmark
Step 13:	${\hat{Q}}_{r k}^{(i)}, {\hat{Q}}_{v k}^{(i)}, {\hat{R}}_{p k}^{(j)}$ and ${\hat{R}}_{d k}^{(j)}$ are determined according to the current state s_k
Step 14:	$[{\hat{X}}_{p s, k}^{E x (p)}, P_{p s, k}^{E x (p)}, {\tilde{y}}_{p s, k}^{E x (p)}] \leftarrow E K F ({\hat{X}}_{p s, k \| k - 1}^{E x (p)}, P_{p s, k \| k - 1}^{E x (p)}, y_{p s, p k}, {\hat{Q}}_{r k}^{(i)}, {\hat{Q}}_{v k}^{(i)}, {\hat{R}}_{p k}^{(j)}) ⊳$ XP searching $[{\hat{X}}_{p s, k}^{E x (d)}, P_{p s, k}^{E x (d)}, {\tilde{y}}_{p s, k}^{E x (d)}] \leftarrow E K F ({\hat{X}}_{p s, k \| k - 1}^{E x (d)}, P_{p s, k \| k - 1}^{E x (d)}, y_{p s, d k}, {\hat{Q}}_{r k}^{(i)}, {\hat{Q}}_{v k}^{(i)}, {\hat{R}}_{d k}^{(j)}) ⊳$ STD searching
Step 15:	$R^{(p)} (s_{k}, a_{k}) \leftarrow R^{(p)} (s_{k - 1}, a_{k - 1}) + \frac{1}{I t C y} \{{[{({\tilde{y}}_{p s, k}^{T r (p)})}^{T} {({\hat{R}}_{p k}^{(j)})}^{- 1} {\tilde{y}}_{p s, k}^{E x (p)}]}^{- 1} - R^{(p)} (s_{k - 1}, a_{k - 1})\} ⊳$ XP rewards $R^{(d)} (s_{k}, a_{k}) \leftarrow R^{(d)} (s_{k - 1}, a_{k - 1}) + \frac{1}{I t C y} \{{[{({\tilde{y}}_{p s, k}^{T r (d)})}^{T} {({\hat{R}}_{d k}^{(j)})}^{- 1} {\tilde{y}}_{p s, k}^{E x (d)}]}^{- 1} - R^{(d)} (s_{k - 1}, a_{k - 1})\} ⊳$ STD rewards
Step 16:	$\begin{array}{l} P_{p s, k - 1}^{S e (p)} \leftarrow β_{p}^{- 1} P_{p s, k - 1}^{g}, {\hat{Q}}_{r p k}^{(i)} \leftarrow β_{p}^{- 1} {\hat{Q}}_{r k}^{(i)}, {\hat{Q}}_{v p k}^{(i)} \leftarrow β_{p}^{- 1} {\hat{Q}}_{v k}^{(i)} \\ P_{p s, k - 1}^{S e (d)} \leftarrow β_{d}^{- 1} P_{p s, k - 1}^{g}, {\hat{Q}}_{r d k}^{(i)} \leftarrow β_{d}^{- 1} {\hat{Q}}_{r k}^{(i)}, {\hat{Q}}_{v d k}^{(i)} \leftarrow β_{d}^{- 1} {\hat{Q}}_{v k}^{(i)} \\ P_{p s, k - 1}^{m} \leftarrow β_{m}^{- 1} P_{p s, k - 1}^{g}, {\hat{Q}}_{k}^{m} \leftarrow β_{m}^{- 1} {\hat{Q}}_{k} \\ {\hat{X}}_{p s, k - 1}^{S e (p)} \leftarrow {\hat{X}}_{p s, k - 1}^{g}, {\hat{X}}_{p s, k - 1}^{S e (d)} \leftarrow {\hat{X}}_{p s, k - 1}^{g}, {\hat{X}}_{p s, k - 1}^{m} \leftarrow {\hat{X}}_{p s, k - 1}^{g} \end{array}$ $⊳$ Information distribution weight
Step 17:	$\begin{array}{l} {\hat{X}}_{p s, k \| k - 1}^{S e (p)} \leftarrow Φ_{p s, k \| k - 1}^{S e (p)} {\hat{X}}_{p s, k - 1}^{S e (p)}, P_{p s, k \| k - 1}^{S e (p)} \leftarrow Φ_{p s, k \| k - 1}^{S e (p)} P_{p s, k - 1}^{S e (p)} {(Φ_{p s, k \| k - 1}^{S e (p)})}^{T} + {\hat{Q}}_{p k}^{S e (i)} \\ {\hat{X}}_{p s, k \| k - 1}^{S e (d)} \leftarrow Φ_{p s, k \| k - 1}^{S e (d)} {\hat{X}}_{p s, k - 1}^{S e (d)}, P_{p s, k \| k - 1}^{S e (d)} \leftarrow Φ_{p s, k \| k - 1}^{S e (d)} P_{p s, k - 1}^{S e (d)} {(Φ_{p s, k \| k - 1}^{S e (d)})}^{T} + {\hat{Q}}_{d k}^{S e (i)} \\ {\hat{X}}_{p s, k \| k - 1}^{m} \leftarrow Φ_{p s, k \| k - 1}^{m} {\hat{X}}_{p s, k - 1}^{m}, P_{p s, k \| k - 1}^{m} \leftarrow Φ_{p s, k \| k - 1}^{m} P_{p s, k - 1}^{m} {(Φ_{p s, k \| k - 1}^{m})}^{T} + {\hat{Q}}_{k}^{m} \end{array}$ $⊳$ Time update
Step 18:	$[{\hat{X}}_{p s, k}^{S e (p)}, P_{p s, k}^{S e (p)}, {\tilde{y}}_{p s, k}^{S e (p)}] \leftarrow E K F ({\hat{X}}_{p s, k \| k - 1}^{S e (p)}, P_{p s, k \| k - 1}^{S e (p)}, y_{p s, p k}, {\hat{Q}}_{r p k}^{(i)}, {\hat{Q}}_{v p k}^{(i)}, {\hat{R}}_{p k}^{(j)}) ⊳$ XP measurement update $[{\hat{X}}_{p s, k}^{S e (d)}, P_{p s, k}^{S e (d)}, {\tilde{y}}_{p s, k}^{S e (d)}] \leftarrow E K F ({\hat{X}}_{p s, k \| k - 1}^{S e (d)}, P_{p s, k \| k - 1}^{S e (d)}, y_{p s, d k}, {\hat{Q}}_{r d k}^{(i)}, {\hat{Q}}_{v d k}^{(i)}, {\hat{R}}_{d k}^{(j)}) ⊳$ STD measurement update
Step 19:	$\begin{array}{l} P_{p s, k}^{g} \leftarrow {[{(P_{p s, k}^{S e (p)})}^{- 1} + {(P_{p s, k}^{S e (d)})}^{- 1} + {(P_{p s, k}^{m})}^{- 1}]}^{- 1} \\ {\hat{X}}_{p s, k}^{g} \leftarrow P_{p s, k}^{g} [{(P_{p s, k}^{S e (p)})}^{- 1} {\hat{X}}_{p s, k}^{S e (p)} + {(P_{p s, k}^{S e (d)})}^{- 1} {\hat{X}}_{p s, k}^{S e (d)} + {(P_{p s, k}^{m})}^{- 1} {\hat{X}}_{p s, k}^{m}] \end{array}$ $⊳$ Information fusion
Step 20:	End for
Step 21:	$Q^{(p)} (s_{k}, a_{k}) \leftarrow (1 - α) Q^{(p)} (s_{k}, a_{k}) + α [R^{(p)} (s_{k}, a_{k}) + γ \underset{a_{k + 1}}{m a x} Q^{(p)} (s_{k + 1}, a_{k + 1})] ⊳$ XP Q-value $Q^{(d)} (s_{k}, a_{k}) \leftarrow (1 - α) Q^{(d)} (s_{k}, a_{k}) + α [R^{(d)} (s_{k}, a_{k}) + γ \underset{a_{k + 1}}{m a x} Q^{(d)} (s_{k + 1}, a_{k + 1})] ⊳$ STD Q-value
Step 22:	Set ${s^{'}}_{k}$ as the current state $s_{k} \leftarrow {s^{'}}_{k}$
Step 23:	$R^{(p)} (s_{k}, a_{k}) \leftarrow 0, R^{(d)} (s_{k}, a_{k}) \leftarrow 0 ⊳$ Reset reward
Step 24:	${\hat{X}}_{p s, k}^{E x (p)} \leftarrow {\hat{X}}_{p s, k}^{T r (p)}, P_{p s, k}^{E x (p)} \leftarrow P_{p s, k}^{T r (p)}, {\hat{X}}_{p s, k}^{E x (d)} \leftarrow {\hat{X}}_{p s, k}^{T r (d)}, P_{p s, k}^{E x (d)} \leftarrow P_{p s, k}^{T r (d)}$ Reset searching EKF
Step 25:	End for
Output:	Return as state estimate ${\hat{X}}_{p s, k} \leftarrow {\hat{X}}_{p s, k}^{g}$ and $P_{p s, k} \leftarrow P_{p s, k}^{g}$

In Algorithm 1, the initial information includes the estimation error covariance matrix

{\hat{X}}_{p s, 0}

, the state noise error covariance matrix

{\hat{Q}}_{0} = ({\hat{Q}}_{r 0}, {\hat{Q}}_{v 0})

of the system, and the observation noise error covariance matrix

({\hat{R}}_{p 0}, {\hat{R}}_{d 0})

at the initial time. α represents the learning rate of the Q-learning, γ represents the learning rate of the Q-learning, and ε represents the action selection probability of the Q-learning, which can be applied to regulate the performance of the learning for the reinforcement Q-learning. The specific time of the navigation in step 3 refers to the navigation time for the cruise phase. In step 7, the state process (including PVSE) and observation (including the X-ray pulsar detector and two Doppler velocity-measuring spectrometers) noise error covariance matrix are randomly generated within the set range corresponding to the initial values of the noise error covariance matrix.

{\underline{q}}_{r k}

and

{\bar{q}}_{r k}

are the upper edge and lower edge bounds for the position state noise covariance matrix of the system error, respectively.

{\underline{q}}_{v k}

and

{\bar{q}}_{v k}

are the upper edge and lower edge bounds for the velocity state noise covariance matrix of the system error, respectively. The ε-greed strategy is involved in step 8, which is determined by the cumulative rewards [R^(p)(s_k, a_k), R^(d)(s_k, a_k)] and Q-values [Q^(p)(s_k, a_k), Q^(d)(s_k, a_k)] calculated from different state s_k and action a_k. The PA selects the random action according to the probability of choice action ε and selects the action to maximize the Q-value with the probability of the (1 − ε). The learning strategy has good stability for the PA to select the most suitable noise covariance matrix. ItCy is the period of time for the calculation of cumulative rewards. If ItCy is small, the weight and cumulative rewards for each update will be large, so the design of the ItCy should not be too small.

In step 12,

{\hat{X}}_{p s, k}^{T r (p)}

and

{\hat{X}}_{p s, k}^{T r (d)}

represent the state estimate values corresponding the sub-filter in the benchmark EKF for the X-ray pulsar and Doppler, respectively.

P_{p s, k}^{T r (p)}

and

P_{p s, k}^{T r (d)}

represent the state transition matrix corresponding to the benchmark EKF, respectively.

{\tilde{y}}_{p s, k}^{T r (p)}

and

{\tilde{y}}_{p s, k}^{T r (d)}

represent the measured innovation information corresponding to the benchmark EKF, respectively. Innovation information in the benchmark EKF and search EKF are important parameters of the Q-learning reward evaluation.

{\hat{Q}}_{r k}^{(i)}

,

{\hat{Q}}_{v k}^{(i)}

,

{\hat{R}}_{p k}^{(j)}

, and

{\hat{R}}_{d k}^{(j)}

are determined according to the current state s_k. In step 14,

{\hat{X}}_{p s, k}^{E x (p)}

and

{\hat{X}}_{p s, k}^{E x (d)}

are state estimates for the searching EKF in the X-ray pulsars sub-filters and Doppler sub-filters, respectively.

P_{p s, k}^{E x (p)}

and

P_{p s, k}^{E x (d)}

represent the state transition matrix corresponding to the searching EKF, respectively.

{\tilde{y}}_{p s, k}^{E x (p)}

and

{\tilde{y}}_{p s, k}^{E x (d)}

represent the measurement innovation information corresponding to the searching EKF, respectively. In step 15, the cumulative reward of the QLEKF based on the X-ray pulsar and Doppler during the iteration period is calculated, and the Q-value is determined according to the sizes of the reward

R^{(p)} (s_{k}, a_{k})

and

R^{(d)} (s_{k}, a_{k})

. A positive reward is received for suitable

{\hat{Q}}_{r k}^{(i)}

,

{\hat{Q}}_{v k}^{(i)}

,

{\hat{R}}_{p k}^{(j)}

, and

{\hat{R}}_{d k}^{(j)}

, and a negative reward is received for unsuitable

{\hat{Q}}_{r k}^{(i)}

,

{\hat{Q}}_{v k}^{(i)}

,

{\hat{R}}_{p k}^{(j)}

, and

{\hat{R}}_{d k}^{(j)}

. In step 16, the weights of the X-ray pulsar, Doppler sub-filter, and the main filter are designed and distributed information. In steps 17 and 18, the corresponding time information update and measurement information update of federated filtering are mainly carried out.

{\hat{X}}_{p s, k}^{S e (p)}

and

{\hat{X}}_{p s, k}^{S e (d)}

are the state estimates for the measurement updated EKF for the X-ray pulsars sub-filters and Doppler sub-filters, respectively.

P_{p s, k}^{S e (p)}

and

P_{p s, k}^{S e (d)}

are the state transition matrices corresponding to the measurement update EKF, respectively.

{\tilde{y}}_{p s, k}^{S e (p)}

and

{\tilde{y}}_{p s, k}^{S e (d)}

are the measurement innovation information corresponding to the measurement update EKF, respectively. In step 19, information fusion is performed on the state estimation of the sub-filter based on the X-ray pulsar, the sub-filter based on Doppler and the main filter. The detailed design process of the federated filtering algorithm in steps 16~19 are given below. The cumulative rewards are gained at the end of the iteration cycle. The value of the Q-value is calculated according to the cumulative reward from the innovation information of filtering, and the corresponding Q-value for each set of the state-action pair (s_k, a_k) is stored in the Q-table. According to the Q-values corresponding to different actions in the current state, the best action is selected to reach the next best state, and the corresponding selection locates the state and observation noise variance matrices.

The structure diagram of the Q-learning-based FEKF intelligent integrated navigation system is shown in Figure 5. The observation information of the X-ray pulsar and Doppler velocity-measuring is fused by the QLFEKF. The reinforcement Q-learning-based EKF is used as the sub-filter for the X-ray pulsar and Doppler velocity-measuring navigation system, and each sub-filter is composed of the benchmark filter, searching filter, and state estimation learning filter. The two sub-filters work in parallel, the output locally optimal state estimates

{\hat{X}}_{p s, k}^{S e (p)}

and

{\hat{X}}_{p s, k}^{S e (d)}

as well as the error covariance matrices

P_{p s, k}^{S e (p)}

and

P_{p s, k}^{S e (d)}

, which are used as inputs to the global master-filter for information fusion periodically. Finally, the global optimal state estimation

{\hat{X}}_{p s, k}^{g}

and error covariance

P_{p s, k}^{g}

are obtained. The global primary filter system performs fusion estimation according to the estimation information of the sub-filter system and assigns the fusion estimation information to the sub-filter according to the design weight by using the information distribution principle. The single-step intelligent federated filtering process is realized to obtain the current state estimation information.

Figure 5. Structure diagram of the Q-learning-based FEKF intelligent integrated navigation.

The information distribution process is that the global main filter and two sub-filters allocate state estimation information according to a certain weight. The information distribution principle between the main filter and each sub-filter is as follows:

\{\begin{cases} {\hat{Q}}_{n k}^{(i)} = β_{n}^{- 1} {\hat{Q}}_{k}^{(i)}, {\hat{Q}}_{k}^{m} = β_{m}^{- 1} {\hat{Q}}_{k} \\ P_{p s, k}^{S e (n)} = β_{n}^{- 1} P_{p s, k}^{g}, P_{p s, k}^{m} = β_{m}^{- 1} P_{p s, k}^{g} \\ {\hat{X}}_{p s, k}^{S e (n)} = {\hat{X}}_{p s, k}^{g}, {\hat{X}}_{p s, k}^{m} = {\hat{X}}_{p s, k}^{g} \end{cases}, n = p, d

(23)

where

β_{m}

and

β_{n}

are the information distribution coefficients of the parallel sub-filter and primary filter, respectively,

β_{m}, β_{n} \in (0, 1], n = 1, 2, \dots

. Here the principle of conservation of the information distribution is satisfied,

\sum^{n} β_{m} + β_{n} = 1

.

The QLFEKF information update includes time update and measurement update, in which the update of time is updated autonomously in the parallel sub-filter and primary filter, the specific update process is as follows:

\{\begin{cases} {\hat{X}}_{p s, k | k - 1}^{S e (n)} = Φ_{p s, k | k - 1}^{S e (n)} {\hat{X}}_{p s, k - 1}^{S e (p)} \\ P_{p s, k | k - 1}^{S e (n)} = Φ_{p s, k | k - 1}^{S e (n)} P_{p s, k - 1}^{S e (n)} {(Φ_{p s, k | k - 1}^{S e (n)})}^{T} + {\hat{Q}}_{n k}^{(i)} \\ {\hat{X}}_{p s, k | k - 1}^{m} = Φ_{p s, k | k - 1}^{m} {\hat{X}}_{p s, k - 1}^{m} \\ P_{p s, k | k - 1}^{m} = Φ_{p s, k | k - 1}^{m} P_{p s, k - 1}^{m} {(Φ_{p s, k | k - 1}^{m})}^{T} + {\hat{Q}}_{k}^{m} \end{cases}

(24)

Since there is not measurement information in the global main filter, the measurement information is located in the parallel sub-filter and is only carried out in the state estimation learning filter in the sub-filter. The specific update process is described as follows:

\{\begin{cases} K_{k}^{S e (n)} = P_{p s, k}^{S e (n)} {(H_{p s, k}^{S e (n)})}^{T} {[H_{p s, k}^{S e (n)} P_{p s, k | k - 1}^{S e (n)} {(H_{p s, k}^{S e (n)})}^{T} + {\hat{R}}_{p k}^{n}]}^{- 1} \\ {\hat{X}}_{p s, k}^{S e (n)} = {\hat{X}}_{p s, k | k - 1}^{S e (n)} + K_{k}^{S e (n)} y_{p s, n k} \\ P_{p s, k}^{S e (n)} = [I - K_{k}^{S e (n)} H_{p s, k}^{S e (n)}] P_{p s, k | k - 1}^{S e (n)} {[I - K_{k}^{S e (n)} H_{p s, k}^{S e (n)}]}^{T} + K_{k}^{S e (n)} {\hat{R}}_{p k}^{n} {(K_{k}^{S e (n)})}^{T} \end{cases}

(25)

The federated EKF fuses the information of the state estimates

{\hat{X}}_{p s, k}^{S e (p)}

and

{\hat{X}}_{p s, k}^{S e (d)}

as well as the error covariance matrices

P_{p s, k}^{S e (p)}

and

P_{p s, k}^{S e (d)}

of the two sub-filters and the main filter. The optimal global state estimation is obtained as follows:

\{\begin{cases} {\hat{X}}_{p s, k}^{g} = {[\sum_{n = p, d}^{n} {(P_{p s, k}^{S e (n)})}^{- 1} + {(P_{p s, k}^{m})}^{- 1}]}^{- 1} \\ P_{p s, k}^{g} = P_{p s, k}^{g} [\sum_{n = p, d}^{n} [{(P_{p s, k}^{S e (n)})}^{- 1} {\hat{X}}_{p s, k}^{S e (n)}] + {(P_{p s, k}^{m})}^{- 1} {\hat{X}}_{p s, k}^{m}] \end{cases}

(26)

The output result of the QLFEKF algorithm is the global state estimate, including the PVSE

{\hat{X}}_{p s, k}

and corresponding state error estimate covariance

P_{p s, k}

.

\{\begin{cases} {\hat{X}}_{p s, k} = {\hat{X}}_{p s, k}^{g} \\ P_{p s, k} = P_{p s, k}^{g} \end{cases}

(27)

4. Simulation and Results Analysis

The high-accuracy, effectiveness, and feasibility of the autonomous intelligent integrated navigation method is an important index to improve the autonomous operation ability of the cruise phase for the SSB probe. In order to further analyze the performance for this presented method, which is compared with the integrated navigation FEKF algorithm to verify the high accuracy of the state estimation of the X-ray pulsar/solar and target planetary Doppler velocity-measuring intelligent integrated navigation algorithm based on the QLFEKF.

4.1. Simulation Initial Conditions

This section presents the initial position state and velocity sate parameters of the probe cruise phase for the solution of the orbital dynamics model at the beginning. The initial simulation conditions are shown in Table 1 as follows. The simulation time is from 1 September 2026 12:00:00:00.000 UTCG to 7 September 2026 12:00:00:00.000 UTCG. Thus, the total simulation time is 6 days.

Table 1. Initial state orbit parameters for the probe.

Three X-ray pulsars are considered as one of the measurement models for intelligent integrated navigation, namely PSR B1937+21, PSR B0531+21, and PSR J1821-24. Their specific parameters are shown in Table 2. The X-ray pulsar detector has an effective area of 1 m². The vector of the position for the SSB relative to the solar center of mass is calculated by the ephemeris DE441 from the Jet Propulsion Laboratory (JPL). According to the approximate calculation formula of the pulsar observation noise [53,54], the standard deviations of the measurement noise for three X-ray pulsars are p₁ = 0.42060 km, p₂ = 0.14070 km, and p₃ = 0.44488 km, respectively.

Table 2. The corresponding parameters of the selected X-ray pulsars.

The initial conditions of the relevant parameters for the X-ray pulsars and solar/target planetary Doppler measurement model that would be adopted in the simulation are listed in Table 3 and mainly include the relevant initial measurement parameter characteristics of the X-ray pulsar detectors and Doppler spectrometers.

Table 3. The corresponding parameters of the measurement models.

In the QLFEKF algorithm, reinforcement Q-learning is adopted to tune the parameters of the state and observation noise error covariance matrices and to obtain the optimum state and observation noise error covariance matrices. The parameters tuned by the reinforcement Q-learning include α, γ, and ε. Set the size of the appropriate Q-learning grid to M × N = 20 × 20. In the case of the specific range of the

{\hat{Q}}_{r k}^{(i)}

,

{\hat{Q}}_{v k}^{(i)}

,

{\hat{R}}_{p k}^{(j)}

, and

{\hat{R}}_{d k}^{(j)}

, the larger scale grid could search more fully and cover more area. The root mean squared error (RMSE) [54] is adopted as the evaluation indicator to implement precision analysis of the PVSE for the probe’s cruise phase. The relevant design parameters for the QLFEKF algorithm are summarized in Table 4 as follows.

Table 4. The design parameters of the QLFEKF algorithm.

4.2. Simulation and Analysis

The X-ray pulsar/solar and target planetary Doppler velocity measurement based the Q-learning federation EKF (STD/XP-QLFEKF) is compared with the navigation state estimates of precision of the X-ray pulsar/solar and target planetary Doppler velocity measurement-based EKF (STD/XP-EKF) and the X-ray pulsar/solar and target planetary Doppler velocity measurement-based federated EKF (STD/XP-FEKF). The RMSE of the PVES of the cruise phase for the probe is used as the evaluation indicator of the navigation accuracy.

Figure 6 and Figure 7 show the RMSE curves of the position estimation error and the three-axis position estimation error of the three integrated navigation algorithms. It can be seen from the figure that the QLFEKF algorithm shows high accuracy position estimation performance with an error accuracy of 10² m. This shows that QLFEKF can adaptively choose the appropriate process states and observation (the X-ray pulsar detector measurement errors and spectrometer measurement errors) noise error covariance matrix by using the cumulative reward mechanism of the Q-learning through its interaction with the flight environment.

Figure 6. Comparison of the position estimate RMSEs between STD/XP-QLFEKF and other integrated navigation algorithms.

Figure 7. Comparison of the position estimate RMSEs for three axes between STD/XP-QLFEKF and other integrated navigation algorithms.

The stability of the position for the three integrated navigation algorithms has corresponding fluctuations, which are caused by eliminating the accumulated position errors caused by Doppler velocity measurements. However, compared with the position state estimation of the STD/XP-EKF navigation algorithm and STD/XP-FEKF navigation algorithm, the stability of the STD/XP-QLFEKF position state estimation is better. This result shows that the Q-learning algorithm has a good convergence effect in the process of parameter tuning, which ensures that the probe can operate continuously and stably in the cruise phase with high precision.

From Figure 8 and Figure 9, it is seen that the velocity estimation precision of the STD/XP-QLFEKF algorithm is about 10⁻¹ m/s, which is much higher than that of the STD/XP-EKF algorithm and the STD/XP-FEKF algorithm. It can be indicated that the state error estimation precision of the FEKF is better than that of the EKF, which reflects the superiority of the federated filter in the fusion of various measurement information. The FEKF combined with the Q-learning algorithm can distribute the position, velocity, and state transition information from different measurement models according to the weighted proportions, and it ensures that the information fusion of various navigation measurement models can obtain high-accuracy position state and velocity state error estimations based on the time update and measurement update. In addition, we could see that the precision convergence of the intelligent navigation system based on X-ray pulsar/solar and target planetary Doppler velocity measurements is very stable.

Figure 8. Comparison of the velocity estimate RMSEs between STD/XP-QLFEKF and other integrated navigation algorithms.

Figure 9. Comparison of the velocity estimate RMSEs based on three axes between STD/XP-QLFEKF and other integrated navigation algorithms.

Based on the simulated analysis mentioned above, it verifies that the proposed STD/XP-QLFEKF intelligent integrated navigation is very appropriate for the considered intelligent integrated navigation for the cruise phase of the SSBE. Table 5 below shows the comparison of the position state and velocity state estimation RMSE errors (3σ) for the above three integrated navigation algorithms.

Table 5. The PVSE RMSE errors (3σ) of the integrated navigation system.

From Table 5 above, it can be seen that the position estimation RMSE error (3σ) accuracy is 443.560 m and the velocity state RMSE (3σ) precision is 0.630 m/s for the STD/XP-EKF. The PVES RMSE (3σ) precision of the STD/XP-FEKF are 351.803 m and 0.386 m/s, respectively. Based on the proposed STD/XP-QLFEKF algorithm in the paper, the PVSE RMSE error (3σ) accuracy of the cruise phase for the probe is 152.101 m and 0.243 m/s, respectively. By comparing the accuracy of the PVSE errors (3σ) between the STD/XP-QLFEKF and STD/XP-FEKF algorithms, the accuracy of the STD/XP-QLFEKF algorithm has been greatly improved. According to statistical analysis, the PVSE precision improvement capability of the STD/XP-QLFEKF algorithm is 55.84% and 37.04%, respectively. This is because the Q-learning algorithm can dynamically adjust the process and observation noise error covariance by periodically iterating according to the real-time in-orbit operating state after receiving different measurement state information. The cumulative reward is obtained through continuous iterations in each iteration cycle (ItCy). The Q-value of the corresponding state (

{\hat{Q}}_{r k}^{(i)}

,

{\hat{Q}}_{v k}^{(i)}

,

{\hat{R}}_{p k}^{(j)}

, and

{\hat{R}}_{d k}^{(j)}

) position in the Q-table is calculated according to the cumulative reward value, and the appropriate information parameters are selected to the maximum extent through continuous trial-and-error to guarantee the performance superiority of the STD/XP-QLFEKF algorithm.

Next, the range of the related parameters (including α, γ, and ε) for the reinforcement Q-learning is set to evaluate the effects on the related learning parameters on the precision of the PVES error of the STD/XP-QLFEKF. The following sets the range of the Q-learning parameters involved above when the three learning parameters are changed in the range of

α \in [0.05 0.14]

,

γ \in [0.90 0.99]

, and

ε \in [0.05 0.14]

, and the RMSE errors (3σ) for position state and velocity state estimation of the STD/XP-QLFEKF algorithm are shown in Figure 10, Figure 11 and Figure 12, respectively.

Figure 10. The PVSE RMSE errors (3σ) of the cruise phase as a function of the learning rate.

Figure 11. The PVSE RMSE errors (3σ) of the cruise phase as a function of the discount factor.

Figure 12. The PVSE RMSE errors (3σ) for the cruise phase as s function of the action selection probability.

From Figure 10, Figure 11 and Figure 12, it can be seen that the change of the Q-learning design parameters has relatively little influence on the PVES error precision for the STD/XP-QLFEKF algorithm. The simulation results show that the STD/XP-QLFEKF is insensitive to the changes in the navigation PVSE errors within the range of the above α, γ, and ε, and the small changes of navigation accuracy estimation errors caused by the changes of the Q-learning parameters can be ignored in the process of intelligent navigation algorithm design.

In addition, in the process of the STD/XP-QLFEKF algorithm reward mechanism design, we take the iteration period as a function of reward accumulation, and the setting of the iteration period has a certain influence on the calculation results of the reward accumulation. The following is the simulation and analysis of the PVES error corresponding different iteration cycles of the Q-learning of the QLFEKF algorithm. The influence of the different iteration cycles of reinforcement Q-learning on the accuracy of PVSE errors in the probe cruise phase is shown in Figure 13.

Figure 13. The influence of different iteration cycles of the reinforcement Q-learning on the precision of the PVSE errors in the probe’s cruise phase.

The simulation results show that when the iteration cycle of the Q-learning in STD/XP-QLFEKF algorithm is 7200 s, the position state and velocity state estimation precision for the intelligent integrated navigation is higher. Therefore, the cumulative reward calculation of the Q-learning algorithm is carried out according to the iteration cycle (ItCy = 2.0 h) in this research to ensure the high precision of the state error estimation of the intelligent integrated navigation system for the cruise phase of the SSBE.

5. Conclusions

Intelligent, high-accuracy integrated navigation for the cruise phase of the SSBE is deeply studied in this paper. The single navigation mode could not meet the high-precision and high-reliability requirements of the navigation system for a probe operating in long-endurance, highly dynamic, and extreme environments. In order to improve the navigation accuracy of the PVSE for the cruise phase of the probe, the QLFEKF based on the X-ray pulsar/solar and target planetary Doppler velocity measurement intelligent integrated navigation algorithm is proposed. The integrated navigation algorithm can compensate for the accumulated position estimation error of Doppler velocity measurements. The navigation system adopts the federation filter composed of the sub-filter and main filter to realize the information fusion of the various navigation modes and ensure the estimation performance of the intelligent integrated navigation system. At the same time, the cumulative reward mechanism of the reinforcement Q-learning algorithm is designed according to the sub-filter characteristics of the federated filter, and the iterative period of the Q-learning algorithm is used as an index to judge the cumulative reward. The process and measurement noise covariance matrices are carried out by tuning the parameters by traversing the Q-table. The simulation and analysis show that compare with the STD/XP-EKF and the STD/XP-FEKF, the STD/XP-QLFEKF algorithm has higher navigation accuracy, and its PVSE RMSE error (3σ) accuracy can reach 152.101 m and 0.243 m/s, respectively. The different values of the three learning parameters (α, γ, and ε) for the QLFEKF algorithm are insensitive to the PVSE errors, and slight changes in the estimation errors of the navigation accuracy caused by changes in the Q-learning parameters can be ignored. Of course, it is necessary to consider the iterative period parameters related to the cumulative rewards of Q-learning in the QLFEKF algorithm. As a function of the reward accumulation, the iterative period has a certain influence on reward accumulation and thus affects the PVSE accuracy of the navigation system. It is important to choose the appropriate cumulative reward iteration cycle. Therefore, the QLEKF based on the intelligent X-ray pulsar/solar and target planetary Doppler velocity measurement integrated navigation system has high accuracy, which can fully fulfill the high accuracy demands for the cruise phase of the SSBE probe.

In future work, the combination of reinforcement learning and traditional navigation methods will be further studied, and deep neural networks will be added to reinforcement learning to provide interpretability for intelligent integrated navigation methods and improve the flexibility of navigation system parameter setting. The fusion of multisource intelligent information of the integrated navigation measurement models would be simultaneously carried out to further improve the information fusion accuracy and efficiency of information obtained from various navigation measurements. It provides theoretical and technical support for the interpretability of the multistage, intelligent, and highly accurate integrated navigation of the SSBE.

Author Contributions

W.T. put forward the algorithm, carried out the simulation and experiment, and prepared this original manuscript; J.Z., J.S. and J.W. pointed out the research directions, gave theoretical guidance, and revised the manuscript; Q.L., H.W., Z.C. and J.Y. revised and checked the manuscript. The authors would like to thank the reviewers and the editor for their comments and constructive suggestions that helped to improve the paper significantly. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation Project, China (grant number 12150007) and the Guangdong Major Project of Basic and Applied Basic Research, China (grant number 2019B030302001).

Data Availability Statement

The data generated in this study are not publicly available due to [their use in an ongoing study by the authors] but can be made available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank Jiaqi Min, Sihuan Wu, Maosen Shao, and Sifan Wu of our research team for their valuable help.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

SSBE	Solar System Boundary Exploration
FEKF	Federated Extended Kalman Filter
QLFEKF	Federated Extended Kalman Filter Based on Q-learning
PVSE	Position and Velocity State Estimation
VLBI	Very Long Baseline Interferometry
TT&C	Tracking Telemetry and Command
XNAV	X-ray Pulsar Navigation
TOA	Time-of-Arrival
SSB	Solar System Barycenter
TDOA	Time Difference of Arrival
EKF	Extended Kalman Filter
RL	Reinforcement Learning
LOS	Line of Sight
PA	Probe Agent
RMSE	Root Mean Squared Error
STD/XP-QLFEKF	X-ray Pulsar/Solar and Target Planetary Doppler Velocity Measurement Based on the Q-learning Federation EKF
STD/XP-EKF	X-ray Pulsar/Solar and Target Planetary Doppler Velocity Measurement Based on EKF
STD/XP-FEKF	X-ray Pulsar/Solar and Target Planetary Doppler Velocity Measurement Based on Federated EKF

References

Wu, W.; Yu, D.; Huang, J.; Zong, Q.; Wang, C.; Yu, G.; Hao, R.; Wang, Q.; Kang, Y.; Meng, L.; et al. Exploring the solar system boundary. Sci. Sin. Inform. 2019, 49, 1–16. [Google Scholar] [CrossRef]
Hall, C.F. Pioneer 10 and pioneer 11. Science 1975, 4187, 445–446. [Google Scholar] [CrossRef] [PubMed]
Courty, J.M.; Levy, A.; Christophe, B.; Reynaud, S. Simulation of ambiguity effects in Doppler tracking of Pioneer probes. Space Sci. Rev. 2010, 151, 93–103. [Google Scholar] [CrossRef]
Capova, K.A. Introducing Humans to the Extraterrestrials: The Pioneering Missions of the Pioneer and Voyager Probes. Front. Hum. Dyn. 2021, 3, 714616. [Google Scholar] [CrossRef]
Burlaga, L.F.; Ness, N.F.; Stone, E.C. Magnetic field observations as Voyager 1 entered the heliosheath depletion region. Science 2013, 341, 147–150. [Google Scholar] [CrossRef]
Stone, E.C.; Cummings, A.C.; Heikkila, B.C.; Lal, N. Cosmic ray measurements from Voyager 2 as it crossed into interstellar space. Nat. Astron. 2019, 3, 1013–1018. [Google Scholar] [CrossRef]
Fountain, G.H.; Kusnierkiewicz, D.; Hersman, C.; Herder, T.; Coughlin, T.; Gibson, W.; Clancy, D.; DeBoy, C.; Hill, T.; Kinnison, J.; et al. The new horizons spacecraft. Space Sci. Rev. 2008, 140, 23–47. [Google Scholar] [CrossRef]
Guo, Y.P.; Farquhar, R.W. New Horizons mission design. Space Sci. Rev. 2008, 140, 49–74. [Google Scholar] [CrossRef]
Liu, X.; Wang, J.; Li, X.; Chen, M.; Yu, Y. Laser communication proposal for solar system boundary exploration. J. Telem. Track. Command 2022, 43, 62–69. [Google Scholar] [CrossRef]
Song, Y.; Wu, W.; Hu, H.; Lin, M.; Wang, H.; Zhang, J. Gravity assist space pruning and global optimization of spacecraft trajectories for solar system boundary exploration. Complex Intell. Syst. 2023, 10, 323–341. [Google Scholar] [CrossRef]
Zheng, W.; Wang, Y. X-ray Pulsar–Based Navigation: Theory and Applications; Springer: Singapore, 2020; pp. 1–20. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Zheng, W.; Song, M. X-Ray Pulsar-based Navigation Scheme for Solar System Boundary Exploration. J. Phys. Conf. Ser. 2020, 2224, 012127. [Google Scholar] [CrossRef]
Sheikh, S.I.; Pines, D.J.; Ray, P.S.; Wood, K.S.; Lovellette, M.N.; Wolff, M.T. Spacecraft Navigation Using X-Ray Pulsars. J. Guid. Control Dyn. 2006, 29, 49–63. [Google Scholar] [CrossRef]
Emadzadeh, A.A.; Speyer, J.L. Relative navigation between two spacecraft using X-ray pulsars. IEEE Trans. Control Syst. Technol. 2010, 19, 1021–1035. [Google Scholar] [CrossRef]
Shemar, S.; Fraser, G.; Heil, L.; Hindley, D.; Martindale, A.; Molyneux, P.; Pye, J.; Warwick, R.; Lamb, A. Towards practical autonomous deep-space navigation using X-Ray pulsar timing. Exp. Astron. 2016, 42, 101–138. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, W. Pulse phase estimation of X-ray pulsar with the aid of vehicle orbital dynamics. J. Navig. 2016, 69, 414–432. [Google Scholar] [CrossRef]
Rinauro, S.; Colonnese, S.; Scarano, G. Fast near-maximum likelihood phase estimation of X-ray pulsars. Signal Process. 2013, 93, 326–331. [Google Scholar] [CrossRef]
Reichley, P.E.; Downs, G.S.; Morris, G.A. Time-of-Arrival Observations of Eleven Pulsars. Astrophys. J. 1970, 159, L35–L40. [Google Scholar] [CrossRef]
Downs, G.S. Interplanetary Navigation Using Pulsating Radio Sources. NASA Technical Report N74-34150. 1974; pp. 1–12. Available online: https://ntrs.nasa.gov/api/citations/19740026037/downloads/19740026037.pdf (accessed on 12 June 2024).
Runnels, J.T.; Demoz, G.E. Estimator for deep-space position and attitude using X-ray pulsars. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 2149–2166. [Google Scholar] [CrossRef]
Gao, J.; Fang, H.; Su, J. Differential X-Ray Pulsar Navigation Method Based on Pulse Arrival Time Difference. In China Satellite Navigation Conference (CSNC 2022) Proceedings: Volume III; Springer Nature: Singapore, 2022; pp. 552–562. [Google Scholar] [CrossRef]
Huang, S.; Kang, Z.; Liu, J.; Ma, X. Accuracy analysis of spectral velocimetry for the solar Doppler difference navigation. IEEE Access 2021, 9, 78075–78082. [Google Scholar] [CrossRef]
Yim, J.R.; Crassidis, J.L.; Junkins, J.L. Autonomous orbit navigation of interplanetary spacecraft. In Proceedings of the Astrodynamics Specialist Conference, Denver, CO, USA, 14–17 August 2000; pp. 53–61. [Google Scholar] [CrossRef]
Ning, X.; Gui, M.; Fang, J.; Liu, G.; Dai, Y. A novel differential Doppler measurement-aided autonomous celestial navigation method for spacecraft during approach phase. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 587–597. [Google Scholar] [CrossRef]
Liu, J.; Kang, Z.; Wuite, P.; Ma, J.; Tian, J. Doppler/XNAV–integrated navigation system using small-area X-ray sensor. IET Radar Sonar Navig. 2011, 5, 1010–1017. [Google Scholar] [CrossRef]
Pan, C.; Liu, J.; Kang, Z.; Chen, X. Solar TDOA/Doppler difference joint observation navigation for the approach phase of mars exploration. Int. J. Aeronaut. Space Sci. 2020, 9, 836–844. [Google Scholar] [CrossRef]
Cui, P.; Wang, S.; Gao, A.; Yu, Z. X-ray pulsars/Doppler integrated navigation for Mars final approach. Adv. Space Res. 2016, 57, 1889–1900. [Google Scholar] [CrossRef]
Xu, Y.; Cao, J.; Shmaliy, Y.S.; Zhuang, Y. Distributed Kalman filter for UWB/INS integrated pedestrian localization under colored measurement noise. Satell. Navig. 2021, 2, 22. [Google Scholar] [CrossRef]
Wang, X.; Yang, Y.; Wang, B.; Lin, Y.; Han, C. Resilient timekeeping algorithm with multi-observation fusion Kalman filter. Satell. Navig. 2023, 4, 25. [Google Scholar] [CrossRef]
Julier, S.; Uhlmann, J.; Durrant-Whyte, H.F. A new method for the nonlinear transformation of means and covariances in filters and estimators. IEEE Trans. Autom. Control. 2000, 45, 477–482. [Google Scholar] [CrossRef]
Xin, S.J.; Wang, X.M.; Zhang, J.L.; Zhou, K.; Chen, Y.F. A Comparative Study of Factor Graph Optimization-Based and Extended Kalman Filter-Based PPP-B2b/INS Integrated Navigation. Remote Sens. 2023, 15, 5144. [Google Scholar] [CrossRef]
Yin, Z.H.; Yang, J.C.; Ma, Y.; Wang, S.L.; Chai, D.S.; Cui, H.N. A Robust Adaptive Extended Kalman Filter Based on an Improved Measurement Noise Covariance Matrix for the Monitoring and Isolation of Abnormal Disturbances in GNSS/INS Vehicle Navigation. Remote Sens. 2023, 15, 4125. [Google Scholar] [CrossRef]
Kang, Z.; Xu, X.; Liu, J.; Li, N. Doppler velocity measurement based on double measurement model and its integrated navigation. J. Astronaut. 2017, 38, 964–970. [Google Scholar] [CrossRef]
Qiao, L.; Liu, J.; Zheng, G.; Xiong, Z. Augmentation of XNAV system to an ultraviolet sensor-based satellite navigation system. IEEE J. Sel. Top. Signal Process. 2009, 3, 777–785. [Google Scholar] [CrossRef]
Yang, C.; Zheng, J.; Gao, D. Autonomous orbit and attitude determination including time prediction based on XNAV and ultraviolet sensor. Chin. J. Space Sci. 2013, 33, 194–199. [Google Scholar] [CrossRef]
Liu, J.; Fang, J.; Yang, Z.; Kang, Z.; Wu, J. X-ray pulsar/Doppler difference integrated navigation for deep space exploration with unstable solar spectrum. Aerosp. Sci. Technol. 2015, 41, 144–150. [Google Scholar] [CrossRef]
Ou, J.J.; Guo, X.; Lou, W.J.; Zhu, M. Quadrotor autonomous navigation in semi-known environments based on deep reinforcement learning. Remote Sens. 2021, 13, 4330. [Google Scholar] [CrossRef]
Xiong, K.; Zhao, Q.; Yuan, L. Calibration Method for Relativistic Navigation System Using Parallel Q-Learning Extended Kalman Filter. Sensors 2024, 24, 6186. [Google Scholar] [CrossRef] [PubMed]
Xiong, K.; Wei, C.; Zhang, H. Q-learning for noise covariance adaptation in extended KALMAN filter. Asian J. Control 2021, 23, 1803–1816. [Google Scholar] [CrossRef]
Xiong, K.; Wei, C. Integrated celestial navigation for spacecraft using interferometer and Earth sensor. Proc. Inst. Mech. Eng. Part G J. Aerosp. Eng. 2020, 234, 2248–2262. [Google Scholar] [CrossRef]
Nemati, M.H.; Kankashvar, M.R.; Bolandi, H. Unscented Kalman Filter adaptive noise covariance selection for satellite formation flying with Q_leaming. In Proceedings of the 2022 30th International Conference on Electrical Engineering (ICEE), IEEE, Tehran, Iran, 17–19 May 2022; pp. 362–367. [Google Scholar] [CrossRef]
Xiong, K.; Peng, Z.; Wei, C. Spacecraft autonomous navigation using line-of-sight directions of non-cooperative targets by improved Q-learning based extended Kalman filter. Proc. Inst. Mech. Eng. Part G J. Aerosp. Eng. 2024, 238, 182–197. [Google Scholar] [CrossRef]
Chang, X.; Cui, P.; Cui, H. Research on Autonomous Navigation Method of Deep Space Cruise Phase Based on the Sun Observation. Yuhang Xuebao/J. Astronaut. 2010, 31, 1017–1023. [Google Scholar] [CrossRef]
Chen, X.; You, W.; Huang, Q. Research on Celestial Navigation for Mars Missions during the Interplanetary Cruising. J. Deep Space Explor. 2016, 3, 214–218. [Google Scholar] [CrossRef]
Song, M.; Yuan, Y. Research on Autonomous Navigation Method for the Cruise Phase of Mars Exploration. Geomat. Inf. Sci. Wuhan Univ. 2016, 41, 952–957. [Google Scholar] [CrossRef]
Liu, J.; Fang, J.; Kang, Z.; Wu, J.; Ning, X. Novel algorithm for X-ray pulsar navigation against Doppler effects. IEEE Trans. Aerosp. Electron. Syst. 2015, 51, 228–241. [Google Scholar] [CrossRef]
Emadzadeh, A.A.; Speyer, J.L. On modeling and pulse phase estimation of X-ray pulsars. IEEE Trans. Signal Process. 2010, 58, 4484–4495. [Google Scholar] [CrossRef]
Sheikh, S.I.; Hanson, J.E.; Graven, P.H.; Pines, D.J. Spacecraft navigation and timing using X-ray pulsars. Navigation 2011, 58, 165–186. [Google Scholar] [CrossRef]
Liu, J.; Ma, J.; Tian, J. Pulsar/CNS integrated navigation based on federated UKF. J. Syst. Eng. Electron. 2010, 21, 675–681. [Google Scholar] [CrossRef]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
AlMahamid, F.; Grolinger, K. Autonomous unmanned aerial vehicle navigation using reinforcement learning: A systematic review. Eng. Appl. Artif. Intell. 2022, 115, 105321. [Google Scholar] [CrossRef]
Wong, A.; Bäck, T.; Kononova, A.V.; Plaat, A. Deep multiagent reinforcement learning: Challenges and directions. Artif. Intell. Rev. 2023, 56, 5023–5056. [Google Scholar] [CrossRef]
Wang, Y.; Zheng, W.; Sun, S. X-ray pulsar-based navigation system/Sun measurement integrated navigation method for deep space explorer. Proc. Inst. Mech. Eng. Part G J. Aerosp. Eng. 2015, 229, 1843–1852. [Google Scholar] [CrossRef]
Xu, Q.; Fan, X.; Zhao, A.; Cui, H.; Xu, L.; Liu, N.; Ding, B. Pre-correction X-ray pulsar navigation algorithm based on asynchronous overlapping observation method. Adv. Space Res. 2021, 67, 583–596. [Google Scholar] [CrossRef]

Figure 1. The fundamental principle of the X-ray pulsar measurement pulse TOA.

Figure 2. The basic principle of the solar/target planetary object Doppler velocity measurement.

Figure 3. Intelligent information interaction with the flight environment for the PA.

Figure 4. Collections of states and corresponding collections for actions for the QLFEKF. The shaded areas denote various combinations of the state and observation noise error covariance matrices Q_k and R_k_. The arrows represent the transitions between different states, and it means choosing different actions.

Figure 5. Structure diagram of the Q-learning-based FEKF intelligent integrated navigation.

Figure 6. Comparison of the position estimate RMSEs between STD/XP-QLFEKF and other integrated navigation algorithms.

Figure 7. Comparison of the position estimate RMSEs for three axes between STD/XP-QLFEKF and other integrated navigation algorithms.

Figure 8. Comparison of the velocity estimate RMSEs between STD/XP-QLFEKF and other integrated navigation algorithms.

Figure 9. Comparison of the velocity estimate RMSEs based on three axes between STD/XP-QLFEKF and other integrated navigation algorithms.

Figure 10. The PVSE RMSE errors (3σ) of the cruise phase as a function of the learning rate.

Figure 11. The PVSE RMSE errors (3σ) of the cruise phase as a function of the discount factor.

Figure 12. The PVSE RMSE errors (3σ) for the cruise phase as s function of the action selection probability.

Figure 13. The influence of different iteration cycles of the reinforcement Q-learning on the precision of the PVSE errors in the probe’s cruise phase.

Table 1. Initial state orbit parameters for the probe.

Simulation Conditions	Parameters Value
Initial position values	[53,107,871.00559; 137,626,318.36608; −10,143.24541] km
Initial velocity values	[−14.01219; 35.86413; 0.42128] km/s
Initial state errors	δX₀ = [δr_x, δr_y, δr_z, δv_x, δv_y, δv_z]^T δr_x = δr_y = δr_z = 1.0 × 10² km, δv_x = δv_y = δv_z = 2.0 × 10⁻³ km/s
Initial state estimation covariance	$P_{0} = d i a g ([δ r_{x}^{2}, δ r_{y}^{2}, δ r_{z}^{2}, δ v_{x}^{2}, δ v_{y}^{2}, δ v_{z}^{2}])$
Initial state process noise covariance	$Q_{0} = d i a g ([q_{1}^{2}, q_{1}^{2}, q_{1}^{2}, q_{2}^{2}, q_{2}^{2}, q_{2}^{2}])$ , q₁ = 2.0 × 10⁻³ km; q₂ = 3.0 × 10⁻⁶ km/s

Table 2. The corresponding parameters of the selected X-ray pulsars.

Pulsars Name	RAD (°)	DEC (°)	D₀ (kpc)	P (10⁻³ s)	W (10⁻³ s)	Fx (ph/cm²/s)	Pf (%)
PSR B1937+21	294.92	21.580	3.6	1.56	0.021	4.99 × 10⁻⁵	86
PSR B0531+21	83.63	22.014	2.0	33.5	1.670	1.54	70
PSR J1821-24	276.13	−24.870	5.5	3.05	0.055	1.93 × 10⁻⁴	98

Table 3. The corresponding parameters of the measurement models.

Initial Condition	Parameters Name	Parameters Value
XNAV Measurement	Number of detectors	3
	Effective area of the detectors	1 m²
	Measurement noise variance matrix	${\hat{R}}_{p 0} = d i a g ([p_{1}^{2}, p_{2}^{2}, p_{3}^{2}]) {km}^{2}$
Solar/target planetary Doppler measurement	Number of the spectrometers	2
	Accuracy of the spectrometers	d₁ = d₂ = 10⁻⁶ km/s
	Velocity measurement navigation measurement noise variance matrix	${\hat{R}}_{d 0} = d i a g ([d_{1}^{2}, d_{2}^{2}]) {(km / s)}^{2}$

Table 4. The design parameters of the QLFEKF algorithm.

Parameter Name		Value
Learning rate, discount factor, and greedy values	α	0.1
	γ	0.9
	ε	0.1
The upper edge and lower edge bounds of ${\hat{Q}}_{r k}^{(i)}$ , ${\hat{Q}}_{v k}^{(i)}$ , ${\hat{R}}_{p k}^{(j)}$ , and ${\hat{R}}_{d k}^{(j)}$	$[\begin{matrix} {\underline{q}}_{r k} & {\bar{q}}_{r k} \end{matrix}]$	[1/20² 1/10²]
	$[\begin{matrix} {\underline{q}}_{v k} & {\bar{q}}_{v k} \end{matrix}]$	[1/10² 10²]
	$[\begin{matrix} {\underline{r}}_{p k} & {\bar{r}}_{p k} \end{matrix}]$	[1/10² 1]
	$[\begin{matrix} {\underline{r}}_{d k} & {\bar{r}}_{d k} \end{matrix}]$	[1/102 1/101]

Table 5. The PVSE RMSE errors (3σ) of the integrated navigation system.

Filter Algorithm	Position Estimation Accuracy (m)				Velocity Estimation Accuracy (m/s)
Filter Algorithm	r_x	r_y	r_z	r	v_x	v_y	v_z	v
STD/XP-EKF	414.797	174.586	380.772	443.560	0.196	0.089	0.610	0.630
STD/XP-FEKF	335.289	132.138	279.342	351.803	0.169	0.158	0.289	0.386
STD/XP-QLFEKF	148.075	65.095	78.936	152.101	0.086	0.104	0.150	0.243
Navigation accuracy improvement rate	55.84%	50.73%	71.74%	56.76%	49.11%	34.18%	48.10%	37.04%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.