Trigger-Based K-Band Microwave Ranging System Thermal Control with Model-Free Learning Process

Wang, Xiaoliang; Zhu, Hongxu; Shen, Qiang; Wu, Shufan; Wang, Nan; Liu, Xuan; Wang, Dengfeng; Zhong, Xingwang; Zhu, Zhu; Damaren, Christopher

doi:10.3390/electronics11142173

Open AccessArticle

Trigger-Based K-Band Microwave Ranging System Thermal Control with Model-Free Learning Process

by

Xiaoliang Wang

^1,†,‡

,

Hongxu Zhu

¹,

Qiang Shen

¹,

Shufan Wu

^1,*,

Nan Wang

²,

Xuan Liu

³,

Dengfeng Wang

^3,‡,

Xingwang Zhong

³,

Zhu Zhu

⁴ and

Christopher Damaren

⁵

¹

School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China

²

University of Michigan—Shanghai Jiao Tong University Joint Institute, Shanghai Jiao Tong University, Shanghai 200240, China

³

Institute of Space Radio Technology, Xi’an 710100, China

⁴

Shanghai Institute of Satellite Engineering, Shanghai 200240, China

⁵

Institute for Aerospace Studies, University of Toronto, Toronto, ON M1C 1A4, Canada

^*

Author to whom correspondence should be addressed.

^†

Current address: East Dongchuan Rd. No. 800, Shanghai 200241, China.

^‡

These authors contributed equally to this work.

Electronics 2022, 11(14), 2173; https://doi.org/10.3390/electronics11142173

Submission received: 28 May 2022 / Revised: 2 July 2022 / Accepted: 4 July 2022 / Published: 11 July 2022

(This article belongs to the Topic Artificial Intelligence in Sensors)

Download

Browse Figures

Versions Notes

Abstract

:

Micron-level accuracy K-band microwave ranging in space relies on the stability of the payload thermal control on-board; however, large quantities of thermal sensors and heating devices around the deployed instruments consume the precious inner communication resources of the central computer. Another problem arises, which is that the payload thermal protection environment can deteriorate gradually through years operating. In this paper, a new trigger-based thermal system controller design is proposed, with consideration of spaceborne communication burden reduction and actuator saturation, which guarantees stable temperature fluctuations of microwave payloads in space missions. The controller combines a nominal constant sampling PID inner loop and a trigger-based outer loop structure under constraints of heating device saturation. Moreover, an iterative model-free reinforcement learning process is adopted that can approximate the estimation of thermal dynamic modeling uncertainty online. Via extensive experiment in a laboratory environment, the performance of the proposed trigger thermal control is verified, with smaller temperature fluctuations compared to the nominal control, and obvious efficiency in system communications. The online learning algorithm is also tested with deliberate thermal conditions that deviate from the original system—the results can quickly converge to normal when the thermal disturbance is removed. Finally, the ranging accuracy is tested for the whole system, and a 25% (RMS) performance improvement can be realized by using a trigger-based control strategy—about 2.2 µm, compared to the nominal control method.

Keywords:

K-band ranging system; event trigger; saturation control; reinforcement learning; actor/critic policy

1. Introduction

K-band microwave ranging (MWR) technology can provide micron-level precise ranging measurements between spacecraft in space, which has potential applications in the fields of Earth elevation surveying, gravity field detection, and other space missions [1,2]. The accuracy ranging performance (time delay) of the MWR system is mainly affected by the payload thermal condition in space. The state-of-the-art payload thermal controller should be well designed with tiny temperature fluctuations during orbiting; however, the spacecraft thermal control system is a large-scale system that involves hundreds, even thousands of temperature sensors and patch heaters around instruments, which increases the pressure on communication with the on-board central computer when engaged in diversified space missions in the future [3,4,5]. At the same time, the thermal control system itself can be deteriorated to the original dynamic model during multiple year-long space missions, and the adaptive approaches should be adopted to dealing with this situation. To deal with such problems, in this paper, a new thermal control strategy is proposed that is based on triggered sampling and the model-free learning process.

The event-triggered control (ETC), in contrast to the traditional time trigger control with fixed sampling period, adopts the updating strategy of sampling in a variable period [6,7,8]. Designers can impose certain thresholds with performance indexes for the system according to actual needs. The control signals are transmitted and updated inside the system only when the states exceed the threshold conditions [9,10]. Zhang [11] embedded ETC into a linear system model to address predictive control problems, updating the predictive sampling step-time through a fixed-threshold event-triggering mechanism. In [12], the design of the multi-variable linear industrial process ETC with time delay and quantization error is studied, the controller parameters are calculated by linear matrix inequality (LMI), and the closed-loop system asymptotic stability proof using Lyapunov theory is provided. Azimi [13] proposed an ETC design that considered the system transmission delay and packet loss during the signal transmission in the chemical process, which can track the set values of the system with the signal transmission constrain. For the situation of time-varying model parameters of linear chemical control process in different working environments, Li [14] described the uncertain system by using Markov random theory, and an event-triggered sliding mode controller with finite-time convergence is designed by combining homogeneous theory with an event-triggered mechanism, which realized the finite-time convergence. The thermal control system we considered in this paper includes a saturated actuator process, which is a type of nonlinear system—scholars have also focused on the application of ETC to nonlinear systems. References [15,16] analyzed the nonlinear strict feedback system and uncertain strict feedback nonlinear system according to adaptive control theory. The results show that a control system with an event-triggering mechanism can improve the transmission efficiency of the signal—it can also reduce the energy consumption and cost. Abhinav [17] designed an adaptive event triggered sliding mode controller by combining the sliding mode control with ETC, and the results show that the designed controller has good regulating performance in the presence of external disturbances and model uncertain. Moreover, the unknown nonlinear characteristics inside the dynamics model can also be approximated by advanced control methods, such as fuzzy control [18], neural network [19,20,21], and adaptive dynamic programming (ADP) [22]. References [23,24] considered the ADP triggering problem with a saturated actuator. Seuret [25] adopted the linear quadratic optimal control method. Reference [26] provided a stability analysis of the system with disturbance. Reference [27] introduced inequality to analyze the trigger system.

A thermal system using an ETC design relies on the temperature sensors as state-sampling hardware, which can diverge from the tracking trajectory once measurement malfunctions. Some scholars have proposed the idea of self-triggered control (STC) in recent years [28,29,30,31,32]. The principle of STC is to actively predict when the next triggering time will occur according to the previously received data and system dynamics. Compared with ETC, STC reduces the transmission times of the feedback signals, effectively reduces the on-board data transmission burden, and improves the control efficiency. Wang [33] provided the self-triggering conditions of general linear systems based on the Lyapunov method. Almeida [34] studied the self-triggering of linear systems with bounded disturbance state feedback to ensure the system asymptotic stability. The application of STC to a network control system is proposed in [35].

The triggered control strategy design may improve the on-board communication efficiency most of time, relying on the system model functioning well; however, as the observation platform for long-term space missions is critical, the spacecraft payload thermodynamic will gradually deteriorate over time, which will clearly deviate from the originally designed model. As a typical intelligent agent in space with sensing and action functions, the spacecraft platform malfunction should be detected and calibrate itself on-orbit, and methods such as optimal control and dynamic programming should be adopted for this model’s uncertain situation.

Recently, efficient approximation techniques have been proposed to solve the above described problem, known as approximate dynamic programming (ApDP) or reinforcement learning (RL), including value-based RL (Q-learning methods) [36,37], policy-based RL (policy gradient methods) [38,39,40], and value-policy combined RL (actor-critic methods) [41]. For systems with disturbed dynamics, LQR has been widely used for learning-based controller design [42,43,44]. Lee [45] has developed a Q-learning framework for LQR control based on an alternative optimization formulation of the problem. The proposed framework is then used to design a model-free Q-learning algorithm based on primal dual updates. Policy gradient methods continuously calculate the gradient of the cumulative income of the agent and the strategy parameters under the current strategy in an end-to-end approach, and finally the gradient converges to the optimal strategy [38]. The actor-critic methods include two parts: actor and critic, in which the actor is responsible for interacting with the environment and selecting actions based on strategy function; the critic is based on the value function, which is responsible for evaluating the actor and guiding its next action. In the actor-critic algorithm, it is necessary to approximate the strategy function and the value function independently [46]. Basically, the critic calculates the state optimal value, and the actor uses it to iteratively update the parameters of the strategy function, selecting action, so as to obtain the immediate reward and move to the next state. The critic uses the reward and the new state to update the parameters of the value function.

The application of RL to spacecraft thermodynamic systems is rarely reported in recent years, according to the authors’ literature review. Lee [47] introduced a RL-based model-free predictive control structure for chiller plants. Qiu [48] provided an optimal operation solution of chillers by combining RL technique and expertise knowledge, aiming for a balance of power and indoor comfort. Inspired by the literature, this paper focuses on a trigger-based MWR payload thermal control system design with an online learning iteration, aiming to the micron-level precise ranging performance in space missions [49,50]. The proposed approach benefits from the following noteworthy features:

Feasible triggered thermal system control design with obvious communication burden reduction;
No original thermodynamic information required when faced with disturbed system model uncertainty;
Suitable for real autonomous management of space platforms with long-term mission life;
Thermal control strategies can be selected from nominal control, triggered control, and model-free learning process, according different orbiting period.

The rest of the paper is organized as follows: Section 2 provides the thermal system modeling for the micron-level K-band ranging system with efficient saturation trigger feedback controller design. The model-free learning method is shown in Section 3 for the case of thermal system uncertainty. Finally, the experimental results in a laboratory environment are provided, which demonstrate the effectiveness of the proposed method.

2. Thermal System Design for Precise Ranging System

2.1. Thermal Structure of Deployed Satellite

The main sources of heat during a satellite orbiting in space including sunlight, thruster ignition, and power consumption of on-board electric devices. The fluctuation of satellite temperature has the following characteristics: temperature, both daily and seasonal, periodically changes as the Earth rotates and revolves around the sun; On the other hand, on-board electric devices will also cause temperature changes due to work status changes, failures, or other uncertain factors, as the on-board thermally insulated layer performance decreases with satellite aging. In order to obtain a proper heat transfer and conversion, avoiding the temperature exceeding the normal working range, thermal control technology should be adopted that includes specially designed active and passive approaches.

The structure of the whole satellite is shown in Figure 1 below, in which “b” indicates the installed position of the MWR system, which includes four parts: K-band transceiving antenna, waveguide switch network, signal process unit, and ultra-stable crystal oscillator (USO), which constitutes the micron-level microwave ranging system as a whole.

It can be seen from Figure 1 that the whole ranging system is installed on the +X and +Z panels of the satellite. A three-level thermal protection structure is designed in order to prevent the payload from being affected by space radiation during operation. First, the platform outer shield, with bare carbon fiber reinforced plastic (CFRP) with 10-layer multi-layer insulation (MLI) pack, is used that can block any heat from leaking in from the outer solar panel. Second, the payload cabin, with CFRP with single layer Kapton foil, is used, which can minimize radiative heat from the inner platform environment to the MWR payload enclosure. Moreover, there is thermal insulation between the satellite platform and the payload, which is made of low conductive material to reduce the effect of temperature variations at the structure interface. The internal thermal conduction balance treatment is carried out within deployed cabins for the differential temperature caused by on-orbit illumination directions on the surface of the satellite. Third, active thermal control, with dedicated heaters and condensation heat pipes, is used for critical payloads that have more stringent temperature limits than the rest of the spacecraft—mainly the four parts of the MWR system. The temperatures of the MWR equipment are monitored using sensors and maintained within the desired limits by several patch heaters and phase-changed heat pipes that are controlled by the spacecraft central computer during the stages of science operations.

According to the massive data from previous tests, the ranging error of the MWR equipment is mainly due to the microwave measurement signal chain, which includes: (1) the temperature fluctuation of USO that reduces the clock frequency stability; (2) the thermal deformation of the horn antenna may lead to changes in the antenna phase center; (3) temperature changes of phase-locked frequency doubling equipment, microwave network, quadrature mixer, intermediate frequency amplifier, and low-pass filter may induce errors in the ranging measurement result. As a result, it is necessary to improve the accuracy and stability of temperature control for these devices, which could be improved using less than ±0.15 K/orbit. In addition, the digital signal processing unit of MWR also needs high-precision temperature control within ±0.1 K/orbit, since it contains the A/D converter of the low-pass filtered signal, FPGA, and DSP components on the same circuit board, which are highly sensitive to temperature fluctuations.

2.2. Payload Thermal Dynamic Modeling and Nominal PID Control

The temperature of deployed payloads is directly related to its surrounding environment of satellite platform in space and internal thermal exchange between themselves. With a proper structure design featuring three levels of thermal protection, the payloads can be perform well within finite temperature fluctuations inside a cabin room. In this paper, we consider the thermal coupling relationship between those payload components; the thermal analysis adopts the node network method to establish the thermal balance equation of any node on satellite as follows [51]:

c_{i} m_{i} \frac{d T_{i}}{d t} + ϵ_{i} a_{i} (T_{i} - T_{T i}) = q_{i}

(1)

where the subscript i denotes the thermal nodes,

c_{i}

denotes the specific heat capacity of the payload metal alloy block,

m_{i}

denotes the block mass,

T_{i}

denotes the transient thermal temperature,

T_{T i}

is target temperature,

ϵ_{i}

denotes the average heat transfer coefficient of each node,

a_{i}

is the area of each heat patch/pipe surface, and

q_{i}

is the heating/cooling power of each node around payload. The thermal control is realized by several heating and condensation patch nodes with an adiabatic section around each payload components, providing the heating and cooling actions from control command. Typical electric heating is used when the temperature is below the target, and the state-of-the-art micro-electro-mechanical system (MEMS)-based pulse width modulation (PWM) high-speed on-off valve is used to cool the liquid flow inside the phase change pipe during high temperature stages.

For the purpose of fast and stable internal temperature control of payloads during space missions, especially during precise ranging stages, two cycles of a closed-loop active thermal controller are designed: one is a high-power electric heating/cooling controller with temperature sensors patched around payloads using PID control; the other one is a precise low-power heating controller using optimal control with triggering saturation constraints.

A typical PID control algorithm is used as a nominal scheme given as [52,53,54]:

u (t) = K_{p} [e (t) + \frac{1}{K_{I}} \sum_{τ = 0}^{t} e (τ) \cdot T_{s} + K_{D} \frac{e (t) - e (t - 1)}{T_{s}}]

(2)

where

K_{p}, K_{I}, K_{D}

are the coefficients of proportional, integral, and differential, respectively;

T_{s}

is sampling time,

e (t)

is the difference of measured temperature and target temperature, and

u (t)

is the heating/cooling power consumption for active thermal control.

2.3. Trigger-Based Precise Optimal Thermal Control with Saturation Constraint

The thermal dynamic model and PID algorithm design above aim to provide the nominal thermal active control during the space orbiting period. For the purpose of the high-precision microwave ranging system that is functional during the science observing phase in space, each payload component is patched around several heating/cooling nodes and measurement sensors in order to fully realize thermal control and temperature monitoring.

Here, we want the thermal control system to be tracking the desired temperatures during the space mission with minimal control burdens. First, considering the nominal states as

x = {[\begin{matrix} T_{a} & T_{w} & T_{m} & T_{u} \end{matrix}]}^{T}

, where the subscript

a, w, m, u

mean the four MWR components of the K-band antenna, waveguide switch network, microwave signal process unit, and USO. Define the new states as the difference of current states

x (t)

and

x (t_{f})

, namely,

x (t) \overset{Δ}{=} x (t) - x (t_{f})

with

x = {[\begin{matrix} δ T_{a} & δ T_{w} & δ T_{m} & δ T_{u} \end{matrix}]}^{T}

. Similarly, we can define the heating/cooling control variable

u (t) \overset{Δ}{=} u (t) - u (t_{f})

with

u = {[\begin{matrix} δ q_{a} & δ q_{w} & δ q_{m} & δ q_{u} \end{matrix}]}^{T}

according to the deviation between the actual and the nominal thermal control input. On the premise of a given nominal temperature state sequence, the time series of the nominal control input can be obtained directly through the PID algorithm; so, after the control input

u = {[\begin{matrix} q_{a} & q_{w} & q_{m} & q_{u} \end{matrix}]}^{T}

of the deviation dynamics is solved, the actual temperature tracking control can be obtained through the summation of

u

and

u

.

The precise thermal control is realized through several accuracy calibrated patch heater and cooler, with limited power consumption constraints. Here, we consider the deviation thermal system as a saturated linear dynamic equation of the form:

\begin{matrix} \dot{x} (t) = A x (t) + B σ (u (t)), x (t_{0}) = x_{0}, t \geq 0 \\ u (t) = K x (t_{k}), t \in [t_{k}, t_{k + 1}) \end{matrix}

(3)

where

A \in R^{n \times n}

is the state matrix, and

B \in R^{n \times m}

is the control matrix.

K = - R^{- 1} B^{T} P \in R^{m \times n}

denotes the feedback gain matrix through optimal control. Here, we assume that all states are observable and that the system is controllable. Because of the unmodeled dynamics and external thermal disturbances during on-orbit mission flight,

A

and

B

are both disturbed matrices.

Note here we define the saturation control of

σ (u_{i}) \overset{Δ}{=} s i g n (u_{i}) \cdot min \{{\bar{u}}_{i}, |u_{i}|\}

, where

u_{i}

is the i-th control input signal and

{\bar{u}}_{i}

is the maximum amplitude of i-th control actuator, i.e., the heating/cooling power. The time-tag

t_{k}

shows up through the event trigger, meaning the control signal triggers off at time

t_{k}

, and holds still during time period

t \in [t_{k}, t_{k + 1})

; this can greatly save signal transmitting bandwidth, reducing the communication burden for the whole on-board system.

Next, we give a brief introduction of the optimal control used in this paper. The tracking performance of the energy cost is written as:

J (x, u, t_{0}, t_{f}) = \frac{1}{2} x^{T} (t_{f}) P (t_{f}) x (t_{f}) + \frac{1}{2} \int_{t_{0}}^{t_{f}} (x^{T} (τ) M x (τ) + u^{T} (τ) R u (τ)) d τ

(4)

where

P (t_{f}) \in R^{n \times n}

is the solution of the Riccati equation in time

t_{f}

,

M \in R^{n \times n} ≻ = 0

,

R \in R^{m \times m} ≻ 0

. Suppose we have the matrix pairs

(A, B)

controllable and

(\sqrt{M}, A)

measurable, clearly, we will obtain the best performance with minimal value of J, and the final object of control system is finding the optimal value

V^{*}

with dynamic modeling disturbances:

V^{*} (x, t_{0}, t_{f}) = min_{u} [J (x, u, t_{0}, t_{f})]

(5)

The finite horizon optimal control problem can be solved as follows: define the Hamiltonian function

H = \frac{1}{2} (x^{T} (t) M x (t) + u^{T} (t) R u (t)) + \frac{\partial V^{* T}}{\partial x} (A x (t) + B u (t))

(6)

With proper derivation, the following HJB equation can be obtained [55]:

- \frac{\partial V^{*}}{\partial t} = \frac{1}{2} (x^{T} (t) M x (t) + u^{* T} (t) R u^{*} (t)) + \frac{\partial V^{* T}}{\partial x} (A x (t) + B u^{*} (t))

(7)

Then, we have the optimal control of

u^{*} (x, t) = - R^{- 1} B^{T} P (t) x (t) = K x (t)

(8)

where

P (t)

can be found in the Riccati equation of

M + P (t) A + A^{T} P (t) - P (t) B R^{- 1} B^{T} P (t) = 0

(9)

Theorem 1.

Suppose we have

P (t)

from Equation (9), with the condition of

P (t_{f})

, then we can obtain the optimal control of Equation (8), satisfying the minimal function of Equation (5). Moreover, the origin point is the globally uniformly asymptotically stable equilibrium point for the closed-loop system.

Proof of Theorem 1.

The proof can be found in ref [55], which is not shown here. □

2.4. Trigger Condition Analysis

We define

e (t) = x (t_{k}) - x (t)

, meaning the state differences of the previous triggered time and current time. The sampling signal of the system that is sent to the controller through feedback needs to meet the selected trigger condition. Here, we design the trigger mechanism as:

t_{k + 1} = t_{k} + min \{τ_{k} | Z_{1} \land Z_{1}, τ > 0\}

(10)

where

Z_{1} = e^{T} (t_{k} + τ) S e (t_{k} + τ) \geq γ x^{T} (t_{k}) D x (t_{k}) + δ

,

Z_{2} = {∥σ (u (t))∥}^{2} < {\bar{u}}^{2}

, and

γ, δ

are given positive constants,

S, D

are a positive defined matrix,

t_{k}

denotes the triggered time of event k, and

τ_{k}

the signal transmitting period since

t_{k}

. Equation (10) can be explained as follows: suppose the first trigger happened in time

t_{0}

in a real system operation, and after that, no trigger happened even if condition

Z_{1}

is satisfied, with the control signal under saturation, i.e., condition

Z_{2}

. By achieving this, it will greatly improve the communication resource utilization of the system.

Let

δ = 0

, and

S = D = P

; we have the event-trigger condition of

\{e^{T} (t) P e (t) \geq γ x^{T} (t_{k}) P x (t_{k})\} \land \{{∥σ (u (t))∥}^{2} < {\bar{u}}^{2}\}, t > t_{k}

(11)

The event-trigger transmission condition is realized by hardware samplings and trigger condition judgment. Similarly, the self-trigger is implemented through previous signal and state predictions. Here, we explain the self-trigger conditions as follows: consider time

t \in [t_{k}, t_{k + 1})

, by using the system model of Equation (3), we have

\begin{matrix} \dot{e} (t) = - \dot{x} (t) = A e (t) - [A x (t_{k}) + B σ (K x (t_{k}))] \overset{Δ}{=} A e (t) - X \\ x (t_{k}) = 0, t > t_{k} \end{matrix}

(12)

The analytical solution of Equation (12) is given as

\begin{matrix} e (t) = - \int_{t_{k}}^{t} e^{A (t - τ)} d τ \cdot X = - \int_{0}^{t - t_{k}} e^{A s} d s \cdot X \end{matrix}

(13)

With proper derivation, we have

\begin{matrix} e^{T} (t) P e (t) & = & {\{\int_{0}^{t - t_{k}} e^{A s} d s \cdot X\}}^{T} P \{\int_{0}^{t - t_{k}} e^{A s} d s \cdot X\} \\ \leq & {\{\int_{0}^{t - t_{k}} e^{λ_{max} (A) s} d s\}}^{2} \cdot X^{T} P X \end{matrix}

(14)

Let

ξ (x (t_{k})) = X^{T} P X

, and

θ (x (t_{k})) = λ_{min} (P) {∥x (t_{k})∥}^{2}

. According to Equation (11), we can obtain the self-trigger time-tag of

t_{k + 1}

through

\{{[\int_{0}^{t_{k + 1} - t_{k}} e^{λ_{max} (A) s} d s]}^{2} ξ (x (t_{k})) = γ θ (x (t_{k}))\} \land \{{∥σ (u (t))∥}^{2} < {\bar{u}}^{2}\}

(15)

Clearly, Equation (15) is obtained based on the event-trigger condition of Equation (11), with less trigger period. Finally, we can obtain the self-trigger condition of

t_{k + 1} = t_{k} + h (γ, x (t_{k}))

(16)

The self-trigger time-tag of

t_{k + 1}

meaning the sampling point, according to saturation function, and the feedback of system states in

t_{k + 1}

depending on

{∥σ (u (t))∥}^{2} < {\bar{u}}^{2}

. Moreover, the trigger period of

h (γ, x (t_{k}))

relies on matrix

A

, namely:

(1) if

λ_{max} (A) = 0

, then we have

\int_{0}^{t_{k + 1} - t_{k}} e^{λ_{max} (A) s} d s = \int_{0}^{t_{k + 1} - t_{k}} d s

, and

h (γ, x (t_{k})) = {[\frac{γ θ (x (t_{k}))}{ξ (x (t_{k}))}]}^{1 / 2}

(17)

(2) if

λ_{max} (A) \neq 0

, then we have

\begin{matrix} \int_{0}^{t_{k + 1} - t_{k}} e^{λ_{max} (A) s} d s & = & \frac{1}{λ_{max} (A)} \int_{0}^{t_{k + 1} - t_{k}} e^{λ_{max} (A) s} d λ_{max} (A) s \\ = & \frac{1}{λ_{max} (A)} [e^{λ_{max} (A) (t_{k + 1} - t_{k})} - 1] \end{matrix}

(18)

and

h (γ, x (t_{k})) = \frac{1}{λ_{max} (A)} ln \{1 + λ_{max} (A) {[\frac{γ θ (x (t_{k}))}{ξ (x (t_{k}))}]}^{1 / 2}\}

(19)

Finally, here we provide the schematic diagram of the proposed trigger-based thermal control system as in Figure 2.

2.5. Stability Analysis of Trigger Control

For the given known stable open-loop system, the global stability of the whole system can be guaranteed by selecting appropriate event triggering conditions when the actuator of the system is saturated. Here, we analyze the trigger conditions that can guarantee the global input-state stability of the system. First, considering the system model as in Equation (3), we introduce Lemma 1 and Lemma 2 as:

Lemma 1

([26]). (Input-states stable) for a system as

\dot{x} (t) = f (t, x, u)

(20)

where f:

[0, \infty) \times R^{n} \times R^{m}

is continuous function of t, regional Lipschitz function of

x, u

. Let the continuous differentiable function be

V : [0, \infty) \times R^{n} \to R

, which satisfies

\underset{̲}{α} (∥x∥) \leq V (t, x) \leq \bar{α} (∥x∥)

(21)

\frac{\partial V}{\partial x} f (t, x) \leq - α (∥x∥) + β (∥u∥)

(22)

where

\underset{̲}{α}, \bar{α}, α, β

are

K_{\infty}

function, then system (20) is input-state stable.

Lemma 2

([26]). For any

v, w \in R^{m}

, if

v, w

belong to linear region

L (v - w, \bar{u})

, then we have

ϕ^{T} (v) T (ϕ (v) + w) \leq 0

(23)

for any positive defined matrix

T \in R^{m \times m}

, where

ϕ (v (t)) = σ (v (t)) - v (t)

is a dead zone nonlinear function from the saturation control of Equation (3).

Then, we have the global input-state stability of the triggered system conditions as in Theorem 2 below:

Theorem 2.

Choosing trigger condition of

∥e (t)∥ \geq \frac{λ_{min} (M_{0})}{4 λ_{m}} ∥x (t)∥

for system model as Equation (3), if we have

M

that satisfies

P B B^{T} P < M

, then event-trigger system (3) is global input-state stable with

λ_{m} = λ_{max} (P B R^{- 1} B^{T} P)

,

M_{0} = M - P B R^{- 1} B^{T} P

.

Proof of Theorem 2.

Constructing the Lyapunov function as

V (t, x (t)) = x^{T} (t) P x (t)

, where

P > 0

, from the Riccati equation of Equation (9), and clearly function

V (t, x (t))

satisfies the condition of Lemma 2. If inequation

P B R^{- 1} B^{T} P < M

, then matrix

A

is a Herwitz matrix, and we have the derivative of the Lyapunov function as

\begin{matrix} \dot{V} (t, x (t)) & = & {(A x (t) + B σ (u (t)))}^{T} P x (t) + x^{T} (t) P (A x (t) + B σ (u (t))) \\ = & x^{T} (t) (A^{T} P + P A) x (t) - 2 x^{T} (t) P B σ (R^{- 1} B^{T} P x (t_{k})) \\ = & - x^{T} (t) (M - P B R^{- 1} B^{T} P) x (t) - 2 x^{T} (t) P B R^{- 1} B^{T} P x (t_{k}) - 2 x^{T} (t) P B ϕ (R^{- 1} B^{T} P x (t_{k})) \\ \leq & - x^{T} (t) (M - P B R^{- 1} B^{T} P) x (t) - 2 x^{T} (t) P B R^{- 1} B^{T} P e (t) + 2 ∥x^{T} (t) P B∥ (∥R^{- 1} B^{T} P x (t)∥) \\ \leq & - x^{T} (t) (M - P B R^{- 1} B^{T} P) x (t) + 4 λ_{max} (P B R^{- 1} B^{T} P) ∥x (t)∥ ∥e (t)∥ \\ \leq & 4 λ_{m} ∥x (t)∥ ∥e (t)∥ - λ_{min} (M_{0}) {∥e (t)∥}^{2} \end{matrix}

(24)

According to the trigger condition of

∥e (t)∥ \geq \frac{λ_{min} (M_{0})}{4 λ_{m}} ∥x (t)∥

, we have

\dot{V} (t, x (t)) < 0

; referring to Lemma 1, finally we can guarantee that system (3) is global input-state stable. □

3. Model-Free Reinforcement Learning Formulation

3.1. Reinforcement Learning Structure

The triggered optimal control design performed stably during the test, as described in Section 4; however, the problem we are faced with is that the real thermodynamic system will gradually deteriorate over time during a long-term space mission, which will clearly deviate from the original designed model, and proper online estimation/update process should be adopted for this situation. Vamvoudakis [37] provided a learning-based approach that deals with an uncertain dynamic environment by using an up-to-date adaptive mechanism process. Similarly, here we use a value-based Q-learning algorithm to find an optimal action-selection policy from the information of thermal actuator and temperature state sensors with dynamic disturbances. The learning algorithm is in the form of an actor/critic structure, which uses an actor to select the control policies to improve the value and the critic to assess the actor’s decisions.

3.2. Reinforcement Learning Structure

Combining the optimal value function of Equation (5) and the Hamiltonian function, we obtain the Q function as

\begin{matrix} Q (x, u, t) \overset{Δ}{=} V^{*} (x, u, t) + \frac{1}{2} x^{T} M x + \frac{1}{2} u^{T} R u + x^{T} P (t) (A x + B u) \end{matrix}

(25)

where

Q (x, u, t)

is the action value function.

We define the generalized state

U \overset{Δ}{=} {(\begin{matrix} x^{T} & u^{T} \end{matrix})}^{T} \in R^{(n + m) \times 1}

, and the Q function rewritten as:

Q (x, u, t) \overset{Δ}{=} \frac{1}{2} U^{T} [\begin{matrix} Q_{x x} & Q_{x u} \\ Q_{u x} & Q_{u u} \end{matrix}] U \overset{Δ}{=} \frac{1}{2} U^{T} QU

(26)

with

\begin{matrix} Q_{x x} = P (t) + M + P (t) A + A^{T} P (t) \\ Q_{x u} = Q_{u x} = P (t) B, Q \in R^{(n + m) \times (n + m)} \\ Q_{u u} = R \end{matrix}

(27)

By using the stable condition of

\partial Q (x, u, t) \partial u = 0

, we can obtain the optimal control for the model-free system as

u^{*} (x, t) = arg min_{u} Q (x, u, t) = - Q_{u u}^{- 1} Q_{u x} (t) x

(28)

3.3. Critic/Actor Structure

In this paper, the critic/actor structure is used to solve the problem of online learning with a disturbed model. We use the critic approximator for the Q function and the actor approximator for the triggered optimal control. The critic of the Q function is given as:

Q^{*} (x, u^{*}, t) = \frac{1}{2} U^{T} QU \overset{Δ}{=} \frac{1}{2} vech {(Q)}^{T} (U \otimes U)

(29)

where vech(·) denotes the half vectorization operation with

vech (Q) \in R^{(1 / 2) (n + m) (n + m + 1)}

, and

2 Q_{i j}

for off-diagonal elements. ⊗ is the Keronecker vector product operation.

Rewrite

Q^{*} (x, u^{*}, t) = W_{c}^{T} (U \otimes U)

(30)

with

W_{c}^{} \overset{Δ}{=} (1 / 2) vech (Q)

, then

W_{c}^{}

can be considered as the ideal weight of quadratic polynomial that approximation

Q^{*} (x, u^{*}, t)

. Actually, the ideal weight is unknown, considering the estimation of

{\hat{W}}_{c}^{} \overset{Δ}{=} (1 / 2) vech (\hat{Q}) \in R^{(1 / 2) (n + m) (n + m + 1)}

, then we have the critic approximator as

\hat{Q} (x, u, t) = {\hat{W}}_{c}^{T} (U \otimes U)

(31)

The actor approximator is

\hat{u} (x, t) = {\hat{W}}_{a}^{T} x

(32)

with weight estimation

{\hat{W}}_{a}^{} \in R^{n \times m}

.

For the purpose of determining the wanted tuning law of

{\hat{W}}_{c}^{}

and

{\hat{W}}_{a}^{}

, it is necessary to define proper approximate errors of the critic/actor. Here, we divide the time sequence into several tiny time periods with fixed step

T_{s}

, then we have the following by using the integral reinforcement learning method:

Q^{*} (x (t), t) = Q^{*} (x (t - T_{s}), t - T_{s}) - \frac{1}{2} \int_{t - T_{s}}^{t} (x^{T} M x + u^{T} R u) d t

(33)

The critic approximation error is defined as the critic weights converge to the ideal value when the critic error converges to zero:

\begin{matrix} e_{c 1} & \overset{Δ}{=} & \hat{Q} (x (t), u (t), t) - \hat{Q} (x (κ), u (κ), κ) + \frac{1}{2} \int_{κ}^{t} (x^{T} M x + {\hat{u}}^{T} R \hat{u}) d t \\ = & {\hat{W}}_{c}^{T} (\hat{U} (t) \otimes \hat{U} (t) - \hat{U} (κ) \otimes \hat{U} (κ)) + \frac{1}{2} \int_{κ}^{t} (x^{T} M x + {\hat{u}}^{T} R \hat{u}) d t \end{matrix}

(34)

\begin{matrix} e_{c 2} \overset{Δ}{=} \frac{1}{2} x^{T} (t_{f}) P (t_{f}) x (t_{f}) - {\hat{W}}_{c}^{T} (\hat{U} (t_{f}) \otimes \hat{U} (t_{f})) \end{matrix}

(35)

where

κ \overset{Δ}{=} t - T_{s}

and

\hat{U} (t) = {[x^{T} {\hat{u}}^{T}]}^{T}

denote the states/control signals from observer. Similarly, we define the actor approximator error as

e_{a} \overset{Δ}{=} {\hat{W}}_{a}^{T} x + {\hat{Q}}_{u u}^{- 1} {\hat{Q}}_{u x} x

(36)

where

{\hat{Q}}_{u u}^{- 1}, {\hat{Q}}_{u x}

can be obtained from weights

{\hat{W}}_{c}^{}

.

After the definition of critic/actor approximation error, the next step is finding a learning algorithm that makes the

e_{c 1}, e_{c 2}, e_{a}

converge to zero, through weight matrix

{\hat{W}}_{c}^{}, {\hat{W}}_{a}^{}

update.

3.4. Learning Process

First, define the approximation error as

K_{c} = \frac{1}{2} {∥e_{c 1}∥}^{2} + \frac{1}{2} {∥e_{c 2}∥}^{2}, K_{a} = \frac{1}{2} {∥e_{a}∥}^{2}

(37)

and the gradient descent method is used here to solve the weight matrices

{\hat{W}}_{c}^{}, {\hat{W}}_{a}^{}

update, making it converge to the ideal value. We have to find the approximate error of critic/actor

K_{c}, K_{a}

by using the directional derivative of the weight matrices

{\hat{W}}_{c}^{}, {\hat{W}}_{a}^{}

. Similar to [37], from the chain rule and normalization, we can obtain:

\begin{matrix} {\dot{\hat{W}}}_{c} = - α_{c} \frac{\partial K_{c}}{\partial {\hat{W}}_{c}} = - α_{c} (\frac{1}{{(1 + σ^{T} σ)}^{2}} σ e_{c 1} + \frac{1}{{(1 + σ_{t_{f}}^{T} σ_{t_{f}}^{})}^{2}} σ_{t_{f}}^{} e_{c 2}) \\ {\dot{\hat{W}}}_{a} = - α_{a} \frac{\partial K_{a}}{\partial {\hat{W}}_{a}} = - α_{a} x e_{a}^{T} \end{matrix}

(38)

where

σ (t) \overset{Δ}{=} (\hat{U} (t) \otimes \hat{U} (t) - \hat{U} (κ) \otimes \hat{U} (κ))

,

σ_{t_{f}} \overset{Δ}{=} (\hat{U} (t_{f}) \otimes \hat{U} (t_{f}))

,

α_{c}, α_{a} \in R^{+}

are the constant gain that determines the convergence rate, and the gradient descent algorithm of (38) guarantees the convergence.

Next, we define the weight estimation error of

{\tilde{W}}_{c} \overset{Δ}{=} W_{c} - {\hat{W}}_{c}

,

{\tilde{W}}_{a} \overset{Δ}{=} W_{a} - {\hat{W}}_{a}

, and make the estimation error dynamic equation of

{\dot{\tilde{W}}}_{c} = {\dot{W}}_{c} - {\dot{\hat{W}}}_{c} = - α_{c} (\frac{σ σ^{T}}{{(1 + σ^{T} σ)}^{2}} + \frac{σ_{f} σ_{f}^{T}}{{(1 + σ_{f}^{T} σ_{f})}^{2}}) {\tilde{W}}_{c}

(39)

Similarly, we obtain the actor weight estimation error dynamic

{\dot{\tilde{W}}}_{a} = - α_{a} x x^{T} {\tilde{W}}_{a} - α_{a} x x^{T} {\tilde{Q}}_{x u} R^{- 1}

(40)

where

{\tilde{Q}}_{x u}

are the matrix elements as in Equation (27). The stable analysis of the learning process can be given in Lemma 3 as:

Lemma 3.

According to the critic approximator tuning law of Equation (38) for any given control input, the critic error dynamics converge exponentially to the equilibrium point with

∥{\tilde{W}}_{c}∥ \leq ∥{\tilde{W}}_{c} (τ_{0})∥ μ_{1} exp (- μ_{2} (τ - τ_{0}))

(41)

where

μ_{1}, μ_{2} \in R^{+}

,

τ > τ_{0} \geq 0

, and the signal Δ should be persistently exciting (PE) within interval

[τ, τ + τ_{PE}]

, i.e.,

\int_{τ}^{τ + τ_{PE}} Δ Δ^{T} d τ \geq β I

, with

β \in R^{+}

and

Δ Δ^{T} \overset{Δ}{=} \frac{σ σ^{T}}{{(1 + σ^{T} σ)}^{2}} + \frac{σ_{f} σ_{f}^{T}}{{(1 + σ_{f}^{T} σ_{f})}^{2}}

(42)

The stability proof of the used learning method can be found in [37], and is not provided here.

The learning process used here has to calculate the weight matrices of Equation (38) iteratively over time, which surely increased the controller complex compared to traditional learning [56]; however, no dynamic information is needed when the system deviates from the nominal one—thus, this system is applicable for space use. Finally, we provide the whole structure of the proposed event-trigger control system with a learning process, as seen in Figure 3.

4. Experiment Test and Simulation

4.1. Laboratory Experiment Environment

The proposed thermal control system was extensively tested in a laboratory environment on the ground, before launch. The relevant metal materials and heating parameters of MWR thermal test are given in Table 1 as:

Parameters for the nominal PID control in Equation (2) include

K_{p} = 40

,

K_{I} = 0.5

,

K_{D} = 5

, and sampling time is

T_{s} = 5 s

. According to the massive experiments, the thermal dynamic model is:

\dot{x} (t) = [\begin{matrix} A_{11} & A_{12} & A_{13} & A_{14} \\ 0 & A_{22} & A_{23} & A_{24} \\ 0 & 0 & A_{33} & A_{34} \\ 0 & 0 & 0 & A_{44} \end{matrix}] x (t) + [\begin{matrix} B_{11} & B_{12} & B_{13} & B_{14} \\ 0 & B_{22} & B_{23} & B_{24} \\ 0 & 0 & B_{33} & B_{34} \\ 0 & 0 & 0 & B_{44} \end{matrix}] σ (u (t))

(43)

Detailed information of the submatrix can be found in Appendix A.

The overall thermal control experiment system for the MWR time-delay test was carefully designed to provide a temperature control environment with precision, high stability, and a wide range of output, with a vibration isolation environment that had a vibration amplitude of less than 1 µm. The schematic diagram of the whole test system is given in Figure 4 below.

The whole thermal control system includes precise controllable temperature and humidity thermal vacuum equipment, ultra-high precision composite vibration isolation platform, MWR payload, MWR data sampling, and a process system and power supply, as shown in Figure 4. The MWR microwave ranging system A and B were placed on the vibration isolation platform with the thermal insulation structure, reducing the temperature of the isolation platform, affecting the measured MWR equipment. The data acquisition and processing system, power supply, etc., were placed on the laboratory desktop to avoid other heat sources, vibration, etc., from affecting the tested payload.

4.2. Performance of Passive and Nominal Thermal Control

To fully simulate the on-orbit thermal condition of the internal satellite compartment, we provide the baseline follow-on formation mission as: Chief spacecraft orbit altitude: 500 km; inclination: 89.2 deg; argument of perigee: 0 deg; RAAN: 0 deg; true anomaly: 0 deg; the deputy spacecraft followed the in-line flight relative to the chief spacecraft, with a distance of about 180 km in-track [57].

The bold blue lines in Figure 5 show the measured temperature of the ranging payloads inside the thermal vacuum equipment without active control, from sampled data during the experiment. Note that there are four sensors around each payload; the results in Figure 5 demonstrate the average temperature of those four sensors for each payload. For the sake of clarity, here we just provide the thermal states after temperature convergence from the fifth to tenth orbit. The results represent the satellite internal ranging payload thermal condition during the formation scene on-orbit, which mostly exhibit periodic fluctuations.

Clearly, the passive thermal protection method performed stably during the on-orbit mission time, about ±0.15

^{\circ}

C when convergence occurred, which verified our model for practical use in a real space mission; however, for more precision ranging performance, it is necessary to conduct active thermal control to achieve less temperature fluctuations. The thermal experiment was conducted again with the same parameters. The green lines in Figure 5 are the target temperatures of each payload, through online curve fitting, and the bold dotted red lines show the results of the nominal thermal PID control as introduced in Section 2.2. The results show the decreased temperature fluctuation of about ±0.1

^{\circ}

C, compared with passive thermal control.

4.3. Performance of Trigger Control

The performance of proposed trigger control was tested, and is described in this section. Here, we set the parameters of

γ = 0.5, δ = 0.02

, meaning the triggered threshold value from the measured states. Saturation control

\bar{u}

complied with the data from Table 1 for each payload, and

S = D = d i a g [0.15 I_{4 \times 4} 0.1 I_{4 \times 4} 0.15 I_{4 \times 4} 0.1 I_{4 \times 4}]

in Equation (11).

Figure 6 provides the experimental results of the self-triggered control for the signal process unit payload. We obtained less temperature fluctuation results (bold solid black line) compared to the nominal control. The reason for this improvement is due to the adopted optimal control that minimized the difference of nominal thermal trajectory to the target ones. The other payloads obtained similar thermal control performance as the signal process unit, which is not shown here for the sake of brevity.

It is interesting to find the internal states by using triggered control; Figure 7 shows the triggered sample periods of the signal process unit payload. Clearly, the triggered control system design can effectively adjust the sampling period, compared to the nominal PID control of the fixed sampling time, according to the error state perturbations. Moreover, the event-trigger (red triangle) could reduce the sampling period frequency better than the self-triggered (blue stars) approach during our experiments—about 60–120 s for event-trigger and 20–35 s for self-trigger—during the undersaturation actuator stages. The reason for this phenomenon is that the event-trigger uses external thermal sensors to obtain state updates, and the self-trigger uses model prediction state information, which increased the sampling frequency more than the event-trigger structure. The upper data in Figure 7 show that the trigger points in the control saturation begin/end stages and regional minimum/maximum temperatures.

4.4. Performance of Learning Process

The thermal control system modeling may differ from the original design as a space mission lasts for years. Here, we use the proposed online learning method for the thermal system with the following parameters: positive semidefinite matrix

M

and positive definite matrix

R

M = diag [\begin{matrix} 10^{- 8} & 10^{- 8} & 10^{- 9} & 10^{- 9} \end{matrix}], R = diag [\begin{matrix} 10^{- 5} & 10^{- 5} & 10^{- 5} & 10^{- 5} \end{matrix}]

the actor/critic approximator constant gain

α_{a} = 0.1

,

α_{c} = 50

in Equations (39) and (40). Moreover, a 0.5 kg aluminum alloy block is patched closely with the signal process unit, until the end of the 6th orbit period, simulating the thermal system model uncertainty in space. For the experiments during orbit No. 0–6, an exploration noise was added in the control input along with nominal one to ensure persistence of excitation and state exploration.

Figure 8 shows the performance of the thermal control system with the learning process for the signal process unit payload. The results of previous and current experiments are clearly marked as thin lines and bold lines, respectively, and compared to Figure 6. It is interesting to find that the thermal system gradually recovered to normal states since the beginning of orbit No.6, as the aluminum alloy block separated from the payload and the nominal/trigger control trajectory experienced a tracking process, marked as bold red/black lines. The results illustrate the efficacy of the proposed learning algorithm when the system model was disturbed with unknown information, which can be adaptively converged to stable states. Moreover, it is meaningful to find that the internal thermal condition can be improved if a big metal alloy block is used.

4.5. Time-Delay Performance of MWR Ranging System

The time-delay (ranging error) fluctuation performance of the microwave ranging system is closely related to the thermal stability of each payload on-orbit. The active microwave payloads used in the experiment were carefully tested separately at a precisely controlled constant temperature and humidity cleaning platform in advance, which revealed the time-delay coefficient (TDC—meaning the time delay value per degree Celsius) of 40 µm/K, 48 µm/K for K-band antenna and waveguide switch network; 19 µm/K, 25 µm/K for the signal process unit and USO [57]. Moreover, the time delay of the whole K-band ranging system can be realized as less than 5 µm if the payload thermal system is well protected within about 0.1

^{\circ}

C fluctuation.

The final microwave ranging error using the proposed trigger-based thermal control structure design is shown in Figure 9 with a black line. Clearly, the thermal control system performed stably after convergence by using the optimal online triggered control structure. The ranging accuracy with passive thermal control can achieve less than approximately 6 µm in orbit No.10 (max-min) and less than 3.5 µm by using nominal thermal control. The 25% (RMS) accuracy improvement can be realized by using the trigger-based control strategy, about 2.2 µm from the test, compared to the nominal control method.

5. Discussion

Aiming to improve autonomous and accuracy MWR ranging performance in real space missions, this paper proposed a trigger-based payload thermal control system design with an online learning process. The whole structure can be selected through option switch, according to actual needs. Basically, a nominal controller is used for coarse control under a new thermal environment, and can be switched to a trigger-based controller during the mission, minimizing the communication resources required from the spacecraft platform. RL-based control is suitable for long-term missions in space, particularly in cases where thermodynamic conditions deteriorate. The computational complexity is increased as we introduced the nominal control, trigger-based control (in trigger condition computation), and RL-based control (in actor/critic approximation iteration steps). The performance of the proposed trigger thermal control system is verified in a laboratory, which demonstrated the efficiency in communication reduction and temperature stability to practical use. The effectiveness of learning process was also validated under conditions of thermal dynamic modeling uncertainty. Finally, the ranging accuracy was tested for the whole payload system; we found that a 25% (RMS) performance improvement can be realized by using a trigger-based control strategy, about 2.2 µm compared to the nominal control method.

Author Contributions

Conceptualization, X.W. and H.Z.; methodology, X.W. and H.Z.; software, X.W. and N.W.; validation, X.W., X.L., D.W. and X.Z.; formal analysis, Q.S. and Z.Z.; investigation, X.W., X.L. and Z.Z.; resources, X.W., S.W. and C.D.; data curation, X.W. and X.L.; writing—original draft preparation, X.W.; writing—review and editing, X.W. and Q.S.; visualization, X.W.; supervision, D.W., S.W. and C.D.; project administration, X.W., D.W. and C.D.; funding acquisition, X.W., D.W. and C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Nature Science Fund under contract No. 19ZR1426800; Shanghai Jiao Tong University Global Strategic Partnership Fund (2019 SJTU-UoT), WF610561702; National Key R&D Program of China, No. 2020YFC2200800; Natural Science Foundation of China, No. U20B2054, No. U20B2056.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RMS	Root mean square
MWR	Microwave ranging
ETC	Event-triggered control
STC	Self-triggered control
LMI	Linear matrix inequality
ADP	Adaptive dynamic programming
ApDP	Approximate dynamic programming
RL	Reinforcement learning
USO	Ultra-stable crystal oscillator
CFRP	Carbon fiber reinforced plastic
MLI	Multi-layer insulation
A/D converter	Analog-to-digital converter
FPGA	Field Programmable Gate Array
DSP	Digital Signal Processing
MEMS	Micro-Electro-Mechanical System
PWM	Pulse-width modulating
PID	Proportion Integration Differentiation
HJB function	Hamilton–Jacobi–Bellman function
PE	Persistently exciting
RAAN	Right Ascension of Ascending Node
TDC	Time-delay/Celsius degree

Symbols

Crucial symbols in trigger-based control include:

$σ (u)$	saturation control
$γ, δ$	positive constants for triggering error
$t_{k}$	the triggered time of event k
$τ_{k}$	signal transmitting period since $t_{k}$
$θ (\cdot)$	function of $λ_{min} (P) {∥\cdot∥}^{2}$
$h (\cdot)$	trigger period of $h (γ, x (t_{k}))$
$\underset{̲}{α} (\cdot), \bar{α} (\cdot), α (\cdot), β (\cdot)$	$K_{\infty}$ function

Crucial symbols in learning-based process include:

${\hat{W}}_{c}, {\tilde{W}}_{c}$	estimation and error of critic approximate weight
${\hat{W}}_{a}, {\tilde{W}}_{a}$	estimation and error of actor approximate weight
$e_{c 1}, e_{c 2}$	critic approximation error
$e_{a}$	actor approximation error
$K_{c}, K_{a}$	critic/actor approximation error function
$σ (t)$	user defined function of generalized states $U$
$α_{c}, α_{a}$	constant gain of convergence rate
$μ_{1}, μ_{2}$	constant of exponential converges function

Appendix A

A_{11} = [\begin{matrix} 9 . 7318 & 0.1 & 0.1 & 0.1 \\ 0 & 9 . 7318 & 0.1 & 0.1 \\ 0 & 0 & 9 . 7318 & 0.1 \\ 0 & 0 & 0 & 9 . 7318 \end{matrix}] \times 10^{- 5}

A_{22} = [\begin{matrix} 1 . 5350 & 0.1 & 0.1 & 0.1 \\ 0 & 1 . 5350 & 0.1 & 0.1 \\ 0 & 0 & 1 . 5350 & 0.1 \\ 0 & 0 & 0 & 1 . 5350 \end{matrix}] \times 10^{- 5}

A_{33} = [\begin{matrix} 6 . 4996 & 0.1 & 0.1 & 0.1 \\ 0 & 6 . 4996 & 0.1 & 0.1 \\ 0 & 0 & 6 . 4996 & 0.1 \\ 0 & 0 & 0 & 6 . 4996 \end{matrix}] \times 10^{- 5}

A_{44} = [\begin{matrix} 1 . 5570 & 0.1 & 0.1 & 0.1 \\ 0 & 1 . 5570 & 0.1 & 0.1 \\ 0 & 0 & 1 . 5570 & 0.1 \\ 0 & 0 & 0 & 1 . 5570 \end{matrix}] \times 10^{- 5}

A_{12} = A_{34} = 0.01 \times I_{4 \times 4}

A_{13} = A_{14} = A_{23} = A_{24} = 0_{4 \times 4}

B_{11} = [\begin{matrix} 7 . 6628 & 0.1 & 0.1 & 0.1 \\ 0 & 7 . 6628 & 0.1 & 0.1 \\ 0 & 0 & 7 . 6628 & 0.1 \\ 0 & 0 & 0 & 7 . 6628 \end{matrix}] \times 10^{- 4}

B_{22} = [\begin{matrix} 3 . 3 & 0.1 & 0.1 & 0.1 \\ 0 & 3 . 3 & 0.1 & 0.1 \\ 0 & 0 & 3 . 3 & 0.1 \\ 0 & 0 & 0 & 3 . 3 \end{matrix}] \times 10^{- 3}

B_{33} = [\begin{matrix} 5 . 3101 & 0.1 & 0.1 & 0.1 \\ 0 & 5 . 3101 & 0.1 & 0.1 \\ 0 & 0 & 5 . 3101 & 0.1 \\ 0 & 0 & 0 & 5 . 3101 \end{matrix}] \times 10^{- 4}

B_{44} = [\begin{matrix} 1 . 7 & 0.1 & 0.1 & 0.1 \\ 0 & 1 . 7 & 0.1 & 0.1 \\ 0 & 0 & 1 . 7 & 0.1 \\ 0 & 0 & 0 & 1 . 7 \end{matrix}] \times 10^{- 3}

B_{12} = B_{34} = 0.01 \times I_{4 \times 4}

B_{13} = B_{14} = B_{23} = B_{24} = 0_{4 \times 4}

References

Landerer, F.W.; Flechtner, F.M.; Save, H.; Webb, F.H.; Bandikova, T.; Bertiger, W.I.; Bettadpur, S.V.; Byun, S.H.; Dahle, C.; Dobslaw, H.; et al. Extending the global mass change data record: GRACE Follow-On instrument and science data performance. Geophys. Res. Lett. 2020, 47, e2020GL088306. [Google Scholar] [CrossRef]
Bryant, R.; Moran, M.S.; McElroy, S.A.; Holifield, C.; Thome, K.J.; Miura, T.; Biggar, S.F. Data continuity of Earth observing 1 (EO-1) Advanced Land I satellite image (ALI) and Landsat TM and ETM+. IEEE Trans. Geosci. Remote Sens. 2003, 41, 1204–1214. [Google Scholar] [CrossRef]
Totani, T.; Ogawa, H.; Inoue, R.; Das, T.K.; Wakita, M.; Nagata, H. Thermal design procedure for micro- and nanosatellite pointing to earth. J. Thermophys. Heat Transf. 2014, 28, 524–533. [Google Scholar] [CrossRef]
Reiss, P.; Hager, P.; Bewick, C. New methodologies for the thermal modeling of CubeSats. In Proceedings of the 26th Annual AIAA/USU Conference on Small Satellites, Logan, UT, USA, 13–16 August 2012; pp. 1–12. [Google Scholar]
Jiang, X.; Han, Q.L.; Liu, S.; Xue, A. A New H_∞ Stabilization Criterion for Networked Control Systems. IEEE Trans. Autom. Control 2008, 53, 1025–1032. [Google Scholar] [CrossRef]
Astrom, K.J.; Bernhardsson, B.M. Comparison of Riemann and Lebesgue sampling for first order stochastic systems. In Proceedings of the 41st IEEE Conference on Decision and Control, Las Vegas, NV, USA, 10–13 December 2002; Volume 2, pp. 2011–2016. [Google Scholar]
Pan, H.; Chang, X.; Zhang, D. Event-triggered adaptive control for uncertain constrained nonlinear systems with its application. IEEE Trans. Ind. Inform. 2019, 16, 3818–3827. [Google Scholar] [CrossRef]
Liu, W.; Huang, J. Event-triggered global robust output regulation for a class of nonlinear systems. IEEE Trans. Autom. Control 2017, 62, 5923–5930. [Google Scholar] [CrossRef]
Xing, L.; Wen, C.; Liu, Z.; Su, H.; Cai, J. Event-Triggered Output Feedback Control A Cl. Uncertain Nonlinear Systems. IEEE Trans. Autom. Control 2018, 64, 290–297. [Google Scholar] [CrossRef]
Wang, R.; Si, C.; Ma, H.; Hao, C. Global event-triggered inner-outer loop stabilization of under-actuated surface vessels. Ocean Eng. 2020, 218, 108228. [Google Scholar] [CrossRef]
Zhang, J.; Liu, S.; Liu, J.F. Economic model predictive control with triggered evaluations: State and output feedback. J. Process Control 2014, 24, 1197–1206. [Google Scholar] [CrossRef]
Shahid, M.I.; Ling, Q. Event-triggered distributed dynamic output-feedback dissipative control of multi-weighted and multi-delayed large-scale systems. ISA Trans. 2020, 96, 116–131. [Google Scholar] [CrossRef]
Azimi, M.M.; Afzalian, A.A.; Ghaderi, R. Decentralized stabilization of a class of large scale networked control systems based on modified event-triggered scheme. Int. J. Dyn. Control 2021, 9, 149–159. [Google Scholar] [CrossRef]
Li, F.; Cao, X.; Zhou, C.; Yang, C. Event-triggered asynchronous sliding mode control of CSTR based on Markov Model. J. Frankl. Inst. 2021, 358, 4688–4704. [Google Scholar] [CrossRef]
Wang, W.; Tong, S. Distributed adaptive fuzzy event-triggered containment control of nonlinear strict-feedback systems. IEEE Trans. Cybern. 2019, 50, 3973–3983. [Google Scholar] [CrossRef] [PubMed]
Su, X.; Liu, Z.; Lai, G.; Zhang, Y.; Chen, C.P. Event-triggered adaptive fuzzy control for uncertain strict-feedback nonlinear systems with guaranteed transient performance. IEEE Trans. Fuzzy Syst. 2019, 27, 2327–2337. [Google Scholar] [CrossRef]
Abhinav, S.; Rajiv, K.M. Control of a nonlinear continuous stirred tank reactor via event triggered sliding modes. Chem. Eng. Sci. 2018, 187, 52–59. [Google Scholar]
Tang, X.T.; Deng, L. Multi-step output feedback predictive control for uncertain discrete-time T-S fuzzy system via event-triggered scheme. Automatica 2019, 107, 362–370. [Google Scholar] [CrossRef]
Li, S.; Ahn, C.K.; Guo, J.; Xiang, Z. Neural-Network Approximation-Based Adaptive Periodic Event-Triggered Output-Feedback Control of Switched Nonlinear Systems. IEEE Trans. Cybern. 2020, 51, 4011–4020. [Google Scholar] [CrossRef]
Liu, D.; Yang, G.H. Neural Network-Based Event-Triggered MFAC for Nonlinear Discrete-Time Processes. Neurocomputing 2018, 272, 356–364. [Google Scholar] [CrossRef]
Xing, X.; Liu, J. Event-triggered neural network control for a class of uncertain nonlinear systems with input quantization. Neurocomputing 2021, 440, 240–250. [Google Scholar] [CrossRef]
Yang, X.; Wei, Q.L. Adaptive Critic Designs for Optimal Event-Driven Control of a CSTR System. IEEE Trans. Ind. Inform. 2020, 17, 484–493. [Google Scholar] [CrossRef]
Yang, X.; He, H. Event-Driven H∞-Constrained Control Using Adaptive Critic Learning. IEEE Trans. Cybern. 2020, 51, 4860–4872. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Zhu, Y.; Dong, N.; Wei, Q.L. Decentralized Event-Driven Constrained Control Using Adaptive Critic Designs. IEEE Trans. Neural Netw. Learn. Syst. 2021; 1–15, Early Access. [Google Scholar] [CrossRef] [PubMed]
Seuret, A.; Prieur, C.; Tarbouriech, S.; Zaccarian, L. Event-triggered control with LQ optimality guarantees for saturated linear systems. IFAC Proc. Vol. 2013, 46, 341–346. [Google Scholar] [CrossRef] [Green Version]
Tarbouriech, S.; Garcia, G.; da Silva, J.M.G., Jr.; Queinnec, I. Stability and Stabilization of Linear Systems with Saturating Actuators; Springer Science & Business Media: Berlin, Germany, 2011. [Google Scholar]
Wu, W.; Reimann, S.; Liu, S. Event-triggered control for linear systems subject to actuator saturation. IFAC Proc. Vol. 2014, 47, 9492–9497. [Google Scholar] [CrossRef] [Green Version]
Åarzén, K.E. A simple event-based PID controller. IFAC Proc. Vol. 1999, 32, 8687–8692. [Google Scholar] [CrossRef]
Heemels, W.P.; Gorter, R.J.; Van Zijl, A.; Van den Bosch, P.P.; Weiland, S.; Hendrix, W.H.; Vonder, M.R. Asynchronous measurement and control: A case study on motor synchronization. Control Eng. Pract. 1999, 7, 1467–1482. [Google Scholar] [CrossRef]
Velasco, M.; Fuertes, J.; Marti, P. The self triggered task model for real-time control systems. In Proceedings of the Work-in-Progress Session of the 24th IEEE Real-Time Systems Symposium (RTSS03), Cancun, Mexico, 3–5 December 2003; Volume 384, pp. 67–70. [Google Scholar]
Heemels, W.; Johansson, K.H.; Tabuada, P. An introduction to event-triggered and self-triggered control. In Proceedings of the 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), Maui, HI, USA, 10–13 December 2012; pp. 3270–3285. [Google Scholar]
Yi, X.; Liu, K.; Dimarogonas, D.V.; Johansson, K.H. Dynamic event-triggered and self-triggered control for multi-agent systems. IEEE Trans. Autom. Control 2018, 64, 3300–3307. [Google Scholar] [CrossRef]
Wang, X.; Lemmon, M.D. Self-Triggered Feedback Control Systems with Finite-Gain L₂ Stability. IEEE Trans. Autom. Control 2009, 54, 452–467. [Google Scholar] [CrossRef]
Almeida, J.; Silvestre, C.; Pascoal, A.M. Self-triggered state-feedback control of linear plants under bounded disturbances. Int. J. Robust Nonlinear Control 2015, 25, 1230–1246. [Google Scholar] [CrossRef]
Peng, C.; Han, Q.L. On designing a novel self-triggered sampling scheme for networked control systems with data losses and communication delays. IEEE Trans. Ind. Electron. 2015, 63, 1239–1248. [Google Scholar] [CrossRef]
Buşoniu, L.; de Bruin, T.; Tolić, D.; Kober, J.; Palunko, I. Reinforcement learning for control: Performance, stability, and deep approximators. Annu. Rev. Control 2018, 46, 8–28. [Google Scholar] [CrossRef]
Vamvoudakis, K.G. Q-learning for continuous-time linear systems: A model-free infinite horizon optimal control approach. Syst. Control Lett. 2017, 100, 14–20. [Google Scholar] [CrossRef]
Fortunato, M.; Azar, M.G.; Piot, B.; Menick, J.; Osband, I.; Graves, A.; Mnih, V.; Munos, R.; Hassabis, D.; Pietquin, O.; et al. Noisy networks for exploration. arXiv 2017, arXiv:1706.10295. [Google Scholar]
Asadi, K.; Littman, M.L. An alternative softmax operator for reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; PMLR 2017. pp. 243–252. [Google Scholar]
Engel, Y.; Mannor, S.; Meir, R. Reinforcement learning with Gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 201–208. [Google Scholar]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 2000, 12, 1057–1063. [Google Scholar]
Jha, S.K.; Roy, S.B.; Bhasin, S. Direct adaptive optimal control for uncertain continuous-time LTI systems without persistence of excitation. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 1993–1997. [Google Scholar] [CrossRef]
Tu, S.; Recht, B. Least-squares temporal difference learning for the linear quadratic regulator. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR 2018. pp. 5005–5014. [Google Scholar]
Umenberger, J.; Schön, T.B. Learning convex bounds for linear quadratic control policy synthesis. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper/2018/hash/f610a13de080fb8df6cf972fc01ad93f-Abstract.html (accessed on 3 July 2022).
Lee, D.; Hu, J. Primal-dual Q-learning framework for LQR design. IEEE Trans. Autom. Control 2018, 64, 3756–3763. [Google Scholar] [CrossRef]
Konda, V.R.; Tsitsiklis, J.N. On actor-critic algorithms. SIAM J. Control Optim. 2003, 42, 1143–1166. [Google Scholar] [CrossRef]
Lee, D.; Lin, C.J.; Lai, C.W.; Huang, T. Smart-valve-assisted model-free predictive control system for chiller plants. Energy Build. 2021, 234, 110708. [Google Scholar] [CrossRef]
Qiu, S.; Li, Z.; Fan, D.; He, R.; Dai, X.; Li, Z. Chilled water temperature resetting using model-free reinforcement learning: Engineering application. Energy Build. 2022, 255, 111694. [Google Scholar] [CrossRef]
Wang, X.; Gong, D.; Jiang, Y.; Mo, Q.; Kang, Z.; Shen, Q.; Wu, S.; Wang, D. A Submillimeter-Level Relative Navigation Technology for Spacecraft Formation Flying in Highly Elliptical Orbit. Sensors 2020, 20, 6524. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Wu, S.; Gong, D.; Shen, Q.; Wang, D.; Damaren, C. Evaluation of Precise Microwave Ranging Technology for Low Earth Orbit Formation Missions with Beidou Time-Synchronize Receiver. Sensors 2021, 21, 4883. [Google Scholar] [CrossRef]
Min, G. Satellite Thermal Control Technology; China Astronautics Press: Beijing, China, 1991; Volume 249. (In Chinese) [Google Scholar]
Choi, M. Thermal assessment of swift instrument module thermal control system and mini heater controllers after 5+ Years in Flight. In Proceedings of the 40th International Conference on Environmental Systems, Barcelona, Spain, 11–15 July 2010. AAAA 2010-6003. [Google Scholar]
Choi, M. Thermal Evaluation of NASA/Goddard Heater Controllers on Swift BAT, Optical Bench and ACS. In Proceedings of the 3rd International Energy Conversion Engineering Conference, San Francisco, CA, USA, 15–18 August 2005. AAAA 2005-5607. [Google Scholar]
Granger, J.; Franklin, B.; Michalik, M.; Yates, P.; Peterson, E.; Borders, J. Fault-Tolerant, Multiple-Zone Temperature Control; NASA Tech Briefs: New York, NY, USA, 1 September 2008; No. NPO-45230.
Lewis, F.L.; Syrmos, V. Optimal Control; Wiley: New York, NY, USA, 1995. [Google Scholar]
Bradtke, S.J.; Barto, A.G. Linear least-squares algorithms for temporal difference learning. Mach. Learn. 1996, 22, 33–57. [Google Scholar] [CrossRef] [Green Version]
Jiao, Z.; Wang, D.; Liu, X.; Ren, S.; Yang, S.; Zhong, X. Test and research on time delay stability of micron microwave ranging system. Space Electron. Technol. 2021, 18, 58–63. (In Chinese) [Google Scholar]

Figure 1. Structure of the satellite: (a) K-band antenna phase center; (b) K-band microwave ranging system payload; (c) centroid adjustment; (d) accelerometer for gravity field detection; (e) star sensor; (f) power supply; (g) fuel tank; (h) thruster.

Figure 2. Structure of the trigger-based thermal control system.

Figure 3. Structure of the triggered thermal control system with learning process.

Figure 4. Schematic diagram of the MWR thermal control system in laboratory environment.

Figure 5. The thermal states of K-band MWR payloads during test. (a) K-band antenna; (b) waveguide switch network; (c) signal process unit; (d) USO.

Figure 6. Comparison of different thermal control algorithms for signal process unit.

Figure 7. Sample period vs. orbit number for self-/event-triggered control.

Figure 8. Comparison of different thermal control algorithms with learning processes for the signal process unit.

Figure 9. The micron level ranging error during different thermal control processes.

Table 1. Metal materials and heating/cooling parameters of MWR thermal test.

Payload	Materials ¹	Heat Capacity (J/kg·K)	Block mass (kg)	Thermal conductivity (W/m·K)	Heating Patch Surface Area (mm × mm)	Nominal Heat/Cool Power (W)	Saturation Heat/Cool Power (W)
antenna	magaluma 5086	9.00 × 10 $^{2}$	1.45	127	50 × 20	8	3.5
waveguide	nickel alloy GH4169	6.15 × 10 $^{2}$	0.50	23.6	20 × 10	4	3.5
signal process	aluminum alloy AZ91D	8.80 × 10 $^{2}$	2.14	51	80 × 30	8	3.5
USO	aluminum alloy AZ91D	8.80 × 10 $^{2}$	0.67	51	60 × 30	5	3.5

¹ Manufacturing major metal materials.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Zhu, H.; Shen, Q.; Wu, S.; Wang, N.; Liu, X.; Wang, D.; Zhong, X.; Zhu, Z.; Damaren, C. Trigger-Based K-Band Microwave Ranging System Thermal Control with Model-Free Learning Process. Electronics 2022, 11, 2173. https://doi.org/10.3390/electronics11142173

AMA Style

Wang X, Zhu H, Shen Q, Wu S, Wang N, Liu X, Wang D, Zhong X, Zhu Z, Damaren C. Trigger-Based K-Band Microwave Ranging System Thermal Control with Model-Free Learning Process. Electronics. 2022; 11(14):2173. https://doi.org/10.3390/electronics11142173

Chicago/Turabian Style

Wang, Xiaoliang, Hongxu Zhu, Qiang Shen, Shufan Wu, Nan Wang, Xuan Liu, Dengfeng Wang, Xingwang Zhong, Zhu Zhu, and Christopher Damaren. 2022. "Trigger-Based K-Band Microwave Ranging System Thermal Control with Model-Free Learning Process" Electronics 11, no. 14: 2173. https://doi.org/10.3390/electronics11142173

APA Style

Wang, X., Zhu, H., Shen, Q., Wu, S., Wang, N., Liu, X., Wang, D., Zhong, X., Zhu, Z., & Damaren, C. (2022). Trigger-Based K-Band Microwave Ranging System Thermal Control with Model-Free Learning Process. Electronics, 11(14), 2173. https://doi.org/10.3390/electronics11142173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Trigger-Based K-Band Microwave Ranging System Thermal Control with Model-Free Learning Process

Abstract

1. Introduction

2. Thermal System Design for Precise Ranging System

2.1. Thermal Structure of Deployed Satellite

2.2. Payload Thermal Dynamic Modeling and Nominal PID Control

2.3. Trigger-Based Precise Optimal Thermal Control with Saturation Constraint

2.4. Trigger Condition Analysis

2.5. Stability Analysis of Trigger Control

3. Model-Free Reinforcement Learning Formulation

3.1. Reinforcement Learning Structure

3.2. Reinforcement Learning Structure

3.3. Critic/Actor Structure

3.4. Learning Process

4. Experiment Test and Simulation

4.1. Laboratory Experiment Environment

4.2. Performance of Passive and Nominal Thermal Control

4.3. Performance of Trigger Control

4.4. Performance of Learning Process

4.5. Time-Delay Performance of MWR Ranging System

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Symbols

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI