Scheduling Strategy Design Framework for Cyber–Physical System with Non-Negligible Propagation Delay

Cyber–physical systems (CPS) have been widely employed as wireless control networks. There is a special type of CPS which is developed from the wireless networked control systems (WNCS). They usually include two communication links: Uplink transmission and downlink transmission. Those two links form a closed-loop. When such CPS are deployed for time-sensitive applications such as remote control, the uplink and downlink propagation delay are non-negligible. However, existing studies on CPS/WNCS usually ignore the propagation delay of the uplink and downlink channels. In order to achieve the best balance between uplink and downlink transmissions under such circumstances, we propose a heuristic framework to obtain the optimal scheduling strategy that can minimize the long-term average control cost. We model the optimization problem as a Markov decision process (MDP), and then give the sufficient conditions for the existence of the optimal scheduling strategy. We propose the semi-predictive framework to eliminate the impact of the coupling characteristic between the uplink and downlink data packets. Then we obtain the lookup table-based optimal offline strategy and the neural network-based suboptimal online strategy. Numerical simulation shows that the scheduling strategies obtained by this framework can bring significant performance improvements over the existing strategies.


Introduction
In the recent past, applications of the wireless control networks have become more and more extensive, such as drone formations, autonomous vehicles, automatic factories, etc. Some of those scenarios implicate new requirements for remote control technology, which is a sub-topic of communication control co-design. Remote control technology originates from wireless control systems with long propagation delay such as far-sea monitoring and high-efficiency satellite IoT. The main cause of long propagation delay is the large-scale geographic distance. This feature makes it extremely challenging to design CPS under this scenario. In order to meet the need of remote control with propagation delay, that is, to maintain stable closed-loop control and reduce control costs, we propose a new framework to design uplink and downlink scheduling strategies.
As show in Figure 1, a typical CPS deployed under the single closed-loop control scenario contains a control system and a communication system. In the rest of this article, we use single-loop CPS to refer to this specific type of CPS. The communication process of a typical single-loop CPS can be divided into two parts: Uplink sensor transmission and downlink controller transmission. The uplink transmission is initiated by the sensor and sends the state update packet from the plant to the controller. The controller first uses this data to obtain a more accurate estimate of the factory status. Then the downlink transmission is initiated to send command information from the controller to the actuator located at the factory. The actuator acts on the factory to maintain the factory's stability.
Taking into account the characteristics of a control system, the command can only be generated with an accurate estimation, which means the downlink transmission must occur after a successful uplink transmission . Because of this fixed timing relationship, CPS has to work in half-duplex in most cases: namely, only one of the uplink sensor transmission and the downlink controller transmission can be activated to send a data packet in the same time slot. That means there is a problem of how to design a scheduling strategy between those two transmissions. Note that the uplink and downlink channels here are not just a single wireless channel, but a simplified modeling of a fixed routing link with multiple relays. This scenario is for some special remote control systems that use satellites as relays. Therefore, the propagation delay in our paper is essentially a collection of various delays contained in the entire relay link, including processing delay, transmission delay, propagation delay, etc. This unified modeling is used because the link characteristics of a fixed routing multi-relay link can be described by an equivalent link with a specific code error rate and propagation delay.
There are many related works about WNCS and CPS [1][2][3][4]. Focusing on the conflict of the accuracy requirements of control systems and the limited quantization level [5], proposed the application of dynamic quantization technology in the communication control co-design. Some works designed CPS with the limitation of wireless coding process, such as code length allocation [6,7], code length design [8,9] and adaptive code length adjustment [10]. Considering the fading characteristics of transmission channels, studies of adaptive transmit power adjustment technology by predicting the fast or slow fading of transmission channels are proposed in [11,12]. Some of the above studies include the idea of designing CPS for time-sensitive applications. Nowadays, the most widely used measure of timeliness is Age of Information (AoI) [13], which is defined as the time elapsed since a certain data packet was generated: where t represents the current time, t represents the time when the packet was generated. It used to be very difficult to express the control performance measurement, that is, the system state mean square error (MSE) [14] when the control system and the communication system are combined. The proposal of AoI changed this situation. For example, the system state MSE of a linear time invariant system (LTI) can be simply expressed as a function of AoI. This improvement greatly reduces the difficulty of describing the overall system performance in the communication control co-design scenario [15,16]. Based on AoI, many related studies have been derived, such as the application of the HARQ mechanism for single-loop CPS to improve the overall timeliness [17,18], and the scheduling strategy aiming to minimize the long-term average MSE for single-loop CPS without transmission delay [19]. Some studies about the multi-loop scheduling strategy design aiming at optimizing timeliness have also been proposed. Reference [20] focuses on the design of the data inter-arrival rate and code length allocation strategy. References [21,22] proposed the uplink scheduling strategy of multi-loop WNCS under the ideal assumption of downlink transmission. Furthermore, the authors of [23,24] discuss the application of data packet transmission result prediction technology in WNCS design.
The scenarios studied above concern mainly short-distance Industrial Internet of Things (IIoT), so the impact of uplink and downlink propagation delay on the closed-loop control performance of a CPS is generally ignored. Besides, the above studies only consider one of the two code error rates of the uplink and the downlink transmission. Under the remote control scenario, the code error rates and propagation delay of both links are not only non-negligible, but also have a huge impact on the overall performance of the singleloop CPS. Some works have studied the design of WNCS optimal control strategy under time-delay scenarios [25][26][27]. However, they do not consider the impact of the code error rate and the scheduling strategy which are issues that cannot be ignored in the design of communication systems in the field of communication engineering. To this end, we propose a new framework to obtain the optimal scheduling strategy while considering both the code error rates and propagation delay. This strategy can minimize the long-term average control cost. Firstly, we model the single-loop CPS as an MDP problem and give the sufficient conditions for the stability of CPS. Secondly, we propose a heuristic semi-predictive framework to eliminate the impact of the coupling characteristic between the uplink and downlink data packets. Finally, we obtain the lookup table-based optimal offline strategy and the neural network-based suboptimal online strategy for the single-loop CPS. The whole process can be expanded according to actual deployment requirements with any fixed propagation delay as long as the sufficient condition is satisfied.
The rest of this paper is organized as follows: In Section 2, we provide the system model and formulate the optimization problem. In Section 3, we introduce the semipredictive framework and transform the optimization problem into an MDP problem. In Section 4, we obtain the optimal offline strategy and the suboptimal online strategy. In Section 5, we show the numerical simulation results. We conclude this work in Section 6.

The Plant of the Single-Loop CPS
First, we model the plant in the single-loop CPS as a discrete-time LTI system: where k represents the k-th time slot, X k ∈ R represents the state of the plant at time slot k, U k ∈ R represents the executed control command, Z k ∈ R represents the normally distributed plant noise whose mean and variance arez and R, respectively. A ∈ R represents the state transition coefficient, B ∈ R represents the command control coefficient. We assume that the plant state remains unchanged within a single time slot. The goal of CPS is to maintain X around 0.

The Communication Process of the Single-Loop CPS
In the previous subsection, we explained that the entire single-loop CPS works in the half-duplex mode. Now we will explain the communication process of the single-loop CPS. The entire system adopts a centralized scheduling scheme because this scheme is more suitable for single-loop CPS. Under this scheme, the scheduling decision of uplink and downlink transmission is completely determined by the remote controller. We use a k to represent the scheduling decision made by the controller in the time slot k. If the controller schedules uplink transmission in the slot k, a k = 1. If the controller schedules downlink transmission in the slot k, a k = 2. We assume that the code error rate of the uplink and downlink transmission channels are p s , p c ∈ (0, 1), respectively. Both code error rates are constant which means the uplink and downlink transmission fails with probability (p s , p c ) in any time slot, respectively. Then we use δ k to represent the transmission result of the packet sent in the time slot k. No matter which transmission is scheduled, if it succeeds, then δ k = 1. Otherwise, δ k = 0. Since the processing procedures of most actual CPS are digital, the packets that have experienced a certain delay will start to be processed in the next processing cycle after it is received; we model the propagation delay of the uplink and downlink channel integer time slots d up , d down ∈ R, respectively. To simplify the analysis, we assume that the transmission of scheduling instructions and feedback information is ideal.
In addition to the variables described above, we define the following two parameters to describe the status of each part in a single-loop CPS: (1) State Estimation Age τ k : This is defined as the age of the latest valid uplink state update packet successfully received by the controller at the end of the time slot k. τ k reflects the accuracy of the estimation maintained by the remote controller. Because of the uplink propagation delay, the minimum value of state estimation age is d up . When the specific time slot is not considered, it is abbreviated as τ. Its update rule is as follows: where j = k − d up + 1.
(2) State Control Age ϕ k : This is defined as the age of the uplink packet used to generate the latest successfully received downlink packet by the actuator at the end of the time slot k. This parameter represents the total time it takes for the entire CPS to complete a closed-loop control process. It reflects the degree of divergence of the plant's state. Because of the uplink and downlink propagation delay, the minimum value of the state control age is d up + d down . When the specific time slot is not considered, it is abbreviated as ϕ. Its update rule is as follows: where q = k − d down + 1. The abbreviations j and q will be used in the rest of this paper. Note that we set the initial values of τ 0 and ϕ 0 to be 2. These values can be arbitrarily selected within a reasonable range. This is because the long-term average cost we focus on is not affected by those initial values.

The Control Process of the Single-Loop CPS
In this subsection, we will explain the control process of the single-loop CPS in detail, which is mainly completed by the remote controller and the actuator. The task of the remote controller can be divided into three parts: Maintaining state estimation, generating control commands, and scheduling uplink and downlink transmissions, while the actuator has only one task: Executing the received control commands.
(1) Maintaining State Estimation: We assume that the sensor can sample the state of the plant without distortion. The uplink transmission cannot be scheduled in every time slot. What is more, the scheduled transmission can fail because of the code error occurring during its propagation process. So the remote controller cannot receive a new state update packet in every time slot. Under these circumstances, the remote controller has to update the estimationX k of the plant state X k through the following process: where g(X, k) = AX + BU k , g n (X, k) = g(g n−1 (X, k − 1), k) ∀n > 1, and g 1 (X, k) = g(X, k). In this scenario, this estimation method has been proven to be optimal [28]. When a certain uplink transmission is successful, the remote controller can use the plant state X k−d up +1 , which is the exact value for d up − 1 time slots before, to obtain the state estimationX k + 1 of the next time slot. When the current time slot has no successful uplink transmission, the controller can only updateX k + 1 withX k . According to this process, we can derive the state estimation MSE of the remote controller asQ k : Note that the state estimation error of the remote controller is entirely caused by the noise Z k . By using the state estimation age τ k , we can rewrite the state estimation MSE as a recursive function of the noise variance R: (6) uses the definition of AoI to derive the MSE of the estimation. This representation greatly reduces the difficulty of calculation. In the following part, we will use the same idea to derive the single-loop CPS control performance metrics.
(2) Control Command Generation and Execution: In each time slot, while the remote controller maintains the state estimation, it also uses the estimation to generate a control commandŨ k :Ũ where K is the command generation coefficient. The goal of this control process is to maintain the state around 0. Since the downlink transmission has a propagation delay of d down time slots, we must ensure BK = −A d down . To simplify the analysis, we set B = −A d down , K = 1. Due to the code error rate and scheduling decisions, not every control commandŨ can be received by the actuator. Only those scheduled and successfully transmitted can be used by the actuator. Therefore, the control command executed by the actuator is U k+1 : where q = k − d down + 1. This control method shown by (8) and (9) is called single-step control, which is a common form in the field of classic cybernetics. Using this method, when a control command is successfully delivered to the actuator, the actual state value will return to a value as close to 0 as possible at one time. Such a process can maximize the effect of a single instruction.
(3) Single-Loop CPS Control Performance Metrics: Consistent with the estimation performance metrics, the control performance metrics is defined as the state MSE of the plant Q k : Similar toQ k , we can rewrite Q k as a function of noise variance R and state control age ϕ: According to the control cost given by Equation (11), we can obtain the long-term average control cost, that is, the long-term average plant state MSE: Equation (12) reflects the state deviation in the field of classic cybernetics which is the core cost metrics we care about. Please note that this parameter used to be very difficult to quantify without the introduction of AoI. Under certain conditions, the limit contained in Equation (12) may not exist, and the problem is unsolvable. In order to prevent such situations, the sufficient condition for the stability of WNCS with propagation delay will be given later, namely equation (19). In this paper, the scheduling strategy will be designed on the premise that equation (19) is satisfied.
(3) Uplink and Downlink Scheduling Process: In the previous subsection, we introduced the control performance measurement of a single-loop CPS. Now we will describe the scheduling process in detail. It has been explained that a single-loop CPS has two communication scenarios-the uplink transmission and the downlink transmission-and we can only choose one of them in each time slot under half-duplex mode. According to the previous definition, the scheduling decision of time slot k is recorded as a k . The set of scheduling decisions of all time slots is called a scheduling strategy: where Π represents the set of all scheduling strategies. Different scheduling strategies can significantly affect the control performance of a single-loop CPS. Every scheduling strategy π has its corresponding long-term average control cost J π . Among all scheduling strategies, there is an optimal strategy π * ∈ Π, which satisfies: Therefore, we can construct the following optimization problem. The goal of this problem is to minimize the long-term average plant state MSE to obtain the optimal scheduling strategy while taking transmission propagation delay and code error rates of two wireless channels into account, namely

Semi-Predictive Framework and MDP Modeling
In this section, we will introduce the coupling characteristic between the uplink and downlink data packets which is caused by their propagation delay. In the following paper, we will use the coupling characteristic to refer to the coupling characteristic between the uplink and downlink data packets to save space. We propose a semi-predictive framework to eliminate the effect of the coupling characteristics on the solution of optimization problem (15). Based on this framework, we remodel this optimization problem to an MDP problem. Note that the semi-predictive framework we proposed is suitable for any value of the uplink and downlink propagation delay. For the generality, we use d up = d down = 1 as an example to illustrate the scheduling strategy design process. In the actual applications with different propagation delay, we only need to modify the value of d up , d down and adjust some parameters in the following modeling step to meet specific design requirements.

The Packet Outdate Problem
Section 2 introduced the control mechanism of a single-loop CPS. Through the above analysis, it is easy to see that state update packets and control command packets have strong coupling characteristic for single-step control methods. Actually, such a coupling characteristic exists in any closed-loop control scenario as long as there exists propagation delay. This characteristic will cause some successfully delivered packets to become outdated. As shown in Figure 2, the green and red arrows represent state update packets up 1 (left green arrow), up 2 (left red arrow) and the control command packets down 1 (right green arrow), down 2 (right red arrow), respectively. The command down 1 is generated by the controller using up 1 , while down 2 is generated by the controller using up 2 . During the period from the slot up 2 sent to the slot down 2 executed, if down 1 is executed successfully, both up 2 and down 2 become invalid. In time slot 4, down 1 is executed; the result is that the real state of the plant was returned to a value around 0. This process causes an interruption in the state estimation process which means the estimation updated by up 2 is no longer accurate, so up 2 is outdated. Since up 2 is outdated, the control command down 2 which was generated from it is also outdated. This is the main effect of the coupling characteristic and we named it the packet outdate problem.
As we can see, this problem is mainly caused by the discontinuity in the dynamic process of the plant. The discontinuity only occurs when a downlink control command is executed, which means the uplink state update packet will not cause this problem. When this happens, the outdated uplink and downlink data packets require different processing methods. For an outdated downlink packet, it only needs to be discarded. However, for an outdated uplink packet, we have to backtrack the state estimation before this outdated packet is used. We show the evolution of the state estimation age and state control age in Figure 2. It can be seen that the state estimation age has been backtracked by changing from τ(3) = 2 to τ(4) = 4. The state control age will not be updated like this.

Main Idea of the Semi-Predictive Framework
In the previous subsection, we explained that the packet outdate problem has an impact on the update of the state estimation age, but this problem does not affect the update of the state control age. Therefore, when we try to construct a theoretical analysis framework, as long as the state control age is correct, the final analysis result can be guaranteed to be correct. In other words, the state estimation age of some time slots is allowed to deviate from the actual physical process. As long as it can be ensured that the state estimation age is accurate when the downlink data packet arrives at the actuator, the correct theoretical analysis can be guaranteed. It can be seen that it is possible to skip the state estimation age backtracking process in the theoretical analysis by using this feature. This is the main idea of the semi-predictive framework.
In the normal communication process, the decoding result of a data packet can only be determined after it arrives at the destination. For an uplink data packet, only after it arrives at the controller can it be known whether the data packet can be successfully decoded, while for a downlink packet, only after it arrives at the actuator can it be known whether the data packet can be successfully decoded. However, under the semi-predictive framework, we assume that the transmission result of a downlink packet is known as soon as the downlink packet is sent. Note that we do not predict the result of an uplink packet. This is because the execution of the downlink command is the root cause of the packet outdated problem.
Take the case of Figure 2 as an example again; if we can foresee that the downlink control command packet down 1 can be successfully decoded and is not outdated, then during the period from its sending to its arrival, any packets sent or arrived can be directly discarded since they will be outdated by down 1 . Through this process, the impact of the packet outdated problem is eliminated and state estimation age backtracking is avoided.
While the update process of the state estimation age under the semi-predictive framework is different from the actual physical process, the scheduling strategy obtained based on this framework can still be directly applied to an actual physical process. In the actual physical process, if a downlink data packet arrives at the actuator successfully and is not outdated, then the uplink and downlink transmissions scheduled during its transmission must be outdated. In other words, no matter what scheduling decision the controller made, those packets sent during this period will be outdated. In other words, those scheduling decisions can be arbitrary since they do not affect the final result. Assuming that the downlink control command packet down 1 in Figure 2 can be successfully decoded and not outdated, we will explain both age update processes under the semi-predictive framework and the actual physical process in detail.
(1) Semi-Predictive Framework: If down 1 can be successfully decoded and not outdated, then the controller knows that it does not matter whether it chooses uplink or downlink during the transmission of down 1 because those scheduled packets will be outdated anyway. Under these circumstance, a reasonable scheduling strategy is to regularly schedule one of the uplink and downlink transmissions during this period to consume time.
(2) Non-Predictive Framework (Actual Physical Process): In the actual physical process, during the transmission of down 1 , the controller continues to schedule uplink or downlink transmissions according to a certain strategy. However, when down 1 is received and decoded successfully, the previous scheduled transmissions of the controller are all outdated. So in the end, the scheduled transmissions during this period only consume time and have no practical effect.
It can be observed that, under the semi-predictive framework and the actual nonpredictive scheduling, the single-loop CPS transmission results are uniform; that is, it is accurate to use the semi-predictive framework in the theoretical design and directly apply the results to the real applications. This subsection qualitatively analyzes the unity of the semi-predictive framework and the actual physical process. In the next subsection, we will quantitatively illustrate how this framework corresponds to actual physical processes through MDP modeling.

MDP Modeling of the Semi-Predictive Framework
Based on the semi-predictive framework, we model the single-loop CPS with uplink and downlink propagation delay as an MDP process with the following four elements: (1) State Space: The state space of this MDP is where d max = max{d up , d down }, D(n) ∈ {0, 1, · · ·, d down + 1}. a(n) represents the scheduling decision made in the time slot n. D(n) represents the time interval between the time slot when the latest valid downlink command packet (successfully transmitted and not outdated) in the time slot n was generated and the current time slot n. τ(n) and ϕ(n) represent the state estimation age and the state control age at the time slot n, respectively. The time slot n is based on the current time slot: The time slot for which scheduling decisions are being made. Taking a (−1) as an example: It represents the transmission action taken in the previous time slot of the current time slot. We set both the uplink and downlink propagation delay to be 1 for illustration in the rest of this paper, so the corresponding state space is: S {a (0), D(0), τ(0), ϕ(0)}. In the subsequent sections of this paper, the state space is abbreviated as S {a , D, τ, ϕ} to save space.
(2) Action Space: The action space is A {0, 1}. This action space corresponds to the scheduling action a k . If the controller schedules uplink transmission in the slot k, a k = 1. If the controller schedules downlink transmission in the slot k, a k = 2.
(3) State Transition Probability Matrix: The transition matrix is P(s |s, a). The state transition probability is the probability that the next state is s by taking action a in the current state s. The transition probability is determined by the channel code error rate. According to the different parameter pairs: (a , D) in the state S, the state transition matrix can be divided into five parts: (a , D) = [(1, 1), (1, 2), (2, 0), (2, 1), (2,2)]. The complete construction rules are given in Appendix A.
(4) Cost Function: It can be seen from (4) and (11) that the cost function in a specific state is independent of the action. The cost function can be expressed as a function of the state control age ϕ k : In the MDP modeling of the semi-predictive framework, the core parameter is D(n). We limit its maximum value to d down + 1 because we only need to track the downlink transmissions in the past d down time slots to ensure that we do not miss any possible packet outdated problems. Besides, such process can help to reduce the scale of the state space. The update rule of D(n) is as follows: This updated process reflects the main idea of the semi-predictive framework and guarantees that it will not cause any differences between the state control ages of the theoretical analysis and the actual physical processes. In the next section, we will use the semipredictive framework to design the optimal scheduling strategy.

Online and Offline Scheduling Strategies
In this section, we first give the sufficient condition for the existence of the optimal scheduling strategy. Then we use the relative value iteration algorithm to obtain the lookup table-based optimal offline strategy. Aiming at reducing the space complexity of the algorithm and saving space for storing the optimal offline strategy, we further propose a neural network-based suboptimal online strategy. For different uplink and downlink propagation delay, the acquisition process of both strategies is universal, which means that the semi-predictive framework has high practical application value.

Sufficient Conditions for the Strategies' Existence
then there must exist a stationary deterministic scheduling strategy that can stabilize the multi-loop CPS. This stability remains as long as the uplink and downlink propagation delay are fixed, but the long term control performance metrics converge to a larger value with the increase of the propagation delay. When K = 1, L = 1 the above multi-loop CPS is just a single-loop CPS.The proof is given in Appendix B.
The essence of this sufficient condition is to link the instability of the control system with the reliability of the communication system. When the reliability of the communication system is higher than the instability of the control system, an optimal scheduling strategy can be found for the communication system to meet the needs of the control system. This condition can effectively guide the design of single-and multi-loop CPS.

Lookup Table-Based Optimal Offline Strategy
Since there is no theoretical upper limitation for the state estimation age and the state control age, the scale of the MDP state space is infinite, so it must be truncated before solving. We select N = max{τ, ϕ} as the truncation condition, and use the relative value iteration algorithm to solve the MDP problem. When the value of N is appropriate, this truncation will have no effect on the control performance. Such a suitable N can be obtained by conducting Monte Carlo experiments. In this section, we take N = 10 as an example to show the resulting scheduling strategy in Figure 3. In Figure 3, those red squares represent that the controller schedules uplink transmission in the corresponding state, and the yellow squares represent that the controller schedules downlink transmission in the corresponding state. As shown in Figure 3a,c,d, if D = {0, 1}, no matter which transmission is scheduled, the related packet will be outdated. So under this circumstance, the scheduling strategy can choose any action arbitrarily. Since we chose the relative value iterative algorithm to solve the MDP problem, the strategy we obtained chooses to use uplink transmission to fill these unnecessary transmissions. Note that this part corresponds to the description of Section 3 part C. We take down 1 as an example again: In the actual physical process, it is not known that the next two transmissions are unnecessary transmissions after down 1 is sent. The controller does not know that D = {0, 1}. Instead, it thinks that D is still equal to 2 at those time slots. Therefore, the controller continues to schedule according to the scheduling strategy. However, down 1 makes those two packets outdated when it is executed, while for those states whose D = 2, the controller can make a scheduling decision with the right state information. The entire process makes sure that the actual process is consistent with the theoretical process.
After obtaining this scheduling strategy, it is stored as a lookup table by the controller and does not require any extra calculation ability from the controller, so we call it an offline strategy. However, since the iterative algorithm is a model-based algorithm, as N gradually increases, the scale of the state space N S = 2 · 3 · N · N = 6N 2 in the MDP modeling increases exponentially. This leads to a sharp increase in the space complexity of the solving process and the lookup table could be too large to be stored. In order to solve these problems, we propose an improved scheme based on neural network in the next subsection.

Neural Network-Based Suboptimal Online Strategy
In Section 3, we remodeled the optimization problem to an MDP problem, and solved it to obtain the optimal offline strategy in the previous subsection. The optimal offline scheduling strategy based on the lookup table has two obvious shortcomings: The size of the lookup table increases linearly as the total number of states in the state space increases and the space complexity required in the calculation process increases exponentially as the total number of states increases. When the optimal offline strategy is actually deployed, there is no guarantee that the central controller has enough storage space to store the entire lookup table. It may even be impossible to perform calculations because the state space is too large. Therefore, here we design a new suboptimal online scheduling strategy based on neural network. The idea of this strategy is to replace the lookup table in the previous strategy with a neural network to save storage space. Neural network is a very ideal approximation function of lookup table, theoretically it can be approximated without error. That means in the theory of reinforcement learning, this strategy can achieve the performance of the optimal strategy. We will show that the performance of this suboptimal online strategy is very close to the performance of the optimal offline strategy in the next section.
In order to obtain this neural network, we use a the model-free algorithm called Deep Q Network (DQN). The algorithm continuously learns the hidden laws of the MDP problem by interacting with the environment and continuously trains the neural network to obtain better performance. We show the detailed process of the algorithm in Algorithm 1. Initialize the environment; Set the origin state s 1 ∈ S randomly; for t = 1 : 1000 do Choose a random a t with probability 1 − ε; Otherwise choose a t = arg max a Q(s, a t−1 ; θ); Execute a t in the environment; Observe reward r t and get new state s t+1 ; Store transition data (s t , a t , r t , s t+1 ) in M; Sample random mini batch of transitions (s j , a j , r j , s j+1 ) from M; Set y j = r j if episode ends at step j + 1 r j + γmax a Q (s j+1 , a ; θ ) otherwise ; Perform RMSprop on (y j − Q(s j , a j ; θ)) 2 with θ; Every 100 steps, setQ = Q end end The structure of the neural network we obtained is shown in Figure 4: Four neurons in the input layer, fifty neurons in the hidden layer, and two neurons in the output layer. This neural network-based scheduling strategy is an online strategy which means that, in order to use this strategy, the current state s must be input to the neural network first. Then the controller needs to run real-time calculations to obtain the action values A(s, a) for taking different actions in the current state. The action value represents how much reward can be obtained by taking the action, so the scheduling strategy is to select the action with the largest A(s, a) among all actions.
DQN is a relatively mature reinforcement learning algorithm, so we only give the parameter settings of this algorithm and briefly introduce its training process. We run E = 2000 episodes, and each episode contains 1000 steps. In each step, this algorithm executes the greedy strategy with a probability of ε = 0.7, and the random strategy with a probability of 1 − ε = 0.3. After each step, one state transition datum is stored in the data set. The scale of this data set is M = 2048, and it is updated in a loop covering manner. A new episode is automatically initialized every 1000 steps. In the meantime, the training process is performed every T = 256 steps, the algorithm selects B = 512 data from the data set for training. The optimizer we used is the Root Mean Square prop optimizer (RMSprop).
With the help of the DQN algorithm, we can obtain the neural network-based suboptimal online strategy. The controller only needs to store the node value of this network, and then calculates the action value in real time according to the current state in each time slot. In other words, this strategy saves a lot of storage space by consuming a small amount of computing ability of the controller. Such an advantage makes this strategy very meaningful in practical applications.

Numerical Simulation
In this section, we run the numerical simulation on those strategies we proposed and some existing strategies. We illustrate the advantages of the proposed strategies through comparison. First we introduce two benchmark strategies. The first is the switch scheduling strategy, that is, alternate uplink and downlink transmissions between each time slot; the second is the insist scheduling strategy, that is, continuous scheduling of uplink or downlink transmissions until success, then the transmission is exchanged.
The parameter settings in the numerical simulation are as follows: The state transition coefficient is A = [1.1, 1.3], the code error rates of the uplink and downlink channels are p s = p c = [0.1, 0.2], the specific values are marked on the curve obtained from the simulation. The initial state of the plant is X 0 = 1. The noise distribution is N (z = 0, R = 1). The command control coefficient is B = −A. The initial state control variable is s 0 = (a 0 , D 0 , τ 0 , ϕ 0 ) = (1, 1, 2, 2). The corresponding initial scheduling action is a 0 = 1. The initial state of the controller estimation isX o = 1. The range of truncated state space is N = max{τ, ϕ} = 20. The plant noise follows normal distribution N (z = 0, R = 1). Each strategy runs 500 episodes with 10,000 time slots each episode. The final long-term average plant state MSE is the average of the results of 500 episodes. Figure 5 show the long-term average MSE of four strategies with A = 1.3 and p s = p c = [0.1, 0.2]. It can be seen that the MDP strategy, that is, the optimal offline strategy, has the best performance among all strategies, which also is the best performance that all possible scheduling strategies can achieve. While the performance of the neural network-based online strategy has slightly decreased, it is still significantly ahead of the existing strategies, and the performance gap between the optimal offline strategy and the suboptimal online strategy is very small. This gap can be eliminated in theory, but due to the limitations of deep reinforcement learning technology, it is currently difficult to fully achieve the optimal performance. It is relatively simple to obtain a suboptimal strategy with very close performance.  Figure 6 show the performance comparison between the optimal offline strategy and the two existing strategies under different state transition coefficient A. The suboptimal online strategy is not shown because it has been explained that the suboptimal strategy can theoretically approach the optimal. The state transition coefficient and the channel code error rates both reflect the instability of the control system and the reliability of the communication system in Equation (19). Combined with Figure 5, it can be seen that their influence on CPS is the same. A larger state transition coefficient or a higher channel code error rate lead to an increase in the long-term average plant state MSE, and when they exceed a certain limit and no longer satisfy Equation (19), the long-term average MSE of the CPS no longer converges, which means the single-loop CPS is unstable.

Conclusions
We proposed the semi-predictive framework to design scheduling strategies for singleloop CPS with uplink and downlink propagation delay. This framework can obtain the optimal offline strategy which is the upper bound on the performance among all strategies and a suboptimal online strategy with more practical application value. By adjusting the parameters, the semi-predictive framework can meet the need of any practical applications. We introduced the complete process of designing scheduling strategies under this framework by taking a specific situation as an example. The numerical simulation proved that the obtained strategies can effectively improve the performance of the existing strategies.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Construction Rules of the State Transition Probability Matrix
Here we give the complete construction rules of the state transition matrix. Firstly, we give all the possible new states after a state transition as follows: Secondly, we use R and R to mark the transmission results. R represents the result of the downlink transmission scheduled in the next time slot. R represents the result of the uplink transmission arrived in the next time slot. Note that R is known by prediction while R is known by normal communication process. These abbreviations can help to simplify the expression of the rules.
We will give the construction rules in the form of P[s |s, c] = p which means that when the condition c is satisfied, the previous state s transfers to the new state s with a probability of p.
When s = (1, 1, τ, ϕ): When s = (2, 2, τ, ϕ): When s = (1, 2, τ, ϕ): To prove sufficient conditions, we only need to prove that there exists a stationary deterministic strategy that can make multi-loop CPS stable. Here we prove that the roundrobin insist scheduling strategy can keep the system stable. We first prove the case of L = 1. Round-robin means that in every K time slots, the controller schedules each subsystem once in turn, and the scheduling sequence is fixed from i = 1 to i = K. Insist refers to when scheduling each subsystem, continuously scheduling uplink or downlink transmission until it succeeds, then switch to another transmission. Therefore, the actions of a single subsystem under the round-robin insist scheduling strategy can be given in the form of the following time axis: The time axis between two consecutive successful downlink scheduling is recorded as a control loop. It can be seen that the AoI evolution process of each control loop of one subsystem is: (1) The initial estimation age is equal to the control age: n K (2) The current subsystem waits for the completion of the scheduling of other subsystems, that is, silence (k − 1) time slots, and then schedules the uplink transmission when it is scheduled again. If the uplink transmission fails, the subsystem waits another (k-1) time slots and tries again until the uplink transmission is successful. This step takes mK time slots. At the end of this step, the estimated age is 0, and the control age is (n + m)K; (3) After the current subsystem silences for (K − 1) time slots, it switches to schedule downlink transmission continuously until it succeeds. This step takes nK time slots. At the end of this step, the estimated age is equal to the control age: nK. Then it finishes a close control loop.
Note that the time slots included in a complete control loop are the time slots marked in red on the coordinate axis in Figure A1, that is, the control age ranges from n K to (n + m + n)K. Each control loop has repeatability, so we only need to prove that the long-term average cost within the range of one control loop converges. According to the channel error probability, the M uplink transmissions and N downlink transmissions in each control loop can be modeled as a geometric distribution with the probability of success being (1 − p s ) and (1 − p c ) respectively. M and N are different in each control loop, N Represents the number of downlink transmissions in the previous loop of the current control loop. (n , m, n) are their specific observations. C i and T i represent the total cost and total time of the i-th control loop of the current subsystem respectively: Next, we can express the long-term average cost as: Choose p max = max{p s , p c }, we can derive that: Since f (·) is a strictly increasing function and (n , m, n) are all greater than 0, we can derive that: We abbreviate n + m + n as i, that is, i = n + m + n. Considering i 3, and when i = n + m + n is a fixed value, the possible combinations of (n , m, n) 1 satisfy the mathematical relationship of ∑ We can derive that: Since there are always exist p > p max and n < ∞, satisfying i 4 p max i < p i , ∀i > n. So we have: For ∞ ∑ i f (iK) · p i , we have: