An Improved DDPG and Its Application in Spacecraft Fault Knowledge Graph

We construct a spacecraft performance-fault relationship graph of the control system, which can help space robots locate and repair spacecraft faults quickly. In order to improve the performance-fault relationship graph, we improve the Deep Deterministic Policy Gradient (DDPG) algorithm, and propose a relationship prediction method that combines representation learning reasoning with deep reinforcement learning reasoning. We take the spacecraft performance-fault relationship graph as the agent learning environment and adopt reinforcement learning to realize the optimal interaction between the agent and the environment. Meanwhile, our model uses a deep neural network to construct a complex value function and strategy function, which makes the agent have excellent perceptual decision-making ability and accurate value judgment ability. We evaluate our model on a performance-fault relationship graph of the control system. The experimental results show that our model has high prediction speed and accuracy, which can completely infer the optimal relationship path between entities to complete the spacecraft performance-fault relationship graph.


Introduction
In recent years, space robots used for spacecraft fault maintenance have been widely considered. They need accurate and fast fault location ability, which is one of the difficulties in application. At present, the mapping relationship between the performance and fault of a spacecraft control system is generally expressed in the form of failure mode and effects analysis (FMEA) or a fault tree. When the spacecraft control system is abnormal, the ground experts can locate the fault source by manual inquiry, so it is difficult to make real-time diagnosis, accurately locate complex faults, and visualize fault diagnosis. Knowledge graph [1][2][3][4][5][6][7][8][9], as an effective tool to describe massive knowledge, knowledge attributes, and knowledge relationships, provides a new means for fault diagnosis. We can manually or automatically construct the knowledge graph of the corresponding relationship between the performance and fault of the spacecraft control system by combining various model knowledge, expert knowledge, and data, which we call the spacecraft performance-fault relationship graph. However, there is no systematic and comprehensive spacecraft fault knowledge base in the aerospace field, so it is difficult to construct a large-scale spacecraft performance-fault relationship graph for spacecraft fault diagnosis. Taking the spacecraft control system as an example, its working environment is harsh. Because of its closed-loop characteristics, there are few fault samples accumulated during long-term in-orbit operation, and the fault mechanism cannot be traversed. Meanwhile, its structure is complex, and the components are closely related. Therefore, it is difficult to establish an accurate and complete performance-fault relationship graph of a spacecraft control system, which affects the accuracy of fault diagnosis results. It is necessary to use relational reasoning to predict the relationship and complete the relationship.
The commonly used relationship reasoning methods at present include representation learning reasoning, path reasoning, reinforcement learning reasoning, and graph convolution network reasoning. However, the spacecraft performance-fault relationship graph has complex relationships, various entities, and rich physical meanings. These methods are difficult to directly apply to a spacecraft performance-fault relationship graph.
The TransE [10] model is a classic model for learning reasoning. It maps entities and relationships into a continuous vector space, and calculates the entity vectors to predict the relationships. However, this method is only suitable for dealing with the one-to-one correspondence of entities and is not suitable for spacecraft performance-fault relationship graphs with complex and diverse relationships. A typical path reasoning model is the Path Ranking Algorithm (PRA) [11], which extracts path feature values from the knowledge graph, and makes relationship judgment according to path features. COR-PRA (Constant and Reversed Path Ranking Algorithm) [12] adds a constant path search mechanism on the basis of PRA, which makes the relationship reasoning more comprehensive. However, path reasoning pays too much attention to the graph structure information, ignoring the semantic attributes of entities and relationships, and the reasoning accuracy is not high. Reference [13] first applied reinforcement learning to relational reasoning and put forward a reinforcement learning framework DeepPath for learning multi-hop relational paths. Based on this, reference [14] added long short-term networks and a graphic attention mechanism as memory components. Reference [15] presented agent MINERVA, which has the built-in flexibility to take paths of variable length and learns to perform query answering by walking on a knowledge graph conditioned on an input query. References [16][17][18][19] improved reward functions and strategies for different datasets based on reinforcement learning reasoning. However, reinforcement learning reasoning requires setting a high reward function and value function. For the spacecraft performance-fault relationship graph with various states and complex physical meanings, it is difficult to set an accurate reward function and value function to achieve a good path finding effect. The graph convolution network [19][20][21][22][23][24][25][26][27][28][29] directly aggregates the adjacent nodes of the graph, which can better reflect the structural relationship of graph data and is more suitable for processing directed graph and large graph data. However, its model is complex and lacks interpretability, so it is not suitable for the complicated spacecraft performance-fault relationship graph.
To solve the above problems, we improve the DDPG algorithm [30] by combining representation learning reasoning with deep reinforcement learning reasoning. We study a Semantic relation and Position Deep Deterministic Policy Gradient reasoning model (SPDDPG) and apply it to the spacecraft performance-fault relationship graph to realize relational reasoning. First, this model extracts the semantic and location features of the embedded vectors of entities and relations. The high-dimensional semantic embedded vectors of entities and relations are obtained by using the TransE model. However, its excessively high dimension and scattered features will affect the reasoning accuracy. To solve this problem, Principal Component Analysis (PCA) [31] is adopted to reduce its dimension to obtain the key semantic feature vectors. Meanwhile, according to the position of the entity in the spacecraft performance-fault relationship graph, the position vector is obtained by Boolean vector conversion. Then, we splice the semantic vector and the position vector as the embedded vector of the entity and input it into DDPG for processing. DDPG takes the spacecraft performance-fault relationship graph as the environment, in which we set different corresponding states and rewards for complex entity types in the spacecraft performance-fault relationship graph. The strategy network of DDPG outputs the optimal action vector according to the embedded vector of the current state, and the value network calculates the action value according to the state vector and the action vector, updating the network parameters in reverse. Both the strategy network and value network are deep neural network models, which can fit complex strategy and value functions. Thus, they are suitable for the spacecraft performance-fault relationship graph with intricate relationships and complex physical meanings. In short, this model can achieve accurate reasoning of the spacecraft performance-fault relationship graph due to two innovations: 1. It extracts the semantic information and position information from the spacecraft performance-fault relationship graph. 2. It has a reasonable network and state setting to realize fast optimal action prediction.

Fault Diagnosis Process Based on Spacecraft Performance-Fault Relationship Graph
The fault diagnosis process based on the spacecraft performance-fault relationship graph is divided into two parts. First, data processing and feature extraction are carried out on spacecraft on-orbit data, and the obtained results are matched with fault symptom entities in the performance-fault relationship graph. The fault symptom entity corresponding to the current on-orbit data is then determined. Second, in the performance-fault relationship graph, reasoning is started from the failure symptom, and the failure cause entity or failure mode entity corresponding to the current failure symptom is obtained through an entity-relationship-entity triple search. There may be many fault causes or modes, so the final fault diagnosis result should be determined by information fusion of fault occurrence probability and expert experience. As a tool of fault diagnosis, the spacecraft performance-fault relationship graph must be complete and accurate. We propose a relational reasoning model SPDDPG for the performance-fault relationship graph and use it to improve the performance-fault relationship graph, which provides reliable and effective support for subsequent fault diagnosis.

SPDDPG Model and Process
The steps of relationship reasoning with the SPDDPG model are as follows: 1. Build the SPDDPG framework based on the spacecraft performance-fault relationship graph and transform the components in the relationship graph into the basic elements of the SPDDPG framework. 2. Take the entity in the performance-fault relationship graph as the current state and select the optimal action by using the strategy network, which we call the Actor Network, to obtain the entity corresponding to the next state. 3. Use the value network, which we call the Critic Network, to fit the action value function, and update the parameters of the strategy network and the value network. The reasoning process of this model is shown in Figure 1.  Figure 1. The reasoning process of SPDDPG.

Construct SPDDPG Framework
We use the deep reinforcement learning algorithm for relational reasoning, so we need to construct the reinforcement learning framework first. We configure the basic elements of the framework, including environment, state, action, and reward, according to the performance-fault relationship graph of the spacecraft control system. Environment. We use the performance-fault relationship graph of the spacecraft control system as the environment to interact with the agent. Because the performance-fault relationship graph is difficult to directly calculate and process by agents, we transform it into an n × n (where n is the number of entities) dimensional environment matrix E. The u relationships between entities are arranged in sequence. If there is a relationship q∈(0,u) between entity i and entity j, then Eij is set to q; otherwise, it is set to 0. Finally, we obtain environment matrix E, which is convenient for a computer to calculate and process the content of the performance-fault relationship graph.

Construct SPDDPG Framework
We use the deep reinforcement learning algorithm for relational reasoning, so we need to construct the reinforcement learning framework first. We configure the basic elements of the framework, including environment, state, action, and reward, according to the performance-fault relationship graph of the spacecraft control system. Environment. We use the performance-fault relationship graph of the spacecraft control system as the environment to interact with the agent. Because the performancefault relationship graph is difficult to directly calculate and process by agents, we transform it into an n × n (where n is the number of entities) dimensional environment matrix E. The u relationships between entities are arranged in sequence. If there is a relationship q∈(0,u) between entity i and entity j, then E ij is set to q; otherwise, it is set to 0. Finally, we obtain environment matrix E, which is convenient for a computer to calculate and process the content of the performance-fault relationship graph.
State. The relationship path of performance-fault relationship graph is complex and involves many states, so it is difficult to achieve ideal experimental results only by setting the normal state and the termination state. We divide the states into the normal state, termination state, and imminent termination state. The imminent termination state refers to the first-order adjacent state to the termination state, and when the agent reaches the termination state, it will complete the path finding of relational reasoning. Our innovative setting of the imminent termination state can help the agent to find the path in the direction of the termination state quickly. The entities in the performance-fault relationship graph are taken as the state. However, they need to be converted into a vector form convenient for computer identification and calculation, which is called the state vector in this paper. The state vector includes two parts: semantic features and location features.
The first part is semantic feature extraction. We use the TransE model to process triples in the performance-fault relationship graph, extracting the semantic high-dimensional vectors of entities and relationships. Then, we use PCA to reduce the dimension and extract the principal components of the semantic features of entities and relationships, which we call the semantic vector.
The second part is location feature extraction. We arrange all entities in the order of (e 1 , e 2 , . . . e n ), and adopt one-hot coding to obtain the position vectors of all entities. Specifically, we define an n-dimensional zero vector as the position vector of every entity. If entity i has a relationship with the entity e j , the jth element of the vector is set to 1. The position vector contains the global position characteristics of the entity in the performance-fault relationship graph.
Action. In the case of entities as states, the relationships between entities act as the action to connect states. The agent outputs the predicted action vector according to the current state to choose the action to reach the next state. The action vector is represented by the relational semantic vector obtained from the TransE model.
Reward. As the basis for judging the current state, a reward should be set artificially according to the path distance and path type between the current state and the termination state. We use the critic network to fit the complex action value function, so the setting of the state reward can follow the simple and effective principle. The normal state occupies most of the path of relational reasoning and should not be set too tendentiously, so it is set to 0. The imminent termination state should guide the agent to the terminal state, so it is set to 1, and the termination state reward is set to 2. The formula is as follows: where t is the number of steps of agent path finding, S t is the current state, R t is the reward corresponding to the current state, END is the end state, and N 1 (END) is the first-order neighborhood of the end state. The unique reward of the terminal state can help the agent approach the terminal state quickly from many complicated paths, and enhance the training effect.

DDPG Model
The DDPG model consists of two kinds of neural networks, the actor network and critic network. The actor network is responsible for predicting actions according to the state, and the critic network is responsible for predicting action values according to the state and actions. The two types of networks have their own current network and target network. The current network calculates the estimated value according to the current state, while the target network calculates the target value according to the next state. The parameters of the neural network are updated according to the estimated value and the target value. Next, we elaborate action selection based on the actor network and parameter update based on the critic network in detail. Figure 2 shows the structure of the DDPG model. parameter update based on the critic network in detail. Figure 2 shows the structure of the DDPG model.  Figure 2. The structure of DDPG model.

Action Selection Based on Actor Network
We use the semantic relation and position multilayer perceptron (SPMLP) model as the actor network. We take the state vector as the input of the SPMLP model and output the action prediction vector. The model structure is shown in Figure 3. We take the entity vector X and reward R corresponding to the current state S. The entity vector X is used as the input vector of the SPMLP input layer. Each element of the entity vector X is multiplied by the weight θ1 of the first hidden layer, and summed up. After adding the offset value b1, the output h1 of this layer is obtained through the sigmoid activation function f1, and the formula is as follows: Take the output of the first hidden layer as the input of the next hidden layer, and repeat Formula (1) to obtain the output of the last hidden layer as ht. The weight is θy

Action Selection Based on Actor Network
We use the semantic relation and position multilayer perceptron (SPMLP) model as the actor network. We take the state vector as the input of the SPMLP model and output the action prediction vector. The model structure is shown in Figure 3. parameter update based on the critic network in detail. Figure 2 shows the structure of the DDPG model.  Figure 2. The structure of DDPG model.

Action Selection Based on Actor Network
We use the semantic relation and position multilayer perceptron (SPMLP) model as the actor network. We take the state vector as the input of the SPMLP model and output the action prediction vector. The model structure is shown in Figure 3. We take the entity vector X and reward R corresponding to the current state S. The entity vector X is used as the input vector of the SPMLP input layer. Each element of the entity vector X is multiplied by the weight θ1 of the first hidden layer, and summed up. After adding the offset value b1, the output h1 of this layer is obtained through the sigmoid activation function f1, and the formula is as follows: Take the output of the first hidden layer as the input of the next hidden layer, and repeat Formula (1) to obtain the output of the last hidden layer as ht. The weight is θy We take the entity vector X and reward R corresponding to the current state S. The entity vector X is used as the input vector of the SPMLP input layer. Each element of the entity vector X is multiplied by the weight θ 1 of the first hidden layer, and summed up. After adding the offset value b 1 , the output h 1 of this layer is obtained through the sigmoid activation function f 1 , and the formula is as follows: Take the output of the first hidden layer as the input of the next hidden layer, and repeat Formula (1) to obtain the output of the last hidden layer as h t . The weight is θ y and the offset is b y . The activation function f o of the output layer selects the softmax function. The final output action prediction vector A is as follows: The element position of the prediction vector A corresponds to the relationship in the performance-fault relationship graph. The element A i with the highest probability in A is the selected optimal action a. Find the next state S corresponding to the optimal action from the environment matrix E, take S as the current state, and repeatedly take the above steps until the end state is reached. Finally, the agent outputs the relationship prediction path.

Parameters Updating Based on Value Network
The multilayer perceptron (MLP) model is adopted as the critic network, which takes the action vector and state vector as input and outputs the action value. The hidden layer structure of the model is the same as that of SPMLP, and the calculation process is similar to the previous section. The difference is that the action value output by the value network is a real number, the activation function of the output layer is the Relu function, and the network parameter is w.
We set-up the actor target network and the critic target network, and the structure is consistent with the actor network and the critic network. The actor target network selects the next optimal action a according to the next state S , and the network parameter θ is periodically copied from θ. The critic target network calculates the action value Q (S, a , w ) of the next state, and the network parameter w is copied from w regularly. Put the {S, A, R, S , A } quintuple of each cycle into the empirical playback set D; take p samples {S j , A j , R j , S j , A j } from D, where j = 0, 1, 2, . . . . p; calculate the current target Q value Q target : where γ is the discount factor, γ ∈ (0, 1), and it is 0.9 in this paper.
Parameters w of the current value network are updated by gradient back propagation of the neural network, and MSE (Mean Square Error) is used as the loss function of the value network.
Parameters θ of the current strategy network are updated by gradient back propagation of the neural network, and the loss function of the strategy network is: Set the update frequency c and soft-update the parameters of the actor target network and critic target network every c cycle: where τ is the renewal coefficient, is generally small, and is 0.1 in this paper.

Performance-Fault Relationship Graph of Spacecraft Control System
We apply the SPDDPG algorithm to the performance-fault relationship graph of the spacecraft control system constructed by us. First, based on the models of each part of the spacecraft control system, the attitude dynamics equation, and the attitude kinematics equation, we construct the performance-fault relationship graph of the spacecraft control system. Taking the system composed of three orthogonal gyroscopes, rolling axis infrared earth sensors, and three orthogonal momentum wheels as the object, the attitude dynamics equation of the spacecraft is as follows: where I x , I y , and I z are the three-axis moments of inertia; ω x , ω y , and ω z are the components of the spacecraft's space angular velocity along the main inertia axis; h x , h y , and h z are the three-axis angular momenta of the momentum wheel.
The attitude kinematics equation of spacecraft is where ϕ, θ, and ψ are Euler angles, and ω 0 is the orbital angular velocity of the spacecraft rotating around the central gravitational body. Infrared earth sensor model: where ϕ h is the ground sensitivity measurement output of the roll-axis, b ϕ is the ground sensitivity constant error of the roll-axis, N ϕh is the ground sensitivity measurement noise of the roll-axis, f ϕ is the ground sensitivity fault of the roll-axis, θ h is the ground sensitivity measurement output of the pitch-axis, b θ is the ground sensitivity constant error of the pitch-axis, N θh is the ground sensitivity measurement noise of the pitch-axis, and f θ is the ground-sensitive fault of the pitch axis. Gyro measurement model: where g i is the gyro measurement output, d i is the gyro exponential drift term ( , b i is the gyro constant drift term, f g,i is the gyro fault, and i = 1, 2, 3.
where J 1 , J 2 , and J 3 are the rotary inertias of the momentum wheel; ω 1 , ω 2 , and ω 3 are the rotation speeds of the momentum wheel; u x , u y , and u z are the expected output torques of the momentum wheel; f ω,1 , f ω,2 , and f ω,3 are the momentum wheel failures.
Based on the models of each part of the spacecraft control system, the entities and relationships of the performance-fault relationship diagram are extracted according to the following principles:

1.
For each part of the spacecraft control system model, the known quantity and unknown quantity can be regarded as the "known quantity entity" and "unknown quantity entity", respectively. 2.
The purpose of this experiment is to infer whether there is a relationship among variables, so all kinds of errors, noises, constant parameters, and faults are not considered when constructing the graph.

3.
When there are multiple variables on one side of the equation, an "AND entity" needs to be added to represent them. 4.
The relationships of variables in the equation are divided into proportional equivalence, derivative equivalence, equivalence, addition, subtraction, and multiplication.
According to the above principles, the extracted entities are shown in Table 1. Table 1. Triple set of spacecraft control system model.

Experimental Results of SPDDPG
We apply SPDDG to the performance-fault relationship graph of the spacecraft control system for relational reasoning and relational completion, and the parameters are set as follows. The output vector dimension of the TransE model is 100, and the dimension of the semantic vector after PCA dimension reduction is set to 20. The input of the actor

Experimental Results of SPDDPG
We apply SPDDG to the performance-fault relationship graph of the spacecraft control system for relational reasoning and relational completion, and the parameters are set as follows. The output vector dimension of the TransE model is 100, and the dimension of the semantic vector after PCA dimension reduction is set to 20. The input of the actor network is a 42-dimensional state vector, and the output is a 7-dimensional predicted action vector. The number of neurons in the three hidden layers is set to (80, 80, 80), and the learning rate lr = 0.0005. The input of the critic network is a 40-dimensional vector, the output is a one-dimensional action value, the single hidden layer contains 60 neurons, and the learning rate lr = 0.0005. The training times are set to 60,000. Each training will randomly perform 100 pathfinding times. The longest relation path of the control system performance-fault relation graph is 7, so the training requirements can be met when the training times are reached or the average path-finding steps per hundred trainings are within seven steps. Actually, the training reaches the termination condition in 54,200 times, and the average path-finding step curve of the agent is as Figure 5:

Experimental Results of SPDDPG
We apply SPDDG to the performance-fault relationship graph of the spacecraft control system for relational reasoning and relational completion, and the parameters are set as follows. The output vector dimension of the TransE model is 100, and the dimension of the semantic vector after PCA dimension reduction is set to 20. The input of the actor network is a 42-dimensional state vector, and the output is a 7-dimensional predicted action vector. The number of neurons in the three hidden layers is set to (80, 80, 80), and the learning rate lr = 0.0005. The input of the critic network is a 40-dimensional vector, the output is a one-dimensional action value, the single hidden layer contains 60 neurons, and the learning rate lr = 0.0005. The training times are set to 60,000. Each training will randomly perform 100 pathfinding times. The longest relation path of the control system performance-fault relation graph is 7, so the training requirements can be met when the training times are reached or the average path-finding steps per hundred trainings are within seven steps. Actually, the training reaches the termination condition in 54,200 times, and the average path-finding step curve of the agent is as Figure 5: We take the momentum wheel speed ω_1 as an example. It is taken as the initial state, and the infrared sensor output measurement φ_h is taken as the final state. The path- We take the momentum wheel speed ω_1 as an example. It is taken as the initial state, and the infrared sensor output measurement ϕ h is taken as the final state. The path-finding step of the agent is 5, which costs 0.08 s, and the visual path-finding process is shown in    The computer directly predicts the existence of the relationship between entities, and manually determines the path of entities that are difficult for the computer to directly determine. We use SPPDPG to discover 51 new relationships, and the completed spacecraft performance-fault relationship graph is shown in Figure 8. The computer directly predicts the existence of the relationship between entities, and manually determines the path of entities that are difficult for the computer to directly determine. We use SPPDPG to discover 51 new relationships, and the completed spacecraft performance-fault relationship graph is shown in Figure 8.

Experimental Results of Various Models
To further verify the advantages of SPDDPG, we make a comparison between the performance of SPDDPG and other published models. Due to the practical application requirements, the relationship reasoning model applied to the spacecraft performancefault relationship graph not only needs to predict the existence of the relationship between entities, but can also output the visual relationship path between entities for manual judgment. Therefore, we focus on selecting the reinforcement learning methods that meet the requirements, DeepPath [13] and MINERVA [15], and compare the experimental results

Experimental Results of Various Models
To further verify the advantages of SPDDPG, we make a comparison between the performance of SPDDPG and other published models. Due to the practical application requirements, the relationship reasoning model applied to the spacecraft performance-fault relationship graph not only needs to predict the existence of the relationship between entities, but can also output the visual relationship path between entities for manual judgment. Therefore, we focus on selecting the reinforcement learning methods that meet the requirements, DeepPath [13] and MINERVA [15], and compare the experimental results from the path-finding accuracy and average path-finding steps. In order to verify the improvement of the DDPG algorithm, we add two DDPG models, DDPG(transE) and DDPG(State). DDPG(transE) only obtains the embedded vector of the entity and relationship through TransE, and DDPG(State) only sets the normal state and the termination state. The other parts of the two models are the same as SPDDPG. The experimental results are shown in Table 2, and the data are the best results obtained after many experiments. The experimental results show that the SPDDPG algorithm has a good inference effect on the performance-fault relationship graph of the spacecraft control system. After the target entity is given, the accuracy of the agent finding the relationship path between entities reaches 100%, and the multi-hop relationship path between entities can be successfully obtained. In the performance-fault relationship graph, the longest relationship path is 7 steps, and the minimum average path-finding step number of agents is 6.91, which can control the path-finding step number to a few times and has high reasoning efficiency. In addition, compared with the experimental results of two kinds of DDPG models that have not been improved, two conclusions can be drawn: 1. The embedded vector extracted by semantic information and location information can contain more obvious entity features and improve the accuracy of relational reasoning. 2. Adding the imminent termination state to the normal state and the termination state can make the purpose of the agent's routing clearer, reduce the path-finding steps, and improve the efficiency of relational reasoning. Therefore, this algorithm provides an interpretable reasoning means for completing the relationship of the spacecraft performance-fault relationship graph and has application significance.

Conclusions
The spacecraft performance-fault relationship graph has complex physical meanings, numerous entities, and coupling relationships; we propose a relationship reasoning model SPDDPG to solve the above problems, which combines the representation learning model and deep reinforcement learning model. Its advantages are as follows: (1) The representation learning model is used to extract the semantic features of the entity and relationships, and the global location features of entities are obtained through Boolean information conversion. PCA is used to reduce the dimensions of entity vectors, to retain the high-order features of entities and avoid overfitting. It helps overcome the difficulty of distinguishing numerous entities, which is beneficial to the training of neural networks. (2) The actor network is used to replace the traditional action selection strategy, and the critic network is used to fit the complex and uncertain value function. The deep neural network can distinguish complex physical meanings of the spacecraft performancefault relationship graph and improve the efficiency of relationship reasoning.
We apply SPDDPG to the performance-fault relationship graph of the spacecraft control system, and the experimental results show that the algorithm has a good relationship reasoning effect on the knowledge graph of the spacecraft system level. In the future work, we are going to build a model of relationship prediction based on SPDDPG's relationship reasoning path, so that it has both predictive ability and interpretability. We will apply this model to a larger spacecraft performance-fault relationship graph, and improve the reasoning efficiency.