A Satellite Incipient Fault Detection Method Based on Decomposed Kullback–Leibler Divergence

Detection of faults at the incipient stage is critical to improving the availability and continuity of satellite services. The application of a local optimum projection vector and the Kullback–Leibler (KL) divergence can improve the detection rate of incipient faults. However, this suffers from the problem of high time complexity. We propose decomposing the KL divergence in the original optimization model and applying the property of the generalized Rayleigh quotient to reduce time complexity. Additionally, we establish two distribution models for subfunctions F1(w) and F3(w) to detect the slight anomalous behavior of the mean and covariance. The effectiveness of the proposed method was verified through a numerical simulation case and a real satellite fault case. The results demonstrate the advantages of low computational complexity and high sensitivity to incipient faults.


Introduction
Due to the vigorous development of the space industry, the number of satellites in orbit has increased to meet various needs, such as navigation [1], communication [2], meteorology [3], and earth observation [4]. However, satellites face the risk of abnormalities or experience failure because of high-energy particles in space, electrostatic discharge, and cycle temperature [5][6][7]. Because serious faults may occur due to the continuous deterioration of incipient faults [8], timely and accurate detection of incipient faults can reserve sufficient processing time for satellite operation and maintenance system, which is of great significance to guarantee the availability and continuity of satellite services [9].
During the past three decades, the problem of satellite fault detection has been extensively studied in various studies [10][11][12][13]. In traditional satellite fault detection methods, such as threshold-based methods [14,15] and model-based methods [16,17], the thresholds or the models required for fault detection necessitate manual setting. Therefore, the performance of these fault detection methods heavily relies on the experience of experts [18]. In recent years, data-driven fault detection methods have eliminated this heavy dependence on expert experience and become a popular research field [19][20][21][22]. These methods establish normal models based on satellite normal historical data, and then compare the online data with the normal models to assess whether the online data is faulty. However, the methods proposed in the existing literature are mainly applied to serious faults, and an extremely small amount of research and application relates to incipient faults of satellites. The amplitudes of incipient faults are small compared to system signals, usually ranging from 1% to 10% [23], which are easily masked by normal system variations [24]. Therefore, satellite incipient fault detection is a daunting task [25].

1.
We analyzed the necessity and feasibility of decomposing the KL divergence in the optimization model.
The effectiveness of the proposed method was verified through a numerical case and a real satellite fault case.
This paper is organized as follows. The generalized Rayleigh quotient (GRQ) and original optimization model are introduced in Section 2. The fault detection method based on the decomposed KL divergence is presented in detail in Section 3. In Section 4, the proposed method is illustrated and analyzed through two cases. Finally, conclusions are given in Section 5.

Preliminary
In this section, we introduce the definition and property of the generalized Rayleigh quotient and note the problem of original optimization model.

Generalized Rayleigh Quotient (GRQ)
The GRQ is defined as follows [31]: where x is a non-zero vector, A is a symmetric matrix, and B is a positive definite symmetric matrix. The GRQ has a critical property that the maximum value of R(A, B, x) is equal to the maximum eigenvalue of matrix B −1 A [32]; that is, R(A, B, x) ≤ λ max , where λ max is the maximum eigenvalue of the matrix B −1 A. In addition, the optimum vector x which maximizes R(A, B, x) is the eigenvector corresponding to the maximum eigenvalue [32]. The sum of two GRQs is defined as follows [33]: where x is a non-zero vector, both A 1 and A 2 are symmetric matrices, and both B 1 and B 2 are positive definite symmetric matrices. Because iteration is not required, the maximum value of a single GRQ can be quickly obtained by directly applying the property of the GRQ. Regarding the maximum value of the sum of two GRQs, according to Reference [33], the time complexity of maximizing the sum of two GRQs is NP-hard. Prominently, accurate algorithms cannot solve large instances of such a problem, and approximate algorithms are necessary.

Original Optimization Model
Under the assumption that the data obey a multidimensional Gaussian distribution, and using the KL divergence to detect incipient faults, the problem of finding the optimum projection vector (PV) is modeled as follows [30]: In Equations (3) and (4), w is a PV, h(w) is the KL divergence of the projections of normal historical data X and online data Y. Both the normal historical data and the online data obey m dimensional joint Gaussian distributions, can be expressed as the sum of two GRQs, as shown in Equation (5): where According to the property of the covariance matrix, both Σ x and Σ y in Equation (5) are non-negative symmetric matrices. This paper considers only the case that both of the matrices are positive definite symmetric matrices to satisfy the condition of the GRQ. If the influence of the coefficient 0.5 and the constant −1 is ignored, Equation (3) can be equally expressed as the maximization of the sum of two GRQs: As stated in Section 2.1, the time complexity of solving the optimization problem in Equation (6) is NP-hard. Similarly, the optimization problem in Equation (3) is NP-hard. In Reference [30], a ready-made optimization solution tool (the fmincon function in MATLAB) is used to solve the optimization problem. However, this method can only obtain the local optimum PV, rather than the global optimum PV. Additionally, with the gradual increase in the number of variables to be monitored, the time complexity of iteration becomes more prominent. Therefore, this study aimed to determine an approximate algorithm with lower time complexity.

Incipient Fault-Detection Method Based on Decomposed KL Divergence
In this section, we propose the idea of decomposing the KL divergence and built two distribution models to detect incipient faults.

Decomposed KL Divergence
As stated in Section 2.1, the maximum value of a single GRQ can be quickly obtained by applying the property of the GRQ. Therefore, this paper attempts to decompose h(w) to reduce time complexity. Specifically, we attempt to decompose h(w) into the sum of multiple GRQs, and then calculate the maximum value and the optimum PV of each GRQ. Under the guidance of this idea, the KL divergence h(w) can be decomposed into the sum of four GRQs, as expressed in Equations (8)- (11): where F 1 (w), F 2 (w), F 3 (w), and F 4 (w) are collectively referred to as the subfunctions of h(w). In Equations (8)- (11), both Σ x and Σ y are positive definite symmetric matrices, so that each subfunction of h(w) satisfies the form of the GRQ. Therefore, we can obtain the maximum value and optimum PV of each subfunction using the property of the GRQ. Clearly, the maximization of each subfunction may not maximize the original function. For instance, we can find a PV w 1 that maximizes F 1 (w), but w 1 does not necessarily maximize h(w). In this case, what is the point of decomposing h(w)? According to reference [30], the ultimate goal of maximizing h(w) is to determine the PV w that is most sensitive to the incipient fault; that is, our ultimate goal is to detect the incipient fault. From the aspect of fault detection, although the PV obtained by maximizing the subfunction may not be optimal for the original function, the PV has its own value if it can detect the fault and be obtained in a fast manner.
Which subfunctions of h(w) are effective and can be solved quickly? After analysis, two subfunctions F 1 (w) and F 3 (w) are selected. According to Equation (8) and the property of GRQ, the maximum value of F 1 (w) is the maximum eigenvalue of matrix Σ −1 x Σ y . Furthermore, the optimum PV of F 1 (w) is the eigenvector corresponding to the maximum eigenvalue. Similarly, According to Equation (10) and the property of the GRQ, the optimum PV of F 3 (w) is the eigenvector corresponding to the maximum eigenvalue of matrix Σ −1 x ∆µ∆µ T . Let the optimum PVs of F 1 (w) and F 3 (w) be w F1 and w F3 , respectively.

Construction of Fault Detection Models
The optimum PVs w F1 and w F3 only provide two optimal perspectives of observation which the on-line data and the normal historical data are the easiest to distinguish that can be most easily distinguished by the online data and the normal historical data. We still lack some measurement indices to test whether a fault has occurred in the on-line data Y. This section uses F 1 (w) and F 3 (w) as the deviation measurement indices. Due to noise, both F 1 (w) and F 3 (w) fluctuate in their normal ranges when there is no fault in Y. However, F 1 (w) or F 3 (w) are outside of the normal ranges when a fault occurs in Y.
The normal ranges of F 1 (w) or F 3 (w) are the key to fault detection. To obtain them, we assume that the normal historical data X and the online data Y obey two m-dimensional joint Gaussian distributions, X ∼ N(µ x , Σ x ) and Y ∼ N µ y , Σ y , respectively. Denote the projections of X and Y onto the vector w F1 as p F1 and q F1 , respectively. According to the property of m-dimensional joint Gaussian distribution, p F1 and q F1 obey one-dimensional Gaussian distributions p F1 ∼ N w F1 T µ x , w F1 T Σ x w F1 and q F1 ∼ N w F1 T µ y , w F1 T Σ y w F1 , respectively [34]. The relationship of F 1 (w), w F1 , Σ x , and Σ y is presented in Equation (12): Denote the projections of X and Y onto the vector w F3 as p F3 and q F3 , respectively. Similarly, according to the property of m-dimensional joint Gaussian distribution, p F3 and q F3 obey one-dimensional Gaussian distributions p F3 ∼ N w F3 T µ x , w F3 T Σ x w F3 and q F3 ∼ N w F3 T µ y , w F3 T Σ y w F3 , respectively [34]. The relationship of F 3 (w), w F3 , Σ x and ∆µ is presented in Equation (13): Because the normal historical data X are obtained before fault detection, and the optimum PVs w F1 and w F3 are obtainable from Section 3.1, it can be considered that Σ x , µ x , w F1 , and w F3 in Equations (12) and (13) are known and invariable. Furthermore, the mean offset vector ∆µ and the covariance matrix Σ y related to Y are unknown and variable.
Because Σ x , w F1 , and w F3 are known, we can assume w T F1 Σ x w F1 = c F1 and w T F3 Σ x w F3 = c F3 , where both c F1 and c F3 are constants. Hence, Equations (14) and (15) can be obtained: To obtain the normal ranges of F 1 (w) or F 3 (w), it is supposed that the fault-free online data Y are obtained by sampling the joint Gaussian distribution obeyed by X. Because p F1 and q F1 are the projections of X and Y onto the vector w F1 , respectively, we can consider that q F1 is obtained by sampling the one-dimensional Gaussian distribution obeyed by p F1 . Similarly, we can consider that q F3 is obtained by sampling the one-dimensional Gaussian distribution obeyed by p F3 .
Assume that f obeys a one-dimensional Gaussian distribution N µ, σ 2 . Let g denote the sample set of f , µ denote the sample mean of g, S 2 denote the sample variance of g, and n 1 denote the sample number of g. Thus, µ satisfies [35]: Let f = p F1 and g = q F1 , then the variances of p F1 and q F1 are substituted into Equation (17). We can obtain: Because w T F1 Σ x w F1 = c F1 , we can obtain: Comparing Equation (14) with Equation (19), we can obtain: Therefore, the subfunction F 1 (w) multiplied by a constant n 1 − 1 obeys a chi-square distribution with n 1 − 1 degrees of freedom when there is no fault in Y. Let f = p F3 and g = q F3 , then the mean and variance of q F3 and the mean of p F3 are substituted into Equation (16). We can obtain: Because µ x , Σ x , and w F3 are all known, we can suppose w T F3 µ x = c 3 , where c 3 is a constant. According to the property of the one-dimensional Gaussian distribution, w T F3 µ y − c 3 still obeys the one-dimensional Gaussian distribution, as shown in Equation (22): Since ∆µ = µ y − µ x and w T F3 Σ x w F3 = c F3 , we can obtain: Normalize w T F3 ∆µ k and we can obtain: Furthermore, we can obtain Equation (25) from the relationship between the standard normal distribution and the chi-square distribution: Comparing Equation (15) with Equation (25), we can obtain: Therefore, the subfunction F 3 (w) multiplied by a constant n 1 obeys a chi-square distribution with one degree of freedom when there is no fault in Y.
In summary, (n 1 − 1)F 1 (w) and (n 1 )F 3 (w) obey chi-square distributions with n 1 − 1 and one degree of freedom, respectively. Thus, the chi-square test is applicable to verify whether a fault occurs in Y. Given a significance level α, the fault detection thresholds of (n 1 − 1)F 1 (w) and n 1 F 3 (w) are obtainable from the chi-square test. Denote the fault detection thresholds of (n 1 − 1)F 1 (w) and n 1 F 3 (w) as ε F1 and ε F3 , respectively. In this case, two fault detection models are established as follows: The reason for selecting the subfunctions F 1 (w) or F 3 (w) is the coverage of detectable faults. It can be seen from Equation (4) that h(w) is a function of w, Σ x , Σ y , and ∆µ. Because the normal historical data X and the PV w are determined, both w and Σ x are known, whereas ∆µ and ∆Σ, which are related to the online data, are unknown. Thus, h(w) is a function of ∆µ and ∆Σ.
Due to noise, both ∆µ and ∆Σ fluctuate within their normal ranges. However, ∆µ or ∆Σ are outside of the acceptable range when the online data is faulty. Because h(w) is a function of ∆µ and ∆Σ, the abnormal change in ∆µ or ∆Σ will further position h(w) outside of the acceptable range. Therefore, the abnormal change in ∆µ or ∆Σ can be detected by h(w). It can be seen from Equation (8) and ∆Σ = Σ y − Σ x that F 1 (w) is a function of ∆Σ; thus, the fault caused by the abnormal change in ∆Σ can be detected by F 1 (w). Similarly, the fault caused by the abnormal change in ∆µ can be detected by F 3 (w) from Equation (10). Therefore, the combination of F 1 (w) and F 3 (w) can cover the majority of faults that can be detected by h(w).
Why are the other two subfunctions F 2 (w) and F 4 (w) not chosen to detect faults? Comparing Equation (8) with Equation (10), F 1 (w) and F 2 (w) are reciprocal to each other. Therefore, we can detect the abnormal change in ∆Σ by taking either of them. The expressions of F 3 (w) and F 4 (w) differ only in the denominator. After experimental verification, the fault detection ability of F 4 (w) is similar to that of F 3 (w). Thus, only one of F 3 (w) and F 4 (w) needs to be selected to detect the abnormal change in ∆µ

Overall Fault Detection Process
We intend to use sliding windows to extract and monitor the online data in real time. Let the online data extracted by the kth sliding window be Y k . The pseudocode and the flow chart of the proposed method are shown as follows: 1.
Z-score normalization is performed for each parameter of the normal historical data X, and X is obtained.

2.
The online data Y k are extracted by a sliding window with the length of n 1 .

3.
The on-line data Y k are normalized by Z-score to obtain Y k . 4.
Two optimum PVs w F1 and w F3 between X and Y k are obtained by using the property of the GRQ, as stated in Section 3.1.

5.
Two fault detection thresholds ε F1 and ε F3 are set by using the chi-square test with a significance level α. 6.
Equations (12) and (13) are used to calculate the actual values F 1 (w) and F 3 (w) of X and Y k . 7.
The potential existence of a fault in Y k is tested according to Equations (27) and (28). If at least one of two fault detection models detect fault, the online data Y k can be considered to be faulty. Otherwise, Y k is normal. Let k = k + 1; the online data of the next sliding window Y k is tested from steps 2 to 7.
As can be seen from Figure 1, for each sliding window Y k , we can use the property of the GRQ to obtain the optimum PVs w F1 and w F3 between X and Y k . Because the online data Y k may vary from different windows, w F1 and w F3 may not be the same for each window; that is, the optimum PVs adjust the online data in real time, which makes the proposed method more adaptable to potential faults.
We suppose that the system model includes n monitored variables and the length of sliding windows is n 1 . The computation cost of Z-score normalization for Y k is O(nn 1 ). The computation cost of obtaining the mean vector and the covariance matrix of Y k is O(n 1 ) and O n 2 n 1 , respectively. The computation cost of obtaining the inverse matrix of Σ x is O n 3 . The computation cost of obtaining The computation cost of obtaining both the maximum eigenvalue and the eigenvector of Σ −1 x Σ y and Σ −1 Combining all the computation cost parts above, we can get the overall computation cost of obtaining two optimum projection vectors for each window as O n 3 . Step 2 Step 1 Step 3 Step 4 Step 4 Step 5 Step 5 Step 6,7 We suppose that the system model includes n monitored variables and the length of sliding windows is 1 n . The computation cost of Z-score normalization for k Y is O n n , respectively. The computation cost of obtaining the inverse

Results and Analysis
In this section, we use a numerical case and a real satellite fault case to assess the effectiveness of the proposed method.

Numerical Case
In this subsection, a numerical simulation case, which includes three incipient faults, is provided to verify the correctness and effectiveness of the proposed method. The system model is as shown in Equation (29) In Equation (29),    

Results and Analysis
In this section, we use a numerical case and a real satellite fault case to assess the effectiveness of the proposed method.

Numerical Case
In this subsection, a numerical simulation case, which includes three incipient faults, is provided to verify the correctness and effectiveness of the proposed method. The system model is as shown in Equation (29): In Equation (29) Both the length and interval of sliding windows were 300 for all data in the experiment. A total of 200 windows were obtained from the online data after using sliding windows. The first 100 of the 200 windows were normal windows, whereas the last 100 were fault windows. The default signal-to-noise ratio (SNR) was set as 20 dB [30]. The simulation hardware platform was a desktop computer (CPU: Intel core i5 − 10400, RAM: DDR4/2666/16G) and the software was MATLAB 2019b.
The compared fault-detection methods included using PCA and the T 2 statistic [36] (PCA + T 2 ), PCA and the squared prediction error statistic [36] (PCA + SPE), PCA and the KL divergence [23] (PCA + KLD), and the method based on the local optimum PV and the KL divergence [30] (LOPVKLD). Because of the poor effect of directly monitoring the original variables, the methods of PCA + T 2 and PCA + SPE in this experiment monitored the means and variances of the original variables. The principal subspace was selected with a cumulative variance contribution of more than 90%. The confidence levels for the PCA + T 2 method and the PCA + SPE method were both set at 0.95. The significance levels for the PCA + KLD method and the LOPVKLD method were 0.05 and 0.01, respectively. The significance levels of the subfunctions F 1 (w) and F 3 (w) proposed in this paper were 0.0005 and 0.001, respectively. Three evaluation indexes-fault detection rate (FDR), false alarm rate (FAR), and the time consumption of finding the optimum PV for each window (time consumption)-were chosen as the indexes for evaluating the fault detection results. For the purpose of conciseness, only the fault detection result of the PCA + KLD method of the principal component that was most sensitive to the fault is presented, whereas the other, relatively poor results are not displayed.
The detection results of five fault-detection methods for the incipient fault f 1 are shown in Figure 2. As can be seen from Figure 2, both the PCA + T 2 method and the PCA + SPE method failed to detect f 1 because most of the fault windows were still within the detection threshold. Conversely, both the PCA + KLD method and the LOPVKLD method successfully detected f 1 . As stated in Section 3.2, the subfunctions F 1 (w) and F 3 (w) can detect the fault that causes the abnormal change in ∆Σ and ∆µ, respectively. Because f 1 is the offset fault that can cause the abnormal change in ∆µ, the fault f 1 can be successfully detected by the subfunction F 3 (w) rather than the subfunction F 1 (w).   The detection results of five fault-detection methods for the incipient fault f 2 are presented in Figure 3. As shown, the PCA + SPE method still fails to detect f 2 . Both the PCA + T 2 method and the PCA + KLD method have relatively poor detection results for f 2 . Due to the application of the local optimum PV, the LOPVKLD method has a better detection result for f 2 . Because f 2 is the gain fault which can cause the abnormal change in ∆Σ, f 2 can be successfully detected by the subfunction F 1 (w) rather than the subfunction F 3 (w). Entropy 2021, 23, x FOR PEER REVIEW 12 of 22  As can be seen from Figure 4, three fault-detection methods-PCA + T 2 , PCA + SPE and PCA + KLD-are ineffective in detecting the fault f 3 , because most of the result values of these methods are still under the detection threshold. It can be seen from Figure 3d,f that the LOPVKLD method and the subfunction F 1 (w) are effective at detecting f 3 . As f 3 is the gain fault, the subfunction F 3 (w) fails to detect f 3 .
Considering the randomness of the signal sources and the noise sources in the numerical case, we simulated the three incipient faults 100 times and then derived the average of the fault detection results, as presented in Table 1.
It can be seen from Reference [30] that the PCA + T 2 and the PCA + SPE methods are ineffective in detecting incipient faults when the original variables are monitored. As can be seen from Table 1, the fault detection rates of these two methods increase, particularly the fault detection rate for f 2 . The reason for the improvement in these two methods is that the extraction of the means and variances of the variables can be considered as smoothing the variables. Although the means and variances of the variables are monitored, the detection results of these two methods are inferior to those of the PCA + KLD method. Due to the usage of constant PVs, the PCA + KLD method is effective at detecting f 1 and f 2 , but has poor detection results for f 3 .
Because of the application of the local optimum PV, the LOPVKLD method is sensitive to all three incipient faults. However, as stated in Section 2.2, the LOPVKLD method has the disadvantage of high computation complexity. As can be seen from Table 1, the LOPVKLD method requires a long duration (about 70 ms) to obtain the optimum PV. By contrast, the duration to obtain the optimum PV for each subfunction is less than 25 µs, three orders of magnitude faster than the LOPVKLD method. Because finding the optimum PV is not required, the PCA + T 2 , PCA + SPE, and PCA + KLD methods have lower computation complexity than the proposed method. However, the detection results of these methods are not as good as those of the proposed method, particularly the detection result for f 3 . Because the subfunctions F 1 (w) and F 3 (w) can detect the faults caused by the abnormal change in ∆Σ and ∆µ, respectively, the three faults can be successfully detected by F 3 (w), F 1 (w), and F 1 (w), respectively.   The reason for the sensitivity of the proposed method to incipient faults, from the perspective of optimum PV, is explained in this paper. The projection process can be regarded as a weighted sum process, as presented in Equation (30): In Equation (30), w is an optimum PV and can be considered to be a weight coefficient vector and X is the vector which includes five monitored variables. For the purpose of presentation, all the optimum PVs in the numerical case were normalized (the moduli of the vectors were set to 1) and the absolute value was taken. The optimum PVs obtained using the LOPVKLD method, the subfunction F 1 (w), and the subfunction F 3 (w) before and after insertion of the faults f 1 and f 3 are shown in Figure 5a-f, respectively. In each subfigure of Figure 5, the first 100 windows were the normal windows, whereas the last 100 windows were the fault windows. F w before and after insertion of the faults 1 f and 3 f are shown in Figure 5a-f, respectively. In each subfigure of Figure 5, the first 100 windows were the normal windows, whereas the last 100 windows were the fault windows. Due to the enlargement of the faulty variables, the fault is easier to expose and the detection ability is improved. It can be seen from Equation (29) that the fault 1 f was added to the variable 1 x . As can be seen from Figure 5a x after the fault 3 f occurred. In addition, because iteration is not needed, the computation complexity of the proposed method is less than that of the LOPVKLD method. In summary, the proposed method not only retains the advantage of being more sensitive to possible incipient faults, but also alleviates the disadvantage of high computational complexity.

Real Satellite Fault Case
On 16 March 2021, key telemetry parameters of a satellite payload abnormally fluctuated. Figure 6 presents the phenomena of a telemetry parameter fluctuation related to the fault. In this case, the development of the fault experienced three stages. In the first stage, the variance of the telemetry parameter increased slightly and lasted around 50 days. With the further deterioration of the fault, the mean and variance of the telemetry parameter significantly fluctuated in the second stage. The fault lasted around 70 days in this stage. As the fault developed to the third stage, the mean and variance of the telemetry parameter seriously deviated from the normal fluctuation range. Because the current fault detection system adopts the method based on a threshold, the system cannot detect the Due to the enlargement of the faulty variables, the fault is easier to expose and the detection ability is improved. It can be seen from Equation (29) that the fault f 1 was added to the variable x 1 . As can be seen from Figure 5a-c, both the LOPVKLD method and the subfunction F 3 (w) enlarged the weight of faulty variable x 1 after the fault f 1 occurred. As shown in Figure 5d-f, because the fault variable of the fault f 3 is x 3 , both the LOPVKLD method and the subfunction F 1 (w) enlarged the weight of faulty variable x 3 after the fault f 3 occurred. In addition, because iteration is not needed, the computation complexity of the proposed method is less than that of the LOPVKLD method. In summary, the proposed method not only retains the advantage of being more sensitive to possible incipient faults, but also alleviates the disadvantage of high computational complexity.

Real Satellite Fault Case
On 16 March 2021, key telemetry parameters of a satellite payload abnormally fluctuated. Figure 6 presents the phenomena of a telemetry parameter fluctuation related to the fault. In this case, the development of the fault experienced three stages. In the first stage, the variance of the telemetry parameter increased slightly and lasted around 50 days. With the further deterioration of the fault, the mean and variance of the telemetry parameter significantly fluctuated in the second stage. The fault lasted around 70 days in this stage. As the fault developed to the third stage, the mean and variance of the telemetry parameter seriously deviated from the normal fluctuation range. Because the current fault detection system adopts the method based on a threshold, the system cannot detect the fault until it develops to the third stage. If the fault was successfully detected at the beginning of the first stage, it could be found about four months earlier. Thus, the research objective of this paper is to detect the incipient fault from the first stage. fault until it develops to the third stage. If the fault was successfully detected at the beginning of the first stage, it could be found about four months earlier. Thus, the research objective of this paper is to detect the incipient fault from the first stage. In this study, a total of 13,066,123 samples were collected and arranged from the satellite measurement and control system from 7:35:34 on 15 November 2020 to 16:27:52 on 16 May 2021. Two telemetry parameters related to the fault were selected, as presented in Figure 7. For the reason of confidentiality, the true telemetry parameter names are hidden. The sampling rate of the telemetry data in Figure 7 was 1 Hz. Due to the constraints of the satellite's visible arc and the ground station measurement and control resources, some telemetry data were not transmitted; that is, the telemetry data were discontinuous in time. In this study, a total of 13,066,123 samples were collected and arranged from the satellite measurement and control system from 7:35:34 on 15 November 2020 to 16:27:52 on 16 May 2021. Two telemetry parameters related to the fault were selected, as presented in Figure 7. For the reason of confidentiality, the true telemetry parameter names are hidden. The sampling rate of the telemetry data in Figure 7 was 1 Hz. Due to the constraints of the satellite's visible arc and the ground station measurement and control resources, some telemetry data were not transmitted; that is, the telemetry data were discontinuous in time.  As indicated in Figure 7b, the parameters show a periodicity, and the period is consistent with the satellite orbital period (46, 468 s). For this reason, in this study, we took the satellite orbital period as the length of the sliding window, set the interval of the sliding window as 10,000, and retained the sliding windows comprising more than 40,000 samples as effective windows. A total of 524 effective windows were obtained from the first 6,246,451 samples after being extracted by sliding windows. The samples of the first 100 effective windows were selected as the normal historical data. The last 424 effective windows were selected as the online data for testing. Among the 424 windows for testing, the first 72 windows were normal windows, whereas the last 354 windows were fault windows.
Furthermore, it can be seen from Figure 7b that the telemetry parameters do not obey Gaussian distributions; thus, the fault detection threshold set by the chi-square test may not be appropriate, and the normal historical data must be used to assist in setting the threshold. As stated in Section 3.2, the subfunction F 1 (w) multiplied by the constant n 1 − 1 obeys a chi-square distribution with n 1 − 1 degrees of freedom. In this case, the length of the sliding window n 1 was 46,468. The degrees of freedom were sufficiently high that the subfunction F 1 (w) could be considered to obey a normal distribution; that is, the 3σ method could be used in this case to test whether there is a fault in F 1 (w).
Let X be the normal historical data, which include the date of 100 normal windows. We assume that the ith normal window data is X i . We set X i as the online data and then use the property of the GRQ to obtain the optimum PV w F1_i between X and X i . Let Y = X i , w = w F1_i ; we can obtain the value of F 1_i (w) from Equation (8). Furthermore, we can obtain a vector F 1_X (w) from 100 normal windows. The process of obtaining the vector F 1_X (w) is shown in Figure 8.
the satellite orbital period as the length of the sliding window, set the interval of the sliding window as 10,000, and retained the sliding windows comprising more than 40,000 samples as effective windows. A total of 524 effective windows were obtained from the first 6,246,451 samples after being extracted by sliding windows. The samples of the first 100 effective windows were selected as the normal historical data. The last 424 effective windows were selected as the online data for testing. Among the 424 windows for testing, the first 72 windows were normal windows, whereas the last 354 windows were fault windows.
Furthermore, it can be seen from Figure 7b that the telemetry parameters do not obey Gaussian distributions; thus, the fault detection threshold set by the chi-square test may not be appropriate, and the normal historical data must be used to assist in setting the threshold. As stated in Section 3.2, the subfunction ( ) Let X be the normal historical data, which include the date of 100 normal windows.
We assume that the ith normal window data is i X . We set i X as the online data and then use the property of the GRQ to obtain the optimum PV  Let M 1 and S 1 be the mean and the standard deviation of the vector F 1_X (w), respectively. The fault-detection method of the subfunction F 1 (w) is presented as follows: It can be seen from Section 3.2 that the subfunction F 3 (w) multiplied by the constant n 1 obeys a chi-square distribution with one degree of freedom. Therefore, we refer to the method in Reference [30] to set the threshold. Let F 3_X (w) be the set of 100 F 3 (w) values of 100 normal windows. The process of obtaining the vector F 3_X (w) is similar to that of the vector F 1_X (w). The difference between these two processes is that we use the property of the GRQ to obtain the optimum PV w F3_i and then obtain the value of F 3_i (w) from Equation (10). Let M 3 be the mean of the vector F 3_X (w). The fault-detection method of the subfunction F 3 (w) is presented as follows: where χ 2 α (1) is the threshold of the chi-square distribution with one degree of freedom with a given significance level α.
In this real satellite fault case, both the PCA + T 2 and the PCA + SPE methods still monitored the means and variances of the telemetry parameters. The experimental parameters of these two methods were the same as those presented in Section 4.1. The significance levels of the PCA + KLD method were set to 0.05 and 0.01, respectively. The significance levels of the LOPVKLD method were set to 0.05 and 0.01, respectively. The threshold of F 1 (w) was set by the 3δ method, and the significance level of F 3 (w) was 0.01. The detection results and evaluation indexes of these five methods for the real satellite fault are shown in Figure 9 and Table 2, respectively.   It can be seen from Figure 9a that the PCA + T 2 method has a poor detection result for the real satellite fault, particularly the fault windows between Nos. 100 and 200. Compared to Figure 9a, the detection result of the PCA + SPE method in Figure 9b is significantly improved. However, some fault windows around Nos. 250 to 300 are below the fault detection threshold. Figure 9c,d presents the detection results of the two principal components of the PCA + KLD method for the real satellite fault. In Figure 9c,d, the detection thresholds of significance levels of 0.05 and 0.01 are represented by the black dashed line and the magenta dashed line, respectively. Figure 8e,f illustrates the fault detection results of the LOPVKLD method with the significance levels of 0.05 and 0.01, respectively. According to Figure 9c,f, the fault detection rates of the PCA + KLD and the LOPVKLD methods are higher than 95% with the significance level of 0.05. However, the false alarm rates of both these methods are higher than 25% at this significance level. At significance levels of 0.01, the false alarm rates of these two methods are around 12%, but the fault detection rates decrease by around 10%. As a comparison, the fault detection and false alarm rates of the subfunction F 1 (w) are 100% and 0%, respectively. The false alarm of the proposed method comes from the subfunction F 3 (w). It can be seen from Figure 9 and Table 2 that the false alarm rate of the proposed method is 13.89%. The effectiveness and superiority of the proposed method is further verified by the real satellite case.

Conclusions
In this paper, we propose a new and fast method to detect incipient faults of satellites. We decompose the KL divergence and use the property of the generalized Rayleigh quotient to obtain the optimum projection vector. Under the assumption that the variables obey a multidimensional Gaussian distribution, the distributions of the subfunctions F 1 (w) and F 3 (w) are presented and verified. To address non-Gaussian satellite telemetry parameters, we use the normal historical data to assist in setting the threshold. The proposed method is a linear method. Future work may focus on developing a nonlinear fault-detection method. Funding: This research was supported by Beidou Navigation In-orbit Support System (grant number JKBDZGDH01) and National special support plan for high-level talents (grant number WRJH19DH01).

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: