A Novel Framework for Anomaly Detection for Satellite Momentum Wheel Based on Optimized SVM and Huffman-Multi-Scale Entropy

The health status of the momentum wheel is vital for a satellite. Recently, research on anomaly detection for satellites has become more and more extensive. Previous research mostly required simulation models for key components. However, the physical models are difficult to construct, and the simulation data does not match the telemetry data in engineering applications. To overcome the above problem, this paper proposes a new anomaly detection framework based on real telemetry data. First, the time-domain and frequency-domain features of the preprocessed telemetry signal are calculated, and the effective features are selected through evaluation. Second, a new Huffman-multi-scale entropy (HMSE) system is proposed, which can effectively improve the discrimination between different data types. Third, this paper adopts a multi-class SVM model based on the directed acyclic graph (DAG) principle and proposes an improved adaptive particle swarm optimization (APSO) method to train the SVM model. The proposed method is applied to anomaly detection for satellite momentum wheel voltage telemetry data. The recognition accuracy and detection rate of the method proposed in this paper can reach 99.60% and 99.87%. Compared with other methods, the proposed method can effectively improve the recognition accuracy and detection rate, and it can also effectively reduce the false alarm rate and the missed alarm rate.


Introduction
As important spacecraft, study of the reliability of artificial satellites is a hot topic at present. Generally, an artificial satellite consists of a structural system, temperature control system, attitude control system, measurement and control system, and power supply system. The mission of the attitude control system is to help the satellite achieve attitude stability or attitude maneuver, to further guarantee the normal operation of the satellite platform and the normal work of the payload.
Satellites have high requirements for attitude accuracy, which makes the task of attitude control systems very heavy. Health state and reliability are the basic guarantee for the normal operation of satellites [1]. Therefore, research on the theory and technology of automatic fault diagnosis and anomaly detection of satellite attitude control systems will further ensure the safe and reliable operation of on-orbit aircraft, reducing the possibility of space accidents.
In recent years, many scholars have conducted research on fault diagnosis technology or health management technology. These research contents can be roughly divided into three main aspects. First, when there is a specific research object, a feasible solution is to construct a simulation model of the object by analyzing the working mechanism and failure mode of the object. The data generated based on the simulation data is used as the theoretical prediction value, and then the judgment criterion is designed to complete the detection task. Luo et al. propose an improved phenomenological model based on meshing vibration to generate fault simulation data [2]. Li et al. established an INS/ADS fault detection model based on kinematic equations, and combined an unscented Kalman filter (UKF) with Runge-Kutta to deal with the non-linear and discretization problem [3]. Second, some research aims at extracting the fault features by constructing more effective signal processing methods, such as the feature extraction method based on entropy value [4,5], the feature extraction method based on spectral kurtosis time (Spectral Kurtosis, SK) [6], or the Frequency domain feature extraction method [7]. To fully excavate the features of the momentum wheel telemetry signal, this paper uses a combination of time domain features, frequency domain features and complexity features for feature extraction. Considering that, compared with permutation, dispersion, hierarchy, etc., sample entropy has better consistency for different parameters, this paper chooses a complex quantification method based on sample entropy. Third, for the fault recognition process, various pattern recognition methods are used to learn the mapping relationship between features and failure modes, so as to realize automatic fault recognition [8].
Due to the extremely complex structure and working principle of the spacecraft itself, and the strong coupling between the sub-systems, it is very difficult to construct an accurate simulation model of the spacecraft or its components [9,10]. As the spacecraft is affected by the special space environment during its orbiting operation, it is extremely prone to unpredictable failures, for example, the circuit signal disturbance caused by electromagnetic background [11], the sudden change of attitude caused by the impact of space debris [12], etc. In addition, during the process of the spacecraft downloading telemetry data to the ground-based measurement and control station, data jumps and even partial loss can occur [13]. Therefore, the data generated by the simulation model is often difficult to simulate the actual telemetry data of the spacecraft, and it becomes very difficult to use the spacecraft anomaly detection method based on the physical simulation model in practical applications.
The fault diagnosis method based on the data mode does not impose necessary restrictions on the prior knowledge of the object or system (including mathematical models and expert experience, etc.), such as artificial neural network (ANN), support vector machine (SVM), Bayesian network (BN) and other health assessment methods.
ANN is a method that is widely used in fault identification problems. Multilayer Perceptron (MLP) is the most typical type of feedforward neural network model, which usually uses a BP algorithm to learn the parameters of the model. Kumar et al. proposed a method based on principal component analysis (PCA) and MLP to detect and classify the three-phase current signals online [14]. In addition, probabilistic neural network (PNN) [15], RBF neural network [16], extension neural network (ENN) [17] and recurrent neural network (RNN) [18][19][20] have also been applied to fault detection and diagnosis problems.
For high-dimensional identification problems in fault diagnosis, the SVM method based on the principle of structural risk minimization has been widely used in recent years [21][22][23]. Compared with the ANN method based on the principle of empirical risk minimization, the learning goal of the SVM is to learn the optimal classification hyperplane in the feature space. The ANN has the ability to deal with pattern recognition problems, but the sample size is large, and it takes a long time to adjust the network structure parameters. Bayesian decision-making has significant execution ability under the premise of considering prior probability, but good accuracy is based on a prior model with appropriate assumptions. Compared with the above methods, the SVM only needs a small number of samples for training and has better generalization ability. Therefore, this paper chooses SVM as the means of pattern recognition. In the field of fault diagnosis, research on the SVM method mainly focuses on two aspects of obtaining more accurate recognition accuracy, i.e., by optimizing the hyperparameters of the model and constructing a new kernel function. For specific recognition tasks, to optimize the hyperparameters of the model to obtain better recognition performance, many optimization methods are applied [24][25][26]. Liu et al. proposed a novel small sample data missing filling method based on support vector regression (SVR) and genetic algorithm (GA) to improve the equipment health diagnosis effect [25]. Particle swarm optimization (PSO) is a hyperparameter optimization algorithm which is used by Cuong-Le et al. for damage identifications [26]. In terms of constructing a new kernel function, Wang et al. proposed a kernel function selection mechanism under sparse representation and the superiority of the selection mechanism was performed in simulations and engineering experiments involving high-speed bearing fault diagnosis [27]. Although both GA and PSO can solve high-dimensional complex optimization problems well, in the iterative process of PSO, the particles can retain the memory of the good solution, but the GA cannot, so PSO can often converge to a better solution more quickly. Based on the above analysis, this paper uses PSO to optimize the multi-class SVM.
From the above analysis, it can be seen that the following problems still exist in the direct application of existing anomaly detection or fault diagnosis methods to the anomaly detection problem of the satellite momentum wheel.
(1) Due to the complex structure and control law of the satellite momentum wheel itself, it is very difficult to construct an accurate simulation model, so model-based anomaly detection methods often fail to achieve satellite momentum wheel anomaly detection. (2) Satellite telemetry data often contains outliers (due to the data with very large deviations introduced by the telemetry process). These data alone cannot characterize the health of the spacecraft, but they can easily be detected as abnormal values by existing methods. At the same time, some segments of the telemetry data are lost in the process of downloading the data from the satellite to the ground. Therefore, reasonable preprocessing of telemetry data is required. (3) The sampling frequency of telemetry data collected by on-orbit satellite is often less than 1Hz, and the data itself has a long change period, so traditional anomaly detection methods based on time-frequency domain analysis are difficult to work with telemetry data.
Therefore, in response to the above problems, this article proposes a new method based on multi-type features fusion and improved SVM to handle the problem of anomaly detection for the satellite. The main contributions of the proposed framework can be summarized as follows: (1) We design a new anomaly detection framework for satellites, which includes a telemetry data preprocessing part, a telemetry data multi-type feature extraction part, and a data-driven anomaly detection part. (2) We propose a new method to construct the fusion-feature sequence HMSE-T/F. The HMSE-T/F is based on the Huffman-multi-scale entropy and the selected time/ frequency-domain feature. The Huffman-multi-scale entropy is a new method based on the Huffman coding principle and sample entropy. (3) We build a multi-class SVM model based on the directed acyclic graph (DAG) principle. We propose an improved adaptive particle swarm optimization (APSO) to train the multi-class SVM model. Compared with other methods, the proposed method has an excellent ability in anomaly detection.
The rest of this paper is organized as follows. Section 2 presents the scheme of the proposed anomaly detection framework. The construction method of multi-type feature sequence HMSE-T/F is provided in Section 3. In Section 4, the anomaly detection method based on multi-class SVM model and the improved adaptive particle swarm optimization (APSO) are stated. In Section 5, the performance of the proposed method is evaluated from different aspects. Finally, in Section 6, a comprehensive summary of this paper and prospects for future work are given.

Description of Difficulties in Spacecraft Anomaly Detection
In fact, since satellites are at normal working conditions at most of the time during their orbits, the proportion of normal data in the telemetry data collected on the ground is very high. For most detection methods that rely on plenty of training data, satellite telemetry data can provide very few abnormal or fault samples, and there are very few effective samples that can be used for classification model training. Therefore, some adaptive improvements are needed when using the classification model to detect anomalies in spacecraft. Figure 1a shows the momentum wheel voltage change of a certain type of satellite within 10 days, and its sampling frequency is 0.125 Hz. Figure 1b shows a sudden voltage change in a certain type of satellite. Figure 1c is the frequency spectrum of the telemetry signal in Figure 1a,d is the partially enlarged view of Figure 1c. According to Figure 1a-d, apart from the feature of less abnormal data, satellite telemetry data also exhibits the characteristics of extremely low sampling frequency, slow data change over a long period of time, and many sudden abnormalities. Therefore, anomaly detection methods that rely on time domain and frequency domain feature extraction often find it difficult to distinguish the health status of their telemetry data.

Description of Difficulties in Spacecraft Anomaly Detection
In fact, since satellites are at normal working conditions at most of the time their orbits, the proportion of normal data in the telemetry data collected on the is very high. For most detection methods that rely on plenty of training data, sat lemetry data can provide very few abnormal or fault samples, and there are very fective samples that can be used for classification model training. Therefore, som tive improvements are needed when using the classification model to detect anom spacecraft. Figure 1a shows the momentum wheel voltage change of a certain type of within 10 days, and its sampling frequency is 0.125 Hz. Figure 1b shows a sudden change in a certain type of satellite. Figure 1c is the frequency spectrum of the te signal in Figure 1a,d is the partially enlarged view of Figure 1c. According to Fig  d, apart from the feature of less abnormal data, satellite telemetry data also exh characteristics of extremely low sampling frequency, slow data change over a long of time, and many sudden abnormalities. Therefore, anomaly detection methods t on time domain and frequency domain feature extraction often find it difficult to guish the health status of their telemetry data.   Figure 1. A satellite's momentum wheel voltage telemetry data: (a) 10-day data sampled quency of 0.125 Hz, (b) sample with sudden change, (c) frequency spectrum, (d) partially view of (c).

The Proposed Anomaly Detection Framework
To effectively solve the problem of satellite momentum wheel anomaly dete new anomaly detection framework based on multi-type feature extraction and f proposed in this paper. The overall procedure of the proposed anomaly detection work is shown in Figure 2. Specifically, the descriptions of each Step are detaile lows.

The Proposed Anomaly Detection Framework
To effectively solve the problem of satellite momentum wheel anomaly detection, a new anomaly detection framework based on multi-type feature extraction and fusion is proposed in this paper. The overall procedure of the proposed anomaly detection framework is shown in Figure 2. Specifically, the descriptions of each Step are detailed as follows.  Figure 2. The overall procedure of the proposed anomaly detection framework.
Step 1: Telemetry data collection. When the satellite is in orbit, to obtain its internal operating status and further pr vide real-time data for the remote-control object, the sensors in the satellite telemetry sy tem need to measure the operating status of each key component and convert it into ele trical signals. After the signals of each channel are combined according to a certain system they are transmitted to the ground telemetry equipment (including receiver, antenna an splitter demodulator) using radio communication technology, and the ground termin equipment restores and stores the original parameter information of each channel throug signal demodulation technology.
Step 2: Data preprocessing. The collection process of telemetry data is interfered with by sensors, converters, an wireless transmission. The data obtained by the ground receiving end often produces a normal jump points. These kind of data points that deviate from the change law of t measured signal are usually called abnormal outliers. The abnormal outliers of the telem etry data will provide wrong information and affect the processing and analysis results the telemetry signal. Outlier elimination is an important part of telemetry data prepr cessing. By eliminating random measurement values with large errors, the authenticity telemetry data can be guaranteed to a certain extent, and the reliability of data analys can be improved. Commonly used methods to eliminate outliers include visual inspe tion, mean square method, point discrimination, Letts criterion, etc. Different outlier elim ination methods should be used for different types of telemetry data. Considering th this article mainly analyzes the telemetry data of the satellite momentum wheel, the ou lier elimination method based on the Letts criterion is adopted.
The premise of the Letts criterion is that the distribution of the measured data is clo Step 1: Telemetry data collection. When the satellite is in orbit, to obtain its internal operating status and further provide real-time data for the remote-control object, the sensors in the satellite telemetry system need to measure the operating status of each key component and convert it into electrical signals. After the signals of each channel are combined according to a certain system, they are transmitted to the ground telemetry equipment (including receiver, antenna and splitter demodulator) using radio communication technology, and the ground terminal equipment restores and stores the original parameter information of each channel through signal demodulation technology.
Step 2: Data preprocessing. The collection process of telemetry data is interfered with by sensors, converters, and wireless transmission. The data obtained by the ground receiving end often produces abnormal jump points. These kind of data points that deviate from the change law of the measured signal are usually called abnormal outliers. The abnormal outliers of the telemetry data will provide wrong information and affect the processing and analysis results of the telemetry signal. Outlier elimination is an important part of telemetry data preprocessing. By eliminating random measurement values with large errors, the authenticity of telemetry data can be guaranteed to a certain extent, and the reliability of data analysis can be improved. Commonly used methods to eliminate outliers include visual inspection, mean square method, point discrimination, Letts criterion, etc. Different outlier elimination methods should be used for different types of telemetry data. Considering that this article mainly analyzes the telemetry data of the satellite momentum wheel, the outlier elimination method based on the Letts criterion is adopted.
The premise of the Letts criterion is that the distribution of the measured data is close to the normal distribution. Based on this assumption, the given confidence probability is 99.7% as the standard, and the standard deviation of three times the measured quantity is used as the basis. Any measurement value exceeding this limit is judged for wild value. For a given sequence of telemetry measurement values. For a given telemetry sequence x = {x i }, i = 1, · · · , N, the specific process of the method is as follows.
(1) Calculate the mean of the series: (2) Calculate the standard deviation of the series: (3) Eliminate outliers: In addition to the problem of outliers, the process of satellite telemetry data transmission to the ground is affected by the ionosphere, and data may be missing during the signal decoding process. A telemetry sequence that has many data problems should be discarded and not used as training data, but the missing value at a certain point in the sequence can be handled by the filling method. From the distribution of the missing values, they can be divided into missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). MCAR means that the law of missing values in the data is completely random and does not affect the unbiasedness of the overall sample. MAR means that the mechanism of missing data is not completely random. The missing data of this type depends on other variables. Such missing values are relatively rare in telemetry data. MNAR means that the missing data is related to the value of the variable itself.
The missing values in satellite telemetry data are generally MCAR, so this paper uses an interpolation method based on two short sequences before and after the missing point to fill in the missing values. Given the data sequence to be filled is y = {y i }, i = 1, · · · , M, The missing value to be filled is y * . The auxiliary variable used to construct the regression equation is x = {x i }, j = 1, · · · , M. The auxiliary variable value corresponding to the missing value of the variable to be filled is x * , and x * is a known variable. Use x, y to construct the regression equation: where f * needs to choose different regression models according to different telemetry data. Then the missing value is y * = f (x * ).
Step 3: Features extraction. Considering the difficulty of using satellite telemetry data for anomaly detection as mentioned above, this paper adopts a time-frequency domain feature extraction and selection method based on feature quality evaluation. At the same time, a complexity feature extraction method based on Hoffman multi-scale entropy is proposed, which enriches signal feature types and provides effective feature learning samples for training satellite telemetry data anomaly detection models. The specific method of feature extraction is described in detail in Section 3.
Step 4: Obtaining the anomaly detection model. This paper takes support vector machine (SVM) as the basic unit and uses a directed acyclic graph (DAG) principle to construct a satellite momentum wheel anomaly detection model based on the support vector machine. This model can effectively solve the multiclassification problem when some categories are difficult to distinguish. In addition, to improve the classification accuracy of the anomaly detection model, an improved particle swarm optimization (PSO) algorithm is proposed to train SVMs. The specific method of Obtaining anomaly detection model is described in detail in Section 4. The time domain signal is a time series in which time is the independent variable to describe the change of a certain physical quantity, and it is the most basic and most intuitive form of expression of the signal. The time domain signal reflects the corresponding relationship between real physical information and time. The processing of filtering, amplifying, statistical feature calculation, and correlation analysis of signals in the time domain is collectively referred to as time domain analysis.

Multi-Type Feature Sequence HMSE-T/F Construction
When a device fails, its spectrum distribution may change. Like the statistical analysis of time-domain signals, this type of change can be described by statistical analysis of the signal's frequency spectrum.
Given a period of time domain signal x(t), the frequency spectrum of this signal is y(k), k = 1, · · · , k, f k is the k-th line of the spectrum. Then the time-domain statistical characteristics and frequency-domain statistical features of x(t) are shown in Table 1 [28].

No. Time-Domain
No. Frequency-Domain 2 19 waveform index : X wi = X rms /X am

Feature Evaluation and Selection
In this paper, two commonly used feature evaluation methods, Laplacian Score (LS) [29] and Relief-F Score (RFS) [30], are used to evaluate the effectiveness of the timedomain and frequency-domain features of the satellite momentum wheel telemetry signal. Feature selection is based on two different feature evaluation results, and the feature with the higher evaluation score is taken as the effective feature in the time/frequency domain.
(1) Laplacian Score (LS). In practical problems, data of the same type are generally close to each other. Under this premise, the importance of describing features can be transformed into evaluating the local retention of features. The Laplace score is based on this idea. Let the data set be X ∈ R m×n , L r is the LS of the r-th feature, f ri is the the r-th feature of the i-th sample. L r can be calculated as follows.
Step 1: Construct a neighbor graph G containing n nodes, the i-th node corresponds to the i-th sample x i , if x i and x j are close to each other, that is, x i is within the k-neighbor range of x j , then an edge is constructed between nodes x i and x j . When the data labels are known, edges can be constructed directly between samples of the same type.
Step 2: If nodes x i and x j are connected, put S ij = e where t is a suitable constant. Otherwise, put S ij = 0. The weight matrix S of the graph models the local structure of the data space.
Step 3: For the r-th feature, the f r and D can be defined as Step 4: Compute the LS of the r-th feature as follows: (2) Relief-F Score (RFS). The Relief-F Score method is a multi-class variant of the Relief method. The Relief method designs a correlation statistic to measure the importance of features. The statistic is a vector, each component of which corresponds to an initial feature, and the importance of the feature subset is determined by the sum of the relevant statistic components corresponding to each feature in the subset. For each x i in the data set X ∈ R m×n , first find its nearest neighbor x i,nh in the same sample of x i , which is called guessing nearest neighbor, and then find its nearest neighbor x i,nm from different type samples of x i , which is called guessing wrong neighbor. The component of the correlation statistic corresponding to the feature is: depends on the type of the r-th feature. If the r-th feature r is discrete, when x (r) Relief is designed for two classification problems, while Relief-F can handle multiple classification problems. For the sample x i , if it belongs to the k-th class, the Relief-F method first finds its nearest neighbor x i,nh in the k-th class sample, and then finds a nearest neighbor x i,l,nm , l = k of x i in each class except the k-th class as a guessing wrong neighbor, so the correlation statistic corresponding to the component of the r-th feature is where p l is the proportion of the l-th class sample in the data set X.

Complexity Features Based on Huffman-Multi-Scale Entropy (HMSE)
Sample entropy (SampEn) is a new time series complexity characterization parameter proposed by Richman et al. in 2004 [31]. The sample entropy is improved on the basis of approximate entropy, both of which measure the complexity of the time series and the probability of a new pattern generated by the sequence when the dimensionality changes. The greater the probability of generating a new pattern, the more complex the sequence and the higher the entropy value. Compared with other nonlinear dynamic methods such as Lyapunov exponent, information entropy, and correlation dimension, sample entropy has the advantages of short data, strong anti-noise and anti-interference ability, and good consistency within a large range of parameters. Therefore, it has attracted the attention of many scholars and has been frequently used in the field of mechanical signal analysis and fault diagnosis in recent years.

Traditional Multi-Scale Sample Entropy (MSE)
Suppose a time series of length N is X = {x 1 , x 2 , · · · , x N−1 , x N }, and the calculation method of sample entropy is as follows: Step 1: Construct the time series X into an m-dimensional vector: Step 2: Define the distance between X(i) and X(j) as d[X(i), X(j)], (i = j), which is the largest difference between the two corresponding elements: Step 3: Given a threshold r > 0, count the number of d[X(i), X(j)] < r and calculate the ratio to the total number of vectors N − m: Step 4: Average all the results obtained by Equation (12): Step 5: Then m = m + 1, repeat Step1-Step4.
Step 6: Then theoretically the sample entropy of this sequence is: However, N cannot be infinite in fact, but a finite value. The estimated value of sample entropy is: The sample entropy does not include the comparison of its own data segments, which not only improves the calculation accuracy and saves the calculation time, but also makes the calculation of the sample entropy independent of the data length. In addition, the sample entropy has better consistency. In other words, if one sequence has a higher SampEn than another sequence, then when the parameters m and r are changed, the sequence still has a relatively high SampEn value. However, the disadvantage of sample entropy is that it does not consider the different time scales that may exist in the time series.
To calculate the complexity of the signal at different time scales, Costa et al. proposed multi-scale entropy [32], which aims to extend the sample entropy to multiple time scales to provide additional observation perspectives when the time scale is uncertain. Like other entropy measurement methods, the goal of multi-scale entropy is to evaluate the complexity of time series. One of the main reasons for using multi-scale entropy is that the relevant time scale in the time series is not known. For example, when analyzing a speech signal, it is more effective to count the complexity of the signal under the word time scale than the complexity of the entire speech segment. However, the actual situation is that we often cannot know how many words a certain speech segment contains, or know what time scale should be used to obtain more useful information from the original signal. Therefore, analyzing the problem through multiple time scales will obtain more effective information.
The basic principle of multi-scale entropy (MSE) includes coarse-graining or downsampling the time series, so that the time series can be analyzed at increasingly coarse time resolutions. Given a time series X = {x 1 , x 2 , · · · , x N−1 , x N } of length N, set the coarsegrained scale to s, then the original time series can be split into i consecutive segments without overlap, where i = f loor(N/s), f loor( * ) means taking the largest integer smaller than *. The original sequence can be transformed into a new sequence by calculating the average value of each fragment by Equation (15). Then the MSE of the original sequence can be obtained by solving the sample entropy of the new sequence Y = {y 1 , y 2 , · · · , y i } obtained under different s. The process of coarse-graining the time series is shown in Figure 3.
Entropy 2021, 23, x FOR PEER REVIEW time scale should be used to obtain more useful information from the origina Therefore, analyzing the problem through multiple time scales will obtain more e information.
The basic principle of multi-scale entropy (MSE) includes coarse-graining or sampling the time series, so that the time series can be analyzed at increasingly time resolutions. Given a time series coarse-grained scale to s , then the original time series can be split into i cons segments without overlap, where means taking the larg ger smaller than *. The original sequence can be transformed into a new sequence culating the average value of each fragment by Equation (15). Then the MSE of the sequence can be obtained by solving the sample entropy of the new se obtained under different s . The process of coarse-graining t series is shown in Figure 3.

The Huffman-Multi-Scale Entropy (HMSE)
According to the process of the calculation of multi-scale sample entropy, the this method is to coarse-grain the original time series on different time scales by ing. Figure 4a shows a satellite momentum wheel voltage telemetry signal with th of 10,000. This signal is coarse-granulated and averaged on time scales

The Huffman-Multi-Scale Entropy (HMSE)
According to the process of the calculation of multi-scale sample entropy, the core of this method is to coarse-grain the original time series on different time scales by averaging. Figure 4a shows a satellite momentum wheel voltage telemetry signal with the length of 10,000. This signal is coarse-granulated and averaged on time scales s = 10, s = 50, and s = 100 respectively. The results are shown in Figure 4b-d. As can be seen from Figure 4, the waveform of the new signal obtained by averaging the original signal at different time scales is almost the same.
It can be seen from Figure 4 that the state of the signal changes at about the 4000th sample point in the original signal. However, the use of different coarse-grained scales cannot reflect the difference in signal changes. Therefore, this paper proposes a new improved multi-scale entropy calculation method based on the Huffman mean model. The main innovation of this method is that when the original data is coarse-grained on different time scales, the average value is not taken, but the Huffman average value is taken. This section will introduce the Huffman mean model and the improved multi-scale entropy calculation method based on the Huffman mean model in detail.
(1) Huffman Coding. In 1952, Huffman proposed an optimum method of coding an ensemble of messages consisting of a finite number of members [33]. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized. The process of Huffman coding is as follows.
Step 1: Given a sequence containing n kinds of symbols. Suppose the set of symbol types is S 0 = s 0 1 , s 0 2 , · · · , s 0 i , · · · s 0 n , i = 1, 2, · · · , n. The probability of each symbol appearing is P 0 = p 0 1 , p 0 2 , · · · , p 0 i , · · · p 0 n , i = 1, 2, · · · , n, and Step 2: Set the iteration parameter to t, the maximum value of t is n − 1 and the initial value of t is 0. The symbol sequence and the corresponding probability at the beginning of the k-th iteration are S k−1 and P k−1 . The symbol sequence and the corresponding probability at the end of the k-th iteration are S k and P k .  It can be seen from Figure 4 that the state of the signal changes at about the 4000th sample point in the original signal. However, the use of different coarse-grained scales cannot reflect the difference in signal changes. Therefore, this paper proposes a new improved multi-scale entropy calculation method based on the Huffman mean model. The main innovation of this method is that when the original data is coarse-grained on different time scales, the average value is not taken, but the Huffman average value is taken. This section will introduce the Huffman mean model and the improved multi-scale entropy calculation method based on the Huffman mean model in detail.
(1) Huffman Coding. In 1952, Huffman proposed an optimum method of coding an ensemble of messages consisting of a finite number of members [33]. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized. The process of Huffman coding is as follows.
Step 1: Given a sequence containing n kinds of symbols. Suppose the set of symbol types is Step 2: Set the iteration parameter to t , the maximum value of t is Step 3: When t k = , arrange the symbol 1 k S − in ascending order of probability 1 k P − as . Then the probability 1 k P − is also rearranged Step 3: When t = k, arrange the symbol S k−1 in ascending order of probability P k−1 as S k−1 = s k−1 1 , s k−1 2 , · · · , s k−1 i , · · · s k−1 n−k+1 . Then the probability P k−1 is also rearranged accordingly as P k−1 = p k−1 1 , p k−1 2 , · · · , p k−1 i , · · · p k−1 n−k+1 . Step 5: Delete s k−1 1 and s k−1 2 from S k−1 , and add s * into S k−1 . Then the S k−1 turns into S k , and the size of S k is n − k. Delete p k−1 1 and p k−1 2 from P k−1 , and add p * into P k−1 . Then the P k−1 turns into P k , and the size of P k is also n − k.
Step 6: Repeat the Step 3 to Step 5 until t = n − 1. Then the symbols sequence will be S n−1 = s n−1 1 and the probability will be P n−1 = p n−1 1 , p n−1 1 = 1. It can be seen from the above coding process that the symbol with the lower probability in the original signal has the longer Huffman code length. Conversely, the symbol with the higher probability has the shorter Huffman code length. The complexity of the probability distribution of the signal can be described by solving the Huffman average code length of the original signal. Based on the above-mentioned Huffman coding process, the method to further calculate the average Huffman coding length is as follows.
Backtrack from the symbol s n−1 average Huffman coding length L * can be calculated according to the length of c i and the corresponding probability p 0 i as Equation (16). L(c i ) is the length of c i .
For a set of source symbols S 0 = {s 1 , s 2 , s 3 , s 4 , s 5 , s 6 } with probability P 0 = {0.35,0.28, 0.14,0.13,0.07,0.03} , the process of Huffman coding is shown in Table 2. The average Huffman coding length of S 0 can be calculated as 2.33 as follows.
Backtrack from the symbol   (2) Huffman Mean Model. The basic principle of Huffman coding and the calculation method for solving the average Huffman coding length were introduced above. In this paper, a new Huffman mean model based on the Huffman coding is proposed for the problem of satellite anomaly detection. For a sequence (2) Huffman Mean Model. The basic principle of Huffman coding and the calculation method for solving the average Huffman coding length were introduced above. In this paper, a new Huffman mean model based on the Huffman coding is proposed for the problem of satellite anomaly detection. For a sequence T = {t 1 , t 2 , · · · , t i , · · · , t n }, the expression of the Huffman mean model is shown in Equation (17).
Hu f f man_mean = sum(T * ( /sum( ))) (17) where HM(T) is the Huffman mean value of T, T = T/sum(T) means to convert the original time series into a probability series, C = Hu f f man_coding(T ) represents the Huffman coding result of the probability sequence T , C = {c 1 , c 2 , · · · , c i , · · · , c n }, i = 1, 2, · · · , n, c i is the Huffman code corresponding to t i in the original sequence, = L(C) means to calculate the length of each c i , Hu f f man_mean = sum(T * ( /sum( ))) represents the Huffman mean value of the sequence T considering the length weight of the Huffman code.
(3) The Improved Method of Huffman-multi-scale Entropy. The Figure 5 shows the calculation process of Huffman-multi-scale entropy. The inputs of both two methods are original signal X = {x 1 , x 2 , · · · , x N−1 , x N }, the scale sequence Scale = s 1 , s 2 , · · · , s p and the parameter set θ = {m, r}, usually 0.1std(X) < r < 0.2std(X). Compared with the classic MSE, the Huffman-multi-scale entropy method proposed in this paper adopts the coarse-grained method based on the Huffman mean model.

Multi-Class SVM
A support vector machine (SVM) is widely used in classification problems. The basic idea is to find a hyperplane so that all sample points in the positive and negative categories are farthest from the plane, and points that are far enough from the plane can basically be correctly classified. Therefore, if the points closer to the hyperplane are as far away as possible from the hyperplane, a better classification effect can be achieved.
This article uses the most interval classifier to achieve two-class SVM, and then uses the directed acyclic graph (DAG) method to achieve the multi-class SVM based on twoclass SVM.
The SVM model keeps all the points on both sides of the support vector of their respective categories, while keeping away from this hyperplane. It can be seen from Equation (18) that when 2 || || w is the smallest, the interval is the largest. Introduce the penalty parameter λ for misclassification and the relaxation factor ξ that allows misclassification, and the objective function can be

Multi-Class SVM
A support vector machine (SVM) is widely used in classification problems. The basic idea is to find a hyperplane so that all sample points in the positive and negative categories are farthest from the plane, and points that are far enough from the plane can basically be correctly classified. Therefore, if the points closer to the hyperplane are as far away as possible from the hyperplane, a better classification effect can be achieved.
This article uses the most interval classifier to achieve two-class SVM, and then uses the directed acyclic graph (DAG) method to achieve the multi-class SVM based on twoclass SVM.
Set the dataset as {(x i , y i )|i = 1, 2, · · · , N}, x i ∈ R n , y ∈ {−1, +1}, the hyperplane is w T x + b = 0, Then the distance from the support vector to the hyperplane is w T x + b = y, which can be written as The SVM model keeps all the points on both sides of the support vector of their respective categories, while keeping away from this hyperplane. It can be seen from Equation (18) that when ||w|| 2 is the smallest, the interval is the largest. Introduce the penalty parameter λ for misclassification and the relaxation factor ξ that allows misclassification, and the objective function can be Entropy According to Lagrange's duality, the optimization objective can be converted into an equivalent dual problem. The Equation (19) can be transformed into: where K x i , x is the kernel function. The radial basis function is used as the kernel function in this paper, K x i , x = exp − ||x i −x|| 2 2σ 2 , σ is the kernel function parameter. Then the decision function is Among the multi-class SVM methods, one is the direct solution method, but this method has high time complexity and is difficult to implement. It is not suitable for a large amount of data. The other one is to combine multiple two-class SVM models into a multi-class SVM model. In this paper, a directed acyclic graph method is used to construct a multi-class SVM.
The DAG method uses the "competition" rule. For n types, the height of the decision tree is n − 1. Put the classes that are easy to distinguish on the upper layer, and the classes that are difficult to distinguish on the lower layer. The schematic diagram of DAG method for five-class SVM is shown in Figure 7.  (19) According to Lagrange's duality, the optimization objective can be converted into an equivalent dual problem. The Equation (19) can be transformed into: where , i K x x is the kernel function. The radial basis function is used as the kernel function in this paper, , σ is the kernel function parameter.
Then the decision function is Among the multi-class SVM methods, one is the direct solution method, but this method has high time complexity and is difficult to implement. It is not suitable for a large amount of data. The other one is to combine multiple two-class SVM models into a multiclass SVM model. In this paper, a directed acyclic graph method is used to construct a multi-class SVM.
The DAG method uses the "competition" rule. For n types, the height of the decision tree is 1 n − . Put the classes that are easy to distinguish on the upper layer, and the classes that are difficult to distinguish on the lower layer. The schematic diagram of DAG method for five-class SVM is shown in Figure 7.

Improved Adaptive Particle Swarm Optimization (APSO)
Kennedy and Eberhart first proposed particle swarm optimization (PSO) in 1995 [34]. PSO is an algorithm for finding the optimal solution inspired by the foraging behavior of bird groups. In the PSO algorithm, each particle represents a feasible solution of a function to be optimized, and the movement of the particle is restricted by two aspects: speed and position. The speed constrains the distance of particle movement, while the position constrains the direction of particle movement. Each particle's movement is given a fitness

Improved Adaptive Particle Swarm Optimization (APSO)
Kennedy and Eberhart first proposed particle swarm optimization (PSO) in 1995 [34]. PSO is an algorithm for finding the optimal solution inspired by the foraging behavior of bird groups. In the PSO algorithm, each particle represents a feasible solution of a function to be optimized, and the movement of the particle is restricted by two aspects: speed and position. The speed constrains the distance of particle movement, while the position constrains the direction of particle movement. Each particle's movement is given a fitness function to evaluate the particle's location. Under the control of constraint conditions and evaluation function, the particles search for a better area in the process of moving. After many iterations, they gather near the optimal solution. The particle velocity and position update formula are as follows: where v k id is the current velocity of the d-th component in the i-th particle, v k+1 id is the next velocity of the d-th component in the i-th particle, ω is the inertia weight, ω ≥ 0, c 1 and c 2 are the acceleration constant of the particle, r 1 and r 2 are random numbers between 0 and 1, r 1 , r 2 = random(0, 1), p id represents the best position of the d-th component of the i-th particle, p gd represents the best position of the d-th component of all particles, x k id is the current position of the d-th component in the i-th particle, and x k+1 id is the next position of the d-th component in the i-th particle.
PSO has the advantages of fewer parameters and fast convergence, but it also has shortcomings such as premature convergence and falling into local optimum. It can be seen from Equations (22) and (23) that the inertia weight ω determines the relationship between the next flight distance and the current flight distance, which further affects the position after the flight. The larger the ω, the stronger the particle's flying ability in the solution space, which is conducive to searching in the global scope. The smaller the ω, the smaller the flight length, and the stronger the search ability of the particles in a local area, which is conducive to the convergence of the algorithm. However, if the value of ω is too large, it will easily cause the algorithm to skip the optimal solution or oscillate near the optimal solution, which will lead to the premature convergence; if ω is too small, the algorithm will easily fall into a local optimum.
The inertia weight ω should be a larger value at the beginning of the iteration to ensure a strong global search ability and the ability to jump out of the local optimum. However, in the later stage of the algorithm iteration, smaller ω should be used to ensure strong local search capabilities, which is conducive to the convergence of the algorithm.
In response to the above problem, this paper proposes a strategy for adaptively changing the ω according to the number of iterations and the current fitness value. The formula is as follows: where k is the current number of iterations, k + 1 is the next number of iterations, K is the maximum number of iterations, ω k+1 id is the inertia weight for the next iteration for the d-th component of the i-th particle, ω 0 is the initial value of ω, ω 0 = 0.5 in this paper, f k id is the fitness value of the d-th component of the i-th particle obtained in the k-th iteration, f k max is the maximum fitness value in the k-th iteration, f k min is the minimum fitness value in the k-th iteration, f k avg is the average fitness value in the k-th iteration. At the beginning of the iteration, the weight of the particle changes greatly, and as the number of iterations of the particle increases, the weight change decreases. At the same time, the weight change is determined by the fitness function. When the particle fitness is less than or equal to the average fitness, that is, when the accuracy of the classification model is greater than or equal to the average accuracy, the inertia weight decreases; when the particle fitness is greater than the average fitness, the accuracy of the classification model is lower than the average accuracy, and the inertia weight increases.
The increase or decrease of the inertia weight is determined by the number of iterations. At the beginning of the iteration, the increase or decrease of the weight is large, which is convenient for searching in the global and optimal solution neighborhood. At the later stage of the iteration, the increase or decrease of the weight is small. The increase of the weight can avoid falling into the local optimal solution for random search, and the decrease of the weight facilitates the local fine search.

The Algorithm of the Proposed APSO-SVM
In this paper, the improved Adaptive Particle Swarm Optimization (APSO) is used to optimize the penalty factor λ and the kernel function parameter σ in the SVM. The specific steps of APSO-SVM are as follows.
Step 1: Input the dataset with labels.
Step 2: Divide the dataset into training set and test set, then normalize both two sets.
Step 3: Population initialization. Set the number of particles in the initial population as n. Set the range of penalty factor λ to [λ min , λ max ]. Set the range of kernel function parameter σ to [σ min , σ max ]. Initialize the parameter set θ = {ω 0 , c 1 , c 2 , K}. Initialize the position x 0 i , the speed v 0 i , the optimal position p id of i-th particle and the global optimal position p gd . Set fitness error ε.
Step 5: The expression of f ( * ) is shown in Equation (21).
Step 6: If k < K, repeat the step 4 to step 6. If k ≥ K, end the APSO.
Step 7: Use the optimal solution (λ * , σ * ) to create the SVMs model and use this model for classification.
The flow chart of APSO-SVM is shown in Figure 8. σ σ σ

Data Description
The data set used in this article is from a satellite's telemetry voltage value of its momentum wheel. In this data set, five types of sample with different health status are screened out. Stable Change (large) indicates that the momentum wheel voltage value continuously and steadily changes with a large change amplitude. Stable Change (small)

Data Description
The data set used in this article is from a satellite's telemetry voltage value of its momentum wheel. In this data set, five types of sample with different health status are screened out. Stable Change (large) indicates that the momentum wheel voltage value continuously and steadily changes with a large change amplitude. Stable Change (small) indicates that the momentum wheel voltage value continuously and steadily changes with a small change amplitude. Large to Small indicates that the amplitude of the momentum wheel voltage change smoothly transitions from large to small. The above three types of sample all represent that the momentum wheel is in a normal state. Irregular Change indicates that the voltage of the momentum wheel changes irregularly. Sudden Change indicates that the voltage of the momentum wheel has a sudden change, such as the voltage suddenly jumping to 0. Irregular Change and Sudden Change represent that the momentum wheel is in an abnormal state. The time-domain waveforms of different types of data are shown in the Figure 9.  To verify the effectiveness of the method proposed in this article, the training set and test set used in this article are shown in the Table 3. Table 3. Label description of the momentum wheel voltage telemetry dataset.

Class
Label Health Status Training Set Test Set To verify the effectiveness of the method proposed in this article, the training set and test set used in this article are shown in the Table 3. According to the time-domain feature and frequency-domain feature calculation methods shown in Table 1, the time-frequency feature values of the five types of momentum wheel voltage telemetry data are calculated. The time-domain features are shown in Figure 10, and the frequency-domain features are shown in Figure 11.
According to the time-frequency feature statistical feature distribution diagrams of different types of data in Figures 10 and 11, the time-domain feature distribution of SC is very scattered, but the frequency-domain feature distribution is relatively more concentrated. Intuitively, peak and peak-to-peak in the time domain feature can distinguish five types of data to a certain extent, and F3 and F4 in the frequency domain feature can also distinguish five types of sample to a certain extent.
In order to quantify the ability of different feature values to distinguish samples, we use the feature evaluation method (Laplacian Score and Relief-F Score) in Section 3.1.2 to score the above 25 types of time-domain features and frequency-domain features. The evaluation results are shown in Figure 12. Comprehensively considering the evaluation results of LS and RFS, this paper chooses nine features (peak, peak-to-peak, skewness, kurtosis, F3, F4, F8, F10 and F11), which have higher scores in two evaluation methods, as part of the feature sequence. These high-scoring features describe the amplitude characteristics, fluctuation characteristics and spectral density characteristics of the voltage telemetry signals.

Complexity Feature Analysis
To verify the effectiveness of the proposed complexity feature extraction method of Huffman-multi-scale entropy, this paper analyzes the sample entropy under different sample lengths and different scales.
Taking a normal type data Stable Change (large) as an example to study the impact of sample length on complexity characteristics, the sample length is taken from 5000 to 25,000 at intervals of 2000. Figure 13 shows the results of calculating the multi-scale entropy and Huffman-multi-scale entropy with each sample length respectively. The scale is from 10 to 300 at intervals of 10. It can be found that when the sample length is 9000 to 15,000, both methods can achieve higher sample entropy. Therefore, this paper selects the sample length as 10,000.

Anomaly Detection Results and Discussion
This paper uses a five-class SVM model based on the DAG method, and uses the proposed APSO to train the classification model. To verify the effectiveness of the pro posed method on the spacecraft anomaly detection problem, this paper not only calculates the recognition accuracy (RA) of the classification model for each category, but also calcu lates the detection rate (DR), false alarm rate (FAR) and the missed alarm rate (MAR). The calculation method of RA, DR, FAR and MAR are shown in Equations (25)- (28).

Anomaly Detection Results and Discussion
This paper uses a five-class SVM model based on the DAG method, and uses the proposed APSO to train the classification model. To verify the effectiveness of the pro posed method on the spacecraft anomaly detection problem, this paper not only calculates the recognition accuracy (RA) of the classification model for each category, but also calcu lates the detection rate (DR), false alarm rate (FAR) and the missed alarm rate (MAR). The calculation method of RA, DR, FAR and MAR are shown in Equations (25)- (28). At the same time, this paper also calculates the multi-scale entropy and Huffmanmulti-scale entropy of different types of momentum wheel voltage telemetry signals when the sample length is 10,000. The scale ranges from 10 to 300, with an interval of 10. The calculation result is shown in Figure 14. It can be seen from Figure 14 that, for normal data, the results of multi-scale entropy and Huffman-multi-scale entropy are close. For Irregular Change data, the value of Huffman-multi-scale entropy is significantly lower than that of multi-scale entropy. This shows that Huffman-multi-scale entropy has better distinguishing ability for data with high complexity. It is worth noting that, for the abnormal data of Sudden Change type, the data itself has pulse characteristics, which causes the fluctuation characteristics of the data before and after the sudden change to be concealed to a certain extent. Therefore, both multi-scale entropy and Huffman-multi-scale entropy can well describe the characteristics of increased complexity caused by sudden and large changes in data.

Anomaly Detection Results and Discussion
This paper uses a five-class SVM model based on the DAG method, and uses the proposed APSO to train the classification model. To verify the effectiveness of the proposed method on the spacecraft anomaly detection problem, this paper not only calculates the recognition accuracy (RA) of the classification model for each category, but also calculates the detection rate (DR), false alarm rate (FAR) and the missed alarm rate (MAR). The calculation method of RA, DR, FAR and MAR are shown in Equations (25)- (28). Based on the above analysis, this paper selects the sample length as 10,000. The feature sequence (HMSE + T/F) is composed of 30 complexity features (scale from 10 to 300 at intervals of 10) and nine time/frequency-domain features.

Anomaly Detection Results and Discussion
This paper uses a five-class SVM model based on the DAG method, and uses the proposed APSO to train the classification model. To verify the effectiveness of the proposed method on the spacecraft anomaly detection problem, this paper not only calculates the recognition accuracy (RA) of the classification model for each category, but also calculates the detection rate (DR), false alarm rate (FAR) and the missed alarm rate (MAR). The calculation method of RA, DR, FAR and MAR are shown in Equations (25)- (28).

RA =
Num(predicted = true) Num(true) * 100% (25) where Num(predicted = true) is the total number of category predictions that are exactly the same as the true value, Num(true) is the total number of the test samples, Num(NN + FF) is the total number of the real normal data predicted as normal data and the real abnormal data predicted as abnormal data, Num(NF) is the total number of real normal data predicted as abnormal data, Num(N) is the total number of real normal data, Num(FN) is the total number of real abnormal data predicted as normal data, and Num(F) is the total number of real abnormal data. Figure 15 shows the corresponding part of the false alarm rate and the missed alarm rate in the confusion matrix. C(large) is the Stable Change (large), C(small) is the Stable Change (small), L-S is the Large to Small, IC is the Irregular Change, and SC is the Sudden Change. data, and is the total number of real abnormal data. Figure 15 shows the corresponding part of the false alarm rate an rate in the confusion matrix. C(large) is the Stable Change (large), C( Change (small), L-S is the Large to Small, IC is the Irregular Change, an Change.  The distinction between Stable Change (large) Stable Change (small) and Irregular Change can be effectively improved by calculating Huffman-multi-scale entropy. This conclusion is also consistent with the result in Figure 14.
In addition, the recognition accuracy of Large to Small and Sudden Change of the three methods has reached 100%, which shows that the feature sequence and anomaly detection model selected in this paper have strong sensitivity to signals with a definite change rule.
To further verify that the method proposed in this paper can effectively improve the accuracy of spacecraft anomaly detection and reduce the rate of false alarms and missed alarms, this paper compares the proposed method with other methods, and calculates the anomaly detection under different processing methods. Principal Component Analysis (PCA), Random forest (RF), Logistic Regression (LR), K-Nearest Neighbor (KNN) and Multilayer perceptron (MLP) are used in this paper. The results of the recognition accuracy, false alarm rate and missed alarm rate of different methods are shown in Table 4. 0%. At the same time, the probability of MSE-T/F-PSO-SVM identifying Irregular Change as Stable Change (large) and Stable Change (small) reaches 16.67% and 7.67%, respectively. The distinction between Stable Change (large) Stable Change (small) and Irregular Change can be effectively improved by calculating Huffman-multi-scale entropy. This conclusion is also consistent with the result in Figure 14. In addition, the recognition accuracy of Large to Small and Sudden Change of the three methods has reached 100%, which shows that the feature sequence and anomaly detection model selected in this paper have strong sensitivity to signals with a definite change rule.
To further verify that the method proposed in this paper can effectively improve the accuracy of spacecraft anomaly detection and reduce the rate of false alarms and missed alarms, this paper compares the proposed method with other methods, and calculates the anomaly detection under different processing methods. Principal Component Analysis (PCA), Random forest (RF), Logistic Regression (LR), K-Nearest Neighbor (KNN) and Multilayer perceptron (MLP) are used in this paper. The results of the recognition accuracy, false alarm rate and missed alarm rate of different methods are shown in Table 4.  From the results in Table 4, it can be seen that: First, the recognition accuracy and detection rate of the method proposed in this paper can reach 99.60% and 99.87%, which are higher than other methods listed in the table, and the false alarm rate is reduced to 0, while the false alarm rate is reduced to 0.34%, which are lower than other methods. Second, the detection method based on the feature sequence (HMSE + T/F) has a higher recognition accuracy and detection rate as well as lower false alarm rate and missed detection rate than the detection method based on the original data. Third, by comparing the standard deviation of various indicators, it can be found that the feature sequence based on Huffman-multi-scale entropy and time-frequency domain features proposed in this paper can effectively improve the stability of the detection method.

Conclusions
In this research, we propose a new detection framework for anomaly detection based on spacecraft telemetry data. Due to the very low frequency characteristics of telemetry data, most frequency analysis methods are not suitable for spacecraft anomaly detection. Therefore, this paper first proposes a feature sequence construction method based on time-domain and frequency-domain feature screening and complexity feature fusion. On this basis, a new method of Huffman-multi-scale entropy (HMSE) based on the Huffman coding principle is proposed. To improve the classification accuracy of SVM, this paper adopts a multi-class SVM model based on the DAG principle, and proposes an improved adaptive particle swarm optimization (APSO) to train the SVM model. Then we apply the proposed method to the voltage telemetry data set of the satellite momentum wheel. Compared with other methods, the results show that the proposed method has a good performance in improving the recognition accuracy and detection rate, and it can also effectively reduce the false alarm rate and the missed alarm rate. Therefore, the method proposed in this paper has a good development prospect in the field of anomaly detection of spacecraft.
In the future work, more real-world datasets will be applied to verify the effectiveness of the detection ability of the proposed method. In addition, more methods based on artificial neural networks will be studied to further improve the versatility of anomaly detection methods.