Optimal Compensation of MEMS Gyroscope Noise Kalman Filter Based on Conv-DAE and MultiTCN-Attention Model in Static Base Environment

Errors in microelectromechanical systems (MEMS) inertial measurement units (IMUs) are large, complex, nonlinear, and time varying. The traditional noise reduction and compensation methods based on traditional models are not applicable. This paper proposes a noise reduction method based on multi-layer combined deep learning for the MEMS gyroscope in the static base state. In this method, the combined model of MEMS gyroscope is constructed by Convolutional Denoising Auto-Encoder (Conv-DAE) and Multi-layer Temporal Convolutional Neural with the Attention Mechanism (MultiTCN-Attention) model. Based on the robust data processing capability of deep learning, the noise features are obtained from the past gyroscope data, and the parameter optimization of the Kalman filter (KF) by the Particle Swarm Optimization algorithm (PSO) significantly improves the filtering and noise reduction accuracy. The experimental results show that, compared with the original data, the noise standard deviation of the filtering effect of the combined model proposed in this paper decreases by 77.81% and 76.44% on the x and y axes, respectively; compared with the existing MEMS gyroscope noise compensation method based on the Autoregressive Moving Average with Kalman filter (ARMA-KF) model, the noise standard deviation of the filtering effect of the combined model proposed in this paper decreases by 44.00% and 46.66% on the x and y axes, respectively, reducing the noise impact by nearly three times.


Introduction
MEMS gyroscopes have the characteristics of small size, low power consumption, low cost, and high-cost performance [1]. It is easier to act as an actuator or a key node of inertial navigation in small institutions, such as in the drone remote sensing measurement gimbals [2], aviation pods [3,4], navigation terminals [5,6], and other institutions, and it plays an important role. High-precision MEMS gyroscopes can already meet the needs of engineers for practical projects, so reducing the noise of MEMS gyroscopes and improving measurement accuracy has become a hot issue.

Methods
This section clarifies the methods and principles proposed in the article and provides corresponding theoretical support for the subsequent experimental verification.

Data Reconstruction Based on Convolutional Denoising Auto-Encoder
The convolutional denoising autoencoder (Conv-DAE) model consists of an encoder and a decoder; the encoder is responsible for quickly compressing the original signal dimension and mapping it to a feature representation in low-dimensional feature space; the decoder is responsible for reconstructing this feature representation and reducing it to the original signal, the basic structure of which is shown in Figure 1 [26]. The Conv-DAE model enables efficient and accurate feature extraction of the original signal in feature space by minimizing the error between the noisy or corrupted original signal and the reconstructed original signal [27]. Compared to conventional DAEs, Conv-DAE has the same basic structure of an encoder and decoder but replaces the fully connected layers with convolutional layers. As deep-structured convolutional neural networks (CNNs) are easy to train, Conv-DAE, as a particular type of CNN, can improve the reconstruction capability by using deep structure [28,29].
The Conv-DAE model structure proposed in this paper is shown in Figure 2 below. The model has a symmetric structure of encoder and decoder, where the encoder consists of two convolutional layers and two max-pooling layers, and the decoder consists of three transposed convolutional layers and two upsampling layers. Each convolutional layer in the encoder uses a 1 × 5 filter to extract the various feature vectors, and each transposed convolutional layer in the decoder also uses a 1 × 5 filter to reduce and aggregate the feature vectors. Details of the structure are shown in Table 1. The Conv-DAE model enables efficient and accurate feature extraction of the original signal in feature space by minimizing the error between the noisy or corrupted original signal and the reconstructed original signal [27]. Compared to conventional DAEs, Conv-DAE has the same basic structure of an encoder and decoder but replaces the fully connected layers with convolutional layers. As deep-structured convolutional neural networks (CNNs) are easy to train, Conv-DAE, as a particular type of CNN, can improve the reconstruction capability by using deep structure [28,29].
The Conv-DAE model structure proposed in this paper is shown in Figure 2 below. The model has a symmetric structure of encoder and decoder, where the encoder consists of two convolutional layers and two max-pooling layers, and the decoder consists of three transposed convolutional layers and two upsampling layers. Each convolutional layer in the encoder uses a 1 × 5 filter to extract the various feature vectors, and each transposed convolutional layer in the decoder also uses a 1 × 5 filter to reduce and aggregate the feature vectors. Details of the structure are shown in Table 1.   The convolutional layer, the max-pooling layer, the transposed convolutional layer, and the upsampling layer are the main structures for feature extraction in the Conv-DAE model proposed in this paper, with the following operational equations: In Equation (1), l j x is the current convolutional layer output features, x n is the jth convolutional kernel of the kth layer, n is the edge length of the convolutional kernel size, and max is the maximum function. In addition, the transposed convolutional layer in the decoder can be regarded as the inverse process of the convolutional layer in the encoder [30].

Deep Neural Networks with Temporal Convolutional Neural Layers
The temporal convolutional network (TCN) [31] is primarily a temporal model based on convolutional neural networks. Unlike standard convolutional neural networks, TCN employs causal convolution for processing time series data and uses dilated convolution to cope with the long-distance dependency problem common in time series models. The basic structure of a temporal convolutional network consists of causal convolution, dilated convolution, and residual connections, as shown in Figure 3.
The convolutional layer, the max-pooling layer, the transposed convolutional layer, and the upsampling layer are the main structures for feature extraction in the Conv-DAE model proposed in this paper, with the following operational equations: In Equation (1), x l j is the current convolutional layer output features, x l−1 i is the previous layer output features, f (·) function is the activation function, ω l ij is the current convolutional layer convolutional kernel, * denotes convolution, M j is the connection between x l j and output features of previous layer, and b l j is the current convolutional layer corresponding bias. In Equation (2), x k j (n) is the jth convolutional kernel of the kth layer, n is the edge length of the convolutional kernel size, and max is the maximum function. In addition, the transposed convolutional layer in the decoder can be regarded as the inverse process of the convolutional layer in the encoder [30].

Deep Neural Networks with Temporal Convolutional Neural Layers
The temporal convolutional network (TCN) [31] is primarily a temporal model based on convolutional neural networks. Unlike standard convolutional neural networks, TCN employs causal convolution for processing time series data and uses dilated convolution to cope with the long-distance dependency problem common in time series models. The basic structure of a temporal convolutional network consists of causal convolution, dilated convolution, and residual connections, as shown in Figure 3.

(a) Causal Convolution
Causal convolution is a fundamental architecture of temporal convolutional networks, and Figure 3 shows the structure of a causal convolution stack. For a onedimensional time series input , the output t y of time t depends only on the current time t x and partial past time input (i.e., 1 2 3 , , , , , not any future input (i.e., 1 2 3 , , , , . Therefore, the output information of the temporal convolutional network is only affected by the past input information, avoiding the "leakage" that never came in the past. In addition, causal convolution is susceptible to the limitations of the receptive field, i.e., the output can only be predicted by receiving information from a shorter history size [32].

(b) Dilated Convolution
The traditional convolution operation process involves convolving the sequence once and then pooling it to reduce the sequence's size and expand the receptive field's size. One of its main disadvantages is that some sequence information will be lost during the pooling process. In contrast, dilated convolutions feature no pooling process but gradually increase the perceptual length through a series of dilated convolutions, so that the output of each convolution contains rich information for long-term tracking [33]. Therefore, dilated convolutions can be well applied to long information-dependent problems of sequences, such as speech and signal processing, weather forecasting, etc. For a one-dimensional time series input where m denotes the filter size, d denotes the dilation factor, * denotes convolution, and The dilation operation can be thought of as introducing a fixed step between every two adjacent filters. Each layer consists of a set of dilated convolutions with rate parameters l d , a non-linear activation

(a) Causal Convolution
Causal convolution is a fundamental architecture of temporal convolutional networks, and Figure 3 shows the structure of a causal convolution stack. For a one-dimensional time series input X = (x 0 , x 1 , . . . , x t , . . . , x T ), the output y t of time t depends only on the current time x t and partial past time input (i.e., x t−1 , x t−2 , x t−3 , . . . , x t ), not any future input (i.e., x t+1 , x t+2 , x t+3 , . . . , x T ). Therefore, the output information of the temporal convolutional network is only affected by the past input information, avoiding the "leakage" that never came in the past. In addition, causal convolution is susceptible to the limitations of the receptive field, i.e., the output can only be predicted by receiving information from a shorter history size [32].

(b) Dilated Convolution
The traditional convolution operation process involves convolving the sequence once and then pooling it to reduce the sequence's size and expand the receptive field's size. One of its main disadvantages is that some sequence information will be lost during the pooling process. In contrast, dilated convolutions feature no pooling process but gradually increase the perceptual length through a series of dilated convolutions, so that the output of each convolution contains rich information for long-term tracking [33]. Therefore, dilated convolutions can be well applied to long information-dependent problems of sequences, such as speech and signal processing, weather forecasting, etc. For a one-dimensional time series input X = (x 0 , x 1 , . . . , x t , . . . , x T ) and a filter F : {0, 1, 2, . . . , m − 1}, the H(·) of the sequence element T of the dilated convolution operation is defined as follows: where m denotes the filter size, d denotes the dilation factor, * denotes convolution, and T − d · i denotes the past direction.
The dilation operation can be thought of as introducing a fixed step between every two adjacent filters. Each layer consists of a set of dilated convolutions with rate parameters d l , a non-linear activation f nl (·), and a residual connection combining the input and convolution signals of the layer. d l represents increasing the number of consecutive layers within the block, calculated by d l = 2 l . The convolution operation only works between two timestamps t and t − d l . Specifically, the filters can be parameterized by a weight matrix W = [W 0 , W 1 ], and a bias vector b, where W i ∈ R F w ×F w , b ∈ R F w , and F w represent the number of filters. Z (j,l) t and Z (j,l) t are the results of the null convolution and the addition of the residual join at time series t, respectively, denoted as where V ∈ R F w ×F w denotes the weight matrix, and e ∈ R F w denotes the bias vector of residual connections [34].

(c) Residual Connections
Residual connections have proven to be an effective method to train deep networks, which allow the network to pass information across layers [31,33]. In addition, the receptive field size of TCN can be enlarged by changing the number of hidden layers in residual connections, and the problem of vanishing gradients in the process of training neural networks can be avoided.
One branch of the residual block performs the transformation operation G(·) on the input X (h−1) , and a branch is added to perform a straightforward transform to keep the number of feature maps in parallel with the existing branches. The output X (h) of the hth residual block can be expressed as: where δ(·) indicates the activation operation. G(·) is a series of transformation operations. As shown in the right half of Figure 3, the residual connection structure includes dilated causal convolutional layers, weightnorm layers, activation layers, and dropout layers. Among them, the dilated causal convolution layer is composed of the aforementioned causal convolution and dilated convolution, which is used to extract hidden features from the input; the weightnorm layer is used to improve the training speed by limiting the weight range; the activation layer adopts a good convergence Rectified Linear Unit (ReLU); and the dropout layer is used for regularization to solve the overfitting problem of deep networks. Therefore, in contrast to long short-term memory and the gated recurrent neural network, (1) TCN can perform convolution in parallel due to its parallelism; (2) TCN can adjust the receptive field size by the number of layers, dilation factor, and filter size, which allows us to control the memory size of the model for different domain requirements; (3) in the depth direction of the network, since TCN uses residual connections when the input length is very long, the gradient in TCN will have more robust stability. Based on the above characteristics, the temporal convolutional network can effectively avoid the gradient disappearance or gradient explosion problem of the recurrent neural networks.

Attention Mechanism
The attention mechanism is a simulation of the human brain's form of assigning attention, and its essence is to change the weight of features in the hidden layer [35]. The attention mechanism can reasonably filter out a small number of critical features from a large number of features and assign more weight to them, reducing the weight of nonkey features to highlight the impact of critical features. Fusing attentional mechanisms with temporal convolutional networks can highlight key features and improve prediction accuracy. The structural principle of the attentional mechanism is shown in Figure 4.
a 0 a 1 a 2 a 3 a 4 a n−4 a n−3 a n−2 a n−1 a n Input Output …… y 0 y 1 y 2 y 3

Permute Layer
Multiply Layer

Densen Layer
Permute Layer Attention Attention Mechanism is the output value of the attention mechanism introduced. The calculation formula of the weight coefficient of the attention mechanism can be expressed as: where t e represents the attention weight determined by the output layer vector t h of the deep neural network at time t, u and w are the weight coefficients; b is the bias coefficient, and t y is the output of the attention layer at time t. The attention mechanism automatically calculates the corresponding weight assignments for the in-depth features and merges them into a new vector. The input to this layer is the output vector of the deep neural network, the Permute layer rearranges the dimensions of the input according to a given pattern, and the Multiply layer completes the output of the attention with the output of the deep neural network by multiplying the output bit by bit, achieving a dynamic weighting process for the hidden layer units, and thus highlighting the impact of critical features on the final result [36].

Multi-Layer Deep Learning Network Combination Model
In order to further improve the prediction performance of the MultiTCN-Attention model, this article proposes a method based on the combination of convolutional denoising autoencoder and MultiTCN-Attention model. After data reconstruction is carried out through the convolutional denoising autoencoder model, the output result is used as the input of the MultiTCN-Attention model for prediction processing. The specific structure and parameter configuration of the convolutional noise reduction autoencoder model are described in Section 2.1 of this paper. When the MEMS gyroscope is sampled for a long time, due to the limitation of the communication between the MEMS gyroscope and the host computer equipment, packet loss will occur. Therefore, in order to imitate the appearance of this phenomenon, 5% of the original MEMS gyroscope data Where x t (t ∈ [0, n]) is the input to the deep neural network, h t (t ∈ [0, n]) corresponds to the hidden layer output obtained by passing each input through the deep neural network, a t (t ∈ [0, n]) is the attention weight of the attention mechanism on the hidden layer output of the deep neural network, and y t (t ∈ [0, n]) is the output value of the attention mechanism introduced. The calculation formula of the weight coefficient of the attention mechanism can be expressed as: where e t represents the attention weight determined by the output layer vector h t of the deep neural network at time t, u and w are the weight coefficients; b is the bias coefficient, and y t is the output of the attention layer at time t. The attention mechanism automatically calculates the corresponding weight assignments for the in-depth features and merges them into a new vector. The input to this layer is the output vector of the deep neural network, the Permute layer rearranges the dimensions of the input according to a given pattern, and the Multiply layer completes the output of the attention with the output of the deep neural network by multiplying the output bit by bit, achieving a dynamic weighting process for the hidden layer units, and thus highlighting the impact of critical features on the final result [36].

Multi-Layer Deep Learning Network Combination Model
In order to further improve the prediction performance of the MultiTCN-Attention model, this article proposes a method based on the combination of convolutional denoising autoencoder and MultiTCN-Attention model. After data reconstruction is carried out through the convolutional denoising autoencoder model, the output result is used as the input of the MultiTCN-Attention model for prediction processing. The specific structure and parameter configuration of the convolutional noise reduction autoencoder model are described in Section 2.1 of this paper. When the MEMS gyroscope is sampled for a long time, due to the limitation of the communication between the MEMS gyroscope and the host computer equipment, packet loss will occur. Therefore, in order to imitate the appearance of this phenomenon, 5% of the original MEMS gyroscope data are randomly damaged and reset, as the input data of the convolutional denoising autoencoder and the original data are compared. The data reconstruction operation is performed, as shown in Figure 5. The reconstructed data output by the convolutional denoising autoencoder is used as the input to the next model. are randomly damaged and reset, as the input data of the convolutional denoising autoencoder and the original data are compared. The data reconstruction operation is performed, as shown in Figure 5. The reconstructed data output by the convolutional denoising autoencoder is used as the input to the next model.  The MultiTCN-Attention model was chosen to build multi-layer TCN, and the addition of an attention mechanism layer made the multi-layer TCN more focused on what was beneficial to the outcome. The output layer was a fully connected layer that accepted the output vector from the attention mechanism weighted processing and processed it into the predicted value of the MEMS gyroscope. The detailed parameter configuration of the MultiTCN-Attention model is described in a later section. As can be seen, the input vector starts at the input layer and it is processed by several TCN layers before entering the attention mechanism, which calculates the attention weight vector based on the current input vector and merges the two to obtain a new vector, which is fed into the fully connected layer to output the predicted value.

Particle Swarm Optimization Algorithm for Optimal Kalman Filter and Others
The Kalman filter is a recursive filter (autoregressive filter) capable of estimating the state of a dynamic system from a series of incomplete and noise-containing measurements by considering the joint distribution at each time based on the values of each measurement at different times, thus producing an estimate of the unknown variables [37]. Kalman filtering mainly includes two parts: the prediction process and the update process. It is assumed that the state-space model of the system (state equation and measurement equation) is as follows: where k x is the system state vector, is the system state transition matrix, The MultiTCN-Attention model was chosen to build multi-layer TCN, and the addition of an attention mechanism layer made the multi-layer TCN more focused on what was beneficial to the outcome. The output layer was a fully connected layer that accepted the output vector from the attention mechanism weighted processing and processed it into the predicted value of the MEMS gyroscope. The detailed parameter configuration of the MultiTCN-Attention model is described in a later section. As can be seen, the input vector starts at the input layer and it is processed by several TCN layers before entering the attention mechanism, which calculates the attention weight vector based on the current input vector and merges the two to obtain a new vector, which is fed into the fully connected layer to output the predicted value.

Particle Swarm Optimization Algorithm for Optimal Kalman Filter and Others
The Kalman filter is a recursive filter (autoregressive filter) capable of estimating the state of a dynamic system from a series of incomplete and noise-containing measurements by considering the joint distribution at each time based on the values of each measurement at different times, thus producing an estimate of the unknown variables [37]. Kalman filtering mainly includes two parts: the prediction process and the update process. It is assumed that the state-space model of the system (state equation and measurement equation) is as follows: where x k is the system state vector, Φ k/k−1 is the system state transition matrix, B k−1 is the system noise driving matrix, and w k−1 is the state excitation noise or system noise; y k is the measurement vector, H k is the measurement matrix, and v k is the measurement noise. Moreover, assume that w k , v k are Gaussian white noise sequences with zero mean, and the two white noises are uncorrelated with each other, satisfying: In the prediction process, the current system state vector is predicted from the previous moment's system state vector such that: wherex k/k−1 is the predicted value of the system state vector, and P k/k−1 is the predicted covariance matrix of the system state vector.
In the update process of the Kalman filter, the current system state vector is updated with the measurement vector such that: where K k is the Kalman filter gain matrix, and P k is the updated covariance matrix of the system state vector.

Kalman Filter Based on ARMA Model
The Autoregressive Moving Average (ARMA) model is obtained by regressing the dependent variable on its lagged values as well as the present and lagged values of the random error term [38]. Moreover, it is one of the standard methods used in time series analysis. The ARMA model can be expressed as follows: That is, the autoregressive moving average model ARMA (p, q). p and q are the acceptance orders of the autoregressive (AR) and moving average (MA) models, respectively. In addition, p is also expressed as the number of lags in the time series data itself used, and q represents the number of forecast error lags used in the forecast model. They are determined by the nature of the time series data itself. x k is the observed time series data; ε k represents a discrete white noise sequence with mean 0 and variance σ 2 . ϕ i < 1(i = 1, 2, . . . , p) is the autoregressive coefficient, and θ j < 1(j = 1, 2, . . . , q) is the moving average coefficient.
The steps for designing the Kalman filter using the ARMA model are as follows [39][40][41]: (1) data pre-processing, including the removal of wild values, removal of constant components and extraction of trend terms, and data testing; (2) determination of the model type based on the autocorrelation function and partial autocorrelation function; (3) determination of the order based on the Akaike Information Criterion; and (4) adaptive testing of the designed model.

Particle Swarm Optimization Algorithm for Optimal Kalman Filter
In order to further improve the accuracy of the Kalman filter, in addition to using the traditional ARMA time series modeling, this paper chooses to optimize the parameters of the Kalman filter using the particle swarm optimization algorithm. Particle swarm optimization has attracted more researchers because of its flexibility and robustness, especially for problems in dynamic environments. PSO is a swarm-based stochastic optimization technique inspired by social behaviors such as bird flocking or fish flocking [42].
As shown in Figure 6a, suppose a flock of birds is randomly searching for food. Additionally, suppose a piece of food that is known to be in a particular area, but none of the birds know exactly where it is. However, they can use their own experience (optimal individual choice) and group experience (optimal global choice) to predict how far away the current location is from the food to find the location of the food quickly [43]. This bird predation mechanism inspires the particle swarm optimization algorithm, so the basis of PSO is the group sharing of information.
As shown in Figure 6a, suppose a flock of birds is randomly searching for food. Additionally, suppose a piece of food that is known to be in a particular area, but none of the birds know exactly where it is. However, they can use their own experience (optimal individual choice) and group experience (optimal global choice) to predict how far away the current location is from the food to find the location of the food quickly [43]. This bird predation mechanism inspires the particle swarm optimization algorithm, so the basis of PSO is the group sharing of information.  The particle swarm optimization algorithm consists of a large swarm of particles in which n particles fly in the D-dimensional space. Each particle maintains the particle position i x , the direction and speed of particle movement i v , and the searched optimal position fitness value i p in the D-dimensional space, which can be expressed as: The improvement of the flying speed, position, and weight of particle i can be adjusted according to Equations  The particle swarm optimization algorithm consists of a large swarm of particles in which n particles fly in the D-dimensional space. Each particle maintains the particle position x i , the direction and speed of particle movement v i , and the searched optimal position fitness value p i in the D-dimensional space, which can be expressed as: The improvement of the flying speed, position, and weight of particle i can be adjusted according to Equations (21)- (23).
In the formula, d and k represent the dimension and the number of iterations, respectively; b represents the bth generation. p ib represents the best position of particle i, p gb represents the current best position; c 1 and c 2 represent the individual learning factor and group learning factor, respectively; rand(·) is used to obtain random values in the range of [0, 1]. ω is the inertia weight used to balance the global search ability and local search ability, which can be updated iteratively by using Equation (23); ω max , ω min are the maximum and minimum inertia weights, respectively; iter, iter max are the current and maximum number of iterations, respectively.
As shown in Figure 7, when optimizing the four parameters Q k , R k , Φ, H k of the Kalman filter using the particle swarm optimization algorithm, according to the Formula (16) in the update process of the Kalman filter, avoiding premature convergence of the optimization seeking process to be able to obtain the optimal global solution, the actual variance of the innovation is selected here as the objective function, with its value minimized as the objective for optimization. The specific PSO process is shown in Figure 6b. Define the objective function as shown in Equations (24) and (25): y k/k−1 = y k − H kxk/k−1 (25) variance of the innovation is selected here as the objective function, with its value minimized as the objective for optimization. The specific PSO process is shown in Figure 6b. Define the objective function as shown in Equations (24) and (25)

Validation of the Proposed Method
In this section, the method proposed in the article was tested, the corresponding experimental design and result analysis were given, and the method's validity was verified.

Acquisition of Test Data
This article used the STIM300 IMU (Safran Sensing Technologies, Horten, Norway) as the measured object, composed of a three-axis MEMS gyroscope, a three-axis MEMS accelerometer, and a three-axis MEMS inclinometer. The physical drawing and gyroscope specifications of the STIM300 are shown in Figure 8a and Table 2, respectively. The STIM300 was fixed to a high-precision static base stage, as shown in Figure 8b. The data acquisition flow of the STIM300 is shown in Figure 8c. The data from the STIM300 were sent to the xPC via the RS422 communication interface at a baud rate of 921,600 bps. xPC decoded the gyroscope data and sent them to the host computer via the network cable. The STIM300 gyroscope was powered up firstly and then preheated for 20 min at room temperature. Lastly, static test experiments were performed. Among them, h is the actual variance of the state information, andŷ k/k−1 is the innovation sequence generated by the Kalman filter [44].

Validation of the Proposed Method
In this section, the method proposed in the article was tested, the corresponding experimental design and result analysis were given, and the method's validity was verified.

Acquisition of Test Data
This article used the STIM300 IMU (Safran Sensing Technologies, Horten, Norway) as the measured object, composed of a three-axis MEMS gyroscope, a three-axis MEMS accelerometer, and a three-axis MEMS inclinometer. The physical drawing and gyroscope specifications of the STIM300 are shown in Figure 8a and Table 2, respectively. The STIM300 was fixed to a high-precision static base stage, as shown in Figure 8b. The data acquisition flow of the STIM300 is shown in Figure 8c. The data from the STIM300 were sent to the xPC via the RS422 communication interface at a baud rate of 921,600 bps. xPC decoded the gyroscope data and sent them to the host computer via the network cable. The STIM300 gyroscope was powered up firstly and then preheated for 20 min at room temperature. Lastly, static test experiments were performed.   In order to adapt to the application scenario of the STIM300 gyroscope, the platform to which the gyroscope equipment was adapted was mainly used to measure the pitch angular velocity and yaw angular velocity of the photoelectric stabilization platform. As shown in Figure 8c, the pitch angle was obtained by rotating the plane YOZ with the y-axis, and the yaw angle was obtained by rotating the plane XOZ with the x-axis. Therefore, we mainly studied the x-axis and y-axis angular velocity. The static raw data obtained from the measurement are shown in Figure 9.  In order to adapt to the application scenario of the STIM300 gyroscope, the platform to which the gyroscope equipment was adapted was mainly used to measure the pitch angular velocity and yaw angular velocity of the photoelectric stabilization platform. As shown in Figure 8c, the pitch angle was obtained by rotating the plane YOZ with the yaxis, and the yaw angle was obtained by rotating the plane XOZ with the x-axis. Therefore, we mainly studied the x-axis and y-axis angular velocity. The static raw data obtained from the measurement are shown in Figure 9.

Comparison of Training Based on Convolutional Denoising Auto-Encoders
In order to further apply the deep learning model and the construction of the ARMA model, this paper chose to use the pre-data processing method of the ARMA model to process the raw data, mainly including the elimination of wild values, the removal of constant components, and the extraction of trend terms [39][40][41]. To consider model generality and accuracy, we took the first 80% of the processed x-axis and y-axis data as the training set and the last 20% of the x-axis and y-axis data as the test set.
The deep learning algorithms proposed in this paper were performed on Tensorflow 2.3.0 (Google, Mountain View, CA, USA) and Keras 2.4.3 (Google, USA) running on Ubuntu 16.04-LTS-x86 64-bit operating system (Canonical Ltd., London, UK). The computer platform was equipped with Intel i7-4770 CPU (Intel, Santa Clara, CA, USA), 16G memory (SKhynix, Icheon-si, Korea), 2T SSD (Samsung, Seoul, Korea), and GeForce RTX-2080Ti GPU (NVIDIA, Santa Clara, CA, USA). In order to compare the superiority of convolutional denoising autoencoders, this paper used the normal denoising autoencoder (Normal-DAE) listed in Table 3 to compare with the convolutional denoising autoencoders listed in Table 1 above. They adopted the Adam optimization algorithm for updating network parameters, using mean squared error (MSE) as the loss function. The preprocessed x-axis and y-axis data volume of 30,000 were used as the input number of the denoising autoencoder, and the randomly damaged data were set to account for 5% of the total data volume. The batch_size was set to 200, the number of epoch was set to 100, and input_size was set to (20,1) for deep learning training.
The results of the convolutional denoising autoencoder are shown in Table 4 and Figure 10. The noise standard deviation of the MEMS gyroscope signals from the x-axis and y-axis decreased by approximately 23.41% and 28.72%, respectively, after processing by the normal denoising autoencoder, while the noise standard deviation of the signals decreased by approximately 44.63% and 38.44%, respectively, after processing by the convolutional denoising autoencoder proposed in this paper. It can be shown that the proposed convolutional denoising autoencoder outperformed normal denoising autoencoder in terms of noise reduction and signal reconstruction of MEMS gyroscope signals. It prepared the signals processed by the convolutional denoising autoencoder for further processing in the later paper. (a) Figure 10. Cont.

The Training Based on Combinatorial Model Compared with Other Neural Networks
To validate the performance of the MultiTCN-Attention model for gyroscope error compensation in the static base environment, this paper used data reconstructed by the convolutional denoising autoencoder as the input values for deep learning. The Mul-tiTCN model was first explored using an x-axis test set with appropriate values for the input data step size, number of hidden cells, number of hidden layers, and dilation list, with the base settings shown in Table 5, and it took the Adam optimization algorithm and mean squared error (MSE) loss function. Subsequently, the training was carried out using the determined values. The MultiTCN-Attention network results were compared with MultiTCN networks and LSTM networks using the x-axis and y-axis test sets, respectively. As shown in Tables 6-9, when the input data stride and the number of hidden layers were wider, the training time per epoch was longer. We need to make a trade-off between results and computational performance. According to the comparisons, the best results were obtained when the input data stride was 20, the number of hidden units was 128, and the number of hidden layers was 4. While this did not indicate that this was an optimal parameter for the network, it would be an appropriate value given the computational resources.

The Training Based on Combinatorial Model Compared with Other Neural Networks
To validate the performance of the MultiTCN-Attention model for gyroscope error compensation in the static base environment, this paper used data reconstructed by the convolutional denoising autoencoder as the input values for deep learning. The MultiTCN model was first explored using an x-axis test set with appropriate values for the input data step size, number of hidden cells, number of hidden layers, and dilation list, with the base settings shown in Table 5, and it took the Adam optimization algorithm and mean squared error (MSE) loss function. Subsequently, the training was carried out using the determined values. The MultiTCN-Attention network results were compared with MultiTCN networks and LSTM networks using the x-axis and y-axis test sets, respectively. As shown in Tables 6-9, when the input data stride and the number of hidden layers were wider, the training time per epoch was longer. We need to make a trade-off between results and computational performance. According to the comparisons, the best results were obtained when the input data stride was 20, the number of hidden units was 128, and the number of hidden layers was 4. While this did not indicate that this was an optimal parameter for the network, it would be an appropriate value given the computational resources.   For the MultiTCN-Attention model, we set the following parameters according to the above conclusions, as shown in Table 10. The attention layer was set to the same length as the input length, and the results are shown in Figure 11 and Tables 11 and 12. Figure 11a shows the training losses within 100 epochs and convergence is achieved for all networks; Figure 11b shows the weights of the sequence output values in the total sequence as calculated by the attention mechanism; as shown in the figure for the x and y axes, the distribution of attention is different, with more even attention on the x-axis and more focused attention on the front of the sequence input for the y-axis. Tables 11 and 12 show that the MultiTCN-Attention model resulted in a 58.15% and 57.89% reduction in the standard deviation of noise in the x and y axes, respectively, compared to the raw data, proving that the application of the MultiTCN-Attention model in MEMS gyroscope error compensation studies was feasible. In addition, compared with the results of the LSTM and the MultiTCN, the noise standard deviation values of the MultiTCN-Attention model results on the x-axis were reduced by 11.68% and 9.46%, respectively, and the deviation values on the y-axis were reduced by 17.05% and 9.52%, respectively. This indicated that the MultiTCN-Attention model outperformed both networks regarding error compensation.

Optimization of Kalman Filter Parameters Based on Particle Swarm Optimization Algorithm and Others
In this section, the raw data and MultiTCN-Attention combined model results on the x-axis and y-axis were used as measurements, respectively. The parameters of the Kalman filter were estimated by the ARMA model and particle swarm optimization algorithm, and the filtering results were compared.
In order to make the experimental data more extensive and adaptable, the data of the MultiTCN-Attention combined model were no longer analyzed using the ARMA model method in this paper, only the particle swarm optimization algorithm was used to optimize the parameters of the Kalman filter, and the raw data were analyzed using the PSO-KF method and the ARMA-KF method.

Determination of Kalman Filter Parameters Based on ARMA Model
In this paper, the Akaike Information Criterion was used to determine the order of the ARMA (p, q) model. If the order increases, the identified model will be more realistic, but the computational difficulty will also increase with the order increase [45]. Therefore, the maximum order was set to 3, i.e., the maximum value of p and q was set to 3. The results were as follows:

Optimization of Kalman Filter Parameters Based on Particle Swarm Optimization Algorithm and Others
In this section, the raw data and MultiTCN-Attention combined model results on the x-axis and y-axis were used as measurements, respectively. The parameters of the Kalman filter were estimated by the ARMA model and particle swarm optimization algorithm, and the filtering results were compared.
In order to make the experimental data more extensive and adaptable, the data of the MultiTCN-Attention combined model were no longer analyzed using the ARMA model method in this paper, only the particle swarm optimization algorithm was used to optimize the parameters of the Kalman filter, and the raw data were analyzed using the PSO-KF method and the ARMA-KF method.

Determination of Kalman Filter Parameters Based on ARMA Model
In this paper, the Akaike Information Criterion was used to determine the order of the ARMA (p, q) model. If the order increases, the identified model will be more realistic, but the computational difficulty will also increase with the order increase [45]. Therefore, the maximum order was set to 3, i.e., the maximum value of p and q was set to 3. The results were as follows: For the raw x-axis data, the identified model was identified as ARMA (3,2): For the raw y-axis data, the identified model was identified as ARMA (2,2): where x k was the output of the ARMA model and ε k was the driving white noise (mean 0, varianceδ 2 ε ). The Kalman filter parameters are shown in Table 13. R is the covariance of the measurement. The initial values of the Kalman filter were set as follows: x 1 = [0; 0; 0; 0], P 1 was the fourth-order identity matrix.

Optimization of Kalman Filter Parameters Based on Particle Swarm Optimization Algorithm
In this paper, the particle swarm optimization algorithm was used to optimize the Kalman filter parameters, using the data and original data of the MultiTCN-Attention combined model. The optimization process was as follows (see Algorithm 1): The initial values of the Kalman filter were set to x 1 = 0 and P 1 = 1, and the initial parameters were set to Φ 1 = 1, H 1 = 1, Q 1 = 1, and R 1 = 1. The initial parameters of the particle swarm optimization algorithm were set to N = 50, iter max = 500, ω min = 0.3, ω max = 0.4, c 1 = 0.5, and c 2 = 0.6. The iterative process of the particle swarm optimization algorithm is shown in Figure 12. The parameter estimation results are shown in Table 14.

Algorithm 1: Kalman Filtering optimal solution
Input: A numeric sequence of sensor data; Begin: (1) Initialize a population of particles (population size N), including random positions, weights, and velocities. (2) Evaluate the fitness of each particle according to Equations (24) and (25).
For each particle, compare its fitness value with the best position p ib it passed through; if better, use it as the current best position p ib . (4) For each particle, compare its fitness value with the global best position p gb it passed through; and if better, take it as the global best position p gb . (5) Adjust the particle velocity and position according to Equations (21)-(23). (6) Turn to step (2) if the end condition is not reached.
Output: The Optimized parameter Φ opt , H opt , Q opt , R opt and the Filtered Sequence.  The Kalman filtering results in this paper were shown in Tables 15 and 16. On the xaxis, compared with the original data, the Kalman filtering noise standard deviation based on the particle swarm optimization algorithm was reduced by 59.65%, and the data using the MultiTCN-Attention-PSO-KF model were reduced by 77.81%, which was 25.84% and 44.71%, respectively, compared with the traditional ARMA-KF noise reduction process. On the y-axis, the Kalman filter noise standard deviation based on the particle swarm optimization algorithm was reduced by 59.66%, and the data using the Mul-tiTCN-Attention-PSO-KF model were reduced by 76.44%, which was 29.88% and 46.66%, respectively, compared with the traditional ARMA-KF noise reduction process. It can be seen that the combined algorithm proposed in this paper can effectively compensate for MEMS gyroscope noise. At the same time, it can be seen from Figure 13 that the filtering effect of the combined algorithm proposed in this paper was smoother, and the signal fluctuation of the MEMS gyroscope was slight, which was closer to the actual value tested in the static base.

Comparison of Kalman Filter Results
The Kalman filtering results in this paper were shown in Tables 15 and 16. On the x-axis, compared with the original data, the Kalman filtering noise standard deviation based on the particle swarm optimization algorithm was reduced by 59.65%, and the data using the MultiTCN-Attention-PSO-KF model were reduced by 77.81%, which was 25.84% and 44.71%, respectively, compared with the traditional ARMA-KF noise reduction process. On the y-axis, the Kalman filter noise standard deviation based on the particle swarm optimization algorithm was reduced by 59.66%, and the data using the MultiTCN-Attention-PSO-KF model were reduced by 76.44%, which was 29.88% and 46.66%, respectively, compared with the traditional ARMA-KF noise reduction process. It can be seen that the combined algorithm proposed in this paper can effectively compensate for MEMS gyroscope noise. At the same time, it can be seen from Figure 13 that the filtering effect of the combined algorithm proposed in this paper was smoother, and the signal fluctuation of the MEMS gyroscope was slight, which was closer to the actual value tested in the static base.

Conclusions
This paper proposed a combined method combining multiple neural networks and Kalman filters for MEMS gyroscope error compensation in the static base environment. By comparing the results, the following conclusions were drawn: (1) This paper verified the feasibility of the convolutional denoising autoencoder to recover and reconstruct the signal when the sensor data were damaged and provided a new idea for signal repair. (2) It was verified that the TCN network with added attention mechanism was better than the standard TCN network and LSTM network, which provided a new way to compensate for the error of MEMS gyro. Moreover, it was also verified that the compensation method of TCN network was more reasonable than that of LSTM network. By adding the attention mechanism, the model we proposed can focus on the temporal data being more decentralized rather than concentrating on the part of the sequence. (3) By using the particle swarm optimization algorithm to estimate the Kalman filter parameters, the noise standard deviation reduction of Kalman filter parameter estimation was more satisfactory than that of the ordinary ARMA model. The calculation process was also more straightforward, and the curve fluctuations were relatively small. Compared to the original data, the noise standard deviation of the filtering effect of the combined model proposed in this paper decreased by 77.81% and 76.44% on the x and y axes, respectively. Additionally, the combined model reduced the noise effect by nearly three times compared to the traditional ARMA-KF filtering model, making the effect of the sensor more stable and effective.
In subsequent experiments, we shall perform dynamic experiments to obtain the MEMS gyroscope output, write the trained neural network model into the xPC module of the host computer for online real-time filtering, and build a platform to validate its practical engineering applications.