Multi-Task Data Imputation for Time-Series Forecasting in Turbomachinery Health Prognostics

: Time-series forecasting is the core of the prognostics and health management (PHM) of turbomachinery. However, missing data often occurs due to several reasons, such as the failure of sensors. These partially missing and irregular data greatly affect the quality of time-series modeling and prediction as most time-series models assume that the data are sampled uniformly over time. Meanwhile, the training process of models requires a large number of samples and time. Due to various reasons, it is difﬁcult to obtain a signiﬁcant amount of monitoring data, and the PHM of turbomachinery has high timeliness and accuracy requirements. To ﬁx these problems, we propose a multi-task Gaussian process (MTGP)-based data imputation method that leverages knowledge transfer across multiple sensors and even equipment. Thereafter, we adopt long short-term memory (LSTM) neural networks to build time-series forecasting models based on the imputed data. In addition, the model integrates the methods of denoising and dimensionality reduction. The superiority of this integrated time-series forecasting framework, termed MT-LSTM, has been veriﬁed in various data imputation scenarios of a synthetic dataset and a real turbomachinery case.


Introduction
Turbomachinery, such as steam turbines, turbocompressors, and gas turbines, is the key equipment type in the electric power, metallurgical, petrochemical, and aviation industries supporting the national economic lifeline.With the development of technology, turbomachinery is heading toward high levels of speed, integration, precision, and intelligence.High availability and reliability standards must be satisfied [1,2].However, due to their complex structures, special operating environments, and long service time, the failure rate of turbine units is high, which creates serious damage.Therefore, the prognostics and health management (PHM) of turbomachinery, which involves health monitoring, feature extraction, fault diagnosis, and fault forecasting, has emerged as a crucial issue that must be resolved [3,4].
The main research object of PHM is multivariate time-series data.The acquired original data are distinguished by high accuracy, wide coverage, variable patterns, different correlations, and rapid evolution due to the widespread use of inexpensive sensors and the growth of the Internet of Things.Tools for robust multivariate time-series modeling and analysis are urgently needed.The two main kinds of existing approaches are datadriven methodology and physics-model-based methodologies.The physics-model-based approaches must provide a thorough mathematical model to represent the system's physical properties and fault mechanisms [5,6], whereas, building a physical model of complicated mechanical systems is challenging, and the approaches rely on domain knowledge, which limits the viability and adaptability of these approaches.The challenge of fault diagnosis in complicated mechanical systems is made easier by data-driven approaches.Hence, datadriven methods have gained widespread attention.Data-driven models have experienced Machines 2023, 11, 18 2 of 18 two stages [7].The first is traditional intelligent models based on machine learning (ML).This paradigm extracts statistical features of signals in the time domain, frequency domain, or time-frequency domain and feeds them into ML models such as support vector machines (SVM) [8], k-nearest algorithm [9], or naive Bayes [10] to identify the status of equipment.In traditional ML methods, many efforts have aimed at the manual design of feature extraction algorithms as traditional models cannot extract representative features from raw signals.Such feature-designing processes require abundant professional knowledge and experience, which necessitate a significant amount of human labor, which limits the application of these kinds of methods.The second is modern intelligent models based on deep learning (DL).In contrast, this end-to-end paradigm directly feeds the raw signals (time domain) or pre-processed signals (frequency domain and time-frequency domain) into the deep neural network model to learn the mappings between data and status [11].
Deep learning (DL) is the most popular embranchment of machine learning and has created significant progress in a wide range of areas [12].Usually, DL-based methods construct a multi-layer structure to extract deeper and more essential features.Recently, various DL-based methods, for example, restricted Boltzmann machines (RBM) [13,14], convolutional neural networks (CNN) [15,16], recurrent neural networks (RNN) [17], auto-encoders (AE) [18], long short-term memory (LSTM) networks [19], or their variants [20,21] have been widely investigated.They can adaptively obtain effective features from time-domain, frequency-domain, and time-frequency-domain data.Then, the system's operating state can be identified by inputting the features into the regressor or the classifier.However, the training of DL models requires many samples, which is difficult to achieve in practice due to their scarcity.Moreover, the training process requires a lot of time as each layer needs a huge number of training epochs [22,23].
To resolve the first drawback, Lim et al. [24] proposed an expedient assessment framework of LED reliability.Multi-output Gaussian process regression (MOGPR) was performed on sparse reference data to establish an accepfigure threshold for the degradation signature.The framework achieved a reduction in the validation time from 6000 h to 100 h.Jia et al. [25] presented a method named deep normalized convolutional neural network (DNCNN) with normalized layers and a weight normalization strategy.This approach effectively increases the fault diagnosis accuracy with a limited number of fault samples.Ensemble deep auto-encoders (EDAEs), a framework that includes several auto-encoders with various properties, was proposed by Shao et al. [26].The accuracy of this approach can reach 93.5% when there are 12 health statuses of bearings and 100 fault samples.Based on the generative adversarial network (GAN), Mao et al. [27] proposed a fault diagnosis approach.The fault diagnosis model's generalization ability was improved by generating synthetic samples for the minority fault class.However, deep neural networks form the foundation for the above methods.We know this requires significant training time and epochs, which impedes the model's ability to update quickly.
To resolve the second drawback, Guo et al. [23] proposed a CNN-LSTM deep learning model-based method for fault diagnosis on the hydraulic system of the blanket transfer device Mover in the Chinese Fusion Engineering Test Reactor (CFETR).In the fault diagnosis experiment, the model had the highest accuracy of 98.56% on the test set and good efficiency in computation time compared to the other three models.Wang et al. [22] obtained a stable distribution of activation values in training by putting batch normalization (BN) in each layer of SAEs.This method can reduce the training time and epochs.Qin et al. [28] proposed a new activation function, Isigmoid, which combines the traditional Sigmoid with ReLU.The results showed that the approach found a solution to the vanishing gradient in the backpropagation process of the deep network and has advantages in convergence.Zhuang et al. [29] presented a new method called multi-scale deep convolutional neural network (MS-DCNN).This intelligent method extracts characteristics of various sizes in parallel using convolution kernels of different sizes, which has a better learning effect, stronger robustness, and reduced training time.Unfortunately, none of the approaches mentioned above can be used for few-shot learning or missing data.
To overcome the above disadvantages, we design a multi-task time-series data analysis model for the health management of turbomachinery.To account for the scarce-data scenario in the PHM of turbomachinery, we performed data imputation of the target task by leveraging knowledge from auxiliary data (tasks) through the modeling of a multi-task Gaussian process (MTGP), which is a non-parametric multi-task model.Thereafter, we employed long-short term memory (LSTM) to build the sequence model of the target data upon the imputed time-series data to perform a prediction in the future.Meanwhile, we integrated denoising and dimensionality reduction techniques.In comparison to solely modeling the scarce sequence data of the target task, the proposed multi-task LSTM (MT-LSTM) provides more accurate and efficient modeling and predictions, which thus benefits downstream fault diagnosis, remaining useful life estimation, etc.
The remainder of this paper is structured as follows.In Section 2, we expound on the theory and ingredients of the proposed multi-task time-series data analysis model.Section 3 conducts empirical research to verify the superiority of the proposed model.Finally, Section 4 summarizes the conclusions.

Methods
For the time-series forecasting of the performance of turbomachinery, missing data is commonly encountered due to the failure of sensors or human error, and it greatly affects the quality of time-series modeling and prediction.Therefore, it is important to fill in the missed time-series data before forecasting.The turbomachinery operation datasets, meanwhile, often have the following characteristics: time series with noise, non-significant periodicity, and high and correlated dimensions (multi-task).Hence, we need to integrate multiple methods, including data denoising, dimensionality reduction, normalization, imputation, and prediction.The flowchart in Figure 1 presents the process for time-series forecasting using datasets collected from turbomachinery.Suppose that we collected scarce timeseries data (target task) regarding the interested performance indicator of turbomachinery, and some related and sufficient time-series datasets (auxiliary tasks) collected from other performance indicators are available.The proposed MT-LSTM first performs data denoising using discrete wavelet packet transform (DWPT), followed by a reduction in dimensionality via probabilistic principal component analysis (PPCA).Thereafter, the MT-LSTM conducts the data imputation of the target task through MTGP.Finally, the LSTM model is built upon the imputed data to predict of the target task in the future.The following sections elaborate on the key ingredients in the proposed MT-LSTM model.

Data Denoising
In the complex working environment of turbomachinery, sensor timeusually comprise a wide range of defects, for example, noise, error, and in Hence, it is necessary to remove the noise or uncertainty characteristics of

Data Denoising
In the complex working environment of turbomachinery, sensor time-series data usually comprise a wide range of defects, for example, noise, error, and incoherence.Hence, it is necessary to remove the noise or uncertainty characteristics of data before modeling.In past decades, scholars have proposed many methods to remove noise from sensor data.However, few studies have evaluated multiple defects simultaneously.In this paper, we propose Bayesian wavelet denoising and use multiscale wavelet analysis to analyze the contexts in the signal.
Specifically, we adopt discrete wavelet packet transform (DWPT) to denoise and multiscale analyze the original time series.In DWPT, a given one-dimensional time-series signal, f (t i )(i = 1, 2, . . ., n), with n historical data points, is simultaneously decomposed into a sequence of scaling coefficients, s j (k), and wavelet coefficients, w j (k).Then, we can use the inverse wavelet transform and the DWPT coefficients to reconstruct the time series as follows: where f (t i ) is the reconstructed time series; k and j are the time and frequency indices, respectively; ϕ j,k (t i ) is the scaling function; ψ j,k (t i ) is the wavelet function; and the double summation means that the wavelet subspace and scaling subspace are simultaneously decomposed into second-level subspaces, which reveal the time and frequency resolution of the signal.
For the benefit of universality, we set d jk to the k-th decomposed coefficients at the j-th level (j = j 0 , . . ., J − 1; k = 0, 1, . . ., 2 j − 1).According to the Bayesian rule, the posterior distribution of the cleaned coefficients, d jk , can be expressed as where the symbol ~represents a distribution; N(a, b) denotes a normal distribution, where a is the mean and b is the variance; djk denotes the decomposed coefficients distribution, djk d jk , σ 2 j ∼ N d jk , σ 2 j ; γ jk is a binary random variable with independent Bernoulli distribution π j , i.e., P(γ jk = 1) = 1 − P(γ jk = 0) = π j , which decides whether d jk are zero (γ jk = 0) or nonzero (γ jk = 1); and the variance τ 2 j denotes the magnitude of d jk at the j-th decomposition level.For all the coefficients in the j-th level, τ 2 j and π j may be assigned the same values; σ j is the standard deviation, which is estimated from the wavelet coefficients of the j-th level DWPT decomposition by dividing the median of the wavelet coefficients by a factor.
Then, the Bayesian hypothesis test is applied to threshold the decomposed coefficients by determining whether to accept the null hypothesis, H 0 : d jk = 0.The thresholding rule is defined as follows: where η jk is the posterior odds ratio of γ jk = 0 versus γ jk = 1.I(.) denotes an indicator function.When η jk < 1, I(.) is equal to unity.Thus, H 0 is rejected and djk estimates the coefficient d jk .If not, I(.) is equal to zero, and djk is removed.The cleaned time series f (t i ) based on Equation ( 1) is then reconstructed by using the acquired coefficients d jk from Equation (3).In summary, we can decompose the time-series data into various levels of wavelet coefficients by the DWPT analysis.Then, a Bayesian hypothesis test is carried out for each level of wavelet coefficients to eliminate possible defects.To avoid over-denoising, the ratio of posterior odds Bayesian method offers a simple way to determine whether the decomposed coefficients have any defects.Meanwhile, the DWTP method provides more coefficients denoting additional subtle details of the time-series signal, thus avoiding under-denoising.

Dimensionality Reduction
In the analysis of multi-variable time-series data of turbomachinery, due to the correlation between variables and the inherent uncertainty of data, too many variable dimensions increase the calculation assumption and the analysis complexity.After cleaning the multivariate time-series data, the PPCA method is mainly used to (1) reduce data dimensionality, (2) decouple the multivariate correlation, and (3) filter data uncertainty.The method plays an important role in improving the accuracy of subsequent data analysis and reducing the calculation time.
Let y i = g(t i ) ∈ R q represent a q-dimensional real number and let Y = [y 1 , . . ., y n ] T represent the n × q data matrix: there are q variables, and each variable contains n cleaned time-series data points, g(t i ).Let Φ = [θ 1 , . . ., θ n ] T represent the n × d data matrix and θ i ∈ R d (d ≤ q) represent d unobservable latent variables (factors), each contains n corresponding positions in the latent space.All θ i are usually defined as independently distributed Gaussian variables obeying zero mean and unit variance, i.e., θ i ∼ N(0, I).The latent variable model combines the measurable variable y i in the relevant data matrix Y with the corresponding uncorrelated latent variable matrix Φ.Then, the form of the Gaussian distribution is where ψ = σ 2 I is the isotropic noise covariance, the q × d weight matrix W represents the relation between y i and θ i , and q mean values that are acquired from Y form the parameter vector µ, i.e., µ = (1/n) ∑ n i=1 y i .In the lower dimensional latent space, the unadjusted data T can be expressed as where M ML = σ 2 ML I + W T ML W, σ 2 ML , and W ML are the maximum likelihood of σ 2 and W. Equation ( 5) has the mean of M −1 ML W T ML µ.The d-dimensional principal components (PC) matrix Φ * is used in the following data-driven modeling.

MTGP Model
In the data analysis of turbomachinery, the lack of data is often caused by the damage of sensor equipment, the insufficient maintenance of equipment, the power failure of equipment, the aging fault of equipment hardware and software, the communication interruption of equipment, and new equipment recently put into operation.Data imputation and augmentation are crucial to the performance of the deep learning model for time-series data, which can effectively improve the size and quality of the training data.We explore a method for time-series data imputation using the multi-task GP model.
We assume a given dataset T = {x , y}, x = {x i | i = 1, . . ., n} is specified as the index set, such as the observed time, and y = {y i | i = 1, . . ., n} is the corresponding value of the observed training data.The purpose is to learn a regression model: y = f (x) + ε, where f (x) is a latent function and ε ∼ N 0, σ 2 is a noise term.Given the test points x * = x * i i = 1, . . ., p , we can predict the uncharted test observations y * = y * i i = 1, . . ., p .Gaussian processes (GP) are useful for time-series modeling [30] because they can accommodate irregular and uncertain observations, and provide probabilistic predictions.Gaussian process models suppose that f (x) can be expressed as a probability distribution of the following functions: where m(x) is the mean function and k(x, x ) is a covariance function describing the coupling between the inputs.The prior knowledge of the function behavior, which we want to model, can be encoded by modifying the covariance function.In the covariance of the vector x ∈ R n , the size of the covariance matrix K(x, x) is n × n, and the covariance function k x i , x j provides the element K ij .The covariance function usually satisfies Mercer's theorem; that is, K(x, x) must be symmetric and positive semidefinite.The affine transformation of the basic covariance function can be used to construct the complex covariance function.
For a given set T, we can obtain the predictions at the test indices x * through calculating the conditional distribution p(y * |x * , x, y) , where p(y * |x * , x, y) is a Gaussian distribution: where m(y * ) is the mean and σ 2 (y * ) is the variance.Without the loss of generality, it is usually assumed that the mean function m(x) is zero.Thus, m(y * ) and σ 2 (y * ) can be expressed as Furthermore, we assume that X = x j i j = 1, . . ., m, i = 1, . . ., n j are the training indices, and Y = y j i j = 1, . . ., m, i = 1, . . ., n j are the observations for m tasks, where task j has n j training points.In an actual situation, we may encounter task-specific with n j training indices where the n j number is far lower than the other tasks.This is a challenge for the problem.
The problem of modeling m tasks simultaneously (e.g., multiple time-series) drives the extension of the MTGP model, where all models have the same index set x. Training an STGP (single-task GP) model for each task separately is a straightforward method, as shown in Figure 2a.For the modeling of m tasks via MTGP, to reveal the affiliation of index x j i and observation y j i to task j, we have to add a label l j to the model as an additional input with l j = j, as illustrated in Figure 2b.
Gaussian processes (GP) are useful for time-series modeling [30] because they can accommodate irregular and uncertain observations, and provide probabilistic predictions.Gaussian process models suppose that () can be expressed as a probability distribution of the following functions: where (x) is the mean function and (x, x') is a covariance function describing the coupling between the inputs.The prior knowledge of the function behavior, which we want to model, can be encoded by modifying the covariance function.In the covariance of the vector x ∈ R , the size of the covariance matrix K(x, x) is  × , and the covariance function ( ,  ) provides the element K .The covariance function usually satisfies Mercer's theorem; that is, K(x, x) must be symmetric and positive semidefinite.The affine transformation of the basic covariance function can be used to construct the complex covariance function.
For a given set T, we can obtain the predictions at the test indices x * through calculating the conditional distribution (y * |x * , x, y), where (y * |x * , x, y) is a Gaussian distribution: where (y * ) is the mean and  (y * ) is the variance.Without the loss of generality, it is usually assumed that the mean function (x) is zero.Thus, (y * ) and  (y * ) can be expressed as   In MTGP, two independent covariance functions, k c and k t , can be assumed as where k c denotes the correlation between tasks and only depends on the labels l and k t denotes the temporal covariance functions within a task and only depends on the indices x.
Let n j = n for j = 1, . . ., m; the covariance matrix K MTGP can be expressed as where l = {j | j = 1, . . ., m}; ⊗ is the Kronecker product; θ c and θ t are vectors, including hyperparameters for K c and K t ; K c has a size of m × m; and K t has a size of n × n.Thus, the matrix, K MTGP has a size of mn × mn.We named K c as the correlation matrix.In practice, to adapt the model to the usage of task-specific training data, we can relax the assumption n j = n for j = 1, . . ., m.
This model has some useful features: we may have task-specific training indices n j (i.e., training data may be observed at task-specific times); when fitting the covariance function in (11), the correlation within the tasks can be learned automatically; the tasks have analogous temporal characteristics and hyperparameters θ t in the framework assumption.
In summary, as MTGP is not a parametric approach, the number of parameters can grow as more observation data are collected.The advantage of the MTGP model over other comparable related methods, such as support vector regression (SVR), is that it may easily express prior knowledge of function behavior (e.g., periodicity or smoothness).Given multiple time series, the goal is to increase the overall modeling accuracy by employing a single MTGP model rather than several STGPs.MTGPs have been used in financial time series [31], environmental sensor networks [32], and the classification of ECG signals [33].It is important to note that MTGPs can include nonuniformly sampled time series in a single model without further downsampling or interpolation.In addition, it is easy to include prior knowledge between time series, such as time shifts or assumptions about similar temporal.Thus, the method has many applications, such as prediction, correlation analysis, the estimation of time shifts between signals, and the compensation of missing data.

LSTM Model
It is very important to predict the operation trend for turbomachinery reliability.Recurrent neural networks (RNN) can predict trends quickly and accurately.However, the traditional RNN has the problem of gradient disappearance or explosion and cannot solve long-term dependence, which affects its practical application [34].To solve this problem, we established a prediction model based on LSTM (long short-term memory).The LSTM is an improvement on RNN [35], which is composed of multiple connected recurrent units recursively.Three gate structures, i.e., forgetting gate, input gate, and output gate, constitute the recurrent unit of LSTM [23].The advantages of LSTM are mainly embodied in the following aspects: (1) the three gating structures of LSTM can learn long-term memory information and provide a solution to long-term dependence; and (2) the sigmoid function and tank function are combined to form the activation function in LSTM, which keeps the gradient almost unchanged during the derivation of backpropagation, avoiding the disappearance or explosion of the gradient, and greatly accelerating the convergence speed of the model.
The LSTM neural network has a variety of input forms.Here, we consider the univariate time-series data.We suppose that there are n discrete time points with a uniform distribution, and y = {y t | t = 1, . . ., n} is a multi-dimensional time variable at each observed time point.These random variables serve as the input of the LSTM neural network.
According to Figure 3, at time t, LSTM receives three inputs: the network's input y t in the present, the previous time output h t−1 , and the previous time cell state C t−1 .At the same time, LSTM has two outputs: output h t and cell state C t .We can notice that y t , h t , and C t are all vectors.LSTM has three gate-like structures that filter through the information [36,37].
The forgetting gate controls how much information of the previous time cell state C t−1 is passed to C t at the present time: where σ(•) is the sigmoid activation function, W f is the weight matrix of the forgetting gate, and b f is the bias.The input gate controls how much information of input y t is passed to C t at the present time (i.e., it is used to update the cell state): Machines 2023, 11, 18 8 of 18 where i t is the input gate, W i is the weight matrix of the input gate, and b i is the bias.C t is the input cell state, W c is the weight matrix of the cell state, and b c is the bias of the cell state.
The output gate determines how much information of C t is passed to the present output h t : where o t is the output gate, W o is the weight matrix of the output gate, and b o is the bias.The final output of the LSTM is determined by the output gate and the cell state.
1 ( [ , ] ) where i is the input gate,  is the weight matrix of the input gate, and  is the bias. is the input cell state,  is the weight matrix of the cell state, and  is the bias of the cell state.The output gate determines how much information of  is passed to the present output ℎ : ) tanh( ) where  is the output gate,  is the weight matrix of the output gate, and  is the bias.The final output of the LSTM is determined by the output gate and the cell state.

Experiments and Results
In the following sections, we first illustrate the performance of the proposed MT-LSTM on an optical marker (OM) dataset.Thereafter, the superiority of MT-LSTM is further verified on a real turbine operation dataset.

OM Dataset
We first investigated whether the proposed MT-LSTM could augment scarce timeseries data among multiple tasks and improve prediction accuracy.We used multivariate biomedical data from the literature [38].These data consisted of three optical markers (OM) time series that were located at the midline of the body: the thorax (OM1), the lower breastbone (OM2), and near the navel (OM3).The three-dimensional positions of all sensors were simplified as their first principal component.In addition, the fourth task corresponded to a respiration belt (RB) placed around the torso near OM2.The monitoring time and sampling frequency (2.6 Hz) were the same for all tasks.Each task included 200 data points, as shown in Figure 4. We investigated two situations-a normal situation and an extreme situation-to prove the method's validity.The normal situation included a few missing data (named S1) and different sampling frequencies (S2).We named the extreme situation S3.It is important to note that since the OM dataset was clean and low-dimensional, we did not conduct data pre-processing such as data denoising and dimensionality reduction (Figure 1).monitoring time and sampling frequency (2.6 Hz) were the same for all tasks.Each task included 200 data points, as shown in Figure 4. We investigated two situations-a normal situation and an extreme situation-to prove the method's validity.The normal situation included a few missing data (named S1) and different sampling frequencies (S2).We named the extreme situation S3.It is important to note that since the OM dataset was clean and low-dimensional, we did not conduct data pre-processing such as data denoising and dimensionality reduction (Figure 1).In this paper, we used some indicators to evaluate prediction accuracy.The predicted test observations are denoted as  * = { * |  = 1, … , }.We used the mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE), defined as  In this paper, we used some indicators to evaluate prediction accuracy.The predicted test observations are denoted as ŷ * = ŷ * i i = 1, . . ., n .We used the mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE), defined as 19) Lower MAE, RMSE, and MAPE indicate higher prediction accuracy.

Multi-Task Data Imputation
(1) S1-Few data missing For the S1 case, we assumed that a few data were missing for each of the four tasks (OM1-3, RB).Specifically, each task picked the first 150 data points and had 30 data points removed at different time points.Figure 5 and Table 1 show the imputation and error results of each task using MTGP and STGP.Obviously, the MTGP model fit well with the original data of each task by transferring knowledge across tasks.The results were very accurate, and the related MAE and RMSE were close to 0. In contrast, for the results of STGP, it was impossible to observe the trend of the tasks, and the corresponding MAPE was much larger than that of MTGP.(2) S2-Different Sampling Frequencies For the S2 case, we assumed that the sampling frequency of the tasks was different.In practice, the different sampling frequencies of various sensors led to different data volumes of signals, which introduced challenges to the subsequent modeling.We set the sampling frequency of Task 3 to be half that of the other tasks, i.e., Task 3 had only 75 data points.1-2 and Task 4 had 30 data points removed (same as S1).As shown in Figure 6 and Table 2, the MTGP was found to effectively model these tasks with different sampling frequencies and have accurate predictions.(2) S2-Different Sampling Frequencies  For the S3 case, we considered an extreme case-some tasks had very scarce data at finite early time points.For Task 1, we kept the first 13 points and removed the remaining 137 data points (91.3% data missing), which eliminated the periodicity of the data itself.The other tasks were kept unchanged.We predicted that the few-shot scenario would undoubtedly introduce great challenges to data augmentation and prediction.

Time-Series Forecasting via Multi-Task Data Imputation
For the S3 case, we considered an extreme case-some tasks had very scarce data at finite early time points.For Task 1, we kept the first 13 points and removed the remaining 137 data points (91.3% data missing), which eliminated the periodicity of the data itself.The other tasks were kept unchanged.We predicted that the few-shot scenario would undoubtedly introduce great challenges to data augmentation and prediction.
First, we performed data imputation for Task 1 with STGP.As shown in Figure 7, the predictions quickly failed.Next, we used the MTGP model and observed signifi- First, we performed data imputation for Task 1 with STGP.As shown in Figure 7, the predictions quickly failed.Next, we used the MTGP model and observed significantly improved predictions (Figure 7).This was also verified in Table 3, which showed that in comparison to STGP, the RMSE of MTGP decreased sharply from 2.128 to 0.648.Consequently, the prediction results of the MT-LSTM in Figure 8 and Table 4 were also more accurate predictions in comparison to the ST-LSTM based on STGP.Moreover, when the incomplete data (13 data points) were directly used for LSTM training, the predictions were extremely inaccurate (see Figure 8 and Table 4).cantly improved predictions (Figure 7).This was also verified in Table 3, which showed that in comparison to STGP, the RMSE of MTGP decreased sharply from 2.128 to 0.648.Consequently, the prediction results of the MT-LSTM in Figure 8 and Table 4 were also more accurate predictions in comparison to the ST-LSTM based on STGP.Moreover, when the incomplete data (13 data points) were directly used for LSTM training, the predictions were extremely inaccurate (see Figure 8 and Table 4).Hence, the validity of the proposed MT-LSTM was preliminarily verified on the OM dataset.

Real-World Datasets
The performance of our method was verified on a set of actual turbine operating data.In a large centrifugal compressor manufacturing enterprise, more than 10 compressors suffered accidents during operation in several years, many of which were new units recently put into operation, resulting in huge losses.Meanwhile, some accidents  Hence, the validity of the proposed MT-LSTM was preliminarily verified on the OM dataset.

Real-World Datasets
The performance of our method was verified on a set of actual turbine operating data.In a large centrifugal compressor manufacturing enterprise, more than 10 compressors suffered accidents during operation in several years, many of which were new units recently put into operation, resulting in huge losses.Meanwhile, some accidents were accompanied by sensor failures.To analyze these accidents, we studied one of the compressor units.After the new compressor was brought into service, the vibration condition and the amplitude were stable at about 20 µm, which was much lower than the alarm value.However, the vibration value rose and exceeded the alarm value after more than 10 days of operation.It is obvious that the scarcity of available data poses significant challenges to this research.The original time-series data were obtained from various sensors positioned on the centrifugal compressor, and the deployment of the sensors is shown in Figure 9.The collected data contained various operating parameters: for example, inlet and exhaust pressure, exhaust flow, bearing temperature, and axis vibration.The sensors measured the operating data every hour, and a total of 11 performance variables were collected in the same time period in this paper.Hence, there were 11 tasks, each with only 200 data points.Obviously, these data contain noise.Table 5 shows the details of the tasks.If we consider an extreme case such as S3 in the OM dataset, some tasks had very scarce data collected at the early time points.Specifically, each task picked the first 150 data points.For Task 10, we kept the first 20 data points and removed the remaining 130 data points (86.7% data missing).The other tasks were unchanged.

Results without Data Pre-Processing
The MT-LSTM method we propose in this paper was applied to this dataset.First, we used the MTGP model directly without data pre-processing.The calculation procedure was the same as that for the S3 of the OM dataset.Figure 10 and Table 6 show the results.It was found that the imputation results of STGP failed quickly, and the prediction variance was high.As expected, the predictions of MTGP improved significantly over STGP.Consequently, the prediction of MT-LSTM achieved the best results (Figure 11 and Table 7).

Results with Data Pre-Processing
To investigate the impact of data pre-processing, we first normalized each task before MTGP modeling.The aim was to lessen the influence of various data value scales as normalization can compare multiple time-series variables at the same time.It prevents largevalued variables from overcontrolling small-valued variables, which improves accuracy.As shown in Figures 12 and 13, Tables 8 and 9, the MTGP and MT-LSTM results were improved compared to those without data pre-processing.To investigate the impact of data pre-processing, we first normalized each task before MTGP modeling.The aim was to lessen the influence of various data value scales as normalization can compare multiple time-series variables at the same time.It prevents large-valued variables from overcontrolling small-valued variables, which improves accuracy.As shown in Figures 12 and 13, Tables 8 and 9, the MTGP and MT-LSTM results were improved compared to those without data pre-processing.Thereafter, we further performed data pre-processing, such as data denoising and dimensionality reduction; the first five principal components are shown in Figure 14.We found that partial principal components accounted for more information in the original data: the first principal component comprised 73.1%, the second principal component comprised 12.8%, the third principal component comprised 4%, the fourth principal component comprised 3.2%, and the fifth principal component comprised 2.8%.We  Thereafter, we further performed data pre-processing, such as data denoising and dimensionality reduction; the first five principal components are shown in Figure 14.We found that partial principal components accounted for more information in the original data: the first principal component comprised 73.1%, the second principal component comprised 12.8%, the third principal component comprised 4%, the fourth principal component comprised 3.2%, and the fifth principal component comprised 2.8%.We normalized the first and second principal components (totally accounting for 85.9% of the information in the original data) and integrated them into the MT-LSTM model.The results are shown in Figures 15 and 16, Tables 10 and 11.We found that the accuracy was significantly high, and the calculation time was about 5 min.10 and 11.We found that the accuracy was significantly high, and the calculation time was about 5 min.According to the experimental results, it could be found that the LSTM was poor in the scarce-data scenario.For the proposed MT-LSTM model, we performed data imputation of the target task by leveraging knowledge from auxiliary tasks through the modeling of an MTGP and then employed the LSTM to build the sequence model of the target data upon the imputed time-series data to perform a prediction in the future.The knowledge transfer across tasks helps the proposed MT-LSTM perform significantly better than the vanilla LSTM.

Conclusions and Discussion
Health management for turbomachinery is a challenging task.Deep learning has gradually entered this industry over the past few years.However, in practical application, the training process of deep learning models requires a large number of samples and time.Due to various reasons, it is difficult to obtain a significant amount of monitoring data, and the health management of turbomachinery has high timeliness and accuracy requirements.To account for the scarce-data scenario in the PHM of turbomachinery, this paper proposes integrating multi-task Gaussian processes, long short-term memory, Bayesian wavelet denoising, and probabilistic principal component analysis to achieve rapid automatic data imputation and time-series modeling to alleviate the need for big data as well as improve predictions with scarce data.Meanwhile, the framework can solve the optimization cleaning of the original multivariate time-series monitoring data, conduct the possible correlation and uncertainty in the multivariate data, and reduce the
9) Furthermore, we assume that X = { | = 1, … , ,  = 1, … ,  } are the training indices, and Y = { |  = 1, … , ,  = 1, … ,  } are the observations for m tasks, where task  has  training points.In an actual situation, we may encounter task-specific with  training indices where the  number is far lower than the other tasks.This is a challenge for the problem.The problem of modeling  tasks simultaneously (e.g., multiple time-series) drives the extension of the MTGP model, where all models have the same index set x.Training an STGP (single-task GP) model for each task separately is a straightforward method, as shown in Figure2a.For the modeling of  tasks via MTGP, to reveal the affiliation of index  and observation  to task , we have to add a label  to the model as an additional input with  = , as illustrated in Figure2b.

Figure 2 .
Figure 2. Box diagram of (a) multiple STGPs and (b) MTGP for multi-task modeling.Figure 2. Box diagram of (a) multiple STGPs and (b) MTGP for multi-task modeling.

Figure 2 .
Figure 2. Box diagram of (a) multiple STGPs and (b) MTGP for multi-task modeling.Figure 2. Box diagram of (a) multiple STGPs and (b) MTGP for multi-task modeling.

Figure 4 .
Figure 4. Time-series signals of the 4 tasks from the OM dataset.x-axis: n is the number of data points.
RMSE, and MAPE indicate higher prediction accuracy.

Figure 4 .
Figure 4. Time-series signals of the 4 tasks from the OM dataset.x-axis: n is the number of data points.

Figure 9 .
Figure 9. Diagram of sensors used to acquire time-series data from a centrifugal compressor.

Figure 9 .
Figure 9. Diagram of sensors used to acquire time-series data from a centrifugal compressor.

Figure 10 .
Figure 10.Imputation results of MTGP and STGP for the extreme situation.

Figure 10 .
Figure 10.Imputation results of MTGP and STGP for the extreme situation.

Figure 10 .
Figure 10.Imputation results of MTGP and STGP for the extreme situation.

Figure 12 .
Figure 12.Imputation results of MTGP and STGP for the extreme situation (after normalization).

Figure 12 .
Figure 12.Imputation results of MTGP and STGP for the extreme situation (after normalization).

Figure 12 .
Figure 12.Imputation results of MTGP and STGP for the extreme situation (after normalization).
Machines 2023, 11, 18 16 of 19 normalized the first and second principal components (totally accounting for 85.9% of the information in the original data) and integrated them into the MT-LSTM model.The results are shown in Figures 15 and 16, Tables

Figure 14 .
Figure 14.The first 5 principal components after data pre-processing from the compressor data.

Figure 15 .
Figure 15.Imputation results of MTGP and STGP for the extreme situation (with data pre-processing).

Figure 14 .
Figure 14.The first 5 principal components after data pre-processing from the compressor data.

Figure 14 .
Figure 14.The first 5 principal components after data pre-processing from the compressor data.

Figure 15 .
Figure 15.Imputation results of MTGP and STGP for the extreme situation (with data pre-processing).

Figure 15 .
Figure 15.Imputation results of MTGP and STGP for the extreme situation (with data pre-processing).

Figure 14 .
Figure 14.The first 5 principal components after data pre-processing from the compressor data.

Figure 15 .
Figure 15.Imputation results of MTGP and STGP for the extreme situation (with data pre-processing).

Table 1 .
Error results of MTGP and STGP for S1.

Table 1 .
Error results of MTGP and STGP for S1.

Table 2 .
Error results of MTGP and STGP for S2.

Table 3 .
Error results of STGP and MTGP for S3.

Table 3 .
Error results of STGP and MTGP for S3.

Table 5 .
Details of tasks acquired from the centrifugal compressor.

Table 6 .
Error results of STGP and MTGP for the extreme situation.

Table 7 .
Error results of MT-LSTM, ST-LSTM, and LSTM for the extreme situation.

Table 7 .
Error results of MT-LSTM, ST-LSTM, and LSTM for the extreme situation.

Table 7 .
Error results of MT-LSTM, ST-LSTM, and LSTM for the extreme situation.

Table 10 .
Error results of STGP and MTGP (with data pre-processing).

Table 10 .
Error results of STGP and MTGP (with data pre-processing).

Table 10 .
Error results of STGP and MTGP (with data pre-processing).