A Time-Series Data Generation Method to Predict Remaining Useful Life

: Accurate predictions of remaining useful life (RUL) of equipment using machine learning (ML) or deep learning (DL) models that collect data until the equipment fails are crucial for maintenance scheduling. Because the data are unavailable until the equipment fails, collecting sufﬁcient data to train a model without overﬁtting can be challenging. Here, we propose a method of generating time-series data for RUL models to resolve the problems posed by insufﬁcient data. The proposed method converts every training time series into a sequence of alphabetical strings by symbolic aggregate approximation and identiﬁes occurrence patterns in the converted sequences. The method then generates a new sequence and inversely transforms it to a new time series. Experiments with various RUL prediction datasets and ML/DL models veriﬁed that the proposed data-generation model can help avoid overﬁtting in RUL prediction model.


Introduction
Prognostics and health management (PHM) is an important contributor to maintaining manufacturing productivity and has rapidly evolved from corrective maintenance to condition-based maintenance (CBM). Also known as predictive or data-driven maintenance, CBM decisions are based on analysis of data gathered from equipment and sensors [1].
The established applications of CBM are monitoring, fault diagnosis, maintenance scheduling, and remaining useful life (RUL) prediction [2]. Among them, RUL prediction is the subject of growing interest because of its effectiveness in maintenance scheduling. The RUL can be defined as the difference between the current time and the time failure occurs, and RUL prediction is to predict RUL (or failure time) at the current time based on collected signals, condition, and so forth [3]. The RUL prediction techniques can be categorized into physical model-driven and data-driven techniques [4]. Zhu and Yang [5] computed the themo-elasto-plastic stress and strain fields of turbine blade using finite element methods. They predicted the fatigue life through Basquin and Manson-Coffin formulae. They performed the thermal stress field analysis by using of Ansys software and then diagnosed thermal-mechanical stress for some nodes. Then, they predicted the fatigue life by maximum stress and mean stress. The remaining lifetime was predicted by extracting the parameter of the formula from controlled stress test data. While they introduced the influence of limited external factors such as thermal and mechanical stress on the remaining life, we consider the state as a time series and predicted the remaining life in a continuous situation probabilistically. Taheri and Taheri [6] studied the feasibility and technical design for implementing a combined heat and power system for the Mahshad Power plant. They figured out the remaining life for gas turbines on the basis of available data by approximated approach, whose prediction is derived by using predetermined elements or formulae. In addition, they used the operation data reflecting the current status for prediction, while our method in this paper adopts a probabilistic approach and uses obtainable data rather than deriving the close-formed formula.
Even though the physical model is easy to understand and very accurate, it is almost impossible to formulate modern industrial system. In contrast, the data-driven technique using machine learning (ML) and deep learning (DL) can formulate a very complex system with data collected from the system. For this reason, this paper focuses on data-driven RUL prediction.
Recently, ML and DL models have been frequently applied to predict RUL after training them with run-to-failure data [7]. These models learn the degradation patterns and relationships between the pattern and the RUL from the data collected until the end of the life of the components and predict the RUL of target components. In many recent studies, ML and DL models have shown superior performance for RUL problems [8][9][10][11][12][13][14][15].
Data insufficiency is a major challenge when training ML-or DL-based RUL models. Because a time-series sample is collected only after a component fails, collecting training data tends to be time-consuming and expensive. A model trained with insufficient data can be overfitted [16] and fail to accurately predict the RUL of a new component. Collecting more data can mitigate overfitting but may not be an option because of the required time and cost.
In this paper, we propose a time-series data-generation method that avoids overfitting when training an RUL model. The proposed method generates a sample on the basis of two existing samples called parents in a probabilistic manner and contributes to avoiding overfitting and increasing prediction performance. The generated samples will be added to training data for an RUL prediction model and contribute to avoid overfitting. In other words, an RUL prediction model is trained with union of original samples and generated samples. This approach is similar to SMOTE (Synthetic Minority Over-sampling Technique [17]), one of the most widely used oversampling methods that generates a minority class samples between two selected samples at random, and accordingly, the generated one may not be similar to the original samples in terms of relationship among features while contributing to the increase of classification performance such as recall and F1-score.
As far as our literature survey is concerned, ours is the first attempt to address the data insufficiency problem when training an RUL prediction model. In this paper, we propose a method to transform a time series into an alphabetical sequence and inversely transform the sequence that can be applied in time-series classification and clustering. The proposed data-generation method is not only efficient (i.e., it is inexpensive in terms of computational time) but also effective (i.e., the generated time series cab help avoid overfitting).
The remainder of this paper is organized as follows. Section 2 introduces background theory and related works of RUL prediction and symbolic aggregate approximation. Section 3 proposes a time-series data-generation method for RUL prediction in detail, and Section 4 illustrates the process of the proposed method using a small example. Section 5 experimentally verifies the performance of the proposed method, and Section 6 draws conclusions and suggests future research directions.

Remaining Useful Life Prediction
Training and usage process of the RUL prediction model proceeds as follows. First, a set X = x (1) , x (2) , · · · , x (n) of training time series samples x (i) , i = 1, 2, · · · , n is transformed by extracting features and the RUL such that: where D(X) is a set of transformed time series in X, and D x (i) is a set of pairs (feature, where φ is a vector of feature functions, x (i) 1:t is a part of the time series collected from time 1 to t of sample i, y is the RUL of sample i measured at t, and T i is the length of x (i) . Second, a model f is trained using D(X), and finally, the RUL of the new sample n + 1 at t is predicted as f φ x (n+1) 1:t . Many researchers have developed RUL prediction models based on ML or DL. For example, Ali et al. [18] introduced a root mean square entropy estimator (RMSEE), which is the entropy of the root mean square (RMS) of the j th windows, to capture bearing degradation and used it as a feature. They also converted an RUL prediction model into a classification problem with seven classes according to the degradation rate (that is, class 1: under 10%, class 2: 10%-25%, class 3: 25%-40%, class 4: 40%-55%, class 5: 55%-70%, class 6: 70%-85%, and class 7: 85%-100%). Finally, they used a multi-layered perceptron (MLP) combined with a simplified fuzzy adaptive resonance theory map as a prediction model. Zheng et al. [19] pointed out that a traditional regression model using features from a window is not appropriate for RUL prediction because it does not fully consider sequence information, and other sequence learning models have also flaws (e.g., hidden Markov model and recurrent neural networks do not consider long-term dependency among nodes). They proposed a deep long short-term memory (LSTM) consisting of four layers: an input layer, a multi-layer LSTM, a multi-layer perceptron (MLP), and an output layer. In their experiment, MLP, support vector regression (SVR), relevance vector regression, and a convolutional neural network were compared in terms of root mean squared error (RMSE), with the deep LSTM exhibiting the smallest RMSE for four datasets. Table 1 summarizes previous research that developed RUL prediction models using ML and DL including convolution neural network (CNN), in terms of domain, feature, and base model. As seen Table 1, statistics such as RMS, kurtosis, and skewness are frequently used as features, and MLP, LSTM, and SVR are used primarily for models. Table 1. Previous studies on RUL prediction.

Symbolic Aggregate Approximation
Pattern extraction is very important for time series analysis such as classification and clustering, but it is hard to extract patterns directly from time series due to huge search space [25]. Thus, representation methods such as symbolic aggregate approximation (SAX), discrete cosine transform (DCT), and discrete wavelet transform (DWT) are usually used to represent time series as sequence before pattern extraction. In this paper, each time series is discretized as alphabet sequence using SAX, patterns are extracted from the sequences, and a new sequence is generated considering the extracted pattern.
SAX can convert a time series x into an alphabetical sequence S for efficient time-series data mining in the following manner [26]. First, the t th element x t of x is normalized as x t = x t −µ σ , where µ and σ denote the mean and standard deviation of x, respectively. Second, x t is split into w windows and r = (r 1 , r 2 , · · · , r w ), where r j , 1 ≤ j ≤ w is a representative value such as the mean of the values in the j th window, is calculated. One can introduce the standard deviation as the alternative representative value. Third, break points β k for k = 1, 2, · · · , l are computed, where l is the number of alphabetical strings defined by the user, satisfying Pr β k ≤ r j < β k+1 = 1/l. Finally, an alphabet is assigned to each window on the basis of the break points. That is, if β k ≤ r j < β k+1 , then the k th alphabet string is assigned to the j th window. Figure 1 illustrates a SAX application process when the time series is assumed to be normalized. Time series is split into w = 6 windows, and mean values r 1 , r 2 , · · · , r 6 in each window are calculated. Three (l = 3) alphabetical strings A, B, and C are introduced, and then three break points β µ,1 , β µ,2 , β µ,3 for means are obtained. Then, the alphabet A is assigned to those means less than β µ,1 , C to those bigger than β µ,2 , and B to those between β µ,1 and β µ,2 . Then, we obtain a sequence S = C − B − A − C − B − C of alphabetical strings that is converted from the time series. time series is discretized as alphabet sequence using SAX, patterns are extracted from the sequences, and a new sequence is generated considering the extracted pattern.
SAX can convert a time series into an alphabetical sequence S for efficient timeseries data mining in the following manner [26]. First, the th element of is normalized as ̃= − , where and denote the mean and standard deviation of , respectively. Second, ̃ is split into windows and = ( 1 , 2 , ⋯ , ), where , 1 ≤ ≤ is a representative value such as the mean of the values in the th window, is calculated. One can introduce the standard deviation as the alternative representative value. Third, break points for = 1, 2, ⋯ , are computed, where is the number of alphabetical strings defined by the user, satisfying Pr( ≤ < +1 ) = 1/ . Finally, an alphabet is assigned to each window on the basis of the break points. That is, if ≤ < +1 , then the th alphabet string is assigned to the th window. Figure 1 illustrates a SAX application process when the time series is assumed to be normalized. Time series is split into = 6 windows, and mean values 1 , 2 , ⋯ , 6 in each window are calculated. Three ( = 3) alphabetical strings , , and ℂ are introduced, and then three break points ,1 , ,2 , ,3 for means are obtained. Then, the alphabet is assigned to those means less than ,1 , ℂ to those bigger than ,2 , and to those between ,1 and ,2 . Then, we obtain a sequence S = ℂ − − − ℂ − − ℂ of alphabetical strings that is converted from the time series. SAX has been used to extract features from time series for various tasks such as classification and clustering. For example, Georgoulas et al. [27] extracted alphabetical features to represent the vibration of bearings and used them to detect bearing faults with various classifiers. Park and Jung [28] proposed a method to reveal rules from multivariate time series. It transforms a time series to an alphabetical sequence through SAX and identifies frequent patterns from the sequences using association rule mining. Notaristefano et al.
[29] made groups of electrical load pattern by reducing data size using SAX.

Proposed Data Generation Method
This section explains the proposed three-phase data-generation method: preprocessing, generating an alphabetical sequence, and generating time-series values. In the first phase, every time-series sample is transformed into a pair of vectors, one of window means and another of window standard deviations, and then each vector is transformed into an alphabetical sequence. In the second phase, two arbitrarily selected pairs of alpha- SAX has been used to extract features from time series for various tasks such as classification and clustering. For example, Georgoulas et al. [27] extracted alphabetical features to represent the vibration of bearings and used them to detect bearing faults with various classifiers. Park and Jung [28] proposed a method to reveal rules from multivariate time series. It transforms a time series to an alphabetical sequence through SAX and identifies frequent patterns from the sequences using association rule mining. Notaristefano et al.
[29] made groups of electrical load pattern by reducing data size using SAX.

Proposed Data Generation Method
This section explains the proposed three-phase data-generation method: preprocessing, generating an alphabetical sequence, and generating time-series values. In the first phase, every time-series sample is transformed into a pair of vectors, one of window means and another of window standard deviations, and then each vector is transformed into an alphabetical sequence. In the second phase, two arbitrarily selected pairs of alphabetical sequences form a new sequence pair, with a pattern similar to those of the originally selected sequences. In the third phase, time-series values for each window are generated from the generated pair. Table 2 presents the mathematical notations used in this paper. Table 2. Mathematical notations used in this paper.

Notation
Meaning Vector of feature functions to extract features from x (i) to train an RUL prediction model.
Dataset generated by transforming x (i) for RUL prediction, where the bottom 30% and top 30% are trimmed for stability, that is, Supervised model for RUL prediction. ).
x (i) , s (i) Mean and standard deviation of x (i) .
x (i) Normalized x (i) with mean x (i) and standard deviation s (i) .
w Number of windows.

Preprocessing
The objective of the preprocessing phase is to express where M (i) and S (i) are window mean vector i and window standard deviation vector i, respectively. The preprocessing phase consists of four steps: z-normalization, segmentation, calculation of break points, and conversion into an alphabetical sequence, as illustrated in Figure 2.
In the first step, x (i) for i = 1, 2, · · · , n is normalized to x (i) with its mean x (i) and standard deviation s (i) as: In the second step, x (i) for i = 1, 2, · · · , n is split into w windows, where w is the number of windows set by the user, and the mean and standard deviation of each window are calculated by: where n i,j is the number of elements in the j th (j = 1, 2, · · · , w) window of x (i) , which equals T i /w when j = w, and x (i) is expressed as a pair of vectors, one of window mean vector w , and another of window standard deviation vector Processes 2021, 9, 1115 6 of 18 In the first step, ( ) for = 1, 2, ⋯ , is normalized to ̃( ) with its mean ̅ ( ) and standard deviation ( ) as: In the second step, ̃( ) for = 1, 2, ⋯ , is split into windows, where is the number of windows set by the user, and the mean and standard deviation of each window are calculated by: where , is the number of elements in the th ( = 1, 2, ⋯ , w) window of ̃( ) , which equals ⌈ / ⌉ when ≠ , and ̃( ) is expressed as a pair of vectors, one of window mean vector ( ) = ( 1 ( ) , 2 ( ) , ⋯ , ( ) ), and another of window standard deviation vector In the third step, break points and for ( ) and ( ) ( = 1, ⋯ , , , = 1, ⋯ , ), respectively, are obtained according to the size of the set of the alphabetical strings, | |, which is also the user parameter. As explained in Section 2.2, the break points In the third step, break points β µ and β σ for µ (i) j and σ (i) j (i = 1, · · · , n,, j = 1, · · · , w), respectively, are obtained according to the size of the set of the alphabetical strings, |A|, which is also the user parameter. As explained in Section 2.2, the break points β µ and β σ are used as criterion to convert mean and standard deviation of each window into alphabets. β µ and β σ are calculated and used not for individual samples but for all samples to consider the sample's scale when generating new sample. The break point is obtained from Pr β q ≤ x < β q+1 = 1 |A| , which implies that each interval β q , β q+1 , q = 1, 2, · · · , |A| − 1, contains the same number of values, and β q is therefore q |A| × 100%. In the fourth step, µ (i) and σ (i) are expressed as alphabetical sequences M (i) and S (i) , respectively, as follows: where β µ, 0 and β σ, 0 are negative infinity, and A µ,j and A σ,j indicate the j th predefined alphabet (e.g., A µ,1 = A, A σ,4 = ) for the window mean and standard deviation, respectively. The first phase is summarized in Algorithm A1 in the Appendix A.

Generating Alphabetical Sequences
The second phase is to generate artificial sequences of alphabets for M (k) and S (k) on the basis of two randomly selected parental samples M (a) , S (a) and M (b) , S (b) .
The first phase is summarized in Algorithm A1 in the Appendix A.

Generating Alphabetical Sequences
The second phase is to generate artificial sequences of alphabets for ( ) an on the basis of two randomly selected parental samples ( ( ) , ( ) ) and ( ( Figure 3 illustrates an example of the generating process of alphabetical sequenc based on ( ) = ℂ and ( ) = ℂ. Note that generating process of ( same as the process of ( ) .
where I(condition) is an indicator function, which returns 1 if condition is satisfied, and 0, otherwise, and α ≥ 0 is a Laplace smoothing parameter to prevent Pr M  (8) and (9), respectively, these probabilities should be normalized by dividing each by their sum as presented in Equations (10) and (11): where Pr M In Equations (14) and (15), L is a parameter to restrict the search space to determine the number of alphabets that match M for all values of i. The parent sample is randomly selected, and to produce the next sample, we adopt the Markov process for randomness and variability. In the Markov assumption, we can add more variability by properly setting the value of L representing the size of the search space (Equations (14) and (15)). These probabilities should be normalized as follows: An algorithm to generate alphabetical sequences of an artificial sample's mean and standard deviation is presented as Algorithm A2. It is based on sampling from categorical distribution to select either a or b. For example, the first element of M (k) follows the categorical distribution C M j follows a normal distribution, with the mean µ (k) j and standard deviation σ (k) j , and µ (k) j and σ (k) j uniformly distributed in β µ, r 1 −1 , β µ, r 1 and β σ, r 2 −1 , β σ, r 2 , respectively, when r 1 and r 2 are not 1.
When r 1 is 1, we set µ (k) j = β µ, r 1 , and σ (k) j = β σ, r 2 when r 2 is 1. We also assume that the length of z After a z (k) j for every value of j is generated, it should be inversely transformed to z (k) j using its mean z (k) and standard deviation s (k) . We set them as a weighted average of x (a) and x (b) , and s (a) and s (b) , respectively, as presented Equations (18) and (19): where 0 < r < 1 is randomly chosen. That is, Equation (18) means z (k) is randomly selected between x (a) and x (b) , and Equation (19) means s (k) is randomly selected between s (a) and s (b) .

Phase 1. Preprocessing
(1) z-normalization Each sample in Table 3 is normalized according to its mean and standard deviation as follows.
x (2) Segmentation Each normalized sample is split into w (w = 6) windows, and the mean and standard deviation of each window are calculated. For example, the first window of x (2) is (1.15, 0.16, 0.16) and its mean and standard deviation µ  The second mean alphabets of sample 1 and 3 are A and C, and the second standard deviation alphabets are and . Therefore, Pr(A|A), Pr(C|A), Pr(| ), and Pr(| ) should be calculated and normalized using Equations (10)- (15). For convenience, we set L to 1, and accordingly, M 2 is sampling from C , 1 2 , 1 2 , and as a result, A and are selected. This process repeats until j becomes w = 6.
From this phase, we obtain M (k) = AABCCC and S (k) = .

Phase 3. Generating time-series values
In this phase, z    Figure 4 shows the generated sample and its parents (sample 1 and 3) of the example. Dashed and dotted lines denote samples 1 and 3 in Table 3, respectively, and solid line denotes the generated sample when the parents are the sample 1 and 3. Y-axis denotes the sensor value and X-axis denotes time, and thus the horizontal length of a line indicates the whole life. As explained before, the length (i.e., whole life) of every sample in X is 18, and thus the length of the generated sample is also 18. To be more accurate, the length of the generated sample follows uniform distribution with [minimum length of parents, maximum length of parents].
As seen in this graph, the generated sample follows a similar pattern to the ones of sample 1 and 3, which implies it contains the characteristics of exiting samples. However, at the same time, the generated sample should not be too close to the exiting samples in order to ensure the variability. The proposed method selects two parent samples at random, and all alphabet sequences are created on the basis of the Equations (12)- (17), which ensure enough randomness and variability of the generated samples when, e.g., selecting time series size for each alphabet, selecting the first alphabet, and so forth. As seen in this graph, the generated sample follows a similar pattern to the ones of sample 1 and 3, which implies it contains the characteristics of exiting samples. However, at the same time, the generated sample should not be too close to the exiting samples in order to ensure the variability. The proposed method selects two parent samples at random, and all alphabet sequences are created on the basis of the Equations (12)- (17), which ensure enough randomness and variability of the generated samples when, e.g., selecting time series size for each alphabet, selecting the first alphabet, and so forth.

Experiment and Results
In this section, we describe an experiment to verify that the samples generated by the proposed method contribute to training an RUL model without overfitting. Two RUL prediction models were compared in terms of mean absolute percent error (MAPE), one with an original dataset X and the other with a dataset X ∪ Z, where Z is an artificially generated dataset. Section 5.1 explains the procedure of the experiment, Section 5.2 introduces the datasets and hyperparameters used in the experiment, and Section 5.3 shows the results.

Procedure
First, an original sample , ( ) is reserved for the test, and the others (i.e., X (− ) ≡ X − { ( ) }) are used for training. Second, an RUL prediction model 1 is trained with X (− ) , to which the transformation for RUL prediction is applied, ⋃ ( ) . Third, the MAPE of the model for ( ( ) ), 1 The specific procedure is described in Algorithm A3 and the flowchart to illustrate to calculate 1, , and 2, is presented in Figure 5. In Step 6 and Step 7 of this algorithm, MAPE is calculated by:

Experiment and Results
In this section, we describe an experiment to verify that the samples generated by the proposed method contribute to training an RUL model without overfitting. Two RUL prediction models were compared in terms of mean absolute percent error (MAPE), one with an original dataset X and the other with a dataset X ∪ Z, where Z is an artificially generated dataset. Section 5.1 explains the procedure of the experiment, Section 5.2 introduces the datasets and hyperparameters used in the experiment, and Section 5.3 shows the results.

Procedure
First, an original sample i, x (i) is reserved for the test, and the others (i.e., X (−i) ≡ X − x (i) ) are used for training. Second, an RUL prediction model f 1 is trained with X (−i) , to which the transformation for RUL prediction is applied, x∈X (−i) D(x). Third, the MAPE of the model for D x (i) , MAPE 1,i is calculated. Fourth, we repeat Q = 100 times to generate Z = z (k) k = 1, 2, · · · , n − 1 using the proposed algorithm for X (−i) under hyperparameters w, L, A µ , and |A σ |; train f 2 with  As seen in this graph, the generated sample follows a similar pattern to the one sample 1 and 3, which implies it contains the characteristics of exiting samples. Howe at the same time, the generated sample should not be too close to the exiting sample order to ensure the variability. The proposed method selects two parent samples at dom, and all alphabet sequences are created on the basis of the Equations (12)-(17), w ensure enough randomness and variability of the generated samples when, e.g., selec time series size for each alphabet, selecting the first alphabet, and so forth.

Experiment and Results
In this section, we describe an experiment to verify that the samples generated by proposed method contribute to training an RUL model without overfitting. Two RUL diction models were compared in terms of mean absolute percent error (MAPE), one an original dataset X and the other with a dataset X ∪ Z, where Z is an artificially ge ated dataset. Section 5.1 explains the procedure of the experiment, Section 5.2 introd the datasets and hyperparameters used in the experiment, and Section 5.3 shows th sults.

Procedure
First, an original sample , is reserved for the test, and the others (i.e., X X ) are used for training. Second, an RUL prediction model is trained with X to which the transformation for RUL prediction is applied, ⋃ ∈ . Third, the M of the model for , , is calculated. Fourth, we repeat 100 times to erate 1, 2, ⋯ , 1 using the proposed algorithm for X under perparameters , , , and | |; train with ⋃ ∈ ∪ ; and then calcu , . Finally, , and the mean of 〖 , are compared. This proced is repeated for all possible values of , , , , | |, and the models. The specific procedure is described in Algorithm A3 and the flowchart to illus to calculate , , and , is presented in Figure 5. In Step 6 and Step 7 of this algorithm, MAPE is calculated by: 2,i are compared. This procedure is repeated for all possible values of i, w, L, A µ , |A σ |, and the models.
The specific procedure is described in Algorithm A3 and the flowchart to illustrate to calculate MAPE 1,i , and MAPE 2,i is presented in Figure 5.
In Step 6 and Step 7 of this algorithm, MAPE is calculated by: This figure illustrates an example of the procedure to calculate MAPE 1,4 and MAPE 2,4 . The specific process illustrated in this figure is as follows: (1) x (1) , x (2) , x (3) , and x (4) are transformed into D x (1) , D x (2) , D x (3) , and D x (4) by applying feature functions, respectively.
(3) D x (4) is used to validate f 1 . That is,ŷ t (4) = f 1 φ x (4) 1:t for all t is obtained and the prediction results are used to calculate MAPE 1,4 .

Experiment Setting
Datasets were obtained from the prognostic data repository of the U.S. National Aeronautics and Space Administration. Table 4 shows information of the datasets.

Experiment Setting
Datasets were obtained from the prognostic data repository of the U.S. National Aeronautics and Space Administration. Table 4 shows information of the datasets. All three datasets are well known and have been widely used in the literature for the purpose of verifying the performance of the developed machine learning methods. Each sample in the first and second dataset is a time series of the capacity of lithium-ion battery until it is dead. Discharge was carried out at 24 • C, and each battery was regarded as dead when its capacity was at 30%. Each sample of the third dataset was a signal from vibration sensor attached to bearing. The operating condition of the bearing was 1800 rpm and 4000N, and sampling frequency of the sensor was 25.6 kHz.
Hyperparameters for each experiment are given in the following Table 5. Battery: Battery #1 and #6; bearing: FEMTO Bearing Set #1; MLP(h 1 , h 2 ): multi-layered perceptron with two hidden layers with h 1 and h 2 nodes, respectively; LSTM(h, s, b, e): long short-term memory with h neurons, s timestamps, b batch size, and e epochs; SVR(C, ε, κ): support vector regression with regularization parameter C, epsilon ε, and kernel function κ. Hyperparameters for each experiment are given in the following Table 5  Battery: Battery #1 and #6; bearing: FEMTO Bearing Set #1; MLP(ℎ 1 , ℎ 2 ered perceptron with two hidden layers with ℎ 1 and ℎ 2 nodes, respectively , ): long short-term memory with ℎ neurons, timestamps, batch epochs; SVR( , , ): support vector regression with regularization paramete , and kernel function .   From the results presented in Figures 6-8, we found the following. First, when we included the generated samples for training, except for the LSTM with battery #1 as shown in Figure 6, the MAPEs were smaller than those of cases with only original samples. In the case of MLP with bearing, shown in Figure 8, 20.6% of MAPE was decreased, which was the largest improvement. This shows the proposed method to artificially generate the training samples is effective and could be used to improve the performance of models. Second, MAPE of the model trained using original samples without test sample could be very high. For example, MAPEs of MLP and SVR for the bearing dataset were 35.82% and 37.21%, respectively. This may have been because the features of some test samples (i.e., cumulative root mean square and kurtosis) are quite different from those of the other samples. In other words, the relationships between feature vectors and label (i.e., RUL) are markedly different from each other. In this case, the proposed method effectively decreased the MAPE, as in the case of the MLP for the bearing dataset. Third, the proposed method showed bigger MAPE when using LSTM for battery #1 (Figure 6), contrary to the other results. In essence, LSTM considered previous feature values (i.e., cumulative RMS and kurtosis at time t − 1) to predict the current label (i.e., RUL at t), but the proposed method did not consider the relationship between two consecutive values. Instead, it took the relationship between two consecutive windows into the consideration, and the values within a window were aggregated to single value, either mean or standard deviation. We think this sometimes may worsen the performance of the model when using more samples for training, which, in turn, results in the larger value of MAPE. However, we obtained smaller MAPEs in battery #6 and bearing, as shown in Figures 7 and 8.    From the results presented in Figures 6-8, we found the following. Fir included the generated samples for training, except for the LSTM with battery in Figure 6, the MAPEs were smaller than those of cases with only original sa case of MLP with bearing, shown in Figure 8, 20.6% of MAPE was decreased the largest improvement. This shows the proposed method to artificially training samples is effective and could be used to improve the performanc From the experiment, we verified that the proposed method can solve the data insufficiency problem that is common for RUL prediction and often leads to overfitting. In other words, the RUL prediction model trained with original samples and generated samples is more generalized than the one trained with original samples only.

Conclusions
Due to time and cost, it is often difficult to collect sufficient run-to-failure data to train ML-and DL-based RUL models. The data insufficiency problem can result in overfitting and undermining of a model's performance. In this paper, we proposed a time-series data-generation method that identifies patterns from alphabetical sequences converted from original time-series samples using SAX and generates new sequence on the basis of these patterns. Finally, it generates time-series values from each alphabet in the generated sequence. In an experiment using three benchmark datasets, we found that the samples generated by the proposed method effectively increased the performance of the RUL prediction model. Future efforts to improve the proposed method should take into the account the relationship between consecutive values when generating time-series values. In addition, the proposed method was designed for a univariate time series and may be not appropriate for multivariate time series, which are common in datasets used for RUL predictions. The proposed method should therefore be expanded to consider multivariate time series. Finally, the proposed method has many parameters such as the number of windows, alphabets, and generated samples, which would impact the prediction performance of RUL prediction model. Thus, in the future research, sensitivity analysis of the parameters should be conducted and the method to choose the proper parameter values should be developed.

Notation
Step 0 Initialize i as 1.
Step 1 Normalize x (i) as x (i) with its mean and standard deviation.
Step 2 Split x (i) into w windows, and convert x (i) into µ (i) and σ (i) by calculating the mean and standard deviation of each window.
Step 3 Find break points β µ and β σ according to A µ and |A σ |.
Step 5 If i equals to n, terminate the algorithm. Otherwise, increase i by 1 and go to Step 1.
Step 6 Generate a set of new time series samples, Z = z (k) k = 1, 2, · · · , n − 1 by applying the proposed method to X − x (i) with parameters w, L, A µ and |A σ |.
Step 7 Transform z (k) to Step 8 Train f 2 with x∈X (−i) ∪ Z D(x).
Step 10 If q does not equal to Q, increase q by 1 and go to Step 6. Otherwise, calculate mean of MAPE (q) 2,i for all q.
Step 11 If i equals to n, terminate this algorithm. Otherwise, increase i by 1 and go to Step 3.