Overview of Machine Learning Methods for Lithium-Ion Battery Remaining Useful Lifetime Prediction

: Lithium-ion batteries play an indispensable role, from portable electronic devices to electric vehicles and home storage systems. Even though they are characterized by superior performance than most other storage technologies, their lifetime is not unlimited and has to be predicted to ensure the economic viability of the battery application. Furthermore, to ensure the optimal battery system operation, the remaining useful lifetime (RUL) prediction has become an essential feature of modern battery management systems (BMSs). Thus, the prediction of RUL of Lithium-ion batteries has become a hot topic for both industry and academia. The purpose of this work is to review, classify, and compare different machine learning (ML)-based methods for the prediction of the RUL of Lithium-ion batteries. First, this article summarizes and classiﬁes various Lithium-ion battery RUL estimation methods that have been proposed in recent years. Secondly, an innovative method was selected for evaluation and compared in terms of accuracy and complexity. DNN is more suitable for RUL prediction due to its strong independent learning ability and generalization ability. In addition, the challenges and prospects of BMS and RUL prediction research are also put forward. Finally, the development of various methods is summarized.


Introduction
With the development of the electrification era, the vigorous advancement of new energy vehicles, and the Internet of Things, the importance of energy storage system performance has become prominent. Lithium-ion batteries stand out among various energy storage solutions due to their high energy density, high power capability, and low selfdischarge rate [1,2]. At the same time, this also puts forward higher requirements and challenges for the development of battery management technology. A comprehensive battery management system (BMS) should include the following functions: battery data collection, battery status determination and prediction, charge and discharge control, safety protection, thermal management, balance control, and communication [3].
The accuracy of state estimation is an important criterion for evaluating the performance of BMS. A high-performance BMS can make the energy storage system operate reliably and extend the battery lifetime [4]. The optimized BMS should provide multi-task processing capabilities, which can make various tasks work together. At the same time, a real-time operating system is introduced to monitor system parameters and status in real-time so that the system can be adjusted in time.
However, the lifetime of everything is limited, and Lithium-ion batteries are no exception. The price and aging of Lithium-ion batteries are the two main factors that hinder their acceptance in a wider range of applications [5]. The performance of Lithiumion batteries will decrease with calendar aging and cycle aging, due to various aging These reviews mainly focus on how to apply model-based methods to SOH, RUL prediction, and how to apply data-driven methods to the joint prediction of SOH and RUL. Few comments have focused on the application of the ML method in Lithiumion battery RUL prediction. To make up for this research gap, the author consulted 216 papers, selected 75 papers based on the ML classification for RUL prediction demand, and reviewed the support vector machine, Gaussian process regression, extreme learning machine, deep neural network, and recurrent neural network. A new standard is proposed to evaluate and compare the methods proposed in the literature, starting from the two general directions of accuracy and robustness, to emphasize the detailed information, advantages, and limitations of these methods.
The remainder of this article is organized as follows. Section 2 introduces the basic principles of the support vector machine (SVM), Gaussian process regression (GPR), extreme learning machine (ELM), deep neural network (DNN), and recurrent neural network (RNN) and their application in Lithium-ion battery RUL prediction. Section 3 compares the aforementioned ML methods from the perspective of accuracy and algorithm parameters. Section 4 presents the challenges and prospects for RUL prediction. Section 5 concludes this work.

Machine Learning for RUL Prediction
The ML method is the preferred method for predicting RUL when historical life cycle data are available [33,34]. Figure 1 shows the basic workflow of introducing ML in the process of predicting RUL. First, collect raw data that can be directly measured by the battery, such as operating temperature (T), charge/discharge current (I), and operating voltage (V), as inputs for the ML model. Secondly, perform preprocessing operations such as denoising on the original data and extracting the feature vector representing the aging behavior. The feature extraction step seriously affects the RUL estimation performance. Finally, the trained ML model will simulate the relationship between the characteristic value and the battery RUL and realize the prediction of the RUL.
Finally, the trained ML model will simulate the relationship between the characteristic value and the battery RUL and realize the prediction of the RUL.

Support Vector Machine
Support vector machines (SVM) have received widespread attention due to their strong advantages in processing small training data sets. SVM is a kernel-based non-parametric ML technology. When the size of the training data set increases, the number of support vectors will increase accordingly. In a complex system, this method can be modeled according to the characteristics of the system and can provide sufficient data support, so it has the characteristics of high flexibility [40].
SVM uses two parallel hyperplanes to clearly classify linearly separable data sets. Equation (1) is the decision boundary between two parallel boundaries, where w is the weight and b is the deviation parameter vector. The distance between the decision boundary and each hyperplane is suitable for the standardized data set. SVM introduces the hinge loss function based on the hyperplane to reduce the classification error on the linear inseparable data set.
The problem becomes the function in the minimization Equation (2), where n represents the number of samples and λ represents the regularization parameters. Specifically, SVM can also use the kernel function to transform the input low-dimensional vector into a high-dimensional feature space, and then use the hyperplane to separate the data, thereby applying the kernel method (the decision-making boundary element has Equations (3) and (4) as shown above, where φ is the mapping function). The structure of the support vector machine algorithm is shown in Figure 2.

Support Vector Machine
Support vector machines (SVM) have received widespread attention due to their strong advantages in processing small training data sets. SVM is a kernel-based nonparametric ML technology. When the size of the training data set increases, the number of support vectors will increase accordingly. In a complex system, this method can be modeled according to the characteristics of the system and can provide sufficient data support, so it has the characteristics of high flexibility [40].
SVM uses two parallel hyperplanes to clearly classify linearly separable data sets. Equation (1) is the decision boundary between two parallel boundaries, where w is the weight and b is the deviation parameter vector. The distance between the decision boundary and each hyperplane is suitable for the standardized data set. SVM introduces the hinge loss function based on the hyperplane to reduce the classification error on the linear inseparable data set.
The problem becomes the function y(x) in the minimization Equation (2), where n represents the number of samples and λ represents the regularization parameters. Specifically, SVM can also use the kernel function to transform the input low-dimensional vector into a high-dimensional feature space, and then use the hyperplane to separate the data, thereby applying the kernel method (the decision-making boundary element has Equations (3) and (4) as shown above, where ϕ is the mapping function). The structure of the support vector machine algorithm is shown in Figure 2.
Electronics 2021, 10, x FOR PEER REVIEW 5 of 18 The structure of the support vector machine algorithm.
When SVM is used as a continuous value regression tool, it is called support vector regression (SVR). SVR needs to find a hyperplane, similar to SVM. The difference is that SVM needs to find a hyperplane with the largest gap. In SVR, a threshold ε needs to be defined, and only the loss of the data points in the strip area as shown in Equation (5) is calculated, and then the points outside the area are regressed.
Regression is achieved by searching for the smallest marginal fit. SVR is one of the most commonly used regression methods at this stage. In the regression model, the convexity solution of SVR can be obtained by constructing the Lagrangian loss function. At the same time, the mapping function can be used to convert a low-dimensional nonlinear input space into a high-dimensional linear feature space and a nonlinear regression problem into a linear problem. The SVR algorithm proposes a non-parametric regression method, which can be updated through model retraining. Because of its ability to describe the non-linear correlation of input and output data, SVR is suitable for health diagnosis and prediction tasks.
To enhance the stability and robustness of SVR, decrement and increment strategies are used to integrate large-scale training samples to train SVR, and at the same time, uncorrelated data are denoised. However, while enhancing the performance of the model, the calculation time and complexity are increased. Patil et al. [41] used the feature vectors extracted from the voltage and temperature curves as the input data set for RUL prediction and built a prediction model based on SVR. The root means square error (RMSE) of the model is 0.357%. When the confidence interval is 95%, the upper and lower errors are 7.87% and 10.75%, respectively. Zhao et al. [42] calculated the battery capacitance, by varying the different time intervals of the same voltage difference during the charging and discharging process, and combined it with the method of processing the data set when the feature vector is selected to improve the accuracy of SVR. The maximum RMSE of this method is 1%. Du et al. [43] established an SVR-based RUL prediction model for Lithiumion batteries using six sets of coupled stress experimental data; the relative error of the RUL prediction for 600 cycles is below 5%.
When the SVM method performs RUL prediction, the activation function is usually selected as the radial-based kernel, the training algorithm is the logistic regression function margin, the hyperparameter adjustment methods are the regularization factor, the SVM type regression, and the kernel parameter. When SVM is used as a continuous value regression tool, it is called support vector regression (SVR). SVR needs to find a hyperplane, similar to SVM. The difference is that SVM needs to find a hyperplane with the largest gap. In SVR, a threshold ε needs to be defined, and only the loss of the data points in the strip area as shown in Equation (5) is calculated, and then the points outside the area are regressed.
Regression is achieved by searching for the smallest marginal fit. SVR is one of the most commonly used regression methods at this stage. In the regression model, the convexity solution of SVR can be obtained by constructing the Lagrangian loss function. At the same time, the mapping function can be used to convert a low-dimensional nonlinear input space into a high-dimensional linear feature space and a nonlinear regression problem into a linear problem. The SVR algorithm proposes a non-parametric regression method, which can be updated through model retraining. Because of its ability to describe the non-linear correlation of input and output data, SVR is suitable for health diagnosis and prediction tasks.
To enhance the stability and robustness of SVR, decrement and increment strategies are used to integrate large-scale training samples to train SVR, and at the same time, uncorrelated data are denoised. However, while enhancing the performance of the model, the calculation time and complexity are increased. Patil et al. [41] used the feature vectors extracted from the voltage and temperature curves as the input data set for RUL prediction and built a prediction model based on SVR. The root means square error (RMSE) of the model is 0.357%. When the confidence interval is 95%, the upper and lower errors are 7.87% and 10.75%, respectively. Zhao et al. [42] calculated the battery capacitance, by varying the different time intervals of the same voltage difference during the charging and discharging process, and combined it with the method of processing the data set when the feature vector is selected to improve the accuracy of SVR. The maximum RMSE of this method is 1%. Du et al. [43] established an SVR-based RUL prediction model for Lithium-ion batteries using six sets of coupled stress experimental data; the relative error of the RUL prediction for 600 cycles is below 5%.
When the SVM method performs RUL prediction, the activation function is usually selected as the radial-based kernel, the training algorithm is the logistic regression function margin, the hyperparameter adjustment methods are the regularization factor, the SVM type regression, and the kernel parameter.

Gaussian Process Regression
The prediction part is added to the prior knowledge based on the Bayesian framework. Such a kernel-based ML method is Gaussian process regression (GPR). GPR uses the average forecast variance to describe the associated uncertainty. The structure of the GPR algorithm is shown in Figure 3. The GPR model is flexible, non-parametric, and probabilistic. It can be updated through online retraining and has been widely used in prognostic analysis. GPR was used in the initial stage to predict the decay trend of battery internal resistance, and the decay of battery internal resistance is the main factor in the decrease of battery capacity [44]. Therefore, GPR is gradually being applied to the prediction of battery RUL. The attenuation of capacity is a very complicated non-linear process. It is affected by many uncertain environmental factors and working conditions. The improper operation will also lead to the rapid attenuation of capacity. Thus, a single covariance function will lead to unreliable predictions for nonlinear mappings with multidimensional input variables. Therefore, an anisotropic kernel with a high-level structure should be constructed, and the training part of GPR should be started by obtaining the training data set and then initializing the hyperparameters. GPR uses the conjugate gradient method to determine the optimal value of the hyperparameters, which will lead to a decrease in the negative marginal log-likelihood function. Finally, RUL is estimated by Equations (6) and (7), which can be expressed as: ; the kernel matrix is denoted as K, the output of the trained data set is y, I is the identity matrix, and the inverse matrix is determined by the marginal log-likelihood function, and its gradient is K + σ 2 n I.

Gaussian Process Regression
The prediction part is added to the prior knowledge based on the Bayesian fram work. Such a kernel-based ML method is Gaussian process regression (GPR). GPR u the average forecast variance to describe the associated uncertainty. The structure of GPR algorithm is shown in Figure 3. The GPR model is flexible, non-parametric, and pr abilistic. It can be updated through online retraining and has been widely used in pr nostic analysis. GPR was used in the initial stage to predict the decay trend of batt internal resistance, and the decay of battery internal resistance is the main factor in decrease of battery capacity [44]. Therefore, GPR is gradually being applied to the pre tion of battery RUL. The attenuation of capacity is a very complicated non-linear proc It is affected by many uncertain environmental factors and working conditions. The proper operation will also lead to the rapid attenuation of capacity. Thus, a single cov ance function will lead to unreliable predictions for nonlinear mappings with multi mensional input variables. Therefore, an anisotropic kernel with a high-level struct should be constructed, and the training part of GPR should be started by obtaining training data set and then initializing the hyperparameters. GPR uses the conjugate g dient method to determine the optimal value of the hyperparameters, which will lead a decrease in the negative marginal log-likelihood function. Finally, RUL is estimated Equations (6) and (7), which can be expressed as: * * ∑ * * * * * Where * is RUL estimation; * , * , … , , * ; * * * , * ; the ker matrix is denoted as K, the output of the trained data set is y, I is the identity matrix, a the inverse matrix is determined by the marginal log-likelihood function, and its gradi is K+ .

Input layer Output layer
Hidden layer Although GPR can predict nonlinear systems, as the complexity of prediction creases, the accuracy of GPR will drop rapidly [45]. To solve this problem, GPR introdu a new covariance function and a mean function based on the zero mean function and diagonal covariance squared exponential function [46,47]. The performance of GPR Although GPR can predict nonlinear systems, as the complexity of prediction increases, the accuracy of GPR will drop rapidly [45]. To solve this problem, GPR introduced a new covariance function and a mean function based on the zero mean function and the diagonal covariance squared exponential function [46,47]. The performance of GPR is highly sensitive to the covariance function, so proper kernel selection and hyperparameter optimization can avoid the problem of excessive sensitivity. To improve ground-penetrating radar, one method is to minimize the negative impact of logarithmic marginal probability [48,49]. It is usually necessary to inverse the covariance matrix to train GPR, which will increase the calculation time and complexity of the algorithm and increase the memory requirements [50]. To solve the problems of long calculation time and high complexity, various sparse methods based on the use of a subset of the training sample size have been developed.
GPR is suitable for processing complex regression problems with high dimensionality, small sample size, and nonlinearity [48]. Yu et al. [51] improved the mixed Gaussian process function regression method, combined with the wavelet denoising data processing method, and used the improved method to predict the RUL of Lithium-ion batteries. The accuracy of this method can reach 2.2%. Compared with the original method, the accuracy is increased by 4.5%. The relative prediction errors of this method are all less than 7%. Li et al. [52] deeply analyzed the changing trend of some incremental capacity and extracted four key feature vectors based on the relationship between capacity change and battery aging. The extracted feature vector will be used as the input data of the Gaussian process regression, and a multi-time scale short-term battery aging model will be constructed using GPR with kernel correction. The mean average error (MAE) and RMSE of this method are both less than 26 cycles. Li et al. [53] combined the characteristics of the equivalent voltage change and the corresponding capacity change with the dual Gaussian process regression model to predict the battery health status. Using this method to estimate the long-term health status of the four batteries, the predicted RUL error is less than 23 cycles.
When selecting the GPR method for RUL prediction, the activation function is usually a kernel function, the training algorithm is a squared exponential kernel or Marginal loglikelihood function, and the hyperparameter adjustment methods are the input dimension length scale and latent function values.

Extreme Learning Machine
With the development of battery RUL prediction technology, extreme learning machines have also been applied in this field. ELM can randomly select hidden layer unit settings. When the single hidden layer feedforward neural network (SLFN) is determined, ELM can analyze the output weight of SLFN. Because of its fast learning speed and high prediction accuracy, ELM has been widely used in single-step and multi-step prediction algorithms. In the process of state estimation for nonlinear complex systems, ELM has strong flexibility, scalability, and high learning performance, which can quickly approach the real value. ELM is a member of ML, and its structure is usually divided into three layers, namely the input layer, hidden layer, and output layer, as shown in Figure 4.
During the data input process, ELM randomly assigns the input weight. When the data are transmitted between the input layer and the hidden layer, the hidden layer deviation is also set randomly, and the input weight and the hidden layer deviation do not need to be adjusted after setting. When the data pass through the hidden layer and enter the output layer, the connection weight will be determined by solving the equation once. Since the connection weight does not need to be adjusted iteratively, ELM can perform fast convergence. In Figure 4, x i represents the input layer, and y i represents the output layer. The mathematical expression output by the hidden layer is represented by the following equation: x = [x i1 , x i2 , . . . , x iN ] T is the input weight vector, b i is the hidden layer deviation, N is the hidden neuron, the weight vector of the i-th hidden node and the input node and the output weight of the output layer neuron are denoted as a i = [a i1 , a i2 , . . . , a iN ] T and  Zhu et al. [54] developed and optimized the ELM, integrated the gray wolf optimization (GWO) into the ELM algorithm, and improved the weight and threshold of the ELM to form a new DGWO-ELM algorithm. The minimum RMSE of this algorithm can reach 0.43%. Fan et al. [55] also focused on the combination of the hybrid gray wolf optimizer (HGWO) algorithm and ELM and added an attention mechanism to optimize the forgotten online sequential extreme learning machine (FOS-ELM). The RMSE of the improved hybrid method can reach 0.0121.
Guo et al. [56] combined RVFL and ELM to obtain a new hybrid data-driven SOH and RUL joint state estimation model. To quantitatively evaluate the RUL prediction interval, the author developed an uncertainty management method based on bootstrap to improve the accuracy of prediction. Compared with the latest learning algorithm, this method improves the robustness of the model and reduces the prediction error.
When ELM is used as the RUL prediction method, the activation function is usually selected as sigmoid, the training algorithm is a linear system function, and the hyperparameters are adjusted through hidden neurons.

Deep Neural Network
Unlike the single-layer feedforward neural network (SLFNN) structure of the standard ANN model, the DNN model contains multiple hidden layers. In the DNN algorithm, a functional relationship is established between the input vector and the output vector through nonlinear calculations. In the calculation process, the function parameters are calculated by a certain method. The DNN contains multiple hidden layers, as shown in Figure 5.
SLFNN is shown in Equation (10), where the input data are , the output data are , the activation function is denoted by , and the weight and deviation are W and b, respectively. In the training process, the real value is approached by continuous iterative updating of W. The He method is usually used for initialization. DNN can be described by Equation (11). Zhu et al. [54] developed and optimized the ELM, integrated the gray wolf optimization (GWO) into the ELM algorithm, and improved the weight and threshold of the ELM to form a new DGWO-ELM algorithm. The minimum RMSE of this algorithm can reach 0.43%. Fan et al. [55] also focused on the combination of the hybrid gray wolf optimizer (HGWO) algorithm and ELM and added an attention mechanism to optimize the forgotten online sequential extreme learning machine (FOS-ELM). The RMSE of the improved hybrid method can reach 0.0121.
Guo et al. [56] combined RVFL and ELM to obtain a new hybrid data-driven SOH and RUL joint state estimation model. To quantitatively evaluate the RUL prediction interval, the author developed an uncertainty management method based on bootstrap to improve the accuracy of prediction. Compared with the latest learning algorithm, this method improves the robustness of the model and reduces the prediction error.
When ELM is used as the RUL prediction method, the activation function is usually selected as sigmoid, the training algorithm is a linear system function, and the hyperparameters are adjusted through hidden neurons.

Deep Neural Network
Unlike the single-layer feedforward neural network (SLFNN) structure of the standard ANN model, the DNN model contains multiple hidden layers. In the DNN algorithm, a functional relationship is established between the input vector and the output vector through nonlinear calculations. In the calculation process, the function parameters are calculated by a certain method. The DNN contains multiple hidden layers, as shown in Figure 5.
SLFNN is shown in Equation (10), where the input data are x, the output data are y, the activation function is denoted by f a , and the weight and deviation are W and b, respectively. In the training process, the real value is approached by continuous iterative updating of W. The He method is usually used for initialization. DNN can be described by Equation (11).  (2) x (2) x(n-1) (1) hN (1) h1 (2) h1 (3) h1(n-2) h1(n-1) h1(n) hN (2) hN (3) hN ( In [57], Ma et al. introduced a transfer learning method based on the DNN method. To select the battery with the most similar performance to the target battery as a reference, the average Euclidean distance-based (AED) method with transferable measurement characteristics is used to select in the historical database. Then, the data are used as the input vector to train the prediction model based on the stacked denoising autoencoder (SDA), and finally the RUL of the target battery is obtained. The improved method can increase the prediction speed by nearly 30%. Hong et al. [58] proposed a new DNN prediction model for the long prediction period of Lithium-ion battery RUL. The model uses an end-to-end deep learning framework to achieve the goal of completing RUL predictions through short-term measurements. The average absolute error rate of this method reaches 10.6%. In [59], the author applied DNN to predict the RUL of Lithium-ion batteries in the field of electronic vehicles. The capacity was predicted using 11 extracted features, and two DNNs were trained. One DNN performed statistical analysis on the capacity attenuation of impedance attenuation as the degree of deterioration increased, and the other DNN obtained the probability prediction based on the capacity attenuation trend to improve the predictive accuracy of the remaining service life. The RMSE of this method is approximately 3.59%.
When DNN is used as an RUL prediction method, the activation function is usually selected as sigmoid or ReLU, the training algorithm is gradient descent, backpropagation through time, and the hyperparameters are adjusted through hidden layers.

Recurrent Neural Network
Recurrent neural networks are widely used to process time-series data because of their time series memory. RNN is an SLFNN, with a classic three-layer model structure. To select the battery with the most similar performance to the target battery as a reference, the average Euclidean distance-based (AED) method with transferable measurement characteristics is used to select in the historical database. Then, the data are used as the input vector to train the prediction model based on the stacked denoising autoencoder (SDA), and finally the RUL of the target battery is obtained. The improved method can increase the prediction speed by nearly 30%. Hong et al. [58] proposed a new DNN prediction model for the long prediction period of Lithium-ion battery RUL. The model uses an end-to-end deep learning framework to achieve the goal of completing RUL predictions through short-term measurements. The average absolute error rate of this method reaches 10.6%. In [59], the author applied DNN to predict the RUL of Lithium-ion batteries in the field of electronic vehicles. The capacity was predicted using 11 extracted features, and two DNNs were trained. One DNN performed statistical analysis on the capacity attenuation of impedance attenuation as the degree of deterioration increased, and the other DNN obtained the probability prediction based on the capacity attenuation trend to improve the predictive accuracy of the remaining service life. The RMSE of this method is approximately 3.59%.
When DNN is used as an RUL prediction method, the activation function is usually selected as sigmoid or ReLU, the training algorithm is gradient descent, backpropagation through time, and the hyperparameters are adjusted through hidden layers.

Recurrent Neural Network
Recurrent neural networks are widely used to process time-series data because of their time series memory. RNN is an SLFNN, with a classic three-layer model structure.
According to time changes, the time series variables at each moment are used as the input of RNN. Through training, RNN can predict the changing trend of input variables [60]. Among the many RNN architectures, the long and short-term memory (LSTM) algorithm is the most representative [44,61]. LSTM has a forget gate that can filter low-correlation inputs and enhance strong-correlation inputs. In this way, the problem of vanishing and exploding gradients can be solved [62]. Compared with the traditional RNN algorithm, LSTM is more suitable for scenarios that require long-term prediction and has better robustness and accuracy [63].  (12) and (13), from t = 1 to N.
where the weight and bias are W and b, respectively. The weight before the input layer and the hidden layer is represented by W xh , and the bias vector and the nonlinear activation function of the hidden layer are b h and H, respectively. LSTM is more suitable for scenarios that require long-term prediction and has better robustness and accuracy [63]. Figure 6 shows the typical structure of an RNN. For an input sample , , … , , where N denotes the sequence length, RNN calculates the hidden state vector sequence , , … , , and outputs the sequence , , … , , through iteration of the Equations (12) and (13), from 1 to N.
where the weight and bias are W and b, respectively. The weight before the input layer and the hidden layer is represented by W , and the bias vector and the nonlinear activation function of the hidden layer are and , respectively. Wu et al. [19] applied the bat particle filter (Bat-PF) to optimize the neural network algorithm. The formed NN+Bat-PF model uses Bat-PF to recursively update the model parameters. The error of predicting RUL is two cycles in 500 prediction cycles, and the width of the probability density function (PDF) is 35 cycles. She et al. [7], based on the radial basis function NN model, used the incremental capacity analysis method to analyze the battery capacity aging trend, and the RUL was predicted based on the relationship between the capacity and the remaining service lifetime of the battery. The prediction accuracy of this method and MAE are 90% and 4.00%, respectively. To realize the online estimation of the RUL of Lithium-ion batteries, Wu et al. [64] used the importance sampling (IS) method to process historical data sets. The feature vector is selected as the input of the feedforward neural network (FFNN), and 40 hidden layer neurons are used for training. This improved online estimation method has an error of less than 5% in actual operation. For online RUL estimation, Zhang et al. [65] combined the incremental capacity analysis method while simplifying the ANN model. There are only two neurons in the input layer of the simplified ANN model. The maximum MAE of this method is four cycles, and the maximum RMSE is six cycles.
Based on the RNN structure, an LSTM architecture is used. The RNN algorithm uses the backpropagation method for training, but this method usually brings about the problem of gradient explosion or disappearance. LSTM uses memory cells instead of hidden nodes to solve this problem. Figure 7 shows the structure of a single LSTM memory cell. At each time step, the storage unit is accessed, updated, and cleared by multiple gates. The input vector of the LSTM unit at time t is , the hidden state is expressed as , is the unit memory, the weight matrix and bias parameters are W and b, respectively, the activation function of the input gate is , the activation function of the forgetting gate is Wu et al. [19] applied the bat particle filter (Bat-PF) to optimize the neural network algorithm. The formed NN+Bat-PF model uses Bat-PF to recursively update the model parameters. The error of predicting RUL is two cycles in 500 prediction cycles, and the width of the probability density function (PDF) is 35 cycles. She et al. [7], based on the radial basis function NN model, used the incremental capacity analysis method to analyze the battery capacity aging trend, and the RUL was predicted based on the relationship between the capacity and the remaining service lifetime of the battery. The prediction accuracy of this method and MAE are 90% and 4.00%, respectively. To realize the online estimation of the RUL of Lithium-ion batteries, Wu et al. [64] used the importance sampling (IS) method to process historical data sets. The feature vector is selected as the input of the feedforward neural network (FFNN), and 40 hidden layer neurons are used for training. This improved online estimation method has an error of less than 5% in actual operation. For online RUL estimation, Zhang et al. [65] combined the incremental capacity analysis method while simplifying the ANN model. There are only two neurons in the input layer of the simplified ANN model. The maximum MAE of this method is four cycles, and the maximum RMSE is six cycles.
Based on the RNN structure, an LSTM architecture is used. The RNN algorithm uses the backpropagation method for training, but this method usually brings about the problem of gradient explosion or disappearance. LSTM uses memory cells instead of hidden nodes to solve this problem. Figure 7 shows the structure of a single LSTM memory cell. At each time step, the storage unit is accessed, updated, and cleared by multiple gates. The input vector of the LSTM unit at time t is x t , the hidden state is expressed as h t , c t is the unit memory, the weight matrix and bias parameters are W and b, respectively, the activation function of the input gate is i t , the activation function of the forgetting gate is f t , and the activation function of the output gate is o t . When new input data are fed into the cell, the information is accumulated to the memory cell if the input gate i t is activated. The previous cell state c t−1 can be "forgotten" if the forget gate f t is on. The output gate o t determines whether the newest cell output c t can be propagated to the final status h t , and H is implemented as: where σ represents the logistic sigmoid function, W hi denotes the hidden-input gate matrix, and W xo is the input/output gate matrix. The LSTM algorithm structure is shown in Figure 7.
Electronics 2021, 10, x FOR PEER REVIEW determines whether the newest cell output can be propagated to the final st and is implemented as: where represents the logistic sigmoid function, denotes the hidden-input g trix, and is the input/output gate matrix. The LSTM algorithm structure is sh Figure 7. Li et al. [66] proposed an Elman-LSTM method. This method combines th memory of LSTM and the advantages of the Elman neural network and introdu empirical mode decomposition algorithm into it. The relative prediction errors Elman-LSTM method are 3.3% and 3.21%, respectively. Qu et al. [67] combined a and easy-to-implement particle swarm optimization algorithm with LSTM traini further introduced an attention mechanism to achieve the effect of joint state esti of SOH and RUL. The average error of this method is −3 and the RMSE is 0.0362 al. [68] introduced the unscented Kalman filter (UKF) algorithm based on the neu work framework of LSTM and NN, forming a new data-driven hybrid model m The average error of this method is 5. Yang et al. [69] combined the optimized bidire long short-term memory network (Bi-LSTM) with the convolutional neural ne which is the same neural network algorithm. The minimum error of this hybrid network algorithm is 1.04%. Chinomona et al. [70] proposed a forward feature se algorithm that uses a combination of RNN and LSTM to completely select the best Li et al. [66] proposed an Elman-LSTM method. This method combines the time memory of LSTM and the advantages of the Elman neural network and introduces the empirical mode decomposition algorithm into it. The relative prediction errors of this Elman-LSTM method are 3.3% and 3.21%, respectively. Qu et al. [67] combined a simple and easy-to-implement particle swarm optimization algorithm with LSTM training and further introduced an attention mechanism to achieve the effect of joint state estimation of SOH and RUL. The average error of this method is −3 and the RMSE is 0.0362. Cui et al. [68] introduced the unscented Kalman filter (UKF) algorithm based on the neural network framework of LSTM and NN, forming a new data-driven hybrid model method. The average error of this method is 5. Yang et al. [69] combined the optimized bidirectional long short-term memory network (Bi-LSTM) with the convolutional neural network, which is the same neural network algorithm. The minimum error of this hybrid neural network algorithm is 1.04%. Chinomona et al. [70] proposed a forward feature selection algorithm that uses a combination of RNN and LSTM to completely select the best feature set. Using partial charge/discharge data, the RMSE and MAE of this method are 0.00286 and 0.00222, respectively.
Ma et al. [71] combined the convolutional neural network with the LSTM method and merged the resulting hybrid method with the false nearest neighbor (FNN) method. The accuracy of this method is 98.21%. Qiao et al. [72] combined the empirical mode decomposition method suitable for processing nonlinear non-stationary signals with DNN with nonlinear system prediction advantages and LSTM with temporal memory characteristics to predict RUL. Compared with traditional methods, the algorithm's MAE and RMSE, which are 75% and 90.8%, respectively, significantly decrease. The standard deviation of this method is 1.36626. Li et al. [73] designed a variant of LSTM called AST-LSTM NN. AST-LSTM NN has many-to-one and one-to-one mapping structures. This method predicts that the absolute error of RUL is 0.0831. Liu et al. [74] combined the advantages of LSTM and GPR. LSTM can accurately predict the long-term dynamic trend of capacity degradation, and the prediction deviation caused by capacity regeneration can be accurately captured by GPR. The RMSE and maximum error of the LSTM+GPR model are 0.0032 and 0.6%, respectively. Parker et al. [75] proposed a many-to-one framework based on LSTM to adapt to various input types. The mean absolute percentage error (MAPE) of the proposed model is 63.7% higher than that of the traditional method.
When RNN performs RUL prediction, the activation function is usually selected as sigmoid, the training algorithm is gradient descent, backpropagation through time, and potential overfitting problems are solved through hyperparameter adjustment. Among them, the activation function of the LSTM algorithm is usually selected as sigmoid and tanh, the training algorithm is gradient descent, backpropagation through time, and the hyperparameters are adjusted through hidden neurons.

Comparison
Even if the working conditions remain the same, Lithium-ion batteries will not necessarily show a linear degradation behavior (e.g., capacity fade, resistance increase, power decrease, etc.) during their life. Therefore, the ideal RUL prediction method should be able to consider these nonlinear behaviors. If the prediction method only focuses on minimizing the error, it may lead to the problem of overfitting. The accuracy of data-driven methods depends on the correct adjustment of hyperparameters. The training data can contain valuable measurement noise indicators. Therefore, the forecasting method should consider uncertain factors. The performance of various RUL prediction methods can be evaluated from the following aspects: (1) activation function; (2) training algorithm; (3) hyperparameter adjustment; (4) uncertainty management; (5) robustness. These aspects are shown in Table 2.
Choosing an appropriate amount of data is essential to obtain a satisfactory RUL estimation result. In actual operation, online learning is more practical. In this case, the scale of training vectors gradually increases over time, and a large number of data sets may cause a huge computational burden. Considering the limited memory and computing power, it is necessary to know the complexity of the input and output vectors and the algorithm structure of each method. The accuracy of the RUL estimation greatly depends on collecting the relevant data. Normally, the original data will be normalized to shorten the training time and improve the performance of the algorithm. The following factors can be considered to evaluate the performance of various RUL prediction methods: (1) input features and output; (2) structure; (3) data calculation. These factors are shown in Table 3.  Table 3. Summaries of the different criteria for computational complexity evaluation defined in this section. Based on the previous summary, the advantages and disadvantages of the proposed methods are compared, as shown in Table 4. Faced with computational challenges, sparsity may become a key function to solve the problem of excessive input data. SVR becomes a sparse algorithm due to its sensitive loss function. SVM has satisfactory performance in nonlinear and high-dimensional models, can deal with local minima and small sample sizes, and has a short calculation time. However, it cannot express uncertainty due to its difficulty with calculating kernel and regularization parameters. GPR is not a sparse model, but different data processing methods can be used to reduce the training data size. Due to the non-parametric nature and execution probability of the GPR method, it has better robustness and computational efficiency prediction capabilities. Since the covariance provided by GPR shows excellent uncertainty management capabilities, it has strong flexibility and adaptability when dealing with high-dimensional and small sample data sets. However, when it is applied to high-dimensional space, the efficiency is reduced, the kernel function seriously affects the performance, and the amount of calculation is large.

Method Input Features and Output
ELM has better scalability and generalization performance, simple structure, and low computational complexity. Furthermore, its accuracy is determined by the value of the hidden neuron. DNN has a strong independent learning ability and generalization ability and high algorithm accuracy, suitable for nonlinear and complex systems. Its performance depends on the number of hidden layer neurons and the number of input historical data; it needs enough training data, the structure is complex, and the memory consumption is large. RNN has high prediction accuracy, is suitable for nonlinear and complex systems, and has strong long-term RUL prediction capabilities. However, its uncertainty management ability is poor and there is a problem of overfitting. LSTM has satisfactory results under long-term dependence, and the computational intensity of the online phase is low. However, this method has a lengthy and complicated training process and requires expensive equipment to accelerate training.

Challenges and Prospects
ML is the preferred method of using historical data sets generated by cycles to predict future development trends. Among them, DNN has a strong independent learning ability and generalization ability, which makes DNN more suitable for RUL prediction. Due to the excellent adaptability of ML, it is suitable for strongly non-linear systems and fits the true trajectory of the system by automatically optimizing model parameters. However, its accuracy relies on a large amount of historical data inputs to train the algorithm, which is also its inevitable limitation. In actual operation, a large amount of training data inputs will increase the calculation time and computational complexity, and it is also easy to cause data overfitting. There needs to be a balance between using ML algorithms to improve the accuracy of prediction and computational complexity.
With the emergence of more and more battery state estimation methods, combined with the application of actual operating systems, online state estimation methods will become the trend of future development. BMS will also be upgraded from a traditional offline system to an online management system. In terms of the types of battery state estimation methods, single state estimation will also be upgraded to joint state estimation. In actual operation, there is a coupling relationship between the battery states, and the joint state estimation has better practicability and higher accuracy. It can be expected that multi-state collaborative real-time online management solutions based on artificial intelligence will become the future development direction.
ML algorithms are consistent with the latest developments in artificial intelligence. The future direction of data-driven Lithium-ion battery RUL prediction will focus on developing hybrid ML models that are widely applicable to multiple types of prediction data. Real-time online ML battery management solutions based on big data and cloud computing platforms are expected to become the main method for future Lithium-ion battery RUL predictions.

Conclusions
This paper reviews the ML-based RUL prediction methods for Lithium-ion batteries, which are proposed in the literature. An innovative standard is defined to evaluate the accuracy and computational cost of the RUL prediction method. From the above comparison, from the perspective of computational complexity, SVM, GPR, and ELM have the characteristics of simple structure and small calculation amount, but they are more suitable for calculation problems with small sample sizes. From the perspective of prediction accuracy, DNN, RNN, and LSTM all have good performance and are suitable for nonlinear complex systems, such as Lithium-ion batteries. Among them, RNN has a relatively poor ability of uncertainty management, and LSTM has a long and complicated training process and requires expensive equipment to accelerate training. In summary, DNN has a strong independent learning ability and generalization ability, making DNN more suitable for RUL prediction.
Author Contributions: S.J. wrote the paper; S.J., X.S., D.-I.S. and R.T. designed the structure of the paper; X.H., X.S. and S.W. reviewed the paper; S.J. and D.-I.S. edited the paper. All authors have read and agreed to the published version of the manuscript.