Uniﬁed Evaluation Framework for Stochastic Algorithms Applied to Remaining Useful Life Prognosis Problems

: A uniﬁed evaluation framework for stochastic tools is developed in this paper. Firstly, we provide a set of already existing quantitative and qualitative metrics that rate the relevant aspects of the performance of a stochastic prognosis algorithm. Secondly, we provide innovative guidelines to detect and minimize the effect of side aspects that interact on the algorithms’ performance. Those aspects are related with the input uncertainty (the uncertainty on the data and the prior knowledge), the parametrization method and the uncertainty propagation method. The proposed evaluation framework is contextualized on a Lithium-ion battery Remaining Useful Life prognosis problem. As an example, a Particle Filter is evaluated. On this example, two different data sets taken from NCA aged batteries and two semi-empirical aging models available in the literature fed up the Particle Filter under evaluation. The obtained results show that the proposed framework gives enough details to take decisions about the viability of the chosen algorithm.


Introduction
Nowadays industry has increased the demand of prognostic solutions to optimize as much as possible the operation efficiency and the return of the investment. RUL prognosis is becoming popular on the design of solution approaches or on decision-make applications. To this end, data mining algorithms are the methods that are having more popularity, and among them, the ones that quantify the uncertainty such as the stochastic algorithms are the most popular ones [1]. However, there are many kinds of available stochastic algorithms and the selection of the optimal algorithm for the desired application becomes non-trivial.
In order to find the optimal algorithm, there are many studies that compare different kinds of stochastic algorithms [2,3] or that present improved algorithms which overcome deficiencies of the original algorithms [4,5]. For that, every author uses different evaluation metrics. The metrics that appear the most on this kind of studies quantify the estimation error, such as the Absolute Error (AE) [6][7][8], the Mean Absolute Error (MAE) [9][10][11], the Mean Squared Error (MSE) [12], the Mean Absolute Percentage Error (MAPE) [13][14][15], the Relative Predicting Error (RPE) [4], the Root Mean Squared Error (RMSE) [5,8,16], the Relative Accuracy (RA) [17,18], the Score [19], the Skill Score (SS) [20], the Bayesian Cramér-Rao Lower Bounds (BCRLB) [21] and their variations considering their mean value [5], maximum value [16], normalized value [22] or their variance [5,23]. In addition to this, metrics that represent the precision on estimating the RUL are also used, such as the probability density function width (PDF width) [24] or the probability on estimating the real RUL [25]. Some other studies also evaluate the timeliness of the algorithms based on more sophisticated metrics such as the Prognosis Horizon (PH) [17,26], α-λ performance [23], the Convergence of the Relative Accuracy (CRA) [18] or the convergence time [27]. There are This paper proposes to measure and quantify the key attributes of a stochastic algorithm by the calculation of some specific set of metrics (quantitative evaluation) along with the display of a set of graphs (qualitative evaluation). The proposal can be used as a standardized language with which technology developers and users can share their findings [33]. The context of the proposed evaluation method on this paper is a RUL prognosis problem (description of the prediction performance) with the goal of developing and implementing robust performance assessment algorithms with desired performance levels as well as implementing the prognostics system within user specifications.

Quantitative Method
The proposed quantitative method quantifies three key performance attributes on RUL prognosis problems: the correctness, the timeliness and the computational burden. The correctness is the main evaluated performance characteristic on all algorithms. The correctness refers to the accuracy and precision of the predictions on a specific prediction-time-instant. However, the algorithm is likely predicting a more general case under uncertainty. The performance level of the algorithm should be invariant to the prediction-time-instant. Therefore, timeliness is also evaluated. The timeliness refers to the time aspects related to accurate predictions. Additionally, the algorithms are often rejected due to final user requirements. The prognosis algorithm needs to meet the final user requirements. Therefore, the computational burden is also evaluated. To cover all these aspects, we present three set of unified evaluation metrics (Tables 1-3). Table 1. Set of metrics that quantifies the correctness.

RA
Relative Accuracy (RA) of the predicted RUL respect to the real RUL [18].

P value
The probability of estimating the ground truth (from a normalized Probability Distribution Function (PDF)) [36].

P width
The relative width of the probability distribution with a 68% confidence range respect to the real RUL [24]. Table 2. Set of metrics that quantifies the timeliness.

PH
The Prognosis Horizon (PH) defines from which time-instant of interest the accuracy of the algorithm reaches a certain threshold β (minimum acceptable probability mass) [18].

CRA
The convergence of the relative accuracy (CRA) quantifies the rate at which the RA improves with time [18]. Table 3. Set of metrics that quantifies the computational performance.

FLOP counts
The number of operations executed in an algorithm computed by Floating-point operations (FLOP) [3].
In addition to these three key performance attributes, there is another key performance attribute, which is not considered in the proposed quantitative method: the confidence. The confidence refers to the level of trust a prediction method's output can have [33]. The reason why the confidence quantification has been discarded on the proposed quantitative method is that the trust on the output has a higher relationship with the trust on the data Batteries 2021, 7, 35 4 of 27 and the prior knowledge of the system (inputs) rather than the algorithm itself [25]. The confidence remains almost the same when using the same inputs.

Correctness
When searching for an optimal something, the correctness is what we all think about. Correctness refers to the accuracy and precision of the predicted distributions. The metrics that evaluate the correctness measure the deviation of a prediction output from ground truth and the spread of the distribution at any given time instant that may be of interest to a particular application [33]. For this aim, the proposed metrics to measure the correctness of the obtained output with respect to its desired specification in terms of accuracy are the Root Mean Squared Error (RMSE) of the prediction [5], the Relative Accuracy (RA) of the RUL value [17] and the probability of predicting the ground truth [25]; and in terms of precision is the probability distribution width [24] ( Table 1).
The RMSE is one of the most used metrics in evaluation and comparison studies [5,8,38]. In this paper, the prediction RMSE is evaluated as an accuracy indicator (Equation (1)). This metric shows the average difference between ground truth data (y) and predictions (ŷ), quantifying the accuracy on the tracking of the system's behavior (the trend under the noise). The RMSE is useful to show if there is something off on the tracking of the system's behavior (accurate predictions usually have low prediction errors), and consequently, on the prediction accuracy of the algorithm. However, the RMSE is also a metric that quantifies the noise of the data respect to the model. Therefore, the accuracy evaluation should not be just based on this metric (the noise on the data set is also affecting the value of this metric).
Therefore, in order to have more information about the accuracy of the prediction algorithm, another metric is proposed: The Relative Accuracy (RA). This metric quantifies the accuracy on predicting the most probable value of the desired events (the end-of-life event). This metric is calculated by Equation (2). The range of values for RA is [−∞,1], where the perfect score is 1 [33].
To complete the accuracy evaluation, the accuracy of the predicted distribution is also quantified. This paper proposes to calculate the probability of predicting the real RUL (P value ), see Equation (3). The probability value is taken from the normalized probability distribution of the predicted RUL. Thanks to this metric, we can also determine if the uncertainty has been underestimated. When the probability of predicting the real RUL is near 0, the uncertainty can be considered underestimated.
Once quantified properly the accuracy of the algorithm, the precision (the spread of the predicted distribution) needs to be addressed. For this, this paper proposes to use the relative probability distribution width (PDF width (P width ) [24] or confidence interval [17] shown in Equation (4).
This fourth metric quantifies the relative number of time-instants that are in between the time-instants that delimit a certain central mass probability of the estimated RUL probability distribution.

Timeliness
Timeliness refers to the time aspects related to availability and usability of predictions. The metrics that evaluates this attribute measure how quickly a prediction algorithm produces its outputs, in comparison to the effects that it is mitigating [33]. For this aim, the proposed metrics to measure the timeliness are the Prognosis Horizon (PH) and the Convergence of the Relative Accuracy (CRA) ( Table 2).
The PH defines the first time when the prediction satisfies a certain criterion (generally defined by a predefined β threshold, Equation (5)) and uses this time to calculate the period between this event and the event that wants to be predicted (end of live (EOL) event). Thanks to this, the availability aspect of the timeliness attribute is put under evaluation: the greater the PH is, the better the timeliness performance of the algorithm is (faster availability). To standardize this metric, the relative of this metric is proposed in this paper, see Equation (6).
is the probability mass of the prediction PDF within the α bounds that are given by α Thanks to the PH, the most important aspect of timeliness (the availability of predictions) is properly described. However, another aspect of the timeliness is found to be interesting when evaluating and comparing prognostic algorithms: the improvement rate of accuracy and precision metrics with time (the convergence). CRA measures how quickly the relative error on the predictions is reduced. For that, the centroid of the area under the curve for the RA is calculated in the same way as in [18], see Equations (7) and (9).

Computational Performance
Computational performance quantification is required to meet the resource constrains of user application [33]. For this aim, the proposed metric to measure the computational performance is the count of floating-point operations (FLOP) ( Table 3). This metric represents the number of numerical operations (both basic and complex mathematical operations translated into basic numerical operations) that the algorithm has on its code. Further details on how to calculate this metric can be found in [3].

Qualitative Method
The proposed qualitative method focuses on describing the correctness and timeliness attributes of the prognosis algorithm by a more visual way. For this, a figure that combines the α-λ accuracy [18] and the Prognosis Horizon boundary fulfilment is displayed (Figure 1). This figure can represent both attributes in a synthesized manner and give enough clues when discussing the obtained metrics. It is a modified version of the designed schematic representation of the RA in [18].
The proposed qualitative method focuses on describing the correctness and timeliness attributes of the prognosis algorithm by a more visual way. For this, a figure that combines the α-λ accuracy [18] and the Prognosis Horizon boundary fulfilment is displayed ( Figure 1). This figure can represent both attributes in a synthesized manner and give enough clues when discussing the obtained metrics. It is a modified version of the designed schematic representation of the RA in [18]. ; the blue lines represent the α boundaries of the α-λ accuracy ( and ); the empty circle represents a prediction with a probability mass within and lower than ; the blue point represent a prediction with a probability mass within and equal or greater than ; the red point represents a prediction with a probability mass within and less than but with a probability mass within and equal or greater than .
In case the discussion needs extra information, a second graphical aid is proposed: the graphical representation of the estimations done by the algorithm for a specific trial at a specific evaluation time (called trial-instant figure in this paper) [25]. This figure shows every detail of the performance of the algorithm at a concrete prediction-time-instant. However, the lack of synthesis on this representation leads us only to propose the use of this graphical aid when there is a mayor doubt on the understanding of the metrics. This figure is the same figure used by some other authors [24,39] when evaluating their prognosis algorithms.

PH and α-λ Accuracy
The PH is already defined as a metric itself, which quantifies the timeliness of the prediction algorithm. However, the illustration of PH boundary fulfilment gives information of the correctness of the prediction algorithm as well. The fulfilment of the PH boundary determines that a β mass of the estimated probability distribution is inside the defined precision boundaries and (Equation (5)). As for the α-λ accuracy boundary fulfilment, the precision boundaries that delimit the acceptable β mass of the estimated probability distribution are and (Equation (10)) [18]. Figure 1. Qualitative response. The black line represents the ground truth; the red lines represent the α boundaries of PH (α + 1 and α − 1 ); the blue lines represent the α boundaries of the α-λ accuracy (α + and α − ); the empty circle represents a prediction with a probability mass within α + and α − lower than β; the blue point represent a prediction with a probability mass within α + and α − equal or greater than β; the red point represents a prediction with a probability mass within α + and α − less than β but with a probability mass within α + 1 and α − 1 equal or greater than β.
In case the discussion needs extra information, a second graphical aid is proposed: the graphical representation of the estimations done by the algorithm for a specific trial at a specific evaluation time (called trial-instant figure in this paper) [25]. This figure shows every detail of the performance of the algorithm at a concrete prediction-time-instant. However, the lack of synthesis on this representation leads us only to propose the use of this graphical aid when there is a mayor doubt on the understanding of the metrics. This figure is the same figure used by some other authors [24,39] when evaluating their prognosis algorithms.

PH and α-λ Accuracy
The PH is already defined as a metric itself, which quantifies the timeliness of the prediction algorithm. However, the illustration of PH boundary fulfilment gives information of the correctness of the prediction algorithm as well. The fulfilment of the PH boundary determines that a β mass of the estimated probability distribution is inside the defined precision boundaries α + 1 and α − 1 (Equation (5)). As for the α-λ accuracy boundary fulfilment, the precision boundaries that delimit the acceptable β mass of the estimated probability distribution are α + and α − (Equation (10)) [18].
Taking advantage that the output of these two metrics is binary, a color code on the PH and α − λ accuracy figure is set (empty circles when both at 0, blue circles when both at 1 and red circles when PH at 1 and α − λ accuracy at 0).

Trial-Instant Figure
The trial-instant figure shows the response of the algorithm on the whole test data set (training and prediction). This figure displays the training, prediction and EOL thresholds, the inputs (data and prior knowledge) and the output (the response of the algorithm on the whole data set and the RUL distribution). Thanks to this, cases where metrics have estranged values can be further evaluated. This will certainly enrich the discussion, see Figure 2.
Taking advantage that the output of these two metrics is binary, a color code on the PH and α − λ accuracy figure is set (empty circles when both at 0, blue circles when both at 1 and red circles when PH at 1 and α − λ accuracy at 0).

Trial-Instant Figure
The trial-instant figure shows the response of the algorithm on the whole test data set (training and prediction). This figure displays the training, prediction and EOL thresholds, the inputs (data and prior knowledge) and the output (the response of the algorithm on the whole data set and the RUL distribution). Thanks to this, cases where metrics have estranged values can be further evaluated. This will certainly enrich the discussion, see Figure 2. . The prior knowledge is composed by a linear model (yellow line) and data from the battery B0007 (blue dots). The time-instant of interest is defined by the EOL threshold (the horizontal red line), which is the capacity value at the 146 sample. The prediction probability density is displayed by the green area called PDF. The algorithm response on the learning data set is the orange line. The purple line represents the algorithm response on the rest of the data set.

Reference Features
This paper develops a standardized language using a unified set of metrics that describe the key attributes of the algorithm, but in order to find the optimal, the attributes of study need to be ranked. For this, some reference values for the proposed metrics are gathered in Table 4. Table 4. Reference values of the unified set of metrics.

Metrics
"Worst" "Best" Prediction RMSE ∞ 0 RA -∞ 1 The prediction probability density is displayed by the green area called PDF. The algorithm response on the learning data set is the orange line. The purple line represents the algorithm response on the rest of the data set.

Reference Features
This paper develops a standardized language using a unified set of metrics that describe the key attributes of the algorithm, but in order to find the optimal, the attributes of study need to be ranked. For this, some reference values for the proposed metrics are gathered in Table 4.

Trial Matrix Design Methodology
The design of the trials is a key aspect when evaluating the features of a stochastic algorithm. It is important to keep in mind that many of the sources of uncertainty on the RUL estimation are "inputs" to the prognostic algorithm [36]. This uncertainty given in form of input may penalize the algorithm in case this input is not correct; it would not be correct to penalize or accept an algorithm according to how well it fits the prediction respect to the ground truth data when the algorithm did not have access to accurate prior knowledge and/or an accurate statement of future conditions. These are the reasons why it is mandatory to develop a rigorous evaluation approach that separates the evaluation of correctness of these "inputs" and the evaluation of the prognostic algorithm itself.
The objective of the proposal is to manage the uncertainty of the "inputs" (data and prior knowledge) with the design of the trials. This will allow evaluating the prognosis algorithm itself. The idea is to apply cases with different level of uncertainty, leading to the illusive "control" of this uncertainty. Firstly, the trial matrix considers data of at least two systems that have different level of noise contribution; and secondly, the trial matrix considers two prior models describing the behavior of the system with different level of accuracy. In this fashion, the correctness of the "inputs" for each algorithm can be discriminated in a certain degree and the evaluation of the prognostic algorithms themselves is improved.
This paper proposes a minimum of a 4 cases trial matrix, shown in Table 5.

Universal Parametrization Criteria
The performance level of an algorithm is determined by many factors such as the parameterization. The chosen parameters will define the algorithm's performance level on a concrete context and goal. However, each algorithm usually has integrated on it its own parametrization method, which does not take into account the context and the goal with which the algorithm needs to work. This leads to evaluate algorithms that are optimized to work on a context different to the interesting one. This paper proposes a universal (applicable to any algorithm) parameterization criterion that corresponds to the context and goals of interest: a RUL prognostic problem context within user specifications with the goal of developing and implementing robust performance assessment algorithms with desired performance levels.
Firstly, the key aspect of the context of interest needs to be defined, which would be "predict a future unknown event". Next, the parameterization criterion that shares the same key aspect needs to be set. In case of doing a literal interpretation of the key aspect, the parameterization must focus on the future event, but this would go against the purpose of the algorithm since that future event needs to be predicted (needs to be unknown).
In this scenario, some assumptions need to be done. It is assumed that accurate predictions come when the algorithm tracks accurately the future behavior of the system. The key aspect of the context of interest is reformulated as "the tracking of the future behavior of the system". Thanks to this, the focus of the key aspect that needs to fulfil the algorithm changes from a concrete future event data point to future data in general.
This paper proposes a parametrization criterion based on quantifying the accuracy of the future behavior of the system by a cross validation of a part of the training data set. Since a cross validation is implemented, the part designed for cross validation needs to be removed from the training data set, reducing the amount of data available for training the algorithm. However, thanks to this criterion, a universal parameterization related with its context is achieved.
The validity of the proposed universal parametrization criterion is supported by the states that the parameterization depends on the performance of the algorithm doing predictions (the key aspect of the context is prediction) and by the state that this criterion can be applied to every algorithm designed for RUL prediction.

Uncertainty Propagation Method
The uncertainty management is embedded on the stochastic algorithms, which means that each stochastic algorithm has its own way of taking into account the uncertainty [40]. This intrinsic part of the algorithm will impact the precision level that the algorithm achieves.
The proposed evaluation framework has been built on the context of predicting the RUL of the system. This means that the precision evaluation described on Section 2 is quantified from the precision on predicting the RUL. Consequently, the managed uncertainty on the training section needs to be propagated to unknown estimations (to the RUL prediction). However, there are stochastic algorithms that do not have, as an intrinsic characteristic, a way of quantifying the uncertainty on unknown estimations (interpolation or extrapolation), even though being stochastic algorithms (such as the Kalman family stochastic filters). In this scenario, an uncertainty propagation method is proposed in order to cover this deficiency and in order to standardize the uncertainty propagation method under the proposed evaluation framework.
Among the huge variety of uncertainty propagation methods available in the literature [17,21,33], the proposed one is a sampling based method called Monte Carlo prediction [33]. In Monte Carlo prediction, samples from the input distributions are drawn randomly and simulated until the desired event is reached (EOL event), predicting like this the RUL of each sample. The resultant predicted RUL values are weighted depending on the prior probability that each sample had on the input distribution, generating like this a statistic distribution of the predicted RUL. The pseudo-code is available in Algorithm 1. [33].
Monte Carlo predictions get exact approaches when the number of samples is infinite. This suggests that the higher the number of samples is, the better the approximation will be but the higher the computational burden will be [33]. Therefore, a trade-off between accuracy and computational burden needs to be considered when selecting the number of samples.

Example of Use: Definition
The proposed evaluation framework is tested with a stochastic algorithm typically applied on Lithium ion battery RUL prediction problems: the Particle Filter (PF) [4,8,17,24,41]. For that, commonly used inputs are gathered on this example of use. On one hand, the input data is taken from NASA's data repository [42] and on the other hand, the prior knowledge is obtained from previous studies [4,8,43].

Stochastic Algorithm
The chosen stochastic algorithm on this example is the Particle Filter (PF). Particle Filter (PF) is a sequential Monte Carlo method. The state Probability Density Function (PDF) is estimated from a set of particles and their associated weights [44]. PF solves the integral operation in the Bayes estimators based on the idea of Monte Carlo method. It estimates the state PDF from a set of particles and their associated weights [5]. The state PDF to its most likely form is adjusted by those weights. As result, an appropriate management of inherent estimation uncertainty is performed [44]. This provides non-linear projection in forecasting [45].
The particles are inferred recursively by two alternate phases. The first phase is the prediction. In the prediction, the values of the particles at the next step are estimated using the values of the particles at the previous step. There is not any involvement of neither measurement nor observation in this step. The second phase is the update. In the update, the value of each particle estimated in the prediction phase is compared with the measurements. Afterwards, the value of the particles is updated accordingly [44]. As an initialization step (k = 1), the proposed PF draws the particles from a Gaussian distribution (N (0, σ ini )) and calculates the weights from a uniform distribution (U (0, 1/ρ)).
However, PF has two main problems: Particle degradation and sample impoverishment. To lessen the impact of particle degradation, a system importance resampling of the particles is carried out on the iterations that do not reach the pre-set resampling threshold. This helps in maintaining the track of the state vector even when disruptive effects like un-modelled operational conditions appear [39]. The pseudo-code of the PF is shown in Algorithm 2.
: IfN e f f < N T then 11: Among the possibilities on system importance resampling methods, the "systematic resampling method", the "stratified resampling method" and the "multinomial resampling method" are the most typical methods [47]. Among them, the chosen algorithm has been the "systematic resampling method" because it is the most common one [46] and because there is no big difference on the obtained results with any of these typical methods on the proposed evaluation framework [25]. The pseudo-code of the "systematic resampling method" is shown in Algorithm 3.
for j = 1 to ρ do 8: u ← u 1 + (j + 1)/ρ 9: while u > c i do 10: The state of the ith particle at k time-instant. w i k = The weight of ith particle at k time-instant. ρ = Number of particles. c i = The cumulative sum of the weight of the particles until ith particle at k time-instant. u = Step reference for searching the particles that will be used before the resampling.

Input
In this proposal, the input of the algorithm refers to the information of the system that is under evaluation: the information about the lithium-ion battery that is introduced into the stochastic algorithm to predict the event of interest. In this case, the event of interest is the EOL. The EOL is commonly defined as a specific dischargeable capacity value, whose behavior is commonly described by an aging model. Consequently, the dischargeable capacity data as well as the aging model that describes the evolution observed on that data are the input of the algorithm.

Data
The dischargeable capacity values are taken from NASA's data repository [42], a repetitively used public data source on evaluating Lithium-ion battery RUL prognosis algorithms. The selected NASA's data sets consist of rechargeable 18650 Gen 2 Li-ion cells with a rated capacity of 2 Ah. The experiment description is taken literally from [49] as follows: "It was conducted through three different operational profiles (charge, discharge and impedance) at room temperature. Charging was performed in a constant current at 1.5 A until the battery voltage reached 4.2 V and continued in constant voltage mode until the charge current dropped to 20 mA. The discharge runs were stopped at 2.7 V. The experiments were conducted until the capacity decreased to the specified EOL criteria of 1.4 Ah".
Among the different data sets available on the selected NASA's data repository, the proposed evaluation framework requires using two data sets with different uncertainty levels on the data itself. To do end, it is considered that the uncertainty on the input data comes from the effects that the proposed models on the prior knowledge section (next section) cannot completely describe, such as the capacity recovery events. Based on this, the selected data set with high uncertainty is "B0018" and the selected data set with low uncertainty is "B0007", see The dischargeable capacity values are taken from NASA's data repository [42], a repetitively used public data source on evaluating Lithium-ion battery RUL prognosis algorithms. The selected NASA's data sets consist of rechargeable 18650 Gen 2 Li-ion cells with a rated capacity of 2 Ah. The experiment description is taken literally from [49] as follows: "It was conducted through three different operational profiles (charge, discharge and impedance) at room temperature. Charging was performed in a constant current at 1.5 A until the battery voltage reached 4.2 V and continued in constant voltage mode until the charge current dropped to 20 mA. The discharge runs were stopped at 2.7 V. The experiments were conducted until the capacity decreased to the specified EOL criteria of 1.4 Ah".
Among the different data sets available on the selected NASA's data repository, the proposed evaluation framework requires using two data sets with different uncertainty levels on the data itself. To do end, it is considered that the uncertainty on the input data comes from the effects that the proposed models on the prior knowledge section (next section) cannot completely describe, such as the capacity recovery events. Based on this, the selected data set with high uncertainty is "B0018" and the selected data set with low uncertainty is "B0007", see

Prior Knowledge
The behavioral aspect that needs to be modelled by the prior knowledge would be the dischargeable capacity evolution of the battery until, at least, the EOL event. In lith-

Prior Knowledge
The behavioral aspect that needs to be modelled by the prior knowledge would be the dischargeable capacity evolution of the battery until, at least, the EOL event. In lithium-ion battery, the dischargeable capacity suffers a decreasing evolution or a decay. For that, the selected prior knowledge describes two aspects of the behavior of the selected lithium-ion battery: the capacity decay (also referred as the aging model) and the state transition model on that capacity decay.
For the capacity decay, this paper proposes to use two semi-empirical capacity decrease models that have different uncertainty levels on describing the system's behavior. In the literature, it has been proved that the capacity decay is non-linear [43] and that a double exponential expression is able to describe the capacity decay behavior with a high level of confidence [4,8]. Therefore, a double exponential model is taken as the prior knowledge with low uncertainty (Equation (11)), and a linear model is taken as the prior knowledge with high uncertainty (Equation (12)).
For the state transition model, a linear stochastic process is proposed as the state transition model of the parameters for both capacity decrease models (Equation (13)). In this case, the uncertainty of this part of the prior knowledge is quantified by the parametrization algorithm; therefore, uncertainty evaluation of this part of the input is not required and thus, a second model is not used.

Trial Matrix
The whole trial matrix that follows the proposed cases is shown in Table 6.

Parametrization
Firstly, a parameter definition step has been done based on experience and on the literature [23,26,50]. The defined parameters are resumed in Table 7. Then, the parameterization of the hyper-parameters of the algorithm is performed (Algorithm 4). It starts initializing the inner states at each time-instant. For that, a "least squares optimization" of the training data and the capacity decay model is conducted using MATLAB's "lsqcurvefit" function. Once done this, the algorithm's hyper-parameters at each time step (Table 8) are achieved by grid search optimization. The variance of the initial Gaussian distribution of the particles.
The grid for each hyper-parameter is designed based on the rule of delimiting as much as possible the tested values without losing interesting options. For that, a sensibility analysis would be required, but in this case, the design of the grid is based on previous engineering experience. The obtained results have led to define the limits of the grids as well as defining the grid delta values. The created grids are shown in Table 9. The results obtained with every configuration on the grid are evaluated to find the optimal hyper-parameters. This is done running C v times each grid configuration and finding the case with the lowest variation among the run cases with the lowest RMSE values on the cross-validation data set.
for s = 1 to size(ϑ u ) do 4: for r = 1 to size(ϑ v ) do 5: for q = 1 to size(ϑ ini ) do 6: Where ϑ λ optimum = The optimum hyper-parameters in λ prediction-time-instant. t 0 : t λ = The time-instants from initial time t 0 to prediction time t λ . E λ = Amount of prediction-time-instants under evaluation. C V = Amount of repetitions of each configuration on the grid. N λ = Amount of learning points in λth prediction-time-instant. RMSE j,s,r,q = Root Mean Square Error in each iteration of the grid search optimization.

Example of Use: Results and Discussion
The selected stochastic algorithm (Algorithm 2) along with the proposed uncertainty propagation method (Algorithm 1) is applied on all the described trials ( Table 6). The achieved values of the quantitative metrics described in Tables 2 and 3 are shown in Table 10. In addition, the achieved qualitative prognosis results are shown on four figures that display the PH and α-λ accuracy (Figures 5-8). These results allow the overall evaluation of the algorithm.         The highest PH values are calculated on the tests 2 and 4 with a "perfect" one. This means that the tracking of the correct aging trend underneath of the data set B0018 can be done in the earliest selected prediction-time-instant. The test 1 and 3 shows lower PH values than the obtained in the tests 2 and 4. However, the obtained scores are above 0.8, which are high values by themselves. In this case, the test 3 shows a higher value than the test 4. This means that the algorithm with the exponential model fits better the model response on early states.  The highest PH values are calculated on the tests 2 and 4 with a "perfect" one. This means that the tracking of the correct aging trend underneath of the data set B0018 can be done in the earliest selected prediction-time-instant. The test 1 and 3 shows lower PH values than the obtained in the tests 2 and 4. However, the obtained scores are above 0.8, which are high values by themselves. In this case, the test 3 shows a higher value than the The CRA values are much higher on the tests 1 and 3 compared with the CRA values obtained on tests 2 and 4. This means that the achievable RA improves in a much higher rate on the data set B0007 than B0018 (as expected, since the data set B0018 has higher uncertainty than the data set B0007). The highest CRA value is achieved on the test 3, which means that the algorithm works better on the data set B0007 with the exponential model. In contrast, the lowest CRA value is achieved on the test 4, which means that the prediction algorithm works worst on the data set B0018 with the exponential model. This means that the exponential fits better the data set B0007 than the linear model but that for the data set B0018, the model that fits better the data set is the linear model.
In terms of computational burden, the tests 3 and 4 have higher number of FLOPs than 1 and 2, which means that the exponential model needs more computational resources than the linear model. In the same way, the length of the data sets also influences, being the one with more data points the ones with more FLOPs.
Once the overall evaluation of the algorithm has been done, some prediction-timeinstants (t λ ) of high interest have been analysed in more detail; specially, the tests run on the data set B0007 at 4, 11, 17, 65 and 118 prediction-time-instants. For that, the quantitative metrics described in Table 1 are used. The results of those prediction-time-instants are shown in Tables 11-14. Furthermore, some of the trial-instant figures have been displayed for further interpretation of the obtained results.   Table 4    t11: The prediction RMSE, the RA, the Pvalue and the Pwidth are better in test 1. In this case, the values of this 4 metrics on t11 are against the obtained results from the PH evaluation. The training data set is still small and there is an event that the model is not able to describe (a capacity recovery event). The data set under evaluation here    t11: The prediction RMSE, the RA, the Pvalue and the Pwidth are better in test 1. In this case, the values of this 4 metrics on t11 are against the obtained results from the PH evaluation. The training data set is still small and there is an event that the model is not able to describe (a capacity recovery event). The data set under evaluation here • t 11 : The prediction RMSE, the RA, the P value and the P width are better in test 1. In this case, the values of this 4 metrics on t 11 are against the obtained results from the PH evaluation. The training data set is still small and there is an event that the model is not able to describe (a capacity recovery event). The data set under evaluation here could be labelled as a data set with high uncertainty since the chosen models are not able to describe the behavior on the validation data set using the tracked behavior on the training data set. As a result, the simplest model (the linear model) works better than the exponential one, see Figures 11 and 12.  t11: The prediction RMSE, the RA, the Pvalue and the Pwidth are better in test 1. In this case, the values of this 4 metrics on t11 are against the obtained results from the PH evaluation. The training data set is still small and there is an event that the model is not able to describe (a capacity recovery event). The data set under evaluation here could be labelled as a data set with high uncertainty since the chosen models are not able to describe the behavior on the validation data set using the tracked behavior on the training data set. As a result, the simplest model (the linear model) works better than the exponential one, see Figures 11 and 12.   t17: The prediction RMSE and the RA of the test 3 are better than the ones of the test 1. However, the Pwidth in test 3 is 1.78, which exceeds the upper limit of ideal range by 78%. A Pwidth higher than 1 means that the uncertainty is higher than the real RUL on that t2, which is a sign of overestimating the uncertainty and therefore, the prediction is not precise. On the other hand, test 1's Pvalue is 0 (even though the Pwidth is in between the ideal range), which means that, in this case, the uncertainty has been underestimated. In both cases the prediction is not precise; see Figures 13 and 14.  • t 17 : The prediction RMSE and the RA of the test 3 are better than the ones of the test 1. However, the P width in test 3 is 1.78, which exceeds the upper limit of ideal range by 78%. A P width higher than 1 means that the uncertainty is higher than the real RUL on that t 2 , which is a sign of overestimating the uncertainty and therefore, the prediction is not precise. On the other hand, test 1's P value is 0 (even though the P width is in between the ideal range), which means that, in this case, the uncertainty has been underestimated. In both cases the prediction is not precise; see  t17: The prediction RMSE and the RA of the test 3 are better than the ones of the test 1. However, the Pwidth in test 3 is 1.78, which exceeds the upper limit of ideal range by 78%. A Pwidth higher than 1 means that the uncertainty is higher than the real RUL on that t2, which is a sign of overestimating the uncertainty and therefore, the prediction is not precise. On the other hand, test 1's Pvalue is 0 (even though the Pwidth is in between the ideal range), which means that, in this case, the uncertainty has been underestimated. In both cases the prediction is not precise; see Figures 13 and 14.           • t 118 : The RA and the P width are the same on both cases but the prediction RMSE and the P value are worse in test 1. This means that the test 3 has a higher probability of predicting the real RUL and that tracks better the nearest trend. The exponential model describes better the trend underneath the data, see      The chosen cases on the tests run on the data set B0018 are the results obtained at 1, 38, 55, 78 and 102 prediction-time-instants: • t 1 : The prediction RMSE, the RA, the P value and the P width are close to the idle cases defined on    t38: The prediction RMSE and the RA are better in test 4 but the in both tests are equal to 0. The RA values in both cases are too low, which leads to have low probabilities on predicting the RUL. There is a huge capacity recovery event on the validation data set used on the parametrization, which cannot be described by the selected models (the behavior on the validation data set cannot be described by using the tracked behavior on the training data set). As a result, we were expecting to have better response with the simplest model as in 11 prediction-time-instant of tests 1 and 3 (Figures 11 and 12). However, in this case we have seen that the incapability of describing the underneath trend of the validation data set by using the tracked underneath trend of the training data set has led to an instable response of the algorithm due to a wrong parametrization, see     t38: The prediction RMSE and the RA are better in test 4 but the in both tests are equal to 0. The RA values in both cases are too low, which leads to have low probabilities on predicting the RUL. There is a huge capacity recovery event on the validation data set used on the parametrization, which cannot be described by the selected models (the behavior on the validation data set cannot be described by using the tracked behavior on the training data set). As a result, we were expecting to have better response with the simplest model as in 11 prediction-time-instant of tests 1 and 3 (Figures 11 and 12). However, in this case we have seen that the incapability of describing the underneath trend of the validation data set by using the tracked underneath trend of the training data set has led to an instable response of the algorithm due to a wrong parametrization, see  • t 38 : The prediction RMSE and the RA are better in test 4 but the P value in both tests are equal to 0. The RA values in both cases are too low, which leads to have low probabilities on predicting the RUL. There is a huge capacity recovery event on the validation data set used on the parametrization, which cannot be described by the selected models (the behavior on the validation data set cannot be described by using the tracked behavior on the training data set). As a result, we were expecting to have better response with the simplest model as in t 11 prediction-time-instant of tests 1 and 3 (Figures 11 and 12). However, in this case we have seen that the incapability of describing the underneath trend of the validation data set by using the tracked underneath trend of the training data set has led to an instable response of the algorithm due to a wrong parametrization, see Figures 21 and 22. the tracked behavior on the training data set). As a result, we were expecting to have better response with the simplest model as in 11 prediction-time-instant of tests 1 and 3 (Figures 11 and 12). However, in this case we have seen that the incapability of describing the underneath trend of the validation data set by using the tracked underneath trend of the training data set has led to an instable response of the algorithm due to a wrong parametrization, see    t55: The prediction RMSE, the RA, the Pvalue and the Pwidth are better in test 4. In both cases the RA and the Pvalue are far from the perfect score. This means that both models do not track properly the trend underneath the data. This is most likely due to the high uncertainty level on the data set seen by the algorithm (training data set), see    t78: In both tests, the prediction RMSE and the RA are far away from the optimal values described in Table 4. The training data set in t78 is composed by more than half of the data set, but instead of getting better results, the results are worse. The results are in line with the conclusions drawn from the CRA evaluation; the prediction does not increase even though increasing the training data set as well as in tests 1 and 3, see  • t 55 : The prediction RMSE, the RA, the P value and the P width are better in test 4. In both cases the RA and the P value are far from the perfect score. This means that both models do not track properly the trend underneath the data. This is most likely due to the high uncertainty level on the data set seen by the algorithm (training data set), see  t55: The prediction RMSE, the RA, the Pvalue and the Pwidth are better in test 4. In both cases the RA and the Pvalue are far from the perfect score. This means that both models do not track properly the trend underneath the data. This is most likely due to the high uncertainty level on the data set seen by the algorithm (training data set), see    t78: In both tests, the prediction RMSE and the RA are far away from the optimal values described in Table 4. The training data set in t78 is composed by more than half of the data set, but instead of getting better results, the results are worse. The results are in line with the conclusions drawn from the CRA evaluation; the prediction does not increase even though increasing the training data set as well as in tests 1 and 3, see   t55: The prediction RMSE, the RA, the Pvalue and the Pwidth are better in test 4. In both cases the RA and the Pvalue are far from the perfect score. This means that both models do not track properly the trend underneath the data. This is most likely due to the high uncertainty level on the data set seen by the algorithm (training data set), see    t78: In both tests, the prediction RMSE and the RA are far away from the optimal values described in Table 4. The training data set in t78 is composed by more than half of the data set, but instead of getting better results, the results are worse. The results are in line with the conclusions drawn from the CRA evaluation; the prediction does not increase even though increasing the training data set as well as in tests 1 and 3, see  • t 78 : In both tests, the prediction RMSE and the RA are far away from the optimal values described in Table 4. The training data set in t 78 is composed by more than half of the data set, but instead of getting better results, the results are worse. The results are in line with the conclusions drawn from the CRA evaluation; the prediction does   t102: In both tests, the prediction RMSE is close to the "best" cases defined on Table 4, but the other three metrics are way too far away from the optimal values even though being close to the event-time-instant of interest. This supports the fact of having such differences between the CRA values of test 1-3 and 2-4. As well, it supports the first assumption of defining the data set B0018 as highly uncertain and B0007 lowly uncertain. See Figures 27 and 28.     t102: In both tests, the prediction RMSE is close to the "best" cases defined on Table 4, but the other three metrics are way too far away from the optimal values even though being close to the event-time-instant of interest. This supports the fact of having such differences between the CRA values of test 1-3 and 2-4. As well, it supports the first assumption of defining the data set B0018 as highly uncertain and B0007 lowly uncertain. See Figures 27 and 28.   • t 102 : In both tests, the prediction RMSE is close to the "best" cases defined on Table 4, but the other three metrics are way too far away from the optimal values even though being close to the event-time-instant of interest. This supports the fact of having such differences between the CRA values of test 1-3 and 2-4. As well, it supports the first assumption of defining the data set B0018 as highly uncertain and B0007 lowly uncertain. See Figures 27 and 28.   t102: In both tests, the prediction RMSE is close to the "best" cases defined on Table 4, but the other three metrics are way too far away from the optimal values even though being close to the event-time-instant of interest. This supports the fact of having such differences between the CRA values of test 1-3 and 2-4. As well, it supports the first assumption of defining the data set B0018 as highly uncertain and B0007 lowly uncertain. See Figures 27 and 28.     t102: In both tests, the prediction RMSE is close to the "best" cases defined on Table 4, but the other three metrics are way too far away from the optimal values even though being close to the event-time-instant of interest. This supports the fact of having such differences between the CRA values of test 1-3 and 2-4. As well, it supports the first assumption of defining the data set B0018 as highly uncertain and B0007 lowly uncertain. See Figures 27 and 28.  Evaluating the qualitative graphs, we can draw qualitatively the conclusions done on the qualitative evaluation and confirm them in a friendlier way.
It can be seen how the tests 2 and 4 have on the first time-instant a RUL prediction inside the α-bounds, generating a high PH value even though having most of the rest of the RUL prediction out of the α-boundaries. There are much more cases that fulfils the α-boundaries of the PH and α-λ accuracy on the tests 1 and 3 and that the response on these two tests is improved when increasing the training data points. In contrast, the results on the tests 2 and 4 show kind of an offset on the obtained accuracy, which do not seem to improve even though increasing the training data set.
Checking on the same data set, it looks like the results on the test 3 are better than the results on the test 1 and that the results on the test 2 are better than the results on the test 4. This can be interpreted as being the linear model a better prior knowledge for highly uncertain data sets, even though knowing that the behavior of the trend underneath the data is exponential [39,43].

Conclusions
This paper presents an algorithm evaluation framework that captures the performance level of any stochastic algorithm. It combines existing seven quantitative metrics and two qualitative diagrams. The evaluation framework is contextualized in a concrete application environment: a prognosis problem of a lithium-ion battery. The approach is characterized in detail, where the effect of usually forgotten variables are analyzed and dealt with the uncertainty contribution of the inputs on the estimations, the used parametrization method, and the uncertainty propagation method. The analysis of these three variables composes the main innovative contribution of this paper. Finally, the framework has been tested on a Particle Filter (PF) applied in a lithium-ion battery prognosis problem.
The proposed evaluation method has shown that both, the correctness and timeliness need to be analyzed in order to capture completely the prognosis algorithms performance level. The evaluation of the PF has shown that the correctness results on a specific predictiontime-instant do not correctly represent the obtainable correctness on some other predictiontime-instant. Therefore, the timeliness analysis is necessary. However, the timeliness cannot describe the algorithms performance level by itself. The results have shown that the PH or the CRA are not able to determine the accuracy or precision with which the algorithm can predict. Therefore, the evaluation of the correctness and timeliness of the algorithm is required.
Additionally, we have seen that the proposed qualitative graphs help considerably the interpretation of the obtained results. The use of these graphs is not necessary since the quantitative metrics alone are able to capture and quantify the performance level of the algorithm. Nonetheless, the use of these graphical aids is highly recommendable.
The acquired results from the evaluation of the PF have shown that the data set influences greatly the response of the selected algorithm (in average, the results on the 'B0018 data set are way worse than the results on the 'B0007 data set). The characteristics of the data set have a relevant effect on the obtainable results of that prognosis algorithm. Therefore, the use of data sets with different level of uncertainty is essential when evaluating the performance level of a prognosis algorithm. The adopted parametrization criterion allows us to determine the hyper-parameters of any prognosis algorithm on every prediction-time-instant. Nonetheless, it shows that depending on the characteristics of the data set on the different evaluated prediction-timeinstants, the hyper-parameters cannot be obtained. This can be used to reduce the evaluated prediction-time-instants. Thanks to this, completely random or inaccurate predictions due to the intrinsic characteristics of the training data set will be avoided. Furthermore, we are able to identify the time-instants with accurate predictions, which is really interesting for on-board applications.
The proposed uncertainty propagation method based on Monte-Carlo simulations generates the probability distributions based on random RUL estimations. This method allows us to use the same approach to propagate the uncertainty on any prognosis algorithm, standardizing like this the uncertainty propagation method. Besides, it provides an uncertainty propagation method to those stochastic algorithms that do not have a proper uncertainty propagation method, such as the Kalman family stochastic filters.