The Sliding Window and SHAP Theory—An Improved System with a Long Short-Term Memory Network Model for State of Charge Prediction in Electric Vehicle Application

: The state of charge (SOC) prediction for an electric vehicle battery pack is critical to ensure the reliability, efﬁciency, and life of the battery pack. Various techniques and statistical systems have been proposed in the past to improve the prediction accuracy, reduce complexity, and increase adaptability. Machine learning techniques have been vigorously introduced in recent years, to be incorporated into the existing prediction algorithms, or as a stand-alone system, with a large amount of recorded past data to interpret the battery characteristics, and further predict for the present and future. This paper presents an overview of the machine learning techniques followed by a proposed pre-processing technique employed as the input to the long short-term memory network (LSTM) algorithm. The proposed pre-processing technique is based on the time-based sliding window algorithm (SW) and the Shapley additive explanation theory (SHAP). The proposed technique showed improvement in accuracy, adaptability, and reliability of SOC prediction when compared to other conventional machine learning models. All the data employed in this investigation were extracted from the actual driving cycle of ﬁve different electric vehicles driven by different drivers throughout a year. The computed prediction error, as compared to the original SOC data extracted from the vehicle, was within the range of less than 2%. The proposed enhanced technique also demonstrated the feasibility and robustness of the prediction results through the persistent computed output from a random selection of the data sets, consisting of different driving proﬁles and ambient conditions.


Introduction
Battery electric vehicles (BEVs) and hybrid electric vehicles (HEVs) have greater advantages over internal combustion engine vehicles (ICEVs), in regards to environmental protection and cost reduction, by making use of clean renewable electricity sources [1,2]. However, nowadays, "range anxiety" is considered a potential obstacle to the extensive usage of electric vehicles (EVs), as a result of the limited driving range due to the limited cell energy density and recharging capacity. Apart from material limitation, one common problem is the inaccurate estimation of cell's state of charge (SOC) [3]. The SOC value in a battery-powered electric vehicle is equivalent to the fuel gauge of the conventional fuel-powered vehicle. An accurate and reliable SOC estimation is critical to the overall protection and operation of an electric vehicle. It is also an important part of the BMS system, which consists of integrated electronic circuitry, to monitor, communicate, and signal to all other working components in the power train system [4][5][6][7][8]. Unfortunately, In this study, the training data comes from five electric vehicles that drove on the actual road, within the period of a year. We proposed a set of SOC prediction processes, including the process of feature extension based on the sliding window method, and feature selection based on LightGBM and SHAP. Finally, we used the LSTM algorithm with multi-inputs and a single output (many to one LSTM chain) to learn the temporal features of fragments and predict the current SOC. Then the performance on SOC prediction was evaluated and compared with the KNN, RFR, and LightGBM methods, in regards to tracking accuracy. In addition, the model was verified for its adaptability for different vehicles, and vehicles driving in different seasons through data segregation and selection.

Analysis and Processing of Vehicle Driving Data
In this study, data were extracted from five vehicles (car0, car1, car2, car3, and car4) that were of the same model and size [17]. The data represented the actual on-road driving conditions within the period of a year; however, only four months (January, April, July, and November) were available for investigation. The total mileage traveled of each vehicle was between 30,000 and 80,000 km. The driving data of each electric vehicle contained both the charging process and discharging process, with 10Hz sampling data collection frequencies.
Each set of data contained nine parameters to illustrate the vehicle performances over time, as listed in Table 1. The total number of collected data sets or sizes are shown in Figure 1a.

Sliding Window Method
The original data sets collected have nevertheless suffered from various data corruption problems, such as inconsistency, loss of data and segments, invalid ranges, abnormal patterns, etc. Data preprocessing is therefore crucial prior to any analysis or adoption to ensure that every data set adopted is healthy and consequential. The corrupted data were removed, and subsequently, the corresponding part was fitted by linear interpolation.
To analyze the original features of the data, the Pearson correlation coefficient was employed. By definition, the Pearson correlation coefficient measures the linear correlation between two certain features [18,19]. It is defined as the covariance of two variables, divided by the product of their standard deviations. Hence, the value is essentially the normalized measurement of the covariance, such that the result always has a value between −1 and 1. The formula is defined as follows: The data sets are fragmented in accordance to the vehicle's SOC, in the range of 100% to 25%, and grouped into the respective month. This exercise was to ensure consistency in the prediction process and to evaluate accurately the strengths and weaknesses of the prediction methodology. Figure 1b depicts a total of 60 driving fragments distributed in accordance to the vehicles and months.

Sliding Window Method
The original data sets collected have nevertheless suffered from various data corruption problems, such as inconsistency, loss of data and segments, invalid ranges, abnormal patterns, etc. Data preprocessing is therefore crucial prior to any analysis or adoption to ensure that every data set adopted is healthy and consequential. The corrupted data were removed, and subsequently, the corresponding part was fitted by linear interpolation.
To analyze the original features of the data, the Pearson correlation coefficient was employed. By definition, the Pearson correlation coefficient measures the linear correlation between two certain features [18,19]. It is defined as the covariance of two variables, divided by the product of their standard deviations. Hence, the value is essentially the normalized measurement of the covariance, such that the result always has a value between −1 and 1. The formula is defined as follows: where r is the correlation coefficient, x i is the values of the x-variable in a sample, x is the mean of the values of the x-variable, y i is the values of the y-variable in a sample, y is the mean of the values of the y-variable. Figure 2 demonstrates the weak correlation between the original features and the predicted target SOC. There are only two features "total voltage" and "motor voltage" that have relatively high correlation with SOC as compared to others; the values are, on average, 0.65 and 0.61 respectively. To further increase the correlation factors, we proposed the sliding time window (SW) method to extend the original features. Generally, the SW method consists of a fixed-point sliding window and a dynamic sliding window. The fixed-point SW is a variable length interval sampling method with a fixed starting point and a sliding ending point along time. The illustrative principle is shown in Figure 3a. On the other hand, the dynamic SW is a sampling method that uses  Generally, the SW method consists of a fixed-point sliding window and a dynamic sliding window. The fixed-point SW is a variable length interval sampling method with a fixed starting point and a sliding ending point along time. The illustrative principle is shown in Figure 3a. On the other hand, the dynamic SW is a sampling method that uses fixed-length temporal windows that shift to create instances. Each window position produces a fixed segment that is used to isolate data for later processing [20,21]. Figure 3b illustrates the principle. Generally, the SW method consists of a fixed-point sliding window and a dynamic sliding window. The fixed-point SW is a variable length interval sampling method with a fixed starting point and a sliding ending point along time. The illustrative principle is shown in Figure 3a. On the other hand, the dynamic SW is a sampling method that uses fixed-length temporal windows that shift to create instances. Each window position produces a fixed segment that is used to isolate data for later processing [20,21]. Figure 3b illustrates the principle.  By using the fixed-point sliding window method, the following extended features are created:

•
Total energy consumption index (TE con ): The total energy consumption measurement index of the whole vehicle battery module is defined as follows: where t is the length of time window at the current time; v t is the total voltage; c t is the total current.

•
Motor energy consumption index (ME con ): The motor energy consumption measurement index of the motor module is calculated as follows: where v m is the motor voltage; c m is the motor current. The total energy consumption index (TE con ) and the motor energy consumption index (ME con ) are both time vectors. They are represented by the integral Equations (2) and (3), which show the content integral of two multiple parameters-the voltage and current. The current in this expression has both magnitude and direction, and, hence, is a vector quantity. The negative current indicates the vehicle in the regenerative or charging mode while the positive current indicates the discharging mode.

•
Mileage driven (mile): Driving distance of the vehicle in the current time window is defined as follows: • Cruise ratio (C r ): The proportion of driving segment length in the current time window is used to measure the driving efficiency of the segments, which is defined as follows: where Len T step is the length of time window at the current time; m c is the motor voltage.
In addition to the above, the dynamic SW allows us to obtain the mean values of some features as extended features, such as the mean values of speed (speed m ) and total voltage (v t_m ) in the dynamic time window. The overall extended features are summarized in Table 2. The original data have a large degree of dispersion and asynchronous (with time delay) during data collections. By applying fixed-point and dynamic SW methods, the whole duration of the collected data can be captured and observed, and the average or cumulative values are extracted as new features, which significantly decrease the effects of the large instantaneous data dispersion and asynchronous data collection. Figure 4 outlines the correlation of original features and extended features. After employing the SW method, the extended features have higher correlation with SOC than the original features. As illustrated, the features obtained by the fixed-point SW method have higher correlation as compared to Figure 2, with the maximum correlation value reaching 0.98. Furthermore, the dynamic SW method improves the correlation of the features "voltage" and "speed" to 0.1 and 0.78, respectively, which previously were only 0.063 and 0.65.

Machine Learning Algorithms and SHAP
In this section, three common traditional machine-learning algorithms were employed to learn the mapping relationship between the highly correlated features obtained in the previous section and the prediction target SOC. The three machine learning models are K-nearest neighbor algorithm (KNN), random forest (RFR) algorithm, and light gradient boosting machine algorithm (LightGBM).
The KNN algorithm can also effectively be used for regression problems [22]. KNN regression is used to predict the value of the output variable by using a local average, while KNN classification attempts to predict the class for the output variable through computing the local probability. In writing the algorithm for KNN, the regression technique only required an additional step to calculate the average value of data points as compared to the classifier. In this study, we used the KNeighborsRegressor from the machine learning scikit-learn library and its default parameters to train the model.
To evaluate the performances and perform comparison studies, three common statistical indicators were used: the coefficient of determination (R 2 score), mean absolute errors (MAE), and root mean square error (RMSE).
where y 1 ,y 2 , . . . y n are the actual values and y 1 , y 2 , . . . y n are the predicted values, and y is the mean of y i . The mean of in Dynamic SW The original data have a large degree of dispersion and asynchronous (with time delay) during data collections. By applying fixed-point and dynamic SW methods, the whole duration of the collected data can be captured and observed, and the average or cumulative values are extracted as new features, which significantly decrease the effects of the large instantaneous data dispersion and asynchronous data collection. Figure 4 outlines the correlation of original features and extended features. After employing the SW method, the extended features have higher correlation with SOC than the original features. As illustrated, the features obtained by the fixed-point SW method have higher correlation as compared to Figure 2, with the maximum correlation value reaching 0.98. Furthermore, the dynamic SW method improves the correlation of the features "voltage" and "speed" to 0.1 and 0.78, respectively, which previously were only 0.063 and 0.65.

Machine Learning Algorithms and SHAP
In this section, three common traditional machine-learning algorithms were employed to learn the mapping relationship between the highly correlated features obtained Here, the min-max normalization method is used to eliminate the influence of numerical differences on the prediction performance of regression models. Then, original features and extended features are applied to these machine-learning models, respectively. The results from the models are outlined in Figure 5 Table 3.
The accuracy of the machine-learning model based on extended features is significantly improved with both the RMSE and MAE indicators reduced to at least three-fold. Further investigation also found that LightGBM algorithm has the best learning performance [23]. The LightGBM model used in this study has strong fitting capabilities due to its complex structure [24]. However, it is often regarded as a black-box model due to its large number of parameters, complex working mechanisms, and low transparency of the model.
The Shapley additive explanation (SHAP) method was used to improve the interpretability of the SOC prediction model and demonstrate the prediction of an instance x by computing the contribution of each feature to the prediction [25,26]. The SHAP value represents the contribution of each feature to the variation in the model output.
where , , ⋯ are the actual values and , , ⋯ are the predicted values, and is the mean of .
Here, the min-max normalization method is used to eliminate the influence of numerical differences on the prediction performance of regression models. Then, original features and extended features are applied to these machine-learning models, respectively. The results from the models are outlined in Figure 5, using both the original features ( Figure 5a) and extended features (Figure 5b) as the input to the model. The comparison studies of the three models through the statistical indicators are listed in Table 3.   Based on the LightGBM model trained above, the impact of each feature is analyzed on the model output from a global perspective. In Figure 6, the blue indicator represents the value of the SHAP in direct proportion to the positive feedback to the output value, the same goes for the red indicator representing the negative feedback to the output value. Two features that have relatively strong correlation to the output model are "total energy" and "mile" followed by feature "Temp max mean", as illustrated in the inset of Figure 6.  The accuracy of the machine-learning model based on extended features is significantly improved with both the RMSE and MAE indicators reduced to at least three-fold. Further investigation also found that LightGBM algorithm has the best learning performance [23]. The LightGBM model used in this study has strong fitting capabilities due to its complex structure [24]. However, it is often regarded as a black-box model due to its large number of parameters, complex working mechanisms, and low transparency of the model.
The Shapley additive explanation (SHAP) method was used to improve the interpretability of the SOC prediction model and demonstrate the prediction of an instance x by computing the contribution of each feature to the prediction [25,26]. The SHAP value represents the contribution of each feature to the variation in the model output.
Based on the LightGBM model trained above, the impact of each feature is analyzed on the model output from a global perspective. In Figure 6, the blue indicator represents the value of the SHAP in direct proportion to the positive feedback to the output value, the same goes for the red indicator representing the negative feedback to the output value. Two features that have relatively strong correlation to the output model are "total energy" and "mile" followed by feature "Temp max mean", as illustrated in the inset of Figure 6. Through the SHAP explanatory analysis of the machine-learning model, we obtained the ranking of the influence degree of each feature on the output results of the model shown in Figure 7. Seven top ranked features of the SHAP value: "total energy", "mile", "temp max mean", "cruise ratio", "total voltage mean", "temp min mean" and "motor energy mean" have apparently demonstrated more advantages over other features, which were used as the input to the LSTM model to learn time series features and predict SOC. Moreover, the two features "total energy" and "mile", which had strong correlation with SOC, were further analyzed through the SHAP value and distributed, as portrayed in Fig-Figure 6. Global interpretation of output SOC by SHAP value of other extended features.
Through the SHAP explanatory analysis of the machine-learning model, we obtained the ranking of the influence degree of each feature on the output results of the model shown in Figure 7. Seven top ranked features of the SHAP value: "total energy", "mile", "temp max mean", "cruise ratio", "total voltage mean", "temp min mean" and "motor energy mean" have apparently demonstrated more advantages over other features, which were used as the input to the LSTM model to learn time series features and predict SOC. Moreover, the two features "total energy" and "mile", which had strong correlation with SOC, were further analyzed through the SHAP value and distributed, as portrayed in

SOC Prediction with LSTM Model
In this section, we used the LSTM algorithm to predict SOC. The inputs to the LSTM model were the extended features processed by the SW and SHAP methods as described in previous sections. The long short-term memory network (LSTM) algorithm, which is an improved recurrent neural network (RNN) algorithm, is capable of learning long-term dependencies.
RNN is a kind of neural network used to process sequence data. The goal of neural network is to make neural network have memory functions, so that the current features can absorb the features from the remaining state, to improve the prediction accuracy of time series problems [27][28][29]. All RNNs have the form of a chain of repeating neural network modules. In standard RNNs, repeating modules have a simple structure, such as a single tanh layer. LSTM also contains a chain, but the repeating module has a different structure to interact, instead of having a single neural network layer [30][31][32][33].

SOC Prediction with LSTM Model
In this section, we used the LSTM algorithm to predict SOC. The inputs to the LSTM model were the extended features processed by the SW and SHAP methods as described in previous sections. The long short-term memory network (LSTM) algorithm, which is an improved recurrent neural network (RNN) algorithm, is capable of learning long-term dependencies.
RNN is a kind of neural network used to process sequence data. The goal of neural network is to make neural network have memory functions, so that the current features can absorb the features from the remaining state, to improve the prediction accuracy of time series problems [27][28][29]. All RNNs have the form of a chain of repeating neural network modules. In standard RNNs, repeating modules have a simple structure, such as a single tanh layer. LSTM also contains a chain, but the repeating module has a different structure to interact, instead of having a single neural network layer [30][31][32][33].
The LSTM unit consists of three gates (input gate , forgetting gate , and output gate ), and several state memories: update step , unit memory state C, and hidden state H. Input is used to input the data of the current time step of the sequence and update the cell state. It adds the hidden state H of the previous cell and the current input

SOC Prediction with LSTM Model
In this section, we used the LSTM algorithm to predict SOC. The inputs to the LSTM model were the extended features processed by the SW and SHAP methods as described in previous sections. The long short-term memory network (LSTM) algorithm, which is an improved recurrent neural network (RNN) algorithm, is capable of learning longterm dependencies.
RNN is a kind of neural network used to process sequence data. The goal of neural network is to make neural network have memory functions, so that the current features can absorb the features from the remaining state, to improve the prediction accuracy of time series problems [27][28][29]. All RNNs have the form of a chain of repeating neural network modules. In standard RNNs, repeating modules have a simple structure, such as a single tanh layer. LSTM also contains a chain, but the repeating module has a different structure to interact, instead of having a single neural network layer [30][31][32][33].
The LSTM unit consists of three gates (input gate i, forgetting gate f, and output gate o), and several state memories: update step g, unit memory state C, and hidden state H. Input i is used to input the data of the current time step of the sequence and update the cell state. It adds the hidden state H of the previous cell and the current input X to the sigmoid function. The formula of the input gate is as follows: where h t−1 is the output of the hidden state of the last neuron; σ is the activation function of sigmoid; w xi is the input hidden layer weight matrix; W hi is the hidden layer weight matrix; x t is the input of the current neuron; b hi is the bias that needs to be updated in the process of training; (annotation of the following formulas are similar.) Forget gate f is used to determine the level of importance of that particular information and made decision on whether to discard of utilize the information. The input of this step is also the hidden state H of the previous unit, and the current input X. When you add them, and pass them to the sigmoid function, the formula for this step is as follows: Update the process of cell memory state vector C: where, c t is the state vector of cell memory at time t; is Hadamard product. Output gate o and hidden state H: The schematic diagram of LSTM network chain structure is shown in Figure 9.
where ℎ is the output of the hidden state of the last neuron; is the activation function of sigmoid; is the input hidden layer weight matrix; is the hidden layer weight matrix; is the input of the current neuron; is the bias that needs to be updated in the process of training; (annotation of the following formulas are similar.) Forget gate is used to determine the level of importance of that particular information and made decision on whether to discard of utilize the information. The input of this step is also the hidden state H of the previous unit, and the current input X. When you add them, and pass them to the sigmoid function, the formula for this step is as follows: Update the process of cell memory state vector C: where, is the state vector of cell memory at time t; ⨀ is Hadamard product. Output gate and hidden state H: The schematic diagram of LSTM network chain structure is shown in Figure 9. The input of our LSTM model is the value of seven extended features in the latest previous time step. The time step is consistent with the length of the dynamic sliding time window, and the output is the current SOC. The proposed LSTM model is a chain structure with multiple inputs and a single output, as shown in Figure 10. The input of our LSTM model is the value of seven extended features in the latest previous time step. The time step is consistent with the length of the dynamic sliding time window, and the output is the current SOC. The proposed LSTM model is a chain structure with multiple inputs and a single output, as shown in Figure 10. The input of our LSTM model is the value of seven extended features in the latest previous time step. The time step is consistent with the length of the dynamic sliding time window, and the output is the current SOC. The proposed LSTM model is a chain structure with multiple inputs and a single output, as shown in Figure 10.  All data are input data for the LSTM model, in the unit of the time step. At each time step, seven features are input data for the LSTM model, at the same time, and then output for the predicted SOC after it is processed.

Results
In order to verify the accuracy and stability of the proposed SW-SHAP-LSTM method for SOC prediction, we distributed the training and test set accordingly. Approximately 90% of the fragments of each vehicle were randomly selected as the training set, and the remaining fragments were used as the test set. Table 4 provides the results of the random data set split. Initial comparison was made between the original and extended features on the LSTM model to evaluate the respective performances. The plot in Figure 11 shows the prediction results of the fourth fragment of car0. The green plot represents the original features while the red one represents the extended features. As illustrated, the prediction though the extended features, after incorporating SW and SHAP methods-the accuracy is significantly improved. This is depicted in the overlapping curve between the red and blue curve.
The prediction accuracy of different models with extended features by the SW and SHAP methods were performed and evaluated with the proposed LSTM in this paper. The models generally selected for comparison are the widely used random forest regression (RFR) algorithm, light gradient boosting machine (LightGBM), and the K-nearest neighbor (KNN) algorithm. The results of the test sets are listed in Table 5, with different statistical indicators, as detailed in Section 2. The proposed LSTM model returned the lowest value, which denoted higher accuracy as compared to the other three models. The notation R 2 in Table 5 is the coefficient of determination that is used to evaluate the performance of the linear regression model. The calculated value is directly proportional to the accuracy of the model prediction as analytically shown in Equation (6). In similar comparison studies, Figure 12 plots the percentage error resulted from each model prediction algorithm through the comparison with the actual measured SOC. The proposed LSTM model has the maximum error of 2.835%, which is the lowest among all the models employed for comparison.
Initial comparison was made between the original and extended features on the LSTM model to evaluate the respective performances. The plot in Figure 11 shows the prediction results of the fourth fragment of car0. The green plot represents the original features while the red one represents the extended features. As illustrated, the prediction though the extended features, after incorporating SW and SHAP methods-the accuracy is significantly improved. This is depicted in the overlapping curve between the red and blue curve. The prediction accuracy of different models with extended features by the SW and SHAP methods were performed and evaluated with the proposed LSTM in this paper.  The models generally selected for comparison are the widely used random forest regression (RFR) algorithm, light gradient boosting machine (LightGBM), and the K-nearest neighbor (KNN) algorithm. The results of the test sets are listed in Table 5, with different statistical indicators, as detailed in Section 2. The proposed LSTM model returned the lowest value, which denoted higher accuracy as compared to the other three models. The notation R 2 in Table 5 is the coefficient of determination that is used to evaluate the performance of the linear regression model. The calculated value is directly proportional to the accuracy of the model prediction as analytically shown in Equation (6). In similar comparison studies, Figure 12 plots the percentage error resulted from each model prediction algorithm through the comparison with the actual measured SOC. The proposed LSTM model has the maximum error of 2.835%, which is the lowest among all the models employed for comparison.  Apart from the prediction accuracy studies of the proposed model, the stability or reliability of the model is also critical in ensuring repeatability and adaptability. Hence, the model is further verified through different driving fragments of different vehicles and different durations in the case months. The previously split training set data were employed, consisting of car1, car3, and car4 for the training and test groups, set from different months, as seen in Table 4. The results of the SOC prediction accuracy of fragments from different sources and durations are listed in Table 6. The notation 'car i Fj' represents the 'j' th fragment of the 'i' th of the car. The column under the 'Source' is the data source of the test segment with 'Homologous', and denotes the test segment from the same vehicle, and 'Heterogeneous' from different vehicles.

Conclusions and Discussion
The SW-SHAP-LSTM method was proposed to predict the SOC of electric vehicles. The following are the investigation's outcomes:

1.
Data preprocessing is crucial and necessary for machine learning. Data segregation and filtering will significantly improve the accuracy of models. The results from this investigation have shown that the extended features processed by the SW and SHAP methods can significantly reduce the prediction error and, hence, improve the accuracy.

2.
LSTM has considerable advantages over other prediction models. The computed errors are within 2%, which is much lower than RFR, KNN, and LightGBM.

3.
The method proposed is shown to have good stability and adaptability, evidenced by the computed error on the prediction results when tested on the different vehicles and driving seasons.
Nevertheless, there is room to improve this study's investigation. For instance, the range of SOC fragments can increase to more than 80%, as compared to the 75% in this study. Moreover, the machine learning models deployed in this study could further improve through the optimization techniques in the algorithm. This is because the LSTM method is susceptible to overfitting, high memory consumption during training, and is sensitive to different random weight initializations. The improved algorithm will focus on these shortcomings and incorporate the extended features with a filtering algorithm to estimate the initial SOC. Improving the distribution of training data set is also crucial for prediction accuracy. The distribution method used in this article was random distribution, but many methods are emerging, such as cross-validation, which can combine measures of fitness in prediction to derive a more accurate estimate of model prediction performance.