A Virtual Sensor for Electric Vehicles’ State of Charge Estimation

: The estimation of the state of charge is a critical function in the operation of electric vehicles. The battery management system must provide accurate information about the battery state, even in the presence of failures in the vehicle sensors. This article presents a new methodology for the state of charge estimation (SOC) in electric vehicles without the use of a battery current sensor, relying on a virtual sensor, based on other available vehicle measurements, such as speed, battery voltage and acceleration pedal position. The estimator was derived from experimental data, employing support vector regression (SVR), principal component analysis (PCA) and a dual polarization (DP) battery model (BM). It is shown that the obtained model is able to predict the state of charge of the battery with acceptable precision in the case of a failure of the current sensor. represents the average by row of the RMSE and MAE. Note that the best results were obtained by using the PCA-GK method trained using Route 1 followed by the PCA-PK6 method trained using Route 2. This shows that the characteristics of the routes chosen for the training of the virtual sensors should be carefully chosen to obtain a good representation of the behavior in different situations.


Introduction
The increasing diffusion of electric vehicles (EV) is not accompanied by a corresponding solid tradition in terms of data collection. The phenomenon is relatively new, and in particular, important for what concerns data related to battery observation.
Often the models for determining the states of charge of vehicles are obtained in the laboratory and do not take into account the variability of driving styles; the use of auxiliaries, such as air conditioning; and the environmental conditions in which vehicles can be found. This leads to incorrect estimates of the states of charge (SOCs) of the batteries in the vehicle and failure of perception by the drivers, which is defined in literature as "range anxiety" [1][2][3][4][5].
Since the state of charge of a battery is not a directly observable quantity, the methods used for its estimation are strongly dependent on assumptions and model simplifications [6][7][8]. In addition, some methods require data measured in laboratory conditions that cannot be directly collected during the normal operation of a vehicle, making them unsuitable for real-world usage. Moreover, the models often depend on parameters that have to be calibrated manually with specific tests and are not appropriate for on-the-run analysis.
A critical factor in any SOC estimator is the quality of the information provided by the EV sensory system; e.g., battery current and voltage measurements. Then, a fault in any of these sensors can lead to a wrong SOC estimate and possible misuse of the battery [9]. The use of automatic learning techniques can bring about significant improvements, especially if combined with traditional techniques for estimating the battery model and state, which can then be improved by the data collected over time.
In this paper, we propose methods for estimating the SOCs of batteries in an EV, starting from the dual polarization (DP) circuit equivalent battery model [10][11][12], relying on a virtual sensor for the battery current measurement, derived from experimental data, using principal component analysis (PCA) and Support Vector Regression (SVR). The virtual sensor operates on available measurements on the EV, such as the position of the accelerator pedal and the battery bank voltage in order to reconstruct the current, and therefore the state of charge of the battery. The methodology is general, but, due to data availability, in our paper we focus on LiFePo4 battery.
In this framework PCA is used to analyze the original data and reduce their dimensions, while the non-parametric machine learning method SVR using kernel functions such as Gaussian kernel support vector regression (GK-SVR) and polynomial kernel vector support regression (PK-SVR) is used to estimate the battery current.
The main contributions of the paper are the following: • Starting from experimental data on vehicle speed, accelerator pedal position and battery bank voltage from an electric vehicle (EV) in a real driving environment, a virtual sensor for the battery current measurement was derived using support vector regression algorithms.

•
The proposed virtual sensor is independent of the battery's chemical operation and can be applied to different battery types. The proposed methodology takes into account only experimental measurements of the dynamics of the vehicle.

•
A comparative study of the performance between the proposed methods and traditional techniques that use original data as input for GK-SVR and PK-SVR (for second and sixth order polynomials) is presented.
The paper is organized as follows: in Section 2 the current virtual sensor design algorithm is presented and discussed; in Section 3 the SOC estimation methodology is presented using a DP model. Lastly, in Section 4 the numerical results are presented and discussed. Finally, the conclusions of the work are presented in Section 5.

Current Virtual Sensor
In the study a small (two passenger) electric vehicle was employed. It relies on a lithium iron phosphate (LiFePo4) battery with a capacity of 150 Ah and maximum voltage of 72 V. For more details, see [13].
The available measurements are the battery voltage (V), battery current (A), battery SoC (%), pedal position (% of angle) and vehicle speed (Km/h). All the variables are obtained from the controller area network (CAN) bus of the EV.
The CAN bus messages are not generated at uniform time intervals. Then, the CAN data is pre-processed in order to obtain samples at uniform time intervals for all the variables of interest and to remove eventual spurious data, which in some cases, are erroneously logged. The data used in this paper for training and testing are available at [14].
Four different itineraries have been considered; their characteristics are briefly summarized in Table 1, where "lt" stands for "light traffic conditions", "ht" for "heavy traffic" and "activity" represents the percentage of time the EV is moving at a speed higher than 2 km/h. Route 3 has initial and final sections in "lt" urban conditions and an out of town's middle section.
For the training step, 900 s of data were extracted from each route from random initial times, and testing was performed on the whole route data file.
The structure of the current virtual sensor is shown in Figure 1. The battery voltage is obtained by adding the voltage of each one of the 24 battery cells, whose data are available on the CAN bus, rather than using the standard total battery voltage also available on the bus. The acceleration data are obtained by numerical differentiation of the speed measurements. The virtual sensor is formed by two steps. First, a dimension reduction procedure based on PCA is applied to the inputs, and then, the resulting signal is applied to a SVR model that generates the current estimates. The principal component analysis (PCA) is used to reduce the dimension of input variable space [15]. This methodology allows one to find a new and reduced set of variables (features) as a linear combination of the original variables.
The expression of the principal components (PCs) can be written as: where, considering the i-th sample of input data, PC i ∈ R p is the vector of the p principal components, Z i ∈ R q the vector of q input variables and A ∈ R p×q the PCA matrix. Each row a k of matrix A is the eigenvector corresponding to the k-th principal component of the input sample being considered. If we are considering N samples of data, in our case time samples, we can write for all data PC = AZ with PC ∈ R p×N , and Z ∈ R q×N . In principle p can be as large as q, but the main idea of PCA is to have p < q. Before being fed to the PCA algorithm, the four input variables (i.e., voltage, pedal, speed and acceleration) are normalized in order to obtain the PCA input data vectors Z p . Each variable is re-scaled as Z i = z i − z min z max + z min (2) where z max and z min are respectively the maximum and minimum values of the original data samples. When applied to the data in Table 1, the PCA method finds that three principal components are sufficient to describe, respectively, the 99.75%, 99.78%, 99.58% and 99.66% of the total variance of the input variables for each one of the routes considered. While this means that the feasible order reduction is only one dimension, this reduction, as will be shown in the sequel, allows a substantially better performance of the SVR algorithm. A geometric visualization of the four vectors representing the coefficient values that transform each input variable in the corresponding PC for Route 1 is shown in Figure 2. The support vector regression (SVR) method was first designed to solve nonlinear two-class classification problems [16,17]. The SVR method is a non-parametric function approximation technique because it relies on kernel functions [18]. The relationship between the independent and dependent variables is represented by a deterministic function, defined as: where x ∈ R n is the independent component of the data and the corresponding dependent value is y ∈ R, so that each sample vector x i corresponds to a scalar y i . w ∈ R m , with m the dimension of the "feature" space, controls the flatness of the model; φ(·) is a non-linear mapping function from the input space R n to feature space R m ; and b ∈ R is a bias term.
To maximize flatness, a vector w with a small norm is desired; for this reason the coefficients of w are estimated by minimizing: where ε is the acceptable output error. C represents a positive constant that determines the degree of penalized loss when a training error larger than ε occurs. ξ i and ξ * i are non negative slack variables specifying the upper and lower "additional" training errors with respect to the allowed error tolerance ε.
The optimization dual problem is solved using Lagrangian multipliers, where the optimum is a saddle point of the Lagrangian Equation (6), subject to Equation (7).
where N is the number of training samples.
The approximation function f (x) is defined as: where α i and α * i are the Lagrange multipliers. The inner product φ( [19][20][21]. Different kernel functions can be employed. In this work we consider the following: (a) Gaussian kernel (GK): where σ 2 denotes the variance for GK, p is the order of the kernel and c is a constant that allows one to trade off the influence of the higher and lower order term for PKp. One of the aims of the paper is to show how the choice of the kernel is a key point in the proposed methodology. In the following, the SVR is based on the Gaussian kernel (GK) and the second and sixth order polynomial kernel (PK2 and PK6, respectively). According to Equation (8), the virtual sensor design problem can now be transformed into the following system of linear equations, whereÎ(x) is the estimated battery current and G(·, ·) is the chosen kernel function; i.e., Gaussian or polynomial kernel. The k-th input variable sample x k used for training is set to x k = [PC1 k , PC2 k , PC3 k ] in the PCA case, or to x k = [V k , P k , S k , A k ], i.e., the scaled voltage (V k ), pedal position (P k ), vehicle speed (S k ) and acceleration (A k ), when the SVR model is trained without resorting to PCA.
The performance of the virtual sensor obtained from the SVR procedure, in terms of RMSE (root mean square error) and MAE (mean absolute error), for the GK and PK2 models, wherein PCA was not used, and the PCA + GK, PCA + PK2 and PCA + PK6 models, PCA based, are presented in Tables 2 and 3. In those experiments, different routes have been used for training models, and each model was then tested on the same set used for its training.  The PCA + GK SVR method offers the best performance on the training sets, with the lowest RMS and MAE values. Additionally, the PCA + PK6 method yields good results, but is in general more expensive in terms of computational requirements. Note also that the route data used for training also has a relevant effect on the final quality of the model. In all cases, note how the use of PCA to pre-process the input variables yields much better results with both the polynomial and the Gaussian kernels. Figure 3 shows the result for each method, except PCA + PK6, over a portion of the training data, in this case extracted from Route 1.  The MAE and RMSE indices in Tables 2 and 3 only show the average error in model operation and do not give any information about the error distribution. To overcome this problem we propose to use the developed discrepancy ratio (DDR) index, proposed in the literature for evaluating prediction models; see, e.g., [22][23][24][25].
DDR is defined as: where the optimal result is zero for every sample k.
The error distribution can be visually described by drawing the histogram of the DDR for each one of the four different approaches considered. These results (see Figure 4) show that during the training stage, the DDR values for the PCA + GK method vary between −0.9 and 1; those for the PCA + PK2 method between −1 and 2; those for the GK method between −1 and 2; and those for the PK2 model between −5 and 5. Moreover, as it can be seen, the distribution for PCA + GK, and to a lesser degree, for PCA + PK2, has a smaller variance around the optimal zero value, and is thus more reliable than the other methods that do not rely on a preliminary PCA of input variables.  According to the results shown in Tables 2 and 3, and to the distribution of DDR values shown in Figure 4, only the methods based on PCA "preprocessing" deserve being considered.
As noted previously, training sets extracted from each one of the four routes considered result in different qualities of virtual sensor. Using, for example, data extracted from Route 1 (900 s) to train the model, the battery current estimations obtained using the data from Routes 2, 3 and 4 are shown, respectively, in Figures 5-7. As expected, the Gaussian kernel based model PCA + GK yields the best results, even if the PCA + PK6 model still gives accurate results.
The RMSEs and MAEs of the three models considered, i.e., PCA + GK, PCA + PK2 and PCA + PK6, trained using a 15' section of each route in the data set, and then tested on the complete data of the four routes, are reported in Tables 4 and 5. In each table the column "score" represents the average by row of the RMSE and MAE. Note that the best results were obtained by using the PCA-GK method trained using Route 1 followed by the PCA-PK6 method trained using Route 2. This shows that the characteristics of the routes chosen for the training of the virtual sensors should be carefully chosen to obtain a good representation of the behavior in different situations.

Battery Model and SOC Estimation
In order to estimate the SOC, the first step is to develop a reliable battery modeling. In this work a model based on the dual polarization (DP) equivalent circuit model is employed to simulate the behavior of the battery cells [10,11]. The DP model is composed of three parts, as shown in Figure 8:

•
An ideal voltage source representing the open circuit voltage of the battery V oc ; this voltage has a non linear relation with the state of charge of the battery. This relation depends on the type of battery, but also on its temperature and age. • Internal resistors, specifically the "ohmic" resistance represented by R and the polarization resistances R 1 and R 2 .

•
Capacitors that, in combination with the polarization resistances, are used to characterize the transient response during the transfer of power, represented by C 1 and C 2 . Assuming the current through the battery as an independent variable, i.e., the battery model in Figure 8 is connected to an independent current source of value i(t), modeling battery discharge or charge, the battery terminal voltage can be expressed in terms of the state equation and output relation in Equation (13)  where V 1 and V 2 are the voltages in C 1 and C 2 respectively; V(t) is the voltage at the battery terminal; and i(t) is the current in the battery. The state of the battery is represented by V 1 and V 2 and by the state of charge SOC that defines the open circuit voltage V oc . The SOC is the ratio of the remaining capacity to the nominal capacity of the battery cell and is obtained from the current i(t) as: where i(τ) is the instantaneous current of the battery, considered as positive for discharge and negative for charge; η represents the Coulomb efficiency [26]; and Q is the nominal capacity measured in Ah, whose value for the EV used in this work is 150 Ah.

Parameter Estimation
In this section a nonlinear least square (NLS) adaptive algorithm is used to estimate the parameters of the battery. Other methods could be in principle be used, but since in our problem NLS converges quickly and with low computational effort, it has been preferred to alternative methods. In order to minimize the squared error between the measured and calculated voltage, we define as the minimization target function, the error criterion known as chi square, and define it as follows: (15) whereV(ψ, t) is the estimated value of the V(t) value defined in the output relation of (13) and based on the parameter vector ψ = [R R 1 C 1 R 2 C 2 V oc ]; M is the number of data samples used; and σ Vi is the expected measurement error for the i-th sample V(t i ).
By collecting all calculated and measured voltagesV(t i ) and V(t i ) in the M × 1 vectorsV and V, respectively, and collecting the reciprocal of all σ Vi in the M × M diagonal matrix W, Equation (15) reduces to the quadratic form: The minimum of the chi square error is searched by repeated use of the Levenberg-Marquardt algorithm, which we briefly recall in the following, on small m sample sized partitions of the M sized available data.
In our setting, the Levenberg-Marquardt algorithm is used to update the parameter vector ψ iteratively by solving the nonlinear optimization problem described in (15). The algorithm adaptively updates the parameter estimates by combining the gradient descent update and the Gauss-Newton update [27] by tuning a damping parameter λ. The Marquardt's update equation is given by: where h is the parameter update vector; diag(·) is an operator that extracts the diagonal from a matrix; J is the Jacobian matrix of V −V with respect to ψ; and λ is the damping parameter acting on the diagonal of J T W J, and was initially chosen to be large, so that initially, small steps in the steepest descent direction would be taken. The Jacobian can be quickly updated using the Broyden formula [28].
The damping factor λ is adjusted by checking the values obtained with the new parameter set against the previous values. One possible way to do this is by using a ρ factor [27,29,30] defined in (19).
The step is accepted if ρ is larger than a user-specified threshold and rejected otherwise; in this case, λ is increased.
Different convergence criteria may be used based on limit values for the gradient, for the chi square error or for the norm of the update vector; or simply, by the number of iterations.
An adaptive algorithm was developed based on the above criterion. Given M time samples of current and voltage data i(t) and V(t), the sequence was initially split in sub-sequences of length m. The algorithm optimizes, for each sub-sequence, the χ 2 cost function (16), obtained by estimating the electric parameter vector ψ of the battery and then calculating the voltage from the electrical parameters in ψ. The operation is repeated for each following sub-sequence using as the initial condition for ψ, the values obtained in the previous iteration. This process updates and adjusts the electrical parameters of the battery.
The identification results of electric parameters for the DP battery model are shown in Table 6. Finally, the relation between SOC and V oc is determined. The relationship between the V oc and the SOC can be described through polynomial data fitting. The fitted curve for the relationship between V oc and SOC is shown in Figure 9, where the V oc from the estimated data is fitted to the SOC value using the 4th order regression polynomial in (20).

Analysis Of Results and Discussion
In this section, the performance of the SOC estimation using the current virtual sensor based on the PCA + GK and PCA + PK2 methods and the DP battery model is evaluated.
The voltage, the pedal position, the speed and the acceleration were measured and were used as input data to the model. PCA was applied, reducing the dimension of the inputs from R 4 to R 3 . Then, the data were injected to the SVR model to estimate the current. Finally, the current provided by the virtual sensor was used as input for the DP battery model, to finally determine the SOC.
The estimation methods were validated with two routes, and the results can be seen in Figures 10 and 11, where the performance of the estimation methods vs. data reported by the existing BMS is shown. The FIT index was used to evaluate the quality of the proposed SOC estimation algorithm. This index is defined as where * is the euclidean norm of the argument, SOC e is the SOC obtained with the proposed estimation method, SOC m the measurement provided by the BMS and SOC m is the average of SOC m during the experiment.  Table 7 shows a comparison of the methods according to the FIT, RMSE and MAE indices for Route 1. The obtained indices make it clear that the estimation using the PCA + GK method shows higher prediction performance than the PCA + PK2 model, with FIT = 91.8%. For Route 2, results are shown in Table 8. Once more, the PCA + GK model shows better performance than the PCA + PK2 model, with FIT = 87.49%. Note that both models, PCA + GK and PCA + PK2, provide adequate virtual current measurements that allow the estimation of the battery SOC from the battery voltage, accelerator pedal position and vehicle speed.

Conclusions
We have presented a solution for state of charge estimation in electric vehicle applications. The proposed strategy makes use of virtual sensors for the battery current estimation, replacing the physical sensor in case of failure. The models are derived from experimental data captured from the CAN bus of an actual electric vehicle and use the battery voltage, vehicle speed and acceleration pedal position to recover the current signal if the actual measurement is not available. Support vector regressions and principal component analysis have been employed to build the virtual sensors. Gaussian and polynomial functions have been employed as kernel functions, and it was observed that the Gaussian kernel offers better performance on the available data sets. A principal component analysis allows one to reduce by one the dimension of the input to the virtual sensor, significantly increasing the final performance.
The estimated current signal is used as input to a dual polarization equivalent circuit model of the battery to estimate the state of charge and open circuit voltage during the vehicle's operation. The parameters of the equivalent circuit have been obtained through a non-linear least squares adaptive algorithm, using experimental data from the vehicle.
The joint operation of the virtual sensor and the battery model allows one to estimate the state of charge with a fit higher than 87% when evaluated on fresh data not employed for the model adjustment.
The methods herein proposed are scalable and can integrate knowledge from other sensors, such as temperature and torque, and can be combined with other machine learning methodologies.
One of the limitations that we noticed in the the method is related to specific properties of the driving segments. Analyzing the entire route can lead to incorrect patterns due to different links between the magnitudes considered by the virtual sensor. For example, accelerations and currents have a very different relation if the vehicle is traveling uphill or on flat terrain, or even downhill. The next extension of our work is to create a mixed method of classification and machine learning, to recognize specific peculiarities of the driving segment and select the correct model to apply.