Learning to Calibrate Battery Models in Real-Time with Deep Reinforcement Learning

: Lithium-ion (Li-I) batteries have recently become pervasive and are used in many physical assets. For the effective management of the batteries, reliable predictions of the end-of-discharge (EOD) and end-of-life (EOL) are essential. Many detailed electrochemical models have been devel-oped for the batteries. Their parameters are calibrated before they are taken into operation and are typically not re-calibrated during operation. However, the degradation of batteries increases the reality gap between the computational models and the physical systems and leads to inaccurate predictions of EOD/EOL. The current calibration approaches are either computationally expensive (model-based calibration) or require large amounts of ground truth data for degradation parameters (supervised data-driven calibration). This is often infeasible for many practical applications. In this paper, we introduce a reinforcement learning-based framework for reliably inferring calibration parameters of battery models in real time. Most importantly, the proposed methodology does not need any labeled data samples of observations and the ground truth parameters. The experimental results demonstrate that our framework is capable of inferring the model parameters in real time with better accuracy compared to approaches based on unscented Kalman ﬁlters. Furthermore, our results show better generalizability than supervised learning approaches even though our methodology does not rely on ground truth information during training.


Introduction
Recent advancements in lithium-ion (Li-I) battery technology have increased their usage in various applications ranging from electric vehicles to drones [1], smart grids [2], and space exploration [3]. Particularly for autonomous systems, it is essential to plan the missions reliably, which requires an accurate prediction of the end-of-discharge (EOD) time for the batteries. Several battery models have been introduced to model the discharge process of the batteries for accurate prediction of EOD [4][5][6]. However, most of these models suffer from an increasing uncertainty in their EOD predictions over time [7]. This is because the batteries degrade with aging and computational models suffer from a reality gap between the physical process and the simulated one. The relationship between battery age and degradation parameters is complex and requires sophisticated modeling techniques to estimate battery degradation parameters that are part of the EOD time prediction [8]. Estimating degradation parameters of the battery model is also known as "model calibration" [9]. Hence, we use these terms interchangeably in this manuscript.
Previous research studies on battery model calibration have mainly focused on understanding and modeling the electrochemical aging processes [8,10,11]. Such methods are known as prognostics and health management (PHM) models, which assume an underlying model for the aging process. However, the calibration problem can also be modeled as a parameter tracking and inference problem. The model parameters are then inferred from the empirical observations. Previous works have focused on traditional variants of Kalman filters, such as the extended Kalman filter (EKF) and unscented Kalman filter (UKF) [12,13], or Bayesian filters such as particle filters, [14] for tracking degradation parameters of the batteries. Parameter tracking approaches do not require an underlying degradation model. However, they suffer from a high computational burden and parameter divergence problems. Several data-driven methods based on empirical learning models have also been proposed for battery end-of-life (EOL) or state of health (SOH) prediction [12,15]. These supervised data-driven methods suffer from a strong dependence on labeled data, also requiring for each training sample the ground truth calibration parameters. However, measuring the ground truth values of the degradation parameters during operation is not practical in many scenarios. These shortcomings limit the applicability of the supervised data-driven approaches in real-world problems.
Reinforcement learning (RL) provides an alternative to parameter tracking approaches by formulating real-time calibration as a Markov decision process (MDP). Combined with powerful function approximators, such as deep neural networks, RL methods can work with complex large-state spaces. RL methods have been applied to various control problems in robotics [16][17][18], water systems management [19], computational biology [20], and AutoML [21]. RL methods have multiple advantages over traditional methods: (1) RL agents can learn to solve tasks without any knowledge of the underlying model. Such methods are known as model-free methods that directly learn by sampling interactions from the environment [22]. (2) The policies learned via RL are robust to model uncertainty [23].
(3) RL methods provide almost real-time performance since they only require evaluating the learned policies. These characteristics make reinforcement learning a compelling alternative to other data-driven methods for battery model calibration.
In this paper, we adopt a reinforcement learning framework [24] to solve the battery model calibration process, which can work in real time and does not require an underlying degradation model. Specifically, we define the battery model calibration problem as a tracking problem using MDP and solve it with the Lyapunov-based maximum entropy reinforcement learning algorithm [24]. We use the battery model from the NASA prognostic model library [25,26] to simulate the RL environment. It is important to emphasize that the applied simulation method models the physical process of the discharge but does not explicitly model the battery aging process, which is the main focus of this paper. To the best of our knowledge, the framework proposed here is the first method applying reinforcement learning to battery model calibration.
The remainder of this paper is structured as follows. In Section 2, we discuss related work. In Section 3, we present the battery discharge model and our reinforcement learningbased calibration framework. In Section 4, we present the datasets, model design, and comparison methods. We discuss our findings in Section 5.

Related Work
In this section, we provide a brief overview of three primary methods for battery model calibration: (1) methods based on Bayesian tracking principle; (2) model-based prognostics based on an explicit aging model; (3) direct estimation from observations. Firstly, methods based on filtering approaches model the parameters of the battery as an internal state and try to track these parameters by external observations. Variants of Kalman filters, such as the unscented Kalman filter (UKF) [13] and extended Kalman filter (EKF) [12] have been used to calibrate battery models. Particle filtering (PF) approaches are similar to UKF. PF-based approaches try to approximate the probability density function of the battery parameters using particles [14]. However, particle filters suffer from particle degeneracy, which results in large estimation errors. The authors of [27] proposed inheritance-based particle filtering to tackle this problem. Such tracking algorithms provide model-agnostic parameter estimations. However, these methods are computationally expensive at the application time and suffer from a drift in parameter tracking.
Secondly, model-based prognostic methods assume an underlying degradation model for the aging parameters as a function of its usage. The authors of [8] used system identifi-cation techniques to estimate the parameters of the degradation model. Furthermore, the authors of [10,11] used electrochemical process knowledge to model battery degradation. These techniques provide accurate estimates as long as the physical degradation process follows the assumed model.
Thirdly, in the direct estimation methods, observations are used to learn the mapping from the battery outputs to the degradation parameters. For example, the authors of [15] used support vector machines (SVM) to learn this mapping. In another study, structured neural networks (SNN) were used to exploit knowledge of the degradation process [12]. Such approaches show promising results in certain scenarios where it is possible to obtain a representative set of labeled samples comprising observations and degradation parameters covering all relevant operating conditions.
Reinforcement learning provides an alternative solution to these three types of approaches while overcoming some of their limitations. Especially, in the scenarios where labeled data are not available, RL can learn from the observations and infer the model parameters. Furthermore, RL is typically computationally very efficient in real-time applications compared to the model-based Kalman filter and its variants. Previous works have highlighted the importance of reinforcement learning in SOH estimation and battery scheduling operations. The authors of [28] used reinforcement learning to estimate the parameters of EKF, which in turn was used to estimate SOH for the batteries. Our method goes one step beyond this-we try to estimate the parameters directly from the observations. Hence, we remove the dependence on the model-based EKF approach. The authors of [29] also highlighted the effectiveness of reinforcement learning in designing an optimal control policy to reduce transmission losses.
With deep function approximators and sophisticated exploration techniques, RL methods have recently made some significant progress. In our work, we focus on modelfree RL methods based on the actor-critic (AC) approach [22,30]. Model-free methods can learn the policy without knowing the underlying model. Actor-critic methods provide a framework for generalized policy iteration algorithms in which two networks (actor and critic) are updated continuously. Especially, maximum entropy-based RL formulation such as soft actor-critic (SAC) [31,32] algorithms have shown good performance in different applications [33,34]. Tian et al. [24] proposed a variant of the maximum entropy-based RL algorithms for the model calibration of turbofan engines. In that work, the authors proposed to use the Lyapunov-based critic (LAC) approach, which has been proven to provide guaranteed stable control [23]. We adapt the proposed approach to the battery calibration problem.

Materials and Methods
As discussed earlier, to solve calibration using reinforcement learning, we need to define the environment for our RL agent. We integrate the battery discharge model described below in OpenAI gym [35] to build the RL environment. We also discuss our RL framework and propose to use a Lyapunov actor-critic [23] algorithm for battery model calibration.

Battery Discharge Model
In this research, we apply the Li-I battery model from NASA the prognostic model library [25,26]. It captures significant electrochemical processes of the discharge. The effect of aging is included in the model by the corresponding degradation parameters. However, the degradation is not modeled explicitly. The model assumes that the degradation parameters are provided. Those are essential for an accurate estimation of the EOD time.
The battery state is modeled by seven parameters as described below. The state changes over time as a function of input load and degradation parameters. In the following, we just denote the state mathematically and refer the readers to the original paper for more details on the battery model [25].
In the first four parameters in Equation (1), q represents the amount of charge, subscript p (or n) represents positive (or negative) electrodes, respectively, and subscript s (or b) represents the surface (or bulk) volume of a particular electrode, respectively. For example, q s,p is the amount of surface charge in the positive electrode. V η,n and V η,p are the voltage drops due to surface over potential on negative and positive electrodes, respectively. V o is the total voltage drop.
There are two main degradation parameters: (a) Q max captures the decrease in available lithium ions, and (b) R o captures the increase in the internal resistance. These parameters are essential for the model dynamics that is defined as follows: where u t represents the input load at time t, and the model predicts the battery voltage y t = V. f and g are the functions for the system dynamics and output measurements, respectively. Without any knowledge of the battery age, degradation parameters are initialized to the "perfect battery" condition values, which are Q max = 7600 C and R o = 0.117215 Ω [25]. Using these parameters, the model can estimate the initial state x 0 . As the battery ages, Q max decreases while R o increases. We learn to infer these parameters by solving the state-tracking problem using RL. In this research, we used the NASA prognostic battery model [25] as our reinforcement learning simulation environment. However, in cases where such a model is difficult to obtain, it can be replaced by surrogate models.

Markov Decision Process and Reinforcement Learning
In this paper, we focus on the battery state tracking task, which we propose to model as a Markov decision process (MDP). An MDP can be described as a tuple, S, A, C, T , ρ , where S is the set of states, A is the set of actions, C(s, a, s') ∈ [0, ∞) is the cost function, T (s, a, s') = p(s'|s, a) is the transition probability function, and ρ(s) is the initial probablity distribution over the states. The policy π θ (a|s) denotes the probability of selecting action a in state s, and it is parameterized by the parameters θ. The state of the MDP at time t is defined as s t ∈ S ⊆ R n , where S denotes the state space. For the proposed tracking strategy, we define the state at time t as s t = [x t , x t+1 , u t+1 ], wherex t is the battery's internal state produced by the battery discharge model at time t, x t+1 is the real (or simulated) battery state we want to achieve at time t+1, and u t+1 is the input load condition at time t+1. The agent (calibrator) then controls the system's degradation parameters as an action a t ∈ A ⊆ R m (e.g., a t = Q max or R o or [Q max , R o ]) according to the policy π(a t |s t ). Based on the internal statex t and predicted action a t , we simulate the next internal statê x t+1 = f x t , a t , u t+1 , where f is the battery discharge dynamics described in Equation (2). Hence, the cost function c(s t , a t ) = ||x t+1 − x t+1 || denotes the quality of action a t to go fromx t to x t+1 at load condition u t+1 . During the entire learning process, the agent never observes true degradation parameters. The agent learns to control the degradation parameters by minimizing the cost formulated using the observations. After one complete transition, the next state of the MDP is s t+1 = [x t+1 , x t+2 , u t+2 ]. This complete process is demonstrated in part (1) of Figure 1. Once the policy network is trained, it can work as a calibrator, where it observes the state from the real system and outputs its parameters for the computer model (part (2) of Figure 1).  (1): the policy network is trained by interacting with the system model. Part (2): the policy network acts as a calibrator at the test time.

Lyapunov-Based Actor-Critic
Since we target the state tracking task, we adopted the Lyapunov-based actor-critic (LAC) approach as proposed in [23]. LAC was designed to improve the stability of the reference trajectory tracking problems by incorporating a Lyapunov energy decreasing constraint as defined in Equation (3) in the policy objective: where L(.) is the Lyapunov value function, α 3 is a positive constant, and the other notations are the same as described before. Hence, our policy network is trained to minimize the energy decreasing Lyapunov objective J c (π) where N is the number of steps for a single state tracking iteration.
Based on the actor-critic framework, LAC uses the Lyapunov function L φ c as a critic in the policy gradient formulation. Similar to value function learning, the Lyapunov function is also parameterized by a neural network φ. This network is trained to minimize the following objective: where L target is the approximation target related to the chosen Lyapunov candidate and D is the set of collected transition pairs. The approximation target is given by: LAC is based on the maximum entropy-based actor-critic framework [31], which can enhance the exploration of the policy and has been shown to substantially improve the robustness of the learned policy. Hence, our actor network ensures stable and robust control of the degradation parameters. The full objective for the policy network is defined as follows: where π θ is the policy parameterized by a neural network f θ and is an input vector consisting of Gaussian noise. D . = {(s, a, s', c)} is the replay buffer for storage of the MDP tuples. In the above objective, β and γ are positive Lagrange multipliers that control the relative importance of policy entropy versus the stability guarantee, and α 3 is a constant for a Lyapunov energy decreasing objective. Similarly to the approach applied in the [31], the entropy of the policy is expected to remain above the target entropy H t . The values of β and λ are learned through the gradient method, thereby maximizing the following objectives:

Experiment Datasets and Models
In this section, we discuss simulated data generation using battery discharge model described in Section 3.1. We also discuss the Unscented Kalman Filter (UKF) and the direct mapping methods to compare them with our approach.

Dataset Generation
As mentioned above, we used the battery model from the NASA prognostic model library [26] to generate simulated data for the training process. As discussed earlier, we have two degradation parameters to calibrate in a battery model. We propose two different experiments for the tracking: (1) Only varying a single parameter at a time. Hence, while varying Q max , R o was kept constant (at 0.117215 Ω) and while varying R o , we kept Q max constant (at 7600 C, C = Coulomb). (2) Varying both parameters simultaneously. For the first experiment, we generated trajectories by varying Q max between 4000 and 7000 C with 501 grid values of constant length in between and keeping R o constant at 0.117215 Ω. We also generated discharge trajectories by varying R o between 0.1 and 0.2 Ω with 501 grid values while keeping Q max constant at 7600 C. For the second experiment, we varied both the parameters, i.e., Q max between 4000 and 7000 C and R o between 0.1 and 0.2 Ω simultaneously. Following the approach of [8], we kept degradation parameters constant for a given discharge cycle. Furthermore, we generated each discharge trajectory for 11 different input load (u) conditions between 8 and 16 W. For each trajectory, the battery state defined in Equation (1) was initialized based on the degradation parameter values for that particular trajectory; namely, the voltage drops (V o , V η,p , V η,n ) were initialized with 0 and the charges (q s,p , q b,p , q b,n , q s,n ) were initialized proportional to Q max following [25]. Hence, the trajectories with different degradation parameter values went through different discharge cycles. Each trajectory was simulated until the output voltage reached the EOD threshold (3 V). The simulated datasets' generation is explained in more detail in Appendix A.

Hyperparameters of the RL Framework
We adopted the same neural network architecture as applied in [24]. We used a fully connected neural network as a function approximator for our actor, f θ , and Lyapunov critic, L c . Both networks had three fully connected layers with 256 neurons each and LeakyReLU [36] activation functions. For the policy network, we predicted two values, namely the mean and the standard deviation for each action. After this step, we used the squashed Gaussian policy [31] to sample from the distribution. To ensure that the Lyapunov values are positive, we used the sum-of-squares of the final layer activations of the Lyapunov network as Lyapunov values. We used α 3 = 1 for the energy decreasing condition described in Equation (3). The parameters β and λ were also updated using the loss defined in Equations (7) and (8). We used an Adam optimizer with the learning rate 5 × 10 −4 .

Compared Methods
We compared the proposed model calibration methodology to the two alternative methods that are comparable to the proposed framework: on the one hand to methods based on Bayesian tracking principles, in particular to the unscented Kalman filter, and on the other hand to a supervised data-driven direct estimation.

Unscented Kalman Filter (UKF)
We compared our RL approach to the traditional unscented Kalman filter (UKF). Here, we used the UKF approach proposed in [37]. A UKF models the degradation parameters as a hidden state and the battery model state as an observation. In particular, the hidden state for the UKF was z ⊂ {Q max , R o } (z ⊂ R 2 ) and the observation was the battery state defined in Equation (1) (x ∈ R 7 ). The UKF starts with a distribution over the initial state and this state distribution is continuously modified through unscented transformations to generate the distribution over the hidden state at each time step. Since we kept the degradation parameters constant throughout one discharge cycle, the UKF state update equation and observation equation were defined as follows: where f is the observation function for the UKF, which was obtained from the battery model introduced in [25], and u is the input load. The initial stateẑ 0 was initialized as a multi-variate standard normal distribution. At the start of each new trajectory, UKF restarts its tracking (i.e., the state is reinitialized withẑ 0 ). Without a restart, the UKF might diverge since there is no connection between two different discharge cycles. Furthermore, the UKF parameters were fine-tuned for the battery discharge datasets described in Section 4.1.

Direct Mapping
We also considered a fully connected neural network to learn a direct mapping from state s t to the degradation parameters (corresponding to a t in the RL setting). It is important to emphasize that direct mapping is a much simpler problem compared to inferring the calibration parameters via the tracking problem without any access to the ground-truth calibration parameters. In direct mapping, the algorithm learns from the labeled pairs of "states" and "degradation parameters". This set of representative labeled samples might not be easy to obtain in real-world scenarios. For each state of the asset, the underlying degradation parameters need to be measured manually, which is considerably time-consuming. Furthermore, the training datasets are required to be representative and cover all the different combinations in all relevant operating conditions to enable a reliable machine learning (ML) model. Hence, the results obtained with this supervised learning setup can be considered as an upper bound for the proposed RL framework performance for the cases where the training and the testing datasets come from the same distribution.
For the direct mapping experiment, we used the same architecture as the policy network described in Section 4.2, with the difference being that only one output per action was learned since the standard deviation was not required. We used the same optimizer and hyperparameters as described in Section 4.2.

Results
We divided the generated discharge trajectories into 70% training and 30% testing datasets. The input load conditions represented in the training and testing datasets did not overlap. Hence, the results presented here are suitable to assess the generalization capability of our method. We trained our RL model for one million steps, which resulted in reward convergence. For direct mapping, we trained the model until the L2-loss between predicted parameters and ground truth values converged.
We compared the inference accuracy of our RL-based approach to the UKF method and the direct mapping approach. Furthermore, as described in Section 4.1, we conducted single-and multi-parameter evaluation experiments. We report the normalized root mean squared error (RMSE) between the ground truth parameters and predicted parameters in Table 1. Parameters were scaled between 0 and 1 for the RMSE calculation. Furthermore, the numbers represent % RMSE (i.e., normalized RMSE x 100) The proposed RL-LAC reduced the % RMSE by more than 50% compared to the traditional UKF tracking approach. Even in multi-parameter tracking, we can see that RL-LAC consistently outperformed UKF. As discussed earlier, the direct mapping method can work better than RL when the test data come from the same distribution as the training data. In our case, the testing trajectories had different load conditions than the training trajectories. Hence, for a single parameter R o tracking, we observed the training error of 0.3% while the test error increased to 10.2%. This shows the limitation of the direct mapping approach and highlights the fact that it can suffer from a generalization gap if not trained on data that are representative of the application. The direct mapping method had a negligible error for parameter Q max in both the experiments, since Q max can be derived exactly from our state formulation. This has also been highlighted in the battery discharge model [25]. However, as discussed earlier, direct mapping requires ground truth degradation parameters, which are difficult to obtain in real-world applications.
To further investigate the performance of the proposed framework, we show inference results of the degradation parameters for single parameter tracking experiments (in Figure 2), and for a multi-parameter tracking experiment (in Figure 3). In Figure 2, each trajectory represents a different load condition. It is important to point out here that our RL-based method works independently on each discharge cycle, and hence, the order of the parameters does not matter. Additionally, this implies that the calibration errors across discharge cycles are not self-correlated. For both experiments, even though there was some variance in the inference of the parameter Q max , we can see that most of the points were close to the true parameters, whereas in the case of R o , tracking accuracy was better than that of Q max . Interestingly, our tracking never diverged too much from the ground truth parameters, which shows the effectiveness of using the Lyapunov-based stability guarantee in our RL framework.
In Table 2, we present the inference times for all three methods. The time has been calculated by averaging five different runs of 2000 random transitions on a single-core CPU.
As discussed earlier, model-based methods (such as UKF) require multiple battery model evaluations at each step, and hence they have the highest inference times. On the other hand, inference time for our RL method depends on the complexity of the policy network, and it was more than twice as fast as the applied UKF. Furthermore, with increasingly complex battery models, inference time for the UKF will increase proportionally, whereas for the RL, it will remain similar. Direct mapping methods were found to run much faster at deployment time than any other methods as expected.

Discussion
In summary, the performance of the RL method was consistently better than traditional tracking methods such as UKF, while being able to perform stable, real-time tracking of the parameters. In addition, the reinforcement learning agent can generalize on out-ofdistribution load conditions and is able to accurately track parameters for the test load conditions, whereas the direct mapping method suffers from a lack of generalization. This competitive performance is achieved while purely learning from the interactions and without any access to the ground truth.

Limitations
Our RL-based method enables accurate calibration of the battery model. However, our method has only been tested on simulated data. Sim-to-real transfer of the RL agent is an active research topic [38], and our proposed algorithm needs to be further tested on the real degradation process data. Furthermore, along with the point estimate of the degradation parameters, the confidence interval of the predictions can help in the maintenance scheduling of the batteries. Incorporating uncertainty into RL agents' decisions is also an actively studied topic [39], and the research in this field can be incorporated with our method to enhance the reliability of the proposed algorithm.

Conclusions
In this paper, we presented a new approach for battery model calibration formulated as a tracking problem. We solved this tracking problem using a Lyapunov-based maximum entropy reinforcement learning framework and showed that the inference of this model provides accurate estimates of the model parameters. The performance of the RL framework presents an improvement over UKF, shows a better generalization than the supervised learning approach, and works in real time. The performance of the proposed framework is comparable or better than that of the supervised learning algorithm, which requires labeled pairs of state observations and degradation parameters. The indirect inference as performed by the RL algorithm is a much harder learning problem compared to direct mapping. Hence, we proposed a valid alternative for the scenarios where labeled training data are either limited or the representativeness of the training data cannot be assured.
In future research, this method can be extended to scenarios where the internal state of the model is not easy to obtain. For such cases, we can formulate the problem as a problem of tracking the output voltage. This is a much harder problem compared to the one analyzed here, since RL has to learn the internal discharge model along with the degradation process purely from the observed rewards.

Conflicts of Interest:
The authors declare no conflict of interest.
Code for the Experiments: The code to reproduce our results is available here: https://github.com/ aunagar/RL-Battery-Calibration.

Appendix A. Dataset Generation
As explained in previous sections, we had two degradation parameters to calibrate, Q max and R o . We performed three different experiments: (1) We varied Q max from 4000 to 7000 C and kept R o constant at 0.117215 Ω. We divided the range of 4000 to 7000 C (both inclusive) into 501 equally separated grid values (i.e., 4000, 4006, 4012, . . . ., 7000). Furthermore, for each Q max value, we varied load conditions between 8 and 16 (both inclusive) with 11 grid values (i.e., 8, 8.8, 9.2, . . . , 16). This gave us a total of 501 × 11 = 5511 discharge trajectories. (Results for this experiment are displayed in Figure 2a). Each trajectory of Q max represents a different load condition. As explained in Section 4, we had 30% test data. Hence, we demonstrated the results of test load conditions (i.e., load = 13.6, 14.4, 15.2, and 16 W).
(2) The second experiment was very similar to (1). The main difference was that here, we kept Q max constant at 7600 C and varied R o between 0.1 and 0.2 Ω. We divided this range, similarly as before, into 501 equally spaced grid values (i.e., 8 Figure 2b. Each trajectory is a different load condition as explained above.
The first two experiments showed the effectiveness of the method when tracking a single parameter at a time. However, in a realistic scenario both parameters degrade together. Hence, we performed a third experiment.
(3) In the third experiment, we varied both Q max and R o (Q max between 4000 and 7000 C and R o between 0.1 and 0.2 Ω) at the same time. The trajectories were generated as follows: We took 101 grid values of Q max (i.e., 4000, 4030, . . . , 7000). For each Q max , and we varied R o between 0.1 and 0.2 Ω in five equally spaced grid values. For each (Q max , R o ) combination, we applied nine different load conditions (i.e., 8, 9, 10, . . . , 16 W). The results are displayed in Figure 2. The figure can be interpreted in the following way: Take the first point of the first trajectory in Q max (Figure 2a at t = 0), which corresponds to five different values of R o (Figure 2b at t = 0), and each of the (Q max , R o ) pairs is a single discharge cycle. Here as well, each trajectory of Q max represents a different load condition.