Deep Reinforcement Learning-Based Battery Management Algorithm for Retired Electric Vehicle Batteries with a Heterogeneous State of Health in BESSs

: In this paper, we propose a battery management algorithm to optimize the lifetimes of retired lithium batteries with heterogeneous states of health in a battery energy storage system under dynamic power demand. A battery energy storage system allows for the use of retired lithium batteries for applications such as backup power in homes, data centers, etc. In a battery energy storage system, a battery pack consists of several retired batteries connected in parallel or in series to fulfill the required power demand. Owing to the retired batteries’ different capacity levels, i


Introduction
Lithium-ion batteries have become an essential component of modern life, powering everything from smartphones to electric vehicles (EVs) [1].Their advantages include high energy efficiency, minimal memory effects, extended lifespans, and low self-discharge rates compared to other battery types, and they are now widely used [2].However, lithium-ion batteries for EVs have a limited lifespan and (eventually, for safety) require replacement when their capacity drops to 80% or lower [3].The total amount of retired battery-pack power is forecast to reach 120 GWh globally by 2030 [4].This will create a significant amount of waste and financial burden, particularly as the demand for lithium-ion batteries continues to grow.As a result, it is becoming increasingly urgent and necessary to identify solutions to reuse retired lithium-ion batteries.
One promising application for retired lithium-ion batteries is in battery energy storage systems (BESSs) that can then be used for backup power in homes, EV charging stations, or telecommunication and data center systems [5].BESSs have the potential to significantly reduce the demand for new batteries and can help reduce the environmental impact of battery production.A battery energy storage system (BESS) has a battery pack in which multiple batteries are connected in parallel or in series to increase the capacity or voltage of the battery pack.A switch is added to each battery cell to connect it to or disconnect it from the battery pack [6].Cells in a battery pack have different capacity levels (i.e., heterogeneous states of health), which hinders the effective utilization of the batteries and, consequently, affects the performance of the battery pack.A scheduling policy is required to control the switches in battery cells to prolong the lifetime of the battery pack in the BESS and reduce the imbalanced capacities of battery cells.
For an optimal scheduling policy in a BESS, correct identification of battery characteristics, including the state of charge (SOC) and state of health (SOH), is important.The SOC of a battery is the level of charge relative to the battery's capacity, whereas the SOH is the ratio of the maximum battery charge to its rated capacity.The relationships between SOC and SOH are illustrated in Figure 1.Information on SOC and SOH in a scheduling policy protects the battery cells by preventing them from overcharging or discharging excessively and increases the capacity of the BESS.SOC and SOH parameters cannot be measured directly from a battery cell.Instead, they are estimated through measurable parameters such as voltage, current, and cell temperature.Based on the states of the battery cells in a pack (including the SOC and SOH), the ON/OFF switches for the batteries in the pack are scheduled so that the states of health of all batteries are balanced.As a result, the lifetime of the battery pack is extended.Therefore, the correct estimation of SOC and SOH, along with the scheduling of battery cell switches, is necessary to optimize the performance of a BESS.Battery state estimation approaches have been explored in the literature.The Coulomb counting method [7,8] calculates the SOC of cells by counting the amount of charge that enters or exits a cell.However, the Coulomb counting method was unable to measure the SOC of cells in an online parallel-connected battery pack following a sharp drop in SOH.An algorithm was proposed for the online estimation of SOC using deep learning [9], but this algorithm ignores the estimation of SOH.A neural network was used to estimate SOH using experimental datasets [10].However, the authors did not consider SOC estimation.Information on both SOCs and SOHs of battery cells is required for efficient scheduling of a battery pack to optimize BESS performance.The authors of [11] proposed joint lithium battery SOC and SOH estimation using a data-driven method.This approach required a large dataset to train the model before operation on site.Kalman filter-based approaches can estimate SOC and SOH levels [12] but depend on the correctness of electrochemical impedance spectroscopy (EIS) parameters, including a resistor and one or more resistorcapacitor (RC) pairs.However, the effect of SOH reduction on EIS parameters is ignored in Kalman filter approaches [13].In [14], SOH reduction was considered to reidentify EIS parameters, but the SOH was updated only offline.Several researchers have studied the problem of cell scheduling in a parallel-connected battery pack.The authors of [15] utilized a fuzzy logic control strategy to adjust the number of cells in a circuit in accordance with the load demand for the purpose of reducing loop current, which leads to battery inconsistency.In [16], battery resistance degradation was monitored to detect weak cells and disconnect them from the battery pack.This approach solved the issue of mismatched characteristics but requires a complex measuring system or incurs a high computational burden.In [17], the weighted-k round-robin (kRR) scheduling framework was proposed to extend the lifetime of a battery pack by considering load demand and SOH reduction.However, kRR-based scheduling can be implemented only for a fixed model, i.e., the number of cells in the battery pack or the battery models inside the pack cannot change.In [6], a multiactor-critic method was proposed to solve battery scheduling problems.This approach prolonged the lifetime of the battery pack and reduced the imbalance between the batteries but ignored dynamic power demand.In [18], a strategy for a battery management system was proposed, including SOC estimation using an extended Kalman filter algorithm and a scheduler to reduce the difference between the SOCs of battery cells.However, SOH and power demand were not considered in that approach.The main challenge is to determine the accurate state (i.e., SOC and SOH) of a battery cell in a battery pack, then schedule the turning ON/OFF of battery cells based on their current states such that the imbalance in SOHs of cells is reduced.
The main contributions of our work are as follows: • A scheduling algorithm is proposed to maximize the lifetime of a battery pack consisting of parallel-connected battery cells with heterogeneous states of health in a BESS.

•
We define the battery lifetime maximization problem as the reduction in the SOH of a battery pack that can be achieved by reducing the imbalance in the SOHs of battery cells in a battery pack.

•
A deep reinforcement learning (DRL) framework is implemented in the scheduling algorithm that uses battery cells' states to set their ON/OFF status and balance the SOHs.

•
To measure the battery cells' states to schedule their ON/OFF status, an extended Kalman filter (EKF)-based algorithm is proposed to estimate SOC and SOH.• A dataset of real measurements is used to determine the accuracy of the proposed estimation algorithm.The proposed algorithm achieves minimal error compared to methods proposed in other works.Simulation results show that the proposed algorithm outperforms previous studies by extending the lifetime of a battery pack under constant and dynamic power demands.
The remainder of this paper is organized as follows.Section 2 discusses the proposed parallel-connected battery model and the scheduling issues.Section 3 presents the framework of the proposed combined algorithm, which includes EKF-based and DRL-based algorithms.Section 4 describes the simulation and presents the results and impacts of the algorithm.Finally, we conclude this work in Section 5.
For ease of presentation, the key notations listed in Table 1 are used throughout this paper.In this paper, we consider a parallel-connected BESS [19,20] with a power supply and a load, as shown in Figure 2. The BESS comprises a battery pack and a battery management system (BMS) connected to a power supply and a load.We consider a discrete-time model, where the working time (W) is divided into w time slots such that W = {t k | k = 1, 2, ..., w} with durations of ∆t = t k − t k−1 .The battery pack consists of N lithium battery cells connected in parallel.A first-order Thévenin equivalent model is considered for the cells [21].Cell i ∈ N = {1, 2, ..., N} has EIS parameters including an open-circuit voltage (V Oi ); internal resistance (R si ); and an RC pair, which includes a resistor (R pi ) and capacitor (C pi ) connected in parallel.The terminal voltage of cell i at time t (V i (t)) is computed as where V pi (t k ) is the polarization voltage applied to the parallel RC network, calculated as [14] There are N switches corresponding to N cells linking them to the battery circuit.X i (t k ) shows whether a switch of cell i is connected to a battery circuit or not, such that Similarly, sets V (t k ), I(t k ), and T (t k ) consist of terminal voltages, currents, and temperatures of all cells at time t k , respectively.
A BMS monitors the states of the battery pack and estimates both the SOC and the SOH of cells in order to schedule the switches in the battery pack.We mathematically define the SOC of cell i at time t k as where M i (t k ) is the capacity level of cell i at time (t k ), and η is the Coulombic efficiencies of the discharging or charging process.Similarly, the SOH of cell i at time t k is defined as where M new is the initial capacity of new cell i.Sets C(t k ) and H(t k ) consist of the SOCs and SOHs of all the cells at time t k , respectively.We define the SOH of the battery pack (SOH P (t k )) as Power supply and load are used for charging and discharging of the battery pack.The battery pack current at time t k (I P (t k )) has a positive value when discharging and a negative value when charging.The battery pack fulfills the load demand when discharging, then recharges to recover the corresponding amount of power.The process of complete charging and discharging of a battery pack is referred to as a cycle.During the working time (W), an arbitrary cycle (j) has multiple time slots based on the power demand.If time slot t k belongs to cycle j, we consider l D (t j k ) and l C (t j k ) to be the amount of power load when discharging and charging in cycle j, respectively, up to time slot t k , which are calculated as and where κ represents the slot number when cycle j starts, i.e., cycle j starts at time t κ .

Problem Formulation
The objective of this paper is to prolong the lifetime of a battery pack by reducing the rate of aging in cells.To that end, the problem is formulated to minimize the SOH reduction of the battery pack during working time (W), which is mathematically expressed as where ∆SOH P (t k ) represents the SOH reduction of the battery pack at time slot t k ; I + max and I − min represent the discharge current and charge current thresholds, respectively; SOC min and SOC max indicate the lower and upper bounds of the SOC, respectively, which are required to prevent excessive discharging and charging; l D (t j k ) and l C (t j k ) represent the power load in cycle j up until time slot t k when discharging and charging, respectively; and d(t j k ) indicates the power demand at time slot t k in cycle j.The SOH reduction of the battery pack at time t k (∆SOH P (t k )) is defined as where SOH P (t k−1 ) and SOH P (t k ) denote the SOH of the battery pack at time slots t k−1 and t k , respectively.Since ∆SOH P (t k ) is a non-increasing function, we constrain it with

The Proposed Algorithm
To tackle the optimization problem ( 9), we propose a battery-scheduling algorithm that is run by the BMS.In each time slot, the algorithm first collects measurement data that include the terminal voltage, current, and temperature of each cell, then estimates the SOC and the SOH (Algorithm 1) and controls the charging or discharging process of the BESS based on the load demand (Algorithm 2).Algorithms 1 and 2 return a state vector consisting of a set of SOC values of cells (C(t k )) a set of SOH values of cells (H(t k )), as well as the battery pack current (I P (t k )) and power demand (d(t k )), triggering the DRL-based battery-scheduling algorithm (Algorithm 3).The overall flow of the proposed algorithm is shown in Figure 3.Each part of the proposed algorithm is discussed in detail in the subsections below.

EKF-Based SOC and SOH Estimation
The algorithm estimates the SOC and SOH of each cell in the battery pack to observe the states of the battery cells using a third-order EKF.To obtain the SOC and SOH of battery cell i at t k , the algorithm first estimates state vector x i (t k ) and error covariance P i (t k ) as and where x i (t k−1 ) is the state vector of cell i at time k−1 , which is defined as and A i (t k−1 ) and B i (t k−1 ) denote the transition matrix and the input matrix, respectively, which are defined as follows where where the open-circuit voltage (V Oi (t k )) is identified by exploiting the look-up tables , where y b is a real number that changes when SOH i decreases).The algorithm calculates the Kalman gain (G i (t k )) to determine the error between the real, measured value and the estimated value using (13) as Based on the estimated terminal voltage ( V i (t k )), estimated state vector ( x i (t k )), Kalman gain (G i (t k )), and measured terminal voltage (V i (t k )), the algorithm obtains the correct state vector (x i (t k )) as Similarly, the algorithm corrects error covariance (P i (t k )) as From corrected state vector (x i (t k )), the proposed algorithm obtains SOC i (t k ) and M i (t k ).The algorithm updates the SOH of cell i after one cycle (complete charging and discharging of the battery pack), since the SOH does not decrease after one or several time slots [23].The algorithm updates the SOH of cell i at time slot t k as where M j i (t k ) is the effective current capacity (on average) of cell i in cycle j, which has (k − κ + 1) time slots if cycle j is completed at time slot t k .The effective current capacity (on average) of cell i is calculated as where cycle j starts at t κ and ends at t k .Algorithm 1 summarizes the EKF-based estimation for the SOH and SOC of cells.

The Charge/Discharge Control Algorithm
To control the process of charging and discharging the battery pack in the BESS, the algorithm first identifies the process that is underway.If the current of the battery pack is positive, i.e., I P (t k ) > 0, we calculate the amount of electric power discharged in cycle j l D (t j k ) using (7).If l D (t j k ) reaches electrical demand (d(t j k )), the BMS converts the BESS process from discharging to charging.Otherwise, the discharge process continues.
If I P (t k ) is negative, the algorithm determines l C (t j k ) (the amount of electrical power charged in cycle j) using ( 8) and compares it with electrical demand (d(t ), the algorithm converts the BESS process from charging to discharging for a new cycle (j + 1); otherwise, it continues charging.The process of charging and discharging the battery pack is summarized in Algorithm 2.

Deep Reinforcement Learning-Based Scheduling Algorithm
A deep Q network (DQN) scheduling algorithm is proposed for the ON/OFF cell switches in the battery pack.The scheduling algorithm has three elements: state s(t k ), which represents the current state of the BESS; action X (t k ), which indicates cell switches that are ON or OFF; and reward function r(t k ) based on action X (t k ).The algorithm selects action X (t k ) by interacting with the environment, i.e., the BESS, to perceive the state of the battery pack (s(t k )) to maximize the cumulative reward (r(t k )), i.e., to minimize SOH reduction of the battery pack.To choose an optimal schedule as X (t k ) for state s(t k ), the algorithm utilizes and updates acquired knowledge (K) using deep reinforcement learning.That knowledge includes a switch-scheduling policy for the given battery states and the corresponding scheduling of rewards.The DQN-based scheduling algorithm is summarized in Algorithm 3.
The algorithm first observes the current environmental state of the battery pack and obtains state vector s(t k ), which is defined as where C(t k ) and H(t k ) are sets of the SOCs and SOHs of N cells, respectively; I P (t k ) is the load current of the battery pack; and d(t k ) is the load demand.Then, the algorithm initializes knowledge (K) that includes replay experience (E ) with samples ⟨s(t k−1 ), X (t k−1 ), r(t k−1 ), s(t k )⟩, a main network (Q), and a target network ( Q) with random weights.Neural networks Q and Q have the same structure.The algorithm explores actions based on past experiences to update the acquired knowledge that leads to a long-term benefit.The DQN updates acquired knowledge (K) by minimizing loss function L(ϕ(t k )) using gradient descent.The loss function is defined as which ϕ(t k ) is the DQN network parameter (weight of the main network) and is calculated as where α ∈ (0, 1] is the learning factor.Q(t k−1 ) shows the expected discounted cumulative reward after time slot t k−1 in main network Q, and Q(t k−1 ) is the target action value of the target network ( Q), which represents the maximum cumulative reward, i.e., the minimum SOH reduction for the battery pack.Q(t k−1 ) and Q(t k−1 ) are calculated as and where γ ∈ (0, 1] is the discount cumulative factor indicating the degree of emphasis of future rewards, and ϕ = {ϕ(t 1 ), ϕ(t 2 ), ..., ϕ(t k )} and φ = {ϕ(t 1 ), φ(t 2 ), ..., φ(t k )} represent the weights of networks Q and Q, respectively.After determining the loss based on an action, the target network ( Q) copies the weight of the main network (Q), i.e., φ = ϕ.
To utilize the past experience in a DQN-based scheduling algorithm, the proposed algorithm looks at the acquired knowledge (K) to determine whether state s(t k ) is in K or not.If state s(t k ) is in K, the algorithm chooses action X (t k ) based on an ϵ-greedy policy, i.e., it chooses a random action with probability p = ϵ or the action with probability p = 1 − ϵ that has the largest value for Q s(t k ), X (t k ) .Based on the ϵ-greedy policy [24], action X (t k ) is defined as In the case in which state s(t k ) is not in K, scheduling action X (t k ) is performed at random.After taking action X (t k ) based on observed state s(t k ), the algorithm evaluates the immediate reward as Then, the algorithm determines the cumulative reward (r(t k )) by interacting with the environment and looks for an optimal policy to maximize r(t k ).The cumulative reward (r(t k )) is calculated as The algorithm minimizes loss function L ϕ(t k ) so that action value Q(t k−1 ) has the same value as target action value Q(t k−1 ), which also means that the SOH of the battery pack is optimized.The DQN-based scheduling algorithm is summarized in Algorithm 3, and the DQN training process is shown in Figure 4.

Performance Evaluation 4.1. Simulation Environment
The simulation was conducted using a lithium-ion battery model and was implemented in MATLAB and Simulink R2022a.To evaluate the performance of the proposed algorithm, we consider a parallel-connected battery pack including four lithium 3.7 V/2.2 Ah batteries with heterogeneous states of health (90.01%, 86.77%, 84.13%, and 78.15% corresponding to cells 1 to 4, respectively).MOSFETs with low ON resistance and low power are installed to connect and disconnect the battery cells from the battery pack.We consider different power demand conditions to evaluate the effectiveness of the algorithm.Based on the maximum capacity of a battery pack with new battery cells, we obtain a dynamic power demand profile by generating values from a uniform distribution across 20% to 60% of the maximum energy of a battery pack (i.e., between 6.51 Wh and 19.54 Wh).For the constant power demand, we calculate the mean value of the dynamic power demand profile as where d(t k ) is the power demand at time slot t k , and W is the number of time slots during working time (W = {t k | k = 1, 2, ..., w}). Figure 5 shows dynamic and constant power demand profiles.Constant power demand is equal to 13.13 Wh (i.e., 40.32% of the maximum energy of a new battery pack).We set the load current of the battery pack when discharging and charging to 8 A.
A dataset compiled by NASA [22] was used to model a first-order Thévenin equivalent battery model with a reduction in SOH.We also use the dataset to obtain actual SOC and SOH values, which are compared with the estimated values.The dataset includes 28 lithium cobalt oxide 18,650 cells with a nominal capacity of 2.2 Ah, including in-cycle measurements of terminal voltage, current, and cell temperature.The dataset also includes measurements for discharging capacity and EIS impedance readings.We identify the EIS parameters, which include V Oi , R si , R pi , and C pi , in the 90% to 60% SOH range using the dataset.The structure of neural networks includes one 10-dimension input layer, two 256dimension hidden layers, one 256-dimension LSTM layer, and one 16-dimension output layer.The input layer consists of 10 elements of the battery state (s(t)), since there are four battery cells in a battery pack.The output layer consists of 11 cases (There must be at least two batteries ON at the same time, since we consider 8A current during discharging and the maximum output current of one battery is 4 A) of schedule action X (t).We set the learning rate (α) to 0.001, the ϵ-greedy value to 0.9, and the discount factor (γ) to 0.99.The period of the target network update is 10 time steps.Other simulation parameters are summarized in Table 2.For the performance evaluation, we first verify the accuracy of the estimation algorithm by determining the error between estimated and actual values.Then, we investigate the effect of the proposed algorithm on the lifetime of a battery pack and the SOHs of the cells under dynamic and constant loads.To validate the performance of the proposed algorithm, we compare it with methods proposed in previous works, including an enhanced Coulomb counting method [7], a hybrid statistical data-driven estimation method [11], and a multi-actor-critic scheduling algorithm [6].For comparison, we combine the scheduling and estimation algorithms and obtain the BESS performance.We also compare the proposed estimation algorithm with the enhanced Coulomb counting method and the hybrid statistical data-driven estimation method.For the sake of simplicity, we denote the proposed third-order extended Kalman filter (EKF) estimation algorithm as EKFest, the proposed deep Q network scheduling algorithm as DQNsch, the multi-actor-critic scheduling algorithm as MACsch, the hybrid statistical data-driven estimation method as DDest, the enhanced Coulomb counting method as ECest, and simulations without any scheduling algorithm as Non Schedule.

State Estimation Verification
To evaluate the performance results of the proposed algorithm in estimating the SOC and SOH for each cell, we first show the estimated terminal voltage of each cell in a battery pack.Figure 6 shows the root mean square error (RMSE) between the measured terminal voltage and the estimated terminal voltage.The RMSE between the measured and estimated values of the terminal voltage for each cell is close to 0.01 V and remains small over time.The small difference between measured and estimated terminal voltages shows that the proposed algorithm accurately models terminal voltage, which leads to a more accurate estimation of the SOC and SOH of a cell.
The performance results of the proposed algorithm in estimating the SOC and SOH for each cell in terms of RMSE and mean absolute error (MAE), respectively, are shown in Figure 7.The proposed estimation algorithm has the lowest RMSE compared to other works in estimating the SOCs of cells, as shown in Figure 7a.The RMSE between the actual and estimated values of the SOC for each cell is close to 1% under the proposed algorithm.The error of the proposed algorithm in estimating the SOHs of the cells is shown in Figure 7b.The proposed algorithm has an error of less than 0.2% for SOH, which is 50% less than the other estimation algorithms.ECest shows the worst performance, degrading over time.Note that the performance of the proposed estimation algorithm becomes more stable over time.Estimating the SOC and SOH of the cells with low error is of great significance in order to obtain optimal ON/OFF cell scheduling that extends the lifetime of a battery pack.The SOH of the battery pack reaches 60% (the end of its second life (EoL)) after a working time of 1767 h under constant power demand, which represents a 13.9% increase in battery pack lifetime compared to previous work (DDest + MACsch).Under dynamic power demand, battery pack lifetime also increases by 20.6% under the proposed algorithm compared to previous work.In addition, the difference in the performance of the proposed algorithm under constant and dynamic power demand is quite small, but the performance of methods proposed in previous work degrades under dynamic power demand.Hence, the proposed algorithm can hence efficiently schedule ON/OFF switching of battery cells to adapt to dynamic power demand.
Compared to DDest + MACsh, the lifetime of the battery pack is higher under EKFest + MACsch and DDest + DQNsch.This shows that the proposed estimation algorithm, as well as the scheduling algorithm, can an impact in extending the lifetime of a battery pack.DDest + DQNsch achieves better performance than EKFest + MACsch, which means optimal scheduling is a more dominant factor in prolonging battery pack lifetime.MACsch achieves worse performance, since it does not consider SOC while scheduling the ON/OFF cell switches to meet power demand.Without scheduling (Non-Schedule), the lifetime of the battery pack reduces rapidly because the weakest cell, i.e., the cell with the lowest SOH, operates continuously.

Impact of the Proposed Algorithm on Capacity Balancing
The effectiveness of the proposed algorithm in balancing the SOH of cells under constant and dynamic load demands is shown in Figures 9 and 10, respectively.Without a scheduling algorithm (Non-Schedule), all the cells in the battery pack are utilized all the time, irrespective of their SOC and SOH, resulting in imbalanced states of health and increasing SOH reduction in the battery pack, irrespective of load demand conditions, as shown in Figures 9a and 10a.
All the algorithms balance the SOH of cells in the battery pack under constant and dynamic load demands, as shown in Figure 9b-e and Figure 10b-e, respectively.Even though the methods proposed in other works achieve SOH balancing among battery pack cells, battery lifetime (the SOH of each cell) decreases rapidly under the other algorithms compared to the proposed algorithm.This means that with heterogeneous states of health for cells in a battery pack, the proposed algorithm offers better performance than other algorithms by extending the second life of battery cells.All the algorithms achieve SOH standard deviations close to zero by balancing the capacity of each cell over time under constant power demand, which can be seen in Figure 9f.
Under dynamic load demand, EKFest + DQNsch achieves more even SOH balancing and reduces the standard deviation of the cells' SOHs to zero, while other algorithms fail to balance the SOHs of cells, except for the DDest + DQNsch, which achieves the second-best performance, as shown in Figure 10b-f.The SOH of the weakest cell (cell 4, which has the lowest initial SOH) reaches 60%, while other cells have SOHs of more than 60% under algorithms proposed in other works, resulting in higher standard deviations and earlier end of second life of the battery pack.DDest + DQNsch reduces the standard deviation of SOHs and extends battery life compared to other scheduling algorithms.This shows the effectiveness of the proposed scheduling algorithm in managing a parallel-connected BESS, even with a less accurate estimation algorithm.The superior performance of the proposed algorithm under the different load demand conditions shows the robustness of the algorithm to load demands.

Impact of Numbers of Batteries on the Proposed Algorithm
We study the impact of the number of parallel-connected batteries for the BESS on the proposed algorithm under dynamic load demand according to the SOH profiles shown in Table 3.The SOH profiles of batteries have the same SOH average (84.77%) and standard deviation (5.02%).The performance of the proposed algorithm under different battery conditions in terms of the operational working time and standard deviation in SOHs is shown in Figure 11.The proposed algorithm (EKFest + DQNsch) achieves higher operational time (i.e., extends the second life of a battery pack) compared to other algorithms, as can be seen in Figure 11a.The proposed algorithm minimizes the SOH reduction of the battery pack in each time slot by balancing the SOHs of battery cells, thereby extending the battery pack's lifetime.The proposed algorithm achieves the lowest standard deviation with different numbers of batteries in a battery pack, as shown in Figure 11b.The standard deviation in SOHs increases by a minimal amount under the proposed algorithm with an increase in the number of batteries compared to other algorithms.The combinations of the proposed estimation and the proposed scheduling algorithms with the algorithms proposed in previous works (EKFest + MACsch and DDest + DQNsch) increase the lifetime of a battery pack and achieve a more uniform SOH balance compared to the combination of previously proposed algorithms (i.e., DDest + MACsch).This shows the effectiveness of both parts of the proposed algorithm in the optimization of BESSs. Figure 11 shows that the proposed algorithm is robust to the number of battery cells in a battery pack in a BESS.

Conclusions and Future Work
In this paper, we proposed a DRL-based battery management algorithm to optimize battery lifetime for retired batteries with heterogeneous SOHs in a parallel-connected BESS.The proposed algorithm (i) estimated the SOCs and SOHs of all battery cells using EKF; (ii) used estimated SOCs and SOHs to represent the state of a BESS for DRL-based scheduling; and (iii) controlled the ON/OFF switches of battery cells inside the battery pack utilizing deep Q network knowledge.
Via simulation, we showed that the proposed algorithm outperformed other proposed algorithms by showing lower estimation errors for battery cell states and extending the battery pack's second life.The proposed algorithm extended the operation time of the battery pack by 13.9% and 20.6% compared to other algorithms under constant and dynamic power demand, respectively.
Regarding future work, we will consider a BESS in which multiple battery packs are connected in series and each battery pack has parallel-connected battery cells.Such a configuration leads to high dimensions of state space.Furthermore, the deployment of smart-grid technologies that include energy storage systems [25] requires hundreds of battery cells connected in parallel or in series in a BESS.In such systems, DRL-based battery management algorithms can achieve limited performance due to high-dimensional state space.We will investigate a distributed reinforcement learning approach to counter the limitations of centralized approaches for large-scale energy storage systems.Additionally, an experimental setup will be considered to observe the impact of the battery management algorithm on real systems.

Figure 1 .
Figure 1.The relationships between state of charge and state of health.

Figure 2 .
Figure 2. Implementation of a parallel-connected BESS.

Figure 3 .
Figure 3. Overall flow chart of the proposed algorithm.

Figure 4 .
Figure 4.The training process in the DQN.

Figure 6 .
Figure 6.Root mean square error between actual and estimated terminal voltage using the proposed algorithm.

Figure 7 .
Figure 7. State estimation evaluation: (a) root mean square error of SOC estimation and (b) mean absolute error of SOH estimation.

4. 3 .
Impact of the Proposed Algorithms on Battery Pack LifetimeThe impact of the proposed algorithm on battery pack lifetime in terms of SOH reduction under constant and dynamic power demands is evaluated and shown in Figure8.The proposed algorithm achieves better performance under both constant and dynamic power demands compared to other algorithms.The proposed algorithm reduces the SOH decay in the battery pack by efficiently scheduling the ON/OFF switching of the cells based on accurate estimation of SOHs and SOCs, resulting in an increase in battery pack lifetime.

Figure 8 .
Figure 8. SOH reduction of the battery pack under (a) constant power demand and (b) dynamic power demand.

Figure 11 .
Figure 11.Performance of the scheduling algorithms with different numbers of batteries under dynamic power demand: (a) operation time of the battery pack until the SOH reaches 60% and (b) standard deviation of SOHs among the batteries.

Table 1 .
Summary of notations.
Algorithm 1 EKF-based SOC and SOH estimation 1: Input: Measurement data V (t k ), I(t k ),T (t k ); Data tables 2: Output: C(t k ),H(t k ) 3: Estimate state vector x i (t k ) and error covariance P i (t k ) using (12) and (13) 4: Estimate terminal voltage V i (t k ) and compute Kalman gain G i (t k ) using (17) and (20) 5: Update x i (t k ) and P i (t k ) using (21) and (22) 6: Update SOC i (t k ) and M i (t k ) 7: if cycle is completed then Discharge or Charge 3: if I P (t k ) > 0 and t k ∈ cycle j then Optimal schedule action X (t k ) 3: Initialize Replay experience E with capacity M 4: Add ⟨s(t k−1 ), X (t k−1 ), r(t k−1 ), s(t k )⟩ into E 5: Construct main network Q and target network Q 9: else if I P (t k ) < 0 and t k ∈ cycle j then ▷ Charging 10: if l C (t j k ) ≥ d(t j k ) then 1: Input: state vector s(t k ) 2: Output: 6: Initialize Q and Q with random weights 7: Perform a gradient descent to minimize loss function L ϕ(t k ) 8: if s(t) ∈ K then ) + x 3 , and x 1 , x 2 , and x 3 are real numbers.These real numbers change when SOH i decreases).Then, the algorithm estimates the terminal voltage ( V i (t k )), using x i (t k ) and Jacobian matrices C i (t k ) and D i (t k ) as ), and C pi (t k−1 ) are functions of SOC i , SOH i , and T i , respectively, which are obtained from two-dimensional look-up tables (A dataset[22]is used to construct look-up tables where R si (t k−1 ), R pi (t k−1 ), and C pi (t k−1 ) are exponential functions of SOC i (t k−1 ), such as x 1 exp x 2 SOC i (t k−1