A Run-Time Dynamic Reconfigurable Computing System for Lithium-Ion Battery Prognosis

Shaojun Wang 1,2, Datong Liu 1,*, Jianbao Zhou 1, Bin Zhang 3 and Yu Peng 1 1 Department of Automatic Test and Control, Harbin Institute of Technology, Harbin 150080, China; wangsj@hit.edu.cn (S.W.); zhoujianbao@163.com (J.Z.); pengyu@hit.edu.cn (Y.P.) 2 Department of Computing, Imperial College London, London SW7 2BZ, UK 3 College of Engineering and Computing, University of South Carolina, Columbia, SC 29208, USA; zhangbin@cec.sc.edu * Correspondence: liudatong@hit.edu.cn; Tel.: +86-451-8641-3533 (ext. 514)


Introduction
Lithium-ion batteries are widely used in electric vehicles (EVs), consumer electronics, communications, and aerospace technologies [1], due to their advantages of high energy density, long cycle life and low self-discharge rate [2].As safety and reliability critical components, lithium-ion batteries and their management, diagnosis and prognosis of state-of-charge (SOC), and state-of-health (SOH), have attracted more and more research efforts in the past decades [3][4][5].Particularly, the research on battery capacity degradation and remaining useful life (RUL) estimation are of great interest to battery management system (BMS), prognostics and health management (PHM), reliability engineering, and system design, among many other related areas [3].From a system design point of view, the estimation of the remaining cycle life and assessment of the health state can be used for control reconfiguration and mission replanning to minimize mission failure risk, improve the system availability, and reduce life-cycle cost [6].With this understanding, the diagnosis (identification of the current battery SOC and SOH) and prognosis (estimation of the RUL in terms of operation time for Energies 2016, 9, 572 3 of 19 Section 4 describes the RC system for RVM implementation.Section 5 presents experimental results to demonstrate the effectiveness of the proposed FPGA-based RC platform, which is followed by concluding remarks and future works in Section 6.

BMS Based on Dynamic Reconfigurable Computing System
In future intelligent BMSs, diagnosis and prognosis on Li-ion batteries are required to extend the batteries' life and ensure the safety and reliability of the systems.However, accurate diagnosis and prognosis both involve very complicated algorithms, i.e., statistical machine learning, stochastic process.Moreover, these algorithms should be implemented on embedded computing systems with strict area and power consumption constraints for actual industrial applications, such as EVs, satellites.To meet these requirements, a BMS architecture is described with FPGA based dynamic reconfigurable computing system as shown in Figure 1.

BMS Based on Dynamic Reconfigurable Computing System
In future intelligent BMSs, diagnosis and prognosis on Li-ion batteries are required to extend the batteries' life and ensure the safety and reliability of the systems.However, accurate diagnosis and prognosis both involve very complicated algorithms, i.e., statistical machine learning, stochastic process.Moreover, these algorithms should be implemented on embedded computing systems with strict area and power consumption constraints for actual industrial applications, such as EVs, satellites.To meet these requirements, a BMS architecture is described with FPGA based dynamic reconfigurable computing system as shown in Figure 1.As shown in Figure 1, besides the sampling and control circuits units, the FPGA is the main control and computing unit in BMS.To accommodate those complex models and algorithms, traditional embedded processor, such as microprocessor, cannot operate efficiently or are unable to meet performance requirements.To make the BMS more compact and power efficient, complicated diagnosis and prognosis computations, such as SOC/SOH identification, SOC/SOH estimation, are realized in time division multiplexing mode in the dynamic reconfigurable partition of FPGA.To reach high performance under limited hardware resources, customized computing architectures for each diagnosis and prognosis algorithm should be designed.

Li-ion Batteries
In this work, prognosis, which usually involves RUL estimation of SOH, is chosen as an example to demonstrate how the algorithms are realized in the BMS as shown in Figure 1.The aforementioned RVM algorithm with capabilities of uncertainty management and promising results on prognosis is selected to show how its customized computing architecture is built.Thus, the basic principle of RVM algorithm as well as its implementation for RUL estimation will be introduced in Part 3.Moreover, on a compact FPGA chips, to support multiple algorithms and realize complicated calculations, the key technology is to design the architecture and reconfigurable framework, which will be described in Part 4.

RVM Algorithm
Given battery monitoring dataset where x is x  and represents the battery capacity, N is the number of  As shown in Figure 1, besides the sampling and control circuits units, the FPGA is the main control and computing unit in BMS.To accommodate those complex models and algorithms, traditional embedded processor, such as microprocessor, cannot operate efficiently or are unable to meet performance requirements.To make the BMS more compact and power efficient, complicated diagnosis and prognosis computations, such as SOC/SOH identification, SOC/SOH estimation, are realized in time division multiplexing mode in the dynamic reconfigurable partition of FPGA.To reach high performance under limited hardware resources, customized computing architectures for each diagnosis and prognosis algorithm should be designed.
In this work, prognosis, which usually involves RUL estimation of SOH, is chosen as an example to demonstrate how the algorithms are realized in the BMS as shown in Figure 1.The aforementioned RVM algorithm with capabilities of uncertainty management and promising results on prognosis is selected to show how its customized computing architecture is built.Thus, the basic principle of RVM algorithm as well as its implementation for RUL estimation will be introduced in Part 3.Moreover, on a compact FPGA chips, to support multiple algorithms and realize complicated calculations, the key technology is to design the architecture and reconfigurable framework, which will be described in Part 4.

RVM Algorithm
Given battery monitoring dataset tx i , t i u N i"1 , x i P R l , t i P R, the regressive model is: Energies 2016, 9, 572 4 of 19 where x is tx i u N i"1 and represents the battery capacity, N is the number of samples, t represents the objective health indicators to show the health state, such as remaining rechargeable cycles; yp¨q is a nonlinear function, and ε is independent additive noise term subject to ε " Np0, σ 2 q [14].The RVM model is as follows: where Φ " rφpx 1 q, φpx 2 q ¨¨¨, φpx N qs T is the kernel function matrix in which φpx i q " r1, Kpx i , x 1 q, ¨¨¨, Kpx i , x N qs and Kp¨q is the kernel function, w " pω 0 , ¨¨¨, ω N q T is the weight of the model.Following Bayesian inference, ppt i |x q meets ppt i |x q " Npt i |ypxq, σ 2 q which is a normal distribution of t i with mean ypxq and variance σ 2 .The likelihood of the complete data set under the independence assumption of t i can be written as: Define a prior zero-mean Gaussian distribution over w: with α " tα 0 , α 1 , ¨¨¨, α N u being a vector of N `1 hyper-parameters.
In practice, the posterior distributions of many weights are sharply (indeed infinitely) peaked around zero.These vectors associated with remaining non-zero weights are called Relevance Vectors (RVs).
To maximize the posterior probability with the given data, the following distribution of hyper-parameters needs to be maximized: where C " σ 2 I `ΦA ´1Φ T .Let the derivative of Equation ( 8) equal to zero, it leads to [15]: where µ i is the i th posterior mean weight and γ i " 1 ´αi ř ii with ř ii being the i th diagonal element of the posterior weight covariance computed with the current values of α and σ 2 .
Similarly, the noise variance σ 2 can be obtained as: Given a new test point x ˚, predictions are made for the corresponding target t ˚.Specifically, the distribution of t ˚is ppt ˚|tq " Npµ ˚, σ 2 ˚q, where µ ˚" µ T φpx ˚q and σ 2 ˚" σ 2 MP `φpx ˚qT ř φpx ˚q are the predicted mean and variance of x ˚, respectively, in which σ 2  MP is obtained by maximizing the updated distribution.

Expectation Maximization (EM) Algorithm
In this work, EM algorithm is used for RVM training [14,15] because it avoids the computing of hyper-parameters.The EM iterative algorithm consists of E-Step (expectation calculation) and M-Step (maximization), as follows: Step 1: Initialization.
Initialize the weight w p1q and variance pσ 2 q p1q .
With w pk`1q , compute the variance pσ 2 q k`1 as, where tracep¨q is the trace of the matrix.

Run-Time Dynamic Reconfigurable Computing Platform for Embedded RVM Implementation
Due to battery's varying operating conditions and nonlinearity of degradation, dynamic training is always required to improve the RUL prediction performance with the cost of more computation.In this Section, run-time RC computation architecture is proposed by analyzing the computation processes and features of RVM with EM algorithm.The partition of computing tasks is realized by a multi-objective optimization based on dynamic reconfigurable RVM.Parallel computing, pipelining and time-multiplexing techniques are employed to optimize the hardware resources and improve the computation efficiency for embedded systems.

Computing Process Analysis of RVM Algorithm
The computing process of RVM algorithm is mainly composed of Equations ( 11) and (12).In Equation (11), Φ is a n-dimension kernel function matrix, Φ T t is matrix-vector multiplication, so Φ and Φ T t are determined by training samples and will keep constant during the iteration; ΦΨ pkq involves matrix multiplication in each iteration, and Ψ pkq Φ T is the transpose matrix of ΦΨ pkq ; ř and Φ T ř ´1 Φ involves several matrix multiplication because matrix inversion is needed in the calculation of ř ; and ww T is a n ˆn matrix.In Equation ( 12), t T t is a n-dimension vector inner product determined by training samples and keeps unchanged in the iteration; t T Φ is the transpose matrix of Φ T t, which can be obtained from the E-Step directly; tracerΦ Epww T qΦ T s involves matrix multiplication; and t T Φpw pk`1q q T is a n-dimension vector inner product.
In Equation (12), Φpx ˚q is a n-dimension kernel function vector and w T Φpx ˚q is a vector inner product.Equation ( 10) is mainly composed of matrix vector multiplications.
The above analyses of the computation load show that:

‚
The entire computation process is composed of matrix multiplication, matrix inversion, and kernel function calculation, which involves a lot of computation in the iteration.This indicates that the RVM algorithm is of high complexity and of intensive computation.

‚
The training efficiency is determined by the number of iterations, which depends on the training samples and the convergence conditions.Therefore, a well-designed computing architecture is imperative to maximize the utilization of hardware resources and efficiency of RVM.

Dynamic Reconfigurable RVM
RC has two modes: static RC and dynamic RC [24].Static RC configures the FPGA beforehand and keeps this configuration unchanged in the implementation.It is clear that more FPGA chips are needed to accommodate the high complexity of RVM algorithm in static RC mode and this will increase volume, weight, and power consumption.Moreover, the static mode is not suitable for the multiple algorithms fusion and dynamic algorithm update.
Dynamic RC allows FPGAs to be configured partially while other parts of the FPGA are still working.With this time multiplexing feature, large and complex computation tasks can be divided into multiple simple sub-tasks and executed with limited resources.This is more suitable for compact embedded BMS application.In this work, dynamic RC is selected to implement the RVM-based prediction algorithm.
In dynamic RC, task partitioning is the key to ensure load balance and computing efficiency.In the dynamic reconfigurable RVM, the simplest way is to manually partition the task into E-Step part and M-Step part according to Figure 1.However, this partition needs to reconfigure the FPGA twice in every iteration of E-step and M-step and this will reduce the computation efficiency.In addition, E-step and M-step involve different sub-tasks that require different hardware resources.It will lead to unbalanced partition and hardware resource waste if these two different tasks are accommodated in one reconfigurable partition.For this reason, optimal computing task partition is critical and needs to be developed.

Computing Task Partition of Dynamic Reconfigurable RVM
An optimal dynamic RC task partition should take into consideration those factors that affect the computing efficiency, such as execution time, number of reconfiguration, size of reconfigurable partitions, and sub-tasks parallelism.Execution time depends on computing delay of the task with given computing resources.Number of reconfiguration must be minimized to increase the efficiency.Each sub-task occupies approximately equal hardware resources to ensure the balance of reconfigurable partitions and decrease the resources consumption.Parallel computing and pipeline computing are utilized to further improve the computational efficiency.In this work, a multi-objective optimization is employed to realize the dynamic RC task partition.In the optimization, the following issues are taken into consideration: Objective task model: The objective task model G is defined as <W,V,D>, in which W is the computing time in each reconfigurable partition, V is reconfiguration time, and D is the hardware resources in the reconfigurable partition.RVM computation: For a RVM computing task divided into N sub-tasks denoted by P " tP 1 , P 2 , ¨¨¨, P N u, each sub-task is executed by a reconfigurable partition in time multiplexing mode.The RVM computing is terminated if all the sub-tasks are executed.
Subtask execution time: The executing time X i of sub-task P i is given by, where m add i , m sub i , m mul i , and m div i are the numbers of Adders, Subtractors, Multipliers and Dividers in task P i , while rd add , rd sub , rd mul , and rd div are the computing delays of the Adders, Subtractors, Multipliers and Dividers in task P i , respectively.The number of each operator is the sum of the corresponding number of operator in each configuration.These numbers can be obtained by analysis of computation in the task.For specific FPGA and corresponding intellectual property (IP) cores, rd add , rd sub , rd mul , and rd div are known before partitioning.
Configuration time: The configuration time of each task depends on the amount of hardware resources in the reconfigurable partition.In other words, the configuration time Y i for tasks P i is proportional to the hardware scale in the designated reconfigurable partitions RpPq, and the number of reconfigurations m i for task P i , which is given as: where k is a gain.Optimization constraints: For the optimization problem, the following constraints need to be considered: (a) Dynamic reconfigurable partitions RpPq must be less than the actual FPGA resources, i.e., RpPq ď R total , where R total is the actual FPGA resources.In our applications, 50% of FPGA hardware resources are reserved for algorithm fusion and parallel computing.With this consideration, the dynamic reconfigurable resources is limited by: (b) Resources for each sub-task must be less than the amount of resources in dynamic reconfigurable partitions, namely: (c) Because different tasks may share the same reconfigurable partition in time multiplexed mode, the resource occupation of each task needs to be balanced to avoid waste of hardware resource.These resources include Look-Up-Table (LUT), block RAM (BRAM), connection resources, and DSP48E resources.DSP48E is the processing unit for a variety of arithmetic computing, such as addition, subtraction, multiplication, and division in FPGA.Since DSP48E resources are the key factors that affect the computing capability of FPGA, to simplify the problem, we keep the balance of DSP48E resource as follows: where P E i and P E j denote the DSP48E resources for tasks P i and P j , respectively, and ∆ is the scale of balance determined by specific task.Once the FPGA and the IP cores used for Adder, Subtractor, Multiplier, and Divider are determined, the numbers of DSP48E is also determined.Assuming the numbers of DSP48E for Adder, Subtractor, Multiplier and Divider are d add , d sub , d mul and d div , respectively, RpP E i q can be expressed as: Energies 2016, 9, 572 8 of 19 Optimization objective: The goal of optimization is to search a variables vector Ω = {N, m add i ,m sub i ,m mul i ,m div i ,m i ,k} to minimize the execution time for the entire task, which is the sum of executing time and reconfiguration time of all sub-tasks: Since the variables and optimization objectives in Equation ( 19) are integers, the optimization problem can be solved by commercial integer linear program tools such as LINGO.
In this work, the Virtex-5 series FPGA is selected as the target hardware platform.The constraints are set according to Virtex-5 XC5VFX130TFPGA device manufactured by Xilinx (San Jose, CA, USA).Some parameters are given as ∆ " 2, N = 2, and m i " 1.

System Framework of Dynamic reconfigurable RVM
The RVM based RC is divided into two sub-tasks that are denoted as reconfiguration units A and B, respectively.The two sub-tasks are executed in time multiplexing mode in the same reconfigurable partition.Each reconfigurable unit will be configured only once throughout the calculation process.
Figure 2 shows the flowchart of the proposed dynamic reconfigurable RVM based on the RC platform.The RVM in this platform has a training phase and a prediction phase, which compose a typical online retraining process.The FPGA is initially configured as unit A in the training phase and store the intermediate results in unit A. Then, the FPGA is reconfigured as unit B and store the calculated results in unit B. In the prediction phase, the unit A is reconfigured to predict the fault growth and estimate the RUL.From this figure, it is obvious that the training phase involves one reconfiguration from unit A to unit B and the prediction phase involves one reconfiguration from unit B to unit A. The online re-training process of RVM needs two dynamic reconfigurations for each cycle of computation.The RVM based RC is divided into two sub-tasks that are denoted as reconfiguration units A and B, respectively.The two sub-tasks are executed in time multiplexing mode in the same reconfigurable partition.Each reconfigurable unit will be configured only once throughout the calculation process.
Figure 2 shows the flowchart of the proposed dynamic reconfigurable RVM based on the RC platform.The RVM in this platform has a training phase and a prediction phase, which compose a typical online retraining process.The FPGA is initially configured as unit A in the training phase and store the intermediate results in unit A. Then, the FPGA is reconfigured as unit B and store the calculated results in unit B. In the prediction phase, the unit A is reconfigured to predict the fault growth and estimate the RUL.From this figure, it is obvious that the training phase involves one reconfiguration from unit A to unit B and the prediction phase involves one reconfiguration from unit B to unit A. The online re-training process of RVM needs two dynamic reconfigurations for each cycle of computation.,( ) , Trace and Vector inner product Figure 3 shows a general framework of dynamic RC system on Virtex-5 FPGA [24].The dynamic RC system consists of FPGA, off-chip memory, and configuration memory.Figure 3 shows a general framework of dynamic RC system on Virtex-5 FPGA [24].The dynamic RC system consists of FPGA, off-chip memory, and configuration memory.Figure 3 shows a general framework of dynamic RC system on Virtex-5 FPGA [24].The dynamic RC system consists of FPGA, off-chip memory, and configuration memory.FPGA is the core functional unit that comprises a static logic partition and dynamic reconfigurable partition.The static logic partition includes embedded processors, on-chip bus, and peripheral modules connected to the bus.Dynamic reconfigurable partition executes reconfiguration units A and B in time multiplexing mode for RVM training and prediction.The embedded processors access the dynamic reconfigurable partition through the on-chip bus to control the computing process.
Off-chip memory stores the input data and intermediate results for RUL prediction.Data exchange between the dynamic reconfigurable partition and off-chip memory is implemented in direct memory access (DMA) manner to improve data transmission efficiency.Decoupling IPs are used to avoid disturbing the functions of static logic partition during dynamic reconfiguration.The configuration memory is used to store FPGA configuration files.

Architecture of Dynamic Reconfigurable Partition
The internal structures of the reconfiguration units A and B are designed as shown in Figures 4 and 5.The internal architecture of the dynamic reconfigurable partition is a key factor to ensure the efficient task execution.The modular design is utilized to divide reconfiguration units A and B into several computing modules.To balance the resources of reconfigurable units and leverage the parallel computing features of FPGA, four processing elements (PE) of kernel functions are instantiated to synchronously calculate the Gaussian kernel.
Energies 2016, 9, 572 9 of 18 FPGA is the core functional unit that comprises a static logic partition and dynamic reconfigurable partition.The static logic partition includes embedded processors, on-chip bus, and peripheral modules connected to the bus.Dynamic reconfigurable partition executes reconfiguration units A and B in time multiplexing mode for RVM training and prediction.The embedded processors access the dynamic reconfigurable partition through the on-chip bus to control the computing process.
Off-chip memory stores the input data and intermediate results for RUL prediction.Data exchange between the dynamic reconfigurable partition and off-chip memory is implemented in direct memory access (DMA) manner to improve data transmission efficiency.Decoupling IPs are used to avoid disturbing the functions of static logic partition during dynamic reconfiguration.The configuration memory is used to store FPGA configuration files.

Architecture of Dynamic Reconfigurable Partition
The internal structures of the reconfiguration units A and B are designed as shown in Figures 4  and 5.The internal architecture of the dynamic reconfigurable partition is a key factor to ensure the efficient task execution.The modular design is utilized to divide reconfiguration units A and B into several computing modules.To balance the resources of reconfigurable units and leverage the parallel computing features of FPGA, four processing elements (PE) of kernel functions are instantiated to synchronously calculate the Gaussian kernel.
In Figure 4, the reconfiguration unit A includes six main calculation modules, in which modules 1-4 are parallel PEs of kernel function, module 5 is a matrix-vector multiplication module, and module 6 is vector inner product module.The reconfiguration unit B shown in Figure 5 consists of five modules, in which module 1 is for matrix multiplication, module 2 is for improved Cholesky decomposition, module 3 is for matrix inversion, module 4 is for matrix-vector multiplication, and module 5 is for vector inner product.In Figure 4, the reconfiguration unit A includes six main calculation modules, in which modules 1-4 are parallel PEs of kernel function, module 5 is a matrix-vector multiplication module, and module 6 is vector inner product module.The reconfiguration unit B shown in Figure 5 consists of five modules, in which module 1 is for matrix multiplication, module 2 is for improved Cholesky decomposition, module 3 is for matrix inversion, module 4 is for matrix-vector multiplication, and module 5 is for vector inner product.The interface between the reconfiguration partition and the static logic partition must be consistent.It includes on-chip bus interface and the control interface for data and commands transmission.Calculation control module ensures that the computing tasks are executed in sequence with the scheduled timing.First-in-first-out (FIFO) is for buffering the calculation results and RAM is for buffering the input data and intermediate variables.The inputs and outputs of the FIFO and RAM are controlled by the Calculation control module related to specific computing task.

Design of Key Modules
The kernel function computing involves calculation of 2-norm and exponential functions.The 2norm calculation has low computation efficiency since it cannot be executed with pipeline computing while the exponential function is a transcendental function which cannot be directly implemented by multiplier and adder.To achieve high efficiency, a multi-level pipeline calculation method of piecewise linear approximation is proposed.In addition, an improved Cholesky decomposition method is proposed to accommodate the large amount of computations in matrix inversion with lower-upper (LU) decomposition and the instability caused by the round-off error from the Cholesky decomposition.


Multi-level pipelined calculation method of piecewise linear approximation for the kernel function For Gaussian kernel function ( , ) exp( ) where  is the hyper-parameter determined before training, the kernel matrix Φ is a symmetric positive definite matrix whose diagonal elements are all 1.Therefore, only the lower triangular elements, as shown below, need to be calculated: Assuming that the dimension of training samples is l , the 2-norm for vector i x and j x is: A multi-level pipelined method is introduced to calculate the 2-norm.Take the first column in The interface between the reconfiguration partition and the static logic partition must be consistent.It includes on-chip bus interface and the control interface for data and commands transmission.Calculation control module ensures that the computing tasks are executed in sequence with the scheduled timing.First-in-first-out (FIFO) is for buffering the calculation results and RAM is for buffering the input data and intermediate variables.The inputs and outputs of the FIFO and RAM are controlled by the Calculation control module related to specific computing task.

Design of Key Modules
The kernel function computing involves calculation of 2-norm and exponential functions.The 2-norm calculation has low computation efficiency since it cannot be executed with pipeline computing while the exponential function is a transcendental function which cannot be directly implemented by multiplier and adder.To achieve high efficiency, a multi-level pipeline calculation method of piecewise linear approximation is proposed.In addition, an improved Cholesky decomposition method is proposed to accommodate the large amount of computations in matrix inversion with lower-upper (LU) decomposition and the instability caused by the round-off error from the Cholesky decomposition.

•
Multi-level pipelined calculation method of piecewise linear approximation for the kernel function For Gaussian kernel function Kpx, x i q " expp´||x ´xi || 2 2 {γ 2 q where γ is the hyper-parameter determined before training, the kernel matrix Φ is a symmetric positive definite matrix whose diagonal elements are all 1.Therefore, only the lower triangular elements, as shown below, need to be calculated: Assuming that the dimension of training samples is l, the 2-norm for vector x i and x j is: 2 " px i1 ´xj1 q 2 `px i2 ´xj2 q 2 `¨¨¨`px il ´xjl q 2 (21) A multi-level pipelined method is introduced to calculate the 2-norm.Take the first column in Equation ( 20) as an example, the computing elements are shown in Table 1.
Table 1.Computing components for the 2-norm computation.
a n´1 " px n1 ´x11 q 2 a n`n´2 " px n2 ´x12 q 2 ¨¨¨a pl´1q˚pn´1q`n`1 "" px nl ´x1l q 2 The 2-norm computing process is executed by rows.That is, ||x 2 ´x1 || 2 2 is first and then ||x 3 ´x1 || 2 , and so forth.In this process, a fully pipelined accumulator is designed with an fully pipelined adder and a FIFO as shown in Figure 6.

2-Norm 1 st
Step The 2-norm computing process is executed by rows.That is, is first and then , and so forth.In this process, a fully pipelined accumulator is designed with an fully pipelined adder and a FIFO as shown in Figure 6. Figure 7 shows the computing flowchart of the fully pipelined accumulator.In the first stage, data a1, a2,...,an−1 and "0" are input to the adder, and the outputs of the adder are stored in the FIFO.In the 2 nd stage, data an, an+1,...,a2(n−1) and FIFO buffered data from the 1 st stage are then input to the adder.The outputs of the adder are again stored in the FIFO.This process is repeated at the l th pipeline cycle.Similarly, the 2-norm calculations in other columns can be implemented.

Stage
Index Input data FIFO Output (b) Exponential Function Calculation Since exponential function cannot be implemented directly by adders and multipliers in FPGA, a piecewise linear approximation method, which combines linear polynomial function and LUT, is employed as an approximation [25].In this method, the linear polynomial parameters are stored in the LUT while the calculation of linear polynomial is implemented by an adder and a multiplier in FPGA.
The principle of piecewise linear approximation is as follows: for a nonlinear function ( ) f x Figure 7 shows the computing flowchart of the fully pipelined accumulator.In the first stage, data a 1 , a 2 , ..., a n´1 and "0" are input to the adder, and the outputs of the adder are stored in the FIFO.In the 2 nd stage, data a n , a n+1 , ..., a 2(n´1) and FIFO buffered data from the 1 st stage are then input to the adder.The outputs of the adder are again stored in the FIFO.This process is repeated at the l th pipeline cycle.Similarly, the 2-norm calculations in other columns can be implemented.

(b) Exponential Function Calculation
Since exponential function cannot be implemented directly by adders and multipliers in FPGA, a piecewise linear approximation method, which combines linear polynomial function and LUT, is employed as an approximation [25].In this method, the linear polynomial parameters are stored in the LUT while the calculation of linear polynomial is implemented by an adder and a multiplier in FPGA.
The principle of piecewise linear approximation is as follows: for a nonlinear function f pxq defined in x P rL, Us, the interval is evenly divided into m subintervals [L i , U i ] and m " pU ´Lq{pU i ´Li q.For x P rL i , U i s, f pxq is approximated as f pxq " k i x `bi , where k i and b i are linear polynomial parameters stored in the LUT.Note that m introduces trade-off between accuracy and hardware occupation and should be selected properly.
Figure 8 shows the data path of Gaussian kernel function calculation.In this figure, the upper part is for 2-norm calculation while the lower part is for exponential function calculation.In the lower part, bRAM and kRAM store parameters k and b for piecewise linear approximation, respectively.
The corresponding values of k and b can be obtained by accessing the addresses for these two RAMs.Then the kernel function can be calculated with subtractor 2 and multiplier 4. To get the right addresses of bRAM and kRAM, the output of the 2-norm calculation is mapped to the RAM address by multiplication, "floating-point to fixed-point" and rounding units.The corresponding values of k and b can be obtained by accessing the addresses for these two RAMs.Then the kernel function can be calculated with subtractor 2 and multiplier 4. To get the right addresses of bRAM and kRAM, the output of the 2-norm calculation is mapped to the RAM address by multiplication, "floating-point to fixed-point" and rounding units.Since the computing delay of multiplier 4 is one clock cycle larger than subtractor 2, FIFO4 is used to synchronize them by introducing a clock delay to access the bRAM.FIFO3 is used to store the output of 2-norm calculation and its depth should be larger than the delay of multiplier 4, which is 8 clock cycles in this design.The multiplier and adder are implemented with Xilinx floating-point arithmetic IP core, and the data are in single-precision floating point format.

 Improved Cholesky decomposition for matrix inversion based on multiply subtraction
To make computations affordable and deal with the instability caused by rounding error in of matrix inversion computing process, an improved Cholesky decomposition [26] is introduced for the inversion of symmetric positive definite Gaussian kernel matrix  .The improved Cholesky decomposition method decomposes matrix  as follows: where L is a lower triangular matrix whose diagonal elements are all 1, D is a diagonal matrix (its diagonal elements are not 0).Matrices L and D can be obtained as:  and h is the element of  .Since the computing delay of multiplier 4 is one clock cycle larger than subtractor 2, FIFO4 is used to synchronize them by introducing a clock delay to access the bRAM.FIFO3 is used to store the output of 2-norm calculation and its depth should be larger than the delay of multiplier 4, which is 8 clock cycles in this design.The multiplier and adder are implemented with Xilinx floating-point arithmetic IP core, and the data are in single-precision floating point format.

• Improved Cholesky decomposition for matrix inversion based on multiply subtraction
To make computations affordable and deal with the instability caused by rounding error in of matrix inversion computing process, an improved Cholesky decomposition [26] is introduced for the inversion of symmetric positive definite Gaussian kernel matrix ř .The improved Cholesky decomposition method decomposes matrix ř as follows: where L is a lower triangular matrix whose diagonal elements are all 1, D is a diagonal matrix (its diagonal elements are not 0).Matrices L and D can be obtained as: where r " 1, ¨¨¨, n, i " r `1, ¨¨¨, n and h is the element of ř .Assuming P " L ´1, we have ř ´1 " P T D ´1 P, where P is a lower triangular matrix with elements of: where i " 1, 2, ¨¨¨n, j " 1, 2, ¨¨¨i ´1.The implementation of the improved Cholesky decomposition and matrix inversion is shown in Figure 8.The elements in diagonal matrix D ´1 is the reciprocal of corresponding elements in matrix D and this only requires division.The computation of matrix P, i.e., the inverse matrix of L, is given by Equation ( 24), in which p ii " 1 and p ij is the negative value of the result from a multiply accumulator: Energies 2016, 9, 572 13 of 18 D and this only requires division.The computation of matrix P, i.e., the inverse matrix of L, is given by Equation ( 24), in which 1 ii p  and ij p is negative value of the result from a multiply accumulator: The data path for matrix inversion.
In Figure 9, the calculation of D −1 is achieved with divider 1 and the results are stored into FIFO1.Multiplier 1, subtractor 1 and FIFO2 are used to calculate the pij, in which FIFO2 buffers the results from the subtractor 1, which is initialized as 0. The memory depth of FIFO1 and FIFO2 is n.With D −1 and P, 1   can be calculated by

Hardware Platform
The run-time dynamic RC system must be implemented on reconfigurable devices for onboard and distributed applications.To verify the proposed run-time dynamic RC system for RVM-based lithium-ion batteries RUL estimation, the Xilinx ML510 development board is used.In the experiments, accuracy, computational efficiency, and hardware resource consumption for the reconfigurable RVM based RUL prediction system are analyzed and discussed.
The main specifications for ML510 development board are as follows:  FPGA: Virtex XC5VFX130T;  DDR2 SDRAM: 512MB, 72-bit and 2 chips;  CompactFlash (CF): 512MB.A run-time RC RVM system is constructed as shown in Figure 10, in which the embedded processor PowerPC440 has a working frequency of 400 MHz, the on-chip PLB bus has a working frequency set to 100 MHz, the configuration port is the ICAP interface embedded in the XC5VFX130T, and the floating-point arithmetic computing is implemented with the single-precision floating-point IP core based on IEEE754 standard [27] provided by Xilinx ISE 13.2 (San Jose, CA, USA).For comparison purposes, a PC platform with a 2.53 GHz Core 2 Duo CPU, and 2 G DDR2 memory is used to execute the same algorithm.In Figure 9, the calculation of D ´1 is achieved with divider 1 and the results are stored into FIFO1.Multiplier 1, subtractor 1 and FIFO2 are used to calculate the p ij , in which FIFO2 buffers the results from the subtractor 1, which is initialized as 0. The memory depth of FIFO1 and FIFO2 is n.With D ´1 and P, ř ´1 can be calculated by ř ´1 " P T D ´1 P.

Hardware Platform
The run-time dynamic RC system must be implemented on reconfigurable devices for onboard and distributed applications.To verify the proposed run-time dynamic RC system for RVM-based lithium-ion batteries RUL estimation, the Xilinx ML510 development board is used.In the experiments, accuracy, computational efficiency, and hardware resource consumption for the reconfigurable RVM based RUL prediction system are analyzed and discussed.
The main specifications for ML510 development board are as follows: A run-time RC RVM system is constructed as shown in Figure 10, in which the embedded processor PowerPC440 has a working frequency of 400 MHz, the on-chip PLB bus has a working frequency set to 100 MHz, the configuration port is the ICAP interface embedded in the XC5VFX130T, and the floating-point arithmetic computing is implemented with the single-precision floating-point IP core based on IEEE754 standard [27] provided by Xilinx ISE 13.2 (San Jose, CA, USA).For comparison purposes, a PC platform with a 2.53 GHz Core 2 Duo CPU, and 2 G DDR2 memory is used to execute the same algorithm.

Lithium-Ion Battery Data Set
Two lithium-ion battery data sets are used to verify the proposed framework.One is from the data repository of the NASA Ames Prognostics Center of Excellence (PCoE) and the other one is from the CALCE at the University of Maryland.
The first data set was sampled from a battery prognostics test bed at NASA comprising commercially available Li-ion 18650 rechargeable batteries [28].The lithium-ion batteries were run through three different operational profiles (charge, discharge, and impedance) at room temperature.Charging was carried out in a constant current mode at 1.5 A until the battery voltage reached 4.2 V and then continued in a constant voltage mode until the charge current dropped to 20 mA.Discharge was carried out at a constant current level of 2 A until the battery voltage fell to 2.7 V, for battery B5.Repeated charge and discharge cycles resulted in accelerated aging of the batteries.The experiments were stopped when the batteries reached the end-of-life criterion, which was a 30% fade in rated capacity.In the CALCE data set, the cycling of battery CS2-33 was implemented with the Arbin BT2000 battery testing system(made by Arbin Instruments, College Station, TX, USA) under room temperature.The 1.1 Ah rated capacity of batteries are adopted in the experiment with the discharging rate at 0.5 C [15,29].

Prediction Precision of RUL Estimation
In this Section, the battery data sets are used to verify the performance of the run-time dynamic RC system for RVM, and the results are compared with those from the PC platform.The quantitative prediction results for B5 and CS2-33 batteries are summarized in Table 2.The prediction starting points are selected at the 80 th cycle for B5 battery and 278 th cycle for the CS2-33 battery, respectively.The failure thresholds for the two batteries are set to 1.38 Ah (about 30% fade) and 0.88 Ah (about 20% fade).In this table, the performances are compared by the predicted RUL (RULp) values (given In the CALCE data set, the cycling of battery CS2-33 was implemented with the Arbin BT2000 battery testing system (made by Arbin Instruments, College Station, TX, USA) under room temperature.The 1.1 Ah rated capacity of batteries are adopted in the experiment with the discharging rate at 0.5 C [15,29]. the battery CS2-33 (relatively large prediction steps) are 1.61 and 2.68, respectively.For training part with high computational complexity, the speed-up ratios for the battery B5 and the battery CS2-33 are 6.35 and 11.99, respectively.As for the overall efficiency, the speed-up ratios for the two batteries are 4.54 and 10.31, respectively.This demonstrates that the proposed RC-based RVM in FPGA platform has great potentials in improvement of computational efficiency, especially for complex tasks.
The FPGA power is about 4.94 W by Xilinx Power Estimator and the power of the CPU is about 25 W from its datasheet, so the FPGA consumes 0.68 J and 3.10 J energy, which are 22.97ˆand 52.17l ess than CPU, for B5 and CS2-33 RUL prediction, respectively.This further demonstrates that FPGA is more suitable for embedded and power aware application.

Analysis of Hardware Resource Consumptions
To verify the adaptability and applicability of the proposed method, the hardware resources, including the utilization of static and dynamic partition throughout the FPGA, are analyzed.The analysis also shows the improvement of resource utilization of the dynamic RC against the static reconfiguration.
The required hardware resources of RC RVM approach and the available FPGA resources are shown in Table 4.The utilization ratio of DSP48E and BRAM is larger than that of LUTS (the logic resources) in our proposed method.This suits to the characteristics of the computing with FPGA IP core in which more DSP48E and BRAM resources are consumed.Note that more than 50.00% of the BRAM and DSP48E resources in FPGA are reserved, which demonstrates that the proposed method has great potentials to support more complex computation tasks with limited computing resources.And in another aspect, a smaller FPGA device, such as XC5VFX70T, can be used to accommodate the whole design.This better fits the requirements of the compact embedded BMS systems.The improvement of the resources utilization by the dynamic RC algorithm can be further illustrated by comparing the resources occupation of the dynamic RC and static reconfiguration.The resource consumptions of dynamic area are obtained by Xilinx's Plan Ahead design tool and the results are compared in Table 5.In Table 5, the resources consumption in the static reconfigurable algorithm is the sum of the reconfigurable units A and B, without considering the increased connectivity resources consumption.For dynamic RC, the resources consumption is those in the dynamic partition.Table 5 shows the dynamic resources saving includes 9408LUTs (39.95%), 15 BRAM (27.27%), and 62 DSP48E (34.07%).This result demonstrates that the proposed method has higher FPGA resource utilization and is more suitable for computing resources constrained applications.The difference of DSP48E for two reconfigurable units in static reconfiguration is 2, which accounts for 2.17% of the maximum required DSP48E resources (92).This shows that the two reconfigurable units achieve the consumption balance of DSP48E.This further proves the effectiveness of the proposed method using multi-objective optimization for reconfigurable tasks partitioning in which the balance of DSP48E resources between different partitions is the main constraint.
The static reconfigurable implementation can be realized with logics combining reconfigurable unit A and reconfigurable unit B in DPR version.Then the training and forecasting process will consume the same time with DPR version, but it avoids two reconfiguration overheads (2 ˆ16.99 ms).From Table 3, the reconfiguration overheads are about 24.6% and 5.4% in the whole consumed time of the B5 and CS2-33 batteries' prognosis respectively.By avoiding these overheads, the static reconfigurable implementation can get 1.33ˆand 1.06ˆspeedup in B5 and CS2-33 batteries respectively compared to the corresponding dynamic reconfigurable implementation.Thus static reconfigurable implementation can get the computational performance a little bit improved, but at the cost of more hardware consumption as shown in Table 5 (39.95% more LUTs, 27.27% more BRAM, 34.07%more DSP48E).Regarding our motivation to get a more compact BMS, we think dynamic reconfigurable system is a better choice.

Conclusions
In this work, a novel run-time dynamic RC-based RVM prognostic algorithm and its implementation on a reconfigurable FPGA platform are developed for RUL estimation.The experiment results demonstrate that the embedded computing platform can significantly improve the computing efficiency and flexibility.The contributions of the research work include a balance of resources utilization and computing efficiency is achieved by multi-objective optimization method to implement computing tasks partitioning; the multi-level pipelined computing and parallel computing are developed for kernel function calculation to further improve the computational efficiency; the improved Cholesky decomposition for matrix inversion is introduced to reduce the computing resource consumption and decreases the computational delay; and thorough analysis and comparison of different embedded system configurations in terms of accuracy, resource utilization, and computation efficiency.Experimental results on two examples of lithium-ion battery RUL estimation show that the FPGA based run-time dynamic RC improves the hardware resources utilization and is more than four times faster than a PC-based approach without sacrificing performance.This work provides a novel solution for the implementation of prognostic prediction based on machine learning algorithm in embedded computing systems.In future, more algorithms, such as SOC estimation, prognosis in operation condition varying applications, etc., will be implemented in the proposed system.

Figure 1 .
Figure 1.BMS based on dynamic reconfigurable computing system.

Figure 1 .
Figure 1.BMS based on dynamic reconfigurable computing system.

Figure 3 .
Figure 3. Framework of dynamic RC system based on Virtex-5 FPGA platform.Figure 3. Framework of dynamic RC system based on Virtex-5 FPGA platform.

Figure 3 .
Figure 3. Framework of dynamic RC system based on Virtex-5 FPGA platform.Figure 3. Framework of dynamic RC system based on Virtex-5 FPGA platform.

1 Figure 4 .
Figure 4. Internal structure of reconfiguration unit A.Figure 4. Internal structure of reconfiguration unit A.

Figure 4 .
Figure 4. Internal structure of reconfiguration unit A.Figure 4. Internal structure of reconfiguration unit A.

Figure 5 .
Figure 5. Internal structure of reconfiguration unit B.

Figure 5 .
Figure 5. Internal structure of reconfiguration unit B.

Figure 8
Figure 8 shows the data path of Gaussian kernel function calculation.In this figure, the upper part is for 2-norm calculation while the lower part is for exponential function calculation.In the lower part, bRAM and kRAM store parameters k and b for piecewise linear approximation, respectively.The corresponding values of k and b can be obtained by accessing the addresses for these two RAMs.Then the kernel function can be calculated with subtractor 2 and multiplier 4. To get the right addresses of bRAM and kRAM, the output of the 2-norm calculation is mapped to the RAM address by multiplication, "floating-point to fixed-point" and rounding units.

Figure 8 .
Figure 8.The computing circuits for the Gaussian kernel function.

Figure 8 .
Figure 8.The computing circuits for the Gaussian kernel function.

Figure 9 .
Figure 9.The data path for matrix inversion.

Figure 10 .
Figure 10.The run-time RC RVM system with ML510 development board.

Figure 10 .
Figure 10.The run-time RC RVM system with ML510 development board.

Table 5 .
The increase of hardware utility.