Functional Safety BMS Design Methodology for Automotive Lithium-Based Batteries

: The increasing use of lithium batteries and the necessary integration of battery management systems (BMS) has led international standards to demand functional safety in electromobility applications, with a special focus on electric vehicles. This work covers the complete design of an enhanced automotive BMS with functional safety from the concept phase to veriﬁcation activities. Firstly, a detailed analysis of the intrinsic hazards of lithium-based batteries is performed. Secondly, a hazard and risk assessment of an automotive lithium-based battery is carried out to address the speciﬁc risks deriving from the automotive application and the safety goals to be fulﬁlled to keep it under control. Safety goals lead to the technical safety requirements for the next hardware design and prototyping of a BMS Slave. Finally, the failure rate of the BMS Slave is assessed to verify the compliance of the developed enhanced BMS Slave with the functional safety Automotive Safety Integrity Level (ASIL) C. This paper contributes the design methodology of a BMS complying with ISO 26262 functional safety standard requirements for automotive lithium-based batteries.


Introduction
The electric vehicle market has increased over recent years doubling the electric vehicle stock every two years [1]. This growth has driven the increase of the lithium-based battery market as well since lithium has settled as one of the preferred technologies for storing energy in automotive applications. In line with this rise, research on lithium-based batteries has focused on improving their power/energy densities and capability, as well as their reliability and safety to answer market demand.
Indeed, lithium-based batteries have several failure mechanisms that can take place during their entire life cycle. Accordingly, special provisions shall be implemented during battery pack design, manufacturing, commissioning, operation and decommissioning to ensure safety. Among them, the battery management system (BMS) is the electronic control unit responsible for the continuous monitoring and protection of the battery during operation to avoid any electrical and thermal misuse. The BMS must be reliable and safe, although this has not always been the case [2][3][4]. Consequently, market and international standards have lately demanded enhancements to BMS design to support higher battery safety. Currently, state of the art BMS design must be upgraded to contribute to improving battery safety and to move towards a more trustworthy technology.
Battery management systems are protection systems and, therefore, they shall follow a safety-oriented design. In this framework, a review of the safety requirements and the methods applied in BMS design is proposed, which are supported by functional safety standards together with battery safety standards. This paper contributes a V-shaped design methodology (Figure 1) of an enhanced BMS complying with functional safety requirements for automotive lithium-based batteries. Among the several design and development methodologies, safety-r follow a methodology that prioritises integrity and safety over develop cost. V-shaped methodology, as well as the waterfall model, is a sequent requires the completion of each activity to advance to the next task. Howe waterfall model, V-shaped methodology is an iterative process, which is n sure that no errors are committed during the design. The V-shaped method of five stages: the concept phase, specifications, development, validation, a and operation. Although it is not considered as a stage, safety managemen be present in all stages, and it is crucial to supervise the product life cycle the development phase is decomposed into hardware and software deve with its own V-shaped methodology consisting of specification, developm cation. At the lowermost tip of the V, the prototype or release is realised. methodology is iterative because if any fault occurs during the verification production phases, the project can return to the hardware/software specifi level specification or concept phase, respectively, to amend the fault. The proposed methodology is to cover the existing gap in the design of most a also considering safety in all its aspects.
As a first step of the methodology, Section 2 comprises the concept p duces a safety assessment of lithium-based batteries in automotive batt leads to the allocation of safety goals to be fulfilled by the BMS. As the se steps of the methodology, Section 3 gathers the technical safety require from these safety goals and the resulting design of the enhanced BMS slav overall BMS). As the fourth and final step of the methodology, Section verification of the enhanced BMS Slave by means of a Failure Modes, Eff nostics Analysis. Finally, Section 5 summarises the obtained conclusions.

Concept Phase: Hazard and Risk Assessment, and Safety Goals Alloca
As a general definition, safety goals are top level objectives that the B Among the several design and development methodologies, safety-related systems follow a methodology that prioritises integrity and safety over development time and cost. V-shaped methodology, as well as the waterfall model, is a sequential process that requires the completion of each activity to advance to the next task. However, unlike the waterfall model, V-shaped methodology is an iterative process, which is necessary to ensure that no errors are committed during the design. The V-shaped methodology consists of five stages: the concept phase, specifications, development, validation, and production and operation. Although it is not considered as a stage, safety management activity must be present in all stages, and it is crucial to supervise the product life cycle. Additionally, the development phase is decomposed into hardware and software developments, each with its own Vshaped methodology consisting of specification, development, and verification. At the lowermost tip of the V, the prototype or release is realised. The V-shaped methodology is iterative because if any fault occurs during the verification, validation, or production phases, the project can return to the hardware/software specifications, system level specification or concept phase, respectively, to amend the fault. The objective of the proposed methodology is to cover the existing gap in the design of most advanced BMS, also considering safety in all its aspects.
As a first step of the methodology, Section 2 comprises the concept phase and introduces a safety assessment of lithium-based batteries in automotive battery packs, that leads to the allocation of safety goals to be fulfilled by the BMS. As the second and third steps of the methodology, Section 3 gathers the technical safety requirements derived from these safety goals and the resulting design of the enhanced BMS slave (as part of the overall BMS). As the fourth and final step of the methodology, Section 4 describes the verification of the enhanced BMS Slave by means of a Failure Modes, Effects and Diagnostics Analysis. Finally, Section 5 summarises the obtained conclusions.

Concept Phase: Hazard and Risk Assessment, and Safety Goals Allocation
As a general definition, safety goals are top level objectives that the BMS must fulfil to ensure the safety of the lithium-based battery under control. They are derived from a hazard analysis and risk assessment of the specific automotive application under study and must be consistent to control the risk down to an acceptable level. The Automotive Safety Integrity Level (ASIL) is the risk classification defined by the ISO 26262 standard (functional safety standard for automotive industry). It is an adaptation of the Safety Integrity Level (SIL) used in IEC 61508 standard (functional safety standard for general applications). This classification helps in defining the previously cited necessary risk reduction. The ASIL is established by looking at the likelihood and the consequences of a hazard in the hazard and risk assessment.
The standard classifies the necessary risk reduction as: ASIL A, ASIL B, ASIL C, ASIL D and QM (Quality Management). ASIL D dictates the highest safety requirements on the function integration, achieving the greatest risk reduction, and ASIL A the lowest. Risks classified as QM must undergo a regular quality management design process.
In the next subchapters, a hazard analysis and risk assessment of the lithium-based battery for automotive application is carried out. For this purpose, a brief description of the application scenario is carried out and the safety issues concerning lithium-based batteries are considered. This assessment classifies the identified risks and infers the required ASIL. Finally, the safety goals are deduced and allocated.

Hazard Analysis
A lithium-based battery is the main energy source of a battery electric vehicle. It is part of the vehicle traction system and there are several devices connected to it, including the BMS in charge of controlling it. A block diagram of a battery electric vehicle traction system is depicted in Figure 2, which integrates a high voltage (HV) lithium-based battery and a low voltage (LV) lead-acid battery. tegrity Level (SIL) used in IEC 61508 standard (functional safety standard for general applications). This classification helps in defining the previously cited necessary risk reduction. The ASIL is established by looking at the likelihood and the consequences of a hazard in the hazard and risk assessment. The standard classifies the necessary risk reduction as: ASIL A, ASIL B, ASIL C, ASIL D and QM (Quality Management). ASIL D dictates the highest safety requirements on the function integration, achieving the greatest risk reduction, and ASIL A the lowest. Risks classified as QM must undergo a regular quality management design process.
In the next subchapters, a hazard analysis and risk assessment of the lithium-based battery for automotive application is carried out. For this purpose, a brief description of the application scenario is carried out and the safety issues concerning lithium-based batteries are considered. This assessment classifies the identified risks and infers the required ASIL. Finally, the safety goals are deduced and allocated.

Hazard Analysis
A lithium-based battery is the main energy source of a battery electric vehicle. It is part of the vehicle traction system and there are several devices connected to it, including the BMS in charge of controlling it. A block diagram of a battery electric vehicle traction system is depicted in Figure 2, which integrates a high voltage (HV) lithium-based battery and a low voltage (LV) lead-acid battery. The BMS is comprised, at the same time, of BMS Slaves (one at each module), a BMS Master and a power monitoring and disconnection unit (PMDU). BMS Slaves gather monitoring data from the cells and transmit them to the BMS Master. In addition, BMS Slaves include cell balancing circuits. Cell imbalances occur if cells in a battery are not homogeneously charged, when their maximum capacity differs from one another, or if either their cell internal or external circuit leakage currents are different among them. Cell imbalances promote cell overcharge and overdischarge, and also prevent the application from exploiting the full battery capacity. Consequently, cell balancing circuits and balancing algorithms need to be allocated to the BMS to overcome this issue. The BMS Master processes the data sent by the BMS Slaves and controls them to coordinate the measurements and to execute the balancing algorithm. The BMS Master also controls the PMDU, which holds the contactors for battery connection and disconnection to the application as well as fuses and a pre-charge circuit. During battery operation, the BMS connects the battery to the traction system and communicates battery relevant data. In the case of any safety concern, the BMS must interrupt the battery current. Furthermore, the BMS must include relevant battery parameter estimation algorithms [5], such as the State of Charge (SoC) [6] to determine the remaining capacity in the battery, the State of Health (SoH) [7] to estimate the The BMS is comprised, at the same time, of BMS Slaves (one at each module), a BMS Master and a power monitoring and disconnection unit (PMDU). BMS Slaves gather monitoring data from the cells and transmit them to the BMS Master. In addition, BMS Slaves include cell balancing circuits. Cell imbalances occur if cells in a battery are not homogeneously charged, when their maximum capacity differs from one another, or if either their cell internal or external circuit leakage currents are different among them. Cell imbalances promote cell overcharge and overdischarge, and also prevent the application from exploiting the full battery capacity. Consequently, cell balancing circuits and balancing algorithms need to be allocated to the BMS to overcome this issue. The BMS Master processes the data sent by the BMS Slaves and controls them to coordinate the measurements and to execute the balancing algorithm. The BMS Master also controls the PMDU, which holds the contactors for battery connection and disconnection to the application as well as fuses and a pre-charge circuit. During battery operation, the BMS connects the battery to the traction system and communicates battery relevant data. In the case of any safety concern, the BMS must interrupt the battery current. Furthermore, the BMS must include relevant battery parameter estimation algorithms [5], such as the State of Charge (SoC) [6] to determine the remaining capacity in the battery, the State of Health (SoH) [7] to estimate the capacity fade from the beginning of life, and the State of Power (SoP) [8]  for an accurate power capability estimation. These advanced estimation algorithms are necessary to operate the battery and contribute to battery safety.
Lithium-based batteries have intrinsic safety concerns. The main parameters that can compromise their safety are temperature, voltage, current, mechanical damage, manufacturing pollution and even the number of serialised cells. These parameters delimit the so-called Safe Operation Area (SOA), whose thresholds vary depending on the battery constituents. When batteries are operated out of the SOA, secondary reactions begin which can quickly degrade cells or even start a fire.
The electrical and chemical behaviour of batteries is highly influenced by temperature [9]. When a battery is exposed to temperatures above 60 • C, hazardous secondary reactions begin. When a cell temperature is increased above the maximum temperature and the dissipation rate is enough to cool it down, it can endure the overtemperature event. However, if the temperature keeps increasing, a thermal runaway might be triggered.
A thermal runaway is a process where the heat generated by an exothermic reaction accelerates the reaction rate which, in turn, increases the heat generation rate. The thermal runaway is a temperature positive feedback that collapses the cell [10]. It begins by decomposing the solid electrolyte interphase layer. Then, the anode reacts with the electrolyte, the separator is melted, the electrodes are decomposed, and lastly, the electrolyte is decomposed. During a thermal runaway process, generated gasses build up the internal pressure of the cell [11]. Internal pressure can cause the cell rupture, liberating noxious gasses, fire, and deflagrations. When several cells are grouped in a module, a thermal runaway can be propagated by heat transfer [12]. Moreover, a thermal runaway can also generate internal short-circuits in the affected cell, therefore, it can also be propagated to electrically parallel-connected cells. The onset temperature of a thermal runaway decreases at higher cell voltages [13] and when lithium deposits are present in the cell anode [14]. The thermal runaway event ends when the reaction constituents are consumed.
On the other hand, chemical reaction kinetics of lithium-based batteries are reduced at low temperatures [15]. During cell charge, slow lithium intercalation and diffusion in the anode causes lithium plating [16]. When batteries are further misused at lower temperature ends, they can grow lithium deposits in the form of dendrites that eventually penetrate the separator and internally short-circuit the electrodes (except for lithium titanate (LTO) type batteries). Internal short-circuits can greatly increase battery heat generation and lead the affected cells to a thermal runaway. Battery power capabilities are decreased at low temperatures with the lowest temperature limit being the electrolyte freezing temperature, normally below −20 • C, at which the cell cannot be cycled [17].
The largest contributor to heat generation of a lithium-based battery is the current. Additionally, the current is also the second largest contributor to cell ageing [9,18]. An external short-circuit is the greatest exponent of battery overcurrent [10]. Moreover, in the case of batteries made of lithium metal anodes, such as Li-O 2 and Li-Sulfur batteries [19], at high current rates lithium plating deposits are generated in the anode [20].
Regarding voltage, lithium-based batteries are overcharged if their terminal voltage is higher than the cell's maximum voltage. Battery overcharge can either be caused by an excessive charge presence inside the cell, or by an excessive charge current. When a battery is being charged, lithium-ions and electrons are moving from the cathode to the anode. The overcharge occurs when the lithium-ions of the cathode are depleted [21]. The cathode becomes unstable after permanent crystallographic changes caused by the high oxidation potentials, leading the cathode to release oxygen that decomposes the electrolyte. Additionally, excessive intercalation of lithium-ions in the anode can cause lithium plating. During the overcharge process, side reactions release gas and heat, which can promote cell venting and fire [22]. The battery response to overcharges is related to the overcharge voltage, current, as well as the environmental conditions and battery constituents [23], among which the cathode is the most influent constituent. When thermal runaway occurs due to a battery overcharge, the event becomes more hazardous because of the additional energy stored. On the other hand, small overcharges at low charging rates can be overcome without fire, but overcharging the cell slightly and repeatedly will eventually trigger a thermal runaway [24].
In an overdischarge process of a lithium-based battery, graphite anodes are delithiated [25], which decomposes the solid electrolyte interphase layer and releases CO and CO 2 gasses. Additionally, the copper collector becomes electronically incompatible with high anode potentials causing copper dissolution. The dissolved copper is deposited in the anode surface, growing dendrites that can occasionally short-circuit the cell. When a cell is overdischarged, the internal short-circuits are not very hazardous because of the low energy in the cell. Nevertheless, when the cell is recharged, the internal short-circuit behaves like a low value resistor [26] increasing the temperature of the cell, which can lead to a thermal runaway. If no short-circuit happens during the overdischarge, the cell is still functional and can be recharged. Further discharging the cell or consecutively overdischarging it increases the chances of an internal short-circuit. After the overdischarge, if the cell is recharged, the solid electrolyte interphase may be regenerated [25] but the anode resistance is also increased. Moreover, lithium-ions are consumed in the regeneration of the solid electrolyte interphase layer, permanently reducing the cell capacity.
Mechanical stresses on batteries can also cause internal short-circuits, either due to the penetration of the cell casing [27] or heavy forces applied to the battery in the form of vibrations or impacts [28]. In both cases, the electrodes are electrically connected, short-circuiting the cell and causing a huge energy release in a short time. Additionally, the electrodes are continuously expanding and contracting [29] during battery operation. After several electrical and thermal cycles, the electrodes can be displaced, which can cause them to come into contact and be short-circuited if they are not properly designed and manufactured. Furthermore, during cell manufacturing, any pollutant agent can contaminate the cell [10]. Pollutant agents may be deposited in the electrodes and grow in the form of dendrites, leading to an internal short-circuit.
As a conclusion, "fire, deflagration, and gases" are the intrinsic hazards related to lithium-based batteries. In addition to them, "electric shock" and "vehicle accident caused by loss of functionality" are other potential hazards when applying a high voltage battery in an electric vehicle application.
Most of the misuse events regarding the operation of a battery out of the SOA lead to hazards related to fire, deflagration, and gas emissions. The excessive heating of local components, such as cables or PCBs, can also cause them to ignite. Moreover, the liquid electrolytes typically used in lithium-based cells contain a mixture of organic solvents [29] (e.g., Ethylene Carbonate, Dimethyl Carbonate, Ethyl Methyl Carbonate, etc.) and a lithium salt, such as LiPF 6 , LiBF 4 or LiClO 4 . If a cell leaks electrolyte, it can react with the ambient moisture [30] and generate hydrofluoric acid (HF), which is highly irritating [31]. Electrolyte gasses can also form explosive mixtures when mixed with air.
Serialised cells increase the voltage of the resultant battery. High voltage batteries present electrical risk, and any insulation fault between their live parts and accessible parts can lead to electric shock and arc formation.
Finally, the battery pack fulfils essential functions in an electric vehicle. The loss of any battery pack functionality during a critical scenario, such as driving on a highway at high speeds, can lead to accidents. The essential functions fulfilled by the battery are not limited to power sourcing and storing, but also data communication and coordination with the connected devices and, therefore, they must be ensured.

Risk Assessment
The risk assessment consists of listing the identified hazards and their causes, planning the measures that can be applied to prevent or mitigate the hazards and assessing the risk to identify the necessary risk reduction. The risk is assessed combining three individual parameters as recommended in the ISO26262 standard: severity (S), exposure (E), and controllability (C). Each parameter has different levels to qualitatively classify its contribution to the risk of the assessed hazard cause. The severity of the hazard is classified in four levels, from S0 (no injuries) to S3 (life-threatening injuries). The exposure to the hazard is classified in five levels, from E0 (incredible) to E4 (very probable). Finally, the controllability of the hazard is also classified in four levels, from C0 (controllable in general) to C3 (hard to control or uncontrollable).
The risk level is inferred from the severity, exposure, and controllability parameters by means of the risk graph matrix in Table 1. The risk graph matrix relates the tolerable risk and the necessary risk reduction, and the outcome is the ASIL applicable to the cause under assessment. Any combination with S0, E0 and C0 results in a QM requirement. When a requirement of QM applies, the risk is generally low, but must not be neglected. The design and development of the function to prevent or control the risk cause must, at a minimum, comply with a quality management system such as the one obtained from ISO 9001 or ISO/TS 16949 standard guidelines. The quality management system helps in preventing mistakes and controlling the design processes. As for the ASIL A to D requirements, in addition to the quality management system, ISO 26262 demands the inclusion of extensive analysis, processes and documentation to demonstrate that the designed function has the appropriate measures to reduce the risk in the required extent.

Severity
Exposure Controllability Regarding the risk assessment of automotive lithium-based batteries, Table 2 lists all the identified hazards and some of their causes. The main causes are disaggregated down to the causes at the component level, including as much details as possible to correctly identify the roots of the hazard. Deriving all the causes and details can be made by induction or deduction. The induction approach requires identifying every possible cause of failure and evaluating if it leads to a given hazard. The deduction approach, in turn, is achieved by thinking about how a given hazard can be caused and decomposing the cause into failure modes. Then, preventive and mitigation measures not related with the E/E/PE systems are planned. Preventive measures aim to reduce the hazard exposure, whereas mitigation measures try to reduce the severity when the hazard occurs. Validation of screw, crimped and welded connections.
Overheating by an external heat source.
Fire at the vicinity of the battery.
Cells tested against thermal runaway propagation.
Charging at low temperatures.
Dendrite growth after several cold charges.
Battery pre-heat before charging. After the hazard and risk analysis, Table 3 assesses the risk of each hazard cause. The severity, exposure, and controllability parameters are assigned along with their rationale considering the already planned preventive and mitigation measures. The assignation is qualitative and corresponds to the authors judgement, along with the given rationale, provided that there is no public quantitative data to calculate the necessary risk reduction.  Overcharging. S3 It can create violent deflagrations.

E3
It is considered likely that the voltage is not properly monitored when the battery is charging or during a regenerative breaking.

C3
The driver cannot avoid the hazard.

E3
It is considered likely that the temperature is not properly monitored after heavy vehicle accelerations.

C3
The driver cannot avoid the hazard.

C SG1
Cell-internal short-circuit caused by lithium plating formed by charging after an overdischarge.

S3
It can lead to cell swelling, and in some cases to fire.

E2
It is considered likely that the voltage is not properly monitored after a long trip or a long period without charging.

C3
The driver cannot avoid the hazard.

B SG5SG6
Cell-internal short-circuit caused by lithium plating formed by overcurrents.

S3
It can lead to cell swelling, and in some cases to fire.

E1
It is considered that there is low chance that the current is not properly monitored during heavy vehicle accelerations or braking. However, overcurrent has a minor contribution to internal short-circuit formation.

C3
The driver cannot avoid the hazard.

A SG7
Cell-internal short-circuit caused by lithium plating formed by charging at low temperatures (T < 0 • C).

S3
It can lead the cell to swelling, and in some cases to fire.

E2
It is considered likely that the battery is heavily used after start-up in the winter.

C3
The driver cannot avoid the hazard.

Safety Goals Definition
In case the allocated non-E/E/PE preventive and mitigation measures are not sufficient, then safety goals are planned for the E/E/PE safety-related system, i.e., the BMS in the present case. Safety goals (SG) are presented in Table 4 and allocated to the causes in Table 3. Safety goals are then derived to satisfy safety requirements, a safety architecture, and the diagnostics. The extent of the specifications and the diagnostics depends on the assigned ASIL, for example, if the ASIL of a safety goal is underrated it could not meet the true necessary risk reduction, whereas overrating the ASIL could greatly increase development costs.

Development Phase: Functional/Technical Requirements and Design
The overall BMS needs to fulfil the safety goals of Table 4. It shall integrate and communicate with up to 16 enhanced BMS Slaves. As part of a top-level specification, the overall BMS must follow the following safety strategy to fulfil the established safety goal at all times. If a safety goal is about to be violated, the BMS must lead the battery to the safe state. The safe state is the operational state where hazards cannot happen. In general, the battery is in the safe state when no current flows through it. This can be achieved either by disconnecting the battery from the traction system, or by setting the charge and discharge power of the traction system to zero. Conversely, the safe state should not be activated when the vehicle is at high speed. Unless it is critical, i.e., fire is imminent or it has been detected, the vehicle must enter an emergency state until the vehicle can be safely stopped. For this purpose, three error categories are proposed: tolerable errors, severe errors and fatal errors. Tolerable errors must only be noticed to initiate the call for maintenance, severe errors must activate the emergency state, and fatal errors must directly activate the safe state, without going first to the emergency state.
Consequently, the BMS must detect the hazardous event and reach the safe state inside the fault tolerant time interval. The fault tolerant time interval of the enhanced BMS is one second. This relatively long time has been chosen considering there are no evidences of thermal runaway caused by short, single abusive events [10]. In parallel, the BMS must scan its resources periodically to find faults that in combination with other faults can violate a safety goal. The considered period for multiple failure diagnosis is either eight hours or every time prior to a vehicle charge.
Regarding the safety architecture, Figure 3 shows the simplified architecture defined for the overall BMS. As it is depicted, the BMS Slaves are supplied directly from the cells, whereas the BMS Master is supplied from the 12/24 V low voltage (LV) battery. The BMS Master performs the current sensing, measures the battery total voltage, monitors the insulation impedance of the high voltage circuit, takes part in the high voltage interlock circuit (HVIL), controls redundant power switches for battery isolation and protection and, finally, communicates with external electronic control units (ECUs). Advanced estimation algorithms, such as SoC, SoH and SoP [6][7][8], or internal shor circuit detection algorithms [32], are highly desirable diagnostics for battery misuse pre vention as long as they are accurately tuned [33] for the used cell. Consequently, the should be allocated to either the BMS Master or an external ECU like an energy manage ment system (EMS) or shared between both. In case advanced estimation algorithms ar integrated in the BMS Master, special attention should be given to the software failur modes (e.g., soft or hard errors, out of boundary conditions, and stack overflow), and the integration in a separate non-safety-critical processor should be considered.
The enhanced BMS Slave safety architecture is presented in Figure 4. It is compose of four main circuits: cell measurement and balancing [34], temperature measuremen communications, and the Application Specific Integrated Circuit (ASIC). The ASIC is state-of-the-art and off-the-shelf battery management IC designed for functional safety oriented applications. All the interfaces include high frequency (HF) filters to comply wit EMC standards. The cell measurement circuit includes an on-board thermistor to monito the cell balancing circuit temperature. The temperature measurement circuit comprise NTC or PTC type sensors, and includes HF and LF filters, and conditioning circuits t adapt the resistor value to a voltage signal. A configurable temperature measurement i considered, where single-ended redundant or differential modes can be selected for eac pair of temperature inputs. In the case that the environmental noise does not allow a saf operation, a combination of both, i.e., single redundant and differential modes, can b used to ensure precision as well as safety. The communication circuit provides galvani isolation and includes resistor endings to match the characteristic impedance. Finally, th ASIC must have the recommended power supply and peripheral components for th hardware configuration and its operation. All the elements in the BMS Slave are safety relevant to comply with the safety goals. Advanced estimation algorithms, such as SoC, SoH and SoP [6][7][8], or internal shortcircuit detection algorithms [32], are highly desirable diagnostics for battery misuse prevention as long as they are accurately tuned [33] for the used cell. Consequently, they should be allocated to either the BMS Master or an external ECU like an energy management system (EMS) or shared between both. In case advanced estimation algorithms are integrated in the BMS Master, special attention should be given to the software failure modes (e.g., soft or hard errors, out of boundary conditions, and stack overflow), and their integration in a separate non-safety-critical processor should be considered.
The enhanced BMS Slave safety architecture is presented in Figure 4. It is composed of four main circuits: cell measurement and balancing [34], temperature measurement, communications, and the Application Specific Integrated Circuit (ASIC). The ASIC is a state-of-the-art and off-the-shelf battery management IC designed for functional safetyoriented applications. All the interfaces include high frequency (HF) filters to comply with EMC standards. The cell measurement circuit includes an on-board thermistor to monitor the cell balancing circuit temperature. The temperature measurement circuit comprises NTC or PTC type sensors, and includes HF and LF filters, and conditioning circuits to adapt the resistor value to a voltage signal. A configurable temperature measurement is considered, where single-ended redundant or differential modes can be selected for each pair of temperature inputs. In the case that the environmental noise does not allow a safe operation, a combination of both, i.e., single redundant and differential modes, can be used to ensure precision as well as safety. The communication circuit provides galvanic isolation and includes resistor endings to match the characteristic impedance. Finally, the ASIC must have the recommended power supply and peripheral components for the hardware configuration and its operation. All the elements in the BMS Slave are safety-relevant to comply with the safety goals.  The safety requirements are aligned to the architecture in Figure 4 and describe the detailed functionalities and the technical aspects to realise a feasible safety concept of the element under development. The most relevant technical safety requirements of the enhanced BMS Slave are collected in Table 5.

ASIL
Description ASIL Description C The number of cells must be configurable and must be in the range [6,18]. C There must be a total of 8 temperature sensor channels in a single-ended redundant topology, or 4 temperature sensor channels in differential topology.

C
The temperature measurements must be selectable between single-ended redundant or differential modes. The safety requirements are aligned to the architecture in Figure 4 and describe the detailed functionalities and the technical aspects to realise a feasible safety concept of the element under development. The most relevant technical safety requirements of the enhanced BMS Slave are collected in Table 5. Table 5. Technical safety requirements of the enhanced BMS Slave.

ASIL
Description ASIL Description

C
The number of cells must be configurable and must be in the range [6,18]. C There must be a total of 8 temperature sensor channels in a single-ended redundant topology, or 4 temperature sensor channels in differential topology.

QM
The temperature measurements must have a nominal accuracy of ±0.2 • C or better in the range [15,30] • C.

C
The voltage measurement circuit must be protected against hot plugs and shortages.

C
The temperature sensors must be NTC or PTC type.

ASIL Description ASIL Description
C Single component faults in the cell measurement interface must not cause a hazard.

C
The temperature measurements must be selectable between single-ended redundant or differential modes.

QM
The cell balancing circuit is a controlled dissipative type. C The redundant sensor must be of the same type but from a different manufacturer.

QM
The cell balancing circuits must be able to handle balancing currents up to 150 mA.

C
The communication speed must be at least 1 Mbps.

C
The power supply must work in a range of [16,90] V. C The communication must be differential, isolated, and reversible.

C
The power supply must withstand hot plugs and shortages. C The BMS Slave must go to sleep mode before FTTI/MPFTI when single/latent faults are detected.

C
Single component short-circuits in the power supply must not cause a hazard.

QM
Comply with standards ISO 6469 and IEC 60664 regarding electrical safety.

C
The power must be sourced by independent wires to avoid IR interferences in measurement wires.

QM
Comply with UNECE R10 directive and OEM specific guidelines regarding electromagnetic compatibility.

C
The configuration hardware and parameters must be checked prior to use.

QM
Components must be compliant with RoHS and AEC-Q series standards.
Finally, the applicable safety analyses and measures are defined. To this end, safety mechanisms are derived from a Failure Modes and Effects Analysis (FMEA). Safety mechanisms are elements or functions intended to prevent or detect failures in the hardware or the software of the element under design. An FMEA is an inductive safety analysis which identifies and describes the failures that can occur in the system. Finally, it must be argued how the failure is avoided or detected according to the defined safety mechanism. Table 6 presents the most relevant entries of the FMEA carried out for the enhanced BMS Slave according to the technical safety architecture and technical safety requirements.
Consequently, Table 7 summarises the most relevant employed safety mechanisms. State-of-the-art ASICs also include several internal safety mechanisms that enable their integration in functional safety-oriented applications. Table 6. Failure Modes and Effects Analysis (FMEA) of the enhanced BMS Slave.

Failure Mode Failure Effect Failure Causes Coverage Rationale
The temperature sensor is not properly connected.
The module temperature cannot be measured.
Connector broken or loose. Broken wire.
The open circuit can be detected either by an open-wire detection algorithm or by detecting a false over/under temperature.
Two adjacent temperature measurement pins are shorted.
False reading of the temperature.
Soldering defect or mechanical damage.
Any short-circuit between adjacent pins can be detected with redundant measurements or by using some ports as analog inputs and adjacent ports as digital outputs.
The cell is not properly connected.
The cell voltage cannot be correctly measured.
Connector broken, loose connector. Cell incorrectly welded or mechanically damaged.

The open circuit is detected with an
open-wire detection algorithm.

Failure Mode Failure Effect Failure Causes Coverage Rationale
There is a drift in the measurement circuit causing the cell voltage to be over or underestimated.
Cell over/undervoltage can be ignored.
Wire or component resistance changes caused by overheating or ageing. EMI.
The ADC reference voltage is regularly checked to detect deviations in the voltage measurement accuracy. The ADCs are regularly calibrated. The comparison with the independent module voltage measurement supports the detection of heavy deviations in a single or two ADCs.
Leakage currents in any low frequency filter introduces a drift in the measurement.
Overvoltage is promoted with cell balancing and it cannot be correctly detected.
Aging effects. Manufacturing defect.
A safety mechanism is established to prevent leakage currents. Additionally, the individual cell voltage sum must be very close to the independent module voltage.
Two adjacent cell voltage measurement pins are shorted.
False reading of the voltage.
The ASIC is hardware protected, and the fault is detected because one of the cell measurements is going to return 0 V.
A balancing circuit is permanently activated or cannot be deactivated.
Cell overdischarge is unavoidable.
Soldering or mechanical defect. Electrical or chemical damage. Aging effects. EMI.
The balancing circuit is diagnosed continuously to detect open-circuits or short-circuits in the balancing components.
Communication message is corrupted or lost.
The ASIC executes an incorrect command. EMI. Loose connector.
The BMS Master and the ASIC check the CRC of every message. BMS Master verifies that the ASIC is correctly checking the CRC by sending an incorrect CRC.
The ASIC does not wake-up or cannot be powered.
Commands are not executed, and cell parameters cannot be retrieved.
If the ASIC does not wake up the communications will fail. It will be detected by the BMS Master by means of the CRC.
Failure of a component of the power supply.
Deterioration of the board, heavy overcurrents, and overdischarge of a module.
Manufacturing defect. Electrical, mechanical or chemical damage.
A fuse protects against short circuits. If the ASIC is unpowered, the BMS Master will detect any power shortage with the CRC. The power supply is low-pass filtered to prevent repeated fast disconnections from powering off the ASIC. The ASIC supply is regularly measured to detect leakages.
The ASIC memory gets corrupted (soft and hard errors).
Incorrect data is retrieved. EMI. Cosmic rays.
The registers are cleared before every measurement to detect soft errors. The measurement is repeated several times to avoid memory corruption between measurements. Hard errors can be detected by comparison to module voltage and redundant temperature measurements. Register checks are run to verify that the registers can be written correctly.
The ASIC or balancing circuit gets overheated.
ASIC can start to malfunction. Shorted balancing circuit components.
Environmental temperature too elevated.
The ASIC internal die temperature is monitored. The balancing circuit is monitored with an in-circuit thermistor. Table 7. Safety mechanisms of the enhanced BMS Slave.

Safety Mechanisms Safety Mechanisms
Check that the ASIC internal voltage reference is in a valid range at least once every second.
Check that cell measurements are in the valid voltage range at least once every second. Initiate the ASIC internal measurement circuit calibration and diagnosis at least once every second.
Send commands and data with an incorrect CRC at least once every second. Check that the power supply voltage is in a valid range at least once every second.
Check that the temperature measurements are in the valid temperature range at least once every second. Prevent or detect voltage measurement errors from component leakage by substituting or doubling leaking components.
Wrong or corrupted communication messages must be detected by means of a CRC. Before executing any measurement, clear the registers of the ASIC.
Check that the registers dedicated to the measurement values can be written using a predefined pattern. Use the ASIC internal open-wire detection circuit to detect open-circuits in the voltage and temperature measurement circuits at least once every 8 h or before charging.
Confirm that die temperature and balancing circuit temperatures are in the valid temperature range at least once every second. Verify that there is no short-circuit between the top-most cell and the power supply.
Measure the module voltage and compare it with the sum of voltage measurements at least once every second. Check that the open-wire detection circuit is not stuck by comparing the measured voltages before and after activating the circuit.
Measure the cell voltage difference before and after the balancing circuit has been activated at least once every second.
Follow a sequence of self-test by clearing registers and then reading the registers to verify that they can be written every 8 h or before charging.
When differential measurements are used, check that there is no short-circuit to an adjacent pin by using the adjacent pin as digital output at least once every second.
As a result of the described requirements, architecture and safety mechanisms, the design of the enhanced BMS slave was carried out. A prototype of the enhanced BMS is presented in Figure 5. Confirm that die temperature and balancing circuit tem peratures are in the valid temperature range at least onc every second. Verify that there is no short-circuit between the top-most cell and the power supply.
Measure the module voltage and compare it with the sum of voltage measurements at least once every second. Check that the open-wire detection circuit is not stuck by comparing the measured voltages before and after activating the circuit.
Measure the cell voltage difference before and after the balancing circuit has been activated at least once every se ond. Follow a sequence of self-test by clearing registers and then reading the registers to verify that they can be written every 8 h or before charging.
When differential measurements are used, check that the is no short-circuit to an adjacent pin by using the adjacen pin as digital output at least once every second.
As a result of the described requirements, architecture and safety mechanisms, t design of the enhanced BMS slave was carried out. A prototype of the enhanced BMS presented in Figure 5.

Verification by Failure Modes, Effects and Diagnostics Analysis
The verification process of a safety-related system does not only consist of testin activities [35]. In this work, the verification by means of a Failure Modes, Effects and D Figure 5. Developed enhanced battery management system slave prototype complying with functional safety ASIL C requirements.

Verification by Failure Modes, Effects and Diagnostics Analysis
The verification process of a safety-related system does not only consist of testing activities [35]. In this work, the verification by means of a Failure Modes, Effects and Diagnostics Analysis (FMEDA) is described. The FMEDA is an inductive safety analysis which consists of analysing the failure modes of every hardware component and checking the suitability of the design according to random failures. This safety analysis is carried out to highlight the circuit vulnerabilities and calculate the random hardware failure rate. The procedure followed to carry out the FMEDA is depicted in Figure 6. The procedure begins by listing all the components in the design and the planned safety mechanisms. The failure modes, failure rates and failure rate distributions are derived for every component. Furthermore, the diagnostic coverage of the safety mechanisms is also assessed. Finally, the effects of every component failure mode are analysed. The following paragraphs describe the details of the followed procedure for completing the FMEDA.
Hardware random failure rates are obtained either from reliability data sources [35][36][37] or measured from testing. However, measuring the failure rates requires accelerated testing of a substantial number of components [38]. Due to the elevated costs of measuring the failure rate for component manufacturers, they are mostly acquired from the reliability data sources. Failure rates are expressed in Failures in Time (FIT), which represent the number of failures expected in 10 9 h. The failure rate of every component is heavily related to the temperature of operation and load profiles, for which the mission profile must be first specified. Higher temperatures penalise the failure rate of components. For this analysis, an operation temperature of 40 • C was considered, aiming to cover a very demanding scenario without overestimating the failure rate.
Regarding the failure modes of safety-related components, they must be analysed to identify whether they would violate a safety goal. Hardware failure modes can be found in standards [35,[39][40][41] or can be estimated through analysis. The former is the recommended route, but the latter may be necessary for most integrated circuits or other than ordinary components. For the analysis, the FMD-91 and EN 50129 standards were used, which detail the failure modes and failure rate distribution of most common components. Although the mentioned standards are not usually used in automotive applications, they have been used because they provide a very accurate failure mode disaggregation.
The diagnostic coverage is the effectiveness of the diagnostics, which is the number of detectable failure modes compared with the total failure modes. The diagnostic coverage expresses the percentage of detectable failure modes of a given component or set of components. Safety mechanisms can decrease the failure rate of the detected dangerous failures by their diagnostic coverage. A rationale must be given for the selected diagnostic coverage for every safety mechanism. The next step is to analyse every failure mode of each component. It must be assessed whether the failure mode is dangerous, and whether it may violate the safety objective by itself (single point faults), or in combination with other component faults (multiple point faults). For this purpose, the failure mode effects must be described. In the case where there is a safety mechanism to detect the failure mode, a rationale of the detection must also be provided. Consequently, the diagnostic coverage must be considered for the failure rate calculations. After running through the procedure, the failure rate of each failure mode is disaggregated as: The procedure followed to carry out the FMEDA is depicted in Figure 6. The procedure begins by listing all the components in the design and the planned safety mechanisms. The failure modes, failure rates and failure rate distributions are derived for every component. Furthermore, the diagnostic coverage of the safety mechanisms is also assessed. Finally, the effects of every component failure mode are analysed. The following paragraphs describe the details of the followed procedure for completing the FMEDA.  The dangerous failure rate is controlled by the establishment of a maximum failure rate and fault metrics. The fault metrics indicate the percentage of a single and multiple point dangerous failure rate over the total failure rate. The single-point fault metric (SPFM) is described by (1), and it is an indicator of the suitability of the diagnostics and measures included to detect or prevent direct dangerous faults. On the other hand, the latent fault metric (LFM) is calculated with (2), and similarly, represents the extent of the avoided dangerous faults when one or more components have already failed. According to the ISO 26262 standard, the target SPFM for the ASIL C safety goals is 97%, whereas the target LFM is 80%.
The probabilistic metric of hardware failures (PMHF) is an estimation of the average probability of failures per hour of the components that fulfil a given safety goal. The PMHF is calculated according to (3), where T lifetime is the overall application lifetime in hours. The PMHF must be evaluated for the full safety goal and must be below the threshold given by the ISO 26262 standard. The target PMHF of enhanced BMS Slaves has been established as 65 % of the overall PMHF required for an ASIL C safety goal, which is 100 FIT. This fraction of the overall PMHF, in turn, must be shared among the total serialised enhanced BMS Slaves. As a total of 16 serialised enhanced BMS Slaves are considered, each enhanced BMS Slave shall have a PMHF under 4 FIT. This PMHF objective is arbitrary but has been determined considering the PHMF of the BMS Master and PMDU shall also be achievable to complete each safety goal.
The obtained failure rates, fault metrics and PMHF of the voltage and temperature monitoring are presented in Table 8. Although the FMEDA can be completed for each individual safety goal, a conservative approach has been taken and the presented results encompass all the BMS Slave applicable safety goals, provided that most sub-circuits are common. According to the obtained results, the enhanced BMS Slave is suitable for the accomplishment of the ASIL C safety goals.

Conclusions
This paper contributes an ISO26262 compliant safety-oriented design and verification methodology for battery management systems (BMS). The lithium-based battery safety concerns were analysed to show the short and long-term hazards of using batteries in automotive applications.
A hazard and risk assessment has shown that an automotive BMS for traction batteries should satisfy at least an ASIL C rating in order to achieve the necessary risk reduction for safe battery operation. A safety architecture was realised for the overall system which considers a three-subsystem topology, composed of 16 BMS Slaves to monitor high voltage batteries (up to 1000 V), a BMS Master and a power monitoring and disconnection unit (PMDU). Most relevant requirements, preventive and mitigation measures, and safety mechanisms are presented to comply with ASIL C requirements for enhanced BMS Slaves.
Finally, the presented BMS Slave was verified by assessing its failure rate and fault metrics. The failure rate and fault metrics of the enhanced BMS Slave were assessed by a Failure Modes, Effects and Diagnostics Analysis (FMEDA). A maximum probabilistic metric of hardware failures (PMHF) of 65 Failures in Time (FIT) was allocated to 16 BMS Slaves, whilst a maximum failure rate of 35 FIT was allocated to the BMS Master and PMDU to fulfil the ASIL C safety goals. Additionally, the single-point fault metric (SPFM) and the latent fault metric (LFM) must be above 97% and 80%, respectively. The methodology applied for the FMEDA was described and applied to the enhanced BMS Slave, resulting in a PMHF of 3.94 FIT for a single BMS Slave, an SPFM of 99.1% and an LFM of 88%. Consequently, 16 BMS Slaves had a PMHF of 63.04 FIT, representing an enhancement of 3% below the 65 FIT threshold, demonstrating an ASIL C capability and the suitability of the design based on the presented methodology.