1. Introduction
Lithium-ion batteries have emerged as the core energy storage component for electric vehicles and large-scale energy storage systems owing to their advantages such as high energy density, cost-effectiveness, and long cycle life [
1]. To meet the capacity requirements of megawatt-level (MW-level) battery energy storage systems (BESS), hundreds to thousands of individual cells are typically connected in series and parallel configurations to form battery modules, which are further integrated into battery clusters. However, under long-term operational conditions, the inherent issue of cell-to-cell inconsistency becomes increasingly prominent [
2,
3]. This inconsistency not only significantly degrades the overall performance and cycle life of the system but may also trigger cascading failures such as thermal runaway [
4,
5], posing serious challenges to the safety and reliability of BESS [
6]. Recent frequent safety incidents highlight the inadequacy of existing safety monitoring and early warning mechanisms [
7]. Consequently, developing efficient and accurate safety early warning technologies for energy storage systems has become a key research focus in this field. However, effective early warning not only needs to find the fault but also needs to accurately identify its location in order to carry out targeted containment. Therefore, the development of efficient and accurate safety early warning technology, especially focusing on accurate positioning methods, is a key research focus in this field.
Existing studies have shown that the thermal runaway process of lithium-ion batteries involves the evolution of multiple physical signals such as electrical, thermal, acoustic, and gaseous ones, providing multidimensional monitoring means for early fault warning [
8]. Among these, acoustic signals have garnered significant attention due to their potential for remote sensing and localization [
9,
10,
11,
12]. Abusive conditions such as overcharging, over-discharging, or overheating can induce internal short circuits in batteries, leading to a rapid temperature rise and heat generation from side reactions [
13]. When the rate of internal heat accumulation far exceeds the heat dissipation capability, thermal decomposition of the electrolyte occurs, producing a large amount of gas. Once the internal pressure exceeds the limit, the battery casing ruptures and releases the gas, generating acoustic signals [
14]. In energy storage power stations, multiple battery cells are often enclosed within sealed battery packs, which are also equipped with pressure relief valves [
15]. These valves open when thermal runaway occurs inside the pack, producing distinctive acoustic signals. Although detecting the valve opening indicates the existence of a fault event, it is important to determine the specific three-dimensional location of the exhaust source to identify the faulty battery pack. If the fault is not suppressed in time, the released high-temperature flammable gas can easily ignite adjacent battery packs, triggering a chain reaction and ultimately threatening the safety of the entire energy storage system. Conversely, if the opening of the pressure relief valve can be detected promptly, pack-level fire suppression measures, such as aerosol flooding, can be immediately activated to suppress combustion and prevent the propagation of thermal runaway.
In response to this need, researchers propose a novel acoustic signal-based method for battery fault warning and localization [
16]. This method requires deploying an acoustic sensor array only at the four corners of the energy storage container. By capturing and analyzing the venting acoustic signature generated during a single cell fault, the method enables the three-dimensional spatial localization of the fault source.
Sound source localization technology has developed over a long period, forming a relatively comprehensive theoretical framework. Traditional methods primarily include beamforming based on Steered Response Power (SRP), subspace methods based on high-resolution spectral estimation, and localization algorithms based on Time Difference of Arrival (TDOA) [
17,
18,
19]. Due to the varying distances between each microphone in an array and the sound source, the sound signals arrive at each microphone at slightly different times. TDOA-based localization algorithms estimate the coordinates of the sound source by calculating the time delay differences of the signals received by the microphone array, in conjunction with the speed of sound propagation in air and the known geometric configuration of the array [
20].
A TDOA-based localization algorithm primarily involves two steps. The first step is time delay estimation. Due to the varying distances between the sound source and each microphone in the array, the signals received by the microphones exhibit temporal differences. The core of time delay estimation is to calculate the signal arrival time differences between different pairs of microphones within the array using specialized algorithms [
21]. Subsequently, in the second step, the calculated time delay differences from the first step are utilized, along with the known geometric configuration of the microphone array, to mathematically solve for the coordinates of the sound source. Commonly used position estimation algorithms can be primarily categorized into geometric analytical methods and spatial search methods based on objective function optimization [
22].
Traditional position estimation algorithms, such as the geometric localization method, calculate the three-dimensional coordinates of the sound source by solving a system of equations derived from the three sets of time delay differences among four microphones. The localization accuracy of this method is directly dependent on the measurement precision of these three time delays. Under ideal, interference-free conditions, achieving high time delay measurement accuracy is relatively straightforward. However, in the battery cabin environment of an energy storage power station, challenges such as strong power frequency noise, enclosed space, severe reflections and reverberation, and the broad frequency distribution of venting valve acoustic signatures across 0–50 kHz all contribute to difficulties in accurately determining time delays. When time delay measurements contain errors, traditional geometric localization methods can produce unacceptably large errors—even misidentifying an internal event as an external anomaly—thus failing to meet the reliability requirements for safety monitoring.
Fortunately, extensive repeated experimental results indicate that the time delay measurement errors induced by the environment are not entirely random but rather exhibit repeatable patterns. This is attributed to the fixed internal structure of the cabin, where reflections and reverberations of venting acoustic waves from different locations maintain time-invariant characteristics. This provides an opportunity for data-driven approaches. Unlike TDOA-based methods, which rely on precise analytical models, neural networks, as data-driven modeling tools, can directly learn an end-to-end mapping relationship from input signals to the sound source location. Their key feature lies in their ability to implicitly learn and compensate for both array geometry and environmental acoustic characteristics solely by utilizing training datasets with position labels [
23]. In other words, as long as the time delay error remains consistent, neural network algorithms can establish accurate positional mappings through extensive data ingestion, thereby circumventing the dependency on highly precise time delay signals required by traditional geometric localization methods.
Moreover, practical engineering requirements also determine the method selection. The core task of safety monitoring in energy storage systems is not global continuous localization, but rather the precise identification of a limited number of discrete risk points. Specifically, a standard 2.5 MWh energy storage battery cabin is typically arranged in an array configuration of 2 stacks, each comprising 7 clusters, with each cluster containing 8 battery packs. Each battery pack is equipped with only one pressure relief valve as the key monitoring point, resulting in a total of 112 discrete monitoring points within the container, as illustrated in
Figure 1. Thus, the localization requirement essentially involves accurately mapping the sound source to a specific battery pack, rather than pursuing centimeter-level continuous coordinate accuracy. In such a classification-type task, the advantage of TDOA-based methods in achieving global continuous localization cannot be fully leveraged, whereas the classification capability of Back Propagation (BP) neural networks aligns well with this demand.
Based on this, this study proposes a BP neural network-based method for venting acoustic source localization. It employs a robust time delay estimation algorithm to extract TDOA as the core feature and utilizes a BP neural network to perform the final sound source position calculation. This method aims to integrate the physical interpretability of signal processing with the powerful nonlinear fitting capability of neural networks. Through training, the BP network learns to tolerate the inherent, repeatable errors within the TDOA features. As long as the error pattern remains consistent, the network can compel the mapping toward the correct battery pack location. This renders the overall system insensitive to time delay calculation errors, with the goal of achieving more robust and easier-to-deploy thermal runaway acoustic source localization in complex environments.
4. Sound Source Localization Based on BP Neural Network
In this study, the input is three groups of TDOA values obtained from four groups of microphones through generalized cross-correlation calculation, and the output is the three-dimensional coordinates of the sound source. The input and output dimensions are not high, and the amount of data is not large. For training tasks with a limited amount of data in this paper, deep neural networks, such as CNN and LSTM, involve a large amount of computation and are prone to overfitting. The traditional neural network reduces the amount of calculation and operation time while ensuring accuracy, which is more suitable for the practical application problems in this paper, and is conducive to meeting the real-time and low-cost demands of the fire protection system for positioning.
4.1. BP Neural Network Model
Before training the BP neural network-based sound source localization model, the network architecture, such as the number of hidden layers and the number of nodes per layer, must be determined. An insufficient number of hidden layers or nodes may lead to inadequate mapping capability, while an excessive number can result in increased computational time and overfitting. The selection of these model parameters is determined through training with simulated sound sources. The reverberation time and background noise level of the simulation experiment are calibrated based on the value of the actual battery compartment to ensure the reliability of the simulation. As shown in
Figure 8b, a rectangular prism space measuring 10 m × 10 m × 5 m is established. Four microphones are simulated at spatial coordinates A(1, 1, 1), B(9, 1, 1), C(1, 9, 1), and D(1, 1, 4). The total volume of this space is 500 m
3. This volume is subdivided into small 1 m
3 cubes, and a sound source is simulated at the center point of each small cube. The four sets of sound signals recorded by the microphones are used to calculate three sets of time differences, as given by Formula (9).
where
Δt represents the time difference between two microphones, and t denotes the time at which the sound arrives at a microphone.
where
xi,
yi, and
zi represent the three-dimensional coordinate values of the simulated sound source, respectively. The microphone array selected in this section is orthogonal. When the microphone at the origin is taken as the reference, the four microphones can obtain three sets of time delay differences. Therefore, this study sets the number of nodes in the input layer to 3. The output consists of the three coordinate components in the spatial Cartesian coordinate system, so the number of nodes in the output layer is also set to 3.
The number of nodes and layers in the hidden layer is critical to the accuracy and computational speed of the algorithm for different network models. Research indicates that the selection of the number of hidden layer nodes is related to factors such as the number of input nodes, the size of the training dataset, the complexity of the network, and error tolerance. If the number of nodes is too small, the network’s fault tolerance becomes poor and its effectiveness may be limited; conversely, too many nodes can easily lead to local optima and prolonged training times. Extensive practical experience suggests that the principle for selecting the number of nodes is to minimize the node count as much as possible while ensuring that the required accuracy is met.
After performing machine learning on these 500 sets of training data using the BP neural network algorithm, localization testing is conducted. A random spatial coordinate is simulated by the computer, and based on the speed of sound, the time differences at which each microphone first receives the sound can be calculated. This set of time differences is used as the input data for localization testing. The trained BP neural network is then employed to make predictions, yielding estimated coordinates. Finally, the error between the predicted location and the true location is analyzed.
Through extensive experimentation, this study determined that a BP neural network with two hidden layers, each containing 10 nodes, yields the most accurate localization results. After repeated adjustments and optimization, the neural network structure was finalized with two hidden layers of 10 neurons each.
Figure 8a presents the fitting regression results of this model, where a value of R closer to 1 indicates better fitting performance and higher model accuracy. As visualized in
Figure 8b, for randomly generated sound source locations, the average prediction error of the BP neural network was below 0.1 m, demonstrating its feasibility for three-dimensional sound source localization.
4.2. Results
For the finalized BP neural network, this study randomly selected 90% of the actually recorded sound source data, i.e., 810 sets, as training samples, while the remaining 10% were used as test samples to evaluate the localization performance. For a multi-input multi-output network structure, a substantial number of training samples is essential to prevent significant errors between predicted and actual values. To evaluate the model’s performance, the Euclidean distance is adopted as the metric for single-point localization error. Its calculation formula is as follows:
where
xpre,
ypre,
zpre are the predicted coordinates, and
xreal,
yreal,
zreal are the actual coordinates. Additionally, the average localization error across all test samples is calculated to assess the model’s overall accuracy throughout the entire test space.
The actual and predicted coordinates of the sound source and their error values are shown in
Table 2. To visualize the performance, three-dimensional scatter plots are used to compare the true and predicted positions of sound sources. In
Figure 9, the spatial overlap between true and predicted locations, as well as the length of error lines connecting corresponding points, provides an intuitive assessment of the model’s localization effectiveness.
Experimental results demonstrate that the proposed BP neural network model can effectively achieve three-dimensional localization of the sound source, with an average localization error of 0.46 m—a value less than 0.5 m. This value is different from the accuracy in
Section 4.1 because the complex reflection and time-varying background noise in the actual environment are not modeled when simulating the sound source, and the sound source may not reach the microphone immediately after it emits sound, which is difficult to simulate well. In order to quantitatively evaluate the positioning accuracy, the Gaussian distribution model is used to fit the error. The results show that the positioning error follows a normal distribution with μ = 0.37 m and σ = 0.18 m. The probability that the error is less than 0.56 m is 84.13%. Considering that the size of the energy storage battery pack is 0.8 × 0.5 × 0.3 m, the error is significantly less than the length, width and height of the module, which is enough to meet the effective identification requirements of the battery pack. Another significant advantage of the model is that it does not require explicit input of microphone position information; the model can infer microphone locations autonomously from the training data. This is particularly important for energy storage systems, where internal layouts are complex and vary due to differing arrangements of electrical equipment, potentially leading to installation errors or even preventing microphones from being placed at predetermined positions. The model proposed here bypasses the need for precise microphone positioning, enabling direct training and localization. This simplifies the application process and grants the method broad adaptability.
4.3. Validation Experiment
In order to verify the practical application of the model used in this paper, the actual verification was carried out in a real liquid-cooled energy storage container.
Figure 10 shows the inner and outer ring mirrors of the container and the experimental site.
The internal space of the container is 3.4 m × 2.3 m × 2.6 m. This experiment still uses the microphones from the previous experiment, installed in four obvious corners, which is more conducive to capturing sound. In the experiment, the same pressure relief valve as in the original experiment is also used as the sound source, and 10 × 10 = 100 groups of data are generated. Based on these data, we used the BP neural network from the original experiment to train and obtain the verification results, as shown in
Figure 11 and
Table 3. Points 3 and 4 are intended to capture their signals after being occluded and reflected. The results show that the overall positioning accuracy is higher, and the average positioning accuracy is less than 0.3 m, but the accuracy of the sound source will be greatly reduced when it is seriously disturbed. We hypothesize that this is because the actual experimental site is smaller than the original open container, and the spatial entity is stronger, which makes the signal of both the first-arrival sound and the reflected sound clearer and less volatile. This experiment demonstrates that in an actual energy storage power station, the BP neural network-based sound source localization algorithm still has strong applicability and robustness, but it still needs to consider that the sound field between the sound source and the microphone will not have a large area of occlusion or reflection effect, which will lead to its localization deviation. This is also a key research direction of this work in the future.
4.4. Discussion
By conducting an in-depth analysis of the error distribution across various test points, we observed a notable phenomenon: the model demonstrates higher localization accuracy in the central regions where training data is densely concentrated, with errors generally below 0.3 m. In contrast, in regions with relatively sparse training data and in corner areas farther from the microphones, localization errors increase significantly, with some points even exceeding 0.5 m. This phenomenon reveals the model’s inherent strength in interpolation but relative weakness in extrapolation.
Inspired by this, in order to continue improving the positioning accuracy in future research, increasing the number of microphones in the rectangular cabin structure—for example, deploying 6 to 8 microphones—is a feasible option. By introducing redundant microphones, better interpolation performance of the neural network model can be achieved. With the increase of input parameters, the overall positioning result offset caused by the error of a single microphone will be suppressed, so as to improve the positioning accuracy of the network. For neural networks, more data contributes to fault-tolerant training and enhances mapping performance, which helps minimize localization deviations. Given the low cost of microphones, increasing their number has an acceptable impact on deployment costs.