A Lightweight Convolutional Spiking Neural Network for Fires Detection Based on Acoustics

Li, Xiaohuan; Liu, Yi; Zheng, Libo; Zhang, Wenqiong

doi:10.3390/electronics13152948

Open AccessArticle

A Lightweight Convolutional Spiking Neural Network for Fires Detection Based on Acoustics

¹

College of Integrated Circuit Science and Engineering, Nanjing University of Posts and Telecommunication, Nanjing 210023, China

²

Beijing AcousticSpectrum Tech Co., Ltd., Beijing 100142, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2948; https://doi.org/10.3390/electronics13152948

Submission received: 30 June 2024 / Revised: 22 July 2024 / Accepted: 22 July 2024 / Published: 26 July 2024

Download

Browse Figures

Versions Notes

Abstract

As urbanization accelerates, the prevalence of fire incidents leads to significant hazards. Enhancing the accuracy of remote fire detection systems while reducing computation complexity and power consumption in edge hardware are crucial. Therefore, this paper investigates an innovative lightweight Convolutional Spiking Neural Network (CSNN) method for fire detection based on acoustics. In this model, Poisson encoder and convolution encoder strategies are considered and compared. Additionally, the study investigates the impact of observation time steps, surrogate gradient functions, and the threshold and decay rate of membrane potential on network performance. A comparison is made between the classification metrics of the traditional Convolutional Neural Network (CNN) approaches and the proposed lightweight CSNN method. To assess the generalization performance of the proposed lightweight method, publicly available datasets are merged with our experimental data for training, which results in a high accuracy of 99.02%, a precision of 99.37%, a recall of 98.75%, and an

F_{1}

score of 99.06% on the test datasets.

Keywords:

fire detection; acoustic sensing; lightweight method; CSNN; convolution encoder

1. Introduction

Fire incidents are significant hazards, resulting in personal suffering, property damage, and economic disruption [1]. The existing fire detection systems deployed in industrial, commercial, and public institutions often suffer from high false alarm rates; studies have shown that false alarm rates are as high as 87% in Germany and 97% in certain applications in Sweden [2,3].

To address these issues, multi-sensor fusion systems have been developed to improve the reliability of fire detection and to reduce false alarms and missed detection [4,5,6]. These systems typically merge data from traditional sensors like temperature, vision, and gas [7,8]. Although data fusion technology has enhanced detection reliability compared to single-sensor systems, it still does not incorporate acoustic sensing well and needs improvements in accuracy. Especially in densely wooded and thickly vegetated forests and other adverse environments, acoustic sensing emerges as a pivotal method for wildfire detection owing to its ability to penetrate visual obstructions and operate effectively in noisy forest settings [9,10,11].

During combustion, energy is released as light and heat, causing air molecules to expand and vibrate rapidly. The burning materials also undergo structural changes and vibrations, which can generate acoustic waves [12,13,14,15]. It has been suggested that fire acoustic signals may be detected earlier than visual or other signals in some cases and are less obstructed by physical barriers such as walls [16].

In the evolution of fire detection based on acoustics, foundational research by Thomas et al. [12] found that a monopole flame source can be regarded as a simple acoustic source, with its strength being defined as the rate of volume variation. This seminal work sparked the interest of researchers in extracting fire acoustic signal features by digital signal processing. Moving into the 1990s, Grosshandler et al. [13] first proposed a proof of concept for using acoustics for fire detection but did not address background noise issues. Bedard et al. [15] contributed to the field by analyzing the relationship between acoustic signal frequency and combustion area sizes, although no specific recognition method was proposed. In 2016, Khamukhin et al. [10,17] studied an innovative method utilizing wireless sensor networks to analyze the acoustic emission spectrum of wildfires, enabling the differentiation between crown and surface fires by identifying distinctive noise patterns; however, the problem of false alarms was not considered in the papers. Zhang et al. [18] studied the trend line amplitude of the acoustic signal spectrum and achieved a 70% recognition rate in distinguishing ground fires from crown fires.

With the advancements in computer audition over recent years, researchers have been using machine learning methods for fire acoustic signal recognition [19]. In 2022, Huang et al. [20] developed an audio-based wildfire detection system, leveraging a support vector machine classifier that achieved 90.9% accuracy on their test dataset and was significantly influenced by the background noise. Martinsson et al. [2] utilized a CNN14 architecture with heavyweight parameters and substantial computational requirements, achieving 97.3% accuracy on their datasets, which lacked realistic background noise. More recently, Peruzzi et al. [21] developed a low-power fire detection system that combines image and audio data, achieving 95.375% accuracy using the NN#2 method based on audio alone. Lee et al. [22] developed an acoustic fire detection system for underground tunnels using a 2D CNN to effectively recognize electric spark sounds, achieving an accuracy of 96.31% and an F1 score of 94.36% for early fire alerts, but there is a need for additional research to improve the adaptability of the system as it is constrained by diverse noises.

Spiking Neural Networks (SNNs) are models that are more closely aligned with the operational mechanisms of biological neurons. They have garnered widespread interest in relation to certain tasks such as event-driven processing or spatiotemporal pattern recognition [23,24]. Many studies have shown that SNNs outperform many classic Artificial Neural Networks (ANNs) in computer vision on some commonly used datasets [25,26]. However, very few works demonstrate their effectiveness in acoustic signal recognition. The authors have yet to find recent papers that discuss the use of an SNN architecture in detecting fires based on acoustics.

Instead of using the numerical values communicated in traditional artificial neurons, spiking neurons operate based on discrete spikes or pulses of activity [27]. This allows SNNs to filter out irrelevant background noise and focus on the distinctive features of fire sounds, enhancing the overall robustness of the model in the presence of noise [28,29,30,31]. It is especially beneficial in real-world fire detection scenarios where CNN methods often face challenges.

In addition, SNNs operate on an event-driven basis and naturally produce sparse representations of data [32]. It means that only a subset of neurons “spike” in response to specific features in the input data, consuming computational resources only when there is a spike. Compared with CNN computation, methods based on SNNs have event-driven processing and sparsity and are highly energy efficient [33].

The contribution of this study can be summarized as follows:

The proposed CSNN method adeptly merges the inherent sensitivity to temporal dynamics with the robust spatial feature extraction capabilities characteristic of convolutional operations. This integration notably enhances the accuracy of fire detection based on acoustics in real-world noisy environments.
The study introduces a specialized convolution encoder within the CSNN framework capable of converting acoustic inputs into spike-coded representations through learnable parameters. This encoding mechanism provides a more robust and adaptive solution for fire detection based on acoustics.
The study presents a spike-based computing method notable for its lightweight design, low computational time complexity, and high energy efficiency. It is well suited for fire detection in the edge hardware of remote surveillance systems.

To the knowledge of the authors, this study is one of the first to demonstrate that a well-trained lightweight CSNN can achieve a performance comparable to classic CNNs on public datasets in fire detection while using only one-tenth of the parameter count of the CNNs. Moreover, the proposed lightweight method exhibits a significant advantage in terms of inference time.

The rest of the paper is organized as follows: Section 2 describes the lightweight CSNN method for fire detection. Section 3 describes the source of the datasets, experimental configuration, and ablation study. Section 4 compares the differences in classification metrics between classic CNNs and the proposed method and introduces the design scheme for hardware implementation. Finally, concluding remarks are provided in Section 5.

2. Methodology

In the paper, the proposed novel hybrid approach based on an SNN mimics the behavior of biological neural systems through spiking mechanisms while leveraging convolutional modules to capture local spatial features. As illustrated in Figure 1, the proposed approach consists of four functional blocks: the preprocess block, encoding block, convolutional block, and full-connect block. The details of each block will be explained in the following sections.

2.1. Preprocess Block

Inspired by the human auditory system, filter-bank-based Mel-scaled processing is applied to analyze the spectral content of incoming acoustic signals. In this study, the input to the model is a 5 s audio clip sampled at 16 kHz. The Hanning window of size 1024 is moved over the waveform with a hop length of 320, which reduces spectrum leakage and increases spectrum smoothness. Then, a short-time Fourier transform is applied to each windowed segment to produce a power spectrogram. The power spectrogram is then processed by 64 triangular Mel filters and log-transformed, resulting in an analytical matrix in

R^{f * t}

, where f represents the number of frequency bins in the spectrum, and t represents the number of time frames. In this paper, it is a matrix in

R^{64 * 251}

. All samples are converted into a Mel Spectrogram using fixed filter bank processing, as outlined in Table 1.

2.2. Encoding Block

Typical encoding schemes in SNNs include rate coding and temporal coding [34]. The most widely used rate coding method converts the input into a Poisson-distributed spike train, which encodes the information through the neuron firing rates [35], as shown in Figure 2. As the observation time step T increases, the aggregation of spike trains progressively approximates the original input Mel Spectrogram. This allows for a more detailed interpretation of the inherent characteristics of the input Mel Spectrogram.

Compared to the Poisson encoder, the convolution encoder used in the method consists of a normalization layer, the first convolution layer, and the Leaky Integrate and Fire (LIF) neuron model. It converts the input Mel Spectrogram into a spiking sequence and serves as an auto-encoder with learnable parameters. As shown in Figure 3 and Table 2, the first convolution layer employs a 5 × 5 kernel and no padding method, generating 60 × 247 × 8 feature maps that are input to the spiking neurons, which repeatedly process the output within the observation time step T.

To enhance the stability and reliability of classification in SNNs, the output is usually represented by the average firing rate of the output layer within the observation time step T. This method minimizes the influence of binary output variability and results in a more consistent classification performance.

2.3. Convolutional Block

The spiking output from the encoder accumulates over time and is fed into a max pooling layer to down-sample the feature maps. The max pooling layer has a size of 2 × 2, which reduces the feature maps to 30 × 123 × 8. After twice going through the ‘convolution–max pooling–spiking activation’ process, the resulting feature maps will be 4 × 27 × 32. In Figure 4, it is shown that each convolutional layer extracts 16 and 32 features using a 5 × 5 kernel, and the max pooling layers down-sample the features after each convolutional layer.

2.4. Full-Connect Block

As shown in Figure 4, the feature maps are flattened and stretched into one dimension. These features are then processed through fully connected layers to the target space, which consists of 2 classes in this research. They are followed by two spiking neurons, which correspond to the fire and non-fire events, respectively. A fire event is labeled with the value of 1, while a non-fire event is assigned the value of 0. Despite there being 442,880 multiplications in this block, it is a fact that these feature values are either 0 or 1 because of the nature of the LIF neuron model, which enhances the computational efficiency.

2.5. Leaky Integrate and Fire Neuron Model

SNNs transmit and process information in a pulsatile manner, resembling the pulse signal propagation mechanism in neurobiology [36,37]. The illustration in Figure 5a shows a conceptual model where spiking neurons mimic the dynamic firing patterns observed in biological neurons. The LIF neuron is a kind of spiking neuron that simplifies the process of action potential generation by simultaneously incorporating the critical features of leakage, integration, and firing, thus significantly reducing computational complexity.

The dynamics of the LIF neuron’s membrane potential can be modeled as an RC circuit, as shown in Figure 5b. A time-varying input

I_{i n} (t)

across the resistor and capacitor forms a voltage named

U_{m e m} (t)

, which can be thought of as the time-integrated potential of the neuron.

I_{i n} (t) = I_{R} (t) + I_{C} (t) = \frac{U_{m e m} (t)}{R} + C * \frac{d U_{m e m} (t)}{d t}

(1)

τ \frac{d U_{m e m} (t)}{d t} = - (U_{m e m} (t)) + R * I_{i n} (t), w h e r e τ = R * C

(2)

The approximate solution to the differential equation can be represented as follows:

U_{m e m} (t) = I_{i n} (t) * R + c_{1} * e^{\frac{- t}{τ}}

(3)

Assuming the membrane potential

U_{m e m} (t)

at

t = 0

is

U_{0}

, the general solution of the differential equation is as follows:

U_{m e m} (t) = I_{i n} (t) * R + (U_{0} - I_{i n} (t) * R) * e^{\frac{- t}{τ}}

(4)

From the equation above, the membrane potential

U_{m e m} (t)

decays over time with a time constant

τ

, and step inputs are accumulated as

I_{i n} (t)

. If the neuron is sufficiently excited by the weighted sum of the step inputs until its membrane potential

U_{m e m} (t)

reaches a threshold

U_{t h r}

, a spike

S_{t}

will be generated,

S_{t} = \{\begin{matrix} 1, & i f U_{m e m} (t) \geq U_{t h r}; \\ 0, & i f U_{m e m} (t) \leq U_{t h r} . \end{matrix}

(5)

Then, the value of the membrane potential will be reset.

3. Experiments

To validate the proposed lightweight method, four sets of experiments were conducted. The first aimed to evaluate the impact of two encoding strategies and a different simulated step T on the model’s accuracy. The second experiment aimed to find the best surrogate gradient function for our research. The third experiment sought to understand how the performance of the proposed method is affected by the value of the membrane potential threshold and decay rate. Lastly, the impact of the number of neurons used in the full-connect block was analyzed.

3.1. Datasets

As shown in Table 3, the datasets included fire sound mixed with many kinds of ambient noises like weather noise, animal sounds, and incidental noises such as those caused by the passage of an airplane or car, along with similar ambient noises without fire sounds [21]. The spectrograms of three kinds of audio clips are shown in Figure 6.

The portion of the samples in the datasets can be retrieved from Test Video [38], from the openly available FSD50-K datasets [39], ESC-50 datasets [40], and audio data recorded by the author of paper [21], while others were specially recorded through our combustion experiment, as shown in Figure 7.

For coherence with open datasets, the experimental audio data recorded by ourselves were resampled at 16 kHz and standardized to 5 s. The datasets contained a total of 3060 samples, which were a mix of open audio samples and our recorded audio samples. Each sample corresponded to a unique class. The datasets were split into training, validation, and test sets by random function; 70% of all data samples were used for training, 10% of all data samples were used as validation datasets to monitor the model’s performance during training and making decisions regarding hyperparameter tuning, and the remaining 20% were reserved for testing the model’s generalization performance. The training, validation, and test sets all exhibited class imbalances, with non-fire events comprising 45% to 47% and fire events comprising 55% to 53% of the samples.

3.2. Experimental Configuration

All experiments were conducted on a computing platform equipped with a W-2223 processor, 32 GB RAM, and an RTX 3090 GPU (Nvidia, Santa Clara, CA, USA). The software environment included Pytorch 2.4 and the SNNTorch 0.9.1 framework [41] and necessary data processing and model training libraries.

During the training stage, the batch size was set to 64, and the learning rate was configured at 0.001. The Adam optimizer and Mean Square Error Spike Count Loss were employed as the optimizing algorithm and loss function, respectively. To ensure optimal model performance, the validation loss value was continuously monitored. Whenever the current validation loss was lower than the previous best, the model’s parameters were updated to reflect this improvement, and the best model was saved. The training process continued until either the validation loss failed to improve for 100 consecutive epochs or the maximum of 2000 epochs was reached. Once the stopping criterion was met, the training stage concluded, guaranteeing that the model had either converged or completed the maximum number of training epochs.

3.3. Training and Ablation Study

The principle of spike rate encoding is to record the average value of the spikes delivered by a neuron within a certain time step T. The average value representing the frequency of spike transmission is used as the classification basis [42]. Therefore, it was necessary to consider the impact of observation time step T on the performance of the two encoders: the Poisson encoder and the convolution encoder.

As depicted in Figure 8, it is evident that the convolution encoder consistently outperformed the Poisson encoder in terms of validation and test accuracies across various time steps. This indicates that the convolution encoder excelled in capturing and encoding essential features from the datasets, making it more effective in the proposed CSNN framework. The greater number of observation time steps T may lead to more time being spent in the inference stage. The observation time step T was set to 15, which seemed to strike a good balance between accuracy and inference time in the proposed CSNN.

Due to the non-differentiable spike output, surrogate gradients were employed for backpropagation [43]. Four cases of surrogate gradient functions were analyzed to assess their impact on classification performance, as illustrated in Figure 9. These cases were the Fast Sigmoid (FS), Atan, Spike Rate Escape (SRE), and Straight Through Estimator (STE). The results indicate that the FS function achieved the highest validation accuracy, while the STE function resulted in the poorest performance.

The membrane potential decay rate parameter

β

defines the rate at which the synaptic weights decay over time during the training stage [44]. To explore the impact of

β

on the performance of the proposed CSNN method, the decay rate

β

was systematically varied from 0.1 to 0.9. As shown in Figure 10, a decay rate of 0.5 effectively balanced the trade-off between retaining past knowledge and adapting to new information during training, resulting in stable convergence and enhanced generalization performance.

To investigate the impact of the varying membrane threshold parameter

U_{t h r}

on the performance of the proposed CSNN, experiments that adjusted

U_{t h r}

from 0.2 to 1.0 were conducted, and the corresponding training loss, validation loss, and validation accuracy were recorded over epochs, as illustrated in Figure 11. The training loss and validation loss curves indicate that the algorithm failed to converge at

U_{t h r} = 0.2

, suggesting that this threshold might be too low for effective learning. From a

U_{t h r}

of 0.4 onwards, a gradual increase in classification accuracy was observed, accompanied by a decreasing trend in the loss curve. Finally, the validation accuracy curve reached its peak when

U_{t h r} = 1.0

, indicating that this threshold yielded the best performance in terms of classification accuracy.

To investigate the impact of spiking neurons in the full-connect block, a comparison was made between the performance of two different full-connect block structures: the Direct Fully Connected Layer (DFC) and the Two Fully Connected Layers (2FC). The DFC model flattens the vectors of post-convolution and directly connects them to a fully connected layer for classification, while the 2FC model first flattens the vector and then passes it through two fully connected layers for classification.

As depicted in Figure 12a,b, the 2FC model achieved lower training loss values and validation loss compared to the DFC model in training progress. Figure 12c demonstrates that the 2FC model consistently outperformed the DFC model in terms of validation accuracy. This suggests that the additional fully connected layer in the 2FC model allows for better learning and representation of features in the data, indicating its superior ability to minimize errors during classification.

4. Results

To facilitate a more intuitive analysis of the performance of our proposed lightweight method, several classic CNN algorithms used for audio classification were also trained and tested on our datasets. The comparison results in Table 4 substantiate that the proposed method exhibited a classification performance equivalent to that of classic CNNs while employing a parameter count that is only one-tenth of that of the CNN6/CNN10 methods and approximately one-twentieth of the CNN14 method. Compared to traditional CNNs, the input feature values through our convolutional block are restricted to 0 and 1, and the calculation of the hidden convolution layers is based on 0 and 1, which significantly enhances computational efficiency.

The detailed training curves and results of the proposed lightweight CSNN method are shown in Figure 13. After ten epochs of training iterations, the validation accuracy surpassed 95%. Furthermore, by the 231th epoch, the model attained its peak validation accuracy of 99.07%. Figure 13b shows the Receiver Operating Characteristic (ROC) curve as a complement, which is a suitable metric for imbalanced datasets. It can be calculated that the value of the Area Under the Curve (AUC) is 0.9904 from the ROC curve. Figure 13c shows the accuracy of the method on fire events (lower right) and non-fire events (upper left), as well as the false-positive (upper right) and false-negative (lower left) rate.

The higher score for the proposed CSNN method is attributed to various aspects. The traditional CNN algorithm model uses the Mel Spectrogram as the network input and extracts the features through convolution layers, similar to the Audio NN1 and NN2 algorithm [21]. It always performs well in clean conditions, but it is not robust when there is noise. In the proposed method, the Mel Spectrogram as input through the convolution encoder with learnable parameters, which encode useful temporal information with a high anti-noise property. It seems that the auricle receives the outside stimuli and extracts the features in forms of spikes through auditory system. These spikes carry the most robust and discriminative information.

Another reason that can be elucidated from the dynamics of the LIF neuron model is encapsulated by the following differential equation:

τ \frac{d U_{m e m} (t)}{d t} = - (U_{m e m} (t) - U_{r e s t}) + I_{i n j} (t) + I_{s y n} (t)

(6)

where

U_{m e m} (t)

represents the membrane potential at time t,

τ

denotes the membrane time constant,

U_{r e s t}

is the resting membrane potential,

I_{i n j} (t)

signifies the injected current (external input) at time t, and

I_{s y n} (t)

is the synaptic current at time t that expressible as

I_{s y n} (t) = Σ_{i} ω_{i} s_{i} (t - δ * t_{i})

(7)

where

ω_{i}

is the synaptic weight indicative of the influence of synapse i on the membrane potential, and

s_{i} (t - δ * t_{i})

represents the spike signal emitted by the presynaptic neuron i at a prior time

t - δ * t_{i}

, with

δ * t_{i}

being the spike transmission delay.

The accumulation of the membrane potential reflects a temporal integration of input signals indicative of sustained attention to input features. The firing of an action potential upon threshold crossing is analogous to the accumulative attention and response to significant information in attention mechanisms. In addition, the variation in synaptic weights can simulate differential attention to input signals. Synapses with higher weights may represent a “focus” on specific input features akin to the enhancing of salient features in attention mechanisms. It can be considered that the CSNN has an intrinsic attention mechanism where particular features are emphasized based on the precise timing of the spikes.

In the following, the complexity of the proposed CSNN method and which hardware platform is suitable for deployment will be considered. In the proposed CSNN method, except for the encoder block, which necessitates floating point computations, the calculations within the remaining convolution block and full-connect block are based on 0 and 1, simplifying the operational complexity compared to the CNN methods.

The time complexity can be analogously estimated as the number of operations performed in the inference phase as follows:

O (C S N N) = (Σ_{l = 1}^{n} (M_{l} * N_{l} * K_{l}^{2} * C_{l - 1} * C_{l})) * T

(8)

where n is the total number of the convolutional layers, and

M_{l}

and

N_{l}

correspond to the width and length of the feature map in the lth layer, respectively.

K_{l}

is the size of the convolutional kernel used in the lth layer.

C_{l - 1}

and

C_{l}

denote the number of channels in the preceding and current layer, respectively. T is the observation time step.

However, it is difficult to directly compare the time complexity and Floating Point Operations Per Second (FLOPS) of these methods through Equation (8). For comparison, the inference latency for each of these methods is meticulously recorded employing the time.time() function within our experimental program. Specifically, each batch consists of 64 audio clips from the test datasets. As illustrated in Table 5, the proposed CSNN method consumes the least time during the inference phase.

Regarding space complexity, deploying the proposed CSNN model on edge hardware with weights represented in FP32 format requires a storage capacity of a mere 1.836 MB for weight storage. By comparison, the CNN6 model demands 19.35 MB, the CNN10 model necessitates 20.877 MB, and the CNN14 model requires a considerably larger 200 MB of storage. The superior memory efficiency and spiking computation mechanism of the proposed CSNN make it particularly suitable for resource-constrained edge computing environments.

Considering the hardware implementation of the proposed CSNN method, the Field-Programmable Gate Array (FPGA) System on Chip (SoC), specifically the ZYNQ series, is deemed appropriate, which integrates the Cortex-ARM-based programmable system (PS) with programmable logic (PL). The PS efficiently handles audio preprocessing tasks, encompassing the reception of PCM-formatted digital audio and the generation of STFT and Mel Spectrograms utilizing its floating point computational unit and C/C++ libraries. Subsequently, the generated Mel Spectrograms are seamlessly handed over to the PL through an AXI bus for swift, real-time processing.

Within the PL domain, a pipelined architecture enables concurrent operation of the encoding, convolution, and full-connect blocks, optimizing FPGA resource usage and reducing latency, which is critical for real-time fire detection. The method’s compact weight parameters fit directly into the SoC’s Block RAM (BRAM), streamlining data access. The convolution operations are simplified to additive operations because the features within convolution and full-connect blocks are either 0 or 1 and are executed through convolution kernels integrated with shift registers. At each clock cycle, the accumulated membrane potential is compared against a predefined threshold

U_{t h r}

using a designed comparator, and the number of spikes is calculated through a designed counter. The designed accumulator and comparator circuits compute and output the final classification probability values.

5. Conclusions

In this paper, the acoustic signal is utilized as a sensing means for fire detection systems, and a lightweight CSNN method is proposed. This is the first attempt at fire acoustic signal recognition based on SNNs. The openly available datasets were merged with our combustion experimental data for training and resulted in a high accuracy of 99.02%, a precision of 99.37%, a recall of 98.75%, and an F1 score of 99.06% on the test datasets. The results are very close to the performance of the CNN6 method but only need one-tenth of the parameter count. Our study identified the scheme of the encoder, observation time step T, surrogate gradient function, membrane potential decay rate, membrane potential threshold, and FC block structure, which are pivotal parameters significantly impacting the CSNN architecture’s performance.

The proposed lightweight CSNN method leverages their biological inspiration, energy efficiency, sparse representation, and noise robustness to enhance the accuracy of fire detection systems. It makes the CSNN a promising approach for improving the reliability of remote fire detection systems. In future works, efforts will be directed toward amassing more real-world fire audio data to expand our fire audio datasets. Additionally, the incorporation of data augmentation is envisioned to further optimize the algorithm framework for enhanced robustness and generalization in practical scenarios. The proposed CSNN method is slated for deployment on an FPGA SoC hardware platform to leverage its inherent parallel computing capabilities while reducing computational overhead and enhancing energy efficiency.

Author Contributions

Conceptualization, X.L. and Y.L.; methodology, X.L.; software, X.L.; validation, X.L. and W.Z.; formal analysis, X.L. and W.Z.; data curation, X.L. and L.Z.; writing—original draft preparation, X.L.; writing—review and editing, Y.L. and X.L.; visualization, X.L.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

Jiangsu Provincial Team of Innovation and Entrepreneurship (JSSCTD202351).

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare no conflicts of interest. Author Wenqiong Zhang is employed by Beijing AcousticSpectrum Tech Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Khan, F.; Xu, Z.; Sun, J.; Khan, F.M.; Ahmed, A.; Zhao, Y. Recent advances in sensors for fire detection. Sensors 2022, 22, 3310. [Google Scholar] [CrossRef] [PubMed]
Martinsson, J.; Runefors, M.; Frantzich, H.; Glebe, D.; McNamee, M.; Mogren, O. A novel method for smart fire detection using acoustic measurements and machine learning: Proof of concept. Fire Technol. 2022, 58, 3385–3403. [Google Scholar] [CrossRef]
Festag, S. False alarm ratio of fire detection and fire alarm systems in germany—A meta analysis. Fire Saf. J. 2016, 79, 119–126. Available online: https://www.sciencedirect.com/science/article/pii/S0379711215300369 (accessed on 3 May 2023). [CrossRef]
Ding, Q.; Peng, Z.; Liu, T.; Tong, Q. Building fire alarm system with multi-sensor and information fusion technology based on d-s evidence theory. In Proceedings of the 2014 International Symposium on Computer, Consumer and Control, Taichung, Taiwan, 10–12 June 2014; pp. 906–909. [Google Scholar]
Zhang, W. Electric fire early warning system of gymnasium building based on multi-sensor data fusion technology. In Proceedings of the 2021 International Conference on Machine Learning and Intelligent Systems Engineering (MLISE), Chongqing, China, 9–11 July 2021; pp. 339–343. [Google Scholar]
Wu, L.; Chen, L.; Hao, X. Multi-sensor data fusion algorithm for indoor fire early warning based on bp neural network. Information 2021, 12, 59. [Google Scholar] [CrossRef]
Liu, P.; Xiang, P.; Lu, D. A new multi-sensor fire detection method based on lstm networks with environmental information fusion. Neural Comput. Appl. 2023, 35, 25275–25289. [Google Scholar] [CrossRef]
Li, J.; Ai, F.; Cai, C.; Xiong, H.; Li, W.; Jiang, X.; Liu, Z. Fire Detecting for Dense Bus Ducts Based on Data Fusion. Energy Rep. 2023, 9, 361–369. Available online: https://www.sciencedirect.com/science/article/pii/S2352484723008995 (accessed on 12 March 2024). [CrossRef]
Viegas, D.X.; Pita, L.P.; Nielsen, F.; Haddad, K.; Tassini, C.C.; D’Altrui, G.; Quaranta, V.; Dimino, I.; Tsangaris, H. Acoustic characterization of a forest fire event. In Proceedings of the SPIE—The International Society for Optical Engineering, Incheon, Republic of Korea, 13–14 October 2008; Volume 119, pp. 374–385. [Google Scholar]
Khamukhin, A.A.; Bertoldo, S. Spectral analysis of forest fire noise for early detection using wireless sensor networks. In Proceedings of the 2016 International Siberian Conference on Control and Communications (SIBCON), Moscow, Russia, 12–14 May 2016. [Google Scholar]
Chwalek, P.; Chen, H.; Dutta, P.; Dimon, J.; Singh, S.; Chiang, C.; Azwell, T. Downwind fire and smoke detection during a controlled burn—Analyzing the feasibility and robustness of several downwind wildfire sensing modalities through real world applications. Fire 2023, 6, 356. [Google Scholar] [CrossRef]
Thomas, A.; Williams, G.T. Flame noise: Sound emission from spark-ignited bubbles of combustible gas. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci. 1966, 294, 449–466. [Google Scholar]
Grosshandler, W.; Jackson, M. Acoustic emission of structural materials exposed to open flames. Fire Saf. J. 1994, 22, 209–228. [Google Scholar] [CrossRef]
Wang, M.; Wu, J.B.; Li, C.H.; Luo, W.; Zhang, L.W. Transformer fire identification method based on multi-neural network and evidence theory. In Proceedings of the 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 11–13 December 2020. [Google Scholar]
Bedard, A.J.; Nishiyama, R.T. Infrasound generation by large fires: Experimental results and a review of an analytical model predicting dominant frequencies. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Toronto, ON, Canada, 24–28 June 2002. [Google Scholar]
Sonkin, M.A.; Khamukhin, A.A.; Pogrebnoy, A.V.; Marinov, P.; Atanassova, V.; Roeva, O.; Atanassov, K.; Alexandrov, A. Intercriteria Analysis as Tool for Acoustic Monitoring of Forest for Early Detection Fires; Atanassov, K.T., Atanassova, V., Kacprzyk, J., Kaluszko, A., Krawczak, M., Owsinski, J.W., Sotirov, S., Sotirova, E., Szmidt, E., Zadrozny, S., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 205–213. [Google Scholar]
Khamukhin, A.A.; Demin, A.Y.; Sonkin, D.M.; Bertoldo, S.; Perona, G.; Kretova, V. An algorithm of the wildfire classification by its acoustic emission spectrum using wireless sensor networks. J. Phys. Conf. Ser. 2017, 803, 012067. [Google Scholar] [CrossRef]
Zhang, S.; Gao, D.; Lin, H.; Sun, Q. Wildfire detection using sound spectrum analysis based on the internet of things. Sensors 2019, 19, 5093. [Google Scholar] [CrossRef] [PubMed]
Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894. [Google Scholar] [CrossRef]
Huang, H.-T.; Downey, A.R.J.; Bakos, J.D. Audio-based wildfire detection on embedded systems. Electronics 2022, 11, 1417. [Google Scholar] [CrossRef]
Peruzzi, G.; Pozzebon, A.; Meer, M.V.D. Fight fire with fire: Detecting forest fires with embedded machine learning models dealing with audio and images on low power iot devices. Sensors 2023, 23, 783. [Google Scholar] [CrossRef] [PubMed]
Lee, B.-J.; Lee, M.-S.; Jung, W.-S. Acoustic based fire event detection system in underground utility tunnels. Fire 2023, 6, 211. [Google Scholar] [CrossRef]
Zeng, Y.; Zhao, D.; Zhao, F.; Shen, G.; Dong, Y.; Lu, E.; Zhang, Q.; Sun, Y.; Liang, Q.; Zhao, Y. Braincog: A spiking neural network based, brain-inspired cognitive intelligence engine for brain-inspired ai and brain simulation. Patterns 2023, 4, 100789. [Google Scholar] [CrossRef] [PubMed]
Yu, Q.; Yan, R.; Tang, H.; Tan, K.C.; Li, H. A spiking neural network system for robust sequence recognition. IEEE Trans. Neural Netw. Learn. Syst. 2015, 27, 621–635. [Google Scholar] [CrossRef] [PubMed]
Agarwal, R.; Ghosal, P.; Murmu, N.; Nandi, D. Spiking Neural Network in Computer Vision: Techniques, Tools and Trends; Borah, S., Gandhi, T.K., Piuri, V., Eds.; Springer Nature: Singapore, 2023; pp. 201–209. [Google Scholar]
Lv, C.; Xu, J.; Zheng, X. Spiking convolutional neural networks for text classification. arXiv 2024, arXiv:2406.19230. [Google Scholar]
Xu, Q.; Qi, Y.; Yu, H.; Shen, J.; Tang, H.; Pan, G. Csnn: An augmented spiking based framework with perceptron-inception. IJCAI 2018, 1646, 1–7. [Google Scholar]
Xiao, R.; Yan, R.; Tang, H.; Tan, K.C. A spiking neural network model for sound recognition. In Cognitive Systems and Signal Processing; Sun, F., Liu, H., Hu, D., Eds.; Springer: Singapore, 2017; pp. 584–594. [Google Scholar]
Zeng, Y.; Zhang, T.; Bo, X.U. Improving multi-layer spiking neural networks by incorporating brain-inspired rules. Sci. China Inf. Sci. 2017, 60, 052201. [Google Scholar] [CrossRef]
Zhang, T.; Zeng, Y.; Zhao, D.; Wang, L.; Xu, B. Hmsnn: Hippocampus inspired memory spiking neural network. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, Hungary, 9–12 October 2016. [Google Scholar]
Cheng, X.; Hao, Y.; Xu, J.; Xu, B. Lisnn: Improving spiking neural networks with lateral interactions for robust object recognition. In Proceedings of the International Joint Conferences on Artificial Intelligence Organization, Online, 7–8 January 2020. [Google Scholar]
Wang, Z.; Wang, Z.; Li, H.; Qin, L.; Jiang, R.; Ma, D.; Tang, H. Eas-snn: End-to-end adaptive sampling and representation for event-based detection with recurrent spiking neural networks. arXiv 2024, arXiv:2403.12574. [Google Scholar]
Han, B.; Roy, K. Deep spiking neural network: Energy efficiency through time based coding. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 388–404. [Google Scholar]
Yamazaki, K.; Vo-Ho, V.-K.; Bulsara, D.; Le, N. Spiking neural networks and their applications: A review. Brain Sci. 2022, 12, 863. [Google Scholar] [CrossRef] [PubMed]
Garg, I.; Chowdhury, S.S.; Roy, K. Dct-snn: Using dct to distribute spatial information over time for low-latency spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 4671–4680. [Google Scholar]
Rathi, N.; Roy, K. Diet-snn: A low-latency spiking neural network with direct input encoding and leakage and threshold optimization. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3174–3182. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Zhang, H.; Lin, Y.; Li, G.; Wang, M.; Tang, Y. Liaf-net: Leaky integrate and analog fire network for lightweight and efficient spatiotemporal information processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6249–6262. [Google Scholar] [CrossRef] [PubMed]
Peruzzi, G.; Pozzebon, A.; Van Der Meer, M. Test Video. 2022. Available online: https://drive.google.com/file/d/1Hi2gs4mkrFibULaHfVDzgJZgVaVUYf6L/view?usp=share_link (accessed on 1 May 2024).
Fonseca, E.; Favory, X.; Pons, J.; Font, F.; Serra, X. Fsd50k: An open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 829–852. [Google Scholar] [CrossRef]
Piczak, K.J. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Ser. MM ’15, Brisbane, Australia, 26–30 October 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1015–1018. [Google Scholar] [CrossRef]
Eshraghian, J.K.; Ward, M.; Neftci, E.; Wang, X.; Lenz, G.; Dwivedi, G.; Bennamoun, M.; Jeong, D.S.; Lu, W.D. Training spiking neural networks using lessons from deep learning. Proc. IEEE 2023, 111, 1016–1054. [Google Scholar] [CrossRef]
Jiang, X.; Xie, H.; Lu, Z.; Hu, J. Energy-efficient and high-performance ship classification strategy based on siamese spiking neural network in dual-polarized sar images. Remote. Sens. 2023, 15, 4966. [Google Scholar] [CrossRef]
Rathi, N.; Srinivasan, G.; Panda, P.; Roy, K. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. arXiv 2020, arXiv:2005.01807. [Google Scholar]
Dasbach, S.; Tetzlaff, T.; Diesmann, M.; Senk, J. Dynamical Characteristics of Recurrent Neuronal Networks Are Robust against Low Synaptic Weight Resolution. Front. Neurosci. 2021, 15, 757790. Available online: https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2021.757790 (accessed on 16 April 2024). [CrossRef]

Figure 1. CSNN architecture for fire detection based on acoustics. The preprocess block converts the input acoustic signal into the Mel-Frequency Coefficients spectrum, the encoding block converts the input Mel Spectrogram into spikes, the convolutional block learns features from the encoded spike, and the full-connect block as the classifier predicts the labels of the input data.

Figure 2. Poisson encoder for Mel Spectrogram of input fire acoustic signals. (a) The input fire acoustic signal Mel Spectrogram. (b) The normalized grayscale spectrogram. (c–j) The accumulative output of the Poisson encoder at observation time step T = 0, 5, 10, 15, 20, 25, 30, and 35, respectively.

Figure 3. Convolution encoder for Mel Spectrogram of input fire acoustic signals. (a) The input fire acoustic signal Mel Spectrogram. (b) The normalized grayscale spectrogram. (c–j) The accumulative 8 channel output features of the convolution encoder at observation time step T = 15, respectively.

Figure 4. Convolutional and full-connect block.

Figure 5. (a) Spiking neurons mimic the spiking behavior of biological neurons. The dendrites integrate incoming spike trains, and, upon surpassing a threshold, the soma generates an output spike, reflecting the all-or-none principle of action potential propagation in biological systems. (b) The simplest model of a passive membrane is a Resistor–Capacitor (RC) circuit.

Figure 6. The time–frequency spectrogram of the clean fire audio clip, fire with bird sounds audio clip, and pure bird noise audio clip, respectively. (a) shows a pure fire sound audio clip concentrated in the low-to-high frequency range. (b) shows that the addition of bird sounds results in a spectral expansion in the middle frequency band, compromising the purity of the fire acoustic signal. (c) indicates severe noise contamination across the entire frequency spectrum.

Figure 7. The combustion experiment was conducted outdoors in rainy conditions at 10:30 a.m. on 5 June 2023. The audio data were recorded using the NI-9234 audio card connected to an MNP32 microphone type. The sample rate was set to 22.5 kHz, and resolution was 16 bits.

Figure 8. Accuracy versus observation time step T between the two encoders. It can be seen that increasing the number of observation time step T usually leads to an improvement in validation accuracy until a certain point. After reaching this threshold, further increases in the number of steps do not lead to additional improvements and could potentially result in a decline in performance, indicating the possibility of overfitting.

Figure 9. The impact of surrogate gradient function in the proposed CSNN. (a) Training loss, (b) validation loss, (c) validation accuracy, and (d) validation loss versus validation accuracy in four surrogate gradient functions.

Figure 10. The impact of the decay rate

β

in the proposed CSNN. (a) Training loss, (b) validation loss, (c) validation accuracy, and (d) validation loss versus the decay rate

β

. When decay rate

β = 0.5

, the method exhibited significantly improved performance characterized by the lowest validation loss value and the highest validation accuracy.

Figure 10. The impact of the decay rate

β

in the proposed CSNN. (a) Training loss, (b) validation loss, (c) validation accuracy, and (d) validation loss versus the decay rate

β

. When decay rate

β = 0.5

, the method exhibited significantly improved performance characterized by the lowest validation loss value and the highest validation accuracy.

Figure 11. The impact of membrane potential threshold

U_{t h r}

in the proposed CSNN. (a) Training loss, (b) validation loss, (c) validation accuracy, and (d) validation loss versus the membrane threshold.

Figure 11. The impact of membrane potential threshold

U_{t h r}

in the proposed CSNN. (a) Training loss, (b) validation loss, (c) validation accuracy, and (d) validation loss versus the membrane threshold.

Figure 12. The impact of spiking neurons in the full-connect block. (a) Training loss, (b) validation loss, and (c) validation accuracy versus epochs under two full-connect block structures.

Figure 13. Training and test results using our proposed lightweight CSNN method. (a) The training and validation curve of the CSNN on our datasets, (b) ROC curve of the CSNN method, (c) the normalized confusion matrix.

Table 1. Parameters for the Mel Spectrogram.

Window length	1024
Hop length	320
Window function	Hanning
Mel bins	64

Table 2. The parameters of the proposed lightweight CSNN.

Layer	Type	Output Shape	Learnable Parameters
InputLayer	-	[batch,251,64,1]	0
BatchNormLayer	Normalization	[batch,251,64,1]	0
Conv_L1	Convolutional	[batch,247,60,8]	208
Neuron Node	LIF	[T,batch,247,60,8]	0
MaxPool_L1	Max Pooling	[batch,123,130,8]	0
Conv_L2	Convolutional	[batch,119,26,16]	3216
Neuron Node	LIF	[T,batch,119,26,16]	0
MaxPool_L2	Max Pooling	[batch,59,13,16]	0
Conv_L3	Convolutional	[batch,55,9,32]	12,832
Neuron Node	LIF	[T,batch,55,9,32]	0
MaxPool_L3	Max Pooling	[batch,27,4,32]	0
FlattenLayer	Flatten	[batch,1,1,3456]	0
FC_L1	Fully Connected	[batch,1,1,128]	442,496
Neuron Node	LIF	[T,batch,1,1,128]	0
FC_L2	Fully Connected	[batch,1,1,2]	258
Neuron Node	LIF	[T,batch,1,1,2]	0

Table 3. Audio dataset description.

Class	Audio Type	Number of Samples	Total Time (s)
Fire	Clean fire	277	1385
	Fire with bird sounds	271	1355
	Fire with kinds of noise	546	2730
	Recordings by other researcher	71	355
	Fire with wind sounds	267	1335
	Recordings by myself	200	1000
No Fire	Noise—bird	276	1380
	Noise—crick	200	1000
	Unknown noise	400	2000
	Noise—rain	345	1725
	Noise—wind	207	1035

Table 4. Performance metrics and number of parameters in different methods.

Architecture	Parameters	Accuracy	Precision	Recall	F1 Score
CNN6 [19]	4,837,455	99.02%	99.07%	99.07%	99.07%
CNN10 [19]	5,219,279	99.35%	99.07%	99.69%	99.38%
CNN14 [2,19]	80,753,615	99.51%	100%	99.07%	99.53%
Proposed CSNN	459,010	99.02%	99.37%	98.75%	99.06%

Table 5. The inference consumption time comparison for different methods.

	CNN6	CNN10	CNN14	Proposed CSNN
Batch1 inference time (s)	0.6459	0.7247	0.7914	0.0452
Batch2 inference time (s)	0.6482	0.7276	0.7953	0.0456
Batch3 inference time (s)	0.6505	0.7306	0.7993	0.0463
Batch4 inference time (s)	0.6528	0.7336	0.8030	0.0461
Batch5 inference time (s)	0.6563	0.7367	0.8067	0.0464
Batch6 inference time (s)	0.6586	0.7397	0.8107	0.047
Batch7 inference time (s)	0.6611	0.7426	0.8146	0.0493
Batch8 inference time (s)	0.6634	0.7456	0.8185	0.0477
Batch9 inference time (s)	0.6660	0.7486	0.8223	0.0498
Batch10 inference time (s)	0.7021	0.8398	0.9261	0.0759
Average inference time (s) per audio clip	0.0103	0.012	0.0132	0.0007

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Liu, Y.; Zheng, L.; Zhang, W. A Lightweight Convolutional Spiking Neural Network for Fires Detection Based on Acoustics. Electronics 2024, 13, 2948. https://doi.org/10.3390/electronics13152948

AMA Style

Li X, Liu Y, Zheng L, Zhang W. A Lightweight Convolutional Spiking Neural Network for Fires Detection Based on Acoustics. Electronics. 2024; 13(15):2948. https://doi.org/10.3390/electronics13152948

Chicago/Turabian Style

Li, Xiaohuan, Yi Liu, Libo Zheng, and Wenqiong Zhang. 2024. "A Lightweight Convolutional Spiking Neural Network for Fires Detection Based on Acoustics" Electronics 13, no. 15: 2948. https://doi.org/10.3390/electronics13152948

APA Style

Li, X., Liu, Y., Zheng, L., & Zhang, W. (2024). A Lightweight Convolutional Spiking Neural Network for Fires Detection Based on Acoustics. Electronics, 13(15), 2948. https://doi.org/10.3390/electronics13152948

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Convolutional Spiking Neural Network for Fires Detection Based on Acoustics

Abstract

1. Introduction

2. Methodology

2.1. Preprocess Block

2.2. Encoding Block

2.3. Convolutional Block

2.4. Full-Connect Block

2.5. Leaky Integrate and Fire Neuron Model

3. Experiments

3.1. Datasets

3.2. Experimental Configuration

3.3. Training and Ablation Study

4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI