1. Introduction
The biological brain provides important inspiration for the development of artificial intelligence. As one of the most representative models in brain-inspired computing, the spiking neural network (SNN) exhibits natural advantages in processing temporal information because computation is carried out through dynamic interactions among spikes, neurons, and synapses [
1,
2]. Compared with conventional artificial neural networks, SNNs encode and transmit information in an event-driven manner, which makes them especially suitable for time-varying signals such as speech [
1,
3]. Nevertheless, although many existing SNNs attempt to mimic neuronal dynamics, their network topologies are still mainly generated by algorithmic rules [
4], such as regular, random, small-world, or scale-free structures [
5,
6,
7]. This means that the structural organization of these models often remains insufficiently constrained by biological brain connectivity. Recent advances in neuroimaging provide a new opportunity to address this limitation. Functional magnetic resonance imaging (fMRI) makes it possible to estimate large-scale functional brain networks (FBNs) in the human brain [
8,
9]. In fMRI, the measured signal is the blood oxygenation level-dependent (BOLD) response, which reflects hemodynamic changes associated with neural activity [
10,
11]. From the perspective of brain-inspired modeling, FBNs derived from fMRI offer biologically grounded macro-scale structural priors that may guide the construction of artificial neural systems. Such priors are particularly attractive because they capture stable communication patterns among brain regions and can potentially serve as an architectural scaffold for neural computation. This perspective motivates the integration of brain-network topology into the design of SNNs.
Among SNN models for temporal processing, the liquid state machine (LSM) is especially suitable for speech recognition [
12]. The reservoir in an LSM projects temporal inputs into a rich high-dimensional dynamic space, and the quality of this transformation depends strongly on the internal network topology. Previous studies have shown that the topology of the reservoir significantly influences classification performance, generalization ability, and information transmission efficiency in temporal tasks [
13,
14,
15]. These observations suggest that constraining the topology of an SNN with FBNs may provide a promising route toward both higher bio-plausibility and improved performance. On this basis, our previous study constructed an fMRISNN constrained by the topology of human FBNs, demonstrating that an LSM based on fMRISNN can achieve favorable speech recognition performance while exhibiting advantages in robustness and information transmission [
16,
17,
18]. In this speech recognition framework, speech preprocessing plays a critical role because it determines the temporal statistics, sparsity, and discriminability of the spike sequences delivered to the reservoir, thereby directly affecting the performance of recognition based on the SNN. In conventional speech processing, the Lyon ear model is often adopted as a biologically inspired auditory front-end. By cascading asymmetric resonators, half-wave rectification, and automatic gain control, it simulates cochlear frequency selectivity, compression, and adaptation, thereby preserving important spectro-temporal information in speech signals [
19,
20]. On this basis, the Bens Spiker algorithm (BSA) is commonly used to convert continuous auditory representations into spike trains [
21]. Accordingly, our previous framework employed the Lyon ear model and the BSA to encode speech signals into spike sequences [
16]. Although this biologically inspired pipeline has been widely used, it still presents several limitations. Its encoding stage depends on the joint design of finite impulse response (FIR) filters and threshold parameters, making parameter selection relatively complex. Moreover, previous studies have shown that BSA is less effective in representing abrupt temporal changes, such as step-like transitions [
22,
23]. These limitations suggest that the conventional Lyon ear model–BSA pipeline still leaves room for improvement in terms of computational efficiency, spike sparsity, and adaptability to temporal variation.
To address this issue, this paper proposes an enhanced Mel-frequency cepstral coefficient (MFCC)-driven sparse spike encoding method for speech recognition. Specifically, the proposed method integrates enhanced MFCC-based dynamic feature extraction with sparse spike conversion, temporal alignment, and normalization. In addition, the present study explicitly compares the proposed preprocessing pipeline with the conventional Lyon ear model + BSA pipeline under the same fMRISNN-based speech recognition framework. This comparison is carried out both at the encoding level and at the downstream recognition level. The main contributions of this study can be summarized as follows:
An enhanced MFCC-driven sparse spike encoding method is introduced as a preprocessing extension to the existing fMRISNN-based speech recognition framework.
A controlled comparison is conducted between the proposed preprocessing pipeline and the conventional Lyon ear model + BSA pipeline at both the encoding level and the downstream recognition level under the same fMRISNN architecture.
Experimental results show that the proposed method reduces spike number, increases spike sparsity, and shortens encoding time while maintaining nearly the same speech recognition accuracy.
The fMRISNN reservoir remains competitive with baseline models using algorithmically generated topologies. Compared with the deep convolutional neural network (CNN), although it still achieves higher absolute recognition accuracy, the fMRISNN exhibits clear advantages in terms of model parameter size and theoretical energy efficiency.
The rest of this paper is organized as follows.
Section 2 describes the construction of the fMRISNN.
Section 3 presents the enhanced MFCC-driven sparse spike encoding method.
Section 4 introduces the fMRISNN-based speech recognition framework and presents the experimental results.
Section 5 discusses the speech recognition mechanism of the fMRISNN. Finally,
Section 6 concludes the paper.
3. Enhanced MFCC-Driven Sparse Spike Encoding
Raw speech data, as an analog signal, must be converted into spike trains to be effectively processed by neuron models. In traditional methods, the Lyon ear model is widely employed due to its bio-plausibility and capability to capture spectral features [
33]. Combined with the BSA, it enables efficient spike encoding [
21]. However, this approach has limitations, including relatively high computational cost and insufficient adaptability to temporal variations, especially in complex acoustic environments [
22]. To address these limitations, this study proposes an enhanced MFCC-driven sparse spike encoding method. This method incorporates auditory-inspired mechanisms including the Mel-frequency scale [
34], non-uniform filter bank analysis [
35], and logarithmic compression [
36] during the feature extraction process. Compared with the conventional Lyon ear model + BSA pipeline, the proposed method is designed to improve speech input representation in terms of spike number, sparsity, encoding time, and recognition accuracy within the same fMRISNN-based speech recognition framework. The speech preprocessing pipeline based on the enhanced MFCC-driven sparse spike encoding method is illustrated in
Figure 5.
3.1. Improved Time–Frequency Feature Extraction
An enhanced MFCC feature extraction scheme is employed, which augments the standard MFCC by incorporating first-order (Δ) and second-order (ΔΔ) dynamic features estimated using a central difference derivative estimator. This combined feature set captures both the static spectral characteristics and the dynamic variation patterns of the speech signal, resulting in a 39-dimensional feature vector (13-dimensional static MFCC + 13-dimensional Δ + 13-dimensional ΔΔ). The feature extraction process can be formally described as follows:
a. Pre-emphasis, framing, and windowing are applied to the speech signal.
b. The Fast Fourier Transform (FFT) magnitude spectrum is computed for each frame.
c. Filtering is performed using a Mel filter bank to obtain the Mel spectrum.
d. The logarithm is taken of the Mel spectrum, followed by Discrete Cosine Transform (DCT).
e. Compute the first-order and second-order dynamic features using a central difference derivative estimator:
where
represents the MFCCs of the
-th frame,
represents the first-order dynamic feature,
represents the second-order dynamic feature, and
is the context window size, typically set to 2.
f. Concatenate the feature vectors:
3.2. Temporal Alignment and Normalization
The dynamic time warping (DTW) algorithm is introduced to handle speech signals of variable lengths. It achieves alignment between sequences of different lengths by finding a minimum-cost path. The alignment distance is defined as
where
is the feature distance between frame
and frame
.
The aligned feature sequences are then normalized using the layer normalization method:
where
and
are the mean and standard deviation of the features, respectively, and
is a small constant to prevent division by zero.
Taking the spoken digit “five” as an example, its preprocessed time–frequency features are shown in
Figure 6.
3.3. Spike Encoding Based on Sigma–Delta Modulation
An improved Sigma–Delta modulator is employed to convert continuous feature values into binary spike sequences. This modulator utilizes an integrate–quantize–feedback mechanism to generate sparse spike trains while maintaining a high signal-to-noise ratio. The modulation process can be described by the following difference equation:
where
is the input feature sequence,
is the output spike sequence,
is the error signal,
is the integrator state, and
is the threshold parameter. The transfer function of this modulator is
where
is the feedback coefficient, which controls the stability of the modulator.
To visually demonstrate the encoding results of the enhanced MFCC-driven sparse spike encoding method on speech signals, taking the spoken digit “five” as an example, its spike encoding result is shown in
Figure 7.
From
Figure 7, it can be clearly observed that after spike encoding, the time–frequency features are converted into sparse binary spikes. Spikes are primarily distributed in time segments with relatively significant feature variations, while the number of spikes is notably reduced in regions with gentle changes. This indicates that the proposed enhanced MFCC-driven sparse spike encoding method can preserve the key time–frequency dynamic information of speech while effectively suppressing redundant responses, thereby improving the sparsity and discriminability of the input representation. It provides an input form that aligns with biological neural encoding mechanisms for the efficient processing of subsequent brain-inspired models. The results of converting speech signals “zero” to “nine” into spike sequences are shown in
Figure 8.
Figure 8 shows the spike sequences obtained by Sigma–Delta modulation encoding of 39-dimensional time–frequency features of different spoken digits. Each white point represents a spike firing event, visually reflecting the spike response patterns of spoken digits at different time steps and across different time–frequency feature channels. This encoding method converts continuous analog speech features into sparse spike-timing representations, making the input form more compatible with the information processing mechanisms of SNNs, thereby providing a bio-inspired input with a certain degree of biological plausibility for the subsequent temporal computation of SNNs. It can be observed that different spoken digits show relatively distinct differences in terms of spike-timing distribution, response across time–frequency feature channels, and firing density. This indicates that the proposed method effectively preserves the time–frequency structure and dynamic variation information of the original spoken digits while achieving an effective mapping from continuous features to spike representations, providing input representations for spoken digits recognition based on SNNs.
3.4. Comparative Evaluation of the Two Preprocessing Methods
To verify the effectiveness of the proposed preprocessing method, we compared the proposed enhanced MFCC-driven sparse spike encoding pipeline with the conventional Lyon ear model + BSA pipeline at the encoding level. Specifically, the comparison was conducted from three aspects: the average number of spikes generated per utterance, the sparsity ratio of the encoded spike representation, and the encoding time required for each utterance. We first compare the average number of spikes generated per utterance for the two preprocessing methods, as shown in
Table 3.
Table 3 summarizes the average spikes per utterance produced by the two preprocessing methods for the ten spoken digits. It can be seen that the proposed method generates dramatically fewer spikes than the conventional Lyon + BSA pipeline for all digit categories. For example, for the digit “zero”, the average number of spikes decreases from 5824.64 ± 648.10 to 309.75 ± 30.77, while for “one” it decreases from 5152.31 ± 721.35 to 303.96 ± 28.71. Similar trends are observed for all remaining digits. Overall, the spike count produced by the proposed method is reduced by an order of magnitude compared with the conventional method. This result indicates that the proposed preprocessing strategy can represent spoken-digit signals using a much more compact spike-based code, thereby substantially reducing redundant neural events at the input stage.
We compare the sparsity ratio of the spike representations produced by the two preprocessing methods, as shown in
Table 4.
The advantage of the proposed method is further confirmed by the sparsity ratio comparison reported in
Table 4. For all ten spoken digits, the proposed method consistently achieves a markedly higher sparsity ratio than the Lyon + BSA pipeline. The sparsity ratios obtained by the conventional method range approximately from 0.874 to 0.932, whereas those of the proposed method are stably concentrated around 0.985–0.987. For instance, for the digit “zero”, the sparsity ratio increases from 0.874 ± 0.0140 to 0.986 ± 0.0013, and for “six”, it increases from 0.932 ± 0.0196 to 0.987 ± 0.0013. These results demonstrate that the proposed method not only reduces the absolute number of spikes but also produces a substantially sparser representation overall. From the perspective of neuromorphic computing, such a representation is highly desirable because sparse spike trains can reduce computational burden and improve energy efficiency in spike-driven systems.
We further compare the encoding time required by the two preprocessing methods, as shown in
Table 5.
Table 5 shows that the proposed preprocessing method is consistently faster than the conventional Lyon + BSA pipeline across all digit categories. For example, for the digit “zero”, the encoding time decreases from 31.3 ± 8 ms to 5.4 ± 1 ms, while for “one”, it decreases from 26.5 ± 4 ms to 3.4 ± 0.6 ms. Similar reductions are observed for the remaining digits, with the proposed method typically requiring only about 3.4–5.4 ms per utterance, compared with approximately 25.5–31.3 ms for the conventional method. This substantial reduction in encoding time indicates that the proposed preprocessing strategy provides a clear computational advantage and is therefore more suitable for efficient spike-based speech-processing systems.