Robust Cochlear-Model-Based Speech Recognition

: Accurate speech recognition can provide a natural interface for human–computer interaction. Recognition rates of the modern speech recognition systems are highly dependent on background noise levels and a choice of acoustic feature extraction method can have a significant impact on system performance. This paper presents a robust speech recognition system based on a front-end motivated by human cochlear processing of audio signals. In the proposed front-end, cochlear behavior is first emulated by the filtering operations of the gammatone filterbank and subsequently by the Inner Hair cell (IHC) processing stage. Experimental results using a continuous density Hidden Markov Model (HMM) recognizer with the proposed Gammatone Hair Cell (GHC) coefficients are lower for clean speech conditions, but demonstrate significant improvement in performance in noisy conditions compared to standard Mel-Frequency Cepstral Coefficients (MFCC) baseline.


Introduction
Speech is the most important means of human communication and enabling computers and other smart devices to communicate via speech would make significant progress in interaction with humans.Speech perception and recognition have intrigued scientists from the early works of Fletcher [1] and first speech recognition systems in Bell labs [2] to modern days, and yet machine recognition is still outperformed by humans.
In quiet environments, high recognition accuracy can be achieved.However, in noisy environments, performance of a typical speech recognizer degrades significantly, e.g., 50% in a cafeteria environment and 30% in a car traveling at 90 km/h [3].Influence of environment and other factors on speech recognition are investigated in [4].As the technology advances, speech recognition will be deployed on more devices which are used in everyday life where environmental factors play an important role, e.g., speech recognition applications for mobile phones [5], cars [6], automated access-control and information systems [7], emotion recognition systems [8], monitoring applications [9], assistance for handicapped people [10], and smart homes [11].Besides speech, many applications of acoustics are also important in various engineering problems [12][13][14][15][16][17][18].To improve the performance in real-world noisy environments, a noise reduction technique could be used [19][20][21][22].
Comparisons using many speech corpora demonstrate that word error rates of machines are often more than an order of magnitude greater than those of humans for quiet, wideband, read speech.Machine performance degrades further below that of humans in noise, with channel variability, and for spontaneous speech [23].Until the performance of automatic speech recognition (ASR) surpasses human performance in accuracy and robustness, we stand to gain by understanding the basic principles behind human speech recognition (HSR) [24].
Despite the progress in understanding auditory processing mechanisms, only a few aspects of sound processing in the auditory periphery are modeled and simulated in common front-ends for ASR systems [25].For example, popular parameterizations such as MFCC employ auditory features like variable bandwidth filter bank and magnitude compression.Perceptual Linear Prediction (PLP) coefficients are based on performing perceptual processing by employing critical-band resolution curves, equal loudness scaling and cube root power law of hearing to linear prediction coefficients (LPC) [26].An example of auditory-motivated improvement of speech representation could include synaptic adaptation.In [27], a simplified model of synaptic adaptation was derived and integrated into conventional MFCC feature extraction.Results showed significant improvement in speech recognition performance.
In [28], new Power Normalized Cepstral Coeffients (PNCC) based on auditory processing were proposed.New features include the use of a power-law nonlinearity, a noise-suppression algorithm based on asymmetric filtering and temporal masking.Experimental results demonstrated improved recognition accuracy compared to MFCC and PLP processing.Another approach in feature extraction is based on deep neural networks (DNN)-noise robustness of DNN-based acoustic models was evaluated in [29].In [30], Recurrent Neural Networks (RNN) were introduced to clean distorted input features (MFCCs).The model was trained to predict clean features when presented with a noisy input.To handle highly non-stationary additive noise, the use of LSTM-RNNs was proposed in [31].A detailed overview of deep learning for robust speech recognition can be found in [32].
In order to better simulate human auditory periphery, standard MFC or PLP coefficients could be replaced with coefficients based on some cochlear model.In [33], auditory front-ends based on the models of Seneff [34] and Ghitza [35] were evaluated in clean and noisy speech and compared with a control mel filter bank (MFB) based cepstral front-end.Results showed that front-ends based on the human auditory system perform comparably to, and can slightly reduce the error rate of, an MFB cepstral based speech recognition system for isolated words with noise and some spectral variability conditions.
In this paper, we propose a front-end based on acoustic features obtained by the gammatone filterbank analysis followed by the IHC processing stage.Gammatone filtering models the cochlea by a bank of overlapping bandpass filters mimicking the structure of the peripheral auditory processing stage.Its performance as speech recognition front-end was investigated in several papers and improvement over MFC baseline was demonstrated [36][37][38][39].Our idea is to further improve the model by adding the IHC processing stage.IHC modeling transforms the basilar membrane displacements into an auditory nerve firing pattern.We add the hair cell model to the back-end of a gammatone filterbank to further mimic the human auditory periphery and form a more complete cochlear model.Based on the model, new GHC coefficients are proposed.To evaluate the robustness of the proposed front-end, we have developed a continuous speech HMM recognizer for Croatian speech.

Cochlear-Based Processing for ASR
Incoming sound pressure is transformed by the cochlea into vibrations of the basilar membrane which are then transformed in a series of neural impulses.The cochlea can be seen as a system designed to analyze frequency components in complex sounds as it acts as a frequency analyzer where each position along the basilar membrane corresponds to a particular frequency.
The cochlea is shaped as a small tube, and is about 1 cm long and 3.5 cm wide.The main structural element within the cochlea is a flexible basilar membrane which varies in width and stiffness along the cochlea and separates two liquid-filled tubes.It contains the organ of Corti-a very sophisticated structure which responds to basilar membrane vibrations and allows for transduction into nerve impulses, Figure 1.Positioned along the organ of Corti are three rows of outer hair cells (OHCs) and one row of inner hair cells (IHCs).The IHCs are the actual sensory receptors; through mechanotransduction, hair cells detect movement in their environment and generate neural impulses.At the limits of human hearing, hair cells can faithfully detect movements of atomic dimensions and respond in the tens of microseconds.Furthermore, hair cells can adapt rapidly to constant stimuli, thus allowing the listener to extract signals from a noisy background [41].

Gammatone Filterbank
In auditory modeling, filterbank is one of the most common concepts used to resemble the characteristics of the basilar membrane (BM).Since each position of the basilar membrane responds to a particular frequency contained in speech signal, each bandpass filter is modeled by particular frequency characteristics of the BM.
The gammatone filterbank contains non-uniform overlapping band pass filters, designed to mimic the basilar membrane characteristics.It was first introduced by Johanesma [42].A gammatone filter impulse response is simply defined in time-domain as the product of a gamma distribution and a tone.The gammatone function is defined as where n is the order of the filter (affects the slope of the filter skirts), b is the bandwidth of the filter (affects the duration of the impulse response, a defines the output gain, f c is the filter center frequency, ϕ is the phase. For the filter order in the range 3-5, Patterson [43] showed that gammatone filter is very similar to that of the roex(p) filter commonly used to represent the magnitude characteristic of the human auditory filter [44].
The equivalent rectangular bandwidth (ERB) of the filter is given with the equation [45] ERB = 24.7(4.37 f c /1000 + 1). ( When the order is 4, the bandwidth b of the gammatone filter is 1.019 ERB. Figure 2a shows a gammatone impulse response we obtained from Equation (1) of a single filter centered at 1000 Hz.It can be regarded as a measure of the BM displacement at a particular position.
These filters are then combined to form an auditory filterbank used to simulate the motion of the basilar membrane.Output of each filter models the frequency response of the basilar membrane at a single place (Figure 2a).Filter center frequencies are equally distributed on the ERB scale [45].
Frequency domain responses of a gammatone filterbank with 20 filters whose center frequencies are equally spaced between 100 Hz and 8 kHz on the ERB scale are shown in Figure 2b.Unlike a traditional spectrogram, which has a constant bandwidth across all frequency channels, using the gammatone model, we obtained a representation similar to cochlea's frequency subbands, which get wider for higher frequencies.In this work, we used Slaney's implementation of a gammatone filterbank [46], with default 64 filters spaced from 50 Hz to 8 kHz (speech is sampled at 16 kHz).

IHC Model
To further mimic the human auditory periphery and form a more complete auditory model, we add the IHC model to the back-end of the gammatone filterbank.Our proposed front-end for ASR is thus constructed by processing the output of each gammatone filter with the IHC model.We used the Meddis' model of hair cell transduction [47].
Each gammatone filter output is converted by the hair cell model into a probabilistic representation of firing activity in the auditory nerve, incorporating well-known effects such as saturation and adaptation.
IHC function is characterized in the Meddis model by describing the dynamics of neurotransmitter at the hair cell synapse [48].Transmitter is transferred between three reservoirs in a reuptake and re-synthesis process loop (see Figure 3 and Equations ( 3)-( 7)).The equations representing the model are prob(event) = hc(t)dt.
The permeability of the cell membrane is represented by k(t), A, B, and g are the model constants, s(t) is the instantaneous amplitude, q(t) is the level of available transmitter in the pool, y is the replenishment rate factor (from the factory), c(t) is the transmitter content of the synaptic cleft, l is a loss factor, and r is a return factor from the cleft.
The probability of the afferent nerve firing (Equation ( 7)) is assumed proportional to the remaining level of transmitter in the cleft.The constant h is the proportionality factor used to scale the output for comparison with empirical data.
When we apply the IHC model to a sequence of 1 kHz tone bursts, each 0.25 s long and ranging in amplitude from 40 dB to 85 dB in 5 dB steps, the synaptic cleft contents resulting from a series of such pulses are shown in Figure 4. Using this model enables replication of many well-known nerve responses such as rectification, compression, spontaneous firing, and adaptation [49].

Speech Recognition Experiments
Our speech recognition system is based on continuous density HMM models, and is developed using the HTK toolkit [50].In order to evaluate the proposed cochlear based front-end, we have constructed a system based on standard MFCC front-end and a system based on the cochlear based front-end (including gammatone filtering and IHC processing).
Our speech database is based on the texts of short weather forecasts for the Adriatic coast.It was recorded in a quiet office by 12 male speakers and contains 673 sentences in Croatian (5731 words) sampled at 16 kHz with 16 bits.Vocabulary size is 362 words.The data and the speakers were divided in two sets: one for training and one for testing.
Acoustical modeling was started with simple monophone continuous Gaussian density HMMs with three states (left-right topology) and diagonal covariance matrix for each of the 30 Croatian phonemes.Models were trained with feature vectors of 39 elements (13 static + 13 velocity + 13 acceleration coefficients) representing 25 ms segments of speech, every 10 ms.We used a bigram language model.
In the next step, context-dependent triphone models were constructed from monophone HMMs.Context dependent models provide a better modeling accuracy, but there is a significant increase in the number of models and the problem of insufficient training data arises.To handle this problem, state tying strategy was applied, according to Croatian phonetic rules.Not only does this procedure ensure enough acoustic material to train all context-dependent HMMs, but it also enables modeling of acoustic units not present in the training data (simply by passing them down the phonetic decision tree).The system was further refined by the conversion from single Gaussian HMMs to multiple mixture components.We used six mixture components per state.
Besides standard MFCC based front-end, we have also developed a cochlear based front-end where the speech signal is first processed by the gammatone filterbank and then the output of each gammatone filter is processed by the IHC model.In order to obtain standard speech recognition segmentation, output of the model is temporally integrated on 25 ms segments (every 10 ms) and discrete cosine transform (DCT) is applied.Similarly, the DCT is applied during MFCCs calculation on the "auditory" spectrum obtained after mel-warping of the frequency axis and logarithm calculation.The number of coefficients used is chosen to be 13-the same as for the standard MFCC baseline.We will refer to these new feature vectors as Gammatone Hair Cell (GHC) coefficients.Besides these static spectral feature vectors, standard practice is to also use dynamic feature vectors (velocity and acceleration) to better model the spectral dynamics.These vectors are concatenated with the static vector and a combined feature vector is constructed.
Block diagram of the ASR system with the proposed cochlear based front-end is given in Figure 5. Depending on the number of coefficients used, the auditory spectrum will be approximated with more or less detail.Although a higher number of coefficients means better approximation, it doesn't necessarily mean a better speech recognition performance.In fact, coefficients should be as different as possible for different phonemes, but, at the same time, as similar as possible for the same phonemes uttered in different words, at different intonations, from different speakers, and in different recording conditions (noise). Figure 6 shows the effect of white noise on standard MFC and our GHC based speech representations.It is clearly visible that GHC coefficients are more robust to noise than standard MFCCs.), where S, D, I represent the number of substitution, deletion and insertion errors and N is the total number of words.Maximum recognition rates for each condition are shown in bold.Besides comparison to standard MFCCs and PLP coefficients, we also included PNC coefficients which are based on auditory processing and include the use of a power-law nonlinearity, a noise-suppression algorithm based on asymmetric filtering and temporal masking [28,51].In addition, we also evaluated the average performance across all conditions.Statistical significance of performance improvement over MFC baseline was assessed using the difference of proportions significance test [52].Accuracy comparison is also given in Figure 7.It can be observed that, in clean speech conditions, recognition performance of the GHC based system is lower than for the other approaches.It should be noted here that the same number of coefficients was used in GHC based system as in MFC based system, and it is possible that some other number of coefficients would better fit the GHC based system.As the noise level increases, the recognition rates of the cochlear model based front-end become higher than the standard MFCC based front-end as well as the PLP front-end.The difference is statistically significant for all conditions (p < 0.05).Our results are comparable with the state-of-the-art PNCC front-end which is also based on auditory processing.Significance test between their average performances shows no statistically significant difference (p > 0.05).When SNR is below 10 dB, recognition rates are close to 0% for MFC and PLP, and around 50% for a GHC based system.Average GHC based performance is ≈20 percentage points higher than standard MFC baseline (p < 0.05).

Conclusions
In this paper, we have proposed a speech recognition front-end motivated by cochlear processing of audio signals.Cochlear behavior is first emulated by the filtering operations of the gammatone filterbank and subsequently by the IHC processing stage.Experimental results using a continuous density HMM recognizer clearly demonstrate the robustness of the proposed system, compared to standard MFC and PLP front-ends.Although the recognition rates are lower for clean speech, they are greatly improved in noisy conditions.We have also compared our system with the PNCC front-end which is also based on auditory processing and we have achieved comparable results.Our future work will include a more detailed analysis of the proposed approach with possible new improvements (especially in clean speech conditions) and will include analysis in various types of additive noise on a well-known database, other state-of-the-art front-ends and computational efficiency analysis.

Figure 1 .
Figure 1.Cross-section of the cochlea with enlarged organ of Corti [40].
Frequency response of 20 gammatone filters.

Figure 5 .
Figure 5. HMM recognizer based on the GHC front-end.First, gammatone filtering is applied to the speech signal; then, the output of each gammatone filter is processed by the IHC model resulting in auditory representation of a speech signal from which GHC coefficients are constructed and used in the HMM recognizer.

Figure 6 .
Figure 6.Comparison of auditory spectrograms obtained from 13 MFCCs (top panels) and 13 GHCs (bottom panels) for the same test sentence in clean (left column) and white noise conditions (middle column signal-to-noise ratio (SNR) = 10 dB and right column SNR = 0 dB).
Table 1 shows the recognition results of the baseline MFCC, PLP and PNCC based front-ends and our cochlear model (GHC coefficients) based front-end in clean and white noise conditions in terms of correctness (Corr = N−D−S N ) and accuracy (Acc = N−D−S−I N
52. Pagano, M.; Gauvreau, K. Principles of Biostatistics; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018.c 2019 by the authors.Licensee MDPI, Basel, Switzerland.This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Table 1 .
Recognition rates (%) with significance levels p (in parentheses) against the MFC baseline.