Automatic speech recognition (ASR) applications have been widely seen in our daily life, and some examples include voice-based command controls of a robot, speech recognition using mobile devices and speech-related web search. However, an ASR system usually degrades significantly in performance when it is applied to an environment with interferences such as additive noise and channel distortion. In recent decades, researchers have been focused on developing noise-robust methods in order to compensate for the aforementioned interference effects and enhance the ASR performance, and these methods can be roughly classified into two fields: feature-based and model-based.
Generally speaking, most of the feature-based methods are developed to enhance the noise-robust ability of existing and widely used speech features like perceptual linear prediction coefficients (PLP) [1
], mel-frequency cepstral coefficients (MFCC) [2
], logarithmic mel-filter-bank coefficients (FBANK) [2
], and Gammatone frequency cepstral coefficients [3
], and some of them act at the intermediate stage of the creating process of speech features. For example, spectral subtraction (SS) [4
], Wiener filtering [6
], and MMSE-based short-time spectral amplitude estimation [7
] are exemplary methods that process the frame-wise acoustic spectra, which are then converted to final features for speech recognition. Second, a variety of statistical moment normalization methods are developed and conducted on the intermediate and final stages of creating speech features to give significant improvement in recognition accuracy under noise-corrupted situations, such as mean normalization (MN) [9
], mean and variance normalization (MVN) [10
], and histogram normalization (HEQ) [11
], to name but a few.
Furthermore, because the aforementioned statistical moments are directly evaluated by the temporal series of speech features, these moment normalization methods implicitly enhance speech features with regard to temporal characteristics. By contrast, another direction of feature enhancement directly and explicitly processes the temporal series of speech features, and the respective methods include, but are not limited to, RASTA [13
], temporal structure normalization (TSN) [14
], and MVN plus ARMA filtering (MVA) [15
]. Additionally, the methods of spectral histogram equalization (SHE) [16
], modulation spectrum replacement/filtering (MSR/MSF) [17
], and nonnegative matrix factorization (NMF)-based modulation spectrum enhancement [18
] directly modify modulation spectra, which are specifically referred to as the Fourier transform of the feature time sequence.
On the other hand, the model-based methods attempt to adapt the existing acoustic models with noise information to make them more suitable for an application environment. According to [19
], these methods can be further split into two schools: general adaptation and noise-specific compensation. The general-adaptation methods use a generic transformation to convert acoustic model parameters, and some representative methods include maximum-likelihood linear regression (MLLR) [20
], maximum likelihood linear transform (MLLT) [21
], minimum classification error–based linear regression (MCELR) [22
] and discriminative mapping transform [24
]. By contrast, the noise-specific compensation methods update acoustic model parameters by explicitly adopting the characteristics of the noise present in an application environment, and they include parallel model combination (PMC) [25
] and model-based vector Taylor series (VTS) [26
]. Interested readers are referred to [19
] for a comprehensive coverage of recent noise-robust techniques for automatic speech recognition.
In this paper, we propose a feature-based method that uses the technique of robust principal component analysis (RPCA) [28
] aiming to extract noise-robust speech features. RPCA is a novel data analysis method and has been widely used for speech enhancement and robust speech representation algorithms, among which some well-known examples are briefly described here. In [30
], RPCA is applied to the spectrogram of speech signals, and the resulting sparse component is shown to contain less noise and thus be noise-robust. Another speech enhancement method proposed in [31
] first decomposes a speech signal into sub-bands via a wavelet transform and then uses RPCA to extract the low-rank component of the matrix created by the overlapped frames of each sub-band signal, and the final output is the inverse wavelet transform of low-rank sub-band signals. The method in [32
] integrates RPCA and exemplar-based sparse representation in an SNR-dependent manner to process the spectrogram of a noise-corrupted signal, i.e., to use RPCA in a low-SNR case and use exemplar-based sparse representation in a high-SNR case. In [33
], RPCA is also used to decompose the spectrogram of a noise-corrupted signal, while the respective sparse component is further constrained to be a nonnegative weighted sum of pre-learned basis spectra. Briefly speaking, the algorithms in [30
] apply RPCA in the spectro-temporal (spectrogramic) domain of speech signals, and in [31
], RPCA is operated in the time domain of speech signals within each sub-band produced by a wavelet transform.
The newly proposed method differs from the aforementioned RPCA-wise algorithms mainly in that it employs RPCA in the temporal series of FBANK/MFCC speech features, which are directly used in automatic speech recognition. Notably FBANK/MFCC features are a nonlinear transform of the spectrogram of a time-domain speech signal due to the logarithmic operation. In the proposed scenario, each signal in the training and testing sets for an ASR system is converted to FBANK/MFCC features. Then, the feature time sequence, expressed in a matrix form, is decomposed by RPCA to produce a sparse matrix and a low-rank matrix. Finally, the obtained sparse matrix is treated as the new features for the subsequent training or testing. Compared with the matrix that contains the original features, the sparse matrix is shown to highlight the relatively fast-varying component, which very probably corresponds to the speech-dominant elements and benefits speech recognition. In comparison, the associated low-rank matrix reveals more static characteristics that are likely related to the embedded noise. We evaluate the proposed RPCA-based novel feature extraction method on the Aurora-4 benchmark task [35
], which consists a medium-to-large vocabulary database and a recognition task based on the Wall Street Journal (WSJ) corpus [36
]. In addition, state-of-the-art deep neural network (DNN) architecture is used for acoustic modeling in the experiments. The evaluation results show that the proposed RPCA-based method can provide the original feature with significant recognition accuracy improvement, and the achieved relative word error rate reduction can be as high as 43%. We also show that this new method can be additive to the prevalent mean normalization (MN) and relative spectral (RASTA) methods to further improve the recognition performance. As a result, these evaluation results indicate that the newly proposed method is quite promising to enhance the ASR and can broaden the corresponding applications in real environments.
This paper is organized as follows. In Section 2
, we briefly introduce the RPCA algorithm. Section 3
includes the proposed RPCA-based feature extraction method. The experimental setup is described in Section 4
, and the experimental results as well as the corresponding discussions and analyses are given in Section 5
. Finally, Section 6
contains concluding remarks and suggestions for future work.
3. Proposed Method
In the presented method, the RPCA algorithm is applied to the matrix organized by the speech feature temporal sequence in order to extract the embedded noise-robust component. The procedure for this method is split into the following two steps:
Step 1: Create the baseline features, FBANK and MFCC:
Any time-domain utterance in the training and test sets is first passed through a high-pass pre-emphasis filter, and the operations of framing and windowing are performed in turn. Then, each windowed frame signal is converted to the acoustic frequency domain via short-time Fourier transform (STFT) to create the corresponding acoustic spectrum. Next, each frame-wise acoustic spectrum is converted to the FBANK or MFCC features. The magnitude of the acoustic spectrum associated with each frame is weighted by a Mel-frequency filter bank and then processed with the logarithmic operation to produce the FBANK features. Moreover, the MFCC features are derived after the application of the discrete cosine transform (DCT) to FBANK.
Step 2: Use RPCA to extract the sparse part of the FBANK/MFCC matrix
Let denote the frame time series of FBANK or MFCC vectors for the time-domain signal obtained from Step 1, where is the frame index and is the total number of frames. Then a matrix is created by assigning the column vector of to be . Therefore, the matrix is termed as the feature matrix of the time-domain signal .
Next, the RPCA algorithm stated in Section II is used to decompose the feature matrix
denote the low-rank and sparse component matrices of
. Finally, we discard the low-rank part
while the sparse part
is preserved and treated as the new feature matrix for the subsequent processing in training and testing.
Because the main idea of the aforementioned method is to extract the sparse component of the feature matrix via RPCA, we will use the notation “RPCA-SPC” to denote this new method hereafter for the ease of discussion.
The primary idea of RPCA-SPC is as follows: For a noise-corrupted utterance and the corresponding speech feature sequence, the embedded noise part often varies more slowly with time relative to the clean-speech part. In other words, clean speech is likely to be more non-stationary than noise. This phenomenon can be easily observed when analyzing the spectrogram of an utterance. The spectral structure of pure noise is usually fixed or slow-varying, while the speech component changes quickly with time. Such an assumption implies that the noise part appears to be of low-rank, while the clean-speech part is sparse. Therefore, extracting the sparse component of the speech feature matrix tends to enhance the speech and alleviate noise.
Here we provide two examples to reveal that RPCA tends to highlight the clean-speech part in the noise-corrupted data:
First, Figure 1
a depicts the spectrogram of a noisy utterance, and Figure 1
b,c depicts the RPCA-derived sparse and low-rank partitions of the spectrogram, respectively, shown in Figure 1
a. From these figures, it is obvious that the sparse part contains rich speech clues, while the low-rank counterpart corresponds to relatively less speech information and more about noise.
Next, Figure 2
a–c and Figure 3
a–c, respectively, show the time series of the sixth and eighth MFCC features of a noisy utterance as well as the corresponding sparse and low-rank components. Likewise, Figure 4
a–c and Figure 5
a–c correspond to the original, sparse, and low-rank versions of the sixth and eighth FBANK coefficients of a noisy utterance. From these figures, it is clearly observed that the sparse component appears close to the original feature stream and shows synchronicity along the time axis to some extent, while the low-rank component behaves like irrelevant noise.
4. Experimental Setup
The Aurora-4 database [35
] is used to evaluate noise robustness of the features created via the proposed RPCA-SPC. Aurora-4 is a medium vocabulary task (a 5000-word vocabulary task) acquired from the Wall Street Journal (WSJ) corpus [36
] at 8 kHz and 16 kHz sampling rates. In Aurora-4, 7138 noise-free clean utterances are recorded with a primary microphone to form the clean-training set, and they are also further contaminated to form the multi-training set with or without the secondary channel distortions and any of six different types of additive noise at the SNRs ranging from 10 to 20 dB. The testing data are split into 14 different test sets (Sets 1–14), with each set containing 330 utterances. The utterances in Sets 1–7 are recorded with a single microphone, while different microphones are used to record the utterances in Sets 8–14, which accordingly contain channel distortions relative to those in Sets 1–7. In addition, Sets 2–7 and Sets 9–14 are further contaminated by additive noise of six types at SNRs from 5 to 15 dB. In particular, as for our experiments we adopt the clean-condition training mode to prepare the acoustic models, and all of the utterances used are at the sampling rate of 8 kHz.
Regarding the speech features, 39-dimensional MFCCs (including 13 static components plus their first- and second-order time derivatives) and 40-dimensional FBANK features serve as the baseline features, and they are further processed by any of mean normalization (MN), mean and variance normalization (MVN), relative spectral (RASTA), and the presented RPCA-SPC. The clean-condition training data is converted to speech features, which are then used to train context-dependent (CD) acoustic models, which further have two different structures, i.e., GMM-HMM and DNN-HMM, in which GMM, DNN, and HMM refer to Gaussian-mixture model, deep-neural network, and hidden Markov model, respectively. Stated in more detail, GMM-HMM and DNN-HMM use GMM and DNN, respectively, to represent each state of the hidden Markov model. As for GMM-HMM, each tri-phone of speech signals and the silence are respectively characterized by a HMM with three states and eight Gaussian mixtures per state and a HMM with three states each having 16 mixtures. On the other hand, seven layers are used for the DNN structure in the DNN-HMMs for tri-phones and silence, having five hidden layers with each layer containing 2048 nodes. A set of trigram language models is created via the reference transcription of training utterances. Finally, the evaluation results are represented using word error rate (WER).