An Investigation of a Feature-Level Fusion for Noisy Speech Emotion Recognition

Because one of the key issues in improving the performance of Speech Emotion Recognition (SER) systems is the choice of an effective feature representation, most of the research has focused on developing a feature level fusion using a large set of features. In our study, we propose a relatively low-dimensional feature set that combines three features: baseline Mel Frequency Cepstral Coefficients (MFCCs), MFCCs derived from Discrete Wavelet Transform (DWT) sub-band coefficients that are denoted as DMFCC, and pitch based features. Moreover, the performance of the proposed feature extraction method is evaluated in clean conditions and in the presence of several real-world noises. Furthermore, conventional Machine Learning (ML) and Deep Learning (DL) classifiers are employed for comparison. The proposal is tested using speech utterances of both of the Berlin German Emotional Database (EMO-DB) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) speech databases through speaker independent experiments. Experimental results show improvement in speech emotion detection over baselines.


Introduction
Speech signals encompass a large set of information, ranging from lexical content to speaker's emotional state and traits. Decoding such information has been of benefit to a number of different speech processing tasks such as speech and speaker recognition, as well as Speech Emotion Recognition (SER). After long-term research, both speech and speaker recognition have been addressed pretty well [1][2][3][4][5], while SER remain difficult, especially in the presence of noise.
Although SER is a widely used research topic, most existing systems are processed under ideal acoustic conditions. Noise is a major factor that affects the performance of speech related tasks [6], but it is still not well studied in the emotional context. Hence, one of the main purposes of this work is to propose an implementation of a SER system in a more realistic setting. It consists of detecting speakers' emotional state under acoustic corruption caused by real-word noises.
A major issue of SER in adverse environment resides in the fact that both background noise and emotion are intermingled together. This implies that whenever we want to extract information about the speaker's emotional state, noise contributes as an uncertainty that results in an acoustic mismatch between training and testing conditions for real-life scenarios. In order to reduce this mismatch, extensive research has been conducted and has tried to intervene in different stages of the recognition process. The developed methods fall into two main categories. The first one includes speech enhancement or noise reduction techniques, either by means of speech sample reconstruction [7], noise compensation using histogram equalization [8], adaptive thresholding in the wavelet domain for noise cancellation [9], or spectral subtraction [10]. However, the downside of such methods is that they strongly rely on prior knowledge of noise, speech, or both, which limits their implementations. The second category includes methods that utilize no prior knowledge of the background noise. They concentrate on finding noise robust features either by proposing new ones such as Log Frequency Power Ratio (LFPR) [11], Rate Scale (RS) [12], Teager Energy based Mel Frequency Cepstral Coefficients (TEMFCCs) [13], Power Normalized Cepstral Coefficients (PNCC) [14], or constructing large feature sets by combining different features.
Since most of the available works concentrate on only a particular type of noise, the contribution of this paper is twofold: on the one hand, it represents a new effort to approach real-life conditions and presents an SER system that is subjected to different types of noise; on the other hand, it proposes a multi-feature fusion representation that is based on a combination of conventional features and wavelet analysis and shows that the usage of a relatively low-dimensional feature set can be conceivable for application to SER even in unconstrained conditions. The organization of the paper is as follows: In Section 2, we provide an extensive overview of various SER works that are based on feature-level fusion. Next, we introduce in Section 3 the proposed framework and the proposed emotion recognition system. In Section 4, we evaluate the performances of the proposed work using two corpora. Finally, Section 5 highlights the contributions of the present work and provides potential future research directions.

Review of Feature Level Fusion Based SER
One of the most important consideration for SER is the extraction of the suitable features, which state the emotion of the speaker even in challenging environments. Globally, there are two feature extraction approaches. The first one consists of using or finding single features of high performance [13,14]. The second approach is combinatorial in nature and makes use of different features to form large features sets. This is known as early or feature-level fusion. Table 1 provides a non-exhaustive list of works related to this topic. In terms of speech emotional feature extraction, various features that are based on a feature-level fusion have been investigated. Among these works, authors in [15] proposed a feature set that combines, among others, Harmonics-to-Noise Ratio (HNR), formants, Mel Frequency Cepstral Coefficients (MFCCs), and 19 channel filter bank (VOC19) features. In [16], acoustic features were combined with lexical features extracted from word transcripts obtained using an Automatic Speech Recognition (ASR) system: Bag-Of-Words (BOW). They achieved a four emotion recognition accuracy of 65.7% using Support Vector Machines (SVM). In [17], the authors extracted 286 features from speech signals that were exposed to babble noise with different signal-to-noise ratio levels, including formants, pitch, energy, MFCCs, Perceptual Linear Prediction (PLP), and Linear Predictive Coding (LPC), and were used for emotion classification by means of Naive Bayes (NB), K-Nearest Neighbors (KNN), the Gaussian Mixture Model (GMM), Artificial Neural Network (ANN), and SVM classifiers. They showed that NB and SVM classifiers provided the best results. In [18], 988 statistical functionals and regression coefficients were extracted from eight kinds of low-level features that were: intensity, loudness, MFCC, Line Spectral Pairs (LSP), Zero Crossing Rate (ZCR), probability of voicing, fundamental frequency F0, and F0 envelope. The proposed set provided a good detection rate of 83.10% for Speaker Independent (SI) SER on the Berlin Database. In [19], sub-band spectral centroid Weighted Wavelet Packet Cepstral Coefficients (W-WPCC) were proposed and were fused with Wavelet Packet Cepstral Coefficients (WPCC) and prosodic and voice quality features to deal with white Gaussian noise. In [20], Linear Predictive Cepstral Coefficients (LPCC) and MFCCs were derived from wavelet sub-bands and were fused with baseline LPCCs and MFCCs. The resulting feature dimension was reduced using the vector quantization method, and the obtained feature vector was used as input to a Radial Basis Function Neural Network (RBFNN) classifier. Recently, in [21], the authors proposed a combination of Empirical Mode Decomposition (EMD) with the Teager-Kaiser Energy Operator (TKEO). They proposed novel features named Modulation Spectral (MS) features and Modulation Frequency Features (MFF) based on the AM-FMmodulation model and combined them with cepstral features.

Methodology
As introduced earlier, the idea of the present work is to investigate a feature driven approach to SER. The considered system is composed of three main stages, which are: feature extraction, dimensionality reduction, and classification.

Feature Extraction
Generally, the most commonly used features in SER are divided into three categories, namely: vocal tract, prosodic and excitation source features. Vocal tract features are obtained by analyzing the characteristics of the vocal tract, which is well reflected in the frequency domain. Prosodic features represent the overall quality of the speech and are extracted from longer speech segments like syllables, words, and sentences. Excitation source features are those used to represent glottal activity, mainly the vibration of vocal folds.
In this paper, we tried to find a set of features that have physical meaning. Since voice is produced and perceived by a human being, we made a combination of all families of features that model the mechanisms of production and perception of speech. More precisely, the considered feature set was chosen with three different feature families, which were pitch, related to vocal folds, MFCCs belonging to the perceptual speech family, and a multi-resolution based feature describing the spectral content of speech. MFCC was used as it is based on the human speech perception mechanism and employs a bank of Mel spaced triangular filters to model the human auditory system that perceives sound in a nonlinear frequency binning. The speech production system includes vocal cords and a vocal tract. The vocal cords generate sound waves by vibrating against each other, and the vocal tract modulates the resulting sound. However, the vocal tract is influenced by several factors such as articulation and emotions and shows greater variability. Hence, only features that are related to vocal cords were used. All of the considered features are described in the following.

Pitch
Pitch, also known as fundamental frequency F0, is a feature that corresponds to the frequency of vibration of the vocal folds. There are many algorithms for computing the fundamental frequency of speech signals, and they are generally referred to as pitch detection algorithms. They can operate either in time or frequency domain. Frequency domain pitch estimators usually utilize Fast Fourier Transform (FFT) to convert the signal to the frequency spectrum. Time domain approaches are typically less computationally expensive. Major pitch estimation algorithms can be decomposed into two major steps: the first one finds potential F0 candidates for each window, and the second one selects the best ones. In this work, pitch was estimated by using the Robust Algorithm for Pitch Tracking (RAPT) [26]. The algorithm first computes the Normalized Cross-Correlation Function (NCCF) of a low-sample signal and records the locations of the local maxima in this first pass NCCF. The NCCF [27] is defined for a K sample length at lag k and analysis frame x(n), 0 ≤ n ≤ N − 1, as: having where N is the number of frames and k is the lag number. Next, NCFF is performed on the high-sample rate signal in the vicinity of the peaks found in the first pass. This generates a list of several F0 candidates for the input frame. Finally, dynamic programming is used for the final pitch estimate selection.

MFCC
MFCC [28] is an audio feature extraction that is extensively used in many speech related tasks. It is known that it was developed to mimic the human auditory perception. The MFCC feature extraction process is given as follows: 1. Pre-emphasize the speech signal: The pre-emphasis filter is a special kind of Finite Impulse Response (FIR), and its transfer function is described as: where α is the parameter that controls the slope of the filter and is usually chosen between 0.4 and 1 [29]. In this paper, its value was set to 0.97. 2. Divide the speech signal into a sequence of frames that are N samples long. An overlap between the frames was allowed to avoid the difference between the frames. Windowing was then is applied over each of the frames to reduce the spectral leakage effect at the beginning and end of each frame. Here, the Hamming window was applied, which was defined as w(n) = 0.54 − 0.46 cos (2πn/N − 1). 3. Compute the magnitude spectrum for each windowed frame by applying FFT. 4. The Mel spectrum was computed by passing the resulting frequency spectrum through a Mel filter bank with a triangular bandpass frequency response.
5. Discrete Cosine Transform (DCT) was applied to the log Mel spectrum to derive the desired MFCCs.

Wavelet Based Feature Extraction Method
Discrete Wavelet Transform (DWT) has been adopted for a huge variety of applications, from speaker recognition [4] to Parkinson's disease detection [30]. Briefly, DWT is a time-scale representation technique that iteratively transforms the input signal into multi-resolution subsets of coefficients through high-pass and low-pass filters and decimation operators. Practically, DWT is performed by an algorithm known as sub-band coding or Mallat's algorithm. According to [31], a discrete signal x(n) can be decomposed as: where is the scaling function at a scale of 2 j 0 shifted by k, ψ j,k (n) = 2 j/2 ψ(2 j n − k) is the mother wavelet at a scale of 2 j shifted by k, a j 0 ,k is the approximation coefficients at a scale of 2 j 0 , and d j,k is the detail coefficients at a scale of 2 j . In this paper, MFCCs were extracted from DWT sub-band coefficients to produce DMFCC features. Figure 1 summarizes the DMFCC feature extraction process. After feature extraction, the sequence of features of each utterance was mapped into a global descriptor representative of the entire utterance. The most used mapping approach in emotion analysis is the Statistics Based Mapping Algorithm (SBMA). It consists of computing several statistics over the entire sequence. The resulting feature vector is used for analysis, learning, and classification. The computed statistics from each of the considered features are shown in Figure 2. For DMFCC, the statistics were computed over each of the MFCC matrices that were extracted from the wavelet sub-band coefficients.
The normalization has an important role in classification methods. Feature normalization guarantees that all the features will have the same scale [32]. Moreover, it is used to remove the speaker variability while preserving the discrimination between emotional classes. In this context, all features were subject to z-score normalization hereafter.

Dimensionality Reduction Using LDA
There are many techniques for reducing the dimensionality of the extracted feature vector with the goal of preserving the discriminability of the different emotion categories in the reduced dimensionality data. The most established techniques are Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) [33]. Both of them reduce the dimensionality of features by projecting the original feature vector into a new subspace through a transformation. PCA optimizes the transformation by finding the largest variations in the original feature space, while LDA pursues the largest ratio of between-class variation and within-class variation when projecting the original feature to a subspace. In this paper, the dimensionality of the obtained feature vector through SBMA was further reduced using LDA. The main objective of LDA is to find a projection matrix W lda that maximizes the so-called Fisher criterion: where S b and S w are the scatter between-class and within-class matrices, respectively, each defined as: where g is the number of classes and µ i and µ are the class mean and overall mean, respectively. υ ij are samples from class C i , and β i is the number of samples in class C i .

Classification
NB and SVM are the typical generative and discriminative classification models, respectively. In this paper, we compare the two classifiers to examine their reliance in SER.

Naive Bayes Classifier
The NB classifier [34] is based on Bayes' theorem of conditional probability. The theorem states that the conditional probability that an event belongs to a class can be calculated from the conditional probabilities of finding particular events in each class and the unconditional probability of the event in each class. In other words, let X be the input data and H be some hypothesis that X belongs to class C. The conditional probability that an event belongs to a class C can be calculated by using the following equation: where p(H) and p(X) are the prior probabilities of H and X, respectively, and P(X|H) is the posterior probability of X conditioned on H. P(X|H) can be estimated as: From this, the NB classifier is defined as:

SVM Classifier
SVM [35] is a binary classifier that separates input data into classes by fitting an optimal separating hyperplane to the training data in the feature space. It is extended to multi-class problems by using two strategies, which are: One-Versus-One (OVO) and One-Versus-All (OVA). In the OVA approach, each class is separated from the remaining ones. Thus, the number of SVMs that are trained equals the number of classes, and the final classification is determined by the highest score. The OVO approach, known also as pairwise classification, pairs each of the classes and trains an SVM for each pair. Each binary classifier is trained on only two classes; thus, the method constructs g(g − 1)/2 binary classifiers, where g is the number of classes. For each test sample, the confidence level of each class is the majority voting result of these binary classifiers. The class with the most votes is selected as the final prediction. In our work, the OVO strategy was adopted. To obtain optimal performance of the SVM classifier, selection of a proper kernel function is essential. In this work, the linear kernel was exploited, since it does not need parameterization.

Results
We conducted SER experiments to evaluate the proposed approach, which was implemented using the algorithms described in Section 3. The evaluation was focused on its performance with and without noise incorporation. For bilingual emotion recognition, two databases were used in the experiments, one in English and the other in German.

Experimental Data and Parameters
In SER, the used datasets are categorized into three types. These are acted, authentic, and elicited emotional corpora [36].
• Acted: where the emotional speech is acted by subjects in a professional manner. The actor is asked to read a transcript with a predefined emotion. • Authentic: where emotional speech is collected from recording conversations in real-life situations. Such situations include customer service calls and audio from video recordings in public places or from TV programs.

•
Elicited: where the emotional speech is collected in an implicit way, in which the emotion is the natural reaction to a film or a guided conversation. The emotions are provoked, and experts label the utterances. The elicited speech is neither authentic nor acted.
In this paper, we aimed to investigate the performance of the proposed emotion recognition system with two types of corpora; IEMOCAP and EMO-DB, which are elicited and acted datasets, respectively.

IEMOCAP
IEMOCAP [37] is a multi-speaker and multimodal database, collected at the Speech Analysis and Interpretation Laboratory (SAIL) of the University of Southern California. It contains approximately twelve hours of audio-visual data from ten actors (five males and five females) and was recorded in five sessions. Each session had one male and one female performing improvisations or scripted scenarios designed to elicit particular emotions. The database was annotated by three annotators into several categorical labels, such as anger, happiness, excited, and so on. Only utterances with at least two agreed emotion labels were used for our experiments. Specifically, the categorical tags that we considered in the present work are: anger, excited, neutral, happiness, sadness, fear, and surprise. We merged happiness and excited as happiness, making the final dataset contain 7214 utterances (1031 anger, 1585 happiness, 1684 neutral, 1018 sadness, 33 fear, 90 surprise, and 1773 frustration).

EMO-DB
The German Emotional Speech Database (EMO-DB) [38] includes seven emotional states: anger, boredom, disgust, fear, happiness, and sadness, in addition to the neutral state. The utterances were produced by ten professional German actors (five females and five males) uttering ten sentences with an emotionally neutral content, but expressed with the seven different emotions. The total number of utterances was 535 divided among the seven emotional states: 128 anger, 83 boredom, 48 disgust, 71 fear, 72 happiness, 81 neutral, and 63 sadness.

Speaker Independent Experimental Results
Here, we present a series of experiments for SI analysis to understand the practical utility of the SER system in a real-world scenario. SI means that the speaker of the classified utterances was not included in the training database. In this context, we used Leave-One-Subject-Out (LOSO) cross-validation. That is, for both databases, the system was trained ten times, each time leaving one speaker out of the training set and testing the performance on the speaker left out.
SI systems offer several advantages. They are able to handle efficiently an unknown speaker. Thus, no training by the individual users was required. They also show a better generalization ability than the SDones, since they avoid overfitting. In addition, the experimental protocol in SI systems is deterministic, in the way that the exact configuration is known. In contrast to SD ones where cross-validation is employed, the random partitioning does not allow an exact reproduction of the configuration, making the results not directly comparable between research works.
The performance was evaluated in terms of accuracy, which is defined as: First of all, we conducted experiments in a clean environment. The motivation of the first experiment is twofold. First, it compared the performance of the SER system with individual features' families only and with fused features, respectively. Second, it compared the performance over the two used classifiers: NB and SVM. Figure 3 reports the obtained results.
The emotion recognition rates using EMO-DB for NB with pitch, MFCC, and DMFCC features were 46.23%, 64.79%, and 68.01%, respectively. Improvement was made by using the feature fusion technique, which gave the best recognition accuracy with a recognition rate of 82.32% when combining the three features. For the IEMOCAP database, the efficiencies of NB and SVM were 42.09% and 41.01%, respectively. By comparing these, we found that NB had the best recognition accuracy with the combination of MFCC and DMFCC. However, it still was nearly equal to that obtained when fusing the three considered features.  In order to further analyze the classification distribution of each emotion, the confusion matrices averaging the results of the ten separate experiments for SI SER experiments are shown in Tables 2 and 3. The values in the main diagonal give the average recognition accuracy of each emotion. As shown, using the IEMOCAP database, the emotions showed globally a high number of confusions. In EMO-DB, the emotions were relatively easily recognized, reaching the highest average accuracy of 92.10% for anger. This can be explained by the fact that EMO-DB is an acted database. Acted emotion expressions are generally more acoustically exaggerated than spontaneous ones [39], thus rendering them more easily differentiated. Moreover, taking an example of fear classification using IEMOCAP, a closer analysis revealed that fear was usually confused with frustration and happiness when annotators came to annotate the data. This behavior was in agreement with the one observed experimentally.
In real-world applications, SER nearly always involves capturing speech with background noise. The impact of such noise on speech signals and on SER performance may increase with the amount of traffic, the number of vehicles, their speed, and so on. The present behavioral experiment provides a performance analysis of the effects of such acoustic disturbances on SER. The considered background noises were based on the ones constructed for the AURORA noisy speech evaluation [40] and included: airport, train, babble, street, car, exhibition, and restaurant noise. Tables 4 and 5 provide the performance measurement in the presence noise for EMO-DB and IEMOCAP, respectively, at different Signal-to-Noise Ratio (SNR) levels.  The reported results in Tables 4 and 5 for the three combined features showed again a greater performance of NB than SVM for EMO-DB. However, when using IEMOCAP, the performance of SVM was greater than that of NB from 0 dB to 10 dB and still nearly equal to that of NB for 15 dB and beyond. It was also noticed that NB was faster than SVM in terms of computational speed, which made it more suitable for real-world applications.

Comparison with Previous Related Work
As stated in Section 2, there is a number of works that have implemented an SER system based on a feature level fusion. However, research has been rarely devoted to SI SER, especially in the presence of noise. In [23], the authors performed a six class emotion recognition on EMO-DB. They used a feature set that combined pitch based features, energy, ZCR, duration, formants, and harmony features. The obtained recognition rates for each emotion were as follows: 52.7% for happiness with 33.9% misclassified as anger, 84.8% for boredom, 52.9% for neutral, 87.6% for sadness, 86.1% for anger, and 76.9% for fear. However, in our approach, we performed a seven class emotion recognition, reaching the recognition rates of 79.54% for happiness with only 3.41% misclassified as anger, 92.10% for anger, 78.39% for boredom, 74.22% for neutral, 79.89% for sadness, 92.10% for anger, and 80.10% for fear. In [22], all of the seven emotions of EMO-DB were used. The extracted feature set included MFCC, LPCC, ZCR, spectral roll-off, and spectral centroid, and an overall accuracy of 78.8% was achieved. More in particular, the recognition rates of each emotion were: 92.91% for anger, 74.68% for boredom, 68.42% for disgust, 70.91% for fear, 50% for happiness, 85.90% for neutral, and 90.57% for sadness. It seemed that sadness was easily recognized compared to our approach (79.89%), which was not the case of happiness, which we correctly recognized at an average rate of 79.54%. In [18], 988 statistical functionals and regression coefficients were extracted from eight kinds of low-level features. The proposed set provided an average recognition rate of 83.10% on EMO-DB. However, it was tested on a subset of only 494 speech samples out of 535. However, as demonstrated in Figure 3, we achieved a good accuracy using only 63 features when testing the performance on all of the 535 utterances of EMO-DB. Table 6 summarizes the obtained results for SI task on EMO-DB with that of literature works.  [42,43], or Long Short Term Memory (LSTM) [44]. In this work, a CNN architecture was proposed for emotion recognition. In this study, feature extraction was applied first to input speech signals as described in Section 3 before they were input to the network. A conventional CNN is a multi-layer stacked neural network, which is built by stacking the following layers:

•
Convolutional layer: utilizes a set of convolutional kernels (filters) to convert the input into feature maps.

•
Non linearity: Between convolutional layers, an activation function is applied to the feature map to introduce nonlinearities into the network. Without this function, the network would essentially be a linear regression model and would struggle with complex data. In this paper, we used the most common activation function, which is the Rectified Linear Unit (ReLU) and defined as: • Pooling layer: Its function is to decrease the feature maps size progressively to reduce the amount of parameters and computation in the network, hence to also control overfitting. Pooling involves selecting a pooling operation. Two common methods are average pooling and max pooling. In average pooling, the output is the average value of the feature map in a region determined by the kernel. Similarly, max pooling outputs the maximum value over a region in the feature maps.
In this paper, the use of max pooling was adapted due to the property that the max operation preserves the largest activations in the feature maps. • Softmax layer: Softmax regression is often implemented at the neural network's final layer for multi-class classification and gives the class probabilities pertaining to the different classes.
Assuming that there are g classes, then its corresponding output probability for the jth class is: The overall architecture of the used CNN model is illustrated in Figure 4. The proposed model was built by stacking two convolutional layers, one fully connected layer and a softmax layer. The numbers of filters for the two convolutional layers were 64 and 128, respectively. Max pooling was added after convolution with a pooling length of two. The number of nodes in the fully connected layer was set to 128. A dropout layer was also added at the end of each layer to avoid overfitting. Details of CNN parameters are shown in Table 7.

Input vector Convolution
Max Pooling Dropout Convolution Max Pooling Dropout Flatten Dense Dropout Softmax Emotion  The CNN parameters were optimized by RMSRprop with a learning rate of 0.00005. We used 100 epochs with a batch size of 16, which means the model saw the whole training data 100 times to update its weights. Figure 5 illustrates the results of emotion classification using CNN and compares them to the results of conventional SVM and NB. It can be seen that the CNN model achieved relatively ideal overall performance using IEMOCAP and improved the performance by 2.12%. However, the DL model was ineffective in being applied to EMO-DB and achieved an average accuracy of 81.55% compared to 82.32% for NB. These findings may be explained by the small size of EMO-DB. ML could still achieve a high classification accuracy in cases where the size of the available data was small, which demonstrated its powerful capacity for pattern recognition problems.

Conclusions
In this paper, the use of a feature fusion for SER was investigated in a noisy environment. Different features' combinations were attempted and showed improvements in classification compared to using individual features. Among the proposed combined features explored in this work, the combination of MFCC, MFCC derived from wavelets (DMFCC) and pitch based features remained the most reliable followed by the combination of MFCC and DMFCC. Comparison with state-of-the art works showed the possibility of using a relatively low-dimensional feature set with good accuracy. Since emotions do not have clear-cut boundaries (even people are usually confused at recognizing other people's emotions), there is a need to explore and develop classification methods that can handle this vague boundary problem. In this context, conventional ML algorithms were studied and compared with a DL technique. Experimental results demonstrated that when implementing DL for SER, one of the major challenges became a lack of availability of large datasets. To deal with the issue of limited datasets, one possibility for future research would be to consider data augmentation by either collecting or creating more data.