Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database

: We propose a speech-emotion recognition (SER) model with an “attention-long Long Short-Term Memory (LSTM)-attention” component to combine IS09, a commonly used feature for SER, and mel spectrogram, and we analyze the reliability problem of the interactive emotional dyadic motion capture (IEMOCAP) database. The attention mechanism of the model focuses on emotion-related elements of the IS09 and mel spectrogram feature and the emotion-related duration from the time of the feature. Thus, the model extracts emotion information from a given speech signal. The proposed model for the baseline study achieved a weighted accuracy (WA) of 68% for the improvised dataset of IEMOCAP. However, the WA of the proposed model of the main study and modiﬁed models could not achieve more than 68% in the improvised dataset. This is because of the reliability limit of the IEMOCAP dataset. A more reliable dataset is required for a more accurate evaluation of the model’s performance. Therefore, in this study, we reconstructed a more reliable dataset based on the labeling results provided by IEMOCAP. The experimental results of the model for the more reliable dataset conﬁrmed a WA of 73%.


Introduction
The emotional state of a person influences their modes of interactions, such as facial expressions, speech characteristics, and the content of communication.Since speech is one of the main modes of expression, a human-machine interface must recognize, understand, and respond to emotional stimuli contained in human diction.Emotions affect both vocal and verbal content.In this study, we aim to develop a mechanism that can recognize emotions from the acoustic features of utterances [1].
Several studies on speech-emotion recognition have aimed to identify features that enable the discrimination of emotions [2,3].Various short-term and long-term features have been proposed [4], but it is still unclear which features contain more information about emotions.The most common method of emotion recognition is to extract a large number of statistical features on an utterance, reduce the dimensions using a dimension reduction technique, and classify the features using machine learning algorithms [5][6][7].
Recently, researchers have developed deep learning models with low-level descriptors (LLD) as inputs [1][2][3].The models recognized emotions well.LLD is an acoustic feature extracted within a short frame of time, and it is mainly generated in a frame of 20-50 ms.This model mainly consists of a recurrent neural network and an attention mechanism suitable for analyzing sequential data.The commonly used LLD for emotion recognition is the IS09 feature, which is extracted using the openSMILE toolkit [8].The IS09 feature is a 32-dimensional feature vector consisting of a fundamental frequency, the voicing probability, the frame energy, the zero-crossing rate, and 12 mel frequency cepstral coefficients (MFCC) and their first-order derivatives.It uses a 25-ms long window.IS09 is considered better than other features [1]; therefore, it has been used in various studies [9,10] on SER.
Additionally, SER attention mechanisms have been used in various fields, such as neural-machine translation [11], image-caption generation [12], and time-series prediction [13].This mechanism generates output vectors by weighting the input vectors in the order of importance and adding them.The simplest form of an attention mechanism consists of a single query vector, q ∈ R N .In this attention mechanism, attention for a given input matrix is calculated by dot production to identify how much each vector from the matrix is similar to a query vector.The query vector, q, is used for a given input matrix H = (h 1 , h 2 , . . ., h T ), where T × N is the shape, T is length, and N is the dimension of the matrix H.The weight, α, for each feature vector included in the feature matrix is generated by the softmax function using the query vector as follows: c, which is the final output of the attention mechanism, is generated by multiplying the weight by the input vector as follows: In this case, c is a context vector created by concentrating important information in the matrix by calculating the similarity between the query vector and the feature matrix.According to [1,6], this technique to generate the context vector performs better than other techniques, such as mean pooling.
In this study, we use the ability of deep learning models to extract features from the LLD and attempt to use better features when using only IS09 features.For this, we leverage the attention mechanism's ability to combine IS09 and mel spectrogram.We propose a feature-combined attention mechanism structure that integrates IS09 and mel spectrogram.The integration is then used to develop a model that predicts emotions using long short-term memory (LSTM), a dense layer, and an attention mechanism.
This attention mechanism computes the IS09-mel-spectrogram feature by selectively weighting the components of IS09 and mel spectrogram that have a significant influence on the predictive emotion.By using the feature-combined attention mechanism, we can reduce the curse of dimensionality [14] or complexity, that accompanies high-dimensional features.
Additionally, we performed an experiment to analyze the interactive emotional dyadic motion capture (IEMOCAP) [15] dataset, which is mainly used in SER research.The dataset consists of five sessions, and each session is recorded by a group of evaluators.It is composed of an improvised dataset on topics that can elicit specific emotions and a scripted dataset.For each utterance in the dataset, each evaluator judged anger, sadness, happiness, disgust, fear, surprise, frustration, excitement, and contentment, and then labeled the utterance with the most selected emotion.The label of the sample obtained through this process may be less reliable.Therefore, we reconstructed a dataset with utterances whose labels are chosen by three or more evaluators for higher reliability.A more reliable dataset such as this will help the authors derive model-evaluation results with higher accuracy.Thus, we studied the IS09-mel-spectrogram attention mechanism and the reliability of the IEMOCAP dataset.
The contributions of this paper can be summarized as follows: • the addition of the deep-learning-based LLD feature extraction capability for enhanced and expanded IS09 features, • an IS09-mel-spectrogram attention mechanism that focuses on the important parts, and an accurate-experimentation method with a highly reliable dataset.
To compare and analyze the performance of the proposed model and the IEMOCAP dataset, we used five cross-fold validations for speaker-independent evaluations and four emotions (happiness, sadness, anger, and neutral) for achieving consistency with existing research.The proposed model achieved similar results as those of the baseline research [9]; however, with a more reliable dataset, the model achieved high performance.
The rest of the paper is organized as follows.Section 2 describes the related works.Section 3 describes the feature-combined attention model proposed.Section 4 describes the IEMOCAP dataset.Sections 5 and 6 detail three sets of experiments.The first experiment is performed to compare the performance of the model.The second one is for visualizing the attention weight of the feature-combined attention mechanism.The third experiment is for obtaining more accurate performance using a more reliable dataset.In Section 7, we discuss the problem of more reliable samples.Section 8 concludes the work.

Related Works
Distinguishing features are important for recognizing the speech emotion using traditional-classification techniques [16].For example, spectrum features like MFCCs, linear frequency cepstral coefficients (LFCC), or paralinguistic features like F0 can be used [17].In Reference [18], the authors presented a method based on the Gaussian-mixture model classifier and MFCC as features for emotion recognition.In Reference [19], the authors presented a method based on the hidden Markov model and support vector machine (SVM) for emotion recognition using MFCC and LFCC.In Reference [20], the authors proposed the combined feature with MFCC and the residual-phase feature for music-emotion recognition using autoassociative neural networks, support vector machines, and radial-basis function neural networks.
With numerous successful applications of deep neural networks (DNNs), more researchers began to focus on emotion-recognition DNNs.In Reference [21], authors proposed a generalized discriminant analysis based on DNNs to learn discriminative features of low dimensions optimized with respect to a fast classification from a large set of acoustic features for emotion recognition.They show a highly significant improvement over the SVM.In Reference [22], authors used CNNs for face and CNNs for voice to recognize the emotion of a given video.They proposed a strong model for facial-expression emotion recognition, achieving state-of-the-art performance.In Reference [23], the authors proposed a deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data.Their model encodes the information from text and audio sequences using RNNs.In Reference [24], the authors proposed a model consisting of CNN and RNN that automatically learns the best representation of the speech signal directly from the raw-time representation.In Reference [10], the authors used the bidirectional long-and short-term memory (BLSTM) model to capture important information from the speech signal.
Also, with numerous successful applications of the attention mechanism, more researchers began to focus on DNNs with attention mechanisms.In Reference [25], the authors proposed a three-dimensional attention-based convolutional recurrent neural network to learn discriminative features for SER, where the mel spectrogram with deltas and delta-deltas were used as inputs.They assumed that calculating the deltas and delta-deltas for personalized features preserve the effective emotional information and reduce the influence of emotional, irrelevant factors.In Reference [26], the authors proposed a deep recurrent neural network model with the attention mechanism for SER.Their model is based on the intuition that it is beneficial to emphasize the expressive part of the speech signal for emotion recognition.In Reference [27], the authors used an architecture involving both convolutional layers, for extracting high-level features from raw spectrograms, and recurrent ones for aggregating long-term dependencies.In Reference [28], the authors used the RNN model for extracting statistical functionals over speech segments.In Reference [29], the authors proposed an attentive convolutional neural network consisting of a CNN layer, a max-pooling layer, and an attention layer.In Reference [9], the authors used the BLSTM model with an attention layer.This research shows the effectiveness of the attention mechanism for SER.
Our works differ from the works mentioned above.Our attention-LSTM-attention model uses two attention layers.The first attention layer is a layer using an IS09-mel-spectrogram-combined attention mechanism that focuses on the important parts for the input of the model, and the second attention layer is a layer using a temporal attention mechanism for emphasizing the temporally important parts of a given audio signal by weighting the LSTM output.To the best of our knowledge, this is the first attempt to use an attention mechanism for combining many features to use the input of the SER model.

Problem Definition
The emotion recognition task is as follows.The input of the model is an n-dimensional feature sequence data of length T, X = (X 1 , X 2 , . . ., X T ) ∈ R N×T .Therefore, X t = (x 1 , x 2 , . . ., x N ) is an N-dimensional feature vector at time t, and X n = x 1 , x 2 , . . ., x T is the nth feature vector of size T.
The purpose for the model is to take X as the input and predict the label y.The real labels for X, y are anger, neutral, sadness, and happiness.Finally, the model can be defined by Equation ( 3) and is shown in Figure 1.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 12 attention layer is a layer using a temporal attention mechanism for emphasizing the temporally important parts of a given audio signal by weighting the LSTM output.To the best of our knowledge, this is the first attempt to use an attention mechanism for combining many features to use the input of the SER model.

Problem Definition
The emotion recognition task is as follows.The input of the model is an -dimensional feature sequence data of length  ,  = ( ,  , … ,  ) ∈ ℝ . Therefore,  = ( ,  , … ,  ) is an dimensional feature vector at time t, and  = ( ,  , … ,  ) is the th feature vector of size .The purpose for the model is to take  as the input and predict the label  .The real labels for , y are anger, neutral, sadness, and happiness.Finally, the model can be defined by Equation ( 3) and is shown in Figure 1.

Feature-Combined Attention Mechanism
The feature-combined attention mechanism weighs on both the feature axis and the time axis, whereas the conventional-attention mechanism weighs on the time axis.This is to focus on the

Feature-Combined Attention Mechanism
The feature-combined attention mechanism weighs on both the feature axis and the time axis, whereas the conventional-attention mechanism weighs on the time axis.This is to focus on the important feature axis for combining features.A representation of the mechanism is shown in Figure 2.
Electronics 2020, 9, x FOR PEER REVIEW 5 of 12 important feature axis for combining features.A representation of the mechanism is shown in Figure 2.
The purpose of the attention mechanism is to generate new features by combining two input features, the IS09 and mel spectrogram.The computation of the feature-combined attention mechanism consists of three weight calculations.The first calculation is made on a time base to ignore unimportant time.Therefore, the feature vector obtained from the mute segment of the audio can be ignored when weighting the feature axis.For this, the weight  for the t-th feature on the time axis,  = ( ,  , … ,  ), is calculated by Equation ( 4), where  is the weight parameter used in the feed-forward network.The weights are each multiplied by the input feature vector  to produce a feature vector  with a suppressed mute segment.

𝑐 = 𝑒
The second calculation gives weights according to the variation of features.For each -th feature,  = ( ,  , … ,  ) ∈ ℝ ,  ∈ ℝ is calculated by Equation (5).The purpose of the attention mechanism is to generate new features by combining two input features, the IS09 and mel spectrogram.The computation of the feature-combined attention mechanism consists of three weight calculations.
The first calculation is made on a time base to ignore unimportant time.Therefore, the feature vector obtained from the mute segment of the audio can be ignored when weighting the feature axis.For this, the weight c n for the t-th feature on the time axis, X t = (x 1 , x 2 , . . ., x N ), is calculated by Equation ( 4), where W c is the weight parameter used in the feed-forward network.The weights are each multiplied by the input feature vector X to produce a feature vector X with a suppressed mute segment.
Electronics 2020, 9, 713 6 of 12 The second calculation gives weights according to the variation of features.For each N-th feature, where W α is a weight parameter used in the feed-forward network.
The third weight calculation uses a weight parameter, P = (P 1 , P 2 , . . ., P T ) ∈ R N×T , to render weights according to the position of the feature.P passes through a dense layer with the tanh function as the active function, and it is transformed into the P = P 1 , P 2 , . . ., P T = P For each T is calculated by Equation (6).
After the weights, the final output X n is calculated using the α n and β, as in Equation (7).
Therefore, X n is the result of the feature-combined attention mechanism created by absolute weights according to the positions of the feature and weights, according to the variation and weights, and according to the temporal importance.

IEMOCAP Database
The dataset used to compare the performance of the model is the improvised dataset included in IEMOCAP (Figure 3).The improvised dataset is part of the whole dataset.Unlike the scripted dataset that contains a script for speakers to read, the improvised dataset is a collection of statements for improvisation.This dataset consists of five sessions, each of which further contains samples of speeches from two speakers.We divided the audio signals into four emotional categories, happiness, sadness, neutral, and anger, for consistency with existing studies.To evaluate the performance of the emotion-recognition model independent of the speaker, the model was trained in four sessions and evaluated in the last session.Moreover, audio signals 5 s or longer were used from 0 to 5 s, and audio signals less than 5 s long were zero-padded to set a length of 5 s.
for improvisation.This dataset consists of five sessions, each of which further contains samples of speeches from two speakers.We divided the audio signals into four emotional categories, happiness, sadness, neutral, and anger, for consistency with existing studies.To evaluate the performance of the emotion-recognition model independent of the speaker, the model was trained in four sessions and evaluated in the last session.Moreover, audio signals 5 s or longer were used from 0 to 5 s, and audio signals less than 5 s long were zero-padded to set a length of 5 s.The features extracted from the audio signal are IS09 and mel spectrogram.Mel spectrogram is an 80-dimensional feature vector extracted by shifting a 20-ms-long window by 10 ms.Therefore, both feature vectors are generated from a segment at the rate of 100 frames/second (fps).As a result, The features extracted from the audio signal are IS09 and mel spectrogram.Mel spectrogram is an 80-dimensional feature vector extracted by shifting a 20-ms-long window by 10 ms.Therefore, both feature vectors are generated from a segment at the rate of 100 frames/second (fps).As a result, a total of 500 frames of feature vectors were generated, and they were used separately or as a concatenate of 112-dimensional feature vectors.
The IEMOCAP dataset contains less reliable samples, i.e., samples in which the choices of the evaluator are staggered, and the emotions are not obvious.In order to determine whether this less reliable sample is adversely affecting the model, less reliable and more reliable samples are separated from the dataset to construct a new dataset, and are used for training and evaluation.In the same vein, we analyzed the impact of less reliable samples on the model.Information on the newly generated datasets is listed in Table 1.We define a less reliable sample as a sample in which the judgments are less than two, because the D, judgment of the evaluator is mostly inconsistent, and a more reliable sample as a sample in which D is mostly consistent.Therefore, a more reliable sample may be considered to have a correct label, a clear characteristic of the emotion included in the sample.Therefore, the model will perform better with more reliable samples for training and evaluation.

Experiment
The model used in the experiment uses the feature-integrated attention mechanism on top of the existing model.The model in (1) uses rectified linear unit (ReLU) dense layers and BLSTM recurrent layers.Weighted pooling by an attention mechanism is used for producing the final output.We named this model the "LSTM-attention model (LA)".Similarly, the model that included the feature-combined attention mechanism in front of the dense layer of the LA model is coined as the "attention-LSTM-attention model (ALA)".We used three types of features: IS09, mel spectrogram, and IS09 + mel spectrogram.Therefore, the experiments are classified in Table 2.In each experiment, five sessions were trained and evaluated with five cross-validation folds, and then overall accuracy (weighted accuracy, WA) and average recall over the different emotional categories (unweighted accuracy, UA) were averaged and used as the performance metrics of the model as described in [1].We implemented the model on TensorFlow [30] and trained via the Adam optimizer set at 80 epochs with a 128-size minibatch at a learning rate of 0.001.To overcome the imbalance of label categories, we weighted the loss function of each category c with W c = N total N c , where N c is the number of samples from category c, and N total is the total number of samples.
The experimental results in Table 3 reveal that the ALA model using IS09 and mel spectrogram shows the best performance.This result appears to improve the performance of the deep learning model by reducing the complexity by the feature-combined attention mechanism, focusing on the important parts of the various features only.However, when the IS09 was used as a feature, the ALA model did not perform well, which may be the result of dividing the IS09 feature into important and unnecessary parts despite it being made only for SER without the unnecessary parts.Unlike IS09, mel spectrogram, where important and unimportant parts coexist, performs better.In addition, the visualization of the attention weight for the examination of the behavior of the feature-combined attention mechanism, we confirmed that the attention mechanism actually gave less weight to the features that were not necessary for the recognition of emotion and more weight to the necessary ones.The visualization results are shown in Figure 4, which shows the weight of the attention when the feature vectors are a concatenation of IS09 and mel spectrogram.From the figure, the weights are applied differently to mel spectrogram and IS09 according to the input.
In addition, the visualization of the attention weight for the examination of the behavior of the feature-combined attention mechanism, we confirmed that the attention mechanism actually gave less weight to the features that were not necessary for the recognition of emotion and more weight to the necessary ones.The visualization results are shown in Figure 4, which shows the weight of the attention when the feature vectors are a concatenation of IS09 and mel spectrogram.From the figure, the weights are applied differently to mel spectrogram and IS09 according to the input.

Analysis of IEMOCAP Database
Table 4 compares the performance of the proposed method with some state-of-the-art methods proposed in other studies.They conducted all their experiments using the improvised dataset of the IEMOCAP dataset.

Experiment Weighed Accuracy Unweighted Accuracy
Etienne et al. [28] 64.5 61.7 Tzinis et al. [29] 64.2 60.0 Lee et al. [10] 62.9 63.9 Neumann et al. [30] 62.1 _ Ramet et al. [9] 68.8 63.7 ISMS_ALA (proposed model) 67.66 ± 3.4 65.08 ± 4.5 From Table 4, the model proposed in this study shows almost the same performance as the model proposed in [9].Changing the parameters several times over and altering the environment of the experiment also shows similar results, which is considered a limitation of the dataset.The problem is the reliability of the IEMOCAP dataset.In order to confirm this, we experimented with the improvised (I), more reliable (R), and less reliable (U) datasets described in Section 4 and trained and evaluated the model for each dataset.
The experiment was performed using the ALA model.First, as shown in Table 1 of Section 4, I, R, and U datasets were used for training and evaluation.For the training, training data of I (TI), R (TR), and U (TU) were used.For the evaluation, evaluation data of I (EI), R (ER), and U (EU) were used.The parameter setting and the environment were the same as in the previous experiment.The results of the experiment are shown in Table 5.The experimental results reveal that the model trained with TI performed better than the models trained with TR and TU in the EI and EU tests since I contains R and U. Therefore, the model can be optimized for EI and EU.All models exhibited their best performances for ER, confirming the high accuracy of the labels of the more reliable data.Therefore, ER should be evaluated for accurate performance comparison.The proposed model shows a higher performance of WA (73.2%) and UA (68.37%).

Discussion
However, it is not confirmed that the more reliable data can be seen as more reliable for evaluation.Because more reliable samples can be seen as just easier samples, which are probably positioned in a farther distance from emotional boundaries in the feature space.In this paper, we believed comparing the evaluation on more reliable datasets is a more accurate comparison.Because, if the more reliable samples are positioned in a farther distance from other reliable samples from other emotional categories, classification models would be better at mapping categories for inputs in the feature space.Even so, we believe it is required to do more research for a more reliable sample that classification models recognized as wrong emotions to find the reason.

Conclusions
In this study, we attempted to improve the performance of the SER model by combining the IS09 features, which are mainly used in SER and mel spectrogram, an LLD, and using them as inputs.For this purpose, after the concatenation of IS09 and mel spectrogram, the attention mechanism was used to combine the features by weighting to the appropriate part.Since then, the same model with the dense layer and bidirectional LSTM has been used.The model was tested using the IEMOCAP dataset.Experimental results show that the ALA model for SER improved by approximately 3% over the LA model in terms of weighted accuracy and unweighted accuracy.
However, it was also determined that the performance improvement was not a reliability issue stemming from the IEMOCAP dataset labeling method.We isolated the reliability problem of the IEMOCAP dataset through additional experiments.For this, we believed the evaluation for the dataset with samples that two or more people agreed on the emotion can show accurate results.This model presented a WA of 73% and a UA of 68%.

Figure 1 .
Figure 1.Architecture of our proposed research.

Figure 1 .
Figure 1.Architecture of our proposed research.

Figure 4 .
Figure 4. Feature-selected attention-weight visualization.(a) Mel spectrogram feature, as weighted by the attention mechanism.(b) IS09 feature, as weighted by the attention mechanism.

Table 1 .
Specifications of the improvised dataset, more reliable dataset, and less reliable dataset.

Table 2 .
Specification of each experiment.

Table 3 .
Experimental results in the form of mean accuracies ± standard deviations.

Table 4 .
Comparison of performance with other studies.

Table 5 .
Results of the ALA model for improvised (I), more reliable (R), and less reliable (U) datasets.