A Study on a Speech Emotion Recognition System with Effective Acoustic Features Using Deep Learning Algorithms

: The goal of the human interface is to recognize the user’s emotional state precisely. In the speech emotion recognition study, the most important issue is the effective parallel use of the extraction of proper speech features and an appropriate classiﬁcation engine. Well deﬁned speech databases are also needed to accurately recognize and analyze emotions from speech signals. In this work, we constructed a Korean emotional speech database for speech emotion analysis and proposed a feature combination that can improve emotion recognition performance using a recurrent neural network model. To investigate the acoustic features, which can reﬂect distinct momentary changes in emotional expression, we extracted F0, Mel-frequency cepstrum coefﬁcients, spectral features, harmonic features, and others. Statistical analysis was performed to select an optimal combination of acoustic features that affect the emotion from speech. We used a recurrent neural network model to classify emotions from speech. The results show the proposed system has more accurate performance than previous studies.


Introduction
Recently, with the technological development in the information society, high performance personal computers are becoming rapidly popularized. Consequently, the interactions between computers and humans have been actively altering into a bidirectional interface. Therefore, there is a need to better understand human emotions. It could improve the interaction systems between humans and machines [1,2]. In signal processing, for these reasons above, emotion recognition has become an attractive research topic [3]. The goal of the human interface is to recognize the user's emotional state precisely and to give personalized media according to user's emotions.
Emotion refers to a conscious mental reaction that one experiences subjectively, which, in other words, is strong feeling generally accompanied by physiological and behavioral changes in the body [4]. To recognize a user's emotional state, many studies have applied diverse forms of input, such as facial expression, speech, text, video, and others [5][6][7][8][9][10][11]. Among emotion recognition studies, a speech signal is one of the most natural ways of human communication. It contains linguistic content and implicit paralinguistic information, including the speaker's emotions. Several studies have reported that acoustic features, speech-quality features, and prosodic features imply abundant emotional significance [12]. In the speech emotion recognition study, the most important issue is the effective parallel use of the extraction of proper speech features and an appropriate classification engine. These features include formant, energy and pitch features [13][14][15]. Moreover, many studies for speech-emotion recognition utilize the Mel-frequency cepstrum coefficients (MFCC) feature representatively [16,17]. However, because there doesn't exist an overt and deterministic mapping between features and emotional state [18], speech emotion recognition still has a lower recognition rate than other emotion-recognition methods, such as facial

Emotional Speech Database
The Berlin Emotional Speech database (EmoDB) is a database of emotional speech recorded with German utterances [19]. The database was constructed using everyday sentences, so that priority was given to the naturalness of speech. Ten actors, five female and five male, recorded speech data according to specified emotions, producing 10 different sentences for seven kinds of emotion: anger, boredom, disgust, fear, happiness, sadness, and neutral. The special feature of EmoDB is that all the utterances in it are composed with emotionless words, which means that EmoDB is helpful for finding out the proper features of emotion due to its exclusion of the influence of the emotional information of words.
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database consists of about 12 h of recorded data [21]. Speech, facial, and landmark data were captured during each recording session. Each session was a sequence of conversations involving men and women. A total of 10 actors split into five pairs took part in the recording. All data were recorded in a professional cinema studio, to maximize the sound quality. Actors were seated across from each other at a "social" distance of three meters. This setup enables realistic communication. An evaluation test was carried out with seven subjects, to label each utterance based on both audio and video streams. The data was categorized using 10 labels: neutral, happiness, sadness, anger, surprise, fear, disgust, frustration, excited, or other. Speech data not recognized by the subjects as exhibiting the appropriate emotion were not included in the database. Data were only included in the database when there were more than half consistent answers. This process provided evidence for the credibility of human recognition.
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) is a video database of emotional speech and songs in North American English, classified into eight emotions, including neutral, calm, happy, sad, angry, fearful, disgusted, and surprised [20]. The database comprises information from 24 professional actors, and each actor has 60 audio-visual (AV) items and 44 song items, for a total of 104 data points. Each recorded production of an actor was available in three modality formats: AV, video only, and audio only.

Speech Emotion Recognition
Speech signals are some of the most natural media of human communication, and they have the merit of real-time simple measurement. They contain implicit paralinguistic information and linguistic content, including the speaker's emotions. In the designing of a speech emotion recognition system, it is a challenging task to identify and extract different emotion related speech features. Since proper selection of the features affects the classification performance, it is critical to combine appropriate audio features in speech emotion recognition. There were many approaches to recognizing emotion from speech, and each study used different speech features. Linear Prediction Cepstrum Coefficients, MFCCs, and F0 have been widely used for the recognition of speech emotion. However, the question of whether these features are effective for classifying emotions is still under discussion in many studies [23]. A few studies extracted features from speech and recognized emotions utilizing Gaussian Mixture Models (GMMs) [24] and Hidden Markov Models (HMMs) [25]. Recently, with the rapid development of deep-learning algorithms, RNN have been applied to various fields of speech analysis [26][27][28]. The main focus of the approach [26] is to classify the emotions in an utterance, rather than classifying the emotions using the frames of the utterance. To extract global features, the authors fed 32-dimensional frame features directly into an RNN. Then, the global features were fed into an extreme learning machine to classify the emotions. This approach recognizes emotions from an utterance, rather than using the frames of the utterance, and therefore requires a lot of computational power to train the network. In another approach, Wieman et al. [29] found that binary decision trees can determine the features most relevant to emotions. However, their experiment was performed using a small dataset. A study [30] assumed that the characteristics of speech vary in each person, and that emotions are affected by the age, gender, and acoustic features of the speaker. The authors focused on speech emotion recognition by grouping speech data by age and gender. They proposed a hierarchical gender-and age-based model and utilized different feature vectors of OpenSmile [31] and eGeMAPS [32]. The results indicated that building a separate classifier for each gender and age group produces better performance than having one model for all genders and ages. Chernykh  The total set of acoustic features is a sequence of 34-dimensional vectors for each utterance.
An RNN with a Connectionist Temporal Classification approach [34] was used to classify emotions from speech. Zhao et al. proposed two CNN LSTM networks created by stacking local feature learning blocks and other layers, to extract emotional features [9]. The 1D CNN LSTM network was intended to recognize speech emotion and extract deep features from a raw signal. The 2D CNN LSTM network focused mainly on learning global contextual information from a log-mel spectrogram. The CNN LSTM networks could recognize speech emotion effectively, without using hand-crafted features.
Many studies have demonstrated the correlation between emotional voices and acoustic features [35][36][37][38][39][40]. However, because there does not exist overt and deterministic mapping between features and emotional state, speech emotion recognition still has lower recognition rate than other emotion-recognition methods, such as FER. For this reason, finding the appropriate feature combination is a critical task in speech-based emotion recognition.

Korean Emotional Speech Database
The study of emotion recognition has rapidly developed over the last decade with broad interest from researchers in neuroscience, computer science, psychiatry, psychology, audiology, and computer science. The key of these studies is to secure the availability of validated and reliable expressions of emotion. Most emotion data sets include either facial expressions or speech recordings. Among data sets, few contain audio-visual recordings of speakers in Korean. This study constructs a K-EmoDB and reports validity and reliability of the data based on ratings from participants. The database was recorded with Korean utterances from professional actors. All recorded data was recorded in a professional studio, considering the sound quality of the data by eliminating any background noise. The database was recorded using two different methods: (1) A method of recording the emotions using 120 emotion-inducing scenarios (20 of each emotion) according to each emotion. (2) A method of recording 20 emotionless sentences that did not affect the emotions in the 6-emotion version.
The database was recorded with Korean utterances from 20 professional actors (M = 28.1, SD = 3.41, age range = 23-35, 11 males and 9 females). To be eligible, actors needed to have Korean as their first language, to speak with a neutral Seoul accent, and to not possess any distinctive features. Participants were also required to identify text presented at a distance of 1.5 m. Figure 1a shows an actual actor during recording.
An RNN with a Connectionist Temporal Classification approach [34] was used to 149 classify emotions from speech. Zhao et al. proposed two CNN LSTM networks created by 150 stacking local feature learning blocks and other layers, to extract emotional features [9]. 151 The 1D CNN LSTM network was intended to recognize speech emotion and extract deep 152 features from a raw signal. The 2D CNN LSTM network focused mainly on learning global 153 contextual information from a log-mel spectrogram. The CNN LSTM networks could rec-154 ognize speech emotion effectively, without using hand-crafted features. 155 Many studies have demonstrated the correlation between emotional voices and 156 acoustic features [35][36][37][38][39][40]. However, because there does not exist overt and deterministic 157 mapping between features and emotional state, speech emotion recognition still has lower 158 recognition rate than other emotion-recognition methods, such as FER. For this reason, 159 finding the appropriate feature combination is a critical task in speech-based emotion 160 recognition. 161

162
The study of emotion recognition has rapidly developed over the last decade with 163 broad interest from researchers in neuroscience, computer science, psychiatry, psychol-164 ogy, audiology, and computer science. The key of these studies is to secure the availability 165 of validated and reliable expressions of emotion. Most emotion data sets include either 166 facial expressions or speech recordings. Among data sets, few contain audio-visual re-167 cordings of speakers in Korean. This study constructs a K-EmoDB and reports validity 168 and reliability of the data based on ratings from participants. The database was recorded 169 with Korean utterances from professional actors. All recorded data was recorded in a pro-170 fessional studio, considering the sound quality of the data by eliminating any background 171 noise. The database was recorded using two different methods: 1) A method of record-172 ing the emotions using 120 emotion-inducing scenarios (20 of each emotion) according to 173 each emotion. 2) A method of recording 20 emotionless sentences that did not affect the 174 emotions in the 6-emotion version. 175 The database was recorded with Korean utterances from 20 professional actors 176 (M=28.1, SD=3.41, age range =23-35, 11 males and 9 females). To be eligible, actors needed 177 to have Korean as their first language, to speak with a neutral Seoul accent, and to not 178 possess any distinctive features. Participants were also required to identify text presented 179 at a distance of 1.5 m. Figure 1 182 We chose six emotions for the experiment: neutral, anger, happiness, sadness, fear, 183 and excitement. Here, "Excitement" refers to a more delighted with excitement emotion 184 than "happiness". Neutrality was selected as the baseline emotion, and the remaining 185 states consisted of the set of five basic or fundamental emotions that are thought to be 186 We chose six emotions for the experiment: neutral, anger, happiness, sadness, fear, and excitement. Here, "Excitement" refers to a more delighted with excitement emotion than "happiness". Neutrality was selected as the baseline emotion, and the remaining states consisted of the set of five basic or fundamental emotions that are thought to be culturally universal. The concept of primary emotions has a long history in science and philosophy. While the discrete model of emotion has been criticized [41][42][43], it is a practical choice for the creation and labelling of emotion sets [21]. Therefore, the six emotions can be found in most existing sets [44][45][46][47][48][49][50][51][52][53][54].
K-EmoDB was created using the following procedure. Actors were recruited through online casting, and 20 actors were selected by audition. All data were recorded in a professional studio, in order to optimize the sound quality of the data by eliminating any background noise. The Neumann TLM 103, Neumann U87 Ai, Oktava MK-319 microphone, and Universal Audio LA-610 MK1, WARM AUDIO WA76, MPAC-01 mixers were used for recording, as shown in Figure 1b. Each person recorded 120 sentences. Each speech data point is approximately three to five seconds long. All speech emotion data were recorded at 48 kHz, and downsampling was performed with a 16 kHz downsampling rate in a PCM signed 16-bit format.
To ensure the emotional quality and naturalness of the data, an evaluation test was carried out with seven subjects. The subjects listened to each sample and had to decide on an emotion score from 5 to 1. If subjects felt that the emotion was clearly contained in the speech data, they gave it a score of 5. However, if subjects felt that the speech data did not include the emotion, they give it a score of 1. Using to the evaluation results, 150 emotion data points in each category were chosen to construct the final K-Emo DB database. The final data consists of male and female speech files in a ratio of 50 to 50.

Speech Emotion Recognition
Speech is not only the most natural and universal communication but also contains paralinguistic information such as emotions and tone. Several previous studies have used prosodic and acoustic speech features to recognize emotion, and the current study investigated acoustic features employed in previous voice recognition studies using the emotion model shown in Figure 2.
culturally universal. The concept of primary emotions has a long history in science and 187 philosophy. While the discrete model of emotion has been criticized [41][42][43], it is a practi-188 cal choice for the creation and labelling of emotion sets [21]. Therefore, the six emotions 189 can be found in most existing sets [44][45][46][47][48][49][50][51][52][53][54]. 190 K-EmoDB was created using the following procedure. Actors were recruited through 191 online casting, and 20 actors were selected by audition. All data were recorded in a pro-192 fessional studio, in order to optimize the sound quality of the data by eliminating any 193 background noise. The Neumann TLM 103, Neumann U87 Ai, Oktava MK-319 micro-194 phone, and Universal Audio LA-610 MK1, WARM AUDIO WA76, MPAC-01 mixers were 195 used for recording, as shown in Figure 1 (b). Each person recorded 120 sentences. Each 196 speech data point is approximately three to five seconds long. All speech emotion data 197 were recorded at 48 kHz, and downsampling was performed with a 16 kHz downsam-198 pling rate in a PCM signed 16-bit format. 199 To ensure the emotional quality and naturalness of the data, an evaluation test was 200 carried out with seven subjects. The subjects listened to each sample and had to decide on 201 an emotion score from 5 to 1. If subjects felt that the emotion was clearly contained in the 202 speech data, they gave it a score of 5. However, if subjects felt that the speech data did not 203 include the emotion, they give it a score of 1. Using to the evaluation results, 150 emotion 204 data points in each category were chosen to construct the final K-Emo DB database. The 205 final data consists of male and female speech files in a ratio of 50 to 50.

207
Speech is not only the most natural and universal communication but also contains 208 paralinguistic information such as emotions and tone. Several previous studies have used 209 prosodic and acoustic speech features to recognize emotion, and the current study inves-210 tigated acoustic features employed in previous voice recognition studies using the emo-211 tion model shown in Figure 2.  Figure 2 shows the two-dimensional emotion model based on discrete emotion the-215 ory developed in previous studies. Horizontal and vertical axes refer to valence and 216 arousal, respectively, where arousal represents excitement level (high or low), and valence 217 represents the positive and negative emotions. For example, anger includes high excite-218 ment and negative emotion, and hence has high arousal and negative valence. 219 Most previous speech emotion studies mainly employed anger, disgust, fear, happi-220 ness, sadness, and surprise; whereas the present study added excitement to the basic emo-221 tions: anger, happiness, neutrality, sadness, and fear, providing six emotions describing 222 speech purpose, reasonably evenly distributed over the arousal and valence scales. Gen-223 erally, arousal is relatively easy to be distinguished whereas valence can be somewhat 224 Figure 2. Two-dimensional emotion model [55]. Figure 2 shows the two-dimensional emotion model based on discrete emotion theory developed in previous studies. Horizontal and vertical axes refer to valence and arousal, respectively, where arousal represents excitement level (high or low), and valence represents the positive and negative emotions. For example, anger includes high excitement and negative emotion, and hence has high arousal and negative valence.
Most previous speech emotion studies mainly employed anger, disgust, fear, happiness, sadness, and surprise; whereas the present study added excitement to the basic emotions: anger, happiness, neutrality, sadness, and fear, providing six emotions describing speech purpose, reasonably evenly distributed over the arousal and valence scales. Generally, arousal is relatively easy to be distinguished whereas valence can be somewhat difficult. For example, voice pitch differs between happiness and sadness, making them relatively easy to distinguish; whereas voice pitch is high for both happiness and anger, making them more difficult to distinguish. Many studies have considered methods to difference valence from speech emotion recognition.

Feature Selection
Since speech acoustic features are important for emotion recognition, it is essential to select and analyze appropriate features. Lindstrom (2010) found that harmonic structures expressed positive and peaceful emotions, such as happy, cheerful, comfortable, and elegant, whereas dissonant intervals were closely related to negative and sad emotions, such as agitated, tense, angry, etc. Harmonic and dissonant speech intervals are closely related to the speech spectrum harmonic structure [56]. Table 1 shows harmonic features used to quantitatively analyze the harmonic structure. We extract the harmonic features using the features extraction tool Essentia which is open-source library and tools for audio and music analysis [57]. Harmonic spectral deviation 8 Odd to even harmonic ratio The harmonic energy feature presents the total sum of frequency peaks in a harmonic structure, when frequencies and magnitudes of harmonic peaks were given. Noise refers to the part besides harmonic structure in a spectrum, and the noise energy presents the subtraction of harmonic energy from the entire spectrum energy. The feature of Noiseness is that it represents the ratio of noise in a spectrum and it refers to the value of noise energy divided by the entire sum of spectrum. F0 feature means fundamental frequency of speech. Inharmonicity feature shows the degree of how much a spectrum does not form harmonic structure, and it is calculated as weighted sum of spectrum values which are the closest to multiples of fundamental frequency. Tristimulus feature is divided into 3 aspects; the first is to calculate relative weighted value, the second aspect is to calculate second, third and fourth harmonic's relative weighted value, and the third and the last is to calculate all the rest harmonics' relative weighted value. The detailed description of the harmonic features can be found in [57]. Table 2 shows the test result that distinguishes between Negative Valence and Positive Valence in the same Arousal. We verified distinction in terms of accuracy as to the combination of features by utilizing Medium Gaussian support vector machines. Negative and positive emotions with high arousal were anger and happiness, respectively, and it yielded 65.5% in accuracy when we classified the two emotions by combining features of Noise energy, Noiseness, Inharmonicity, Tristimulus among harmonic features. Negative and positive emotions with low arousal were sadness and neutrality, respectively, and it yielded 76% in accuracy when we classified the two emotions by utilizing features of Noiseness, F0, Inharmonicity, Tristimulus among harmonic features. Through this result, it is clarified that Negative Valence and Positive Valence can be distinguished in the same Arousal when characteristics of Inharmonicity, Tristimuls, Harmonic energy, Noise energy, Noiseness are utilized in combination.
Features used for speech emotion recognition were also analyzed, including MFCC, LPC, Zero-Crossing Rate, Signal Energy, FFT Spectrum (magnitude), Mel Spectrum, Cepstral features, Pitch, Harmonic to Noise Ratio, Spectral features, Chroma and others, which have been widely used in previous studies. Through each feature's individual classification, we carried out an experiment to investigate features specialized in emotion recognition. In individual features, MFCC, in general, showed the highest accuracy (approximately 60%). We performed the analysis again, and redesigned feature sets by combining each feature according to its accuracy, starting with the highest, and after comparing the result with individual features, we could confirm that combined feature sets yielded improved result by 10% in average. Feature vector's combination, so called "basic feature combination" which shows the highest accuracy, is as shown below.

Speech Segment Extraction
The need for determining whether a given speech signal should be classified as a speech section or a silence section arises in many speech analysis systems. When nonspeech sections are included in the learning or testing process, they can provide unnecessary information and become an obstacle. Since the signal energy value of the speech signal segment is larger than that of the non-speech signal segment, an absolute integral value (IAV) reflecting the energy value is used. The IAV value is computed by Equation (1) where X is the recorded signal, N is the number of samples, and i is the sample index. The process of extracting speech segments is shown in Figure 3. The process of selecting the IAV threshold value is as follows. First, it is necessary to 300 extract the IAV feature vector from the interval of the signal. Then, it is imperative to 301 calculate the maximum value and the minimum value and determine the threshold value 302 by 10% of the difference between these two values. An example of determining the thresh-303 old is shown in Figure 4. The process of selecting the IAV threshold value is as follows. First, it is necessary to extract the IAV feature vector from the interval of the signal. Then, it is imperative to calculate the maximum value and the minimum value and determine the threshold value by 10% of the difference between these two values. An example of determining the threshold is shown in Figure 4. The process of selecting the IAV threshold value is as follows. First, it is necessary to 300 extract the IAV feature vector from the interval of the signal. Then, it is imperative to 301 calculate the maximum value and the minimum value and determine the threshold value 302 by 10% of the difference between these two values. An example of determining the thresh-303 old is shown in Figure 4. The signal threshold is computed to find the starting point of the signal in the win-307 dow as well as the IAV threshold. The threshold of the signal is divided by the frame size 308 at the IAV threshold value. As the IAV value is the absolute integral value of all the signal 309 values in the window, the average signal value of the critical section can be obtained by 310 dividing by the window size. 311 The process of extracting a speech interval includes a point at which the window is 312 larger than the IAV value, and it determines a point at which the window is larger than 313 the signal threshold value as a starting point. If an extracted IAV value is smaller than the 314 IAV threshold, the end point is determined. If the range of each attribute value of learning data differs greatly, the learning is not 317 working efficiently. For example, if the range of the property A is 1 to 1000 and the range 318 of the property B is 1 to 10, and if the value of A is larger, A is reflected as if it has a 319 significant impact on the neural network, and B acts like it does not affect the network, 320 relatively. Thus, transforming each property value into the same range is necessary before 321 The signal threshold is computed to find the starting point of the signal in the window as well as the IAV threshold. The threshold of the signal is divided by the frame size at the IAV threshold value. As the IAV value is the absolute integral value of all the signal values in the window, the average signal value of the critical section can be obtained by dividing by the window size.
The process of extracting a speech interval includes a point at which the window is larger than the IAV value, and it determines a point at which the window is larger than the signal threshold value as a starting point. If an extracted IAV value is smaller than the IAV threshold, the end point is determined.

Feature Scaling
If the range of each attribute value of learning data differs greatly, the learning is not working efficiently. For example, if the range of the property A is 1 to 1000 and the range of the property B is 1 to 10, and if the value of A is larger, A is reflected as if it has a significant impact on the neural network, and B acts like it does not affect the network, relatively. Thus, transforming each property value into the same range is necessary before learning, and this process is referred as "feature scaling." Among the various methods of feature scaling, a min-max scaling method that sets the range between 0 and 1 based on the maximum and minimum values within the range of each feature is most generally used. However, simply normalizing the range between 0 and 1 is not suitable for the feature used in this study, because the difference of each value will decrease excessively if the range changes into 0 to 1. In this study, we normalize the features using a standard-score method, which considers the range as well as the variation of the values. The formula for this scaling method is as follows (2) where x is the normalized feature vector, x is the input. x is the average of the input vector, and σ is the standard deviation of the input.

Emotion Recognition Model
This study used LSTM model to recognize emotional state from speech. LSTM is a modified RNN structure proposed by Hochreiter et al. [22], where hidden layer nodes are replaced by blocks. The LSTM block comprises memory cells with cyclic structure; and input, forget, and output gates. They differ from conventional RNN models in that LSTM blocks control flow by sending or not sending information appropriately using their gates Appl. Sci. 2021, 11, 1890 9 of 15 depending on the specific circumstances. LSTM calculates the i-th value for each node at time t as where x j (t) is the j-th input variable, c i (t) is the i-th input state, calculated by applying weights U, W, and b to hidden layer h(t−1) at input times t and t−1. Equation (3) is applied to input, forget, and output gates g, f, and q, respectively, and the weighted sum of the constant term is converted into the range of 0 to 1 using a non-linear function. Gate output = 1 means that all information is maintained, whereas gate output = 0 means that all information is deleted. Cell state s i (t) at t is calculated from c i (t) and gates g and f, where the i-th hidden node is h i (t) = q i tan h(s i (t)), Thus, the first input value cannot influence the next value when gates f and g = 1 and 0, respectively, hence the cell state never changes, and short-term memory can be removed; whereas long-term memory can be maintained in the opposite case.
Data from previous time(s) influence current data for most time-series models. Therefore, RNN models predict status for current time t using data from previous times. However, current status can often be predicted more accurately using future data. For example, prediction accuracy may improve for Korean language objects if the verb, which appears after the object, is utilized rather than the subject. Thus, inferring in forward as well as backward directions may produce more meaningful results.
Equation (4) shows that conventional LSTM uses data for past time t-1 time to predict the data for current time t. Backward connection does not exist so the current time cannot be predicted using future time data. Therefore, bi-directional LTSM models are commonly employed to resolve this drawback, as shown in Figure 5. This study used the feature vectors discussed in Section 4.1 and a bi-directional LSTM 363 to recognize emotions from speech. Figure 6 shows the bi-directional LTSM model em-364 ployed. This study used the feature vectors discussed in Section 4.1 and a bi-directional LSTM to recognize emotions from speech. Figure 6 shows the bi-directional LTSM model employed. This study used the feature vectors discussed in Section 4.1 and a bi-directional LSTM 363 to recognize emotions from speech. Figure 6 shows the bi-directional LTSM model em-364 ployed. 365 366 Figure 6. Proposed emotion recognition network structure. 367 Generally, emotion can be recognized when we hear the entire spoken sentence or 368 phrase. However, it is very difficult to recognize emotion when we hear the speech into 369 short segments (e.g., one second). Thus, emotion recognition accuracy generally improves 370 for longer rather than shorter intervals. However, LSTM has a problem that a gradient is 371 not delivered as the layer is deeper (meaning that speech is recognized at a longer inter-372 val). This study set LSTM interval as short as possible, and then calculated emotion prob-373 ability over each section to decide the final emotion label, hence avoiding the gradient loss 374 problem while still using longer speech intervals to recognize emotion. Generally, emotion can be recognized when we hear the entire spoken sentence or phrase. However, it is very difficult to recognize emotion when we hear the speech into short segments (e.g., one second). Thus, emotion recognition accuracy generally improves for longer rather than shorter intervals. However, LSTM has a problem that a gradient is not delivered as the layer is deeper (meaning that speech is recognized at a longer interval). This study set LSTM interval as short as possible, and then calculated emotion probability over each section to decide the final emotion label, hence avoiding the gradient loss problem while still using longer speech intervals to recognize emotion.
Input speech data was divided into speech sections through the pre-processing procedure that searches the speech section as mentioned in Section 4.2.1. Combined feature vectors, discussed in Section 4.1, were extracted from the divided speech sections in chronological order, and overlapped as much as 0.3 for approximately 15 ms time units. Extracted feature vectors were re-configured into 78 vectors, emotion for each part was calculated approximately every second chronologically, and each frame was classified using the trained model. Using a voting mechanism on the classified frames, the final emotion label of the speech signal is detected.

Experiments and Results
Two experiments were conducted to compare the proposed method performance. The first experiment used the K-EmoDB database developed in Section 3. We chose six emotions for the experiment: neutral, anger, happiness, sadness, fear, and excitement. Randomly selected records (630 of 900, i.e., 70%) were used as training data, and the remaining 170 records as test data. We used 5-fold cross-validation, i.e., repeated the sample selection five times. The second experiment compared the proposed method with previous emotion recognition studies using various international databases. We used the various databases employed in the previous studies and compared accuracy directly and indirectly. Table 3 shows emotion recognition results for K-EmoDB. Chernykh [33] achieved approximately 67.9% accuracy using 34 acoustic features and LSTM. Shaqr [30] employed eGeMAPS features and a multi-layer perceptron model to achieve 67.14% accuracy. The study by Zamil [58] employed only MFCC as the acoustic feature and achieved 61.32% using the logistic model tree. The studies by George [59] and Jianfeng [9] proposed an emotion recognition method by end-to-end learning using a convolution layer. With the recognition using 1D convolution layer and LSTM with the input of time-series signals of speech data, their accuracy was 65.89%, and with the recognition using the LSTM layer by combining 1D convolution layer and 2D convolution layer with the input of Melspectrogram, their accuracy was 62.63%. When the basic feature combination proposed in this study was used, the accuracy was 70.51%. When emotion recognition was performed by combining the basic feature combination and harmonic features, the accuracy was 75.46%, which improved around 5% compared to that not using the harmonic features. Moreover, when the final emotion of the speech data was produced by determining the label of each section at a shorter interval, high accuracy of 83.81% was obtained.

International DB
We also compared emotion recognition using emotion speech databases widely used in previous studies. Table 4 compares prediction accuracy using the databases from the respective papers. Although databases and emotions differed between the various studies, accuracies achieved approximately 60-70%. However, the proposed method achieved 85.79% accuracy for EmoDB, and 87.11% for the RAVDESS database. In the case of the result using the RAVDESS database, we used two kinds of deep neural networks for classifying emotions from speech and images, respectively. The networks reflected temporal representations from sequential data of images and speech. We fine-tuned the softmax functions of the pre-trained networks, considering the characteristic of each input, to maximize the ability of the networks. Consequently, the proposed method showed more accurate results than other models. Thus, higher accuracy can be acquired by combining the proposed acoustic features with a FER model.

Conclusions
In this paper, we constructed a Korean emotion speech database for speech emotion analysis and proposed a feature combination that could improve the emotion recognition performance using an RNN model. We carried out speech emotion recognition using the emotional speech database. To investigate whether harmonic and dissonant speech intervals affect positive and negative emotions, respectively, we extracted and analyzed harmonic features closely related to harmonic and dissonant speech. The experiments showed that harmonic features have an effect on distinguishing valence, which increased around 5% compared to that not using the harmonic features. This study set the LSTM interval to be as short as possible, and then calculated emotion probability over each section to decide the final emotion label, hence avoiding the gradient loss problem while still using longer speech intervals to recognize emotion. It has been shown experimentally that recognizing emotions after dividing speech into short intervals produces better recognition accuracy (83.81%) than recognizing speech emotion using one long LSTM model (75.51%).
In future works, in order to accurately learn deep running-based models, a considerable amount of data is required. Studies into speech emotion recognition use far less data than studies based on videos or texts. It is necessary to build high-quality databases in order to improve performance and generalization. In order to recognize emotions from speech, it is necessary to find an accurate mapping of acoustic features and the intensity of emotions, using large databases. Although studies have used a variety of acoustic features, they still have the problem of not knowing what emotional intensity is manifested when a certain feature is found at a certain level. It is necessary to solve these problems in order to recognize emotions and their intensity more accurately.  Data Availability Statement: Data sharing is not applicable to this article.