Speech Emotion Recognition Using a Dual-Channel Complementary Spectrogram and the CNN-SSAE Neutral Network

Featured Application: Emotion recognition is the computer’s automatic recognition of the emotional state of input speech. It is a hot research ﬁeld, resulting from the mutual inﬁltration and interweaving of phonetics, psychology, digital signal processing, pattern recognition, and artiﬁ-cial intelligence. At present, speech emotion recognition has been widely used in the ﬁelds of intelligent signal processing, smart medical care, business intelligence, assistant lie detection, criminal investigation, the service industry, self-driving cars, voice assistants of smartphones, and human psychoanalysis, etc. Abstract: In the background of artiﬁcial intelligence, the realization of smooth communication be-tween people and machines has become the goal pursued by people. Mel spectrograms is a common method used in speech emotion recognition, focusing on the low-frequency part of speech. In contrast, the inverse Mel (IMel) spectrogram, which focuses on the high-frequency part, is proposed to comprehensively analyze emotions. Because the convolutional neural network-stacked sparse autoencoder (CNN-SSAE) can extract deep optimized features, the Mel-IMel dual-channel complementary structure is proposed. In the ﬁrst channel, a CNN is used to extract the low-frequency information of the Mel spectrogram. The other channel extracts the high-frequency information of the IMel spectrogram. This information is transmitted into an SSAE to reduce the number of dimensions, and obtain the optimized information. Experimental results show that the highest recognition rates achieved on the EMO-DB, SAVEE, and RAVDESS datasets were 94.79%, 88.96%, and 83.18%, respectively. The conclusions are that the recognition rate of the two spectrograms was higher than that of each of the single spectrograms, which proves that the two spectrograms are complementary. The SSAE followed the CNN to get the optimized information, and the recognition rate was further improved, which proves the effectiveness of the CNN-SSAE network.


Introduction
Emotion recognition has become an important research field in neuroscience, computer science, cognitive science, and medicine [1,2]. Although facial expressions [3], body movements [4], gestures [5], and EEG signals [6] can provide a good representation of human emotions, many researchers extract raw emotions from speech signals because speech is a fast and normal means of communication between humans [7]. Speech emotion recognition (SER) has wide application prospects in modern intelligent systems, such as self-driving cars, voice assistants of smartphones, human psychoanalysis, and medical services [8,9]. Despite the many applications, SER is a challenging assignment because emotions are not objective. There is no consensus on how to classify or measure emotions [10].
SER usually consists of four parts: preprocessing, feature extraction, feature optimization or selection, and classification [11]. Feature extraction is very important for SER. Speech features include acoustic features and spectrographic features [12], whereas The structure of this paper is as follows. Section 2 introduces the related research. In Section 3, the architecture of SER is described. Section 4 describes the generation of the IMel spectrogram, and provides theoretical analysis. Section 5 presents the CNN-SSAE neural network for extracting deep optimized features. Section 6 introduces the datasets, features, experimental settings, and evaluation scenarios. Section 7 presents the conclusions and discusses the future research directions of SER.

Development of Acoustic Features
Mel-frequency cepstral coefficients (MFCCs) are the most widely used perceptual incentive features. Their improved forms [29], such as log frequency power coefficients [30] and linear prediction cepstral coefficients [31], have been used in SER. In addition to these basic features, researchers have constructed many new acoustic features for SER. In reference [32], inspired by the AM-FM modulation model, the features of modulation spectra and modulation frequency were extracted and combined with cepstral features. In reference [33], three features were selected: the MFCC feature of the typical human auditory model, the feature of Zero Crossings with Maximal Teager Energy Operator (ZCMT) of the uncompensated human auditory model, and the feature of Glottal Compensation to Zero Crossings with Maximal Teager Energy Operator (GCZCMT) feature. In reference to [34], a new set of harmonic features was proposed. Reference [35] proposed weighted spectral features based on Local Hu moments (HuWSF), which is based on Local Hu moments. However, these features do not reflect the relationship between the frequency domain and time domain at the same time, and therefore many researchers have turned their attention to spectrogram.

Development of the Spectrogram
Previous studies of the spectrogram can be divided into three situations: (1) Improving the recognition network. Reference [36] inputs the standard spectrogram into a deep CNN with a rectangular kernel for recognition, reference [37] inputs the standard spectrogram into a CNN with an improved pooling strategy for recognition, and reference [38] inputs the Mel spectrogram into a visual attention CNN for recognition. (2) Processing the spectrogram to improve the recognition effect better. In references [39,40], log-Mels, deltas, and delta-deltas were used as the inputs of the structure in the hybrid system. In reference [41], the K-means clusters after extracting 88 viterbilt symbols from each frame signal were sent to a three-dimensional CNN by the standard spectrogram of the key frames. In reference [42], the multiscale standard spectrogram was processed by a Chirplet filter, and the local direction, time information, intensity, and frequency comparison features were sent to a support vector machine (SVM). (3) Combining acoustic features with the spectrogram. Reference [43] combined the IS09 feature in openSMILE [44] with the Mel spectrogram and sent it to the model of the attention-LSTM-attention model. Reference [45] proposed combining the acoustic features and spectrogram features to improve the recognition rate.
To summarize, these studies mainly used the standard spectrogram or Mel spectrogram to construct SER models. To represent emotional information in a more comprehensive manner, we propose the IMel spectrogram, which pays more attention to high-frequency information.

Development of Feature Optimization
Overall, high dimensionality has a substantial impact on classification accuracy and efficiency. To ensure higher precision and shorter calculation time, the number of feature dimensions should be reduced. In reference [46], three-dimensionality reduction methods-LDA, PCA, and PCA + LDA-were used to reduce the number of dimensions of the features. In reference [34], the sequential floating forward selection algorithm, which is an iterative method for finding features that are close to the optimal features, was proposed. In reference [47], a feature selection method, particle swarm optimization-assisted biogeography-based optimization, was proposed. In reference [48], an optimal discriminative dimensionality-reduction matrix was used in the QPSO algorithm to reduce the number of dimensions.
In recent years, in deep learning, the SAE has also been applied to research feature dimensions. The multilevel structure of the SAE is suitable for the fusion and compression of information in input data and can extract necessary features. The SSAE, which consists of several SAE units, is used to extract the key information of the features. The following results prove the feasibility of using the SSAE to optimize the features. In reference [49], to identify distinguishing features of nuclei, an SSAE was used to learn high-level features from pixel intensities. In reference [50], an SSAE was used to extract high-level features from fast and noncell samples in the training set. In reference [51], a method of feature representation for multicell Pap smear images, using an SSAE was proposed. Reference [52] used a trained SSAE to extract high-level features from image patches. Therefore, an SSAE can capture advanced features by learning from low-level features. Advanced high-level features can differentiate between various emotions.
To combine the advantages of the CNN and SSAE to extract deep optimized features, a CNN-SSAE neural network method is proposed in this paper.

Architecture of Speech Emotion Recognition
The dual-channel structure for SER proposed in this paper is shown in Figure 1, The method comprises the following steps: (1) Preprocessing, which mainly includes endpoint detection, framing, and windowing, is performed. is performed.
(2) The Mel spectrogram and IMel spectrogram, which are complementary, are generated. (3) The CNN-SSAE deep optimized features from the Mel spectrograms and IMel spectrograms are extracted and spliced. (4) The spliced features are sent to the softmax layer for identification. Figure 1. Recognition network. The first channel extracts optimized features from the Mel spectrogram to highlight the low-frequency information, and the other channel extracts optimized features from the IMel spectrogram to highlight high-frequency information.

IMel Spectrogram
This section describes the IMel spectrogram and its application to SER. First, the generation process and basic theory of the IMel spectrogram are introduced. Second, the performance of the IMel spectrogram is analyzed and compared with that of the Mel spectrogram.

Generation of the Mel Spectrogram
The commonly used Mel spectrogram contains the features of human auditory perception. It is based on the Mel frequency domain, in which linear frequencies are mapped to the Mel frequency. The equation below shows the relation:  Figure 2b. The resolution at the low frequency is higher than that at the high frequency, which is consistent with the characteristics of human hearing perception, and the Mel spectrogram based on the Mel filter banks has the same property. Therefore, the Mel spectrogram has been widely used and has achieved good results in the field of speech recognition. However, when applied to SER, it has certain limitations. Figure 1. Recognition network. The first channel extracts optimized features from the Mel spectrogram to highlight the low-frequency information, and the other channel extracts optimized features from the IMel spectrogram to highlight high-frequency information.

IMel Spectrogram
This section describes the IMel spectrogram and its application to SER. First, the generation process and basic theory of the IMel spectrogram are introduced. Second, the performance of the IMel spectrogram is analyzed and compared with that of the Mel spectrogram.

Generation of the Mel Spectrogram
The commonly used Mel spectrogram contains the features of human auditory perception. It is based on the Mel frequency domain, in which linear frequencies are mapped to the Mel frequency. The equation below shows the relation: where f lin is the frequency, expressed on a linear scale, and the unit is Hz; f Mel is the frequency on the Mel scale, and the unit is Mel. In Figure 2a, f Mel and f lin are logarithms, and the resulting Mel filter banks are shown in Figure 2b. The resolution at the low frequency is higher than that at the high frequency, which is consistent with the characteristics of human hearing perception, and the Mel spectrogram based on the Mel filter banks has the same property. Therefore, the Mel spectrogram has been widely used and has achieved good results in the field of speech recognition. However, when applied to SER, it has certain limitations.   Figure 3 shows Russell's two-dimensional expression of emotional space [53]. The abscissa is valence and the ordinate is arousal. Arousal indicates the intensity of emotions: a higher numerical value corresponds to a higher degree of emotional stimulation. The speech signals for high-aroused emotions, such as happiness and anger, include more quasi-periodic and low-frequency parts. The speech signals for Low-arousal emotions, such as sadness and tiredness, include more nonperiodic high-frequency parts and tail parts. The Mel spectrogram has a high resolution in the low-frequency part and is therefore suitable for high-arousal emotions. However, for some emotions with low arousal, more attention should be given to the high-frequency parts to achieve better SER. Therefore, in this paper, the IMel frequency domain is proposed and the IMel spectrogram, which is complementary to the Mel spectrogram and is generated by IMel filter banks, is obtained.  Figure 3 shows Russell's two-dimensional expression of emotional space [ abscissa is valence and the ordinate is arousal. Arousal indicates the intensity of em a higher numerical value corresponds to a higher degree of emotional stimulat speech signals for high-aroused emotions, such as happiness and anger, includ quasi-periodic and low-frequency parts. The speech signals for Low-arousal em such as sadness and tiredness, include more nonperiodic high-frequency parts parts. The Mel spectrogram has a high resolution in the low-frequency part and fore suitable for high-arousal emotions. However, for some emotions with low more attention should be given to the high-frequency parts to achieve better SER fore, in this paper, the IMel frequency domain is proposed and the IMel spect which is complementary to the Mel spectrogram and is generated by IMel filter b obtained.

Generation of the IMel Spectrogram
The IMel spectrogram is generated by using an improved IMel filter bank, transformation relation is as follows: IMel lin f f × IMel f is expressed in the IMel frequency domain, which is opposite to the quency domain. Figure 4a shows the exponential relationship between lin f and the IMel frequency domain. As shown in Figure 4b, the high-frequency part of filter banks has a narrow bandwidth and high resolution, thus enhancing the infl high-frequency signals. Therefore, the IMel spectrograms generated using the IM banks have the same characteristics.

Generation of the IMel Spectrogram
The IMel spectrogram is generated by using an improved IMel filter bank, and the transformation relation is as follows: f I Mel is expressed in the IMel frequency domain, which is opposite to the Mel frequency domain. Figure 4a shows the exponential relationship between f lin and f I Mel in the IMel frequency domain. As shown in Figure 4b, the high-frequency part of the IMel filter banks has a narrow bandwidth and high resolution, thus enhancing the influence of high-frequency signals. Therefore, the IMel spectrograms generated using the IMel filter banks have the same characteristics.    Figure 5a,b show the same person saying the same sentence "The dog is sitting by the door" with different emotions. In Figure 5a, anger is intense and quasi-periodic, with higher energy and more low-frequency parts. The speech in Figure 5a has the Mel spectrogram shown in Figure 5c, with clear stripes, highlighting the emotional information of the higharousal signal. In Figure 5b, the speech is sad, weak, and low in energy, the range of the ordinate is [−0.1,0.1], which is ten times smaller than the range of angry speech. The unvoiced segment "S" in Figure 5b, marked by the ellipse, is the key point of detection. The Mel spectrogram in Figure 5d and the IMel spectrogram in Figure 5f corresponding to "S" are displayed in the ellipses. The IMel spectrogram amplifies the role of the highfrequency part and highlights the emotional information of the low-arousal signal. This paper combines the Mel spectrogram highlighting the low-frequency part with the IMel spectrogram highlighting the high-frequency part to apply their complementary effects to SER.  Figure 5a,b show the same person saying the same sentence "The dog is si the door" with different emotions. In Figure 5a, anger is intense and quasi-period higher energy and more low-frequency parts. The speech in Figure 5a has the M trogram shown in Figure 5c, with clear stripes, highlighting the emotional inform the high-arousal signal. In Figure 5b, the speech is sad, weak, and low in energy, th of the ordinate is [−0.1,0.1], which is ten times smaller than the range of angry spee unvoiced segment "S" in Figure 5b, marked by the ellipse, is the key point of de The Mel spectrogram in Figure 5d and the IMel spectrogram in Figure 5f corresp to "S" are displayed in the ellipses. The IMel spectrogram amplifies the role of th frequency part and highlights the emotional information of the low-arousal sign paper combines the Mel spectrogram highlighting the low-frequency part with t spectrogram highlighting the high-frequency part to apply their complementary e SER. As shown in Figure 6, the speech signal is preprocessed, and then the ene tained after a fast Fourier transform (FFT) is calculated. After passing through filter banks and the IMel filter banks separately, the logarithm is calculated, and the Mel spectrogram and the IMel spectrogram are obtained. The steps performed erate the IMel spectrogram are as follows: (1) The speech signal  As shown in Figure 6, the speech signal is preprocessed, and then the energy obtained after a fast Fourier transform (FFT) is calculated. After passing through the Mel filter banks and the IMel filter banks separately, the logarithm is calculated, and finally the Mel spectrogram and the IMel spectrogram are obtained. The steps performed to generate the IMel spectrogram are as follows: the FFT.
f and h f are the lowest frequency and the highest freque quency respectively. (5) The center frequency of each filter in the filter banks is Calculate , M is the number of filters. (6) The calculated center frequency of each filter is converted from linear scale: The calculated center frequency on the linear scale is rounded to the FFT: where represents the scale of the decomposition and s

CNN-SSAE Neural Network
As shown in Figure 1, the CNN-SSAE neural network proposed i of a CNN and an SSAE. First, the spectrogram is taken as input, and t  (1) The speech signal x s (n) = [s 1 (n), s 2, (n), · · · , s i (n), · · · , s l (n)] is framed, where s i represents each frame and l denotes the number of frames.
(2) Windows s i (n) = s i (n) × w(n) are added, where the window function ω(n) is a Hamming window and n is the window length.
(3) Each frame of the signal is the number of points of the FFT.
(4) f l and f h are converted from a linear scale to the IMel scale and are expressed as f l and f h : where f l and f h are the lowest frequency and the highest frequency of the filter frequency respectively.
(5) The center frequency of each filter in the filter banks is Calculated: where j = 1, 2, · · · , M − 1, M is the number of filters.
(6) The calculated center frequency of each filter is converted from the IMel scale to a linear scale: The calculated center frequency on the linear scale is rounded to the nearest point of the FFT: where NFFT represents the scale of the decomposition and F s represents the sampling frequency.
(8) Where H j (m) represents the transfer function of the IMel filter bank, the output where J is the number of filters, and the sample of each filter is m. Finally, groups of LO j are arranged vertically to form the IMel spectrogram of each frame, which is represented as SF i . (9) The features of all frames SF i are combined horizontally, and the IMel spectrogram is obtained.

CNN-SSAE Neural Network
As shown in Figure 1, the CNN-SSAE neural network proposed in this paper consists of a CNN and an SSAE. First, the spectrogram is taken as input, and the deep features are obtained by the CNN, Second, the deep optimized features are obtained by SSAE dimension reduction. Finally, the recognition results are classified through the softmax layer.

Convolutional Neural Network
The network structure of the CNN is shown in Figure 7 and the parameters of it are presented in Table 1. The CNN can transform the two-dimensional spectrogram into one-dimensional deep features to capture the emotional information of the speech. The characteristics of the CNN proposed in this paper are as follows: (1) After adding the batch normalization (BN) layer to the convolutional layers Con1, Con2, Con3 and Con4, the structure of the whole network is regularized to prevent overfitting, allow a higher learning rate, greatly improve the training efficiency, and solve the vanishing gradient problem. (2) The Global Average Pooling (GAP) layer is used instead of the fully connected layer. In the fully connected layer, a sub-region of each feature subgraph is set for averaging, the sub-regions are slid, and finally, all the average values are connected in series. In contrast, using GAP, each feature subgraph obtains an average value; this increases the emotional information somewhat, reduces the number of parameters, realizes model compression, and improves efficiency.

Convolutional Neural Network
The network structure of the CNN is shown in Figure 7 and the parameter presented in Table 1. The CNN can transform the two-dimensional spectrogram dimensional deep features to capture the emotional information of the speech. acteristics of the CNN proposed in this paper are as follows: (1) After adding normalization (BN) layer to the convolutional layers Con1, Con2, Con3 and C structure of the whole network is regularized to prevent overfitting, allow a hig ing rate, greatly improve the training efficiency, and solve the vanishing gradi lem. (2) The Global Average Pooling (GAP) layer is used instead of the fully c layer. In the fully connected layer, a sub-region of each feature subgraph is set fo ing, the sub-regions are slid, and finally, all the average values are connected in contrast, using GAP, each feature subgraph obtains an average value; this incr emotional information somewhat, reduces the number of parameters, realizes m pression, and improves efficiency. The deep features of the spectrogram obtained by the CNN may be re Therefore, an SSAE is adopted to further optimize the deep features, reduce th of dimensions, and improve the recognition rate.

Stacked Sparse Autoencoder
Multiple SAEs are stacked together to form an SSAE. It is suitable for the fu compression of information in input data and can extract necessary features. T usually used for data compression and fusion. The name refers to the sparse re that are added to each hidden layer. The previous layer contains more hidden the latter layer, and the hidden features learned in the latter layer are more abst structure is similar to the working state of the human brain.
The SAE1 shown in Figure 8 is composed of an input layer, a hidden layer output layer. The input data x1 are mapped to a hidden laye1; this is called the process. The hidden layer1 is remapped to the reconstructed data y1; this is c decoding process. This process should minimize the error function E from 1 to  The deep features of the spectrogram obtained by the CNN may be redundant. Therefore, an SSAE is adopted to further optimize the deep features, reduce the number of dimensions, and improve the recognition rate.

Stacked Sparse Autoencoder
Multiple SAEs are stacked together to form an SSAE. It is suitable for the fusion and compression of information in input data and can extract necessary features. The SAE is usually used for data compression and fusion. The name refers to the sparse restrictions that are added to each hidden layer. The previous layer contains more hidden cells than the latter layer, and the hidden features learned in the latter layer are more abstract. This structure is similar to the working state of the human brain.
The SAE1 shown in Figure 8 is composed of an input layer, a hidden layer and an a output layer. The input data x1 are mapped to a hidden laye1; this is called the encoding process. The hidden layer1 is remapped to the reconstructed data y1; this is called the decoding process. This process should minimize the error function E from x1 to y1: where n denotes the number of samples, j denotes the number of hidden layers, ρ denotes the target value of the average activation degree, andρ j denotes the average activation degree of the j-th layer.ρ whose hidden layer1 output is the 118-dimensional vector h . Second, hidden layer1 is regarded as the input o whose hidden layer2 output is the 108-dimensional vector (2) h , ℎ (2) = ∑ 2 * ℎ (1) �. Third, hidden layer2 is regarded as the input of 3 SAE , whose hid output is the 98-dimensional vector (3) h , where Fi deep optimized features (3) h are fed into the softmax layer for recognition.  KL(ρ ρ j ) means KL distance (Kullback-Leibler divergence): When training the network, the parameters should be constantly adjusted to minimum β. A value of ρ closer to 0 corresponds to a smaller average activation degree of the middle layerρ j . In addition, it is necessary to constantly adjust the connection weight W j and bias b j to minimum the error function E.
In the SSAE pre-training show in Figure 8, the weights and biases of each layer are W j and b j , the adjusted parameters are δ j and δ j , and the weights and biases after fine-tuning are W * j = W j + δ j and b * j = b j + δ j . SSAE is formed by stacking multiple layers. First, the 128-dimensional deep features of the spectrogram are as input layer of SAE1, whose hidden layer1 output is the 118 . Second, hidden layer1 is regarded as the input of SAE2, whose hidden layer2 output is the 108-dimensional vector h (2) , h (2) (1) ). Third, hidden layer2 is regarded as the input of SAE3, whose hidden layer output is the 98-dimensional vector (2) ). Finally, the deep optimized features h (3) are fed into the softmax layer for recognition.

Experimental Datasets
To verify the effectiveness of our proposed spectrogram and dual-channel complementary model, we tested it on three widely used datasets: the Berlin dataset of German emotional speech (EMO-DB) [54], the Surrey Audio-Visual Expressed Emotion dataset (SAVEE) [55], and the Ryerson audiovisual dataset of emotional speech and song (RAVDESS) [56].
The EMO-DB database was created by researchers at the University of Berlin and is an emotional database in Germany. This database contains 535 utterances produced by 10 professional actors, including 5 males and 5 females), who produced 49, 58, 43, 38, 55, 35, 61, 69, 56, and 71 utterances respectively. These data are made up of seven different emotions: happiness, neutral, anger, boredom, disgust, sadness, and fear. The Sampling rate is 48 kHz (compressed to 16 kHz), and 16-bit quantization is adopted. It belongs to acted, and discrete type, access type is open, and modalities is audio.
The SAVEE database records the data of four native English-speaking males (DC, JE, JK, KL), graduate students and researchers from Surrey University, aged from 27 to 31. It contains a total of 480 utterances, each of which produces 120 utterances. The average length of each utterance is 4 s with a sampling rate of 44.1 kHz. This database consists of seven different emotions: happiness, sadness, anger, disgust, fear, neutral and surprise. It belongs to acted, and discrete type, access type is free, and modalities is audio/visual.
The RAVDESS database is an English emotional dataset, which consists of 1440 utterances produced by 24 actors, including 12 males and 12 females, who expressed eight different emotions: happiness, sadness, surprise, anger, calmness, disgust, fearfulness, and neutral. The database is widely used for emotional song and speech recognition, with a sampling rate of 48 kHz. It belongs to evoked and discrete types; access type is free and modalities are audio/visual.
In the above databases, emotion is expressed in different languages and cultures; for example, pronunciation characteristics of German and English differ; however, they have some common characteristics. Low-arousal emotions, such as sadness and disgust, have low energy and the speed of speech is generally slow. High-arousal emotions, such as happiness and anger, have high energy and the emotions are expressed strongly. For all datasets, the Mel spectrogram and IMel spectrogram are always complimentary.

Experimental Features
To verify that the Mel spectrograms and the IMel spectrograms are complementarity and to assess the effectiveness of the CNN-SSAE, six experimental features were designed.

Parameter Settings
We chose the Hamming window to have a length of 25 ms and a shift of 10 ms and set the length of the FFT to 512. In the CNN-SSAE models, the batch size, learning rate of RMSProp, and dropout rate were set to 32, 0.001, and 0.1-0.2, respectively. To ensure the comprehensiveness of the experiment, fivefold cross-validation was adopted and the average value of five test results was taken to ensure the correctness of the experimental results.

Experimental Scenarios
To ensure that the experiment was comprehensive, we employed three experimental scenarios to analyze the results: speaker independent (SI), speaker-dependent (SD), and gender-dependent (GD) [57,58].
(1) SD: The samples were randomly divided into two groups: the training set, containing 80%, and the test set, containing the remaining 20%. (2) SI: The samples were divided into two groups according to the subjects. The test set was composed of all the samples spoken by one subject, and the training set contained samples spoken by the remaining subjects.
(3) GD: GD consists of two scenarios, GD-male and GD-female, according to gender. In the GD-male scenario, the samples were divided into two groups, with male data as the training set and female data as the test set. Conversely, in the GD-female scenario, female data were used as the training set, and male data were used as the test set.

Evaluation Indexes
Two evaluation indexes are used to measure the SER performance in this paper. The unweighted accuracy (UA) is the average value of recall for each class, and the weighted accuracy (WA) is the number of samples correctly classified divided by the total number of samples.
where the number of emotional categories is represented as K, and the actual number of samples with this emotion is expressed as total − positives k , true − positives k is the number of samples correctly classified for emotion category k.
6.6. Experimental Results and Analysis 6.6.1. Analysis of the IMel Spectrogram (1) Factors influencing the spectrogram How to get the most suitable spectrogram is the implementation challenge, because in the process of spectrogram implementation, the FFT length is 256, the window length is 256, the window shift is 128, and many other parameters have an impact on the results, such as the number M of filters and the frequency range R.
The number M of filters in the filter bank has an important influence on the display of the spectrogram. Figure 9a shows the filter banks with 20 filters, Figure 9b  The vertical axis of the spectrogram indicates frequency and the horizo cates time. In this paper, represents the frequency range of the spectrog represents the maximum frequency. Both and max have an importa recognition. According to the Nyquist principle, the maximum frequency half of the sampling frequency . For example, the sampling frequency The vertical axis of the spectrogram indicates frequency and the horizontal axis indicates time. In this paper, R represents the frequency range of the spectrogram and R max represents the maximum frequency. Both R and R max have an important impact on recognition. According to the Nyquist principle, the maximum frequency f m is equal to half of the sampling frequency f s . For example, the sampling frequency of the SAVEE dataset is 44,200 Hz and the maximum frequency is 22,100 Hz. As shown in Figure 10a,c, the frequency ranges of the Mel filter banks and IMel filter banks are 0-4000 Hz, and Figure 10b,d show the corresponding Mel spectrogram and IMel spectrogram. In Figure 10e,g, the frequency ranges of the Mel filter and IMel filter are both 0-8000 Hz, and Figure 10f,h show the corresponding Mel spectrogram and IMel spectrogram. In Figure 10i,k, the frequency ranges of the Mel filter and IMel filter are both 0-21,000 Hz, and Figure 10j,l show the corresponding Mel spectrogram and IMel spectrogram. It can be observed that the low-frequency decomposition of the Mel spectrogram in Figure 10j is not sufficiently detailed, whereas the attention in Figure 10l is mostly in the invalid region. Therefore, the following work later section of this paper is to set the frequency range at 0-4000 Hz or 0-8000 Hz.
The vertical axis of the spectrogram indicates frequency and the horizontal axis in cates time. In this paper, represents the frequency range of the spectrogram and m represents the maximum frequency. Both and max have an important impact recognition. According to the Nyquist principle, the maximum frequency is equal half of the sampling frequency . For example, the sampling frequency of the SAV dataset is 44,200 Hz and the maximum frequency is 22,100 Hz. As shown in Figure 10a the frequency ranges of the Mel filter banks and IMel filter banks are 0-4000 Hz, and F ure 10b,d show the corresponding Mel spectrogram and IMel spectrogram. In Figu 10e,g, the frequency ranges of the Mel filter and IMel filter are both 0-8000 Hz, and Figu 10f,h show the corresponding Mel spectrogram and IMel spectrogram. In Figure 10i,k, frequency ranges of the Mel filter and IMel filter are both 0-21,000 Hz, and Figure 1 show the corresponding Mel spectrogram and IMel spectrogram. It can be observed th the low-frequency decomposition of the Mel spectrogram in Figure 10j is not sufficien detailed, whereas the attention in Figure 10l is mostly in the invalid region. Therefore, following work later section of this paper is to set the frequency range at 0-4000 Hz or 8000 Hz.  (2) Complementarity analysis of Mel spectrogram and IMel spectrogram To prove that the Mel spectrogram and IMel spectrogram are complementary, the correlation analysis method is used in this paper. For two related variables, the correlation coefficient (whose value is between −1 and 1) indicates the degree of correlation. A greater absolute value of the correlation coefficient indicates more obvious emotional information and a brighter reflection in the image. As shown in Figure 11a-c, a correlation analysis of Mel deep features, IMel deep features, and Mel + IMel deep features of 480 sentences in the SAVEE database was conducted and expressed in the form of images. The abscissa and ordinate represent the 480 samples, The color of anger is bright in Figure 11a, dark in Figure 11b, and brighter in Figure 11c. The color of sadness is dark in Figure 11a, bright in Figure 11b, and brighter in Figure 11c. This shows that the Mel deep features and the IMel deep features are complementary, Moreover, the Mel + IMel deep features can maintain the advantage of each of them and can enhance the overall advantage, thereby playing a role in enhancing emotional information. Figure 11b, and brighter in Figure 11c. The color of sadness is dark in Figure 11a, bright in Figure 11b, and brighter in Figure 11c. This shows that the Mel deep features and the IMel deep features are complementary, Moreover, the Mel + IMel deep features can maintain the advantage of each of them and can enhance the overall advantage, thereby playing a role in enhancing emotional information.  Table 2 shows that when max = 8000 Hz, M = 60 and E = WA, the highest recognition rate of the EMO-DB database was 94.79 + 1.77%, that of the SAVEE database was 88.96 + 3.56%, and that of the RAVDESS database was 83.18 + 2.31%. This indicates that the frequency range of 0-8000 Hz reflects more comprehensive information. When the number M of filters was equal to 60, the decomposition of the spectrogram was more appropriate. Therefore, the following experiments for SI, GD and various types of emotion analysis were conducted with max = 8000 Hz and M = 60.
As seen from Tables 2-5: (1) The recognition accuracy of the Mel feature was higher than that of IMel, and the effect of using IMel alone was not as good as that of using Mel alone.  Table 2 shows that when R max = 8000 Hz, M = 60 and E = WA, the highest recognition rate of the EMO-DB database was 94.79 + 1.77%, that of the SAVEE database was 88.96 + 3.56%, and that of the RAVDESS database was 83.18 + 2.31%. This indicates that the frequency range of 0-8000 Hz reflects more comprehensive information. When the number M of filters was equal to 60, the decomposition of the spectrogram was more appropriate. Therefore, the following experiments for SI, GD and various types of emotion analysis were conducted with R max = 8000 Hz and M = 60.
As seen from Tables 2-5: (1) The recognition accuracy of the Mel feature was higher than that of IMel, and the effect of using IMel alone was not as good as that of using Mel alone. (2) MelSSAE, IMelSSAE, and (Mel + IMel) achieved higher recognition accuracy than Mel, IMel, and Mel + IMel, respectively, thereby proving the effectiveness of the SSAE in dimension reduction. (3) The recognition accuracy of Mel + IMel was not necessarily higher than that of Mel or IMel, because directly splicing the two may contain redundancy and affect the recognition accuracy. However, after SSAE optimization, (Mel + IMel) SSAE improved the recognition accuracy of Mel, IMel, and Mel + IMel, thereby proving that Mel and IMel are complementary in deep features. (4) The standard deviation value is the average value that describes the distance that each recognition result deviates from the average. The values in the SI environment ( Table 2) were less than the values in the SD and GD environments (Tables 3-5). This is because SI took all samples of one subject as the test set and other people's samples as the training set, to reflect the similarity between individuals and groups. For example, GD-male took samples of all male subjects as the training set and samples of one female subject as the test set. To discover the similarity between different genders, there were large differences between the different genders, so the standard deviation of the recognition results was greater in the SI environment.    Although the experimental standards are different, the experimental results of SI, SD, and GD show the similar trends mentioned above; the results therefore prove the general applicability of this algorithm in this paper.
(2) Recognition accuracy of various emotions Table 6 presents a comparison between the emotion recognition rate of the Mel feature and that of the IMel feature when = 8000 Hz and M = 60 in the three datasets. Figure 11 is a line chart corresponding to the recognition accuracy shown in Table 6. It can be observed that for high-arousal emotions, such as anger and happiness, the recognition accuracy of the Mel features was higher than that of the IMel features. For anger, the recognition accuracies of the Mel features were 94.83%, 80.17%, and 80.82%, respectively, and those of the IMel features were 88.71%, 71.67%, and 68.79%, respectively. These results indicate that the Mel spectrogram is more suitable for high-arousal emotions than the IMel spectrogram. For low-arousal emotions, such as sadness and disgust, the recognition accuracy of the IMel features was higher than that of the Mel features. For sadness, the recognition accuracies of the Mel features were 75.87%, 70.54%, and 66.87%, respectively, and those of the IMel features were 84.68%, 79.35%, and 70.96%, respectively. These results indicate that the IMel spectrogram is more suitable for low-arousal and high-frequency emotions than the Mel spectrogram. (3) ROC curves In Figure 12, the horizontal coordinate is the false Positive Rate, and the vertical coordinate is the ture positive Rate. AUC refers to the area under the ROC curve. Figure 12a presents the ROC curves of the system for 'anger', 'anxiety', 'boredom', 'disgust', 'happiness', 'neutral', 'sadness' obtained by our proposed approach to the DBM-DB dataset, the red dashed line is the mean ROC value of each emotion. As can be seen from the figure, the system of this paper has good performance.
(4) The performance of the proposed system Although the algorithm proposed in this paper has two CNN channels to extract the deep features, and doubles the number of parameters in terms of complexity, we can build two projects in Pycharm for Mel spectrogram and IMel spectrogram at the same time, and therefore there is no change in running time. The dimension of the single channel is 128, that of double channel is 256, and whether the input is 128 or 256 dimensions has little effect on the running time of SSAE.
(3) ROC curves In Figure 12, the horizontal coordinate is the false Positive Rate, and the vertical coordinate is the ture positive Rate. AUC refers to the area under the ROC curve. Figure 12a presents the ROC curves of the system for 'anger', 'anxiety', 'boredom', 'disgust', 'happiness', 'neutral', 'sadness' obtained by our proposed approach to the DBM-DB dataset, the red dashed line is the mean ROC value of each emotion. As can be seen from the figure, the system of this paper has good performance. (4) The performance of the proposed system Although the algorithm proposed in this paper has two CNN channels to extract the deep features, and doubles the number of parameters in terms of complexity, we can build two projects in Pycharm for Mel spectrogram and IMel spectrogram at the same time, and therefore there is no change in running time. The dimension of the single channel is 128, that of double channel is 256, and whether the input is 128 or 256 dimensions has little effect on the running time of SSAE.
(5) Comparison of various papers Table 7 lists the results of the comparison between this paper and related papers published in recent years. The listed papers all use spectrograms to detect emotion. It can be observed that our method of combining the Mel spectrogram and IMel spectrogram is better than methods that use only a Mel spectrogram or a standard spectrogram.  Table 7 lists the results of the comparison between this paper and related papers published in recent years. The listed papers all use spectrograms to detect emotion. It can be observed that our method of combining the Mel spectrogram and IMel spectrogram is better than methods that use only a Mel spectrogram or a standard spectrogram.

Conclusions
In this paper, the frequency domain of IMel was proposed, and an IMel spectrogram that can highlight the high-frequency part was formed, which makes up for the shortage of Mel spectrograms that only highlight the low-frequency part. Then, a CNN-SSAE deep optimization network was presented. In this network, the two-dimensional spectrogram was sent to the CNN to obtain one-dimensional deep features, and then the internal redundant features were compressed and optimized by SSAE. The EMO-DB, SAVEE, and RAVDESS speech databases were used for experiments to verify the complementarity of the Mel spectrogram and IMel spectrogram and the effectiveness of CNN-SSAE in optimizing deep features. According to the current research situation, Although the proposed results obtained better recognition results, However, it also has limitations. The IMel filter is only the inverse process of the Mel filter, highlighting the high-frequency part. In the next step, wavelet algorithm, Hilbert-Huang transform, etc., can be used to highlight different frequency components of the layered signal; therefore, the next step will be to realize new spectrograms to obtain better recognition results. At present, the research tends to be multimodal, combining the results of spectrogram with EEG features or kinematic features. In addition, the next step will be to extract better features through a deep optimized network, and obtain better recognition results.