A Deep Learning Method Using Gender-Specific Features for Emotion Recognition

Speech reflects people’s mental state and using a microphone sensor is a potential method for human–computer interaction. Speech recognition using this sensor is conducive to the diagnosis of mental illnesses. The gender difference of speakers affects the process of speech emotion recognition based on specific acoustic features, resulting in the decline of emotion recognition accuracy. Therefore, we believe that the accuracy of speech emotion recognition can be effectively improved by selecting different features of speech for emotion recognition based on the speech representations of different genders. In this paper, we propose a speech emotion recognition method based on gender classification. First, we use MLP to classify the original speech by gender. Second, based on the different acoustic features of male and female speech, we analyze the influence weights of multiple speech emotion features in male and female speech, and establish the optimal feature sets for male and female emotion recognition, respectively. Finally, we train and test CNN and BiLSTM, respectively, by using the male and the female speech emotion feature sets. The results show that the proposed emotion recognition models have an advantage in terms of average recognition accuracy compared with gender-mixed recognition models.


Introduction
Speech emotion recognition is a computer simulation of the human emotion perception and understanding process. It extracts the acoustic features of emotion from the collected speech signals using a microphone sensor and identifies the mapping relationship between these acoustic features and human emotion. Speech emotion recognition is widely used in the field of human-computer interaction [1][2][3][4]. In the field of medicine, the effective recognition of emotion in speech can be used to improve the intelligibility of speech for people with speech disabilities and help listeners better understand the speech information expressed by the speaker [5]. In the field of education, students who learn online should be analyzed to identify their emotional states and improve the quality of teaching [6]. In the field of criminal investigation, the automatic recognition of speech helps to discover the real emotional state of criminal suspects and their attempt to hide their true emotions, thus assisting in lie detection [7]. Currently, due to the impact of COVID-19, about 15.5% of the global population suffers from some kind of mental illness [8], and speech emotion recognition systems are gradually being applied to the field of mental health counseling. The research on the automatic recognition of speech emotion not only promotes the development of computer technology, but also provides an efficient diagnosis of mental illnesses. Through the early detection, intervention and treatment of mental illness, people's quality of life can be improved.
In recent years, a lot of work has been carried out to automatically recognize emotional information in speech [9][10][11][12], but the lack of significant improvement in recognition (R-CNN) with a gender information block. Utilizing the deep learning algorithm, the network automatically selects important information from the original speech signal for the classification layer to complete the emotion recognition. The results show that the proposed algorithm achieves recognition rates of 84.6%, 90.3%, and 71.5% on the CASIA, EMODB, and IEMOCAP datasets, respectively. Although the above studies consider the influence of gender on speech emotion recognition, they only input gender as a characteristic parameter, or manually distinguish different genders and then input the recognition model. The recognition results do not improve significantly.
In this study, based on the differences in physiological and acoustic characteristic parameters between genders, front-end and back-end models are designed to automatically complete gender classification and speech emotion recognition. Our main contributions are summarized in three phases: First, the front-end model uses MFCC mean and spectrum contrast to extract the features from the original speech, and automatically distinguishes male and female speech through a multilayer perceptron neural network (MLP).
Second, the acoustic speech emotion features are traversed, and one feature parameter is extracted each time. The recognition results are compared using the support vector machine (SVM) to analyze the weight difference of various speech features in the speech recognition of different genders.
Finally, combinations of the feature parameters of different genders are, respectively, input into a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) speech emotion recognition models established in the back-end to realize the emotion recognition of different genders and improve its accuracy.
The following parts of this paper are organized into five sections. In Section 2, the differences in the physiological and acoustic characteristics between male and female voices are introduced. Section 3 presents related research work, gender classification methods, the extraction of the speech feature parameters of different genders, and the structure of the speech emotion recognition models. Section 4 presents the experimental design and experimental results. Section 5 presents the discussion of the proposed technique. Finally, the summarization of the study is presented in Section 6.

Background
The main problem of speech emotion recognition is that the acoustic performance of speech signals is affected not only by emotional factors, but also by many other factors. Among them, the discrepancy of acoustic characteristics caused by the difference in the physiological characteristics of speakers is the main influencing factor that reduces the accuracy of speech emotion recognition.
The vocal tract is the main organ that produces the voice. Its core function is to adjust the timbre of the sound produced by the vibration of the vocal cords. Speech is produced according to the shape of the vocal tract, which changes over time. The shape of the vocal tract depends on the shape or size of the vocal organ, and different genders inevitably show individual differences. Previous studies show that the ratio of the total length of the vocal tract of adult females to that of adult males is about 0.87 [25]. Males and females differ in the thickness of the larynx, angle of the thyroid tablet and shape of the glottis [26]. The vocal cords are part of the vocal tract. The tension or relaxation of the vocal cords determines the pitch of the voice. People of different genders have different vocal cord structures; thus, the pitch of the voice is also different. Adult male and female vocal cords are different in length, thickness, tension, and other anatomical structures, as well as in the physiological functions of respiratory organs and resonance organs, resulting in different characteristics between adult males and females in vocalization and voice quality. The main factor affecting the rate of vocal cord vibration is the length of the vocal cord, and long vocal cords vibrate more slowly than short ones. Previous studies show that the ratio of female vocal cord length to male vocal cord length is about 0.8 [27]; therefore, men's voices are lower than women's in most cases. In addition, female voices have a higher base frequency or pitch in comparison with males'. The average formant frequency of female speakers is higher than that of male speakers, and the slope of the female spectrum is steeper than that of male speakers.
In conclusion, there are general physiological differences between genders; thus, men and women may express the same emotion in completely different ways. Therefore, it is necessary to classify emotions according to gender and carry out emotion recognition according to the different acoustic characteristics of males and females to improve the overall accuracy of the recognition system.

Methods
In this section, we establish the overall framework of speech emotion recognition. The overall framework shown in Figure 1 is divided into two models: the front-end model and the back-end model. The front-end model completes gender recognition and the classification of speech data. The back-end model extracts emotional features from the speech data of males and females, respectively, and provides the emotional recognition results of different genders. The main factor affecting the rate of vocal cord vibration is the length of the vocal cord, and long vocal cords vibrate more slowly than short ones. Previous studies show that the ratio of female vocal cord length to male vocal cord length is about 0.8 [27]; therefore, men's voices are lower than women's in most cases. In addition, female voices have a higher base frequency or pitch in comparison with males'. The average formant frequency of female speakers is higher than that of male speakers, and the slope of the female spectrum is steeper than that of male speakers.
In conclusion, there are general physiological differences between genders; thus, men and women may express the same emotion in completely different ways. Therefore, it is necessary to classify emotions according to gender and carry out emotion recognition according to the different acoustic characteristics of males and females to improve the overall accuracy of the recognition system.

Methods
In this section, we establish the overall framework of speech emotion recognition. The overall framework shown in Figure 1 is divided into two models: the front-end model and the back-end model. The front-end model completes gender recognition and the classification of speech data. The back-end model extracts emotional features from the speech data of males and females, respectively, and provides the emotional recognition results of different genders. .

Extraction of Speech Feature Parameters
MFCC is based on the characteristics of the human ear, cochlea, and basement membrane, and has a nonlinear corresponding relationship with the actual frequency, which makes its cepstrum more similar to the nonlinear human auditory system. Previous studies show that MFCC and F0 classify gender well [28]. Spectrum contrast takes into account the peak value, valley value, and the difference value of each sub-frequency spectrum to show the relative characteristics of the spectrum. The spectrum contrast feature can roughly reflect the corresponding distribution of the middle and morning harmonics in the spectrum, retain more spectrum characteristic information, and better distinguish different kinds of speech information. Therefore, MFCC mean, fundamental frequency F0,

Extraction of Speech Feature Parameters
MFCC is based on the characteristics of the human ear, cochlea, and basement membrane, and has a nonlinear corresponding relationship with the actual frequency, which makes its cepstrum more similar to the nonlinear human auditory system. Previous studies show that MFCC and F0 classify gender well [28]. Spectrum contrast takes into account the peak value, valley value, and the difference value of each sub-frequency spectrum to show the relative characteristics of the spectrum. The spectrum contrast feature can roughly reflect the corresponding distribution of the middle and morning harmonics in the spectrum, retain more spectrum characteristic information, and better distinguish different kinds of speech information. Therefore, MFCC mean, fundamental frequency F0, and spectral contrast ratio are adopted in this study to classify the speech features of different genders more accurately.

Gender Recognition Model
In this study, MLP is used for the automatic gender recognition of detected speech. MLP consists of an input layer, several hidden layers in the middle, and an output layer [29]. It has high parallelism, high nonlinear global function, and good fault tolerance and associative memory function. The structure of MLP is shown in Figure 2. and spectral contrast ratio are adopted in this study to classify the speech features of different genders more accurately.

Gender Recognition Model
In this study, MLP is used for the automatic gender recognition of detected speech. MLP consists of an input layer, several hidden layers in the middle, and an output layer [29]. It has high parallelism, high nonlinear global function, and good fault tolerance and associative memory function. The structure of MLP is shown in Figure 2. Let sample ∈ × , where denotes the batch size, and represents the number of inputs. Suppose the multilayer perceptron has just one hidden layer, where the number of hidden units is ℎ, and the output of the hidden layer is ∈ × . Since the hidden layer and the output layer are fully connected layers, the weight parameters and deviation parameters of the hidden layer can be set as: The weight parameters and deviation parameters of the output layer are: Then, the output ∈ × of a multilayer perceptron design with a single hidden layer is calculated as follows: The output of the hidden layer is directly taken as the input of the output layer, and the subsequent equation can be obtained: In this study, MFCC mean, fundamental frequency F0, and spectral contrast are used to extract the emotional features from the speech of different genders.

Extraction of Speech Emotion Feature Parameters
For all speech emotion recognition systems, a key problem lies in how to select the best feature set to represent speech emotion signals. This section explores the differences in speech emotion features among speakers of different genders. The rhythm features, quality features, and spectral features are traversed. One feature parameter is extracted each time and input into the SVM model for individual emotion recognition. The recognition results are used to judge the weight of different feature parameters in male and Let sample X ∈ R n×d , where n denotes the batch size, and d represents the number of inputs. Suppose the multilayer perceptron has just one hidden layer, where the number of hidden units is h, and the output of the hidden layer is H ∈ R n×h . Since the hidden layer and the output layer are fully connected layers, the weight parameters and deviation parameters of the hidden layer can be set as: The weight parameters and deviation parameters of the output layer are: Then, the output O ∈ R n×q of a multilayer perceptron design with a single hidden layer is calculated as follows: The output of the hidden layer is directly taken as the input of the output layer, and the subsequent equation can be obtained: In this study, MFCC mean, fundamental frequency F0, and spectral contrast are used to extract the emotional features from the speech of different genders.

Extraction of Speech Emotion Feature Parameters
For all speech emotion recognition systems, a key problem lies in how to select the best feature set to represent speech emotion signals. This section explores the differences in speech emotion features among speakers of different genders. The rhythm features, quality features, and spectral features are traversed. One feature parameter is extracted each time and input into the SVM model for individual emotion recognition. The recognition results are used to judge the weight of different feature parameters in male and female speech emotion features. The ranking of male and female speech emotion feature parameters is shown in Table 1 (See Appendix A for details). According to the proportion of weight, the feature parameters with a recognition rate greater than 50% are displayed in order from high to low. As can be seen from Table 1, the weight of male and female speech emotion feature parameters in emotion recognition is significantly different: the male single feature parameter has the best MFCC effect in emotion recognition, followed by its maximum and mean value. On the contrary, women have the best maximum MFCC, followed by the mean and MFCC. For males, sound pressure level (SPL) is more important, followed by voice energy. In the case of female voices, voice energy is more important, followed by Mayer spectrum.
The emotional characteristics of male speech include shimmer abs (absolute amplitude perturbation) and SPL, both of which are not found in females. Female voice emotion feature parameters include zero-crossing rate (ZCR) and short-time energy, while male speech does not. Amplitude perturbation describes the variation of acoustic amplitude between adjacent periods, which mainly reflects hoarseness. The male vocal tract is rougher and hoarser than the female vocal tract; thus, the amplitude perturbation parameters have a greater effect on the male voice. SPL is defined as the root mean square value of the instantaneous pressure generated by a sound wave at a point. In general, women's vocalizations have more high-frequency components and men's vocalizations have more low-frequency components. Women speak louder than men at the same SPL. SPL fluctuates more in men and less in women during mood swings. ZCR refers to the number of times a speech signal passes through the zero point (from positive to negative or from negative to positive) in each frame. The ZCR of unvoiced sounds is the highest, while that of voiced sounds is the lowest, and the short-time energy can distinguish between unvoiced and voiced sounds. Voiced sounds are produced by the vibration of vocal cords and contain most of the energy in the sound. It shows obvious periodicity. Female vocal cords are shorter than male vocal cords; therefore, female voices are crisper and male voices are deeper.

Speech Emotion Recognition Models
The gender-based emotion detection method, which incorporates gender information into the process of emotion recognition, has proved to be robust and successful. Studies have shown that gender-mixed emotion recognition systems have a lower success rate than gender-specific systems [24]. To further exclude the possible influence of classifiers on emotion recognition results, CNN and BiLSTM emotion recognition models are established in this study.
CNN and BiLSTM are representative algorithms of deep learning, which have achieved great success in natural speech processing, such as speech recognition and language translation, and have shown excellent performance in speech emotion recognition. In this experiment, these two classifiers are used for a comparative analysis.
In [30], the authors proposed an improved CNN model that greatly enhance CNNs' capability of modeling geometric transformations. CNN's model formula is as follows: where b is the offset; Z l and Z l+1 represent the convolution input and output of the l + 1 layer, which are known as the feature graph; L l+1 denotes the size of Z l+1 ; Z(i, j) corresponds to the feature graph; w l+1 k (x, y) represents the k th sub-convolution kernel matrix of the convolution kernel; k is the number of channels of the feature graph; f is the convolution kernel size; s 0 is the convolution step size; and p is the number of filling layers.
CNN is composed of two convolution layers and a fully connected layer. The window length of the convolution kernel is 5, the convolution step is 1, and the activation function is "Relu". For each convolution layer, the output of convolution is transferred to one dimension for batch normalization, and the prediction results are obtained after the softmax activation layer. Batch normalization reduces the internal covariance drift in the feature graph by normalizing the output of the previous layer. The irregular effect can reduce the overfitting. The structure of CNN is shown in Figure 3.
In [30], the authors proposed an improved CNN model that greatly enhance CNNs' capability of modeling geometric transformations. CNN's model formula is as follows: , = , + = + , + , + , ∈ 0,1, ⋯ = + 2 − + 1 (5) where is the offset; and represent the convolution input and output of the + 1 layer, which are known as the feature graph; denotes the size of ; , corresponds to the feature graph; , represents the sub-convolution kernel matrix of the convolution kernel; is the number of channels of the feature graph; is the convolution kernel size; is the convolution step size; and p is the number of filling layers.
CNN is composed of two convolution layers and a fully connected layer. The window length of the convolution kernel is 5, the convolution step is 1, and the activation function is "Relu". For each convolution layer, the output of convolution is transferred to one dimension for batch normalization, and the prediction results are obtained after the softmax activation layer. Batch normalization reduces the internal covariance drift in the feature graph by normalizing the output of the previous layer. The irregular effect can reduce the overfitting. The structure of CNN is shown in Figure 3. The optimizer of CNN selects "Adam", and the loss function is cross-entropy. In order to prevent overfitting, input neurons are randomly disconnected with a probability of 0.3 every time the parameters are changed during training. In the process of training and testing, five-fold cross-validation is adopted, and 80% of the dataset is used as training data and 20% as test data.
BiLSTM uses a layer of bidirectional LSTM to extract the features of the hidden layer, and directly selects the 256-dimensional feature outputs of the hidden layer for batch normalization [31]. By normalizing the output of the previous layer, the internal covariance drift in the feature graph can be reduced, and the regularization effect caused can reduce the overfitting. Then, a full connection layer is used to down sample the input features and reduce the dimension of feature space. Finally, the prediction results are obtained after the softmax activation layer. The optimizer selects "Adam". The fixed learning rate is set to 0.001, and the loss function is cross-entropy. In order to prevent overfitting, the neurons are randomly disconnected with a probability of 0.5 every time the parameters are changed in the training process. In the process of training and testing, five-fold crossvalidation is adopted, with 80% of the dataset used as training data and 20% as test data. The structure of BiLSTM is shown in Figure 4. The optimizer of CNN selects "Adam", and the loss function is cross-entropy. In order to prevent overfitting, input neurons are randomly disconnected with a probability of 0.3 every time the parameters are changed during training. In the process of training and testing, five-fold cross-validation is adopted, and 80% of the dataset is used as training data and 20% as test data.
BiLSTM uses a layer of bidirectional LSTM to extract the features of the hidden layer, and directly selects the 256-dimensional feature outputs of the hidden layer for batch normalization [31]. By normalizing the output of the previous layer, the internal covariance drift in the feature graph can be reduced, and the regularization effect caused can reduce the overfitting. Then, a full connection layer is used to down sample the input features and reduce the dimension of feature space. Finally, the prediction results are obtained after the softmax activation layer. The optimizer selects "Adam". The fixed learning rate is set to 0.001, and the loss function is cross-entropy. In order to prevent overfitting, the neurons are randomly disconnected with a probability of 0.5 every time the parameters are changed in the training process. In the process of training and testing, five-fold cross-validation is adopted, with 80% of the dataset used as training data and 20% as test data. The structure of BiLSTM is shown in Figure 4.

Emotional Speech Dataset
In this paper, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the CASIA Chinese emotional speech dataset are used for testing. The diversity of the data prevents the model from being applicable to only one dataset; therefore, the effect of the model can be fully validated. In the process of training and testing, five-fold cross-validation is adopted, 80% of the dataset is used as training data and 20% as test data.
The RAVDESS is a multimodal dataset, which consists of 24 young and middle-aged actors (12 males and 12 females) uttering sentences that have matching words in neutral North American accents [32]. According to the analysis of speech features and the requirement of emotion recognition, the experiment selected 1440 audio recordings of emotional speech as test samples which contain eight kinds of emotion: happiness, sadness, anger, fear, calm, disgust, surprise, and neutral. The sampling rate of voice data audio is 48kHz, 16bit, and the file is saved in an uncompressed waveform format. A total of 1440 experimental samples, including 720 males and 720 females, are used without gender classification.
The CASIA Chinese emotional speech dataset, released by the Institute of Chinese Academy of Sciences, is composed of 1200 sentences vocalized by four young and middleaged professional orthoepists (two males and two females). Our research chooses six emotions: happiness, sadness, anger, fear, surprise, and neutral. The sampling rate of voice data audio is 16KHz, 16bit, and the file is saved in an uncompressed waveform format. A total of 1200 test samples are used, which consist of 600 samples of men and 600 samples of women. See Table 2 for details. The experiment is carried out on Windows 7, where the computer hardware is configured as Intel i7 CPU at 2.80GHz, with 16GB of memory. The programming language is

Emotional Speech Dataset
In this paper, the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the CASIA Chinese emotional speech dataset are used for testing. The diversity of the data prevents the model from being applicable to only one dataset; therefore, the effect of the model can be fully validated. In the process of training and testing, five-fold cross-validation is adopted, 80% of the dataset is used as training data and 20% as test data.
The RAVDESS is a multimodal dataset, which consists of 24 young and middleaged actors (12 males and 12 females) uttering sentences that have matching words in neutral North American accents [32]. According to the analysis of speech features and the requirement of emotion recognition, the experiment selected 1440 audio recordings of emotional speech as test samples which contain eight kinds of emotion: happiness, sadness, anger, fear, calm, disgust, surprise, and neutral. The sampling rate of voice data audio is 48 kHz, 16 bit, and the file is saved in an uncompressed waveform format. A total of 1440 experimental samples, including 720 males and 720 females, are used without gender classification.
The CASIA Chinese emotional speech dataset, released by the Institute of Chinese Academy of Sciences, is composed of 1200 sentences vocalized by four young and middleaged professional orthoepists (two males and two females). Our research chooses six emotions: happiness, sadness, anger, fear, surprise, and neutral. The sampling rate of voice data audio is 16 KHz, 16 bit, and the file is saved in an uncompressed waveform format. A total of 1200 test samples are used, which consist of 600 samples of men and 600 samples of women. See Table 2 for details. The experiment is carried out on Windows 7, where the computer hardware is configured as Intel i7 CPU at 2.80 GHz, with 16 GB of memory. The programming language is Python 3.8. The running process of the program mainly relies on the CPU for calculation. GPU is not used. The run-time is about 10 min.

Gender Classification Results
The front-end model of this experiment automatically classifies the original speech according to gender. MFCC mean and spectral contrast are used to extract gender features, and the speech gender classification is completed through MLP. Through the test, the accuracy of gender classification in the RAVDESS dataset reaches 100%, while the accuracy in the CASIA dataset is 99.5%.

Speech Emotion Recognition Results
The back-end model of this experiment automatically completes the speech emotion recognition process according to different genders. First, the male and female voices input from the RAVDESS dataset are preprocessed. Acoustic speech feature parameters are extracted and integrated into overall feature vectors according to their respective dimensions which are, respectively, sent into CNN and BiLSTM for speech emotion recognition. The comprehensive recognition result in Table 3 is used to obtain the comprehensive recognition rate of the whole dataset by integrating the recognition rate of the identified males and females. After the RAVDESS dataset is tested, the procedure is repeated in the CASIA dataset. The final recognition results are shown in Table 3.
As can be seen from Table 3, in the RAVDESS dataset, the comprehensive recognition rate of CNN with gender classification is 9.72% higher than that of CNN without gender classification, while that of BiLSTM is increased by 5.20%. In the CASIA dataset, the comprehensive recognition rate of CNN with gender classification is 7.83% higher than that of CNN without gender classification, while BiLSTM improves by 5.41%. The accuracy is higher when CNN and BiLSTM recognize male and female speech, respectively, while recognizing the speech of all genders together shows a lower accuracy. Moreover, the average accuracy of speech recognition of males is significantly better than that of females. The recognition accuracy of CNN for male speech is 1.39% and 5.83% higher than that of female speech in RAVDESS and CASIA, respectively, while that of BiLSTM for male speech is 3.47% and 9.17% higher. This result indicates that the emotional characteristics of male speech are more significant than those of female speech, and are easier to be recognized by models.
We also find that CNN shows a better performance in speech emotion recognition than BiLSTM. This is because CNN mainly focuses on local features and has a high degree of correlation with the maximum and minimum of speech features, pitch, and other parameters. Local features of speech are more obvious than the overall features, which is conducive to CNN, while BiLSTM focuses on timing features, and is relatively less sensitive to local features.
The confusion matrix of CNN recognition results and the confusion matrix of BiLSTM recognition results are shown in Figures 5 and 6, respectively.

Discussion
Gender-based classification and analysis of the influence weights of multiple speech emotion features in speech emotion recognition across genders are the main contributions of the paper. The MLP model is used to recognize the gender of the original speech and

Discussion
Gender-based classification and analysis of the influence weights of multiple speech emotion features in speech emotion recognition across genders are the main contributions of the paper. The MLP model is used to recognize the gender of the original speech and distinguish the male and female speech data. The differences between male and female speech in terms of acoustics are analyzed. We further analyze the influence weights of multiple speech emotion features in male and female speech emotion recognition and establish weight rankings. Then, we build the optimal feature sets for male and female emotion recognition, respectively. Finally, we feed the set of speech emotion features into CNN and BiLSTM models for training and testing. The value of this research approach is to reduce the difficulty of model recognition by differentiating between genders. The establishment of more targeted feature sets will help to further improve the recognition accuracy. Table 4 compares the results of speech emotion recognition in previous studies [24,[33][34][35] and this study. Sun, T.W [24] proposed a novel emotion recognition algorithm that does not rely on any speech acoustic features and combines speaker gender information with the emerging R-CNN structure. Kwon, S et al. [33] proposed an artificial intelligenceassisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from a spectrogram of speech signals that are enhanced in prior steps to improve performance. Sajjad, M et al. [34] proposed a new strategy for SER by using sequence selections and extraction via a nonlinear RBFN-based method to find a similarity level in clustering. Then, they proposed a multilayer deep BiLSTM to learn and recognize long-term sequences in audio data for recognizing emotions. The achieved accuracy for the RAVDESS dataset is 77.02%. None of the abovementioned studies examine the weighting of speech emotion features for different genders, nor do they create separate feature sets for different genders; thus, none of the recognition accuracies are satisfactory. It can be seen from Table 4 that the method adopted in this study can significantly improve the accuracy of speech emotion recognition, especially in the CASIA dataset, where the results are improved by 3.31% compared with other algorithms. Table 5 shows a comparison of the proposed work with [36], in which S. Kanwal et al. describe a feature optimization approach that uses a clustering-based genetic algorithm. They use the gender information as an independent feature map to feed into CNN to train the whole recognition network. The recognition rates of 82.59% for general speakers, 75.49% for male speakers, and 91.12% for female speakers on RAVDESS are obtained in speaker-dependent experiments. In this study, the influence weight of emotional features of speech in male and female speech emotion recognition is first analyzed. On this basis, the optimal feature sets for male and female emotion recognition are established, respectively, and input into the CNN model and BiLSTM model, respectively. Ultimately, the recognition rate of male speakers is 9.93% higher, the recognition rate of female speakers is 7.09% lower, and the overall recognition rate is 2.13% higher.

Comparison of Accuracy with Algorithm Based on Gender Information
Although the proposed algorithm shows excellent recognition performance thus far, it cannot operate in real time. In addition, the computing power of the proposed algorithm is relatively large and cannot be integrated into mobile devices, which also narrows the application scenarios.
In future work, we aim to add more modes, such as images and text, to improve its accuracy. In addition, we intend to train and experiment in more conditions, including age and identification.

Conclusions
This paper proposes a speech emotion analysis method based on gender recognition. First, gender recognition and automatic gender classification are performed in the speech dataset. Then, based on the different physiological characteristics of males and females, the weight of acoustic emotion characteristic parameters in male and female voice emotions is analyzed. Finally, CNN and BiLSTM are used to establish the speech emotion feature analysis model to recognize the emotion of male and female speech in the RAVDESS and CASIA datasets, respectively. The results show that the speech emotion recognition model based on gender classification proposed in this paper effectively improves the accuracy of speech emotion recognition. Moreover, the proposed model shows better performance in the emotion recognition of male speech compared with female speech.