Gender-Driven English Speech Emotion Recognition with Genetic Algorithm

Speech emotion recognition based on gender holds great importance for achieving more accurate, personalized, and empathetic interactions in technology, healthcare, psychology, and social sciences. In this paper, we present a novel gender–emotion model. First, gender and emotion features were extracted from voice signals to lay the foundation for our recognition model. Second, a genetic algorithm (GA) processed high-dimensional features, and the Fisher score was used for evaluation. Third, features were ranked by their importance, and the GA was improved through novel crossover and mutation methods based on feature importance, to improve the recognition accuracy. Finally, the proposed algorithm was compared with state-of-the-art algorithms on four common English datasets using support vector machines (SVM), and it demonstrated superior performance in accuracy, precision, recall, F1-score, the number of selected features, and running time. The proposed algorithm faced challenges in distinguishing between neutral, sad, and fearful emotions, due to subtle vocal differences, overlapping pitch and tone variability, and similar prosodic features. Notably, the primary features for gender-based differentiation mainly involved mel frequency cepstral coefficients (MFCC) and log MFCC.


Introduction
Emotions are considered an integral and important part of human life, and they are a way to express one's opinions and inform others about one's physical and mental health [1].Speech signals have emerged as a valuable source of information about a speaker's emotions, and they have the advantage of easy recording compared to other physiological signals that require special equipment [2].As a result, speech emotion recognition (SER) has received increasing attention.
SER systems aim to discern the potential emotional states of speakers from speech signals [3].These systems find applications in various fields, from human-computer interaction to automated supervision and control of safety systems [4].For instance, in remote call centers, they can automatically detect negative or positive customer experiences, and facilitate the evaluation of company services or employee attitudes.In crime investigation, they can be utilized to determine the psychological state of suspects and judge their truthfulness.In vehicles, drivers' emotional information is extracted to improve safety.Additionally, identifying the emotional states of students in academic environments can assist teachers or intelligent virtual agents in providing appropriate responses, thereby improving the quality of education.
SER is essential for improving device intelligence, promoting personalized voice services, and achieving natural and harmonious human-computer interactions [5].It is difficult to accurately predict, due to the complex nature of human emotions [6].Gender is a significant contributor, because of differences in acoustic characteristics.The pitch distinction between males and females is substantial, while variations within the same 2. Extract various features from speech for gender and emotion recognition.3. Utilize a genetic algorithm for high-dimensional feature selection for fast emotion recognition, in which the algorithm is improved through feature evaluation, the selection of parents, crossover, and mutation.4. Validate the performance of the proposed algorithm on four English datasets.
The structure of this paper is arranged as follows: Section 2 explains the related works on SER, and the proposed model is presented in Section 3. Section 4 contains the experimental results and discussions, while Section 5 provides the conclusions.

Related Works
Gender-based SER systems are designed to detect and analyze emotional states in spoken language, with a focus on distinguishing between male and female speakers.These systems aim to identify emotional signals, such as tone, pitch, intensity, and other acoustic features, and associate them with particular emotions.
Bisio et al. recognized individuals' emotional states by registering audio signals, and their system had two functions: gender recognition (GR) and emotion recognition (ER) [27].GR was implemented through a pitch frequency estimation approach, while ER used an SVM classifier based on correctly selected audio features.The performance analysis revealed that the emotion recognition system achieved high recognition rates and accurately identified emotional content.Bhattacharya et al. studied the influence of multimodal emotional features produced by facial expressions, voice, and text [28].By analyzing a substantial dataset comprising 2176 manually annotated YouTube videos, they noticed that the performance of multimodal features consistently surpassed that of bimodal and unimodal features.This performance variation was caused by different emotional contexts, gender factors, and video durations.In particular, male speakers exhibited strong suitability for multimodal features in identifying most emotions.Zaman et al. presented a system for recognizing age, gender, and emotion from audio speech [29].In this system, all audio files were converted into 20 statistical features and the transformed digital datasets were employed to develop various prediction models, such as artificial neural networks, CatBoost, XGBoost, AdaBoost, gradient boosting, KNN, random forest, decision tree, naive Bayes and support vector machine (SVM) models.CatBoost demonstrated the highest performance among all the prediction models.
Speaker recognition systems often experience a decline in performance in the presence of emotional or stressful conditions.Verma et al. examined the intonation and stress patterns of speech in Hindi (Indo-Aryan) and evaluated the impact of gender on the accuracy of SER [30].This study suggested a system that recognized both gender and emotion through obtaining fundamental prosodic and spectral speech features, and then compared three classification algorithms.Experiments on the Hindi emotion corpus revealed that the SVM achieved 78% accuracy in recognizing speech emotions.Bandela et al. proposed a novel feature extraction method using the teager energy operator (TEO) to detect stress emotions.TEO was specifically designed to increase the energy of accented speech signals [31].For gender-dependent and speaker-independent conditions, a KNN classifier was used for emotion classification in EMA (English), EMOVO (Italian), EMO-DB (German), and IITKGP (Telugu) databases.Rituerto et al. introduced a speaker recognition system designed for personalized wearable devices to address issues related to genderbased violence [32].The objectives included measuring the impact of stress on speech and seeking ways to mitigate its effects on speaker recognition tasks.Given the scarcity of data resources for such scenarios, the system employed data augmentation techniques customized for this purpose.
These studies provided valuable insights into SER and related fields.However, they did not consider the variations in emotional features between male and female speakers.To address this, we employed a GA to efficiently recognize corresponding features and complete classification.

Emotional Databases
In this study, we utilized four datasets: CREMA-D [3], EmergencyCalls [33], IEMOCAP-S1 [34], and RAVDESS [35], for gender-based emotion recognition.These databases are valuable for research in emotion recognition, speech analysis, and related fields.They provide a range of emotional expressions, and they are useful resources for developing and testing emotion recognition algorithms and models.

CREMA-D
CREMA-D is a dataset comprising 7442 original video clips performed by 91 actors.Among these actors, 48 are male, and 43 are female, with ages ranging from 20 to 74.The actors delivered a series of 12 sentences that expressed one of six distinct emotions: angry, disgust, fearful, happy, sad, and neutral.These emotions are depicted at four different intensity levels: low, medium, high, and unspecified.

EmergencyCalls
The 18 speakers were instructed to record their voices portraying four different emotions: angry, drunk, painful, and stressful.After labeling, audio files were cleaned to eliminate any background noise.Subsequently, the 338 recordings were trimmed to a uniform length of approximately 3 s each.Additionally, synthetic audio files were generated by adjusting the pitch of recordings from four selected speakers.

IEMOCAP-S1
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal, and multispeaker database recently collected at the SAIL lab at USC.It contains approximately 12 h of data, including text, speech, video, and facial transcriptions.Actors perform both scripted and spontaneous dialogues, expressing a range of emotions, such as angry, happy, sad, and neutral, as well as dimensional labels such as activation, valence, and dominance.The IEMOCAP dataset is highly regarded for its detailed annotation of emotional expressions and multimodal nature.IEMOCAP-S1 is the S1 session of IEMOCAP, and it contains 1819 video clips.

RAVDESS
RAVDESS is a database that contains audiovisual recordings of individuals expressing a wide range of emotional states through speech and song.RAVDESS comprises 7356 audio and video clips, and each clip lasts approximately 3 to 5 s.The database is composed of 24 professional actors (12 male and 12 female).The RAVDESS database covers various emotions, including angry, happy, sad, surprised, fearful, and disgust.
The datasets contain an equal number of male and female speakers.CREMA-D, EmergencyCalls, and RAVDESS provide enough emotional samples to balance the numbers.The sample data in IEMOCAP-S1 is unevenly distributed, especially for fearful, other, and disgust emotions, which make up only 3% of the samples.

Feature Extraction
We extracted pitch and acoustic features from the databases introduced in Section 3.1.Pitch features were used to identify gender and the OpenSmile toolkit was employed to obtain acoustic features for emotion recognition.

Improved Genetic Algorithm
Since a large number of features were extracted from the speech signals, we employed a GA to perform feature selection, to reduce the dimensionality of features and improve the recognition accuracy.GAs have good a global search ability, and they are widely used to solve optimization problems.The classical GA performs well in the early stages.However, as the population approaches the global optimum, the diversity within the population gradually diminishes.GAs are prone to prematurity and may become stuck in local optimality too early.An improved GA (IGA), based on feature evaluation, the selection of parents, crossover, and mutation, was proposed, to increase the diversity of the population and balance global search and local convergence, as shown in Figure 2.

Feature Evaluation
Fisher score is a statistical measure used in feature selection to evaluate the discriminatory ability of features in distinguishing between different classes in a dataset [36,37].The equation for Fisher score is given as follows: where Between − ClassVariance is the variance among the means of different classes, while Within − ClassVariance is the average variance within each class.A higher Fisher score indicates better discrimination between classes.In feature selection, features with higher Fisher scores are considered more relevant, and they are more likely to contribute significantly to the separation of classes.Based on the Fisher score of features, we divided the feature space into three subspaces: the top-ranking features (FS1) accounted for 10%; the lowest-ranking features (FS3) accounted for 10%; the remaining features (FS2) accounted for 80%, as shown in Figure 3.
Top-ranking Remaining Lowest-ranking

The Selection of Parents
The parents in a GA typically adopt the roulette wheel selection method based on their performance.The population will quickly converge if parents are generated near the optimal value.However, they cannot maintain their convergence speed and search ability.Additionally, as the algorithm searches, its ability to explore new solutions decreases due to the approximation of the optimal solution.When producing the next population, one should select excellent parents as much as possible, and then adjust the selection of parents according to offspring during the search process to expand the search scope.Algorithm 1 displays the new parent selection scheme.

Algorithm 1:
The selection of parents  To enhance the convergence speed, we made a change to the algorithm so that only the top-performing half of parents participated in generating offspring during each cycle.As illustrated in line 5 of the algorithm, if the newly generated offspring are excellent, their parents will continue to be used in the next cycle.However, if they do not meet this criterion, the algorithm employs the previous excellent solutions that are not involved in the current cycle as parents.

Crossover and Mutation
In high-dimensional feature selection, features from FS1 have a high probability of being selected in the mutation of offspring.Meanwhile, features from FS2 have a low probability of being selected.We proposed a new crossover and mutation method for gender differences.Due to the lack of knowledge about which features affect emotional recognition in males and females, their initial values remained the same to reduce bias.As the algorithm was executed, it gradually identified gender-specific features.The dimensions were divided into two parts: one half is dedicated to detecting male emotional features, while the other half focuses on identifying female emotional features.As a result, the dimensions of the population are twice the number of features in the datasets.
Crossover mimics the genetic recombination found in nature and aims to create new solutions by fusing the characteristics of promising parents.As depicted in Figure 4, features from FS1 are more likely to be selected, whereas features from FS3 are typically not chosen.Because their crossover does not significantly enhance genetic diversity, they are excluded from the crossover operation.The number of intersection points of FS2 is randomly generated within the range of 1 to 5.
Mutation introduces new genetic material into the population by randomly changing the values of several genes in the chromosomes, thereby maintaining genetic diversity.These changes enable the GA to explore areas of the solution space that would otherwise be inaccessible.During the mutation process, it is crucial to consider gender and feature differences.We developed a new mutation method that takes these factors into account, as illustrated in Algorithm 2.
The features of FS1 and FS3 are ranked as the most important or least important.Lines 1-25 indicate that if either male or female features from FS1 and FS3 are selected/unselected, then the corresponding features for the other gender should also be selected/unselected.Additionally, the algorithm's diversity is increased through mutation.Lines 26-33 utilize the GA's mutation scheme to search for gender-differentiated features.

Experimental Results and Analysis
To assess the effectiveness of our proposed IGA, we compared its performance with the GA [38], BBO_PSO [39], and MA [40].We used the default parameter settings for all algorithms, to ensure a fair comparison, and Table 2 presents their main parameters setting.
It is worth noting that BBO_PSO does not incorporate gender information, while GA and IGA employ a gender-emotion model as illustrated in Figure 1.The population size of the algorithms was 20, and they were run 20 times.The maximum evaluations for IGA, GA, BBO_PSO, and MA were 2000.To evaluate the statistical significance of the experimental results, we employed the Wilcoxon rank sum test and the Friedman test.A significance level of 0.05 was chosen to examine if there were any noteworthy differences in the obtained results.

Objective Function
The primary metric of SER is classification accuracy.Therefore, we utilized this as the objective function in our experiments, as depicted in Equation (2).Additionally, we assessed the algorithms in terms of precision, recall, F1-score, the number of selected features, and the time taken for execution.
where TP stands for the number of instances that were actually positive and correctly predicted as positive.TN represents the number of instances that were truly negative and accurately predicted as negative.FP signifies the number of instances that were actually negative but incorrectly predicted as positive.FN indicates the number of instances that were truly positive but wrongly predicted as negative.

Experimental Analysis
We employed a SVM and 10-fold cross-validation to evaluate the performance of the SER algorithms.
Table 3 displays the average recognition accuracy of the algorithms.The improvement strategies of the IGA appeared to be effective, as it achieved a better classification accuracy than the GA on CREMA-D, EmergencyCalls, IEMOCAP-S1, and RAVDESS.The IGA outperformed the MA, BBO_PSO, and GA on EmergencyCalls, IEMOCAP-S1, and RAVDESS, and especially on EmergencyCalls, it exhibited significant superiority.The MA surpassed the compared algorithms only on CREMA-D.Compared to the two current emotion recognition algorithms, BBO_PSO and MA, the IGA improved the accuracy by 10% on EmergencyCalls, 3% on IEMOCAP-S1, and 2% on RAVDESS.The algorithms performed better on CREMA-D, EmergencyCalls, and RAVDESS than on IEMOCAP-S1.EmergencyCalls, IEMOCAP-S1, and RAVDESS have fewer emotions than IEMOCAP-S1, and they have a more uniform data distribution, which can facilitate constructing efficient SER models.The results of the Friedman test demonstrated that the IGA achieved the best performance on three out of four datasets, while the IGA was inferior to the MA only on EmergencyCalls.BBO_PSO performed the worst.Furthermore, it is noteworthy that the IGA, MA, and GA utilized gender features to recognize speech emotions, which implies that gender information can improve the accuracy of SERs.The Wilcoxon rank sum test indicated that the MA, BBO_PSO, GA, and IGA performed well on 1, 0, 0, and 3 emotional datasets, respectively.The MA and IGA could not distinguish the experimental data in RAVDESS within a 5% confidence interval.Therefore, the IGA was more effective than the other algorithms.
Figure 5 shows the precision, recall, and F1-score of the algorithms.Similarly to the observations in Table 3, the algorithms demonstrated superior performance on Emergen-cyCalls and RAVDESS when compared to CREMA-D.The algorithms identified fearful, other, and disgust emotions as other emotions, but they did not mistakenly classify other emotions as these specific emotions.Their values in precision and F1-score were NaN.BBO_PSO was inferior to the MA, GA, and IGA.These algorithms had better precision over recall, and they accurately recognized correct emotional samples.Table 4 presents the number of selected features and running time of the algorithms.Even though the GA acquired more features than the MA and BBO_PSO, the IGA and GA managed to search twice as much feature space as the MA and BBO_PSO.Remarkably, the IGA used the lowest number of features for recognition.The IGA divided the feature space through Fisher score and avoided wasting searches on less important features.It was possible to find supplementary features (from FS2) that were suitable for classification through importance-based crossover and mutation.The number of extracted features from these datasets was equal, so the algorithms did not achieve a significant difference for the number of selected features.The execution time of the feature selection algorithms was mainly affected by their objective function.The SVMs' time complexity ranged from O(N * D 2 ) to O(N * D 2 ), where N means the feature size and D implies the number of samples.These algorithms are more efficient when there are fewer features used for a given dataset.Consequently, the execution time of the IGA was shorter than that of the other algorithms.Additionally, it is important to highlight that CREMA-D, with its larger sample size, tended to have the longest execution time, while EmergencyCalls, with fewer samples, led to a shorter execution time for the algorithms.Based on the above analysis, the IGA demonstrated outstanding performance in various evaluation metrics, and it is suitable for sentiment analysis of English speech.

Discussion
Figure 6 displays the confusion matrix of IGA.On CREMA-D, the IGA faced challenges in distinguishing between neutral, sad, and disgust emotions.It struggled to classify fearful accurately, and the presence of the other five emotions could interfere with its recognition.On EmergencyCalls, the IGA performed exceptionally well in recognizing angry, drunk, and painful emotions, but these emotions also impacted its ability to judge stressful effectively.On IEMOCAP-S1, the neutral, frustrated, and exciting emotions negatively affected the IGA's recognition of other emotions.The algorithm also tended to misclassify sad as fearful and other.On RAVDESS, sad influenced the classification of neutral, and emotions were confused with surprised, particularly for happy, sad, and fearful.On CREMA-D, the GA, BBO_PSO, and MA showed similar results to the IGA.Identifying fearful was easier for the GA than the IGA, while BBO_PSO outperformed the IGA in recognizing happy emotions.The MA achieved a better accuracy than the IGA in recognizing disgust, fearful, happy, and neutral emotions.On EmergencyCalls, painful emotions significantly affected the MA's ability to recognize stressful.The GA and BBO_PSO often misclassified stressful as angry and painful.On RAVDESS, the BBO_PSO performed exceptionally well in recognizing surprised, with an accuracy of 73.7%.On IEMOCAP-S1, the GA, BBO_PSO, and MA could not identify neutral, frustrated, and exciting emotions, while the BBO_PSO misclassified surprised as exciting.

Conclusions
Research has shown that people of different genders often express various emotions through their speech patterns, tones, and intonations.In our work, we built a speech emotion model based on gender to acquire a high accuracy, and this model involves extracting gender and emotion features, as well as feature selection.We employed higherorder statistics, spectral features, and temporal dynamics that might better capture the nuances between neutral, sad, and fearful emotions.The GA implements feature selection to improve the model's ability to distinguish between similar emotions.In experiments on four English datasets, the accuracies of the proposed algorithm were 0.6565, 0.7509, 0.5859, and 0.6756 on the CREMA-D, EmergencyCalls, IEMOCAP-S1, and RAVDESS datasets, respectively.In terms of precision, recall, and F1-score, it was superior to the compared algorithms.Additionally, our research revealed that neutral, sad, and fearful emotions affect recognition accuracy, and the log mel frequency and MFCC are the main gender-differentiated emotional features.To improve the classification of challenging emotions such as stressful, fearful, and disgust, we could employ several strategies in our proposed model.We will use filter methods to comprehensively evaluate features and fine-tune the crossover and mutation parameters of the GA.We also intend to test different classifiers to evaluate their recognition performance.Additionally, data preprocessing will be used to enhance the quality and representativeness of the training set.This work could be extended by integrating facial and voice features from the same individual in various environments, which would improve recognition accuracy and the machine's comprehension of human emotions.

Figure 1
Figure 1 depicts the proposed model, which includes emotional databases, feature extraction, and feature selection using an improved GA.This model establishes SER for males and females based on gender.When predicting emotions, it initially determines the gender of the voices, and subsequently employs the relevant emotion model based on that gender prediction.

Figure 1 .
Figure 1.Flowchart of the proposed model.
two from historical parents in the next cycle;

Figure 5 .
Figure 5.The precision, recall, and F1-score of the algorithms.

Table 1 .
Table 1 describes their details, in which gender contains 11 features and emotion has 485 values.Summary of gender and emotion features.

1
Sort in ascending order by the objective function values; 2 Use the best half of individuals as parents; 3 Randomly select parents to generate offsprings; 4 Crossover and mutation; 5 for i ∈ offsprings do Use the parents that generated it in the next cycle; 6if i is better than the worst individual then 7

Table 2 .
The main parameters setting.

Table 3 .
The classification accuracy of the algorithms.

Table 4 .
The number of selected features and running time of the algorithms.