Genetic Algorithm for High-Dimensional Emotion Recognition from Speech Signals

: Feature selection plays a crucial role in establishing an effective speech emotion recognition system. To improve recognition accuracy, people always extract as many features as possible from speech signals. However, this may reduce efﬁciency. We propose a hybrid ﬁlter–wrapper feature selection based on a genetic algorithm speciﬁcally designed for high-dimensional (HGA) speech emotion recognition. The algorithm ﬁrst utilizes Fisher Score and information gain to comprehensively rank acoustic features, and then these features are assigned probabilities for inclusion in subsequent operations according to their ranking. HGA improves population diversity and local search ability by modifying the initial population generation method of genetic algorithm (GA) and introducing adaptive crossover and a new mutation strategy. The proposed algorithm clearly reduces the number of selected features in four common English speech emotion datasets. It is conﬁrmed by K-nearest neighbor and random forest classiﬁers that it is superior to state-of-the-art algorithms in accuracy, precision, recall, and F1-Score.


Introduction
Human communication relies heavily on speech signals [1].Therefore, speech emotion recognition (SER) poses a compelling challenge in human-computer interaction due to its multifaceted nature, particularly when the recognition is merely based on speech signals [2,3].Speech carries linguistic information related to emotions, as well as implicit knowledge that can be extracted through speech processing methods [1,4].
SER primarily analyzes audio features without linguistic information to judge a person's emotional state [5,6].In acoustics, speech processing techniques provide valuable information, which mainly comes from prosodic and spectral features.
Feature fusion improves the classification accuracy of SER systems; nevertheless, it increases the computational cost of classifiers.The reason is that certain features have a significant impact, while others may be completely useless for emotion recognition.Feature selection methods simplify the task of selecting the most relevant features for classification algorithms [7,8].These methods mainly eliminate the loss and overfitting problems caused by the curse of dimensionality, and improve the model's generalization.Feature selection is the most effective way to enhance the accuracy of SER systems, and it decreases their computation time and memory.
Feature selection reduces the number of features by removing irrelevant and redundant ones [9,10].However, it is computationally difficult and NP-hard to search the entire feature space.Metaheuristic algorithms provide a robust and flexible approach to solving complex optimization problems [11][12][13].Due to their global search ability, adaptability, and potential for parallelization, they are a powerful tool for finding near-optimal solutions in feature selection [14,15].Genetic algorithm (GA) is a popular optimization technique inspired by the processes of natural selection and genetics, and it finds available solutions to complex issues by mimicking biological evolution [16,17].A chromosome represents a potential solution of an optimization problem, and it is encoded as a binary string.Feature selection is a binary optimization, so GA is specifically suited for this task.
In GA, crossover and mutation play pivotal roles, as they determine how the next generation is produced.Researchers create various crossover and mutation operators that are designed to work with specific chromosome representations for optimization problems.Guan et al. introduced the crossover elitist preservation mechanism in which elite solutions are preserved during crossover to promote the retention of valuable genetic material [18].Faraji and Naji utilized the newly developed crossover architecture that enables parallel crossover operations across multiple individuals within the population [19].Kaya explained the significant role of crossover operators in GAs and emphasized their importance in facilitating the exploration and exploitation of solution space [20].Zhang et al. introduced a new crossover mechanism tailored specifically for the Steiner tree problem [21].This mechanism exploits the problem's characteristics and improves the convergence speed and solution quality of GA.Duan and Zhang proposed a precise mutation strategy that is integrated into GA and particle swarm optimization (PSO) [22].This mutation enhances the exploration and exploitation of computationally expensive scenarios.Wang et al. divided the population into two groups, and each group was then subjected to a mutation operator with unique properties [23].They employed the advantages of these mutation operators to enhance the search process's effectiveness and efficiency.
Based on the above analysis, we observe that existing GA algorithms primarily improve their performance through crossover and mutation.However, when dealing with high-dimensional features, they become time-consuming and ineffective.Consequently, we utilize a hybrid filter-wrapper model to address these issues, and the main contributions of this paper are summarized as follows: 1.
We propose a novel feature selection based on filter and wrapper methods.

2.
We bring an improved GA algorithm with adaptive crossover and novel initialization.

3.
We validate the proposed algorithm on four English emotion speech datasets.
The structure of this paper is organized as follows.Section 2 introduces the related works of SER.Section 3 describes the proposed algorithm.Section 4 includes experimental results and discussions, and Section 5 provides the conclusions.

Related Works
Human speech often blends a person's emotions with sentence structure and meaning.SER categorizes speakers' emotions by studying their recorded speech.In this section, we will provide an overview of the primary research in the field of SER.
Sun et al. proposed a SER model based on GA [24].To fully express emotional information, acoustic features are extracted from speech signals.The Fisher Score selects high-ranking features and removes unnecessary information.In the feature fusion stage, GA adaptively searches for the best feature weights.Finally, decision tree (DT) and fused features establish an SER model.Mao et al. proposed a hierarchical SER method based on improved support vector machine (SVM) and DT [25].DT is established according to the confusion among emotions.In addition, a neural network filters the original features.GA is utilized to select the remaining features for each classification in DT while synchronously optimizing SVM's parameters.Kanwal and Asghar proposed a feature optimization method using cluster-based GA for SER [26].Instead of randomly selecting a new generation, clustering is utilized during fitness evaluation to detect outliers and exclude them from the next generation.Two mel-frequency cepstral coefficients (MFCCs) improve the performance of SER systems [27].Several effective feature subsets are determined through a fast correlation-based filter feature selection.Finally, a fuzzy neural network (FAMNN) recognizes emotions from speech.At the same time, GA determines the optimal values of the selection and alert parameters, and the learning rate of FAMNN.Shahin et al. proposed an automatic SER method based on gray wolf optimizer (GWO) [28].Speech signal data are processed by feature extraction and then passed to GWO to remove irrelevant and redundant features.Emotion recognition systems that are both accurate and robust can be achieved by employing correlated and meaningful acoustic features.This is because GWO effectively explores feature space and finds the optimal feature set with rich sentiment classification patterns.
Huang and Epps investigated the effectiveness of partitioning speech signals into small segments and extracting acoustic features from each segment [29].These features acquire specific emotional cues present in different parts of speech and provide a more detailed representation of emotional dynamics in continuous speech.Özseven proposed a novel feature selection method that identifies the most informative and discriminative features from speech signals [30].The novel feature selection method enhances the accuracy and efficiency of SER systems.Dey et al. combined golden ratio optimization (GRO) and equilibrium optimization (EO) to design a new hybrid metaheuristic feature selection algorithm for SER in audio signals [31].They use the sequential single point flip (SOPF) technique to search for the nearest neighbors of the final candidate solutions.To address the issue of high-dimensional emotional features in SER, Ding et al. utilized the characteristics of biogeography-based optimization algorithm (BBO) and SVM to obtain features with rich emotional information [32].
Drawing from previous research, we employ a metaheuristic algorithm to choose relevant features in speech emotion recognition.The algorithm aims to enhance classification accuracy by reducing the number of features.The distinct advantages and robust search ability of GA (as explained in Section 1) drive our suggestion to adopt it as a feature selection method in SER.

Materials and Methods
The proposed system involves emotional databases, feature extraction, and feature selection.

Emotional Databases
This study employs eNTERFACE05, the Ryerson audio-visual database of emotional speech and song (RAVDESS), the Surrey audio-visual expressed emotion (SAVEE) database, and the Toronto emotional speech set (TESS) to implement emotion recognition.eN-TERFACE05 offers multimodal emotional speech and facial expressions, and RAVDESS provides a diverse set of acted emotions in speech and song.SAVEE contains English speakers expressing various emotions, while TESS focuses on North American English speakers' emotional speech.These databases can complete comprehensive assessments across emotions, patterns, and cultural backgrounds.

eNTERFACE05
eNTERFACE05 contains audio recordings of speakers uttering scripted sentences or engaging in spontaneous conversations while expressing different emotions.These emotional states include happiness, sadness, anger, and neutral, among others.

RAVDESS
The database consists of 24 skilled actors who deliver statements with the same words but varying emotions, including anger, fear, calm, happiness, sadness, surprise, and disgust.There are two different levels of intensity for each emotion, as well as an additional neutral expression.

SAVEE
SAVEE contains recordings of emotional speech from actors portraying anger, happiness, sadness, fear, disgust, surprise, and neutral expressions.Each emotion is recorded in different intensity levels.The dataset contains recordings of four male native English speakers, one of whom is a postgraduate student, while the rest are researchers at the University of Surrey.

TESS
TESS contains a diverse range of emotional expressions, including but not limited to emotions like happiness, sadness, anger, fear, disgust, and surprise.Each recording is labeled with the corresponding emotional category, so it is suitable for supervised machine learning tasks.

Feature Extraction
Pre-emphasis, framing, and windowing are fundamental signal processing techniques commonly used in speech and audio analysis.The Hamming window splits speech signals into frames of 25 ms with an overlap of 10 ms, and the pre-emphasis filter coefficient is set to 0.97.These techniques work together to enhance the quality of audio signals and provide a more suitable representation for subsequent processing steps.Specifically, pre-emphasis boosts the higher frequencies; framing segments signals for analysis; windowing minimizes spectral leakage.
In this study, we employ the OpenSmile toolkit to gather acoustic features, which are part of the standard set used in the computational paralinguistics challenge at INTER-SPEECH 2010.A total of 1582 features are extracted from this toolbox, and Table 1 provides further details.In the proposed algorithm, as shown in Figure 1, we first rank acoustic features from Fisher Score and information gain, and then the feature space is divided according to the ranking.Finally, the genetic algorithm for high dimensionality (HGA) implements feature selection on classifiers.Fisher Score, also known as Fisher discriminant analysis or Fisher's linear discriminant, is a statistical method in machine learning and pattern recognition.It finds a linear combination of features in a dataset which maximizes the separation among different classes.
Suppose n i is the number of samples contained in the i-th class, and u r i and (σ r i ) 2 are the mean and variance of the r-th feature of the i-th class.The Fisher Score of the r-th feature is calculated as follows: It is important to minimize the variation in a feature when considering data samples from the same category.On the other hand, we aim to maximize the variation when comparing a feature across different categories.

Information gain
Information gain calculates the reduction in uncertainty or entropy achieved by considering a particular feature in a dataset.The information entropy of a random variable X is computed in the following manner: where p(x) represents the probability distribution of X.The joint entropy H(X, Y) of two random variables, X and Y, is defined as follows: where p(x, y) represents the probability distribution of X and Y. Equation ( 4) is the conditional entropy H(X|Y).
Features with higher information gain are more influential in separating and categorizing data.Fisher Score and information gain rank features.Subsequently, the TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) method is used for a comprehensive evaluation of each feature.

Feature Space Division
According to the filtering ranking of features, the whole feature space is divided into four parts, as shown in Figure 2.  The top-ranking features (FS1) make up 1% of the space.These features are the most important, so they have the highest probability of being selected.The probabilities of features in other spaces gradually decrease, as shown in Equation (5):

Improved Genetic Algorithm
Crossover and mutation play important roles in GA, as they involve mixing and matching genes to create new generations of potential solutions.They expand the scope and explore various possibilities for solving a problem.While many studies have been conducted in this area, it is still a challenge to determine which crossover and mutation methods to use in a given situation.In this study, our objective is to provide insights into selecting the appropriate method for emotion recognition problems, and we improve GA from initialization, crossover, and mutation.

Initialization
When randomizing the initial population, features with high importance are more likely to be selected, while those with a low ranking have a low probability of being chosen.The initialization of HGA relies on the feature ranking, as depicted in Equation ( 6).
where X j i represents the position of chromosome i in the j-th dimension.

Crossover
Crossover is an evolutionary method that utilizes two or more solutions to generate a new one.Original solutions are called parents, and new individuals are known as children.One of the simplest crossover methods is the 1-point crossover.It selects a cut-off point along the parents' genetic representation and swaps the segments before and after this point to produce two children.Each child inherits at least one element from their parents.Multiple cut-off points can be produced in the same manner.Instead of choosing cut-off points, it is possible to determine the probability of swapping several of the parents' elements.
Individuals in the population are randomly paired, and excellent individuals act as parents to generate offspring through crossover.An adaptive crossover operator is utilized in HGA.The features in FS1, FS3, and F4 have a higher/lower probability of being selected.FS2 is an area that requires exploration; therefore, a multi-point crossover strategy is adopted to expand search range and increase population diversity.As indicated in line 3 of Algorithm 1, the crossover rate of FS2 decreases with the execution of the algorithm.To illustrate this concept, let us consider the following example, shown in Figure 3.In FS1 and FS3, there are no crossover chromosomes in parent1 and parent2.In FS2 and FS4, parents adopt 2-and 1-point crossovers, respectively.
The example of adaptive crossover.

Mutation
Mutation changes a single individual in a population.It is frequently utilized for local search or breaking local maxima through further improving a promising solution.Minor adjustments to candidate solutions are employed to produce individuals that are slightly different from their parents.The number of mutations is controlled by Equation (5).Features in FS1, FS3, and FS4 have a large/small probability of being selected, resulting in small mutation.Mutation occurs at the same rate as the original GA for features in FS2.Algorithm 2 describes the detailed process of the mutation.

Experimental Analysis
In this research, K-nearest neighbor (KNN) and random forest (RF) are utilized to establish classification models in which the K is set to 5 and RF is constructed by 20 decision trees, and 10-fold cross validation is adopted to evaluate the performance of the models.

Simulation Results on the KNN Classifier
Table 3 presents recognition accuracy (Accuracy) and the number of selected features (Length) using the KNN classifier.From recognition accuracy, HGA outperforms GA in eNTERFACE05, RAVDESS, and TESS, but it is inferior to GA in SAVEE.Through the Wilcoxon rank sum test, it is found that they have similar experimental statistical data in eNTERFACE05.The experimental results illustrate that HGA is superior to GA, and the adaptive crossover and improved mutation help to advance speech emotion recognition.GWO and JAYA have no similar experimental results with HGA, and they perform worse than HGA.Although JAYA uses filter methods, it discards low-ranking features.The experimental results indicate that low-ranking features also play an important part in recognition, and the strategy of comprehensively considering features proposed in this paper is effective.The Friedman test reveals that their average ranks are 1.75, 3.25, 3.75, and 1.25, respectively, with p-values less than 0.05.The Wilcoxon rank sum test and the Friedman test validate the superiority of HGA.JAYA only selects features from the top 10%, so it has the minimum number of features.In contrast, GA and GWO implement selection on all features, and they have the highest number of features.Although HGA operates on all features, it avoids selecting many low-ranking features.HGA effectively balances emotional recognition accuracy and the number of selected features.
Figure 4 displays the precision, recall, and F1-Score of the algorithms.The algorithms have the best data in TESS.GA, GWO, and HGA have the worst performance in eNTER-FACE05, and JAYA performs the worst in RAVDESS.In SAVEE, GA outperforms other algorithms, and HGA has obvious advantages in eNTERFACE05, SAVEE, and TESS.HGA offers flexibility and robustness in SER and decreases feature space., where D represents the number of samples and N is the number of features used.Since JAYA uses the minimum number of features, it has the highest computational efficiency.HGA has better efficiency than GA and GWO.TESS contains a large number of samples, so the algorithms run on it for the longest time.Based on the previous discussion, it is evident that HGA presents exceptional performance in terms of classification accuracy, precision, recall, F1-Score, and the number of selected features.Therefore, HGA is a highly appropriate choice for speech emotion recognition.

Simulation Results on the RF Classifier
Table 5 displays recognition accuracy and the number of selected features using the RF classifier.The accuracy achieved with RF surpasses the value obtained by KNN.The results in Table 5 highlight the superior performance of HGA in eNTERFACE05, RAVDESS, SAVEE, and TESS, demonstrating its superiority over GA, GWO, and JAYA.According to the Wilcoxon rank sum test, GA, GWO, JAYA, and HGA excel in three, three, zero, and four datasets, respectively.GA and GWO have similar statistical results to HGA in eNTERFACE05, RAVDESS, and SAVEE.JAYA employs the minimum number of features to complete classification, followed by HGA, GA, and GWO.The results of the Friedman test indicate that the average ranks of GA, GWO, JAYA, and HGA are two, three, four, and one, respectively, with a p-value of 7.38 × 10 −3 .This analysis is further proved by the data in Table 5, which clearly reveals the superior performance of HGA in speech emotion recognition.In terms of these metrics, the algorithms achieve their highest scores in TESS, followed by SAVEE, RAVDESS, and eNTERFACE05.Notably, HGA excels in RAVDESS and SAVEE, while GWO outperforms the other algorithms in eNTERFACE05 and TESS. Figure 5 effectively illustrates that HGA is adept at selecting the most pertinent features from the speech emotion datasets, and it achieves a harmonious balance between precision and recall., where M is the number of decision tress.The execution time of the algorithms with the RF classifier is significantly longer compared to that with KNN, primarily due to the higher time complexity of RF.JAYA demonstrates the quickest execution efficiency, followed by HGA, GA, and GWO.RAVDESS and TESS display longer running time, while eNTERFACE05 and SAVEE require relatively less time for execution.

Datasets
GA GWO JAYA HGA employs a few features for classification.The adaptive crossover strategy enhances population diversity and retains the potential to discover the optimal solution.The empirical evidence derived from HGA confirms its suitability for emotion recognition.

Conclusions
Emotion-related features are always extracted from speech signals.People attempt to use more features for recognition due to uncertainty about which features are effective for classification, causing high-dimensional problems.We investigate feature selection based on filter and wrapper methods.Fisher Score and information gain rank features.However, unlike traditional filter methods, low-ranking features also enter the wrapper operation.An improved GA is proposed to effectively search for the optimal solution, and the performance of the algorithm is tested on four different datasets using KNN and RF classifiers.HGA is superior to the compared algorithms in terms of accuracy, precision, recall, and F1-Score.It acquires an accuracy of 0.4634, 0.5123, 0.5364, and 0.9748 on eNTERFACE05, RAVDESS, SAVEE, and TESS based on KNN and an accuracy of 0.5172, 0.5612, 0.6575, and 0.9931 based on RF.The future research work for the proposed algorithm involves exploring its adaptability to real-world scenarios, enhancing its robustness in diverse cultural and linguistic contexts, and integrating it into practical applications such as human-computer interaction, mental health monitoring, and personalized services.

Figure 4 .
Figure 4.The precision, recall, and F1-Score of the algorithms based on KNN.

Figure 5
Figure5describes the precision, recall, and F1-Score of the algorithms.In terms of these metrics, the algorithms achieve their highest scores in TESS, followed by SAVEE, RAVDESS, and eNTERFACE05.Notably, HGA excels in RAVDESS and SAVEE, while GWO outperforms the other algorithms in eNTERFACE05 and TESS.Figure5effectively illustrates that HGA is adept at selecting the most pertinent features from the speech emotion datasets, and it achieves a harmonious balance between precision and recall.

Figure 5 .
Figure 5.The precision, recall, and F1-Score of the algorithms based on RF.

Table 1 .
Summary of acoustic features.

Table 3 .
The experimental results of the compared algorithms based on KNN.

Table 4
presents the running time of the algorithms.The execution time of feature selection mainly depends on classifiers.The maximum time complexity of KNN is O(D * N * N)

Table 4 .
The average running time of the compared algorithms (in seconds) based on KNN.

Table 5 .
The experimental results of the compared algorithms based on RF.

Table 6
depicts the running time of the algorithms.The maximum time complexity of RF is O(M * (D * N * logD))

Table 6 .
The average running time of the compared algorithms (in seconds) based on RF.