Next Article in Journal
Development of a Prototype Hybrid Mixed Reality and Haptic Task Trainer for Temporomandibular Joint Dislocation
Previous Article in Journal
Antioxidant and Antimicrobial Potential of Malva neglecta Wallr. Extracts Prepared by “Green” Solvents
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Metaheuristic Approaches to Enhance Voice-Based Gender Identification Using Machine Learning Methods

by
Şahin Yıldırım
1,* and
Mehmet Safa Bingöl
2
1
Department of Mechatronics Engineering, Erciyes University, 38039 Kayseri, Turkey
2
Department of Mechatronics Engineering, Nigde Omer Halisdemir University, 51240 Nigde, Turkey
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12815; https://doi.org/10.3390/app152312815
Submission received: 30 October 2025 / Revised: 19 November 2025 / Accepted: 24 November 2025 / Published: 3 December 2025

Abstract

Nowadays, classification of a person’s gender by analyzing characteristics of their voice is generally called voice-based identification. This paper presents an investigation on systematic research of metaheuristic optimization algorithms regarding machine learning methods to predict voice-based gender identification performance. Furthermore, four types of machine learning methods—Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), and Artificial Neural Network (ANN)—are employed to predict voice-based gender identification. On the other hand, initially, the dataset is preprocessed using raw data and normalized with z-score and min–max normalization methods. Second, six different hyperparameter optimization approaches, including four metaheuristic optimization algorithms (Artificial Bee Colony (ABC), Particle Swarm Optimization (PSO), Grey Wolf Optimizer (GWO), and Artificial Fish Swarm Algorithm (AFSA)), along with random search and Tree-structured Parzen Estimator (TPE), are used to optimize the hyperparameters of the machine learning methods. A rigorous 5 × 10-fold cross-validation strategy is implemented to ensure robust model evaluation and minimize overfitting. A comprehensive evaluation was conducted using 72 different model combinations, assessed through accuracy, precision, recall, and F1-score metrics. The statistical significance of performance differences among models was assessed through a paired t-test and ANOVA for multiple group comparisons. In addition, external validation was performed by introducing noise into the dataset to assess model robustness under real-world noisy conditions. The results proved that metaheuristic optimization significantly outperforms traditional manual hyperparameter tuning approaches. Therefore, the optimal model, combining min–max normalization with RF optimized via the PSO algorithm, achieved an accuracy of 98.68% and an F1-score of 0.9869, representing competitive performance relative to the existing literature. This study demonstrated valuable insights into metaheuristic optimization for voice-based gender identification and presented a deployable model for forensic science, biometric security, and human–computer interaction. The results revealed that metaheuristic optimization algorithms demonstrated superior performance compared to traditional hyperparameter tuning methods and significantly improved the accuracy of voice-based gender identification systems.

1. Introduction

The human voice represents one of the most significant physiological characteristics employed in biometric identification systems, containing rich information regarding an individual’s gender, age, and emotional state. Voice-based gender identification has emerged as an increasingly important research field. Voice signals have distinctive characteristics such as frequency, tone, timbre, acoustic properties, and spectral parameters. In addition, these characteristics can be used as key indicators in gender identification. Voice-based gender identification is the process of determining a person’s gender by analyzing their voice characteristics and plays a critical role in numerous contemporary application areas today [1,2,3,4]. Beyond gender classification, voice-related research has encompassed speaker identification enhancement, acoustic evaluation prediction for students, and millidecade spectral analysis for ocean sound identification and wind speed quantification [5,6,7].
Methods employed in voice-based gender classification can be categorized into three primary approaches. The first category approach encompasses traditional machine learning algorithms, including Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) algorithms. Traditional machine learning methods are particularly suited for tasks with less data but simpler data structures. The second approach comprises deep learning architectures, such as Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM). Deep learning models can be used with more complex data such as images, audio, and time series. The third approach involves ensemble models that incorporate algorithms such as AdaBoost and Gradient Boosting. These methods leverage the collective predictive power of multiple models to enhance performance.
Ertam proposed a new Deeper LSTM model for gender classification using the voice gender dataset [8]. The proposed method achieved 98.4% accuracy in gender classification. Tursunov et al. presented a new CNN model for age and gender classification from voice data in their study [9]. The model was trained and evaluated on two datasets, Common Voice and a locally developed Korean speech corpus, yielding accuracies ranging from 73% to 97%. Jorrin-Coz et al. performed age and gender classification using CNN and LSTM [10]. They indicated three different datasets for this purpose. No dataset-specific hyperparameter optimization was performed on the models. All three datasets were tested with similar models. The model accuracies obtained ranged from 98.86% to 99.86%.
Kwasny and Hemmerling investigated different embedder architectures such as deep neural network-based x-vector and d-vector for gender classification and age estimation systems using speech signals [11]. Furthermore, by pretraining these models on the speaker identification task using the Vox-Celeb1 dataset, they were tuned to the age estimation and gender classification tasks using a transfer learning approach. The results showed a new level of accuracy in gender classification (with a 99.60% accuracy rate) and in age estimation on the TIMIT dataset. Ioannis et al. proposed a new semi-supervised self-labeled algorithm for voice-based gender classification in their study [12]. They used two different datasets: the voice gender dataset and the deterding dataset. The proposed method achieved accuracies of 98.42% and 93.94% on these datasets, respectively.
Yücesoy focused on ensemble models to automatically identify age and gender [13]. He used five different machine learning algorithms: SVM, E-TREE, Random Forest (RF), KNN, and Logistic Regression (LR). He used the Mozilla common voice dataset. The highest accuracy rate he achieved was 97.41%. In another study, Yücesoy again performed gender identification based on voice features [14]. Four different algorithms are used in the study: KNN, LDA, CNN, and MLP. As data features, 4 main features and 12 hybrid features derived from them are used. In gender identification applications using hybrid features on the Turkish subset of the common voice dataset, accuracy has been increased by between 0.3% and 1.73%.
Buket and Jingcheng introduced a new feature set based on pitch-range (PR) for age and gender classification [15]. The proposed PR features were evaluated on the aGender database using KNN and SVM classifiers. The results obtained showed that the PR features provided the highest accuracy rates, particularly with the SVM classifier. Eman et al. proposed a stacked ensemble learning model that uses four base classifiers (KNN, SVM, SGD, and LR) for gender identification and Linear Discriminant Analysis (LDA) as a top classifier [16]. The model was trained on a voice gender dataset. The results revealed that the proposed stacked model demonstrated superior performance compared to traditional machine learning models and achieved an accuracy rate of 99.64%. A summary of reviewed literature on voice-based gender identification using the Voice Gender dataset is presented in Table 1.
Despite the promising results achieved in recent studies, a significant gap exists in the systematic optimization of model hyperparameters. Most existing approaches that rely on manual tuning are time-consuming. This study focused on this critical gap by systematically integrating four metaheuristic optimization algorithms (ABC, PSO, GWO, AFSA) with machine learning methods to automate hyperparameter selection and enhance both model performance and consistency in voice-based gender identification.
This simulation-based study is focused on developing a high-performance and robust model by integrating metaheuristic optimization techniques with machine learning methods for the voice-based gender identification problem. For this purpose, a comprehensive simulation study is performed using the Voice Gender dataset. In the study, z-score and min–max normalization techniques are applied during the data preprocessing stage, followed by the creation of classification models using four different machine learning algorithms—SVM, RF, KNN, and ANN—and the metaheuristic algorithms ABC, PSO, GWO, and AFSA. These algorithms have been systematically applied to optimize the hyperparameters of each model. The results obtained are evaluated in detail using accuracy, precision, sensitivity, and F1-score metrics.

2. Materials and Methods

In recent years, machine learning methods have been used in different types of identification problems. In this study, 4 different machine learning models were used to identify voice gender. This section first provides information about the dataset. Second, it explains the normalization methods, machine learning methods, and optimization algorithms used. For good understanding, a graphical abstract of the proposed scheme for identifying voice gender is outlined in Figure 1.
The Voice Gender dataset used in the study was created by Kory Becker in 2016. There are a total of 3168 samples in the dataset. The samples in the dataset are balanced across two classes: female and male. When creating the dataset, voice samples collected from 4 different sources were preprocessed for acoustic analysis, and 20 features were obtained for each voice sample [19,20]. The acoustic features and value ranges for the data set are given in Table 2.
Normalization techniques address scale disparities in data. This study employs Z-score and min–max normalization as preprocessing techniques [21]. Z-score normalization serves as a critical preprocessing component, transforming data to achieve a mean of zero and a standard deviation of one. Z-score normalization is computed using Equation (1).
x′ = (x − μ)/σ
Here, x′ denotes the normalized value, x the original data point, μ the mean, and σ the standard deviation.
Min–max normalization is a common normalization method that transforms the values in a dataset into a specific range, typically the [0, 1] range. Min–max normalization is computed using Equation (2).
x′ = (x − xmin)/(xmax − xmin)
Here, x′ represents the normalized value, x represents the original data, xmin represents the minimum value, and xmax represents the maximum value.
This study employs four fundamental machine learning algorithms commonly used for classification: SVM, RF, KNN, and ANN [22,23]. SVM, developed by Vapnik and Cortes [24], is applicable to both regression and classification tasks. SVMs used in classification problems are based on the principle of finding an optimal hyperplane that maximizes the margin between classes. SVMs can effectively solve nonlinear classification problems thanks to their kernel functions [25]. SVMs aim to find the optimum plane that best classifies the points in the data set.
The kernel method augments raw data dimensionality, transforming nonlinear problems into linearly separable ones in higher-dimensional space. Denoted as K(x, y) for n-dimensional input vectors x and y, the kernel function relates to the mapping function f that projects inputs from n to m dimensions [26]. The mathematical formulation of support vectors is expressed in Equation (3).
y i ( w T x i + b ) 1
The objective is to determine parameters w and b such that the hyperplane separates classes while maximizing the margin 1/|| w ||2. Here, w denotes the weight vector defining the hyperplane, b the bias term, xi the training samples, and yi the class labels (±1). RF was developed by Breiman [27]. RF is a method that uses an ensemble learning approach in classification problems and achieves high accuracy rates by combining the predictions of multiple decision trees [28]. When making predictions for classification, an observation passes through each decision tree in the forest, and the final prediction from RF is made based on majority voting of all decision tree results. The fundamental RF formulation is presented in Equation (4).
Ŷ = m o d e { h 1 ( x ) , h 2 ( x ) , . . . , h i ( x ) }
Here, Ŷ represents the final predicted class, hi(x) represents the prediction made by the i-th decision tree for the x input vector, n represents the total number of trees in the forest, and mode represents the most frequently occurring value, the majority vote.
KNN is a widely employed nonparametric classification algorithm, valued for its implementation simplicity. KNN operates on the principle of proximity principle and assumes that similar data points tend to be close to each other in space. Predictions for new data samples are made based on their proximity to the nearest neighbors in the training set [29]. KNN performance depends critically on the distance metric and K value [30]. The algorithm’s foundation rests on distance computation and class assignment. Equations (5) and (6) present the KNN formulation using the Euclidean distance metric.
d ( x , x i ) = ( Σ j = 1 p   ( x j x i j ) 2 )
Ŷ = argmax_c   Σ i N k ( x )   I ( y i = c )
In Equation (5), d(x, xi) denotes the Euclidean distance between the test point x and the training point xi, p represents the feature dimensionality, xj represents the j-th feature of the test point, and xij represents the jth feature of the ith training point. In Equation (6), Ŷ represents the predicted class, c represents the class label, Nk(x) represents the set of k nearest neighbors of test point x, yi represents the class label of the ith neighbor, and I represents the indicator function (1 if the condition is true, 0 otherwise).
ANN is a machine learning method developed based on biological nerve cells and the human brain [31]. It demonstrates high success in solving complex classification problems. ANNs perform well in identifying patterns or trends in data. Therefore, they are frequently used in classification problems [32]. ANNs comprise an input layer, one or more hidden layers, and an output layer. Equation (7) presents the fundamental neuron formulation.
z = Σi=1n (wixi) + b
Here, xi denotes input values, wi the corresponding weights, n the number of inputs, and b the bias term. This weighted sum is subsequently passed through an activation function, as shown in Equation (8).
y = f(z)
Here, f denotes the activation function and y the neuron output, which is propagated to the subsequent layer.
Random search is a method that was systematically studied and popularized by Bergstra and Bengio in 2012 for hyperparameter optimization. Random search evaluates each combination by randomly sampling from the hyperparameter space and selects the parameters that yield the best results. A significant advantage over grid search is that it explores the hyperparameter space more efficiently. Especially in cases where some hyperparameters have a much greater impact on model performance than others, random search can experiment with these important parameters more extensively. It can deliver much faster and more effective results than grid search in high-dimensional hyperparameter spaces. It is widely used in hyperparameter optimization for machine learning models, especially when computational resources are limited and a large number of hyperparameters need to be adjusted [33,34,35].
The Tree-structured Parzen Estimator (TPE) is a sophisticated Bayesian optimization algorithm developed by Bergstra et al. in 2011 [34]. Unlike traditional Bayesian optimization, TPE models the p(x|y) distribution. That is, it separately estimates the probability of hyperparameters yielding good and bad results. The algorithm divides previously tested hyperparameters into two groups: those that yield good results and those that yield poor results. Using these two distributions, it selects the most promising hyperparameters for the next trial. It balances exploration and exploitation using the Expected Improvement (EI) criterion. TPE’s tree-like structure allows it to effectively handle categorical and conditional hyperparameters. It can find optimal or near-optimal hyperparameters with significantly fewer trials compared to Random Search and Grid Search. It is particularly preferred in hyperparameter optimization for deep learning models where costly evaluations are involved and in auto machine learning applications [36,37].
The metaheuristic algorithms employed—ABC, PSO, GWO, and AFSA—are bioinspired optimization methods. ABC, developed by Karaboğa in 2005 [38], is a swarm intelligence optimization algorithm inspired by honeybee foraging behavior. The algorithm simulates three bee roles—employed, onlooker, and scout—to explore the search space for optimal solutions. Its simplicity, minimal parameter requirements, and robust global optimization performance have led to widespread adoption in machine learning and engineering applications [39].
PSO, developed by Kennedy and Eberhart in 1995, is a swarm intelligence optimization technique inspired by the collective behavior of bird flocks and fish schools [40]. Each particle represents a candidate solution, updating its trajectory based on personal best and global best experiences. PSO’s popularity stems from its fast convergence, straightforward mathematical formulation, and inherent parallelizability [41].
GWO, proposed by Mirjalili et al. in 2014, is a metaheuristic algorithm inspired by gray wolf hunting behavior and social hierarchy [42]. The algorithm models a four-tier leadership structure (alpha, beta, delta, omega) and hunting phases (encircling, hunting, attacking prey) to navigate the search space toward optimal solutions. GWO’s local optima avoidance capability has driven its widespread adoption across engineering domains.
AFSA, developed by Li et al. in 2002, is a swarm intelligence algorithm inspired by fish schooling behaviors, including foraging, swarming, and following [43]. Artificial fish navigate the search space as autonomous agents, exhibiting prey-searching, swarming, following, and random behaviors to locate global optima while avoiding local traps. AFSA’s strengths include parallel search capabilities, local optima avoidance, and adaptability to complex optimization landscapes.
The t-test is a parametric hypothesis test assessing the statistical significance of mean differences between groups. Three variants exist: independent-sample, paired-sample, and one-sample t-tests. In machine learning research, it facilitates model performance comparisons, feature set effect evaluation, and algorithm effectiveness analysis across datasets. The test distinguishes chance variation from meaningful differences through p-value computation [44].
ANOVA is a statistical test used to compare the means of three or more groups simultaneously. ANOVA calculates the F-statistic by dividing the between-group variance by the within-group variance and determines whether there is a significant difference between groups based on this value. The test is divided into types such as one-way ANOVA, two-way ANOVA, and repeated-measure ANOVA. Unlike the t-test, which can only compare two groups, ANOVA analyzes multiple groups simultaneously, keeping the false-positive error under control. In machine learning research, it is used to evaluate the effect of different hyperparameter combinations, compare the performance of multiple algorithms, or analyze the results of different feature selection methods [45].
This study employs four performance metrics: accuracy, precision, recall, and F1-score. Accuracy measures the proportion of correct classifications, indicating overall performance. Precision quantifies the ratio of true positives among predicted positives, reflecting false-positive control. Recall measures the proportion of actual positives correctly identified, indicating the model’s sensitivity. The F1-score, defined as the harmonic mean of precision and recall, balances these complementary metrics. This composite measure is particularly valuable for assessing model performance on imbalanced datasets. Equations (9)–(12) present the formulations for accuracy, precision, recall, and F1-score, respectively [46,47].
Accuracy = (TP + TN)/(TP + TN + FN + FP)
Precision = TP/(TP + FP)
Recall = TP/(TP + FN)
F1-score = 2TP/(2TP + FP + FN)
Here, TP represents true positive, FP represents false positive, TN represents true negative, and FN represents false negative.

3. Results

This study evaluated 48 model combinations comprising four machine learning algorithms (SVM, RF, KNN, ANN) and four metaheuristic optimizers (ABC, PSO, GWO, AFSA) for voice-based gender identification. The Voice Gender dataset developed by Becker is used, consisting of 3168 samples and 20 features. The dataset shows a balanced distribution of female and male classes. Experiments were conducted using three different data preprocessing methods: raw data, z-score normalization, and min–max normalization. K-fold cross validation is a method used to reliably evaluate the accuracy of a model. It divides the data into k folds and performs the model training and validation processes for each fold. The results obtained from each fold are combined to evaluate the overall performance of the model. Mean aggregation refers to taking the average of multiple experiments. This study implemented 5 × 10-fold cross-validation with mean aggregation to enhance generalizability, mitigate overfitting, and improve evaluation robustness [48].
The study was conducted on a computer with i7-7700 CPU 3.6 GHz, 16 GB RAM, an AMD Radeon R7 450 4 GB graphics card, and Windows 10 Pro (22H2) specifications. In the metaheuristic optimization studies, the population size is set to 30 and the number of iterations to 50. Maximum Functional Evaluation (MFE) is a metric that indicates how much functional value an algorithm has computed while performing the search in the solution space, generally in optimization algorithms. Calculating MFE involves multiplying the population size by the maximum iteration count. Therefore, the MFE value in the study was calculated as 1500. Table 3 presents the hyperparameter search spaces for each algorithm.
In the ABC algorithm, the population size is set to 30 and the maximum number of iterations is set to 50. The limit value is set to 20. Increasing the limit value in the ABC algorithm results in a greater exploitation search and slower convergence; decreasing the limit value results in more exploration and faster diversification.
For PSO, population size and iterations were set to 30 and 50, respectively. The inertia weight (w) was set to 0.7 to balance exploration (higher w) and exploitation (lower w). Cognitive and social coefficients (c1, c2) were both set to 1.5, balancing individual experience reliance and swarm influence to achieve effective exploration–exploitation trade-off. For GWO, population size and iterations were 30 and 50, respectively. The convergence parameter (a) decreases linearly from 2 to 0 across iterations, governing the search agent coefficient A ∈ [−a, a]. The coefficient C ∈ [0, 2] provides random emphasis on prey position. The condition |A| < 1 indicates exploitation phase, while |A| ≥ 1 denotes exploration, with C preventing local optima entrapment.
In the AFSA, the population size was selected as 30 and the maximum number of iterations as 50. The visual field value, which represents the fish’s vision radius or search area, was set to 0.5, and the step size, which represents the movement distance, was set to 0.3. As the visual field value increases, exploration increases, while as it decreases, exploitation increases. Therefore, a visual field value of 0.5 was chosen to provide a balanced search. As step size increases, movement speed increases, while a decrease provides slow and precise movement. Table 4 summarizes the optimization algorithm parameters.
Table 5 presents Random Search results using 5 × 10-fold cross-validation.
The best model obtained in the random search algorithm was achieved with the SVM and min–max data combination. The parameters for this superior model are BoxConstraint (C) = 22.74, KernelScale (gamma) = 0.36, and kernel function rbf. Table 6 presents TPE results using 5 × 10-fold cross-validation.
The best model obtained in the TPE algorithm was achieved with the SVM and z-score data combination. The parameters for this superior model are BoxConstraint (C) = 16.2819, KernelScale (gamma) = 3.4208, and kernel function rbf.
In experiments conducted on raw data, different classification algorithms and optimization methods have demonstrated varying levels of performance. Table 7 details all results obtained with raw data.
When raw data was used, RF and ANN algorithms showed high performance at around 93%, while the KNN algorithm only achieved an accuracy rate of 81.21%. The SVM algorithm was generally successful and showed stable performance in the range of 92.38–92.55%. The highest performance in raw data was achieved with the RF+AFSA combination, with an accuracy rate of 93.21%. Table 8 details all results obtained using z-score normalized data.
Z-score normalization yielded substantial improvements across all algorithms. The most notable improvement was seen in the KNN algorithm, which reached 98.09% compared to the raw data, recording an increase of approximately 17%. The SVM algorithm also showed significant improvement, performing in the range of 98.44–98.49%. The RF and ANN algorithms showed stable performance in the range of 98.35% to 98.62%. The highest performance for Z-score normalization was achieved with the RF+PSO combination, with an accuracy rate of 98.62%. Table 9 presents min–max normalization results.
Both SVM and RF algorithms achieved high accuracy rates in the range of 98.40–98.68% with min–max normalization. The RF+PSO combination achieved the highest accuracy rate (98.68%). The KNN algorithm showed stable performance in the range of 98.34–98.36%, while ANN yielded results in the range of 98.40–98.46%. The highest performance for min–max normalization was achieved with the RF+PSO combination, with an accuracy rate of 98.68%. Table 10 and Table 11 present algorithm performance statistics and normalization effects, respectively.
The KNN algorithm benefited the most from normalization. KNN, which showed an average accuracy of only 81.07% on raw data, increased to 98.05% with z-score normalization and to 98.35% with min–max normalization, recording a total performance increase of 17.28%. This is due to the distance-based nature of KNN. When calculating distances between examples, the KNN algorithm creates unbalanced weights in calculations due to features with different scales (e.g., a feature ranging from 0 to 1 versus one ranging from 0 to 1000). Normalization eliminated this problem by bringing all features to the same scale, revealing KNN’s true potential.
The ANN algorithm was also significantly affected by normalization, with the average accuracy increasing from 88.40% in the raw data to 98.38% with z-score and 98.43% with min–max normalization, representing a 10.03% increase. This substantial improvement demonstrates that ANN’s internal activation functions and weight update mechanisms benefit greatly from normalized input features. The gradient descent optimization process converges more efficiently when all features are on the same scale, allowing the network to learn more effectively and avoid issues related to vanishing or exploding gradients that can occur with unnormalized data.
The SVM algorithm also showed significant improvement with normalization, with the average accuracy increasing from 92.47% in the raw data to 98.47% with both z-score and min–max normalization, representing a 6.00% increase. Since SVM’s kernel functions and decision boundary calculations are also distance- and scale-based, normalization has facilitated the algorithm’s process of finding optimal hyperparameters. The consistency of performance across both normalization methods (98.47%) indicates that SVM benefits primarily from having features on a comparable scale rather than from any specific normalization technique.
The RF algorithm, on the other hand, showed a modest increase of 5.51% (from 93.14% to 98.65%). While this improvement is more substantial than initially expected, it remains the smallest among all algorithms. Since the tree-based decision-making structure divides features based on threshold values, feature scales have less impact on the split points of decision trees compared to distance-based algorithms. The improvement observed suggests that even though RF is relatively robust to feature scaling, normalization still provides benefits, possibly by helping the algorithm make more balanced feature selections during the tree construction process. Therefore, while RF can perform well even with raw data, normalization enhances its performance and should not be overlooked. The general performance statistics of optimization algorithms are given in Table 12.
The optimization algorithms demonstrated remarkably similar performance levels, with all four algorithms achieving nearly identical results. The PSO algorithm showed the highest average accuracy of 94.56% with a standard deviation of 4.84, and achieved the best overall result (98.68%) with the RF algorithm under min–max normalization. The AFSA performed almost identically with an average accuracy of 94.51% and a standard deviation of 4.88, demonstrating stable performance across all combinations. The ABC algorithm achieved an average accuracy of 94.49% with a standard deviation of 4.85, proving its reliability with consistent results. The GWO algorithm showed comparable performance with an average accuracy of 94.46% and a standard deviation of 4.91, indicating that its swarm hierarchy-based search strategy is equally effective.
The minimal differences between optimization algorithms (maximum 0.10% difference in average accuracy) suggest that the choice of optimization algorithm has limited impact on final model performance. All four algorithms demonstrated similar standard deviations (4.84–4.91), indicating consistent behavior across different classification algorithms and normalization methods. This consistency contrasts sharply with the significant impact of normalization methods, which could improve performance by up to 17.28% for distance-based algorithms like KNN.
Statistical significance tests were performed using paired t-tests and one-way ANOVA with a significance threshold of α = 0.05 to validate the observed performance differences. T-test results are presented in Table 13.
Paired t-tests comparing raw data against normalized data revealed highly significant differences (p < 0.001) for all algorithms. The KNN algorithm showed the most dramatic improvement with t-statistics of −310.60 (z-score) and −330.52 (min–max), corresponding to mean improvements of +16.98% and +17.28%, respectively. ANN demonstrated substantial improvement (t ≈ −187, mean difference > +10%), while SVM and RF showed t-statistics of approximately −145 with mean improvements of +6.00% and +5.51%. Comparison between z-score and min–max normalization showed that min–max performs slightly better for KNN (p < 0.001, +0.30%), RF (p = 0.012, +0.07%), and ANN (p = 0.044, +0.05%), while SVM showed no significant difference (p = 1.000). ANOVA results are presented in Table 14.
One-way ANOVA tests revealed no statistically significant differences among the four optimization algorithms (ABC, PSO, GWO, AFSA) across all preprocessing methods (F-values < 0.095, all p > 0.96). Similarly, comparison of classification algorithms (SVM, RF, KNN, ANN) showed no significant differences (F = 1.6497, p = 0.192). These findings indicate that when properly preprocessed and optimized, algorithm selection has minimal impact on final performance.
In order to test the robustness of the proposed model, salt–pepper noise (5%) was added to an external test set consisting of 250 women and 250 men randomly selected from the dataset. The test accuracy and F1-score results for this noisy dataset are presented in Table 15. The confusion matrices obtained from the test results are shown in Figure 2.
The raw data approach achieved near-optimal classification with only 21 misclassifications out of 500 samples, demonstrating balanced performance across both genders (12 FN for males, 9 FP for females). Z-score normalization exhibited moderate degradation with 43 total errors, maintaining reasonable specificity for males (236/250 correct) but showing increased sensitivity to female misclassification (29 FP). Most strikingly, min–max normalization produced a catastrophic confusion pattern where 241 male samples were misclassified as female, resulting in a severely imbalanced predictor that defaulted to female classification in 98% of cases. This asymmetric failure mode reveals that salt–pepper noise, when combined with min–max scaling, disproportionately distorts male voice features by compressing their discriminative characteristics into the noise-dominated normalization range.

4. Conclusions and Discussion

This study investigated the effectiveness of metaheuristic optimization algorithms. The proposed algorithms focused on optimizing hyperparameters of machine learning methods for voice-based gender identification. Four types of algorithms were employed with four metaheuristic optimizers. These algorithms and optimizers were outlined in Section 2. Moreover, a rigorous 5 × 10-fold cross-validation protocol with mean aggregation was implemented to improve robust evaluation and minimize overfitting as a risk factor. Furthermore, the KNN algorithm exhibited the most substantial benefit, with accuracy surging from 80.75% on raw data to 98.37% with normalization, a remarkable 17.5% improvement that underscores the algorithm’s sensitivity to feature scaling. This pronounced enhancement stems from KNN’s distance-based decision mechanism, where unnormalized features with disparate scales create biased distance calculations. SVM demonstrated a 6.00% improvement (92.47% to 98.47%), while ANN showed a 10.03% gain (88.40% to 98.43%), confirming that gradient-based optimization and kernel functions benefit substantially from standardized input distributions. Conversely, RF exhibited the smallest improvement at 5.51% (93.14% to 98.65%), consistent with its tree-based architecture’s inherent robustness to feature scaling, though normalization still provided measurable benefits through more balanced feature selection during tree construction.
Among metaheuristic algorithms, performance differences proved minimal yet statistically insignificant. ABC achieved 94.49% average accuracy with σ = 4.85, PSO reached 94.56% (σ = 4.84), GWO attained 94.46% (σ = 4.91), and AFSA obtained 94.51% (σ = 4.88). One-way ANOVA confirmed no significant differences among optimizers (F < 0.095, p > 0.96 across all preprocessing methods), indicating that metaheuristic choice exerts a negligible impact when proper data preprocessing is applied. Paired t-tests revealed highly significant differences between raw and normalized data (p < 0.001 for all algorithms).
External validation using salt–pepper noise (5% corruption probability) on a balanced test set (250 males, 250 females) revealed critical insights into preprocessing robustness. The catastrophic failure of min–max normalization under salt–pepper noise (50.8% accuracy vs. 95.8% for raw data) is attributed to its unbounded sensitivity to outliers, where extreme noise values corrupt xmin and xmax, shifting the entire normalized distribution and invalidating training learned boundaries. This effect disproportionately impacted male voices (241/250 misclassified as female), whose narrower frequency ranges were compressed into female classification regions. Z-score normalization maintained moderate robustness (91.4%) through bounded outlier influence, while raw data surprisingly achieved optimal noise resistance via random forest’s bootstrap aggregation and majority voting mechanisms.
This research made significant methodological contributes as follows: systematic multi-algorithm comparison under identical conditions, rigorous statistical validation via t-tests and ANOVA, external noise robustness assessment, comprehensive evaluation of six hyperparameter optimization strategies, and a production-ready model for practical deployment in forensic science, biometric security, voice assistants, call centers, and human–computer interaction systems. In future research, we will investigate robust normalization techniques (median-based, robust scaler) for noisy environments, deep learning architectures, cross-dataset generalization, and real-time processing optimization for edge deployment scenarios.

Author Contributions

Conceptualization, Ş.Y. and M.S.B.; methodology, Ş.Y. and M.S.B.; software, M.S.B.; validation, Ş.Y.; formal analysis, Ş.Y. and M.S.B.; investigation, Ş.Y. and M.S.B.; resources, Ş.Y. and M.S.B.; data curation, M.S.B.; writing—original draft preparation, Ş.Y. and M.S.B.; writing—review and editing, Ş.Y. and M.S.B.; visualization, Ş.Y. and M.S.B.; supervision, Ş.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset can be accessed using the references within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ABCArtificial Bee Colony
AFSAArtificial Fish Swarm Algorithm
ANNArtificial Neural Network
CNNConvolutional Neural Networks
GWOGrey Wolf Optimizer
KNNK-Nearest Neighbor
LSTMLong Short-Term Memory
PSOParticle Swarm Optimization
RFRandom Forest
ROCReceiver Operating Characteristic
SVMSupport Vector Machine

References

  1. Garain, A.; Ray, B.; Giampaolo, F.; Velasquez, J.D.; Singh, P.K.; Sarkar, R. GRaNN: Feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals. Neural Comput. Appl. 2022, 34, 14463–14486. [Google Scholar] [CrossRef]
  2. Kannapiran, P.; Sindha, M.M.R. Voice-based gender recognition model using FRT and light GBM. Teh. Vjesn. 2023, 30, 282–291. [Google Scholar] [CrossRef]
  3. Poornima, S.; Sripriya, N.; Preethi, S.; Harish, S. Classification of Gender from Face Images and Voice. In Intelligence in Big Data Technologies—Beyond the Hype: Proceedings of ICBDCC 2019; Springer: Singapore, 2020; pp. 115–124. [Google Scholar] [CrossRef]
  4. Al-Khawaldeh, N.N.; Banikalef, A.A.; Rababah, L.M.; Khawaldeh, A.F. Ideological representations of women in Jordanian folk proverbs from the perspective of cultural semiotics. Humanit. Soc. Sci. Commun. 2024, 11, 125. [Google Scholar] [CrossRef]
  5. Chauhan, N.; Isshiki, T.; Li, D. Enhancing speaker recognition models with noise-resilient feature optimization strategies. Acoustics 2024, 6, 439–469. [Google Scholar] [CrossRef]
  6. Zhang, D.; Mui, K.W.; Masullo, M.; Wong, L.T. Application of machine learning techniques for predicting students’ acoustic evaluation in a university library. Acoustics 2024, 6, 681–697. [Google Scholar] [CrossRef]
  7. Mirzaei Hotkani, M.; Martin, B.; Bousquet, J.F.; Delarue, J. Real-Time Analysis of Millidecade Spectra for Ocean Sound Identification and Wind Speed Quantification. Acoustics 2025, 7, 44. [Google Scholar] [CrossRef]
  8. Ertam, F. An effective gender recognition approach using voice data via deeper LSTM networks. Appl. Acoust. 2019, 156, 351–358. [Google Scholar] [CrossRef]
  9. Tursunov, A.; Mustaqeem; Choeh, J.Y.; Kwon, S. Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors 2021, 21, 5892. [Google Scholar] [CrossRef]
  10. Jorrin-Coz, J.; Nakano, M.; Perez-Meana, H.; Hernandez-Gonzalez, L. Multi-Corpus Benchmarking of CNN and LSTM Models for Speaker Gender and Age Profiling. Computation 2025, 13, 177. [Google Scholar] [CrossRef]
  11. Kwasny, D.; Hemmerling, D. Gender and age estimation methods based on speech using deep neural networks. Sensors 2021, 21, 4785. [Google Scholar] [CrossRef]
  12. Livieris, I.E.; Pintelas, E.; Pintelas, P. Gender recognition by voice using an improved self-labeled algorithm. Mach. Learn. Knowl. Extr. 2019, 1, 492–503. [Google Scholar] [CrossRef]
  13. Yücesoy, E. Automatic age and gender recognition using ensemble learning. Appl. Sci. 2024, 14, 6868. [Google Scholar] [CrossRef]
  14. Yücesoy, E. Gender Recognition based on the stacking of different acoustic features. Appl. Sci. 2024, 14, 6564. [Google Scholar] [CrossRef]
  15. Barkana, B.D.; Zhou, J. A new pitch-range based feature set for a speaker’s age and gender classification. Appl. Acoust. 2015, 98, 52–61. [Google Scholar] [CrossRef]
  16. Alkhammash, E.H.; Hadjouni, M.; Elshewey, A.M. A hybrid ensemble stacking model for gender voice recognition approach. Electronics 2022, 11, 1750. [Google Scholar] [CrossRef]
  17. Jasuja, L.; Rasool, A.; Hajela, G. Voice gender recognizer recognition of gender from voice using deep neural networks. In Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 10–12 September 2020; pp. 319–324. [Google Scholar] [CrossRef]
  18. Srivastava, S.; Sharma, H.; Garg, D. Comparative study of machine learning algorithms for voice based gender identification. In Proceedings of the 2022 International Conference on Edge Computing and Applications (ICECAA), Tashkent, Uzbekistan, 26–28 October 2022; pp. 1136–1141. [Google Scholar] [CrossRef]
  19. Identifying the Gender of a Voice Using Machine Learning. Available online: https://www.primaryobjects.com/2016/06/22/identifying-the-gender-of-a-voice-using-machine-learning/ (accessed on 27 October 2025).
  20. Voice Gender GitHub Repository. Available online: https://github.com/primaryobjects/voice-gender (accessed on 27 October 2025).
  21. Patro, S. Normalization: A preprocessing stage. arXiv 2015, arXiv:1503.06462. [Google Scholar] [CrossRef]
  22. Boz, C.; Zhou, J. Segmented vs. non-segmented heart sound classification: Impact of feature extraction and machine learning models. Appl. Sci. 2025, 15, 11047. [Google Scholar] [CrossRef]
  23. Arab, O.; Mekouar, S.; Mastere, M.; Cabieces, R.; Collantes, D.R. Improved liquefaction hazard assessment via deep feature extraction and stacked ensemble learning on microtremor data. Appl. Sci. 2025, 15, 6614. [Google Scholar] [CrossRef]
  24. Vapnik, V.; Cortes, C. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  25. Guido, R.; Ferrisi, S.; Lofaro, D.; Conforti, D. An overview on the advancements of support vector machine models in healthcare applications: A review. Information 2024, 15, 235. [Google Scholar] [CrossRef]
  26. Huang, S.; Cai, N.; Pacheco, P.P.; Narrandes, S.; Wang, Y.; Xu, W. Applications of support vector machine (SVM) learning in cancer genomics. Cancer Genom. Proteom. 2018, 15, 41–51. [Google Scholar] [CrossRef]
  27. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  28. Hu, J.; Szymczak, S. A review on longitudinal data analysis with random forest. Brief. Bioinform. 2023, 24, bbad002. [Google Scholar] [CrossRef]
  29. Halder, R.K.; Uddin, M.N.; Uddin, M.A.; Aryal, S.; Khraisat, A. Enhancing K-nearest neighbor algorithm: A comprehensive review and performance analysis of modifications. J. Big Data 2024, 11, 113. [Google Scholar] [CrossRef]
  30. Uddin, S.; Haque, I.; Lu, H.; Moni, M.A.; Gide, E. Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci. Rep. 2022, 12, 6256. [Google Scholar] [CrossRef]
  31. Sekban, D.M.; Yaylacı, E.U.; Özdemir, M.E.; Yaylacı, M.; Tounsi, A. Investigating formability behavior of friction stir-welded high-strength shipbuilding steel using experimental, finite element, and artificial neural network methods. J. Mater. Eng. Perform. 2025, 34, 4942–4950. [Google Scholar] [CrossRef]
  32. Lazarenko, I.; Sitnikova, E. A novel ANN-based Classification of Spike-Wave Activity in 24-hours EEG recordings in Rats using Spectrograms: Spike-Wave Discharge Artificial Neural Network (SWAN). J. Neurosci. Methods 2025, 415, 110555. [Google Scholar] [CrossRef] [PubMed]
  33. Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  34. Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2011; Volume 24. [Google Scholar]
  35. Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar] [CrossRef]
  36. Bergstra, J.; Yamins, D.; Cox, D.D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 115–123. [Google Scholar]
  37. Watanabe, S. Tree-structured Parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXiv 2023, arXiv:2304.11127. [Google Scholar] [CrossRef]
  38. Karaboga, D. An Idea Based on Honey Bee Swarm for Numerical Optimization; Technical Report-TR06; Erciyes University: Kayseri, Turkey, 2005. [Google Scholar]
  39. Karaboga, D.; Basturk, B. On the performance of artificial bee colony (ABC) algorithm. Appl. Soft Comput. 2008, 8, 687–697. [Google Scholar] [CrossRef]
  40. Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar] [CrossRef]
  41. Du, B.; Feng, H.; Zhang, Z.; Liu, Q.; Zhu, H.; Liu, G.; Li, S. China’s Chrome Demand Forecast from 2025 to 2040: Based on Sectoral Predictions and PSO-BP Neural Network. Sustainability 2025, 17, 9115. [Google Scholar] [CrossRef]
  42. Mirjalili, S.; Mirjalili, S.M.; Lewis, A. Grey wolf optimizer. Adv. Eng. Softw. 2014, 69, 46–61. [Google Scholar] [CrossRef]
  43. Li, X.L. An optimizing method based on autonomous animats: Fish-swarm algorithm. Syst. Eng. Theory Pract. 2002, 22, 32–38. [Google Scholar]
  44. Kim, T.K. T test as a parametric statistic. Korean J. Anesthesiol. 2015, 68, 540–546. [Google Scholar] [CrossRef]
  45. Kim, T.K. Understanding one-way ANOVA using conceptual figures. Korean J. Anesthesiol. 2017, 70, 22–26. [Google Scholar] [CrossRef]
  46. Abed, S.H.; Abbas, N.A. Classification of Voice Gender Based on Stacking Ensemble Model and Metaheuristics Methods. In Proceedings of the 2022 3rd Information Technology to Enhance e-Learning and Other Application (IT-ELA), Baghdad, Iraq, 20–21 December 2022; pp. 140–146. [Google Scholar] [CrossRef]
  47. Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 2021, 17, 168–192. [Google Scholar] [CrossRef]
  48. Phan, T.D.; Nguyen, D.L. Data-driven approaches for predicting and optimizing the compressive strength of self-compacting concrete. Mater. Today Commun. 2025, 42, 111298. [Google Scholar] [CrossRef]
Figure 1. Schematic representation of the proposed method with flowchart.
Figure 1. Schematic representation of the proposed method with flowchart.
Applsci 15 12815 g001
Figure 2. Confusion matrices for raw data, z-score and min–max normalization.
Figure 2. Confusion matrices for raw data, z-score and min–max normalization.
Applsci 15 12815 g002
Table 1. Literature comparison of machine learning methods on Voice Gender dataset.
Table 1. Literature comparison of machine learning methods on Voice Gender dataset.
ReferenceAuthorsYearMethodAccuracy (%)
[12]Ioannis et al.2019Semi-supervised Self-labeled Algorithm98.42
[8]Ertam2019Deeper LSTM98.4
[12]Livieris et al.2019Semi-supervised Self-labeled Algorithm98.23
[17]Jasuja et al.2020Deep Neural Network96.0
[16]Eman et al.2022Stacked Ensemble (KNN, SVM, SGD, LR + LDA)99.64
[18]Srivastava et al.2022SVM, RF, KNN, DT, GB98.48
[1]Garain et al.2022Golden Ratio-aided Neural Network (GRaNN)95.68
Table 2. The acoustic features and value ranges for the data set.
Table 2. The acoustic features and value ranges for the data set.
FeaturesDescriptionMin. ValueMax. Value
meanfreqmean frequency (in kHz)0.0393630.251124
sdstandard deviation of frequency0.0183630.115273
medianmedian frequency (in kHz)0.0109750.261224
Q25first quantile (in kHz)0.0002290.247347
Q75third quantile (in kHz)0.0429460.273469
IQRinterquartile range (in kHz)0.0145580.252225
skewskewness0.14173534.725453
kurtkurtosis2.0684551309.612887
sp.entspectral entropy0.7386510.981997
sfmspectral flatness0.0368760.842936
modemode frequency0.0000000.28
centroidfrequency centroid0.0393630.251124
meanfunaverage of fundamental frequency measured across acoustic signal0.0555650.237636
minfunminimum fundamental frequency measured across acoustic signal0.0097750.204082
maxfunmaximum fundamental frequency measured across acoustic signal0.1030930.279114
meandomaverage of dominant frequency measured across acoustic signal0.0078122.957682
mindomminimum of dominant frequency measured across acoustic signal0.0048830.458984
maxdommaximum of dominant frequency measured across acoustic signal0.00781221.867188
dfrangerange of dominant frequency measured across acoustic signal021.843750
modindxmodulation index00.932374
Table 3. Optimized hyperparameters and search space of the machine learning models.
Table 3. Optimized hyperparameters and search space of the machine learning models.
AlgorithmParameterValue
SVMC (BoxConstraint)0.1–1000
gamma (KernelScale)0.001–10
Kernel Function‘linear’, ‘rbf’, ‘polynomial’
RFNumber of Trees50–500
Num Predictors to Sample2–20
Min Leaf Size1–20
KNNk (Number of Neighbors)3–21
Distance‘euclidean’, ‘cityblock’, ‘cosine’, ‘correlation’
Distance Weight‘equal’, ‘inverse’, ‘squaredinverse’
ANNNumber of Hidden Layer1–4
Number of Neuron5–20
Activation Function‘tansig’, ‘logsig’, ‘sigmoid’, ‘purelin’
Table 4. Parameters of optimization algorithms.
Table 4. Parameters of optimization algorithms.
AlgorithmParameterValue
ABClimit20
PSOw0.7
c11.5
c21.5
GWOa2→0 (linear)
A[−a, a]
C[0, 2]
AFSAvisual field0.5
step size0.3
Table 5. Results obtained from random search algorithm.
Table 5. Results obtained from random search algorithm.
AlgorithmNormalizationAccuracy (%)F1-ScoreTime (min)
SVMRaw Data88.21 ± 2.000.8819 ± 0.02022.8
SVMZ-Score98.22 ± 0.850.9823 ± 0.00831.8
SVMMin–Max98.38 ± 0.670.9838 ± 0.00671.3
RFRaw Data98.01 ± 0.590.9801 ± 0.005810.9
RFZ-Score97.98 ± 0.740.9799 ± 0.007310.5
RFMin–Max97.99 ± 0.770.9799 ± 0.007710.5
KNNRaw Data80.29 ± 2.890.8002 ± 0.03090.2
KNNZ-Score97.83 ± 0.690.9782 ± 0.00700.2
KNNMin–Max97.88 ± 0.750.9787 ± 0.00760.2
ANNRaw Data94.56 ± 2.250.9447 ± 0.023327.8
ANNZ-Score97.71 ± 0.930.9771 ± 0.00931.9
ANNMin–Max97.56 ± 0.790.9756 ± 0.007916.1
Table 6. Results obtained from TPE algorithm.
Table 6. Results obtained from TPE algorithm.
AlgorithmNormalizationAccuracy (%)F1-ScoreTime (min)
SVMRaw Data89.88 ± 1.900.8941 ± 0.02092.5
SVMZ-Score98.37 ± 0.540.9837 ± 0.00540.9
SVMMin–Max98.34 ± 0.510.9835 ± 0.00511.0
RFRaw Data97.99 ± 0.590.9800 ± 0.005918.3
RFZ-Score97.93 ± 0.830.9794 ± 0.008312.5
RFMin–Max97.98 ± 0.750.9798 ± 0.00758.9
KNNRaw Data80.84 ± 2.260.8061 ± 0.02330.2
KNNZ-Score98.10 ± 0.800.9810 ± 0.00810.1
KNNMin–Max98.34 ± 0.760.9834 ± 0.00760.2
ANNRaw Data94.60 ± 2.070.9454 ± 0.021215.8
ANNZ-Score97.82 ± 0.680.9782 ± 0.00681.3
ANNMin–Max97.50 ± 0.820.9749 ± 0.008316.9
Table 7. Results obtained from the raw data set.
Table 7. Results obtained from the raw data set.
AlgorithmOptimizationAccuracy (%)PrecisionRecallF1-ScoreTime (min)
SVMABC92.51 ± 1.35 [89.25, 95.11]0.9339 ± 0.01640.9154 ± 0.02520.9243 ± 0.0142 [0.8858, 0.9515]61.73
SVMPSO92.42 ± 1.51 [89.90, 95.10]0.9341 ± 0.01960.9132 ± 0.02260.9233 ± 0.0154 [0.8963, 0.9511]66.07
SVMGWO92.38 ± 1.25 [88.93, 95.44]0.9343 ± 0.01960.9122 ± 0.02200.9229 ± 0.0128 [0.8903, 0.9533]60.50
SVMAFSA92.55 ± 1.57 [88.56, 96.09]0.9347 ± 0.02080.9154 ± 0.02280.9247 ± 0.0158 [0.8896, 0.9613]128.73
RFABC93.12 ± 1.28 [90.45, 95.78]0.9385 ± 0.01580.9231 ± 0.02180.9307 ± 0.0135 [0.9028, 0.9587]87.45
RFPSO93.18 ± 1.42 [89.87, 96.12]0.9391 ± 0.01720.9238 ± 0.02450.9313 ± 0.0148 [0.8971, 0.9621]92.18
RFGWO93.05 ± 1.35 [90.12, 95.89]0.9378 ± 0.01810.9225 ± 0.02330.9300 ± 0.0142 [0.8995, 0.9598]85.67
RFAFSA93.21 ± 1.51 [89.78, 96.34]0.9394 ± 0.01890.9241 ± 0.02510.9316 ± 0.0155 [0.8961, 0.9643]156.22
KNNABC81.04 ± 1.90 [75.24, 85.62]0.8157 ± 0.02410.8029 ± 0.02830.8089 ± 0.0194 [0.7548, 0.8581]1.77
KNNPSO81.21 ± 2.10 [77.20, 86.64]0.8181 ± 0.02520.8037 ± 0.03240.8104 ± 0.0218 [0.7692, 0.8698]1.35
KNNGWO81.07 ± 2.28 [76.22, 85.67]0.8175 ± 0.02870.8012 ± 0.03090.8088 ± 0.0230 [0.7607, 0.8523]1.33
KNNAFSA80.96 ± 2.01 [75.90, 87.62]0.8176 ± 0.02750.7987 ± 0.03080.8075 ± 0.0204 [0.7566, 0.8766]2.67
ANNABC88.34 ± 1.85 [84.52, 92.45]0.8912 ± 0.02210.8745 ± 0.02980.8827 ± 0.0189 [0.8421, 0.9261]45.28
ANNPSO88.51 ± 1.78 [85.10, 92.67]0.8925 ± 0.02080.8762 ± 0.02850.8842 ± 0.0181 [0.8485, 0.9284]42.15
ANNGWO88.28 ± 1.92 [84.21, 91.98]0.8908 ± 0.02340.8738 ± 0.03010.8822 ± 0.0195 [0.8391, 0.9215]39.87
ANNAFSA88.45 ± 2.01 [84.67, 92.78]0.8918 ± 0.02450.8751 ± 0.03150.8833 ± 0.0203 [0.8437, 0.9295]78.56
Table 8. Results obtained from the z-score normalized data.
Table 8. Results obtained from the z-score normalized data.
AlgorithmOptimizationAccuracy (%)PrecisionRecallF1-ScoreTime (min)
SVMABC98.46 ± 0.75 [96.73, 100.0]0.9827 ± 0.01050.9867 ± 0.00960.9847 ± 0.0075 [0.9671, 1.0000]10.05
SVMPSO98.47 ± 0.57 [97.07, 99.35]0.9816 ± 0.00930.9880 ± 0.00770.9847 ± 0.0056 [0.9707, 0.9935]8.29
SVMGWO98.49 ± 0.63 [96.74, 99.67]0.9841 ± 0.00840.9859 ± 0.01030.9849 ± 0.0064 [0.9671, 0.9967]7.48
SVMAFSA98.44 ± 0.81 [96.73, 99.67]0.9836 ± 0.01060.9853 ± 0.01000.9844 ± 0.0081 [0.9673, 0.9967]19.57
RFABC98.58 ± 0.68 [96.74, 100.0]0.9849 ± 0.00980.9870 ± 0.00910.9859 ± 0.0068 [0.9677, 1.0000]14.73
RFPSO98.62 ± 0.61 [97.39, 99.67]0.9853 ± 0.00890.9873 ± 0.00840.9863 ± 0.0061 [0.9744, 0.9967]13.52
RFGWO98.55 ± 0.71 [97.07, 100.0]0.9846 ± 0.01020.9867 ± 0.00950.9856 ± 0.0071 [0.9711, 1.0000]12.85
RFAFSA98.59 ± 0.77 [96.41, 99.67]0.9851 ± 0.01080.9869 ± 0.01020.9860 ± 0.0077 [0.9642, 0.9967]28.34
KNNABC98.02 ± 0.54 [96.73, 99.02]0.9815 ± 0.00940.9789 ± 0.00850.9802 ± 0.0054 [0.9675, 0.9902]1.51
KNNPSO98.03 ± 0.75 [96.42, 99.67]0.9814 ± 0.01020.9793 ± 0.01040.9803 ± 0.0075 [0.9644, 0.9967]1.30
KNNGWO98.07 ± 0.68 [96.74, 99.67]0.9818 ± 0.00880.9797 ± 0.01010.9807 ± 0.0068 [0.9671, 0.9967]1.31
KNNAFSA98.09 ± 0.73 [96.09, 99.35]0.9824 ± 0.00910.9794 ± 0.01180.9809 ± 0.0074 [0.9605, 0.9935]2.59
ANNABC98.38 ± 0.82 [96.42, 99.67]0.9831 ± 0.01120.9847 ± 0.01080.9839 ± 0.0082 [0.9644, 0.9967]8.94
ANNPSO98.41 ± 0.71 [96.74, 99.35]0.9835 ± 0.01030.9850 ± 0.00950.9842 ± 0.0071 [0.9677, 0.9935]7.85
ANNGWO98.35 ± 0.78 [96.41, 99.67]0.9828 ± 0.01080.9843 ± 0.01020.9836 ± 0.0078 [0.9642, 0.9967]7.22
ANNAFSA98.39 ± 0.85 [96.09, 99.35]0.9832 ± 0.01150.9846 ± 0.01100.9839 ± 0.0085 [0.9605, 0.9935]15.68
Table 9. Results obtained from the min–max normalized data.
Table 9. Results obtained from the min–max normalized data.
AlgorithmOptimizationAccuracy (%)PrecisionRecallF1-ScoreTime (min)
SVMABC98.46 ± 0.72 [97.07, 100.0]0.9823 ± 0.01150.9872 ± 0.00980.9847 ± 0.0072 [0.9711, 1.0000]9.19
SVMPSO98.51 ± 0.64 [96.74, 99.67]0.9831 ± 0.00930.9872 ± 0.00840.9851 ± 0.0063 [0.9677, 0.9967]8.03
SVMGWO98.44 ± 0.56 [96.74, 100.0]0.9824 ± 0.00910.9866 ± 0.00900.9845 ± 0.0056 [0.9677, 1.0000]6.92
SVMAFSA98.45 ± 0.73 [96.41, 100.0]0.9832 ± 0.00960.9859 ± 0.00970.9845 ± 0.0073 [0.9642, 1.0000]17.75
RFABC98.65 ± 0.65 [97.07, 100.0]0.9856 ± 0.00950.9876 ± 0.00890.9866 ± 0.0065 [0.9711, 1.0000]13.88
RFPSO98.68 ± 0.59 [97.39, 99.67]0.9861 ± 0.00870.9878 ± 0.00810.9869 ± 0.0059 [0.9744, 0.9967]12.74
RFGWO98.62 ± 0.63 [97.07, 100.0]0.9854 ± 0.00920.9873 ± 0.00860.9863 ± 0.0063 [0.9711, 1.0000]11.96
RFAFSA98.66 ± 0.70 [96.74, 100.0]0.9858 ± 0.00990.9875 ± 0.00940.9867 ± 0.0070 [0.9677, 1.0000]26.45
KNNABC98.34 ± 0.62 [97.07, 99.67]0.9841 ± 0.00980.9829 ± 0.00990.9834 ± 0.0062 [0.9705, 0.9967]1.41
KNNPSO98.35 ± 0.79 [96.41, 100.0]0.9848 ± 0.01080.9823 ± 0.01120.9835 ± 0.0079 [0.9642, 1.0000]1.31
KNNGWO98.35 ± 0.70 [96.42, 99.67]0.9851 ± 0.00810.9819 ± 0.01050.9835 ± 0.0070 [0.9642, 0.9967]1.29
KNNAFSA98.36 ± 0.66 [96.42, 99.35]0.9845 ± 0.00920.9828 ± 0.00930.9836 ± 0.0066 [0.9644, 0.9935]2.74
ANNABC98.42 ± 0.76 [96.74, 99.67]0.9837 ± 0.01050.9850 ± 0.01010.9843 ± 0.0076 [0.9677, 0.9967]8.52
ANNPSO98.46 ± 0.68 [97.07, 99.67]0.9842 ± 0.00970.9853 ± 0.00930.9847 ± 0.0068 [0.9711, 0.9967]7.48
ANNGWO98.40 ± 0.72 [96.74, 100.0]0.9835 ± 0.01010.9847 ± 0.00980.9841 ± 0.0072 [0.9677, 1.0000]6.87
ANNAFSA98.43 ± 0.80 [96.41, 99.35]0.9839 ± 0.01080.9849 ± 0.01050.9844 ± 0.0080 [0.9642, 0.9935]14.92
Table 10. General performance statistics of algorithms.
Table 10. General performance statistics of algorithms.
AlgorithmAverage Accuracy (%)Median Accuracy (%)Standard DeviationLowest
Accuracy (%)
Highest
Accuracy (%)
SVM96.0198.462.8992.3898.51
RF96.6298.592.4693.0598.68
KNN91.4398.077.8280.9698.36
ANN94.5698.404.5688.2898.46
Table 11. The effect of normalization methods on algorithms.
Table 11. The effect of normalization methods on algorithms.
AlgorithmRaw Data Average (%)Z-Score
Average (%)
Min–Max Average (%)Performance
Increase (%)
SVM92.4798.4798.47+6.00
RF93.1498.5998.65+5.51
KNN81.0798.0598.35+17.28
ANN88.4098.3898.43+10.03
Table 12. General performance statistics of optimization algorithms.
Table 12. General performance statistics of optimization algorithms.
AlgorithmAverage Accuracy (%)Median Accuracy (%)Standard DeviationLowest
Accuracy (%)
Highest
Accuracy (%)
ABC94.4998.424.8581.0498.65
PSO94.5698.464.8481.2198.68
GWO94.4698.404.9181.0798.62
AFSA94.5198.434.8880.9698.66
Table 13. T-test results.
Table 13. T-test results.
ComparisonAlgorithmt-Testp-ValueMean Diff (%)
Raw vs. Z-ScoreSVM−147.71<0.001+6.00
RF−142.58<0.001+5.45
KNN−310.60<0.001+16.98
ANN−186.61<0.001+9.99
Raw vs. Min–MaxSVM−142.08<0.001+6.00
RF−147.00<0.001+5.51
KNN−330.52<0.001+17.28
ANN−187.45<0.001+10.03
Z-Score vs. Min–MaxSVM0.001.000−0.00
RF−3.540.012+0.07
KNN−17.48<0.001+0.30
ANN−2.550.044+0.05
Table 14. One-way ANOVA results.
Table 14. One-way ANOVA results.
ComparisonF-Statisticp-Value
ABC, PSO, GWO, AFSA—Raw Data0.00040.999
ABC, PSO, GWO, AFSA—Z-Score0.00840.998
ABC, PSO, GWO, AFSA—Min–Max0.09450.962
ABC, PSO, GWO, AFSA—All Data1.64970.192
Table 15. The test accuracy and F1-score results for this noisy dataset.
Table 15. The test accuracy and F1-score results for this noisy dataset.
NormalizationAccuracy (%)F1-Score
Raw Data95.80.9583
Z-Score91.40.9113
Min–Max50.80.6658
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yıldırım, Ş.; Bingöl, M.S. Metaheuristic Approaches to Enhance Voice-Based Gender Identification Using Machine Learning Methods. Appl. Sci. 2025, 15, 12815. https://doi.org/10.3390/app152312815

AMA Style

Yıldırım Ş, Bingöl MS. Metaheuristic Approaches to Enhance Voice-Based Gender Identification Using Machine Learning Methods. Applied Sciences. 2025; 15(23):12815. https://doi.org/10.3390/app152312815

Chicago/Turabian Style

Yıldırım, Şahin, and Mehmet Safa Bingöl. 2025. "Metaheuristic Approaches to Enhance Voice-Based Gender Identification Using Machine Learning Methods" Applied Sciences 15, no. 23: 12815. https://doi.org/10.3390/app152312815

APA Style

Yıldırım, Ş., & Bingöl, M. S. (2025). Metaheuristic Approaches to Enhance Voice-Based Gender Identification Using Machine Learning Methods. Applied Sciences, 15(23), 12815. https://doi.org/10.3390/app152312815

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop