1. Introduction
The human voice represents one of the most significant physiological characteristics employed in biometric identification systems, containing rich information regarding an individual’s gender, age, and emotional state. Voice-based gender identification has emerged as an increasingly important research field. Voice signals have distinctive characteristics such as frequency, tone, timbre, acoustic properties, and spectral parameters. In addition, these characteristics can be used as key indicators in gender identification. Voice-based gender identification is the process of determining a person’s gender by analyzing their voice characteristics and plays a critical role in numerous contemporary application areas today [
1,
2,
3,
4]. Beyond gender classification, voice-related research has encompassed speaker identification enhancement, acoustic evaluation prediction for students, and millidecade spectral analysis for ocean sound identification and wind speed quantification [
5,
6,
7].
Methods employed in voice-based gender classification can be categorized into three primary approaches. The first category approach encompasses traditional machine learning algorithms, including Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) algorithms. Traditional machine learning methods are particularly suited for tasks with less data but simpler data structures. The second approach comprises deep learning architectures, such as Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM). Deep learning models can be used with more complex data such as images, audio, and time series. The third approach involves ensemble models that incorporate algorithms such as AdaBoost and Gradient Boosting. These methods leverage the collective predictive power of multiple models to enhance performance.
Ertam proposed a new Deeper LSTM model for gender classification using the voice gender dataset [
8]. The proposed method achieved 98.4% accuracy in gender classification. Tursunov et al. presented a new CNN model for age and gender classification from voice data in their study [
9]. The model was trained and evaluated on two datasets, Common Voice and a locally developed Korean speech corpus, yielding accuracies ranging from 73% to 97%. Jorrin-Coz et al. performed age and gender classification using CNN and LSTM [
10]. They indicated three different datasets for this purpose. No dataset-specific hyperparameter optimization was performed on the models. All three datasets were tested with similar models. The model accuracies obtained ranged from 98.86% to 99.86%.
Kwasny and Hemmerling investigated different embedder architectures such as deep neural network-based x-vector and d-vector for gender classification and age estimation systems using speech signals [
11]. Furthermore, by pretraining these models on the speaker identification task using the Vox-Celeb1 dataset, they were tuned to the age estimation and gender classification tasks using a transfer learning approach. The results showed a new level of accuracy in gender classification (with a 99.60% accuracy rate) and in age estimation on the TIMIT dataset. Ioannis et al. proposed a new semi-supervised self-labeled algorithm for voice-based gender classification in their study [
12]. They used two different datasets: the voice gender dataset and the deterding dataset. The proposed method achieved accuracies of 98.42% and 93.94% on these datasets, respectively.
Yücesoy focused on ensemble models to automatically identify age and gender [
13]. He used five different machine learning algorithms: SVM, E-TREE, Random Forest (RF), KNN, and Logistic Regression (LR). He used the Mozilla common voice dataset. The highest accuracy rate he achieved was 97.41%. In another study, Yücesoy again performed gender identification based on voice features [
14]. Four different algorithms are used in the study: KNN, LDA, CNN, and MLP. As data features, 4 main features and 12 hybrid features derived from them are used. In gender identification applications using hybrid features on the Turkish subset of the common voice dataset, accuracy has been increased by between 0.3% and 1.73%.
Buket and Jingcheng introduced a new feature set based on pitch-range (PR) for age and gender classification [
15]. The proposed PR features were evaluated on the aGender database using KNN and SVM classifiers. The results obtained showed that the PR features provided the highest accuracy rates, particularly with the SVM classifier. Eman et al. proposed a stacked ensemble learning model that uses four base classifiers (KNN, SVM, SGD, and LR) for gender identification and Linear Discriminant Analysis (LDA) as a top classifier [
16]. The model was trained on a voice gender dataset. The results revealed that the proposed stacked model demonstrated superior performance compared to traditional machine learning models and achieved an accuracy rate of 99.64%. A summary of reviewed literature on voice-based gender identification using the Voice Gender dataset is presented in
Table 1.
Despite the promising results achieved in recent studies, a significant gap exists in the systematic optimization of model hyperparameters. Most existing approaches that rely on manual tuning are time-consuming. This study focused on this critical gap by systematically integrating four metaheuristic optimization algorithms (ABC, PSO, GWO, AFSA) with machine learning methods to automate hyperparameter selection and enhance both model performance and consistency in voice-based gender identification.
This simulation-based study is focused on developing a high-performance and robust model by integrating metaheuristic optimization techniques with machine learning methods for the voice-based gender identification problem. For this purpose, a comprehensive simulation study is performed using the Voice Gender dataset. In the study, z-score and min–max normalization techniques are applied during the data preprocessing stage, followed by the creation of classification models using four different machine learning algorithms—SVM, RF, KNN, and ANN—and the metaheuristic algorithms ABC, PSO, GWO, and AFSA. These algorithms have been systematically applied to optimize the hyperparameters of each model. The results obtained are evaluated in detail using accuracy, precision, sensitivity, and F1-score metrics.
2. Materials and Methods
In recent years, machine learning methods have been used in different types of identification problems. In this study, 4 different machine learning models were used to identify voice gender. This section first provides information about the dataset. Second, it explains the normalization methods, machine learning methods, and optimization algorithms used. For good understanding, a graphical abstract of the proposed scheme for identifying voice gender is outlined in
Figure 1.
The Voice Gender dataset used in the study was created by Kory Becker in 2016. There are a total of 3168 samples in the dataset. The samples in the dataset are balanced across two classes: female and male. When creating the dataset, voice samples collected from 4 different sources were preprocessed for acoustic analysis, and 20 features were obtained for each voice sample [
19,
20]. The acoustic features and value ranges for the data set are given in
Table 2.
Normalization techniques address scale disparities in data. This study employs Z-score and min–max normalization as preprocessing techniques [
21]. Z-score normalization serves as a critical preprocessing component, transforming data to achieve a mean of zero and a standard deviation of one. Z-score normalization is computed using Equation (1).
Here, x′ denotes the normalized value, x the original data point, μ the mean, and σ the standard deviation.
Min–max normalization is a common normalization method that transforms the values in a dataset into a specific range, typically the [0, 1] range. Min–max normalization is computed using Equation (2).
Here, x′ represents the normalized value, x represents the original data, xmin represents the minimum value, and xmax represents the maximum value.
This study employs four fundamental machine learning algorithms commonly used for classification: SVM, RF, KNN, and ANN [
22,
23]. SVM, developed by Vapnik and Cortes [
24], is applicable to both regression and classification tasks. SVMs used in classification problems are based on the principle of finding an optimal hyperplane that maximizes the margin between classes. SVMs can effectively solve nonlinear classification problems thanks to their kernel functions [
25]. SVMs aim to find the optimum plane that best classifies the points in the data set.
The kernel method augments raw data dimensionality, transforming nonlinear problems into linearly separable ones in higher-dimensional space. Denoted as K(x, y) for n-dimensional input vectors x and y, the kernel function relates to the mapping function f that projects inputs from n to m dimensions [
26]. The mathematical formulation of support vectors is expressed in Equation (3).
The objective is to determine parameters w and b such that the hyperplane separates classes while maximizing the margin 1/||
||
2. Here, w denotes the weight vector defining the hyperplane, b the bias term, x
i the training samples, and y
i the class labels (±1). RF was developed by Breiman [
27]. RF is a method that uses an ensemble learning approach in classification problems and achieves high accuracy rates by combining the predictions of multiple decision trees [
28]. When making predictions for classification, an observation passes through each decision tree in the forest, and the final prediction from RF is made based on majority voting of all decision tree results. The fundamental RF formulation is presented in Equation (4).
Here, Ŷ represents the final predicted class, hi(x) represents the prediction made by the i-th decision tree for the x input vector, n represents the total number of trees in the forest, and mode represents the most frequently occurring value, the majority vote.
KNN is a widely employed nonparametric classification algorithm, valued for its implementation simplicity. KNN operates on the principle of proximity principle and assumes that similar data points tend to be close to each other in space. Predictions for new data samples are made based on their proximity to the nearest neighbors in the training set [
29]. KNN performance depends critically on the distance metric and K value [
30]. The algorithm’s foundation rests on distance computation and class assignment. Equations (5) and (6) present the KNN formulation using the Euclidean distance metric.
In Equation (5), d(x, xi) denotes the Euclidean distance between the test point x and the training point xi, p represents the feature dimensionality, xj represents the j-th feature of the test point, and xij represents the jth feature of the ith training point. In Equation (6), Ŷ represents the predicted class, c represents the class label, Nk(x) represents the set of k nearest neighbors of test point x, yi represents the class label of the ith neighbor, and I represents the indicator function (1 if the condition is true, 0 otherwise).
ANN is a machine learning method developed based on biological nerve cells and the human brain [
31]. It demonstrates high success in solving complex classification problems. ANNs perform well in identifying patterns or trends in data. Therefore, they are frequently used in classification problems [
32]. ANNs comprise an input layer, one or more hidden layers, and an output layer. Equation (7) presents the fundamental neuron formulation.
Here, x
i denotes input values, w
i the corresponding weights, n the number of inputs, and b the bias term. This weighted sum is subsequently passed through an activation function, as shown in Equation (8).
Here, f denotes the activation function and y the neuron output, which is propagated to the subsequent layer.
Random search is a method that was systematically studied and popularized by Bergstra and Bengio in 2012 for hyperparameter optimization. Random search evaluates each combination by randomly sampling from the hyperparameter space and selects the parameters that yield the best results. A significant advantage over grid search is that it explores the hyperparameter space more efficiently. Especially in cases where some hyperparameters have a much greater impact on model performance than others, random search can experiment with these important parameters more extensively. It can deliver much faster and more effective results than grid search in high-dimensional hyperparameter spaces. It is widely used in hyperparameter optimization for machine learning models, especially when computational resources are limited and a large number of hyperparameters need to be adjusted [
33,
34,
35].
The Tree-structured Parzen Estimator (TPE) is a sophisticated Bayesian optimization algorithm developed by Bergstra et al. in 2011 [
34]. Unlike traditional Bayesian optimization, TPE models the p(x|y) distribution. That is, it separately estimates the probability of hyperparameters yielding good and bad results. The algorithm divides previously tested hyperparameters into two groups: those that yield good results and those that yield poor results. Using these two distributions, it selects the most promising hyperparameters for the next trial. It balances exploration and exploitation using the Expected Improvement (EI) criterion. TPE’s tree-like structure allows it to effectively handle categorical and conditional hyperparameters. It can find optimal or near-optimal hyperparameters with significantly fewer trials compared to Random Search and Grid Search. It is particularly preferred in hyperparameter optimization for deep learning models where costly evaluations are involved and in auto machine learning applications [
36,
37].
The metaheuristic algorithms employed—ABC, PSO, GWO, and AFSA—are bioinspired optimization methods. ABC, developed by Karaboğa in 2005 [
38], is a swarm intelligence optimization algorithm inspired by honeybee foraging behavior. The algorithm simulates three bee roles—employed, onlooker, and scout—to explore the search space for optimal solutions. Its simplicity, minimal parameter requirements, and robust global optimization performance have led to widespread adoption in machine learning and engineering applications [
39].
PSO, developed by Kennedy and Eberhart in 1995, is a swarm intelligence optimization technique inspired by the collective behavior of bird flocks and fish schools [
40]. Each particle represents a candidate solution, updating its trajectory based on personal best and global best experiences. PSO’s popularity stems from its fast convergence, straightforward mathematical formulation, and inherent parallelizability [
41].
GWO, proposed by Mirjalili et al. in 2014, is a metaheuristic algorithm inspired by gray wolf hunting behavior and social hierarchy [
42]. The algorithm models a four-tier leadership structure (alpha, beta, delta, omega) and hunting phases (encircling, hunting, attacking prey) to navigate the search space toward optimal solutions. GWO’s local optima avoidance capability has driven its widespread adoption across engineering domains.
AFSA, developed by Li et al. in 2002, is a swarm intelligence algorithm inspired by fish schooling behaviors, including foraging, swarming, and following [
43]. Artificial fish navigate the search space as autonomous agents, exhibiting prey-searching, swarming, following, and random behaviors to locate global optima while avoiding local traps. AFSA’s strengths include parallel search capabilities, local optima avoidance, and adaptability to complex optimization landscapes.
The
t-test is a parametric hypothesis test assessing the statistical significance of mean differences between groups. Three variants exist: independent-sample, paired-sample, and one-sample
t-tests. In machine learning research, it facilitates model performance comparisons, feature set effect evaluation, and algorithm effectiveness analysis across datasets. The test distinguishes chance variation from meaningful differences through
p-value computation [
44].
ANOVA is a statistical test used to compare the means of three or more groups simultaneously. ANOVA calculates the F-statistic by dividing the between-group variance by the within-group variance and determines whether there is a significant difference between groups based on this value. The test is divided into types such as one-way ANOVA, two-way ANOVA, and repeated-measure ANOVA. Unlike the
t-test, which can only compare two groups, ANOVA analyzes multiple groups simultaneously, keeping the false-positive error under control. In machine learning research, it is used to evaluate the effect of different hyperparameter combinations, compare the performance of multiple algorithms, or analyze the results of different feature selection methods [
45].
This study employs four performance metrics: accuracy, precision, recall, and F1-score. Accuracy measures the proportion of correct classifications, indicating overall performance. Precision quantifies the ratio of true positives among predicted positives, reflecting false-positive control. Recall measures the proportion of actual positives correctly identified, indicating the model’s sensitivity. The F1-score, defined as the harmonic mean of precision and recall, balances these complementary metrics. This composite measure is particularly valuable for assessing model performance on imbalanced datasets. Equations (9)–(12) present the formulations for accuracy, precision, recall, and F1-score, respectively [
46,
47].
Here, TP represents true positive, FP represents false positive, TN represents true negative, and FN represents false negative.
3. Results
This study evaluated 48 model combinations comprising four machine learning algorithms (SVM, RF, KNN, ANN) and four metaheuristic optimizers (ABC, PSO, GWO, AFSA) for voice-based gender identification. The Voice Gender dataset developed by Becker is used, consisting of 3168 samples and 20 features. The dataset shows a balanced distribution of female and male classes. Experiments were conducted using three different data preprocessing methods: raw data, z-score normalization, and min–max normalization. K-fold cross validation is a method used to reliably evaluate the accuracy of a model. It divides the data into k folds and performs the model training and validation processes for each fold. The results obtained from each fold are combined to evaluate the overall performance of the model. Mean aggregation refers to taking the average of multiple experiments. This study implemented 5 × 10-fold cross-validation with mean aggregation to enhance generalizability, mitigate overfitting, and improve evaluation robustness [
48].
The study was conducted on a computer with i7-7700 CPU 3.6 GHz, 16 GB RAM, an AMD Radeon R7 450 4 GB graphics card, and Windows 10 Pro (22H2) specifications. In the metaheuristic optimization studies, the population size is set to 30 and the number of iterations to 50. Maximum Functional Evaluation (MFE) is a metric that indicates how much functional value an algorithm has computed while performing the search in the solution space, generally in optimization algorithms. Calculating MFE involves multiplying the population size by the maximum iteration count. Therefore, the MFE value in the study was calculated as 1500.
Table 3 presents the hyperparameter search spaces for each algorithm.
In the ABC algorithm, the population size is set to 30 and the maximum number of iterations is set to 50. The limit value is set to 20. Increasing the limit value in the ABC algorithm results in a greater exploitation search and slower convergence; decreasing the limit value results in more exploration and faster diversification.
For PSO, population size and iterations were set to 30 and 50, respectively. The inertia weight (w) was set to 0.7 to balance exploration (higher w) and exploitation (lower w). Cognitive and social coefficients (c1, c2) were both set to 1.5, balancing individual experience reliance and swarm influence to achieve effective exploration–exploitation trade-off. For GWO, population size and iterations were 30 and 50, respectively. The convergence parameter (a) decreases linearly from 2 to 0 across iterations, governing the search agent coefficient A ∈ [−a, a]. The coefficient C ∈ [0, 2] provides random emphasis on prey position. The condition |A| < 1 indicates exploitation phase, while |A| ≥ 1 denotes exploration, with C preventing local optima entrapment.
In the AFSA, the population size was selected as 30 and the maximum number of iterations as 50. The visual field value, which represents the fish’s vision radius or search area, was set to 0.5, and the step size, which represents the movement distance, was set to 0.3. As the visual field value increases, exploration increases, while as it decreases, exploitation increases. Therefore, a visual field value of 0.5 was chosen to provide a balanced search. As step size increases, movement speed increases, while a decrease provides slow and precise movement.
Table 4 summarizes the optimization algorithm parameters.
Table 5 presents Random Search results using 5 × 10-fold cross-validation.
The best model obtained in the random search algorithm was achieved with the SVM and min–max data combination. The parameters for this superior model are BoxConstraint (C) = 22.74, KernelScale (gamma) = 0.36, and kernel function rbf.
Table 6 presents TPE results using 5 × 10-fold cross-validation.
The best model obtained in the TPE algorithm was achieved with the SVM and z-score data combination. The parameters for this superior model are BoxConstraint (C) = 16.2819, KernelScale (gamma) = 3.4208, and kernel function rbf.
In experiments conducted on raw data, different classification algorithms and optimization methods have demonstrated varying levels of performance.
Table 7 details all results obtained with raw data.
When raw data was used, RF and ANN algorithms showed high performance at around 93%, while the KNN algorithm only achieved an accuracy rate of 81.21%. The SVM algorithm was generally successful and showed stable performance in the range of 92.38–92.55%. The highest performance in raw data was achieved with the RF+AFSA combination, with an accuracy rate of 93.21%.
Table 8 details all results obtained using z-score normalized data.
Z-score normalization yielded substantial improvements across all algorithms. The most notable improvement was seen in the KNN algorithm, which reached 98.09% compared to the raw data, recording an increase of approximately 17%. The SVM algorithm also showed significant improvement, performing in the range of 98.44–98.49%. The RF and ANN algorithms showed stable performance in the range of 98.35% to 98.62%. The highest performance for Z-score normalization was achieved with the RF+PSO combination, with an accuracy rate of 98.62%.
Table 9 presents min–max normalization results.
Both SVM and RF algorithms achieved high accuracy rates in the range of 98.40–98.68% with min–max normalization. The RF+PSO combination achieved the highest accuracy rate (98.68%). The KNN algorithm showed stable performance in the range of 98.34–98.36%, while ANN yielded results in the range of 98.40–98.46%. The highest performance for min–max normalization was achieved with the RF+PSO combination, with an accuracy rate of 98.68%.
Table 10 and
Table 11 present algorithm performance statistics and normalization effects, respectively.
The KNN algorithm benefited the most from normalization. KNN, which showed an average accuracy of only 81.07% on raw data, increased to 98.05% with z-score normalization and to 98.35% with min–max normalization, recording a total performance increase of 17.28%. This is due to the distance-based nature of KNN. When calculating distances between examples, the KNN algorithm creates unbalanced weights in calculations due to features with different scales (e.g., a feature ranging from 0 to 1 versus one ranging from 0 to 1000). Normalization eliminated this problem by bringing all features to the same scale, revealing KNN’s true potential.
The ANN algorithm was also significantly affected by normalization, with the average accuracy increasing from 88.40% in the raw data to 98.38% with z-score and 98.43% with min–max normalization, representing a 10.03% increase. This substantial improvement demonstrates that ANN’s internal activation functions and weight update mechanisms benefit greatly from normalized input features. The gradient descent optimization process converges more efficiently when all features are on the same scale, allowing the network to learn more effectively and avoid issues related to vanishing or exploding gradients that can occur with unnormalized data.
The SVM algorithm also showed significant improvement with normalization, with the average accuracy increasing from 92.47% in the raw data to 98.47% with both z-score and min–max normalization, representing a 6.00% increase. Since SVM’s kernel functions and decision boundary calculations are also distance- and scale-based, normalization has facilitated the algorithm’s process of finding optimal hyperparameters. The consistency of performance across both normalization methods (98.47%) indicates that SVM benefits primarily from having features on a comparable scale rather than from any specific normalization technique.
The RF algorithm, on the other hand, showed a modest increase of 5.51% (from 93.14% to 98.65%). While this improvement is more substantial than initially expected, it remains the smallest among all algorithms. Since the tree-based decision-making structure divides features based on threshold values, feature scales have less impact on the split points of decision trees compared to distance-based algorithms. The improvement observed suggests that even though RF is relatively robust to feature scaling, normalization still provides benefits, possibly by helping the algorithm make more balanced feature selections during the tree construction process. Therefore, while RF can perform well even with raw data, normalization enhances its performance and should not be overlooked. The general performance statistics of optimization algorithms are given in
Table 12.
The optimization algorithms demonstrated remarkably similar performance levels, with all four algorithms achieving nearly identical results. The PSO algorithm showed the highest average accuracy of 94.56% with a standard deviation of 4.84, and achieved the best overall result (98.68%) with the RF algorithm under min–max normalization. The AFSA performed almost identically with an average accuracy of 94.51% and a standard deviation of 4.88, demonstrating stable performance across all combinations. The ABC algorithm achieved an average accuracy of 94.49% with a standard deviation of 4.85, proving its reliability with consistent results. The GWO algorithm showed comparable performance with an average accuracy of 94.46% and a standard deviation of 4.91, indicating that its swarm hierarchy-based search strategy is equally effective.
The minimal differences between optimization algorithms (maximum 0.10% difference in average accuracy) suggest that the choice of optimization algorithm has limited impact on final model performance. All four algorithms demonstrated similar standard deviations (4.84–4.91), indicating consistent behavior across different classification algorithms and normalization methods. This consistency contrasts sharply with the significant impact of normalization methods, which could improve performance by up to 17.28% for distance-based algorithms like KNN.
Statistical significance tests were performed using paired t-tests and one-way ANOVA with a significance threshold of α = 0.05 to validate the observed performance differences. T-test results are presented in
Table 13.
Paired t-tests comparing raw data against normalized data revealed highly significant differences (
p < 0.001) for all algorithms. The KNN algorithm showed the most dramatic improvement with t-statistics of −310.60 (z-score) and −330.52 (min–max), corresponding to mean improvements of +16.98% and +17.28%, respectively. ANN demonstrated substantial improvement (t ≈ −187, mean difference > +10%), while SVM and RF showed t-statistics of approximately −145 with mean improvements of +6.00% and +5.51%. Comparison between z-score and min–max normalization showed that min–max performs slightly better for KNN (
p < 0.001, +0.30%), RF (
p = 0.012, +0.07%), and ANN (
p = 0.044, +0.05%), while SVM showed no significant difference (
p = 1.000). ANOVA results are presented in
Table 14.
One-way ANOVA tests revealed no statistically significant differences among the four optimization algorithms (ABC, PSO, GWO, AFSA) across all preprocessing methods (F-values < 0.095, all p > 0.96). Similarly, comparison of classification algorithms (SVM, RF, KNN, ANN) showed no significant differences (F = 1.6497, p = 0.192). These findings indicate that when properly preprocessed and optimized, algorithm selection has minimal impact on final performance.
In order to test the robustness of the proposed model, salt–pepper noise (5%) was added to an external test set consisting of 250 women and 250 men randomly selected from the dataset. The test accuracy and F1-score results for this noisy dataset are presented in
Table 15. The confusion matrices obtained from the test results are shown in
Figure 2.
The raw data approach achieved near-optimal classification with only 21 misclassifications out of 500 samples, demonstrating balanced performance across both genders (12 FN for males, 9 FP for females). Z-score normalization exhibited moderate degradation with 43 total errors, maintaining reasonable specificity for males (236/250 correct) but showing increased sensitivity to female misclassification (29 FP). Most strikingly, min–max normalization produced a catastrophic confusion pattern where 241 male samples were misclassified as female, resulting in a severely imbalanced predictor that defaulted to female classification in 98% of cases. This asymmetric failure mode reveals that salt–pepper noise, when combined with min–max scaling, disproportionately distorts male voice features by compressing their discriminative characteristics into the noise-dominated normalization range.
4. Conclusions and Discussion
This study investigated the effectiveness of metaheuristic optimization algorithms. The proposed algorithms focused on optimizing hyperparameters of machine learning methods for voice-based gender identification. Four types of algorithms were employed with four metaheuristic optimizers. These algorithms and optimizers were outlined in
Section 2. Moreover, a rigorous 5 × 10-fold cross-validation protocol with mean aggregation was implemented to improve robust evaluation and minimize overfitting as a risk factor. Furthermore, the KNN algorithm exhibited the most substantial benefit, with accuracy surging from 80.75% on raw data to 98.37% with normalization, a remarkable 17.5% improvement that underscores the algorithm’s sensitivity to feature scaling. This pronounced enhancement stems from KNN’s distance-based decision mechanism, where unnormalized features with disparate scales create biased distance calculations. SVM demonstrated a 6.00% improvement (92.47% to 98.47%), while ANN showed a 10.03% gain (88.40% to 98.43%), confirming that gradient-based optimization and kernel functions benefit substantially from standardized input distributions. Conversely, RF exhibited the smallest improvement at 5.51% (93.14% to 98.65%), consistent with its tree-based architecture’s inherent robustness to feature scaling, though normalization still provided measurable benefits through more balanced feature selection during tree construction.
Among metaheuristic algorithms, performance differences proved minimal yet statistically insignificant. ABC achieved 94.49% average accuracy with σ = 4.85, PSO reached 94.56% (σ = 4.84), GWO attained 94.46% (σ = 4.91), and AFSA obtained 94.51% (σ = 4.88). One-way ANOVA confirmed no significant differences among optimizers (F < 0.095, p > 0.96 across all preprocessing methods), indicating that metaheuristic choice exerts a negligible impact when proper data preprocessing is applied. Paired t-tests revealed highly significant differences between raw and normalized data (p < 0.001 for all algorithms).
External validation using salt–pepper noise (5% corruption probability) on a balanced test set (250 males, 250 females) revealed critical insights into preprocessing robustness. The catastrophic failure of min–max normalization under salt–pepper noise (50.8% accuracy vs. 95.8% for raw data) is attributed to its unbounded sensitivity to outliers, where extreme noise values corrupt xmin and xmax, shifting the entire normalized distribution and invalidating training learned boundaries. This effect disproportionately impacted male voices (241/250 misclassified as female), whose narrower frequency ranges were compressed into female classification regions. Z-score normalization maintained moderate robustness (91.4%) through bounded outlier influence, while raw data surprisingly achieved optimal noise resistance via random forest’s bootstrap aggregation and majority voting mechanisms.
This research made significant methodological contributes as follows: systematic multi-algorithm comparison under identical conditions, rigorous statistical validation via t-tests and ANOVA, external noise robustness assessment, comprehensive evaluation of six hyperparameter optimization strategies, and a production-ready model for practical deployment in forensic science, biometric security, voice assistants, call centers, and human–computer interaction systems. In future research, we will investigate robust normalization techniques (median-based, robust scaler) for noisy environments, deep learning architectures, cross-dataset generalization, and real-time processing optimization for edge deployment scenarios.