Multi-Corpus Benchmarking of CNN and LSTM Models for Speaker Gender and Age Profiling
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsStrength:
+ Gender and Age Classification from Voice Signals has research significance and a wide range of application scenarios, which is well explained in the introduction.
+ The paper achieved good experimental results, up to 99.86%,
Major Weakness:
-The main concern is the lack of innovation—this reads more like a technical report than a research paper. The manuscript should emphasize its novel contributions more clearly.
-The motivation for stacking modules (e.g., CNN+RNN in Figure 1) remains unclear. For example, the authors should explicitly justify why both architectures are necessary, rather than using CNN or RNN alone.
-The current ablation study design—limited to substituting pre-trained CNN/RNN backbones—is insufficient to validate the core innovation.
-Additional implementation details - such as the input/output dimensions of the CNN and RNN components in Figure 2 and their fusion methodology - would significantly improve reproducibility. Given the critical importance of reproducibility for the research community, we recommend either releasing the source code or providing comprehensive pseudocode to facilitate replication.
Minor Weakness:
-The phrase "Deep Learning Approaches" in the title is too broad and should specify which particular model or optimization strategy is being used to.
-Technical terms should be properly cited, such as "K-Nearest Neighbors" and "Label Propagation " in Line 93-94.
-Section 3.1 Dataset should be placed in the experimental section.
-Figure 1 is not a vector image.
-Please align Table 1 with the document's standard margin.
All in all, this is more like a good technical report, but if it is a research paper, my main concern is that it lacks innovation.
Author Response
Answer to Reviewer 1
Dear reviewer 1:
Thank you for reviewing our paper. We have revised and improved our paper, according to your requirements as follows:
Question 1. The main concern is the lack of innovation—this reads more like a technical report than a research paper. The manuscript should emphasize its novel contributions more clearly.
Response: Thank you for your observation. We agree with your observation. However, the objective of this paper is to develop a benchmark several Convolutional Neural Networks (CNNs) and Long-Short Term Memory (LSTMs) for speaker profiling tasks based on speaker gender and age. To this end, we used three publicly available speech corpora — TIMIT, Mozilla Common Voice, and VoxCeleb1 — which represent distinct recording conditions to determine the most suitable Deep Neural Network (DNN) model for each dataset. The TIMIT dataset provides studio-quality speech data; however, the amount of speech data is limited and exhibits an imbalance in age groups. The Mozilla Common Voice contains linguistic and demographic diversity, and VoxCeleb1 has diversity in recording conditions. The latter two datasets contain enough speech data for training. In this paper, we demonstrated that if adequate preprocesses are applied to the speech signals in an adequate order, appropriate features for each dataset are selected and suitable configurations of CNN or LSTM are finetuned, the state-of-the-art performance can be obtained. We believe this is the mail contribution of this paper.
Question 2. The motivation for stacking modules (e.g., CNN+RNN in Figure 1) remains unclear. For example, the authors should explicitly justify why both architectures are necessary, rather than using CNN or RNN alone.
Response. Thank you for your question. We did not use the CNN+RNN configuration for age and gender classification; instead, we utilized several CNNs and LSTMs independently to establish a benchmark for the speaker profiling task to determine the most suitable CNN and LSTM configuration. The explanation and the figure were not clear in the original version. To clarify our objective and the processes carried out, we modify Figure 1, the block diagram of the proposed framework, while in the text, we explain each process that comprises the framework in the revised version.
Question 3. The current ablation study design—limited to substituting pre-trained CNN/RNN backbones—is insufficient to validate the core innovation.
Response. Thank you for your observations. To achieve our objective, which is to benchmark several CNN and LSTM configurations for the speaker profiling task using three speech corpora with distinct characteristics, we evaluate the preprocessing schemes for speech signal, feature extraction, and various CNN and LSTM configurations. We showed that by selecting an adequate method in each stage (preprocessing, feature extraction and DNN) accuracy can reach to the state-of-the-art performance in three widely used speech corpora. To clarify the objective and contribution of our work, we modified almost all sections of the paper. We hope that the revised version is both straightforward and comprehensive.
Question 4. Additional implementation details - such as the input/output dimensions of the CNN and RNN components in Figure 2 and their fusion methodology - would significantly improve reproducibility. Given the critical importance of reproducibility for the research community, we recommend either releasing the source code or providing comprehensive pseudocode to facilitate replication.
Response. Thanks for your comments, we modify Figure 2 to include input/output dimensions of the CNN and RNN components, in the revised version it becomes Figure 7. We also include the pseudocode in Appendix A.
Question 5. Technical terms should be properly cited, such as "K-Nearest Neighbors" and "Label Propagation " in Line 93-94.
Response. Thank you for your observation. In the revised version of the paper, we added references, which are as follows:
KNN: Cover T,M., Hart P. E. “Nearest neighbor pattern classification”, IEEE Trans. on Information Theory, 13(1), pp. 21-27, doi:10.1109/TIT.1967.1053964.
Label Propagation: Raghavan U. N., Albert R. Kumara S., “Near linear time algorithm to detect community structures in large-scale networks”, Physical Review E, 76, 036106, 2007.
Question 6. Section 3.1 Dataset should be placed in the experimental section.
Response. Thanks for your recommendation. According to your recommendation, in the revised version Dataset analysis and configuration was moved from section 3.1 to section 4.1 in results.
Question 7: Figure 1 is not a vector image.
Response: Thanks for your observation. In the revised version of the paper, we modify Figure 1 to enhance the quality.
Question 8. Please align Table 1 with the document's standard margin.
Response. Thank you for your comment. However, in the journal format it is permitted that tables and figures width be largest than text margin.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsComments on the manuscript “Deep Learning Approaches for Gender and Age Classification from Voice Signals” (Ref: computation-3696886)
In this work, the authors have used the voice data in neural networks to evaluate the speaker characteristics, i.e., gender and age. They have used different voice databases to check the accuracy of the model. The reviewer has the following comments.
1. The work presented in this paper is important from the point of accurate identification of some of the characteristics of the speaker. However, the presentation of the method and the results needs to be improved substantially. It may be noted that details of the methodology are not discussed. The authors have mentioned that FFT of the voice signal is carried out after using some noise filter. However, no typical example of the input signal, data processing method and output data is provided. It is suggested that the authors include some typical example to illustrate the procedure.
2. In Table-1, it has been mentioned that different sampling frequency has been used for different datasets. Please provide the reason for the choice. Moreover, it is not clear why 16 kHz is chosen for sampling frequency? Usually for FFT, the sampling frequency should be much higher so that the accuracy can be better and high frequency data can be picked. Please explain and elaborate.
3. For Mozilla Common Voice data, it is mentioned that 20.50 kHz is used for sampling frequency. However, in Page-4, the same is mentioned as 25.05 kHz. Please check for the inconsistency.
4. The current title seems to be generalized in nature and it doesn’t reflect the specific content of this work. It is suggested that the title be modified to include some specific aspects of the work presented.
5. The abstract needs to shortened and modified. The novelty of this work is not clear from this abstract. There are many neural network models in literature which are regularly used for voice recognition and other such tasks. Please specify what specific results that you are getting from these networks, which are different from the previous works by others. Moreover, the statement regarding “robustness and versatility” of the approach needs to be substantial with specific comparison to previous such studies.
6. Section 2 needs to be shortened. The authors seem to discuss the content of just 5 papers, i.e. Refs. [8] to [12] in this section and their discussion is linear. It is suggested that the literature survey be improved to include some latest works in this area. Moreover, the details of conclusions from these works including numbers, e.g., “F1-score of 0.982947” need not be reproduced here. The idea behind the literature should be mentioned in brief and only the statement relevant to the current problems may be included. Please revise the section accordingly.
7. Section 3 needs substantial revision. It is suggested that the features of each dataset, such as age grouping, linguistic classification, accent of speech classification, number of speech data, time duration etc. should be provided in a Tabular form. A brief description for the unique features of each dataset should be provided. Moreover, it is not clear why these 3 datasets are chosen for the analysis. Is it the availability of data or some other reason? Please mention. The authors may have recorded some data themselves and used these in their method before using the dataset from literature. Please add a discussion to this effect.
8. Eq. (1). What is ‘t’ here? Is it time in seconds? Please elaborate. Moreover, it is not clear why x(t-1) is subtracted from x(t). It seems you are using the differential data for the analysis. Please explain and elaborate.
9. It is not clear what type of data is analyzed. Is it pressure signal form the voice? Usually, the sound is measured in pressure signal and converted to decibel. No such details are provided. What are the frequency ranges of these signals. This shall define the noise cutoff frequency and sampling frequency.
10. Please present a typical speech signal x(t) as a function of time. Show the mathematical operations that are carried out on the raw signal, including low pass filtering and sampling etc. Show the output after FFT. It is suggested that the authors elaborate methodology better.
11. Page-5: It is written that “Mozilla Common Voice dataset, with age split into six groups (teens: 10–19 years, twenties: 20–29 years, thirties: 30–39 years, forties: 40–49 years, fifties: 50–59 years, sixties: 60+ years)”. Please show a Table with number of dataset and its features for each age group.
12. Page-5: Please check the sentence “.. low-pass filtering to remove high-frequency noise that could degrade signal quality, cutting at 4000 Hz”. Usually, noise cutoff is fixed at 1 kHz in literature. You have chosen a higher value of 4 kHz. Is there any specific reason for the same? Please elaborate. Moreover, it is suggested that the authors carry out a sensitivity study by varying this cutoff frequency and report if any further error is introduced by changing this cutoff frequency.
13. Eq. (2). The silence function is a step function whose value is 1 after a threshold and zero below it. There are two aspects of this issue. The first is the threshold energy. What is its unit here. You have chosen it to be 0.05. What are the energy ranges of a typical voice signal. The selected threshold may be dependent on this range. Please elaborate.
14. It is not clear where the silence function s(t) is used. Is it multiplied to the input signal? Please explain the details and its relevance. Is changing the threshold from 0.05 affect the results? Please comment.
15. You have mentioned that FFT has been carried out for the voice signal in Fig. 1. Is it pressure vs time or other parameter. Please mention and show some sample data and its frequency content.
16. It is not clear where energy data is being used in the calculation as mentioned in context of silence function s(t)? Are you normalizing the signal with respect to total energy? It is not very clear from the description of the methodology. It is suggested to show one example, where energy calculation and normalization etc. are demonstrated.
17. Table-1: The number of Mel and MFCC coefficients are mentioned. Please provide the description of the expression where these coefficients are used. You may provide appropriate references. Moreover, it is mentioned in Page-6 that “Mel spectrograms were converted into 224×224 RGB images”. Please show one sample example, where the input raw signals and its spectrogram is shown. Please highlight the dominant or characterizing frequencies in such spectrograms, which are used further in the analysis to evaluate the coefficients.
18. Table-2: You have mentioned the freeze and retained layers and mentioned functions such as “conv 1-1 and 1-2, conv 2-1 MBconv6”. However, these functions are not described in the text. Moreover, this Table does not add much information to the discussion and hence, may be removed. The details of each convolution function, used if any, may be shifted to suitable appendices.
19. The results and discussion section needs to be improved by providing data in graphical form. Please compare the predicted results from each algorithm with respect to actual data to show the error. The overall error % is one information, however, the efficacy of the algorithm needs to be demonstrated with specific examples.
20. The training process of the neural network needs to be elaborated. Please show the variation of error for different datasets of training through appropriate graphs. It is not clear if you are using same dataset for training and prediction. Please clarify. Moreover, please elaborate on the variability of data (age, gender, linguistic classification etc.) those are used in training and prediction.
21. Please elaborate on the algorithm which takes care of accent of voice and language. What mathematical feature is used to identify such information? Please add some discussion in the context of your work and those from literature.
22. The number of data points for TIMIT as presented in Fig. 5 are very small when compared to other 2 datasets. Is it due to non-availability of data? Will it affect the accuracy of the results as training may be insufficient? Provide full form of TIMIT in the text.
23. Table-9 shows that the error of the current method is less for Mozilla Common Voice and VoxCeleb1 database, whereas it is more for TIMIT database. What is the reason for the same? Please elaborate. Does the error depend upon the type of voice used as input, i.e., it may be better for certain class of voice. Please elaborate on this on the basis of voice signal features such as frequency content, energy etc. Is there a way through which the error can be reduced for TIMIT database? Please comment.
24. Conclusion: This section needs to be modified. Currently, this is just a summary. It is suggested that few important conclusions may be included in a bulleted manner and the reason for such conclusions may be specified.
25. Please include more current literature in the references section. The details of sources for Ref. [14], [15] and [27] are not provided. Please check all the references for completeness.
26. Page-3: Please check the sentence “we propose exploring domain adaptation techniques and data augmentation strategies to enhance model robustness”. Please elaborate on these techniques and how these proposed methods shall improve the accuracy.
Author Response
Answer to Reviewer 2
Dear reviewer 2:
Thank you for reviewing our paper. We have revised and improved our paper, according to your requirements as follows:
Question 1. The work presented in this paper is important from the point of accurate identification of some of the characteristics of the speaker. However, the presentation of the method and the results need to be improved substantially. It may be noted that details of the methodology are not discussed. The authors have mentioned that FFT of the voice signal is carried out after using some noise filter. However, no typical example of the input signal, data processing method and output data is provided. It is suggested that the authors include some typical examples to illustrate the procedure.
Response. We thank you very much for your important observation to improve our work, it is crucial for refining our investigation and final version of the manuscript. In the revised version we modify Figure 1 and include a better version that shows details of the preprocessing stage and methodology. Each block of the figure was explained in detail in sections 3.1-3.3 and examples were added for better understanding.
Question 2. In Table-1, it has been mentioned that different sampling frequency has been used for different datasets. Please provide the reason for the choice. Moreover, it is not clear why 16 kHz is chosen for sampling frequency? Usually for FFT, the sampling frequency should be much higher so that the accuracy can be better and high frequency data can be picked. Please explain and elaborate.
Response. We really appreciate your feedback and comments; they are very important for us. Respect the different frequency used we added a brief explanation of our selection based on a reference [45], as follow:
Audio from VoxCeleb1 and TIMIT corpora was processed at 16 kHz because this rate, native to both datasets, captures the 0–8 kHz speech band while reducing storage and computation cost. Common Voice recordings, originally are distributed at 44.1 kHz, were down-sampled to 22.05 kHz—exactly half of the 44.1 kHz consumer-audio family—to preserve spectral content up to 11.025 kHz. Moreover, empirical studies show that 22.05 kHz yields negligible intelligibility loss relative to higher rates while halving bandwidth, making it a pragmatic choice for wide-band speech pipelines. Consequently, the combined use of 16 kHz for recognition and 22.05 kHz for synthesis offers a widely endorsed balance between perceptual fidelity and resource efficiency in modern speech-technology research [45].
Question 3. For Mozilla Common Voice data, it is mentioned that 20.50 kHz is used for sampling frequency. However, in Page-4, the same is mentioned as 25.05kHz. Please check for the inconsistency.
Response. Thanks for this observation, in the revised version the value of sample frequency for Mozilla Common Voice was modified, the correct value is 22.050kHz.
Question 4. The current title seems to be generalized in nature, and it doesn’t reflect the specific content of this work. It is suggested that the title be modified to include some specific aspects of the work presented.
Response: We appreciate your comments in this revision. We consider your advice; in the revised version of the manuscript, we employ a new title more related to the technology and methodology of our approach.
Question 5. The abstract needs to shorten and modifid. The novelty of this work is not clear from this abstract. There are many neural networks models in literature with are regularly used for voice recognition and other such tasks. Please specify what specific results that you are getting from these networks, which are different from the previous works by others. Moreover, the statement regarding “robustness and versatility" of the approach needs to be substantial with specific comparison to previous such studies.
Response. We appreciate the reviewer’s insightful observations. This study offers a rigorous, dataset-specific evaluation of two standard architectures Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks by training and testing each model independently on three representative speech corpora: VoxCeleb1 (noisy, in-the-wild recordings), Mozilla Common Voice (large-scale, multi-accent material) and TIMIT (compact corpus with eight dialects and age labels). Although each model is trained and evaluated within its respective dataset, the unified experimental protocol enables a coherent comparison of model behavior under markedly different acoustic conditions, a perspective that, to our knowledge, has not been documented in previous work.
As reported in Table 10, our configurations surpass the current state of the art for gender and age classification and achieve competitive age-estimation accuracy with a lightweight implementation. These results substantiate the claimed “robustness and versatility" of the approach across diverse recording environments and speaker characteristics.
We shortened the abstract accordingly, explicitly highlighting these contributions and empirical gains.
Question 6. Section 2 needs to be shortened. The details of conclusions from these works including numbers, e.g., “F1-score of 0.982947” need not be reproduced here authors seem to discuss the content of just 5 papers, i.e. Refs. [8] to [12] in this section and their discussion is linear. It is suggested that the literature survey be improved to include some latest works in this area. Moreover the details of conclusions from these works including numbers, e.g., “F1-score of 0.982947” need not be reproduced here. The idea behind the literature should be mentioned in brief and only the statement relevant to the current problems may be included. Please revise the section accordingly.
Response. Thank you or your helpful comments and recommendations. According to them, we modified and restructured Section 2 considerably. In the revised version we aggregate references [14-19] and shortened the mentions of these studies from state of the art, eliminating excess of information and highlighting only the technics used and problem to resolve.
Question 7. Section 3 needs substantial revision. It is suggested that the features of each dataset, such as age grouping, linguistic classification, accent of speech classification, number of speech data, time duration etc. should be provided in a Tabular form. A brief description for the unique features of each dataset should be provided. Moreover, it is not clear why these 3 datasets are chosen for the analysis. Is it the availability of data or some other reason? Please mention. The authors may have recorded some data themselves and used these in their method before using the dataset from literature. Please add a discussion to this effect.
Response. Thank you for your comment, considering your requirements we modified the revised version to explain better the characteristics of each dataset, including Table 2, to resume them. In Section 4.1 we showed details. These datasets were selected taken in account diversity in recording conditions, speaker demographics, and linguistic coverage, enabling a comprehensive evaluation of the proposed models across varied scenarios.
- VoxCeleb1: The characteristics “in the wild” is the motivation to utilize it in this approach.
- Mozilla Common Voice: The diversity of accents and amount of available data make this dataset ideal for the tasks we conducted.
- TIMIT: The uses of this dataset in many studies, its availability and age labeled data make this dataset a complement to developing our study.
Question 8. Eq. (1). What is ‘t’ here? Is it time in seconds? Please elaborate. Moreover, it is not clear why x(t-1) is subtracted from x(t). It seems you are using the differential data for the analysis. Please explain and elaborate.
Response: We appreciate your useful feedback and comments; we make the necessary modifications to better understand. The variable, t, refers to the time step. The interval time in second between t-1 and t varies depending on sampling frequency. If the sampling frequency is 16KHz (VoxCeleb1 and TIMIT), the interval time is between t-1 and t is 1/16000 second.
In the revised version we explain Equation 1 as follows.
where is the sample signal at time step , is the sample of the previous time step, is the pre-emphasis signal and is the pre-emphasis coefficient, typically 0.97 [19, 20]. This filter subtracts an attenuated version of the previous sample, enhancing rapid changes in the signal like high-frequency components. Figure 4 illustrates the pre-emphasis process. Equation (4) can be expressed using transfer function of the first-order high-pass filter.
Question 9. It is not clear what type of data is analyzed. Is it pressure signal form the voice? Usually, the sound is measured in pressure signal and converted to decibel. No such details are provided. What are the frequency ranges of these signals. This shall define the noise cutoff frequency and sampling frequency.
Response. Thank you for your question. Attending your observation, we trait to explain this fact as follows. In fact, the original voice signal is a variation in air pressure through the vocal cords, mouth, and tongue, which was captured by a microphone and converted into a voltage, which is then stored as a time series. Thus, the dataset used for training and testing our system and any other recognition system consist of a time-series denoted as x(t). Thus, all public datasets in which voice signals are digitized are converted into a time series of numbers which are expressed as x(t), where x(t) notes the amplitude of the original speech signal at time step t.
Question 10. Please present a typical speech signal x(t) as a function of time. Show the mathematical operations that are carried out on the raw signal, including low pass filtering and sampling etc. Show the output after FFT. It is suggested that the authors elaborate methodology better.
Response. Thank you for your helpful suggestion. According to your suggestion, we added mathematical expressions for each process applied to the row speech signal x(t) in 3.1. Preprocessing and 3.2 Feature extraction. In section 3.1, we applied silence removal, pre-emphasis filtering, low-pass filtering, and energy normalization as preprocessing in this order. Additionally, in this section, we have displayed all processed signals from the raw speech signal, either in the time domain or the frequency domain. In section 3.2, we show the Spectrogram, Mel spectrogram and MFCC as output of Short Time FFT (STFT).
Question 11. Page-5: It is written that “Mozilla Common Voice dataset, with age split into six groups (teens:10-19 years, twenties: 20-29 years, thirties: 30–39 years, forties: 40–49 years, fifties: 50–59 years, sixties: 60+ years)”. Please show a Table with number of each dataset and its features for each age group.
Response. Thank you for your comments. According to your helpful suggestion, we have added a table listing the principal characteristics of each dataset used in the paper. Additionally, we added a bar graph for easy understanding of the age distribution.
Question. Page-5: Please check the sentence “.. low-pass filtering to remove high-frequency noise that could degrade signal quality, cutting at 4000 Hz”. Usually, noise cutoff is fixed at 1 kHz in literature. You have chosen a higher value of 4 kHz. Is there any specific reason for the same? Please elaborate. Moreover, it is suggested that the authors carry out a sensitivity study by varying this cutoff frequency and report if any further error is introduced by changing this cutoff frequency.
Response. Thank you for your insightful comments. To remove high-frequency noise such as electronic device interference, electrical hiss, or friction noise, the cutoff frequency depends on the speech signal quality. For telephone-quality speech, the cutoff frequency is approximately 3.5kHz. In contrast, the cutoff frequency for high-quality speech signals must be higher than 3.5 kHz, according to “Digital Processing of Speech Signals” by Rabiner (1978). For the speech signals in the datasets used, high-frequency noise, such as electronic device interference, is not common; therefore, the effect of low-pass filtering with different cutoff frequencies is not observed.
Question 13. Eq. (2). The silence function is a step function whose value is 1 after a threshold and zero below it. There are two aspects of this issue. The first is the threshold energy. What is its unit here. You have chosen it to be 0.05. What are the energy ranges of a typical voice signal. The selected threshold may be dependent on this range. Please elaborate.
Response. Thank you for your insightful question. As you mentioned, the silence function is a step function whose values (0, silence or 1, speech signal) are determined by a threshold value. Based on the reference [19], we used the Mahalanobis distance between the speech signal of each time step x(t) and the mean amplitude value of the speech signal within the window of N time steps. To determine the threshold value, we calculate the power spectral densities (PSDs) of the original and the silence-removed speech signal. If the silence-removed signal keeps the same quality than the original speech signal, the condition mush be satisfied, where is the PSD of the original signal, while is the PSD of the silence-removed signal. To obtain an adequate threshold value, we generate several silence-removed signals using different threshold values and calculate the difference between the PSDs of the original signal and each silence-removed signal generated by a different threshold value q, as shown in equation (3). Figure 3 of the revised version shows threshold value vs mean difference given by (3). We have re-written a new section 3.1.1 Silence removal in the revised version as follows:
3.1.1. Silence removal.
Speech signals often include silent segments, which can occur at the start, between words, or at the end of an utterance. Removing these non-speech portions helps to reduce both computational load and processing complexity. To enable more efficient analysis, these silent parts are typically discarded
early in the processing pipeline. This is achieved by classifying each audio sample as either speech or silence based on the statistical characteristics of background noise.
First, the mean μ and variance σ of the voice signal x(t) are calculated by (1) and (2).
For each audio sample, if , then the sample is classified as speech; otherwise, it is considered as silence segment [22], where is the threshold, whose value is determined so that the power spectral density (PSD) of the original speech and the silence-removed speech signal are almost the same i.e. , where is frequency. Figure 2 shows an example of the silence removal process, showing the original speech signal and silence-removed signal in time domain. We selected the threshold value based on (3).
We determined the threshold experimentally. In Figure 3 we show the results of these experiments. The optimal q is obtained for when is minimum but not cero. The experiments found that the optimal value for is between 0.05 and 0.1, thus we choose.
- Jahangir, R., Teh, Y. W., Nweke, H. F., Mujtaba, G., Al-Garadi, M. A., & Ali, I. (2021). Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Expert Systems with Applications, 171, 114591.
Question 14. It is not clear where the silence function s(t) is used. Is it multiplied to the input signal? Please explain the details and its relevance. Is changing the threshold from 0.05 affect the results? Please comment.
Response. Thank you for your valuable question. In the original paper, the explanation of the silence removal process was not clear. In the revised version, we added a more understandable explanation. We also used a formula and a graph (Figure 3) to show how to calculate the threshold.
Question 15. You have mentioned that FFT has been carried out for the voice signal in Fig. 1. Is it pressure vs time or other parameter. Please mention and show some sample data and its frequency content.
Response: Thank you for your comment. We have modified Figure 1, which is the global diagram of the proposed system. We believe that this diagram effectively illustrates the proposed system. In the 3.2 feature extraction section, we have added a detailed explanation of the feature extraction process. And Figure 6 shows an example of each extracted feature: a Spectrogram, a Mel spectrogram, and an MFCC using a speech signal from the TIMIT dataset.
Question 16. It is not clear where energy data is being used in the calculation as mentioned in context of silence function s(t)? Are you normalizing the signal with respect to total energy? It is not very clear from the description of the methodology. It is suggested to show one example, where energy calculation and normalization etc. are demonstrated.
Response. Thank you for your helpful comments. According to your comments, we added a detailed explanation and an example of a speech signal before and after the operation. Generally, all preprocessing operations, composed of silence removal, Pre-emphasis filtering, Low pass filtering and Energy normalization in this order, are described in detail with examples. We used variable as original signal, as the signal after silence removal, is the signal after pre-emphasis filtering, is the signal after low pass filtering, finally, is the normalized signal after low pass filtering.
Question 17. Table-1: The number of Mel and MFCC coefficients are mentioned. Please provide the description of the expression where these coefficients are used. You may provide appropriate references. Moreover, it is mentioned in Page-6 that “Mel spectrograms were converted into 224×224 RGB images”. Please show one sample example, where the input raw signals and its spectrogram is shown. Please highlight the dominant or characterizing frequencies in such spectrograms, which are used further in the analysis to evaluate the coefficients.
Response: Thank you for your helpful comments. According to your comments, we added a brief explanation about the selection of number of coefficients of Mel spectrogram and MFCC for each of the three datasets, based on a reference [27], as follows.
The dimensionality of the mel spectrogram was adapted to each corpus. VoxCeleb1 recordings, which contain background noise and speaker overlap, were encoded with 224 Mel coefficients [27] to preserve fine-grained spectral detail. Mozilla Common Voice, whose utterances are cleaner, required only 128 coefficients, while the comparatively noise-free TIMIT corpus was adequately represented with 64 coefficients. Table 1 summarized the number of coefficients used for different datasets. Mel coefficients were used to obtain RGB images for introducing them into CNN models and MFCC for LSTM models.
[27] Meng, H., Yan, T., Yuan, F., & Wei, H. (2019). Speech emotion recognition from 3D log-mel spectrograms with deep learning network. IEEE access, 7, 125868-125881.
Furthermore, we added figures of RGB images, size 224 x 224, represented as a Spectrogram, Mel spectrogram, and MFCC in Figure 6. From the graphs, we can observe the fundamental frequency (F0) [80 – 300 Hz], the first formant (F1) [300 – 900 Hz], the second formant (F2) [900 – 1800 Hz], and the third formant (F3) [1800 – 3000 Hz]. Depending on the speaker’s gender and age, the larger or lower spectral energy in different ranges is observed. In the MFCC, large values are found in the lowest coefficients for male speakers and in the highest coefficients for female speakers, as well as for younger speakers. We clearly observed these dominant frequency ranges in the graphs.
Question 18. Table-2: You have mentioned the freeze and retained layers and mentioned functions such as “conv 1-1 and 1-2, conv 2-1 MBconv6”. However, these functions are not described in the text. Moreover, this Table does not add much information to the discussion and hence, may be removed. The details of each convolution function, used if any, may be shifted to suitable appendices.
Response: Thank you for your comments. According to your comment, we moved Table 2,” Fine Tuning CNN models,” into the appendix A and added some explanation about the layer names that appear in the table. Appendix A in the revised version is as follows:
Appendix A
Table A1. Fine-Tuning CNN Models.
Model |
Frozen layers (blocks) |
Retrained layers (blocks) |
VGG16 |
Blocks 1–3 (conv1_1 – conv3_3) |
Blocks 4 & 5 (conv4_1 – conv5_3) |
ResNet50 |
Stem conv1 + Layer1 (conv2_x) + Layer2 (conv3_x) |
Layer3 (conv4_x) & Layer4 (conv5_x) |
ResNet18 |
Stem conv1 + Layer1 (conv2_x) + Layer2 (conv3_x) |
Layer3 (conv4_x) & Layer4 (conv5_x) |
MobileNetV2 |
Conv2dNormActivation (block 0) + InvertedResidual blocks 1–15 |
Inverted Residual blocks 16 & 17 + Classification head |
EfficientNet‑B0 |
Stem conv + Stages 1–5 (MBConv1 / MBConv6) |
Stages 6 & 7 (MBConv6) + Head |
Table A1 summarizes the fine-tuning strategy for each convolutional backbone. For all networks the lower-level feature extractors were kept intact, while the deeper semantic blocks and the final classifier were updated. Concretely, VGG-16 was fine-tuned from block 4 onward; both ResNet variants from conv4_x onward; MobileNet V2 from block 16 onward; and EfficientNet-B0 from stage 6 onward. This balanced scheme preserves generic visual features, shortens training time, and mitigates over-fitting while still giving the model capacity to adapt to the target domain.Appendix B
Question 19. The results and discussion section needs to be improved by providing data in graphical form. Please compare the predicted results from each algorithm with respect to actual data to show the error. The overall error % is one information, however, the efficacy of the algorithm needs to be demonstrated with specific examples.
Response. We thank you very much for your important observation to improve our work. In the revised version of the paper, we added
Question 20. The training process of the neural network needs to be elaborated. Please show the variation of error for different datasets of training through appropriate graphs. It is not clear if you are using same dataset for training and prediction. Please clarify. Moreover, please elaborate on the variability of data (age, gender, linguistic classification etc.) those are used in training and prediction.
Response. Thank you for your insightful observation. The training processes of the three CNN models in each dataset are shown in graphs (Figs. 9-11). Furthermore, we added the hyperparameters of the training process including the data distribution for training set (80%) and test set (20%) in subsection 3.3.2 as follows:
Each corpus was processed independently using a stratified 80%–20% train–test split. Optimization employed Adam [28] with a Reduce-on-Plateau [29] scheduler (initial learning rate = 1 × 10⁻³, minimum = 1 × 10⁻⁶, patience = 3, reduction factor = 0.5). Early stopping [30] (patience = 15) and best-checkpoint saving were enabled. Generalization was promoted through L2 weight decay [31] (1 × 10⁻⁴) and dropout [32] (p = 0.5).
The distribution of classes within the original database is maintained in the training and test sets. The details of three datasets used are described in table 2 and figure 12.
Question 21. Please elaborate on the algorithm which takes care of accent of voice and language. What mathematical feature is used to identify such information? Please add some discussion in the context of your work and those from literature.
Response: Thank you for your observation. Accent identification is used to differentiate native speakers from non-native speakers, classify dialects within the same language, or categorize the social and educational level of speakers. Therefore, accent identification is valid for automatic speaker recognition tasks; however, our purpose is to recognize speakers' gender and age. Recent research [1] has shown that including information about accent in gender and age recognition tasks reduces recognition performance. On the other hand, the spoken language identification has several important applications, such as multilingual voice assistants and speech translation systems. However, our proposed framework is designed for the English language; therefore, language identification is not currently considered. Undoubtedly, incorporating accent and spoken language into the speaker profiling system can offer several essential applications mentioned above.
For both the accent identification and language identification, the same features, such as MFCCs, Log-Mel spectrogram, i-vector, and x-vector, are used, and several CNN and/or RNN architectures are trained to identify the accent and/or language.
Because the objective of the proposed framework is gender and age recognition in the English language, the accent and language identifications are not used; however, including these characteristics enriches the speaker profiling system, therefore, we added the inclusion of these characteristics as future work in the conclusion section of the revised version of the paper.
[1] R. Wang, “TIMIT Speaker Profiling: A Comparison of Multitask learning and Single-task learning approaches”, arXiv:2404.12077v1 [cs.SD] 18 Apr 2024
Question 22. The number of data points for TIMIT as presented in Fig. 5 are very small when compared to other 2 datasets. Is it due to non- availability of data? Will it affect the accuracy of the results as training may be insufficient? Provide full form of TIMIT in the text.
Response: As you pointed out, the TIMIT is the smallest dataset compared to the other two datasets. Of course, a small dataset makes it more challenging to train due to the issue of overfitting. In Table 2 and Figure 12 of the revised version of the paper describe the complete form of the TIMIT dataset.
Question 23. Table-9 shows that the error of the current method is less for Mozilla Common Voice and VoxCeleb1 database, whereas it is more for TIMIT database. What is the reason for the same? Please elaborate. Does the error depend upon the type of voice used as input, i.e., it may be better for certain class of voice. Please elaborate on this on the basis of voice signal features such as frequency content, energy etc. Is there a way through which the error can be reduced for TIMIT database?. Please comment.
Response. Thank you for your insightful question. One of the reasons for the lower performance in the TIMIT dataset is its considerably small amount of data, compared to the other two datasets. It may lead to overfitting. Another reason is the phonetic variation, which means that even native speakers pronouncing the same sentences have different intonations and accents across the eight regional dialects of the United States. One of the techniques to reduce the error rate is data augmentation, which includes methods such as additive noise, speech perturbation, pitch shifting, and time stretching. Another strategy is multi-tasking, in which the gender of the speech is first estimated, and then, using the estimated gender information, age is estimated. In the Conclusions Section of the revised version, we added analysis and possible techniques that reduce the error rate.
Question 24. Conclusion: This section needs to be modified. Currently, this is just a summary. It is suggested that few important conclusions may be included in a bulleted manner and the reason for such conclusions may be specified.
Response: Thank you for your insightful comments about the conclusions. In the conclusion section of the revised version, we included several important conclusions in bullet manner as follows:
- CNN/LSTM pipeline generalizes across diverse corpora without hyper-parameter retuning.
Using an identical training recipe on VoxCeleb1, Common Voice and TIMIT, all models retained ≥ 98 % gender and age group accuracy or ≤4-year MAE. This shows that network design—not corpus-specific tweaking—is the principal driver of performance. - Model capacity should match corpus scale.
EfficientNet and ResNet50 delivered the best accuracies on the two large-scale datasets (VoxCeleb1, Common Voice), whereas the lightweight MobileNet-V2 achieved the lowest MAE on the small TIMIT corpus. This confirms that over-parameterized models can overfit limited data, while compact architectures excel in low-resource settings. - Front-end choice must fit the architecture.
A set of comparative tests showed that CNNs improve by +1.3 to +1.8 percentage points when they take mel-spectrograms instead of MFCC vectors; this margin was consistent across all three corpora. In contrast, LSTMs gained only a modest ≈ +0.4 pp from using MFCCs, whose cepstral coefficients compress the spectral envelope and favor recurrent temporal modelling—an outcome aligned with standard practice in the literature. Although not an exhaustive ablation, these results support the rule of thumb: use high-resolution 2-D spectral images for convolutional networks and compact MFCC trajectories for recurrent networks, aligning spectral resolution with the receptive-field size and shape of the model. - Performance improvements over prior art are statistically significant
The proposed pipeline surpasses the previous state of the art by +1.25 pp (gender-MCV), +2.86 pp (age-MCV) and +0.57 pp (gender-VoxCeleb1).
Question 25. Please include more current literature in the references section. The details of sources for Ref. [14], [15] and [27] are not provided. Please check all the references for completeness.
Response: Thank you for your comments. We completed information about the references [14], [15] and [27] of the original version ([44], [45] and [41] in the revised version), as follows:
[41] Pan S.J., Yang Q., “A survey on transfet learning”, IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359.
[44] Ardila R., Branson M., Davis K., Henretty M., Kohler M., Meyer J., Morais R., Saunders L., Tyers F. M., Weber G. (2020) “Common voice: A massively-multilingual speech corpus”, Proc. of 12th Int. Conf. on Language Resource and Evaluation, 4218-4222.
[45] Zue, V., Seneff S., Glass, J. (1990), “Speech database development at MIT: Timit and beyond”, Speech Communication, 9(4), 351-356, https://doi.org/10.1016/0167-6393(90)90010-7.
Question 26. Page-3: Please check the sentence “we propose exploring domain adaptation techniques and data augmentation strategies to enhance model robustness”. Please elaborate on these techniques and how these proposed methods shall improve the accuracy.
Response: Thank you for your comments. Considering your comments, we completed the sentence “we propose exploring domain adaptation techniques and data augmentation strategies to enhance the model robustness” as follows:
As future work, we propose exploring domain adaptation techniques and data augmentation strategies to enhance model robustness. In domain adaptation, we consider that the proposed system, trained for the English language, will adapt to Mexican Spanish or Cuban Spanish. The primary challenge for this adaptation is the lack of sufficient labeled data for training. As domain adaptation techniques, pseudo-labeling and teacher-student techniques can be applied to data with labels. Both techniques generate a large enough dataset from a small dataset. The data augmentation technique involves artificially generating data by applying several audio processing techniques to real data, such as speech perturbation, pitch shifting, and time stretching. These techniques can also be applied to a small dataset, such as the TIMIT dataset, to enhance classification and estimation accuracy.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsAcceptance of the manuscript was recommended
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have addressed all the comments of the reviewer and have revised the manuscript accordingly. The revised version of the manuscript may be accepted for publication.