Gender and Age Estimation Methods Based on Speech Using Deep Neural Networks

The speech signal contains a vast spectrum of information about the speaker such as speakers’ gender, age, accent, or health state. In this paper, we explored different approaches to automatic speaker’s gender classification and age estimation system using speech signals. We applied various Deep Neural Network-based embedder architectures such as x-vector and d-vector to age estimation and gender classification tasks. Furthermore, we have applied a transfer learning-based training scheme with pre-training the embedder network for a speaker recognition task using the Vox-Celeb1 dataset and then fine-tuning it for the joint age estimation and gender classification task. The best performing system achieves new state-of-the-art results on the age estimation task using popular TIMIT dataset with a mean absolute error (MAE) of 5.12 years for male and 5.29 years for female speakers and a root-mean square error (RMSE) of 7.24 and 8.12 years for male and female speakers, respectively, and an overall gender recognition accuracy of 99.60%.


Introduction
Speech is a multidimensional phenomenon, the production of which consists of many anatomical structures movements that influence the overall speech quality and voice characteristics. Speech is the main and the easiest source of communication. Beside the lingustic information it also covers speaker-dependent para-linguistic data such as speaker's identity, emotional state, health state, age, or gender [1,2]. The systems for automatic extraction of these information from speech might be very useful in numerous applications such as in personal identification in banking systems; customer care applications such as call centers; voice bots; and interactive, intelligent voice assistants. In the industry, there are already global and local companies offering such services for speech processing, like Google, Amazon, and Techmo on polish market [3]. Extracting information about age and gender of the speaker may be used by the interactive voice response system (IVR) to redirect the speaker to an appropriate consultant [4] or to play a suitable for a given gender/age group background music [5]. For voice-bots systems, extraction of para-linguistic information may be applied to alter the behaviour of the bot. In the case of voice assistants, such knowledge may be used to target suitable advertisements or select search results that are more fitting for a given age/gender group. All combined, exploiting the para-linguistic content can lead to an improved user experience, and this in turn may generate revenue for the company that decides to use such systems.

Related Works
The research on extracting this para-linguistic content has been rapidly explored in the recent years in many languages. In the earlier years, researchers used methods involving extraction of some acoustic parameters such as mel frequency cepstral coefficients (MFCCs) and perceptual linear prediction, which were then averaged across the sequence to form features which were then used in some classification algorithm such as support vector SRE 2010 dataset [20]. Another research group in [21] has developed a method based on x-vectors, which showed a mean absolute error (MAE) of 4.92 years. The implementation of the i-vector system on the same dataset enabled the MAE of 5.82, which is definitely lower.
The work in [22] describes a DNN implementation for a joint height and age estimation system. Their results for age estimation are 0.6 years in terms of root mean square error (RMSE), 7.60 and 8.63 years for male and female using the TIMIT dataset [23]. Finally, in the latest paper from 2020 [24], the authors propose a feature-engineering based support vector regression system and achieve a state-of-the-art results on the TIMIT dataset, with a mean absolute error (MAE) of 5.2 for males and 5.6 years for female.
In this paper, we present the DNN-based approach, enhancing with a multi-stage transfer learning schemes to detect gender and estimate the age of a speaker.

Data
To conduct the experiments we used 3 open-source, popular speech datasets: Vox-Celeb1, Common Voice, and TIMIT. The first dataset was used to pre-train the speaker embedder. The remaining two were used for the experiments regarding age estimation and gender classification. Regardless of the original format and sampling frequency, all the data in each of the datasets is converted to 16 bit PCM wav format and downsampled to 16 kHz.

Voxceleb 1
VoxCeleb1 [25] is a large-scale audio-visual dataset published in 2017. It contains over 100,000 recordings from 1251 celebrities which correspond to over 300 h of data. The dataset was created primarily for the purpose of accelerating research in the field of speaker identification and verification. The database is gender-balanced. This dataset was used in transfer learning schemes to pre-train the speaker embedder network. It comes as 16-bit PCM wav files with a sampling rate of 16 kHz.

Common Voice
The Common Voice dataset is the largest open-source, multilingual dataset. It contains more than 2500 h of transcribed speech in 40 languages [26]. On top of the audio recordings and the corresponding transcriptions, it also contains voluntary metadata about the speaker, such as age group (teens, twenties, . . . , eighties), gender (female, male, other) and accent. For the sake of this project we have decided to use the subset of the english part of the CommonVoice dataset [27]. To conduct the experiments we included only those recordings that contain metadata about both gender and age of the speaker. The recordings with label other for gender were excluded from this experiment. In total, we applied approximately 80 h of data in the train set, and 1.5 h of data in both validation and test sets. In terms of number of recordings, it consists of 54,593 male and 18,099 female recordings in the training set, 1120 and 391 male/female recordings in the validation set and 1133/390 male/female recordings in the test set. As there is no information about speakers' labels in the version of dataset we used, it is not possible to determine the exact number of speakers in both gender groups. The original recordings come in an mp3 format with a sampling rate of 44,100 Hz. Figures 1-3 present the distribution of recordings in different age groups in the training, validation, and test set, respectively.

Timit
The DARPA-TIMIT dataset [28] contains recordings of 630 speakers from 8 different English dialects. It contains rich metadata about each speaker, including gender (male/female), exact age (exact birth date and recording date), and accent. For training and validation, we applied a random TRAIN-TEST split on the default TRAIN subset of the data. That included 3.5 h of train data and 0.5 an hour of validation data. The test set was applied for final examination. That contained 1500 utterances and the length of recordings were 1.5 h.
In terms of exact number of recordings, there are 2938 male and 1211 female recordings in the training set, 322 and 139 male/female recordings in the validation set and 1120/560 male/female recordings in the test set. The recordings are stored in the wav format and were recorder with a sampling rate of 16,000 Hz. For training and validation, a random TRAIN-TEST split is performed on the default TRAIN subset of the data, which corresponds to roughly 3.5 h of train data and 0.5 an hour of validation data.     Figure 7 presents a high-level representation of the proposed system. Three different approaches are explored. Every system uses similar front-end classification/regressions modules, which differ in size depending on the embedder architecture used and, consequently embedding size. These are descirbed in more details in Section 3.2. Figure 7. High-level representation of the system for joint age estimation and gender classification. The first implemented system is the baseline x-vector solution introduced in [21], adapted for the joint age and gender prediction. The adaptation involves replacing layers 7-8 from the original paper with a more modular configuration of 2 separated classifiers and a regressor to allow for joint age/gender prediction. It features Voice Activity Detection (VAD) and a 5 s long random crop in the waveform preprocessing stage and 23-dimensional MFCCs with Cepstral Mean Normalization over a sliding windows of 3 s as input features. For the embedder, a vanilla TDNN is used, which we describe in more in details in Section 3.1.1 [8].

Methods
The second system is our proposition at extending the baseline version. Instead of 23-MFCCs, it uses a 30-dimensional MFCC features. Instead of Cepstral Mean Normalization, which experimentally has been shown to perform poorly for short utternaces in the TIMIT datas, we apply decibel-relative-to-full scale normalization to the level of −30 dB to the cropped waveforms before feature extraction.The idea to use this normalization method comes from in [10], where the authors used it in a d-vector based speaker verification system. Another major difference is the embedder architecture. It has been shown that deeper and more modern architectures such as TDNN-F or ResNet outperform the shallow TDNN-based embedder in the task of speaker verification [7]. Inspired by those result, we applied embedder that consists of a deep, residual convolutional architecture, described in more deatils in Section 3.1.2. The third explored system differs heavily from the first two. First of all, instead of the x-vector embedder, it uses d-vector architecture introduced in [10]. The d-vector approach differs from x-vector in the way the utterance embedding is generated-instead of a deep convolutional network, a multilayer-LSTM RNN network is used, and the output of the last hidden unit is used as embedding. We present more details on this embedder architecture in Section 3.1.3. The second difference comes from the way the whole system is trained. While the first two systems train the embedder in an end-to-end manner alongside the front-end modules, the third system uses a pretrained embedder shared by the authors (which has been trained for speaker verification on LibriSpeech, VoxCeleb1 and VoxCeleb2 datasets) [29] and freezes its weights, which means that in this system only the front-end modules are trained. The motivation for such approach is the vast amount of computational power that would be required to train such system and the fundamental differences in the data processing pipelines and training scheme between that system and the remaining two.
The effects of subsequent preprocessing stages mentioned earlier are shown in Figure 8.

Time Delay Neural Network Based X-Vector Embedder
The first explored embedder consists of a stack of time-delay neural networks layers that are responsible for capturing local context, followed by a statistics pooling layer, which aggregates outputs of the TDNN layers [8,21,30]. The statistics vector is then passed through a fully connected layer to form the final embedding. The TDNN layer has been implemented as a 1-D dilated convolution [31]. The detailed network architecture has been shown in Table 1. Table 1. Baseline architecture summary [21].

Quartznet X-Vector Embedder
The second flavor of the x-vector embedder we propose differs from the one described in Section 3.1.1 mainly in the architecture of the layers that precedes the statistics pooling layer. The quartznet x-vector embedder uses the quartznet architecture for the task of endto-end speech recognition [32]. The network is composed of several blocks with residual connections between them. Each block is composed of one or more 1D convolution, batch normalization, and ReLU layers. The exact architecture that has been used throughout this work is shown in Table 2.

D-Vector Embedder
The d-vector architecture used in this work has been introduced in [10] for the task of speaker verification. Although the goal of d-vector based embedder is the same as of the x-vector architecture, that is, to embed a variable-length utterance into a fixed size vector, the principle is much different. Instead of statistical pooling layers that aggregate information across the global context, the d-vector uses a multilayer-LSTM network [33] with a linear projection at the final layer to summarize the utterance. What is unique about the contribution of the authors at [10] is the training procedure, especially the introduction of the so-called Generalized End-to-End training. The network architecture itself is shown in Table 3.

Front-End Modules
As shown already in Figure 7, there are 3 additional neural networks on top of the embedder: binary classifier for gender classification, multiclassifier for helper age group classification, and regressor for age estimation. Due to the modular design of the presented system, embedder network can be trained separately from the classifiers/regressor and this feature has been proved very useful in the experiment where the embedder has been separately pretrained on the VoxCeleb dataset and then jointly fine-tuned on Common Voice and TIMIT for the gender/age classification task. The exact architecture of these front-end modules is presented in Table 4.

Results
We have performed experiments with three different methods of data processing and network architectures, introduced in previous sections. Each system presented in this section has been trained in three different ways: • only on the TIMIT train dataset, • only on the Common Voice train dataset, or • pretrained on the Common Voice train dataset and then fine-tuned on the TIMIT train dataset.
The reasoning behind such training scheme is the attempt to leverage transfer learning. The Common Voice dataset used in this work is much larger then the TIMIT train dataset with more then 80 h of data as compared to approximately 3.5 h in the case of TIMIT. We present the results of these experiments in Sections 4.1-4.3. We also present one additional experiment, where an additional pretraining of embedder with a speaker identification task on the VoxCeleb dataset is performed. The results for this are shown in Section 4.4.
One important fact to note about the Common Voice dataset is that it does not contain speaker labels. In other words, it is entirely possible and perhaps highly likely, that speakers overlap between train, validation, and test datasets. As such, the classification results on the Common Voice dataset are presented mostly to show the performance of the system trained solely on the TIMIT dataset as well as to show one of negative features of transfer learning-so-called catastrophic forgetting [34], a phenomenon where a fine-tuned system no longer performs equally well on the original dataset.
The results are presented according to the following metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Accuracy, and F1 score. The waveform processing and features extraction details for the baseline and QuartzNet-based systems are shown in the Tables 5 and 6. As the d-vector-based system uses a neural network as a feature extractor, the details of processing for that system are not presented here.

Baseline Tdnn X-Vector System
The results of the baseline system achieved on the TIMIT test dataset are shown in Table 7. The hyperparameters of the baseline system strictly follow the baseline implementation proposed in [21]. The performance on the gender classification task on the Common Voice dataset with a baseline x-vector embedder is presented in Table 8. The age estimation RMSE of presented approach is 8.44 and 7.96 years for female and male speakers is comparable to the results reported by the authors at [22], with 8.63 and 7.60 years female/male. However, the network does not react well to the attempts of using transfer learning-a system pre-trained on Common Voice actually offers worse results then the system with no pre-training. One possible reason for such behaviour may be the usage of Cepstral Mean Normalization over a sliding window-the difference in lengths of recordings between Common Voice and TIMIT datasets, and the fact that the recordings in the TIMIT dataset are overall quite short (below 3 s on average), may lead to a mismatch in the estimates and conversely in the degradation of results. In fact we also found experimentally that usage of the Cepstral Mean Normalization in the second proposed architecture lead to worse results then different normalization techniques.

Quartznet-Based X-Vector System
The results of the QuartzNet-based system are shown in the Tables 9 and 13. As we have found experimentally that the usage of CMVN lead to worse results, we decided on different normalization technique for this system. In particular, we decided to make use of the dbFS normalization used in the d-vector system [10] and apply it to the quartznet-based x-vector system. The QuartzNet-based system performs worse then baseline when the amount of data is low (trained only on TIMIT), which should not be a surprise given the size of the network it uses. However, it achieves a significant performance gain in the transfer learning scenario, clearly outperforming the baseline system when pretrained on the Common Voice dataset. On the TIMIT test set it achieves RMSE of 7.91 and 7.37 years for female/male and MAE of 5.2 and 5.37 years female/male. the results in terms of MAE are comparable with the state-of-the-art age estimation results of 4.23 and 5.78 female/male on the NIST SRE08 dataset which were published in [21]. Note, however, that the lengths of utterances in that dataset are much higher then of those in the TIMIT dataset and the authors report much worse results on shorter test segments. These results are also on-par with those recently published in [24] (5.6 and 5.2 MAE female/male) without using any hand-engineered features and relying solely on the low-level signal representation. In terms of gender classification, a similar pattern as with age estimation can be seen. While on average the accuracy of gender classification is comparable to accuracy achieved with a baseline system, the performance tends to be worse when the amount of training is limited, which again, is expected due to the size of the network. This phenomenon can be seen comparing accuracy of systems trained only on TIMIT train dataset, shown in Tables 8 and 13.

D-Vector Embedder-Based System
The D-vector embedder-based system uses the embedder network as feature extractor and does not update its weights during the training. This means that the only trainable parameters are located in the classifiers/regressor MLPs and the overall number of weights that needs to be optimized is lower then in the case of the end-to-end pipelines of baseline and quartznet systems by an order of magnitude. The complete pipeline of this approach was as published in [10], while the results on TIMIT and Common Voice datasets are shown in Tables 10 and 11. This system offers the best and the most robust results in terms of RMSE for age estimation and accuracy for gender classification. The performance of the systems when trained only on the TIMIT train dataset is already comparable to that of the baseline x-vector system in terms of age estimation and better in terms of gender classification accuracy. However, this system benefits from pre-training the classifiers and regressor on the Common Voice dataset, as it yields an improvement of 0.56 years RMSE when compared to the system trained without the pre-training. Furthermore, the performance on the gender classification task is the best out of the proposed solutions with 96.80% accuracy on the Common Voice test set in a scenario without any Common Voice data in the training, 99.40% on the TIMIT test set with no TIMIT data in training and 99.60% when trained on TIMIT train set or fine-tuned on it.

Pre-Training X-Vector Embedder on Voxceleb 1
Encouraged by the results achieved with the QuartzNet architecture and given how well it reacted to pretraining on Common Voice dataset, we decided to experiment with one extra pretraining step. In this pipeline, the quartznet-based embedder network (but only the embedder) receives an extra training step-it is pretrained on the VoxCeleb 1 dataset with the goal of speaker identification. The pretraining is performed using a MultiClassifier network as a frontend module with number of output classes equal to number of speakers in the VoxCeleb1 dataset, which is 1211. After training on VoxCeleb, the frontend module is discarded and only the pretrained embedder is used in further training steps. The hypothesis for this step is that the speaker's identity contains also information about both speaker's gender and age. The results obtained with this pipeline are shown in Table 12.
The results seems to confirm the assumption-in every scenario, the age estimation results improved with regards to the same system without the VoxCeleb pre-training (shown in the Tables 9 and 13) when MAE metric is considered (also the RMSE metric improved or remained the same). On top of that, the gender classification accuracy is competitive with the results achieved by the d-vector system, shown in Table 10. This results are also the best in terms of MAE out of all proposed solution and better then the current state-of-the-art results shown in [24] by 0.31 and 0.08 MAE for female and male speakers, respectively.

Conclusions
In this paper, we explored different neural network architectures, transfer learning schemes, and usage of multitask learning in the context of age estimation and gender classification from speech signals. In particular, we implemented baseline and extended x-vector based utterance embedder as well as a d-vector based system. The proposed transfer learning schemes, including pretraining systems on the Common Voice dataset as well as an additional embedder pre-training on VoxCeleb dataset, yielded consecutive performance gains in all scenarios, except for the baseline system. The results presented in this work confirm that these deep learning approaches are effective at estimating speaker's age and gender. In terms of age estimation, the proposed system with two-staged transfer learning scheme and a QuartzNet embedder achieved new state-of-the-art result on the TIMIT dataset, with a MAE of 5.12 years for male, 5.29 years for female speakers, and RMSE of 7.24 and 8.12 years for male and female speakers respectively. Comparing to the results already published in the literature (Table 14), our algorithm overcomes existing solutions published in the literature. In terms of gender classification, the d-vector-based system achieved a robustly high performance with accuracies varying from 96.8% to 99.6% depending on the training and testing datasets. The highest result was achieved when the Common Voice dataset was used for training, the algorithm was further fine-tuned on TIMIT dataset, what enabled the classification accuracy at the level of 99.6% for gender recognition. The female accuracy with proposed preprocessing and classification methods was slightly ( 0.1%) worse than the male recognition. This result also overcomes the existing methods published in the literature so far (Table 15). An additional appeal of the d-vector system is that it only contains trainable weights in the front-end modules, reducing amount of data needed to acquire satisfying performance, overfitting and training time. In overall, the accuracy detection of female was lover for all the presented methods in this paper. There are some limitations to this study that should be kept in mind when interpreting the findings. The transfer learning approach, we used built on the information that already exist in pre-trained models. Secondly, the databases used for the purpose of this research are not meant to be fully representative of all the potential diversity in human voices. In the end, the aim of this research was to capture the signals' features that were diverse enough to make a meaningful comparisons about the way these types of system learn about gender recognition and age estimation. This research provided insights into the nature and limitations of the implemented types of machine learning models. In general, it is important to train the models on the data to capture the diversity of human characteristics, they will encounter in real-world contexts.  Table 15. Comparison of proposed system with existing ones for gender recognition.

Data Availability Statement:
The data presented in this study are openly available in VoxCeleb1 at [37], the Common Voice dataset at [27] and The DARPA-TIMIT dataset [28].

Conflicts of Interest:
The authors declare no conflict of interest.