Classiﬁcation between Elderly Voices and Young Voices Using an Efﬁcient Combination of Deep Learning Classiﬁers and Various Parameters

: The objective of this research was to develop deep learning classiﬁers and various parameters that provide an accurate and objective system for classifying elderly and young voice signals. This work focused on deep learning methods, such as feedforward neural network (FNN) and convolutional neural network (CNN), for the detection of elderly voice signals using mel-frequency cepstral coefﬁcients (MFCCs) and linear prediction cepstrum coefﬁcients (LPCCs), skewness, as well as kurtosis parameters. In total, 126 subjects (63 elderly and 63 young) were obtained from the Saarbruecken voice database. The highest performance of 93.75% appeared when the skewness was added to the MFCC and MFCC delta parameters, although the fusion of the skewness and kurtosis parameters had a positive effect on the overall accuracy of the classiﬁcation. The results of this study also revealed that the performance of FNN was higher than that of CNN. Most parameters estimated from male data samples demonstrated good performance in terms of gender. Rather than using mixed female and male data, this work recommends the development of separate systems that represent the best performance through each optimized parameter using data from independent male and female samples.


Introduction
The human voice represents a complex biological signal resulting from the dynamic interaction between adduction/vibration of the vocal folds and pulmonary air emission and flow through the resonant structures [1]. Physiologic aging leads to specific changes in the anatomy and physiology of all structures involved in the production and modulation of the human voice [2][3][4]. The aging of laryngeal tissue changes the movement of the vocal cords, their vibration, and their opening and closing processes [5]. Hence, a possible approach to evaluate the effect of physiological aging in humans could include the analysis of voice [1]. Voice characteristics are measured by the frequency of vocal cord oscillations per second, that is, the fundamental frequency (F0), jitter, shimmer, excitation source component, etc. [6][7][8][9]. The elderly voice can also be distinguished from the voices of children, adolescents, middle-aged people, and seniors that have distinct acoustic features measured as different voice characteristics [5]. Therefore, in order to create a system for recognizing the voice of the elderly, it is necessary to understand the characteristics of changes in vocal cord tissue due to anatomical or physiological aging [10], and various welfare systems using only the voice database of the elderly should be implemented.
More recent studies have widely used ensemble-based machine learning classifiers, such as classification and regression tree (CART), random forest, XGBoost, and the gradient boosting algorithm (GBM) to objectively distinguish presbyphonia from neurological voice disorders [10]. The key finding of another study was that cepstral peak prominence (CPP), one of the cepstral indices, was the most important predictor for identifying hypokinetic dysarthria in the developed prediction model, which included sociodemographic factors, health behavior, and acoustic-phonetic indices [26]. In other words, it had the advantage of analyzing elderly voices, reflecting the speech characteristics of actual subjects acoustically and sustained vowel phonation tasks for analyzing voice characteristics [27]. Lee et al. proposed higher-order statistical (HOS) parameters for the classification of young and elderly voice signals and identifying gender-and age-related differences through HOS analysis [5]. The acoustic parameters were extracted from young and elderly voice signals using Praat and a time-frequency analysis program (TF32). Additionally, their study investigated the gender-and age-related differences in HOS parameters. Young and elderly voice signals differed significantly in normalized skewness (p = 0.005) in females and normalized kurtosis (p = 0.011) in males. Therefore, normalized skewness is a useful parameter for distinguishing between young and elderly female voices, and normalized kurtosis is essential for distinguishing between young and elderly male voices.
Until now, most acoustic studies that analyzed elderly voices using deep learning methods and acoustic parameters have evaluated the sustained phonation of single vowels [13,[19][20][21][22][23][24][25], and only a few studies have examined the recognition of elderly voices [11][12][13][14][15][16][17][18]. Furthermore, in terms of methodology, even the previous studies [10] that analyzed elderly voices using deep learning approaches merely identified the differences between elderly voices and hypokinetic dysarthria based on acoustic-phonetic analysis using both the cepstrum and spectrum by collecting the voice data of subjects.
On the other hand, deep learning methods, such as the feedforward neural network (FNN) and convolutional neural network (CNN), have been widely used as data mining algorithms for developing disease prediction models in recent years [1,28]. No study has explored the classification of elderly versus young voices using deep learning algorithms applied to the Saarbruecken voice database (SVD). Therefore, the ultimate goal of this research was to develop deep learning classifiers and various parameters that provide an accurate and objective system for classifying elderly and young voices. This work focused on deep learning methods, such as FNN and CNN, for the detection of elderly voice signals using mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstrum coefficients (LPCCs), skewness, and kurtosis parameters. Voice data were obtained from the publicly available SVD. Finally, this work investigated voice performance using deep learning methods with various combinations of parameters. Therefore, the originality of this work can be found in its proposal of a new parameter and a novel deep learning method that combines HOSs, MFCCs, and LPCCs in the /a/ vowel voice signals of elderly and young individuals. The contribution of this paper can be summarized with the following points: • This paper introduces an intelligent elderly voice classification system based on deep learning classifiers and various parameters.

•
The suggested combinations of various parameters and deep learning methods can effectively distinguish young from elderly voices. • A lot of experimental tests are performed to confirm the effectiveness of the elderly voice classification system using the Saarbruecken voice database.

•
The experimental results emphasize the superiority of the proposed elderly voice classification system integrating deep learning methods and various parameters to monitor an elderly voice for an effective and reliable system.

Database
This work used the Saarbruecken voice database (SVD) recorded by the Phonetics Research Institute at Saarland University, Germany [29]. This database is a collection of voice recordings from more than 2000 persons, consisting of the vowels /a/, /i/ and /u/ produced at normal, high, low, and low-high-low pitch and the sentence "Guten Morgen, wie geht es Ihnen?". In addition, the electroglottogram (EGG) signal is also stored for each case in a separate file. The length of the files with sustained vowels is between 1 and 3 s. All recordings are sampled at 50 kHz and their resolution is 16-bit. Table 1 shows descriptive statistics of voice samples used in the study. The study used recordings of the sustained vowel sound/a/from 126 normal speakers (63 elderly and 63 young) at neutral pitch, including 36 voice samples of "vox senilis". The voice signals were classified into two different groups, namely those from young and elderly subjects, based on a recent publication [2,5]. This paper also analyzed voice differences by comparing young and elderly subjects according to gender. Thus, this work subdivided the voice signals into four subcategories. Group 1 consisted of 24 young men between the ages of 22 and 59 years (mean age = 39.13). Group 2 consisted of 24 elderly men between the ages of 60 and 89 years (mean age = 72.2). Group 3 included 39 young women between the ages of 20 and 58 years (mean age = 39.12). Finally, Group 4 consisted of 39 elderly women between the ages of 60 and 87 years (mean age = 70.7). According to the gender innovation analysis methods [19,20], in order to match the number of elderly and young voice signals, the experiments were conducted using the dataset extracted from the SVD. The number of elderly voices used this study was the maximum number of elderly voices available in SVD. Although the author admits that the amount of data is small to use deep learning algorithms, there are also other papers [9,30,31] experimenting with deep learning algorithms using a small amount of data.

Feature Parameters
The mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstrum coefficients (LPCCs) are well known in speech signal processing, and thus, they will not be discussed in detail here [32,33]. More information about their application and derivation can be found in a number of previous studies [14,15,28]. In this work, 20-dimensional MFCCs and 20-dimensional LPCCs were extracted from a 40 ms window signal using a 20 ms frameshift [28].
Among the various higher-order statistics (HOS), the third-and fourth-order cumulants were used as characteristic parameters in this study. These parameters are called normalized skewness (γ 3 ) and normalized kurtosis (γ 4 ) and they are defined as shown in Equation (1). In this paper, kurtosis is, to be exact, excess kurtosis. Skewness and kurtosis in the time domain have shown massive potential as classification indices for pathological signals [34,35]. They worked well in the classification between elderly and young voices, as shown in a previous study [5]. where x n is the n th sample value, N is the number of samples, and µ and σ represent the mean and standard deviation, respectively. Skewness indicates the symmetry of the probability density function (PDF) of the amplitude of a time series. A normal distribution has a skewness of zero, which indicates a time series with an equal number of large and small amplitude values. If the skewness value is between −0.5 and 0.5, the data are fairly symmetrical. If the skewness value is between −1 and −0.5 or between 0.5 and 1, the data are moderately skewed. If the skewness value is positive, it indicates a time series with many small values and few large values that is positively skewed (right tail). On the contrary, a time series with many large values and few small values is negatively skewed (left tail), and the skewness values are negative. Kurtosis measures the tailedness of the PDF of a time series. The kurtosis value is often compared with a normal distribution, which is equal to zero in Equation (1). If the kurtosis value is greater than zero, then the time series has relatively sharp peaks and heavier tails than a normal distribution. On the contrary, if the kurtosis is less than zero, then the time series has relatively flat peaks and lighter tails than a normal distribution [28,36].
In this study, the parameters of the 20-dimensional MFCCs and the 20-dimensional LPCCs were extracted from a 40 ms window signal using a 20 ms frameshift. The same settings have been applied in many previous studies [28,37]. Parameters such as MFCC + skewness and LPCC + kurtosis were created by appending one parameter to the original 20-dimensional MFCCs and LPCCs, thus producing 21 dimensions. MFCC + skewness + kurtosis and LPCC + skewness + kurtosis had 22 dimensions. Finally, the MFCC + delta + skewness + kurtosis parameter, which is made up of the 20-dimensional MFCCs, the 20-dimensional MFCC deltas, skewness, and kurtosis, had 42 dimensions. Therefore, the total number of dimensions ranged from 20 to 42. All voice data were grouped into training (70% of the data) and test (30%) sets to implement all methods. For a five-fold cross validation scheme, each set was randomly selected from the subset [35,38,39]. The whole process was repeated 10 times and the results were averaged. The results are shown as the mean and standard deviation.

Deep Learning Classifiers
FNN and CNN are well-known deep learning classifiers and, thus, will not be discussed in detail here. More information about their application and derivation can be found in previous studies [18,28]. This work used the FNN with two feed-forward layers to solve the binary classification problem of the classification of healthy young and elderly voices. The first layer was followed by rectified linear unit (ReLU) activation, and the last layer was followed by softmax activation. The selected parameters of FNN values are shown in Table 2. For the CNN, this work used two convolutional layers and two feed-forward layers. Dropout with a probability of 0.05 and batch normalization were applied to all layers except the feed-forward layers. Max-pooling and average pooling operations performed between the convolutional layers to downsample the intermediate representations in time and to add some time invariance in the process. This study also used a filter size of eight frames and a stride length of one in every convolutional layer with max-pooling at a pool size of eight and a stride length of eight between the first and second convolutional layers, and average pooling at a pool size of eight and a stride length of two between the other layers. The output of the last fully connected layer with a size of 64 was fed to a two-way softmax classifier, which classifies a patch into one of the two classes. The parameter details used in CNN are shown in Table 3.

Experimental Results and Discussion
A histogram is an approximate representation of the distribution of numerical data. To configure a histogram, the first step is to create the value range of a "bin". That is, it divides the entire range of values into a series of intervals and then calculate the number of values contained in each interval [36]. A histogram can be thought of as simple kernel density estimates that smooth the frequency of bins by using kernels. This usually results in a smoother probability density function that more accurately reflects the distribution of the underlying variable. The density estimate can be displayed instead of a histogram, and is usually drawn as a curve rather than a set of boxes [36]. Figure 1 shows the histograms and curves of MFCC data extracted from elderly and young voices to provide a better visualization of elderly and young voice signals from men and women for the /a/ vowel. In Figure 1a, overall similar tendencies are observed in both older and younger voice signals in both the histograms and curves of MFCC for female /a/ vowel samples. As shown in Figure 1b,c, the histograms and curves of young voice signals are more concentrated near zero than older voice signals for male and mixed /a/ vowel samples. Figure 2 shows the histograms and curves of LPCC data extracted from elderly and young voice signals of men and women for the /a/ vowel. In Figure 2a, the histogram and curve estimated for elderly voice signals tend to show a similar pattern compared to those extracted from young voice signals. In contrast, the histogram and curve extracted from elderly voice signals tend to be slightly more distributed near zero than those extracted from young voice signals, as shown in Figure 2b. A similar tendency towards symmetry is observed for both older and younger voice signals in both the histograms and curves of LPCC for mixed female and male /a/ samples. In Figures 1 and 2, the distribution of MFCC and LPCC data for female, male, and mixed sampled have similar patterns. Figure 3 shows the histograms and curves of skewness extracted from elderly and young voice signals by men and women for the /a/ vowel. In female /a/ vowel samples (Figure 3a), the skewness extracted from elderly voice signals tends to be distributed above zero. There is a tendency to be positively skewed with moderate skewness. However, the skewness extracted from young voice signals tends to be distributed near and below zero compared with elderly voice signals. That is, there seems to be a fairly symmetrical and negatively skew with a left tail. In male /a/ vowel samples (Figure 3b), similar tendencies can be observed for both elderly and young voice signals in both the histograms and curves, but the histogram and curve of skewness extracted from young voice signals have a longer distribution and larger values. There seems to be a fairly symmetrical and moderate skew in both elderly and young voice signals. Overall, in Figure 3a,b, it can be seen that the skewness extracted from female voice signals is distributed above zero compared with the skewness extracted from male voice signals. Clearly, the skewness distribution in the two figures shows a different pattern. In the mixed female and male /a/ vowel samples (Figure 3c), the histogram and curve patterns also have a distinctly different pattern compared to female and male /a/ vowel samples. In female /a/ vowel samples (Figure 4a), the histograms and curves of young voice signals tend have a longer kurtosis distribution, and the kurtosis estimated for elderly voice signals tends to be shorter and wider than that of young voice signals. However, in terms of kurtosis, the histograms and curves for both voice signals tend to have relatively flat peaks and lighter tails than a normal distribution. On the other hand, in case of male /a/ vowel samples (Figure 4b), the kurtosis estimated in elderly voice signals tends to be longer and wider than that of young voice signals. Because the histograms of the kurtosis estimated for elderly voice signals tends to be less than zero, the tendency has relatively flat peaks and lighter tails than a normal distribution. In the case of the histograms of the kurtosis extracted from young voice signals, some of them tend to have values greater than zero. Therefore, it can be seen that some of them have sharp peaks and heavier tails than a normal distribution, and most of them have relatively flat peaks and lighter tails. In the mixed female and male /a/ vowel samples (Figure 4c), the histogram and curve have a different pattern to those in Figure 4a for both elderly and young voice signals. Table 4 shows the classification results obtained by the MFCC, skewness, and kurtosis parameters and by the deep learning methods (FNN and CNN). In Table 4, for the FNN classifier, when using male /a/ vowel samples, the combination of MFCC, skewness, and kurtosis showed the best performance (87.50%) for classifying elderly and young voices. In the mixed data samples from women and men, an average accuracy of 87.50% was also obtained by using MFCC + skewness. In addition, the average performance was 75.00% for female data samples with the MFCC parameter. For the CNN classifier, the best accuracy (87.50%) for classifying elderly and young voices was obtained with the combination of MFCC and skewness in the case of male data samples; in mixed and female data samples, 78.95% and 75.00% accuracy were obtained with the MFCC + skewness + kurtosis and MFCC + skewness parameters, respectively.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 13 skewness extracted from young voice signals tends to be distributed near and below zero compared with elderly voice signals. That is, there seems to be a fairly symmetrical and negatively skew with a left tail. In male /a/ vowel samples (Figure 3b), similar tendencies can be observed for both elderly and young voice signals in both the histograms and curves, but the histogram and curve of skewness extracted from young voice signals have a longer distribution and larger values. There seems to be a fairly symmetrical and moderate skew in both elderly and young voice signals. Overall, in Figure 3a,b, it can be seen that the skewness extracted from female voice signals is distributed above zero compared with the skewness extracted from male voice signals. Clearly, the skewness distribution in the two figures shows a different pattern. In the mixed female and male /a/ vowel samples (Figure 3c), the histogram and curve patterns also have a distinctly different pattern compared to female and male /a/ vowel samples. In female /a/ vowel samples (Figure 4a), the histograms and curves of young voice signals tend have a longer kurtosis distribution, and the kurtosis estimated for elderly voice signals tends to be shorter and wider than that of young voice signals. However, in terms of kurtosis, the histograms and curves for both voice signals tend to have relatively flat peaks and lighter tails than a normal distribution. On the other hand, in case of male /a/ vowel samples (Figure 4b), the kurtosis estimated in elderly voice signals tends to be longer and wider than that of young voice signals. Because the histograms of the kurtosis estimated for elderly voice signals tends to be less than zero, the tendency has relatively flat peaks and lighter tails than a normal distribution. In the case of the histograms of the kurtosis extracted from young voice signals, some of them tend to have values greater than zero. Therefore, it can be seen that some of them have sharp peaks and heavier tails than a normal distribution, and most of them have relatively flat peaks and lighter tails. In the mixed female and male /a/ vowel samples (Figure 4c), the histogram and curve have a different pattern to those in Figure  4a for both elderly and young voice signals.        Table 4 shows the classification results obtained by the MFCC, skewness, and kurtosis parameters and by the deep learning methods (FNN and CNN). In Table 4, for the FNN classifier, when using male /a/ vowel samples, the combination of MFCC, skewness, and kurtosis showed the best performance (87.50%) for classifying elderly and young voices. In the mixed data samples from women and men, an average accuracy of 87.50% was also obtained by using MFCC + skewness. In addition, the average performance was 75.00% for female data samples with the MFCC parameter. For the CNN classifier, the best accuracy (87.50%) for classifying elderly and young voices was obtained with the combination of MFCC and skewness in the case of male data samples; in mixed and female data samples, 78.95% and 75.00% accuracy were obtained with the MFCC + skewness + kurtosis and MFCC + skewness parameters, respectively.    Table 5 shows the classification results obtained by the MFCC delta, skewness, and kurtosis parameters for FNN and CNN. These results were based on the MFCC delta parameter. In Table 5, utilization of the FNN classifier with the MFCC + delta + skewness parameters obtained the highest accuracy (93.75%) in male /a/ vowel samples compared with the other classifier and parameter combinations. Additionally, 78.95% and 79.17% accuracy were obtained, respectively, by the MFCC + delta + skewness + kurtosis and MFCC + delta parameters with the FNN classifier for mixed and female data samples. For the CNN classifier, 78.95%, 75.00%, and 81.25% accuracy was achieved with MFCC + delta, MFCC + delta+ kurtosis, and MFCC + delta + skewness for the mixed, female, and male data samples, respectively.  Table 6 shows the classification results obtained by the LPCC, skewness, and kurtosis parameters for FNN and CNN. These results were based on the LPCC delta parameter. For the FNN classifier, good accuracy (83.33%) was achieved for classifying elderly and young voices with the LPCC parameter in the case of female data samples. In mixed and male data samples, 78.95% and 81.25% accuracy were achieved by the LPCC and LPCC + kurtosis parameters, respectively. For the CNN classifier, similar results (75.00%) were shown for female and data samples with LPCC + kurtosis and LPCC + skewness + kurtosis, respectively; 63.16% was achieved with LPCC + kurtosis in mixed female and male data samples. For male /a/ vowel data samples, utilization of the FNN classifier, MFCC + delta + skewness, and MFCC + delta + kurtosis parameters obtained the highest accuracy (93.75%) compared with the combination of CNN, FNN, and various parameters. The best accuracy (87.50% and 83.33%) was obtained with MFCC + skewness and the LPCC parameters for the FNN classifier in mixed and female data samples, respectively. Another finding is that the fusion of the skewness and kurtosis parameters had a positive effect on the overall accuracy of the classification. Most results were better with FNN than CNN in all mixed, female, and male data samples. Figure 5 summarizes the performance trends in terms of network, gender, and various parameters. In this figure, it can be seen that FNN performed well in most cases. In terms of gender, the use of male data samples demonstrated good performance with most parameters. Therefore, rather than experimenting with mixed female and male data samples, the author recommends experimenting with and constructing a system that divides the data into female and male individuals. Figure 5c shows that the highest performance was 93.75%, obtained by using the parameters that added skewness and kurtosis to the MFCC and MFCC delta parameters. Of these, the addition of skewness showed the best performance. The figure also shows the better performance of using MFCC rather than LPCC. In conclusion, it can be seen that it is appropriate to use the MFCC and skewness parameters based on FNNs to achieve good classification of elderly and young voice signals. ous parameters. In this figure, it can be seen that FNN performed well in most cases. In terms of gender, the use of male data samples demonstrated good performance with most parameters. Therefore, rather than experimenting with mixed female and male data samples, the author recommends experimenting with and constructing a system that divides the data into female and male individuals. Figure 5c shows that the highest performance was 93.75%, obtained by using the parameters that added skewness and kurtosis to the MFCC and MFCC delta parameters. Of these, the addition of skewness showed the best performance. The figure also shows the better performance of using MFCC rather than LPCC. In conclusion, it can be seen that it is appropriate to use the MFCC and skewness parameters based on FNNs to achieve good classification of elderly and young voice signals.

Conclusions
It is important to objectively distinguish between elderly and young voices because physiological aging leads to specific changes in the anatomy and physiology of all structures involved in the production and modulation of the human voice [2,[4][5][6]10,13]. Currently, most systems use all data at once, from children's voices to elderly voices, so when older people use a voice-activated system, their voices are often not recognized well [40-

Conclusions
It is important to objectively distinguish between elderly and young voices because physiological aging leads to specific changes in the anatomy and physiology of all structures involved in the production and modulation of the human voice [2,[4][5][6]10,13]. Currently, most systems use all data at once, from children's voices to elderly voices, so when older people use a voice-activated system, their voices are often not recognized well [40][41][42]. Therefore, in order to develop a system for recognizing the voice of the elderly, it is necessary to understand the characteristics of changes in vocal cord tissue due to anatomical or physiological aging [3,15,16,18].
This study developed a classifier that distinguishes elderly voices from young voices by using deep learning algorithms and MFCCs, LPCCs, skewness, and kurtosis parameters. This study also compared the prediction performance of FNN and CNN based on the combination of various parameters to find out which combinations are best for classifying elderly and young voices. The results of this study revealed that the performance of FNN was higher than that of CNN. Therefore, future studies are necessary to compare the predictive performance estimated by diverse evaluation methods using big data of elderly and young voice signals, such as sensitivity, specificity, etc., suitable for achieving the analysis's objective, rather than using one performance index such as accuracy.
Another key finding of this study was that skewness, one of the higher-order statistics (HOS), was the most important predictor for identifying elderly voice signals in the deep learning models (FNN and CNN). Although the fusion of the skewness and kurtosis parameters had a positive effect on the overall accuracy of the classification, the highest performance (93.75%) appeared when the skewness was added to the MFCC and MFCC delta parameters. Specifically, the study found that skewness was more effective in combination with MFCC and MFCC delta parameters, and kurtosis was more effective in combination with LPCC parameters.
The last important discovery of this study is as follows. Most parameters estimated from male data samples demonstrated good performance. Rather than using mixed female and male data, this study recommends the development of separate systems that represents the best performance obtained for each optimized parameter by using independent data for males and females.
In conclusion, the purpose of this study was to provide HOS parameters and deep learning methods for the classification of young and elderly voice signals; clarify the network-, gender-, and parameter-related differences regarding classification performance; and find the best combination for achieving good classification of elderly and young voice signals. The results of this study will be the basis for preparing a voice interface optimized for the elderly, such as a smart mirror system and a conversation recognition system for the elderly. Future studies will be based on the spread of elderly speech processing, along with the parameters and deep learning algorithms that reflect important information about elderly voice signals to achieve high-quality classification performance of younger and older voice signals using large datasets.
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (No. 2017R1A2B4011373). The sponsor had no involvement in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the article for publication.

Conflicts of Interest:
The author declares no conflict of interest.