Gender Classification Based on the Non-Lexical Cues of Emergency Calls with Recurrent Neural Networks ( RNN )

Automatic gender classification in speech is a challenging research field with a wide range of applications in HCI (human-computer interaction). A couple of decades of research have shown promising results, but there is still a need for improvement. Until now, gender classification has been made using differences in the spectral characteristics of males and females. We assumed that a neutral margin exists between the male and female spectral range. This margin causes misclassification of gender. To address this limitation, we studied three non-lexical speech features (fillers, overlapping, and lengthening). From the statistical analysis, we found that overlapping and lengthening are effective in gender classification. Next, we performed gender classification using overlapping, lengthening, and the baseline acoustic feature, Mel Frequency Cepstral Coefficient (MFCC). We have tried to achieve the best results by using various combinations of features at the same time or sequentially. We used two types of machine-learning methods, support vector machine (SVM) and recurrent neural networks (RNN), to classify the gender. We achieved 89.61% with RNN using a feature set including MFCC, overlapping, and lengthening at the same time. Also, we have reclassified using non-lexical features with only data belonging to the neutral margin which was empirically selected based on the result of gender classification with only MFCC. As a result, we determined that the accuracy of classification with RNN using lengthening was 1.83% better than when MFCC alone was used. We concluded that new speech features could be effective in improving gender classification through a behavioral approach, notably including emergency calls.


Introduction
It is difficult to identify the age, intention, emotion, and gender of a speaker from telephone calls [1].However, this information is considered essential for automatic speech recognition (ASR) since the cues can guide human-computer interaction systems to understand the needs of users [2,3].Gender classification is especially useful in the field of ASR because specific acoustic models are applied for the process, which has been reported to improve performance [4].Furthermore, these can be used in many fields, such as categorizing calls by gender (e.g., for surveys) [4,5].The systems for gender classification have been recently developed [6,7].During emergency calls, in particular, it is instrumental in detecting the gender of the caller from the beginning of the call so that the call can be routed to the appropriate receiver according to the caller's gender in order to calm the caller if necessary.
Symmetry 2019, 11, 525 2 of 14 Moreover, the emotional speech recognition systems that considered gender information proved to be more accurate than the system that did not consider it [8].Therefore, gender classification in emergency calls can be utilized before the recognition systems, i.e., emotion, age, to improve the quality of the interaction.However, gender classification in emergency calls has rarely been studied.
In the ASR, it can accurately classify gender using acoustic information such as fundamental frequency (F0) and Mel Frequency Cepstral Coefficient (MFCC) [9][10][11].Frequently, gender classification was based on the pitch as a feature which was presented by References [9,12].Males have a lower F0 than females during spontaneous speech, vocalization, and text recitation [13,14].Additionally, males have greater fluctuations in vocal intensity compared to females during spontaneous speech.These results indicate that the F0 of speech is affected by the condition for males whereas females displayed a change in vocal intensity according to the condition.Research on determining differences in verbal expression through jitter, shimmer, MFCC, and other voice features is still ongoing [4,6,15].However, the above studies have been conducted with voice data of ordinary situations rather than emergency situations.If callers in an emergency are excited or afraid, they may show a different pitch, F0, or MFCC patterns than usual.We need to investigate further to see if it can show the same performance in emergency situations.
So far, MFCC has shown good results in gender classification.However, there has been a limit to the differentiation between a male with a female spectral band and female with a male spectral band.To overcome this, we attempted to find non-lexical speech features that affect gender classification using data from emergency calls.The factors in non-lexical speech utterances during emergency calls that enable gender classification are fillers [16][17][18][19][20], overlapping [16,[21][22][23][24], and lengthening [24,25].
Fillers are the most common non-lexical speech utterance in a spontaneous speech [18].They are similar to exclamations, which are very situation dependent, yet they lack any emotional connotations.According to a previous study, mentioned that females frequently use 'ᄋ ᅡ' (uh) while males 'ᄀ ᅳ' (ku) as fillers [16].Another study revealed that males more often use 'ᄋ ᅡ' (ah), while females showed a tendency to use 'ᄋ ᅥ' (uh) as fillers.The sound 'ᄋ ᅥ' (ah) used by males has an assertive and extroverted speech utterance characteristic whereas females tend to use 'ᄋ ᅥ' (uh) to present a more modest and passive tone in their speech style [20].Additionally, females use words such as 'ᄋ ᅡ' (uh), 'ᄋ ᅳ ᆷ' (um), and 'ᄋ ᅡ' (ah) 1.5 to 1.7 times more than males for emphasis [19].
Overlapping occurs when the speaker interrupts another person mid-sentence to express his or her thoughts [23].In a telephone conversation, males overlap more often than females; when talking to the same gender, females are more likely to make an overlapping with positive expressions for confirmation or agreement.However, males are more likely to make overlapping with the intent to express criticism or make a negative remark.In other words, females tend to show a listening attitude, while males tend to present a selfish attitude when overlapping with another speaker [24].For this reason, overlapping is a valuable factor for gender classification.
Lastly, lengthening is the formation of a long sound from the last syllable between clauses or phrases during speech and is highly situational.As an agglutinative language, Korean frequently displays combined morphemes in endings as a phrase or sentence (e.g., ᄂ ᅢᄀ ᅡ ᄌ ᅵ ᆸᄋ ᅦ ᄀ ᅡᄀ ᅩ ᄋ ᅵ ᆻᄂ ᅳ ᆫᄃ ᅦ when I am going home).Lengthening is usually presented as an extension using morphemes because the Korean language has an agglutinative nature in general.Especially, lengthening frequently appears in spontaneous speech and reflects the non-fluent speech style of Korean.When communicating in an emergency, lengthening appears to indicate hesitation about the urgency of the current state.In other words, the speaker hesitates because they cannot immediately explain the present situation.It is similar to a kind of blackout that can be caused by excessive emotional expression.In spontaneous speech analysis, females tend to extend the last syllable of the last word in their speech.The average duration of the last syllable is 300 ms, which is quite long [25].In a speech from travel agents' conversations with customers, there is also a hesitation in the emphasized part of the word, and this can be classified as lengthening [24].Most of the above papers related to non-lexical features were in the psychological, linguistic, and social fields.The results of those papers depended on only analyzing statistical measures for verification with speech data limited to a normal situation, not an emergency.In addition, unfortunately, they did not attempt any computer scientific approach, nor did they suggest a direction to perform the feature extraction automatically.So, it would be necessary for us to verify the extraction of non-lexical speech utterances more objectively, especially in an emergency.
Recently, most studies analyzing gender classification have focused on machine learning, especially deep learning, using MFCC.In Reference [26], the authors attempted to distinguish gender using AMI meeting corpus.They obtained an accuracy of 90.99% as a result of recurrent neural networks (RNN) with long-short-term memory (LSTM).References [27,28] conducted gender classification using FAU Aibo Emotion Corpus with RNN.It achieved 74.41% in the case of RNN and 76.03% in case of LSTM-RNN.Additionally, Reference [29] described recognizing gender using 22 acoustic parameters on acoustic signals.They achieved 96.74% accuracy on the test dataset.
Further, many studies performed speech recognition using deep learning methods such as artificial neural networks (ANN) and deep neural networks (DNN) for detecting speaker-related information like emotion [30,31], age [27,28], and phoneme [32] across various fields.Recent trends in gender classification have applied multimodal signal processing combined with a variety of data (e.g., text and facial expressions), and using these methods have yielded considerable results [33,34].
The experiments carried out in these papers aimed only to classify using the machine learning or deep learning method.They did not take into consideration a range of specific conditions.Some papers that were cited above used a public database, but did not include emergency situations.AMI meeting corpus [35] are recorded real meetings that include scenario-driven situations, which have been designed to elicit a range of various realistic behaviors.It did not consider the specific condition, such as an emergency situation.FAU Aibo emotion corpus [36] also consisted of communication with a robot named Aibo.Similarly, there was a study which handled telephone speech for annotating voices from normal situatons [37].It was also applied under normal conditions, not an emergency.
We investigated non-lexical speech utterances in emergency calls.This study was aimed at identifying new features that can help with the gender classification of emergency calls.Primarily, in a previous study, it was difficult to classify gender from the speaker's voice in an unexpected situation (i.e., emergency).In this regard, we need to investigate additional features that can accurately classify gender from the voices in unexpected situations.
We subsequently identified these features and presented the differences between our findings and prior research on non-lexical speech utterances (i.e., fillers, overlapping, lengthening) in emergency calls.
This paper is organized as follows: Section 2 describes the database and the experiment method for gender classification.The results and discussions are presented in Sections 3 and 4, and the conclusions are in Section 5.

Materials
The emergency call datasets were taken from the call center of the National Emergency Management Agency (NEIA) of Northern Gyeonggi province.In February 2015, we signed a memorandum of understanding (MOU) with the NEIA of Northern Gyeonggi province to cooperate with the development of technology for social security based on emergency calls.
The voice data were recorded as follows.First, a Zi-Log recording device (VOIP) was installed on four telephones at an emergency call center.This method allowed for the automatic separation of the speaker and the receiver during calls.This voice data were automatically saved as a wave file format.Once the data were recorded, all personal information such as the name and the phone number was deleted.Finally, the data were compressed using the Encryto tool before being saved onto a secure hard drive (WD Drive 1Tb) and delivered to us.The data in this study were collected with a sampling rate of 8000 Hz, which means the spectral analysis was limited to the range of 0 to 4000 Hz.
We analyzed 335 datasets (male: 225, female: 117) consisting of 342 anonymous Korean native speakers.The speech data length was 22,201 sec (males: 15,360 sec, females: 6841 sec).Seven sets of voice data were excluded.In one set, the first speaker handed the call to a second person, so there were two speakers on the call.In another set, the speaker was a non-Korean who spoke Korean as a second language.In the other two sets of excluded data, no speech was present (Supplementary Materials).
All of the personal information of data had been anonymized by the operator before being transferred to the authors in order to ensure that no contacts should be made with any individuals for the purpose of safeguarding their privacy.In this regard, this study was waived of ethical approval from the Institutional Review Board (IRB) of Sejong University.According to the IRB of Sejong University, research involving the collection or the study of existing data, documents, and records, if these sources are publicly available or if the information is recorded by the investigator in such a manner that the subjects cannot be identified, it can be exempt from IRB review.All the data are anonymous, with no protected personal information included.Therefore, this study meets the requirements for exemption from IRB review.

Methods
Figure 1 shows an overview of our framework.The input is the combination of 2-channel speech consisting of two speakers.The data were pre-processed to highlight non-lexical speech features which were then verified by statistical analysis.The effective features generated by feature extraction and selection were MFCC, overlapping, and lengthening.Machine learning techniques were utilized to determine the speakers' genders and SVM/RNN were utilized to produce experimental results.The classification system was also computed using majority voting to separate the classified voices from ambiguous voices by a neutral margin.The final output was the percentage likelihood that the voice belonged to a male or female speaker.with a sampling rate of 8000 Hz, which means the spectral analysis was limited to the range of 0 to 4000 Hz.We analyzed 335 datasets (male: 225, female: 117) consisting of 342 anonymous Korean native speakers.The speech data length was 22,201 sec (males: 15,360 sec, females: 6841 sec).Seven sets of voice data were excluded.In one set, the first speaker handed the call to a second person, so there were two speakers on the call.In another set, the speaker was a non-Korean who spoke Korean as a second language.In the other two sets of excluded data, no speech was present (Supplementary).
All of the personal information of data had been anonymized by the operator before being transferred to the authors in order to ensure that no contacts should be made with any individuals for the purpose of safeguarding their privacy.In this regard, this study was waived of ethical approval from the Institutional Review Board (IRB) of Sejong University.According to the IRB of Sejong University, research involving the collection or the study of existing data, documents, and records, if these sources are publicly available or if the information is recorded by the investigator in such a manner that the subjects cannot be identified, it can be exempt from IRB review.All the data are anonymous, with no protected personal information included.Therefore, this study meets the requirements for exemption from IRB review.

Methods
Figure 1 shows an overview of our framework.The input is the combination of 2-channel speech consisting of two speakers.The data were pre-processed to highlight non-lexical speech features which were then verified by statistical analysis.The effective features generated by feature extraction and selection were MFCC, overlapping, and lengthening.Machine learning techniques were utilized to determine the speakers' genders and SVM/RNN were utilized to produce experimental results.The classification system was also computed using majority voting to separate the classified voices from ambiguous voices by a neutral margin.The final output was the percentage likelihood that the voice belonged to a male or female speaker.

Non-Lexical Speech Utterances for Gender Classification
We investigated the occurrence frequency and the number of frequency of the three non-lexical speech utterances (fillers, overlapping, lengthening) for each call.To minimize errors in manual gender classification by a group of listeners, we conducted crosschecks for each call.The listeners also practiced beforehand with training data before analysis.
We investigated differences in non-lexical speech utterances by gender, with a focus on the following: The occurrence frequency, the number of occurrences (if occurred), and the results of correlation among non-lexical speech features based on statistical analysis.Based on previous research [16,18,20], the five most common filler words were identified as follows: '어' (ah), '그' (ku), '저'(ceo), '아' (uh), and '음' (um) (Table 1).

Non-Lexical Speech Utterances for Gender Classification
We investigated the occurrence frequency and the number of frequency of the three non-lexical speech utterances (fillers, overlapping, lengthening) for each call.To minimize errors in manual gender classification by a group of listeners, we conducted crosschecks for each call.The listeners also practiced beforehand with training data before analysis.
Overlapping was defined when the speaker started to speak before the receiver had finished talking.Every instance in which the speaker interrupted the receiver was counted as one occurrence.
Lengthening was defined as the phenomenon of prolonged sounds between clauses and phrases uttered by the speaker, usually occurring at the end of words or phrases.Based on prior research [25], lengthening was identified when spoken syllables were prolonged for more than 300 ms.

Statistical Analysis
We used the following methods to find the differences in non-lexical speech utterances for gender classification using statistical analysis: Pearson's Chi-Square test, independent sample test, and Pearson's correlation analysis.It was essential to validate the occurrence frequency and the number of frequency for feature selection.The significance levels were set at p < 0.05 and analysis was conducted with SPSS 21.0 (IBM).
We created a decision tree to identify useful factors for gender classification.The decision tree is a type of data mining tool used to create a model that will detect correlations and patterns within data [38].We created classification and regression trees (CRT) in an attempt to maximize within-node homogeneity (Figure 2).CRT is a method used for finding the frequency weights or influential variables that best delineate the dependent variables.Also, we performed 10-fold cross-validation to assess how well our tree structure can be generalized to a larger population.The data used for this process were randomly selected.The tree generated the nodes in sequence, excluding the data from each subsample in turn.The cross-validation produced a single final model.The cross-validated risk estimate for the last nodes was calculated as the average of the risks for all of the trees [38].(Neyga um Mani Appayo.)= I am um very sick.

그(ku)
그 여기가 그(ku Yeogiga ku) = ku here is ku Overlapping was defined when the speaker started to speak before the receiver had finished talking.Every instance in which the speaker interrupted the receiver was counted as one occurrence.
Lengthening was defined as the phenomenon of prolonged sounds between clauses and phrases uttered by the speaker, usually occurring at the end of words or phrases.Based on prior research [25], lengthening was identified when spoken syllables were prolonged for more than 300 ms.

Statistical Analysis
We used the following methods to find the differences in non-lexical speech utterances for gender classification using statistical analysis: Pearson's Chi-Square test, independent sample test, and Pearson's correlation analysis.It was essential to validate the occurrence frequency and the number of frequency for feature selection.The significance levels were set at p < 0.05 and analysis was conducted with SPSS 21.0 (IBM).
We created a decision tree to identify useful factors for gender classification.The decision tree is a type of data mining tool used to create a model that will detect correlations and patterns within data [38].We created classification and regression trees (CRT) in an attempt to maximize within-node homogeneity (Figure 2).CRT is a method used for finding the frequency weights or influential variables that best delineate the dependent variables.Also, we performed 10-fold cross-validation to assess how well our tree structure can be generalized to a larger population.The data used for this process were randomly selected.The tree generated the nodes in sequence, excluding the data from each subsample in turn.The cross-validation produced a single final model.The cross-validated risk estimate for the last nodes was calculated as the average of the risks for all of the trees [38].

Machine-Learning-Based Gender Classification
We present a method for gender classification of speech from emergency calls using the machine learning technique known as support vector machines (SVM).SVM is employed for solving two-group classification problems [39] and is used to construct the optimal hyperplane with the largest margin for separating data between two groups.This classifier is widely used for pattern recognition or data analysis.We used Scikit-learn toolbox based on Python [40].
We also utilized RNN, which is a powerful model for processing time-sequential data [41].RNN was recently demonstrated to outperform feed-forward networks in speech recognition [42].The RNN architecture is incorporated with Tensorflow using the basic RNN Cell model and consists of one fully connected layer (two dimensions) with 128 (dimensions).The loss function used was sigmoid cross-entropy with logits, and the learning rate was set to 0.0001.Each training epoch consisted of 500 such instances.We used RMS Prop Optimizer for optimization and applied a dropout with 0.5 rates.
We conducted gender classification using three features.The commonly used MFCC feature was considered.MFCC is useful for gender classification.We extracted the initial 16 coefficients produced with 19 filters in the Mel-filter bank.We hypothesized that a neutral margin exists between male and female voice data.Also, the neutral margin causes the misclassification of gender.Hence, other features were used for the two non-lexical speech utterances (i.e., overlapping and lengthening) described in Section 3. Five-fold cross validation was used with 340 sets to test the gender classification results.We have proceeded in two steps to complement the existing method which used only MFCC.
First, we performed gender classification with SVM and RNN using MFCC and non-lexical speech features (overlapping and lengthening) at the same time.The speech data length used for machine learning was 13,676.58sec.Next, we sequentially reclassified genders using non-lexical speech features to clarify some vague or slightly unclear results in the previous gender classification using MFCC alone.In determining the neutral margin between the two genders, we applied an empirical research method which resulted in a range of 10% above or below the boundary of probability distribution functions of the two genders.The speech data length included in the neutral margin was 1831.14 sec of SVM and 1590.84 sec of RNN.

Descriptive Analysis: Non-Lexical Speech Utterances
Table 2 shows the occurrence frequencies of three non-lexical speech utterances (fillers, the overlapping, lengthening).The occurrence frequencies of overlapping and lengthening were useful for gender classification.Males had a higher occurrence frequency for overlapping than females (146 > 45), but the result was the opposite for lengthening.The difference was significant by gender (p < 0.001).Table 3 shows the mean frequency per minute for the three non-lexical speech utterances.The results were significant for overlapping and lengthening.Overlapping was used more frequently by males than females, but lengthening was the reverse.However, the use of fillers was not significantly different because the usage was high for both genders.Table 4 shows the Pearson's coefficient correlation for the identified features.Overlapping had a significant correlation with fillers and lengthening, though the coefficient for lengthening (r = 0.454, p = 0.000) was much higher.The results for lengthening were similar to those for overlapping.However, with fillers, the coefficient for overlapping for females (r = 0.483, p = 0.006) was higher than that for males (r = 0.249, p = 0.003).Additionally, fillers did not have a significant correlation with lengthening (r = 0.098, p = 0.173).We selected valuable factors based on the coefficient values and p-values for gender classification.We decided on two non-lexical speech utterances, overlapping and lengthening.Next, we created a decision-making tree, which is a tool for data mining analysis, for gender classification.

Decision-Making Tree: CRT Analysis
We performed 10-fold cross validation using CRT analysis for gender classification based on overlapping and lengthening in emergency calls.Figure 3 shows one example of the structure models developed using CRT analysis.As shown in Figure 3, the cut-off point is vital for gender classification.
The node was divided by lengthening; females were above 3.86 and males were below.The node was split again by a cut-off point for overlapping, specifically a frequency per minute of 1.60.

Decision-Making Tree: CRT Analysis
We performed 10-fold cross validation using CRT analysis for gender classification based on overlapping and lengthening in emergency calls.Figure 3 shows one example of the structure models developed using CRT analysis.As shown in Figure 3, the cut-off point is vital for gender classification.The node was divided by lengthening; females were above 3.86 and males were below.The node was split again by a cut-off point for overlapping, specifically a frequency per minute of 1.60.Table 5 shows the gain chart for the nodes used for gender classification with mean values.For males, the ratio of node 4 in the training set was 44.7% (N = 104) of the total (N = 233), and the percentage ratio of responses was 56.9% (N = 85) for male classification (N = 150).Males had the highest mean value at node 4 while females had a higher mean value at node 2. According to the training datasets, node 4 (N = 85, 82.8%) was the best node for classifying males.Females were placed at node 2 (N = 40, 66.8%) with the training set.Table 5 shows the gain chart for the nodes used for gender classification with mean values.For males, the ratio of node 4 in the training set was 44.7% (N = 104) of the total (N = 233), and the percentage ratio of responses was 56.9% (N = 85) for male classification (N = 150).Males had the highest mean value at node 4 while females had a higher mean value at node 2. According to the training datasets, node 4 (N = 85, 82.8%) was the best node for classifying males.Females were placed at node 2 (N = 40, 66.8%) with the training set.As shown in Table 6, the classification accuracy was demonstrated to be 72.9% in the training set and 69.8% in the test set.This value was a result of 69.8% accuracy in the test set because of the training set.In summary, the predicted classification accuracy was approximately 72.9% in the training set and 69.2% in the test set.It means that the predicted accuracy was relatively high.Based on these results, the summary for gender classification is as follows: (1) The occurrence frequency of overlapping and lengthening is critical for gender classification.(2) The "cut-off points" for overlapping and lengthening [38] are different for each gender and (3) the accuracy of gender classification using overlapping and lengthening is 69.2%.

Gender Classification: SVM and RNN
Figure 4 shows the gender classification results using baseline acoustic (MFCC) and non-lexical speech features (overlapping and lengthening).We achieved the highest accuracy when overlapping and lengthening were combined with MFCC, compared to using only MFCC or the other combination of features.The accuracy of classification based on RNN was slightly higher than that of SVM in the case of each feature set.Notably, the highest accuracy was 86.58% of SVM and 89.61% of RNN, separately, when all of the features (MFCC, overlapping, and lengthening) were used simultaneously.Furthermore, we experimented using a neutral margin, defined as above, for investigating what kind of feature set is the most effective as well as how to use it for the improvement of gender classification.First, we empirically found the neutral margin based on the prior result, which was done with only MFCC.Then, we reclassified data within the neutral margin.With regard to the neutral margin, we achieved a higher accuracy with lengthening than with other non-lexical speech features, including MFCC only.For SVM, it gained about 0.6% better accuracy over MFCC only Furthermore, we experimented using a neutral margin, defined as above, for investigating what kind of feature set is the most effective as well as how to use it for the improvement of gender classification.First, we empirically found the neutral margin based on the prior result, which was done with only MFCC.Then, we reclassified data within the neutral margin.With regard to the neutral margin, we achieved a higher accuracy with lengthening than with other non-lexical speech features, including MFCC only.For SVM, it gained about 0.6% better accuracy over MFCC only (Figure 5).For RNN, the accuracy of non-lexical features was significantly improved, especially lengthening (Figure 6) compared to the other non-lexical speech features, as well as MFCC only.Furthermore, we experimented using a neutral margin, defined as above, for investigating what kind of feature set is the most effective as well as how to use it for the improvement of gender classification.First, we empirically found the neutral margin based on the prior result, which was done with only MFCC.Then, we reclassified data within the neutral margin.With regard to the neutral margin, we achieved a higher accuracy with lengthening than with other non-lexical speech features, including MFCC only.For SVM, it gained about 0.6% better accuracy over MFCC only (Figure 5).For RNN, the accuracy of non-lexical features was significantly improved, especially lengthening (Figure 6) compared to the other non-lexical speech features, as well as MFCC only.The best result was obtained using all of the features at the same time with RNN.Our experiments showed that with all of the data, the combination of overlapping and lengthening achieved a better performance than both MFCC only and overlapping or lengthening separately.However, with the neutral margin, the combination of overlapping and lengthening features gained a higher performance than overlapping and a slightly worse performance than lengthening.

Discussion
This experiment was conducted to determine whether the experimental method could accurately and automatically classify the gender of the speaker making a call to an emergency The best result was obtained using all of the features at the same time with RNN.Our experiments showed that with all of the data, the combination of overlapping and lengthening achieved a better performance than both MFCC only and overlapping or lengthening separately.However, with the neutral margin, the combination of overlapping and lengthening features gained a higher performance than overlapping and a slightly worse performance than lengthening.

Discussion
This experiment was conducted to determine whether the experimental method could accurately and automatically classify the gender of the speaker making a call to an emergency response center.The database targeted for this study should have reflected the actual situation.Over the last few decades, researchers have been trying to find new ways to classify speaker gender.MFCC has commonly been used as the baseline feature because it is based on the typical spectral differences of male and female voices.However, it has difficulty distinguishing the gender of speakers whose voices have intermediate spectral characteristics.
This study was conducted to overcome the shortcomings of existing methods in identifying speaker gender due to the various emotional states expressed in emergencies.This study was predicted to be challenging because the situational factors examined in this study were different from those examined in other studies.If situation factors can be considered, higher accuracy would be obtained by combining non-lexical speech utterances that reflect situational conditions such as emergencies.Thus, to overcome the limitations of existing methods, this study tested analyzed non-verbal, non-lexical behavioral cues that have not been used for gender classification before.Statistical analysis of emergency calls verified that overlapping and lengthening are useful non-lexical speech utterance factors for gender classification.
We also conducted gender classification using the machine-learning techniques, SVM and RNN, with some features (MFCC, overlapping, and lengthening).As expected based on previous analyses, the combination of the proposed non-lexical features and the more traditional MFCC produced more accurate gender classification results than the use of MFCC alone (Figure 4).This result indicated that the proposed features can be used to overcome the shortcomings of existing gender classification methods.
This study was conducted to determine whether the proposed method could overcome the limitations of existing gender identification methods using only MFCC.It was hypothesized that it would be difficult to determine a speaker's gender within the neutral margin.This study investigated the effectiveness of using non-lexical speech features in determining the gender of speakers within the neutral margin.The results showed that non-lexical features were more effective than commonly used spectral features for determining the gender of speakers in the neutral margin.The proposed method, which used MFCC with overlapping and lengthening in the classification of the gender of speakers in neutral margin, was more accurate than using only MFCC.In conclusion, the proposed gender classification method can overcome the weaknesses of existing gender classification methods (Figures 5 and 6).It can be considered both challenging and novel.
Even though gender classification through speech in emergency situations includes more obstacles, such as noise and emotional vocal traits, than in normal ones, the method proposed in this study performed better than standard methods.However, this study had some limitations the future research should account for.It was difficult to use both non-lexical speech features and MFCC over time because the non-lexical speech feature occurred with a different frequency than X over time.For example, it was possible to confirm overlapping by determining when a speaker finished an utterance, but lengthening sometimes appeared while continuing speech.However, MFCC occurs more quickly than non-lexical speech features, so it is difficult to evaluate them when they are combined.If the difference of timescales is more effectively reflected in the experiment's setup, we can expect a higher experimental accuracy.In addition, we have focused only on the speaker's gender without considering the receiver's gender in this study.If we try to classify the gender of both a caller and a receiver in emergency calls, we will likely identify additional speech features and statistic outcomes such as cut-off points or predictability.

Conclusions
We investigated the usability of non-lexical speech utterances for gender classification with emergency calls to address the limitation of the baseline feature, MFCC.This study is attractive because of non-lexical speech utterances in special situations, not normal.Furthermore, the non-lexical speech utterances proposed in this study may be utilized as supporting materials for speech recognition and speech processing.
We also confirmed that overlapping and lengthening had complementary effects when combined with MFCC, indicating that gender classification can be improved by using these two features (acoustic and non-lexical speech features).Furthermore, if we combine both features, we can expect to achieve a higher accuracy in results using machine learning (e.g., SVM, RNN).In summary, the results may assist in classifying the gender of a speaker when analyzing voices in unexpected situations, such as emergencies.
We expect that the results will substantially contribute to the improvement of gender classification for an integrated system for efficient intelligent support systems that can be used for emergency rescues.Furthermore, our findings can help to advance non-lexical speech utterance features extracted from a speaker's voice for situation awareness.

Figure 1 .
Figure 1.Framework for gender classification using emergency calls.

Figure 1 .
Figure 1.Framework for gender classification using emergency calls.

Figure 2 .
Figure 2. The example of the classification and regression tree (CRT) model.

Figure 3 .
Figure 3.The example of one sample structure model in gender classification using CRT (a 10-fold cross validation); left: Training set, right: Test set.

Figure 3 .
Figure 3.The example of one sample structure model in gender classification using CRT (a 10-fold cross validation); left: Training set, right: Test set.

Figure 5 .
Figure 5. Gender classification using SVM with neutral margin.Figure 5. Gender classification using SVM with neutral margin.

Figure 6 .
Figure 6.Gender classification using RNN with neutral margin.

Figure 6 .
Figure 6.Gender classification using RNN with neutral margin.

Table 1 .
The example of fillers.

Table 1 .
The example of fillers.

Table 2 .
The occurrence frequency per minute by non-lexical speech utterances.
Numbers represent the number of subjects (percentages)/p < 0.05.

Table 3 .
The number of mean frequency in the non-lexical speech utterances between genders.

Table 4 .
The correlation between the non-lexical speech utterances (fillers, overlapping, and lengthening).

Table 5 .
Gain chart in nodes to predict gender classification using mean values.

Table 5 .
Gain chart in nodes to predict gender classification using mean values.

Table 6 .
The mean accuracy for gender.