Using Data Augmentation and Time-Scale Modiﬁcation to Improve ASR of Children’s Speech in Noisy Environments

: Current ASR systems show poor performance in recognition of children’s speech in noisy environments because recognizers are typically trained with clean adults’ speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech in testing). This article studies methods to tackle the effects of these two mismatches in recognition of noisy children’s speech by investigating two techniques: data augmentation and time-scale modiﬁcation. In the former, clean training data of adult speakers are corrupted with additive noise in order to obtain training data that better correspond to the noisy testing conditions. In the latter, the fundamental frequency ( F 0 ) and speaking rate of children’s speech are modiﬁed in the testing phase in order to reduce differences in the prosodic characteristics between the testing data of child speakers and the training data of adult speakers. A standard ASR system based on DNN–HMM was built and the effects of data augmentation, F 0 modiﬁcation, and speaking rate modiﬁcation on word error rate (WER) were evaluated ﬁrst separately and then by combining all three techniques. The experiments were conducted using children’s speech corrupted with additive noise of four different noise types in four different signal-to-noise (SNR) categories. The results show that the combination of all three techniques yielded the best ASR performance. As an example, the WER value averaged over all four noise types in the SNR category of 5 dB dropped from 32.30% to 12.09% when the baseline system, in which no data augmentation or time-scale modiﬁcation were used, was replaced with a recognizer that was built using a combination of all three techniques. In summary, in recognizing noisy children’s speech with ASR systems trained with clean adult speech, considerable improvements in the recognition performance can be achieved by combining data augmentation based on noise addition in the system training phase and time-scale modiﬁcation based on modifying F 0 and speaking rate of children’s speech in the testing phase.


Introduction
Automatic speech recognition (ASR) has many potential applications for children in areas such as education (learning new languages and other skills), games, and entertainment. Building ASR systems for child users is, however, challenging for several reasons. First, ASR applications are typically used by children in noisy environments, and the data collection to cover different noise conditions is particularly difficult. Second, due to general problems in recording child speech (e.g., it is difficult to control recording conditions for children, and talkers are not always collaborative), it is difficult to collect enough training data to build ASR systems for children. Therefore, the performance of ASR systems in recognition of children's speech degrades due to the mismatch caused by training and testing under different noise conditions and due to the mismatch caused by training the system with adults' speech and testing with children's speech.
(2) recognition of children's speech using an ASR system trained with adult speech due to the lack of training data from child speakers. In order to address these two challenges, the study combines data augmentation and time-scale modification. In the former, a straightforward data augmentation method is used by corrupting the training data of the ASR system with additive noise to obtain new speech data that correspond better with the testing data in noisy conditions. In the latter, two modification methods (F 0 modification and speaking rate modification) are used to modify the prosodic characteristics of the children's speech in the testing phase towards the prosodic characteristics of the adult speech that is used in the system training phase. F 0 and speaking rate were selected as prosodic features to be modified because their modification is easy to implement and because their modification has shown promising results in previous studies [33,34,39]. In addition, modification of these factors can be done in a more robust manner from noisy speech compared to factors such as formants, whose estimation deteriorates in noisy environments. Figure 1 shows a flow diagram demonstrating how data augmentation and time-scale modification are used in the current investigation. Note that it would also be possible to modify the prosodic structure of adult speech in the training phase using time-scale modification, but this should be done separately for each noise condition and is therefore not feasible.
Data augmentation and time-scale modification have been investigated separately in previous ASR studies (e.g., [40][41][42] for the former and [33,34,39,43] for the latter). However, the effect of combining these techniques has not been investigated in recognition of noisy children's speech before. Therefore, the main contribution of the current study is to investigate how the performance of a children's speech ASR system that suffers from the two challenges described in the beginning of this section is affected when using data augmentation, F 0 modification, and speaking rate modification either separately or by combining these techniques one by one. The study shows encouraging results, indicating that while none of the three previously studied methods alone gives an adequate improvement in the recognition performance, the combination of the three approaches as implemented in the current study results in a considerable improvement in recognition of children's speech in noisy conditions.

Figure 1.
A flow diagram describing the ASR scenario studied in the current article to recognize noisy children's speech using training based on clean adults' speech. The two techniques studied, data augmentation and time-scale modification, are marked with red and blue color, respectively. The remainder of the paper is organized as follows. The two main techniques studied, data augmentation and time-scale modification, are first described in Sections 2 and 3, respectively. Section 4 describes the speech databases and the ASR system used in the study.
The results of the ASR experiments are reported in Section 5 by describing in separate sub-sections how data augmentation, F 0 modification, speaking rate modification, and finally the combination of the three affect the recognition performance. The results are discussed in Section 6, and the conclusions of the study are drawn in Section 7.
The list of abbreviations used in this study are given in Table 1.

Data Augmentation
In this study, we used noise addition as the data augmentation strategy. The main motivation for this type of data augmentation is to capture more acoustic variability of the data to improve the ASR system performance in noisy environments. The proposed approach is demonstrated in the block diagram shown in Figure 2. The input to the data augmentation procedure is clean adult speech taken from an existing large database, WSJCAM0 [44], which will be described in Section 4. Augmentation is conducted by corrupting the clean input signal with additive noise by varying the SNR from 0 dB to 15 dB with a step size of 5 dB and by using four different types of noise (babble, white, factory, and volvo) extracted from the NOISEX-92 database [45]. It should be noted that the data augmentation approach is used in this study to generate new, noise-corrupted data for training, but the original clean adult speech taken from the WSJCAM0 database is not included in the generated training data.
Data augmentation is performed in the ASR experiments of the current study using three scenarios: the "same" scenario, the "different" scenario, and the "all" scenario. The "same" scenario refers to testing the ASR system in circumstances where the test speech is corrupted by one type of noise (e.g., babble), and system training is based on using the same noise type in data augmentation. The "different" scenario refers to corrupting the test speech with one noise type and using the other three types in augmenting the training data (e.g., babble noise in testing, and factory, volvo as well as white noise in augmentation). The "all" scenario refers to testing with speech corrupted by one noise type (e.g., white) but using all four noise types in data augmentation.

Time-Scale Modification
In order to address the second challenge of ASR for children's speech described in Section 1-the mismatch caused by the lack of adequate training data from child speakersthe present study investigates modifying the time-scale structure of children's speech in the system testing phase. The goal of the time-scale modification is to make the testing data of child speakers more similar to the training data of adult speakers by modifying two prosodic features of speech, F 0 and speaking rate. Both the F 0 modification and the speaking rate modification were conducted using the same method, the real-time iterative spectrogram inversion with look-ahead (RTISI-LA) algorithm [35,46,47]. RTISI-LA was originally developed as a method for estimating time-domain signals from overlapping magnitude spectra that have been computed frame by frame. RTISI-LA is a similar kind of phase recovery technique as the Griffin-Lim algorithm [46,48]. RTISI-LA is, however, much faster than the Griffin-Lim algorithm and therefore justified to be used in applications like the current study, where large numbers of speech data need to be processed. The RTISI-LA method consists of the following steps.

•
The speech signal is processed in frames of L samples by computing the short-time Fourier transform magnitude (STFTM) spectrum using the FFT with the Hamming window. The frame shift (S) is selected as S = L/4 so that each frame overlaps with three previous and three following frames. In the following, the frame index is denoted by m and the window function by w(n).

•
To reconstruct the speech signal from its STFTM, an iterative frame-by-frame signal estimation process is applied. Let us suppose that the first m − 1 frames of the speech signal have already been reconstructed from STFTM, and let us denote this signal by x m−1 (n). The task is to synthesize x m (n). • In order to estimate the mth frame, a partial analysis frame is created using overlapadding (OLA) for the (m − 1)th, (m − 2)th, and (m − 3)th frame of x(n) considering an overlap of 75%. The fourth quarter of this partially filled frame is filled with zeros. Let the partial frame be denoted by x m−1 (n)w(n − mS). In RTISI-LA, the future k frames influence the reconstruction of the mth frame. After the mth frame is generated, it is kept uncommitted until the (m + k)th frame is generated. • Next, the Fourier transform of the partial frame is computed using a scaled Hamming window. • The phase information computed from the Fourier transform of the partial frame is then combined with the STFTM for the mth frame. • The inverse Fourier transform of the derived frequency-domain signal produces a new estimate for the mth frame. In each iteration, the estimation of x(n) is updated.
As described in [46], the RTISI-LA algorithm can be used to modify both the F 0 and speaking rate of speech signals. By using the notations of [46], these two modifications were conducted in the current study as follows by using child speech as input.
F 0 Modification The F 0 modification was computed by first re-sampling the input speech signal in the time domain. To modify the F 0 of the input speech signal downwards by factor q, where 0 < q < 1, the input frame of L samples was re-sampled to obtain a longer frame of L samples (i.e., L = qL), which was then used in the STFTM computation. We used simple linear interpolation in re-sampling because it is computationally inexpensive and was reported in [46] to provide reasonable sound quality. The value of L was fixed to 160 samples. The STFMs of the overlapping frames (i.e., the magnitude spectrogram) were then processed using the RTISI-LA algorithm (steps 1-6 above) to obtain the F 0 -modified time domain output signal. The value of the factor q was selected by conducting ASR experiments and by searching for the value of q, which yielded the lowest WER. More details about this will be given in Section 5.2.
Speaking rate modification The modification of the speaking rate with RTISI-LA is based on the frequency-domain approach proposed in [46]. In this approach, the STFTM of the input signal is computed in the analysis stage using a frame shift of S a , and the signal is transformed to the time domain in the synthesis stage with RTISI-LA using a different value of the frame shift (denoted by S s ). The amount of modification is defined by the factor α, which is defined such that S a = S s /α. By using 0 < α < 1.0, the speaking rate of speech can be increased. In our experiments, the speaking rate modification was conducted using the frame size of L = 256 samples and by fixing S s = L/4. The value of α was selected by searching for the value that yielded the lowest WER, as will be explained in Section 5.3.

Speech Databases and the ASR System
The ASR experiments of the study were conducted using two openly available speech databases. The adults' speech data that were used in training was taken from the WSJCAM0 British English speech corpus [44]. The children's speech data that were used for testing were taken from the PF-STAR British English speech corpus [49]. Detailed information about the two databases are given in Table 2. The children's speech data were split into two parts: the validation set and the test set. The former consisted of 2.5 h of speech from 62 speakers with an age range of 6-14 years. The latter consisted of 1.1 h of speech from 60 speakers with an age range of 4-13 years. To build ASR systems, the Kaldi toolkit [50] was used. Deep neural network (DNN)based context-dependent hidden Markov models (HMM) were used for acoustic modeling of the cross-word tri-phones. Decision-tree-based state tying was performed with the maximum number of tied-states (senones) fixed at 2000. Prior to learning the parameters of the DNN-HMM-based ASR system, the fMLLR-normalized feature vectors were timespliced considering a context size of 9 frames. The number of hidden layers in the DNN was set to 5, with 1024 hidden units in each layer. The non-linearity in the hidden layers was modeled using the tanh function. The initial learning rate was set to 0.005, which was reduced to 0.0005 in 15 epochs for training the DNN-HMM system. The minibatch size of 512 was used in the DNN training. In decoding the test set for children's speech, a 1.5 k domain-specific bigram language model (LM) was used. This bigram LM was trained on the transcripts of the speech data in PF-STAR after excluding the test set. In total, 1969 words were used, including pronunciation variations in the lexicon for decoding the children's test set.

Results
As the first step in the series of experiments, we trained a baseline ASR system using original clean adult speech from the WSJCAM0 database and tested the system with original clean children's speech from the PF-STAR database (i.e., no data augmentation or time-scale modification was used). As expected, this system gave a poor performance (WER = 19.58%) due to the mismatch induced by having adults' speech in training and children's speech in testing. We then also included the other mismatch type discussed in the introduction by testing the same system with noise-corrupted children's speech. The results indicated, as expected, that the system performance deteriorated severely: in the noise condition with SNR = 5 dB, for example, the WER value rose to 82.67%, 87.40%, 92.32%, and 46.12% for babble, white, factory, and volvo noise, respectively. As also reported in previous studies [22,51], these experiments indicated that there is lots of room for improvement in recognition of children's speech in noisy conditions. In the following sub-sections, we report on the results from the experiments, which were conducted to improve the system performance step by step by first using data augmentation, then time-scale modification based on F 0 modification, then time-scale modification based on speaking rate modification, and finally all of these three methods combined.

Results Obtained by Using Data Augmentation
The effect of augmenting the training data (of adult speech) with the method described in Section 2 was studied by testing the ASR system with noise-corrupted children's speech. To generate noisy test data, the children's speech signals of the PF-STAR database were corrupted with additive noise using four noise types (babble, white, factory, volvo) and four SNR categories (between 0 dB and 15 dB). For each individual noise condition, we built a baseline ASR system, which was trained using solely the noise-corrupted adult speech of the corresponding condition (i.e., no data augmentation was used in the training of the baseline systems). The results of the ASR experiments are reported in Table 3.
From Table 3, the following observations can be made. (1) The data augmentation based on the "same" condition improved the ASR performance compared to the baseline and the condition "different" worsened the performance in all noise types and SNR categories. (2) The data augmentation based on the "all" condition improved the performance compared to the baseline for babble and factory noise but decreased the performance for white and volvo noise. (3) Despite the fact that the data augmentation improved the performance compared to the baseline systems in many of the scenarios studied, overall the WER values were still unacceptably high. Table 3. WERs obtained using data augmentation in recognizing noisy children's speech of four noise types in four SNR categories. Results obtained with the baseline system, which was trained without data augmentation using solely the noise-corrupted speech of the corresponding condition, are shown in the third column. WERs obtained using the three data augmentation scenarios described in Section 2 are shown in the fourth, fifth, and sixth column.

Results Obtained by Using F 0 Modification
Compared to adult speakers, children typically use higher F 0 values and they also vary the F 0 of their speech over a larger F 0 range [52,53]. This phenomenon is show in Figure 3, which shows F 0 histograms computed from adults' speech signals taken from the WSJCAM0 database (panel (a)) and from children's speech signals taken from the PFSTAR database (panel (b), gray color). For this analysis, 100,000 frames of speech were collected from adults' and children's speech. The figure shows that for the adults' speech signals, there are two peaks close to 100 Hz and 200 Hz corresponding to the average F 0 of male and female speakers, respectively. The peak in the children's F 0 histogram is higher (around 250 Hz) but the histogram also shows a clearly larger spread of the F 0 values compared to the adults' F 0 values. When the children's speech signals were processed with the RTISI-LA method described in Section 3 using q = 0.80, the F 0 histogram shown in orange in panel (b) was obtained. By comparing panels (a) and (b), it can be clearly seen that RTISI-LA succeeded in converting the original F 0 histogram of children's speech much closer to that of adults' speech. To evaluate the effect of F 0 modification on the ASR performance, the RTISI-LA method was used to modify all the validation data of children's speech that were noise-corrupted with four different noises with varying SNR values. In these tests, we used the same value for q for all the data in all the noise conditions. This q value was determined by varying the parameter between 0.95 and 0.75 (in steps of 0.05) for the different noise types and SNR categories and by searching for the q value that yielded the lowest WER value in each case. This optimization method was selected because it is straightforward and has low computational cost. These experiments, demonstrated in Figure 4, indicated that the lowest WER value was achieved in most cases with q = 0.80. Therefore, we fixed the F 0 modification parameter to q = 0.80 and processed the noisy children's test data using this parameter value with the RTISI-LA method. The WER values of these experiments are reported in Table 4. It can be seen that the F 0 modification improved the recognition performance in all the noise types and SNR categories studied.  . WERs as a function of the F 0 modification factor q for each of the four noise types and four SNR categories using the validation set. Note that the smaller the value of q (i.e., towards the right end of the x-axis), the larger the strength of the F 0 modification. Table 4. WERs obtained in recognizing noisy children's speech of four noise types in four SNR categories using F 0 modification. The baseline system is the same as in Table 3.

Results Obtained by Using Speaking Rate Modification
In addition to F 0 , speaking rate is another major factor for the mismatch between children's and adults' speech. Children typically speak at slower rates compared to adults [35,52,53]. The effect of the speaking rate modification algorithm described in Section 3 is demonstrated in Figure 5, which shows the histograms of the number of phonemes per second. For this analysis, 500 utterances of adult speech and 500 utterances of child speech were taken from the WSJCAMO and PFSTAR databases, respectively, and the speaking rate of the latter was modified with the RTISI-LA algorithm described in Section 3 by using α = 0.74. The difference in the histograms between the adults and children can be noted easily in Figure 5: the children's utterances show clearly smaller numbers of phonemes per time unit compared to the adults' utterances, that is, the children speak at a lower rate. The lower panel of Figure 5 shows the effect of the RTISI-LA algorithm when the speaking rate of the children was modified. By comparing the histograms shown in this panel with the histogram shown in the upper panel of Figure 5, it can be observed that the histogram of the rate-modified child utterances has become closer to that of the adults' utterances compared to the histogram computed from the original children's utterances. To evaluate the effect of the speaking rate modification on the ASR performance, the RTISI-LA algorithm was applied to the noise-corrupted children's speech of the entire validation set. We first varied the speaking rate modification factor α from 0.83 and 0.68 (in steps of 0.03) in the same noise conditions and evaluated the corresponding WER value for all the data. As demonstrated in Figure 6, the lowest WER value was obtained using α = 0.74 for most of the noise types and SNR categories. Therefore, we fixed α = 0.74 and modified the speaking rate of the children's speech test data in all noise conditions. The results of these ASR experiments are shown in Table 5. From Table 5, it can be noted that the speaking rate modification algorithm improved the system performance over the baseline system for all the noises of varying SNR levels. Table 5. WERs obtained in recognizing noisy children's speech of four noise types in four SNR categories using speaking rate modification. The baseline system is the same as in Table 3.

Results Obtained by Using the Combined System
As the last step, we evaluated the recognition of noisy children's speech by combining the data augmentation scene described in Section 2 to the two time-scale modification methods described in Section 3. The evaluation was computed in a similar manner to that in Section 5.1 by using the four noise types and the four SNR categories and by including the three different data augmentation scenarios ("same", "different", and "all"). In each of the three scenarios, the recognizer trained in the corresponding data augmentation scenario was tested with noisy children's speech, which was time-scale-processed either using F 0 modification or speaking rate modification and using both of them. The WER results obtained are reported in Table 6 by referring to data augmentation, F 0 modification, and speaking rate modification by DA, F 0 M, and SRM, respectively. From this table, the following main observations can be made. First, the system that combined all the three studied components (i.e., DA + F 0 M + SRM) performed considerably better than the baseline in all the cases studied. In addition, combining both of the two time-scale modification methods with DA yielded the best combined system in all cases. Second, by comparing the best combined systems (i.e., DA + F 0 M + SRM) between the three noise augmentation scenarios, it can be seen that the WER values were best in the "all" scenario for all the noise types and SNR categories. Volvo noise Figure 6. WERs as a function of the speaking rate modification factor α for each of the four noise types and four SNR categories using the validation set. Note that the smaller the value of α (i.e., towards the right end of the x-axis), the larger the strength of the speaking rate modification. Table 6. WERs obtained in recognizing noisy children's speech of four noise types in four SNR categories using different combinations of data augmentation (DA), F 0 modification (F 0 M), and speaking rate modification (SRM). The baseline system is the same as in Table 3. WERs are reported separately for each of the three data augmentation scenarios described in Section 2.

Discussion
Achieving high accuracy in recognition of children's speech is difficult using state-ofthe-art ASR systems because of two types of mismatch between the system training and testing. First, children typically use ASR applications in noisy environments such as when playing games and when taking part in education with other children. Therefore, when the system is trained with clean speech, there is mismatch between the testing and training stages. Second, due to practical problems in recording young child speakers, few training data are available from child speakers, and current ASR systems are mostly trained using adults' speech only. Therefore, when these ASR systems are used to recognize children' speech, another mismatch will be brought about between the system training and testing stages. The severity of these two mismatches was first demonstrated in the current study using a standard ASR system: a poor WER value (of about 20%) was obtained in recognizing clean children's speech using the system trained using adults' speech. Furthermore, when the children's speech was contaminated with noise in order to simulate the use of children's ASR in realistic environments, the recognition performance deteriorated severely to WER values larger than 80% in some noise conditions.
In order to tackle the effects caused by the two mismatches described above in recognition of children's speech in noisy conditions, the current study investigated the utilization of data augmentation and time-scale modification. Furthermore, the time-scale modification technique consisted of two parts, modification of F 0 and modification of speaking rate, which were used to convert the prosodic structure of the children's speech test data to become closer to that of the adults' speech used in the system training. The experiments of the study were planned in order to first investigate how the recognition performance was affected when each of the modification techniques was utilized alone in building the recognizer, after which all the studied techniques were combined aiming at the best system. The experiments were conducted using an existing deep neural network (DNN)based recognizer. Although CNNs are currently increasingly used in ASR, they call for larger numbers of training data compared to DNN-based systems. Therefore, choosing a DNN-based architecture was justified for the current investigation, and we leave the verification of the studied approach with big and complex systems as future work. The data augmentation involved corrupting the original clean training data of adults' speech using additive noise of different types and SNR categories. Three augmentation scenarios ("same", "different", "all") were generated, and these scenarios differ in the way the noise type in testing is seen by the data augmentation procedure. The recognition experiments indicated that data augmentation yielded a consistent improvement in WER only in the case when the noise type was the same in the augmentation and testing. However, when the noise type in testing was different from that used in data augmentation, the performance compared to the baseline decreased considerably, and this happened in all the noise conditions studied. Hence, the utilization of the straightforward data augmentation approach based on noise-corrupting the adult speech in the system training stage did not give an adequate improvement in recognition of noisy children's speech. As the next steps, ASR experiments were conducted by time-scale modifying the prosodic structure of the children's speech in the test stage. The experiments showed that both F 0 modification and speaking rate modification improved WER values compared to the baseline system and that this happened for all the noise conditions studied. From the two modification methods, F 0 modification yielded smaller WER values in all noise conditions. As the final step in our experiments, we combined data augmentation with F 0 modification, with speaking rate modification, and with both of them. The results indicated that combining data augmentation with both of the time-scale modification methods yielded the lowest WER in all noise conditions studied. For this combination, the WER values obtained in the three data augmentation scenarios were lowest in "all", second lowest in "same", and highest in "different".

Conclusions
Poor accuracy is obtained by DNN-HMM -based standard ASR systems in recognition of children's speech in noisy conditions due to two mismatches between the system training and testing. The study indicated that an effective way to tackle the effects of these mismatches is to combine data augmentation (adding different types of noise to adults' speech in the training phase) to the modification of both the F 0 and speaking rate structure of children's test speech to make the children's speech in testing become closer to the adults' speech in training. The study showed that compared to the baseline ASR system, the recognizer, which combined the data augmentation and time-scale modification, yielded substantial improvements in WER in all the noise conditions studied. For example, for the most severe noise type (factory noise), the WER values obtained using the baseline system were very poor in all SNR categories (the average WER = 41.29%), but the WER values obtained by the combined system in the "all" scenario dropped to clearly lower levels (the average WER = 14.88%). For the least severe noise type (volvo noise), the corresponding improvement in the averaged WER given by the combined system was from 19.93% to 9.14%. Despite the fact that the study showed promising results in recognition of noisy children's speech, new research is needed to understand, for example, how the reported results are affected when the amount of adult speech is increased in training. Moreover, studying how the proposed ASR system functions in the presence of adversarial attacks is another topic of future investigations.  Data Availability Statement: Publicly available datasets were analyzed in this study. These data can be found here: http://www.thespeechark.com/pf-star-page.html and https://catalog.ldc.upenn. edu/LDC95S24 (accessed on 10 June 2021).