Multimodal Unsupervised Speech Translation for Recognizing and Evaluating Second Language Speech

This paper addresses an automatic proficiency evaluation and speech recognition for second language (L2) speech. The proposed method recognizes the speech uttered by the L2 speaker, measures a variety of fluency scores, and evaluates the proficiency of the speaker’s spoken English. Stress and rhythm scores are one of the important factors used to evaluate fluency in spoken English and are computed by comparing the stress patterns and the rhythm distributions to those of native speakers. In order to compute the stress and rhythm scores even when the phonemic sequence of the L2 speaker’s English sentence is different from the native speaker’s one, we align the phonemic sequences based on a dynamic time-warping approach. We also improve the performance of the speech recognition system for non-native speakers and compute fluency features more accurately by augmenting the non-native training dataset and training an acoustic model with the augmented dataset. In this work, we augment the non-native speech by converting some speech signal characteristics (style) while preserving its linguistic information. The proposed variational autoencoder (VAE)-based speech conversion network trains the conversion model by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. Experimental results show that the proposed method effectively measures the fluency scores and generates diverse output signals. Also, in the proficiency evaluation and speech recognition tests, the proposed method improves the proficiency score performance and speech recognition accuracy for all proficiency areas compared to a method employing conventional acoustic models.


Introduction
As the demand for untact technology in various fields increases and machine learning technologies advance, the need for computer-assisted second language (L2) learning contents has increased [1][2][3][4]. The widely used method for learning a second language is to practice listening, repeating, and speaking language. A GenieTutor, one of the second language (English at present) systems, plays the role of a language tutor by asking questions to the learners, recognizing their speech, which is answered in second language, checking grammatical errors, evaluating the learners' spoken English proficiency, and providing feedbacks to help L2 learners practice their English proficiency. The system comprises several topics, and the learners can select a topic to have communication with the system based on the role-play scenarios. After the learner finishes the speaking of each sentence, the system measures various fluency factors such as pronunciation score, word score, grammar error, stress pattern, and intonation curve, and provides feedback to learners about them compared with the fluency factors of the native speakers [5][6][7].
The stress and rhythm scores are one of the important factors for fluency evaluation in English speaking, and they are computed by comparing the stress pattern and the rhythm model is able to generate diverse and multimodal outputs. In addition, we train our speech conversion model from nonparallel data because parallel data of the source and target speakers are not available in most practical applications and it is difficult to collect such data. By transferring some speech characteristics and converting the speech, we generate additional training data with nonparallel data and train the AM with the augmented training dataset.
We evaluated the proposed method on the corpus of English read speech for the spoken English proficiency assessment [48]. In our experiments, we evaluated the fluency scoring ability of the proposed method by measuring fluency scores and comparing them with the fluency scores of native speakers, and the results demonstrate that the proposed DTW-based fluency scoring method can compute stress patterns and measure stress and rhythm scores effectively even if there are pronunciation errors in the learner's utterances. The spectral feature-related outputs demonstrate that the proposed conversion model can efficiently generate diverse signals while keeping the linguistic information of the original signal. Proficiency evaluation test and speech recognition results with and without an augmented speech dataset also show that the data augmentation with the proposed speech conversion model contributed to improving speech recognition accuracy and proficiency evaluation performance compared to a method employing conventional AMs.
The remainder of this paper is organized as follows. Section 2 briefly describes the second language learning system used in this work. Section 3 describes a description of the proposed DTW-based fluency scoring and VAE-based nonparallel speech conversion method. In Section 4, experimental results are reported, and finally, we conclude and discuss this paper in Section 5.

Previous Work
GenieTutor is a computer-assisted second language (English at present) learning system. In order to help learners practice their English proficiency, the system recognizes the learners' spoken English responses for given questions, checks content properness, automatically checks and corrects grammatical errors, evaluates spoken English proficiency, and provides educational feedback to learners. Figure 1 shows the schematic diagram of the system [7].
Appl. Sci. 2021, 11,2642 3 of 17 encoder models robustly. The encoded content factor is fed into a decoder with a target style factor to generate converted spectral features. By sampling different style factors, the proposed model is able to generate diverse and multimodal outputs. In addition, we train our speech conversion model from nonparallel data because parallel data of the source and target speakers are not available in most practical applications and it is difficult to collect such data. By transferring some speech characteristics and converting the speech, we generate additional training data with nonparallel data and train the AM with the augmented training dataset. We evaluated the proposed method on the corpus of English read speech for the spoken English proficiency assessment [48]. In our experiments, we evaluated the fluency scoring ability of the proposed method by measuring fluency scores and comparing them with the fluency scores of native speakers, and the results demonstrate that the proposed DTW-based fluency scoring method can compute stress patterns and measure stress and rhythm scores effectively even if there are pronunciation errors in the learner's utterances. The spectral feature-related outputs demonstrate that the proposed conversion model can efficiently generate diverse signals while keeping the linguistic information of the original signal. Proficiency evaluation test and speech recognition results with and without an augmented speech dataset also show that the data augmentation with the proposed speech conversion model contributed to improving speech recognition accuracy and proficiency evaluation performance compared to a method employing conventional AMs.
The remainder of this paper is organized as follows. Section 2 briefly describes the second language learning system used in this work. Section 3 describes a description of the proposed DTW-based fluency scoring and VAE-based nonparallel speech conversion method. In Section 4, experimental results are reported, and finally, we conclude and discuss this paper in Section 5.

Previous Work
GenieTutor is a computer-assisted second language (English at present) learning system. In order to help learners practice their English proficiency, the system recognizes the learners' spoken English responses for given questions, checks content properness, automatically checks and corrects grammatical errors, evaluates spoken English proficiency, and provides educational feedback to learners. Figure 1 shows the schematic diagram of the system [7]. The system comprises two learning stages: Think&Talk and Look&Talk. The Think&Talk stage has various subjects, and each subject comprises several fixed role-play dialogues. In this stage, an English learner can select a study topic and a preferred scenario, and then talk with the system based on the selected role-play scenario. After the learner's spoken English response for each given question is completed, the system computes an intonation curve, a sentence stress pattern, and word pronunciation scores. The learner's and a native speaker's intonation curve patterns are plotted as a graph, and the stress patterns of the learner and native speaker are plotted by circles with different sizes The system comprises two learning stages: Think&Talk and Look&Talk. The Think&Talk stage has various subjects, and each subject comprises several fixed role-play dialogues. In this stage, an English learner can select a study topic and a preferred scenario, and then talk with the system based on the selected role-play scenario. After the learner's spoken English response for each given question is completed, the system computes an intonation curve, a sentence stress pattern, and word pronunciation scores. The learner's and a native speaker's intonation curve patterns are plotted as a graph, and the stress patterns of the learner and native speaker are plotted by circles with different sizes below the corresponding word to represent the intensity of each word at a sentence stress level. Once the learner has finished all conversations on the selected subject or all descriptions of Appl. Sci. 2021, 11, 2642 4 of 16 the selected picture, the system semantically and grammatically evaluates the responses, and provides overall feedback. Figure 2 shows an example of a role-play scenario and educational overall feedback with the system. In the Look&Talk stage, the English learner can select a picture and then describe the selected picture to the system. below the corresponding word to represent the intensity of each word at a sentence stress level. Once the learner has finished all conversations on the selected subject or all descriptions of the selected picture, the system semantically and grammatically evaluates the responses, and provides overall feedback. Figure 2 shows an example of a role-play scenario and educational overall feedback with the system. In the Look&Talk stage, the English learner can select a picture and then describe the selected picture to the system.

Proposed Fluency Scoring and Automatic Proficiency Evaluation Method
Proficiency evaluation with the proposed method consists of fluency features extraction for scoring each proficiency area, proficiency evaluation model training with fluency features, and automatic evaluation of pronunciation proficiency. The proposed method computes various acoustic features, such as speech rate, intonation, and segmental features, from spoken English uttered by non-native speakers according to a rubric designed to evaluate pronunciation proficiency. In order to compute the fluency features, speech signals are recognized using the automatic speech recognition system and time-aligned sequences of words and phonemes are computed using a forced-alignment algorithm. Each time-aligned sequence contains start and end times for each word and phoneme and acoustic scores. Using the time-aligned sequences, the fluency features are extracted in various aspects of each word and sentence. Proficiency evaluation models are trained using the extracted fluency features and scores from human expert raters, and proficiency scores are computed using the fluency features and scoring models. Figure 3 shows a block diagram of the proficiency evaluation model training and evaluating system for automatic proficiency evaluation.

Proposed Fluency Scoring and Automatic Proficiency Evaluation Method
Proficiency evaluation with the proposed method consists of fluency features extraction for scoring each proficiency area, proficiency evaluation model training with fluency features, and automatic evaluation of pronunciation proficiency. The proposed method computes various acoustic features, such as speech rate, intonation, and segmental features, from spoken English uttered by non-native speakers according to a rubric designed to evaluate pronunciation proficiency. In order to compute the fluency features, speech signals are recognized using the automatic speech recognition system and time-aligned sequences of words and phonemes are computed using a forced-alignment algorithm. Each time-aligned sequence contains start and end times for each word and phoneme and acoustic scores. Using the time-aligned sequences, the fluency features are extracted in various aspects of each word and sentence. Proficiency evaluation models are trained using the extracted fluency features and scores from human expert raters, and proficiency scores are computed using the fluency features and scoring models. Figure 3 shows a block diagram of the proficiency evaluation model training and evaluating system for automatic proficiency evaluation.

DTW-Based Feature Extraction for Fluency Scoring
Most language learning systems evaluate and score the learners' spoken English compared to the native speaker's one. However, in realistic speaking situations, a learner's English pronunciation often differs from that of native speakers. For example, learners may pronounce given words incorrectly or pronounce different words from the referred one. In such cases, some fluency features, especially stress and rhythm scores, cannot be measured using previous pattern comparison methods. To solve this problem and measure more meaningful scores, the proposed method aligns the phonemic sequence of the sentence uttered by the learner with the native speaker's phonemic sequence through dynamic time-warping (DTW) alignment and computes the stress patterns, stress scores, and rhythm scores from the aligned phonemic sequences. Figure 4 shows a block diagram of the proposed DTW-based stress and rhythm scoring method.

DTW-Based Feature Extraction for Fluency Scoring
Most language learning systems evaluate and score the learners' spoken English compared to the native speaker's one. However, in realistic speaking situations, a learner's English pronunciation often differs from that of native speakers. For example, learners may pronounce given words incorrectly or pronounce different words from the referred one. In such cases, some fluency features, especially stress and rhythm scores, cannot be measured using previous pattern comparison methods. To solve this problem and measure more meaningful scores, the proposed method aligns the phonemic sequence of the sentence uttered by the learner with the native speaker's phonemic sequence through dynamic time-warping (DTW) alignment and computes the stress patterns, stress scores, and rhythm scores from the aligned phonemic sequences. Figure 4 shows a block diagram of the proposed DTW-based stress and rhythm scoring method.

DTW-Based Phoneme Alignment
Dynamic time-warping is a well-known technique for finding an optimal alignment between two time-dependent sequences by comparing them [15]. To compare and align

DTW-Based Feature Extraction for Fluency Scoring
Most language learning systems evaluate and score the learners' spoken English compared to the native speaker's one. However, in realistic speaking situations, a learner's English pronunciation often differs from that of native speakers. For example, learners may pronounce given words incorrectly or pronounce different words from the referred one. In such cases, some fluency features, especially stress and rhythm scores, cannot be measured using previous pattern comparison methods. To solve this problem and measure more meaningful scores, the proposed method aligns the phonemic sequence of the sentence uttered by the learner with the native speaker's phonemic sequence through dynamic time-warping (DTW) alignment and computes the stress patterns, stress scores, and rhythm scores from the aligned phonemic sequences. Figure 4 shows a block diagram of the proposed DTW-based stress and rhythm scoring method.

DTW-Based Phoneme Alignment
Dynamic time-warping is a well-known technique for finding an optimal alignment between two time-dependent sequences by comparing them [15]. To compare and align

DTW-Based Phoneme Alignment
Dynamic time-warping is a well-known technique for finding an optimal alignment between two time-dependent sequences by comparing them [15]. To compare and align two phonemic sequences uttered by learner and native speaker (reference), we compute a local cost matrix of two phonemic sequences defined by the Euclidean distance. Typically, if a learner and a native speaker are similar to each other, the local cost matrix is small, and otherwise, the local cost matrix is large. The total cost of an alignment path between the learner's and native speaker's phonemic sequences is obtained by summing the local cost measurement values for each pair of elements in two sequences. An optimal alignment path is the alignment path having minimal total cost among all possible alignment paths, and the goal is to find the optimal alignment path and align between two phonemic sequences with the minimal overall cost. Figure 5 shows an example of DTW-based phonemic sequence alignment and stress patterns' computation results. Error phonemes caused by phoneme mismatch are marked in red, and the stress value of error phonemes was set to 3, which is not a standard stress value indicated by 0 (no stress), 1 (secondary stress), or 2 (primary stress) in order to compute the deducted stress score according to the presence of the phonemic errors. By tagging error phonemes, the proposed method evaluates the learner's utterance more accurately and helps L2 learners practice their English pronunciation. the local cost measurement values for each pair of elements in two sequences. An optimal alignment path is the alignment path having minimal total cost among all possible alignment paths, and the goal is to find the optimal alignment path and align between two phonemic sequences with the minimal overall cost. Figure 5 shows an example of DTW-based phonemic sequence alignment and stress patterns' computation results. Error phonemes caused by phoneme mismatch are marked in red, and the stress value of error phonemes was set to 3, which is not a standard stress value indicated by 0 (no stress), 1 (secondary stress), or 2 (primary stress) in order to compute the deducted stress score according to the presence of the phonemic errors. By tagging error phonemes, the proposed method evaluates the learner's utterance more accurately and helps L2 learners practice their English pronunciation.

Stress and Rhythm Scoring
Given the aligned phonemic sequences, the proposed method computes the word stress score, sentence stress score, and rhythm scores. In order to measure the word and sentence stress scores, word stress patterns are computed for each content word in the given sentence, and sentence stress patterns are computed for the entire sentence. Then, the word and sentence stress scores are measured by computing the similarity between the learner's stress patterns and the native speaker's stress patterns.
The rhythm scores are measured by computing the mean and standard deviation of the time intervals between the stressed phonemes. An example of computing the rhythm score in the sentence "I am still very sick. I need to take some pills." is as follows:

Stress and Rhythm Scoring
Given the aligned phonemic sequences, the proposed method computes the word stress score, sentence stress score, and rhythm scores. In order to measure the word and sentence stress scores, word stress patterns are computed for each content word in the given sentence, and sentence stress patterns are computed for the entire sentence. Then, the word and sentence stress scores are measured by computing the similarity between the learner's stress patterns and the native speaker's stress patterns.
The rhythm scores are measured by computing the mean and standard deviation of the time intervals between the stressed phonemes. An example of computing the rhythm score in the sentence "I am still very sick. I need to take some pills." is as follows:

•
Compute the stress patterns from the aligned phonemic sequences. Table 1 shows an example of the sentence stress pattern. The start times of the stressed phonemes, including the start and end times of the sentence, are highlighted (bold in Table 1) to compute the rhythm features. • Select the stressed phonemes (highlighted point in Table 1), and compute the mean and standard deviation of the time intervals between them:   Table 2 shows an example of the mean values of the mean time interval and the standard deviation of the time interval for each pronunciation proficiency level evaluated by human raters. Proficiency scores 1, 2, 3, 4, and 5 indicate very poor, poor, acceptable, good, and perfect, respectively. As shown in Table 2, the lower the proficiency level, the greater the mean and standard deviation values of the time intervals between the stressed phonemes. Two stress scores and rhythm scores are used for spoken English proficiency evaluation with other features.

Automatic Proficiency Evaluation with Data Augmentation
The speech recognition system is optimized for non-native speakers as well as natives for educational purposes and smooth interaction. Speech features for computing fluency scores are extracted and decoded into time-aligned sequences by forced-alignment using the non-native acoustic model (AM). In addition, multiple AM scores are used to evaluate proficiency. In order to improve speech recognition accuracy and time-alignment performance, and to compute AM scores more accurately and meaningfully, we augment the training speech dataset and train non-native AM using the augmented training dataset.
In this work, we convert some speech characteristics (style) to generate speech data for augmentation. In the proposed speech conversion model, we assume that each spectral feature of the speech signal is decomposed into a speaker-independent content factor desired to be maintained and each speaker-specific style factor we want to change in latent space. After extracting the content factor from the source speech signal, the proposed conversion model converts the source speech to the desired speech style by extracting the style factor of target speech and recombining it with the extracted content factor. By simply choosing the style factor for this recombination as a source style factor or target style factor, the conversion model can reconstruct or convert speech: wherex s→s andx s→t are the reconstructed and converted spectra, x s and x t are the source and target speech spectra, D is the decoder, and E c , E s s , and E t s denote the content encoder, source style encoder, and target style encoder, respectively. The content encoder network is shared across both speakers, and the style encoder networks are domain-specific networks for individual speakers. Figure 6 shows a block diagram of the proposed speech conversion method. style factor, the conversion model can reconstruct or convert speech:  Figure 6 shows a block diagram of the proposed speech conversion method. Figure 6. Flow of the proposed variational autoencoder (VAE)-based nonparallel speech conversion method. Each of the decoder networks are shared. The arrows indicate flow of the proposed method related to the source spectra or common elements (e.g., source style factor, content factor), and the dashed arrows indicate the flow of the method belonging to the target spectra (e.g., target spectra, target style factor).
As shown in Figure 6, the content encoder network extracts the content factor and is shared across all domains. All convolutional layers of the content encoder were followed by instance normalization (IN) to remove the speech style information and learn domainindependent content information (phoneme in speech). The style encoder network computes the domain-specific style factor for each domain and is composed of multiple separate style encoders (source style encoder and target style encoder in Figure 6) for individual domains. In the style encoders, IN was not used, because it removes the speech style information.
We jointly train the encoders and decoder with multiple losses. To keep encoder and decoder as inverse operations and ensure the proposed system should be able to reconstruct the input spectral features after encoding and decoding, we consider reconstruction loss as follows: For the content factor and style factors, we apply a semi-cycle loss in latent variable  speech spectra  latent variable coding direction as the latent space is partially shared. Here, a content reconstruction loss encourages the translated content latent factor to preserve the semantic content information of input spectral features, and a style reconstruction loss encourages style latent factors to extract and change speaker-specific speaking style information. Two semi-cycle losses for source speech are computed as follows: Figure 6. Flow of the proposed variational autoencoder (VAE)-based nonparallel speech conversion method. Each of the decoder networks are shared. The arrows indicate flow of the proposed method related to the source spectra or common elements (e.g., source style factor, content factor), and the dashed arrows indicate the flow of the method belonging to the target spectra (e.g., target spectra, target style factor).
As shown in Figure 6, the content encoder network extracts the content factor and is shared across all domains. All convolutional layers of the content encoder were followed by instance normalization (IN) to remove the speech style information and learn domain-independent content information (phoneme in speech). The style encoder network computes the domain-specific style factor for each domain and is composed of multiple separate style encoders (source style encoder and target style encoder in Figure 6) for individual domains. In the style encoders, IN was not used, because it removes the speech style information.
We jointly train the encoders and decoder with multiple losses. To keep encoder and decoder as inverse operations and ensure the proposed system should be able to reconstruct the input spectral features after encoding and decoding, we consider reconstruction loss as follows: For the content factor and style factors, we apply a semi-cycle loss in latent variable → speech spectra → latent variable coding direction as the latent space is partially shared. Here, a content reconstruction loss encourages the translated content latent factor to preserve the semantic content information of input spectral features, and a style reconstruction loss encourages style latent factors to extract and change speaker-specific speaking style information. Two semi-cycle losses for source speech are computed as follows: where c denotes the content factor, and s t and s s denote the target style factor and source style factor, respectively. The losses for target speech are similarly computed. The full loss of the proposed speech conversion method is the weighted sum of all losses, which is defined as follows: where λ 1 , λ 2 , and λ 3 control the weights of the components.

Dialogue Exercise Result
To validate the effectiveness of the proposed method, we performed computer-assisted fluency scoring experiments with spoken English sentences collected in dialogue scenarios of the GenieTutor system. Figure 7 shows an example of a role-play scenario and fluency scores feedback with the proposed method. Once the learner completes a sentence utterance, the system computes several aspects of pronunciation evaluation and displays them in diagram forms. Learners can check their fluency scores by selecting the sentences they want to check. Learners are provided with overall feedback after finishing all conversations. As shown in Figure 7, the proposed method can efficiently compute the intonation curves and stress patterns of the sentences uttered by the learner even when pronunciation errors occur. In addition, the error words are marked in red, so the learner can see the error parts.
full loss of the proposed speech conversion method is the weighted sum of all losses, which is defined as follows: where 1 λ , 2 λ , and 3 λ control the weights of the components.

Dialogue Exercise Result
To validate the effectiveness of the proposed method, we performed computer-assisted fluency scoring experiments with spoken English sentences collected in dialogue scenarios of the GenieTutor system. Figure 7 shows an example of a role-play scenario and fluency scores feedback with the proposed method. Once the learner completes a sentence utterance, the system computes several aspects of pronunciation evaluation and displays them in diagram forms. Learners can check their fluency scores by selecting the sentences they want to check. Learners are provided with overall feedback after finishing all conversations. As shown in Figure 7, the proposed method can efficiently compute the intonation curves and stress patterns of the sentences uttered by the learner even when pronunciation errors occur. In addition, the error words are marked in red, so the learner can see the error parts.

Speech Database
We also performed the proficiency evaluation test using the rhythm and stress scores with other fluency features. A speech dataset was selected from the English read speech dataset read by non-native and native speakers for the spoken proficiency assessment. The dataset is a corpus of English speech sounds spoken by Koreans and 7 American Eng-

Speech Database
We also performed the proficiency evaluation test using the rhythm and stress scores with other fluency features. A speech dataset was selected from the English read speech dataset read by non-native and native speakers for the spoken proficiency assessment. The dataset is a corpus of English speech sounds spoken by Koreans and 7 American English native speakers (references) for experimental phonetics, phonology, and English education, and is designed to see Korean speakers' intonation and rhythmic patterns in English connected speech and the errors which Korean speakers are apt to make in pronunciation of segments. Each utterance was scored by human expert raters on a scale of 1 to 5. In this study, the gender and spoken language proficiency levels were evenly distributed among the speakers. Table 3 shows scripts samples. The speech dataset comprised 100 non-native speakers, and for each speaker, 80 sentences were used for training and another 20 sentences, not included in the training dataset, were used for testing. For speech conversion and augmentation, an additional 7 American English native speakers (3 males and 4 females), and for each speaker, 100 sentences, were used, and frame alignment of the dataset was not performed. We used the WORLD package [49] to perform speech analysis. The sampling rate of all speech signals reported in this paper was 16 kHz. The frame shift length was 5 ms and the number of fast Fourier transform (FFT) points was 1024. For each extracted spectral sequence, 80 Mel-cepstral coefficients (MCEPs) were derived.

Human Expert Rater
Each spoken English sentence uttered by non-native learners was annotated by four human expert raters who have English teaching experience or are currently English teachers. Each non-native utterance was rated for five proficiency area scores: holistic impression of proficiency, intonation, stress and rhythm, speech rate and pause, and segmental accuracy. In addition, each proficiency score was measured on a fluency level scale of 1-5. A holistic score for each utterance is calculated as an average of all proficiency scores and used for proficiency evaluation in this paper. Table 4 shows a mean of the correlation between human expert raters' holistic scores.

Data Augmentation
The proposed VAE-based speech conversion model consisted of a content encoder, style encoders, and a joint decoder. The content encoder comprised two dilated convolutional layers and a gated recurrent unit (GRU) based on a recurrent neural network. In order to remove the speech style information, all convolutional layers were followed by instance normalization (IN) [50]. The style encoder comprised a global average pooling layer, 3-layer multi-layer perceptron (MLP), and a fully connected layer. In the style encoder, IN was not used because it removes the original feature mean and variance that represent speech style information. Then, content and style factors were fed into the decoder to reconstruct or convert the speech. The decoder comprised two dilated convolutional layers and the recurrent neural network-based GRU. All convolutional layers were used with an Adaptive Instance Normalization layer generated by the MLP from the style factor [50].
where z is the activation of the previous convolutional layer, and µ(.) and σ(.) denote the mean and variance, respectively. Figure 8 shows an example of Mel-spectrograms obtained by the proposed method. Comparing the decoding results, we confirmed that the proposed method reconstructs and converts the spectral features efficiently.

Data Augmentation
The proposed VAE-based speech conversion model consisted of a content encoder, style encoders, and a joint decoder. The content encoder comprised two dilated convolutional layers and a gated recurrent unit (GRU) based on a recurrent neural network. In order to remove the speech style information, all convolutional layers were followed by instance normalization (IN) [50]. The style encoder comprised a global average pooling layer, 3-layer multi-layer perceptron (MLP), and a fully connected layer. In the style encoder, IN was not used because it removes the original feature mean and variance that represent speech style information. Then, content and style factors were fed into the decoder to reconstruct or convert the speech. The decoder comprised two dilated convolutional layers and the recurrent neural network-based GRU. All convolutional layers were used with an Adaptive Instance Normalization layer generated by the MLP from the style factor [50].
where z is the activation of the previous convolutional layer, and ( ) . μ and ( ) . σ denote the mean and variance, respectively. Figure 8 shows an example of Mel-spectrograms obtained by the proposed method. Comparing the decoding results, we confirmed that the proposed method reconstructs and converts the spectral features efficiently. We performed the perception test to compare the sound quality and speaker similarity of converted speech between the proposed VAE-based speech conversion method and the conventional conditional VAE-based speech conversion (CVAE-SC) method [29], which is one of the most common speech conversion methods. We conducted an AB test and an ABX test. "A" and "B" were outputs from the proposed method and the CVAE, and "X" was a real speech sample. To eliminate bias in the order, "A" and "B" were presented in random orders. In the AB test, each listener was presented with "A" and "B" audios at a time, and was asked to select "A", "B", or "fair" by considering both speech naturalness and intelligibility. In the ABX test, each listener was presented with two audios and a reference audio "X", and then, was asked to select a preferred audio or "fair" by considering the one closer to the reference. We used 24 utterance pairs for the AB test and another 24 utterance pairs, not included in the AB test, for the ABX test. The number of listeners was 20. Figure 9 shows the results, and we confirmed that the proposed method outperforms the baseline in both sound quality and speaker similarity terms. We performed the perception test to compare the sound quality and speaker similarity of converted speech between the proposed VAE-based speech conversion method and the conventional conditional VAE-based speech conversion (CVAE-SC) method [29], which is one of the most common speech conversion methods. We conducted an AB test and an ABX test. "A" and "B" were outputs from the proposed method and the CVAE, and "X" was a real speech sample. To eliminate bias in the order, "A" and "B" were presented in random orders. In the AB test, each listener was presented with "A" and "B" audios at a time, and was asked to select "A", "B", or "fair" by considering both speech naturalness and intelligibility. In the ABX test, each listener was presented with two audios and a reference audio "X", and then, was asked to select a preferred audio or "fair" by considering the one closer to the reference. We used 24 utterance pairs for the AB test and another 24 utterance pairs, not included in the AB test, for the ABX test. The number of listeners was 20. Figure 9 shows the results, and we confirmed that the proposed method outperforms the baseline in both sound quality and speaker similarity terms. audios at a time, and was asked to select "A", "B", or "fair" by considering both speech naturalness and intelligibility. In the ABX test, each listener was presented with two audios and a reference audio "X", and then, was asked to select a preferred audio or "fair" by considering the one closer to the reference. We used 24 utterance pairs for the AB test and another 24 utterance pairs, not included in the AB test, for the ABX test. The number of listeners was 20. Figure 9 shows the results, and we confirmed that the proposed method outperforms the baseline in both sound quality and speaker similarity terms. We also performed the speech recognition test to validate that the spectral features were converted meaningfully using the English read speech dataset. We used the ESPnet [51] for an end-to-end ASR system. We trained the AM using only the training dataset ("Train database only" in Table 5) and evaluated the test dataset, and we compared the recognition results to those obtained by evaluating the same test dataset using the AM trained with the augmented dataset ("Augmentation" in Table 5). Table 5 shows the word error rate (WER) results. For comparison, SpecAugment [21], speed perturbation method [20], and CVAE-SC were used as a reference. As shown in Table 5, we confirmed that the data augmentation with the proposed method improves the speech recognition accuracy for all proficiency score levels compared to a method employing conventional AM and the other augmentation methods. By sampling different style factors, the proposed speech conversion method is able to generate diverse outputs, but the computational complexity is higher than that of other methods. We also performed the speech recognition test to validate that the spectral features were converted meaningfully using the English read speech dataset. We used the ESPnet [51] for an end-to-end ASR system. We trained the AM using only the training dataset ("Train database only" in Table 5) and evaluated the test dataset, and we compared the recognition results to those obtained by evaluating the same test dataset using the AM trained with the augmented dataset ("Augmentation" in Table 5). Table 5 shows the word error rate (WER) results. For comparison, SpecAugment [21], speed perturbation method [20], and CVAE-SC were used as a reference. As shown in Table 5, we confirmed that the data augmentation with the proposed method improves the speech recognition accuracy for all proficiency score levels compared to a method employing conventional AM and the other augmentation methods. By sampling different style factors, the proposed speech conversion method is able to generate diverse outputs, but the computational complexity is higher than that of other methods. All features for proficiency scoring are computed based on the time-aligned phone sequence and its time information [11,12,14]. Table 6 shows the proficiency scoring feature list used to train the automatic proficiency scoring models in this work.  We used two modeling methods: (1) multiple linear regression (MLR) and (2) deep neural network, to train scoring models with high agreement with human expert raters. MLR is simple and has been used for a long time for automatic proficiency scoring purposes. Based on the MLR scoring model, the proficiency score is computed as follows: where i is the index of each feature, α i is the weight associated with each scoring feature f i , and β is a constant intercept. We also used a neural network to train the proficiency scoring model nonlinearly and more accurately. The neural network comprised a convolutional layer with 1 hidden layer and 3 hidden units and a fully connected layer. Given 41 features, the neural network trains the proficiency scoring model.

Proficiency Evaluation Results
In order to validate that the proposed automatic proficiency evaluation system measured the proficiency scores effectively and meaningfully, we computed and compared a Pearson's correlation coefficient between the proficiency scores of the proposed system and those of human raters. The Pearson's correlation coefficient is a commonly used metric for evaluating the performance of proficiency assessment methods [52][53][54]. Tables 7 and 8 show the proficiency evaluation results obtained by the proposed method with and without data augmentation. For comparison, the range of correlation coefficients of the inter-rater scores ("Human" in Tables 7 and 8) were used as a reference. As shown in Tables 7 and 8, we confirmed that the proposed automatic proficiency evaluation method measures proficiency scores efficiently for all proficiency area scores. In addition, we confirmed that data augmentation for AM training with the proposed speech conversion method improves the averaged correlation performance for all proficiency area scores compared to the method employing conventional AM trained without data augmentation. By automatically evaluating the proficiency of the L2 speaker's utterance, the proposed proficiency scoring system is able to perform fast and consistent evaluation in various environments. Table 7. Correlation between human rater and proposed proficiency scoring system without data augmentation.

Conclusions and Future Work
We proposed an automatic proficiency evaluation method for L2 learners in spoken English. In the proposed method, we augmented the training dataset using the VAE-based speech conversion model and trained the acoustic model (AM) with an augmented training dataset to improve the speech recognition accuracy and time-alignment performance for non-native speakers. After recognizing the speech uttered by the learner, the proposed method measured various fluency features and evaluated the proficiency. In order to compute the stress and rhythm scores even when the phonemic sequence errors occur in the learner's speech, the proposed method aligned the phonemic sequences of the spoken English sentences by using the DTW, and then computed the error-tagged stress patterns and the stress and rhythm scores. In computer experiments with the English read speech dataset, we showed that the proposed method effectively computed the error-tagged stress patterns, stress scores, and rhythm scores. Moreover, we showed that the proposed method efficiently measured proficiency scores and improved the averaged correlation between human expert raters and the proposed method for all proficiency areas compared to the method employing conventional AM trained without data augmentation.
The proposed method can also be used for most signal processing and generation problems, such as sound conversion between instruments or generation of various images. However, the current style conversion framework has a limitation that the conversion model learns the domain-level style factors and generates the converted speech signal rather than diverse pronunciation styles of multiple speakers included in each domain. In order to learn more meaningful and diverse style factors and perform many-to-many speech conversion, we plan to address the issues of automatic speaker label estimation and expansion to each speaker-specific style encoder in the future work.
Author Contributions: Conceptualization, methodology, validation, formal analysis, writing-original draft preparation, and writing-review and editing, Y.K.L.; supervision and project administration, J.G.P. All authors have read and agreed to the published version of the manuscript.