KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition

This paper introduces a large-scale spontaneous speech corpus of Korean, named KsponSpeech. This corpus contains 969 h of general open-domain dialog utterances, spoken by about 2000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. This paper also presents the baseline performance of an end-to-end speech recognition model trained with KsponSpeech. In addition, we investigated the performance of standard end-to-end architectures and the number of sub-word units suitable for Korean. We investigated issues that should be considered in spontaneous speech recognition in Korean. KsponSpeech is publicly available on an open data hub site of the Korea government.


Introduction
In artificial intelligence (AI) services, speech recognition systems are used in various applications, such as AI assistants, dialog robots, simultaneous interpretation, and AI tutors. The performance of automatic speech recognition (ASR) systems has been markedly improved by applying deep learning algorithms and collecting large speech databases [1]. Most conventional ASR systems [2,3] consist of various modules, such as the acoustic model, language model, and pronunciation dictionary, which are trained separately. In recent years, end-to-end ASR systems [4][5][6][7], which can be directly trained to maximize the probability of a word sequence given an acoustic feature sequence, have been the research focus. Many researchers [7,8] reported that end-to-end ASR systems can significantly simplify the speech recognition pipelines and outperform the conventional ASR systems on several representative speech datasets.
For a natural conversation between humans and machines, spontaneous speech recognition is an essential technology. This involves complex spontaneous speech phenomena, such as unwanted pauses, word fragments, elongated segments, filler words, self-corrections, and repeated words. Thus, spontaneous speech recognition is still challenging and worth investigating [9,10]. In this situation, the end-to-end approach provides one solution for dealing with various spontaneous speech phenomena without intermediate modules carefully designed by human knowledge [11,12]. To build a high-quality end-to-end ASR system [7,13], a large-scale spontaneous speech corpus must be collected to handle a variety of spontaneous speech phenomena. After recording the voices of the two speakers on a stereo channel, we separated them into mono channels for editing and transcription. The obtained data were edited for the following conditions: (1) when saving the speech signals as a file, the speech signals should not be cut off in the middle; (2) long speech signals were divided based on a long silence because it is difficult to divide a spontaneous speech into sentence-level segments; and (3) speech signals were stored in a 16 kHz, 16 bits linear, and little endian PCM format.

Transcription Rules
This section describes the KsponSpeech transcription process. All transcriptions were performed according to our pre-defined transcription rules and saved in EUC-KR format. Figure 1 depicts a transcription example. Here, special symbols, such as /, (, ), *, and +, in the transcription are used only for the purpose of representing dual transcription, disfluent words, and non-speech events. When the special symbols were actually spoken, they were transcribed in the form they were pronounced. The edited and transcribed data were thoroughly checked to see if the speech and transcription were identical and were written according to the transcription rules. be used depending on the purpose. When using the dual transcription, parentheses are only used to indicate the range of the dual transcription. Figure 1a shows an example of dual transcription.

Disfluent Speech Transcription
Disfluent speech is marked with forward slash (/) and addition (+) symbols with their transcription. Here, the/symbol mainly indicates filler words, which generally contain little to no lexical content, such as "uh" and "um" in English. Figure 1b shows an example of a filler word. The + symbol mainly indicates repeated words or word fragments and self-corrections. Figure 1c shows an example of a repeated word.

Ambiguous Pronunciation
Words that were difficult to understand or whose pronunciation was ambiguous were transcribed with an asterisk (*) at the end. This symbol is attached to words that could not be recognized as a single word but could be estimated according to the surrounding circumstances. Figure 1d shows an example of the word with ambiguous pronunciation. The example seems to be fine, but their actual sounds are abbreviated. The * symbol was not added if pronounced clearly. Words that were difficult to estimate despite considering the surroundings are marked with u/for unknown words.

Non-Speech and Noise Notation
For non-speech event notation, we use b/, l/, o/, and n/ symbols, which represent breath sound, laughter, utterance overlapped with other participant's speech and is written at the beginning of the transcription, and other noises, respectively. Figure 1e shows an example of non-speech even t notations.

Numeric Notation
Numbers are doubly transcribed to reflect their pronunciation. They are composed of numbers and units and their pronunciation. Korean uses a word-phrase unit, unlike English, consisting of one or more words as the basic unit. Therefore, we needed a criterion for whether to connect or space

Dual Transcription
When deviating from the standard pronunciation, or when two or more pronunciations were possible for the same transcription, we used both orthographic transcription and phonetic transcription in parallel. We call this dual transcription. The orthographic transcription is written according to Korean standard orthographic rules; the phonetic transcription is written as close to the original sound as possible. They were initially prepared for training the acoustic model and the language model of the conventional ASR system. In the end-to-end ASR system, both notations can be used depending on the purpose. When using the dual transcription, parentheses are only used to indicate the range of the dual transcription. Figure 1a shows an example of dual transcription.

Disfluent Speech Transcription
Disfluent speech is marked with forward slash (/) and addition (+) symbols with their transcription. Here, the/symbol mainly indicates filler words, which generally contain little to no lexical content, such as "uh" and "um" in English. Figure 1b shows an example of a filler word. The + symbol mainly indicates repeated words or word fragments and self-corrections. Figure 1c shows an example of a repeated word.

Ambiguous Pronunciation
Words that were difficult to understand or whose pronunciation was ambiguous were transcribed with an asterisk (*) at the end. This symbol is attached to words that could not be recognized as a single word but could be estimated according to the surrounding circumstances. Figure 1d shows an example of the word with ambiguous pronunciation. The example seems to be fine, but their actual sounds are abbreviated. The * symbol was not added if pronounced clearly. Words that were difficult to estimate despite considering the surroundings are marked with u/for unknown words.

Non-Speech and Noise Notation
For non-speech event notation, we use b/, l/, o/, and n/ symbols, which represent breath sound, laughter, utterance overlapped with other participant's speech and is written at the beginning of the transcription, and other noises, respectively. Figure 1e shows an example of non-speech event notations.

Numeric Notation
Numbers are doubly transcribed to reflect their pronunciation. They are composed of numbers and units and their pronunciation. Korean uses a word-phrase unit, unlike English, consisting of one or more words as the basic unit. Therefore, we needed a criterion for whether to connect or space between numbers and units. For this reason, we separated numbers from the units such as "year", "month", "day", and "minute". The pronunciations of numbers are written with spaces in decimal units. Figure 1f shows an example of numeric notation.

Abbreviation Notation
When abbreviations and foreign language words are pronounced differently from the standard pronunciation, we double transcribed the abbreviations and the actual pronunciation. Frequently used abbreviations, such as "KBS" for Korean Broadcasting System or "FIFA" for the International Federation of Association Football, and foreign language words were indicated as they are if they were spoken in general pronunciation.

Word Spacing and Punctuation
As mentioned earlier, Korean uses word-phrase units, and the spaces between them are sometimes ambiguous. Therefore, we tried to ensure that the space between words was written correctly according to the Korean standard orthographic rules. If we could not clearly determine whether to use spaces even after following the rules, we added a space between words. Punctuation marks such as periods, question marks, and exclamation marks were placed at the end of sentences. Commas are used to indicate the context in the middle.

Corpus Partitions for Speech Recognition
Distributing a large-scaled corpus as a single large archive may be impractical and inconvenient. Thus, KsponSpeech was compressed by dividing it into five archives for training and one archive for evaluation. First, the training portion consists of a total of 622,545 utterances. Among them, 620,000 utterances are divided by 124,000 and stored in five compressed files. The remaining 2545 utterances are additionally stored in the last (fifth) compressed file. Here, the first 620,000 utterances were designated as "Train", and the last 2545 utterances were designated as "Dev". Here, Dev is used to find the optimal parameters of an end-to-end model, and their speakers were included in Train.
The evaluation data were stored as an archive containing 6000 utterances. This was built by selecting 100 utterances per person from conversations of 60 speakers, which were not included in the training data. Subsequently, the evaluation data were divided into two subsets according to the perplexity [31]. To do this, we first built a three-gram language model [31] with the transcription of the training data. Then, the utterances of each speaker were ranked according to the perplexity, and were divided roughly into 50 utterances each, with the lower-perplexity being designated as "eval-clean", and higher-perplexity designated as "eval-other". In the evaluation data, eval-clean has a perplexity of 444 and eval-other has a perplexity of 3106. Eval-clean has a context similar to the training data and consists of utterances that are shorter and easier to recognize than eval-other. Note that perplexity is measured in word-phrase unit, not in word unit. Table 2 shows the data subsets in KsponSpeech.

Speech Recognition Results
To validate the effectiveness of KsponSpeech, we first demonstrate the performance of the two end-to-end models of the RNN [24] and the Transformer [26] used as the standard in the ESPnet toolkit [23]. We then investigate the number of sub-word units [30,32] suitable for Korean and present the performance of the large Transformer architecture. We finally explore the methods for generating clean transcriptions in spontaneous speech.

Experimental Setups
All speech recognition experiments were performed using the ESPnet toolkit [23]. Most of the hyper-parameters followed the default settings provided by the toolkit, especially the recipe of LibriSpeech [33,34]. Each model was jointly trained with the CTC objective function as an auxiliary task [24,26]. We used four 1080Ti GPUs for the training stage and did not use external language models for the decoding stage. All experiments employed 83-dimensional features, including 80-dimensional log-Mel filterbank coefficients with 3-dimensional pitch features per frame. The features were normalized by the mean and variance for the training set. We removed utterances having more than 3000 frames or more than 400 syllables due to GPU memory efficiency.
The RNN and Transformer were mostly configured by following the settings previously provided [27,34]. For RNN, the encoder was composed of 2 VGG blocks [27] followed by 5 layers of bidirectional long short-term memory (BLSTM) with 1024 cells in each layer and direction. The attention layer used a location-aware attention mechanism [35] with 1024 units. The decoder was 2 layers of unidirectional long short-term memory (LSTM) with 1024 cells. The training was performed using AdaDelta [36] with early stopping and dropout regularization [27]. For the decoding, we used a beam search algorithm with a beam size of 20 and CTC weight of 0.5.
For Transformer, the encoder used 12 self-attention blocks stacked on the 2 VGG blocks, and the decoder used 6 self-attention blocks. For every Transformer layer, we used 2048-dimensional feedforward networks. For multi-head attention, we employed two configurations: 4 attention heads with 256 dimensions ("small Transformer") and 8 attention heads with 512 dimensions ("large Transformer"). The training was performed using Noam [6] without early stopping. For regularization, we also adopted warmup-steps, label smoothing, gradient clipping, and accumulating gradients described previously [27]. For the decoding, we used a beam search algorithm with beam sizes of 10 and 60 and CTC weights of 0.5 and 0.4 for small and large Transformer, respectively. More detailed configurations are described in Appendix A.
Transcriptions for training and evaluation were generated by the following text processing: first, we used the orthographic transcription in the original transcript consisting of the dual transcription. After that, all punctuation marks such as period and question marks and non-speech symbols such as overlap, breath, and laughter, except for unknown words, are removed from the transcription. Finally, we generate a transcription by removing the "/" symbol for the filler-words and the "+" symbol for repeated-words and self-correction.

Evaluation Metrics
As evaluation metrics, we used character error rate (CER), word error rate (WER), and space-normalized WER (sWER), which are described below. All metrics were measured using the Score Lite toolkit [37], which is a tool for scoring and evaluating the output of speech recognition systems. This toolkit compares the hypothesis text (HYP) output by the speech recognizer and the reference text (REF). Here, the hypothesis text and the reference text are composed of three units: (1) characters (syllables in Korean), (2) words (word-phrases in Korean), and (3) space-normalized words. Figure 2 provides examples of calculations for metrics, and Appendix B shows their pseudocode.
Appl. Sci. 2020, 10, x FOR PEER REVIEW  8 of 16 text, as shown in Figure 2d. The detailed algorithm is described in Appendix B. In all evaluation processes, words with the same definition but different forms such as "2" and "two", were still regarded as incorrect.

Comparison of RNN and Transformer Architectures
We demonstrate the performance differences between the RNN and Transformer models and then examine the performance by applying SpecAugment [29], which is one of the data augmentation methods. The Transformer experiments used the configuration of the small Transformer. Each model used 2306 Korean syllables (including a space symbol) observed in the training data as output nodes. Table 3 shows the performance of the RNN and Transformer models according to whether or not the SpecAugment is applied.  In the evaluation results (Eval), C, S, I, and D denote the correct, substituted, inserted, and deleted words, respectively. A yellow box denotes space-normalized words, underscore denotes whitespace, bold text denotes incorrect words, and asterisk denotes inserted words.
The CER was measured by converting character or sub-word units predicted from the end-to-end model into character units, as shown in Figure 2b. Here, the CER was calculated using both character units and spaces. The WER was measured after restoring the sub-word units predicted from the end-to-end model to the original word units, as shown in Figure 2c. This is an evaluation metric commonly used in speech recognition and may present incorrect performance depending on whether or not spaces are used in Korean, as described in Appendix B.
Finally, we propose sWER as a new evaluation metric. In Korean, space rules are flexible; inconsistent spacing is frequently seen in spontaneous speech transcriptions, like in KsponSpeech. However, this causes a problem in the evaluation of speech recognition because correct results are classified as errors due to this spacing variation. Thus, we used sWER, which gives a more valid word error rate by excluding the effects of inconsistent spaces. This metric was measured from space-normalized texts, which was performed only on the hypothesis text, based on spaces in the reference text, as shown in Figure 2d. The detailed algorithm is described in Appendix B. In all evaluation processes, words with the same definition but different forms such as "2" and "two", were still regarded as incorrect.

Comparison of RNN and Transformer Architectures
We demonstrate the performance differences between the RNN and Transformer models and then examine the performance by applying SpecAugment [29], which is one of the data augmentation methods. The Transformer experiments used the configuration of the small Transformer. Each model used 2306 Korean syllables (including a space symbol) observed in the training data as output nodes. Table 3 shows the performance of the RNN and Transformer models according to whether or not the SpecAugment is applied. The bold number is the best performance.
We first compared the performance of each model on CER. The evaluation data consisted of the three datasets, Dev, Eval-clean, and Eval-other, summarized in Table 2. Here, the Transformer model outperformed the RNN model in all evaluation datasets. Among each dataset, Dev showed the lowest error rate of 7.2% because it consists of the speakers included in the training data, unlike the other evaluation datasets. Eval-clean and Eval-other showed CERs of 8.7% and 9.7%, respectively.
We also observed a large performance gap between WER and sWER. Here, WER indicated a large difference in performance between the evaluation datasets, whereas sWER indicated a slightly smaller difference. When using sWER, mismatched words, which were classified as incorrect words by spacing, were classified as matched words. Most of the spacing differences occurred due to the inconsistent use of spacing in the training data. Note that both metrics used the same reference text, and the hypothesis text differed only in spaces. Thus, the spacing is important issue in the evaluation for Korean, and sWER is a better metric than WER.
In terms of data augmentation, SpecAugment helped improve the performance of both models. The Transformer model using data augmentation had lower relative character error rates of 9.7%, 8.0%, and 7.2% for Dev, Eval-clean, and Eval-other, respectively. As a result, we confirmed that the Transformer model using SpecAugment performs the best. Thus, we continued to use this model in all subsequent experiments.

The Number of Sub-Word Units
We investigated the number of sub-word units suitable for Korean. The sub-word unit is the result of concatenating several letters to form a new sub-word unit [24]. Here, we connected letters using the unigram algorithm [30] used by default in the ESPnet toolkit. Table 4 describes the performance according to the number of sub-word units. In the table, the experiment with 2306 sub-word units is the same model that uses 2306 Korean syllables, which performed best in the previous experiments. Additionally, the CER was measured by converting sub-word units predicted from the end-to-end model into character units with whitespace, as shown in Figure 2b. We observed that performance declined as the 2306 syllable units were increased. We observed that LibriSpeech [33], an English speech data corpus of similar size to KsponSpeech, showed improved performance when using sub-word units with the same algorithm in our preliminary experiments. The reason for showing different results despite using a similarly sized dataset is probably that the basic units between Korean and English are different. Note that Korean uses syllables and English uses characters as the basic unit. These results are also observed in Mandarin, which uses syllable units as the basic unit, as does Korean. Some researchers [38] reported that character units produce better performance than the sub-word units in Mandarin. In our experiment, syllable units may have already been sufficiently concatenated, unlike English character units. Conversely, syllable units may also be unsuitable for expansion into sub-word units as they are already concatenated units of two or more Korean alphabets [39]. We will conduct an experiment starting with the Korean alphabet in the future work.

Size of Transformer Model
We compared the performance difference between the small and large Transformer architectures.
Here, the small Transformer consists of four attention heads with 256 dimensions, and the large Transformer consists of eight attention heads with 512 dimensions. The detailed configuration of both models is described in Appendix A. In addition, they used the 2306 syllable units that produced the best performance in the previous experiments as the sub-word unit. Table 5 displays the performance of two Transformer models with different sizes. In the experimental results, we observed that the large Transformer architecture performed better than the small Transformer architecture. The large Transformer model showed a reduction in relative sWER of 6.3%, 5.6%, and 6.7% over the small Transformer model for Dev, Eval-clean, and Eval-other, respectively. The actual sizes of the two models were 116 and 297 MB, respectively, with the small Transformer model being 2.6 times larger than the large Transformer model. We present the performance of the Large Transformer model in Table 5 as the baseline performance of the KsponSpeech corpus.

Clean Transcription Generation
KsponSpeech can contribute to the task of generating clean transcriptions because the disfluency tags can be used for the disfluent words provided by the corpus to produce a clean transcription. In spontaneous speech, disfluent words, such as filler words, self-corrections, and repeated words, reduce the readability of automatically-generated transcriptions. Therefore, we demonstrated the feasibility of approaches generating the fluent transcription directly from disfluent speech using the end-to-end ASR model.
We can attempt two possible approaches. First, fluent transcription can be obtained by detecting all disfluent words and then removing them. In this case, the end-to-end model should be trained to output disfluent words along with their disfluency tag. Then, the disfluency-tagged words are removed through post-processing. Second, fluent transcription can be obtained directly from the end-to-end model. In this case, the model should be trained to generate fluent transcription from disfluent speech.
To perform this experiment, we used end-to-end models trained with two types of transcriptions, as shown in Table 6: (1) disfluent transcription with disfluency tag ("disfluent w/tag") and (2) fluent transcription ("fluent"). The disfluent w/tag type is a transcription containing the disfluency tags and the fluent type is a transcription with the disfluent words completely removed.   Table 7 below shows the performance for each transcription type. Here, we use the transcription of the fluent type as reference text in evaluation. For the disfluent w/tag model, we first removed all disfluent words from their recognition results and then measured their sWER. For the fluent model, we calculated sWER without additional processing. Table 7 shows that both models performed similarly to each other. For the fluent model, we observed that many disfluent words were automatically removed in the hypothesis text. In the predicted clean transcription, there were also words that were incorrectly inserted or deleted. The possible reason is that many people may miss some disfluency tags on the disfluent words in the process of building a large corpus. As a result, we demonstrated the feasibility of generating the fluent transcription from disfluent speech using the end-to-end ASR model. KsponSpeech can contribute to clean transcription generation. We think this task will be helpful for a variety of applications that require readability and clarity, such as automatic minutes generation and machine translation.

Conclusions
This paper introduced a large-scale Korean spontaneous speech corpus for speech recognition. We call this KsponSpeech, which contains 969 h of 622,545 utterances spoken by 2000 Korean native speakers. We present the baseline performance of an end-to-end model trained with KsponSpeech. Here, we proposed a new evaluation metric, space-normalized word error rate, handling the Korean word-spacing variation. To validate the effectiveness of our corpus, we investigated the performance of standard end-to-end models, the effectiveness of the data augmentation technique, and the number of sub-word units suitable for Korean. As a result, we confirmed that the syllable-based Transformer model trained using the data augmentation technique showed the best performance and presented it as the baseline performance. We also explored approaches to generating clean transcription from disfluent speech. We confirmed that an end-to-end model trained with KsponSpeech can be used to generate the clean transcriptions.
We are releasing the KsponSpeech corpus on the AIHub open data hub site. This corpus can contribute to building a high-quality end-to-end ASR model for Korean spontaneous speech, which can be applied to various AI fields, such as AI assistants, dialog robots, and AI tutors. We expect that KsponSpeech will be widely used as a benchmark corpus for Korean speech recognition.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1 shows experimental configurations for RNN and Transformer models. Most of the hyper-parameters follow the default settings provided by the ESPnet toolkit [23]. In Table A1, the Transformer has two architectures according to the number of attention heads and their dimensions.

Appendix B
Algorithm A1 shows the pseudocode for character error rate (CER), word error rate (WER), and space-normalized word error rate (sWER), which we propose as the new evaluation metric for Korean. In Korean, space rules are flexible. This means that even though they hear the same utterance, spaces may be used differently for each person writing the transcription. However, this causes a problem in the evaluation of speech recognition because the hypothesis text with inconsistent spaces is not a problem for humans to read, but it is considered an invalid word in the evaluation metric. For this reason, we generated a space-normalized hypothesis text and then calculated the evaluation metric from it. Algorithm A2 provides the pseudocode for generating the space-normalized word units-based hypothesis text. Here, Algorithm A2 was created by modifying the Levenshtein distance provided by the Kaldi toolkit [40], and we did not consider memory efficiency or code simplification.