Automatic Word Spacing of Korean Using Syllable and Morpheme

In Korean, spacing is very important to understand the readability and context of sentences. In addition, in the case of natural language processing for Korean, if a sentence with an incorrect spacing is used, the structure of the sentence is changed, which affects performance. In the previous study, spacing errors were corrected using n-gram based statistical methods and morphological analyzers, and recently many studies using deep learning have been conducted. In this study, we try to solve the spacing error correction problem using both the syllable-level and morpheme-level. The proposed model uses a structure that combines the convolutional neural network layer that can learn syllable and morphological pattern information in sentences and the bidirectional long shortterm memory layer that can learn forward and backward sequence information. When evaluating the performance of the proposed model, the accuracy was evaluated at the syllable-level, and also precision, recall, and f1 score were evaluated at the word-level. As a result of the experiment, it was confirmed that performance was improved from the previous study.


Introduction
Word spacing is the boundary between words that construct a sentence. Text data with spacing errors can affect performance in various natural language processing (NLP) tasks. For example, two sentences, "abeoji-ga bang-e deuleoga-sin-da" (Father enters the room) and "abeoji gabang-e deuleoga-sin-da" (Father enters the bag), have only difference in word spacing, but the semantics of the two sentences are completely different. Therefore, it is important to reduce semantic ambiguity in the sentences by finding correct spaces before performing NLP tasks. Even in the case of speech-to-text, there are frequent spacing errors. A more complete sentence can be generated by performing spacing correction with speech-to-text post-processing.
Research on the correction of Korean word spacing is evolving from a rule-based, statisticalbased, and probability-based method [1][2][3][4] to a deep neural network method [5][6][7][8][9][10][11][12][13]. A word dictionary is constructed, and the word dictionary is searched by moving the sentence at the syllable level, and the spacing result is generated based on word score. In addition, this study proposed a method to improve performance by applying heuristic algorithms [1]. There was a study that corrected word spacing by using probability weights through the bi-gram and voting method to determine where to insert spacing [2]. There was a study that suggested a word spacing model using a structural support vector machine (SVM) by attaching syllable-based part-of-speech (POS) tags to sentences with no spacing [3]. The word spacing correction problem was defined as a sequence labeling problem, and a conditional random field (CRF), which has an excellent performance in solving the sequence labeling problem and was applied to the word spacing correction [4]. The data composed of morpheme-level was converted to a POS tag at the syllable-level, and the data were composed using syllable and noun unit n-gram and POS distribution vector 2 of 10 as additional features. Then, a method of correcting the word spacing error was proposed through an architecture in which bidirectional LSTM (Bi-LSTM) and CRF were combined [6]. There was a study that proposed a method of correcting word spacing errors using a sequence-to-sequence model, specialized in processing complex and large sequence data by stacking LSTMs [9].
Most previous studies construct the word spacing system using one of the features of syllables, words, and morphemes in sentences. In addition, the model of previous studies was constructed using convolutional neural networks (CNN), recurrent neural networks (RNN), and CRF. There are studies that show better performance using both CNN and LSTM in combination than to construct a model using them individually [10,11]. In addition, there is a study in which the structure that combines CNN and LSTM outperforms the structure that combined LSTM and CRF in the part-of-speech tagging task that detects metaphors in sentences [12]. Therefore, we constructed a model of the architecture that combines CNN and Bi-LSTM. We extracted local features of syllables and morphemes using multiple filter CNN and extracted order information through Bi-LSTM. It concatenates the LSTM output of syllables and morphemes information, passes through the fully connected layer, and finally outputs the space tag.
The rest of this paper is organized as follows. Section 2 describes the characteristics, collection, and preprocessing of the Korean text dataset. In Section 3, we describe the model architecture used for word spacing correction. In Section 4, we compare the experimental results with the proposed model with existing studies. Finally, we present the conclusion and future work of this study in Section 5.

Data
The text data are Korean sentences and are divided into three levels: Syllable, morpheme, and word. Table 1 is an example of dividing the sentence "abeoji-ga bang-e deuleoga-sin-da" (Father enters the room) in syllable, morpheme, and word. "/" means the delimiter separating each level. A syllable is a unit of speech that the speaker and listener think of as a bundle. It is larger than a phoneme and smaller than a word (morpheme). In Korean, syllables consist of consonants and vowels or a single vowel. In the Table 1 example sentence, syllables are [a, beo, ji, -ga, bang, -e, deul, eo, ga, sin, -da]. A morpheme is the smallest unit of speech that has meaning, and each separate morpheme has a meaning. Morphemes are [abeoji, -ga, bang, -e, deuleoga, sin-da]. Words (usually called eo-jeol in Korean) usually coincide with the unit of spacing and may be formed by attaching a josa to a che-on (noun, pronoun, number in English) or attaching an ending to the stem. Words are [abeoji-ga, bang-e, deuleoga-sin-da]. Syllable and morpheme level are used as input features. In addition, since word-level is a spacing unit, it is used to evaluate whether the word spacing is corrected.  [4]. The data composed of morpheme-level was converted to a POS tag at the syllablelevel, and the data were composed using syllable and noun unit n-gram and POS distribution vector as additional features. Then, a method of correcting the word spacing error was proposed through an architecture in which bidirectional LSTM (Bi-LSTM) and CRF were combined [6]. There was a study that proposed a method of correcting word spacing errors using a sequence-to-sequence model, specialized in processing complex and large sequence data by stacking LSTMs [9]. Most previous studies construct the word spacing system using one of the features of syllables, words, and morphemes in sentences. In addition, the model of previous studies was constructed using convolutional neural networks (CNN), recurrent neural networks (RNN), and CRF. There are studies that show better performance using both CNN and LSTM in combination than to construct a model using them individually [10,11]. In addition, there is a study in which the structure that combines CNN and LSTM outperforms the structure that combined LSTM and CRF in the part-of-speech tagging task that detects metaphors in sentences [12]. Therefore, we constructed a model of the architecture that combines CNN and Bi-LSTM. We extracted local features of syllables and morphemes using multiple filter CNN and extracted order information through Bi-LSTM. It concatenates the LSTM output of syllables and morphemes information, passes through the fully connected layer, and finally outputs the space tag.
The rest of this paper is organized as follows. Section 2 describes the characteristics, collection, and preprocessing of the Korean text dataset. In Section 3, we describe the model architecture used for word spacing correction. In Section 4, we compare the experimental results with the proposed model with existing studies. Finally, we present the conclusion and future work of this study in Section 5.

Data
The text data are Korean sentences and are divided into three levels: Syllable, morpheme, and word. Table 1 is an example of dividing the sentence "abeoji-ga bang-e deuleoga-sin-da" (Father enters the room) in syllable, morpheme, and word. "/" means the delimiter separating each level. A syllable is a unit of speech that the speaker and listener think of as a bundle. It is larger than a phoneme and smaller than a word (morpheme). In Korean, syllables consist of consonants and vowels or a single vowel. In the Table 1 example sentence, syllables are [a, beo, ji, -ga, bang, -e, deul, eo, ga, sin, -da]. A morpheme is the smallest unit of speech that has meaning, and each separate morpheme has a meaning. Morphemes are [abeoji, -ga, bang, -e, deuleoga, sin-da]. Words (usually called eo-jeol in Korean) usually coincide with the unit of spacing and may be formed by attaching a josa to a che-on (noun, pronoun, number in English) or attaching an ending to the stem. Words are [abeoji-ga, bang-e, deuleoga-sin-da]. Syllable and morpheme level are used as input features. In addition, since word-level is a spacing unit, it is used to evaluate whether the word spacing is corrected.
Most of the word spacing correction studies use Sejong corpus data. The Sejong corpus is data provided by the National Institute of Korean Language and has two categories, written and spoken language [14]. The written language corpus consists of newspapers or magazines. Spelling and word spacing rules of sentences are relatively better than the spoken language corpus. That is why the Sejong written language corpus is used in the study. In addition, we crawled and collected news articles with good spacing rules.
The collected Sejong corpus and news articles have HTML tags, special characters, etc., which are not necessary to process word spacing. We removed the HTML tags and special characters. We have not removed the frequently used special characters such as quotes, commas, periods, etc. The same preprocessing was applied to both data collected. After the preprocessing was completed, about 3 million sentences from the Sejong corpus and about 7 million sentences from collected news articles were combined to form a total of about 10 million sentences. The total number of words was about 12 million, and the number of syllables was about 46 million. Table 2 is a statistical value of the number of syllables and words appearing in each preprocessed sentence. The maximum number of syllables in a sentence is 350, which is very different from the average number of syllables 39.183. If all sentences are padded with a maximum syllable length of 350 to be used as an input sentence of the model, a gradient vanishing problem may occur because the padding value occupies a large proportion of the sentence. Therefore, we reconstructed the sentences thus that they can have similar lengths. When deciding whether or not there is spacing, not all words in a sentence are needed, only 2-3 words before and after. Through this idea, we reconstructed the sentences thus that they could have 6 to 13 words to sample sentences of similar length. It was completed by slicing 6 to 13 words from the beginning of the document continuously. The maximum number of syllables in reconstructed sentences is 76. If the reconstructed sentences are padded with the same length, the proportion of the padding value is smaller than before, thus the gradient vanishing problem is partially compensated. In this way, the processing was performed by removing unnecessary elements from the sentence and sampling sentences with similar length. Finally, the number of sentences used in this study is 13 million.

Word Spacing Correction Model
In this Section, the input/output process of the model used for word spacing correction training is explained, and the overall architecture of the model is shown in Figure 1. We have shared the model structure used in this study on the github repository (https://github. com/JeongMyeong/KoAutoSpacing-KAS). Detailed model parameters are described in the experiment Section.

Integer Encoding
This study defined the problem of correcting Korean word spacing as a sequence labeling problem that sequentially attaches spacing tags to syllables in sentences. We used two input types of sequence information in sentences to train the word spacing correction model. The first is a sequence in which syllables in a sentence are encoded as integers, and the second is a sequence in which morphemes are encoded as integers. For example, "ab-

Integer Encoding
This study defined the problem of correcting Korean word spacing as a sequence labeling problem that sequentially attaches spacing tags to syllables in sentences. We used two input types of sequence information in sentences to train the word spacing correction model. The first is a sequence in which syllables in a sentence are encoded as integers, and the second is a sequence in which morphemes are encoded as integers. For example, "abeoji-ga bang-e deoleoga-sin-da." ("Father entered the room") when there is a sentence, Tables 3 and 4 show examples of encoding sentences using syllable and morpheme encoding values. The maximum value of integer encoding of syllables is defined as the unique count of all syllables appearing in the training data. However, since the type of morphemes is limited according to the morpheme analyzer used, the number of unique integers is limited. We represented the morpheme sequence as an integer with the start, middle, and end of the morpheme. For example, if the start of the morpheme is an integer N, the middle is set to N, and the end is set to N + 1. Therefore, the start and end of the morpheme can be known through the encoded integers.

Embedding
Methods of embedding elements in vector space include language model [15,16], word2vec [17,18], dependency-based context [19], global vectors of words [20], word representation through the artificial neural networks. In this study, syllables and morphemes were embedded in vector space using word representation through artificial neural networks. The embedding layer places elements in the vector space. Furthermore, as the model was trained, elements with similar roles are placed closer together in vector space. Figure 2 shows the conversion of an integer sequence to a vector sequence through an embedding layer. When the [w 1 , w 2 , . . . , w n−1 , w n ] integer sequence is input to the embedding layer, the w n values of the sequences are converted into the m-dimensional vector value, x n , and converted into a [x 1 , x 2 , . . . , x n−1 , x n ] vector sequence. The term w here means the integer value of a syllable or morpheme. After each integer value was converted to the m-dimensional value, the m × n sequence of vectors was made.
were embedded in vector space using word representation through artificial neural networks. The embedding layer places elements in the vector space. Furthermore, as the model was trained, elements with similar roles are placed closer together in vector space. Figure 2 shows the conversion of an integer sequence to a vector sequence through an embedding layer. When the [ 1 , 2 , … , −1 , ] integer sequence is input to the embedding layer, the values of the sequences are converted into the -dimensional vector value, , and converted into a [ 1 , 2 , … , −1 , ] vector sequence. The term here means the integer value of a syllable or morpheme. After each integer value was converted to the -dimensional value, the × sequence of vectors was made.

Multiple Filter 1-dimensional Convolutional Neural Networks
CNN is known to have excellent performance not only in image processing but also in NLP. The convolution layer in NLP is characterized by 1-dimension operation on text data [21][22][23][24]. Figure 3 describes the process of extracting local features from a vector sequence through multiple filter 1-dimensional CNN (1D-CNN). The vector sequence of syllables and morphemes is expressed as follows: where is syllable or morpheme vector value and is the position of the element. The term 1: here is the concatenation of elements from 1 to . The convolution operation for each syllable or morpheme is expressed as follows: where is a non-linear function like and is the weight value. The term ℎ and is the filter window size and bias. The feature maps extracted by CNN is expressed as follows: where ℎ is the feature map extracted by the filter window of size ℎ. is concatenated feature maps extracted with multiple filter sizes of { , , … , }. In this study, multiple 1D

Multiple Filter 1-dimensional Convolutional Neural Networks
CNN is known to have excellent performance not only in image processing but also in NLP. The convolution layer in NLP is characterized by 1-dimension operation on text data [21][22][23][24]. Figure 3 describes the process of extracting local features from a vector sequence through multiple filter 1-dimensional CNN (1D-CNN). The vector sequence of syllables and morphemes is expressed as follows: where x is syllable or morpheme vector value and n is the position of the element. The term x 1:n here is the concatenation of elements from 1 to n. The convolution operation for each syllable or morpheme is expressed as follows: where f is a non-linear function like ReLU and w is the weight value. The term h and b is the filter window size and bias. The feature maps extracted by CNN is expressed as follows: where c h is the feature map extracted by the filter window of size h. C is concatenated feature maps extracted with multiple filter sizes of {a, b, . . . , k}. In this study, multiple 1D CNN extracted local features from syllable and morpheme sequences. Then these values were concatenated and passed to the next layer.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 10 CNN extracted local features from syllable and morpheme sequences. Then these values were concatenated and passed to the next layer.

Bidirectional Long Short-Term Memory
Long Short-Term Memory (LSTM) is a network that complements the gradient vanishing problem and long-term dependency problem in RNN. Figure 4 describes Bi-LSTM structure with LSTM operations forward and backward. Bi-LSTM has the advantage of being able to specify sentence information by training the forward and backward infor-

Bidirectional Long Short-Term Memory
Long Short-Term Memory (LSTM) is a network that complements the gradient vanishing problem and long-term dependency problem in RNN. Figure 4 describes Bi-LSTM structure with LSTM operations forward and backward. Bi-LSTM has the advantage of being able to specify sentence information by training the forward and backward information of the sequence [25]. In the model proposed in this study, C, the output result of multiple filter 1D-CNN, was used as the input of Bi-LSTM. As the forward sequence, C a , C b , . . . , C k was used as the input of the forward LSTM, and backward, the reverse of the forward sequence, C k , . . . , C b , C a was used as the input of the backward LSTM. h f and h b represent the forward and backward output values of LSTM respectively. Final outputs of Bi-LSTMs were [h f , h b ], which concatenate h f and h b . Equations (5)-(10) are an equation that derive the output of the LSTM. x t , h t , c t , f t , i t , o t mean input, output, cell state, forget gate, input gate, and output gate, respectively. Equation (5) shows what information is to be discarded from the cell state, and the weight is determined through the sigmoid layer. Equation (6) represents obtaining new information by deriving the input gate. The hidden state is derived through Equation (7). The cell state is derived through Equation (8), and the output gate is derived through Equation (9). Finally, the output is derived from Equation (10).
Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 10 layer and is finally output through the last output layer. The activation function of the output layer is softmax to indicate the probability value. There are three probability values: Spacing, non-spacing, and padding.

Labeling
Label sequences were created under the assumption that the spacing rules of the collected data were well followed. The rules for tagging spacing were as follows. If the spacing was required after the current syllable, it was tagged as 1, and all others were tagged as 2. Furthermore, the padding value to match all sentences with the same length was tagged as 0.

Data Feed
To train the spacing correction model, the Sejong corpus and the crawled news article

Labeling
Label sequences were created under the assumption that the spacing rules of the collected data were well followed. The rules for tagging spacing were as follows. If the spacing was required after the current syllable, it was tagged as 1, and all others were tagged as 2. Furthermore, the padding value to match all sentences with the same length was tagged as 0.

Data Feed
To train the spacing correction model, the Sejong corpus and the crawled news article sentences were preprocessed and approximately 13 million sentences were used. 1000 samples of the Sejong corpus were separated as test sentences to measure the final performance. For the remaining sentences, about 10.5 million training sentences and 2.5 million validation sentences were used at a 8:2 ratio. Since the Sejong corpus was used to measure the performance in previous studies, this study also used it separately from the Sejong corpus as the final performance test sentences. To accurately measure the final performance of word spacing correction, the spacing of the test sentences was reviewed and used.

Parameters
Syllable and morpheme-level sequences were padded to the same length by padding values for training. Among reconstructed sentences, the length of the sentence with the maximum length of syllables was 76. Therefore, we set the length of all sentences to 100, which was slightly larger than 76. The dimension of the embedding layer used to convert integer-encoded syllable and morpheme sequences into vector sequences was set to 128 and 64, respectively. Since there were fewer types of morpheme than syllables, we set it to 64 dimensions, which were smaller than 128 dimensions. The number of CNNs used in multiple filter 1D-CNN was 4. The filter size was set to 2, 3, 4, 5, respectively, and the filter unit was set to 64. In Bi-LSTM, the number of LSTM units was set to 128 thus that the forward and backward outputs were concatenated to have an output value of 256 sizes. The dropout of the LSTM was set to 0.5. The activation function of the multiple filter 1D-CNNs was set to ELU, and the activation function of the LSTM was set to tanh. As an optimization function of the model, Adam was used, and the learning rate started from the initial value 1e-3 and gradually decreased to 1e-6, and a polynomial decay method was used.

Train
Three experiments were conducted: When only the syllable-level sequence was used, when only the morpheme-level sequence was used, and both the syllable and morphemelevel sequence were used as the spacing correction experiment. When using both syllable and morpheme-level, we trained the model structure in Figure 1. Each of only the syllable-level and only the morpheme-level model had no concatenation after Bi-LSTM in Figure 1.

Metric
To evaluate the performance of the spacing correction model, evaluate tag accuracy, word recall, precision, and f1 as follows: Accuracy tag = the predicted correct tags the actual entire tags × 100 (11) Precision Word = the predicted correct words the predicted entire words × 100 (12) Recall Word = the predicted correct words the actual entire words × 100 (13) F1 score word = 2 * precision word * recall word precision word + recall word (14) Equation (11) is a metric that measures whether tag classification is correct. Equations (12) and (14) are a metric that evaluates at word-level whether a sentence is correctly completed when a predicted tag is inserted into a sentence. Table 5 shows the spacing correction performance of the proposed model. According to experimental results, morpheme-level, syllable-level, and both syllable and morpheme-level were used in order of good performance. When both the syllable and morpheme were used, the advantages of syllable and morpheme were combined to achieve better performance.  Table 6 shows the spacing correction performance of the proposed model. According to experimental results, morpheme-level, syllable-level, and both syllable and morpheme-level were used in order of good performance. When both the syllable and morpheme were used, the advantages of syllable and morpheme were combined to achieve better performance.  Table 6 shows the performance of the previous study of spacing correction and the performance of the proposed model. Previous studies [1,8,9,13] used the Sejong corpus as test data to measure performance. In this study, the Sejong corpus was also used as test data. Reference [1] trained a spacing correction model using word frequency dictionary and syllable frequency dictionary-based word information and a morpheme analyzer, resulting in an f1 score of 93.2%. Reference [13] used a combination of unigram, bigram, trigram, and a noun dictionary. As a result of training a model combining gated recurrent unit (GRU) and CRF, an f1 score performance of 92.32% was achieved. Reference [9] constructed an encoder and a decoder using an LSTM-based sequence to sequence model. The attention mechanism technique was applied to the decoder, and the model was trained using sentence data limited to a maximum of 10 words. As a result, 2019 achieved an f1 score of 93.99%. Reference [8] created a feature vector through a Bi-LSTM encoder and constructed a model using a linear chain CRF. As a result of performance evaluation, an f1 score of 94.26% was achieved. In this study, a model was constructed with an architecture combining multiple filter 1D-CNN and Bi-LSTM, and a spacing correction model was trained by inputting two types of syllable-level and morpheme-level sequences. As a result, the performance of the proposed spacing correction model was 96.06%, which was about 1.8% higher than the previous study.

Conclusions
This study proposed to use both the syllable level and morpheme level of Korean. A model with a structure combining multiple filter 1D-CNN and Bi-LSTM is used, and information of syllable-level and morpheme-level is combined in the second half of the model. Spacing correction performance is evaluated through the Sejong corpus. As a result of the performance evaluation, when both the syllable and morpheme level were used, better performance was achieved compared to when only the syllable or morpheme level was used. The combination of syllable level and morpheme level information in the second half of the model was an advantage for spacing correction. As future studies, we will compare and analyze how much performance changes before and after spacing correction in various NLP tasks.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://ithub.korean.go.kr.