Using Multiple Monolingual Models for Efﬁciently Embedding Korean and English Conversational Sentences

: This paper presents a novel approach for ﬁnding the most semantically similar conversational sentences in Korean and English. Our method involves training separate embedding models for each language and using a hybrid algorithm that selects the appropriate model based on the language of the query. For the Korean model, we ﬁne-tuned the KLUE-RoBERTa-small model using publicly available semantic textual similarity datasets and used Principal Component Analysis (PCA) to reduce the resulting embedding vectors. We also selected a highly-performing English embedding model from available SBERT models. We compared our approach to existing multilingual models using both human-generated and large language model-generated conversational datasets. Our experimental results demonstrate that our hybrid approach outperforms state-of-the-art multilingual models in terms of accuracy, elapsed time for sentence embedding, and elapsed time for ﬁnding the nearest neighbor, regardless of whether a GPU is used. These ﬁndings highlight the potential beneﬁts of training separate embedding models for different languages, particularly for tasks involving ﬁnding the most semantically similar conversational sentences. We expect that our approach will be used for diverse natural language processing-related ﬁelds, including machine learning education.


Introduction
Sentence-BERT (SBERT) [1] has emerged as one of the most efficient BERT-based [2] sentence embedding approaches.SBERT not only outperforms other approaches such as GloVe [3], InferSent [4], and Universal Sentence Encoder [5] in semantic textual similarity tasks but also solves the problem of slow inference time of cross-encoder (BERT)-based sentence embedding approaches.SBERT typically involves using pre-trained models such as BERT-base, BERT-large, RoBERTa-base, or RoBERTa-large [6], which are fine-tuned on datasets specific to downstream tasks such as semantic textual similarity.
While most SBERT models focus on the English language, there is a need to embed both English and Korean sentences.To the best of our knowledge, one of the best approaches for embedding both Korean and English sentences using SBERT is through the use of the "Knowledge Distillation" [7] method.This method involves a teacher model and a student model, where the student model is trained to imitate the teacher model using translated sentences.Surprisingly, the authors of this approach stated that it outperforms even state-of-the-art Korean monolingual models for Korean semantic textual similarity tasks.
In this paper, we present a novel and efficient approach for embedding both Korean and English.Instead of using a single multilingual model that maps both language sentences into the same embedding space, our main idea is to prepare each small and efficient monolingual model (Korean and English) by selecting appropriate pre-trained models and training datasets.During inference, we select the appropriate monolingual model based on the language of a query sentence.To create an efficient Korean model, we use the KLUE-roberta-small [8] pre-trained model and fine-tune it on public semantic textual similarity datasets.We then use PCA to reduce the dimensionality of the resulting vectors for efficient inference.For the efficient English model, we select the model that is best suited for our purposes.Figure 1 shows the overview of our approach.Note that our research focuses on developing a model for identifying the most semantically similar "conversational" sentences, such as questions or commands.To verify the effectiveness of our approach, we construct new test sets composed of conversational sentences.Based on these test sets, we demonstrate that our multiple monolingual models can more accurately and quickly identify the most similar sentences than existing multilingual models.We anticipate that our approach will fulfill the needs of various artificial intelligence applications, including chatbot development, natural language processing education [9][10][11], and others.
The main contributions of our paper are as follows: • First, we introduce two types of new conversational test sets to evaluate sentence similarity methods.These sets capture a wider range of conversational contexts (Section 3).

•
Second, we propose a sentence embedding model specifically designed for Korean conversational sentences.To develop this model, we select a pre-trained model and fine-tune it on a Korean or Korean-translated corpus (Section 4).• Third, we compare existing public SBERT models to identify the most efficient English sentence embedding model for this task (Section 5).

•
Finally, we present a hybrid approach for embedding both Korean and English sentences, which outperforms existing multilingual approaches in terms of accuracy, elapsed time, and model size (Section 6).

Related Work
Efficient semantic sentence embedding is crucial for many natural language processing tasks.While several approaches have been proposed, Sentence BERT (SBERT)-based methods [1] have gained popularity due to their superior performance in terms of accuracy compared with existing methods such as GloVe [3], InferSent [4], and Universal Sentence Encoder [5].Additionally, SBERT-based approaches are much faster than BERT-based crossencoder approaches, making them a promising solution for large-scale NLP applications.
SBERT is a BERT [2]-based approach that uses two siamese BERTs and pooling layers to generate sentence embeddings.The network is fine-tuned to ensure that similar sentences have close embedding vectors.Intuitively, the Semantic Textual Similarity (STS) dataset is one of the appropriate datasets for fine-tuning SBERT, as it consists of two sentence pairs with their semantic similarity scores.One of the most widely used STS datasets for fine-tuning and testing is the SemEval STS dataset [12][13][14][15][16][17].Note that there are many different versions of BERT, such as BERT-large, RoBERTa-base [6], etc.The performance of SBERT greatly depends on not only the datasets used but also the pre-trained model used.
While most sentence embedding techniques are designed for English, there are also widely used approaches for Korean, such as KLUE [8].KLUE offers pre-trained Korean models and datasets, including models based on BERT and RoBERTa, which are trained on extensive Korean corpora.The Korean corpora, which have been tokenized using a morpheme analyzer and a BPE-based tokenizer, amount to approximately 62.65 GB.This size is larger than the data used to train the original BERT (16GB) but smaller than the data used to train RoBERTa (160GB).Additionally, KLUE-STS provides training and development sets for STS tasks.KorSTS [18] is also a popular Korean STS dataset, which provides training, development, and test datasets for Korean STS tasks by translating the STSb [12] English dataset.
There are also approaches to create a single multilingual model that can map sentences from multiple languages into the same embedding space, such as Knowledge Distillation [7].The Knowledge Distillation approach first prepares an English model (teacher model) and then trains the multilingual model in such a way that the translated sentences are mapped to the same embedding space as the original sentences.Surprisingly, it has been found that this multilingual approach outperforms even the Korean monolingual model for the STS task.

Training Data
Fine-tuning pre-trained models using STS (Semantic Sentence Similarity) datasets is a popular method for training Sentence BERT models.These datasets typically contain pairs of sentences, along with their corresponding semantic similarity scores ranging from 0 to 5. Table 1 provides a summary of the datasets that were utilized to train our proposed models.To the best of our knowledge, among these datasets, the STSb (STS Benchmark) [12], STS 2012-2017 [12-17], and SICK-R datasets are the most popular STS datasets.Thus, we consider them for training English sentence embedding models.For training Korean sentence embedding models, we consider three types of datasets: (1) KLUE-STS [8], which is one of the most popular Korean STS datasets, consisting of Policy News (articles produced by ministries, etc.), ParaKQC (conversational sentences about smart home devices), and Airbnb Reviews.(2) KorSTS [18] is a translated version of STSb, but the translated sentences were refined.(3) KSTS and KSICKR are the translated versions of STS (STS 2012-2017) and SICK-R, respectively, using gpt-3.5-turbo.When translating using gpt-3.5-turbo,we exclude some of the sentences that are not correctly translated for a number of reasons.Note we use the notation "SICKR" instead of "SICK-R" and "KLUE" instead of "KLUE-STS" to express the dataset names concisely.
Note that, in the experiments described in the following sections, we used all of the training, development, and test sets for training.For instance, although KorSTS includes separate datasets for training, development, and testing, we trained our models on all of the KorSTS datasets (including the training, development, and test sets) when reporting our results using the KorSTS dataset.

Test Data
We propose two types of test datasets to evaluate the effectiveness of Korean and English SBERT models: one made by AI and another made by humans.Our conversational dataset, called "GPT-ko", was generated by ChatGPT (using GPT-3.5 and GPT-4).The other dataset, called "Paraph-ko", consists of 10,000 conversational sentences extracted from ParaKQC, which was generated by humans.Although ParaKQC is also used for KLUE-STS, "Paraph-en" and KLUE-STS do not have common sentence pairs."GPT-en" and "Paraph-en" are the English versions of "GPT-ko" and "Paraph-ko", respectively, translated by the gpt-3.5-turbomodel.Table 2 summarizes the test data used in our experiments.English GPT-en 2000 English translations of the GPT-ko dataset, generated using the gpt-3.5-turbo.Paraph-en 2000 English translations of the Paraph-ko, generated using the gpt-3.5-turbo.
To construct the GPT-ko dataset, we followed the algorithm below: 1.
We asked ChatGPT to recommend appropriate topics for generating datasets.Based on its response, we selected 10 topics (Culture, Travel, Science, Sports, Education, Food, Health, Technology, History, Humanities) and 38 subtopics.2.
We manually selected 34 subtopics from the 38, eliminating the very similar ones.

3.
For each of the remaining 34 subtopics, we asked ChatGPT to randomly generate conversational sentences in Korean.

4.
For each conversational sentence, we asked ChatGPT to generate one paraphrase sentence in Korean that differed in syntax as much as possible.

5.
Although the generated sentences were mostly of high quality, for some subtopics, ChatGPT did not generate enough diverse sentences.In these cases, we tried switching from GPT version 3.5 to 4.0.However, generating diverse sentences was still difficult for some subtopics.For those subtopics, we stopped generating and moved on to the next ones.
We have created the GPT-ko dataset based on this algorithm, which includes 2000 conversational sentences generated in Korean across 34 subtopics.Furthermore, we have used GPT-3.5-turbo to translate the GPT-ko dataset into English, resulting in the GPT-en dataset.Table 3 displays the statistics and examples of the generated sentences.Due to limitations in table space, we present only one example sentence per topic in this paper.However, the complete dataset is available on our GitHub page (https://github.com/tooeeworld/multiple-monolingual-model, accessed on 2 May 2023).
Unlike the GPT-ko and GPT-en datasets, the Paraph-ko and Paraph-en datasets are generated by humans.To create these datasets, we extracted 2000 sentences from the existing ParaKQC paraphrase dataset, which contains 10,000 sentences organized into 1000 groups, with each group consisting of 10 semantically identical sentences.We randomly selected two sentences from each group to obtain the 2000 sentences in the Paraph-ko dataset.The Paraph-en dataset is the English-translated version of the Paraph-ko dataset, which was translated using gpt-3.5-turbo.identifies the most semantically similar sentence for 1000 query sentences out of a total of 2000 query sentences.In these experiments, we used the three popular Korean embedding models as comparators:

•
The For a fair comparison, we did not use any development sets or other datasets related to the test sets.We trained each of our approaches for only one epoch.
The experimental results indicate that the highest average accuracy is achieved when using KLUE-RoBERTa-base as a pre-trained model and fine-tuning with three datasets: KLUE, KSTS, and KorSTS.Additionally, the best results for Korean test sets are obtained when only the KLUE and KSTS datasets are used.Based on these findings, we have identified three important factors.Firstly, using an appropriate pre-training model is crucial.The accuracy difference between the KR-SBERT-V40K based model and those based on KLUE-RoBERTa-base is significant.Secondly, dataset quality is more important than the dataset's quantity.Unfortunately, some popular datasets, especially KSICKR, do not lead to improved performance and can even degrade performance.Finally, although high-accuracy results are achieved on Korean test sets, the results on the English test sets are relatively lower.In the following sections, we describe novel ways to address these limitations.

An Efficient English Embedding Model
In this section, we conduct experiments to find the best model for conversational sentence embeddings in English.Our approach follows the same experimental setup as described in Section 4, but with English datasets instead of Korean datasets.Additionally, to compare our approach with existing methods, we use all of the selected English models available on sbert.net(https://sbert.net/docs/pretrained_models.html,accessed on 2 May 2023).The experimental results are presented in Table 5.In these results, one of our approaches (fine-tuning on STSb and STS English datasets) achieves the best average accuracy.An interesting thing is that the Korean pre-trained model "KLUE-RoBERTa-base" performed well for English sentence embedding.Our hypothesis is that, during its pre-training phase, the Korean datasets used to train the model may have contained some English expressions.This could have enabled the pretrained model to also learn how to process English expressions effectively.However, although it is helpful for our approaches to train on English training datasets, state-of-theart SBERT models still significantly outperform our approaches in the English test datasets.Surprisingly, the best performing English models are very small models: all-MiniLM-L6-v2 performs the best in the GPT-en dataset, and all-MiniLM-L12-v2 performs the best in the Paraph-en dataset and on average in the English test datasets.
Our experiments showed that, although the English models achieved exceptional performance on the English datasets, they performed poorly on the Korean datasets.This suggests that using two separate monolingual models, one for English and one for Korean, in collaboration may lead to better results.In the next section, we will explore this possibility further.

A Hybrid Approach
In the previous sections, we conducted an analysis to identify the best-performing embedding models for each language in our conversational datasets.Based on our findings, we aim to answer the following research questions:

•
Can we achieve better accuracy in our test datasets by using our Korean and English sentence embedding models together, compared with using state-of-the-art multilingual models?• If so, can we achieve both higher accuracy and faster processing times in our test datasets by using even smaller versions of the two monolingual models in combination, compared with using state-of-the-art multilingual models?
In order to answer these questions, we have configured our hybrid approach as follows: • We utilized the KLUE-RoBERTa-small pre-trained model for developing a Korean monolingual model.It is worth noting that this model is smaller than the KLUE-RoBERTa-base utilized in Section 4. Hence, it may achieve lower accuracy but perform faster.To fine-tune our model, we employed the KLUE+KSTS datasets, which demonstrated the best performance on Korean test datasets, as introduced in Section 4.

•
To speed up the inference time, we applied PCA (Principal Component Analysis) to reduce the dimensionality of the resulting vectors.We followed a similar approach used in sbert.net:we randomly extracted 20,000 sentences from the KLUE-NLI dataset (which is not relevant to our test datasets) to generate sample embeddings for PCA.The original dimensionality of the Korean model was 768, but our resulting vectors, applied after PCA, had a dimensionality of 384.We added the "PCA layer" to our model, and the PCA process ran very quickly.

•
For the English monolingual model, we used the all-MiniLM-L12-v2 model, which was the best performing model on average in the English test sets.Because this model was already a very small model, we did not need to make the model smaller.
Additionally, the output of the model is also small (384-dimensional vectors), so we did not apply PCA to reduce the dimensionality.

•
For combining these two monolingual models, we used a very simple hybrid approach: if a query sentence has at least one Korean letter, then the Korean model makes the corresponding 384-dimensional vector.Otherwise, the English model makes the corresponding 384-dimensional vector.
Table 6 displays the experimental results of existing multilingual models and our approaches.The comparators include two types of models.Firstly, the first five models in this table are selected multilingual models from sbert.net (https://sbert.net/docs/pretrained_models.html,accessed on 2 May 2023), which are based on English sentence embedding models and were generated using Knowledge Distillation.Secondly, the next three models in this table are our own models that use both Korean and English training datasets to fine-tune the KLUE-RoBERTa-base model.Finally, the last model is generated by our hybrid approach, which combines two small monolingual models.The experimental results show that our hybrid approach demonstrates consistent performance across the four test datasets and outperforms the other approaches on average.Based on our previous experiments, only two existing models had an average accuracy of 75% or higher: "Huffon/sentence-klue-roberta-base" and "paraphrase-multilingualmpnet-base-v2".Therefore, we compare our approach with these models in detail.We compare not only accuracy but also model size, encoding time (CPU/GPU), and time for finding the most similar sentences (CPU/GPU).These elapsed times are crucial in many AI-related applications, particularly in AI educational programming environments.Table 7 shows that our hybrid approach outperforms the existing state-of-the-art approaches in

Figure 1 .
Figure 1.Overview of our approach based on multiple monolingual models.

Table 1 .
Summary of the data used for training our proposed models.

Table 2 .
Summary of the test data used in our experiments.

Table 4 .
Comparison of existing Korean sentence embedding models (the first three models) and our embedding models trained with different datasets.

Table 5 .
Comparison of existing English sentence embedding models (the first nine models) and our embedding models trained with different datasets.

Table 6 .
Comparison of existing multilingual embedding models (the first five models) and our embedding models trained with different datasets.Note our hybrid approach uses two small monolingual models.