Comparative Study of Multiclass Text Classiﬁcation in Research Proposals Using Pretrained Language Models

: Recently, transformer-based pretrained language models have demonstrated stellar performance in natural language understanding (NLU) tasks. For example, bidirectional encoder representations from transformers (BERT) have achieved outstanding performance through masked self-supervised pretraining and transformer-based modeling. However, the original BERT may only be effective for English-based NLU tasks, whereas its effectiveness for other languages such as Korean is limited. Thus, the applicability of BERT-based language models pretrained in languages other than English to NLU tasks based on those languages must be investigated. In this study, we comparatively evaluated seven BERT-based pretrained language models and their expected applicability to Korean NLU tasks. We used the climate technology dataset, which is a Korean-based large text classiﬁcation dataset, in research proposals involving 45 classes. We found that the BERT-based model pretrained on the most recent Korean corpus performed the best in terms of Korean-based multiclass text classiﬁcation. This suggests the necessity of optimal pretraining for speciﬁc NLU tasks, particularly those in languages other than English. - o Comparative evaluation of real-time driver tendency judgment technology; technology analysis for determining o Speed proﬁle prediction technology considering driver propensity and road conditions. To develop speed proﬁle prediction technology using the established driving information DB. To develop global speed proﬁle prediction technology using driver propensity, GIS information, and trafﬁc light situation in the absence of a driving route DB. o Establishing a foundation for the development and veriﬁcation of power distribution control algorithms. To develop a path-based real-time optimal power distribution control algorithm. To develop engine operation minimization control technology based on real-time vehicle external information. To construct a human-in-the-loop simulation environment.


Introduction
Transformer language modeling with pretraining and word representations combined with transfer learning has significantly improved many natural language understanding (NLU) tasks, such as text classification and natural language inference [1][2][3][4][5][6]. Deepcontextualized language models pretrained through masked language modeling, such as bidirectional encoder representations from transformers (BERT) [3], have demonstrated state-of-the-art performances in NLU tasks. However, these BERT-based models are primarily tested on text data in English; hence, they may not perform well on text written in other languages. In particular, the Korean language, which is an agglutinative language that requires sophisticated processing, may not be managed effectively by pretrained language models (PLMs) developed for the English language. Korean-specific PLMs such as KoBERT [7], KLUE-BERT [8], and KR-BERT [9], pretrained on numerous novel Koreancentric datasets, have shown significant improvements in Korean-based NLU tasks. This indicates that PLMs are sensitive to the language of the text data used for pretraining. Most studies regarding pretraining with Korean-specific corpora are typically conducted using datasets obtained based on different criteria; however, the performance is not comprehensively verified for each NLU benchmark, rendering the selection of pretrained specifications difficult.
In this study, we investigated the manner in which the current PLMs classify climate technology in research proposals written in the Korean language, which is a multiclass text classification task. We fine-tuned seven PLMs using a dataset and compared the classification performance across PLMs while matching all hyperparameters. The main contributions/findings of this study are as follows: • We compared the performance of a fine-tuned model with those of seven BERT-based models pretrained with different pretraining corpora.

•
We evaluated the performance of each model when performing Korean-based multiclass text classification using a climate technology classification dataset that contained more than 200,000 research proposals in Korean spanning 45 different categories. • The pretrained model, which was fine-tuned using the most recent novel Korean corpora, showed up to a 7% performance improvement as compared with the model with different Korean corpora.

Transfer Learning and Pretrained Language Modeling
Transfer learning improves performance by sharing the parameters of a model that has been trained on similar task data in advance, unlike conventional machine learning that performs isolated and single-task learning. Furthermore, in transfer learning, previously learned tasks are required when learning a new task; therefore, it is more accurate, requires less training data, and its learning process is faster than that of isolated machine learning.
Transfer learning was first presented and actively applied in image classification, where pretrained network models such as VGG [10], Inception [11], and ResNet [12] were used. In natural language processing (NLP), PLMs such as GPT [13], ELMo [1], and BERT [3] were released in 2018 and achieved state-of-the-art performances in most NLP tasks. In particular, BERT and ELMo overcame the critical constraint of previous language models such as GPT, which is a unidirectional learning method. This unidirectional learning method does not consider the "context", which is one of the most important factors of language models for a wide range of NLP tasks; therefore, it restricts the capability of the pretrained representations. To address this issue, bidirectional learning methods that are well-contextualized have been proposed in BERT and ELMo. Unlike ELMo, which is based on long short-term memory (LSTM) and applies a pretrained model to downstream tasks using a feature-based method, BERT solves the vanishing gradient problem constraint caused by recurrent neural network layers by replacing transformer-based modeling with a layer. BERT can be applied to downstream tasks by fine-tuning all pretrained parameters, as shown in Figure 1.

BERT
BERT is a deep bidirectional pretrained encoder model that operates by stacking transformers in multiple layers for language understanding. BERT is available in two sizes: BERT-BASE (containing 12 transformer layers and 768 hidden layers) and BERT-

BERT
BERT is a deep bidirectional pretrained encoder model that operates by stacking transformers in multiple layers for language understanding. BERT is available in two sizes: BERT-BASE (containing 12 transformer layers and 768 hidden layers) and BERT-LARGE (containing 24 transformer layers and 1024 hidden layers). These two models were trained on the same datasets: BookCorpus (800M words) [14] and English Wikipedia (2500M words). BERT can be applied to downstream tasks by fine-tuning all pretrained parameters, as shown in Figure 1. As such, BERT can be easily applied to several NLP tasks by adding output layers. BERT performed the best in 11 NLP tasks and achieved state-ofthe-art performances in 8 tasks of general language understanding evaluation (GLUE) [15], two areas of SQuAD v1.1 [16] and v2.0 [17] (question answering tasks), and SWAG [18] (a common-sense inference task). BERT is designed for training using self-supervised learning, wherein the model learns the context of a sentence during training by introducing a masked language model. The masked language model randomly masks tokens from the input text to predict the original vocabulary of the masked word based only on its context. This method solves the unidirectionality constraint that does not consider the context of the sentence comprehensively and allows BERT to consider a change in the meaning of words based on the context in the training method, thereby enabling the BERT language model to achieve performances comparable to those afforded by humans for language understanding tasks. In addition, the next sentence prediction task allows the relationships between sentences to be determined based on text-pair representations. By applying next sentence prediction, BERT can be utilized in many downstream tasks such as question and answering as well as natural language inference by understanding the relationship between two sentences.

Multilingual Pretraining Using BERT
A few models demonstrated excellent performances on NLU tasks, such as the XL-Net [19] and T5 [20]; however, BERT was still adopted as the baseline model in many studies, as it can be easily applied to fine-tuning, and its performance is comparable to those of other, newer models. In addition, RoBERTa, which affords better performances by supplementing the weaknesses of BERT, has been widely applied in conjunction with BERT in many studies. However, most of these BERT-based models are based on an English-centric design; consequently, researchers have attempted to develop models based on languages other than English.
Multilingual BERT [19] retains the model structures of BERT but replaces the pretrained corpus attributes with those that include more than 100 languages from only the existing English-centric BERT, resulting in significant performance improvements over the original BERT in natural language comprehension tasks. However, this implies that its vocabulary size is large and that its size increases inefficiently owing to the processing of more than 100 languages, which consequently restricts the memory efficiency.
Cross-lingual modeling [4] trains a model via unsupervised-learning-based pretraining, where continuous learning in English and other languages is applied simultaneously. BERT pretrained through the cross-lingual method improved the NLU accuracy in multilingual tasks as compared with the original BERT that was pretrained using a dataset containing 100 languages. This method significantly improves symbolic performance in multilingual tasks other than those in English. However, cross-lingual modeling is not comparable to well-preprocessed models in English in terms of accuracy.
Meanwhile, researchers achieved performances comparable to those of English-based models by completely replacing the pretraining corpus attributes with Korean and excluding English [8,9]. Because these studies were conducted using different methods in recent years, the pretraining corpus was determined independently and in parallel for each study, wherein the difference was indicated in only the composition of the dataset, thereby rendering performance prediction difficult. For these Korean-based models, various experiments must be conducted using new NLU task-based benchmark datasets instead of substantially standard task benchmarks such as NSMC [21] (text classification) and KorQuAD [22] (question and answering). Benchmark evaluation is primarily conducted through NSMC in Korean text classification, which involves the classification of two labels; in fact, it is difficult to predict performance using only the data presented in the results.

Robust Language Models in Korean
We selected a total of seven pretrained language models based on BERT; three in Korean, three that were multilingual, and one in English. This section discusses the criteria for selecting the PLMs and provides a brief overview of the seven PLMs. We selected PLMs that satisfied the following criteria for our task: (1) the target PLMs must be based on BERT and RoBERTa; (2) their pretraining is expected to be robust to Korean text data, i.e., they are Korean pretrained or multilingual pretrained (the pretraining details are shown in Table 1); and (3) RoBERTa (original version), which is English-pretrained and not directly related to Korean, and can be used to demonstrate the differences in performance between the models pretrained using Korean and English, separately. An overview of these models is presented below, and some of the pretraining details are shown in Table 1. Table 1. Summary of the seven pretrained language models.  KoBERT: Owing to the limitations of BERT-base-multilingual-cased (BERT-M-cased) in the Korean NLP task, KoBERT (the Korean BERT pretrained case) was released [7], pretrained only in the Korean corpus with BERT. Its architecture is the same as that of BERT (12 transformer encoders and 764 hidden layers), and its pretraining dataset is primarily based on the Korean Wikipedia (54M words, 5M sentences).

English-pretrained baseline (baseline)
RoBERTa: RoBERTa improves BERT's undertrained points through the following four tuning procedures: using more pretraining data, applying dynamic masking, removing NLP loss, and training on longer sequences. RoBERTa achieved state-of-the-art results on all nine GLUE tasks and outperformed recently proposed architectures, such as the XLNet [19]. We used the original RoBERTa that only considered pretraining using the English corpus as the baseline.
XLM-RoBERTa: XLM-RoBERTa is a transformer PLM that is typically used in multilingual tasks [23]. It is pretrained as a masked language model on 100 languages and 2.5 TB of filtered common crawl data. XLM-RoBERTa is a multilingual version of RoBERTa, where XLM is an acronym for cross-lingual language model. RoBERTa is an unsupervised model that relies only on monolingual data, whereas the XLM is a supervised model that leverages parallel data with a new XLM model objective [4].

Evaluations
We evaluated the performance of the seven models of Korean natural-language-based climate technology classification as a benchmark dataset. Some PLMs can be broadly classified into English-pretrained, multilingual-pretrained, or Korean-pretrained models. They are expected to demonstrate different levels of classification performance depending on the data properties.
We implemented a neural network model that performs a text classification downstream task by adopting a BERT-based pretrained encoder and adding an output layer for multiclass classification. Subsequently, we conducted a comparative study where we analyzed the process by which BERT performs downstream tasks and compared the performance of each model in our main task (Korean text-based multiclass classification). The seven models are as follows:

Climate Technology Classification Dataset
We specified the natural-language-based climate technology classification corpus provided by the Green Technology Center (Seoul, Korea) as a natural language-based multiclass text classification task to evaluate the seven pretrained language models and expected robustness in Korean-based tasks through pretraining in Korean.
The dataset was organized to classify research proposals into 45 classes as category labels in Korean, based on climate technology from the National Science & Technology Information Service. The class labels were defined based on the Climate Technology Information System (CTIS) of the Green Technology Center (Table A1); the index number (1 to 45) and the name of rightmost column (labeled "Section") were used as labels. The dataset included approximately 170,000 training data and 43,000 test data in the form of research proposals based on Korean text. Each example comprised 13 features of the research proposal such as the title, aims, descriptions, keywords, and other trivial information. We selected three features in the dataset-the title, the aim, and the keywordsbecause these three features contain the most relevant information to the classes after inspecting all features manually. In addition, the size of sequence length was appropriate for the BERT-based language models when appending the three features. A detailed description of an example from the dataset is presented in Table 2.

Column Description (Korean) Description (Translated into English)
Title To develop the power distribution control technology of high-fuel-efficiency PHEV using road information (Goals for second year) o Comparative evaluation of real-time driver tendency judgment technology; technology analysis for determining driver tendencies o Speed profile prediction technology considering driver propensity and road conditions. To develop speed profile prediction technology using the established driving information DB. To develop global speed profile prediction technology using driver propensity, GIS information, and traffic light situation in the absence of a driving route DB.
o Establishing a foundation for the development and verification of power distribution control algorithms. To develop a path-based real-time optimal power distribution control algorithm. To develop engine operation minimization control technology based on real-time vehicle external information. To construct a human-in-the-loop simulation environment.

Keywords
Plug-in hybrid vehicles, optimal power distribution control, driving information prediction, driving propensity of driver, and human-in-the-loop simulation.
Label Transport efficiency

Preprocessing
We used the Mecab morphological analyzer [24] to perform part-of-speech tagging using text data, where certain stop words such as special characters ("{ }," "( )," and "[ ]") were removed. We only extracted nouns from the text data because the dataset primarily comprised text summarized from complete research proposals, and extracting nouns in Korean is a widely accepted method since Korean is an agglutinative language, which is the form of various morphemes attached to the original nouns [25][26][27]. An example of preprocessing performed using the morphological analyzer is shown in Figure 2.
In this section, we describe our preprocessing results based on data from the dataset. Although the dataset comprised primarily Korean text, we provide not only the Korean processing results, but also the English translation of the example data in Table 2. We applied nouns using the Mecab morphological analyzer. We selected the same example data as those in Table 2. First, we appended 3 of the 13 columns to the training set and used it as the main text data; the other columns were removed. Second, noun filtering was performed using the Mecab analyzer. As shown in Figure 2, several unique characters, numbers, and English characters in the original text remained after it was appended. These characters were removed through text filtering, as shown in Figure 2, and only nouns were extracted. Consequently, only nouns separated by spaces were used as input data. We applied the same preprocessing for the training and test datasets. In this section, we describe our preprocessing results based on data from the dataset. Although the dataset comprised primarily Korean text, we provide not only the Korean processing results, but also the English translation of the example data in Table 2. We applied nouns using the Mecab morphological analyzer. We selected the same example data as those in Table 2. First, we appended 3 of the 13 columns to the training set and used it as the main text data; the other columns were removed. Second, noun filtering was performed using the Mecab analyzer. As shown in Figure 2, several unique characters, numbers, and English characters in the original text remained after it was appended. These characters were removed through text filtering, as shown in Figure 2, and only nouns were extracted. Consequently, only nouns separated by spaces were used as input data. We applied the same preprocessing for the training and test datasets.

Evaluation Metric
Evaluation metric: The Macro-F1 score was used as the evaluation metric in this study. The Macro-F1 score is widely used in multiclass classification and is calculated by adding and averaging all the classification F1 scores for each label. The F1 and Macro-F1 scores are defined in Equations (1) and (2) 2 * * Furthermore, we evaluated the validation and testing performances. The validation performance was evaluated using a portion of the test dataset, whereas the test performance was calculated using the remainder of the test dataset to determine whether overfitting occurred. Example of morphological filtering using Mecab as the preprocessor. This morphological analyzer filtered not only unique characters such as "[', ']" and " " as well as numeric data such as "2," but also postpositions in Korean, such as "ᄋ ᅳ ᆯ" and "ᄅ ᅳ ᆯ" that do not exist in English. Output is presented at the bottom of the figure.

Evaluation Metric
Evaluation metric: The Macro-F1 score was used as the evaluation metric in this study. The Macro-F1 score is widely used in multiclass classification and is calculated by adding and averaging all the classification F1 scores for each label. The F1 and Macro-F1 scores are defined in Equations (1)  Macro F 1 = 2 * Total precision * Total recall Total precision + Total recall Furthermore, we evaluated the validation and testing performances. The validation performance was evaluated using a portion of the test dataset, whereas the test performance was calculated using the remainder of the test dataset to determine whether overfitting occurred.
Hyperparameters: We fixed all the parameter conditions (loss function, optimizer, epochs, validation split rate, and sequence length) by evaluating the performances of the BERT-based encoder models. We applied Adam [28] as an optimizer and sparse categorical cross-entropy loss as a loss function, which involves converting a categorical cross-entropy (CCE) data target from a one-hot vector to an integer. The CCE loss is widely accepted in multiclass classification tasks and is defined as follows: whereŷ i is the predicted model output, and y i is the target value. Both are described as one-hot vectors. We trained our models using 10 epochs and a split validation data rate of 20%. The sequence length for BERT was fixed across all pretrained models. The hyperparameters are summarized in Table 3. Table 3. Hyperparameters in fine-tuning.

Loss Function Optimizer Epochs Validation Split Rate Sequence Length
Categorical Cross Entropy Adam 10 20% 256

Evaluation on Climate Technology Classification
To investigate the performance of the PLMs in classifying climate technology in research proposals, we calculated the classification performance for each PLM fine-tuned using the preprocessed text (Table 4). RoBERTa is the only pretrained model in English text; therefore, it can be used as a baseline model to estimate the improvement in the classification performance of the other models. It achieved classification performances of 0.59 and 0.56 for the validation and test datasets, respectively. Meanwhile, XLM-RoBERTa, which is a crosslingual-based pretrained model, achieved classification performances of 0.63 and 0.61 for the validation and test datasets, respectively. Compared with RoBERTa, which is a model pretrained using English text, XLM-RoBERTa demonstrated an improved performance by approximately 4%. This indicated that the XLM achieved a better classification performance than the model pretrained using English text. Meanwhile, BERT-M-cased pretrained with the multilingual method achieved classification performances of 0.65 and 0.64 for the validation and test datasets, respectively. This indicated that the multilingual model performed better than the XLM. To investigate the effect of pretraining using uncased text in the multilingual model, we calculated the classification performance of BERT-M-uncased, where 0.70 and 0.68 were obtained for the validation and test datasets, respectively. This indicated that the multilingual model pretrained using uncased characters performed better than the cased model in the classification task. Additionally, we calculated the classification performance of the models pretrained with Korean text to investigate language-specific effects on pretraining. The classification performance of KoBERT, which was pretrained using the Korean Wikipedia only, was 0.67 for both the validation and test datasets. Meanwhile, KLUE-RoBERTa, which was pretrained with RoBERTa using the KLUE dataset, demonstrated classification levels of 0.72 and 0.70 for the validation and test datasets, respectively. The differences in the classification performance were attributable to the size and detailed refinement of the dataset used for pretraining. Finally, KLUE-BERT achieved the highest classification performance (0.74 and 0.72 for the validation and test datasets, respectively). These findings imply that matching the language, quality, and size of the text for pretraining are key in multiclass classification. In addition, the error range was approximately 0.15 between the lowest and highest values.

Conclusions
In this study, we investigated the applicability of PLMs in Korean text datasets and evaluated them through multiclass classification. We compared the performances of BERTbased encoder models and discovered that KLUE-BERT outperformed the other models in terms of classification; additionally, KLUE-BERT indicated an error range of up to 0.15 in terms of the Macro-F1 score and matched the language for a classification task.
We aim to show the degree of performance difference caused by the encoder model pretrained in Korean on a multiclass dataset with 45 labels, which is more challenging than the existing benchmarks of text classification. Through our findings, we imply that the size and detailed refinement of the corpus used for pretraining could be crucial factors and we suggest that applying the most appropriate pretraining language model to NLP tasks is fundamental.
As with any scientific study, this study has limitations. First, we tested the PLMs in a classification task only. Other NLP tasks, such as natural language inference or question and answering, should be tested because these tasks may be more sensitive depending on language-specific models. Second, we selected BERT-based pretrained language models only for the dataset. Other pretrained language models such as ELECTRA should be considered to investigate language model-specific effects. Third, we extracted nouns only from the dataset because the dataset consisted of summarized text in Korean. Noun extraction has been regarded as an effective way for understanding Korean, but other tags could be used in a future study. Thus, future research should consider other NLP tasks and state-of-the-art language models to verify our findings.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Category Labels of Climate Technology Classification Dataset
In this section, we provide raw data for the 45 categories used to classify the climate dataset multilabel. The labels were related to research areas of climate technology such as greenhouse gas mitigation, agriculture, water management, and climate change forecasting. According to the CTIS of the Green Technology Center and the training data of the dataset, 45 class labels existed with a default label "0"; therefore, they were configured to 46 labels to perform the multiclass classification task. Table A1. Descriptions of climate technology information system of green technology center. Index numbers of the rightmost column (Section) were used as labels for our classification task.