MDPI - Publisher of Open Access Journals

14 pages, 2439 KiB

Open AccessArticle

A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT

by Hanjo Jeong

Mathematics 2025, 13(5), 864; https://doi.org/10.3390/math13050864 - 5 Mar 2025

Viewed by 901

Abstract

The disambiguation of word senses (Word Sense Disambiguation, WSD) plays a crucial role in various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and information retrieval. Due to the complex morphological structure and polysemy of the Korean language, the meaning [...] Read more.

The disambiguation of word senses (Word Sense Disambiguation, WSD) plays a crucial role in various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and information retrieval. Due to the complex morphological structure and polysemy of the Korean language, the meaning of words can change depending on the context, making the WSD problem challenging. Since a single word can have multiple meanings, accurately distinguishing between them is essential for improving the performance of NLP models. Recently, large-scale pre-trained models like BERT and GPT, based on transfer learning, have shown promising results in addressing this issue. However, for languages with complex morphological structures, like Korean, the tokenization mismatch between pre-trained models and fine-tuning data prevents the rich contextual and lexical information learned by the pre-trained models from being fully utilized in downstream tasks. This paper proposes a novel method to address the tokenization mismatch issue during the fine-tuning of Korean WSD, leveraging BERT-based pre-trained models and the Sejong corpus, which has been annotated by language experts. Experimental results using various BERT-based pre-trained models and datasets from the Sejong corpus demonstrate that the proposed method improves performance by approximately 3–5% compared to existing approaches. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

16 pages, 1092 KiB

Open AccessArticle

Preprocessing for Keypoint-Based Sign Language Translation without Glosses

by Youngmin Kim and Hyeongboo Baek

Sensors 2023, 23(6), 3231; https://doi.org/10.3390/s23063231 - 17 Mar 2023

Cited by 15 | Viewed by 3980

Abstract

While machine translation for spoken language has advanced significantly, research on sign language translation (SLT) for deaf individuals remains limited. Obtaining annotations, such as gloss, can be expensive and time-consuming. To address these challenges, we propose a new sign language video-processing method for [...] Read more.

While machine translation for spoken language has advanced significantly, research on sign language translation (SLT) for deaf individuals remains limited. Obtaining annotations, such as gloss, can be expensive and time-consuming. To address these challenges, we propose a new sign language video-processing method for SLT without gloss annotations. Our approach leverages the signer’s skeleton points to identify their movements and help build a robust model resilient to background noise. We also introduce a keypoint normalization process that preserves the signer’s movements while accounting for variations in body length. Furthermore, we propose a stochastic frame selection technique to prioritize frames to minimize video information loss. Based on the attention-based model, our approach demonstrates effectiveness through quantitative experiments on various metrics using German and Korean sign language datasets without glosses. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

32 pages, 2220 KiB

Open AccessArticle

Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

by Chanjun Park, Midan Shim, Sugyeong Eo, Seolhwa Lee, Jaehyung Seo, Hyeonseok Moon and Heuiseok Lim

Appl. Sci. 2022, 12(11), 5545; https://doi.org/10.3390/app12115545 - 30 May 2022

Cited by 6 | Viewed by 4665

Abstract

The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora [...] Read more.

The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance. Full article

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)

► Show Figures

Figure 1

12 pages, 507 KiB

Open AccessArticle

AI Student: A Machine Reading Comprehension System for the Korean College Scholastic Ability Test

by Gyeongmin Kim, Soomin Lee, Chanjun Park and Jaechoon Jo

Mathematics 2022, 10(9), 1486; https://doi.org/10.3390/math10091486 - 29 Apr 2022

Cited by 4 | Viewed by 4146

Abstract

Machine reading comprehension is a question answering mechanism in which a machine reads, understands, and answers questions from a given text. These reasoning skills can be sufficiently grafted into the Korean College Scholastic Ability Test (CSAT) to bring about new scientific and educational [...] Read more.

Machine reading comprehension is a question answering mechanism in which a machine reads, understands, and answers questions from a given text. These reasoning skills can be sufficiently grafted into the Korean College Scholastic Ability Test (CSAT) to bring about new scientific and educational advances. In this paper, we propose a novel Korean CSAT Question and Answering (KCQA) model and effectively utilize four easy data augmentation strategies with round trip translation to augment the insufficient data in the training dataset. To evaluate the effectiveness of KCQA, 30 students appeared for the test under conditions identical to the proposed model. Our qualitative and quantitative analysis along with experimental results revealed that KCQA achieved better performance than humans with a higher F1 score of 3.86. Full article

(This article belongs to the Special Issue Advances in Mathematical Methods, Machine Learning and Deep Learning Based Applications)

► Show Figures

Figure 1

10 pages, 567 KiB

Open AccessArticle

Text Data Augmentation for the Korean Language

by Dang Thanh Vu, Gwanghyun Yu, Chilwoo Lee and Jinyoung Kim

Appl. Sci. 2022, 12(7), 3425; https://doi.org/10.3390/app12073425 - 28 Mar 2022

Cited by 12 | Viewed by 4144

Abstract

Data augmentation (DA) is a universal technique to reduce overfitting and improve the robustness of machine learning models by increasing the quantity and variety of the training dataset. Although data augmentation is essential in vision tasks, it is rarely applied to text datasets [...] Read more.

Data augmentation (DA) is a universal technique to reduce overfitting and improve the robustness of machine learning models by increasing the quantity and variety of the training dataset. Although data augmentation is essential in vision tasks, it is rarely applied to text datasets since it is less straightforward. Some studies have concerned text data augmentation, but most of them are for the majority languages, such as English or French. There have been only a few studies on data augmentation for minority languages, e.g., Korean. This study fills the gap by demonstrating several common data augmentation methods and Korean corpora with pre-trained language models. In short, we evaluate the performance of two text data augmentation approaches, known as text transformation and back translation. We compare these augmentations among Korean corpora on four downstream tasks: semantic textual similarity (STS), natural language inference (NLI), question duplication verification (QDV), and sentiment classification (STC). Compared to cases without augmentation, the performance gains when applying text data augmentation are 2.24%, 2.19%, 0.66%, and 0.08% on the STS, NLI, QDV, and STC tasks, respectively. Full article

(This article belongs to the Topic Machine and Deep Learning)

► Show Figures

Figure 1

13 pages, 1632 KiB

Open AccessArticle

Improving the Performance of Vietnamese–Korean Neural Machine Translation with Contextual Embedding

by Van-Hai Vu, Quang-Phuoc Nguyen, Ebipatei Victoria Tunyan and Cheol-Young Ock

Appl. Sci. 2021, 11(23), 11119; https://doi.org/10.3390/app112311119 - 23 Nov 2021

Cited by 4 | Viewed by 3145

Abstract

With the recent evolution of deep learning, machine translation (MT) models and systems are being steadily improved. However, research on MT in low-resource languages such as Vietnamese and Korean is still very limited. In recent years, a state-of-the-art context-based embedding model introduced by [...] Read more.

With the recent evolution of deep learning, machine translation (MT) models and systems are being steadily improved. However, research on MT in low-resource languages such as Vietnamese and Korean is still very limited. In recent years, a state-of-the-art context-based embedding model introduced by Google, bidirectional encoder representations for transformers (BERT), has begun to appear in the neural MT (NMT) models in different ways to enhance the accuracy of MT systems. The BERT model for Vietnamese has been developed and significantly improved in natural language processing (NLP) tasks, such as part-of-speech (POS), named-entity recognition, dependency parsing, and natural language inference. Our research experimented with applying the Vietnamese BERT model to provide POS tagging and morphological analysis (MA) for Vietnamese sentences,, and applying word-sense disambiguation (WSD) for Korean sentences in our Vietnamese–Korean bilingual corpus. In the Vietnamese–Korean NMT system, with contextual embedding, the BERT model for Vietnamese is concurrently connected to both encoder layers and decoder layers in the NMT model. Experimental results assessed through BLEU, METEOR, and TER metrics show that contextual embedding significantly improves the quality of Vietnamese–Korean NMT. Full article

► Show Figures

Graphical abstract

13 pages, 598 KiB

Open AccessArticle

Factors Behind the Effectiveness of an Unsupervised Neural Machine Translation System between Korean and Japanese

by Yong-Seok Choi, Yo-Han Park, Seung Yun, Sang-Hun Kim and Kong-Joo Lee

Appl. Sci. 2021, 11(16), 7662; https://doi.org/10.3390/app11167662 - 21 Aug 2021

Cited by 5 | Viewed by 2075

Abstract

Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we [...] Read more.

Korean and Japanese have different writing scripts but share the same Subject-Object-Verb (SOV) word order. In this study, we pre-train a language-generation model using a Masked Sequence-to-Sequence pre-training (MASS) method on Korean and Japanese monolingual corpora. When building the pre-trained generation model, we allow the smallest number of shared vocabularies between the two languages. Then, we build an unsupervised Neural Machine Translation (NMT) system between Korean and Japanese based on the pre-trained generation model. Despite the different writing scripts and few shared vocabularies, the unsupervised NMT system performs well compared to other pairs of languages. Our interest is in the common characteristics of both languages that make the unsupervised NMT perform so well. In this study, we propose a new method to analyze cross-attentions between a source and target language to estimate the language differences from the perspective of machine translation. We calculate cross-attention measurements between Korean–Japanese and Korean–English pairs and compare their performances and characteristics. The Korean–Japanese pair has little difference in word order and a morphological system, and thus the unsupervised NMT between Korean and Japanese can be trained well even without parallel sentences and shared vocabularies. Full article

► Show Figures

Figure 1

21 pages, 2719 KiB

Open AccessArticle

Named Entity Correction in Neural Machine Translation Using the Attention Alignment Map

by Jangwon Lee, Jungi Lee , Minho Lee and Gil-Jin Jang

Appl. Sci. 2021, 11(15), 7026; https://doi.org/10.3390/app11157026 - 29 Jul 2021

Cited by 4 | Viewed by 4591

Abstract

Neural machine translation (NMT) methods based on various artificial neural network models have shown remarkable performance in diverse tasks and have become mainstream for machine translation currently. Despite the recent successes of NMT applications, a predefined vocabulary is still required, meaning that it [...] Read more.

Neural machine translation (NMT) methods based on various artificial neural network models have shown remarkable performance in diverse tasks and have become mainstream for machine translation currently. Despite the recent successes of NMT applications, a predefined vocabulary is still required, meaning that it cannot cope with out-of-vocabulary (OOV) or rarely occurring words. In this paper, we propose a postprocessing method for correcting machine translation outputs using a named entity recognition (NER) model to overcome the problem of OOV words in NMT tasks. We use attention alignment mapping (AAM) between the named entities of input and output sentences, and mistranslated named entities are corrected using word look-up tables. The proposed method corrects named entities only, so it does not require retraining of existing NMT models. We carried out translation experiments on a Chinese-to-Korean translation task for Korean historical documents, and the evaluation results demonstrated that the proposed method improved the bilingual evaluation understudy (BLEU) score by 3.70 from the baseline. Full article

(This article belongs to the Special Issue Current Approaches and Applications in Natural Language Processing)

► Show Figures

Figure 1

19 pages, 1310 KiB

Open AccessArticle

Context-Aware Neural Machine Translation for Korean Honorific Expressions

by Yongkeun Hwang, Yanghoon Kim and Kyomin Jung

Electronics 2021, 10(13), 1589; https://doi.org/10.3390/electronics10131589 - 30 Jun 2021

Cited by 4 | Viewed by 5599

Abstract

Neural machine translation (NMT) is one of the text generation tasks which has achieved significant improvement with the rise of deep neural networks. However, language-specific problems such as handling the translation of honorifics received little attention. In this paper, we propose a context-aware [...] Read more.

Neural machine translation (NMT) is one of the text generation tasks which has achieved significant improvement with the rise of deep neural networks. However, language-specific problems such as handling the translation of honorifics received little attention. In this paper, we propose a context-aware NMT to promote translation improvements of Korean honorifics. By exploiting the information such as the relationship between speakers from the surrounding sentences, our proposed model effectively manages the use of honorific expressions. Specifically, we utilize a novel encoder architecture that can represent the contextual information of the given input sentences. Furthermore, a context-aware post-editing (CAPE) technique is adopted to refine a set of inconsistent sentence-level honorific translations. To demonstrate the efficacy of the proposed method, honorific-labeled test data is required. Thus, we also design a heuristic that labels Korean sentences to distinguish between honorific and non-honorific styles. Experimental results show that our proposed method outperforms sentence-level NMT baselines both in overall translation quality and honorific translations. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

12 pages, 995 KiB

Open AccessArticle

An Empirical Study of Korean Sentence Representation with Various Tokenizations

by Danbi Cho, Hyunyoung Lee and Seungshik Kang

Electronics 2021, 10(7), 845; https://doi.org/10.3390/electronics10070845 - 1 Apr 2021

Cited by 4 | Viewed by 3700

Abstract

It is important how the token unit is defined in a sentence in natural language process tasks, such as text classification, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT, and ALBERT. Although these [...] Read more.

It is important how the token unit is defined in a sentence in natural language process tasks, such as text classification, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT, and ALBERT. Although these language models achieved state-of-the-art results in various NLP tasks, it is not clear whether the subword tokenization is the best token unit for Korean sentence embedding. Thus, we carried out sentence embedding based on word, morpheme, subword, and submorpheme, respectively, on Korean sentiment analysis. We explored the two-sentence representation methods for sentence embedding: considering the order of tokens in a sentence and not considering the order. While inputting a sentence, which is decomposed by token unit, to the two-sentence representation methods, we construct the sentence embedding with various tokenizations to find the most effective token unit for Korean sentence embedding. In our work, we confirmed: the robustness of the subword unit for out-of-vocabulary (OOV) problems compared to other token units, the disadvantage of replacing whitespace with a particular symbol in the sentiment analysis task, and that the optimal vocabulary size is 16K in subword and submorpheme tokenization. We empirically noticed that the subword, which was tokenized by a vocabulary size of 16K without replacement of whitespace, was the most effective for sentence embedding on the Korean sentiment analysis task. Full article

(This article belongs to the Special Issue Selected Papers from International Conference on Smart Media and Applications (SMA 2020))

15 pages, 1525 KiB

Open AccessArticle

Decoding Strategies for Improving Low-Resource Machine Translation

by Chanjun Park, Yeongwook Yang, Kinam Park and Heuiseok Lim

Electronics 2020, 9(10), 1562; https://doi.org/10.3390/electronics9101562 - 24 Sep 2020

Cited by 24 | Viewed by 6058

Abstract

Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to [...] Read more.

Pre-processing and post-processing are significant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus filtering that only filters data suitable for training, and data augmentation to ensure that the corpus contains sufficient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufficient hardware attempt to provide NLP services, throughput and memory problems often occur. These difficulties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufficient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance. Full article

(This article belongs to the Special Issue Smart Processing for Systems under Uncertainty or Perturbation)

► Show Figures

Figure 1

18 pages, 2531 KiB

Open AccessArticle

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

by Van-Hai Vu, Quang-Phuoc Nguyen, Joon-Choul Shin and Cheol-Young Ock

Appl. Sci. 2020, 10(11), 3904; https://doi.org/10.3390/app10113904 - 4 Jun 2020

Cited by 5 | Viewed by 4184

Abstract

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available [...] Read more.

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage. Full article

(This article belongs to the Special Issue Machine Learning and Natural Language Processing)

► Show Figures

Figure 1

9 pages, 735 KiB

Open AccessArticle

Improving Neural Machine Translation by Filtering Synthetic Parallel Data

by Guanghao Xu, Youngjoong Ko and Jungyun Seo

Entropy 2019, 21(12), 1213; https://doi.org/10.3390/e21121213 - 11 Dec 2019

Cited by 7 | Viewed by 4781

Abstract

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired [...] Read more.

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively. Full article

(This article belongs to the Section Multidisciplinary Applications)

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI