MDPI - Publisher of Open Access Journals

15 pages, 3561 KiB

Open AccessData Descriptor

Acoustic Data on Vowel Nasalization Across Prosodic Conditions in L1 Korean and L2 English by Native Korean Speakers

by Jiyoung Jang, Sahyang Kim and Taehong Cho

Data 2025, 10(6), 82; https://doi.org/10.3390/data10060082 - 23 May 2025

Viewed by 697

This article presents acoustic data on coarticulatory vowel nasalization from the productions of twelve L1 Korean speakers and of fourteen Korean learners of L2 English. The dataset includes eight monosyllabic target words embedded in eight carrier sentences, each repeated four times per speaker. Half of the words contain a nasal coda such as p*am in Korean and bomb in English and the other half a nasal onset such as mat in Korean and mob in English. These were produced under varied prosodic conditions, including three phrase positions and two focus conditions, enabling analysis of prosodic effects on vowel nasalization across languages along with individual speaker variation. The accompanying CSV files provide acoustic measurements such as nasal consonant duration, A1-P0, and normalized A1-P0 at multiple timepoints within the vowel. While theoretical implications have been discussed in two published studies, the full dataset is published here. By making these data publicly available, we aim to promote broad reuse and encourage further research at the intersection of prosody, phonetics, and second language acquisition—ultimately advancing our understanding of how phonetic patterns emerge, transfer, and vary across languages and learners. Full article

► Show Figures

Figure 1

25 pages, 692 KiB

Open AccessArticle

Attention-Based 1D CNN-BiLSTM Hybrid Model Enhanced with FastText Word Embedding for Korean Voice Phishing Detection

by Milandu Keith Moussavou Boussougou and Dong-Joo Park

Mathematics 2023, 11(14), 3217; https://doi.org/10.3390/math11143217 - 21 Jul 2023

Cited by 22 | Viewed by 6665

Abstract

In the increasingly complex domain of Korean voice phishing attacks, advanced detection techniques are paramount. Traditional methods have achieved some degree of success. However, they often fail to detect sophisticated voice phishing attacks, highlighting an urgent need for enhanced approaches to improve detection performance. Addressing this, we have designed and implemented a novel artificial neural network (ANN) architecture that successfully combines data-centric and model-centric AI methodologies for detecting Korean voice phishing attacks. This paper presents our unique hybrid architecture, consisting of a 1-dimensional Convolutional Neural Network (1D CNN), a Bidirectional Long Short-Term Memory (BiLSTM), and Hierarchical Attention Networks (HANs). Our evaluations using the real-world KorCCVi v2 dataset demonstrate that the proposed architecture effectively leverages the strengths of CNN and BiLSTM to extract and learn contextually rich features from word embedding vectors. Additionally, implementing word and sentence attention mechanisms from HANs enhances the model’s focus on crucial features, considerably improving detection performance. Achieving an accuracy score of 99.32% and an F1 score of 99.31%, our model surpasses all baseline models we trained, outperforms several existing solutions, and maintains comparable performance to others. The findings of this study underscore the potential of hybrid neural network architectures in improving voice phishing detection in the Korean language and pave the way for future research. This could involve refining and expanding upon this model to tackle increasingly sophisticated voice phishing strategies effectively or utilizing larger datasets. Full article

(This article belongs to the Special Issue Advances in Mathematical Methods, Machine Learning and Deep Learning Based Applications, 2nd Edition)

► Show Figures

Figure 1

11 pages, 305 KiB

Open AccessArticle

Using Multiple Monolingual Models for Efficiently Embedding Korean and English Conversational Sentences

by Youngki Park and Youhyun Shin

Appl. Sci. 2023, 13(9), 5771; https://doi.org/10.3390/app13095771 - 7 May 2023

Cited by 1 | Viewed by 3377

Abstract

This paper presents a novel approach for finding the most semantically similar conversational sentences in Korean and English. Our method involves training separate embedding models for each language and using a hybrid algorithm that selects the appropriate model based on the language of the query. For the Korean model, we fine-tuned the KLUE-RoBERTa-small model using publicly available semantic textual similarity datasets and used Principal Component Analysis (PCA) to reduce the resulting embedding vectors. We also selected a highly-performing English embedding model from available SBERT models. We compared our approach to existing multilingual models using both human-generated and large language model-generated conversational datasets. Our experimental results demonstrate that our hybrid approach outperforms state-of-the-art multilingual models in terms of accuracy, elapsed time for sentence embedding, and elapsed time for finding the nearest neighbor, regardless of whether a GPU is used. These findings highlight the potential benefits of training separate embedding models for different languages, particularly for tasks involving finding the most semantically similar conversational sentences. We expect that our approach will be used for diverse natural language processing-related fields, including machine learning education. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

13 pages, 1632 KiB

Open AccessArticle

Improving the Performance of Vietnamese–Korean Neural Machine Translation with Contextual Embedding

by Van-Hai Vu, Quang-Phuoc Nguyen, Ebipatei Victoria Tunyan and Cheol-Young Ock

Appl. Sci. 2021, 11(23), 11119; https://doi.org/10.3390/app112311119 - 23 Nov 2021

Cited by 4 | Viewed by 3202

Abstract

With the recent evolution of deep learning, machine translation (MT) models and systems are being steadily improved. However, research on MT in low-resource languages such as Vietnamese and Korean is still very limited. In recent years, a state-of-the-art context-based embedding model introduced by Google, bidirectional encoder representations for transformers (BERT), has begun to appear in the neural MT (NMT) models in different ways to enhance the accuracy of MT systems. The BERT model for Vietnamese has been developed and significantly improved in natural language processing (NLP) tasks, such as part-of-speech (POS), named-entity recognition, dependency parsing, and natural language inference. Our research experimented with applying the Vietnamese BERT model to provide POS tagging and morphological analysis (MA) for Vietnamese sentences,, and applying word-sense disambiguation (WSD) for Korean sentences in our Vietnamese–Korean bilingual corpus. In the Vietnamese–Korean NMT system, with contextual embedding, the BERT model for Vietnamese is concurrently connected to both encoder layers and decoder layers in the NMT model. Experimental results assessed through BLEU, METEOR, and TER metrics show that contextual embedding significantly improves the quality of Vietnamese–Korean NMT. Full article

► Show Figures

Graphical abstract

12 pages, 995 KiB

Open AccessArticle

An Empirical Study of Korean Sentence Representation with Various Tokenizations

by Danbi Cho, Hyunyoung Lee and Seungshik Kang

Electronics 2021, 10(7), 845; https://doi.org/10.3390/electronics10070845 - 1 Apr 2021

Cited by 4 | Viewed by 3820

Abstract

It is important how the token unit is defined in a sentence in natural language process tasks, such as text classification, machine translation, and generation. Many studies recently utilized the subword tokenization in language models such as BERT, KoBERT, and ALBERT. Although these language models achieved state-of-the-art results in various NLP tasks, it is not clear whether the subword tokenization is the best token unit for Korean sentence embedding. Thus, we carried out sentence embedding based on word, morpheme, subword, and submorpheme, respectively, on Korean sentiment analysis. We explored the two-sentence representation methods for sentence embedding: considering the order of tokens in a sentence and not considering the order. While inputting a sentence, which is decomposed by token unit, to the two-sentence representation methods, we construct the sentence embedding with various tokenizations to find the most effective token unit for Korean sentence embedding. In our work, we confirmed: the robustness of the subword unit for out-of-vocabulary (OOV) problems compared to other token units, the disadvantage of replacing whitespace with a particular symbol in the sentiment analysis task, and that the optimal vocabulary size is 16K in subword and submorpheme tokenization. We empirically noticed that the subword, which was tokenized by a vocabulary size of 16K without replacement of whitespace, was the most effective for sentence embedding on the Korean sentiment analysis task. Full article

(This article belongs to the Special Issue Selected Papers from International Conference on Smart Media and Applications (SMA 2020))

20 pages, 1381 KiB

Open AccessArticle

Syntactic Comprehension of Relative Clauses and Center Embedding Using Pseudowords

by Kyung-Hwan Cheon, Youngjoo Kim, Hee-Dong Yoon, Ki-Chun Nam, Sun-Young Lee and Hyeon-Ae Jeon

Brain Sci. 2020, 10(4), 202; https://doi.org/10.3390/brainsci10040202 - 31 Mar 2020

Cited by 3 | Viewed by 5577

Abstract

Relative clause (RC) formation and center embedding (CE) are two primary syntactic operations fundamental for creating and understanding complex sentences. Ample evidence from previous cross-linguistic studies has revealed several similarities and differences between RC and CE. However, it is not easy to investigate the effect of pure syntactic constraints for RC and CE without the interference of semantic and pragmatic interactions. Here, we show how readers process CE and RC using a self-paced reading task in Korean. More interestingly, we adopted a novel self-paced pseudoword reading task to exploit syntactic operations of the RC and CE, eliminating the semantic and pragmatic interference in sentence comprehension. Our results showed that the main effects of RC and CE conform to previous studies. Furthermore, we found a facilitation effect of sentence comprehension when we combined an RC and CE in a complex sentence. Our study provides a valuable insight into how the purely syntactic processing of RC and CE assists comprehension of complex sentences. Full article

(This article belongs to the Special Issue Behavioral and Cognitive Neurodynamics)

► Show Figures

Figure 1

9 pages, 735 KiB

Open AccessArticle

Improving Neural Machine Translation by Filtering Synthetic Parallel Data

by Guanghao Xu, Youngjoong Ko and Jungyun Seo

Entropy 2019, 21(12), 1213; https://doi.org/10.3390/e21121213 - 11 Dec 2019

Cited by 7 | Viewed by 4820

Abstract

Synthetic data has been shown to be effective in training state-of-the-art neural machine translation (NMT) systems. Because the synthetic data is often generated by back-translating monolingual data from the target language into the source language, it potentially contains a lot of noise—weakly paired sentences or translation errors. In this paper, we propose a novel approach to filter this noise from synthetic data. For each sentence pair of the synthetic data, we compute a semantic similarity score using bilingual word embeddings. By selecting sentence pairs according to these scores, we obtain better synthetic parallel data. Experimental results on the IWSLT 2017 Korean→English translation task show that despite using much less data, our method outperforms the baseline NMT system with back-translation by up to 0.72 and 0.62 Bleu points for tst2016 and tst2017, respectively. Full article

(This article belongs to the Section Multidisciplinary Applications)

Search Results (7)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (7)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI