Next Article in Journal
Mathematical Modeling to Estimate Photosynthesis: A State of the Art
Next Article in Special Issue
Boosting the Transformer with the BERT Supervision in Low-Resource Machine Translation
Previous Article in Journal
GrowBot: An Educational Robotic System for Growing Food
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

1
Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
2
Upstage, Yongin 17006, Korea
3
Department of Computer Science, Yonsei University, Seoul 03722, Korea
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2022, 12(11), 5545; https://doi.org/10.3390/app12115545
Submission received: 11 April 2022 / Revised: 23 May 2022 / Accepted: 26 May 2022 / Published: 30 May 2022
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)

Abstract

:
The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

1. Introduction

In the past few years, the demand for machine translation (MT) systems has been continuously increasing and its importance is growing, especially for the industrial services [1,2] Companies, such as Google, Facebook, Microsoft, Amazon, and Unbabel continue to conduct research and formulate plans to commercialize applications related to MT.
From the late 1950s, numerous MT-related projects were proceeded by mainly focusing on rule-based and statistical-based approaches before the advent of deep learning technology. As deep learning based neural machine translation (NMT) was proposed and adopted to several research works, it has been gradually figured out that more superior performance can be derived through NMT approach [3,4,5,6].
Followed by the adoption of deep learning based technique, the improvements of computing power (e.g., GPU) and corresponding enhancement of parallel processing accelerated the advancement of NMT. Recently, release of open source frameworks, such as Pytorch [7], and lowered accessibility to the big data further facilitated vigorous and diverse research.
However, several issues considering the enhancement of the NMT system remain still. Representatively, limitations in ensuring the quality of data is an unresolved issue. As have previously been studied, the quality of the training data is deeply related to the NMT performance [8,9]. The major problem is that the process of building a high-quality parallel corpus is time-consuming and expensive, and it is significantly difficult for low-resource languages, such as Korean. Although data-augmentation techniques, such as back translation [10] and copied translation [11] have been introduced, as the human supervision is generally minimized or excluded in the data generation process, the quality of such pseudo-generated parallel corpus cannot be guaranteed [12,13]. This restricted the usage of pseudo-generated parallel to complements of human-labeled gold parallel corpus, rather than its substitutes [14].
For the alleviation of above limitations, numerous studies on the collection of high-quality training data have been conducted, such as parallel corpus filtering (PCF) research and Data Dam project. PCF refers to a research field that aims to filter out low-quality noisy data (i.e., sentence pairs) residing in the parallel corpus, and improve the overall quality of the corpus. PCF is currently being applied to various NMT studies and contributed to the advancement of the NMT systems [15,16]. While the amount of training data caused significant impact on the statistical-based MT approaches, the quality of data is treated as more important than the amount of data in general deep learning-based MT approaches [17,18]. Moreover, Data Dam (http://www.data-alliance.kr/default/, accessed on 25 May 2022) projects for building high-quality parallel corpora nationally are in progress. In the Republic of Korea, a large number of parallel corpora is open to the public through AI Hub (http://aihub.or.kr/, accessed on 25 May 2022), which is organized by the National Information Society Agency (NIA) [19].
Following these research trends, where the quality is treated more importantly than the quantity in the data construction process, we analyzed the above Korean–English parallel corpus distributed by AI Hub. Despite its sufficient amount of data, the quality of corresponding corpus has not been confirmed clearly. This may restrict the unconstrained utilization of such corpus in adoption to the NMT model, as low quality data may degrade the overall performance. In this study, we conducted several quality verification experiments including Linguistic Inquiry and Word Count (LIWC) [20,21], and clarified the quality and characteristics of such corpus. By analyzing various factors that can affect NMT performance, we proposed a method that can be applied in future research using the analysis results.
LIWC is a text-analysis tool that automatically analyzes the number of words in a sentence and classifies words with similar meanings and sentimental characteristics. LIWC extracts various interpersonal variables related to clinical, social, physiological, cognitive, psychological, and developmental contexts that cannot be detected using previous text-analysis programs. Additionally, LIWC comprises a variety of features for analyzing text. LIWC is generally used to recognize linguistic markers for mental health study in such as detecting narcissism [22], schizophrenia [23], bipolar disorder [24]. However, LIWC provides various linguistic features, word count, gender bias and so on, so it can be used for various analyses. In this study, we use LIWC to analyze parallel corpora based on diverse properties. It is also first time analyzing a corpus using LIWC.
In addition, we conduct baseline translation experiments by training transformer-base model structure [4] through all the parallel corpora given by AI Hub. By analyzing MT performance of corresponding models, we propose further research directions on MT for the Korean language. The contributions of this study are as follows:
  • For the first time, we conduct a deep data analysis on AI Hub data. To the best of our knowledge, this is the first time LIWC has been used to analyze corpora. This study acts as a milestone for further studies on NMT with respect to the Korean language.
  • We conduct baseline translation experiments on all the data in the AI Hub parallel corpus. Our experiments provide a foundation for further research on Korean-based NMT.
  • We discovered that many factors might cause decreasing model performance, and we provide the direction that those factors could be filtered through our correlation analysis between LIWC and model performance.

2. Related Works and Background

2.1. Machine Translation

Machine Translation refers to a computer system that translates source sentences into target sentences and has achieved significant performance improvements with the advent of deep learning. In 1951, Yehoshua-Bar-hillel first started research on MT at MIT [25], and it has gradually been developed in the order of rule-based, statistical-based, and deep learning-based MT.
Rule-based Machine Translation. Rule-based MT (RBMT) [26,27] is a translation method based on linguistics rules established by linguists, as well as traditional natural language processing such as lexical analysis, syntax analysis, and semantic analysis. For example, the Korean–English RBMT is a methodology that transfers Korean sentence in accordance with the English grammatical rules based on a process of morphological analysis and synthetic analysis, and translates Korean sentence as the source language into English sentence as the target language. This method has the advantage of conducting ideal translation of sentences that conform to the rules, but has the disadvantage of difficulty in extracting grammatical rules and requiring a lot of linguistic knowledge. It is also difficult to expand the translatable language pairs and numerous rules should be considered.
Statistical Machine Translation. Statistical MT (SMT) [28,29] is a method of translating using statistical prior knowledge learned from large scale parallel corpus. This method utilizes the alignment and co-occurrence based on statistical information between words from large scale parallel corpus.
SMT contains a translation, reordering and language model. It extracts the alignment of the source sentence and target sentence through the translation model, and predicts the probability of the target sentence through the language model. Unlike RBMT, this methodology can be developed without linguistic knowledge and generally higher performance can be obtained by increasing the amount of data. However, building large amounts of data is a challenging task and the context is difficult to understand, because the translation is carried out in words or phrases basis.
In the case of SMT, the methodology has changed according to the unit of translation. At the beginning of the study, translation was performed in words. However, in 2003, a translation method of multiple word bundles (i.e., phrase units), was proposed and showed better performance better than word units. The introduction of the concept of variables within the phrase is referred to “Hierarchical Phrase-Based SMT”, which does not indicate a specific word, such as “eat an bread”, but rather expressing with the variable X as “eat X”. The superiority of this approach is that variable X can accommodate a variety of substitute words such as apple and pineapple. Prereordered-based SMT is a word order change before translation. In the case of Korean, the word order of the sentence is Subject–Object–Verb (SOV), while English is Subject–Verb–Object (SVO). If the word order is different, this is a methodology to alter the word order in accordance with the word order of the target language to be translated before proceeding with the translation. Syntax Base SMT is a translation technique that changes from “eat X” to “eat NP (Noun Phrase)” in the Hierarchical Phase-Based SMT. In other words, not all phrases can come to the candidate group, but only nouns can be placed to the candidate group, and it eliminates unnecessary translation candidates in advance [28,29].
Neural Machine Translation. NMT uses deep neural network to translation system. Based on the Sequence to Sequence model, the source language is vectorized through encoder and the latent vector is untangled through decoder to generate the target language. It is a method of utilizing deep neural network to uncover the most appropriate representations and translation results with a single pair of statements of input and output. For the text-to-text sequential modeling [30], NMT model generally comprises encoder and decoder structure that takes input sequence and generates output sequence auto-regressively. It has been developed to Recurrent Neural Network (RNN) [3,31], Convolution Neural Network (CNN) [32,33], and Transformer-based model [4] which outperforms other existing methods. Furthermore, fine-tuning approaches for pre-trained language models have recently shown the best performance including Cross-lingual Language Model Pre-training (XLM) [5], Masked Sequence to Sequence Pre-training for Language Generation (MASS) [6], and Multilingual BART (mBART) [34]. In contrast, the parameters and model sizes of these pre-trained language models are extremely large for real-world industries to deploy the services. To address this issue, we present that the optimal model to proceed with the service is Transformer, considering the overall factors such as model performance, speed, and memory in recently published papers, and conduct experiments based on that model.

2.2. AI Hub

With the advent of the fourth industrial revolution [35], inter-language exchange of information has rapidly been increased, accelerating the demand for the development of advanced translation systems. Despite the development of automatic translation systems accompanied by the growth of Information Technology (IT), there remain several difficulties in the industrial services of machine translation. The cost and time barriers of building early translation solutions exist and it is difficult to obtain quality data. Moreover, there are challenges in maintaining NMT performance quality, obstacles to obtaining domain-specific language pairs, and struggling to provide domain-specific NMT solutions. In other words, most of the difficulties are due to the lack of translation data, namely, parallel corpus. Additionally, intellectual property makes it complicated to secure data, and there are numerous costs to collect, which is a major challenge for start-ups in the artificial intelligence-based industry or companies preparing for innovation [9].
In general, a single corpus is relatively uncomplicated to obtain and sufficient amount can be secured, but in the case of parallel corpus, it becomes tough to acquire. Furthermore, constructing parallel corpus requires a number of high-level techniques for refining, pre-processing practical original corpus, and translating a single corpus into a desired heterogeneous language demands a lot of expenses.
To mitigate these limitations, AI Hub constructs and continuously distributes public data nationally. AI Hub is a platform that integrates AI infrastructure such as AI data, AI software, algorithms, and computing resources that are essential for developing AI technologies, products, and services. It is also releasing data related to image recognition as well as data related to machine reading, machine translation, and voice recognition. This platform contributes to the creation of an intelligence information society and an artificial intelligence industrial ecosystem including medium-sized venture enterprises, research institutes, and individuals in Korea by disclosing high-quality and high-capacity artificial intelligence data.
AI Hub has released several datasets on MT, including the high-quality Korean–English corpus released in 2019 and 2021. Subsequently, the construction of parallel corpus, including Korean–Japanese. Korean–Chinese, and other Korean-language parallel corpora has been actively established. However, close verification of these data is not being condutcted specifically, and we seeks to proceed with quality confirmation by building a LIWC and a real-world NMT model.

2.3. Parallel Corpus Quality Assessment

Accompanied by the increase of publicly released parallel corpus such as FLORES-101 [36] and AI hub, the importance of evaluating and improving the quality of the parallel corpus becomes higher. Especially for the data construction process, assessing the quality of the corpus is regarded as an essential process. For example, in the case of AI hub, all publicly available corpora were constructed through a multi-phase process that proceeded with machine translation followed by the human examination. For the corresponding examination process, semantic coherence and sentence alignment are mainly inspected. This can be viewed as checking its suitability to the intended purpose of the data construction.
However, as data acquisition becomes more accessible and the amount of data used for training increases, exploring each data with human labor leads to considerable cost. For instance, as the total amount of parallel corpus released by the AI hub is approximately 7M, it can be expected that tremendous time and cost are required to examine the whole corpus.
For the alleviation of such limitation, corpus evaluation studies have been made by mainly focusing on the minimization of direct human examination. Representative methods include the use of several translation rules established in advance [37], the Gale–Church algorithm [38] that evaluates the overall align of the sentence, and the Bilingual Sentence Aligner [39]. The validity of these corpus evaluation methodologies is generally evaluated based on the performance of the MT system generated through the corresponding corpus. In particular, evaluation criteria, such as sentence alignment, were confirmed to be effective as corpus evaluation metric through the performance verification of the SMT model generated through the corpus [40]. Furthermore, with the development of deep neural modeling, these methodologies are evaluated by the performance of NMT model [37].
However, most of these studies aim to improve the performance of the MT system itself trained by the corresponding parallel corpus. These often led to inconsistent results where several data that was not considered to be noisy in training SMT system considerably deteriorated the performance of the NMT system when utilized in training process [17]. Thus, it shows that these corpus evaluation criteria may not be consistent enough to be directly related to actual quality assessment. In this study, we analyze the corpus using a sentence analysis tool called LIWC, which has not been utilized as a parallel corpus inspection and as an objective evaluation index for the corpus quality. In addition, following previous studies, we check the performance of the NMT system trained through the parallel corpus and analyze the characteristics and quality of the corresponding corpus that can be obtained through the result.

2.4. Korean Neural Machine Translation Research

Recently, various services have been provided in South Korea as well as MT-related research. Along with Papago translation service [41] which is serviced by Naver corporation, MT services are conducted by many companies and laboratories, including the Electronics and Telecommunications Research Institute (ETRI), Kakao, SYSTRAN and Genie Talk at Hancom Interfree.
Research on NMT data pre-processing is mainly being conducted in the academia of Korea University. There are several related studies including Onepiece which proposes a specialized sub-word tokenization in Korean [42] and applying PCF to the Korean–English NMT for the first time [43]. They also propose a methodology for training with relative ratio when configuring batch rather than simply applying back translation or copied translation when applying data augmentation [44]. This results in higher performance than simply using back translation. In addition, based on machine translation, they have conducted various applications such as Korean spelling corrector [45], English grammar corrector [46], and cross lingual transfer learning [47]. In conclusion, there are various experiments and studies based on the importance of pre-processing and data augmentation as well as research on NMT models.

3. Analyzing the AI Hub Corpus Using LIWC

3.1. Linguistic Inquiry and Word Count (LIWC)

LIWC is a natural-language analysis software, which allows for the investigation of various emotional, cognitive, and structural components of specific sentences [48]. LIWC offers corpus analysis by referring to a dictionary comprising 93 features. Each feature provided is shown in Table 1. Every feature can be classified into 14 categories: summary language variables, linguistic dimensions, grammar, affect process, cognitive process, social process, perceptual process, biological process, drives, time-orientations, relativity, personal concerns, informal language markers, and punctuations. This is different from the classifications presented in the LIWC manual, which are classified into 16 categories for more intuitive analysis. We consolidate them into new categories to avoid confusion and achieve our objective. We merged auxiliary verbs, common adverbs, conjunctions, and negations of function words in grammar with Other Grammar such as conjunctions, adjectives, and so on defined in the initial categories of Manual. Pronouns, articles, and prepositions, which involve functional words, were joined with the linguistic dimension, thereby helping in the understanding of text through the rules of sentence structure. Grammar represents the grammatical components of a sentence and comprises some parts of speech. The summary language variable represents the summarized value of all the linguistic features representing the overall features of a sentence. The affect process quantifies emotions and feelings. The biological process category represents biological topics, such as body, health, and ingest in text. The drives category represents motivations and needs, which appear in text. The time-orientation category helps in the understanding of the tense used in text because LIWC contains both the tenses of verbs and general time orientations. The relativity category represents relatively-trivial topics and personal concerns as well as literal meanings of the concerned topics in text. The punctuations and informal language markers have similar meanings. In this study, we conducted an in-depth analysis with respect to the following five aspects.
First, morphological analysis can be conducted by referring to morphological features, such as grammar and linguistic dimensions. Second, the investigation of summary language variables, general descriptors, time orientation, punctuation, and informal language markers categories, enables the analysis of sentence syntax. Third, semantic analysis through the inspection of various topics, including cognitive, social, perceptual, and biological process, as well as relativity and personal concerns, can be implemented. Fourth, we can conduct sentimental analysis through the affect process, which involves positive and negative emotions. Finally, the social category, which contains male and female referents, enables the analysis of gender bias. Specifically, in the field of NMT, numerous studies have been conducted to reduce the prevalence of gender bias [49,50]. In the future, this approach can be used to inspect the performance of MT systems.
LIWC is mainly leveraged in the field of psychology, especially during the investigation of linguistic characteristics revealed in the writings of psychiatric patients [21,51]. Furthermore, LIWC has been recently utilized in numerous studies on natural language processing (NLP), and its effectiveness and relevance in the field of NLP has been demonstrated. For instance, the performance of misinformation detection [52], sentiment analysis, and plagiarism detection [53] can be improved by applying LIWC, and the effectiveness of LIWC can be evaluated through its comparison to BERT [54]. Following these trends, we aim to investigate all the parallel corpora for Korean released by AI Hub through morphological, semantic, sentence-syntactic, sentimental, and gender-bias aspects. For Table 2, Table 3 and Table 4, the largest values of each category are shown in bold.

3.2. Korean–English Parallel Corpus

Corpus Description. The Korean–English parallel corpus (https://aihub.or.kr/aidata/87, accessed on 25 May 2022) is a parallel corpus from AI Hub, which was released in 2019. The Corresponding corpus was built through the cooperation of Saltlux partners (http://saltlux.com, accessed on 25 May 2022), Flitto (https://www.flitto.com, accessed on 25 May 2022), and Evertran (http://www.evertran.com, accessed on 25 May 2022). The total amount of sentence pairs in the constructed corpus is 1.6M, which comprises 800K news articles, 100K website contents from the government, 100K instances of by-law data, 100 K instances of Korean-based cultural contexts, 400K instances of colloquial-style data, and 100K instances of dialogic data. The ratios of each domain to the entire corpus are shown in Figure 1. It can be considered the most representative Korean–English parallel corpus, and many Korean-related studies on MT have been conducted based on that corpus [55].
For a more thorough data analysis, we conducted an in-depth investigation of this corpus with respect to various features, such as morphemes, syntax information, and the characteristics of the corpus. We analyzed such features using LIWC, and the results are shown in Table 2.
Corpus Analysis. We conducted analysis in the aspect of morphology through the linguistic dimension and the grammar of linguistic-feature results obtained using LIWC. In the linguistic dimension category, it shows high frequency in ‘prepositions (prep)’ and ‘articles (article)’ of the total part-of-speech, at 15.1% and 10.2%, respectively. Additionally, the ‘auxiliary verb (auxverb)’ in the grammar category appears for 6.75%, and its prevalence is higher than that of others, such as ‘commonverb (verb)’, which shows the highest frequency, at 11.36%. This result indicates the continuous prevalence of be verbs (am, are, is, was, and were), and the perfect tense of English characteristics, such as diverse tenses and conjugations, was reflected as data, rather than a base form of a verb.
In personal pronoun analysis, which consists of {‘1st pers singular (i)’, ‘1st pers plural (we)’, ‘2nd person (you)’, ‘3rd pers singular (shehe)’, ‘3rd pers plural (they)’, and ‘impersonal pronouns (ipron)’}, the frequency of impersonal pronouns (ipron) is similar to that of ‘personal pronouns (ppron)’, and the ‘1st pers singular(i)’ has the highest frequency as one of the personal pronouns. The ‘2nd person’ and ‘3rd pers singular’ pronouns also came next in order of prevalence. Unlike other corpora in which impersonal pronouns are predominant, this corpus includes both colloquial and dialogic sentences because it comprises interactive conversations between the first-person perspective and other person perspectives.
Syntactic analysis is defined as an analytic approach that informs us on the grammatical meaning of specific sentences or parts of such sentences. This approach avails the type of tone and atmosphere used in sentences through the summary language variables category. We also obtain the sentence length from ‘word count (wc)’, ’word per sentence (wps)’, and lengthy word count using ‘Sixltr’. We explore the sentences based on whether they are represented using statements, questions, or quotations using the punctuation category. We use the time orientations category to understand the point of view, and we investigate the ’assents (assent)’, ‘fillers (filler)’, ‘swear words (swear)’ in the informal language markers category. In the end, these features contribute to understanding the syntactic information of the corpus.
We show the ‘thinking (analytic)’, ‘clout (clout)’ (i.e., the representation of trust), ‘authenticity (authentic)’ (i.e., the representation of sincerity), and ‘emotional tone (tone)’ as 94.16%, 65.39%, 27.76%, and 54.48%, respectively. We can find that the prevalence of ‘analytic’ is relatively low whereas that of ‘clout’, ‘authentic’, and ‘tone’ is relatively high. This reflects the characteristics of each component in the corpus that contains various descriptive styles (i.e., colloquial or literary), rather than focusing on conveying and explaining specific domain knowledge. Various descriptive styles of this corpus can also be found by inspecting punctuation category which shows relatively high appearance rate of ‘question marks (QMark)’.
In the results of the linguistic dimensions category, we establish that the average word count of a sentence is 25.39, and ‘Sixltr’ accounts for 25% of the total word count. The articles and prep categories also account for a large proportion at 10.4% and 15.5%, respectively, and as a result of the analysis of the time orientations category, the ratios are high in the order of present focus, past focus, and future focus. Additionally, in the grammar category, words that directly represent ‘number (number)’ are used approximately twice as often as ‘quantifiers (quant)’ representing quantities.
In the informal language mark category, the values are higher than those of the other categories. It is noteworthy that ‘swear words’, such as ‘damn’ and ‘shit’, and ‘fillers’, such as ‘you know’ and ‘i mean’, which are used as interludes between conversations, are close to zero in other corpora results. It seems that this is because the corpus contains written, colloquial, and dialogue language, unlike general data, which consists only of written or spoken language.
In semantic analysis, semantics is the study of analyzing meaning in units of text, sentences, and phrases. In this study, we attempt to understand the corpus in depth by checking which of the various topics, such as drives and biological process, has a high ratio.
In the corpus, the relativity category corresponding to a relatively trivial topic was found to be at the highest level at 14.29%, and ‘space (space)’ accounted for the highest prevalence at 7.83%. Next are the drives, cognitive process, personal concerns, and social process categories in that order. Specifically, in the personal concerns category, ‘work (work)’ occupied more than half. This result is because, unlike other corpora, this corpus includes colloquial words and dialogues. For this reason, the relatively trivial topic, which is the topic of conversation, and the individual’s sense of purpose, thoughts, and interests are relatively clearly revealed.
Sentiment analysis involves analyzing the degree of positivity and negativity appearing in text. LIWC supports the analysis of the scale of ‘positive emotion (posemo)’ and ‘negative emotion (negemo), such as ‘anger’, ‘anxiety’, and ‘sadness’.
‘posemo’ occurs twice as much as ‘negemo’ in this corpus. Specifically, as a whole, words expressing emotions were used the most compared to other data, thereby revealing the characteristics of the corpus, which includes dialogues and spoken words.
Gender bias is an important factor when it comes to determining the quality of MT. The results of ‘Female referents (female)’ and ‘Male referents (male)’ in the social process category represent the feature of the level of being referent in the text. ‘Male’ appears at a frequency of 0.63%, which is twice that of ‘female’ at 0.34%. Although the gender balance of the corpus was not effectively achieved, it had the same results as the number of male and female referents presented through the LIWC average analysis results of various corpora, such as blogs, novels, and Twitter in the LIWC 2015 manual.

3.3. Korean–English Domain-Specialized Parallel Corpus

Corpus Description. The Korean–English Domain-Specialized Parallel Corpus (https://aihub.or.kr/aidata/7974, accessed on 25 May 2022) provides various parallel corpora specializing in several domains. This corpus was released in 2021, and three companies cooperated in its construction: Saltlux partners, Flitto, and Evertran.
The corresponding corpus consists of 1.5M sentence pairs, including 250K instances of medical/health data, 200K instances of financial/stock market data, 100K instances of parent-notices data, 200K instances of international sport events data, and 100K instances of IT technology data, 200K instances of festival event content data, 150K judicial precedents, and 200K instances of data on traditional culture/food. The percentage of the domain data within the entire corpus is shown in Figure 2.
Corpus Analysis. This corpus was released separately into training and validation datasets, and Table 2 shows the LIWC results of each dataset. The differences of the linguistic features between the training and validation datasets are generally less than 0.1%. Although there are exceptional differences in some summary-language variables, such as ‘clout’, which means confidence, and ‘authentic’, which means authenticity, there are rarely any differences in these features because the differences are at approximately 1%. Therefore we can conclude that the training and validation datasets are released in balance.
In morphological analysis, there are generally similar results to those of Section 3.2. However, impersonal pronouns are used twice as much as personal pronouns. Although all the results show that the use of personal pronouns is low, “i” and “you” show the lowest results in other corpora.
Focusing on the syntactic category, ‘analytic’, which represents analytic thinking in text, is highest in four summary language variables. The length of the sentences is the longest, and ‘commas’ is the most frequently used in all the Korean–English corpora we analyzed. This is because there are many long sentences with several phrases explaining the domain-specialized concepts in such corpora. Time orientation is still the highest in the present and the lowest in the future. However, the difference between the present and the past tense is smallest in all corpora because the data contain past cases, such as judicial precedents and financial/stock market data.
Additionally, in semantics, the biological process category has the highest prevalence in the entire corpus. In this category, ‘health/illness (health)’ is especially high because this corpus contains several domains that include both medical and international sport data. The results of ‘money (money)’ and ‘leisure activities (leisure)’ in the personal concerns category prove that there are international sports data and financial/stock market data in this corpus.
In the results of the sentimental analysis, ‘posemo’ appears twice as much as ‘negamo’. Additionally, the use of words representing sentiments is 7% lower than that presented in Section 3.2. Finally, male referents and female referents of this corpus allow for noticing gender bias because male referents are two times more than female referents unlike other corpora.

3.4. Korean–English Parallel Corpus (Technology)

Corpus Description. AI Hub released the technology-science domain-specialized Korean–English translation corpus (https://aihub.or.kr/aidata/30719, accessed on 25 May 2022) in 2021 through the cooperation of Twigfarm (https://twigfarm.net/, accessed on 25 May 2022), Lexcode (https://lexcode.co.kr/, accessed on 25 May 2022), Naver (https://www.naver.com, accessed on 25 May 2022), the Korean telecommunications technology association (TTA) (https://www.tta.or.kr/, accessed on 25 May 2022), and the Fun & Joy company (FNJ) (http://www.fnj.or.kr/home/index.html, accessed on 25 May 2022). The corresponding corpus was constructed for the support of ICT companies with respect to the translation of technical documents or product localization.
The number of sentence pairs in the entire corpus is 1.5M, which comprises five domains, as shown in Figure 3: 350K instances of ICT domain data, 150K instances of electricity domain data, 150K instances of electronic domain data, 350K instances of mechanical domain data, and 500K instances of medical domain data. For the construction of the high-quality corpus, expert-level revision by ICT professionals and several professors in translation fields was conducted after the initial corpus was compiled using a computer system. In this study, we partially leveraged 788K sentence pairs (ICT (35.2%), mechanic (31.2%), electricity (13.8%), electronic (11.2%), and medical (8.6%)) because all the data is yet to be released.
Corpus Analysis. The entire corpus comprises training and validation datasets, and the LIWC results are shown in Table 3.
The LIWC results show that there exist small differences between the training and validation datasets in terms of each feature value. We can also establish that the corresponding corpus contains more adverbs than other parallel corpora, and few personal pronouns have been used to the extent that impersonal pronouns (‘ipron’) have been used approximately 10 times more than personal pronouns. This allows us to identify the characteristics of the corpus that describe technology and phenomena compared to the corpora of other fields that explore people and culture.
Because the ‘Sixltr’ rate is relatively high, whereas ‘WPS’ is low, we can infer that short sentences, each of which consist long words, are mainly contained in the corpus. Furthermore, the corpus shows low ‘authentic’ and emotional tone (‘tone’) rates. This indicates that by considering the characteristics of the technology domain, the representations of each sentence are concise. The present tense appears more frequently than the past or future tenses, thereby supporting the attributes of the technology domain that are mainly targeted to describe current technology, future complementary points, and expectations.
Compared to other corpora, the prevalence of the drive category, which represents the motivation of a sentence, is relatively low, whereas the biological processes category remains the most frequent. These results are contrary to those of the social domain corpus, which reflects the characteristics of the corresponding corpus.
The ratio of the conventional process to that of the perceptual process category was also the highest throughout all the corpora. Therefore, it can be confirmed that the characteristics of articles in the technology domain are presented in a manner that describes perceptions and cognitive processes, such as ‘insight’, ‘causation’, and ‘certainty’, about technology. One notable characteristic is the absence of gender bias. The ratio of the domains can vary because all the 1.5M sentences are completely constructed. Therefore, we can obtain the linguistic features of each domain by comparing the entire corpus to the present corpus in latter experiments.

3.5. Korean–English Parallel Corpus (Social Science)

Corpus Description. Similar to the technology-science specialized corpus, the social science specialized Korean–English parallel corpus (https://aihub.or.kr/aidata/30720, accessed on 25 May 2022) was also published in 2021 through the cooperation of Twigfarm, Lexcode, Naver, TTA, and FNJ.
The entire corpus comprises 1.5M sentence pairs, including 300K instances of economic data, 90K instances of cultural content, 100K instances of tourism content, 400K instances of education data, 500K instances of law data, and 110K instances of art domain content. The occupational ratio of each domain to the corresponding corpus is shown in Figure 4. The data was revised by the specialists of their domain and translation experts. In this study, we partially leveraged 537K sentence pairs (law (37.2%), economy (24.2%), education (24.1%), tourism (5.9%), culture (4.5%), art (4%), and medical(0.08%)) because all the data is yet to be released.
Corpus Analysis. The data was also split into the training and validation datasets, and the results of running LIWC on the training, validation, and the entire dataset are shown in the Table 3. The difference between each linguistic feature of the training and validation datasets is mostly within 0.1%, and as an exception, some features, such as ‘authentic’ and ‘emotional tone’, have relatively sizable dissimilarities. This is as a result of the incidental blending of datasets with various domains, such as law, culture, and economy.
First, the morphological analysis is homogeneous to the results presented in Section 3.2, but negates serving as not and never have been used the most among other corpora. This result is contrary to the outcomes in the technical science corpus discussed in Section 3.4, thereby suggesting that the frequency of plain and negative statements varies depending on the domain. In the case of pronouns, impersonal pronouns (‘ipron’) have been used four times more than personal pronouns (‘ppron’), as shown in the results of Section 3.2, which is caused by the characteristics of the corpus in which the written sentences account for the description of most of the objects.
Considering syntactic characteristics, the number of analytic explanations is relatively high, thereby indicating that this corpus logically describes domains, such as economics, law, and education. There is a higher ratio of ‘Sixltr’ compared to other corpora, thereby demonstrating the increased use of average long words. The low ‘WPS’ also suggests that short sentences have been used. We also find that future focus (‘focusfuture’) accounts for the smallest percentage in the time orientations category. The reason behind this output is that there exist present state-oriented explanations rather than future predictions in the social and cultural corpora.
As a characteristic of the semantic perspective, the biological process category shows the lowest score compared to that of other corpora. Specifically, ‘body (body)’ figures are within 0.2% as a result of the nature of the social science domain, which is far from biology-related topics. Additionally, the cognitive process of human thinking is the highest compared to that of other corpora, with 1.5 to 1.8 times higher insight and cause. This confirms that ‘insight (insight)’ and ‘cause (cause)’ are attributes of written sentences in the social science domain in contrast to the technical science domain and specialized fields. The prevalence of ‘posemo’ is twice as high as that of ‘negemo’. Subsequently, considering the phenomenon of gender bias, male-related pronouns are approximately twice as much as female-related pronouns, similar to other corpora.
Similarly, because this corpus has a different ratio of domains from the dataset that would be completed at 1.5 million, as described in Section 3.4, we can infer and analyze the linguistic features of each domain by comparing the results of the dataset that will be updated to 1.5 million.

3.6. Korean–Chinese Parallel Corpus (Technology)

Corpus Description. AI Hub also provided the technology-domain specialized Korean–Chinese parallel corpus (https://aihub.or.kr/aidata/30722, accessed on 25 May 2022). This corpus is the first publicly-released Korean–Chinese parallel corpus. To build this corpus, six companies, including Saltlux partners, Flitto, Evertran, Onasia (https://on-asialang.com/, accessed on 25 May 2022), Yoon’s information development company, and dmtlabs (http://dmtlabs.co.kr/, accessed on 25 May 2022) cooperated.
The entire corpus comprises 1.3M sentence pairs, including 250K instances of medical/health data, 150K instances of patent/technology data, 300K instances of car/traffic/material data, and 600K instances of IT/computer/mobile-related content. Figure 5 shows the ratio of each domain. This corpus is subdivided into the training and validation datasets, and Table 4 shows the LIWC analysis results of both datasets.
Owing to the characteristics of the Chinese language, there exist differences between the training and validation datasets in count-based analyses, such as ‘WPS’ and ‘Sixltr’, but their severities are subtle. We inspect the linguistic characteristics of English and Chinese by comparing them between those of the technology-specialized Korean–English parallel corpus, which is analyzed in Section 3.4.
Corpus Analysis. In the morphological analysis, unlike Section 3.4’s result where ‘common verb’ appears 1.7 times more than ‘auxverb’, this corpus shows the lowest difference between the two features at 1.45 times. Additionally, the results indicate that it rarely uses the negative representations, ‘quant’ and ‘negate’. We establish that this corpus shows notable differences in pronoun analysis. Unlike Section 3.4, where personal pronouns were rarely used, personal pronouns were used approximately three times as often as non-personal pronouns, and among them, the ‘1st pers plural (we)’ was the most common, and the ‘3rd pers pronouns’ (i.e., 3rd pers singular and plural) were rarely used.
Due to the nature of the data in technical field, declarative texts take large portion of the whole dataset, and thereby ‘semicolons (SemiC)’, ‘Colons (Colon)’, ‘Dashes (Dash)’, and ‘QMark’ were rarely used as demonstrated in Section 3.4.
In the aspect of syntactic analysis, ‘analytic’ and confidence, ‘Clout’, were the highest throughout all the English corpora we analyzed. As primary purpose of the data in technology-domain is to convey existing information that proposed priorly, the present tense is less focused than the past and future tenses.
Notably, ‘analytic’ was 5.9% lower than that presented in Section 3.4’s ‘analytic’, and ‘WPS’ and ‘Sixltr’ were much higher. These results show that the length of sentences and words used in Chinese are longer than those used in English. Additionally, unlike most Korean–English parallel corpora, including those presented in Section 3.4, ‘article’ and ‘prep’ are scarcely used, and the use of all tenses in the time orientations category with the exact weight is also a characteristic of Chinese.
In the punctuations category, the usage frequency of ‘colon’ is similar to that presented in Section 3.4’s results. This is a characteristic that explains the existence of multiple contents in one sentence. Additionally, ‘number’, which directly represents a number, was higher than ‘quant’, which represents a quantitative description. However, informal language markers were rarely used. It is noteworthy that ‘quotes’, which were hardly used in Korean–English parallel corpora, accounted for 12.7%. This result suggests that the presence of many quotations in this corpus show the differences between the English corpus and the Chinese corpus.
Considering the semantic aspects, ‘work’ and ‘leisure’ have the highest ratios in the personal concerns category. In addition, the perception process, biological process, and cognitive process categories were higher than in other Korean–Chinese corpora, which is similar to the results presented in Section 3.4.
Through the qualitative inspection, we conclude that the semantic difference between the corpora with similar domains is identical, except morphological and syntactic distinction of each respective language.
Overall, in sentiment analysis, all outcomes in the affective process category, including emotional tone in the summary language variables category, are low. This is because, as discussed in Section 3.4, it consists of a sentence-oriented corpus that describes knowledge and phenomena. The unusual thing is that in the corpus, ‘posemo’ appeared approximately six times more than ‘negemo’, which is similar to the results presented in Section 3.4.
In the view of gender bias, most of the Korean–English parallel corpora analyzed so far had gender bias, but there was no gender bias in all the Chinese corpora.

3.7. Korean–Chinese Parallel Corpus (Social Science)

Corpus Description. Along with the technology-domain specialized corpus, AI Hub also released a social science-domain specialized Korean–Chinese parallel corpus (https://aihub.or.kr/aidata/30721, accessed on 25 May 2022). To build this corpus, six companies, including Saltlux partners, Flitto, Evertran, Onasia, Yoon’s information development company, and dmtlabs, cooperated.
The total amount of sentence pairs in the corpus is 1.3M, including 200K instances of financial/stock market contents, 200K instances of social/welfare domain data, 100K instances of education data, 150K instances of cultural heritage/local/K-food content, 250K by-law texts, 250K instances of political/administration data, and 200K instances of K-POP/culture content. The ratio of each domain to the entire corpus is shown in Figure 6.
Corpus Analysis. As shown in Table 4, the overall characteristics of the training and validation datasets are almost identical. The overall analysis results are generally similar to the results presented in Section 3.6, except for a few aspects. The corresponding corpus showed a similar ratio of ‘conjunctions (conj)’, ‘negations (negate)’, ‘comparisons (compare)’, and ‘interrogatives (interrog)’ in the grammar category. Through the inspection of syntactic analysis, we established that the relatively frequent ‘preposition’, ‘comma’, ‘question mark (Qmark)’, and ‘quote’ are contained in each sentence. This shows that the length of each sentence is quite short, and the proportion of ‘questions’ and ‘quotes’ is relatively high. We can infer that descriptive methods that sequentially list various types of information have been commonly used.
We can point out a common feature with Section 3.5 because personal pronouns are used three times as much as non-personal pronouns. However, the corresponding corpus has a distinguishable feature in that more first and second person singular pronouns are used more frequently than first person plural pronouns. These results show the attributes of the domain of the corresponding corpus, where the descriptions of social/culture/politics, which mainly focus on “I” and “You”, are composed.
Furthermore in the corpus, the prevalence of ‘posemo’ is higher than that of ‘negemo’ and gender bias rarely exists. Later, by analyzing the linguistic and colloquial Chinese corpora in various fields, we verified whether it is a linguistic characteristic of Chinese or a special case occurring during descriptions in a specialized field.

3.8. Korean–Japanese Parallel Corpus

AI Hub released the public Korean–Japanese parallel corpus (https://aihub.or.kr/aidata/30723, accessed on 25 May 2022) for the first time in Korea. Each sentence pair in the corpus is generated by translating Korean sentences from various domains into Japanese sentences using MT systems, after which it is revised by human experts. This corpus is not biased to a specific industrial domain and is constructed from a raw data source. Therefore, it is free from copyright problems. These attributes enable the corpus to be widely utilized for any NLP industrial services that deal with various domains. To build this corpus, six companies, including Saltlux partners, Flitto, Evertran, Onasia, Yoon’s information development company, and dmtlabs, cooperated.
The entire corpus comprises 1.3M sentence pairs, including 150K instances of cultural heritage/local/K-food content, 200K instances of K-POP/culture content, 200K instances of IT/computer/mobile domain data, 200K instances of finance/stock market contents, 200K instances of social/welfare data, 100K instances of education data, 150K instances of patent/technology domain data, 100K instances of medical/health content, and 200K instances of car-related data. The ratio of each domain to the entire corpus is shown in Figure 7. As the proper LIWC software has not been publicly released, we skipped the corpus analysis for the corresponding corpus.

4. Experiments and Results

4.1. Dataset Details

In this study, we utilize seven types of Korean parallel corpora released by AI Hub as training data for the experiments. We measure the total number of sentences for each corpus, the minimum, maximum, and average length for each word, and the character unit. Statistics for the seven newly released parallel corpora by AI Hub are listed in Table 5. In the case of the social science and technology fields of the Korean–English parallel corpus, only 470K and 690K instances of data were released owing to unintentional circumstances of the organizers. We leverage the official training and validation datasets for training and evaluation. Before training, to test the performance of the MT system, we use a partially separated 3K instance of the training set as a test set. The performance of each NMT model in our experiments is measured by the BLEU score [56], which is the common metric in NMT field, and for the precise evaluation, we adopted Jieba (https://github.com/fxsjy/jieba, accessed on 25 May 2022) and MeCab (https://github.com/taku910/mecab, accessed on 25 May 2022) as tokenizers of Chinese and Japanese output sequence.

4.2. Models Detail

For verifying the quality of dataset provided by AI Hub, we constructed transformer based NMT model [4] that trained with each dataset. Transformer is an auto-regressive model structure which comprises encoder-decoder architecture, and is widely utilized in many NLP research fields, including NMT, in achieving SOTA performance. Corresponding model refrains recurrence and constructs its encoder-decoder model architecture by mainly applying attention structure. This enables considerable reduction of required training time by allowing significantly more parallelization in training process. Attention based model structure of transformer also can relieve long term dependency problem of RNN and LSTM [57]. Output results of attention structure can be described as Equation (1).
Attention ( q , k , v ) = Softmax ( W q q ) ( W k k ) T d k ( W v v )
In Equation (1), W q , W k and W v refers to trainable parameters. Attention structure takes three input; query, key and value, which is denoted as q, k and v, respectively. Through this structure, transformer can obtain the relational information between input sentence and generating sentence. In such cases, the embedding obtained from input sentence is fed to the attention structure as q and k, and the embedding from the generating sentence is regarded as v. Attention structure is also leveraged to obtain the bidirectional contextual information of input sentence and generating sentence, through self-attention mechanism which takes identical embedding value as q, k, and v simultaneously.
We construct transformer NMT model trained with each AI Hub dataset. We regard the performance of NMT model as the quality of parallel corpus, by controlling all the training conditions of our experiments to be identical, except the training dataset. Training objective of transformer based NMT model θ that trained with parallel corpus P can be described as Equation (2).
max θ 1 D ( X , Y ) D log i = 1 n P ( y i X , y t < i , θ )
Overall process is similar to the training of sequence to sequence [30] based MT model. In Equation (2), X and Y indicate source and target sentence in P, respectively. Target sentence Y comprises total m tokens, which are denoted as { y i } 1 n , and through this training process, corresponding model is trained to generate Y auto-regressively.
In our training process, we used adam optimizer with noam decay, and all the batch size is set to be 4096. The transformer NMT model in our experiments consists of six encoder and decoder layers with six attention blocks and eight attention heads, which dimensionality and embedding size is 512.
For the pre-processing of our training data, we utilized sentencepiece [58] subword tokenization method, with 32,000 vocab size. We extracted 5000 and 3000 samples randomly from training data for the validation and test set, respectively. The performance evaluation of all the translation results are proceeded with BLEU score by leveraging multi-bleu.perl script (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl, accessed on 25 May 2022) given by Moses.

4.3. Main Results

Performance analysis. The baseline results of the seven AI Hub parallel corpora are listed in Table 6. The experiment showed a BLEU score of 28.36 for the Korean–English NMT model trained using Korean–English parallel corpora. In the case of the NMT model trained using the Korean–English parallel corpus (technology), the performance showed a BLEU score of over 50, which shows that the words and expressions in a specific domain appear quite repeatedly. For the NMT models based on other fields, the Korean–English parallel corpus (social science) and the Korean–English domain-specialized parallel corpus also demonstrated high performance of 45.64 and 51.88, respectively, in similar contexts.
Considering the significant performance gap between domain and general corpora, we can point out probable limitation of corpus construction. Although the performance of all domain corpora is overwhelmingly higher than general corpora, this result does not guarantee that the NMT model is well operating because we randomly extract test set within the training set. Corpora built on the basis of a particular domain typically have significant overlap parts with other sentences within such corpora, but there still exist many different expressions and words present in the field. This can cause difficulties in translating other various expressions. Therefore, our experimental results show that corpus generators should include much more diverse expressions especially in specific domains given that a well-constructed corpora makes a model smarter.
Language direction analysis. As a result of conducting both Korean to English translation and the opposite based on four Korean–English parallel corpora, the gap of the experimental results for translating Korean to English compared to those of the opposite case differ significantly, with scores from 14.83 to 29.89.
This can be interpreted in terms of data construction. Using parallel corpora built by translating sentences from one language to another, translation results can be awkward when training the model in the opposite direction. Thus, a reasonable construction process for training direction-robust NMT models involves building a parallel corpus by constructing about half as the source language and the other half as the target language and translating each. In other words, given the significant differences in performance when changing the direction of translation, it is highly likely that the translation was carried out using only a monolingual corpus, which consists of a source language without considering the opposite direction. Similarly, in the case of Korean–Japanese and Korean–Chinese models, the performance in the opposite direction was significantly reduced. These aspects should be considered when building parallel datasets in the future.
In this paper, such a problem is defined as “data imbalance” [9,59], and the problem must be solved when constructing data in the future. As for high-quality data, it is important that various elements are ultimately built in a balanced manner, and we conducted further analysis in this respect.
Correlation Analysis between LIWC and BLEU score. We analyzed a correlation between LIWC features and BLEU score for observing the connection between them. We employed BLEU score derived from the Korean–English corpus results in Table 6 and used only LIWC features of the Korean–English case to do this. As shown in Figure 8, we calculated a Pearson correlation [60] joining all the features coming from LIWC, BLEU score (KR-EN), and BLUE (EN-KR). There are numerous correlations of each pair of linguistic features such as powerful negative correlation, which appears between positive emotion and negative emotion and represents as blue.
We can infer following result with Figure 8. First, the overall tendency of correlation within LIWC features is mostly not different from analysis in Section 3. For example, Analytic and sentimental levels show a negative correlation. This is a unified result since Analytic indicates whether emotions are excluded and tone of the text is logical. In other words, the results give validity to the LIWC analysis.
Secondly, we show that correlations between LIWC features and BLEU score is highly negative. It can be said that training data is good when it has balance in tone, length, gender and so on. However, there are many things to improve in AI Hub such as word count, punctuation usage, sentimental analysis. It is because there are numerous negative effects in terms of data imbalance. This suggests in which direction we should build data and informs us that the performances can be improved through data cleaning such as PCF [61,62].
Additionally, we distilled the features by the case of statistically significant negative correlation in Figure 9. It shows the correlation between the BLEU score of MT and linguistic characteristics. This indicates that a negative correlation of them can be created just by changing the source and target language and English-Korean translation has a negative correlation with BLEU score than reverse translation. That is, English-Korean translation affects the negative effect on BLEU score by those features. We can infer that many features need to filter than Korean–English translation to compensate its correlation with performance. This result suggests further research on which factors are considered to remove during the data filtering process. In addition, our findings are supported by BLEU score about showing the lower score in EN-KR than KR-EN in terms of data imbalance as shown in Table 6.
Finally, this paper figured out the association between LIWC and BLEU score in terms of data filtering. It may assist to make guidelines for building datasets later.

5. Discussion and Positive Impact of This Study

This paper conducted in-depth analyses on various parallel corpora published by AI Hub. Structural components that directly determine the quality of each corpus were closely investigated through the LIWC, and the actual usability of each corpus was quantitatively evaluated through the NMT model trained by the corresponding corpus. Through these, we have posed a positive impact on the machine translation research fields and figured out the desirable direction of data construction. Specifically, main contributions of our paper can be described as follows:
First, to the best of our knowledge, for the first time, we performed quantitative and qualitative in-depth analyses on AI Hub data. We adopted LIWC as an investigation tool for parallel corpus, and derived various meaningful information (e.g., quality of parallel corpus) by newly interpreting each component obtained from LIWC. As LIWC was generally used in psychological research, various aspects of corpus analysis were possible, such as morphological analysis, Syntactic analysis, and so on. It can be confirmed that the results were suitable for the features of each corpus in most cases. For example, in Section 3.2, informal language markers such as swear word and filler rather used although they rarely used in other corpus. It is because the corpus includes dialogues and spoken words. Additionally, the result of Word Count and Commas in Section 3.3 showed that Domain-Specialized Parallel Corpus tends to explain terminology in long sentences with commas. In Section 3.4, the Emotional tone of the technology corpus was relatively low in order to concisely explicate the terminology, not a description of emotions. Since there are many texts with the economy as a topic in Section 3.5, money of personal interest topic has the highest rate than other corpora. Furthermore, word per sentence in Section 3.6 and Section 3.7 was the longest due to the characteristics of Chinese, which rarely uses spaces. Away from model-centric machine translation studies, this paper encourages data-centric research differently. This can have a positive impact on the NMT research field by presenting a new perspective.
Second, we pointed out the problems of the data construction process by revealing that there was a significant discrepancy in performance between English-Korean and Korean–English NMT model trained by the identical parallel corpus. It can be inferred that it is caused by the improper construction strategy. When constructing a parallel corpus that comprises certain two languages (i.e., Korean and English), it is desirable to construct a balanced corpus by translating a half of the translation into the first language based on the second language and the remaining half of the translation into the second language based on the first language. Through the empirical analysis, we point out that this aspect may underestimated.
For the last, we revealed that several important factors that determine the quality of corpus. In Section 3.5, we can infer that domain uniformity is neglected because it contains medical text in social science corpus. The gender bias, which had a major influence on the quality of corpus, was also overlooked in several corpora, especially in Section 3.3 and Section 3.5, there was a double gender bias. Additionally, we proposed that subject omission and cross-reference resolution problems should be further considered for ensuring the high quality data.
Eventually, this paper clearly analyzed the strengths and weaknesses of the existing AI Hub data and provided insight into the future direction of data construction.
In general, in the case of data filtering, mathematical and modeling approaches are taken [61,63]. Those approaches also can be reflected in our future direction of data construction. There are LIWC analysis studies on the text [64,65]. Previous research could be one of our options to enhance our approach. A topic-related approach is useful to filter unrelated topics by LIWC analysis [65].
There are still limitations in the language pairs provided by LIWC. LIWC supports diverse languages include Arabic, Chinese, Dutch, English, French, German, Italian, Portuguese, Russian, Serbian, Spanish, and Turkish. They are used in psychological or linguistic research in various countries [66,67]. However, LIWC is not supported in specific languages such as Korean or Japanese since their open dictionary has not been created yet. It is natural because LIWC is available only with access to a specific language dictionary. In the case of Korean, K-LIWC [68] was once available and there are some studies using it [69,70]. Nevertheless, for the parallel corpus with Korean, such as KR-EN, we only analyze non-Korean with LIWC. This is because the dictionary of K-LIWC is currently closed.

6. Conclusions

In this work, we proceeded with a quality evaluation of all the Korean-related parallel corpus, released by AI Hub. For the model-centric performance validation, we constructed a transformer based NMT model trained with each parallel corpus. Through quantitative and qualitative analysis of these NMT models, we point out some probable limitations on constructing corpora. First, for learning NMT model well in specific field, the domain corpora should contain various words and expressions in consideration of the excessive performance difference between domain and general corpora. Second, given the significant performance gap in terms of language direction, half of the parallel data to be built must be configured in the source language and the other half in the target language and then translated respectively.
Away from the model-centric analysis, we encouraged data-centric research through LIWC analysis. We figured out the association between LIWC and model performance in terms of data filtering. Through this analysis, we suggested the direction of further work to improve model performance. The national level re-examination of the various standards and building processes should be made for the encouragement of AI data construction research works. In the future, we plan to investigate efficient beam search strategies and new decoding methods by utilizing these AI Hub data. In addition, to more accurately measure the model performance, we plan to build an official Korean–English test set.

Author Contributions

Funding acquisition, H.L.; investigation, C.P.; methodology, C.P.; project administration C.P.; conceptualization, C.P.; software, S.E.; validation, S.E. and M.S.; formal analysis, H.M. and J.S.; writing—review and editing, C.P. and S.L.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Ministry of Science and ICT (MSIT), Korea, under the Information Technology Research Center (ITRC) support program (IITP-2018-0-01405) supervised by the Institute for Information & Communications Technology Planning & Evaluation (IITP), Institute for Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00368, A Neural-Symbolic Model for Knowledge Acquisition and Inference Techniques) and the Ministry of Science and ICT (MSIT), Korea, under the ICT Creative Consilience program (IITP-2021-2020-0-01819) supervised by the Institute for Information & communications Technology Planning & Evaluation (IITP).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study.

Acknowledgments

Thanks to AI Hub for creating a great dataset.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Vieira, L.N.; O’Hagan, M.; O’Sullivan, C. Understanding the societal impacts of machine translation: A critical review of the literature on medical and legal use cases. Inf. Commun. Soc. 2021, 24, 1515–1532. [Google Scholar] [CrossRef]
  2. Zheng, W.; Wang, W.; Liu, D.; Zhang, C.; Zeng, Q.; Deng, Y.; Yang, W.; He, P.; Xie, T. Testing untestable neural machine translation: An industrial case. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Montreal, QC, Canada, 25–31 May 2019; pp. 314–315. [Google Scholar]
  3. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  5. Lample, G.; Conneau, A. Cross-lingual language model pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar]
  6. Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. Mass: Masked sequence to sequence pre-training for language generation. arXiv 2019, arXiv:1905.02450. [Google Scholar]
  7. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
  8. Park, C.; Oh, Y.; Choi, J.; Kim, D.; Lim, H. Toward High Quality Parallel Corpus Using Monolingual Corpus. In Proceedings of the 10th International Conference on Convergence Technology (ICCT 2020), Jeju Island, Korea, 21–23 October 2020; Volume 10, pp. 146–147. [Google Scholar]
  9. Park, C.; Park, K.; Moon, H.; Eo, S.; Lim, H. A study on performance improvement considering the balance between corpus in Neural Machine Translation. J. Korea Converg. Soc. 2021, 12, 23–29. [Google Scholar]
  10. Edunov, S.; Ott, M.; Auli, M.; Grangier, D. Understanding back-translation at scale. arXiv 2018, arXiv:1808.09381. [Google Scholar]
  11. Currey, A.; Miceli-Barone, A.V.; Heafield, K. Copied monolingual data improves low-resource neural machine translation. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, 7–8 September 2017; pp. 148–156. [Google Scholar]
  12. Burlot, F.; Yvon, F. Using monolingual data in neural machine translation: A systematic study. arXiv 2019, arXiv:1903.11437. [Google Scholar]
  13. Epaliyana, K.; Ranathunga, S.; Jayasena, S. Improving Back-Translation with Iterative Filtering and Data Selection for Sinhala-English NMT. In Proceedings of the 2021 Moratuwa Engineering Research Conference (MERCon), Moratuwa, Sri Lanka, 27–29 July 2021; pp. 438–443. [Google Scholar]
  14. Imankulova, A.; Sato, T.; Komachi, M. Improving low-resource neural machine translation with filtered pseudo-parallel corpus. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), Taipei, Taiwan, 27 November–1 December 2017; pp. 70–78. [Google Scholar]
  15. Koehn, P.; Guzmán, F.; Chaudhary, V.; Pino, J. Findings of the WMT 2019 shared task on parallel corpus filtering for low-resource conditions. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), Florence, Italy, 1–2 August 2019; pp. 54–72. [Google Scholar]
  16. Park, C.; Lee, Y.; Lee, C.; Lim, H. Quality, not quantity?: Effect of parallel corpus quantity and quality on neural machine translation. In Proceedings of the 32st Annual Conference on Human Cognitive Language Technology (HCLT2020), Online, 15–16 October 2020; pp. 363–368. [Google Scholar]
  17. Khayrallah, H.; Koehn, P. On the impact of various types of noise on neural machine translation. arXiv 2018, arXiv:1805.12282. [Google Scholar]
  18. Koehn, P.; Chaudhary, V.; El-Kishky, A.; Goyal, N.; Chen, P.J.; Guzmán, F. Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment. In Proceedings of the Fifth Conference on Machine Translation, Association for Computational Linguistics, Online, 19–20 November 2020; pp. 726–742. [Google Scholar]
  19. Park, C.; Lim, H. A Study on the Performance Improvement of Machine Translation Using Public Korean–English Parallel Corpus. J. Digit. Converg. 2020, 18, 271–277. [Google Scholar]
  20. Pennebaker, J.W.; Francis, M.E.; Booth, R.J. Linguistic inquiry and word count: LIWC 2001. Mahway Lawrence Erlbaum Assoc. 2001, 71, 2001. [Google Scholar]
  21. Tausczik, Y.R.; Pennebaker, J.W. The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 2010, 29, 24–54. [Google Scholar] [CrossRef]
  22. Holtzman, N.S.; Tackman, A.M.; Carey, A.L.; Brucks, M.S.; Küfner, A.C.; Deters, F.G.; Back, M.D.; Donnellan, M.B.; Pennebaker, J.W.; Sherman, R.A.; et al. Linguistic markers of grandiose narcissism: A LIWC analysis of 15 samples. J. Lang. Soc. Psychol. 2019, 38, 773–786. [Google Scholar] [CrossRef] [Green Version]
  23. Bae, Y.J.; Shim, M.; Lee, W.H. Schizophrenia Detection Using Machine Learning Approach from Social Media Content. Sensors 2021, 21, 5924. [Google Scholar] [CrossRef]
  24. Sekulić, I.; Gjurković, M.; Šnajder, J. Not Just Depressed: Bipolar Disorder Prediction on Reddit. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Belgium, 31 October 2018; pp. 72–78. [Google Scholar]
  25. Kasher, A. Language in Focus: Foundations, Methods and Systems: Essays in Memory of Yehoshua Bar-Hillel; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 43. [Google Scholar]
  26. Dugast, L.; Senellart, J.; Koehn, P. Statistical Post-Editing on SYSTRAN’s Rule-Based Translation System. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, 23 June 2007; pp. 220–223. [Google Scholar]
  27. Forcada, M.L.; Ginestí-Rosell, M.; Nordfalk, J.; O’Regan, J.; Ortiz-Rojas, S.; Pérez-Ortiz, J.A.; Sánchez-Martínez, F.; Ramírez-Sánchez, G.; Tyers, F.M. Apertium: A free/open-source platform for rule-based machine translation. Mach. Transl. 2011, 25, 127–144. [Google Scholar] [CrossRef]
  28. Zens, R.; Och, F.J.; Ney, H. Phrase-based statistical machine translation. In Proceedings of the Annual Conference on Artificial Intelligence, Aachen, Germany, 16–20 September 2002; pp. 18–32. [Google Scholar]
  29. Koehn, P. Statistical Machine Translation; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  30. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, USA, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
  31. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
  32. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1243–1252. [Google Scholar]
  33. Wu, F.; Fan, A.; Baevski, A.; Dauphin, Y.N.; Auli, M. Pay less attention with lightweight and dynamic convolutions. arXiv 2019, arXiv:1901.10430. [Google Scholar]
  34. Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
  35. Schwab, K. The Fourth Industrial Revolution. Currency. 2017. Available online: https://www.weforum.org/about/the-fourth-industrial-revolution-by-klaus-schwab (accessed on 25 May 2022).
  36. Goyal, N.; Gao, C.; Chaudhary, V.; Chen, P.J.; Wenzek, G.; Ju, D.; Krishnan, S.; Ranzato, M.; Guzman, F.; Fan, A. The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. arXiv 2021, arXiv:2106.03193. [Google Scholar] [CrossRef]
  37. Esplà-Gomis, M.; Forcada, M.L.; Ramírez-Sánchez, G.; Hoang, H. ParaCrawl: Web-scale parallel corpora for the languages of the EU. In Proceedings of the Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, Dublin, Ireland, 19–23 August 2019; pp. 118–119. [Google Scholar]
  38. Gale, W.A.; Church, K. A program for aligning sentences in bilingual corpora. Comput. Linguist. 1993, 19, 75–102. [Google Scholar]
  39. Simard, M.; Plamondon, P. Bilingual sentence alignment: Balancing robustness and accuracy. Mach. Transl. 1998, 13, 59–80. [Google Scholar] [CrossRef]
  40. Abdul-Rauf, S.; Fishel, M.; Lambert, P.; Noubours, S.; Sennrich, R. Extrinsic evaluation of sentence alignment systems. In Proceedings of the Workshop on Creating Cross-language Resources for Disconnected Languages and Styles, Istanbul, Turkey, 27 May 2012. [Google Scholar]
  41. Lee, H.G.; Kim, J.S.; Shin, J.H.; Lee, J.; Quan, Y.X.; Jeong, Y.S. papago: A machine translation service with word sense disambiguation and currency conversion. In Proceedings of the COLING 2016, 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, 11–16 December 2016; pp. 185–188. [Google Scholar]
  42. Park, C.; Eo, S.; Moon, H.; Lim, H.S. Should we find another model? Improving Neural Machine Translation Performance with ONE-Piece Tokenization Method without Model Modification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 97–104. [Google Scholar]
  43. Park, C.; Kim, G.; Lim, H. Parallel Corpus Filtering and Korean-Optimized Subword Tokenization for Machine Translation. In Proceedings of the 31st Annual Conference on Human & Cognitive Language Technology, Daejeon, Korea, 11–12 October 2019. [Google Scholar]
  44. Park, C.; Kim, K.; Lim, H. Optimization of Data Augmentation Techniques in Neural Machine Translation. In Proceedings of the 31st Annual Conference on Human & Cognitive Language Technology, Daejeon, Korea, 11–12 October 2019. [Google Scholar]
  45. Park, C.; Kim, K.; Yang, Y.; Kang, M.; Lim, H. Neural spelling correction: Translating incorrect sentences to correct sentences for multimedia. Multimed. Tools Appl. 2020, 80, 34591–34608. [Google Scholar] [CrossRef]
  46. Park, C.; Lee, C.; Yang, Y.; Lim, H. Ancient Korean Neural Machine Translation. IEEE Access 2020, 8, 116617–116625. [Google Scholar] [CrossRef]
  47. Lee, C.; Yang, K.; Whang, T.; Park, C.; Matteson, A.; Lim, H. Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models. Appl. Sci. 2021, 11, 1974. [Google Scholar] [CrossRef]
  48. Pennebaker, J.W.; Boyd, R.L.; Jordan, K.; Blackburn, K. The Development and Psychometric Properties of LIWC2015; Technical Report; University of Texas Libraries: Austin, TX, USA, 2015. [Google Scholar]
  49. Prates, M.O.; Avelar, P.H.; Lamb, L. Assessing gender bias in machine translation–a case study with Google translate. arXiv 2018, arXiv:1809.02208. [Google Scholar] [CrossRef] [Green Version]
  50. Saunders, D.; Byrne, B. Reducing gender bias in neural machine translation as a domain adaptation problem. arXiv 2020, arXiv:2004.04498. [Google Scholar]
  51. Coppersmith, G.; Dredze, M.; Harman, C. Quantifying mental health signals in Twitter. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Baltimore, MD, USA, 22–27 June 2014; pp. 51–60. [Google Scholar]
  52. Su, Q.; Wan, M.; Liu, X.; Huang, C.R. Motivations, methods and metrics of misinformation detection: An NLP perspective. Nat. Lang. Process. Res. 2020, 1, 1–13. [Google Scholar] [CrossRef]
  53. Garcıa-Dıaz, J.A. Using Linguistic Features for Improving Automatic Text Classification Tasks in Spanish. In Proceedings of the Doctoral Symposium on Natural Language Processing from the PLN.net Network (PLNnet-DS-2020), Jaén, Spain, 16 December 2020. [Google Scholar]
  54. Biggiogera, J.; Boateng, G.; Hilpert, P.; Vowels, M.; Bodenmann, G.; Neysari, M.; Nussbeck, F.; Kowatsch, T. BERT meets LIWC: Exploring State-of-the-Art Language Models for Predicting Communication Behavior in Couples’ Conflict Interactions. arXiv 2021, arXiv:2106.01536. [Google Scholar]
  55. Moon, H.; Park, C.; Eo, S.; Park, J.; Lim, H. Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering. J. Korea Converg. Soc. 2021, 12, 1–7. [Google Scholar]
  56. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
  57. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  58. Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
  59. Cai, L.; Zhu, Y. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 2015, 14, 2. [Google Scholar] [CrossRef]
  60. Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
  61. Koehn, P.; Khayrallah, H.; Heafield, K.; Forcada, M.L. Findings of the wmt 2018 shared task on parallel corpus filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, 31 October–1 November 2018; Association for Computational Linguistics: Belgium, Brussels, 2018; pp. 726–739. [Google Scholar]
  62. Park, C.; Seo, J.; Lee, S.; Lee, C.; Moon, H.; Eo, S.; Lim, H. BTS: Back TranScription for Speech-to-Text Post-Processor using Text-to-Speech-to-Text. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), Association for Computational Linguistics, Online, 5–6 August 2021; pp. 106–116. [Google Scholar] [CrossRef]
  63. Zhang, B.; Nagesh, A.; Knight, K. Parallel corpus filtering via pre-trained language models. arXiv 2020, arXiv:2005.06166. [Google Scholar]
  64. Pope, D.; Griffith, J. An Analysis of Online Twitter Sentiment Surrounding the European Refugee Crisis. In Proceedings of the KDIR, Porto, Portugal, 9–11 November 2016; pp. 299–306. [Google Scholar]
  65. Fast, E.; Chen, B.; Bernstein, M.S. Empath: Understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; pp. 4647–4657. [Google Scholar]
  66. Garzón-Velandia, D.C.; Barreto, I.; Medina-Arboleda, I.F. Validación de un diccionario de LIWC para identificar emociones intergrupales. Rev. Latinoam. Psicol. 2020, 52, 149–159. [Google Scholar] [CrossRef]
  67. Paixao, M.; Lima, R.; Espinasse, B. Fake News Classification and Topic Modeling in Brazilian Portuguese. In Proceedings of the 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Melbourne, Australia, 14–17 December 2020; pp. 427–432. [Google Scholar]
  68. Lee Chang-hwan, S.J.m.; Ae-sun, Y. The Review about the Development of Korean Linguistic Inquiry and Word Count. Korean Psychol. Assoc. 2004, 2004, 295–296. [Google Scholar]
  69. Lee, J.W.; Oh, J.H.; Jung, J.S.; Lee, C.H. Counselor-Client Language Analysis Using the K-LIWC Program. J. Korean Data Anal. Soc. 2007, 9, 2545–2567. [Google Scholar]
  70. Kim Youngil, K.Y.; Kyungil, K. Detecting a deceptive attitude in non-pressure situations using K-LIWC. Korean Soc. Cogn. Sci. 2016, 27, 247–273. [Google Scholar]
Figure 1. Data-domain statistics of the Korean–English Parallel Corpus.
Figure 1. Data-domain statistics of the Korean–English Parallel Corpus.
Applsci 12 05545 g001
Figure 2. Data-domain statistics of the Korean–English Domain-Specialized Parallel Corpus.
Figure 2. Data-domain statistics of the Korean–English Domain-Specialized Parallel Corpus.
Applsci 12 05545 g002
Figure 3. Data-domain statistics of the Korean–English Parallel Corpus (Technology).
Figure 3. Data-domain statistics of the Korean–English Parallel Corpus (Technology).
Applsci 12 05545 g003
Figure 4. Data-domain statistics of the Korean–English Parallel Corpus (Social Science).
Figure 4. Data-domain statistics of the Korean–English Parallel Corpus (Social Science).
Applsci 12 05545 g004
Figure 5. Data-domain statistics of the Korean–Chinese Parallel Corpus (Technology).
Figure 5. Data-domain statistics of the Korean–Chinese Parallel Corpus (Technology).
Applsci 12 05545 g005
Figure 6. Data-domain statistics of the Korean–Chinese Parallel Corpus (Social Science).
Figure 6. Data-domain statistics of the Korean–Chinese Parallel Corpus (Social Science).
Applsci 12 05545 g006
Figure 7. Data-domain statistics of the Korean–Japanese Parallel Corpus.
Figure 7. Data-domain statistics of the Korean–Japanese Parallel Corpus.
Applsci 12 05545 g007
Figure 8. Results of the correlation between LIWC features and BLEU score (KR-EN). The blue-colored indicates a positive correlation while red-colored indicates a negative correlation.
Figure 8. Results of the correlation between LIWC features and BLEU score (KR-EN). The blue-colored indicates a positive correlation while red-colored indicates a negative correlation.
Applsci 12 05545 g008
Figure 9. Negative correlation ( r < 0 ) results of the important factors between BLEU score (KR-EN) and LIWC features. The empty cells (white) indicate cut-offed due to the positive value. Note that this result is statistically significant as p < 0.05 .
Figure 9. Negative correlation ( r < 0 ) results of the important factors between BLEU score (KR-EN) and LIWC features. The empty cells (white) indicate cut-offed due to the positive value. Note that this result is statistically significant as p < 0.05 .
Applsci 12 05545 g009
Table 1. Overview of features in LIWC.
Table 1. Overview of features in LIWC.
CategoryFeatures (Label)
SummaryAnalytical thinking (Analytic), Clout (Clout),
language variablesAuthenticity (Authentic), Emotional tone (Tone)
Words per sentence (WPS), Percent of target words captured by the dictionary (Dic),
LinguisticPercent of words in the text that are longer than six letters (Sixltr), Word count (WC),
DimensionArticles (article), Prepositions (prep), Total pronouns (pronoun), Personal pronouns (ppron),
1st pers singular (i), 1st pers plural (we), 2nd person (you),
3rd pers singular (shehe), 3rd pers plural (they), Impersonal pronouns (ipron)
GrammarsAuxiliary verbs (auxverb), Common verbs (verb), Common Adverbs (adverb),
Conjunctions (conj), Negations (negate), Common adjectives (adj), Comparisons (compare),
Interrogatives (interrog), Number (number), Quantifiers (quant)
Affect processTotal affect process (affect), Positive emotion (posemo),
Negative emotion (negemo), Anxiety (anx), Anger (anger), Sadness (sad)
Cognitive processTotal cognitiive process (cogproc), Insight (insight), Cause (cause),
Discrepanices (discrep), Tentativeness (tentat), Certainty (certain), Differentiation (differ)
Social processTotal social process (social), Familty (family), Friends (friend),
Female referents (female), Male referents (male)
Perceptual processTotal perceptual process (percept), Seeing (see), Hearing (hear), Feeling (feel)
Biological processTotal biological process (bio), Body (body), Health/Illness (health),
Sexuality (sexual), Ingesting (ingest)
DrivesTotal drives (drives), Affiliation (affiliation), Achievement (achieve),
Power (power), Reward focus (reward), Risk focus (risk)
Time orientationsPast focus (focuspast), Present focus (focuspresent), Future focus (focusfuture)
RelativityTotal relativity (relativ), Motion (motion), Space (space), Time (time)
Personal concernsWork (work), Home (home), Money (money),
Leisure activities (leisure), Religion (relig), Death (death)
InformalTotal informal language markers (Informal), Assents (assent),
language markersFillers (filler), Swear words (swear), Netspeak (netspeak), Nonfluencies (nonfl)
PunctuationsTotal punctuation (Allpunc), Semicolons (SemiC), Commas (Comma), Colons (Colon),
Parantheses (Parenth), Question marks (QMark), Exclamation marks (Exclam),
Periods (Period), Apostrophes (Apostro), Quoatation marks (Quote)
Dashes (Dash), Other puntuation (OtherP)
Table 2. LIWC results of Korean–English Domain-Specialized Parallel and Korean–English Parallel Corpus.
Table 2. LIWC results of Korean–English Domain-Specialized Parallel and Korean–English Parallel Corpus.
KR-En Domain-Specialized Parallel CorpKR-En Parallel Corpus
CategoriesFeaturesTrainValidTotalTotal
summary language variablesAnalytic97.1497.2697.1694.16
Clout61.161.1761.1165.39
Authentic25.725.3725.6627.76
Tone49.1449.0849.1354.48
Linguistic dimensionsWC33,750,8164,222,19137,973,00738,481,936
WPS27.3327.2927.3325.39
Sixltr26.526.4926.525.69
Dic76.3376.376.3279.25
function43.7843.7843.7845.24
pronoun4.94.874.896.73
ppron1.541.541.543.08
i0.210.210.211.02
we0.230.230.230.38
you0.30.30.30.56
shehe0.420.420.420.67
they0.380.380.380.45
ipron3.363.333.353.64
article10.410.4210.410.2
prep15.5115.5215.5115.1
grammarauxverb6.076.056.066.75
adverb2.462.472.462.55
conj5.925.925.925.43
negate0.630.620.630.7
verb9.949.919.9411.36
adj4.494.464.494.47
compare2.352.342.352.25
interrog1.211.21.211.3
number3.273.283.272.53
quant1.361.361.361.4
affective processaffect3.783.753.774.03
posemo2.492.482.492.75
negemo1.241.231.241.23
anx0.170.170.170.19
anger0.210.20.210.29
sad0.30.30.30.28
social processsocial5.075.045.066.7
family0.230.230.230.25
friend0.120.120.120.15
female0.180.170.180.34
male0.460.470.460.63
cognitive processcogproc6.756.736.747.48
insight1.561.561.561.75
cause1.691.681.691.73
discrep0.770.770.770.99
tentat1.141.141.141.41
certain0.650.650.650.7
differ1.921.911.921.96
perceptual processpercept1.531.531.531.82
see0.580.580.580.68
hear0.480.480.480.65
feel0.320.320.320.35
Biological processbio2.312.312.311.63
body0.460.460.460.46
health1.351.361.350.67
sexual0.040.040.040.05
ingest0.510.510.510.46
drivesdrives7.397.47.48.26
affiliation1.521.521.521.77
achieve1.921.921.921.96
power3.333.333.333.94
reward1.021.021.021.05
risk0.790.780.790.63
time-orientationsfocuspast3.273.253.263.4
focuspresent5.795.785.796.63
focusfuture0.970.960.961.49
relativivityrelativ14.5314.5514.5314.29
motion1.721.721.721.81
space8.318.328.317.83
time4.594.64.594.73
personal concernswork5.645.645.646.37
leisure1.521.521.521.43
home0.520.530.520.5
money2.192.182.191.87
relig0.230.230.230.29
death0.140.140.140.14
informal languageinformal0.230.230.230.26
swear0000.01
netspeak0.10.110.10.1
assent0.060.060.060.09
nonflu0.080.080.080.09
filler0000
punctuationsAllPunc14.7214.7214.7214.68
Period3.813.813.814.33
Comma6.236.236.235.24
Colon0.030.030.030.08
SemiC0.010.010.010.01
QMark0.010.010.010.22
Exclam0000
Dash1.881.891.891.6
Quote1.111.11.111.16
Apostro0.850.850.851.11
Parenth0.560.570.560.7
OtherP0.220.230.220.23
Table 3. LIWC results of Korean–English Parallel (Social Science) and Korean–English Parallel (Technology) Corpus.
Table 3. LIWC results of Korean–English Parallel (Social Science) and Korean–English Parallel (Technology) Corpus.
KR-En Parallel Corpus (Social Science)KR-En Parallel Corpus (Technology)
CategoriesFeaturesTrainValidTotalTrainValidTotal
summary language variablesAnalytic97.3897.3897.38999999
Clout53.8653.8753.8649.1349.1749.13
Authentic23.9924.2624.0216.6916.4916.67
Tone47.6647.9147.6934.0233.9934.02
Linguistic dimensionsWC9,991,4081,250,11511,241,52313,621,2091,702,72215,323,931
WPS20.7720.7920.7717.8717.8717.87
Sixltr30.6530.6130.6529.0229.0329.02
Dic80.7180.7180.7167.1967.1767.19
function46.8146.8146.8142.1542.1942.15
pronoun5.325.315.322.042.062.04
ppron0.960.960.960.140.150.14
i0.140.140.140.030.030.03
we0.220.220.220.020.020.02
you0.10.10.10.020.010.02
shehe0.160.170.160.010.020.01
they0.340.340.340.060.070.06
ipron4.354.344.351.91.911.9
article11.4211.4211.4214.9514.9514.95
prep16.1416.1316.1413.213.2313.2
grammarauxverb7.237.217.237.687.77.68
adverb2.442.422.441.351.341.35
conj5.415.445.413.773.763.77
negate0.850.860.850.30.30.3
verb10.310.2510.299.519.539.51
adj4.734.734.733.323.323.32
compare2.62.572.62.152.152.15
interrog0.880.880.880.540.540.54
number1.891.871.896.286.276.28
quant1.581.581.581.631.621.63
affective processaffect3.783.793.781.731.731.73
posemo2.452.462.451.091.091.09
negemo1.271.261.270.620.620.62
anx0.210.20.210.150.150.15
anger0.240.250.240.050.050.05
sad0.220.220.220.230.230.23
social processsocial4.274.284.271.831.841.83
family0.120.130.120.050.050.05
friend0.060.070.060.090.090.09
female0.10.090.10.040.040.04
male0.180.190.180.030.030.03
cognitive processcogproc11.1411.1311.149.69.599.6
insight3.343.343.342.092.072.09
cause2.782.762.782.292.32.29
discrep1.061.061.060.350.340.35
tentat1.891.891.893.713.723.71
certain1.091.091.090.450.440.45
differ2.632.642.631.721.721.72
perceptual processpercept1.161.171.162.172.172.17
see0.560.560.561.371.371.37
hear0.260.250.260.20.20.2
feel0.180.190.180.420.420.42
Biological processbio0.80.810.81.421.421.42
body0.180.190.180.410.410.41
health0.450.440.450.80.790.8
sexual0.030.030.030.030.030.03
ingest0.150.150.150.220.220.22
drivesdrives7.958.027.964.594.584.59
affiliation1.31.321.30.660.660.66
achieve1.871.91.871.461.461.46
power3.873.913.872.082.082.08
reward0.830.840.830.370.370.37
risk0.830.820.830.340.340.34
time-orientationsfocuspast2.662.662.661.341.361.34
focuspresent6.986.926.976.156.146.15
focusfuture0.760.760.762.852.852.85
relativityrelativ11.8811.9311.8911.7211.6911.72
motion1.491.521.491.431.421.43
space7.347.367.347.297.297.29
time33.0133.163.153.16
personal concernswork8.428.418.422.252.262.25
leisure0.760.760.760.440.420.44
home0.320.320.320.280.280.28
money3.23.173.20.380.380.38
relig0.130.120.130.020.020.02
death0.080.080.080.040.050.04
informal languageinformal0.130.130.130.180.180.18
swear0.010.010.010.030.030.03
netspeak0.060.060.060.120.120.12
assent0.020.020.020.010.010.01
nonflu0.050.040.050.040.040.04
filler000000
punctuationsAllPunc11.7511.7611.7511.1711.1611.17
Period4.814.814.815.655.655.65
Comma4.664.664.663.543.543.54
Colon0.010.010.010.010.010.01
SemiC00.010000
QMark0.040.040.04000
Exclam000000
Dash0.830.840.830.820.80.82
Quote0.120.120.120.020.020.02
Apostro0.630.650.630.120.120.12
Parenth0.370.370.370.690.70.69
OtherP0.260.260.260.310.310.31
Table 4. LIWC results of Korean–Chinese(Zh) Parallel (Social Science) and Korean–Chinese(Zh) Parallel (Technology) Corpus.
Table 4. LIWC results of Korean–Chinese(Zh) Parallel (Social Science) and Korean–Chinese(Zh) Parallel (Technology) Corpus.
Ko-Zh Parallel Corpus (Social Science)Ko-Zh Parallel Corpus (Technology)
CategoriesFeaturesTrainValidTotalTrainValidTotal
summary language variablesAnalytic93.2593.2493.2593.1593.0993.14
Clout50.4450.4950.4551.7352.1451.78
Authentic111111
Tone26.9727.0526.9829.1829.7529.25
Linguistic dimensionsWC3,907,897482,4414,390,3383,954,651512,8734,467,524
WPS598.73724.39612.542569.622947.552613.01
Sixltr66.8966.6866.8766.3564.8466.18
Dic1.021.071.033.273.813.33
function0.210.210.210.550.670.56
pronoun0.070.080.070.270.350.28
ppron0.060.060.060.20.250.21
i0.030.030.030.050.060.05
we0.010.010.010.010.010.01
you0.020.020.020.150.190.15
shehe000000
they000000
ipron0.020.020.020.070.090.07
article0.060.050.060.080.090.08
prep0.050.050.050.150.170.15
grammarauxverb0.010.020.010.020.030.02
adverb0.010.010.010.020.030.02
conj0.010.010.010.040.040.04
negate0.010.010.01000
verb0.080.080.080.160.170.16
adj0.070.080.070.270.30.27
compare0.010.010.010.020.020.02
interrog0.010.010.010.020.020.02
number1.451.411.451.331.161.31
quant0.010.010.010.040.040.04
affective processaffect0.120.130.120.290.350.3
posemo0.10.10.10.250.290.25
negemo0.020.030.020.040.060.04
anx00000.010
anger00.0100.020.020.02
sad0.010.010.010.010.010.01
social processsocial0.130.140.130.340.410.35
family0.010.010.010.010.010.01
friend0.010.020.010.040.040.04
female0.020.010.020.020.030.02
male0.020.020.020.020.020.02
cognitive processcogproc0.050.060.050.20.240.2
insight0.020.020.020.090.110.09
cause0.020.020.020.080.110.08
discrep00.01000.010
tentat0.010.010.010.010.010.01
certain0.010.010.010.030.030.03
differ0000.010.010.01
perceptual processpercept0.090.090.090.220.250.22
see0.040.040.040.110.130.11
hear0.030.030.030.050.060.05
feel0.010.010.010.030.030.03
Biological processbio0.060.070.060.180.210.18
body0.010.010.010.040.050.04
health0.020.020.020.080.10.08
sexual0000.010.010.01
ingest0.020.020.020.060.060.06
drivesdrives0.160.180.160.570.710.59
affiliation0.060.060.060.220.270.23
achieve0.030.030.030.110.140.11
power0.070.080.070.20.260.21
reward0.020.020.020.060.070.06
risk0000.030.040.03
time-orientationsfocuspast0.010.010.010.020.030.02
focuspresent0.060.070.060.140.160.14
focusfuture0.010.010.010.020.020.02
relativivityrelativ0.170.180.170.650.720.66
motion0.030.040.030.130.150.13
space0.090.080.090.350.380.35
time0.050.060.050.170.190.17
personal concernswork0.110.120.110.520.560.52
leisure0.110.110.110.350.490.37
home0.010.010.010.060.060.06
money0.060.060.060.260.240.26
relig0.010.010.010.020.020.02
death0000.010.020.01
informal languageinformal0.090.090.090.320.410.33
swear00000.010
netspeak0.070.080.070.310.40.32
assent0.030.030.030.030.030.03
nonflu0.010.010.010.010.010.01
filler000000
punctuationsAllPunc66.0163.9765.7963.1963.2463.2
Period0.150.150.150.10.120.1
Comma42.341.2642.1938.937.3738.72
Colon1.090.941.071.030.821.01
SemiC0.010.010.010.010.010.01
QMark0.180.130.170.050.050.05
Exclam0.050.060.050.020.020.02
Dash0.520.520.521.141.151.14
Quote15.5214.3115.3912.6812.8912.7
Apostro0.520.460.510.570.50.56
Parenth3.794.283.846.658.536.87
OtherP1.891.861.892.051.792.02
Table 5. Summary of overall AI Hub datasets. For statistics on tokens, we denote NA because there are no spaces in Japanese and Chinese.
Table 5. Summary of overall AI Hub datasets. For statistics on tokens, we denote NA because there are no spaces in Japanese and Chinese.
# of Sents# of Min Toks# of Max ToksAvg Toks per S# of Min Chars# of Max CharsAvg Chars per S
Korean–English Parallel CorpusTrainKR1,399,11617812.97435955.07
EN1,399,116218022.5710999135.24
ValidKR200,30223217.46922075.36
EN200,302311030.7314706187.89
TestKR30007259.022511335.71
EN300051210.463110166.50
Korean–English Parallel Corpus (Social Science)TrainKR474,967213812.571162952.94
EN474,967227120.7191617128.78
ValidKR59,746614412.582163652.96
EN59,746628020.73291550128.88
TestKR300064512.602223653.00
EN300076920.8635451129.33
Korean–English Parallel Corpus (Technology)TrainKR697,66543712.252118051.97
EN697,66515219.239311115.88
ValidKR87,58343512.252315551.95
EN87,58354819.2437310115.86
TestKR300063412.232616151.98
EN300064219.1745294115.75
Korean–English Domain-Specialized Parallel CorpusTrainKR1,197,00033515.301130466.18
EN1,197,000114527.5311001167.23
ValidKR150,00053215.301419266.21
EN150,00049727.5526691167.34
TestKR300073015.502714767.79
EN300076928.5646433175.57
Korean–Japanese Parallel CorpusTrainKR1,197,00033515.651121667.56
JP1,197,000NANANA925061.63
ValidKR150,00043115.701824367.87
EN150,000NANANA1324161.69
TestKR300043014.152016861.99
JP3000NANANA1618658.87
Korean–Chinese Parallel Corpus (Social Science)TrainKR1,037,00037815.951235969.03
ZH1,037,000NANANA525946.73
ValidKR130,00045215.661228368.04
ZH130,000NANANA720046.29
TestKR300063014.352515162.12
ZH3000NANANA1111737.77
Korean–Chinese Parallel Corpus (Technology)TrainKR1,037,00023515.821023669.01
ZH1,037,000NANANA729648.22
ValidKR130,00033115.931721369.71
ZH130,000NANANA919949.07
TestKR300043015.072216365.75
ZH3000NANANA1418145.91
Table 6. Experimental results for seven datasets and three language pairs published by AI Hub.
Table 6. Experimental results for seven datasets and three language pairs published by AI Hub.
CorpusLanguageBLEU
Korean–English Parallel corpusKR-EN28.36
EN-KR13.53
Korean–English Parallel corpus (Social Science)KR-EN45.64
EN-KR17.71
Korean–English Parallel corpus (Technology)KR-EN63.88
EN-KR39.17
Korean–English Domain-specialized Parallel corpusKR-EN51.88
EN-KR21.99
Korean–Japanese Parallel corpusKR-JA68.88
JP-KR49.05
Korean–Chinese Parallel corpus (Social Science)KR-ZH48.74
ZH-KR25.16
Korean–Chinese Parallel corpus (Technology)KR-ZH46.70
ZH-KR25.75
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Park, C.; Shim, M.; Eo, S.; Lee, S.; Seo, J.; Moon, H.; Lim, H. Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC. Appl. Sci. 2022, 12, 5545. https://doi.org/10.3390/app12115545

AMA Style

Park C, Shim M, Eo S, Lee S, Seo J, Moon H, Lim H. Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC. Applied Sciences. 2022; 12(11):5545. https://doi.org/10.3390/app12115545

Chicago/Turabian Style

Park, Chanjun, Midan Shim, Sugyeong Eo, Seolhwa Lee, Jaehyung Seo, Hyeonseok Moon, and Heuiseok Lim. 2022. "Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC" Applied Sciences 12, no. 11: 5545. https://doi.org/10.3390/app12115545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop