AI Student: A Machine Reading Comprehension System for the Korean College Scholastic Ability Test

: Machine reading comprehension is a question answering mechanism in which a machine reads, understands, and answers questions from a given text. These reasoning skills can be sufﬁ-ciently grafted into the Korean College Scholastic Ability Test (CSAT) to bring about new scientiﬁc and educational advances. In this paper, we propose a novel Korean CSAT Question and Answering (KCQA) model and effectively utilize four easy data augmentation strategies with round trip translation to augment the insufﬁcient data in the training dataset. To evaluate the effectiveness of KCQA, 30 students appeared for the test under conditions identical to the proposed model. Our qualitative and quantitative analysis along with experimental results revealed that KCQA achieved better performance than humans with a higher F1 score of 3.86.


Introduction
Machine Reading Comprehension (MRC) aims to teach machines to read and answer questions after understanding the given text passages, which is a fundamental goal of natural language understanding [1,2].This sequential process of the MRC model resembles a human who solves reading comprehension questions in the Korean language subject on the College Scholastic Ability Test (CSAT).The CSAT is a standardized test in Korea.It is designed to evaluate the scholastic ability of students required for a college education.The CSAT focuses on high-order thinking skills based on the prior knowledge of each student in the subject and how they answer the exam.For the Korean language subject in the CSAT, the MRC model should evaluate the effectiveness of whether the model can classify answers as True (T) or False (F) given the passage, which is approximately 1700 characters and 15 multiple-choice questions on average.However, research on analyzing the CSAT using Natural Language Processing (NLP) remains limited.
Over the past few years, there has been a significant change in the learning of NLP models.Pre-trained Language Models (PLMs) [3][4][5], which are pre-trained with large-scale text corpus through unsupervised objectives similar to masked language modeling such as BERT, GPT, RoBERTa, and T5, are adapted to various downstream tasks by training and finetuning parameters with task-specific objective functions [3][4][5][6][7].Recently, there have been several attempts to adopt PLM-based MRC models in other domains, the MRC has domain application cases such as biomedicine and cybersecurity [8][9][10][11].However, to the best of our knowledge, there has been limited research on exploiting the MRC technique in the domain of the Korean language reading in the CSAT.It is crucial to utilize PLMs to convert given passages and questions into a classification problem, and the intrinsic reasoning skills of the PLMs to test exams depend on their prior knowledge, similar to humans.Therefore, in this paper, we propose AI student, a novel Korean CSAT Question and Answering (KCQA) model that assesses the scholastic reading ability in the Korean language CSAT.KCQA evaluates the effectiveness of language models with given questions and passages under identical conditions experienced by students.In addition, Round-Trip Translation (RTT), which is a method mainly utilized to alleviate insufficient training data in neural machine translation, is applied and simultaneously obtains practical knowledge by exploiting the Easy Data Augmentation (EDA) method to effectively augment CSAT corpus with wordnet, which includes synonym and antonym relationships between words [12][13][14][15].This approach suggests the possibility of examining the contribution of prior knowledge of certain subjects to the understanding of the related passage by adopting language models, which was impossible to quantify in the field of education.
The contributions are summarized as follows: • We obtained insights into the meaningful effects of our KCQA models, which assessed scholastic reading ability.

•
To this end, we demonstrated the effectiveness of using various Korean and multilingual language models with four data augmentation strategies for practical learning to alleviate insufficient training data problems.

•
For human performance, we employed 30 test students preparing for the CSAT with conditions identical to the PLMs.The results proved the superior performance of the proposed KCQA.

•
We comprehensively conducted qualitative and quantitative analyses based on this by deriving concrete experimental results in both aspects of educational assessment and deep learning.

Background
The CSAT, which is a high-stakes assessment that serves as a decisive factor for college admission, has been developed and administered by the Korea Institute for Curriculum & Evaluation (https://www.kice.re.kr/ (accessed on 5 April 2022)) since its introduction in 1993 as the most important standardized tool.It is a compulsory exam for students who are aiming to enter college after 12 years of regular education courses [16,17].Each subject consists of Korean language subjects subdivided into reading and literature, as well as English, Mathematics, Social Studies, Science, Vocational Studies, Foreign Languages, and Chinese Classics.
Regardless of language, "Reading" is highly thought of in education; this is especially true for Korean language education.The reading section in the CSAT aims at "cultivating the ability to accurately understand a certain passage".When evaluating reading comprehension skills, the prior knowledge of each student plays a crucial role in understanding the grammar and literary knowledge for a given passage and question [18].In this process, reading requires students to classify the given question as T or F for the given passage.

Pre-Trained Language Models
Many recent studies based on transformer models consisted of a pre-training and finetuning stage and achieved good performances in various downstream tasks, such as named entity recognition, question answering, text generation, and other NLP tasks.It is common practice in the NLP community to fine-tune various tasks instead of learning models from scratch [19][20][21][22][23][24].BERT showed superiority over human performance in the Stanford question answering dataset, the most representative question answering dataset [25].PLMs apply various techniques, such as masked language models, next sentence prediction, and Replaced Token Detection (RTD) to provide contextual information for better quality context-sensitive information within the passage for each downstream task.Multilingual-BERT (mBERT) for conducting research in Korean includes various forms of linguistic information, but it has limitations in that Korean-specific data are not sufficient.
Recently, various Korean PLMs have been released in the Korean research community.They achieved outstanding performances that were better than those of multilingual models with large-scale Korean corpora and vocabulary exploited by reflecting the philological characteristics of the Korean language, which is an underlying factor for the performance of PLMs.KoBERT (https://github.com/SKTBrain/KoBERT(accessed on 5 April 2022)) is pre-trained with 5 million Korean sentences and optimized for the Korean language.KoELECTRA is pre-trained with 34 GB worth of Korean sentences from news articles, wikidata, and the National Institute of Korean Language, which is an institution that establishes the norm for Korean linguistics (https://corpus.korean.go.kr/ (accessed on 5 April 2022)).The model adopted RTD to change a certain ratio of tokens into masking tokens and let the generator generate suitable tokens to fill in the masking tokens [26].In this process, the discriminator trains by determining which token has been replaced based on the output of the generator.Moreover, KcELECTRA is an ELECTRA-based model, but there is a significant difference in the nature and scale of the training data.It is optimized for news comments and movie reviews where tokens have colloquial features, including numerous new words and informal expressions such as typos.Approximately 17 GB of news comments were used for pre-training.

Adoption of Deep Learning in CSAT
A study that extracted keywords, which accounts for 11% of the Korean language section CSAT to visualize as a word cloud, and to analyze the language network with a termdocument matrix method was conducted [27].However, considering only the frequency of the keywords, describing the contextualized feature representation and understanding the meaning of linguistic representations could not be achieved.Furthermore, a study was conducted on the English section of CSAT, which compared vocabulary complexity using a vocab profile with the reading comprehension of SAT2 in ETS, and grammatical complexity using the L2 syntactic complex analyzer [28].However, this approach also did not graft to the PLMs, which mainly dealt with in NLP.Although the various subjects of the CSAT were used to evaluate the ability the understanding of a given passage, no research has been conducted.

Data Augmentation Strategy
Enhancing text data to overcome the lack of training corpus in a relatively low-resource language has been suggested in various research studies as an effective way to increase the contextual understanding of the language model [29][30][31].For data augmentation, the original training corpus must be transformed into a suitable form within a range that does not damage the original meaning.There were several challenges in applying data augmentation to NLP, as compared to other fields; however, this presents an opportunity to introduce the EDA to augment text-based datasets effectively.EDA is a technique for augmenting given sentences according to four categories: Synonym Replacement (SR), Random Insertion (RI), Random Swap (RS), and Random Deletion (RD).SR is a method in which specific tokens are replaced with synonym tokens referred to as wordnet.RI is a method of inserting a random token at a specific location.RS is a method of swapping two random token positions.Finally, RD is a method of deleting a random token from a sentence.In the process of data augmentation for text classification, even if the size of the dataset has been increased effectively, the issue of the existing correct answer being contaminated should not exist.For example, when the number of sentences with labels classified as T is augmented, it becomes a failure if many of them are classified as F.However, EDA prevents the label from being reversed from T to F and vice versa.

Reading Form in CSAT
Reading consists of philosophy, social studies, science, technology, and art, and convergent passages deal with two different fields: philosophy and science.Each field includes three or four passages, and each passage includes four or five questions.The question includes five answer candidates and must be classified whether each answer candidate is T or F. The test format is demonstrated in Figure 1.Passages and questions in the actual example of CSAT are shown in Table 1.

Passage
a semiconductor substrate board, which is similar to the process of making an engraving.Just as countless engravings can be made from an original plate on paper, in the case of photolithography, a single plate called a mask is made, and then the pattern of the exact same shape is repeatedly copied on the substrate using a laser to make a large number of patterns.Compared to making the original plate using an engraving knife, in the case of photolithography, the size of the mask pattern is very small, so it is made using a laser.

Question
The size of the pattern engraved on the mask is smaller than the size of the pattern created on the substrate board.
To determine the correct answer corresponding to the question, students must be capable of understanding not only the superficial information but also the semantic and contextual information.In Table 1, a student pays attention to the superficial information of the size of the mask pattern is very small, and thus can easily misclassify the given question as T.However, if a student pays attention to the figurative semantic information implied in Just as countless engravings can be made from an original plate on paper, the student can correctly classify the given question as F.

Reading Section of Korean Language
A total of 285 passages were used, and each passage was given an index according to the admission year and area of the reading section in the CSAT.The index consisted of an academic area that focused on philosophy, social studies, art, science, technology, convergent, and language, which have not appeared in the CSAT since 2012.In Table 2, the average number of questions in augmented train and test datasets for each academic area are shown in detail.The passages corresponding to each category are indicated in parentheses.For example, the 58 passages in the field of philosophy have on average 120.86 sentences on the augmented training dataset, and the average number of questions that are T or F is 6.16 and 5.71 in the test datasets, respectively.This is closely related to a downstream task in NLP that recent claims that the MRC model truly understands a given passage.Extracting the span corresponding to the answer based on the given text tended to have an increasingly degrading performance in the presence of discouragements unrelated to the passage [32].It is considered that models rely on superficial information and do not understand the passage in context.Our CSAT corpus can be classified correctly only when the models fully understand the passages and questions.Therefore, by capturing the CSAT in the form of a dataset to which NLP can be applied, it is possible to present methods to solve problems for the MRC task.

Strategy for CSAT Corpus Augmentation
We used the passages and their corresponding questions from the CSAT as training data.Despite the fact that the average number of sentences increased to 30.76 after the 2015 revised curriculum, the amount of CSAT corpus is insufficient for the practical learning process.Furthermore, since the test aims to evaluate the reading ability of the student in a limited time, unsophisticatedly extending the length of the passages is not an appropriate solution.To overcome this, we applied the four training data augmentation strategies proposed by EDA [14], which are effective even when the size of data is relatively small.We applied SR, RI, RS, and RD by 10% ratio, independently, in order to augment the sentences by five times.
Wordnet was referred to replace tokens included in a given sentence with synonym tokens or to insert random tokens in the middle of sentences.Considering that wordnet is not universally available publicly, unlike English, RTT is an effective method that splits the training data to each sentence before carrying out Ko-En translation and then En-Ko translation back again.The overall procedure of the data preprocessing is shown in Figure 2.

KCQA System
We fine-tuned our KCQA model by leveraging the Korean PLMs with the optimization process.Figure 3 illustrates the overall model architecture.From the entire training corpus, we fed augmented input sentences into the model.The model is trained to correctly classify (CLS) tokens as the final prediction value of the input.In preparation for this process, we randomly composed sentences of multiple passages so that the same number of true-labeled and false-labeled sentences were input into training for each passage.However, to let the model effectively train false-labeled sentences, the sentences with negative expressions were artificially generated based on the passage and were preferentially included as the input.The method was devised from the fact that the most negative expressions in Korean sentences are found in the terminus of the word token, and also for the reason that in the case of the Korean language section of the CSAT, it is helpful to generate negative expressions consistently because the terminus of each question tends to be consistent [33].Detailed examples are shown in Table 3.

Negation Rule Original True Sentence Sentence Negation
(Personal motives appear to conflict with public ones.) (Personal motives don't appear to conflict with public ones.) (This intersecting mark was introduced in 'process 1' and can be delivered to 'b') (This intersecting mark was introduced in 'process 1' and can't be delivered to 'b') (Aristoxenus does determine the beauty of music based on the sound perceived by the ear) (Aristoxenus doesn't determine the beauty of music based on the sound perceived by the ear)

Experiments 5.1. Dataset
The 285 passages contain training and test datasets.To ensure the quality of the training model, we conducted an evaluation using ten-fold cross-validation, i.e., 10 times for each model, where 90% of the data were provided as training input and 10% as a test to the models.On average, the original training corpus consisted of 24.30 sentences without augmentation and 119.55 sentences with augmentation strategies.A more detailed description of training and testing is shown in Table 4.

Metrics
To evaluate the effectiveness of our proposed method, we utilize the F1 score, which is a harmonic value of Precision (P) and Recall (R), as a standard indicator to compensate for the weakness caused by using only accuracy for evaluation.The F1 score is calculated by Equation ( 1).
where E p represents the set of predicted correct answers, E r denotes the ground-truth answer collection, and C = E p E r are the correct answers.

Experimental Results
In the experiment, we leveraged KoBERT, KoELECTRA, and KCELECTRA as Korean representative models, and mBERT and xlm-RoBERTa as multilingual models for baseline model architecture.For accurate performance evaluation, we set consistently fine-tuning hyperparameters with a batch size of 128, max sequence length of 128, max epochs of 40, learning rate of 1 × 10 −5 , and a weight decay of 0.1.In Table 5, we describe our quantitative experimental results.Human performance targeted 30 test students preparing for the CSAT.They were required for the CSAT corpus to be referenced in the identical method as the PLM by reading the passages and determining whether the given information was appropriate or not.
Table 5. Performance of human and various data augmented language models.We utilized five Korean and multilingual models for baseline architecture.As a result, all of the Korean models with data augmentation strategies achieve higher F1 scores.In the case of the two multilingual augmented PLMs, mBERT and xlm-RoBERTa performed less favorably than the humans; specifically, xlm-RoBERTa recorded a decrease of −9.76 on the average F1 score.The language models based on ELECTRA architecture pretrained with large-capacity, formal and informal written styles showed the most effective philological understanding compared to the other models; this was verified through the model performance.In addition, the similar composition of the linguistic system between the Korean language of CSAT and our KCQA proves that the result implies our models can make correct assessments about questions with the given passages, which shows that educational goals of evaluating the reading comprehension in the Korean language section can be achieved.

Analysis
In this section, we provide a detailed analysis of augmenting our KCQA model by describing additional experiments.To further analyze how our data augmented representations influence the model performance, we scrutinized the differences between P and R of language models trained in augmented and non-augmented strategies.We addressed CSAT by exploiting the same model architectures described above in the corresponding sections.
In Table 6, all of the PLMs with data augmentations enable stable learning as the differences between P and R is alleviated from 41.14 to 7.33 on average.Based on the T values predicted by the model in Equation ( 1), P, which refers to the T, and R, which the model predicts as T, complementarily show better performances when both indicators are higher.For the case where augmentation strategies were not applied, the differences between P and R were relatively significant because T and F sentences were predicted as T since they were superficially similar to the expression of the given passages.The performance declined after applying augmentation, but it rather indicates a better understanding of the given text since the differences between P and R decreased after augmentation.Thus, we prove its effectiveness by articulating that the differences of both indicators have been mitigated after the augmentation strategy, and confirm that the augmented PLMs can understand the given questions and passages of the CSAT corpus beyond superficial knowledge.
In Figure 4, each field constituting the CSAT corpus was described as a visualized confusion matrix for the data augmented language model performance.Considering that the difference in numerical values represented by each field deepened the understanding of each passage during the augmentation process, it can be interpreted that the required degree of inference and understanding varies for each field.
Furthermore, it can be assumed that philosophy and art areas showed relatively poor performance because those passages required accepting perspectives of specific artists and philosophers.These passages with high abstractness showed minor performance improvements after data augmentation.Table 7 lists the top and bottom ten passages based on the F1 score of each passage.The two fields of philosophy and technology account for 70% of the bottom ten passages with the lowest performances, and in the case of the science field, applicable content was dealt with in detail, but the range of improvement was low.The most probable cause is that domain-specific additional knowledge is considered regardless of the degree of reasoning required by the passage itself.Considering the experimental results, multiple-choice questions can only be judged according to the given passages.However, practically, more detailed, professional knowledge is required, and the limit in fully understanding the context only by the knowledge of the language models will be obvious.This is an argument that supports the fact that students should endeavor to make efforts to learn related subject knowledge for some fields when studying for the CSAT.

Discussion
In this study, we applied data augmentation to improve the skills of truly understanding the given passage via language models.As we can see in Figure 4 described in previous sections, certain deviations exist depending on which domain the passage belongs to.Furthermore, as we Table 7, the different degrees of inference required for each passage led to differences in performance in each passage.Combining the above results shows that the current language model, which already shows outstanding performance in most MRC datasets including a Korean question answering dataset, has not reached the performance of understanding the given passage to the same level as humans [34].

Conclusions
In this study, we proposed a KCQA model called AI student, which is a Korean reading comprehension system, to evaluate the possibility of the intrinsic reasoning skills of PLMs via the CSAT to that of humans.We performed data augmentation with four EDA-based strategies and verified whether the CSAT corpus determined a question with given passages from the pedagogical perspective by exploiting various Korean and multilingual PLMs.The results demonstrated that the proposed KCQA could determine the appropriateness of the given questions based on the passages and that the EDA-based data augmentation process had the potential to improve the understanding of a passage when access to domainspecific knowledge was allowed for a particular subject.Although the non-augmented model exhibited better performance, it was difficult to judge when a given passage was truly understood.Finally, we expect that the performance of models can be evaluated based on a high-order educational CSAT corpus in terms of pedagogy in the future.

1 .Figure 1 .
Figure1.Description of the Korean CSAT.To solve the given questions, a student must be able to select which question is T/F out of the multiple-choice questions.

Figure 3 .
Figure3.PLMs architecture.In this process, augmented sentences are split into positive and negative sentences using negation rules before being fed into the model.Finally, the prediction value is classified according to whether the answer is T or F.

Figure 4 .
Figure 4. Visualization of augmented language models with confusion matrices of each field.

Table 1 .
Example of passage and question in the Korean CSAT.

Table 2 .
Main categories and their average number of sentences (in parentheses, marked #) in reading.

Table 3 .
Examples of sentences according to the negation rules.There are three negative expressions, Verb (V) → don't V, can → can't, and does → doesn't.The underlined words in Korean indicate bold words in English.

Table 6 .
Differences between recall and precision, the gaps in our data augmented representations are relatively insignificant.

Table 7 .
Comparison of the best and worst scores in passages.The passage index in the left side column is organized in the order of year-month-domain, and A and B (e.g., 9A, 6B) describe that it was performed twice in month.