A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM

Hur, Wonjin; Ji, Bongjun

doi:10.3390/systems13100851

Open AccessArticle

A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM

by

Wonjin Hur

^1,2 and

Bongjun Ji

^3,*

¹

Department of Korean Language as a Foreign Language, Pusan National University, Busan 46241, Republic of Korea

²

Department of Korean, Cyber Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea

³

Graduate School of Data Science, Pusan National University, Busan 46241, Republic of Korea

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(10), 851; https://doi.org/10.3390/systems13100851

Submission received: 21 August 2025 / Revised: 20 September 2025 / Accepted: 26 September 2025 / Published: 28 September 2025

Download

Browse Figures

Versions Notes

Abstract

The global expansion of Korean language education has created an urgent need for scalable, objective, and consistent methods for assessing the writing skills of non-native (L2) learners. Traditional manual grading is resource-intensive and prone to subjectivity, while existing Automated Essay Scoring (AES) systems often struggle with the linguistic nuances of Korean and the specific error patterns of L2 writers. This paper introduces a novel hybrid AES system designed specifically for Korean L2 writing. The system integrates two complementary feature sets: (1) a comprehensive suite of conventional linguistic features capturing lexical diversity, syntactic complexity, and readability to assess writing form and (2) a novel semantic relevance feature that evaluates writing content. This semantic feature is derived by calculating the cosine similarity between a student’s essay and an ideal, high-proficiency reference answer generated by a Large Language Model (LLM). Various machine learning models are trained on the Korean Language Learner Corpus from the National Institute of the Korean Language to predict a holistic score on the 6-level Test of Proficiency in Korean (TOPIK) scale. The proposed hybrid system demonstrates superior performance compared to baseline models that rely on either linguistic or semantic features alone. The integration of the LLM-based semantic feature provides a significant improvement in scoring accuracy, more closely aligning the automated assessment with human expert judgments. By systematically combining measures of linguistic form and semantic content, this hybrid approach provides a more holistic and accurate assessment of Korean L2 writing proficiency. The system represents a practical and effective tool for supporting large-scale language education and assessment, aligning with the need for advanced AI-driven educational technology systems.

Keywords:

automated essay scoring (AES); Korean language education; natural language processing (NLP); large language models (LLM); educational technology systems; second language acquisition; semantic similarity

1. Introduction

The global diffusion of Korean culture and South Korea’s expanding economic integration have substantially increased demand for Korean language education [1,2,3]. This growth has significantly increased the workload of educational institutions and organizations responsible for evaluating language proficiency [2,3,4]. Writing assessment, a fundamental component of assessing linguistic competence, presents a particularly significant challenge [5]. The manual grading of essays is a notoriously labor-intensive, time-consuming, and expensive process [6,7]. Furthermore, it is susceptible to significant variability, including inconsistencies between different raters (inter-rater reliability) and even inconsistencies from the same rater over time (intra-rater reliability) [8,9]. These challenges are magnified in the context of high-stakes, large-scale examinations such as the Test of Proficiency in Korean (TOPIK), where fairness, consistency, and efficiency are paramount [10,11].

To address these systemic bottlenecks, the field of educational technology has increasingly turned to Automated Essay Scoring (AES). AES systems leverage computational methods to analyze and score written text, offering a powerful technological intervention to augment or automate the grading process [12]. The primary promise of AES lies in its potential to deliver immediate, consistent, and scalable feedback, thereby decreasing the burden on human educators and providing learners with timely evaluations to guide their development [13]. As demand continues to rise, robust and reliable AES systems have become essential to the sustainable expansion of high-quality language education [14].

The field of AES has undergone a significant technological evolution since its emergence. The earliest pioneering systems, such as the Project Essay Grader (PEG) developed in the 1960s, operated on the principle of feature engineering [15]. These systems analyzed writing based on a set of hand-crafted, quantifiable proxies for writing quality, often referred to as “proxes” and “trins.” These included surface-level linguistic features such as word and sentence counts, grammatical error frequencies, vocabulary diversity metrics, and readability scores like the Flesch–Kincaid grade level [16]. While effective to a degree, these models were limited in their ability to comprehend the deeper semantic content or logical structure of an essay [17].

The advent of deep learning marked a paradigm shift in AES [18,19,20]. Architectures like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and, more recently, Transformers (e.g., BERT) made manual feature engineering unnecessary [19,21]. These models could learn complex, high-dimensional representations of syntax and semantics directly from raw text, leading to substantial improvements in scoring accuracy [21,22,23]. However, this shift often came at the cost of interpretability, creating “black box” models whose decision-making processes lacked transparency [23,24].

The state of the art has shifted to an integrative paradigm enabled by large language models (LLMs) [25]. AES development follows a U-shaped trajectory, from transparent, engineered features, through a period of end-to-end deep learning with limited interpretability, and now toward a principled synthesis of both [19,25]. LLMs function as modules within hybrid systems for zero-shot evaluation, generation of explanatory feedback, and, in this study, construction of high-quality reference texts for fine-grained semantic comparison [26]. This work adopts that paradigm, combining the reliability of engineered linguistic features with the semantic modeling capacity of LLMs [25]. Their approach utilizes an LLM (GPT-3) primarily as a feature encoder. It generates neural context embeddings from existing, human-graded essays to compute sophisticated features, such as the local semantic coherence between adjacent sentences. In this model, the “gold standard” for discourse is derived from a pre-existing, human-annotated corpus. The challenge of creating reliable benchmarks is particularly acute in the Korean context, where the field of LLM evaluation is still developing. While several benchmarks like the Open Ko-LLM Leaderboard and HAE-RAE Bench have been introduced, much of the existing evaluation infrastructure relies on the direct translation of English benchmarks, an approach that may not fully capture the unique linguistic and cultural nuances of Korean. This highlights a critical need for native evaluation methodologies and resources [27,28]. Our work contributes to this area by proposing and validating a method for generating high-quality, Korean-specific reference texts for assessment, a necessary step for building more reliable evaluation tools.

While AES for English has been extensively researched, its application to other languages, particularly for non-native (L2) learners, presents a distinct set of challenges [29]. L2 writing is characterized by unique linguistic phenomena that can confound systems trained primarily on native-speaker text [30]. These include linguistic transfer, in which L1 grammar and writing conventions influence L2 production, together with more frequent error types that standard AES models may misinterpret [31].

These general L2 challenges are compounded by the specific linguistic characteristics of the Korean language [32]. Korean encodes grammatical relations primarily via case-marking particles (조사, josa) and verbal inflections, creating difficulties for learners and for automated analysis [32,33]. Indeed, analyses of learner corpora show that errors in particle usage are among the most frequent mistakes made by Korean L2 learners. Furthermore, the relative scarcity of large-scale, publicly available, and well-annotated corpora of Korean L2 writing has historically hindered research and development in this area, creating a significant gap compared to the resources available for English [34,35]. An entire subfield of Korean NLP has emerged to address this challenge, developing sophisticated methods for the annotation, detection, and correction of these specific error types. A successful AES system for Korean L2 writing must therefore be designed to be robust to these specific linguistic features and error patterns [36,37]. A successful AES system for Korean L2 writing must therefore be designed to be robust to these specific linguistic features and error patterns [33].

This paper proposes a novel AES system designed to address the diverse and interrelated challenges of evaluating Korean L2 writing. The proposed solution is conceived not as a monolithic algorithm but as a holistic, multi-component socio-technical system for application in the educational domain. The system’s design is motivated by the central hypothesis that a more accurate, reliable, and human-like assessment can be achieved by systematically integrating two complementary sets of features, mirroring the dual focus of human graders on both form and content. While conceptually similar in its hybrid nature to previous work [25], our work diverges from this approach in its fundamental application of the LLM. Rather than using the LLM as an encoder of existing texts, we employ it as a generator of the benchmark itself. Our system prompts a state-of-the-art LLM to create a diverse set of twenty new, high-quality “ideal” reference essays for each topic. The resulting semantic feature, therefore, measures global topical relevance against this generated benchmark. This is a distinct methodological choice aimed at evaluating content alignment, particularly for under-resourced languages like Korean, where large, expertly graded L2 corpora are less available.

The first component of our system consists of a suite of traditional linguistic features. These quantifiable metrics, such as lexical diversity, syntactic complexity, and readability, provide a robust and interpretable measure of a writer’s foundational linguistic proficiency. They effectively assess how the learner constructs language.

The second, and novel, component is an LLM-powered semantic feature. This feature directly evaluates the substance of the writing such as its coherence, topical relevance, and content quality. It is calculated by measuring the semantic similarity between the learner’s essay and an “ideal” reference answer generated by an LLM. This approach uses the LLM as a proxy for the world knowledge and topical understanding that a human expert brings to the grading process. This feature assesses what the learner has written.

By integrating these two distinct feature sets, the proposed hybrid architecture leverages the respective strengths of both established and emerging paradigms: the precision and interpretability of engineered linguistic features and the deep semantic comprehension of LLMs. This systemic approach aims to capture a more complete and nuanced picture of a learner’s writing ability than could be achieved by either approach in isolation. Furthermore, the system is designed for practical application within real-world educational frameworks by utilizing the official Korean Language Learner Corpus from the National Institute of the Korean Language (NIKL) and aligning its 6-point scoring output with the established TOPIK proficiency levels.

The primary contributions of this research are fourfold:

The design and implementation of a novel hybrid AES system specifically for assessing the Korean writing of non-native speakers.
The introduction of a new feature for semantic evaluation based on calculating the similarity between a student’s essay and a high-quality reference answer generated by an LLM.
A comprehensive evaluation of the proposed system using the official Korean Language Learner Corpus from the National Institute of the Korean Language, with performance benchmarked against the 6-level TOPIK proficiency scale.
An empirical demonstration that the proposed hybrid, systemic approach significantly outperforms models based on either traditional linguistic or semantic features alone.

The remainder of this paper is organized as follows: Section 2 details the materials and methods, including the dataset, system architecture, and feature engineering processes. Section 3 describes the experimental setup and evaluation protocol. Section 4 presents and analyzes the results of our experiments. Section 5 discusses the interpretation and implications of these findings, as well as the limitations of the study. Finally, Section 6 provides concluding remarks.

2. Materials and Methods

This section details the methodology employed to develop and validate the hybrid AES system. We first describe the dataset used for training and evaluation. Next, we present the overall system architecture, followed by a detailed explanation of the two distinct feature engineering pipelines, one for traditional linguistic features and another for the novel LLM-based semantic feature. Finally, we describe the predictive model used to generate the final proficiency scores.

2.1. Dataset: The Korean Language Learner Corpus

The foundation of this study is the Korean Language Learner Corpus (한국어 학습자 말뭉치), a large-scale linguistic resource compiled and maintained by the NIKL of the Republic of Korea. This corpus is specifically designed for research in second language acquisition and Korean language education by collecting texts produced by non-native learners of Korean. The first phase of its construction, completed in 2022, gathered approximately 6.2 million eojols (a Korean unit of spacing, similar to a word or phrase) from learners representing 143 countries and 99 native language backgrounds.

For this study, we utilized a subset of the corpus containing written essays. Each essay in the dataset is accompanied by metadata, including the learner’s proficiency level, which is graded on the 6-level scale corresponding to the TOPIK framework (Levels 1–6) (Table 1).

This expert-assigned proficiency level serves as the ground-truth label and the target variable for our predictive model. The corpus provides a rich and diverse collection of authentic L2 writing, encompassing the typical error patterns, grammatical structures, and lexical choices characteristic of learners at different stages of proficiency [38]. Access to the corpus for research purposes is managed through the NIKL’s “Korean Language Learner Corpus Sharing Center”.

To ensure sufficient coverage and comparability across prompts, we developed and evaluated the AES system using essays drawn from the 20 most frequently occurring topics in the corpus (Table 2). Topic frequencies were computed from the corpus metadata, and essays were filtered accordingly. Focusing on the highest-frequency topics thus yields a dataset that remains representative of common communicative tasks while enabling rigorous, topic-aware analysis of model performance.

2.2. System Architecture

The proposed AES system employs a hybrid, feature-based architecture designed to conduct a holistic assessment of writing quality by evaluating both linguistic form and semantic content. Figure 1 presents an end-to-end architecture that converts raw learner essays into TOPIK-aligned proficiency predictions. The pipeline begins with topic identification from corpus metadata. As we mentioned Section 2.1, essays associated with the 20 most frequent prompts are retained to ensure adequate sample sizes per prompt and comparability across prompts.

For each retained topic, a large language model generates a set of high-quality reference responses subject to length and style constraints. These references function as semantic anchors rather than as training labels. Essays are then normalized through script cleanup, sentence segmentation, tokenization, and metadata validation. Two parallel feature branches are constructed. In the semantic branch, each essay and its topic-specific references are embedded with a sentence-level encoder, and similarity statistics are computed. We use top-k cosine similarities and aggregate descriptors such as mean, maximum, variance, and coverage indicators to summarize content alignment at the topic level. In the conventional branch, interpretable descriptors of form are extracted, including character/word level length measures, lexical diversity indices, syntactic complexity indicators, discourse and cohesion cues, and surface error rates. Feature quality control removes degenerate vectors, caps extreme values, and records missingness flags.

The branches are concatenated after scaling and type handling to form a unified feature vector. Supervised learning is applied to the concatenated representation to predict proficiency levels 1–6. To prevent leakage, topic identifiers are excluded from the learner, text normalization is performed prior to any feature computation, and cross-validation is conducted with topic-aware folds so that prompts do not span training and validation partitions. Model selection relies on cross-validation with a bounded hyperparameter search and class-balanced sampling where appropriate. Outputs are evaluated using Quadratic Weighted Kappa (QWK), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). Confusion matrices provide label-space diagnostics, and ablations quantify the marginal contribution of the semantic and conventional branches. This architecture integrates interpretable linguistic evidence with topic-aware semantic evidence derived from LLM-generated references, enabling reliable, content- and form-sensitive assessment while maintaining reproducibility and control over prompt effects.

2.3. Feature Engineering 1: Linguistic Form Analysis

This pipeline extracts a set of established linguistic features that serve as proxies for writing proficiency. These features are designed to be robust and interpretable, capturing foundational aspects of a learner’s command of the Korean language.

Lexical Features: These features measure the richness and diversity of the vocabulary used [5,39].
Word and Morpheme Counts: Total number of words (eojols), unique words, morphemes, and unique morphemes [40,41].
Type-Token Ratio (TTR): Calculated as the number of unique morphemes (types) divided by the total number of morphemes (tokens). A higher TTR generally indicates greater lexical diversity [42].
Syntactic Complexity Features: These features assess the sophistication of the grammatical structures employed by the learner [43].
Sentence-Level Metrics: Total number of sentences and average sentence length (in words and morphemes). Longer, more complex sentences are often characteristic of higher proficiency levels [44].
Subordination and Clause Complexity: Metrics derived from syntactic parsing, such as the average depth of the parse tree and the ratio of subordinate clauses to main clauses, are used to quantify grammatical complexity [45].

2.4. Feature Engineering II: LLM-Based Semantic Content Analysis

This pipeline introduces a novel feature to evaluate the quality and relevance of the essay’s content, addressing a common limitation of traditional AES systems that focus primarily on form. The process involves two steps, (1) generating a benchmark answer and (2) calculating similarity.

First, for each essay prompt in the dataset, a Large Language Model (GPT-4o) is prompted to generate an exemplary essay [46]. The prompt instructs the LLM to write a well-structured, coherent, and topically rich response that would be representative of a TOPIK Level 6 performance. This LLM-generated text serves as a “golden” or ideal answer, encapsulating the key concepts and logical structure expected for a high-scoring response on that topic. This approach leverages LLMs’ ability to produce high-quality, contextually relevant text that can serve as a reliable benchmark for comparison.

Prompt 1: ideal answer generation

System: You are an expert Korean writing instructor and TOPIK rater.
User: Generate exemplar responses under the following constraints.
[Input]
- Topic: {TOPIC}
[Task]
- Produce 20 distinct Korean essays representative of TOPIK Level 6 for the given topic.
- Ensure clear diversity across essays, minimize lexical and phrasal overlap.
[Form and language]
- Register: formal written Korean suitable for an exam response
- Structure: introduction, development, and conclusion
- Length: ~580 characters on average; ~152 eojeols (±10) on average
- Sentences: preferably 10–14 per essay
- Linguistic requirements: correct spacing and punctuation; appropriate case particles and verbal endings; natural cohesion
- Prohibitions: no bullet lists, no mention of the instructions, no external quotations or sources, no code-switching

[Output format]
- Return a single JSON object whose sole key is the topic string and whose value is an array of 20 essay strings.
- Example: {“{TOPIC}”: [“Essay 1”, “Essay 2”, …, “Essay 20”]}
- Output valid UTF-8 JSON only, with no additional commentary

After answer generation, semantic similarities are calculated to quantify how well a student’s essay aligns with the ideal answer. Figure 2 illustrates the semantic branch of the hybrid AES. For each topic, both the student essay and 20 reference essays produced by an LLM are embedded with the same sentence-transformer encoder and L2-normalized. Pairwise cosine similarities are then calculated between the student vector and all reference vectors. The resulting similarity distribution is summarized by the top-k values (k = 5 in our main experiments) together with aggregate descriptors such as the maximum, mean, and variance. These statistics quantify topic-level content alignment and are concatenated with conventional linguistic features for supervised learning.

We measure the semantic similarity between the two texts. Both the student’s essay and the LLM-generated reference answer are converted into high-dimensional vector representations (embeddings) using a pre-trained sentence-transformer model (Sentence-BERT) [47]. These models are adept at capturing the semantic meaning of text, going beyond simple keyword matching. The cosine similarity between the two vectors is then calculated as below [48].

c o s i n e s i m i l a r i t y (x, y) = \frac{x^{T} y}{{||x||}_{2} {||y||}_{2}}

(1)

which equals the cosine of the angle between

x

and

y

. The resulting score, ranging from −1 to 1 (but typically 0 to 1 for this task), serves as the semantic feature. A score closer to 1 indicates a high degree of semantic overlap and topical relevance between the student’s writing and the high-proficiency benchmark.

2.5. Predictive Modeling

We formulate automated essay scoring as an ordinal prediction task with target levels 1–6 aligned to the TOPIK scale. Three feature spaces are considered. The first feature space is a conventional set capturing surface form, lexical diversity, syntactic complexity, cohesion cues, and error indicators. The second feature set is a semantic set derived from similarity statistics between each student essay and a bank of twenty topic-specific reference responses. The last set is a combined feature set obtained by concatenation after scaling and type handling.

Five supervised learners are evaluated to represent linear, kernel, and tree-based families, which include L2-regularized linear regression (Ridge), support vector regression with an RBF kernel (SVR), random forest, histogram-based gradient boosting (HistGB), and gradient-boosted decision trees (XGBoost). Models are trained in regression mode to respect the ordinal structure of the labels.

Ridge penalizes the squared magnitude of coefficients to stabilize estimation and reduce variance. It also mitigates multicollinearity and yields well-conditioned solutions in high-dimensional feature spaces [49].

SVR optimizes an ε-insensitive loss with margin maximization, producing sparse solutions based on support vectors. Nonlinear relations are modeled via kernels, with

C, ε,

and

γ

governing regularization and function complexity [50,51].

Random Forest constructs an ensemble of decision trees using bootstrap sampling and random feature subsetting at each split. Aggregating decorrelated trees captures nonlinear interactions and provides robustness to overfitting [52].

HistGB discretizes continuous features into histogram bins and fits additive decision trees via gradient boosting. Generalization and efficiency are controlled by learning rate, tree depth, regularization, and early stopping [53,54].

XGBoost implements a regularized gradient-boosting framework with second-order optimization and sparsity-aware split finding. Shrinkage and column/row subsampling, together with L1/L2 penalties, provide effective control of overfitting and computational cost [55].

At inference, continuous outputs are clipped to [1,6] and rounded to the nearest integer for label-based diagnostics. For Ridge and SVR, numeric predictors are standardized to zero mean and unit variance, and tree-based models are fit on the raw scale. Feature quality control removes degenerate vectors, caps extreme values, and records missingness indicators; median imputation is applied where required.

Model selection uses nested, topic-aware cross-validation to prevent prompt leakage. The outer evaluation employs grouped folds with topics as grouping units and stratification on proficiency level within each fold. The inner loop performs hyperparameter tuning by grid or randomized search restricted to bounded ranges. Early stopping is used for gradient-boosting models based on a validation split drawn from the training fold. Where supported, class weights inversely proportional to label frequencies are enabled as a robustness check. Random seeds are fixed for reproducibility.

2.6. Performance Measurement

We employed a 5-fold topic-aware grouped cross-validation to ensure that essays written for the same prompt did not appear in both the training and test sets of any fold. Within each training fold, a further 90/10 split was used to create a validation set for hyperparameter tuning and early stopping in the gradient boosting models. We evaluate the model’s performance using three standard metrics in AES research, which together provide a comprehensive view of the prediction performance. Because the target value is an ordinal six-level proficiency grade, model performance is quantified using (i) quadratic weighted QWK, a chance-corrected coefficient of agreement adjusted for ordinal categories, (ii) MAE, and (iii) RMSE.

2.6.1. Quadratic Weighted Kappa (QWK)

Quadratic Weighted Kappa (QWK) is a statistical measure that evaluates the level of agreement between two raters (e.g., a human and an AES model) on an ordinal scale [56]. It is a variant of Cohen’s Kappa that is specifically designed for ordered categories, making it ideal for essay scoring. QWK not only corrects for agreement that could occur by chance but also penalizes larger disagreements more heavily than smaller ones. For example, a model predicting a ‘1’ for an essay that a human scored as a ‘6’ is penalized more than a model predicting a ‘5’ for the same essay. The formula for QWK is defined as below.

Q W K = 1 - \frac{\sum_{i, j} w_{i, j} O_{i, j}}{\sum_{i, j} w_{i, j} E_{i, j}}

(2)

where

O_{i, j}

is the observed number of essays that received score

i

from the human and score

j

from the model.

E_{i, j}

is the expected number of agreements by chance, calculated form the marginal totals of scores for the human and the model.

w_{i, j}

is the weight matrix, which defines the penalty for disagreement. For QWK, the weights are calculated quadratically based on the distance between the scores:

w_{i, j} = \frac{{(i - j)}^{2}}{{(N - 1)}^{2}}

(3)

N is the number of possible score categories. A QWK score of 1 indicates perfect agreement, 0 indicates agreement no better than chance, and negative values indicate agreement worse than chance.

2.6.2. Mean Absolute Error (MAE)

Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of predictions, without considering their direction [57]. It is calculated as the average of the absolute differences between the predicted scores and the actual scores. MAE is easily interpretable because it is in the same units as the essay scores. For instance, an MAE of 0.5 means the model’s predictions are, on average, half a point away from the human scores. Because it does not square the errors, MAE is less sensitive to large, infrequent errors (outliers) than RMSE. The formula for MAE is:

M A E = \frac{1}{N} \sum_{n = 1}^{N} | {\hat{y}}_{n} - y_{n} |,

(4)

N

is the total number of essays,

y_{i}

is the true human-assigned score for the i-th essay, and

{\hat{y}}_{i}

is the score predicted by the model for the i-th essay.

2.6.3. Root Mean Squared Error (RMSE)

The RMSE is the square root of the average of the squared differences between the predicted and actual scores [58]. Like MAE, RMSE measures the average error in the same units as the scores. However, the key difference is that RMSE squares the errors before averaging them. Therefore, large errors are given relatively high weight. A four-point error has 16 times more impact on the overall error than a one-point error. Therefore, RMSE is particularly sensitive to outliers and is a useful metric for penalizing models that produce large and severe scoring errors. The formula for RMSE is:

R M S E = \sqrt{\frac{1}{N} \sum_{n = 1}^{N} {({\hat{y}}_{n} - y_{n})}^{2}},

(5)

2.6.4. Statistical Significance Testing

To determine whether the observed differences in model performance were statistically significant, we conducted pairwise comparisons of the Quadratic Weighted Kappa (QWK) scores. We employed the DeLong test, a non-parametric method for comparing the Area Under the Curve (AUC) of two correlated Receiver Operating Characteristic (ROC) curves, which can be adapted for ordinal agreement metrics like QWK. The null hypothesis was that the two models have equal performance, and we used a significance level of α = 0.05 to evaluate our claims of model superiority.

3. Results

This section presents the empirical results of our experiments, evaluating the performance of various regression models using three distinct feature sets: conventional linguistic features, semantic similarity scores, and a combination of both. The models were assessed using QWK, MAE, and RMSE to provide a comprehensive view of their scoring accuracy.

3.1. Overall Model Performance

The primary experiment involved training and evaluating five different prediction models (Ridge, SVR, RandomForest, HistGB, and XGBoost) on the three feature configurations. The overall performance of each model across the entire test set is summarized in Table 3. The results clearly demonstrate significant performance differences based on both the selection of model and the feature set employed.

The results clearly indicate that the hybrid approach yields superior performance. The models trained exclusively on the semantic similarity feature performed poorly, with QWK scores near zero, suggesting that content similarity alone is insufficient for nuanced scoring. The models using conventional linguistic features established a strong baseline, with the XGBoost model achieving a QWK of 0.782. However, the highest accuracy was consistently achieved with the combined feature set (Figure 3).

Among the algorithms, the tree-based ensemble models (RandomForest, HistGB, and XGBoost) outperformed the Ridge and SVR models. The best result was obtained by the XGBoost model using the combined feature set, which achieved a QWK of 0.801, an MAE of 0.501, and an RMSE of 0.839. This result validates our central hypothesis that integrating traditional linguistic analysis with LLM-based semantic content evaluation provides a more robust and accurate scoring system. Crucially, pairwise statistical tests confirmed that the QWK score of the hybrid XGBoost model was significantly superior to all other model configurations and baselines (DeLong’s test, p < 0.05). To aid interpretability, we also report label-space accuracies for the best configuration (XGB + combined). Exact-match accuracy was 0.587, while 92.4% of predictions fell within ±1 level of the human score on the 6-point TOPIK scale. Class-wise exact-match rates monotonically decreased with proficiency level as follows: 0.754, 0.587, 0.432, 0.431, 0.337, 0.241 (Figure 4). A pattern consistent with the increasing lexical/syntactic variability of higher-level responses. Pearson and Spearman correlations between predicted and reference scores were 0.807 and 0.773, respectively.

The model demonstrates high accuracy for essays at the lower end of the score scale. For essays with a true score of 1, the model is correct in approximately 75% of cases (2154 out of 2858). Performance remains strong for score 2, with an accuracy of around 59% (1265 out of 2156). At these levels, writing quality is often distinguished by clear, surface-level linguistic markers such as high frequencies of grammatical errors, simple syntactic structures, and limited lexical diversity. The conventional features in our model are well-suited to detect these strong negative signals, allowing for reliable classification.

The model’s predictive power diminishes significantly for mid-range and high-end scores. Accuracy drops to approximately 43% for scores 3 and 4, further to 33% for score 5, and to just 24% for score 6. This decline can be explained by three primary factors:

As writers become more proficient, the utility of surface-level features as differentiators decreases. Essays scored 5 and 6 are both likely to exhibit low error rates, high lexical diversity, and complex syntax. The distinction between these scores lies in more abstract qualities such as rhetorical sophistication, argumentative nuance, and originality of thought. The current feature set is not designed to capture these subtle, higher-order attributes, resulting in a “ceiling effect” where the model can identify an essay as “good” but cannot reliably distinguish it from an “excellent” one.

For highly proficient writers, the semantic similarity score may become a confounding factor. An exceptional essay (score 6) might present a more creative, complex, or unconventional argument that deviates semantically from the standardized, LLM-generated reference answer. In contrast, a competent but less exceptional essay (score 4 or 5) might adhere more closely to a conventional line of reasoning, thereby achieving a higher similarity score. This can paradoxically cause the model to penalize originality and underestimate the quality of the most advanced essays.

The confusion matrix reveals a significant imbalance in the dataset. There are far more examples of low-scoring essays (2858 instances of score 1) than high-scoring ones (300 instances of score 5 and 328 of score 6). This skew means the model has substantially less data from which to learn the defining characteristics of top-tier writing. Consequently, the model is biased towards the more heavily represented lower and middle scores and exhibits a strong tendency to underestimate the quality of high-performing essays, as seen by the large number of true score 6 essays being misclassified as 5 or 4.

3.2. Comparison with Direct LLM Scoring Baselines

A critical question for any hybrid system is whether its architectural complexity is justified when compared to direct scoring by a state-of-the-art Large Language Model. To address this, we benchmarked our system against direct scoring by GPT-4o under both zero-shot and few-shot conditions. In the zero-shot setting, the model was provided only with the essay and the official TOPIK scoring rubric. In the few-shot setting, the prompt was augmented with five representative essay examples for each of the six TOPIK levels.

The results, presented in Table 4, are striking and decisively demonstrate the necessity of our hybrid architecture. Direct scoring by GPT-4o, even with few-shot prompting, performed very poorly, achieving a QWK of just 0.24. The high error rates (MAE of 1.28 and RMSE of 1.62) indicate that the model’s predictions were, on average, incorrect by more than a full proficiency level. This performance is substantially worse than even our baseline models using only conventional linguistic features.

In contrast, our hybrid XGBoost model achieved a QWK of 0.80, with MAE and RMSE values of 0.50 and 0.84, respectively. This represents a performance gap of more than 230% on the QWK metric. These findings provide clear validation for our hybrid approach. They suggest that while a powerful LLM possesses general language understanding, it struggles to reliably apply the nuanced, multi-level criteria of the TOPIK rubric in a zero- or few-shot setting. The integration of explicit linguistic features acts as an essential structural guardrail, allowing the model to ground its semantic understanding in the formal properties of the text. The hybrid architecture is therefore not merely beneficial; it is indispensable for achieving accurate and reliable scoring.

3.3. Model Performance by Topic

To understand the model’s robustness and limitations, we analyzed the performance of our best-performing configuration (XGBoost with combined features) on a per-topic basis. Table 5 presents the evaluation metrics for each of the 20 distinct essay topics in the test set, revealing a significant variance in performance that correlates strongly with the nature of the prompt.

These performance differences show a clear pattern. The model excels on concrete, descriptive, and narrative topics, but struggles significantly on abstract, argumentative, or highly personal topics.

The model achieved the highest accuracy on topics such as “Self-Introduction” (QWK 0.797) and “Comprehensive Narrative” (QWK 0.660). These topics typically lead to responses that use predictable structures and a relatively limited range of topics and vocabulary. For example, self-introductions are likely to discuss family, hobbies, and personal goals. This thematic consistency makes both existing linguistic features and semantic similarity scores highly effective. Linguistic features can reliably model fluency and complexity in these familiar contexts, while semantic similarity scores accurately measure the relevance of content to the generated reference responses, which capture the common elements of the essays.

Conversely, the model’s performance deteriorated sharply on abstract and argumentative topics, with QWK scores near zero for “Using the Internet Correctly” (0.007) and “What Matters Most in Life” (0.029). This failure can be attributed to two key limitations of the feature set when applied to these prompts:

Failure of the Semantic Similarity Feature: For abstract topics, there is no single “right” answer. A high-quality essay can present one of several valid and logical arguments. The referenced answers generated by the LLM represent only twenty possible perspectives. Consequently, a student’s creative, nuanced, and unconventional argument, while potentially excellent, may be semantically disconnected from the referenced text. In this case, the semantic similarity score becomes a misleading feature, penalizing originality and rewarding adherence to a single, arbitrary criterion. The model incorrectly assumes that proximity to the LLM answer is a measure of quality, which is not true for complex, open-ended questions.

Lack of conventional linguistic features: The quality of an argumentative or philosophical essay is determined by factors such as logical coherence, depth of insight, and strength of reasoning. Conventional linguistic features used in this model (e.g., sentence length, lexical diversity, error rate) fail to adequately reflect this deeper structure. An essay may be grammatically perfect and syntactically complex, yet logically flawed and superficial. Conversely, a profound argument may be expressed in simple and clear language. Because this model relies on these superficial features, it fails to distinguish between well-structured and poorly structured arguments, and its scores for topics where argumentation is the primary criterion for quality are unreliable.

4. Discussion

This section interprets the empirical results presented in Section 3, connecting them to the working hypotheses and situating them within the broader landscape of AES research. The performance of the hybrid model is discussed, its strengths and weaknesses across proficiency levels and topic types are analyzed, and the wider implications for Korean L2 writing assessment and pedagogy are discussed.

4.1. Validation of the Hybrid System

The central hypothesis of this study, that integrating linguistic form with semantic content would yield a more accurate assessment, is strongly validated by our findings. The results clearly show that the hybrid model significantly outperforms models trained exclusively on either conventional or semantic features. This outcome empirically demonstrates that neither form nor content is sufficient in isolation; rather, their synergistic combination is essential for a robust evaluation that approximates human judgment.

This finding aligns with the “U-shaped” trajectory of AES development, where the field is returning to a sophisticated integration of engineered features with LLMs after a period dominated by opaque, end-to-end deep learning models. Our work contributes to a growing body of evidence showing that hybrid models, which combine the strengths of interpretable linguistic features with the deep semantic understanding of modern neural architectures, represent the current state of the art.

The catastrophic failure of the semantic-only model is a profoundly important diagnostic result. It reveals that while an LLM can generate high-quality reference texts, raw semantic similarity is an inadequate proxy for writing quality. It effectively measures topical relevance but fails to capture the quality of the discourse. A student could simply list relevant keywords in grammatically incoherent sentences and still achieve a high similarity score, showing that the semantic feature in isolation is blind to linguistic proficiency. The success of the hybrid model, therefore, stems from the linguistic features acting as a crucial “gatekeeper.” They establish a baseline of formal competence, after which the semantic feature can contribute a meaningful signal about content relevance.

4.2. Interpreting Model Performance Across Proficiency Levels

As detailed in the Results Section, our model exhibits clear limitations, with a performance gradient across proficiency. This pattern is consistent with the ceiling effect reported in AES research. Lower proficiency is characterized by frequent, easily detectable errors (e.g., particle misuse, simple syntax, restricted vocabulary), which our conventional linguistic features capture effectively. As proficiency increases, these negative cues become sparse. Distinguishing a very good essay (Level 5) from an excellent one (Level 6) depends more on higher-order aspects such as discourse organization, argument development, originality, and creativity, which are not fully represented in the current feature set.

Severe class imbalance in the NIKL corpus (2858 Level 1 essays vs. 328 Level 6 essays) is not merely a technical inconvenience. It is a primary causal factor that compounds the ceiling effect. Supervised models learn by discovering patterns associated with labels, and the scarcity of Level 6 examples prevents the learner from forming a reliable statistical profile of top-tier writing. In the current feature space, Level 5 and Level 6 essays tend to appear similar because both show low error rates and high lexical diversity, while the true discriminators lie in dimensions not captured by our features. The combination of a limited feature representation for advanced skills and a sparse sample of high-level essays produces a compounding limitation. The model lacks both the tools (features) and the experience (data) needed to make fine-grained distinctions at the top of the scale. This issue reflects a broader challenge in L2 AES, namely, the difficulty of assembling large, well-annotated corpora of high-proficiency learner writing, and it exemplifies the imbalanced learning problem in which models are biased toward the majority class.

4.3. Implications for Korean L2 Writing Assessment and Pedagogy

This study makes a significant contribution by developing and validating one of the first hybrid AES systems specifically for Korean L2 writing. It addresses a notable gap in a field historically dominated by English and tackles the unique challenges posed by Korean’s agglutinative morphology. By training on the official NIKL corpus and aligning with the TOPIK scale, our system provides a valuable and relevant benchmark for future research in this area.

The system’s topic-dependent performance profile dictates its appropriate pedagogical use. It is a highly promising tool for providing formative feedback to beginner and intermediate learners (Levels 1–4) engaged in descriptive and narrative writing tasks. This reframes the system’s contribution more precisely. It serves as a strong baseline for foundational writing where topical relevance is a key signal of quality. The immediate, consistent feedback on linguistic form can help learners notice errors, facilitate revision, and promote autonomy, thereby alleviating some of the grading burden on instructors. Conversely, the system is not suitable for the high-stakes evaluation of advanced, open-ended argumentative writing, where its reliance on topical similarity can penalize originality.

However, the system’s limitations on abstract tasks indicate a necessary reframing for the field. The objective should not be to construct a single all-purpose AES that replaces human graders across contexts. A more viable direction is a toolkit of specialized systems. Our model constitutes a successful prototype within this toolkit, functioning as a “foundational writing tutor” for Korean L2 learners. It performs strongly within this scope but is not appropriate for high-stakes evaluation of advanced argumentative writing. Such targeted deployment supports responsible integration of AI in education and positions these systems as aids rather than substitutes for human educators.

5. Limitations and Future Work

This study demonstrates the promise of a hybrid approach to AES for Korean L2 writing, but it also has limitations that motivate a clear agenda for future work. This section transparently discusses the current model’s constraints and outlines several promising avenues for subsequent research.

5.1. Limitations

While our hybrid system represents a significant advance, its performance and methodology are subject to several important limitations that must be acknowledged.

A primary limitation, evident from our results, is the model’s diminished accuracy when evaluating high-level writing (TOPIK Levels 5–6). This “ceiling effect” arises because the current feature set, while effective at identifying foundational linguistic competence, is not equipped to capture the higher-order qualities that distinguish “very good” writing from “excellent” writing. These qualities such as rhetorical sophistication, argumentative nuance, logical depth, and originality are not easily quantified by metrics like sentence length or lexical diversity. The model can reliably identify well-formed prose but struggles to assess the quality of the ideas within it.

The use of cosine similarity with a finite set of LLM-generated reference essays has inherent flaws. Although we generate twenty distinct essays per prompt to create a broad semantic space, this approach primarily measures topical relevance, the degree of overlap in vocabulary and concepts, rather than the logical coherence or quality of an argument. Consequently, an exceptionally creative or unconventional essay that is well-argued but deviates from the semantic space of the reference texts may be unfairly penalized. The semantic feature, in its current form, is a proxy for topicality, not a true measure of content quality.

The performance of the model is constrained by the skewed data distribution within the NIKL corpus, which contains far fewer examples of high-proficiency essays compared to beginner-level ones. This severe class imbalance means the model has insufficient data from which to learn the defining statistical characteristics of top-tier writing, thereby exacerbating the ceiling effect and biasing its predictions toward the more heavily represented lower and middle proficiency levels

Finally, the current study does not include a fairness audit to assess whether the model exhibits performance disparities based on learners’ L1 backgrounds. Given that the NIKL corpus includes writers from 99 native language backgrounds, it is possible that linguistic transfer from different native languages could influence the model’s predictions. This represents an unexamined potential for bias that must be acknowledged as a limitation.

5.2. Future Works

The limitations identified above provide a clear and ambitious roadmap for future research. We have identified four key areas for development.

The most critical next step is to move beyond simple semantic similarity. A promising direction is the integration of features derived from dynamic entity embeddings and knowledge graphs to model the logical structure of an essay by representing its claims and evidence as a graph. This could be complemented by incorporating argument-mining techniques to detect, classify, and evaluate claims, premises, and their logical relations. Such a framework would represent a significant leap from measuring what an essay is about to how well it argues its point.

To better capture the specific challenges faced by Korean L2 learners, the linguistic feature set should be expanded to include more sophisticated, error-specific features. Given that particle errors are among the most frequent and persistent mistakes for learners of Korean, integrating a dedicated module for Korean particle error detection would be a particularly high-impact addition. Additionally, incorporating computational measures of discourse coherence, such as entity-grid models, would help capture the higher-order skills that define advanced writing.

To mitigate the problem of data imbalance at higher proficiency levels, future research should explore advanced data augmentation techniques. This could include methods such as back-translation or the controlled, LLM-based synthesis of new, high-quality L2 essays that exhibit the characteristics of TOPIK Levels 5 and 6. Furthermore, adapting feature-space synthesizers such as SMOTE or ADASYN to the hybrid representation could help create more balanced training sets.

A responsible AES system must be both fair and transparent. A crucial direction for future work is to conduct a fairness audit to investigate whether the model exhibits performance disparities across learners from different L1 backgrounds, ensuring the system is equitable for all users. Additionally, to enhance the system’s pedagogical value, deeper interpretability analyses using methods like SHAP or LIME should be conducted. Generating instance-level explanations that link a score to specific textual evidence would move the system from a simple grading tool to a powerful diagnostic and feedback generation engine for learners.

6. Conclusions

The rapid global growth of Korean language education has created an urgent demand for scalable, objective, and consistent writing assessment tools. This study responds by designing, implementing, and evaluating a hybrid Automated Essay Scoring system for Korean L2 learners. The approach integrates two complementary feature sets, a robust set of conventional linguistic indicators that capture writing form and a semantic feature derived from the cosine similarity between a student essay and an LLM-generated reference response that evaluates content. Experiments on the official NIKL Korean Language Learner Corpus yield four principal findings. First, the proposed hybrid system achieves QWK = 0.80 and significantly outperforms baselines that use only linguistic or only semantic features, supporting the hypothesis that accurate assessment requires joint evidence on form and content. Second, the system attains high accuracy for beginner and intermediate levels but shows a ceiling effect at the highest level, a limitation exacerbated by severe class imbalance. Third, performance is highly sensitive to the essay prompt and is stronger on concrete, descriptive topics than on abstract, argumentative topics, where reliance on a single reference response can penalize originality and creativity. Finally, the evidence supports the use of the system for formative assessment in foundational L2 writing contexts.

This study makes four contributions. It presents a novel hybrid AES architecture for Korean L2 writers, introduces the use of large language models to generate a semantic benchmark, provides a comprehensive evaluation against the official TOPIK proficiency scale, and empirically demonstrates the advantage of a hybrid approach. Although the current model has limitations, particularly in assessing argumentation and addressing equity, the work constitutes a practical advance and establishes a foundation for future research on more nuanced and educationally useful AI-based tools for Korean-language learners worldwide.

Author Contributions

Conceptualization, W.H.; methodology, W.H.; software, B.J.; validation, W.H. and B.J.; data curation, W.H.; writing—original draft preparation, W.H.; writing—review and editing, B.J.; visualization, B.J.; supervision, B.J.; project administration, B.J.; funding acquisition, W.H. and B.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00242528).

Data Availability Statement

This study utilized publicly available data provided by the National Institute of the Korean Language (국립국어원). The dataset, known as the Learner Corpus (학습자 말뭉치), can be accessed through the official repository of the National Institute of the Korean Language.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, H.S.; Kelly, K. Investigating Korean language learning motivations: Trend and implications. Korean Lang. Am. 2023, 27, 152–181. [Google Scholar] [CrossRef]
Ding, X.; Wu, Y. Determinants of international Korean language promotion: A cross-country analysis. Heliyon 2023, 9, e21078. [Google Scholar] [CrossRef]
Han, Y.; Dewaele, J.M.; Kiaer, J. Does the attractiveness of K-culture shape the enjoyment of foreign language learners of Korean? Int. J. Appl. Linguist. 2025, 35, 486–502. [Google Scholar] [CrossRef]
Shin, D.; Park, S.; Cho, E. A review study on discourse-analytical approaches to language testing policy in the South Korean context. Lang. Test. Asia 2023, 13, 44. [Google Scholar] [CrossRef]
Weigle, S.C. Assessing Writing; Ernst Klett Sprachen: Stuttgart, Germany, 2002. [Google Scholar]
Shermis, M.D.; Burstein, J. (Eds.) Handbook of Automated Essay Evaluation: Current Applications and New Directions; Routledge: New York, NY, USA, 2013. [Google Scholar]
Li, S.; Ng, V. Automated essay scoring: A reflection on the state of the art. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 17876–17888. [Google Scholar] [CrossRef]
Kayapınar, U. Measuring essay assessment: Intra-rater and inter-rater reliability. Eurasian J. Educ. Res. 2014, 57, 113–136. [Google Scholar] [CrossRef]
Eckes, T. Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Lang. Assess. Q. 2005, 2, 197–221. [Google Scholar] [CrossRef]
American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing; American Psychological Association: Washington, DC, USA, 1985. [Google Scholar]
Im, G.H.; Shin, D.; Park, S. Suggesting a policy-driven approach to validation in the context of the Test of Proficiency in Korean (TOPIK). Curr. Issues Lang. Plan. 2022, 23, 214–232. [Google Scholar] [CrossRef]
Lim, C.T.; Bong, C.H.; Wong, W.S.; Lee, N.K. A comprehensive review of automated essay scoring (AES) research and development. Pertanika J. Sci. Technol. 2021, 29, 1875–1899. [Google Scholar] [CrossRef]
Bai, J.Y.; Zawacki-Richter, O.; Bozkurt, A.; Lee, K.; Fanguy, M.; Cefa Sari, B.; Marín, V.I. Automated essay scoring (AES) systems: Opportunities and challenges for open and distance education. In Proceedings of the Tenth Pan-Commonwealth Forum on Open Learning (PCF10), Calgary, AB, Canada, 14–16 September 2022. [Google Scholar]
Watson, C. How Is an Academic English Skills (AES) Module, Delivered in a UK Higher Education (HE) Setting, Understood, and Experienced by Chinese Students? Ph.D. Thesis, University of Glasgow, Glasgow, UK, 2024. [Google Scholar]
Page, E.B. The imminence of… grading essays by computer. Phi Delta Kappan 1966, 47, 238–243. [Google Scholar]
Katznelson, J. A Computer Program for Assessing Readability; HEL Technical Memorandum HELTM480; U.S. Army Human Engineering Laboratory: Aberdeen Proving Ground, MD, USA, 1980. [Google Scholar]
Barrett, C.M. Automated Essay Evaluation and the Computational Paradigm: Machine Scoring Enters the Classroom. Ph.D. Thesis, University of Rhode Island, Kingston, RI, USA, 2015. [Google Scholar]
Ke, Z.; Ng, V. Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 6300–6308. [Google Scholar]
Taghipour, K.; Ng, H.T. A neural approach to automated essay scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1882–1891. [Google Scholar]
Alikaniotis, D.; Yannakoudakis, H.; Rei, M. Automatic text scoring using neural networks. arXiv 2016, arXiv:1606.04289. [Google Scholar] [CrossRef]
Wang, Y.; Wang, C.; Li, R.; Lin, H. On the use of BERT for automated essay scoring: Joint learning of multi-scale essay representation. arXiv 2022, arXiv:2205.03835. [Google Scholar] [CrossRef]
Rodriguez, P.U.; Jafari, A.; Ormerod, C.M. Language models and automated essay scoring. arXiv 2019, arXiv:1909.09482. [Google Scholar] [CrossRef]
Kumar, V.; Boulanger, D. Explainable automated essay scoring: Deep learning really has pedagogical value. Front. Educ. 2020, 5, 572367. [Google Scholar] [CrossRef]
Misgna, H.; On, B.W.; Lee, I.; Choi, G.S. A survey on deep learning-based automated essay scoring and feedback generation. Artif. Intell. Rev. 2024, 58, 36. [Google Scholar] [CrossRef]
Atkinson, J.; Palma, D. An LLM-based hybrid approach for enhanced automated essay scoring. Sci. Rep. 2025, 15, 14551. [Google Scholar] [CrossRef]
Mansour, W.; Albatarni, S.; Eltanbouly, S.; Elsayed, T. Can large language models automatically score proficiency of written essays? arXiv 2024, arXiv:2403.06149. [Google Scholar] [CrossRef]
Park, C.; Kim, H.; Kim, D.; Cho, S.; Kim, S.; Lee, S.; Lee, H. Open Ko-LLM Leaderboard: Evaluating large language models in Korean with Ko-H5 benchmark. arXiv 2024, arXiv:2405.20574. [Google Scholar]
Kim, H.; Kim, D.; Kim, J.; Lee, S.; Kim, Y.; Park, C. Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs. arXiv 2024, arXiv:2410.12445. [Google Scholar] [CrossRef]
Lim, K.-T.; Song, J.; Carbonell, J.; Poibeau, T. Neural automated writing evaluation for Korean L2 writing. Nat. Lang. Eng. 2023, 29, 1341–1363. [Google Scholar] [CrossRef]
Silva, T. Toward an understanding of the distinct nature of L2 writing: The ESL research and its implications. TESOL Q. 1993, 27, 657–677. [Google Scholar] [CrossRef]
Odlin, T. Cross-linguistic influence. In The Handbook of Second Language Acquisition; Doughty, C.J., Long, M.H., Eds.; Blackwell: Malden, MA, USA, 2003; pp. 436–486. [Google Scholar]
Yeon, J.; Brown, L. Korean: A Comprehensive Grammar; Routledge: London, UK, 2013. [Google Scholar]
Lee, S.H.; Dickinson, M.; Israel, R. Developing learner corpus annotation for Korean particle errors. In Proceedings of the Sixth Linguistic Annotation Workshop, Jeju, Republic of Korea, 12–13 July 2012; pp. 129–133. [Google Scholar]
Chun, J.; Kim, M.H. Corpus-informed application based on Korean Learners’ Corpus: Substitution errors of topic and nominative markers. Asian-Pac. J. Second Foreign Lang. Educ. 2021, 6, 13. [Google Scholar] [CrossRef]
Yoon, S.; Park, S.; Kim, G.; Cho, J.; Park, K.; Kim, G.; Seo, M.; Oh, A. Towards standardizing Korean grammatical error correction: Datasets and annotation. arXiv 2022, arXiv:2210.14389. [Google Scholar]
Israel, R.; Dickinson, M.; Lee, S.H. Detecting and correcting learner Korean particle omission errors. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya Congress Center, Nagoya, Japan, 14–18 October 2013; Asian Federation of Natural Language Processing: Nagoya, Japan, 2013; pp. 1419–1427. [Google Scholar]
Kim, T.; Jeong, S.; Song, Y. KoGEC: Korean Grammatical Error Correction with Pre-trained Translation Models. arXiv 2025, arXiv:2506.11432. [Google Scholar]
Hur, W. A study on the Development of AI-Based Model for Automatic Assessment of Korean Learners’ Speaking Proficiency. Ph.D. Thesis, Hankuk University of Foreign Studies, Seoul, Republic of Korea, 2024. [Google Scholar]
Malvern, D.; Richards, B.; Chipere, N.; Durán, P. Lexical Diversity and Language Development; Palgrave Macmillan: London, UK, 2004; pp. 16–30. [Google Scholar]
Song, H.J.; Park, S.B. Korean morphological analysis with tied sequence-to-sequence multi-task model. In Proceedings of EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 1436–1441. [Google Scholar]
Verhoeven, L.; Perfetti, C. (Eds.) Learning to Read Across Languages and Writing Systems; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Treffers-Daller, J.; Parslow, P.; Williams, S. Back to basics: How measures of lexical diversity can help discriminate between CEFR levels. Appl. Linguist. 2018, 39, 302–327. [Google Scholar] [CrossRef]
Kyle, K.; Crossley, S.A. Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices. Mod. Lang. J. 2018, 102, 333–349. [Google Scholar] [CrossRef]
Casal, J.E.; Lee, J.J. Syntactic complexity and writing quality in assessed first-year L2 writing. J. Second Lang. Writ. 2019, 44, 51–62. [Google Scholar] [CrossRef]
Lu, X. Automatic analysis of syntactic complexity in second language writing. Int. J. Corpus Linguist. 2010, 15, 474–496. [Google Scholar] [CrossRef]
OpenAI. Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 21 August 2025).
Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Making monolingual sentence embeddings multilingual using knowledge distillation. arXiv 2020, arXiv:2004.09813. [Google Scholar] [CrossRef]
Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Drucker, H.; Burges, C.J.; Kaufman, L.; Smola, A.; Vapnik, V. Support vector regression machines. In Advances in Neural Information Processing Systems 9 (NIPS 1996); MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30 (NIPS 2017); MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef]
Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]

Figure 1. End-to-end architecture of the proposed Korean L2 AES.

Figure 2. Construction of topic-aware semantic similarity features.

Figure 3. Performance comparison of five machine learning models using three different feature sets. The models are evaluated based on (a) QWK, (b) MAE, and (c) RMSE.

Figure 4. Confusion matrix for the 6-level Korean proficiency classification.

Table 1. Descriptions of the 6-Level TOPIK Proficiency Framework.

Level	Title	Key Competencies	Estimated Vocabulary
1	Beginner	Able to carry out basic conversations for survival (greetings, purchasing), understand personal topics (family, hobbies), and create simple sentences.	~800 words
2	Beginner	Able to conduct simple daily routines (phone calls, favors), use public facilities, and distinguish between formal and informal situations.	1500–2000 words
3	Intermediate	Able to maintain social relationships and use public facilities without significant difficulty. Can express opinions on familiar social topics and understand basic characteristics of written vs. spoken language.	~3000 words
4	Intermediate	Able to understand news articles and general social issues. Can comprehend and use common idioms and understand representative aspects of Korean culture. Sufficient for some general work tasks.	~5000 words
5	Advanced	Able to perform linguistic functions required for professional work or research. Can understand and discuss unfamiliar topics in fields like politics, economics, and culture.	~7000+ words
6	Advanced	Able to perform linguistic functions for professional work and research fluently and accurately. Can understand and express oneself on specialized subjects without difficulty, achieving near-native proficiency.	~10,000+ words

Table 2. Selected topics for AES development (Top 20 topics with the most data).

Topic (In Korean)	Topic (In English)	Number of Essays in Data
한국 생활	Life in Korea	910
자기소개	Self-Introduction	728
주말 이야기	Weekend Stories	465
여행 경험	Travel Experiences	447
취미	Hobbies	443
좋아하는 계절	Favorite Season	437
종합 내러티브	Comprehensive Narrative	422
선물	Gifts	364
자주 가는 장소	Places I Visit Frequently	356
잊지 못할 추억	Unforgettable Memories	344
주말 생활	Weekend Life	298
존경하는 사람	People I Respect	286
나의 성격	My Personality	273
여행 계획	Travel Plans	265
인생에서 가장 중요한 것	The Most Important Thing in Life	256
친구 소개	Friend Introductions	255
10년 후의 나의 계획	My Plan for 10 Years Later	251
올바른 인터넷 사용 태도	The Right Way to Use the Internet	238
고치고 싶은 나의 생활 습관	My Lifestyle Habits I Want to Change	237
방학 계획	Vacation Plan	235

Table 3. Overall performance of all models across the three feature sets. The best performance within each feature set is highlighted in bold. * Performance is statistically significantly different from the best model (Combined XGB) based on DeLong’s test (p < 0.05).

Feature Set	Model	QWK	MAE	RMSE
Conventional	Ridge	0.67 *	0.64	1.00
Conventional	SVR	0.77 *	0.54	0.90
Conventional	Random Forest	0.78 *	0.54	0.88
Conventional	Hist GB	0.78 *	0.53	0.88
Conventional	XGB	0.78 *	0.53	0.88
Semantic	Ridge	0.06 *	1.06	1.41
Semantic	SVR	0.07 *	1.02	1.41
Semantic	Random Forest	0.14 *	1.08	1.42
Semantic	Hist GB	0.12 *	1.07	1.41
Semantic	XGB	0.13 *	1.08	1.44
Combined	Ridge	0.68 *	0.64	0.99
Combined	SVR	0.78 *	0.51	0.87
Combined	Random Forest	0.79 *	0.52	0.85
Combined	Hist GB	0.79 *	0.51	0.85
Combined	XGB	0.80	0.50	0.84

Table 4. Performance Comparison with Direct LLM Scoring Baselines.

Model	QWK	MAE	RMSE
GPT-4o (Zero-Shot)	0.21	1.42	1.7
GPT-4o (Few-Shot)	0.24	1.28	1.62
Our Hybrid XGBoost (Combined)	0.80	0.50	0.84

Table 5. Performance of the best model broken down by essay topic.

Topic	QWK	MAE	RMSE
Self-introduction	0.797	0.454	0.82
Comprehensive Narrative	0.66	0.856	1.151
Life in Korea	0.499	0.393	0.703
Places I Visit Frequently	0.491	0.357	0.656
Hobbies	0.483	0.543	0.846
Vacation Plans	0.474	0.362	0.629
Favorite Season	0.42	0.291	0.58
Weekend Story	0.413	0.269	0.596
Travel Plans	0.389	0.532	0.849
Travel Experience	0.385	0.445	0.722
Friend Introduction	0.391	0.424	0.714
People I Respect	0.295	1.157	1.534
Gifts	0.255	0.234	0.516
My Plan for 10 Years Later	0.203	0.673	0.921
Weekend Life	0.142	0.191	0.452
My Personality	0.112	0.612	0.904
Unforgettable Memory	0.091	0.605	0.863
My Lifestyle Habits I Want to Change	0.088	0.574	0.832
Most Important Thing in Life	0.029	0.656	0.944
The Right Way to Use the Internet	0.007	1.013	1.367

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hur, W.; Ji, B. A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM. Systems 2025, 13, 851. https://doi.org/10.3390/systems13100851

AMA Style

Hur W, Ji B. A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM. Systems. 2025; 13(10):851. https://doi.org/10.3390/systems13100851

Chicago/Turabian Style

Hur, Wonjin, and Bongjun Ji. 2025. "A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM" Systems 13, no. 10: 851. https://doi.org/10.3390/systems13100851

APA Style

Hur, W., & Ji, B. (2025). A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM. Systems, 13(10), 851. https://doi.org/10.3390/systems13100851

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid System for Automated Assessment of Korean L2 Writing: Integrating Linguistic Features with LLM

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset: The Korean Language Learner Corpus

2.2. System Architecture

2.3. Feature Engineering 1: Linguistic Form Analysis

2.4. Feature Engineering II: LLM-Based Semantic Content Analysis

2.5. Predictive Modeling

2.6. Performance Measurement

2.6.1. Quadratic Weighted Kappa (QWK)

2.6.2. Mean Absolute Error (MAE)

2.6.3. Root Mean Squared Error (RMSE)

2.6.4. Statistical Significance Testing

3. Results

3.1. Overall Model Performance

3.2. Comparison with Direct LLM Scoring Baselines

3.3. Model Performance by Topic

4. Discussion

4.1. Validation of the Hybrid System

4.2. Interpreting Model Performance Across Proficiency Levels

4.3. Implications for Korean L2 Writing Assessment and Pedagogy

5. Limitations and Future Work

5.1. Limitations

5.2. Future Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI