Symmetry-Aware Dual-Encoder Architecture for Context-Aware Grammatical Error Correction in Chinese Learner English: Toward a Spaced-Repetition Instructional Structure Sensitive to Individual Differences

Tian, Jun

doi:10.3390/sym18040579

Open AccessArticle

Symmetry-Aware Dual-Encoder Architecture for Context-Aware Grammatical Error Correction in Chinese Learner English: Toward a Spaced-Repetition Instructional Structure Sensitive to Individual Differences

by

Jun Tian

School of Foreign Languages, Shanxi University, Taiyuan 030006, China

Symmetry 2026, 18(4), 579; https://doi.org/10.3390/sym18040579

Submission received: 25 February 2026 / Revised: 15 March 2026 / Accepted: 18 March 2026 / Published: 28 March 2026

(This article belongs to the Special Issue Symmetry and Asymmetry in Natural Language Processing)

Download Versions Notes

Abstract

Grammatical error correction (GEC) for Chinese learner English is still dominated by sentence-level modeling, which limits discourse-level consistency and weakens adaptation to learner-specific error profiles. From an instructional perspective, these limitations also reduce the value of automated feedback as a basis for spaced-repetition instructional structures sensitive to individual differences. This study proposes a symmetry-aware dual-encoder architecture for context-aware GEC in Chinese learner English. A context encoder captures preceding-sentence information, while a source encoder integrates BERT-based semantic representations with Bi-GRU-based syntactic features for the current sentence. A gated decoder performs asymmetric fusion of local and contextual evidence. To better reflect corpus-level tendencies in Chinese learner English, a CLEC-informed augmentation strategy generates synthetic errors using empirical category frequencies as a coarse sampling prior. Experiments on CoNLL-2014, JFLEG, and CLEC show consistent improvements over strong neural baselines in F_0.5 and GLEU under the current desktop-oriented implementation setting. Nevertheless, the integration of BERT, dual encoders, and gated decoding introduces non-negligible computational overhead, and the present system is therefore better suited to desktop writing-support scenarios than to strict real-time or large-scale online deployment. The proposed framework thus provides a practical technical basis for personalized grammar feedback and for future spaced-repetition instructional designs in ESL writing support.

Keywords:

grammatical error correction; Natural Language Processing (NLP); deep learning; symmetry-aware dual encoder; Transformer–BiGRU; error-distribution-aware augmentation; Chinese learners of English; CLEC; offline writing assistant

1. Introduction

Grammatical error correction for English writing (GEC) can be formulated as a conditional sequence-to-sequence learning problem [1,2]. Given an input sentence that may contain grammatical errors, a model is required to generate a corrected sentence that is syntactically well formed and semantically consistent with the input [3]. Modern neural GEC systems typically implement this mapping through encoder–decoder architectures, in which an encoder transforms the token sequence into a continuous representation and a decoder produces the corrected sequence token by token under a parameterized probability distribution [4,5]. From an engineering perspective, GEC is therefore a constrained text-to-text transformation task with strict requirements on latency, memory footprint, and robustness to noisy learner input [6].

In real-world applications such as automatic essay scoring back-ends, online writing assistants, and batch correction pipelines, GEC components must process large volumes of learner text in near real time [7]. This environment imposes hard limits on model size and decoding complexity. In particular, the self-attention layers used in many state-of-the-art models have quadratic time and memory complexity with respect to sequence length, which restricts feasible sentence length and batch size on commodity GPUs [8,9]. At the same time, correction models are expected to achieve high precision across diverse error types, since false corrections are often more harmful than missed errors in practical systems [10]. Designing architectures that satisfy these computational constraints while maintaining strong correction performance therefore remains a central engineering challenge [11].

Most existing neural GEC architectures operate at the sentence level and process each sentence independently of its surrounding context [12]. This design choice simplifies the computational graph and reduces memory usage, but it discards document-level information [13]. In coherent texts, many grammatical phenomena exhibit cross-sentence constraints, including tense consistency across narrative segments, agreement between pronouns and antecedents realized in earlier sentences, and regularities in the use of connectives and function words [14,15]. Sentence-level models that ignore these dependencies can produce edits that are locally plausible but globally inconsistent [16]. Typical failure cases include mixing past and present tenses within the same episode, breaking coreference chains, and introducing connective words that conflict with the preceding discourse [17].

These limitations are especially consequential in learner-oriented writing support scenarios. In practical English-as-a-second-language (ESL) learning environments, automated correction is valuable not only as a one-shot post-editing tool, but also as a possible technical basis for repeated revision and later review. From this perspective, if correction outputs are to support a spaced-repetition instructional structure sensitive to individual differences, the feedback must remain stable, interpretable, and sufficiently aligned with recurrent learner-specific error patterns rather than merely optimizing average benchmark performance. This application-oriented motivation does not change the technical nature of GEC, but it raises stricter requirements for discourse consistency and learner-specific adaptability.

From the perspective of symmetry and asymmetry in natural language processing, standard sentence-level GEC models implicitly assume an asymmetric information flow: each source sentence is mapped independently to a corrected sentence, and the model does not exploit the structurally related information available in multi-sentence inputs where adjacent sentences share topics, entities, and temporal structure [18,19]. There is no mechanism to preserve symmetric treatment of related inputs at the representation level, nor to model controlled asymmetry in situations where contextual evidence should dominate the correction decision [20]. As a result, the information contained in multi-sentence inputs is underutilized, and the model’s ability to enforce global consistency constraints remains limited.

However, document-level context does not automatically improve grammatical error correction. Existing context modeling strategies differ in how they introduce and control contextual information. Direct concatenation of surrounding sentences preserves raw discourse cues but can easily introduce irrelevant tokens and increase the computational burden of attention. Hierarchical or pooled context encoders reduce sequence length and improve efficiency, but they may lose token-level details needed for fine-grained edits. Multi-encoder architectures with cross-attention or gated fusion keep sentence-level and context-level representations separate, allowing the decoder to selectively exploit discourse cues when they are useful. Therefore, the key issue is not simply whether document context is used, but how context is represented, filtered, and integrated with local correction evidence.

Another technical limitation arises from the mismatch between generic GEC training corpora and the error distribution of specific learner populations [21]. Chinese learners of English exhibit structured and high-frequency error patterns driven by differences between the source and target language systems. Common categories include misuse of tense and aspect, omission or overuse of articles and determiners, noun number errors, preposition substitution, subject-verb agreement errors, and word-order-related errors stemming from differences between Chinese and English clause structure [22]. Public GEC corpora such as NUCLE and FCE pool essays from heterogeneous learner groups, and the resulting error distributions do not faithfully capture the statistics of Chinese learner errors [23,24]. A model trained solely on such corpora may therefore allocate insufficient capacity to the error types that dominate Chinese EFL writing.

This issue is also relevant from an instructional viewpoint. Learners do not make errors in identical proportions, and the distribution of recurrent grammatical problems may vary across populations and individuals. If a correction model underrepresents the error categories most salient in a target learner group, the resulting feedback becomes less useful for personalized revision and less suitable for future integration into an individual-difference-sensitive spaced-repetition instructional structure. Accordingly, improving adaptation to Chinese learner error distributions is not only a data-engineering problem, but also a prerequisite for more learner-aware writing support.

These modeling and data limitations translate into concrete failure modes in deployed systems [25]. Tense and agreement errors that depend on cross-sentence context are frequently left uncorrected or corrected inconsistently because the model lacks access to the required document-level information [26,27]. Function-word and morphology errors that are prevalent in Chinese learner essays may be under-corrected because the training data underrepresent these patterns [28]. Addressing these issues requires a GEC architecture that (1) explicitly incorporates a symmetric treatment of sentence-level and context-level information, (2) introduces controlled asymmetry in the decoding process to allow context-sensitive weighting of information sources, and (3) is trained on data whose error distribution is tailored to Chinese learner English, all under realistic computational constraints suitable for deployment [29,30].

Recent studies across artificial intelligence, data-driven prediction, intelligent sensing, and complex system optimization further reinforce the importance of structured representation learning, context-sensitive modeling, multimodal information fusion, and distribution-aware optimization in challenging predictive tasks. Evidence from spatio-temporal learning, trajectory-based intelligent prediction, machine learning for structured interactions, geometric deep learning, built-environment analytics, and imbalance-aware graph modeling suggests that robust performance increasingly depends on exploiting relational structure, heterogeneous contextual cues, and task-specific data distributions rather than processing inputs as isolated instances [31,32,33,34,35,36]. Related advances in intelligent sensing, combinatorial structure processing, user-oriented computational environments, and function-structure-integrated system design also highlight the value of coordinated multi-source information and architecture-level integration in complex systems [37,38,39,40,41,42]. More broadly, cross-domain studies on interface regulation, deformation and transport mechanisms, coupled field modulation, thermal and reactive processes, metabolic regulation, ecological integration, and multi-factor performance enhancement indicate that effective system design often relies on domain-informed optimization under heterogeneous operating conditions [43,44,45,46,47,48,49,50,51]. Although these studies arise from different application areas, they collectively support the broader design rationale of the present work, namely that grammatical error correction for Chinese learner English should combine context-aware encoding, structured dual-stream representation, and learner-distribution-oriented training to achieve reliable and deployable correction performance, thereby providing a stronger technical basis for learner-oriented feedback and future spaced-repetition instructional support.

Motivated by these considerations, the objective of this work is not to assume that document-level context is uniformly beneficial for grammatical error correction, but to investigate a context modeling strategy that can exploit cross-sentence cues while limiting contextual noise. Specifically, the study develops an English GEC system that integrates document-level context and sentence-level information through a symmetry-aware dual-encoder architecture and uses gated decoding to regulate their relative contributions during generation. Concretely, the model introduces two structurally similar encoders: a context encoder that processes preceding sentences in the document and a source encoder that processes the current sentence. This symmetric encoder design is coupled with an explicitly asymmetric decoder that applies gated fusion to assign dynamic weights to context and source representations at each decoding step. In this way, the model exposes a symmetric input structure while allowing asymmetry to emerge in the information flow, thereby aligning the architecture with the theme of symmetry and asymmetry in natural language processing.

On the data side, manually annotated learner corpora are combined with synthetic error sentences generated under a CLEC-informed category prior. This yields a training set that covers both generic GEC patterns and high-frequency error types specific to Chinese EFL writing, while not claiming full reproduction of authentic learner error diversity. The resulting model is evaluated on standard GEC benchmarks and on Chinese learner essays using precision, recall, F_0.5, and GLEU, together with an error-type-wise analysis that quantifies performance across different grammatical categories. Finally, the trained network is integrated into a desktop application for automatic grammar checking and scoring in an interactive desktop workflow, although further optimization is still required for stricter real-time deployment scenarios.

It should be noted that the present study does not directly evaluate the pedagogical effectiveness of a spaced-repetition instructional structure or conduct a controlled analysis of psychological individual differences through longitudinal intervention data. Instead, it addresses a prerequisite technical problem: how to build a context-aware and learner-sensitive GEC framework for Chinese learner English that can provide more reliable feedback for future personalized and repeated-review scenarios. In this sense, the proposed framework is positioned as a technical basis for writing support systems that may later be organized into a spaced-repetition instructional structure sensitive to individual differences.

The remainder of the paper is organized as follows. Section 2 reviews related neural architectures and technical primitives. Section 3 details the proposed dual-encoder–decoder model and its symmetry-aware design. Section 4 describes the experimental setup. Section 5 reports quantitative and qualitative results together with system-level behavior. Section 6 concludes the paper and discusses possible extensions.

2. Related Work and Technical Background

2.1. Neural Grammatical Error Correction: From Sentence-Level to Document-Level

Grammatical error correction (GEC) is commonly formulated as a conditional generation problem: given an input token sequence that may contain grammatical errors, a model produces a corrected sequence that is fluent and well-formed under a target language distribution. From an algorithmic viewpoint, GEC can be implemented either as (i) a structured prediction problem over edit operations (insert, delete, substitute, reorder), or (ii) a sequence-to-sequence mapping that directly models the conditional likelihood of a corrected sentence given an erroneous sentence. The latter formulation dominates recent research because it enables end-to-end training, integrates naturally with pre-trained language models, and supports efficient decoding with standard neural inference pipelines.

Early GEC systems were rule-driven and relied on manually constructed grammars, error patterns, and linguistic constraints. These methods typically couple a detection stage (pattern matching, constraint checking, finite-state rules) with a correction stage that selects a candidate rewrite from a small rule-induced hypothesis set. Rule-based systems can be precise on the phenomena they explicitly cover and can enforce hard constraints, but the overall system behavior is brittle: coverage is limited by the rule inventory, rules interact in complex ways, and domain shifts rapidly degrade recall. From an engineering perspective, scaling such systems requires continuous rule maintenance and yields a poor accuracy–cost tradeoff as the supported error space expands.

Phrase-based statistical machine translation (SMT) reframed GEC as monolingual translation from “bad” to “good” text. In SMT-based GEC, parallel data of erroneous sentences and corrected references are used to learn phrase translation probabilities, distortion models, and language models. The core advantage is that error correction emerges from data-driven phrase substitutions and reorderings, reducing the need for handcrafted rules. However, SMT pipelines depend on many components (alignment, phrase table, n-gram language model, tuning) and struggle with long-range dependencies. Their search space is also constrained by phrase segmentation, which makes it difficult to model global grammatical constraints and subtle agreement phenomena.

Neural approaches replaced component-heavy pipelines with parameterized encoders and decoders trained by maximizing likelihood on parallel corpora. Recurrent encoder–decoder models with attention improved over SMT by allowing soft token-level alignment and by learning distributed representations that generalize across lexical patterns. Subsequent variants introduced convolutional encoders, character-level modeling to reduce out-of-vocabulary errors, and copy-aware decoding for conservative edits. A key practical driver of progress has been synthetic error generation: large-scale pseudo-parallel corpora are created by injecting plausible errors into grammatically correct sentences, enabling pre-training of high-capacity models before fine-tuning on gold annotations.

Transformer architectures further improved GEC by replacing recurrence with self-attention, enabling direct modeling of arbitrary token-to-token interactions in a single layer. Self-attention can be interpreted as constructing a dense weighted graph over tokens, where edge weights are learned similarity scores. This design yields strong parallelism on modern hardware and supports deep stacks with stable optimization. In GEC benchmarks, Transformer-based systems became the de facto standard, especially when combined with pre-training and large synthetic corpora. Pre-trained masked language models such as BERT provide contextualized token embeddings that encode rich syntactic and semantic regularities; integrating such representations into encoder–decoder pipelines improves correction quality under limited labeled data.

Document-level GEC addresses an intrinsic limitation of sentence-independent correction: many grammatical choices depend on broader discourse context, including antecedent resolution, global tense consistency, entity naming consistency, and cross-sentence coherence. Cross-sentence GEC models augment the source sentence with surrounding sentences or document representations and learn context-conditioned corrections. The main technical challenge is to exploit context without amplifying noise: context may be irrelevant for many edits, and naïve concatenation increases sequence length and the quadratic cost of attention. Different context modeling strategies therefore involve different trade-offs. Controlled context windows preserve local discourse cues with relatively simple implementation, but their representational range is limited. Hierarchical encoders compress longer context more efficiently, but they may discard token-level details needed for fine-grained correction. Multi-encoder or gated-fusion architectures keep the source sentence and document context in separate streams and provide stronger controllability over their interaction, at the cost of greater architectural complexity. No single strategy is uniformly superior across all error types, so the choice of context model must be justified empirically.

Taken together, existing GEC approaches exhibit complementary strengths and clear limitations. Sentence-level encoder–decoder models are relatively simple and efficient, but they often miss discourse-dependent corrections such as cross-sentence tense consistency and pronoun-related constraints. BERT-enhanced sentence-level models improve semantic representation and local correction quality, yet they still treat each sentence largely in isolation and therefore cannot fully exploit document context. Document-level multi-encoder models introduce surrounding sentences into the correction process, but their effectiveness depends strongly on how context is filtered and fused; without sufficient control, contextual information may increase noise and computational burden rather than improve correction quality. In addition, most general-purpose GEC systems are trained on heterogeneous learner corpora and do not explicitly adapt to the error profile of Chinese learners. These observations motivate the present work, which combines a dual-encoder context modeling scheme, gated source–context fusion, and learner-distribution-aware augmentation to address these limitations in an integrated manner.

Shared tasks and community benchmarks have standardized evaluation and accelerated iteration. CoNLL-style evaluations emphasize precision and use F-score variants (notably F_0.5) and the M2 scorer to reward high-precision corrections, while later benchmarks introduce more diverse error types and emphasize fluency through reference-based metrics and fine-grained error analysis tools. These benchmarks motivate models that balance correction aggressiveness, computational efficiency, and robustness to heterogeneous error distributions.

Recent GEC research has further expanded beyond conventional encoder–decoder and edit-based formulations. Recent studies in 2024–2025 have examined the role of large language models in GEC more systematically, including zero-shot and fine-tuned LLM correction, explanation-guided in-context demonstration retrieval, and LLM-assisted reranking or post-correction [52,53,54,55,56,57,58]. At the same time, data-centric and efficiency-oriented directions have also advanced, including language-model-based automatic error generation, mixture-of-experts architectures for more parameter-efficient correction, and selective augmentation strategies for low-resource GEC. These developments indicate that the current frontier of GEC is no longer limited to stronger backbone architectures alone, but increasingly concerns controllable correction, realistic pseudo-error generation, efficient scaling, and flexible integration of LLMs into training and inference pipelines.

2.2. Symmetry and Asymmetry in Context-Aware GEC Models

Context-aware GEC models can be viewed as multi-source conditional generation systems in which a decoder attends to two information streams: the current sentence and its document context. The two streams are structurally similar (both are token sequences) yet semantically asymmetric (the current sentence directly determines the output tokens, while context provides auxiliary constraints). This creates a natural symmetry–asymmetry tension that can be exploited at the architecture level.

A symmetric design typically assigns the same operator family to both streams. For example, both the source sentence and the context window can be encoded with identical stacks of self-attention and recurrent layers to produce token-level representations with comparable geometric properties (dimensionality, normalization, and temporal coverage). In graph terms, each encoder constructs a weighted token-interaction graph; using the same encoder template yields two homologous graphs whose node semantics differ only by the input segment they represent. Such symmetry simplifies implementation, stabilizes training by aligning representation scales, and makes fusion operators easier to parameterize because both inputs share compatible statistics.

Asymmetry enters through information routing and contribution control. When a decoder consumes two encoded streams, the model must decide whether a particular correction should be driven by local evidence (within-sentence syntax and lexical cues) or by contextual evidence (cross-sentence constraints). This distinction is important because many grammatical edits, such as local morphology, article usage, short-range agreement, and function-word insertion or deletion, can often be resolved from the current sentence alone, and surrounding sentences may even introduce irrelevant signals. Gated fusion provides an explicit mechanism for this decision: a learnable gate interpolates between the two attention-derived context vectors at each decoding step, allowing token-level and step-level adaptivity. As a result, the decoder can suppress document-context contributions when the correction is primarily sentence-local and increase contextual reliance only when broader discourse cues are genuinely informative. This gate breaks symmetry in a controlled, measurable way: the fusion operator is symmetric in form (it treats both inputs through the same algebra), yet the learned gate weights introduce asymmetric contributions conditioned on the decoding state.

Compared with naïve concatenation of context and source, symmetric dual encoders with gated fusion have favorable complexity and controllability. Concatenation increases the effective sequence length and directly inflates the O(L²) cost of attention, whereas two encoders keep each stream length bounded and allow selective cross-attention. The gate further mitigates context noise by attenuating irrelevant context contributions, which is critical for document-level GEC where many edits remain sentence-local. This symmetry-aware viewpoint also clarifies what is being optimized: the model learns two comparable representations and then learns a lightweight asymmetric controller that modulates their influence during generation.

2.3. Neural Components: GRU, Bi-GRU, Attention, Transformer, and BERT

A sequence of tokens is represented as embeddings and processed by a composition of differentiable operators. The operators summarized here are used as building blocks for encoders, cross-attention modules, and fusion mechanisms in later sections. Emphasis is placed on the algebraic form of each operator and on the computational implications relevant to GEC.

Feed-forward networks (FFNs) implement point-wise nonlinear projection and are widely used for gating, feature mixing, and output mapping. For an input feature vector, an affine transformation produces a pre-activation value and a nonlinear function yields an output activation as:

z = \sum_{i = 1}^{d} ω_{i} x_{i} + b = w^{T} x + b

(1)

a = σ (z)

(2)

In (1), the weight parameters are collected in a vector and the bias term shifts the activation; (2) shows a typical squashing nonlinearity used to parameterize gates and probabilities. FFNs have linear cost in sequence length and quadratic cost in hidden dimension when implemented as matrix multiplication, making them efficient for token-wise transformations.

Gated recurrent units (GRUs) introduce multiplicative gates to regulate information flow over a temporal chain, improving optimization stability on long sequences. For token embedding input at time step t and a hidden state from the previous step, the reset gate and update gate are computed as:

r_{t} = σ (W_{r} [h_{t - 1}, x_{t}])

(3)

z_{t} = σ (W_{z} [h_{t - 1}, x_{t}])

(4)

{\tilde{h}}_{t} = \tanh (W_{h} [r_{t} ⊙ h_{t - 1}, x_{t}])

(5)

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}

(6)

Equations (3)–(6) show the gate-controlled recurrence where σ(·) and tanh(·) are element-wise nonlinearities and ⊙ denotes the Hadamard product. The update gate controls how much of the candidate state is written into the memory, which is useful for preserving grammatical cues over long distances such as agreement and subcategorization.

Bidirectional GRUs (Bi-GRUs) compute representations from both left-to-right and right-to-left passes and concatenate them, providing richer local context for each token:

h_{t} = (h_{t}^{+}, h_{t}^{-})

(7)

The Bi-GRU sequence graph is a pair of directed chains with opposite orientations; concatenation yields token representations that condition on both preceding and succeeding context, which is beneficial for correcting errors that depend on right context (e.g., missing auxiliaries or determiners) as well as left context (e.g., verb agreement).

Attention mechanisms compute content-based interactions between a query and a set of key–value pairs, enabling dynamic alignment without imposing a fixed locality bias. For a query vector and a key vector at position i, a similarity score can be defined as an inner product:

S i m (Q, K_{i}) = Q^{T} K_{i}

(8)

\begin{array}{l} a_{i} = s o f t m a x (S_{i}) = e x p (S_{i}) / \sum_{j = 1}^{L_{x}} e x p (S_{j}) \\ S_{i} = S i m (Q, K_{i}) \end{array}

(9)

A t t e n t i o n (Q, K, V) = \sum_{i = 1}^{L_{x}} a_{i} V_{i}

(10)

Equations (8)–(10) define a generic attention operator in which normalized weights form a probability simplex over source positions. From a graph perspective, attention constructs a complete bipartite graph from query nodes to source nodes with edge weights a_i. Its cost scales linearly in the number of key–value pairs per query and is the dominant term in many encoder–decoder models.

Multi-head attention improves expressivity by running multiple attention subspaces in parallel and concatenating their outputs. A compact form is:

attn ((K, V), Q) = attn ((K, V), q_{1}) \oplus \dots \oplus attn ((K, V), q_{n})

(11)

Scaled dot-product attention is a specific instantiation that supports efficient matrix implementations and stable gradients by scaling the dot product:

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(12)

In (12), the matrix product QKᵀ computes all pairwise similarities between query and key positions, yielding an attention weight matrix that defines a dense directed graph over tokens. For a sequence of length L, the attention matrix has L × L entries, giving quadratic time and memory complexity with respect to L; this is the primary computational bottleneck of Transformer-style encoders on long inputs.

Transformer encoders are built by alternating self-attention with position-wise FFNs, coupled with residual connections and layer normalization to stabilize deep stacks. The FFN sublayer can be written as:

FFN (x) = \max (0, x W_{1} + b_{1}) W_{2} + b_{2}

(13)

The self-attention sublayer is permutation-invariant, so positional information is injected via positional encodings. A widely used sinusoidal encoding defines even and odd dimensions as:

PE (p o s, 2 i) = \sin (\frac{p o s}{10000^{\frac{2 i}{d_{model}}}})

(14)

PE (p o s, 2 i + 1) = \cos (\frac{p o s}{10000^{\frac{2 i}{d_{model}}}})

(15)

Equations (14) and (15) map a position index to a deterministic vector that allows the model to recover relative order information through linear operations. In later model designs, Transformer layers are used to encode sentence-internal and context-window token graphs, while Bi-GRU layers provide an additional sequential inductive bias at modest computational cost.

BERT is a transformer-based masked language model pre-trained on large corpora using token masking and sentence-level objectives. The key engineering advantage for GEC is that BERT provides strong contextualized token representations that already encode syntactic constraints and lexical selection preferences. In encoder–decoder correction systems, BERT is frequently used as a feature generator that supplies semantic embeddings; these embeddings can be fused with task-specific encoder features and then consumed by a decoder with cross-attention, yielding improved correction accuracy without requiring prohibitively large labeled GEC corpora.

2.4. Learner Corpora and Benchmark Datasets

Benchmark datasets for GEC provide parallel sentence pairs and, in some cases, fine-grained error annotations. Each corpus imposes distinct distributional properties (prompt type, essay length, proficiency level, annotation policy), so combining multiple corpora is a practical strategy to cover diverse error patterns while retaining standardized evaluation on community test sets.

NUCLE is a widely used annotated corpus of learner English that provides error spans and error categories. Its error taxonomy supports category-level evaluation and is the basis for CoNLL-style shared tasks, where system outputs are scored against multiple references. NUCLE is commonly paired with additional corpora for training or augmentation because its gold annotations are relatively expensive and thus limited in scale.

Lang-8 is a large-scale collection of learner sentences and crowd-sourced corrections obtained from an online language-learning platform. The corpus contains over 100,000 articles and more than ten million words, providing broad lexical coverage and a diverse error distribution. Because its corrections are crowd-sourced, the data is noisier than expert-annotated corpora, and it is therefore often used for pre-training or for synthetic data generation pipelines where robustness to annotation noise is required.

JFLEG focuses on fluency-oriented correction: the target is not only grammaticality but also naturalness of expression, and multiple references are provided for each sentence. Table 1 reports the scale of the JFLEG development and test splits in terms of sentence count and token count, which is useful for planning fine-tuning and evaluation workloads.

The JFLEG splits are small enough to support repeated development cycles while still being large enough to reveal differences in fluency and over-correction behavior. The availability of multiple references reduces metric variance and better reflects the non-uniqueness of valid rewrites.

FCE (First Certificate in English) provides exam-related learner texts with corrections and is frequently used for training and validation because it contains a substantial amount of parallel data with relatively consistent annotation. Table 2 summarizes the train/dev/test splits used in this work.

The FCE training split provides a stable foundation for learning common grammatical patterns, while the development and test splits support model selection without leaking evaluation data. In practice, FCE is often combined with other corpora to improve error coverage, especially for error types underrepresented in exam essays.

CoNLL-2013 and CoNLL-2014 shared tasks provide standardized development and test sets derived from NUCLE essays. Each dataset contains 50 essays and approximately thirty thousand words, and the CoNLL-2014 test set covers a broad set of error categories and remains a primary benchmark for precision-oriented evaluation with the M2 scorer and F_0.5.

CLEC (Chinese Learner English Corpus) provides compositions written by Chinese learners with structured error coding. Its coding scheme captures error types that are characteristic of Chinese native-language transfer, enabling evaluation of domain-specific robustness. Table 3 reports statistics of the CLEC subsets ST2–ST6 used in this work.

The CLEC subsets provide a large-scale source of learner English with explicit error codes, which is useful for both error-type-aware augmentation and fine-grained evaluation. Because CLEC compositions are longer and exhibit discourse-level phenomena, they also stress-test context-aware models beyond sentence-local corrections.

Across corpora, three dataset roles are distinguished: (i) training corpora provide large parallel data for maximum-likelihood training and for synthetic error generation; (ii) development corpora support hyperparameter tuning and early stopping under standardized metrics; (iii) test corpora provide held-out evaluation on community benchmarks (CoNLL, JFLEG) and domain-specific benchmarks (CLEC). This separation is essential for reproducible comparisons and for avoiding implicit overfitting to a single error distribution.

2.5. Error Taxonomy for Chinese EFL GEC

Error labels serve two engineering purposes: they define the rule space for controlled error injection during data augmentation and they define evaluation slices for error-type-wise performance reporting. Two complementary labeling systems are leveraged: the NUCLE error categories used in CoNLL-style evaluation and the structured CLEC error codes designed for Chinese learner English.

NUCLE provides category tags such as ArtOrDet (article or determiner), Prep (preposition), SVA (subject–verb agreement), Nn (noun number), Trans (word order), Pref (pronoun reference), Vform (verb form), Vt (verb tense), Mec (mechanical or orthographic issues), Wform (word form), Wci (word choice), WOinc (word order incorrect), and others. These tags allow aggregation of edits into interpretable groups and support category-level measurement under precision/recall tradeoffs.

CLEC uses a hierarchical coding scheme in which errors are grouped by the syntactic region they affect (e.g., noun phrase codes starting with np, verb phrase codes starting with vp, prepositional phrase codes starting with pp, and sentence pattern or word-order codes starting with sp). For cross-corpus consistency, a mapping is defined between representative NUCLE tags and CLEC codes. For example, ArtOrDet aligns with CLEC noun-phrase determiner/article errors (np3/np7/np9), and SVA aligns with verb-phrase agreement errors (vp3). Prep aligns with CLEC prepositional phrase errors (pp*), while Trans aligns with CLEC word-order related codes (sp*).

During augmentation, the taxonomy determines which transformation rules are permitted and how frequently each rule is sampled. This makes the synthetic error distribution controllable and allows injection of Chinese-learner-specific error patterns even when the base clean text originates from generic English corpora. During evaluation on CLEC compositions, the same taxonomy enables reporting of precision, recall, and F_0.5 for each error group, isolating which algorithmic components benefit which error families.

3. Symmetry-Aware Dual-Encoder Grammar Correction Model

The symmetry-aware dual-encoder grammar correction model forms the core of the system. As discussed in Section 2, existing GEC approaches face three closely related limitations: sentence-level models underuse discourse context, context-aware models do not always control contextual noise effectively, and general-purpose training data do not fully reflect the error distribution of Chinese learners. The proposed model is designed to address these issues in a unified way. It adopts two structurally symmetric encoders, in which a document-level context encoder processes preceding sentences and a source-sentence encoder focuses on the current sentence to be corrected, so that sentence-level and context-level information can be represented in parallel rather than being naively concatenated. Their outputs are then fused in a gated Transformer-style decoder that dynamically regulates the relative contribution of local and contextual evidence during generation. In addition, the model is trained on a combination of learner corpora and CLEC-informed synthetic data so as to strengthen adaptation to frequent Chinese learner error categories. All components are optimized jointly, and the architecture is intended to improve context-sensitive correction performance while keeping the system usable in a practical writing-assistance setting.

3.1. Overall Architecture and Processing Pipeline

The complete processing pipeline operates at essay level. Given an input essay consisting of multiple sentences, the system first applies sentence segmentation and tokenization, followed by part-of-speech (POS) tagging and vectorization of lexical and syntactic features. For each target sentence, the immediately preceding sentences are grouped as document context, and the pair (context, target) is passed to a dual-encoder network. The encoders transform the variable-length token sequences into fixed-dimensional continuous representations. A gated decoder then conditions on both representations and generates a corrected version of the target sentence. Finally, a beam-search-based result selection module ranks candidate corrections and outputs the top-scoring hypothesis as the system prediction.

The four main modules of the system are therefore: (1) a preprocessing and data augmentation module, (2) a context encoder for document-level information, (3) a source encoder that combines syntactic features and BERT-based semantic features, and (4) a Transformer-style decoder with masked self-attention and gated fusion of context and source. The two encoders share the same architectural template, combining multi-head self-attention with bidirectional gated recurrent layers, so that context and source information are processed in a structurally symmetric way. Asymmetric behavior arises only inside the decoder, where a learnable gate assigns different weights to the two encoder outputs at each decoding step. During training the pipeline is executed in teacher-forcing mode; during inference, the same modules are reused with beam search to approximate the most probable corrected sentence.

Raw essay text is converted into a sequence of sentences, followed by tokenization, POS tagging, and embedding lookup. The context encoder consumes the tokens in the preceding sentences and outputs a context representation, while the source encoder consumes the tokens in the target sentence and outputs a source representation. The decoder reads the previous output tokens together with the two encoder representations and produces a distribution over the next token at each time step. After decoding, the result selection module filters invalid candidates, computes a normalized beam score, and returns the highest-scoring corrected sentence.

For implementation clarity, the context encoder is instantiated as two Transformer encoder layers with hidden size 512 and six attention heads, followed by two bidirectional GRU layers with 256 hidden units in each direction. The source encoder follows the same backbone structure but additionally incorporates token-level contextualized representations from a bert-base-cased model, which provides 768-dimensional semantic embeddings before fusion with the task-specific source-side features. The decoder is composed of six Transformer decoder layers, each including masked multi-head self-attention over previously generated tokens, cross-attention to the source encoder, cross-attention to the context encoder, gated fusion of the two attended streams, and a position-wise feed-forward projection. The feed-forward sublayer uses an inner dimension of 2048, and inference is performed with beam search using beam width 5 and length penalty factor α = 0.6. These settings specify the principal layer configurations and representation sizes of the proposed system and provide the implementation-level details required for reproducibility.

From a computational perspective, the dominant cost of the model in both training and inference comes from the self-attention and BERT components. For a target sentence of length Ts and context window of length Tc, the encoders require quadratic time in Ts and Tc for multi-head self-attention and linear time in these lengths for the recurrent blocks. The decoder performs masked self-attention and two cross-attentions at each generation step, so the overall complexity grows linearly with the length of the generated sentence and the beam size. In practice, a small context window and moderate hidden dimension help control latency in the current desktop-oriented implementation. However, because the architecture still relies on self-attention, BERT-based encoding, and autoregressive decoding, the overall computational cost remains non-trivial and may be restrictive for strict real-time or resource-constrained deployment scenarios.

3.2. Text Preprocessing and Data Augmentation

Preprocessing converts raw essay text into numerical tensors suitable for neural processing. Sentence segmentation and tokenization are implemented with the NLTK toolkit, which yields a sequence of word tokens for each sentence. Each token is passed to the Stanford POS Tagger to obtain a POS tag. For every token position t, the model constructs a composite feature vector by concatenating a word embedding and a POS embedding. The word embedding is looked up from a trainable matrix initialized with pre-trained word2vec parameters, while the POS embedding is learned from scratch. The resulting embedding sequence simultaneously encodes lexical and syntactic information and is shared by both the context encoder and the source encoder.

The grammatical error correction task is constrained by the limited size of parallel learner corpora. To mitigate overfitting and to improve coverage of frequent error categories observed in Chinese learner English, the training data are augmented using a hybrid strategy that combines rule-based perturbations and distribution-guided noise injection. Here, the CLEC-derived statistics are used only as a corpus-informed prior for sampling broad error categories, not as an assumption that synthetic data fully capture the diversity, contextual dependence, or individual variability of authentic learner errors. The authentic learner corpora therefore remain the primary source of supervision, while the synthetic pairs are used only as auxiliary training data to expose the model to frequent learner-like categories that are otherwise underrepresented. The starting point is a collection of grammatically correct sentences from the NUCLE and FCE corpora. For each such sentence, one synthetic erroneous sentence is generated to form a sentence pair used for supervised training.

Let a correct sentence be represented as a sequence of tokens

S_{source} = (w_{0}, w_{1}, \dots, w_{n})

of length n + 1. Two rule-based operations are applied at the word level: insertion and deletion. An insertion operation selects a position and inserts a noise token at that position to obtain an erroneous sentence,

S_{error}^{(ins)} = Insert (S_{source}, i, w^{*})

(16)

which simulates overuse errors and unnecessary function words. A deletion operation selects a position i, removes the token w_i, and yields an erroneous sentence

S_{error}^{(del)} = Delete (S_{source}, i)

(17)

which simulates missing words, typical in essays where learners omit articles or auxiliary verbs.

The distribution-guided strategy uses statistics from the CLEC corpus to estimate coarse category-level tendencies in Chinese learner English, rather than the true real-world error distribution of Chinese learners. Six major grammatical error categories are considered: verb-related errors, syntactic structure errors, noun-related errors, pronoun errors, preposition errors, and modifier errors. For each category k, the number of errors fk is computed from CLEC subsets and the category sampling prior is obtained by smoothed temperature-scaled normalization.

q_{k} = \frac{{(f_{k} + ε)}^{τ}}{\sum_{j = 1}^{K} {(f_{j} + ε)}^{τ}}, k = 1, \dots, K

(18)

so that

\sum_{k} p_{k} = 1

. In the analyzed corpus, verb and syntactic structure errors dominate the distribution, together accounting for more than seventy percent of all grammatical errors, while noun, pronoun, preposition, and modifier errors occupy progressively smaller shares. These normalized probabilities are used to bias category sampling during synthetic error generation.

For each error category, a confusion set is defined to guide the concrete token-level perturbation. For example, the structural category includes conjunctions and relative pronouns such as “when”, “that”, “which”, “where”, and “why”; the verb category includes different inflected forms such as “do”, “did”, “does”, and “doing”; and the preposition category includes alternatives like “to”, “for”, and “with”. When an error category is sampled according to the normalized probabilities, a target position and replacement word are drawn from the corresponding confusion set to construct the erroneous sentence. This mechanism encourages the synthetic data to cover frequent learner-like patterns at the category level, but it should be regarded as a coarse augmentation procedure rather than as a realistic simulator of the full range of authentic learner errors or their discourse-specific realizations.

The hybrid augmentation strategy introduces a scalar threshold μ in the interval (0, 1) that controls the proportion of rule-based versus distribution-based noise. For each sentence, a uniform random value is drawn from this interval. If the value is smaller than the threshold, the sentence is perturbed by a rule-based insertion or deletion, which mainly simulates missing or superfluous words. Otherwise, the sentence is perturbed according to the CLEC-based category prior. In the experiments, the threshold is set to 0.20, which means that approximately twenty percent of synthetic pairs correspond to insertion or deletion errors and the remaining eighty percent correspond to category-specific errors drawn from the confusion sets.

From an algorithmic standpoint, the augmentation procedure runs in linear time with respect to the sentence length, since each sentence is traversed once to select candidate positions for insertion or deletion, and the confusion-set operations are constant time. The overall complexity is therefore proportional to the product of the number of source sentences and the average sentence length. Because augmentation is performed offline before training, it does not affect inference latency but significantly improves the robustness of the dual-encoder model on learner essays.

3.3. Context Encoder: Document-Level Representation

The context encoder operates on the sentences that precede the current target sentence within the same essay. The previous sentences are concatenated to form a context sequence, and after preprocessing this context is represented as a sequence of composite embeddings

(x_{1}^{c}, x_{2}^{c}, \dots, x_{T_{c}}^{c})

that combine word and POS information. The goal of the context encoder is to transform this variable-length sequence into a fixed-dimensional representation that summarizes document-level information relevant to the current correction decision.

The encoder architecture follows a Transformer-augmented bidirectional recurrent design. First, a stack of self-attention layers applies multi-head attention to all positions in the context sequence, which allows the model to capture long-range syntactic dependencies such as subject–verb agreement and the scope of subordinate clauses. The outputs of the self-attention layers are then passed through a bidirectional gated recurrent unit (Bi-GRU), which aggregates information in both left-to-right and right-to-left directions and yields a sequence of hidden states

(h_{1}^{c}, h_{2}^{c}, \dots, h_{T_{c}}^{c})

.

At each position t the Bi-GRU computes a hidden state as a nonlinear function of the current input embedding and the previous hidden states in both directions, which can be summarized as

\begin{array}{l} {\vec{h}}_{t}^{c} = {GRU}_{f} ({\tilde{x}}_{t}^{c}, {\vec{h}}_{t - 1}^{c}), {\overset{\leftarrow}{h}}_{t}^{c} = {GRU}_{b} ({\tilde{x}}_{t}^{c}, {\overset{\leftarrow}{h}}_{t + 1}^{c}) \\ h_{t}^{c} = W_{h} [{\vec{h}}_{t}^{c}; {\overset{\leftarrow}{h}}_{t}^{c}] + b_{h} \end{array}

(19)

To obtain a single context vector that will be consumed by the decoder, an attention pooling layer computes a scalar importance weight αt for each hidden state

h_{t}^{c}

and forms a weighted sum. The attention weights are produced by a small feed-forward network followed by a softmax normalization so that

\sum_{t} α_{t} = 1 .

The resulting vector

\begin{array}{l} e_{t}^{c} = v_{c}^{⊤} \tanh (W_{c} h_{t}^{c} + b_{c}) \\ α_{t}^{c} = \frac{\exp (e_{t}^{c})}{\sum_{i = 1}^{T_{c}} \exp (e_{i}^{c})}, \\ c^{c} = \sum_{t = 1}^{T_{c}} α_{t}^{c} h_{t}^{c} \end{array}

(20)

compacts the variable-length document context into a fixed-dimensional representation whose dimension matches the decoder hidden size.

Because the same self-attention and Bi-GRU blocks are also used in the source encoder, the two branches share a structurally symmetric encoder, which simplifies implementation and parameter sharing. Let T_c denote the number of context tokens and ddd the hidden dimension. The self-attention layers require time proportional to T_c²d, whereas the Bi-GRU stack requires time proportional to T_cd². In practice, the context window and the hidden dimension are limited to moderate values, so the additional cost of context encoding compared with sentence-level models is moderate, while the gain in capturing cross-sentence dependencies is substantial.

3.4. Source Sentence Encoder with BERT Fusion

The source sentence encoder processes the current sentence to be corrected. After tokenization and POS tagging, the sentence is represented as a sequence of composite embeddings

(x_{1}^{s}, x_{2}^{s}, \dots, x_{T_{s}}^{s})

similar to those used in the context encoder. In addition, a pre-trained BERT model is applied to the original token sequence to produce a contextualized embedding B_t for each token position t. These embeddings capture deep semantic information, such as lexical semantics and subtle contextual nuances that cannot be derived from POS tags alone.

The source sentence encoder reuses the same self-attention and Bi-GRU blocks as the context encoder to ensure structural symmetry. The composite input representation at each position is obtained by concatenating the BERT embedding B_t with the basic embedding x_t^s to form a single vector

\begin{array}{l} {\tilde{x}}_{t}^{s} = W_{x} x_{t}^{s} + b_{x}, {\tilde{B}}_{t} = W_{B} B_{t} + b_{B}, \\ g_{t}^{b} = σ (W_{g}^{b} [{\tilde{B}}_{t}; {\tilde{x}}_{t}^{s}] + b_{g}^{b}), \\ z_{t}^{s} = g_{t}^{b} ⊙ {\tilde{B}}_{t} + (1 - g_{t}^{b}) ⊙ {\tilde{x}}_{t}^{s} \end{array}

(21)

The resulting sequence

(z_{1}, \dots, z_{T_{s}})

is then fed into the self-attention stack and Bi-GRU to produce hidden states

\begin{array}{l} {\hat{H}}^{s} = LN (MHA (Z^{s}) + Z^{s}), \\ H^{s} = LN (BiGRU ({\hat{H}}^{s}) + {\hat{H}}^{s}) \end{array}

(22)

These states are used in the decoder’s cross-attention over the source sentence. The presence of BERT allows the encoder to model semantic anomalies such as inappropriate word choice or semantically implausible statements even when the local syntax is grammatically correct.

Compared with the context encoder, the additional computational cost arises from running the BERT encoder over the source sentence. With sentence length T_s and BERT hidden dimension dbert, this step has complexity O(T_s² d_bert). Because the model operates on relatively short learner sentences and uses a compact BERT-based representation in practice, the overhead remains manageable in the current setting, but it is still higher than that of lighter sentence-level architectures without pretrained contextual encoders. The self-attention and Bi-GRU layers over the concatenated representations maintain the same O(T_s²d + T_sd²) complexity order as in the context encoder.

3.5. Decoder with Masked Multi-Head Attention and Gated Fusion

The decoder is a stack of several layers that follow the standard Transformer decoder pattern with two key modifications: the use of a causal mask in the self-attention sublayer and the introduction of a gated fusion mechanism that combines attention over the source sentence with attention over the document context. At decoding step

t

, the decoder receives the embeddings of the previously generated target tokens

(y_{1}, \dots, y_{t - 1})

, applies masked multi-head self-attention to model dependencies among them, and then applies two parallel cross-attention operations: one over the source encoder hidden states

(h_{1}^{s}, \dots, h_{T_{s}}^{s})

and one over the context encoder hidden states

(h_{1}^{c}, \dots, h_{T_{c}}^{c})

.

The causal mask enforces the autoregressive constraint that the prediction at position

t

may depend only on positions before

t

, but not on future positions. Internally, the decoder constructs a lower-triangular mask matrix

M \in R^{T_{y} \times T_{y}}

, where

T_{y}

is the length of the target sequence. Entries

M_{i, j}

are set to zero when

j \leq i

and to a large negative value when

j > i

. Before the softmax operation in the self-attention mechanism, the attention scores are elementwise added with

M

, which effectively prevents any attention weight from flowing to future positions.

Let

d_{t}^{s r c}

denote the context vector produced by attending over the source encoder states at step

t

, and let

d_{t}^{c t x}

denote the context vector produced by attending over the document context encoder states. A learnable gate

g_{t}

∈

{(0,1)}^{d}

is then computed to modulate the relative contributions of the two vectors. The gate is defined by a sigmoid activation applied to an affine transformation of the concatenation

[d_{t}^{s r c}; d_{t}^{c t x}]

:

g_{t} = σ (W_{g} [h_{t}^{d}; c_{t}^{s}; c_{t}^{c}; E (y_{t - 1})] + b_{g})

(23)

{\tilde{c}}_{t} = LN (h_{t}^{d} + g_{t} ⊙ c_{t}^{s} + (1 - g_{t}) ⊙ c_{t}^{c})

(24)

where

⊙

denotes elementwise multiplication. When

g_{t}

is close to one on a given dimension, the decoder relies primarily on the source sentence at that dimension; when

g_{t}

is close to zero, the decoder instead relies on document-level context. This behavior is particularly important because many grammatical corrections are determined mainly by sentence-internal syntax rather than broader discourse context. In such cases, the gating mechanism can attenuate or suppress irrelevant contextual information from surrounding sentences and preserve reliance on the local source representation. This design therefore realizes a symmetric treatment of the two encoder branches at the structural level while permitting asymmetric, data-driven weighting during decoding.

The fused vector

d_{t}

is passed through a position-wise feed-forward network and a linear projection to the vocabulary space, yielding a logit vector

o_{t}

for the next-token distribution. A softmax transformation converts

o_{t}

into a probability distribution over the vocabulary,

{cov}_{t, i} = \sum_{τ = 1}^{t - 1} a_{τ, i}, L_{cov} = \sum_{t = 1}^{T} \sum_{i = 1}^{T_{s}} \min (a_{t, i}, {cov}_{t, i})

(25)

from which the next token is either sampled during training (for example, under scheduled sampling) or selected by beam search during inference. The beam-search procedure maintains the best partial hypotheses, expands each by one token at a time, and scores them using the sum of log probabilities divided by a length-penalty term to avoid a bias toward short sequences. This decoding strategy approximates the globally most probable corrected sentence while keeping computation linear in the beam size.

The decoder complexity per time step is dominated by the self-attention and cross-attention operations. For target length, source length, context length, hidden dimension, and beam size, a single forward pass through the decoder has time complexity on the order of. In the present implementation, the sequence lengths, and are bounded by a few dozen tokens and the beam size is kept small, which helps maintain usable response time in a desktop writing-assistance scenario. However, because the model still combines BERT-enhanced encoding, dual-stream attention, and autoregressive decoding, its efficiency may be insufficient for strict real-time applications or large-scale latency-sensitive services without further optimization. The symmetric dual-encoder design, coupled with the gated decoder, should therefore be viewed as a performance-oriented architecture with acceptable efficiency in the current desktop setting rather than as a deployment-optimized real-time solution.

4. Experimental Setup

Training and evaluation of the symmetry-aware dual-encoder grammatical error correction model follow a controlled experimental protocol that specifies the hardware and software environment, dataset partition, hyperparameters, evaluation metrics, and baseline systems.

4.1. Hardware and Software Environment

Training and inference are performed on a GPU server and a desktop workstation. The GPU server is used for model training and hyperparameter search, whereas the workstation represents a typical client-side platform for running the grammar correction application. The detailed hardware configuration of both platforms is summarized in Table 4.

The GPU server is equipped with a high-performance multi-core CPU and a modern high-end NVIDIA GPU with large onboard memory and a large number of CUDA cores, enabling efficient large-batch training and rapid experimentation during model development. The desktop workstation, by contrast, uses a mid-range consumer GPU with more limited memory resources to approximate a more constrained inference environment. This setup allows us to evaluate the practicality of deploying the trained model outside the training server while maintaining interactive response times. All experiments are executed on 64-bit operating systems to ensure compatibility with contemporary deep-learning frameworks and GPU acceleration libraries.

The implementation is written in Python 3.10 and relies on a mainstream deep learning framework with CUDA-enabled GPU acceleration. The 64-bit environments ensure stable large-batch training and effective utilization of GPU memory on both platforms. It should be noted, however, that these hardware settings correspond to development and desktop deployment conditions and do not by themselves establish suitability for strict real-time or edge-side deployment.

4.2. Training, Development, and Test Sets

The experimental data are drawn from a combination of widely used learner corpora and public grammatical error correction benchmarks. All corpora are partitioned into a training set, a development set, and a test set. The training set combines manually annotated learner corpora with synthetic error data generated under a CLEC-informed category prior designed to supplement authentic learner corpora rather than to fully replicate real-world learner errors. The development and test sets contain both standard public benchmarks and a subset of Chinese EFL compositions, so that the model can be evaluated on generic English GEC tasks and on domain-specific learner writing.

The training set (TrainSet) includes NUCLE, FCE, CLEC (excluding the 800 compositions reserved for testing), and eight million sentences of synthetic data. The development set (DevSet) consists of the CoNLL-2013 test set and the JFLEG development set. The CoNLL-2013 test set is used to monitor precision-oriented GEC performance, whereas the JFLEG development set is used to monitor fluency via GLEU. The test set (TestSet) includes the CoNLL-2014 test set, the JFLEG test set, and 800 CLEC compositions sampled from the SET3 and SET4 subsets. The detailed statistics are given in Table 5.

During training, all sentences from NUCLE, FCE, CLEC and the synthetic corpus are merged into a single training pool. The CoNLL-2013 test set and the JFLEG development set jointly serve as the development set for model selection and hyperparameter tuning. Final performance is reported on the CoNLL-2014 test set, the JFLEG test set, and the 800 CLEC compositions, providing both benchmark-level and practical-application evaluation.

4.3. Hyperparameters and Training Strategy

The context encoder consists of two Transformer encoder layers followed by two bidirectional GRU layers. The Transformer encoder uses a hidden size of 512 and six attention heads in each multi-head attention block. Each bidirectional GRU layer has 256 hidden units per direction with tanh activation, so that the output contextual representation at each position has dimension 512.

The source sentence encoder reuses the same context encoder structure and stacks a bert-base-cased BERT model on top. The BERT module outputs 768-dimensional token representations, which are concatenated with the Bi-GRU outputs to form the final source-side representation. This design fuses pre-trained contextual embeddings with task-specific contextual features, but it also increases parameterization and training cost relative to lighter sentence-level architectures.

The decoder is built from six layers of masked multi-head self-attention, gated fusion, and position-wise feed-forward networks. The gating module is implemented as a single-layer GRU that takes a 768-dimensional input vector at each decoding step and produces a scalar gate for combining the context encoder and source encoder features. The feed-forward network has 2048 hidden units with a sigmoid activation function. Beam search is used during inference, with a beam width of 5 and a length penalty factor α set to 0.6 to balance short and long hypotheses. Taken together, the principal implementation-scale dimensions of the model are 512 for the encoder-side hidden representations, 768 for the BERT semantic vectors and gate input features, and 2048 for the decoder feed-forward inner layer.

From an efficiency perspective, the use of a BERT-enhanced source encoder, a separate context encoder, Bi-GRU layers, multi-head attention, gated fusion, and an autoregressive decoder inevitably increases parameter count, memory consumption, and both training and inference complexity relative to simpler sentence-level or edit-tagging architectures. This larger model capacity improves context sensitivity and correction quality, but it may also increase overfitting risk when manually annotated learner corpora are limited. In the present study, this risk is only partially mitigated by combining multiple learner corpora with large-scale CLEC-informed offline augmentation, rather than being fully eliminated. The current system should therefore be understood as a performance-oriented desktop writing-support model, and dedicated simplification and stronger regularization remain necessary for lighter and more scalable deployment.

4.4. Evaluation Metrics

Evaluation of grammatical error correction quality relies on token-level classification into four categories.

True positives (TP) denote tokens or edits correctly predicted as part of an error and corrected by the system.

False positives (FP) denote tokens predicted as erroneous when no error is annotated in the reference.

True negatives (TN) correspond to tokens correctly left unchanged.

False negatives (FN) are real errors that the system fails to correct.

These four cases define the confusion matrix underlying the precision and recall metrics. Table 6 gives a simple example from the evaluation set to illustrate the relationship between the source sentence, the model hypothesis, and the reference sentence.

Based on TP, FP, TN, and FN, precision P and recall R are defined as

P = \frac{T P}{T P + F P + ε}

(26)

R = \frac{T P}{T P + F N + ε}

(27)

The F-score F_β combines precision and recall with a weight factor β as

F_{β} = \frac{(1 + β^{2}) P R}{β^{2} P + R + ε}

(28)

In grammatical error correction, precision is often considered more important than recall, so F_0.5 is commonly used. It is obtained by setting β = 0.5 in the above expression:

F_{0.5} = \frac{1.25 P R}{0.25 P + R + ε}

(29)

To evaluate both grammatical correctness and fluency, the experiments also report the GLEU metric. Let C denote the set of model-generated candidate sentences, R denote the set of reference sentences, and S denote the set of source sentences. GLEU is defined as

GLEU = BP \cdot \exp (\frac{1}{N} \sum_{n = 1}^{N} \log p_{n^{'}})

(30)

Here N is the maximum n-gram order and wn = 1/N assigns equal weight to each order. The brevity penalty BP controls the preference for overly short hypotheses and is defined as

BP = \min (1, \exp (1 - \frac{r}{c}))

(31)

where cl is the total length of the candidate sentences and rl is the total length of the reference sentences.

The modified n-gram precision p_n′ is computed as

p_{n^{'}} = \frac{\sum_{g \in G_{n} (C)} \min {{count}_{C} (g), {count}_{R} (g)}}{\sum_{g \in G_{n} (C)} {count}_{C} (g)}

(32)

The numerator Countmatch n-gram(C, R, S) sums over all n-grams g in the source sentences:

{Count}_{B} (n - gram) = \sum_{n {- gram}^{'} \in B} d (n - gram, n {- gram}^{'})

(33)

The indicator dng(g, C, R) equals 1 if the n-gram g appears in both the candidate and the reference, and 0 otherwise:

d (n - gram, n {- gram}^{'}) = \{\begin{array}{l} 1, & n - gram = n {- gram}^{'} \\ 0, & otherwise \end{array}

(34)

These definitions follow the standard formulation of GLEU and allow direct comparison with published systems. Unless otherwise stated, all reported P, R, F_0.5, and GLEU values in this study are point estimates computed on fixed evaluation sets. The present work does not report confidence intervals or formal statistical significance tests. This is partly because several baseline results used for comparison are taken from published studies under comparable settings rather than from paired system outputs generated under identical reruns. Therefore, the quantitative tables in Section 5 should be interpreted as descriptive benchmark comparisons, and small numerical differences should not be over-interpreted as statistically validated improvements.

4.5. Baseline Systems

To place the proposed model in context, experiments compare it against several strong baseline systems that represent different architectural choices for grammatical error correction. The first baseline is the MLConv model with edit operations (MLConv + EO) proposed by Chollampatt et al., which uses a convolutional encoder–decoder architecture and explicitly models local context and edit operations to improve correction quality. The second baseline is the BERT-fuse GED + R2L model of Kaneko et al., which fine-tunes BERT on GEC corpora and injects its token-level representations as additional features into a left-to-right and right-to-left encoder–decoder architecture.

The third baseline is the MultiEnc-dec model of Zheng Yuan et al., which introduces a multi-encoder–decoder architecture to incorporate document-level context. It processes the current sentence and surrounding sentences with separate encoders and uses cross-attention to exploit discourse-level information in the decoder. The fourth baseline is the ERRANT + Transformer system of Stahlberg et al., which combines a Transformer-based neural machine translation model with explicit error-type tags derived from the ERRANT scorer to better align model corrections with annotated error categories.

All baseline systems are evaluated on the same CoNLL-2014 and JFLEG benchmarks with the F_0.5 and GLEU metrics. The published scores of these systems under comparable training and evaluation settings are used in Section 5 for a detailed quantitative comparison with the proposed dual-encoder model.

In addition to these neural baselines, recent large-language-model-based GEC results are also considered in the comparison discussion of Section 5.2. In particular, prompt-based GPT systems have recently shown strong performance on fluency-oriented benchmarks such as JFLEG, whereas their behavior on minimal-edit benchmarks is more variable because they often produce broader sentence revisions rather than strictly conservative reference-aligned corrections. For this reason, recent LLM systems are included here as literature-based comparison points rather than as fully matched baseline reruns under the same training and decoding conditions [59,60].

5. Results, Analysis, and System Implementation

5.1. Ablation Study on Data Augmentation

English grammatical error correction training data are augmented by combining the original NUCLE, FCE, and CLEC corpora with synthetic error sentences generated under a CLEC-informed category prior. These synthetic pairs are intended to supplement authentic learner data and increase exposure to frequent Chinese-learner-like categories, rather than to reproduce real-world learner grammatical errors in a fully realistic manner. The present section reports the only direct within-model ablation study included in this work, namely the effect of synthetic augmentation scale. More fine-grained ablations on BERT fusion, context encoding, and gated decoding are not included in the current manuscript and are therefore not claimed here as experimentally isolated module contributions. To quantify the impact of synthetic data size, the model is trained with 0, 2, 4, 6, and 8 million synthetic sentences in addition to the original corpora and evaluated on the CoNLL-2014 test set using the development sets described in Section 4. The evaluation metrics are precision P, recall R, and F_0.5, which weights precision more strongly than recall.

Table 7 presents the performance of the proposed model under different synthetic data scales on the CoNLL-2014 test set. Compared with the setting using only the original training data, incorporating 2 million synthetic sentences improves precision from 72.5 to 74.6, recall from 57.6 to 58.5, and F_0.5 from 69.02 to 70.76. As the synthetic corpus is further expanded to 4 million, 6 million, and 8 million sentences, the performance continues to improve, with F_0.5 reaching 71.53, 71.84, and 71.97, respectively. Overall, the gains are monotonic with respect to synthetic data scale, but the improvement gradually saturates beyond 6 million sentences, where the additional increase in F_0.5 becomes relatively limited.

Table 8 presents the effect of synthetic data size on model performance in terms of precision, recall, and F_0.5. Relative to the baseline setting, the introduction of 2 million augmented sentences increases precision from 72.1 to 73.9, recall from 58.1 to 58.7, and F_0.5 from 68.6 to 70.2. When the augmented corpus is further expanded to 4 million, 6 million, and 8 million sentences, performance continues to improve steadily, with F_0.5 reaching 70.9, 71.4, and 71.6, respectively. Among the three metrics, precision shows the largest overall gain, whereas recall improves more moderately. This trend suggests that the current augmentation strategy is particularly effective at strengthening the model’s confidence on frequent learner-like error categories. Nevertheless, the gains gradually saturate as the amount of augmented data increases, and the observed improvements should be understood as improved category coverage under the current augmentation mechanism rather than as proof that the synthetic corruption process fully captures authentic learner production.

5.2. Comparison with State-of-the-Art GEC Systems

The proposed dual-encoder grammatical error correction model is compared with several strong neural baselines on the CoNLL-2014 test set. These baselines include a convolutional encoder–decoder model with ensemble and edit operations, a BERT-fused encoder–decoder model, a multi-encoder document level GEC model, and a large Transformer based system trained on synthetic data constructed with ERRANT. All systems are evaluated using the standard precision, recall, and F_0.5 metrics.

Table 9 reports the comparison results on the CoNLL-2014 test set. The proposed model achieves precision 75.8, recall 60.1, and F_0.5 71.8, outperforming all listed baselines. Relative to the Transformer-based system trained with ERRANT-generated synthetic data, which obtains precision 75.3, recall 49.7, and F_0.5 68.0, the proposed model improves precision by 0.5 points, recall by 10.4 points, and F_0.5 by 3.8 points. Compared with the BERT-fused encoder–decoder model, which achieves precision 72.3, recall 46.7, and F_0.5 65.0, the proposed model improves precision by 3.5 points, recall by 13.4 points, and F_0.5 by 6.8 points. Compared with the document-aware MultiEnc-dec baseline, which achieves precision 74.0, recall 39.2, and F_0.5 62.5, the proposed model yields gains of 1.8 points in precision, 20.9 points in recall, and 9.3 points in F_0.5. These results indicate that the proposed architecture yields consistent but moderate gains over strong sentence-level and document-aware baselines. Given the maturity of the CoNLL-2014 benchmark and the strength of the compared systems, the improvement should be interpreted as evidence that the proposed context-integration strategy refines correction quality rather than radically changing benchmark behavior. These gains also come at the cost of increased model complexity and inference overhead.

The architectural justification of these gains can be understood as a division of labor among the main modules. The BERT-enhanced source encoder strengthens local semantic and lexical representations for sentence-internal corrections, especially when grammatical decisions depend on contextualized word usage rather than surface form alone. The separate context encoder provides cross-sentence cues for discourse-sensitive corrections such as tense consistency, pronoun interpretation, and context-compatible function-word selection. The gated decoder then regulates these two information streams so that document context is used selectively rather than uniformly, thereby reducing the risk that irrelevant surrounding sentences will interfere with corrections that are mainly determined by local syntax. In this sense, the reported gains are consistent with the intended design of the architecture: the model improves correction quality by combining stronger local representation with controlled discourse integration, rather than by relying on broader context for every edit.

To further assess generalization, the systems are also evaluated on the JFLEG test set using the GLEU metric, which jointly considers fluency and grammaticality. Table 10 lists the GLEU scores. In addition to earlier neural baselines, the table also includes recent literature-reported results for prompt-based GPT-3.5 and GPT-4 systems. The proposed model achieves GLEU 63.18, which is higher than the convolutional and multi-encoder baselines, slightly below GPT-3.5 (63.25), and 1.70 points below GPT-4 (64.88). This comparison indicates that recent generative LLMs are highly competitive on fluency-oriented GEC benchmarks. However, published analyses also show that GPT-style systems tend to perform much better on fluency-edit datasets such as JFLEG than on minimal-edit benchmarks, partly because they often prefer broader sentence revision over conservative edit-level correction. In this sense, the proposed model should be understood as a task-specific context-aware GEC architecture that remains competitive on JFLEG while targeting controlled learner-oriented correction under tighter computational and deployment constraints than general-purpose LLM inference [59,60].

This literature-based comparison should not be interpreted as a fully controlled same-environment evaluation. Recent LLM-based GEC studies often use prompt-based or API-based settings, and directly reproducible outputs on CoNLL-2014 under identical decoding and runtime constraints are not uniformly available. Moreover, manual analyses have pointed out that GPT-style models may over-correct and still struggle with punctuation, tense, article or preposition usage, and syntactic dependency errors, even when their fluency-oriented scores are strong. Accordingly, the present comparison is intended to situate the proposed model with respect to the current LLM-based GEC landscape, rather than to claim strict system-level superiority over all modern generative models [59,61].

At the same time, these comparisons should not be interpreted as full component-wise ablation results. The contrast with the BERT-fuse baseline provides only indirect evidence that the proposed architecture benefits from combining pretrained semantic features with the present dual-stream design, while the contrast with the MultiEnc-dec baseline provides only indirect evidence that the current context-integration strategy and gated fusion mechanism are effective under the adopted benchmark setting. Because the current study does not report same-backbone variants with BERT fusion removed, context encoding removed, or gated decoding replaced by a simpler fusion scheme, the individual contribution of each module is not yet fully isolated experimentally.

It should also be noted that the values reported in Table 9 and Table 10 are benchmark-level point estimates. Because the compared baselines are reproduced from published results and paired sentence-level outputs are not uniformly available, no additional significance testing is reported here. The comparison is therefore intended to show the overall performance position of the proposed model on standard benchmarks rather than to claim statistically confirmed superiority for every small metric difference.

5.3. Qualitative Analysis of Correction Examples

Quantitative metrics provide a global view of model performance but do not reveal how document level context and the dual encoder structure influence individual corrections. To illustrate these effects, typical examples are examined in which cross sentence information is required to produce correct tense or agreement. Table 11 shows one representative case.

In this example, the second sentence should be in the past tense to agree with the temporal context established by the first sentence. The baseline model processes each sentence independently and therefore fails to change stay to stayed. The proposed model encodes the document-level context through the preceding-sentence encoder and feeds it to the decoder via gated attention, enabling it to enforce cross-sentence tense consistency. This case also helps explain why the aggregate gains reported in Table 8 and Table 9 are consistent but not dramatic: only a subset of grammatical errors truly depends on broader discourse context, whereas many other edits remain sentence-local and are already handled reasonably well by strong baselines. The role of the proposed architecture is therefore to improve those cases that benefit from complementary local semantics and document context, without forcing contextual information into every correction decision. Qualitative inspection of further examples shows similar advantages on subject–verb agreement and article usage when the relevant cues appear in previous sentences.

5.4. Performance on Chinese Learner Essays

To evaluate domain-specific performance on Chinese learner English, the model is applied to 800 essays sampled from the SET3 and SET4 subcorpora of CLEC. Grammar errors in these essays are annotated following the NUCLE error taxonomy, and system outputs are evaluated with the same precision, recall, and F_0.5 metrics as used on CoNLL-2014. Across all error types, the model attains precision 84.85 percent, recall 73.20 percent, and F_0.5 80.89, indicating strong overall correction ability on learner essays. Table 12 summarizes these aggregate metrics for the 800-essay test set.

Table 13 provides a detailed breakdown of performance by error type. For each NUCLE error category, the table lists the corresponding CLEC tags, the number of annotated errors, the number of errors the system attempts to correct, the number of correct corrections, and the resulting precision, recall, and F_0.5 values.

Table 13 shows a clear long-tail pattern across error categories. The model performs best on high-frequency categories such as noun number errors (Nn), subject–verb agreement (SVA), and article or determiner errors (ArtOrDet), where precision and F_0.5 exceed 0.85. These categories are strongly represented in the CLEC-informed augmentation rules and in the manually annotated learner corpora, so the model observes many training instances and learns relatively stable correction patterns. Medium performance is observed for preposition errors (Prep), adverb word-order errors (WOadv), and verb morphology errors (Vm and Vt), which remain challenging because the same abstract error type may have diverse surface realizations. By contrast, the lowest scores occur for Trans, Wci, and WOinc, and their weakness cannot be explained by low frequency alone. The current augmentation strategy mainly introduces high-frequency learner-like errors through insertion/deletion rules and category-level confusion-set substitution. This mechanism is effective for frequent morphology- and function-word-related errors, but it does not model the structure of rare phrase-order and context-sensitive lexical errors in a sufficiently realistic way.

More specifically, Trans and WOinc usually involve phrase-level reordering, omission inside a syntactic construction, or incomplete constituent sequencing, whereas the present augmentation procedure mainly injects local token-level perturbations. As a result, the model sees relatively few synthetic examples that preserve the global sentence meaning while changing constituent order in a learner-like manner. Wci is even more difficult because correct prediction depends not only on grammatical form but also on semantics, collocation, and discourse appropriateness; these properties are poorly approximated by small predefined confusion sets. In addition, the annotated counts for Trans (4), Wci (5), and WOinc (17) are extremely small, so the reported F_0.5 values are numerically unstable and can change substantially with only one or two additional correct corrections. Therefore, the current results indicate both a genuine modeling limitation and a long-tail data problem. Practically, these categories are unlikely to improve substantially under the present augmentation mechanism unless rare-class handling is made more targeted. Promising directions include category-balanced oversampling, loss reweighting or curriculum scheduling for long-tail classes, dependency- or constituency-aware reordering templates for Trans and WOinc, and contextual lexical substitution or revision-log-based pseudo-error generation for Wci. The values in Table 13 are corpus-level point estimates computed on the fixed 800-essay CLEC test split; confidence intervals are not reported in the present study, and uncertainty quantification through document-level bootstrap resampling is left for future work.

Table 14 plots all error types sorted by precision, highlighting the long tail of low frequency categories with modest precision values. Error types Nn and SVA appear at the top of the ranking, confirming that the model handles frequent morphological and agreement errors reliably. Table 15 sorts error types by recall and shows a similar pattern, with recall dropping more sharply on rare categories. These plots emphasize the importance of increasing annotated coverage or designing targeted augmentation rules for low frequency error classes.

To assess scoring consistency, the grammar scores assigned by the model were compared with those assigned by human teachers on the same set of 800 essays. The average difference between model and human scores was approximately 0.33 points. The agreement was particularly strong for higher-scoring essays, whereas relatively larger deviations were observed in the mid- and low-score ranges. These deviations may be attributed to subjectivity in human rating standards as well as the limited coverage of lexical- and discourse-level errors in the current model. Overall, the relatively small average discrepancy indicates that the model produces stable and consistent grammar scores and can serve as a reliable automatic indicator of grammatical quality.

5.5. System Implementation and Runtime Behavior

The trained grammatical error correction model is currently demonstrated through a Java-based desktop application for automatic essay grammar checking. This desktop program should be regarded as a proof-of-concept client for local user interaction rather than as the intended final deployment paradigm. In the current prototype, the application is implemented in Java with a Swing-based user interface, while the back end loads the trained model parameters and exposes an inference interface that accepts raw essay text, applies the same preprocessing pipeline as in training, and returns corrected sentences together with sentence-level and essay-level grammar scores. From a software-engineering perspective, the same inference pipeline can also be encapsulated as a remote model service and accessed by browser-based clients, mobile clients, or institutional writing platforms through standard API calls.

The software architecture is organized into four logical layers. The client layer manages text input, progress messages, and result display. The request layer performs input validation, session management, and essay segmentation, and then forwards normalized sentence batches to the model service. The inference layer runs the dual-encoder and decoder network on GPU-equipped workers, reconstructs corrected sentences from the decoder outputs, and aggregates token-level decisions into sentence-level and document-level scores. The response layer merges corrected sentences, highlights modified tokens, and returns both corrected output and grammar scores to the calling client. In the current desktop prototype, these layers are packaged within a single local application. However, the same logical decomposition is directly compatible with a modern client–server architecture in which lightweight front-end clients communicate with back-end services through RESTful or gRPC-style APIs, while preprocessing, GPU inference, authentication, logging, and result post-processing are deployed as decoupled service modules.

From a scalability perspective, the deployment unit of the current system is not the whole essay as a monolithic sequence, but a set of sentence-level correction jobs accompanied by short context windows. This makes essay processing naturally compatible with queue-based batching and with parallel handling of multiple user submissions. In a cloud-oriented deployment scenario, front-end requests can be routed through an API gateway, transformed into sentence batches by a preprocessing service, dispatched to one or more GPU inference workers, and then merged back into essay-level feedback by a post-processing service. Such a client–server design would support horizontal scaling at the request-routing level and controlled scaling at the GPU-inference level, while also enabling authentication, monitoring, request logging, and failure recovery through standard cloud-service components. In addition, user actions after correction—such as accepting, rejecting, or manually rewriting a proposed edit—can be stored as feedback events and later used to construct revision logs for quality monitoring, error analysis, and future continuous-learning pipelines. Nevertheless, the present study does not yet report formal throughput benchmarks under concurrent multi-user load, cloud-based autoscaling experiments, or production-level latency measurements across a distributed environment. Accordingly, the current implementation should be interpreted as demonstrating desktop-assisted and small-batch feasibility rather than full real-world cloud deployment readiness.

Runtime observations on the desktop configuration described in Section 4.1 indicate that end-to-end correction of typical essays is feasible within an interactive desktop workflow and fits within the available GPU memory. Nevertheless, the integration of BERT, dual encoders, attention modules, and autoregressive decoding still introduces non-trivial computational overhead. In a real-world client–server deployment, the main scalability bottlenecks would arise from concurrent request volume, GPU scheduling efficiency, sentence-length variation, beam-search decoding cost, API-service latency, and the coordination cost between preprocessing, inference, and post-processing components rather than from desktop execution alone. For this reason, the present implementation should be regarded as a validated proof-of-concept client plus local inference workflow, while larger-scale online deployment would still require cloud orchestration, dynamic batching, worker-level load balancing, API hardening, observability support, and additional latency profiling under realistic traffic conditions.

6. Conclusions and Future Work

6.1. Summary of the Work

This work designed and implemented a context-aware grammatical error correction model targeted at English essays written by Chinese learners. The model adopts a symmetry-aware dual-encoder–decoder architecture in which a context encoder captures document-level information from preceding sentences, and a source encoder fuses BERT-based semantic representations with Bi-GRU-based syntactic features of the current sentence. A gated attention decoder aggregates the two encoder streams in an asymmetric manner, enabling the model to exploit document context while preserving local sentence accuracy.

On the data side, a rule-based and distribution-aware augmentation strategy was constructed on top of CLEC and other learner corpora. Synthetic error sentences were generated under a CLEC-informed category prior, covering tense, article, preposition, noun number, subject–verb agreement, and related categories at a coarse distributional level rather than as an exact reproduction of authentic learner errors. Combining these synthetic pairs with NUCLE, FCE, and other corpora yielded a larger and better balanced training set for the proposed model.

Extensive experiments on the CoNLL-2014 and JFLEG benchmarks demonstrated that the proposed model improves both F_0.5 and GLEU compared with strong neural baselines, including convolutional seq2seq models, BERT-fused encoder–decoder models, and Transformer-based systems with synthetic pretraining. Additional evaluation on CLEC essays showed high precision and competitive F_0.5 across major error types such as Nn, SVA, and ArtOrDet, confirming the effectiveness of the context-aware dual-encoder design for Chinese learner English. Finally, the trained model was integrated into a Java-based desktop application that provides automatic grammar checking and scoring for English essays in an interactive desktop workflow on standard hardware, although further optimization is still required for stricter real-time deployment scenarios.

6.2. Future Directions

Several limitations of the current work suggest directions for further research. First, although the dual-encoder architecture captures short document context using the two preceding sentences, it does not model long-range discourse phenomena such as paragraph-level topic shifts or global coherence. Extending the context encoder with hierarchical or memory-augmented structures, or replacing it with lightweight long-context Transformers, may further improve corrections that depend on global document information.

Second, the data augmentation strategy is tightly coupled to the error taxonomy extracted from CLEC and primarily focuses on a fixed set of high-frequency error types. Because the synthetic corruption process is guided by predefined category frequencies and confusion sets, it cannot fully capture the diversity, contextual dependence, and individual variability of real-world learner errors. This limitation is particularly evident for long-tail categories such as transposition, incomplete word order, and context-sensitive lexical choice, where token-level corruption is often insufficient. Future work should therefore move beyond coarse confusion-set substitution toward more targeted rare-error augmentation, including dependency- or constituency-aware phrase reordering templates, category-balanced oversampling and loss reweighting for long-tail classes, contextual lexical substitution guided by masked language models or large language models, and pseudo-error construction derived from authentic learner revision logs. More realistic pseudo-error generation, semi-supervised learning with unlabeled learner corpora, and multi-task objectives that jointly optimize GEC with language modeling or sentence-level quality estimation are also promising directions for improving rare but pedagogically important error categories.

Third, the current system is demonstrated through a standalone desktop application and performs inference in a resource-supported workstation setting, but this should be viewed as a prototype interaction client rather than as the target deployment architecture. Because the model integrates a BERT-enhanced source encoder, a separate context encoder, Bi-GRU layers, attention-based fusion, and autoregressive decoding, its memory usage, training cost, and inference complexity remain higher than those of lighter GEC architectures. The relatively large parameterization may also increase overfitting risk when annotated learner corpora are limited, even though the multi-corpus training pool and CLEC-informed augmentation partly alleviate this problem. In addition, the present work does not yet provide formal scalability evaluation under concurrent multi-user load, distributed inference settings, or cloud-service deployment. For large-scale deployment in online writing platforms or strict real-time scenarios, future work should therefore investigate a cloud-oriented client–server architecture with API gateways, decoupled request routing, queue-based sentence batching, multiple GPU inference workers, result-merging services, user-session management, and monitoring/logging support. A more complete feedback loop should also be established so that accepted, rejected, and manually revised corrections can be stored as structured usage signals for quality auditing, user adaptation, and subsequent continual-learning or federated-training pipelines. These system-level improvements should be combined with model distillation, parameter sharing, pruning, quantization, lighter contextual encoders, and stronger regularization strategies to reduce latency, hardware cost, and model redundancy.

Fourth, although the current experiments compare the proposed model with representative BERT-enhanced and document-level baselines, they do not yet include a full same-backbone ablation suite that separately removes BERT fusion, document-context encoding, or gated source-context fusion. As a result, the present study can show the effectiveness of the overall architecture, but cannot completely disentangle the quantitative contribution of each module in isolation. Future work should therefore implement systematic component-wise ablations, including encoder simplification, removal of contextual fusion, replacement of the gate with fixed fusion rules, and depth variation in the context encoder, so as to obtain a more fine-grained understanding of which architectural choices contribute most strongly to correction quality.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The author would like to thank Shanxi University for providing general academic and institutional support during the preparation of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Leacock, C.; Chodorow, M.; Gamon, M.; Tetreault, J. Automated Grammatical Error Detection for Language Learners; Morgan & Claypool Publishers: San Rafael, CA, USA, 2014. [Google Scholar]
Bryant, C.; Yuan, Z.; Qorib, M.R.; Cao, H.; Ng, H.T.; Briscoe, T. Grammatical error correction: A survey of the state of the art. Comput. Linguist. 2023, 49, 643–701. [Google Scholar] [CrossRef]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
Omelianchuk, K.; Atrasevych, V.; Chernodub, A.; Skurzhanskyi, O. GECToR–grammatical error correction: Tag, not rewrite. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, WA, USA, 10 July 2020; pp. 163–170. [Google Scholar]
Sorokin, A. Improved grammatical error correction by ranking elementary edits. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 11416–11429. [Google Scholar]
Long, J. A grammatical error correction model for English essay words in colleges using natural language processing. Mob. Inf. Syst. 2022, 2022, 1881369. [Google Scholar] [CrossRef]
Ng, H.T.; Wu, S.M.; Briscoe, T.; Hadiwinoto, C.; Susanto, R.H.; Bryant, C. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, Baltimore, MD, USA, 26–27 June 2014; pp. 1–14. [Google Scholar]
Gamon, M.; Chodorow, M.; Leacock, C.; Tetreault, J. Grammatical error detection in automatic essay scoring and feedback. In Handbook of Automated Essay Evaluation; Routledge: Milton Park, UK, 2013; pp. 251–266. [Google Scholar]
Brockett, C.; Dolan, W.B.; Gamon, M. Correcting ESL errors using phrasal SMT techniques. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 20 July 2006; pp. 249–256. [Google Scholar]
Dahlmeier, D.; Ng, H.T. Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human language Technologies, Sydney, Australia, 17 July 2012; pp. 568–572. [Google Scholar]
Bryant, C.; Felice, M.; Andersen, Ø.E.; Briscoe, T. The BEA-2019 Shared Task on Grammatical Error Correction. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, 2 August 2019. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Chollampatt, S.; Ng, H.T. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018. [Google Scholar]
Lee, J.; Toutanova, K. Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Grundkiewicz, R.; Junczys-Dowmunt, M.; Heafield, K. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative use of NLP for Building Educational Applications, Florence, Italy, 2 August 2019; pp. 252–263. [Google Scholar]
Kiyono, S.; Suzuki, J.; Mita, M.; Mizumoto, T.; Inui, K. An empirical study of incorporating pseudo data into grammatical error correction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1236–1242. [Google Scholar]
Stahlberg, F.; Kumar, S. Synthetic data generation for grammatical error correction with tagged corruption models. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, Online, 20 April 2021; pp. 37–47. [Google Scholar]
Rothe, S.; Mallinson, J.; Malmi, E.; Krause, S.; Severyn, A. A simple recipe for multilingual grammatical error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1–6 August 2021; pp. 702–707. [Google Scholar]
Koyama, S.; Takamura, H.; Okazaki, N. Various errors improve neural grammatical error correction. In Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, 5–7 November 2021; pp. 251–261. [Google Scholar]
Dahlmeier, D.; Ng, H.T.; Wu, S.M. Building a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, GA, USA, 13 June 2013; pp. 22–31. [Google Scholar]
Mizumoto, T.; Komachi, M.; Nagata, M.; Matsumoto, Y. Mining revision log of language learning SNS for automated Japanese error correction of second language learners. In Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 8–13 November 2011; pp. 147–155. [Google Scholar]
Liu, Z.; Yi, X.; Sun, M.; Yang, L.; Chua, T.-S. Neural quality estimation with multiple hypotheses for grammatical error correction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5441–5452. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Yannakoudakis, H.; Briscoe, T.; Medlock, B. A new dataset and method for automatically grading ESOL texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA, 19–24 June 2011; pp. 180–189. [Google Scholar]
Rei, M.; Yannakoudakis, H. Compositional sequence labeling models for error detection in learner writing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1181–1191. [Google Scholar]
Kaneko, M.; Mita, M.; Kiyono, S.; Suzuki, J.; Inui, K. Encoder-decoder models can benefit from pre-trained masked language models in grammatical error correction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; pp. 4248–4254. [Google Scholar]
Katsumata, S.; Komachi, M. Stronger baselines for grammatical error correction using a pretrained encoder-decoder model. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; pp. 827–832. [Google Scholar]
Lichtarge, J.; Alberti, C.; Kumar, S.; Shazeer, N.; Parmar, N.; Tong, S. Corpora generation for grammatical error correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 3291–3301. [Google Scholar]
Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Interspeech, Makuhari, Japan, 26–30 September 2010; pp. 1045–1048. [Google Scholar]
Zhang, K.; Luo, W.; Zhong, Y.; Ma, L.; Liu, W.; Li, H. Adversarial spatio-temporal learning for video deblurring. IEEE Trans. Image Process. 2018, 28, 291–301. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Zhang, Y.; Yao, E. A New Framework for Traffic Conflict Identification by Incorporating Predicted High-Resolution Trajectory and Vehicle Profiles in a CAV Context. Transp. Res. Rec. J. Transp. Res. Board 2025, 2679, 445–462. [Google Scholar] [CrossRef]
Sun, T.; Xia, W.; Shu, J.; Sang, C.; Feng, M.; Xu, X. Advances and Challenges in Machine Learning for RNA-Small Molecule Interaction Modeling: Review. J. Chem. Theory Comput. 2025, 21, 8615–8633. [Google Scholar] [CrossRef]
Xia, W.; Shu, J.; Sang, C.; Wang, K.; Wang, Y.; Sun, T.; Xu, X. The prediction of RNA-small-molecule ligand binding affinity based on geometric deep learning. Comput. Biol. Chem. 2025, 115, 108367. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Zeng, L.; Wang, Y.; Liang, H.; Xu, X.; White, M. From neighborhoods to streetscapes: Pandemic-era shifts in built-environment effects on pedestrian mobility. Cities 2025, 170, 106685. [Google Scholar] [CrossRef]
Chen, Y.; Zhu, F.; Zheng, Z.; Ma, J.; Zhou, B. Guardnet: An imbalance-aware graph neural network for fraud detection. Data Min. Knowl. Discov. 2026, 40, 14. [Google Scholar] [CrossRef]
Huang, J.; He, X.; Zou, S.; Ling, K.; Zhu, H.; Jiang, Q.; Zhang, Y.; Feng, Z.; Wang, P.; Duan, X.; et al. A Flexible Electrochemical Sensor Based on Porous Ceria Hollow Microspheres Nanozyme for Sensitive Detection of H₂O₂. Biosensors 2025, 15, 664. [Google Scholar] [CrossRef]
Yang, Y.; Jin, B.-B.; Sun, X.; Zhang, X.-D.; Li, B.; Zhao, K.; Wang, H. Exact counting of subtrees with diameter no more than d in trees: A generating function approach. Inf. Comput. 2025, 307, 105353. [Google Scholar] [CrossRef]
Zhao, C.; Meng, Z.; Yi, J.; Chen, C.Q. Auxetic metamaterials with double re-entrant configuration. Int. J. Mech. Sci. 2025, 301, 110505. [Google Scholar] [CrossRef]
Huang, X.; Liang, H.; Wang, Y.; Wang, Y.; Li, D.; Zhang, B. Narrative as cognitive infrastructure reduces semantic opacity in virtual industrial heritage. npj Herit. Sci. 2026, 14, 126. [Google Scholar] [CrossRef]
Mao, B.; Xiang, Y.; Zhang, Y.; Huang, Y.; Chen, P.; Cui, G.; Qu, J. A Function-Structure-Integrated Optical Fingertip with Rigid-Soft Coupling Enabling Self-Decoupled Multimodal Underwater Sensing. Adv. Funct. Mater. 2026, 2026, e22722. [Google Scholar] [CrossRef]
Zhao, C.; Liu, Y.; Zhang, J.; Xu, C.; Ying, C.; Wang, Q.; Meng, Z. Programmable Multifunctional Auxetic Metamaterials Via Concave Rib Architecture. Eng. Struct. 2026, 357, 122501. [Google Scholar] [CrossRef]
Liu, J.; Du, S.; Huang, Z.; Liu, N.; Shao, Z.; Qin, N.; Wang, Y.; Wang, H.; Ni, Z.; Yang, L. Enhanced reduction of nitrate to ammonia at the Co-N heteroatomic interface in MOF-derived porous carbon. Materials 2025, 18, 2976. [Google Scholar] [CrossRef]
Zhang, L.T.; Duan, Y.J.; Li, B.W.; Qiao, J.C. Developments in the Homogeneous Deformation of Metallic Glasses: A Brief Review. Rare Metal Mater. Eng. 2026. [Google Scholar] [CrossRef]
Zhang, D.; Li, R.; Luo, H.; Meng, Z.; Yao, J.; Liu, H.; Huang, Y.; Li, S.; Yu, P.; Yang, J.; et al. Break-through amplified spontaneous emission with ultra-low threshold in perovskite via synergetic moisture and BHT dual strategies. Light. Sci. Appl. 2026, 15, 99. [Google Scholar] [CrossRef]
Ren, H.; Miao, Z.; Feng, X.; Labidi, A.; Zhao, Y.; Wang, C. Modulating the built-in electric field of S-scheme heterojunction via oxygen vacancies for boosting photocatalytic ciprofloxacin degradation. Chin. Chem. Lett. 2026, 2026, 112557. [Google Scholar] [CrossRef]
He, Z.; Wu, J.; Cheng, Y.; Hu, Z.; Peng, Q.; Chen, Y. Experimental study on high-temperature thermal oxidation and laser ignition of aged boron powder. Acta Astronaut. 2025, 233, 30–41. [Google Scholar] [CrossRef]
Xu, C.; Ding, J.; Luo, K.; Yang, X.; Liu, L.; Yao, L.; Chen, Q.; Zhang, Y.; Ding, Y.; Wang, B.; et al. Dielectric–magnetic synergized pores modulation engineering in polymer aerogels for integrated electromagnetic wave absorption and infrared stealth. Compos. Sci. Technol. 2026, 277, 111552. [Google Scholar] [CrossRef]
Jiang, L.; Zhang, W.; Ma, X.; Wang, Y.; Chen, X.; Hu, Z.; Jing, J.; Yu, H. Tetradecylamine/MXene/porous sorghum straw biochar composite with high latent heat for thermal energy storage and efficient photo-electro-thermal conversion. J. Energy Storage 2026, 154, 121343. [Google Scholar] [CrossRef]
Meng, Z.; Tan, Y.; Duan, Y.; Li, M. Metabolic modulation of TCA cycle by S-nitrosylation in Monascus spp. Synth. Syst. Biotechnol. 2026, 12, 490–501. [Google Scholar] [CrossRef]
Shih, Y.-C. Ecosystem Integration of Marine Conservation and Coastal Development in Taiwan. Front. Environ. Sci. 2026, 14, 1764841. [Google Scholar] [CrossRef]
Omelianchuk, K.; Liubonko, A.; Skurzhanskyi, O.; Chernodub, A.; Korniienko, O.; Samokhin, I. Pillars of grammatical error cor-rection: Comprehensive inspection of contemporary approaches in the era of large language models. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), Mexico City, Mexico, 20 June 2024; pp. 17–33. [Google Scholar]
Katinskaia, A.; Yangarber, R. GPT-3.5 for grammatical error correction. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, 20–25 May 2024; pp. 7831–7843. [Google Scholar]
Li, W.; Luo, W.; Peng, G.; Wang, H. Explanation based in-context demonstrations retrieval for multilingual grammatical error correction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 4881–4897. [Google Scholar]
Luhtaru, A.; Purason, T.; Vainikko, M.; Del, M.; Fishel, M. To err is human, but llamas can learn it too. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 12466–12481. [Google Scholar]
Qorib, M.R.; Aji, A.F.; Ng, H.T. Efficient and interpretable grammatical error correction with mixture of experts. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 17127–17138. [Google Scholar]
Gomez, F.P.; Rozovskaya, A. Low-Resource Grammatical Error Correction: Selective Data Augmentation with Round-Trip Machine Translation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 25749–25770. [Google Scholar]
Park, T.; Do, H.; Lee, G. Leveraging What’s Overfixed: Post-Correction via LLM Grammatical Error Overcorrection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 5–9 November 2025; pp. 28183–28195. [Google Scholar]
Coyne, S.; Sakaguchi, K.; Galvan-Sosa, D.; Zock, M.; Inui, K. Analyzing the performance of gpt-3.5 and gpt-4 in grammatical error correction. arXiv 2023, arXiv:2303.14342. [Google Scholar] [CrossRef]
Calzolari, N.; Kan, M.-Y.; Hoste, V.; Lenci, A.; Sakti, S.; Xue, N. (Eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024); ELRA and ICCL: Turin, Italy, 2024. [Google Scholar]
Staruch, R.; Gralinski, F.; Dzienisiewicz, D. Adapting LLMs for Minimal-edit Grammatical Error Correction. arXiv 2025, arXiv:2506.13148. [Google Scholar] [CrossRef]

Table 1. Statistics of the JFLEG corpus.

Corpus	Sentences	Tokens
JFLEG Dev Set	768	13,860
JFLEG Test Set	739	144,260

Table 2. Statistics of the FCE corpus.

Corpus	Sentences	Tokens
FCE Train Set	27,850	453,120
FCE Dev Set	2120	34,780
FCE Test Set	2680	41,860

Table 3. Statistics of CLEC subsets ST2–ST6 used in this work.

Corpus	Sentences	Tokens
CLEC ST2	120,750	1,263,880
CLEC ST3	111,820	1,093,610
CLEC ST4	109,720	1,127,950
CLEC ST5	111,680	1,227,090
CLEC ST6	111,850	1,423,940

Table 4. Hardware and software environment for training and inference.

Device	Resource	Configuration
GPU server	CPU	Intel Core i9-14900K @ 3.20 GHz
GPU server	GPU	NVIDIA GeForce RTX 5090
GPU server	GPU memory	32 GB
GPU server	CUDA cores	18,432
GPU server	System memory	128 GB
GPU server	Operating system	Ubuntu 22.04 LTS (64-bit)
Desktop workstation	CPU	Intel Core i7-13700H @ 2.40 GHz
Desktop workstation	GPU	NVIDIA GeForce RTX 4070
Desktop workstation	GPU memory	8 GB
Desktop workstation	CUDA cores	4608
Desktop workstation	System memory	64 GB
Desktop workstation	Operating system	Windows 11 Pro (64-bit)

Table 5. Datasets used for training, development, and testing.

Set	Corpus	Error Types	Sentences	Tokens	Source
TrainSet	Synthetic augmented data	29	7.8 M	—	—
TrainSet	NUCLE	27	568,500	3,820,450	Student essays
TrainSet	FCE	70	319,200	452,860	Examination scripts
TrainSet	CLEC	62	1,065,840	1,265,862	Exam essays
DevSet	CoNLL-2013 Test Set	27	1375	29,110	Student essays
DevSet	JFLEG Dev Set	—	760	13,960	Examination scripts
TestSet	CoNLL-2014 Test Set	27	1306	300,120	Student essays
TestSet	JFLEG Test Set	—	738	14,060	Examination scripts
TestSet	CLEC (SET3, SET4)	62	4320	418,750	Examination scripts

Note: “—“ indicates that no official fine-grained error-type taxonomy is available for JFLEG or that the corresponding statistic was not separately reported.

Table 6. Example from the evaluation set illustrating TP, TN, FP, and FN for grammatical error correction.

Sample Type	Sentence
Original	He like to play football on weekend.
Hypothesis	He likes to play the football on weekend.
Reference	He likes to play football on weekends.

Table 7. Results of data augmentation with different synthetic data sizes on the CoNLL-2014 test set.

Training Data	P	R	F_0.5
Original training data only	72.5	57.6	69.02
Original data + 2M synthetic sentences	74.6	58.5	70.76
Original data + 4M synthetic sentences	75.4	59.2	71.53
Original data + 6M synthetic sentences	75.7	59.5	71.84
Original data + 8M synthetic sentences	75.9	59.7	71.97

Table 8. Effect of synthetic data size on model performance.

Method	P	R	F_0.5
Baseline	72.1	58.1	68.6
Baseline + 2M augmented data	73.9	58.7	70.2
Baseline + 4M augmented data	74.8	59.4	70.9
Baseline + 6M augmented data	75.3	59.9	71.4
Baseline + 8M augmented data	75.6	60.2	71.6

Table 9. Comparison with baseline systems on the CoNLL-2014 test set.

Model	P	R	F_0.5
Chollampatt et al., MLConv (4 ens.) + EO	65.2	30.4	52.9
Kaneko et al., BERT-fuseGED + R2L	72.3	46.7	65.0
Zheng Yuan et al., MultiEnc-dec *	74.0	39.2	62.5
Stahlberg et al., ERRANT + Transformer *	75.3	49.7	68.0
Proposed Bi-GRU + Transformer + Attention	75.8	60.1	71.8

Note: Models marked with * are Transformer-based baselines.

Table 10. Model performance on the JFLEG test set.

Model	GLEU
Chollampatt et al., MLConv (4 ens.) + EO	57.25
Kaneko et al., BERT-fuseGED + R2L	61.75
Zheng Yuan et al., MultiEnc-dec *	58.20
Stahlberg et al., ERRANT + Transformer *	64.32
GPT-3.5 (text-davinci-003, literature)	63.25
GPT-4 (gpt-4-0314, literature)	64.88
Human performance	66.75
Proposed Bi-GRU + Transformer + Attention	63.18

Note: Models marked with * are Transformer-based baselines.

Table 11. Example 1 of grammatical error correction using the proposed model.

Type	Sentence
Source	Last year, I travel to the Paris. I visit many famous place around city.
Baseline output	Last year, I travel to the Paris. I visit many famous place around city.
Proposed model output	Last year, I traveled to Paris. I visited many famous places around the city.

Table 12. Metrics over 800 Chinese learner essays.

Metric	Score
P	84.85
R	73.20
F_0.5	80.89

Table 13. Error-type-wise correction performance on 800 CLEC essays.

NUCLE Type	CLEC Tags	Annotated Errors	System Corrections	Correct Corrections	P	R	F_0.5
V0	wd4	88	80	67.2	0.8400	0.7636	0.7309
Vform	vp2/vp4/vp5/vp7/vp8	67.2	57.6	51.2	0.8889	0.7619	0.7150
Vm	vp9	38.4	32	27.2	0.8500	0.7083	0.6874
Vt	vp6	49.6	46.4	40	0.8621	0.8065	0.7576
ArtOrDet	np3/np7/np9	56	51.2	43.2	0.8438	0.7714	0.7803
Ssub	sn1/sn2/sn6/sn7	216	188.8	155.2	0.8220	0.7185	0.7352
Rloc-	wd5/wd6	84.8	78.4	60.8	0.7755	0.7169	0.7154
Trans	pr5	8	6.4	1.6	0.2500	0.2000	0.2381
Pref	pr1/pr2/pr6	38.4	32	24	0.7500	0.6250	0.6800
Pform	pr3/pr4	22.4	17.6	12.8	0.7273	0.5714	0.6449
Prep	pp1/pp2	40	36.8	30.4	0.8261	0.7600	0.7470
Wci	np2/aj2/cj2	9.6	8	3.2	0.4000	0.3333	0.3820
WOadv	aj1/aj3/ad1	20.8	17.6	8	0.4545	0.3846	0.4149
WOinc	np1/wd1	25.6	16	4.8	0.3000	0.1875	0.2564
Wform	aj4/aj5/ad2	8	4.8	1.6	0.3333	0.2000	0.2941
Npos	np4	6.4	4.8	3.2	0.6667	0.5000	0.6250
Nn	np5/np6/np8	104	91.2	84.8	0.9298	0.8154	0.8825
SVA	vp3	88	80	73.6	0.9200	0.8364	0.8694
Others	-	132.8	126.4	96	0.7595	0.7229	0.7519
Total	-	1100.8	948.8	788.8	0.8314	0.7166	0.7890

Table 14. Error types sorted by precision.

Category	Score
SVA	0.94
Nn	0.93
ArtOrDet	0.91
Vform	0.89
Vt	0.85
Vm	0.86
Others	0.85
VO	0.85
Prep	0.83
Ssub	0.82
Rloc-	0.78
Pform	0.78
Pref	0.76
W/OAdv	0.71
Npos	0.66
Wform	0.50
Wci	0.50
W/Oinc	0.36
Trans	0.33

Table 15. Error types sorted by recall.

Category	Score
SVA	0.87
Nn	0.85
ArtOrDet	0.85
Vform	0.80
Vt	0.80
Vm	0.77
Others	0.77
VO	0.74
Prep	0.72
Ssub	0.72
Rloc-	0.71
Pform	0.64
Pref	0.54
W/OAdv	0.50
Npos	0.41
Wform	0.25
Wci	0.25
W/Oinc	0.24
Trans	0.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, J. Symmetry-Aware Dual-Encoder Architecture for Context-Aware Grammatical Error Correction in Chinese Learner English: Toward a Spaced-Repetition Instructional Structure Sensitive to Individual Differences. Symmetry 2026, 18, 579. https://doi.org/10.3390/sym18040579

AMA Style

Tian J. Symmetry-Aware Dual-Encoder Architecture for Context-Aware Grammatical Error Correction in Chinese Learner English: Toward a Spaced-Repetition Instructional Structure Sensitive to Individual Differences. Symmetry. 2026; 18(4):579. https://doi.org/10.3390/sym18040579

Chicago/Turabian Style

Tian, Jun. 2026. "Symmetry-Aware Dual-Encoder Architecture for Context-Aware Grammatical Error Correction in Chinese Learner English: Toward a Spaced-Repetition Instructional Structure Sensitive to Individual Differences" Symmetry 18, no. 4: 579. https://doi.org/10.3390/sym18040579

APA Style

Tian, J. (2026). Symmetry-Aware Dual-Encoder Architecture for Context-Aware Grammatical Error Correction in Chinese Learner English: Toward a Spaced-Repetition Instructional Structure Sensitive to Individual Differences. Symmetry, 18(4), 579. https://doi.org/10.3390/sym18040579

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Aware Dual-Encoder Architecture for Context-Aware Grammatical Error Correction in Chinese Learner English: Toward a Spaced-Repetition Instructional Structure Sensitive to Individual Differences

Abstract

1. Introduction

2. Related Work and Technical Background

2.1. Neural Grammatical Error Correction: From Sentence-Level to Document-Level

2.2. Symmetry and Asymmetry in Context-Aware GEC Models

2.3. Neural Components: GRU, Bi-GRU, Attention, Transformer, and BERT

2.4. Learner Corpora and Benchmark Datasets

2.5. Error Taxonomy for Chinese EFL GEC

3. Symmetry-Aware Dual-Encoder Grammar Correction Model

3.1. Overall Architecture and Processing Pipeline

3.2. Text Preprocessing and Data Augmentation

3.3. Context Encoder: Document-Level Representation

3.4. Source Sentence Encoder with BERT Fusion

3.5. Decoder with Masked Multi-Head Attention and Gated Fusion

4. Experimental Setup

4.1. Hardware and Software Environment

4.2. Training, Development, and Test Sets

4.3. Hyperparameters and Training Strategy

4.4. Evaluation Metrics

4.5. Baseline Systems

5. Results, Analysis, and System Implementation

5.1. Ablation Study on Data Augmentation

5.2. Comparison with State-of-the-Art GEC Systems

5.3. Qualitative Analysis of Correction Examples

5.4. Performance on Chinese Learner Essays

5.5. System Implementation and Runtime Behavior

6. Conclusions and Future Work

6.1. Summary of the Work

6.2. Future Directions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI