PSIF: Phonetic–Semantic and Long–Short Information Fusion for Chinese Spelling Correction

Zhang, Lei; Liu, Zhetan; Yan, Qianxi; Liu, Xiaodong

doi:10.3390/app16052440

Open AccessArticle

PSIF: Phonetic–Semantic and Long–Short Information Fusion for Chinese Spelling Correction

by

Lei Zhang

^1,2,3,4

,

Zhetan Liu

^3,4,†,

Qianxi Yan

^5,† and

Xiaodong Liu

^2,3,4,*

¹

School of Software, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

Nanjing Institute of InforSuperBahn, Nanjing 211100, China

³

Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

⁴

University of Chinese Academy of Sciences, Nanjing 211135, China

⁵

School of Software, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(5), 2440; https://doi.org/10.3390/app16052440

Submission received: 19 January 2026 / Revised: 22 February 2026 / Accepted: 25 February 2026 / Published: 3 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Chinese Spelling Correction (CSC) aims to identify and correct character-level errors in Chinese text, where mistakes are predominantly caused by phonetic similarity and complex semantic ambiguity. Existing CSC approaches typically model phonetic and semantic information separately, which limits their ability to resolve errors requiring joint reasoning over pronunciation, tone, and global sentence meaning. In this paper, we propose a Phonetic–Semantic and Long–Short Information Fusion (PSIF) framework that explicitly integrates transliteration knowledge with sentence-level semantic representations. By incorporating tone-aware pinyin embeddings and fusing short-range phonetic features with long-range contextual semantics, PSIF effectively captures both local and global cues necessary for accurate correction. Extensive experiments on multiple CSC benchmarks demonstrate that the proposed method consistently outperforms state-of-the-art approaches, particularly on homophonic and context-sensitive errors. Furthermore, to investigate CSC under noisy input conditions in large language models (LLMs), we introduce UCMMLU, a novel benchmark constructed by injecting erroneous Chinese characters into CMMLU questions. Results show that applying PSIF as a preprocessing module significantly enhances LLM robustness and question-answering performance in zero-shot settings. These findings suggest that phonetic–semantic fusion not only advances CSC accuracy but also provides an effective pathway for improving the reliability of language models when handling misspelled or noisy Chinese text.

Keywords:

Chinese spelling correction; phonetic fusion; semantic fusion; LLM robustness

1. Introduction

Chinese Spelling Correction (CSC) aims to detect and correct spelling errors in Chinese text [1], particularly those caused by characters with similar pronunciation or visual form. Given an input sentence, a CSC system identifies incorrect characters and replaces them with the correct ones while preserving the original sentence structure.

The CSC task is subject to strict character-level constraints. First, the length constraint requires that the predicted sentence contain exactly the same number of characters as the source sentence. Second, the phonetic constraint reflects the observation that the majority of Chinese spelling errors are related to pronunciation. Specifically, approximately 83% of spelling errors are phonetically similar to their correct forms [2]. These constraints distinguish CSC from general text generation [3] or sequence-to-sequence correction tasks [4].

Spelling errors not only reduce reading efficiency, but may also cause semantic ambiguity or misunderstanding. Even minor spelling errors can cast doubt on the credibility or attitude of the author. From a computational perspective, spelling errors significantly undermine the performance of downstream natural language processing models that rely on clean and accurate input text.

Figure 1 illustrates two examples of Chinese Spelling Correction, comparing direct correction with transliteration-augmented correction. The examples involve a pair of homophones pronounced as “ji shui”, which correspond to different meanings (“water accumulation” versus “calculation”). As shown in the figure, direct correction methods struggle to distinguish between these phonetically identical but semantically distinct characters. In contrast, incorporating transliteration information provides additional phonetic cues that help disambiguate homophones, leading to more accurate correction results.

CSC has garnered increasing attention in recent years, with state-of-the-art methods typically defining CSC as a sequence tagging task and fine-tuning BERT-based models on sentence pairs [5]. Further improvements have been achieved by injecting phonological and morphological features into the tagging process [6]. In contrast to these encoder-based approaches, ref. [7] recently explored the application of large language models (LLMs) with decode-only architecture, such as GPT-4, to CSC using few-shot prompting. Their analysis demonstrated that decode-only LLMs such as GPT-4 lag behind BERT-style models in maintaining structural consistency and phonetic similarity. For example, 10% of the GPT-4 output did not match the source sentence in character count, an issue that was not observed in BERT-based models. In addition, 35% of the predicted characters were phonetically dissimilar to the source, and nonhomophone prediction errors accounted for about 70% of all mistakes [8].

One underlying reason for these limitations is that decode-only LLMs are prone to tokenization errors at the very beginning of processing, often segmenting Chinese text into inappropriate subwords or tokens. This fundamental issue at the tokenizer stage makes it difficult for LLMs to ensure strict character-level alignment and phonological accuracy, ultimately constraining their effectiveness on tasks like CSC that require fine-grained control over output structure.

Although token-based models represented by BERT have achieved significant progress in Chinese spelling correction tasks, their core mechanism still primarily relies on modeling local context and directly memorizing correction pairs from training data. Such models often reduce the correction of spelling errors to pattern matching and mapping, paying limited attention to the holistic semantic expression at the sentence level [9]. As a result, the correction behavior is overly dependent on individual erroneous words themselves. In contrast, humans, in the process of spelling correction, tend to first grasp the overall meaning of a sentence and then make targeted corrections to local errors based on this understanding. This process is highly dependent on long-range global semantic information, enabling context-aware and fine-grained correction.

Additionally, Chinese characters are inherently characterized by a close association between phonology and semantics, providing abundant short-range cues for spelling correction tasks, and current token-based models possess a natural advantage in this respect. However, to further enhance the model’s human-like correction capabilities, reliance solely on local features is manifestly insufficient. There is an urgent need for models to integrate a deep understanding of global semantics, achieving effective synergy between local information and the overall context, thereby attaining more accurate and contextually appropriate correction performance.

Conventional Chinese spelling correction approaches often rely on Pinyin transcriptions without tonal information, such as “jishui,” which fails to distinguish between distinct expressions like “ji1 shui3” and “ji4 shui4.” As shown in Figure 1, this limitation makes it challenging to identify and correct errors where tonal distinctions are critical for disambiguation. While these cases are easily differentiated by tone, traditional methods lack the representational capacity to leverage this information, resulting in frequent miscorrections, especially for homophonic substitutions.

To overcome this challenge, we propose a framework that explicitly integrates Pinyin with tone annotations and fuses sentence-level semantic cues with fine-grained phonetic information via a long–short information fusion mechanism. This design allows the model to capture both global context and subtle phonological distinctions. Our approach enables accurate correction of errors that would otherwise remain unresolved using tone-agnostic representations, thereby significantly improving the robustness of Chinese spelling correction.

Beyond standalone CSC performance, we further explore how spelling correction can benefit downstream large language model applications. We build upon PSIF to develop a plugin framework designed to enhance the robustness of large language models (LLMs) against Chinese spelling errors. To systematically evaluate this enhancement, we introduce UCMMLU, a novel benchmark dataset derived from CMMLU by randomly injecting erroneous Chinese characters into the original questions [10]. This setup provides a rigorous testbed for assessing LLMs’ generalization and error-tolerance capabilities in handling noisy input. Experimental results show that integrating PSIF as a pre-processing plugin consistently improves the performance of large language models, especially in zero-shot question answering scenarios, thereby significantly strengthening their ability to manage misspelled inputs.

2. Related Work

Chinese Spelling Correction (CSC) aims to automatically detect and correct misspelled characters in Chinese sentences. In practice, CSC is a key component of writing assistants and post-processing modules for ASR/OCR, where errors are often induced by pinyin-based input, homophones, and visually similar characters. Compared with alphabetic spelling correction, CSC is particularly challenging due to the logographic nature of Chinese characters, the abundance of phonologically/graphically confusable pairs, and the need to preserve semantics while avoiding over-correction.

2.1. Rule-Based and Statistical Approaches

Traditional CSC systems are commonly grouped into rule-based and statistical paradigms. Early rule-based systems [11,12,13] relied on handcrafted linguistic rules, confusion sets, and syntactic preferences curated by experts. While these approaches can be effective in narrow domains, they typically suffer from limited coverage, high maintenance cost, and weak generalization to unseen error patterns. To reduce manual engineering, statistical approaches adopted probabilistic models such as CRF [14] and HMM [11], learning error distributions from annotated corpora. These methods improved scalability but still struggled to capture deeper semantics and long-range dependencies, which are crucial when multiple plausible substitutions exist for a given context.

2.2. Neural CSC and Pre-Trained Language Models

With the rise of deep learning, CSC increasingly shifted to neural architectures, and more recently to pre-trained language models (PLMs). Many PLM-based methods formulate CSC as sequence tagging or masked prediction, where an error detector identifies suspicious positions and a corrector predicts substitutions. Representative systems include FASpell [15], which generates candidates with a BERT-style backbone and selects corrections using a confidence-similarity decoding strategy, and soft-masked BERT [16], which explicitly mitigates BERT’s tendency to blur error signals by soft-masking suspected characters.

Beyond these canonical frameworks, recent journal work further improves how models incorporate external knowledge and how they balance detection versus correction. For example, one study revisits the long-standing use of confusion sets and proposes an adapter-based fusion strategy (ABC-Fusion) [17]. By placing lightweight adapters between BERT layers, their method allows the semantics of confusing candidates to deeply interact with the surrounding context during the encoding phase. However, because such methods continuously couple local lexical features with global context across hidden layers, they risk tangling the representations, potentially allowing local noise to disrupt the holistic sentence semantics. Our work addresses this by explicitly decoupling these streams.

Broader Advances in Deep Networks and Transformer-Based Fusion

Beyond CSC, recent progress in deep network architectures and transformer-style fusion mechanisms has provided useful structural insights for modeling complex contextual interactions. A comprehensive review [18] summarizes representative deep network designs for affective/emotion recognition, offering a consolidated perspective on how hierarchical representations and attention modules evolve in modern neural systems. Meanwhile, transformer-based multimodal modeling has been validated in diverse settings: ref. [19] investigate a joint multi-scale multimodal transformer that captures complementary cues across granularities, and [20] propose a hybrid attention transformer to better model complex spatial–temporal dependencies. Although these works target different tasks, they highlight effective design patterns for integrating heterogeneous information and attention across multiple scales. Motivated by these architectural innovations, our PSIF framework further explores how to integrate global sentence semantics with fine-grained phonetic cues via explicit long–short decoupling and gated fusion, improving robustness while mitigating over-correction.

2.3. Mitigating Over-Correction via Representation Learning

A practical difficulty in CSC is over-correction; models may change already-correct characters if the context admits fluent but semantically different alternatives. Recent studies address this issue by improving the geometry of representations and decision boundaries for confusable characters. Ref. [21] propose a metric-learning view of CSC, learning a correct representation space to better separate true errors from contextually plausible but incorrect substitutions. Complementarily, contrastive-learning-based methods explicitly optimize discrimination among confusable candidates: ref. [22] leverage phonological and visual knowledge in a contrastive learning framework, while [23] introduce reverse contrastive learning that intentionally reduces agreement among phonetically/visually confusable characters, improving robustness across models.

Crucially, while these multimodal approaches and adapter-based methods have proven effective, they generally rely on early or continuous coupling of auxiliary features within the base encoder. This can lead to representations where local multimodal cues overshadow the broader sentence context. In contrast, our proposed PSIF framework is distinct in its use of explicit long–short decoupling. By separating global semantic representations (long) from fine-grained, tone-aware phonetic features (short), PSIF prevents early interference. These decoupled streams are then combined using a specialized gated integration and feature concatenation mechanism, allowing the model to dynamically balance global understanding with local phonetic precision.

2.4. Multimodal and Knowledge-Enhanced CSC

Because Chinese spelling mistakes are dominated by phonetic and glyph confusability, integrating multimodal information (phonology, glyph/shape, and semantics) has become an important direction. Earlier work incorporated phonological cues via pre-training/fine-tuning [24] or combined pinyin and visual representations [6]. Recent journal contributions further systematize this trend. Ref. [25] propose MISpeller, a multimodal language-model-based corrector built upon ChineseBERT-style encodings, and emphasize efficient multimodal fusion and mechanisms to alleviate over-correction. Ref. [26] propose PSDSpell, which introduces a self-distillation-based pretraining strategy with confusion-set-driven synthetic errors and a single-channel masking mechanism to reduce noise from erroneous context while preserving phonetic/glyph channels. In addition to multimodal cues, external structured knowledge has also been explored. Ref. [27] inject world knowledge from knowledge graphs and definition knowledge from dictionaries, strengthening contextual understanding and supporting corrections that require factual or definitional grounding.

2.5. Datasets and Evaluation Under Realistic Scenarios

Benchmarking is another crucial factor shaping CSC progress. SIGHAN bake-off datasets have long been the standard testbed, yet they may not fully represent real-world distributions (e.g., mixture of spelling and grammatical errors, or skewed error types). To bridge this gap, ref. [28] propose YACSC, a more realistic evaluation benchmark containing annotations for both spelling and grammatical errors, and analyze CSC behavior in real-world scenarios. Moreover, domain-sensitive settings highlight new challenges: ref. [29] construct an entity-focused CSC dataset and approach, emphasizing that named entities and domain-specific terms require specialized modeling and evaluation beyond generic benchmarks.

2.6. From Character-Level Correction to Sentence-Level Generation

Finally, CSC is also evolving from character-level labeling to more holistic sentence-level rewriting. Ref. [9] reformulate CSC as a paraphrasing/rephrasing task, where models generate corrected sentences via masked filling rather than strictly performing per-character replacement. Such a paradigm can improve fluency and global consistency, but it also increases the need for controlling faithfulness and preventing unnecessary edits—making robust evaluation and over-correction control even more important.

3. Materials and Methods

3.1. Task Formulation

The Chinese Spelling Correction (CSC) task can be formulated as a conditional generation problem. Given a Chinese sentence of length n:

X = (x_{1}, x_{2}, \dots, x_{n})

(1)

where some characters may be incorrect, the goal is to generate a corrected sentence of the same length:

Y = {y_{1}, y_{2}, \dots, y_{n}} .

(2)

The challenge in CSC lies not only in maintaining the sentence’s length but also in ensuring that the corrected sentence preserves global semantic coherence. Most spelling errors arise from homophones or visually similar characters, which requires the model to utilize both global semantic understanding and local phonetic information to infer the correct characters. Correcting such errors is particularly difficult because the underlying causes are uncertain, and it is unclear which phonetic or semantic factors lead to their occurrence.

3.2. Analysis of CSC

In CSC, errors frequently arise from homophones or visually similar characters. Many traditional approaches rely primarily on local information, such as phonetic or glyph-level similarity, to identify and correct these errors. However, such locally focused strategies are often insufficient to preserve the semantic integrity of the entire sentence. By emphasizing only short-range cues, these methods may overlook long-range semantic dependencies, resulting in corrections that are locally plausible but globally inconsistent.

For instance, approaches that depend heavily on phonetic similarity may replace an incorrect character with another phonetically similar one, even if the substitution alters the overall meaning of the sentence. Similarly, human-in-the-loop correction strategies often infer sentence-level semantics first and then reconstruct individual characters accordingly. While effective in general text editing, this process may introduce near-synonym substitutions that violate the strict one-to-one correspondence required in CSC, where the goal is to recover the original intended character rather than a semantically similar alternative.

Therefore, effective CSC requires modeling both local cues and long-range dependencies to ensure semantic coherence at the sentence level. Even when local features are correctly captured, the corrected character must remain consistent with the global meaning of the sentence. Neglecting such long-range semantic constraints can lead to corrections that are syntactically acceptable yet semantically erroneous, underscoring the necessity of a more holistic correction framework.

Justification of Design Choices. We further clarify the theoretical and linguistic motivations behind three key architectural choices. (i) Multi-kernel Conv1D vs. local self-attention. In CSC, many errors are governed by local and relatively rigid n-gram collocations, rather than flexible long-range dependencies. Compared to local self-attention, Conv1D provides a stronger inductive bias toward modeling local, translation-invariant patterns, enabling more direct and efficient extraction of such collocations with reduced computational overhead and a lower tendency to overfit local noise. (ii) Kernel sizes 2, 3, 4. The selected kernel sizes align with prevalent granularities of Chinese lexical units: size 2 targets frequent bi-character words, size 3 captures tri-grams and common three-character phrases, and size 4 is tailored for four-character idioms (Chengyu) with fixed semantics. This multi-scale design allows the model to capture semantic units from short words to fixed idiomatic expressions. (iii) Phonetics as Queries with semantics as Keys/Values. CSC primarily relies on phonological cues to correct semantically inappropriate characters. We thus employ phonetic features as queries to realize a phonetically constrained retrieval mechanism: pronunciation-driven queries retrieve semantic context compatible with the observed sound, effectively constraining the correction space by pronunciation. Reversing the roles (semantics as queries) would resemble retrieving sounds from meaning, which is less aligned with the CSC objective of using phonetic evidence to guide correction.

3.3. Motivation for PSIF Model

To address the multiple challenges present in Chinese Spelling Correction, we propose a novel model named PSIF (Phonetic–Semantic and Long–Short Information Fusion). The overall architecture of PSIF is illustrated in Figure 2. The core motivation behind PSIF is to develop a system that can simultaneously leverage phonetic, semantic, and contextual cues to accurately correct errors in Chinese text. Unlike previous approaches that primarily focus on a single type of information, PSIF introduces two key innovations. First, it incorporates a long-short information fusion mechanism that effectively captures and combines both local (short-range) and global (long-range) contextual information. This design significantly enhances the model’s robustness to a wide range of error types. Second, by deeply integrating phonetic, semantic, and contextual features, PSIF achieves a more comprehensive understanding of phonetic similarity and semantic appropriateness in Chinese, leading to improved correction accuracy. Experimental results on several widely-used Chinese spelling correction benchmarks demonstrate that PSIF outperforms existing methods, highlighting its strengths in information fusion and spelling error correction.

1.: Phonetic–Semantic Fusion: By jointly utilizing character-level semantic information and phonetic cues, PSIF is able to address homophone and visually similar character errors. The model combines phonetic features with global semantic context to ensure that the corrected sentence maintains both local accuracy and overall semantic coherence.
2.: Long–Short Contextual Features: PSIF integrates long-range global information (captured through a RoBERTa network) with short-range local context (captured through convolution). This allows the model to learn both global sentence dependencies and local character–phonetic relationships, which are essential for detecting and correcting errors effectively.

3.4. Model Architecture

We propose PSIF, a phonetic–semantic integrated framework for Chinese spelling correction. Given an input sentence

X = (x_{1}, \dots, x_{n})

, PSIF jointly models character-level semantics and phonetic information, and captures contextual dependencies at both short and long ranges.

Character–Pinyin Dual Embedding. For each character

x_{i}

, we construct two aligned representations. The character embedding

e_{i}^{c} \in R^{d}

encodes orthographic and semantic information. To explicitly capture fine-grained phonetic and tonal cues, we encode the phonological information using a holistic syllable-tone format (e.g., the character “shui (water)” is represented by a single token containing its complete base syllable and tone number, such as “shui3”). During preprocessing, we utilize a context-aware pinyin converter to handle polyphonic characters by dynamically assigning the most probable pronunciation based on the surrounding text. This discrete syllable-tone token is then mapped to the pinyin embedding

e_{i}^{p} \in R^{d_{p}}

via a trainable lookup table. The two embeddings are concatenated and projected to a unified space:

\begin{matrix} X_{c} & = f_{emb}^{c} (X) \in R^{n \times d}, \end{matrix}

(3)

\begin{matrix} X_{p} & = f_{emb}^{p} (X) \in R^{n \times d_{p}}, \end{matrix}

(4)

\begin{matrix} Z & = ϕ ([X_{c}; X_{p}]) \in R^{n \times d}, \end{matrix}

(5)

where

ϕ (\cdot)

is a linear projection for dimensional alignment.

Contextual Encoding. To capture sentence-level semantics and long-range dependencies, we adopt RoBERTa-wwm as the backbone encoder. The fused representations are first combined with positional embeddings and then passed through L Transformer layers:

\begin{matrix} H^{0} & = Z + E_{pos}, \end{matrix}

(6)

\begin{matrix} H^{l} & = {Transformer}^{l} (H^{l - 1}), l = 1, \dots, L, \end{matrix}

(7)

\begin{matrix} C & = H^{L} \in R^{n \times d} . \end{matrix}

(8)

The whole-word masking pre-training strategy of RoBERTa-wwm provides stronger Chinese representations, which is beneficial for resolving homophones and visually similar characters.

Short-Range Feature Extraction. To model local contextual patterns and short-distance phonetic–semantic interactions, we employ a multi-kernel Conv1D module with kernel sizes

k \in {2, 3, 4}

:

\begin{matrix} F_{k} & = GELU (RMSNorm (Conv 1 D_{k} (C))), \end{matrix}

(9)

\begin{matrix} F & = Concat (F_{2}, F_{3}, F_{4}), \end{matrix}

(10)

which enables the model to capture neighborhood information at different granularities.

Long-Range Information Aggregation. In parallel, PSIF constructs a long-range information stream via an attention-based aggregation mechanism. Phonetic representations are used to form queries, while contextual semantic representations serve as keys and values:

\begin{matrix} Q & = X_{p} W^{q}, K = C W^{k}, V = C W^{v}, \end{matrix}

(11)

\begin{matrix} C_{long} & = softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V . \end{matrix}

(12)

This design allows phonetic cues to guide the aggregation of globally relevant semantic information.

Gated Fusion and Decoding. The short-range and long-range features are adaptively fused through a gated mechanism. Let

U = [F; C_{long}]

denote the concatenated features. The fused representation is computed as:

\begin{matrix} g & = σ (W_{g} U), \end{matrix}

(13)

\begin{matrix} H & = g ⊙ \tanh (W_{h} U), \end{matrix}

(14)

where

σ (\cdot)

is the sigmoid function and ⊙ denotes element-wise multiplication. Finally, a character-level decoder predicts the corrected character at each position:

\begin{matrix} p (y_{i} ∣ X) & = softmax (W_{o} h_{i}) . \end{matrix}

(15)

The model is trained end-to-end using the negative log-likelihood loss:

L = - \sum_{i = 1}^{n} \log p (y_{i} ∣ X) .

(16)

The integration of phonetic and semantic knowledge provides inherent robustness. When the erroneous character is masked or incorrect, the model can still rely on the phonetic embedding to provide local acoustic hints and use surrounding context to impose semantic constraints, thus inferring the most likely correct character. In other words, even when the local input is missing or noisy, as long as the phonetic and semantic information is available, the model can successfully recover the correct character. This mechanism significantly improves error tolerance and generalization in realistic scenarios with random typos or incomplete inputs.

3.5. Datasets

We conduct experiments on several publicly available CSC datasets with standard preprocessing, which are used for model training and evaluation.

3.5.1. SIGHAN

The SIGHAN Chinese Spelling Check (CSC) datasets are released through a series of shared tasks organized by the Special Interest Group on Chinese Language Processing (SIGHAN) [30]. In this work, we use all available annotated data released for the SIGHAN 2013, 2014, and 2015 shared tasks. Statistics of the datasets are summarized in Table 1.

3.5.2. NACGEC

The NLPCC 2018 Grammatical Error Correction dataset (NACGEC) is collected [31] from real-world user-generated texts. Compared to SIGHAN, NACGEC contains a broader variety of error types, including misspellings, homophones, visually similar characters, and compound errors. This dataset better reflects error distributions in practical applications and is commonly used for robustness evaluation.

3.5.3. ECSpell

ECSpell [32] introduces three domain-specific Chinese spelling correction (CSC) evaluation datasets covering the domains of law (Law), medical consultation (Med), and official document writing (Odw) (see Table 2). In contrast to SIGHAN-style benchmarks, which are mainly constructed through controlled homophone- or shape-similar character substitutions, ECSpell targets realistic domain-specific scenarios where specialized and low-frequency terminology is common. In such cases, general-purpose spell checkers often either fail to identify errors or mistakenly over-correct domain-specific terms. Moreover, most issues in ECSpell originate from the CSC task itself; for consistency, we retain only the data that conforms to the CSC task definition.

3.5.4. LEMON

LEMON [33] is a large-scale multi-domain Chinese spelling correction dataset constructed from real-world user-generated content. It contains naturally occurring spelling errors across seven different domains, covering a wide range of error types and domain-specific language phenomena. The test set comprises 22,252 sentences, and the dataset is typically used to assess the open-domain generalizability of CSC models in a zero-shot setting. In our experiments, to enlarge the training data, we do not evaluate on LEMON separately; instead, both the original training and test sets are merged and used for model training.

3.5.5. Evaluation Settings

Compared with traditional CSC approaches that rely on task-specific architectures, PSIF adopts a fully data-driven framework built upon a RoBERTa-wwm pretrained encoder. As a result, PSIF is more sensitive to the scale and diversity of training data than conventional CSC models. Training on a single dataset is often insufficient to achieve stable convergence.

To facilitate effective training of PSIF, we construct a larger and more diverse training corpus by integrating multiple publicly available CSC datasets. Before merging, all datasets are calibrated into a unified format, including consistent input–output representations and annotation conventions.

After format normalization, we further perform extensive data cleaning to improve data quality. Specifically, we remove samples containing traditional Chinese characters, garbled or corrupted text, as well as instances that do not correspond to valid CSC errors. These noisy or inconsistent samples may negatively affect model optimization.

Through dataset unification and rigorous cleaning, we obtain a high-quality training set that enables stable optimization of PSIF and ensures fair and consistent training across different data sources.

3.5.6. Data Deduplication and Leakage Prevention

Given that our training corpus aggregates multiple datasets (including the repurposing of both the training and test splits of the LEMON dataset), preventing data leakage is of paramount importance to ensure the integrity of our evaluations. To guarantee a strictly fair, unseen evaluation setting, we implemented a rigorous deduplication pipeline before training the PSIF model.

Specifically, we extracted the source texts of all sentences in the target evaluation test sets (SIGHAN 13, 14, 15, and the three domains of ECSpell). We then performed an exact-match overlap check (ignoring punctuation and whitespace variations) against every sentence in the aggregated training pool. Any training instance that exhibited an exact match with a test sample was strictly purged from the training corpus. This protocol was identically applied to the merged LEMON data, ensuring that no cross-dataset contamination occurred.

4. Results

4.1. PSIF Model and Comparative Experiments

In this section, we present the main experimental results of PSIF on the ECSpell benchmark and the SIGHAN shared task datasets. We also report detection-level and correction-level results in Table 3 to show that PSIF maintains a minimal gap between detection and correction. On ECSpell, we evaluate PSIF across three real-world domains (LAW, MED, and ODW) and compare it with representative tagging-based and rephrasing-based baselines, as well as in-context prompting with ChatGPT. The overall results are summarized in Table 4.

Following prior work on Chinese Spelling Correction (CSC), we evaluate models using Precision (Prec.), Recall (Rec.), and F1-score (F1) at the character level. In CSC, a prediction is considered correct only when the model both detects an erroneous character and generates the correct replacement. Accordingly, a true positive (TP) corresponds to a character that is erroneous in the source sentence and is correctly corrected by the model. A false positive (FP) denotes a character that is predicted as erroneous and modified by the model but is actually correct in the source text or corrected to an incorrect character. A false negative (FN) refers to an erroneous character that the model fails to correct.

Based on these definitions, Precision measures the accuracy of the model’s correction actions, indicating how often a performed correction is correct, while Recall reflects the model’s ability to identify and correct all existing spelling errors. The metrics are formally defined as:

\begin{matrix} Precision & = \frac{T P}{T P + F P}, \end{matrix}

(17)

\begin{matrix} Recall & = \frac{T P}{T P + F N}, \end{matrix}

(18)

\begin{matrix} F 1 & = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall} . \end{matrix}

(19)

Table 4 summarizes correction performance across the LAW, MED, and ODW domains. Overall, PSIF consistently delivers the strongest results in every domain, highlighting its robustness to domain shift and real-world noisy inputs. In particular, PSIF achieves the best F1 scores across these domains, surpassing all baseline methods by a clear margin.

Relative to the strongest baseline, ReLM, PSIF shows substantial absolute improvements across the evaluated settings. These gains suggest that PSIF is more effective at jointly modeling error detection and correction under domain-specific distributions and error patterns.

From the perspective of precision–recall trade-off, PSIF maintains a well-balanced correction behavior across all domains, with both Precision and Recall remaining at high levels. In contrast, rephrasing-based approaches tend to achieve higher Recall by aggressively modifying characters, which often leads to unnecessary or incorrect corrections and thus lower Precision. Tagging-based methods exhibit relatively conservative behavior and suffer from lower Recall, particularly in the MED and ODW domains, where error patterns and terminology differ significantly from the training data. Overall, the results demonstrate that PSIF strikes a more favorable balance between correction accuracy and coverage in CSC.

We further evaluate PSIF on the SIGHAN benchmarks to assess its generalization across different evaluation settings. As illustrated in Figure 3, PSIF achieves strong performance on the SIGHAN 13 dataset. The results on SIGHAN 14 and SIGHAN 15, shown in Figure 4 and Figure 5, respectively, exhibit a similar trend, demonstrating that PSIF maintains stable and robust performance across multiple years of SIGHAN shared tasks.

4.2. Fine-Grained Error Type Analysis

To explicitly validate the benefits of our phonetic–semantic fusion and the tone-aware design, we conducted a fine-grained analysis based on specific error types. Since traditional rule-based pinyin conversion often misclassifies polyphonic characters when ignoring linguistic context, we employed a Large Language Model (LLM) as an advanced context-aware annotator. The LLM analyzed the gold-standard errors in the SIGHAN 15 test set within their full sentence contexts and categorized them into three subsets: (1) Strict Homophones (characters sharing identical pinyin and tone), (2) Tone-Sensitive errors (characters sharing the identical base pinyin but differing in tone), and (3) Morphological/Other errors (visually similar characters or semantic misuses).

As shown in Table 5, PSIF exhibits substantial improvements across all categories compared to the strongest baseline. Most notably, in the highly ambiguous Tone-Sensitive subset, PSIF achieves a significant absolute accuracy improvement. This targeted gain directly corroborates our core hypothesis: the holistic syllable-tone encoding provides critical discriminative cues. When effectively fused with long-range semantics via our proposed mechanism, it successfully resolves tone-level ambiguities that tone-agnostic architectures frequently misclassify.

4.3. Ablation Study

To quantify the contribution of each component in PSIF, we conduct an ablation study by removing either the long-context fusion branch (Long Info only) or the short-range phonetic–semantic branch (Short Info only). Table 6 reports the results on three domains. We further isolate the contributions of cross-attention and gating in Table 7.

When only Long Info is used, the model achieves reasonably strong and consistent performance across all datasets, indicating that long-range semantic context provides a solid foundation for error detection and correction. This setting is particularly effective in maintaining precision, as global contextual constraints help suppress incorrect corrections caused by local ambiguities. However, the absence of fine-grained local cues limits the model’s ability to fully capture certain error patterns, resulting in suboptimal overall performance.

In contrast, relying solely on Short Info leads to unstable behavior across domains. Although local phonetic and semantic signals enable the model to detect a larger portion of potential errors, the lack of global contextual awareness causes a substantial increase in false positives, which severely degrades precision and overall effectiveness. This confirms that short-range cues alone are insufficient to support reliable correction in complex, domain-specific texts.

By jointly modeling long-range contextual semantics and short-range phonetic–semantic information, PSIF consistently delivers the best performance across all datasets. The fusion mechanism effectively balances precision and recall by leveraging complementary strengths of the two branches: long-context information provides global semantic consistency, while short-range cues enhance sensitivity to local error patterns. These results demonstrate that neither source alone is sufficient, and their integration is crucial for achieving robust and well-balanced correction performance in real-world scenarios.

5. Plugin for LLM

In real-world applications, user inputs often contain various types of noise, such as misspellings, homophone substitutions, and visually similar character errors. These imperfections can significantly degrade the performance of large language models (LLMs) on question-answering (QA) tasks, as even minor character-level perturbations may lead to semantic misinterpretation and incorrect answers. Traditional Chinese spelling correction (CSC) methods primarily focus on detecting and correcting erroneous characters, which improves text quality but provides limited benefits for enhancing the reasoning and comprehension abilities of LLMs. To address this issue, we design PSIF as a preprocessing step before the LLM to correct noisy input data, thereby improving the robustness and accuracy of downstream inference.

This module leverages our proposed PSIF model for character-level error detection and correction before the input reaches the LLM. The MCP is designed with three key principles: modularity, enabling seamless integration without modifying the internal architecture of LLMs; ensuring that the PSIF-based correction model can operate efficiently even in edge-side or low-resource environments; and robustness enhancement, ensuring that the LLM receives semantically clean input, which significantly improves downstream QA accuracy under noisy input conditions.

CMMLU is a widely adopted Chinese multi-task language understanding benchmark designed to evaluate large language models across diverse domains and reasoning abilities. It consists of multiple-choice questions spanning humanities, social sciences, natural sciences, engineering, and professional knowledge, with all questions written in Chinese. The benchmark requires precise semantic understanding and multi-step reasoning, making it a strong baseline for evaluating both language comprehension and reasoning performance.

Based on CMMLU, we construct UCMMLU (Uncertain-Character CMMLU) to assess the robustness of large language models under realistic character-level noise. Specifically, we inject controlled perturbations into the question stems only, while keeping the answer options and ground-truth labels unchanged. For each question, a fixed proportion of characters is randomly selected and replaced with erroneous characters following three common noise patterns: random typographical errors, homophone substitutions, and visually similar character substitutions. This process introduces local semantic ambiguity at the character level without altering the underlying task intent or answer distribution.

To ensure that UCMMLU accurately simulates real-world noise, the injection process adheres to an empirically derived error distribution. Specifically, we applied a fixed character-level corruption rate of 10% exclusively within the question stems, ensuring the text becomes noisy while remaining fundamentally comprehensible, and preserving the integrity of the answer options and ground-truth labels. The injected noise distribution is strictly controlled: approximately 70% are phonological substitutions (homophones or near-homophones), 20% are morphological substitutions (visually similar characters), and the remaining 10% are random character substitutions (typographical noise).

By preserving the original evaluation protocol and semantic targets of CMMLU, UCMMLU isolates the impact of character-level noise on model reasoning and comprehension. The resulting benchmark intentionally disrupts the input surface form to challenge large language models’ robustness, providing a realistic and controlled testbed for analyzing how preprocessing modules such as PSIF can mitigate noise-induced reasoning degradation.

In our experiments, we compare the QA accuracy of multiple models—including Meta-Llama-3.1-8B, Qwen2.5, glm-4-9b, Mistral, gemma-2-27b, and DeepSeek-V3—under two conditions: directly using erroneous inputs (Origin) and using inputs preprocessed by the PSIF module.

Results demonstrate that all evaluated models benefit from PSIF preprocessing. For example, DeepSeek-V3 achieves 83.34% accuracy on raw noisy inputs, which rises to 86.21% after PSIF preprocessing in the 0-shot setting; Qwen2.5-32B-Instruct improves from 79.86% to 83.71%. Similar improvements are consistently observed across both 0-shot and 5-shot settings, with relative gains ranging from 2% to over 10% depending on the model and configuration.

These findings indicate that the PSIF module not only enhances LLM robustness under character-level noise but also provides a practical pathway for deploying LLM-based QA systems in real-world scenarios, especially in low-latency edge environments. Furthermore, the modular and lightweight design of PSIF facilitates extension to multimodal or speech-to-text scenarios, where input errors are common, offering a general-purpose enhancement strategy for improving the reliability and generalization of large language models.

6. Discussion

This work investigates Chinese Spelling Correction (CSC) from a unified phonetic–semantic perspective and further evaluates how CSC can serve as a practical robustness layer for downstream large language models (LLMs) under noisy inputs. The proposed PSIF framework explicitly integrates tone-aware pinyin representations with sentence-level semantics and employs a long–short information fusion mechanism to balance local confusability cues and global meaning consistency. In this section, we discuss (i) why PSIF performs strongly across real-world domains, (ii) how each design choice contributes to the precision–recall trade-off, (iii) what PSIF implies for LLM robustness under character-level noise, and (iv) current limitations and future directions.

6.1. Why Phonetic–Semantic Long–Short Fusion Matters in Real-World CSC

A key observation in CSC is that many errors are locally plausible (e.g., homophones and visually similar characters) yet globally inconsistent with sentence meaning. Prior PLM-based CSC systems often focus on character-level tagging or masked prediction [5,15,16], and many multimodal enhancements inject phonological/glyph cues to better handle confusable pairs [6,25]. However, in domain-rich settings such as LAW and MED, local plausibility alone can lead to incorrect but fluent substitutions, i.e., the model may “correct” a token into another high-frequency candidate that matches the local context while drifting from the intended semantics.

Our results on ECSpell (Table 4) show that PSIF achieves consistently high F1 across LAW/MED/ODW and simultaneously maintains high precision and recall. This behavior aligns with the central motivation of PSIF: short-range phonetic-sensitive cues help recover from confusable errors, while long-range sentence semantics constrain the correction space to preserve overall meaning. Compared with rephrasing-based baselines, which tend to trade precision for recall due to over-editing and paraphrastic drift, PSIF keeps a stricter correction behavior while still correcting difficult confusable cases. This is particularly important in practical writing assistance and post-processing pipelines where over-correction is often more harmful than under-correction (e.g., altering correct domain terms, names, or legal/medical entities).

6.2. Interpreting the Ablation: Precision from Global Semantics, Recall from Local Cues

The ablation study (Table 6) provides a clear interpretation of how information sources shape CSC behavior. The Long Info variant attains strong but suboptimal performance, suggesting that global contextual encoding can already support many corrections, especially when the error is semantically incompatible with the sentence. Yet, without explicit short-range phonetic modeling, the model may fail to select the correct character among multiple semantically plausible candidates—an issue frequently seen in homophone-heavy confusion sets.

In contrast, the Short Info variant exhibits high recall but substantially lower precision. This indicates that local phonetic–semantic patterns are powerful for triggering corrections (especially for common homophone/near-homophone substitutions), but without global semantic constraints the model becomes more willing to change characters, increasing false positives. This precision drop is consistent with typical over-correction phenomena in CSC [21,22]. PSIF’s gated fusion helps reconcile these behaviors: it learns when to trust long-range semantics to avoid unnecessary edits and when to emphasize local phonetic cues to recover genuinely confusable errors. Therefore, the fusion mechanism is not merely additive; it is essential for achieving a stable precision–recall balance across domains.

6.3. The Role of Tone-Aware Pinyin: Disambiguating Homophones Beyond Coarse Phonology

Chinese homophones pose a distinctive challenge for CSC: multiple characters may share identical segmental pinyin while differing only in tone, and such tonal contrasts can be crucial for lexical selection in context. Although prior CSC pipelines incorporate phonological features, coarse or tone-agnostic representations tend to collapse tone-differentiated candidates into the same phonetic bucket, weakening disambiguation under ambiguity [6]. In PSIF, tone-aware pinyin embeddings provide a finer-grained phonetic signal that complements semantic evidence. When semantics alone cannot confidently separate competing candidates (e.g., multiple substitutions yield similarly fluent contexts), explicit tone markers introduce an additional constraint, reducing confusion among otherwise indistinguishable homophones.

To strictly quantify the contribution of tone awareness, we further perform an ablation experiment that replaces tone-aware pinyin with a tone-agnostic variant while keeping all other components unchanged. As shown in Table 8, removing tone information causes consistent degradations across all datasets in terms of P/R/F1. This confirms that tone-aware pinyin provides additional discriminative cues beyond coarse phonology, helping PSIF better distinguish tone-sensitive homophone candidates when semantic evidence alone is insufficient, and thus yielding more reliable corrections in ambiguous contexts.

6.4. Comparison with Prior CSC Paradigms: Tagging, Rephrasing, and Knowledge-Enhanced Fusion

From a methodological perspective, PSIF sits between classic tagging-based CSC and more flexible sentence rewriting. Tagging-based methods enforce length and alignment constraints naturally, but they may underutilize sentence-level meaning when corrections require longer-range reasoning [15,16]. Rephrasing-based methods can leverage global semantics more freely, yet they often struggle with faithfulness and strict character-level constraints, resulting in over-editing and structural inconsistencies.

Recent work has explored richer fusion strategies, e.g., injecting candidate information via adapters [17], contrastive objectives to separate confusable representations [22,23], or knowledge-enhanced CSC with dictionaries/knowledge graphs [27]. PSIF complements these directions by emphasizing multi-scale representation learning: convolutional local modeling for short-range dependencies and attention-based aggregation for long-range constraints, then using gating to dynamically balance them. This design offers a practical path to retain the controllability of tagging-style correction while improving context sensitivity in realistic domains.

6.5. Implications for LLM Robustness: Correcting Before Tokenization and Reasoning

Beyond CSC, we demonstrate that spelling correction can serve as a robustness layer for LLM-based QA under character-level noise. Prior observations suggest that decode-only LLMs may be sensitive to Chinese tokenization and early segmentation errors, which can propagate to downstream reasoning and lead to unstable outputs [7]. Our plugin setting addresses this issue pragmatically: PSIF corrects the input before it is processed by the LLM tokenizer, thereby reducing the chance that a single corrupted character shifts token boundaries or distorts key entities.

The results on UCMMLU (Table 9) show consistent gains across multiple LLM families in both 0-shot and 5-shot settings. Two takeaways are noteworthy. First, improvements appear across model scales and architectures, indicating that character-level noise is a shared vulnerability and that preprocessing can offer architecture-agnostic robustness benefits. Second, the gains are present even in few-shot prompting, suggesting that demonstrations alone do not fully compensate for corrupted inputs; improving input quality remains valuable even when prompting is optimized.

Practically, this plugin-style design has attractive deployment properties: PSIF can be integrated without modifying the LLM, and it can be applied selectively when input is suspected to be noisy (e.g., from user typing, OCR, or ASR). This also provides a controllable interface to balance latency and robustness in edge or low-resource environments.

7. Conclusions

In this work, we proposed PSIF, a Phonetic–Semantic and Long–Short Information Fusion model for Chinese Spelling Correction (CSC). By jointly leveraging phonetic embeddings with tonal information, semantic context, and a dual-scale long-short feature encoder, PSIF effectively corrects homophone and visually similar character errors while maintaining global semantic consistency. Extensive experiments on multiple CSC benchmarks demonstrate that PSIF achieves state-of-the-art performance, with significant gains in both detection and correction metrics.

Beyond traditional CSC tasks, we introduced the UCMMLU dataset to evaluate the robustness of large language models (LLMs) under noisy inputs with character-level perturbations. By deploying PSIF as a modular front-end correction plugin, we significantly enhanced the QA accuracy of models such as GPT-4, illustrating the practical value of integrating lightweight error correction modules in real-world applications.

In the future, we plan to extend PSIF toward multilingual spelling correction and explore its integration with speech-to-text systems, where phonetic cues are naturally abundant. We also aim to optimize the model for low-latency environments, making it a general-purpose enhancement tool to improve the robustness, reliability, and user experience of LLM-based systems.

8. Limitations

Our study is primarily grounded in the linguistic characteristics of Chinese, where pronunciation and semantics are tightly coupled, and the pronunciation of a single character often implicitly conveys semantic information. As a result, the generalizability of the proposed model may be limited for languages that rely more heavily on orthographic cues or explicit grammatical structures to encode meaning. In addition, when deployed as a prompt optimization module for large language models (LLMs), our method inevitably introduces additional computational overhead, leading to increased inference latency and runtime costs.

Despite strong empirical performance, Chinese spelling correction (CSC) remains challenging under several realistic conditions. First, named entities and domain-specific terms pose significant difficulties: many such expressions are low-frequency, newly coined, or transliterated, and overly aggressive corrections may inadvertently corrupt them. Second, sentences containing multiple interacting errors often require joint inference across positions; locally optimal corrections may alter the semantic context and negatively affect subsequent decisions. Third, out-of-distribution noise, such as mixed scripts, informal internet slang, or OCR artifacts, is not fully covered by existing training corpora, resulting in increased uncertainty and potential miscorrections.

These challenges suggest several promising directions for future work. One direction is to incorporate entity-aware constraints or lexicon-based protection mechanisms (e.g., freezing spans identified as named entities) to reduce harmful edits in specialized domains. Another direction is to attach calibrated uncertainty estimates to correction outputs, enabling selective editing or user-in-the-loop verification in high-stakes applications such as medical or legal text processing. Finally, in the LLM plugin setting, it may be beneficial to expose both the original and corrected inputs to the downstream model (e.g., via dual-input prompting), allowing the LLM to cross-check alternatives and mitigate rare but high-impact miscorrections.

Author Contributions

Conceptualization, L.Z. and X.L.; methodology, L.Z. and Z.L.; software, Z.L. and Q.Y.; validation, L.Z., Z.L. and Q.Y.; formal analysis, L.Z.; investigation, L.Z., Z.L. and Q.Y.; resources, X.L.; data curation, Z.L. and Q.Y.; writing—original draft preparation, L.Z., Z.L. and Q.Y.; writing—review and editing, L.Z. and X.L.; visualization, Q.Y.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Smart-Grid National Science and Technology Major Project (2025ZD0804800).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets and experimental scripts used in this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the Zhongke Nanjing Information High-Speed Rail Research Institute for its administrative coordination and technical assistance.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, J.; Li, Z. Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape. In Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing; Sun, L., Zong, C., Zhang, M., Levow, G.A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 220–223. [Google Scholar] [CrossRef]
Liu, C.L.; Lai, M.H.; Tien, K.W.; Chuang, Y.H.; Wu, S.H.; Lee, C.Y. Visually and Phonologically Similar Characters in Incorrect Chinese Words: Analyses, Identification, and Applications. ACM Trans. Asian Lang. Inf. Process. 2011, 10, 1–39. [Google Scholar] [CrossRef]
Zhang, B.; Ma, H.; Li, D.; Ding, J.; Wang, J.; Xu, B.; Lin, H. Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation. Trans. Assoc. Comput. Linguist. 2025, 13, 1007–1031. [Google Scholar] [CrossRef]
Fang, H.; Zhu, X.; Gurevych, I. Preemptive Detection and Correction of Misaligned Actions in LLM Agents. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 222–244. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Huang, L.; Li, J.; Jiang, W.; Zhang, Z.; Chen, M.; Wang, S.; Xiao, J. PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Check. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 5958–5967. [Google Scholar] [CrossRef]
Li, K.; Hu, Y.; He, L.; Meng, F.; Zhou, J. C-LLM: Learn to Check Chinese Spelling Errors Character by Character. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 5944–5957. [Google Scholar]
Xu, Z.; Zhao, Z.; Zhang, Z.; Liu, Y.; Shen, Q.; Liu, F.; Kuang, Y.; He, J.; Liu, C. Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 3839–3853. [Google Scholar] [CrossRef]
Liu, L.; Wu, H.; Zhao, H. Chinese spelling correction as rephrasing language model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 18662–18670. [Google Scholar]
Li, H.; Zhang, Y.; Koto, F.; Yang, Y.; Zhao, H.; Gong, Y.; Duan, N.; Baldwin, T. CMMLU: Measuring massive multitask language understanding in Chinese. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11260–11285. [Google Scholar] [CrossRef]
Chu, W.C.; Lin, C.J. NTOU Chinese Spelling Check System in Sighan-8 Bake-off. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, Beijing, China, 30–31 July 2015; Yu, L.C., Sui, Z., Zhang, Y., Ng, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 137–143. [Google Scholar] [CrossRef]
Jiang, Y.; Wang, T.; Lin, T.; Wang, F.; Cheng, W.; Liu, X.; Wang, C.; Zhang, W. A rule based Chinese spelling and grammar detection system utility. In Proceedings of the 2012 International Conference on System Science and Engineering (ICSSE), Dalian, China, 30 June–2 July 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 437–440. [Google Scholar]
Chang, T.H.; Chen, H.C.; Yang, C.H. Introduction to a proofreading tool for Chinese spelling check task of SIGHAN-8. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, Beijing, China, 30–31 July 2015; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 50–55. [Google Scholar]
Wang, Y.R.; Liao, Y.F. Word Vector/Conditional Random Field-based Chinese Spelling Error Detection for SIGHAN-2015 Evaluation. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, Beijing, China, 30–31 July 2015; Yu, L.C., Sui, Z., Zhang, Y., Ng, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 46–49. [Google Scholar] [CrossRef]
Hong, Y.; Yu, X.; He, N.; Liu, N.; Liu, J. FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm. In Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), Hong Kong, China, 4 November 2019; Xu, W., Ritter, A., Baldwin, T., Rahimi, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 160–169. [Google Scholar] [CrossRef]
Zhang, S.; Huang, H.; Liu, J.; Li, H. Spelling error correction with soft-masked BERT. arXiv 2020, arXiv:2005.07421. [Google Scholar] [CrossRef]
Xie, J.; Dang, K.; Liu, J.; Liang, E. ABC-Fusion: Adapter-based BERT-level confusion set fusion approach for Chinese spelling correction. Comput. Speech Lang. 2024, 83, 101540. [Google Scholar] [CrossRef]
Mustaqeem, M.; Kwon, S. Speech Emotion Recognition Based on Deep Networks: A Review. In Proceedings of the Annual Conference of KIPS, Online, 14–15 May 2021; Korea Information Processing Society: Seoul, Republic of Korea, 2021. [Google Scholar]
Khan, M.; Ahmad, J.; Gueaieb, W.; De Masi, G.; Karray, F.; El Saddik, A. Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices. IEEE Trans. Consum. Electron. 2025, 71, 1092–1101. [Google Scholar] [CrossRef]
Khan, M.; Ahmad, J.; El Saddik, A.; Gueaieb, W.; De Masi, G.; Karray, F. Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4713–4722. [Google Scholar] [CrossRef]
Li, C.; Zhang, M.; Zhang, X.; Yan, Y. MCRSpell: A metric learning of correct representation for Chinese spelling correction. Expert Syst. Appl. 2024, 237, 121513. [Google Scholar] [CrossRef]
Mao, X.; Shan, Y.; Li, F.; Chen, X.; Zhang, S. CLSpell: Contrastive learning with phonological and visual knowledge for Chinese spelling check. Neurocomputing 2023, 554, 126468. [Google Scholar] [CrossRef]
Lin, N.K.; Wu, H.Y.; Fu, S.H.; Jiang, S.Y.; Yang, A.M. A Chinese Spelling Check Method Based on Reverse Contrastive Learning. J. Comput. Sci. Technol. 2025, 40, 821–834. [Google Scholar] [CrossRef]
Zhang, R.; Pang, C.; Zhang, C.; Wang, S.; He, Z.; Sun, Y.; Wu, H.; Wang, H. Correcting Chinese Spelling Errors with Phonetic Pre-training. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2250–2261. [Google Scholar] [CrossRef]
Li, J.; Duan, J.; Wang, H.; He, L.; Zhang, Q. MISpeller: Multimodal Information Enhancement for Chinese Spelling Correction. IEICE Trans. Inf. Syst. 2024, E107.D, 1342–1352. [Google Scholar] [CrossRef]
He, L.; Zhang, X.; Duan, J.; Wang, H.; Li, X.; Zhao, L. PSDSpell: Pre-Training with Self-Distillation Learning for Chinese Spelling Correction. IEICE Trans. Inf. Syst. 2024, E107.D, 495–504. [Google Scholar] [CrossRef]
Wang, H.; Ma, Y.; Duan, J.; He, L.; Li, X. Chinese Spelling Correction Based on Knowledge Enhancement and Contrastive Learning. IEICE Trans. Inf. Syst. 2024, E107.D, 1264–1273. [Google Scholar] [CrossRef]
Yang, L.; Liu, X.; Liao, T.; Liu, Z.; Wang, M.; Fang, X.; Yang, E. Is Chinese Spelling Check ready? Understanding the correction behavior in real-world scenarios. AI Open 2023, 4, 183–192. [Google Scholar] [CrossRef]
Liu, X.; Zhu, S.; Li, Y.; Chen, X.; Yu, Z. Entity-focused Chinese Spelling Correction: Dataset and Approach. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 1–17. [Google Scholar] [CrossRef]
Tseng, Y.H.; Lee, L.H.; Chang, L.P.; Chen, H.H. Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check. In Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing, Beijing, China, 30–31 July 2015; Yu, L.C., Sui, Z., Zhang, Y., Ng, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 32–37. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, B.; Jiang, H.; Li, Z.; Li, C.; Huang, F.; Zhang, M. NaSGEC: A Multi-Domain Chinese Grammatical Error Correction Dataset from Native Speaker Texts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 9935–9951. [Google Scholar] [CrossRef]
Lv, Q.; Cao, Z.; Geng, L.; Ai, C.; Yan, X.; Fu, G. General and Domain-adaptive Chinese Spelling Check with Error-consistent Pretraining. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–18. [Google Scholar] [CrossRef]
Wu, H.; Zhang, S.; Zhang, Y.; Zhao, H. Rethinking Masked Language Modeling for Chinese Spelling Correction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 10743–10756. [Google Scholar] [CrossRef]

Figure 1. Examples of Chinese Spelling Correction (CSC) comparing direct correction and transliteration-augmented correction. The examples involve a pair of homophones pronounced as “ji shui”, corresponding to different meanings. Transliteration information helps distinguish pronunciation-consistent but semantically different candidates.

Figure 2. Overall architecture of PSIF, consisting of multi-view embedding, multi-scale convolution for local n-gram features, cross-attention for global phonetic–semantic alignment, gated fusion, and decoding into corrected output Y.

Figure 3. Experimental results on the SIGHAN 13 dataset.

Figure 4. Experimental results on the SIGHAN 14 dataset.

Figure 5. Experimental results on the SIGHAN 15 dataset.

Table 1. Statistics of the SIGHAN Chinese Spelling Check (CSC) datasets. The statistics include the number of sentences, average sentence length (in characters), and the total number of annotated spelling errors.

Dataset	#Sentences	Avg. Length	#Errors
SIGHAN13 (Train)	700	41.8	343
SIGHAN13 (Test)	1000	74.3	1224
SIGHAN14 (Train)	3437	49.6	5122
SIGHAN14 (Test)	1062	50.0	771
SIGHAN15 (Train)	2339	31.3	3037
SIGHAN15 (Test)	1100	30.7	542

Table 2. Statistics of ECSpell domain-specific datasets (Law/Med/Odw) and the SIGHAN15 test set for reference [32].

Statistic	Law	Med	Odw	SIGHAN15 (Test)
# Error sents/Sents	1314/2460	1699/3500	1259/2220	542/1100
Min. Len	12	11	9	5
Max. Len	120	127	161	108
Avg. Len	30.5	50.1	41.2	30.7
# Continuous error sents	229	253	265	51
# Annotators per sent	5	5	5	1

Table 3. Detection-level and correction-level performance on the SIGHAN 15 test set. The results demonstrate that PSIF maintains a minimal performance gap between error detection and correction.

Method	Detection Level (%)			Correction Level (%)
Method	Prec.	Rec.	F1	Prec.	Rec.	F1
BERT_Tagging	74.2	78.0	76.0	71.6	75.3	73.4
ReLM	84.5	85.2	84.8	82.1	82.8	82.4
PSIF (Ours)	85.3	84.2	84.1	84.0	83.0	83.5

Table 4. Comparison of precision, recall, and F1 scores for various methods on the ECSpell dataset across three domains (LAW, MED, ODW). The results demonstrate that our proposed PSIF model achieves state-of-the-art performance in all domains, significantly outperforming both tagging-based and rephrasing-based baselines.

Domain	Method	Prec.	Rec.	F1
LAW	GPT2_Tagging	37.7	32.5	34.9
	BERT_Tagging	43.3	36.9	39.8
	GPT2_Rephrasing	61.6	84.3	71.2
	BERT_Tagging-MFT	73.2	79.2	76.1
	MDCSpell_Tagging-MFT	77.5	83.9	80.6
	ReLM	89.9	94.5	92.1
	Baichuan2_Rephrasing	85.1	87.1	86.1
	PSIF	96.6	96.4	96.5
	ChatGPT (10-shot)	46.7	50.1	48.3
MED	GPT2_Tagging	23.1	16.7	19.4
	BERT_Tagging	25.3	20.0	22.3
	GPT2_Rephrasing	29.6	44.7	35.6
	BERT_Tagging-MFT	57.9	58.1	58.0
	MDCSpell_Tagging-MFT	69.9	69.3	69.6
	ReLM	79.2	85.9	82.4
	Baichuan2_Rephrasing	72.6	73.9	73.2
	PSIF	83.9	89.5	86.6
	ChatGPT (10-shot)	21.9	31.9	26.0
ODW	GPT2_Tagging	26.8	19.8	22.8
	BERT_Tagging	30.1	21.3	24.9
	GPT2_Rephrasing	46.2	64.3	53.8
	BERT_Tagging-MFT	59.7	58.8	59.2
	MDCSpell_Tagging-MFT	65.7	68.2	66.9
	ReLM	82.4	84.8	83.6
	Baichuan2_Rephrasing	86.1	79.3	82.6
	PSIF	87.1	88.1	87.6
	ChatGPT (10-shot)	56.5	57.1	56.8

Table 5. Correction accuracy (%) broken down by specific error types on the SIGHAN 15 test set, as annotated by a context-aware LLM. PSIF shows pronounced improvements in the Tone-Sensitive subset.

Method	Strict Homophones	Tone-Sensitive	Morphological/Other
Baseline: [ReLM]	82.2	75.3	84.0
PSIF (Ours)	83.8	82.5	84.5

Table 6. Ablation study results: performance comparison across three experimental models.

Dataset	Metric	Experimental Models
Dataset	Metric	Long Info	Short Info	PSIF
med	P	72.6	53.1	83.9
	R	82.7	68.7	89.5
	F1	77.3	59.9	86.6
law	P	92.8	47.5	96.6
	R	92.7	71.7	96.4
	F1	92.8	57.2	96.5
odw	P	92.1	60.1	87.1
	R	82.7	82.0	88.1
	F1	87.2	69.3	87.6

Table 7. Ablation study isolating the contributions of cross-attention and the gating mechanism.

Dataset	Metric	Model Variants
Dataset	Metric	w/o Cross-Attn	w/o Gating	PSIF
MED	P	80.8	82.5	83.9
	R	86.8	88.3	89.5
	F1	83.7	85.3	86.6
LAW	P	95.6	96.1	96.6
	R	95.4	96.0	96.4
	F1	95.5	96.0	96.5
ODW	P	84.8	85.9	87.1
	R	86.9	87.3	88.1
	F1	85.8	86.6	87.6

Table 8. Ablation study on tone awareness: PSIF with tone-aware pinyin vs. a tone-agnostic variant.

Dataset	Metric	PSIF Variants
Dataset	Metric	Tone-Aware	Tone-Agnostic
med	P	83.9	81.6
	R	89.5	87.0
	F1	86.6	84.2
law	P	96.6	95.2
	R	96.4	95.0
	F1	96.5	95.1
odw	P	87.1	85.4
	R	88.1	86.2
	F1	87.6	85.8

Table 9. Downstream UCMMLU accuracy (%) under four pre-tokenization correction settings: No Correction, Dictionary-based, Generic BERT (semantic-only), and PSIF (phonetic-constrained fusion).

Setting	Model	No Corr.	Dict.	Generic BERT	PSIF
0-shot	Meta-Llama-3.1-8B-Instruct	48.21	48.45	49.55	50.38
	Qwen2.5-7B-Instruct	73.40	73.78	75.05	77.31
	glm-4-9b-chat	68.65	68.96	70.10	71.94
	Mistral-7B-Instruct-v0.3	41.51	41.70	42.10	42.71
	Mistral-Nemo-Instruct-2407	52.87	53.15	54.20	55.31
	gemma-2-27b-it	57.40	57.65	58.45	59.23
	Qwen2.5-32B-Instruct	79.86	80.20	81.65	83.71
	DeepSeek-V3	83.34	83.62	84.95	86.21
5-shot	Meta-Llama-3.1-8B-Instruct	39.70	40.05	41.95	43.89
	Qwen2.5-7B-Instruct	74.50	74.85	76.10	77.85
	glm-4-9b-chat	65.33	65.70	67.05	68.79
	Mistral-7B-Instruct-v0.3	37.33	37.55	37.90	38.13
	Mistral-Nemo-Instruct-2407	54.08	54.35	55.35	56.33
	gemma-2-27b-it	59.73	60.05	61.15	62.66
	Qwen2.5-32B-Instruct	81.20	81.55	82.85	84.54
	DeepSeek-V3	84.92	85.20	86.40	87.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, L.; Liu, Z.; Yan, Q.; Liu, X. PSIF: Phonetic–Semantic and Long–Short Information Fusion for Chinese Spelling Correction. Appl. Sci. 2026, 16, 2440. https://doi.org/10.3390/app16052440

AMA Style

Zhang L, Liu Z, Yan Q, Liu X. PSIF: Phonetic–Semantic and Long–Short Information Fusion for Chinese Spelling Correction. Applied Sciences. 2026; 16(5):2440. https://doi.org/10.3390/app16052440

Chicago/Turabian Style

Zhang, Lei, Zhetan Liu, Qianxi Yan, and Xiaodong Liu. 2026. "PSIF: Phonetic–Semantic and Long–Short Information Fusion for Chinese Spelling Correction" Applied Sciences 16, no. 5: 2440. https://doi.org/10.3390/app16052440

APA Style

Zhang, L., Liu, Z., Yan, Q., & Liu, X. (2026). PSIF: Phonetic–Semantic and Long–Short Information Fusion for Chinese Spelling Correction. Applied Sciences, 16(5), 2440. https://doi.org/10.3390/app16052440

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PSIF: Phonetic–Semantic and Long–Short Information Fusion for Chinese Spelling Correction

Abstract

1. Introduction

2. Related Work

2.1. Rule-Based and Statistical Approaches

2.2. Neural CSC and Pre-Trained Language Models

Broader Advances in Deep Networks and Transformer-Based Fusion

2.3. Mitigating Over-Correction via Representation Learning

2.4. Multimodal and Knowledge-Enhanced CSC

2.5. Datasets and Evaluation Under Realistic Scenarios

2.6. From Character-Level Correction to Sentence-Level Generation

3. Materials and Methods

3.1. Task Formulation

3.2. Analysis of CSC

3.3. Motivation for PSIF Model

3.4. Model Architecture

3.5. Datasets

3.5.1. SIGHAN

3.5.2. NACGEC

3.5.3. ECSpell

3.5.4. LEMON

3.5.5. Evaluation Settings

3.5.6. Data Deduplication and Leakage Prevention

4. Results

4.1. PSIF Model and Comparative Experiments

4.2. Fine-Grained Error Type Analysis

4.3. Ablation Study

5. Plugin for LLM

6. Discussion

6.1. Why Phonetic–Semantic Long–Short Fusion Matters in Real-World CSC

6.2. Interpreting the Ablation: Precision from Global Semantics, Recall from Local Cues

6.3. The Role of Tone-Aware Pinyin: Disambiguating Homophones Beyond Coarse Phonology

6.4. Comparison with Prior CSC Paradigms: Tagging, Rephrasing, and Knowledge-Enhanced Fusion

6.5. Implications for LLM Robustness: Correcting Before Tokenization and Reasoning

7. Conclusions

8. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI