Visual and Phonological Feature Enhanced Siamese BERT for Chinese Spelling Error Correction

: Chinese Spelling Check (CSC) aims to detect and correct spelling errors in Chinese. Most CSC models rely on human-deﬁned confusion sets to narrow the search space, failing to resolve errors outside the confusion set. However, most spelling errors in current benchmark datasets are character pairs in similar pronunciations. Errors in similar shapes and errors which are visually and phonologically irrelevant are not considered. Furthermore, widely-used automatically generated training data in CSC tasks leads to label leakage and unfair comparison between different methods. In this work, we propose a feature (visual and phonological) enhanced siamese BERT to (1) correct spelling errors without using confusion sets; (2) integrate phonological and visual features for CSC by a glyph graph; (3) improve performance for unseen spelling errors. To evaluate CSC methods fairly and comprehensively, we build a large-scale CSC dataset in which the number of samples in different error types is the same. The experimental results show that the proposed approach achieves better performance compared with previous state-of-the-art methods on three benchmark datasets and the new error-type balanced dataset.


Introduction
Chinese spelling errors are common in daily life, due to the similarity between characters. In Chinese, there are many characters similar in phonology and visual shape, but different in semantics. These spelling errors are typically caused by careless human writing, automatic speech recognition, or optical character recognition systems. Besides, misspellings are intentionally used by perpetrators to avoid automatic detection in current social platforms, such as spam broadcasting, malicious advertising, etc. Therefore, detecting and correcting such misuse of the Chinese language are important tasks in real-world applications. An effective method for CSC is not only useful in natural language tasks such as speech recognition, word recognition, and grammatical error correction but also has potential value in anti-spam problems such as fraud detection, advertisement detection, etc.
Unlike English, Chinese spelling error correction is a challenging task due to the characteristics of Chinese characters. Chinese texts consist of many pictographic characters without word delimiters. The semantic meaning of each character changes dramatically when the context changes. Besides, the pronunciation of a character depends on the context. In this sense, a CSC model is required to both understand the semantics and integrate the surrounding information (i.e., pronunciation, character structure).
Traditional methods have been employed for CSC. For example, previous works [1][2][3] employed the traditional machine learning [4,5] and other deep learning models [6]. Also, sequence-to-sequence models [7] have been proposed for spelling error correction by transforming an input sentence into a new sentence with spelling errors corrected. Recently, several methods have been introduced to exploit external information of character similarity, which rely on a human-defined confusion set [7,8]. The confusion set is a dictionary containing similar character pairs. In ACL 2019, confusion set-guided Pointer Networks [7] used a pointer network to copy similar characters from the confusion set. In ACL 2020 [8] a spelling check convolutional graph network was presented by constructing similarity graphs using the confusion set. These methods attempt to model the relationship between characters based on the confusion set. However, the similarity information extracted from the confusion set is limited. The confusion set only provides information about whether two characters are similar, but does not present phonological and visual features of Chinese characters. Furthermore, a human-defined confusion set only covers a small subset of Chinese characters due to the human annotation, which can hardly correct the misspellings of unknown characters.
Addressing the problems of human annotation and unknown characters, we propose Feature Enhanced BERT (FE-BERT) to integrate phonological and visual similarity for CSC without using a confusion set. Specifically, we construct a glyph graph over Chinese characters for capturing the shape similarity between characters, by leveraging the intrinsic component structure of Chinese characters. Instead of similar character pairs set by human annotation to capture the shape similarity, our glyph graph requires no human annotation. Furthermore, the glyph graph utilizes the decomposition structure of Chinese characters, which can cover more characters compared to the confusion set. The glyph graph is pre-trained to generate a vector representation for each character as its shape feature. Combining the shape and pronunciation features with BERT [9], FE-BERT can adequately leverage the similarity knowledge and generate the right corrections accordingly. Considering spelling errors which are visually and phonologically irrelevant, we adopt a siamese structure to combine a vanilla BERT and a FE-BERT, namely FES-BERT. As depicted in Table 1, FES-BERT can completely correct the spelling errors led by both the pronunciation similarity and the shape similarity. Table 1. A CSC data sample from SIGHAN 2015 [10] with ID A2-1515-1, the incorrect/correct characters are in orange/blue. A BERT model modifies the text into a sentence that is semantically reasonable but dissimilar in pronunciation. By incorporating both phonological and visual similarities, our new method FES-BERT can generate a sentence that is both semantically sensible and phonically similar to the original sentence. The sentence output from FES-BERT means "Can you tell me, does this store have Chinese books?", while the sentence Output from BERT means "Can you tell me, does this store have a Chinese department?".

Unsupervised Methods
The CSC task attracts much attention from the community in recent years. Earlier works in CSC mainly focused on unsupervised methods [1,13]. Character-level n-gram language models were used to detect potential misspelled characters with low probabilities below some pre-defined threshold. To be specific, [1] performs a four-stages procedure to handle spelling check, including character-level n-gram language model, pronunciation and shape-based candidate set generation, candidate corrections filter, and find the best candidate. In [13], the spelling correction process includes two cascaded components: spelling error detection and spelling error correction. In spelling error detection phase, characters with a low score under the language model are regarded as potential error characters. Independent characters after automatic word segmentation are also collected as potential error characters. In the spelling error correction phase, a candidate set will be generated for each error character. The candidate set generation is based on similar pronunciation or shape dictionary (confusion set). The generated candidate set will be filtered by testing if the candidate character can construct a legal word with neighbor characters. After filtering, the character with the highest score calculated by the language model will be selected as the best candidate. However, such unsupervised methods cannot model the semantic context, which is important information in CSC tasks.

Supervised Methods
Several supervised methods have been proposed to tackle this problem. They regard the CSC task as a sequence labeling problem [14][15][16], in which RNN based models are adopted. Confusion set guided Pointer Networks [7] used a pointer network to copy similar characters from the confusion set directly. A chunk-based framework was proposed in [17] to correct single-character and multi-character word errors uniformly.
Recently, BERT [9], the masked language representation model, has been successfully applied in CSC task. FASPell [11] fine-tunes a pre-trained BERT to produce character candidates and then use a confidence-similarity decoder to filter the candidates. Candidate characters with high contextual confidence and similarity from the original character are more likely to be the best. Confusionset-guided Pointer Networks [7] proposed a Seq2Seq model to copy an input sentence to a corrected new sentence through a pointer network. Soft-Masked BERT [12] consists of a network for error detection based on GRU [18] and a network for error correction based on BERT, with the former being connected to the latter with a soft-masking technique. The character embeddings of potential error characters are masked via a soft pointer to help BERT select the best candidate.
Graph-based methods have also been proposed to utilize external information of characters for CSC task. SpellGCN [8] with graph convolutional network (GCN) aims to capture similarity information between characters and it is able to achieve the SOTA results on benchmarks. Specifically, two graphs are built for pronunciation and shape similarities correspondingly. An attentive graph is used to combine node representations of the two graphs. For characters in a given input text, SpellGCN uses BERT to extract context representations. Then, similarities between these context representations and node representations extracted by GCN are measured to find the correct answer. SpellGCN relies on the confusion set, which requires human annotation and only covers a subset of Chinese characters.

Approach
In this section, we describe the CSC task and introduce the Feature-enhanced Siamese BERT (FES-BERT) which is able to jointly learn contextualized representations of characters' shape, pronunciation, and semantic features.

Problem Formulation
Chinese Spelling Check can be formalized as a generation problem by modeling the probability P(Y|X). Given a text sequence X = {x 1 , x 2 , . . . , x l } of l characters containing misspelled characters, the goal of our task is to transform the input sentence into a correct text sequence Y = {y 1 , y 2 , . . . , y l } in which the wrong characters are recognized and corrected.

Model Overview
The framework of the proposed method is depicted in Figure 1. It consists of two components, i.e., a Feature-enhanced BERT (FE-BERT) for visual and phonological errors and a vanilla BERT for visually and phonologically irrelevant typos. The prediction results of the BERT and the FE-BERT are summed up through a dynamic soft pointer to reach a balance between semantic constraints and external feature constraints. The details are shown in Algorithm 1. More specifically, shape embeddings for visual features and pronunciation embeddings for phonologically features of each character are provided for the Feature-enhanced BERT as clues for selecting the best candidate. Shape embeddings of all Chinese characters are retrieved from a Chinese character glyph graph. In the glyph graph, Characters with the same components in the structure are connected and the edges are given different weights in terms of the number of character strokes and node neighbors. The pronunciation embeddings are initialized according to Chinese Pinyin, which serves as a romanization system providing phonetic-based information for Chinese characters.
In the following subsections, we describe (1) the Feature-enhanced BERT(FE-BERT) for errors similar in pronunciation or shape. (2) the combination of BERT and FE-BERT.

. Shape Embeddings for Visual Features
We retrieve shape embeddings for Chinese characters from a glyph graph. As illustrated in Figure 2, Chinese characters can be repeatedly disassembled until the structure can not be subdivided. For example, character "您" can be divided into "你" and "心", while "你" can be split into "亻" and "尔". The same is true for character "茨". Although the shapes of Chinese characters are different from others, they share common basic components which are also Chinese characters. The structure information of characters are labeled by the Kanji Database Project (http://kanji-database.sourceforge.net/, accessed on 12 September 2021). The shape information of Chinese characters is represented by unicode standard-Ideographic Description Sequence (IDS). For example, character "您" is represented as "U+60A8 您你心", while character "你" is represented as "U+4F60 你亻尔". As illustrated in Figure 3, we build the glyph graph according to IDS labels of Chinese characters by connecting characters and their sub-components. Table 2 shows some statistics of the glyph graph.  Considering that different character pairs have different similarity degrees and need to be distinguished, we make the glyph graph directed and weighted in terms of the number of strokes and the neighbors of graph nodes. Figure 4 shows an extreme situation that a frequently-used radical "亻" connects to 1758 characters. This indicates why the number of neighbors should be considered when setting up the edge weights. These characters sharing the same radical "亻" are similar, but their similarities should be different from each other. If every node is connected equally without weights, the characters connected by "亻" will be similar to each other with the same degree, which is not true. Besides, the number of character strokes is also important. Connected characters with closer stroke numbers may look more similar than others, such as character pair "仁" and "仨" are more similar compared to pair "亻" and "仁".

Shape Embeddings
Node2vec [19] algorithm is adopted to retrieve shape representations for characters in the glyph graph. Node2vec is an extension of the Word2vec [20] algorithm, which can generate vector representations of nodes on a graph. It follows the intuition that random walks through a graph can be treated like sentences in a corpus. A random walk is treated as a sentence and each node in the walk is treated as a word. By applying Word2vec algorithms such as skip-gram or continuous bag of words model on these "sentences", the algorithm generates shape embeddings for all nodes in the glyph graph. Visualization results of shape embeddings will be shown in Section 4.6.
It should be emphasized that the purpose of the glyph graph is not to directly learn the similarity between characters, but to learn the structure information of characters. Therefore models using these shape embeddings can learn structure relation from error pairs similar in shape and make corrections under the guidance of visual features.

Pronunciation Embeddings Phonological Features
Pinyin is an official romanization system for Standard Chinese, which provides phonetic-based information for Chinese characters. The word "Pinyin" literally means "spelled sounds". In the Pinyin system, each character has one syllable, which consists of three components: an initial (consonant), a final (vowel), and a tone [21]. There are thousands of Chinese characters sharing hundreds (402) of Pinyin codes with different tones. Following [22], we ignore the tones here because the pronunciation of characters can already be regarded as similar if the initial (consonant) and final (vowel) of Pinyin are the same. Pinyin itself is already a mature representation of phonological features of Chinese characters. Different from shape embeddings retrieved by graph-based algorithm, the pronunciation embeddings are initialized randomly according to the Pinyin codes of characters and updated during the training phase. Case analysis results show that pronunciation embeddings can deliver phonological information and help the model to make correct predictions in phonological constraints.

Feature-Enhanced BERT
We use a Feature-enhanced BERT (FE-BERT) to deal with spelling errors with similar pronunciations or shape. Compared to original BERT, FE-BERT requires two more embeddings as input: shape embeddings for visual features and pronunciation embeddings for phonetic features. Let E p = {e p1 , e p2 , . . . , e pM } denote the Pinyin embeddings, E s = {e s0 , e s1 , . . . , e sN } denote the glyph embeddings. "M" is the number of Pinyin codes for Chinese characters' pronunciation, "N" is the number of Chinese characters.
The input sentence is converted to a sequence of embeddings that are constructed by summing up token, segment, position, shape, and pronunciation embeddings. For characters in the text sequence X = {x 1 , x 2 , . . . , x l }, BERT is used to retrieve the corresponding output representations H FE = {h f e1 , h f e2 , . . . , h f el }. A visualization of this construction is shown in Figure 1.

Feature-Enhanced Siamese BERT
On one hand, the existence of external features can be the guidance to correct visually and phonologically related errors. On the other hand, the external features may be useless or even harmful when it comes to visually and phonologically irrelevant spelling errors. To reach a balance between semantic constraints and external feature constraints, and to fully leverage the reasoning capacity of the pre-trained BERT, we combine the prediction results of an original BERT and the FE-BERT through a dynamic pointer p sem .
For character x i in the input sequence, there is a purely semantic vector representation h i from BERT and another vector representation h f ei mixed with external features from FE-BERT. The dynamic pointer p sem to combine h i and h f ei is dependent on h i , h f ei and h cls , where h cls is the representation for token "[CLS]" from BERT. The final probability for each candidate is defined as where W p , W 1 , W 2 and b p , b 1 are trainable parameters. The predicted character y i at position i is the character with the max probability y i = arg max P i . Finally, the learning objective is to maximize the log likelihood of target characters:  [7]. Statistics of the SIGHAN datasets are listed in Table 2.
Error Type Balanced Dataset. SIGHAN are real-world datasets generated by the modification of international students' compositions, they reflect the real situation in practice. However, they might not be fully capable to evaluate CSC models accurately and comprehensively because: (1) The scale of SIGHAN datasets is too small. There are only 1k sentences in the SIGHAN15 test dataset. (2) The errors in SIGHAN datasets are mainly spelling errors in similar pronunciations. Errors in similar shapes and errors which are visually and phonologically irrelevant are not included. (3) Most spelling errors in current benchmark datasets are leaked to CSC models through widely-used auto-generated datasets in training process. The ability to handle unseen new errors of CSC models is never evaluated. To handle these issues, we construct a relatively large-scale Error Type Balanced Dataset.
We repartition the auto-generated 271K samples dataset and modify the spelling errors to build an error-type balanced dataset. There are four kinds of errors defined by character's similarity: similar only in shape, similar only in pronunciation, similar in both shape and pronunciation, and not similar in shape nor pronunciation. Besides, it is also necessary to evaluate errors that are not included in the confusion set. If character x i is misspelled as x j in training materials, the error pair < x i → x j > is defined as "seen" by CSC models in training phase. In the test dataset, there are spelling errors not seen by CSC models in the training materials. This is to evaluate the performance on new errors. Statistics of the Error Type Balanced Dataset are listed in Table 3. Table 3. Statistics information of the Error Type Balanced Dataset. "Seen Errors" means the errors concluded in the training dataset. "New Errors" means the spelling errors in sentences that are not in the training dataset, which are new to CSC models. Besides, each type of error is divided into two situations according to whether the error pair appears in the confusion set. "Confusion Set Coverage" means the portion of error pairs covered by the confusion set.

Seen Errors
Error

Baselines
We compare our method with several strong baselines.
• PN [7]: This method copies candidate characters from a confusion set by Pointer Network [25]. • FASpell [11]: FASPell is a new paradigm for CSC which consists of a denoising autoencoder (DAE) and a decoder. Candidate characters are retrieved by a pre-trained masked language model and a specialized decoder utilizing the salient feature of Chinese character similarity is used to select the best candidate. • BERT-Embed [9]: The word embeddings are used as the softmax layer on the top of BERT for the CSC task. This method served as a baseline in [9]. • BERT-Linear [9]: The original BERT without shape or phonological features. A pretrained linear layer is used to make predictions over hidden states of the last layer of BERT. • Spell-GCN [8]: Phonological and visual similarity knowledge is integrated into language models for CSC via a specialized graph convolutional network (SpellGCN). • Soft-Masked BERT [12]: This model consists of a network for error detection and a network for error correction based on BERT, with the former being connected to the latter with a "soft-masking" technique.
When training shape embeddings using Node2vec algorithm on the glyph graph, the return parameter p and the in-out parameter q are 1 and 8, respectively; the length and number of walk per source are 3 and 300; the context size for optimization is 2. We force the Node2vec algorithm to concentrate on neighbors which are characters with the same components by setting the context size and walk length small and the in-out parameter q large. Table 4 shows experimental results of the above methods on three SIGHAN datasets. The FES-BERT equipped with shape and phonic features achieves better performance against vanilla BERT and SpellGCN on most metrics of SIGHAN datasets. In terms of sentence-level F1 score metric in the correction subtask, i.e., C-F score in the last column, the improvements against previous best results (Spell-GCN) are 3.3%, 1.5%, and 0.2% points respectively. In terms of character-level F1 score metric in the correction subtask, the improvements against SpellGCN are 2.0%, 0.1%, and −0.1%, respectively. This demonstrates the effectiveness of our proposed method on benchmark datasets. Table 5 exhibits model performance on different kinds of spelling errors, which are respectively about pronunciation, shape, and others. There are four situations for a CSC model to correct a spelling error according to whether the error pairs are in the confusion set and whether the error pairs have been seen in training materials. For errors in both train and test datasets, SpellGCN and FES-BERT reach close scores on the sentence level F1 metrics while the latter has a small advantage. For new errors that only exist in the test dataset, the FES-BERT achieves a large advantage on the sentence level F1 metrics. This demonstrates that FES-BERT has a better generalization performance than SpellGCN and BERT. For errors that are new but included in the confusion set, SpellGCN achieves an obvious advantage over vanilla BERT. This indicates that it is helpful to collect error characters into a confusion set. Table 4. The performance of our method and baseline models (%). D, C denote the detection, correction, respectively. P, R, F denote the precision, recall and F1 score, respectively. Best results are in bold. Following experiment setting of [8], We performed additional fine-tuning on SIGHAN13 for 3 epochs as the data distribution in SIGHAN13 differs from other datasets, e.g., "的得地" are rarely distinguished. Dataset. The results with ‡ are reproduced by our own implementation.

Character-Level
Sentence-Level    Table 5. Sentence level F1 scores on Error Type Balanced Dataset. "In" means in the confusion set, "out" means out of the confusion set, "p" means phonic, "s" means shape. "New errors" are error pairs never seen by CSC models in the training phase while "seen errors" are error pairs appeared in training materials; "all" in last column means sentence level F1 scores on all error types together.

Ablation Studies
In this section, we analyze the effect of shape embeddings and pronunciation embeddings on SIGHAN15. We trained FES-BERT with only pronunciation embeddings or shape embeddings as an external feature. The experimental results on SIGHAN15 can be seen in Table 6. Figure 5 shows the test curves of FES-BERT-S, FES-BERT-P, FES-BERT, BERT-Linear, FE-BERT, and Soft-Masked BERT.
Compared to vanilla BERT, the FES-BERT with additional features converges rapidly in less than four epochs. In terms of sentence-level F1 score metric in the correction subtask at epoch 3, the improvements of FES-BERT-S, FES-BERT-P, and FE-BERT against the original BERT (BERT-Linear) are 2.9%, 3.3%, and 1.8%, respectively. In terms of character-level F1 score metric in the detection subtask, the improvements of FES-BERT-S, FES-BERT-P, and FE-BERT against the original BERT are 3.9%, 4.7%, and 1.8%, respectively. All three models achieve better performance on all metrics than to the original BERT. This indicates that both pronunciation and shape features are necessary to correct Chinese spelling errors. Models with the siamese structure achieve higher scores than FE-BERT, which means the siamese structure is also important for correcting spelling errors. Table 6. Experimental results of ablation study. FES-BERT-P means FES-BERT with only pronunciation features, FES-BERT-S means FES-BERT with only shape features. FE-BERT is a Feature-enhanced BERT with shape and pronunciation features but without the siamese original BERT.

Character-Level
Sentence-Level

Case Study
We show several spelling error cases either similar in shape or similar in pronunciation in Table 7. There are more than one semantically appropriate candidate characters for these cases, which means information of pronunciation and shape is necessary to make further judgments.
For sentences with error characters in similar shapes, such as "他们计划舂(chong)天 去爬山。", the corresponding correct sentence is "他们计划春(chun)天去爬山。(They plan to climb mountains in spring.)". The Chinese character " 春(chun)" is misspelled as " 舂(chong)" in a similar shapes. There are several semantically appropriate candidate characters such as "秋(qiu)", "春(chun)" and "明(ming)". It is reasonable in semantics for BERT to correct the sentence to "他们计划明(ming)天去爬山。(They plan to climb mountains tomorrow.)" or "他们计划秋(qiu)天去爬山。(They plan to climb and mountains in autumn.)". However, both 明(ming) and 秋(qiu) are not consistent with the shape constraints of 舂(chong). The errors are easy to detect but difficult to correct for vanilla BERT with only semantic information. However, the FES-BERT can find the best candidates "春(chun)" according to the shape constraints from shape embeddings of character "舂(chong)". The same is true for character pair "愁(chou)" and "秋(qiu)". SpellGCN also makes correct predictions in the case of "舂" and "春" which are in the character pairs of the confusion set. However, SpellGCN fails to correct "轻哽(geng)卡车" to "轻便(bian)卡 车" because character pair "哽(geng)" and "便(bian)" are not in the confusion set, which means "哽(geng)" and "便(bian)" are considered similar by Spell-GCN.
The Soft-Masked BERT corrects both "舂(chong)天" and "愁(chou)天" into "秋(qiu)天". The reason may be that the soft masks avoids the influence of original error characters. Besides, it is worth mentioning that most spelling errors can be corrected by BERT according to semantic information without other features. If the error pair in input sentence has appeared in the training materials, BERT can make correct predictions easily.

Characters with Similar Pronunciations
Visualization of Embeddings Figure 6 shows visualization results of embedddings by t-SNE [29]. Characters with Pinyin "lin", "ning", or "ta" are presented in the figure. Figure 6a shows visualization results of pronunciation embeddings. Characters with different pronunciations are separated easily. The reason is that characters with similar pronunciations in the graph share the same Pinyin codes (without tone), so their phonological features are represented by the same pronunciation embedding. Figure 6b shows visualization results of shape embeddings, in which the similar characters in terms of shape are placed together. For example, characters "宁柠拧泞咛" with common components "宁" are located closer to each other than other characters. This indicates the shape embeddings retrieved from the glyph graph do contain structure information of Chinese characters and can be used to measure the similarity of characters in shape. Figure 6c shows visualization results of BERT token embeddings. Characters in Figure 6c do not gather together either by shape or sound. Embeddings of original BERT failed to capture similarity information in terms of shape or pronunciation. Figure 6d shows visualization results of embeddings which are the sum of shape, pronunciation, and BERT char embeddings. In Figure 6d, characters with similar shapes and similar pronunciations exhibit cluster patterns in an obvious trend. The characters with similar shapes such as "遴磷麟鳞膦" or "他她地" are placed closely from each other. Characters with similar pronunciations exhibit the same phenomenon, are also closely. Characters with both similar shapes and similar pronunciations are placed closer than those with only one similar feature. For example, "遢榻拓" are characters with the same pronunciation, so these three characters are closer than other words. Meanwhile, "遢榻" looks more similar than "塌拓", so "遢榻" are closer than "塌拓" in Figure 6d. Due to this property, the feature-enhanced model may tend to recognize the similarity between characters and is able to search for answers with shape or pronunciation constraints.

Conclusions
We incorporate shape and pronunciation features of Chinese characters into BERT language models, and a good balance between semantic constraints and Chinese characters' external feature constraints is reached via a siamese model structure. A novel and efficient graph-based method is introduced for retrieving shape features of Chinese characters. The proposed method achieves better experimental results compared to previous SOTA models SpellGCN and Soft-Masked BERT. Case studies compared to the original BERT, SpellGCN, and Soft-Masked BERT show that our model is able to search candidate characters more accurately with the constraints of shape and pronunciation. For more effective development and evaluation of CSC methods, we built an error type balanced dataset by repartitioning the auto-generated 271K CSC dataset and modifying the types of spelling errors.
As for future work, we plan to develop an end-to-end CSC system based on the glyph graph and explore models with more powerful reasoning capabilities through external features of characters, instead of simply "remembering" spelling errors that appeared in the training materials.