Chinese Text Readability Assessment Based on the Integration of Visualized Part-of-Speech Information with Linguistic Features

Chi-Yi Hsieh; Jing-Yan Lin; Chi-Wen Hsieh; Bo-Yuan Huang; Yi-Chi Huang; Yu-Xiang Chen

doi:10.3390/a18120777

,

and

¹

The Institute of Chinese Language Education, National Kaohsiung Normal University, Kaohsiung 80201, Taiwan

²

Department of Electrical Engineering, National Chiayi University, Chiayi City 600325, Taiwan

³

Department of Electrical Engineering, National Chung Cheng University, Minhsiung 621301, Taiwan

⁴

Advanced Institute of Manufacturing with High-Tech Innovations, Ans. 621301 Innovation Building R209, 168 University Road, Ming-Hsiung Township, Chia-Yi 621301, Taiwan

Algorithms2025, 18(12), 777;https://doi.org/10.3390/a18120777

This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering

Version Notes

Order Reprints

Review Reports

Abstract

The assessment of Chinese text readability plays a significant role in Chinese language education. Due to the intrinsic differences between alphabetic languages and Chinese character representations, the readability assessment becomes more challenging in terms of the language’s inherent complexity in vocabulary, syntax, and semantics. The article proposed the conceptual analogy between Chinese readability assessment and music’s rhythm and tempo patterns, in which the syntactic structures of the Chinese sentences could be transformed into an image. The Chinese Knowledge and Information Processing Tagger (CkipTagger) tool developed by Sinica-Taiwan is utilized to decompose the Chinese text into a set of tokens. These tokens are then refined through a user-defined token pool to retain meaningful units. An image with part-of-speech (POS) information will be generated by using the token versus syntax alignment. A discrete cosine transform (DCT) is then applied to extract the temporal characteristics of the text. Moreover, the study integrated four categories: linguistic features–type–token ratio, average sentence length, total word, and difficulty level of vocabulary for the readability assessment. Finally, these features were fed into the Support Vector Machine (SVM) network for the classifications. Furthermore, a bidirectional long short-term memory (Bi-LSTM) network is adopted for quantitative comparisons. In simulation, a total of 774 Chinese texts fitted with Taiwan Benchmarks for the Chinese Language were selected and graded by Chinese language experts, consisting of equal amounts of basic, intermediate, and advanced levels. The finding indicated the proposed POS with the linguistic features work well in the SVM network, and the performance matches with the more complex architectures like the Bi-LSTM network in Chinese readability assessments.

Keywords:

Chinese text readability; part-of-speech; natural language processing; machine learning

1. Introduction

Since the late 19th century, a growing concern has been paying more attention to text readability. Initially, the focus was predominantly on the literature; however, the importance of the need for clarity in communication soon became to be addressed by other industries, like insurance policies, health education materials, and instruction manuals, newspapers, as well as learning materials [1,2]. Sherman [3] L. A. was one of the early pioneers in this field. In 1893, he took the first steps towards measuring readability using a quantifiable text feature. Subsequently, many scholars [4,5,6,7,8] from both the sciences and education have collaborated to improve teaching, reading comprehension, and related research areas, enabling more effective communication with the target audience by making texts easier to understand.

Previous research strongly suggests that text readability is directly linked to its inherent structural and content characteristics [9,10,11]. Furthermore, these measurable characteristics can be used to check if a text is easy to read. That is why researchers are so keen on studying these things. Word choice, sentence structure, and meaning clarity—these common linguistic features can effectively capture the lexical, syntactic, and semantic dimensions of text. In this manner, they enable us to know whether the text is of high or low quality and even the appropriateness for its intended audience. Furthermore, De Clercq [12] reported and discussed the issues of quality and suitability of a text. Carefully examining these features allows us to gain valuable insights into how people understand a text. Thus, this analysis can ultimately aid the present research.

There are two main ways to compute text readability—via traditional formulas and through machine learning techniques [13,14]. In the case of the traditional approach, the emphasis is on the linguistic features that can be easily counted—such as word count, vocabulary length, sentence structure, vocabulary difficulty, and even the vocabulary frequency—to generate readability gradings. The Lexile, Advantage-TASA Open Standard (ATOS), and Degrees of Reading Power (DRP) metrics are frequently considered key indicators of text readability within the field owing to their ease of application [15]. However, it also limits their capacity to capture the full nuances of language, resulting in a superficial evaluation. Conversely, machine learning techniques have an advantage as an alternative solution. These computational gadgets can automatically detect hidden patterns and semantic information that traditional formulas simply miss [13,14,15,16,17,18]. They can further perceive the details of language with a complete and accurate assessment. Thanks to the latest discoveries in machine learning and deep learning, this field is improving with speed. A prime example is how NLP uses neural networks to automatically extract and assess features within a text [19,20].

Both the two thinking paths have different pros and cons. Traditional formulas are quick and easy to use and complete a quick assessment. However, they poorly represent many aspects of text, leading to the misreading of a passage. The reader may obtain a shallow understanding of the intricate parts of the text. In another case, text analysis approaches incorporating machine learning can handle a broader range of linguistic features and adapt to different text genres with impressive flexibility. They often surpass traditional formulas; therefore, the resultant readability is far more precise. Nevertheless, these powerful tools also come at a cost besides their high performance, as they require considerable computational resources, and the users should specialize in using them.

Assessing the readability of Chinese text faces another challenge due to the intrinsic properties different from other alphabetic languages, even though some attempts have been made [21,22,23]. One thing that makes the Chinese language so difficult to grasp is its complex logographic writing system, huge vocabulary, intricate grammar, and diverse tokenization [23,24]. It is not like English or other alphabetic languages; Chinese contains thousands of symbolic characters that may have different meanings due to context, which complicates it significantly.

Many readability assessment tools are primarily designed for alphabetic languages, such as English [25], and are narrowing down their ability to inspect and recognize the structure of Chinese text in a valid and proper manner. Intuitively, they might offer a superficial assessment, but they are irrelevant tools for the task. This highlights the importance of the innovation of methods and tools that cater to the unique characteristics of the Chinese language. Although it is challenging, it is necessary for the correct interpretation and meaningful score of reading levels.

This study proposed two categories of neural networks for the assessment of Chinese text grading and to quantitively analyze how text features affect Chinese text readability. One of the novel concepts is to map the POS features onto a two-dimensional image to reserve the temporal structure of the given text. By integrating statistical linguistic features, the proposed algorithm not only reserves the time-series information, but does so with traditional linguistic features for modeling, analysis, and assessment of text readability. On the other hand, the LSTM neural network is adopted due to its powerful intrinsic design, e.g., using the series gates to manipulate the different memories of text tokens for comparisons.

For model construction, two evaluation frameworks are proposed:

Support Vector Machine (SVM): Uses common linguistic features and visual spectrograms, e.g., POS, as input for classification.
Bidirectional Long Short-Term Memory (Bi-LSTM): Takes the segmented words of the text, e.g., word embedding, as input and processes them as sequential data.

By comparing the performance of these two models in assessing Chinese text readability, this study aims to build a multimodal evaluation framework that combines language and visual information, providing a new technical reference for natural Chinese language processing tasks.

2. Materials and Methods

To effectively assess the readability of Chinese text, this article proposes two architectures, including SVM and LSTM neural networks. As shown in Figure 1, the schematic framework illustrates how data flows from the input of a given Chinese text, through feature transformation and model classification, to finally output the corresponding readability level. In SVM classifier, three combinations of features are utilized: POS spectrograms, linguistic features, and a hybrid of them. In Bi-LSTM structures, the input consists of word sequences obtained through tokenization using Hugging Face tools, enabling the model to capture contextual and temporal dependencies within the text.

Figure 1. Overall architecture flowchart of the proposed Chinese text readability assessment.

2.1. Chinese Text Datasets

The evaluation indicator of this study is the difficulty level of the texts, and the evaluation criteria are based on the Chinese proficiency grading standards in Taiwan and the results of expert review. These materials were collected, and grading by Chinese language experts consists of an equal amount of texts at the basic, intermediate, and advanced levels, totaling 774 texts. To ensure consistency in evaluation, this study adopted a unified grading standard, namely the Taiwan Benchmarks for the Chinese Language (TBCL) [26], developed by the National Academy for Educational Research. The standard is based on the language proficiency and learning needs of Chinese learners and covers five aspects: phonetics, vocabulary, grammar, pragmatics, and culture. These experts first graded each text independently and then discussed them together to determine the final level collectively. Table 1 lists the number of texts in each difficulty level after the experts’ review and confirmation—there are, in total, 774 texts and, they are equally a part of the three grades.

Table 1. The number of texts in each level.

2.2. Data Preprocessing

Before being unleashed on these texts, several fundamental preprocessing steps were enacted. Different preprocessing pipelines were designed for the two models to generate the corresponding input formats.

In the SVM model, the Chinese Knowledge and Information Processing (CKIP) toolkit [27] was used to perform word segmentation and POS tagging on the given Chinese text. Note that the structure of Chinese POS is quite different from English; for example, the English word “morning” might be rewritten as “zao shang” in Chinese, but “shang” in Chinese can also mean “up” or “above” in English. That means there are a lot of possibilities of tokenization in Chinese language. Therefore, we had built a mapping table to restrict the meaningful combinations. Through this process, non-space characters were split into individual words, and each word was assigned a POS tag such as adjective, verb, noun, etc. This step can be viewed as identifying the different parts of a sentence, with the aim of clearly indicating each word’s grammatical function.

For the Bi-LSTM model, we utilized the pre-trained JinaBERT Chinese word embedding model available on the Hugging Face platform to convert the original sentences of given text into vector representations [28], then put the vector into the Bi-LSTM network for text readability assessment.

2.3. Text Feature

In the SVM classifier, the two-dimensional POS spectrogram generated after POS tagging and the selected linguistic features are detailed as follows:

2.3.1. Part-of-Speech (POS) Spectrum Feature

Basically, the characteristics of POS of Chinese texts implies the syntax and structure of the text. We found the conceptual analogy between Chinese text grading and music style classification, in which the rhythm and tempo patterns in music resemble the syntactic structures present in Chinese sentences. By employing spectrogram-like visualizations of POS tags, the study aims to capture the structural patterns inherent in Chinese texts. This feature visualizes the frequency variations in different POS categories across the text sequence in the form of an image, thereby providing supplementary information for the model in assessing text readability.

Firstly, we used the CkipTagger [27] developed by the Academia Sinica, together with the designed mapping table to perform lexical segmentation and POS annotation on the original Chinese texts. Then, the tagging results are converted into a 2D matrix image, where the x-axis represents the sequential position of characters or words in the text, and the y-axis corresponds to 26 POS labels, including 14 primary POS categories consisting of nouns, verbs, adjectives, adverbs, pronouns, determiners, prepositions, conjunctions, particles, numerals, classifiers, modal verbs, onomatopoeia, interjections, and 11 punctuation marks and a whitespace.

In Figure 2, the Chinese token “wang” and “lin xin xin”, i.e., “王” and “林心心”, represented in Chinese characters and words from the given text, as shown in Figure 2, are annotated as Nb (proper noun). And “lao shi”, “ke”, “jiao”, ”shang”, and “ji de”, i.e., “老師”, “課”, “教”, “上”, and “記得”, are represented in Chinese characters as Na, Na, VC, VC, and VK, respectively, where Na, VC, and VK are the common noun, transitive verb, and cognitive verb, respectively. However, to simplify the graph design of the POS spectrum, we mapped these detailed categories into higher-level coarse-grained categories. For example, Na and Nb are merged into N (noun), while VC, VK, and similar tags are merged into V (verb). This simplification helps reduce dimensionality and spectrum complexity.

Figure 2. Tree diagram of Chinese POS categories.

The image shown at the bottom of Figure 3 is the POS representation. Each word is represented as a row using a one-hot vector indicating its POS category. Each row is a 26-dimensional vector corresponding to the 26 predefined categories, which include POS categories, punctuation marks, and a whitespace symbol, as listed in Table 2.

Figure 3. An example Chinese text and its extracted POS tags. The different colors represent the different POS tags. The corresponding POS spectrum image visualizes the character-level tagging, where the X-axis represents the sequence of characters in the text and the Y-axis indicates 26 POS categories, including 14 major POS tags, 11 punctuation types, and 1 whitespace. To enhance visual clarity and aesthetics, dashed lines are added to divide the image into 26 equal sections along the horizontal direction, making it easier to distinguish between different POS categories.

Table 2. Index of POS, punctuation, and whitespace categories in the POS spectrum image.

Although the color blocks in Figure 3 represent the POS tags, during real processing, the binary image of POS tagging is used to extract the DCT coefficients. The colors are only used for recognition by human eyes but does not provide information for feature extraction process. And another issue is the Chinese text length. To depress the variation effect of given article lengths, all representations are normalized with respect to the longest article. For shorter articles, we extend their representations by repetitively sampling from the beginning of the text until the required length is achieved.

This representation method effectively preserves syntactic patterns and structural information while reducing vector complexity and enhancing comparability across samples. The POS spectrum image not only serves as an alternative representation of textual structure but functions as a feature map input for deep learning models, thereby improving the model’s ability to recognize syntactic patterns in readability assessment tasks. Because we want to extract the temporal information and capture the main patterns of POS, DCT coefficients of POS are used for assessment. These DCT coefficients reserved significant syntax and structural information related to Chinese text readability.

2.3.2. Selected Common Linguistic Features

Many linguistic features are commonly used in text analysis to reveal various aspects of a text’s content and structure. To evaluate the complexity of Chinese texts, we selected four types of textual features, which are described in detail below.

Type–Token Ratio (TTR): Measures the ratio of the number of non-repeated words to the total number of words within the same part of speech, used to evaluate lexical diversity. The calculation formula is as follows:

T T R = \frac{t h e n u m b e r o f d i s t i n c t w o r d s w i t h i n t h e s a m e P O S}{\sqrt{2 \times a l l w o r d s o f t h e s a m e P O S}}

This feature reflects the diversity and degree of organization in word usage within the text. A higher TTR indicates richer vocabulary, which may suggest that the text is relatively more difficult.

II.: Total Number of Words: The total number of words in the text, regardless of the related POS. This feature reflects the length and density of the text. Longer texts often contain more semantic content and structural variation, which may increase the level of difficulty in comprehension.
III.: Average Sentence Length: The average number of characters per sentence in the text. The calculation formula is as follows:

A v e r a g e s e n t e n c e l e n g t h = \frac{A l l C h a r a c t e r s}{A l l P u n c t u a t i o n}

The longer the sentence, the higher the likelihood of syntactic nesting and information density, which may increase the difficulty of comprehension.

IV.: Difficulty Level of Vocabulary: The Taiwan Benchmarks for the Chinese Language (TBCL) [26] was released by the National Academy for educational research. This feature is closely linked to the difficulty level of the text.

2.4. Text Analysis with Neural Network Classifiers

One of the primary objectives of this study is to employ appropriate machine learning techniques to address the complexity and heterogeneity inherent in textual data. The research involves the training and comparative evaluation of two distinct models—SVM and Bi-LSTM networks—for the task of text readability classification. Through systematic optimization of model parameters and refinement of classification strategies, the proposed approach aims to effectively capture subtle linguistic features and distinctions, thereby establishing a robust foundation for accurate readability level assessment.

2.4.1. Support Vector Machine (SVM)

SVM is a supervised learning method based on statistics and convex optimization theory. Its core idea is to find the optimal hyperplane in the feature space that maximizes the margin, thereby improving classification accuracy. In the study, four categories—27 common linguistic features and DCT coefficients of POS spectrum—were sent to the SVM neural network for Chinese text readability classifications.

From a mathematical perspective, the training objective of the SVM can be formulated as the following optimization problem:

\underset{w, b, ξ_{i}}{m i n} \frac{1}{2} {‖w‖}^{2} + C \sum_{i = 1}^{n} ξ_{i}

s u b j e c t t o y_{i} (w^{T} \emptyset (x_{i}) + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0, \forall i = 1, \dots, n

where w is the weight vector, b is the bias term, and ϕ(x_i) represents the input data

x_{i}

that will be mapped onto a higher-dimensional feature space via the function ϕ. C is the regularization parameter, which controls the degree of tolerance for classification errors.

ξ_{i}

denotes the slack variable used to handle potential misclassified samples in the training data [29].

This study adopts the Radial Basis Function kernel to enhance the model’s ability to fit nonlinear decision boundaries. The optimal regularization parameter C was selected through cross-validation, and the model is trained to perform a three-class classification task.

2.4.2. Bidirectional Long Short-Term Memory (Bi-LSTM)

Long Short-Term Memory networks (LSTM) were first proposed by Hochreiter and Schmidhuber in 1997. As a typical structure of recurrent neural networks (RNN), LSTM effectively addresses the issues of vanishing and exploding gradients. Its core lies in introducing memory cells and gating mechanisms, which enable the model to capture long-term dependencies in sequences [30]. These gating mechanisms are responsible for regulating how information in the memory cells is updated and utilized, thereby improving the ability to retain and process long-term information in sequential data [31]. Bi-LSTM extracts features in both forward and backward directions based on the naïve LSTM architecture, allowing it to capture more complex and detailed features from sequential data [32].

Then the given text is tokenized and transferred into an encoded vector and finally fed into a Bi-LSTM neural network. Basically, Bi-LSTM can effectively assess the related semantic information of the Chinese text and use it for the final task of readability level classification. The study designs a Bi-LSTM model consisting of two layers of bidirectional LSTM, each with 128 hidden units. The model extracts semantic features from both forward and backward directions of the input sequence. The outputs will be passed through an average pooling layer, followed by a linear classification module that includes a ReLU activation function and Batch Normalization to enhance training stability and nonlinear representation capability. The final output consists of three classification categories. The entire model is implemented using the PyTorch framework [33] and trained with the Cross-Entropy loss function.

2.4.3. Application and Comparison of Bi-LSTM and SVM in Text Classification

The LSTM neural network is widely recognized for its ability to process sequential data, making it a highly successful and effective model in NLP research. As previously discussed, the Bi-LSTM model begins by transforming the input text into word embeddings, which are then fed into the network for classification. These word embeddings preserve both semantic and sequential information, enabling the Bi-LSTM to effectively capture contextual meaning through its gating mechanisms, which are designed to model both short-term and long-term dependencies within the text.

Compared to the LSTM network, the SVM neural network lacks mechanisms to capture temporal dependencies inherent in language data. However, the temporal information plays a dominant role in NLP studies. Therefore, this study employs imaged POS features combined with common linguistic features for the SVM model that allows the proposed SVM architecture to reflect the temporal features. The proposed process is shown in Figure 4. Word segmentation and the first stage of POS tagging are performed using the CKIP toolkit. Because of the intrinsic differences in tokenization between alphabet language and hieroglyph, the constraint mapping table is necessary for meaningful output for the final POS tagging. Then the words in each sentence are arranged in temporal order to construct a two-dimensional image representing the POS structure. This concept is based on the idea that textual hierarchy is related to the arrangement and complexity of POS patterns. Consequently, the periodic structural information could be extracted from the DCT coefficients of the POS.

Figure 4. Workflow for POS feature extraction via DCT.

By converting the POS into a two-dimensional image mentioned in Figure 2 and Figure 3, frequency-related features can be extracted. Typically, these frequency components of DCT represent a range-dependent association between adjacent words in a sentence. The concept is similar to Bi-LSTM manipulating short-term and long-term features through its gate mechanisms. Compared to Bi-LSTM, the SVM neural network retains competitive classification performance while offering a simpler network structure, effectively reducing computational costs during training and inference. Although the design of the SVM neural network lacks the related time-series information, the proposed algorithm tries to integrate the image-based POS with some popular linguistic features to preserve local sequential information in the syntactic structure of the Chinese text. This structurally and periodically based feature representation not only effectively captures word order patterns and latent structures in sentences but compensates for SVM’s lack of sequence modeling capability in the NLP study, allowing it to achieve a performance comparable to that of the Bi-LSTM model in text classification tasks.

From another viewpoint, the POS represented the major syntactic structure but without the semantic features of the text. Thus, one is a syntactic structure with traditional statistical linguistic features, e.g., SVM neural network, and the other is a complicated schematic feature, e.g., Bi-LSTM, because the structure of LSTM is more complex than the SVM. Clearly, the cost of the training/testing model is effectively depressed using the SVM architecture.

2.5. Hardware and Software Environment for Model Training and Evaluation

The training and testing environments for this study were configured with CUDA 11.7, cuDNN 8.7.0, PyTorch 2.0.1 and Python 3.9. The hardware specifications include a 12th Gen Intel^® Core™ i5-12400 CPU (Intel Inc., Santa Clara, CA, USA), 64 GB of system memory, and an RTX 3060-O12G GPU (Asus Inc., Taipei City, Taiwan). All experiments were conducted under this consistent environment.

2.6. Evaluation of Text Grading

To evaluate the performance of the two methods, the confusion matrix is utilized. Basically, the confusion matrix is the most popular evaluation metric. The numbers in the confusion matrix represent the counts of correct and incorrect predictions made by a classification model, categorized by the actual and predicted text readability, for example, true positives (TPs), where the elements away from the diagonal line are the classifier errors, which indicate classification errors such as false positives (FPs), where the model overestimates the difficulty; false negatives (FNs), where it underestimates it; and true negatives (TNs), where the model correctly identifies non-target class samples. With the analyis and computation of various metrics, including accuracy, precision, recall, and the F1-score, these metrics offer a detailed perspective on our classifier’s performance, aiding us in identifying its strengths and weaknesses in handling different levels of text difficulty.

3. Results and Discussion

3.1. Comparison of LSTM and SVM with Different Input Features

From Figure 5, it can be observed that when only the DCT-transformed coefficients are used as input features for the SVM model, the classification accuracy gradually increases as the number of selected coefficients grows. However, once the number of coefficients exceeds a certain threshold, the accuracy begins to plateau. This indicates that low-order frequency components play a critical role in classification performance, while introducing additional high-frequency coefficients may cause the classification to be saturated. Basically, the integration of hetero features is a complicated issue; inter/intra feature dependency and redundancy analysis are commonly required tools. However, the analysis needs more data scale to avoid an overfitted scenario. According to the tendency of curve and the classification accuracy shown in Figure 5, 46 coefficients of DCT are selected for SVM network readability assessment.

Figure 5. Scatter plot of results using only DCT coefficients as input to SVM.

Feature selection plays a critical role in text readability classification. This study employs the Bi-LSTM and SVM model to investigate the impact of different feature designs on text classification. For the SVM model, this study adopted POS spectrum features and four categories of linguistic features—type–token ratio, average sentence length, total word, and difficulty level of vocabulary. Table 3 shows the corresponding 27-dimensional features of the four categories used for linguistic features. Dimensions 1 to 14 correspond to type–token ratio features, dimension 15 represents average sentence length, dimension 16 indicates total word, and dimensions 17 to 27 correspond to the difficulty level of vocabulary features.

Table 3. Mapping of feature dimensions to linguistic categories.

The left side of Table 4 shows the approximate 83.87% accuracy of segmented semantic features fed into the Bi-LSTM. The result that the Bi-LSTM achieved confirms the effectiveness of deep learning models in handling semantic features.

Table 4. Accuracy of the Bi-LSTM and SVM classifier for different feature categories.

In the SVM classifier, the classification accuracy varies across different linguistic feature groups. Group A, B, C, D, and the POS spectrum achieved accuracies of 80.65%, 81.29%, 85.81%, 83.23%, and 72.26%, respectively. The result seems to be satisfactory and practical by referring to the proposed paper [34]. However, the used linguistic features lack the temporal characteristics of text readability. Therefore, to integrate the temporal information, DCT coefficients of POS are proposed to evaluate the performance. The resulting feature combinations are selected and denoted as A*, B*, C*, and D* and are listed in Table 4. The experimental results show that combining POS spectral features with linguistic features leads to slight improvements in classification accuracy in all cases, indicating a slightly complementary effect between periodic POS spectral features and traditional linguistic features. The accuracy improves from 80.65% to 83.23%, 81.29% to 82.58%, 85.81% to 86.45%, and 83.23% to 85.16% by combining the POS spectrum with Groups A, B, C, and D, respectively. These results suggest that POS spectral features can compensate for syntactic variations not captured by traditional linguistic features and that the combination of both improves the model’s capture of richer linguistic information.

3.2. Confusion Matrices of LSTM and SVM

Figure 6, Figure 7 and Figure 8 show the confusion matrices of the two models under their best configurations. The numbers in these matrices represent the counts of correct and incorrect predictions made by the proposed model, categorized by the actual and predicted text readability. These visual results provide a more intuitive representation of model performance and the impact of different features on the outcomes. The LSTM model shows that intermediate level text has a slight trend misclassified as basic or advanced level. This tendency is similar in the SVM neural network framework. Another interesting inspection of the unexpected jump levels, e.g., the inter-changed, which skips the intermediate level between the basic and advanced level, are few, which means the hyperplane of the features implies a sort of continuity. To compare with Figure 7 and Figure 8, the classifying correctness of the advanced and intermediate levels has slight improvement after the POS feature joined.

Figure 6. Shows the confusion matrix of the LSTM model.

Figure 7. Shows the confusion matrix of the SVM model using traditional linguistic features. From left to right, the four combinations are 1–14, 1–16, 17–27, and 1–27 dimensional features of utilized linguistic features.

Figure 8. Shows the confusion matrix of the SVM model using proposed linguistic features combined with the POS spectrum.

3.3. Execution Time Comparison Between LSTM and SVM

This study also compares the processing time of SVM and Bi-LSTM. Basically, the total processing time consists of two stages: preprocessing and model execution time. The computation loads of SVM mainly consist of the following steps: calling and running CkipTagger, generating 2D images, and extracting corresponding DCT coefficients, as well as computing traditional linguistic features. In contrast, the Bi-LSTM model only performs simple preprocessing using packages provided by Hugging Face without involving additional external network requests. Therefore, due to the reliance on internet connectivity and complex computations in SVM preprocessing, the preprocessing time of the two models cannot be fairly compared.

When comparing the execution time of SVM and Bi-LSTM, it is observed that SVM takes about 1 to 2 min, while Bi-LSTM requires 10 to 20 min. This difference mainly stems from variations in model complexity and processing methods. SVM, as a traditional machine learning method, has a simple model structure that only needs to learn a small number of support vectors and decision boundaries, resulting in high computational efficiency. In contrast, Bi-LSTM is a deep learning architecture designed to handle long-term dependencies in time-series data, involving many parameters and complex gating mechanisms, which significantly increases the time required for training and inference. As a result, when dealing with static features, SVM often completes model construction and prediction in a much shorter time.

3.4. Comparison with Similar Related Work

To discuss the difference in model accuracy, we conducted a qualitative comparison with the previous article proposed by Sung et al. [34]. The comparison shows a gap: Sung’s study achieved an accuracy of 72.92%, whereas this study reached 86.45%. This difference might mainly be attributed to variations in dataset characteristics and input feature design. The reasonable explanation is that all of the dataset was collected by Chinese language experts in our study; the decision surface of the feature hyperplane might be smoother for the classifications. Another factor is that the dataset used in Yao-Ting Sung’s study was sourced from Mandarin textbooks for grades 1 to 6 and published in 2009 by three major publishers in Taiwan, totaling 386 texts. Our study utilized three readability levels with 774 texts from Mandarin teaching materials collected from various countries, twice the amount than [34] proposed. The increased dataset size helps improve the model’s generalization ability and stability, which may be one of the reasons for the higher accuracy achieved in this study.

In addition, the design of input features significantly affects model performance. Yao-Ting Sung’s study employed 24 linguistic features across four categories: lexical, semantic, syntactic, and cohesion-related features. These features are traditional linguistic variables and are slightly different from what we proposed. Our study not only includes linguistic features such as the type–token ratio, average sentence length, total vocabulary size, and word level, but incorporates POS spectrum information. By visualizing the distribution of POS tags, this approach supported temporal information of the Chinese text. This innovative feature design may effectively enhance the model’s ability to assess text readability, thereby improving classification performance. Consequently, the expansion of dataset size and the integrated linguistic features with the POS spectrum contributed to a practical and satisfactory result in Chinese text readability grading.

4. Conclusions

In this study, we investigated how inherent linguistic features and POS spectrum can reveal the factors affecting the comprehension of a text. Two classifying methods, SVM and Bi-LSTM, were employed to classify the 774 Chinese texts into three readability levels: basic, intermediate, and advanced. The findings reveal the multifaceted nature of text comprehension and demonstrate the effectiveness of the two proposed approaches, as they were able to attain commendable accuracy. The choice of model and features depends on the specific application and the available computational resources. One of the key findings in this study is that texts with similar readability levels exhibit comparable image contours in the POS spectrum. Consequently, integrating DCT with the POS spectrum emerges as a promising method for analyzing sequential data. From these results, the POS spectrum features could capture the syntactic information of the Chinese texts, making them a complementary addition to the linguistic features.

In terms of processing time, the experimental results show that the SVM model, due to its simple structure and high computational efficiency, demonstrates a significant advantage during the model execution stage, costing approxiamtely one tenth of computation time compared with the Bi-LSTM model. Additionally, although the preprocessing workflows differ greatly, SVM involves external tools, e.g., CkipTagger and POS spectrum generation, while Bi-LSTM uses preprocessing with Hugging Face packages; this makes preprocessing time between the two models not directly comparable. Overall, SVM performs better in terms of execution efficiency and is more suitable for scenarios requiring fast responses, whereas Bi-LSTM, with its ability to process temporal information, is better suited for tasks involving time-dependent language analysis.

Another interesting viewpoint is the capability for Chinese teaching. Although the text readability assessment is excellent via deep learning technique, e.g., LSTM, understanding and comprehension are quite difficult for learners due to the complicated and nonlinear properties in the neural network. That means it has few contributions for teaching. On the contrary, the statistical linguistic features and rhythm of the text are significant indicators to identify the learning achievement of learners. With the help of linguistic texts and the POS spectrum, creating precise text analysis and readability assessment tools is possible. This can help teachers adjust their teaching scopes to the specific needs of their students, making learning more effective. Future research may explore more sophisticated feature engineering techniques, such as incorporating sentiment analysis or topic modeling, and investigate the application of these models to larger and more diverse datasets.

Author Contributions

Supervision, C.-Y.H. and C.-W.H.; Conceptualization, C.-Y.H. and C.-W.H.; Methodology, C.-Y.H. and C.-W.H.; Review and editing, C.-Y.H.; Visualization, J.-Y.L., B.-Y.H. and Y.-C.H.; Software, J.-Y.L., B.-Y.H. and Y.-X.C.; Validation, J.-Y.L., B.-Y.H., Y.-C.H. and Y.-X.C.; Original draft, C.-W.H.; Writing—original draft, Writing—original draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the first author Professor Chi-Yi Hsieh via the email address: t4136@mail.nknu.edu.tw.

Acknowledgments

We thank Chih-Yen Chen for data validation and thank the Advanced Institute of Manufacturing with High-tech Innovations, CCU, for simulation supports.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gray, W.S. Summary of reading investigations (July 1, 1936-June 30, 1937). J. Educ. Res. 1938, 31, 401–434. [Google Scholar] [CrossRef]
Gray, C.J. Readability: A Factor in Student Research? Ref. Libr. 2012, 53, 194–205. [Google Scholar] [CrossRef]
Sherman, L.A. Analytics of Literature: A Manual for the Objective Study of English Prose and Poetry; Ginn and Company: Boston, MA, USA, 1893. [Google Scholar]
Hu, T.; Chen, Z.; Ge, J.; Yang, Z.; Xu, J. A Chinese Few-Shot Text Classification Method Utilizing Improved Prompt Learning and Unlabeled Data. Appl. Sci. 2023, 13, 3334. [Google Scholar] [CrossRef]
Liu, H.; Ye, Z.; Zhao, H.; Yang, Y. Chinese Text De-Colloquialization Technique Based on Back-Translation Strategy and End-to-End Learning. Appl. Sci. 2023, 13, 10818. [Google Scholar] [CrossRef]
Guo, S.; Huang, Y.; Huang, B.; Yang, L.; Zhou, C. CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement. Appl. Sci. 2023, 13, 4056. [Google Scholar] [CrossRef]
Kostadimas, D.; Kermanidis, K.L.; Andronikos, T. Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment. Appl. Sci. 2024, 14, 7997. [Google Scholar] [CrossRef]
Liu, Y.; Li, S.; Deng, Y.; Hao, S.; Wang, L. SSuieBERT: Domain Adaptation Model for Chinese Space Science Text Mining and Information Extraction. Electronics 2024, 13, 2949. [Google Scholar] [CrossRef]
Ratajczak, M. The Effects of Individual Differences and Linguistic Features on Reading Comprehension of Health-Related Texts; Lancaster University: Lancaster, UK, 2020. [Google Scholar]
McNamara, D.S.; Louwerse, M.M.; McCarthy, P.M.; Graesser, A.C. Coh-Metrix: Capturing linguistic features of cohesion. Discourse Process. 2010, 47, 292–330. [Google Scholar] [CrossRef]
Tseng, H.-C.; Chen, B.; Chang, T.-H.; Sung, Y.-T. Integrating LSA-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts. Nat. Lang. Eng. 2019, 25, 331–361. [Google Scholar] [CrossRef]
De Clercq, O.; Hoste, V. All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch. Comput. Linguist. 2016, 42, 457–490. [Google Scholar] [CrossRef]
Liu, Y.; Ji, M.; Lin, S.S.; Zhao, M.; Lyv, Z. Combining readability formulas and machine learning for reader-oriented evaluation of online health resources. IEEE Access 2021, 9, 67610–67619. [Google Scholar] [CrossRef]
Maqsood, S.; Shahid, A.; Afzal, M.T.; Roman, M.; Khan, Z.; Nawaz, Z.; Aziz, M.H. Assessing English language sentences readability using machine learning models. PeerJ Comput. Sci. 2022, 8, e818. [Google Scholar] [CrossRef]
Dascalu, M. Analyzing Discourse and Text Complexity for Learning and Collaborating; Springer Nature: London, UK, 2014. [Google Scholar]
Pantula, M.; Kuppusamy, K.S. A machine learning-based model to evaluate readability and assess grade level for the web pages. Comput. J. 2022, 65, 831–842. [Google Scholar] [CrossRef]
Martinc, M.; Pollak, S.; Robnik-Šikonja, M. Supervised and unsupervised neural approaches to text readability. Comput. Linguist. 2021, 47, 141–179. [Google Scholar] [CrossRef]
Sung, Y.-T.; Chen, J.-L.; Cha, J.-H.; Tseng, H.-C.; Chang, T.-H.; Chang, K.-E. Constructing and validating readability models: The method of integrating multilevel linguistic features with machine learning. Behav. Res. Methods 2015, 47, 340–354. [Google Scholar] [CrossRef]
Balyan, R.; McCarthy, K.S.; McNamara, D.S. Applying natural language processing and hierarchical machine learning approaches to text difficulty classification. Int. J. Artif. Intell. Educ. 2020, 30, 337–370. [Google Scholar] [CrossRef]
Demner-Fushman, D.; Elhadad, N.; Friedman, C. Natural language processing for health-related texts. In Biomedical Informatics: Computer Applications in Health Care and Biomedicine; Springer International Publishing: Cham, Germany, 2021; pp. 241–272. [Google Scholar]
Curiel, A.; Gutiérrez-Soto, C.; Rojano-Cáceres, J.R. An online multi-source summarization algorithm for text readability in topic-based search. Comput. Speech Lang. 2021, 66, 101143. [Google Scholar] [CrossRef]
Zhang, C. Readability Grading Based on Multidimensional Linguistics Features for International Chinese Language Education. IEEE Access 2024, 12, 88608–88619. [Google Scholar] [CrossRef]
Zhu, S.; Song, J.; Peng, W.; Guo, D.; Wu, G. Text readability assessment for Chinese second language teaching. In Chinese Lexical Semantics: 20th Workshop, CLSW 2019, Beijing, China, 28–30 June 2019, Revised Selected Papers 20; Springer International Publishing: Cham, Switzerland, 2020; pp. 393–405. [Google Scholar]
Wang, W.; Wang, R.; Wang, L.; Wang, Z.; Ye, A. Towards a robust deep neural network against adversarial texts: A survey. IEEE Trans. Knowl. Data Eng. 2021, 35, 3159–3179. [Google Scholar] [CrossRef]
Zulqarnain, M.; Saqlain, M. Text Readability Evaluation in Higher Education Using CNNs. J. Ind. Intell. 2023, 1, 184–193. [Google Scholar] [CrossRef]
Taiwan Benchmarks for the Chinese Language. Available online: https://bcoct.naer.edu.tw/TBCL/index.md (accessed on 15 August 2025).
CkipTagger. Available online: https://ckip.iis.sinica.edu.tw/service/ckiptagger/ (accessed on 15 August 2025).
Jina, A.I. jina-embeddings-v2-base-zh. Hugging Face. Available online: https://huggingface.co/jinaai/jina-embeddings-v2-base-zh (accessed on 15 August 2025).
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Hochreiter, S. Long Short-Term Memory; Neural Computation MIT-Press: Cambridge, MA, USA, 1997. [Google Scholar]
Rao, G.; Huang, W.; Feng, Z.; Cong, Q. LSTM with sentence representations for document-level sentiment classification. Neurocomputing 2018, 308, 49–57. [Google Scholar] [CrossRef]
Melamud, O.; Goldberger, J.; Dagan, I. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. 2016. Available online: https://aclanthology.org/K16-1006/ (accessed on 15 November 2025).
Pytorch. Available online: https://pytorch.org/ (accessed on 15 November 2025).
Sung, Y.T.; Chen, J.L.; Lee, Y.S.; Cha, J.H.; Tseng, H.C.; Lin, W.C.; Chang, T.H.; Chang, K.E. Investigating Chinese Text Readability: Linguistic Features, Modeling, and Validation. Chin. J. Psychol. 2012, 55, 75–106. [Google Scholar] [CrossRef]

Figure 1. Overall architecture flowchart of the proposed Chinese text readability assessment.

Figure 2. Tree diagram of Chinese POS categories.

Figure 3. An example Chinese text and its extracted POS tags. The different colors represent the different POS tags. The corresponding POS spectrum image visualizes the character-level tagging, where the X-axis represents the sequence of characters in the text and the Y-axis indicates 26 POS categories, including 14 major POS tags, 11 punctuation types, and 1 whitespace. To enhance visual clarity and aesthetics, dashed lines are added to divide the image into 26 equal sections along the horizontal direction, making it easier to distinguish between different POS categories.

Figure 4. Workflow for POS feature extraction via DCT.

Figure 5. Scatter plot of results using only DCT coefficients as input to SVM.

Figure 6. Shows the confusion matrix of the LSTM model.

Figure 7. Shows the confusion matrix of the SVM model using traditional linguistic features. From left to right, the four combinations are 1–14, 1–16, 17–27, and 1–27 dimensional features of utilized linguistic features.

Figure 8. Shows the confusion matrix of the SVM model using proposed linguistic features combined with the POS spectrum.

Table 1. The number of texts in each level.

Text Readability Level	Advanced	Intermediate	Basic	Total
Number of Texts	258	258	258	774

Table 2. Index of POS, punctuation, and whitespace categories in the POS spectrum image.

POS Tag	Punctuation Marks	Whitespace
A, C, POST, ADV, ASP, N, DET, M, NV, T, P, Vi, Vt, FW	COLONCATEGORY, COMMACATEGORY DASHCATEGORY, ETCCATEGORY EXCLAMATIONCATEGORY, PARENTHESISCATEGORY PAUSECATEGORY, PERIODCATEGORY QUESTIONCATEGORY, SEMICOLONCATEGORY SPCHANGECATEGORY	WHITESPACE
Index number: 1st–14th	Index number: 15th–25th	Index number: 26th

Table 3. Mapping of feature dimensions to linguistic categories.

Category	Group A	Group B	Group C	Group D
Corresponding dimension	1–14	1–16	17–27	1–27
Component	Type–token ratio	Type–token ratio/average sentence length/total word	Difficulty level of vocabulary	All features

Table 4. Accuracy of the Bi-LSTM and SVM classifier for different feature categories.

Classifier	Bi-LSTM	SVM
Feature category	Semantic features	Common linguistic feature				POS spectrum (46)
Feature category	Semantic features	A: 1–14	B: 1–16	C: 17–27	D: 1–27	POS spectrum (46)
Accuracy	83.87%	80.65%	81.29%	85.81%	83.23%	72.26%
Feature category	83.87%	Common linguistic feature + POS spectrum
Feature category		A *		B *	C *	D *
Accuracy		83.23%		82.58%	86.45%	85.16%

* Represented the added POS spectrum of common linguistic feature.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Chinese Text Readability Assessment Based on the Integration of Visualized Part-of-Speech Information with Linguistic Features

Abstract

1. Introduction

2. Materials and Methods

2.1. Chinese Text Datasets

2.2. Data Preprocessing

2.3. Text Feature

2.3.1. Part-of-Speech (POS) Spectrum Feature

2.3.2. Selected Common Linguistic Features

2.4. Text Analysis with Neural Network Classifiers

2.4.1. Support Vector Machine (SVM)

2.4.2. Bidirectional Long Short-Term Memory (Bi-LSTM)

2.4.3. Application and Comparison of Bi-LSTM and SVM in Text Classification

2.5. Hardware and Software Environment for Model Training and Evaluation

2.6. Evaluation of Text Grading

3. Results and Discussion

3.1. Comparison of LSTM and SVM with Different Input Features

3.2. Confusion Matrices of LSTM and SVM

3.3. Execution Time Comparison Between LSTM and SVM

3.4. Comparison with Similar Related Work

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics