DysDiTect: Dyslexia Identification Using CNN-Positional-LSTM-Attention Modeling with Chinese Dictation Task

Handwriting difficulty is a defining feature of Chinese developmental dyslexia (DD) due to the complex structure and dense information contained within compound characters. Despite previous attempts to use deep neural network models to extract handwriting features, the temporal property of writing characters in sequential order during dictation tasks has been neglected. By combining transfer learning of convolutional neural network (CNN) and positional encoding with the temporal-sequential encoding of long short-term memory (LSTM) and attention mechanism, we trained and tested the model with handwriting images of 100,000 Chinese characters from 1064 children in Grades 2–6 (DD = 483; Typically Developing [TD] = 581). Using handwriting features only, the best model reached 83.2% accuracy, 79.2% sensitivity, 86.4% specificity, and 91.2% AUC. With grade information, the best model achieved 85.0% classification accuracy, 83.3% sensitivity, 86.4% specificity, and 89.7% AUC. These findings suggest the potential of utilizing machine learning technology to identify children at risk for dyslexia at an early age.


Introduction
Developmental Dyslexia (DD) is characterized by persistent difficulties in reading and phonological abilities [1], resulting in deficient decoding and spelling skills.In nonalphabetic languages such as Chinese, the writing system contributes to the multi-deficit nature of dyslexia [2].Unlike most alphabetic words with linear letter sequences, Chinese characters have a multi-dimensional, multi-level feature set that includes orthography, phonology, and semantics at the character and radical levels, constructed using logographemes and strokes [3].These multifaceted features intensify the complexity of potential writing errors, such as assimilation, substitution, insertion, deletion, and transposition at the radical and component levels, as well as protrusion, retraction, blending, segmenting, insertion, and deletion at the stroke level.These errors are frequently observed in copying and dictation tasks, which are the most common practices for children learning Chinese handwriting [4].While the copying task measures motor ability, the dictation task directly assesses competence to accurately convert the pronunciation of spoken words into written form.
Behavioral studies suggest that Chinese DD children experience difficulties and delayed development in both tasks [5,6] due to impaired motor ability, phonological skills, and orthographic knowledge.The real-time analysis of handwriting performance demonstrated that these dyslexic individuals exhibited significantly more pause time and execution, as well as differences in pen pressure and character size [6], compared with their typically developing (TD) peers.Meanwhile, a similar analysis of Chinese dictation tasks showed the subtypes of handwriting difficulties and the association with lexical knowledge, perceptualmotor ability, and attention span in working memory systems [7].Additionally, functional Brain Sci.2024, 14, 444 2 of 15 Magnetic Resonance Imaging (fMRI) studies clarify the neural basis for handwriting deficit in Chinese dyslexia by showing that DD children exhibited reduced activation in sensorymotor and visual-orthography processing but increased activation in executive control as a compensation mechanism [8].Similarly, a follow-up study confirmed these patterns in brain network connectivity, with strength associated with handwriting speed [9].
Given the robust evidence of handwriting difficulties in Chinese DD children and the rapid advancements in machine learning technology, a crucial question arises: Can machine learning be used effectively to identify children at risk for dyslexia by analyzing their handwriting errors in a dictation task?This study presents a novel approach called DysDi-Tect, an automated Dyslexia Dictation deTection system that uses deep learning models and Chinese handwriting images in a dictation task to effectively classify individuals and predict their dyslexia status.

Technological Advancement of Chinese Handwriting and Performance Evaluation
The challenges faced by dyslexic individuals in handwriting have prompted the development of technological analysis and solutions.One such solution is handwritten Chinese character recognition (HCCR) technology, which has evolved from hierarchical and structural analyses to statistical modeling and deep-learning approaches [10].For instance, optical character recognition (OCR) has surpassed human performances in recognizing handwritten characters.As HCCR and related innovations mature across languages, a focus on evaluating handwriting performance has increased in recent years.
One commonly used approach for assessing Chinese handwriting is statistical analysis of individual strokes [11,12].This technique can provide detailed feedback on stroke quality, but it often fails to capture the overall representations of the character.To address this issue, previous studies have decomposed the structure of Chinese characters into quantifiable measurements and performed feature mapping with a standardized template for quality evaluation [13,14].While feature mapping approaches have been effective, the complex algorithmic complexity used can lead to feedback that is incomprehensible to users, limiting their usage in educational settings.
The latest handwriting evaluation method [15,16] uses deep learning techniques to dissect and encode characters into smaller units of logographemes and assess the performances of each part [17].This approach combines both structure-based and feature-mapping techniques, resulting in higher performance and informative feedback for users.Additionally, the system can be extended to implement stroke-based evaluations.

Transforming Dyslexia Identification: Transitioning from Human-Delivered Behavioral Tests to Machine Learning-Assisted Automatic Detection
The traditional approach for diagnosing DD involves a range of behavioral tests that assess various reading-related cognitive and meta-linguistic skills, such as the Hong Kong test of specific learning difficulties in reading and writing for primary school students (HKT-P) [18].However, despite the compelling evidence highlighting the importance of handwriting analysis, this approach is considered inadequate due to its reliance on a limited number of tasks that may not fully capture the complexities of Chinese dyslexia.Furthermore, early identification and intervention of dyslexia are crucial for preventing adverse consequences [19].However, the practical application of these measures is hindered by the high cost of, and labor-intensive effort required by, experienced educational psychologists and clinicians.
The rise of machine learning enables unconventional techniques and introduces new possibilities for early screening and identification of dyslexia.Extensive reviews [20][21][22] have examined the application of machine learning in dyslexia research, focusing on data sources, models/algorithms, feature selection, and evaluation metrics.Machine learning techniques for dyslexia identification utilize three main categories of data sources: behavioral symptoms, eye-tracking, and biomarkers [20].Decades of research and clinical experiences have accumulated a large quantity of behavioral data related to the cognitive and language Brain Sci.2024, 14, 444 3 of 15 abilities of typically developing (TD) and DD children.Eye-tracking techniques have also enhanced our understanding of underlying cognitive processes of reading difficulties through the measurement of eye fixation.Additionally, different biomedical technologies such as fMRI, electroencephalography (EEG), and electrooculography (EOG) have been employed to investigate DD as a neurodevelopmental disorder.Depending on specific data sources, different machine-learning models are utilized [21].Numerical (or preprocessed) data commonly use algorithms like Support Vector Machine (SVM), K Nearest Neighbors (KNN), Random Forest (RF), Decision Tree (DT), and regressions.Image data often incorporate deep learning techniques like Convolutional Neural Network (CNN).
Feature selection [22] is a critical step in machine learning aimed at identifying the most predictive features for improved prediction results and theoretical implications.Data preprocessing is sometimes employed to extract features from raw data, such as neuroimaging, brain signals, or handwriting metrics.To evaluate the performance and utility of machine learning models, various metrics are employed.Previous studies have reported accuracy rates ranging from 70-95% [20].

Predictions and Identification of Dyslexia Using Handwriting Features with Machine Learning Techniques
According to a recent review [22], approximately 30% of dyslexia prediction utilizing deep learning has been conducted using handwriting datasets.Previous attempts to identify dyslexia handwriting images have focused primarily on analyzing the basic unit of the writing system, namely, letters in alphabetic languages.DD is often manifested in prevalent errors such as reversed and corrected letters [23], as well as messiness in handwriting [24,25].To facilitate the identification procedure, Optical Character Recognition (OCR) is incorporated, particularly in languages with a limited set of letters like English [26].It is worth noting that dysgraphia identification research [27] has focused predominantly on studying the kinematic and static data of in-process handwriting, both in alphabetic languages and Chinese [28].
However, the techniques previously developed for dyslexia identification are not fully applicable to Chinese handwriting due to the multi-dimensional, multi-level features of Chinese characters.Lee et al. [29] utilized the error analysis of preprocessed dictation performance and successfully identified DD with an 80.0% accuracy rate, using stroke, grade, lexicality, and character configuration as the most predictive features.However, the labor-intensive nature of and reliance on knowledge-based expert coding of handwriting errors limited the practical implementation of this technique.As a result, a recent model called Dyslexia Prescreening Mobile Application for Chinese Children (DYPA) [17] utilizes deep learning encoding of multi-level features such as stroke, radical, and character to overcome these limitations and achieve an accuracy rate of 81.14% when combined with other meta-linguistic tests.It is important to highlight that DYPA was trained on a small dataset, including 39 Chinese DD children and 168 TD children in Grades 1-3.Such a small sample size may not fully reflect the variability and complexity of handwriting difficulties exhibited by Chinese DD children.More importantly, while DYPA achieved an accuracy rate of 81.14%, it is crucial to note that this was a result of combining handwriting analysis with other meta-linguistic tests.The extent to which handwriting analysis alone, without human expertise, can effectively differentiate DD and TD children remains unclear.
Thus, in this study, we advance previous research by developing DysDiTect, an automated Dyslexia Dictation deTection system that utilizes deep learning models and Chinese handwriting images in a dictation task to effectively classify individuals and predict their dyslexia status.To train and evaluate DysDiTect, we collected a large data set comprising 100,000 Chinese characters from 1064 children in Grades 2-6, including 483 DD and 581 TD children.Notably, our study is the first to employ deep learning techniques on handwriting images for identifying dyslexia in the Chinese language.We developed a series of models to evaluate the handwriting performances and temporal-sequential dependency of TD and DD children during Chinese dictation tasks.

Chinese Word Dictation Task
Adopted from HKT-P [18], this task required participants to write down in designated boxes 96 Chinese characters (48 two-character words) read aloud by the experimenter.Testing stopped after eight consecutive incorrect responses of two-character words, i.e., 16 characters.

Data Classification
The dataset consisted of scanned images of 869 handwritten encoded responses from Lee and Tong [3] and 195 handwritten raw responses.For the encoded data, each written Chinese character was binary-coded for multi-level, multi-dimensional features.Table 1 shows a summary of encoded data accuracies.A correct response meant that all written strokes, logographemes, and radicals within the character were accurately reproduced.A wrong response indicated to an incorrectly written structure within the character and blank, completely crossed out, or incomprehensible strokes not considered an attempt at writing.Cronbach's α = 0.979.

Data Preprocessing
Each participant's 96 handwritten Chinese character responses were color-scanned from the paper-based dictation test.Next, the images were cropped, isolated, and extracted from the designated boxes, then rescaled to a standardized size of 128 × 128 pixels of individual images, each containing a single Chinese character, resulting in 1064 × 96 = 102,144 images.The image size was selected to reduce computational cost while maintaining the details of strokes, which was confirmed by human inspection.A binarization operation was performed on each character image, converting the background to black and handwriting strokes to white to reduce computational cost, increase training speed, and decrease in-class variance.Notably, the experimental procedure and coding process would occasionally obstruct the handwritten responses with additional markings of ticks and crosses of some characters.
The preprocessing was completed using Python scripts with an automated edge detection technique.Additionally, manual checking and cropping were used to facilitate the dataset construction process.The training, validation, and test datasets were divided into an 8:1:1 ratio in a stratified grouping of both grade and dyslexic status, resulting in a sample size of 851:106:107 in the datasets.

Model Architecture
The model was adopted from existing deep learning architectures.First, the model utilized the independent characteristics of individual written characters by applying the feature extraction CNN module to every image and the positional encoding for incorporating the sequential properties in the dictation task.Then, the temporal-sequential dependency nature of the dictation task was captured using a stepwise LSTM module.Next, the selfattention layer was introduced to signify the feature maps.Finally, the Classification and Prediction module was used to predict the status of participants.The model architecture is shown in Figure 1.
and crosses of some characters.
The preprocessing was completed using Python scripts with an automated edge detection technique.Additionally, manual checking and cropping were used to facilitate the dataset construction process.The training, validation, and test datasets were divided into an 8:1:1 ratio in a stratified grouping of both grade and dyslexic status, resulting in a sample size of 851:106:107 in the datasets.

Model Architecture
The model was adopted from existing deep learning architectures.First, the model utilized the independent characteristics of individual written characters by applying the feature extraction CNN module to every image and the positional encoding for incorporating the sequential properties in the dictation task.Then, the temporal-sequential dependency nature of the dictation task was captured using a stepwise LSTM module.Next, the self-attention layer was introduced to signify the feature maps.Finally, the Classification and Prediction module was used to predict the status of participants.The model architecture is shown in Figure 1.

CNN Module with Positional Encoding
Convolutional neural network (CNN) is a type of deep learning model that is widely used in computer vision and image processing [30].The CNN module used in this study consisted of convolutional layers and pooling layers [31].The convolutional layers extracted features from images to adjust the training weight and bias of the neural network to generate the output feature maps of the input image [32].The feature maps generated by convolutional layers could be connected to the next convolutional layer or pooling layers for feature extraction, or to a Fully Connected (FC) layer for classification.Moreover, pooling layers downsample the feature maps, reducing the computational costs while retaining the features learnt from the input image.
The intrinsic features of Chinese handwritten characters were generalized into feature map representations.Each individual Chinese character of (3,128,128) was inputted into the CNN module and summarized as 32 neurons.Positional encoding [33] was introduced after the CNN module to leverage the positional information of the dictation sequence in the subsequent module.Then, the positional encoded feature map was passed to the LSTM module.
This module was adapted from the ResNet [34] architecture, which signifies the residual connection between convolutional layers to improve the performance of the model.The model built in this study adopted the ResNet-50 model, which consists of 50 layers, including convolutional and pooling layers.

Bi-LSTM Module
A Long Short-Term Memory (LSTM) network is a type of Recurrent Neural Network (RNN) used for sequential data.This technique mimics the long-term and short-term memory systems in the human brain by implementing a gate system [35], that captures features and patterns within a time-series sequence [36].Bi-directional LSTM (Bi-LSTM) considers the input sequence in both forward and backward directions, enabling longer dependency and the reversed order of features.
In the dyslexic prediction task based on the Chinese dictation task, the handwriting of characters followed a time sequence from the first character to the last character, which was suitable for LSTM.Thus, the temporal-sequential properties of handwriting characters were generalized and passed to the next attention module.
The feature maps from the CNN module extracted from each character were summarized as neurons and fed as input time steps into the LSTM.A 2-layer Bi-LSTM structure was used with 128 hidden states in each LSTM cell.The input data were encoded layer by layer.In each layer, the input data were encoded as a bi-directional connection of each cell both from the first to the last and from the last to the first in the Bi-LSTM structure.The final output of the LSTM cells was extracted and concatenated from both forward and backward directions as a feature map of the Fully Connected (FC) layer of 256 neurons.

Multi-Head Self-Attention Module
In deep learning, the attention mechanism is considered one of the most important concepts and innovations [33], allowing each individual token to focus on different parts and "pay attention" to the input sequence.This mechanism signifies the importance of each token, enabling the model to selectively emphasize relevant information while downplaying irrelevant details, which overcomes the limitation of long-term dependency and enhances the model's ability to capture complex relationships within the sequence [37].
With the intrinsic sequential properties of the dictation task being captured by the Bi-LSTM module, the integration of multi-head self-attention introduces a sophisticated mechanism for capturing cross-item linkages among handwritten characters.The attention context vector is then passed to the next classification and prediction module.
The feature map of each timestep from the Bi-LSTM module was passed to the 4-headed self-attention module.The dimensions of the final output of the attention module were unchanged, i.e., 256 neurons for each of 96 timesteps.

Classification and Prediction with Grade Information
The output from the above modules served as the generalized representation of handwriting performance and behavior for the entire dictation task.Next, the embedding of each character was condensed into a single separate neuron and concatenated with the grade information to formulate the last FC layer consisting of 97 neurons.Finally, the FC layer was connected to a sigmoid activation function to predict whether the input was from TD or DD participants.

Model Training
Transfer learning in machine learning allows researchers to use state-of-the-art pretrained models as the starting point, adapt to a specific problem or dataset, and enhance model performances and generalization capabilities [38].The backbone CNN module was adapted from the ResNet-50 model, applied with pre-trained weights, and fine-tuned by the handwriting dataset for the dyslexia prediction task.Specifically, the pre-training on ImageNet was used for the well-established performances in previous studies by finetuning with a small dataset related to the downstream tasks [39].In our model training, layer 4 and the FC layer were unfrozen for fine-tuning on the Chinese dictation dataset.
The models were built and trained with PyTorch and Lightning library on a Windows desktop with RTX 3060Ti 8 GB GPU.The minibatch sizes of 6 were used for model training, and the batches were reshuffled after each epoch.Adam optimizer with binary cross entropy was used to train the models.The learning rate was initially set as 5 × 10 −6 and decreased by ×0.5 every three epochs.Regularization techniques were applied to avoid overfitting, where weight decay was set to 5 × 10 −5 , and dropout was set to 0.2 for LSTM, attention, and condensed layers.The models were trained for a maximum of 50 epochs with an early stopping setting when the validation loss did not improve for 1 × 10 −4 in three consecutive epochs.A random seed of 42 was used for all settings.

Pilot Study
Given that the dataset was derived from Lee and Tong [3] with encoded accuracies for each individual character, the authors first attempted to replicate previous approaches [40,41] for character-based predictions using the OCR/HCCR technique.However, the imbalanced classes of character accuracy (as reflected in Table 1) hindered the statistical power in evaluation metrics.The preliminary results were 95.7 ± 3.49% using a pre-trained model for the first 24 characters in 566 participants.However, data augmentation techniques would be required for further fine-tuning, but since they introduce doubt, decrease credibility of subsequent results, and, as shown by previous studies, do not capture the characteristics of dyslexic handwriting, we did not pursue their use.

Ablation Study
The models were labeled as DysDiTect_{P/L/A/G}, where P refers to Positional encoding, L to LSTM, A to Attention, G to Grade, and brackets {} indicate optional modules.The modules/information were selectively dropped to verify the importance and usefulness of model design, resulting in a total of 16 models.Figure 2 shows the accuracies and losses of the training and validation set in DysDiTect_PLA and DysDiTect_PAG.Overfitting was observed when the training and validation loss diverged significantly.After training stopped, the test dataset was evaluated from the checkpoint with the lowest validation loss.
Table 2 shows the detailed results of the testing set, including the confusion matrix by lower (G2-3) and higher (G4-6) grades, with overall accuracy, sensitivity (correct rates of DD), specificity (correct rates of TD), and AUC.The best-performing model using only handwriting features is DysDiTect_PLA with 0.832 accuracy and 0.792 sensitivity.If grade information is included, the best-performing model is DysDiTect_PAG with 0.850 accuracy and 0.833 sensitivity.The confusion matrices revealed that higher grades have lower accuracy and sensitivity compared with lower grades in all models.Table 2 shows the detailed results of the testing set, including the confusion matrix by lower (G2-3) and higher (G4-6) grades, with overall accuracy, sensitivity (correct rates of DD), specificity (correct rates of TD), and AUC.The best-performing model using only handwriting features is DysDiTect_PLA with 0.832 accuracy and 0.792 sensitivity.If grade information is included, the best-performing model is DysDiTect_PAG with 0.850 accuracy and 0.833 sensitivity.The confusion matrices revealed that higher grades have lower accuracy and sensitivity compared with lower grades in all models.).With both position encoding and grade information removed, the accuracy increased for the DysDiTect_PLAG model from 0.776 to 0.822 (DysDiTect_LA); the DysDiTect_PLG model from 0.748 to 0.804 (DysDiTect_L); and the DysDiTect_PG model from 0.710 to 0.738 (DysDiTect_).These results suggested that the inclusion of this information may introduce unnecessary complexity and hinder the model's ability to generalize effectively.
However, after removal of either or both position encoding and grade information, the opposite effect was observed for DysDiTect_PAG, where the accuracy decreased from 0.850 to 0.738 (DysDiTect_AG), 0.738 (DysDiTect_PA) and 0.766 (DysDiTect_A).These results suggested that the information is jointly learned by the Attention module.

LSTM and Attention Modules
With the LSTM module removed, the accuracy decreased for most models except for Dys-DiTect_PLAG, which increased (0.776 to 0.850) compared with DysDiTect_PAG.The results conveyed the importance of the LSTM module in most situations, but also reflected its ability to obscure the Attention module when jointly learning both positional and grade information as mentioned above.Meanwhile, the removal of the Attention module resulted in decreased accuracies for most models except for DysDiTect_PA, which remained at 0.738 compared with DysDiTect_P, but decreased in AUC from 0.805 to 0.769.This result illustrated the importance of the Attention module, further analysis of which is discussed below.

Attention Map
The self-attention weights of DysDiTect_PAG and DysDiTect_PLA in the testing set were further evaluated.The examples are shown in Figures 3 and 4, where the attention map from the same participant is listed in the same location.The order of sequences is ranked from high to low, with each row referring to the weight assigned to other tokens, and left to right, with each column referring to the weights assigned by other tokens.The weight scale is normalized by multiplying the sequence length of 96 characters and limiting it to (0, 2; Figure 3) and (0.8, 1.2; Figure 4) for visual representation.
The self-attention map of DysDiTect_PAG showed high variability of weights assigned to different tokens.Particularly, some tokens' attained weights (i.e., the attention assigned by all tokens) were much higher than others, especially in later sequences for some participants (the continuous red columns on the right of attention maps).Notably, our observations diverged from those documented in prior studies, where the token-wise self-attention along the diagonal axis was not dominant.The self-attention map of DysDiTect_PAG showed high variability of weights assigned to different tokens.Particularly, some tokens' attained weights (i.e., the attention assigned by all tokens) were much higher than others, especially in later sequences for  The self-attention map of DysDiTect_PAG showed high variability of weights assigned to different tokens.Particularly, some tokens' attained weights (i.e., the attention assigned by all tokens) were much higher than others, especially in later sequences for  The Intraclass Correlation Coefficients (ICC) between character entropy and accuracy measures were calculated.The ICC estimates and their 95% confidence intervals were calculated using Pingouin statistical package version 0. During the character dictation task, the discontinuation criterion caused more wrong and blank responses in the later sequences.By focusing on those responses, the model could possibly identify the handwriting characteristics associated with dyslexia, e.g., reversed writings, radical substitution, stroke errors.This finding is consistent with previous studies [3,29] indicating that sublexical errors and responses are more predictive for identifying Chinese dyslexia.
In contrast, the self-attention weights of DysDiTect_PLA showed lower variations across the character sequence and were highly concentrated around the value of 1. Specifically, a prevalent characteristic across all attention maps was the absence of self-attention directed toward individual tokens themselves, though regional self-attention was observed.The majority of tokens exhibited relatively equal weights, suggesting a tendency towards uniform attention distributions across the input sequence.The attentions were mostly evenly distributed across multiple tokens or concentrated toward specific regions of the input sequence.
The attention map was based on the output of the LSTM module, where the intrinsic features of character sequences were already captured in the module.Therefore, selfattention was localized to amplify the generalized pattern of intrinsic characteristics of dyslexic handwriting.

Discussion
The experimental results demonstrated the robustness of DysDiTect with satisfactory performances.The proposed model framework is the proof of concept for a fully automated dyslexia screening system with a cost-effective solution.The Chinese dictation task lasted between 10 and 20 min, and the format was easily transformed to an electronic version to speed up the preprocessing pipeline.Furthermore, the technological advancement of faster algorithms [42] and hardware allowed real-time prediction to run directly on the user's device.With the proposed system, teachers and parents can conduct self-screening to identify children at risk of dyslexia.Additionally, the in-process handwriting features can be incorporated for prediction performance.
Compared with previous studies using handwriting features for dyslexia identification via machine learning techniques, our results outperformed all evaluation metrics and were tested with an adequate sample size.The summary of results is shown in Table 4, which briefly lists the key information for evaluating performances.Notably, most results of

Figure 1 .
Figure 1.DysDiTect_PLAG model constructed for dyslexia prediction.Samples of participants' handwritten responses are reproduced by the author and include correctly and incorrectly written characters, and visual-graphic symbols.

Figure 1 .
Figure 1.DysDiTect_PLAG model constructed for dyslexia prediction.Samples of participants' handwritten responses are reproduced by the author and include correctly and incorrectly written characters, and visual-graphic symbols.

Figure 2 .
Figure 2. (a) Training and validation accuracy.(b) Training and validation loss.

Figure 2 .
Figure 2. (a) Training and validation accuracy.(b) Training and validation loss.

Figure 3 .
Figure 3. Examples of attention maps of DysDiTect_PAG.

Figure 4 .
Figure 4. Examples of attention maps of DysDiTect_PLA.

Figure 4 .
Figure 4. Examples of attention maps of DysDiTect_PLA.

Table 2 .
Model prediction results of the testing set.

Table 2 .
Model prediction results of the testing set.

Table 3 .
Descriptive statistics of entropy by group and response.
Note.K = Number of responses.M = Mean.SD = Standard deviation.