Integrating Linguistic and Eye Movements Features for Arabic Text Readability Assessment Using ML and DL Models

Baazeem, Ibtehal; Al-Khalifa, Hend; Al-Salman, Abdulmalik

doi:10.3390/computation13110258

Open AccessArticle

Integrating Linguistic and Eye Movements Features for Arabic Text Readability Assessment Using ML and DL Models

by

Ibtehal Baazeem

^1,2,*

,

Hend Al-Khalifa

²

and

Abdulmalik Al-Salman

²

¹

Artificial Intelligence and Robotics Institute, King Abdulaziz City for Science and Technology, Riyadh 13523, Saudi Arabia

²

College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Computation 2025, 13(11), 258; https://doi.org/10.3390/computation13110258

Submission received: 26 August 2025 / Revised: 20 October 2025 / Accepted: 21 October 2025 / Published: 3 November 2025

(This article belongs to the Special Issue Recent Advances on Computational Linguistics and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Evaluating text readability is crucial for supporting both language learners and native readers in selecting appropriate materials. Cognitive psychology research, leveraging behavioral data such as eye-tracking and electroencephalogram (EEG) signals, has demonstrated effectiveness in identifying cognitive activities associated with text difficulty during reading. However, the distinctive linguistic characteristics of Arabic present unique challenges for applying such data in readability assessments. While behavioral signals have been explored for this purpose, their potential for Arabic remains underutilized. This study aims to advance Arabic readability assessments by integrating eye-tracking features into computational models. It presents a series of experiments that utilize both text-based and gaze-based features within machine learning (ML) and deep learning (DL) frameworks. The gaze-based features were extracted from the AraEyebility corpus, which contains eye-tracking data collected from 15 native Arabic speakers. The experimental results show that ensemble ML models, particularly AdaBoost with linguistic and eye-tracking handcrafted features, outperform ML models using TF-IDF and DL models employing word embedding vectorization. Among the DL models, convolutional neural networks (CNNs) achieved the best performance with combined linguistic and eye-tracking features. These findings underscore the value of cognitive data and emphasize the need for exploration to fully realize its potential in Arabic readability assessment.

Keywords:

machine learning; deep learning; behavioral data; eye movements; eye-tracking; natural language processing (NLP); cognitive NLP; Arabic language

1. Introduction

Reading is key to learning and accessing diverse content, which are crucial for both education and everyday life [1,2,3,4]. The need to match reader skills with text complexity has fueled the demand for automated readability assessments in different fields such as education and cognitive science [1,5]. Readability is crucial to successfully comprehending a text [6]. Accessible and clear reading materials enhance the desire to read, making text readability a crucial measure for determining suitability for specific audiences, especially in education [7,8].

Text readability assessment, a challenging field since the 1920s, has gained importance in the Information Age due to the abundance of written text across various domains [6,9]. Dale and Chall define readability as the sum of all elements in printed material that affect a reader’s success in understanding, reading speed, and interest including both typographical aspects and writing style [10]. Klare also describes readability as the ease of understanding due to writing style [11]. Cognitive linguists highlight two key aspects influencing comprehension: reader-based features (background knowledge, experiences, language proficiency, and motivation) and text-based features (content, syntactical complexity, and visual aids) [12,13,14,15].

Although text readability assessment models aim to predict readability based on readers’ difficulties, they often overlook evaluations that focus on the reader’s cognitive processing and performance results [16]. Most models depend on expert judgments rather than the actual reading performance of the target audience, a method facing criticism [17,18]. Additionally, much research on Arabic readability targets second-language (L2) learners, with limited focus on first-language (L1) readers. Eye-tracking technology can address this gap by offering insights into cognitive processes and enhancing readability assessments in NLP research [19]. This technology measures cognitive effort, helping new models predict text complexity more efficiently [3,5,20]. Real-time assessments based on behavioral signals could significantly advance readability research [14] as there is a strong correlation between eye movements and cognitive processing [21], such as the eye–mind hypothesis, which suggests a direct link between fixation and processing [22]. Eye-tracking provides objective measures of reading behavior and effectively detects reading difficulties [23], with recent advancements exploring the relationship between reader behavior and reading challenges [24,25,26].

A pilot study [27] first explored eye-tracking features for assessing Arabic text readability, highlighting their potential in this under-researched area and prompting further investigation. Despite progress in automated readability assessment, several gaps remain unaddressed for Arabic. First, most Arabic studies rely on handcrafted linguistic features or expert judgments, with little validation against cognitive data and a limited focus on L1 readers. Second, no large-scale Arabic eye-tracking corpus existed until recently, which restricted systematic multimodal modeling. Third, unlike English and German, Arabic’s unique orthographic and morphological properties—such as cursive script, diacritics, bidirectionality, and root-pattern morphology—make cognitive demands that have not been adequately examined. Fourth, existing gaze-based studies in other languages focus on traditional reading metrics (e.g., fixation duration, regressions), while the potential of experimental condition features—such as reading speed and annotation duration—remains underexplored, despite evidence that task demands shape eye-movement behavior and readability judgments. Finally, most prior work targeted sentences or entire documents, whereas paragraph-level readability, which better reflects natural reading behavior, has not been systematically studied. Accordingly, while international work has explored integrating gaze features with ML and DL models, this has not been systematically carried out for Arabic. Our study addresses these gaps by introducing a novel corpus-driven approach that combines linguistic and gaze features in ML and DL models, thus advancing both Arabic-specific and cross-linguistic readability research.

Relying on the recently released AraEyebility corpus, this study addresses a significant gap in Arabic readability assessment. While gaze-based modeling has been studied in English and other languages, Arabic’s orthographic and cognitive demands—cursive script, morphological richness, diacritics, and bidirectionality—make the direct transfer of those modeling methods insufficient. To the best of our knowledge, this is the first large-scale study to systematically integrate eye-tracking features into ML and DL models for Arabic readability. The key contributions of the study are as follows:

The first multimodal evaluation of Arabic readability using a corpus that combines text and gaze features at the paragraph level, a unit of analysis that balances contextual meaning with experimental tractability.
We conducted an in-depth investigation of the relationship between cognitive data and text readability assessment, particularly in the Arabic context.
The impact of two cognitive features—direct and indirect—related to reading features and experimental conditions on the ML and DL models was identified, in addition to their significant contribution to the accurate prediction of readability scores.
We compared the performance of different ML and DL models in predicting text difficulty using various textual representations including handcrafted features and feature vectorization methods. Empirical evidence showed that gaze features substantially improved both ML and DL models compared with text-only features.
The results of the cognitive measures of Arabic text readability were compared with other readability measures for Arabic and other languages such as English, German, Japanese, and Portuguese.

These contributions collectively advance Arabic readability research and extend global findings into a morphologically rich right-to-left language. At the same time, the results should be interpreted in light of several limitations, as the findings are based on the AraEyebility dataset, which—despite its diversity of genres and readability levels—has a limited participant pool and an imbalance across readability levels. As such, our findings are best viewed as preliminary and proof-of-concept. Larger, more balanced datasets, refined feature engineering, and further testing with diverse text types will strengthen generalizability.

The rest of this paper is organized as follows. Section 2 covers the background, methodology, and eye-tracking in Arabic reading. Section 3 reviews eye-tracking and readability assessment across languages. Section 4 and Section 5 present the ML and DL models for readability classification, respectively. Section 6 discusses the key findings, and Section 7 concludes with a summary, future directions, and a discussion of the challenges.

2. Background

This section offers an overview of text readability, the Arabic language, and the principles of eye-tracking technology, highlighting their relevance to reading in general and specifically exploring their connection to Arabic.

2.1. Text Readability Assessment

Research on text readability began in the 1920s. The studies were primarily carried out in English and have since expanded to other languages [7,8,28]. Modern approaches often use ML, which combines annotated corpora, extracted text features, and predictive models to assign readability scores [14,29]. Unlike tasks such as topic modeling or sentiment analysis, readability assessment is more subjective and requires model interpretability to justify ratings, especially in educational contexts [14]. Automated readability tools have wide applications from supporting second-language learners and readers with dyslexia to simplifying online content and tailoring educational materials. Recent trends explore integrating behavioral signals, such as eye movements, crowdsourced annotations, and data-driven methods, to adapt to evolving language and document types [14,29].

2.2. Eye-Tracking

When surveying a scene, the human eye follows varied movement patterns called scan paths [30]. Eye-tracking captures these in real-time, estimating gaze position, motion, and focus duration [30,31]. Originating in the 18th century, eye-tracking has become a key tool for exploring visual and cognitive processes, with applications in psychology, education, marketing, and user-interface design [31,32]. Devices, such as Tobii (Tobii Technology AB, Stockholm, Sweden) [33] and EyeLink (SR Research Ltd., Ottawa, ON, Canada) [34], are widely used. Eye-tracking systems process data on pupil location, gaze points, and durations using methods such as infrared oculography (IROG) and video-oculography (VOG) [30,31]. VOG, often used in screen-based studies, detects eye presence via near-infrared reflections and analyzes this with machine learning and image processing algorithms [35,36].

Eye-tracking captures natural reading behaviors without intervention, providing direct observations of language engagement [37]. Eye-tracking, akin to reaction-time tests, employs two theories for acquiring gaze data [32]. First, the fixation duration indicates cognitive effort—longer or more frequent fixations suggest deeper processing. Second, the gaze location indicates the subject of focus, offering insights into cognitive processes during reading [38]. Thus, analyzing eye movements can elucidate cognitive processes involved in reading [25].

The key reading behavior metrics used in reading studies are as follows: saccades (rapid eye movements between text areas), fixations (pauses between saccades indicating readability or difficulty), and regressions (backward movements suggesting text comprehension challenges or clarity issues) [17,32,39]. Longer saccades typically indicate easier comprehension, while frequent and prolonged fixations and regressions suggest text difficulty and comprehension issues [25,39,40].

2.3. Arabic Language

Arabic, a Semitic language, differs significantly from Latin script; it is read from right to left and lacks upper-case and lower-case distinctions [41]. The language includes 28 basic consonantal letters, with cursive writing requiring connected letters that may change shape based on their position within words [41,42,43,44]. Arabic is classified into Classical Arabic (CA), which is used historically and in religious texts such as the Qur’an [45,46,47]; Modern Standard Arabic (MSA), prevalent in contemporary media and the literature but distinct from CA in structure and complexity [27,46,47]; and various dialects spoken across Arabic-speaking regions [47]. In Arabic script, consonantal sounds are primarily represented by letters, while short vowel sounds are indicated by diacritics such as fat’ha, kasra, damma, shadda, and tanween [43,48]. Diacritization clarifies pronunciation and meaning and is crucial in resolving ambiguity in homographs, where undiacritized words can have multiple meanings [41,48]. For example, a basic undiacritized Arabic word such as “علم” can have different pronunciations, and each is associated with a different meaning such as flag (عَلَم), knowledge (عِلْم), teach (عَلّم), etc. [43].

2.4. Arabic Language and Eye-Tracking

Arabic has unique linguistic and visual features that create distinct reading challenges compared with Latin-based languages. Its cursive script, bidirectional reading direction, and morphological richness require stronger visual-spatial processing and influence eye-movement patterns [35,49,50]. Unlike English or German, which are non-cursive and visually simpler, only a few studies have investigated how these script properties affect Arabic reading behavior [51]. Several script-specific features of Arabic are particularly relevant for readability and eye-tracking:

Informational density: Arabic text requires more time to process, making word recognition harder than in Latin scripts [51].
Fixation and perceptual span patterns: Arabic readers tend to fixate centrally within words and extend their perceptual span leftward, reflecting the script’s directionality [51].
Diacritics: While they aid in disambiguation, diacritics can also act as visual clutter, increasing fixation durations. Experienced readers often rely on context to interpret undiacritized text [42,46,51,52,53].
Dot density and orthographic ambiguity: Many letters differ only by dots, adding to visual load and fixation demands [44].
Right-to-left directionality and bidirectional reading: Text runs right-to-left, while numbers are read left-to-right; this affects saccade planning and sometimes causes inversion errors [54].
Cursive and shape-changing script: Letters connect and change form depending on position, which complicates word segmentation [55].

3. Related Research

Measuring text readability remains challenging [6,11,13,14,28], with efforts spanning classical formulas and data-driven approaches [13,56]. This section reviews key English and Arabic formulas, data-driven Arabic readability methods, and eye-tracking studies for readability prediction.

3.1. Classic Text Readability Assessments

Traditional readability assessments rely on lexical and syntactic features, such as word familiarity and sentence length, based on the assumption that longer sentences and unfamiliar words increase text difficulty [6,13,14,57]. In English, extensive research has produced hundreds of formulas since the early 20th century including the Flesch Reading Ease Score [58], Gunning Fog Score (Fog) [59], Simple Measure of Gobbledygook (SMOG) Grade [60], Coleman–Liau Index (CLI) [61], Flesch–Kincaid Grade Level [62], and Dale–Chall Lexicon-Based Score [63]. For Arabic, despite growing interest in language processing, few studies have developed dedicated readability indices [64]. Early work adopted formula-based approaches similar to English, later evolving to statistical and ML-based methods [65]. Notable Arabic measures include the Dawood Readability Score [66], Al-Heeti Grade Level [67], Corpus-Based Readability Formula [68], automatic Arabic readability index (AARI) [12], open source metric for measuring Arabic narratives (OSMAN) [64], and computational formulas for Arabic reading materials among non-native students [69].

However, traditional formulas have limitations—they assume well-formed sentences [14], struggle with nontraditional texts [14], and rely on surface features, overlooking coherence and syntax [6,14,56,68,70], and often inflate scores for shorter sentences at the cost of readability [13].

3.2. Data-Driven Text Readability Assessment

Several tools have been developed to analyze semantic features in texts [13]. Since the early 2000s, AI and ML techniques have advanced readability assessment, moving beyond rule-based NLP systems [14,71]. Given the limits of traditional formulas, combining NLP with ML offers more accurate predictions [57]. Recently, DL has further improved accuracy and reliability in this field [6,14,70,72]. This section reviews ML and DL approaches to text readability in Arabic and other languages [73,74,75,76].

3.2.1. Arabic Text Readability Assessment Studies

Although readability prediction has been extensively studied for English and other languages, Arabic readability has only recently attracted attention within the Arab NLP community [77,78]. Cavalli-Sforza et al. [65] and Nassiri et al. [43] reviewed trends in Arabic studies, focusing on traditional ML approaches for assessing text complexity. Early contributions include Al-Ajlan et al. [28]; Al-Khalifa and Al-Ajlan [7]; Shen et al. [79]; Salesky and Shen [80]; Forsyth [81]; Cavalli-Sforza et al. [82]; Saddiki et al. [29]; Alotaibi et al. [11]; Nassiri et al. [8,83,84,85,86]; Al Aqeel et al. [87]; and Bessou and Chenni [77], among others. Several studies have adapted features from other languages, although challenges such as imbalanced datasets persist. Nassiri et al. [86] explored class clustering and SMOTE for addressing imbalance, finding SMOTE to improve readability estimation, especially for finer text complexity distinctions.

As noted by Nassiri et al. [43], Arabic readability research for both L1 and L2 learners largely relies on ML due to the scarcity of the large annotated datasets necessary for DL. Most studies have employed handcrafted linguistic features or BoW/TF–IDF representations such as [77], and few have leveraged DL. Al Jarrah [56] was the first to compare ANNs with ML models for Arabic readability. Khallaf and Sharoff [88] demonstrated the value of word embeddings (fastText, mBERT, XLM-R, and Arabic-BERT), with fine-tuned Arabic-BERT outperforming others. Similarly, Berrichi et al. [57] found that combining AraVec embeddings with linguistic features improved the F scores by 8%. Later, Berrichi et al. [89] demonstrated that feature selection and the combination of embeddings (Arabic-BERT, AraBERT, and XLM-R) with linguistic features further enhanced performance. For L1 learners, Berrichi et al. [1] reported that AraBERT produced a 76.93% F score, outperforming traditional features. More recently, Ouassil et al. [90] introduced a hybrid AraBERT-BiLSTM model that integrates transformer-based contextual embeddings with BiLSTM’s sequential processing. This model achieved 89.55% accuracy, 89.65% precision, 89.55% recall, and an F1 score of 89.52%, outperforming both standalone BERT/BiLSTM models and traditional classifiers (SVM, naïve Bayes, KNN, and decision tree). These findings underscore the potential of hybrid DL architectures in tackling Arabic’s linguistic complexity.

In addition to the summaries provided by the authors of [43,91], BA (Table A1) provides a comprehensive overview of data-driven Arabic readability studies for MSA targeting similar audiences. While some earlier studies (e.g., [11,28,87]) proposed systems or tools for Arabic readability, they lacked data-driven experiments and did not report quantitative results.

3.2.2. Global Text Readability Assessment Studies Using Eye-Tracking

Several studies have focused on recommending appropriate texts to audiences by considering eye movements while reading [24,26]: Chen et al. [24], Copeland et al. [25], and Vajjala et al. [16]. For example, Garain et al. [26] investigated reader-specific difficult word identification in documents. Moreover, as reader effort is a significant factor in reading and comprehension, Mishra and Bhattacharyya [92] developed an approach to quantify reading effort by analyzing the eye-movement patterns of readers. This method, known as Scanpath complexity, serves as a measure of text readability.

Given that text readability is an essential component of automatic essay grading—a method wherein machines assign a grade to an essay written on a specific topic—several studies, such as those by Mathias et al. [40,93,94], have sought to augment training datasets and enhance model performance in evaluating overall text quality. They achieved this by using readers’ gaze behavior and patterns to collect cognitive information. Additionally, some studies, such as those by the authors of [4,95,96], did not directly target readability and focused on comprehension, while others [40,93,94] aimed at the overall text quality, of which readability is only a part.

Indeed, research shows that predictable words are read faster or even skipped [22]. Surprisal, which quantifies how unexpected a word is in context, has become a key feature in modeling human sentence processing. Goodkind and Bicknell [23] examined how LM quality influences surprisal’s predictive power for readability. Aurnhammer and Frank [24] compared RNN variants in estimating surprisal using self-paced reading, eye-tracking, and EEG data. Merkx and Frank [22] extended this by contrasting GRUs with transformer-based LMs. Building on these, Gauthier software. Ref. [25] explored how different LMs affected surprisal estimates and predicted human gaze across datasets. Additional studies include those from the authors of [26,27,28,29,30].

When evaluating the performance of existing studies, a diverse set of algorithms was applied such as SVM, generalized additive mixed models (GAMMs), multi-layer perceptron (MLP), and other algorithms for various modeling needs. Additionally, a diverse set of performance measures was applied such as the root mean square error (RMSE), mean absolute error (MAE), mean squared error (MSE), the area under the precision–recall curve (AUPRC), correlation coefficients (CCs), and quadratic weighted kappa (QWK). These measures offer insights into different aspects of model performance and collectively provide a comprehensive evaluation of predictive accuracy and reliability across different modeling scenarios. Appendix A (Table A2) summarizes some of the existing studies.

3.3. Discussion

This review identifies key gaps in Arabic readability assessment and the use of eye-tracking to explore reading behaviors. In Arabic readability studies, most research has focused on L2 learners in educational contexts, with limited studies conducted in other domains such as health [11,87], and even fewer on L1 readers [8,85]. Progress is constrained by the lack of annotated datasets required for ML and DL models [89]. Feature selection remains a major challenge, as studies often rely on predefined linguistic features without thoroughly evaluating their relevance [56,83,88,89]. While recent work has introduced TF-IDF and word embeddings [1,77,88,89], hyperparameter tuning is rarely discussed. Statistical ML methods, particularly random forests, dominate the field, whereas ensemble methods, sequence-to-sequence models (e.g., RNNs), and transformer-based architectures remain underexplored. DL approaches are rare due to their data demands and limited Arabic resources [43,89].

In eye-tracking research, there is growing interest in applying gaze data to NLP tasks such as sentiment analysis and named-entity recognition [97,98,99,100]. Eye-tracking has the potential to be used in assessing text readability [101], but studies often focus narrowly on metrics such as the first fixation duration and total fixation time, overlooking other indicators of reading difficulty [32,92,102,103,104]. It remains unclear which eye-tracking features are most predictive of readability or how they can be integrated with ML/DL models. Combining gaze data with linguistic features and DL representations has shown promise for enhancing readability prediction [19,40,105,106,107,108].

This review underscores the potential of eye-tracking to advance Arabic text readability research and indicates new directions for integrating gaze measures with computational models. However, to date, no multimodal approaches have systematically combined linguistic and gaze features for Arabic text readability tasks, despite Arabic’s distinct orthographic and morphological characteristics, which pose unique challenges compared with other languages.

4. ML-Based Experiments and Results

For resource-limited languages, manual linguistic features and traditional ML techniques provide a baseline for readability assessment [106]. ML models are favored for small datasets due to their simplicity and lower data demands, while larger datasets allow for more complex algorithms [71]. This section examines the modeling and evaluation phases, where ML models integrating traditional text features with eye-tracking data are evaluated using various metrics [57].

4.1. Model Building and Evaluation

To systematically evaluate Arabic text readability, we designed a structured set of ML experiments using the AraEyebility corpus. The dataset was divided into 80% training and 20% testing [8,40,82,83,109]. The experimental design had five main components [85]:

Dataset: Arabic text readability was framed as a supervised classification task using the AraEyebility corpus [110]. The corpus comprises 587 paragraphs (57,617 words) drawn from 92 MSA and CA texts across 13 genres including grammar, literature, health, and politics. Texts were partially diacritized and segmented into coherent paragraphs of different lengths and difficulty levels. Eye-movement data were collected from 15 native Arabic speakers. Extracted eye-tracking features—fixation, saccades, regressions, and pupil metrics—serve as the indicators of reading effort [4]. Additionally, the corpus includes linguistic features and subjective gold-standard readability annotations from participants, classifying Arabic texts into easy, medium, and difficult levels. The distribution is notably skewed, with approximately 61% being easy, 34% being medium, and 5% being difficult, indicating a significant data imbalance.
Classification using handcrafted features (Section 4.2): Four ML models—multinomial naive Bayes (MNB), logistic regression (LR), support vector machines (SVMs), and K-nearest neighbors (KNN)—and four ensemble methods—bootstrap aggregating (Bagging), RF, adaptive boosting (AdaBoost), and extreme gradient boosting (XGBoost)—were trained. Ensemble methods apply the “wisdom of the many” concept, potentially improving prediction accuracy [109]. The models were evaluated using handcrafted linguistic and cognitive features, followed by experiments that varied feature sets to assess their impact.
Classification using feature vectorization methods (Section 4.3): Texts were represented using TF–IDF, and the ML models were re-evaluated on these representations. Results were compared against those obtained with handcrafted features.
Implementation and evaluation: All models were implemented in scikit-learn on Google Colab, trained with hyperparameter tuning, and evaluated using weighted precision, recall, and F1 score, as these metrics are more robust to data imbalance than accuracy [69,89,109,111].
Handling class imbalance: The dataset was notably skewed (61% easy, 34% medium, 5% difficult). As noted in [71], class imbalance is a common challenge in readability datasets, particularly in automatic text readability assessment, where natural text distributions are often uneven. This issue is also frequently observed in Arabic readability studies. Oversampling, undersampling, and SMOTE were initially tested, but these approaches yielded inconsistent results and distorted class distributions. Consequently, the class imbalance was addressed through stratified sampling to divide the data into training and testing sets, ensuring that each readability class was proportionally represented. This approach is particularly effective for imbalanced datasets, as it preserves class distribution and helps reduce model bias during training and evaluation. In addition, we adopted hyperparameter fine-tuning and cost-sensitive training to further improve performance under imbalance conditions. These steps involved adjusting model-specific parameters and employing appropriate evaluation metrics to better capture the performance across all classes and ensure robustness.

4.2. Classification Using Handcrafted Features

Initially, text readability was modeled using handcrafted features to transform paragraphs into feature vectors [8,57,86]. Despite their complexity, handcrafted features remain effective, particularly in Arabic readability studies with ML algorithms [57,71,106]. A total of 98 features categorized as text-based and gaze-based were used to evaluate the contribution of gaze data and experimental factors, such as task completion time, as indicators of confusion. Two eye-tracking feature sets—gaze-based and gaze-based (reading)—were tested for their impact on Arabic readability assessment [40]. Table 1 summarizes these features. For a detailed description, readers may refer to the AraEyebility corpus paper [110].

Figure 1 summarizes the workflow of the adopted methodology.

4.2.1. Baseline Models

Experiments

To assess the impact of eye-tracking features on Arabic text readability, we first optimized the hyperparameters using only text-based features, aligning with standard practices in readability and cognitive studies [71]. This baseline model established a benchmark for evaluating subsequent enhancements. While grid search was computationally intensive in Google Colab, Bayesian optimization offered a more efficient alternative by leveraging a probabilistic model to identify promising hyperparameters [112]. The experiments employed stratified 5-fold cross-validation to maintain class balance and mitigate overfitting [111,112].

Results

Bayesian optimization produced the best models, optimal parameters, and performance metrics, all of which were recorded for further analysis. Details of the hyperparameter values for each ML classifier are provided in Table A3 in Appendix B to support reproducibility. The optimized models—LR, SVM, KNN, and MNB—were evaluated using average weighted precision, recall, and F1 scores. LR resulted in the highest balance, with precision and recall at 73% and an F1 score of 71%. SVM followed closely (F1:70%), exhibiting higher recall (73%) but lower precision (68%) [112]. KNN (F1: 67%) and MNB (F1: 65%) lagged behind, with MNB exhibiting a lower performance. The high scores achieved by LR and SVM could be attributed to their capacity to handle feature interdependencies, in contrast to MNB, which does not account for such dependencies [111], and KNN, which struggled in the presence of less informative features [113].

The results indicate that key linguistic features—such as character count, word count, syllable count, sentence length, and difficult word counts—substantially improve text categorization by capturing text complexity across three levels [89]. However, features such as the average difficult word count, loanword counts, and ratios of loan/foreign words contributed less to class differentiation, resulting in misclassifications, particularly at the medium level [7,56,114]. Unlike prior studies enforcing distinct vocabularies per class [69], vocabulary overlap in this dataset further increased the classification errors. Dataset imbalance also limited model generalization. Despite these challenges, the optimized RF model outperformed both basic models (KNN, SVM, LR, and MNB) and ensemble methods (Bagging, AdaBoost, and XGBoost), likely due to its ability to randomize instances and features, reducing variance and handling noisy linguistic data [115]. Appendix C (Table A4 and Table A5) details these results.

To evaluate model robustness, repeated stratified 5-fold cross-validation was applied with 5, 10, 15, 20, 25, and 50 repetitions, focusing on weighted F1 scores. Increasing the number of repetitions balanced thorough validation with computational constraints, with no notable performance gains beyond 50 repetitions. The results showed low variance in validation scores (Figure 2), indicating model stability against minor data fluctuations. However, classifiers such as AdaBoost and Bagging exhibited F1-score variability due to randomness in cross-validation splits [116].

4.2.2. Analysis of Feature Variations

This study aimed to evaluate the role of eye movements in Arabic text readability assessment. Thus, optimized baseline models using text-based features were compared with models incorporating gaze-based features to analyze the impact of feature variations across multiple ML models [47].

Experiments

Similarly to the authors of [40], this study evaluated the effectiveness of gaze features using optimized baseline models across classifiers. The models were retrained with three feature variations—gaze-based (reading), gaze-based, and all features variations (Table 1). Models retrained with these feature sets were evaluated to identify the best-performing configurations [11,77]. This analysis examined (1) the added value of gaze data alone (gaze-based and gaze-based (reading)) and (2) the synergistic effect of combining gaze and text features in improving readability assessment [69]. Retraining with consistent hyperparameters from the baseline model, without further hyperparameter tuning for new feature variations, ensured a controlled comparative analysis focused solely on the impact of feature variations on model performance.

Results

For each feature variation, model performance was compared with the baseline to assess the contribution of each subset to predictive power. The LR model remained the top basic classifier, achieving 75% in precision, recall, and F1 score. SVM also improved with respect to all features (F1: 74%), while KNN and MNB showed moderate gains, particularly with gaze features (MNB F1: 68%). Among the ensemble models, Bagging performed strongly with gaze features (F1: 77%), XGBoost reached 76% with all features, and AdaBoost achieved the highest performance overall (precision, recall, and F1 score of 80%) when combining text-based and gaze-based features. RF maintained stable results but declined slightly with all features; this was likely due to feature interdependency limitations. Overall, RF and AdaBoost performed consistently well with text and gaze features, while XGBoost and Bagging excelled with both gaze (reading) and gaze features, suggesting that in some cases, gaze-based features outperformed traditional text-based features. These findings underscore the potential effectiveness of eye-tracking data alone in enhancing model performance. Appendix C (Table A6 and Table A7) details these results.

Overall Performance and Gaze Data

Table 2 highlights the top-performing models across basic and ensemble types, showcasing the superiority of ensemble models over basic models when various feature variations are applied.

Table 3 highlights the top-performing features, demonstrating the substantial impact of eye-tracking data on Arabic readability assessment and the superiority of ensemble models over basic ones. Eye-tracking features, whether standalone or combined with text-based features, enhanced model performance by providing contextual insights beyond textual analysis. KNN improved with gaze and all features, while MNB and Bagging performed better with gaze (reading) and gaze features, respectively. SVM, LR, AdaBoost, and XGBoost benefited most from integrating gaze and text features (All Features), whereas RF performed strongly with both feature types individually. In conclusion, eye-tracking data, alone or combined with text-based features, significantly improve the model performance in text readability assessment.

4.3. Classification Using Simple Feature Vectorization Methods

Simple vectorization methods such as BoW and TF-IDF are effective for readability assessment, especially in morphologically complex languages such as Arabic [57,69]. Prior studies link text complexity to lexical richness, emphasizing n-gram features in distinguishing difficulty levels [69].

TF-IDF, widely used in NLP for its balanced text encoding, outperforms BoW in Arabic readability assessment by emphasizing informative, less frequent words [57,71,77]. Unlike BoW, which counts word occurrences and may miss critical nuances—especially with stop words—TF-IDF’s weighting reduces common word influence and highlights unique vocabulary essential for readability classification [117]. This section details the methodology using TF-IDF with and without handcrafted features for feature analysis and model evaluation (Figure 3) including the integration of eye-tracking features.

Preprocessing transformed raw text from textbooks in .csv format for analysis [118]. The text underwent initial cleaning: consecutive spaces were consolidated, and newline characters were removed using regex [71,83]. Tokenization used white spaces as delimiters, with manual adjustments for poetry’s incomplete word endings in the first hemistich. Punctuation and diacritical marks were removed to streamline tokenization [118,119]. Stop words were retained to preserve context because the objective of this study was to collect eye-tracking data for all texts without altering them, focusing on the paragraph as a whole rather than specific word-level changes. Retaining stop words also helped maintain the natural spillover effect in reading, in line with the methodology described in [27].

4.3.1. Analysis of Feature Variations

Experiments

To evaluate the impact of combining gaze and vectorized text features in Arabic readability assessment [57,69,77,88,89], optimized baseline models were retrained using vectorized text alone and in combination with feature variations (Table 4).

Using a consistent set of hyperparameters from the baseline model without additional tuning for new features allowed for controlled analysis [69]. TF-IDF was used to derive unigram and bigram features [69,77], resulting in vocabulary sizes of 19,421 and 48,795 tokens, respectively.

Results

Models using unigram and bigram features [120] showed varied performance, with unigrams often outperforming bigrams in F1 scores. XGBoost was an exception, improving from 69% (unigrams) to 75% (bigrams). The high dimensionality of bigrams increased sparsity and noise, complicating optimization and affecting TF-IDF performance [57,71]. Thus, this section focuses on the TF-IDF unigram results [121]. Detailed performance metrics for basic and ensemble algorithms with TF-IDF and feature variations are provided in Appendix D, benchmarked against the text-based baseline.

For basic models, TF-IDF alone resulted in the lowest performance, with MNB achieving an F1 score of 36%, underscoring its limitations and the value of incorporating additional features. KNN reached its highest F1 score (69%) when TF-IDF was combined with gaze features, while adding gaze (reading) or all features yielded consistent improvements (F1: 68%). Enhancing TF-IDF with extra text features offered no further gains, suggesting diminishing returns. Notably, MNB showed no benefit from feature variations, contrary to the literature supporting fractional counts with TF-IDF [77,111]. SVM showed modest gains, with F1 scores increasing from 70% to 72% and 73% when all features and text features were added, respectively, along with improved precision. LR maintained an F1 score of 71% with all features but exhibited decreases in precision and recall. Overall, both models underperformed compared with others using TF-IDF and feature variations.

Among the ensemble models, TF-IDF alone yielded poor results, with AdaBoost and RF recording the lowest F1 scores (46%). Adding TF-IDF to other feature variations generally reduced performance, except for XGBoost, which only showed a slight decline with TF-IDF and all features.

Overall Performance and Gaze Data

The findings indicate that adding TF-IDF features generally lowered performance, except for SVM and KNN, which improved with specific feature variations. This decline likely stems from high dimensionality causing overfitting and reduced test accuracy, highlighting the need for dimensionality reduction [83]. Table 5 shows that with TF-IDF feature variations, basic models—especially SVM—outperformed the ensemble models. While ensemble methods struggle with high dimensionality, SVM leverages its strength in defining decision boundaries within complex feature spaces [122].

Overall, the results in Table 6 indicate that integrating text-based and gaze-based features enhances the performance over TF-IDF alone.

5. DL-Based Experiments and Results

Supervised ML methods have traditionally driven Arabic readability prediction through various classifiers and predefined features. Recent DL advancements, including word2vec, fastText, and pretrained language models, have improved semantic modeling and enhanced readability assessment when combined with linguistic features [57,70,123]. Arabic NLP studies further demonstrate that integrating word embeddings with ML models improves both performance and interpretability [43,57,89]. This section investigates whether DL models, particularly those incorporating eye-tracking features, more effectively capture human cognitive processes in readability assessment by leveraging embeddings such as fastText alongside cognitive data.

5.1. Model Building and Evaluation

The deep learning experiments were conducted using the AraEyebility corpus introduced in Section 4.1. Similarly, the data split (80% training, 20% testing), evaluation protocol, and imbalance-handling strategies followed the procedures described in Section 4.1. Accordingly, the experimental design for DL models consisted of the following:

Classification with embeddings and feature variations (Section 5.2): Given the limited use of sequence-to-sequence models, such as RNNs, in Arabic readability research, seven DL models were selected based on prior studies [71,118,119,124,125,126,127,128] including long short-term memory (LSTM), bi-directional LSTM (BiLSTM), and gated recurrent units (GRUs). Additionally, convolutional neural networks (CNNs) were explored in order to leverage their effectiveness in cognitive studies and text classification tasks. Hybrid architectures (CNN-LSTM, CNN-GRU, and CNN-BiLSTM) were also explored, mirroring ensemble strategies to boost predictive accuracy. This stage evaluated the contribution of cognitive signals to DL models, covering preprocessing, embedding integration, hyperparameter tuning, and the analysis of feature variations.
Implementation: All models were implemented in Google Colab with TensorFlow and Keras for enhanced usability and functionality [124,125].

5.2. Classification Using Word Embeddings and DL Models

This section outlines the baseline DL model development and evaluates the impact of gaze features on Arabic readability [125], covering text preprocessing, fastText embeddings, hyperparameter tuning, and feature variation analysis. Figure 4 illustrates the methodology workflow. Text from CSV-format textbooks was preprocessed including space consolidation, newline removal via regex [71,83], and whitespace tokenization. Adjustments addressed incomplete word endings in poetry. Punctuation and diacritics were removed [118,119], but stop words were retained for context [27]. Sequences were padded to the longest paragraph length for uniformity.

5.2.1. Baseline Models

Tokenized words were mapped to embedding vectors via an embedding matrix for neural network input, capturing linguistic nuances and analogies [118,124,129]. Pretrained Arabic fastText embeddings were used for their strong performance in Arabic NLP and robustness with limited data [71,118,124]. By modeling words as subword n-grams, fastText effectively handled out-of-vocabulary (OOV) words, with only 4.1% (796 of 19,421) OOV words in this corpus [124,130].

Experiments

Baseline DL models (LSTM, GRU, BiLSTM, CNN, CNN-LSTM, CNN-GRU, and CNN-BiLSTM) with fastText embeddings and text-based features were first evaluated as benchmarks. KerasTuner, integrated with TensorFlow and scikit-learn, was used for hyperparameter optimization to establish optimal configurations for each model. Bayesian optimization, outperforming Hyperband, was used to maximize weighted F1 scores for imbalanced data, selecting optimal hyperparameters by averaging validation scores across trials [111,131,132].

Unique DL model architectures were adapted from the authors of [118,125,126], with adjustments as needed, as detailed in Appendix E. Each architecture incorporates a dropout regularization layer to prevent overfitting during training. This technique randomly excludes neurons to enhance model robustness and generalization. After model construction, the key hyperparameters require fine-tuning [71,118,124]: (1) batch size: controlling sample processing and weight updates; (2) epochs: balancing sufficient training and overfitting risks; (3) optimizers (e.g., SGD, Adam, RMSprop, Nadam): minimizing loss; (4) dropout regularization: preventing overfitting by deactivating neurons randomly; and (5) hidden layer neurons: defining model complexity and data processing depth. Table 7 summarizes the explored hyperparameters and their variations. The softmax function in the output layer enabled class probability assignment for improved interpretability [124], while categorical cross-entropy loss and a 20% validation split were consistently applied [119]. Appendix F provides the best-performing hyperparameters and model-specific optimization details for each DL classifier, supporting re-experimentation and community use.

Results

CNN outperformed all classifiers, with 77% precision, 78% recall, and a 77% F1 score, reflecting the balanced identification of positive instances and low misclassification rates [111]. GRU, despite being a newer RNN variant, achieved a lower F1 score of 61% due to its reduced recall (59%) despite exhibiting 72% precision. LSTM had the lowest performance, with a precision and F1 score of 52%. Among hybrids, CNN-LSTM and CNN-BiLSTM both enabled 70% F1 scores, with CNN-LSTM exhibiting higher precision (73%) and CNN-BiLSTM exhibiting slightly better recall (70%). The CNN-LSTM results highlight the advantage of preceding LSTM with CNN for feature extraction, outperforming the standalone LSTM (F1: 52%). CNN-GRU performed the worst among hybrids (F1: 60%). Appendix G details all single and hybrid models using average weighted precision, recall, and F1 scores on the test set. Repeated k-fold cross-validation (5–50 repetitions) confirmed model resilience to data distribution shifts, with baseline models showing consistent performance and low variance (Figure 5).

DL models displayed greater F1-score fluctuations than the ML models, which is likely due to random weight initialization and variability in cross-validation splits [116].

5.2.2. Analysis of Feature Variations

Experiments

This section evaluates the effectiveness of gaze features in optimized baseline models across all classifiers. Feature variations—gaze-based (reading), gaze-based, and all features—were fine-tuned, with top-performing models identified for each (Table 8). Retraining used consistent hyperparameters for controlled comparison, involving model reconstruction, adjustment for handcrafted features, and retraining with the new feature sets.

Results

Each model’s performance was compared with the baseline to evaluate the contributions of different feature subsets (see Appendix G, Table A14 and Table A15). CNN, which achieved the highest F1 score of 77% with text-based features, exhibited a decline to 71% when retrained with gaze features or all features. In contrast, RNN variants (LSTM, BiLSTM, and GRU) demonstrated improvements when combining text-based and gaze-based features. Notably, LSTM and GRU exhibited significant F1-score increases from 52% to 69% and from 61% to 74%, respectively, along with enhanced precision and recall. However, BiLSTM exhibited a decline with both gaze feature variations, suggesting challenges in integrating gaze data due to model complexity and the absence of clear sequential patterns. For hybrid models, CNN-LSTM improved to 72% with all features, CNN-GRU exhibited consistent gains across all gaze feature variations, and CNN-BiLSTM showed a marginal increase to 71% with gaze (reading) but declined with other feature sets, likely due to overfitting caused by its higher complexity.

Overall Performance and Gaze Data

Table 9 highlights the top-performing models across feature variations, reflecting CNN as the best overall performer, except for GRU, which achieved the highest F1 score with all features.

To more clearly gauge the impact of eye-tracking features, Table 10 presents the highest-performing features for each model.

Table 10 shows that eye-tracking data, alone or combined with text-based features, enhances Arabic readability assessment by capturing cognitive processes beyond text analysis. All DL models except CNN benefited from gaze data, with LSTM and CNN-BiLSTM performing best with gaze (reading), while CNN excelled using text-based features alone. Overall, integrating eye-tracking significantly improved the predictive power of most DL models.

6. Discussion

This study examined the integration of linguistic and eye-tracking features to assess Arabic text readability, exploring text complexity, cognitive metrics, and their relationship with reading challenges. It compared ML and DL models, evaluated eye-tracking contributions to readability prediction, and contrasted Arabic measures with those in other languages, presenting a comprehensive analysis of the findings.

6.1. Impact of Eye-Tracking Features on Arabic Text Readability Assessment

This study assessed Arabic text readability using online eye-tracking and offline linguistic metrics. It compared the text-based, gaze-based, gaze-based (reading), and all features sets. The results showed that gaze patterns significantly enhanced the ML models (KNN, MNB, SVM, LR, AdaBoost, and XGBoost), while RF performed well with both feature types. Gaze features often outperformed text features alone and boosted TF-IDF effectiveness. For DL models, integrating eye-tracking with embeddings improved LSTM, CNN-BiLSTM, GRU, and CNN-GRU performance, with CNN excelling on text features alone.

These results demonstrate that integrating eye-tracking features with traditional linguistic features enhances the accuracy of both ML and DL models in assessing Arabic text readability. By capturing cognitive processes, eye-tracking data complement textual analysis, offering a more comprehensive understanding of text complexity and improving readability predictions for native learners. This approach aligns with broader research trends [3,5,17,19,24,101,133,134,135,136] and highlights promising directions for future work in Arabic readability, text simplification, and summarization.

6.2. Impact of Eye-Tracking Experimental Conditions and Reading Features on Arabic Text Readability Prediction

While reading speed reflects comprehension, relying on it alone is insufficient for predicting readability, as speed-based models often underperform [137]. Combining eye movement data with speed provides stronger correlations to text complexity and improves analysis accuracy [4,92,137]. The literature reveals a lack of standardized protocols for studying gaze patterns in relation to readability [137]. Increased fixations and review times (“answer-seeking behavior”) are linked to text difficulty [133]. This study observed varied participant engagement with readability guidelines, suggesting comprehension challenges influenced by task demands, although the role of task goals remains unclear and warrants further research [95,137]. Moreover, we investigated whether incorporating eye-tracking features beyond reading metrics improves readability assessment. The ML and DL models showed mixed results: MNB, SVM, AdaBoost, and LSTM performed better with gaze (reading) features, while Bagging, RF, XGBoost, and BiLSTM favored gaze features. KNN and LR performed similarly with respect to both. These findings underscore the value of diverse gaze data and the need to consider experimental conditions beyond traditional metrics.

These findings highlight the complexity of reading processes, shaped by task demands and text characteristics, and underscore the need for task-specific measures in readability assessment [95,137]. The variability in the effectiveness of gaze and gaze (reading) features across the ML and DL models suggests that feature selection should align with task requirements and the combination of other features used. This calls for further investigation into how task demands influence cognitive mechanisms and optimal feature integration for improved readability prediction [137].

6.3. Comparison of ML and DL Models in Text Readability Prediction

ML models with handcrafted features outperformed TF-IDF, achieving F1 scores of 68–80% vs. 47–69%, highlighting the value of domain-specific engineering over TF-IDF’s sparse vectors [71]. Adding eye-tracking data further improved performance, with AdaBoost reaching 80%, showcasing ensemble learning’s strength in Arabic readability. Unlike past reliance on RF and limited XGBoost use, this study applied sequence-to-sequence DL models, where fastText-based approaches achieved an F1 score of 69–77%. CNN outperformed other DL models across features, excelling in capturing linguistic factors influencing text complexity. The findings align with prior studies [71,124,126] showing that CNNs outperform BiLSTMs on smaller datasets, while BiLSTMs excel with larger datasets. CNNs, although originally intended for computer vision, performed well with word embeddings [126], whereas RNN variants (LSTM, BiLSTM, and GRU) required more data to fully leverage their sequential processing strengths [71,138,139]. Combining CNNs with LSTM or GRU reduces the training time and enhances accuracy [126]. ML models with handcrafted features outperformed other vectorization methods, emphasizing the value of language-specific features for resource-scarce languages [106]. While DL automated feature extraction, ML surpassed it in readability tasks with short texts and limited data, reflecting DL’s challenges with data scarcity [20,43,71,89,124,125]. Future research could improve DL models by expanding the dataset sizes. Despite advances such as transformers, the study of TF-IDF and word embeddings underscores the enduring value of handcrafted features [140]. Given the challenges of collecting extensive eye-tracking data and small datasets, traditional ML remains preferable in early stages [106]. Variations in performance across text representations highlight the need to consider overfitting risks and feature-task relevance [71,96,124].

6.4. Comparison of Cognitive Measures of Arabic Text Readability with Other Readability Measures for Arabic and Other Languages

This study compared cognitive metrics for Arabic text readability with existing Arabic and cross-linguistic studies, differing in audience, evaluation metrics, text types, corpus size, features, labeling, readability levels, goals, annotation methods, and vectorization techniques [1,43,65]. These distinctions affect the generalizability and comparability of findings in Arabic text and gaze-based assessments [141].

Arabic readability studies have evolved from early ML approaches (SVM, DT, KNN, and RF) with linguistic features to DL methods using embeddings such as fastText and Arabic-BERT since 2017 [1]. Recent studies report varied outcomes due to differences in datasets, algorithms, and learner proficiencies. A 2017 ML study achieved an F1 of 90.5% with 166 linguistic features on GLOSS texts [8], while a 2024 DL study reported an F1 of 89.52% using BERT-BiLSTM [90]. Methodological differences (e.g., GLOSS vs. Aljazeera-Learning corpora) limit direct comparisons. Overall, DL methods enhance accuracy for L2 learning, but gaps remain in resources for L1 readers, underscoring ongoing challenges in Arabic language processing. On the other hand, eye-tracking-based readability studies cover diverse audiences and languages—mainly English learners, with some conducted on German, Japanese, and Brazilian Portuguese. Methods evolved from ML (SVM and RF) to advanced DL models (ANN and CNN-LSTM), leveraging embeddings and transfer learning to achieve 62.25–97.5% accuracy (e.g., 2020 Brazilian Portuguese study [101]). This shift highlights the potential of computational models in education and assessment. While not replacing linguistic features, this study augments them with cognitive signals, achieving an 80% F1 score for Arabic—a low-resource language—demonstrating the value of gaze features in readability prediction.

Based on these results, this study proposes an innovative approach to Arabic readability assessment by integrating eye-tracking signals with traditional linguistic features, advancing the field and aligning with global trends from ML to DL models. The results show that combining even basic linguistic features with eye-tracking data significantly improves the readability predictions, highlighting promising directions for future research and the further refinement of assessment methods.

7. Conclusions, Limitations, and Future Work

Text readability has long been assessed by quantifying linguistic features to automatically calculate complexity [135]. While using gaze data to study reading is not new [92], it remains underused in readability research [135]. Eye-tracking, despite being time-intensive and resource-intensive, offers a natural method for exploring reading processes and rereading behavior [17,18]. In Arabic, this gap is even larger, as current readability measures rely on simple textual features without considering the readers’ cognitive behaviors, limiting their accuracy [65]. Moreover, the use of ML and deep learning (DL) in Arabic readability research is still emerging, although these methods have shown promise in modeling semantic relationships and improving predictions. This study addressed these gaps by pioneering the integration of eye-tracking data into Arabic readability assessment. It demonstrates how eye-tracking metrics improve prediction accuracy, with AdaBoost outperforming other models when combining linguistic and gaze features. Beyond Arabic as a first language, these findings can support readability assessments for Arabic learners [136].

This study had several limitations affecting the experimental results. First, the small Arabic dataset and limited number of participants (15) constrain diversity and generalizability, echoing the challenges observed in prior gaze datasets such as Dundee [142] and ZuCo [97]. As noted in prior surveys of eye-tracking corpora, there is often a trade-off between the number of participants and the richness or length of the textual material: some corpora achieve larger participant pools but use shorter or fewer texts, while others prioritize more diverse or extended textual stimuli with fewer readers. The corpus we used, AraEyebility, follows the latter approach, emphasizing a diversity of text genres and readability levels while relying on a smaller participant pool. Consequently, the findings presented here should be considered preliminary and proof-of-concept rather than fully generalizable to all readers of Arabic. This reflects the resource-intensive nature of eye-tracking data collection, which is also a common constraint in other widely used cognitive corpora such as Dundee and ZuCo. As a result, creating large, annotated gaze datasets remains costly and time-intensive [20,138]. As noted in prior NLP surveys [21], corpora balance between participant numbers and textual richness. AraEyebility adopted the latter approach, offering diverse genres but fewer participants due to feasibility constraints. A second limitation is the dataset imbalance in readability levels. To address this, we applied weighting adjustments during training to reduce bias toward the majority classes, though other techniques (e.g., data augmentation or oversampling) should be explored [25,141]. Third, refining feature engineering and readability guidelines is essential to address classification errors, especially between medium and difficult texts. Fourth, model adaptability across diverse text types (e.g., books, news) requires further testing to understand text variability impacts. A fifth limitation is that gaze behavior may be influenced not only by text readability, but also by demographic and behavioral factors such as gender, educational background, prior knowledge, and engagement. In AraEyebility, these confounds were mitigated through diverse text domains, randomized presentation order, comprehension checks, and broad participant recruitment across gender, age, and professions, which helped support data reliability. However, a formal demographic-based analysis was not conducted, despite prior studies highlighting its importance [108,143]. Addressing this in future research will advance the development of more inclusive, user-aware, and robust readability models. Moreover, we did not apply additional outlier detection because the AraEyebility paper already incorporated a thorough evaluation framework. The dataset underwent participant-level validation (e.g., calibration, exclusion of low-quality recordings), engagement checks (e.g., comprehension questions, random insertions), and systematic quality controls. These measures ensured that the gaze data and readability annotations were consistent and reliable, reducing the likelihood of spurious outliers. Finally, model performance depends on both feature effectiveness and parameter configurations, emphasizing the need for ongoing optimization.

To improve the generalizability and representativeness of the results, future work should expand the Arabic cognitive corpus by increasing both the number of participants and the diversity of texts as well as by balancing readability levels. Such efforts would enhance model performance and allow for testing with more diverse algorithms. In addition, investigating text variability (e.g., length, genre, and readability levels) and the interaction between linguistic and gaze features could yield deeper insights [144,145]. Future research should also incorporate systematic outlier detection and removal at the model-training level, complementing the corpus-level validation already embedded in AraEyebility, to enhance robustness and reliability. In this study, transformer-based models were not employed due to computational resource constraints and the relatively small size of the AraEyebility dataset, but this increased the risk of overfitting when fine-tuning large-scale models. Future research should evaluate transformer-based architectures, such as AraBERT, in combination with gaze features (alongside transfer learning and diverse embeddings), to benchmark and enhance readability prediction [1,57,88,89]. Future research should also include a systematic feature importance analysis of gaze-based metrics to identify the most influential cognitive predictors of Arabic text readability. Additional directions include refining data preprocessing, exploring alternative cognitive signals such as EEG, and leveraging large language models (LLMs) to decode cognitive states [146] or to serve as aids and alternatives for human judgment [147,148,149]. Moreover, in preprocessing, stop words were retained to ensure the texts remained unaltered, as the aim was to collect eye-tracking data and assess readability at the paragraph level. This preserved natural spillover effects critical for capturing authentic eye-movement patterns, consistent with prior methodology [27]. While we did not test the inclusion versus exclusion of stop words, we acknowledge this as a valuable direction for future research. Research could also examine how different preprocessing strategies—such as stemming, lemmatization, and diacritization—affect readability prediction [150,151]. Finally, accelerated gaze dataset methods [135,138] and LLM-generated synthetic eye movements [152] can support scalable, cognitively driven readability models, thereby facilitating their broader adoption.

Author Contributions

Conceptualization, I.B., H.A.-K. and A.A.-S.; Methodology, I.B., H.A.-K. and A.A.-S.; Software, I.B.; Validation, I.B.; Formal analysis, I.B. and H.A.-K.; Investigation, I.B., H.A.-K. and A.A.-S.; Resources, I.B. and H.A.-K.; Data curation, I.B.; Writing—original draft preparation, I.B.; Writing—review and editing, H.A.-K. and A.A.-S.; Visualization, I.B.; Supervision, H.A.-K. and A.A.-S.; Project administration, I.B., H.A.-K. and A.A.-S.; Funding acquisition, I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.

Institutional Review Board Statement

The AraEyebility corpus used in this study was collected according to King Saud University’s Institutional Review Board rules and regulations and was approved by the Ethics Committee of King Saud University, Riyadh, Saudi Arabia (no. 21/0892/IRB, dated 19 October 2021).

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

The AraEyebility corpus used in this study is publicly available at: https://doi.org/10.7910/DVN/P5WPNS.

Acknowledgments

The authors would like to thank King Abdulaziz City for Science and Technology for funding and supporting this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Arabic Readability and Eye-Tracking-Based Studies

Table A1. Summary of studies of Arabic readability assessment.

Year	Study	Audience	Algorithm(s)	Level	Reported Results (%)
ML-based methods
2010	[7]	L1 learners	SVM, NB, DT	3	Accuracy: 77.77
2013	[79]	L2 learners	SVM, MIRA	7	MSE: 0.198
2014	[82]	L2 learners	DT	4	Accuracy (k-means): 86.95
				5	Accuracy (balanced): 91.3
				25	Accuracy: 60.8
	[80]	L2 learners	MIRA, SVM	7	MSE: 0.198
	[81]	L2 learners	KNN	3	F-score: 71.9
	[81]	L2 learners	KNN	5	F-score: 51.9
2015	[29]	L2 learners	ZeroR, OneR, DT, KNN, SVM, RF	3	Accuracy: 73.31, F-score: 73
2015	[29]	L2 learners	ZeroR, OneR, DT, KNN, SVM, RF	5	Accuracy: 59.76, F-score: 58.9
2018	[8]	L2 learners	ZeroR, OneR, DT, KNN, SVM, RF	3	Accuracy: 90.43, F-score: 90.5
				4	Accuracy: 89.56, F-score: 89.5
				5	Accuracy: 89.56, F-score: 89.5
	[91]	L1 learners L2 learners	ZeroR, OneR, DT, KNN, SVM, RF	4	L1 Accuracy: 94.8 L2 Accuracy: 72.4
2020	[85]	L1 learners	DT, KNN, SVM, RF	3	Accuracy: 78.84
2021	[77]	L1 learners L1 readers	MNB, BNB, SVM, RF (BoW, TF–IDF)	4	Accuracy: 87.14
2022	[86]	L1 learners L2 learners	----	3	Accuracy (class clustering): 70, 75, 88, 68
	[86]	L1 learners L2 learners	----	5	Accuracy (SMOTE): 60, 63.04, 98.46, 67.21
	[83]	L2 learners	DT, KNN, SVM, RF	3	Accuracy: 86.15
	[83]	L2 learners	DT, KNN, SVM, RF	5	Accuracy: 76.92
DL-based methods
2017	[56]	L1 learners L1 readers	DT, NB, KNN, SVM, ANN	3	Accuracy: 91.41
2021	[88]	L2 learners	SVM, RF, KNN, XGBoost (fastText, mBERT, XLM-R, Arabic-BERT)	3	F-score: 80
2023	[57]	L2 learners	RF (AraVec, TF–IDF)	3	Accuracy:80.63, F-score: 78.82
2023	[89]	L2 learners	SVM, RF (ArabicBert, AraBert, XLM-R)	3	Accuracy:82.67, F-score: 83.49
2024	[1]	L1 learners	RF (AraVec, AraBert)	3	Accuracy: 77.5, F-score: 76.93
2024	[90]	L2 learners	BERT-BiLSTM	3	Accuracy: 89.55, Precision: 89.65, Recall: 89.55, F1 Score: 89.52

Table A2. Summary of eye-tracking-based studies.

Year	Study	Audience	Algorithm(s)	Level	Reported Results (%)
2012	[5]	German L1 learners	SVM	6	Accuracy: 62.25
2013	[133]	English L1 learners	FFNN	10	MSE: 0.38
2014	[134]	English L1, L2 learners	FFNN	2, 3	Accuracy: 79–89
2015	[25]	English L1 and L2 Readers	DT, RF, FFNN	3	MSE: 0.22
2015	[24]	English L1 users	RF	3	Accuracy: 81.25
2016	[17]	English L2 learners	GAMMs	2	Variance: 0.78
2016	[135]	English L2 learners	LR, SVM	3	Accuracy: 75.21
2017	[26]	English L2 learners	FFNN, SVM, RF	2	Precision: 66
	[3]	Japanese L2 learners	SVM	5	MAE: 0.33
	[38]	English L1 readers	Multi-task MLP, LR	2	Accuracy: 86.54
2018	[136]	English L2 learners, L1 readers	Single-task and multi-task MLP	2	Accuracy: 86.62
	[40] *	English L1 readers	NB, LR, RF, FFNN	4	QWK: 55.2
	[92]	English L1 readers	Linear regression	2	CC: 0.59
2020	[101]	Brazilian Portuguese L1 readers	Single-task MLP, multi-task MLP, transfer learning	2	Accuracy: 97.5
2021	[19]	English L1 readers	SVM, fine-tuned ALBERT	7	RMSE: 0.44
2021	[94] *	English L2 learners	CNN-LSTM	5	QWK: 49.8
2022	[141] *	English L1 learners	RF	2	AUPRC: 81
2023	[96] *	English L1 learners	FFNN (BERT+ RoBERTa)	4	F1 score: 69.9
2023	[4] *	English L1 learners	RF, linear regression	2	CC: 0.32
2024	[95] *	English L1, L2 readers	Multiple regression	3	Variance explained: 0.38
2024	[153] *	English L1 learners	Logistic regression, CNN, BEyeLSTM, Eyettention, RoBERTa-QEye, MAG-QEye, PostFusion-QEye	4	Balanced accuracy: up to 30.9
2025	[149] *	English L1 learners	Traditional formulas, modern readability measures, LLMs, commercial system psycholinguistic measures	2	Pearson r up to 0.40
	[154]*	English L2 learners	SVM, RF	3	Accuracy: 52.1, F1: 52.1
	[155]*	English L1 readers	RoBERTa	---	Spearman correlation: ~0.91, MAE: ~4.57

Asterisk (*) indicates studies that do not directly target readability, and (---) indicates not available.

Appendix B. Hyperparameter Optimization Results

The details of the optimization results for all selected hyperparameters are provided the for the ML models.

Table A3. Hyperparameter values that resulted from the Bayesian optimization of each ML classifier.

Model	Parameters and Values
KNN	algorithm: kd_tree, leaf_size: 21, metric: chebyshev, n_neighbors: 19, p: 2, weights: distance.
SVM	C: 0.1, class_weight: None, gamma: auto, kernel: linear.
LR	C: 10.1, class_weight: none, max_iter: 70,000, penalty: l2, solver: sag.
MNB	alpha: 50.0, fit_prior: false, force_alpha: true.
Bagging	base_estimator_class_weight: balanced, base_estimator_criterion: entropy, base_estimator_max_depth: none, base_estimator_max_features: log2, base_estimator_min_samples_leaf: 1, base_estimator_min_samples_split: 10, bootstrap: false, max_features: 0.9, max_samples: 0.8, n_estimators: 500.
RF	Bootstrap: false, class_weight: none, criterion: gini, max_depth: 20, max_features: log2, min_samples_leaf: 1, min_samples_split: 3, n_estimators: 1500.
AdaBoost	algorithm: SAMME, base_estimator_class_weight: none, base_estimator_criterion: entropy, base_estimator_max_depth: 5, base_estimator_max_features: log2, base_estimator_min_samples_leaf: 1, base_estimator_min_samples_split: 10, base_estimator_splitter: random, learning_rate: 0.05, n_estimators: 1000.
XGBoost	booster: gbtree, colsample_bytree: 0.6, gamma: 1.0, earning_rate: 0.1, max_delta_step: 7, max_depth: 5, min_child_weight: 1, n_estimators: 50, reg_alpha: 1 × 10⁻⁵, reg_lambda: 0.1, subsample: 0.773478.

Appendix C

Table A4. Performance of the basic models (i.e., the baseline models).

Model	Precision (%)	Recall (%)	F1 Score (%)
KNN	66	69	67
SVM	68	73	70
LR	73	73	71
MNB	68	64	65

Table A5. Performance of the ensemble models (i.e., the baseline models).

Model	Precision (%)	Recall (%)	F1 Score (%)
Bagging	72	70	71
RF	77	75	74
AdaBoost	72	71	71
XGBoost	69	70	70

Table A6. Performance of the basic models with different feature variations. The bold numbers have the highest values.

Model	Feature Variation	Precision (%)	Recall (%)	F1 Score (%)
KNN	Text (Baseline)	66	69	67
	Gaze (Reading)	66	69	68
	Gaze	67	70	68
	All Features	66	70	68
SVM	Text (Baseline)	68	73	70
	Gaze (Reading)	68%	71	69
	Gaze	66	69	68
	All Features	74	75	74
LR	Text (Baseline)	73	73	71
	Gaze (Reading)	65	69	67
	Gaze	65	69	67
	All Features	75	75	75
MNB	Text (Baseline)	68	64	65
	Gaze (Reading)	73	65	68
	Gaze	68	65	66
	All Features	68	66	67

Table A7. Performance of the ensemble models with different feature variations. The bold numbers have the highest values.

Model	Feature Variation	Precision (%)	Recall (%)	F1 Score (%)
Bagging	Text (Baseline)	72	70	71
	Gaze (Reading)	77	76	76
	Gaze	78	77	77
	All Features	77	76	76
RF	Text (Baseline)	77	75	74
	Gaze (Reading)	72	72	72
	Gaze	74	74	74
	All Features	72	75	73
AdaBoost	Text (Baseline)	72	71	71
	Gaze (Reading)	72	71	71
	Gaze	70	70	70
	All Features	80	81	80
XGBoost	Text (Baseline)	69	70	70
	Gaze (Reading)	72	72	72
	Gaze	73	73	73
	All Features	76	76	76

Appendix D

Table A8. Performance of the basic models with TF-IDF and different feature variations. The bold numbers have the highest values.

Model	Feature Variation	Precision (%)	Recall (%)	F1 Score (%)
KNN	Text (Baseline)	66	69	67
	TF-IDF	44	58	47
	TF-IDF + Text	65	69	67
	TF-IDF + Gaze (Reading)	66	69	68
	TF-IDF + Gaze	65	69	69
	TF-IDF + All Features	66	70	68
SVM	Text (Baseline)	68	73	70
	TF-IDF	37	61	46
	TF-IDF + Text	72	72	72
	TF-IDF + Gaze (Reading)	67	71	69
	TF-IDF + Gaze	66	69	67
	TF-IDF + All Features	73	74	73
LR	Text (Baseline)	73	73	71
	TF-IDF	68	69	65
	TF-IDF + Text	70	71	70
	TF-IDF + Gaze (Reading)	66	69	68
	TF-IDF + Gaze	64	68	66
	TF-IDF + All Features	71	72	71
MNB	Text (Baseline)	68	64	65
	TF-IDF	62	42	36
	TF-IDF + Text	38	61	47
	TF-IDF + Gaze (Reading)	37	61	46
	TF-IDF + Gaze	37	61	46
	TF-IDF + All Features	37	61	46

Table A9. Performance of the ensemble models with TF-IDF and different feature variations. The bold numbers have the highest values.

Model	Feature Variation	Precision (%)	Recall (%)	F1 Score (%)
Bagging	Text (Baseline)	72	70	71
	TF-IDF	50	62	49
	TF-IDF + Text	55	63	50
	TF-IDF + Gaze (Reading)	62	64	54
	TF-IDF + Gaze	59	64	52
	TF-IDF + All Features	64	69	62
RF	Text (Baseline)	77	75	74
	TF-IDF	37	61	46
	TF-IDF + Text	37	61	46
	TF-IDF + Gaze (Reading)	38	61	47
	TF-IDF + Gaze	38	61	47
	TF-IDF + All Features	55	63	50
AdaBoost	Text (Baseline)	72	71	71
	TF-IDF	37	61	46
	TF-IDF + Text	60	65	57
	TF-IDF + Gaze (Reading)	64	68	61
	TF-IDF + Gaze	61	66	58
	TF-IDF + All Features	58	64	57
XGBoost	Text (Baseline)	69	70	70
	TF-IDF	64	69	65
	TF-IDF + Text	67	70	69
	TF-IDF + Gaze (Reading)	67	70	69
	TF-IDF + Gaze	66	69	67
	TF-IDF + All Features	68	71	69

Appendix E

Unique DL model architectures are outlined here. Each architecture incorporates a dropout regularization layer to prevent overfitting during training:

LSTM: LSTMs excel in modeling sequential data by capturing contextual dependencies among words through their gated architecture, facilitating accurate predictions. The model comprises two LSTM layers followed by a dropout layer [71,125].
BiLSTM: This architecture utilizes a single BiLSTM layer, leveraging bidirectional processing to enhance context understanding [124]. A single-layer BiLSTM is preferred over deeper variants to mitigate overfitting, especially with smaller datasets.
GRU: GRUs are employed here with two layers, optimizing information flow within the network through advanced gating mechanisms [125]. The model integrates an initial GRU layer with an embedding layer, followed by another GRU layer and a dropout layer.
CNN: CNNs adapt well to text analysis by focusing on key n-gram features rather than processing every word individually [71]. The architecture includes multiple convolutional and max pooling layers, concluding with a dropout layer, to extract significant patterns from input texts.
CNN-LSTM: This hybrid model combines CNNs for local feature extraction with LSTM’s sequential processing capabilities [118,125,126]. It begins with a convolutional layer followed by max pooling, integrating these features into an LSTM layer and concluding with a dropout layer.
CNN-GRU: Similarly to CNN-LSTM, this model incorporates CNNs for initial feature extraction but uses GRU instead of LSTM for sequential data processing, enhancing long-term dependency handling [126].
CNN-BiLSTM: This variant substitutes LSTM with BiLSTM in the CNN-LSTM architecture, leveraging bidirectional processing within the sequential context established by CNNs [118,125,126].

Appendix F

The following tables are the best-performing common hyperparameters for each network and the details of the model-specific optimization results of each DL classifier and all selected hyperparameters, respectively.

Table A10. Common parameters of each DL classifier.

Parameter	LSTM	BiLSTM	GRU	CNN	CNN-LSTM	CNN-GRU	CNN-BiLSTM
Batch size	32	16	16	16	16	32	16
Epoch count	96	29	35	100	95	55	37
Optimizer	SGD	Nadam	RMSprop	Nadam	Nadam	SGD	Adam
Learning rate	0.01	0.001	0.01	0.001	0.001	0.0001	0.0001
Dense units	512	320	352	128	256	448	224
Dense dropout	0.2	0.1	0.2	0.2	0.2	0.1	0

Table A11. Hyperparameter values that resulted from the Bayesian optimization of each DL classifier.

Model	Parameters and Values
LSTM	lstm_units1: 384, lstm_units2: 128, lstm_dropout_rate: 0.1, class_weight: balanced
BiLSTM	lstm_units: 128, lstm_dropout_rate: 0.2, class_weight: balanced
GRU	gru_units1: 256, gru_units2: 288, gru_dropout_rate: 0, class_weight: none
CNN	conv_filters1: 96, kernel_size1: 3, pool_size1: 3, conv_filters2: 128, kernel_size2: 3, pool_size2: 3, cnn_dropout_rate: 0.2, class_weight: balanced
CNN-LSTM	conv_filters: 32, kernel_size: 3, pool_size: 3, lstm_units: 288, lstm_dropout_rate: 0.5, class_weight: balanced
CNN-GRU	conv_filters: 96, kernel_size: 7, pool_size: 3, gru _units: 96, gru_dropout_rate: 0, class_weight: balanced
CNN-BiLSTM	conv_filters: 128, kernel_size: 7, pool_size: 2, lstm_units: 384, lstm_dropout_rate: 0, class_weight: balanced

Appendix G

Table A12. Performance of the single DL models (baseline models).

Model	Precision (%)	Recall (%)	F1 Score (%)
LSTM	52	58	52
BiLSTM	72	70	70
GRU	71	59	61
CNN	77	78	77

Table A13. Performance of the hybrid DL models (baseline models).

Model	Precision (%)	Recall (%)	F1 Score (%)
CNN-LSTM	73	69	70
CNN-GRU	71	59	60
CNN-BiLSTM	70	70	70

Table A14. Performance of the single DL models with fastText and different feature variations. The bold numbers indicate the highest values.

Model	Feature Variation	Precision (%)	Recall (%)	F1 Score (%)
LSTM	Text (Baseline)	52	58	52
	Gaze (Reading)	74	68	69
	Gaze	74	66	65
	All Features	71	69	69
BiLSTM	Text (Baseline)	72	70	70
	Gaze (Reading)	75	52	58
	Gaze	77	67	68
	All Features	73	72	72
GRU	Text (Baseline)	71	59	61
	Gaze (Reading)	74	58	59
	Gaze	66	69	67
	All Features	76	73	74
CNN	Text (Baseline)	77	78	77
	Gaze (Reading)	72	74	72
	Gaze	73	73	73
	All Features	71	72	71

Table A15. Performance of the DL hybrid models with fastText and different feature variations. The bold numbers indicate the highest values.

Model	Feature Variation	Precision (%)	Recall (%)	F1 Score (%)
CNN-LSTM	Text (Baseline)	73	69	70
	Gaze (Reading)	70	64	63
	Gaze	71	65	67
	All Features	73	73	72
CNN-GRU	Text (Baseline)	71	59	60
	Gaze (Reading)	69	66	67
	Gaze	76	69	71
	All Features	76	73	73
CNN-BiLSTM	Text (Baseline)	70	70	70
	Gaze (Reading)	78	71	71
	Gaze	67	69	68
	All Features	66	68	66

References

Berrichi, S.; Nassiri, N.; Mazroui, A.; Lakhouaja, A. Exploring the Impact of Deep Learning Techniques on Evaluating Arabic L1 Readability. In Artificial Intelligence, Data Science and Applications; Springer Nature: Cham, Switzerland, 2024; pp. 1–7. [Google Scholar]
McCarthy, K.S.; Yan, E.F. Reading Comprehension and Constructive Learning: Policy Considerations in the Age of Artificial Intelligence. Policy Insights Behav. Brain Sci. 2024, 11, 19–26. [Google Scholar] [CrossRef]
Sanches, C.L.; Augereau, O.; Kise, K. Using the Eye Gaze to Predict Document Reading Subjective Understanding. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 8, pp. 28–31. [Google Scholar]
Southwell, R.; Mills, C.; Caruso, M.; D’Mello, S.K. Gaze-based predictive models of deep reading comprehension. User Model. User-Adapt. Interact. 2023, 33, 687–725. [Google Scholar] [CrossRef]
Biedert, R.; Dengel, A.; Elshamy, M.; Buscher, G. Towards robust gaze-based objective quality measures for text. In Proceedings of the Symposium on Eye Tracking Research and Applications, Santa Barbara, CA, USA, 28–30 March 2012; pp. 201–204. [Google Scholar]
Balyan, R.; McCarthy, K.S.; McNamara, D.S. Comparing Machine Learning Classification Approaches for Predicting Expository Text Difficulty. In Proceedings of the Thirty-First International Flairs Conference, Melbourne, FL, USA, 21–23 May 2018; pp. 421–426. [Google Scholar]
Al-Khalifa, H.S.; Al-Ajlan, A.A. Automatic readability measurements of the Arabic text: An exploratory study. Arab. J. Sci. Eng. 2010, 35, 103–124. [Google Scholar]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Modern Standard Arabic Readability Prediction. In Proceedings of the Arabic Language Processing: From Theory to Practice (ICALP 2017), Fez, Morocco, 11–12 October 2017; pp. 120–133. [Google Scholar]
Feng, L.; Elhadad, N.M.; Huenerfauth, M. Cognitively motivated features for readability assessment. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 30 March–3 April 2009; pp. 229–237. [Google Scholar]
Dale, E.; Chall, J.S. The Concept of Readability. Elem. Engl. 1949, 26, 19–26. [Google Scholar]
Alotaibi, S.; Alyahya, M.; Al-Khalifa, H.; Alageel, S.; Abanmy, N. Readability of Arabic Medicine Information Leaflets: A Machine Learning Approach. Procedia Comput. Sci. 2016, 82, 122–126. [Google Scholar] [CrossRef]
Al Tamimi, A.K.; Jaradat, M.; Al-Jarrah, N.; Ghanem, S. AARI: Automatic Arabic readability index. Int. Arab J. Inf. Technol. 2014, 11, 370–378. [Google Scholar]
Baazeem, I. Analysing the Effects of Latent Semantic Analysis Parameters on Plain Language Visualisation. Master’s Thesis, Queensland University, Brisbane, Australia, 2015. [Google Scholar]
Collins-Thompson, K. Computational assessment of text readability: A survey of current and future research. ITL Int. J. Appl. Linguist. 2014, 165, 97–135. [Google Scholar] [CrossRef]
Mesgar, M.; Strube, M. Graph-based coherence modeling for assessing readability. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, Denver, CO, USA, 4–5 June 2015; pp. 309–318. [Google Scholar]
Balakrishna, S.V. Analyzing Text Complexity and Text Simplification: Connecting Linguistics, Processing and Educational Applications. Ph.D. Thesis, der Eberhard Karls Universität Tübingen, Tübingen, Germany, 2015. [Google Scholar]
Vajjala, S.; Meurers, D.; Eitel, A.; Scheiter, K. Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), Osaka, Japan, 11 December 2016; pp. 38–48. [Google Scholar]
Vajjala, S.; Lucic, I. On understanding the relation between expert annotations of text readability and target reader comprehension. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, 2 August 2019; pp. 349–359. [Google Scholar]
Sarti, G.; Brunato, D.; Dell’Orletta, F. That looks hard: Characterizing linguistic complexity in humans and language models. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Virtual, 10 June 2021; pp. 48–60. [Google Scholar]
Ghosh, S.; Dhall, A.; Hayat, M.; Knibbe, J.; Ji, Q. Automatic Gaze Analysis: A Survey of Deep Learning Based Approaches. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 61–84. [Google Scholar] [CrossRef]
Mathias, S.; Kanojia, D.; Mishra, A.; Bhattacharyya, P. A Survey on Using Gaze Behaviour for Natural Language Processing. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Survey Track, Yokohama, Japan, 7–15 January 2021; pp. 4907–4913. [Google Scholar]
Just, M.A.; Carpenter, P.A. A theory of reading: From eye fixations to comprehension. Psychol. Rev. 1980, 87, 329–354. [Google Scholar] [CrossRef]
Atvars, A. Eye movement analyses for obtaining Readability Formula for Latvian texts for primary school. Procedia Comput. Sci. 2017, 104, 477–484. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, W.; Song, D.; Zhang, P.; Ren, Q.; Hou, Y. Inferring Document Readability by Integrating Text and Eye Movement Features. In Proceedings of the SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research, Santiago, Chile, 2 December 2015. [Google Scholar]
Copeland, L.; Gedeon, T.; Caldwell, S. Effects of text difficulty and readers on predicting reading comprehension from eye movements. In Proceedings of the 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), Gyor, Hungary, 19–21 October 2015; pp. 407–412. [Google Scholar]
Garain, U.; Pandit, O.; Augereau, O.; Okoso, A.; Kise, K. Identification of reader specific difficult words by analyzing eye gaze and document content. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1346–1351. [Google Scholar]
Baazeem, I.; Al-Khalifa, H.; Al-Salman, A. Cognitively Driven Arabic Text Readability Assessment Using Eye-Tracking. Appl. Sci. 2021, 11, 8607. [Google Scholar] [CrossRef]
Al-Ajlan, A.A.; Al-Khalifa, H.S.; Al-Salman, A.S. Towards the development of an automatic readability measurements for Arabic language. In Proceedings of the Third International Conference on Digital Information Management, London, UK, 13–16 November 2008; pp. 506–511. [Google Scholar]
Saddiki, H.; Bouzoubaa, K.; Cavalli-Sforza, V. Text readability for Arabic as a foreign language. In Proceedings of the 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), Marrakech, Morocco, 17–20 November 2015; pp. 1–8. [Google Scholar]
Klaib, A.F.; Alsrehin, N.O.; Melhem, W.Y.; Bashtawi, H.O.; Magableh, A.A. Eye tracking algorithms, techniques, tools, and applications with an emphasis on machine learning and Internet of Things technologies. Expert Syst. Appl. 2021, 166, 114037. [Google Scholar] [CrossRef]
Singh, H.; Singh, J. Human eye tracking and related issues: A review. Int. J. Sci. Res. Publ. 2012, 2, 1–9. [Google Scholar]
Conklin, K.; Pellicer-Sánchez, A. Using eye-tracking in applied linguistics and second language research. Second Lang. Res. 2016, 32, 453–467. [Google Scholar] [CrossRef]
Tobii. Available online: https://www.tobii.com (accessed on 20 November 2021).
SR Research EyeLink. Available online: https://www.sr-research.com (accessed on 31 March 2021).
Al-Edaily, A.; Al-Wabil, A.; Al-Ohali, Y. Interactive Screening for Learning Difficulties: Analyzing Visual Patterns of Reading Arabic Scripts with Eye Tracking. In Proceedings of the HCI International 2013—Posters’ Extended Abstracts, Las Vegas, NV, USA, 21–26 July 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 3–7. [Google Scholar]
Tobii Technology AB. What Is Eye Tracking? Available online: https://www.tobii.com/learn-and-support/get-started/what-is-eye-tracking (accessed on 16 December 2023).
Cop, U.; Dirix, N.; Drieghe, D.; Duyck, W. Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading. Behav. Res. Methods 2017, 49, 602–615. [Google Scholar] [CrossRef]
Gonzalez-Garduno, A.V.; Søgaard, A. Using gaze to predict text readability. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Copenhagen, Denmark, 8 September 2017; pp. 438–443. [Google Scholar]
Grabar, N.; Farce, E.; Sparrow, L. Study of readability of health documents with eye-tracking approaches. In Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), Tilburg, The Netherlands, 8 November 2018. [Google Scholar]
Mathias, S.; Kanojia, D.; Patel, K.; Agarwal, S.; Mishra, A.; Bhattacharyya, P. Eyes are the windows to the soul: Predicting the rating of text quality using gaze behaviour. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2352–2362. [Google Scholar]
Hermena, E.W.; Drieghe, D.; Hellmuth, S.; Liversedge, S.P. Processing of Arabic diacritical marks: Phonological–syntactic disambiguation of homographic verbs and visual crowding effects. J. Exp. Psychol. Hum. Percept. Perform. 2015, 41, 494–507. [Google Scholar] [CrossRef]
Al-Samarraie, H.; Sarsam, S.M.; Alzahrani, A.I.; Alalwan, N. Reading text with and without diacritics alters brain activation: The case of Arabic. Curr. Psychol. 2020, 39, 1189–1198. [Google Scholar] [CrossRef]
Nassiri, N.; Cavalli-Sforza, V.; Lakhouaja, A. Approaches, Methods, and Resources for Assessing the Readability of Arabic Texts. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 22, 95. [Google Scholar] [CrossRef]
Paterson, K.B.; Almabruk, A.A.A.; McGowan, V.A.; White, S.J.; Jordan, T.R. Effects of word length on eye movement control: The evidence from Arabic. Psychon. Bull. Rev. 2015, 22, 1443–1450. [Google Scholar] [CrossRef]
Alrabiah, M.; Alsalman, A.; Atwell, E. The design and construction of the 50 million words KSUCCA King Saud University Corpus of Classical Arabic. In Proceedings of the WACL’2 Second Workshop on Arabic Corpus Linguistics, Lancaster, UK, 22 July 2013; pp. 5–8. [Google Scholar]
Alnefaie, R.; Azmi, A.M. Automatic minimal diacritization of Arabic texts. Procedia Comput. Sci. 2017, 117, 169–174. [Google Scholar] [CrossRef]
El-Haj, M.; Kruschwitz, U.; Fox, C. Creating language resources for under-resourced languages: Methodologies, and experiments with Arabic. Lang. Resour. Eval. 2015, 49, 549–580. [Google Scholar] [CrossRef]
Bouamor, H.; Zaghouani, W.; Diab, M.; Obeid, O.; Oflazer, K.; Ghoneim, M.; Hawwari, A. A pilot study on Arabic multi-genre corpus diacritization. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China, 30 July 2015; pp. 80–88. [Google Scholar]
Al-Edaily, A.; Al-Wabil, A.; Al-Ohali, Y. Dyslexia Explorer: A Screening System for Learning Difficulties in the Arabic Language Using Eye Tracking. In Proceedings of the Human Factors in Computing and Informatics, Maribor, Slovenia, 1–3 July 2013; pp. 831–834. [Google Scholar]
Al-Wabil, A.; Al-Sheaha, M. Towards an interactive screening program for developmental dyslexia: Eye movement analysis in reading Arabic texts. In Proceedings of the 12th International Conference on Computers Helping People with Special Needs, Vienna, Austria, 14–16 July 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 25–32. [Google Scholar]
AlJassmi, M.A.; Hermena, E.W.; Paterson, K.B. Eye movements in Arabic reading. In Studies in Arabic Linguistics; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2021; Volume 10, pp. 85–108. [Google Scholar] [CrossRef]
Hermena, E.W.; Bouamama, S.; Liversedge, S.P.; Drieghe, D. Does diacritics-based lexical disambiguation modulate word frequency, length, and predictability effects? An eye-movements investigation of processing Arabic diacritics. PLoS ONE 2021, 16, e0259987. [Google Scholar] [CrossRef]
Roman, G.; Pavard, B. A comparative study: How we read in Arabic and French. In Eye Movements from Physiology to Cognition; Elsevier: Amsterdam, The Netherlands, 1987; pp. 431–440. [Google Scholar]
Blanken, G.; Dorn, M.; Sinn, H. Inversion errors in Arabic number reading: Is there a nonsemantic route? Brain Cogn. 1997, 34, 404–423. [Google Scholar] [CrossRef]
Naz, S.; Razzak, M.I.; Hayat, K.; Anwar, M.W.; Khan, S.Z. Challenges in baseline detection of Arabic script based languages. In Proceedings of the Intelligent Systems for Science and Information: Extended and Selected Results from the Science and Information Conference, London, UK, 7–9 October 2013; Springer: Cham, Switzerland, 2013; pp. 181–196. [Google Scholar]
Al Jarrah, E.Q. Using Language Features to Enhance Measuring the Readability of Arabic Text. Master’s Thesis, Yarmouk University, Irbid, Jordan, 2017. [Google Scholar]
Berrichi, S.; Nassiri, N.; Mazroui, A.; Lakhouaja, A. Impact of Feature Vectorization Methods on Arabic Text Readability Assessment. In Artificial Intelligence and Smart Environment (ICAISE 2022); Springer International Publishing: Cham, Switzerland, 2023; Volume 635, pp. 504–510. [Google Scholar]
Flesch, R. A new readability yardstick. J. Appl. Psychol. 1948, 32, 221–233. [Google Scholar] [CrossRef]
Gunning, R. The Technique of Clear Writing, 2nd ed.; McGraw-Hill Book Company: New York, NY, USA, 1968. [Google Scholar]
Mc Laughlin, G.H. SMOG Grading—A New Readability Formula. J. Read. 1969, 12, 639–646. [Google Scholar]
Coleman, M.; Liau, T.L. A computer readability formula designed for machine scoring. J. Appl. Psychol. 1975, 60, 283–284. [Google Scholar] [CrossRef]
Kincaid, J.P.; Fishburne, R.P., Jr.; Rogers, R.L.; Chissom, B.S. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel; Naval Technical Training Command Millington TN Research Branch: Millington, TN, USA, 1975. [Google Scholar]
Chall, J.S.; Dale, E. Readability Revisited: The New Dale-Chall Readability Formula; Brookline Books: Cambridge, MA, USA, 1995. [Google Scholar]
El-Haj, M.; Rayson, P. OSMAN―A Novel Arabic Readability Metric. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 250–255. [Google Scholar]
Cavalli-Sforza, V.; Saddiki, H.; Nassiri, N. Arabic Readability Research: Current State and Future Directions. Procedia Comput. Sci. 2018, 142, 38–49. [Google Scholar] [CrossRef]
Dawood, B. The Relationship Between Readability and Selected Language Variables. Master’s Thesis, Baghdad University, Baghdad, Iraq, 1977. [Google Scholar]
Al-Heeti, K.N. Judgment Analysis Technique Applied to Readability Prediction of Arabic Reading Material. Ph.D. Thesis, Northern Colorado University, Greeley, CO, USA, 1985. [Google Scholar]
Daud, N.M.; Hassan, H.; Aziz, N.A. A corpus-based readability formula for estimate of Arabic texts reading difficulty. World Appl. Sci. J. 2013, 21, 168–173. [Google Scholar] [CrossRef]
Ghani, K.A.; Noh, A.S.; Yusoff, N.M.R.N.; Hussein, N.H. Developing Readability Computational Formula for Arabic Reading Materials Among Non-native Students in Malaysia. In The Importance of New Technologies and Entrepreneurship in Business Development: In the Context of Economic Diversity in Developing Countries: The Impact of New Technologies and Entrepreneurship on Business Development; Springer: Cham, Switzerland, 2021; Volume 194, pp. 2041–2057. [Google Scholar]
Mesgar, M.; Strube, M. A neural local coherence model for text quality assessment. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4328–4339. [Google Scholar]
Vajjala, S.; Majumder, B.; Gupta, A.; Surana, H. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems; O’Reilly Media Inc.: Sebastopol, CA, USA, 2020. [Google Scholar]
Chen, X.; Meurers, D. Characterizing text difficulty with word frequencies. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, San Diego, CA, USA, 16 June 2016; pp. 84–94. [Google Scholar]
Rello, L. DysWebxia: A Text Accessibility Model for People with Dyslexia. Ph.D. Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2014. [Google Scholar]
Azpiazu, I.M.; Pera, M.S. Multiattentive Recurrent Neural Network Architecture for Multilingual Readability Assessment. Trans. Assoc. Comput. Linguist. 2019, 7, 421–436. [Google Scholar] [CrossRef]
Martinc, M.; Pollak, S.; Robnik-Šikonja, M. Supervised and unsupervised neural approaches to text readability. Comput. Linguist. 2021, 47, 141–179. [Google Scholar] [CrossRef]
Oliveira, A.M.d.; Germano, G.D.; Capellini, S.A. Comparison of Reading Performance in Students with Developmental Dyslexia by Sex. Paidéia 2017, 27, 306–313. [Google Scholar] [CrossRef]
Bessou, S.; Chenni, G. Efficient measuring of readability to improve documents accessibility for arabic language learners. J. Digit. Inf. Manag. 2021, 19, 75–82. [Google Scholar] [CrossRef]
Marie-Sainte, S.L.; Alalyani, N.; Alotaibi, S.; Ghouzali, S.; Abunadi, I. Arabic Natural Language Processing and Machine Learning-Based Systems. IEEE Access 2018, 7, 7011–7020. [Google Scholar] [CrossRef]
Shen, W.; Williams, J.; Marius, T.; Salesky, E. A language-independent approach to automatic text difficulty assessment for second-language learners. In Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, Sofia, Bulgaria, 8 August 2013; pp. 30–38. [Google Scholar]
Salesky, E.; Shen, W. Exploiting Morphological, Grammatical, and Semantic Correlates for Improved Text Difficulty Assessment. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, Baltimore, MD, USA, 16 June 2014; pp. 155–162. [Google Scholar]
Forsyth, J.N. Automatic Readability Detection for Modern Standard Arabic. Master’s Thesis, Brigham Young University, Provo, UT, USA, 2014. [Google Scholar]
Cavalli-Sforza, V.; El Mezouar, M.; Saddiki, H. Matching an Arabic text to a learners’ curriculum. In Proceedings of the 2014 Fifth International Conference on Arabic Language Processing (CITALA 2014), Oujda, Morocco, 26–27 November 2014; pp. 79–88. [Google Scholar]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Arabic L2 readability assessment: Dimensionality reduction study. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 3789–3799. [Google Scholar] [CrossRef]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Arabic Readability Assessment for Foreign Language Learners; Springer International Publishing: Cham, Switzerland, 2018; Volume 10859, pp. 480–488. [Google Scholar]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Combining Classical and Non-classical Features to Improve Readability Measures for Arabic First Language Texts. In International Conference on Advanced Intelligent Systems for Sustainable Development; Springer: Cham, Switzerland, 2020; pp. 463–470. [Google Scholar]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Evaluating the Impact of Oversampling on Arabic L1 and L2 Readability Prediction Performances. In Networking, Intelligent Systems and Security; Springer International Publishing: Singapore, 2022; Volume 237, pp. 763–774. [Google Scholar]
Al Aqeel, S.; Abanmy, N.; Aldayel, A.; Al-Khalifa, H.; Al-Yahya, M.; Diab, M. Readability of written medicine information materials in Arabic language: Expert and consumer evaluation. BMC Health Serv. Res. 2018, 18, 139. [Google Scholar] [CrossRef]
Khallaf, N.; Sharoff, S. Automatic Difficulty Classification of Arabic Sentences. In Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP), Virtual, Kyiv, Ukraine, 19 April 2021; pp. 105–114. [Google Scholar]
Berrichi, S.; Nassiri, N.; Mazroui, A.; Lakhouaja, A. Interpreting the Relevance of Readability Prediction Features. Jordanian J. Comput. Inf. Technol. 2023, 9, 36–52. [Google Scholar] [CrossRef]
Ouassil, M.A.; Jebbari, M.; Rachidi, R.; Errami, M.; Cherradi, B.; Raihani, A. Enhancing Arabic Text Readability Assessment: A Combined BERT and BiLSTM Approach. In Proceedings of the 2024 International Conference on Circuit, Systems and Communication (ICCSC), Fes, Morocco, 28–29 June 2024; pp. 1–7. [Google Scholar]
Saddiki, H.; Habash, N.; Cavalli-Sforza, V.; Al Khalil, M. Feature optimization for predicting readability of Arabic L1 and L2. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, Melbourne, Australia, 19 July 2018; pp. 20–29. [Google Scholar]
Mishra, A.; Bhattacharyya, P. Scanpath Complexity: Modeling Reading Effort Using Gaze Information. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 77–98. [Google Scholar]
Mathias, S.; Murthy, R.; Kanojia, D.; Mishra, A.; Bhattacharyya, P. Happy are those who grade without seeing: A multi-task learning approach to grade essays using gaze behaviour. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; pp. 858–872. [Google Scholar]
Mathias, S.; Murthy, R.; Kanojia, D.; Bhattacharyya, P. Cognitively aided zero-shot automatic essay grading. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), Patna, India, 18–21 December 2021; pp. 175–180. [Google Scholar]
Mézière, D.C.; Yu, L.; McArthur, G.; Reichle, E.D.; von der Malsburg, T. Scanpath regularity as an index of Reading comprehension. Sci. Stud. Read. 2024, 28, 79–100. [Google Scholar] [CrossRef]
Nicula, B.; Panaite, M.; Arner, T.; Balyan, R.; Dascalu, M.; McNamara, D.S. Automated Assessment of Comprehension Strategies from Self-explanations Using Transformers and Multi-task Learning. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky; Springer Nature: Cham, Switzerland, 2023; pp. 695–700. [Google Scholar]
Hollenstein, N. Leveraging Cognitive Processing Signals for Natural Language Understanding; ETH Zurich: Zürich, Switzerland, 2021. [Google Scholar]
Hollenstein, N.; Barrett, M.; Troendle, M.; Bigiolli, F.; Langer, N.; Zhang, C. Advancing NLP with cognitive language processing signals. arXiv 2019, arXiv:1904.02682. [Google Scholar] [CrossRef]
Sood, E.; Tannert, S.; Müller, P.; Bulling, A. Improving natural language processing tasks with human gaze-guided neural attention. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online, 6–12 December 2020; Volume 33, pp. 6327–6341. [Google Scholar]
Barrett, M. Improving Natural Language Processing with Human Data: Eye Tracking and Other Data Sources Reflecting Cognitive Text Processing. Ph.D. Thesis, University of Copenhagen, Copenhagen, Denmark, 2018. [Google Scholar]
Leal, S.E.; Vieira, J.M.M.; dos Santos Rodrigues, E.; Teixeira, E.N.; Aluísio, S. Using eye-tracking data to predict the readability of Brazilian Portuguese sentences in single-task, multi-task and sequential transfer learning approaches. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 5821–5831. [Google Scholar]
Clifton, C., Jr.; Staub, A. Syntactic influences on eye movements during reading. In The Oxford Handbook of Eye Movements; No. 2; Oxford University Press: Oxford, UK, 2011; Volume 3, pp. 895–910. [Google Scholar]
Liversedge, S.P.; Paterson, K.B.; Pickering, M.J. Chapter 3—Eye Movements and Measures of Reading Time. In Eye Guidance in Reading and Scene Perception; Underwood, G., Ed.; Elsevier Science Ltd.: Amsterdam, The Netherlands, 1998; pp. 55–75. [Google Scholar] [CrossRef]
Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 1998, 124, 372–422. [Google Scholar] [CrossRef]
Wiechmann, D.; Qiao, Y.; Kerz, E.; Mattern, J. Measuring the impact of (psycho-) linguistic and readability features and their spill over effects on the prediction of eye movement patterns. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 5276–5290. [Google Scholar]
Ibañez, M.; Reyes, L.L.A.; Sapinit, R.; Hussien, M.A.; Imperial, J.M. On Applicability of Neural Language Models for Readability Assessment in Filipino. In Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners’ and Doctoral Consortium, Proceedings of the 23rd International Conference, AIED 2022, Durham, UK, 27–31 July 2022; Proceedings, Part II; Springer: Cham, Switzerland, 2022; pp. 573–576. [Google Scholar]
Howcroft, D.M.; Demberg, V. Psycholinguistic models of sentence processing improve sentence readability ranking. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017; Volume 1, pp. 958–968. [Google Scholar]
Yancey, K.; Pintard, A.; Francois, T. Investigating readability of french as a foreign language with deep learning and cognitive and pedagogical features. Lingue Linguaggio 2021, 20, 229–258. [Google Scholar] [CrossRef]
Olukoga, T.A.; Feng, Y. A Case Study on the Classification of Lost Circulation Events During Drilling using Machine Learning Techniques on an Imbalanced Large Dataset. arXiv 2022, arXiv:2209.01607. [Google Scholar] [CrossRef]
Baazeem, I.; Al-Khalifa, H.; Al-Salman, A. AraEyebility: Eye-Tracking Data for Arabic Text Readability. Computation 2025, 13, 108. [Google Scholar] [CrossRef]
Scikit-Learn. Scikit-Learn Machine Learning in Python. Available online: https://scikit-learn.org/stable/ (accessed on 5 October 2023).
Li, D.; Kanoulas, E. Bayesian Optimization for Optimizing Retrieval Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, Marina Del Rey, CA, USA, 5–9 February 2018; pp. 360–368. [Google Scholar] [CrossRef]
Bzdok, D.; Krzywinski, M.; Altman, N. Machine learning: Supervised methods. Nat. Methods 2018, 15, 5–6. [Google Scholar] [CrossRef]
Aldayel, A.; Al-Khalifa, H.; Alaqeel, S.; Abanmy, N.; Al-Yahya, M.; Diab, M. ARC-WMI: Towards Building Arabic Readability Corpus for Written Medicine Information. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, Miyazaki, Japan, 8 May 2018; p. 14. [Google Scholar]
Abellán, J.; Mantas, C.J.; Castellano, J.G.; Moral-García, S. Increasing diversity in random forest learning algorithm via imprecise probabilities. Expert Syst. Appl. 2018, 97, 228–243. [Google Scholar] [CrossRef]
Brownlee, J. Repeated k-Fold Cross-Validation for Model Evaluation in Python. Guiding Tech Media. Available online: https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/ (accessed on 21 December 2023).
Bengfort, B.; Bilbro, R.; Ojeda, T. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning; O’Reilly Media Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
Priyadarshini, I.; Cotton, C. A novel LSTM–CNN–grid search-based deep neural network for sentiment analysis. J. Supercomput. 2021, 77, 13911–13932. [Google Scholar] [CrossRef] [PubMed]
Hedhli, M.; Kboubi, F. CNN-BiLSTM Model for Arabic Dialect Identification. In Advances in Computational Collective Intelligence; Nguyen, N.T., Botzheim, J., Gulyás, L., Nunez, M., Treur, J., Vossen, G., Kozierkiewicz, A., Eds.; Springer: Cham, Switzerland, 2023; pp. 213–225. [Google Scholar]
Rahman, S.S.M.M.; Biplob, K.B.M.B.; Rahman, M.H.; Sarker, K.; Islam, T. An Investigation and Evaluation of N-Gram, TF-IDF and Ensemble Methods in Sentiment Classification. In Cyber Security and Computer Science (ICONCS); Springer: Cham, Switzerland, 2020; Volume 325, pp. 391–402. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Chen, J.; Yuan, P.; Zhou, X.; Tang, X. Performance Comparison of TF*IDF, LDA and Paragraph Vector for Document Classification. In Knowledge and Systems Sciences; Chen, J., Nakamori, Y., Yue, W., Tang, X., Eds.; Springer: Singapore, 2016; pp. 225–235. [Google Scholar]
Mesgar, M.; Strube, M. Lexical coherence graph modeling using word embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1414–1423. [Google Scholar]
Sabbeh, S.F.; Fasihuddin, H.A. A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification. Electron. 2023, 12, 1425. [Google Scholar] [CrossRef]
Ahmed, Z.A.T.; Albalawi, E.; Aldhyani, T.H.H.; Jadhav, M.E.; Janrao, P.; Obeidat, M.R.M. Applying Eye Tracking with Deep Learning Techniques for Early-Stage Detection of Autism Spectrum Disorders. Data 2023, 8, 168. [Google Scholar] [CrossRef]
Sarika, P.K. Comparing LSTM and GRU for Multiclass Sentiment Analysis of Movie Reviews. Bachelor’s Thesis, Blekinge Institute of Technology, Karlskrona, Sweden, 2020. [Google Scholar]
Wilcox, E.; Gauthier, J.; Hu, J.; Qian, P.; Levy, R. On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior. In Proceedings of the 42nd Annual Meeting of the Cognitive Science Society, Virtual, 29 July–1 August 2020; pp. 1707–1713. [Google Scholar]
Aurnhammer, C.; Frank, S.L. Comparing gated and simple recurrent neural network architectures as models of human sentence processing. In Proceedings of the 41st Annual Conference of the Cognitive Science Society (CogSci 2019), Montreal, QC, Canada, 24–27 July 2019; pp. 112–118. [Google Scholar]
Facebook Inc. fastText Library for Efficient Text Classification and Representation Learning. Available online: https://fasttext.cc (accessed on 22 September 2023).
Setyanto, A.; Laksito, A.; Alarfaj, F.; Alreshoodi, M.; Kusrini; Oyong, I.; Hayaty, M.; Alomair, A.; Almusallam, N.; Kurniasari, L. Arabic Language Opinion Mining Based on Long Short-Term Memory (LSTM). Appl. Sci. 2022, 12, 4140. [Google Scholar] [CrossRef]
Uluslu, A.Y.; Schneider, G. Exploring Linguistic Features for Turkish Text Readability. In Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP-2023), Virtual, 16–17 December 2023; pp. 223–232. [Google Scholar]
Keras. KerasTuner. Available online: https://keras.io/keras_tuner/ (accessed on 16 January 2023).
Copeland, L.; Gedeon, T. Measuring Reading Comprehension Using Eye Movements. In Proceedings of the IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom), Budapest, Hungary, 2–5 December 2013; pp. 791–796. [Google Scholar]
Copeland, L.; Gedeon, T.; Mendis, B.S.U. Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error. Artif. Intell. Res. 2014, 3, 35–48. [Google Scholar] [CrossRef]
Singh, A.D.; Mehta, P.; Husain, S.; Rajkumar, R. Quantifying sentence complexity based on eye-tracking measures. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), Osaka, Japan, 11 December 2016; pp. 202–212. [Google Scholar]
Gonzalez-Garduno, A.; Søgaard, A. Learning to predict readability using eye-movement data from natives and learners. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 5118–5124. [Google Scholar]
Mézière, D.C.; Yu, L.; Reichle, E.D.; Von Der Malsburg, T.; McArthur, G. Using Eye-Tracking Measures to Predict Reading Comprehension. Read. Res. Q. 2023, 58, 425–449. [Google Scholar] [CrossRef]
Makowski, S.; Jäger, L.A.; Abdelwahab, A.; Landwehr, N.; Scheffer, T. A discriminative model for identifying readers and assessing text comprehension from eye movements. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Dublin, Ireland, 10–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 209–225. [Google Scholar]
Liu, F.; Lee, J.S. Hybrid models for sentence readability assessment. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications, Toronto, ON, Canada, 13 July 2023; pp. 448–454. [Google Scholar]
Srivastava, H. Zero Shot Crosslingual Eye-Tracking Data Prediction using Multilingual Transformer Models. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Dublin, Ireland, 26 May 2022; pp. 102–107. [Google Scholar]
Caruso, M.; Peacock, C.E.; Southwell, R.; Zhou, G.; D’Mello, S.K. Going Deep and Far: Gaze-Based Models Predict Multiple Depths of Comprehension during and One Week Following Reading. In Proceedings of the 15th International Conference on Educational Data Mining, International Educational Data Mining Society, Durham, UK, 24–27 July 2022; pp. 145–157. [Google Scholar]
Kennedy, A.; Hill, R.; Pynte, J.E. The Dundee Corpus. In Proceedings of the 12th European Conference on Eye Movements, Dundee, UK, 20–24 August 2003. [Google Scholar]
Fosch-Villaronga, E.; Poulsen, A.; Søraa, R.A.; Custers, B.H.M. A little bird told me your gender: Gender inferences in social media. Inf. Process. Manag. 2021, 58, 102541. [Google Scholar] [CrossRef]
Southwell, R.; Gregg, J.; Bixler, R.; D’Mello, S.K. What eye movements reveal about later comprehension of long connected texts. Cogn. Sci. 2020, 44, e12905. [Google Scholar] [CrossRef]
Goodkind, A.; Bicknell, K. Predictive power of word surprisal for reading times is a linear function of language model quality. In Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018), Salt Lake City, UT, USA, 7 January 2018; pp. 10–18. [Google Scholar]
Hadar, C.A.; Shubi, O.; Meiri, Y.; Berzak, Y. Decoding Open-Ended Information Seeking Goals from Eye Movements in Reading. arXiv 2025, arXiv:2505.02872. [Google Scholar] [CrossRef]
Chiang, C.-H.; Lee, H.-Y. Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 15607–15631. [Google Scholar]
Moons, P.; Van Bulck, L. Using ChatGPT and Google Bard to improve the readability of written patient information: A proof of concept. Eur. J. Cardiovasc. Nurs. 2024, 23, 122–126. [Google Scholar] [CrossRef]
Klein, K.G.; Frenkel, S.; Shubi, O.; Berzak, Y. Eye Tracking Based Cognitive Evaluation of Automatic Readability Assessment Measures. arXiv 2025, arXiv:2502.11150. [Google Scholar]
Azmi, A.M.; Alnefaie, R.M.; Aboalsamh, H.A. Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 21, 60. [Google Scholar] [CrossRef]
Azmi, A.M.; Alsaiari, A. A calligraphic based scheme to justify Arabic text improving readability and comprehension. Comput. Hum. Behav. 2014, 39, 177–186. [Google Scholar] [CrossRef]
Bolliger, L.S.; Reich, D.R.; Jäger, L.A. ScanDL 2.0: A Generative Model of Eye Movements in Reading Synthesizing Scanpaths and Fixation Durations. Proc. ACM Hum.-Comput. Interact. 2025, 9, ETRA05. [Google Scholar] [CrossRef]
Shubi, O.; Meiri, Y.; Hadar, C.A.; Berzak, Y. Fine-grained prediction of reading comprehension from eye movements. arXiv 2024, arXiv:2410.04484. [Google Scholar] [CrossRef]
Melo, J.; Fernandez, L.; Ishimaru, S. Automatic Classification of Difficulty of Texts From Eye Gaze and Physiological Measures of L2 English Speakers. IEEE Access 2025, 13, 24555–24575. [Google Scholar] [CrossRef]
Dini, L.; Domenichelli, L.; Brunato, D.; Dell’Orletta, F. From Human Reading to NLM Understanding: Evaluating the Role of Eye-Tracking Data in Encoder-Based Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 17796–17813. [Google Scholar] [CrossRef]

Figure 1. Adopted methodology with handcrafted features.

Figure 2. Validation scores vs. number of repetitions for ML models.

Figure 3. Used methodology with TF-IDF and handcrafted features.

Figure 4. Adopted methodology with DL models.

Figure 5. Validation scores versus number of repetitions for DL models.

Table 1. Variations of the handcrafted features.

Feature Variation	Count	Description
Text-based	69	Features extracted to represent the selected texts and their linguistic complexity.
Gaze-based	29	Features extracted during the eye-tracking experiment, based on widely used metrics, reflect cognitive processing and comprehension such as reading duration. These include both reading and experimental condition metrics.
Gaze-based (Reading)	23	Features were extracted during the eye-tracking experiment using widely used metrics from the literature, focusing exclusively on eye-tracking reading metrics and deliberately excluding experimental condition metrics to understand the independent impact of reading metrics.
All Features	98	A combination of gaze-based and text-based features.

Table 2. Top-performing models, both basic and ensemble, with feature variations.

Feature Variation	F1 Score (%)	Model
Text	74%	RF
Gaze (Reading)	76%	Bagging
Gaze	77%	Bagging
All Features	80%	AdaBoost

Table 3. Performance of the best models and feature variations.

Model	Feature Variation	F1 Score (%)
KNN	Gaze (reading), Gaze, All Features	68
SVM	All Features	74
LR	All Features	75
MNB	Gaze (Reading)	68
Bagging	Gaze	77
RF	Text, Gaze	74
AdaBoost	All Features	80
XGBoost	All Features	76

Table 4. Variations in handcrafted features with TF-IDF.

Feature Variation	Description
TF-IDF	Features were extracted to represent the texts using TF-IDF vectorization.
TF-IDF + Text	The concatenation of the vectors obtained from TF-IDF and the text-based handcrafted features.
TF-IDF + Gaze (Reading)	The concatenation of the vectors obtained from TF-IDF and eye-tracking reading metrics, referred to as “gaze-based (reading)” handcrafted features.
TF-IDF + Gaze	The concatenation of the vectors obtained from TF-IDF and the gaze-based handcrafted features.
TF-IDF + All Features	The concatenation of the vectors obtained from the TF-IDF vectorization and all text-based and gaze-based handcrafted features.

Table 5. Top-performing models, both basic and ensemble, with TF-IDF and feature variations.

Features Variations	F1 Score	Model
TF-IDF + Text	72%	SVM
TF-IDF + Gaze (Reading)	69%	SVM, XGBoost
TF-IDF + Gaze	69%	KNN
TF-IDF + All Features	73%	SVM

Table 6. Performance of the best models and TF-IDF with feature variations.

Model	Feature Variation	F1 Score (%)
KNN	TF-IDF + Gaze	69
SVM	TF-IDF + All Features	73
LR	TF-IDF + All Features	71
MNB	TF-IDF + Text	47
Bagging	TF-IDF + All Features	62
RF	TF-IDF + All Features	50
AdaBoost	TF-IDF + Gaze (Reading)	61
XGBoost	TF-IDF + Text, TF-IDF + Gaze (Reading), TF-IDF + All Features	69

Table 7. Common hyperparameters and their values.

Hyperparameter	Options
Batch size	16, 32, 64, 128
Epochs count	50, 100, 200
Optimizer	Adam, Nadam, RMSprop, SGD
Learning rate	0.1, 0.01, 0.001, 0.0001
Dense units	Minimum value = 32, Maximum value = 512, step = 32
Dense dropout	0.0, 0.1, 0.2, 0.5

Table 8. Feature variations with fastText word embeddings.

Feature Variation	Description
Text	The vectors obtained from fastText were concatenated with text-based handcrafted features.
Gaze (Reading)	The concatenation of the vectors obtained from fastText and eye-tracking reading metrics, referred to as gaze-based (reading) handcrafted features.
Gaze	The concatenation of the vectors obtained from fastText and gaze-based handcrafted features.
All Features	The concatenation of the vectors obtained from fastText vectorization and all text-based and gaze-based handcrafted features.

Table 9. Top-performing DL models with each feature variation.

Feature Variation	F1 Score (%)	Model
Text	77	CNN
Gaze (Reading)	72	CNN
Gaze	73	CNN
All Features	74	GRU

Table 10. Performance of the best DL models and feature variations.

Model	Feature Variation	F1 Score (%)
LSTM	Gaze (Reading), All Features	69
BiLSTM	All Features	72
GRU	All Features	74
CNN	Text	77
CNN-LSTM	All Features	72
CNN-GRU	All Features	73
CNN-BiLSTM	Gaze (Reading)	71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baazeem, I.; Al-Khalifa, H.; Al-Salman, A. Integrating Linguistic and Eye Movements Features for Arabic Text Readability Assessment Using ML and DL Models. Computation 2025, 13, 258. https://doi.org/10.3390/computation13110258

AMA Style

Baazeem I, Al-Khalifa H, Al-Salman A. Integrating Linguistic and Eye Movements Features for Arabic Text Readability Assessment Using ML and DL Models. Computation. 2025; 13(11):258. https://doi.org/10.3390/computation13110258

Chicago/Turabian Style

Baazeem, Ibtehal, Hend Al-Khalifa, and Abdulmalik Al-Salman. 2025. "Integrating Linguistic and Eye Movements Features for Arabic Text Readability Assessment Using ML and DL Models" Computation 13, no. 11: 258. https://doi.org/10.3390/computation13110258

APA Style

Baazeem, I., Al-Khalifa, H., & Al-Salman, A. (2025). Integrating Linguistic and Eye Movements Features for Arabic Text Readability Assessment Using ML and DL Models. Computation, 13(11), 258. https://doi.org/10.3390/computation13110258

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Linguistic and Eye Movements Features for Arabic Text Readability Assessment Using ML and DL Models

Abstract

1. Introduction

2. Background

2.1. Text Readability Assessment

2.2. Eye-Tracking

2.3. Arabic Language

2.4. Arabic Language and Eye-Tracking

3. Related Research

3.1. Classic Text Readability Assessments

3.2. Data-Driven Text Readability Assessment

3.2.1. Arabic Text Readability Assessment Studies

3.2.2. Global Text Readability Assessment Studies Using Eye-Tracking

3.3. Discussion

4. ML-Based Experiments and Results

4.1. Model Building and Evaluation

4.2. Classification Using Handcrafted Features

4.2.1. Baseline Models

Experiments

Results

4.2.2. Analysis of Feature Variations

Experiments

Results

Overall Performance and Gaze Data

4.3. Classification Using Simple Feature Vectorization Methods

4.3.1. Analysis of Feature Variations

Experiments

Results

Overall Performance and Gaze Data

5. DL-Based Experiments and Results

5.1. Model Building and Evaluation

5.2. Classification Using Word Embeddings and DL Models

5.2.1. Baseline Models

Experiments

Results

5.2.2. Analysis of Feature Variations

Experiments

Results

Overall Performance and Gaze Data

6. Discussion

6.1. Impact of Eye-Tracking Features on Arabic Text Readability Assessment

6.2. Impact of Eye-Tracking Experimental Conditions and Reading Features on Arabic Text Readability Prediction

6.3. Comparison of ML and DL Models in Text Readability Prediction

6.4. Comparison of Cognitive Measures of Arabic Text Readability with Other Readability Measures for Arabic and Other Languages

7. Conclusions, Limitations, and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Arabic Readability and Eye-Tracking-Based Studies

Appendix B. Hyperparameter Optimization Results

Appendix C

Appendix D

Appendix E

Appendix F

Appendix G

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI