AraEyebility: Eye-Tracking Data for Arabic Text Readability

Baazeem, Ibtehal; Al-Khalifa, Hend; Al-Salman, Abdulmalik

doi:10.3390/computation13050108

Open AccessArticle

AraEyebility: Eye-Tracking Data for Arabic Text Readability

by

Ibtehal Baazeem

^1,2,*

,

Hend Al-Khalifa

²

and

Abdulmalik Al-Salman

²

¹

Artificial Intelligence and Robotics Institute, King Abdulaziz City for Science and Technology, Riyadh 13523, Saudi Arabia

²

College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Computation 2025, 13(5), 108; https://doi.org/10.3390/computation13050108

Submission received: 5 March 2025 / Revised: 16 April 2025 / Accepted: 28 April 2025 / Published: 5 May 2025

(This article belongs to the Special Issue Recent Advances on Computational Linguistics and Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Assessing text readability is important for helping language learners and readers select texts that match their proficiency levels. Research in cognitive psychology, which uses behavioral data such as eye-tracking and electroencephalogram signals, has shown its effectiveness in detecting cognitive activities that correlate with text difficulty during reading. However, Arabic, with its distinctive linguistic characteristics, presents unique challenges in readability assessment using cognitive data. While behavioral data have been employed in readability assessments, their full potential, particularly in Arabic contexts, remains underexplored. This paper presents the development of the first Arabic eye-tracking corpus, comprising eye movement data collected from Arabic-speaking participants, with a total of 57,617 words. Subsequently, this corpus can be utilized to evaluate a broad spectrum of text-based and gaze-based features, employing machine learning and deep learning methods to improve Arabic readability assessments by integrating cognitive data into the readability assessment process.

Keywords:

Arabic language; cognitive corpus; eye tracking; eye movements; human processing; language resources; machine learning; natural language processing; readability assessment; text difficulty

1. Introduction

Reading is essential for acquiring knowledge and communicating, involving processes like word encoding and meaning assignment [1,2]. It is crucial for academic success and accessing information across disciplines [3,4,5]. Readability, as defined by Dale and Chall [6], involves elements affecting a reader’s understanding, speed, and interest. It depends on reader-based features like background knowledge and motivation, as well as text-based features like content and syntax [2,7,8,9]. Predicting readability has shifted the focus to text analysis [8,9]. However, traditional methods such as cloze tests and expert judgment are costly and subjective, highlighting the need for technological tools [8]. Additionally, current models rely on expert judgments rather than readers’ cognitive processing, and most Arabic readability research relies on simple features that do not reflect true readability [10]. Although Arabic readability corpora exist, they focus on Arabic first-language (L1) and second-language (L2) learners, with limited corpora for assessing readability for the general Arabic-speaking public [6,11,12,13,14]. Accordingly, because using physiological data like electroencephalograms (EEGs) and eye tracking can improve readability models by predicting human processing efforts during reading [15,16,17], this paper aims to fill this gap by establishing an Arabic human reading corpus to aid in subsequent modeling of Arabic text readability assessment [18].

The establishment of an open-source Arabic human reading corpus, parallel to that in other languages, such as English [18,19], French [18], and German [12], with graded gold-standard texts annotated by both humans and computers using eye tracking, is expected to open the door for the broader utilization of eye tracking for further research related to the improvement in Arabic readability assessments and many Arabic natural language processing (NLP) tasks. Additionally, collected data might help in the development of theories about natural reading behavior through eye movement analysis [20]. For example, such a corpus will help to enhance machine learning (ML) models for diverse NLP operations, such as information extraction, sentiment analysis, text quality assessment, and text simplification [21,22].

2. Background

This section provides an overview of the Arabic language and outlines the principles of eye-tracking technology, emphasizing its relevance to reading generally and its connection to Arabic specifically.

2.1. Arabic Language

Arabic is a Semitic language with a right-to-left cursive script consisting of twenty-eight consonantal letters without upper- or lower-case distinctions [23]. Its letters change shape depending on their prefix and suffix letters [24,25]. Arabic comprises Classical Arabic (CA), Modern Standard Arabic (MSA), and various dialects. CA, a formal language from the pre-Islamic era to the early eleventh century CE, is the foundation for MSA, which is used in modern media and literature and is easier to read. Dialects are regional spoken varieties used in informal communication [26,27,28]. Arabic script uses diacritics for short vowels and pronunciation guides, such as fat’ha, kasra, damma, shadda, and tanween [27]. The addition of diacritics, called diacritization, clarifies pronunciation and meaning, resolving lexical ambiguity in undiacritized text. For example, the undiacritized word “علم” can mean flag (عَلَم), knowledge (عِلْم), or teach (عَلّم), depending on its pronunciation [29]. Diacritization can be full, partial, or absent. Full diacritization adds diacritics to every letter, enhancing clarity but slowing reading speed due to visual noise. Partial diacritization marks selected letters, improving readability for less experienced readers without much visual noise, but it is subjective and time-consuming. Absent diacritization relies on reader knowledge and context for correct pronunciation and meaning, which is challenging for less experienced readers, especially with ambiguous words and homographs [23,24,28,30,31,32,33].

2.2. Eye Tracking

When surveying a scene, the human eye creates varied movement patterns, commonly referred to as the scan path [34]. The estimation of the eye’s position to know exactly where a person is looking (point of gaze), its motion, and for how long in real time is known as eye tracking [34,35]. Interfacing with computers through eye tracking involves analyzing the specific region of the user’s focus, a concept dating to the 18th century [35]. In recent years, eye tracking has gained significant popularity in assessing the visual and cognitive processes of humans [36]. Providing insights into visual attention and user behavior, this technology is widely used today in research and practical applications in psychology, education, marketing, gaming, and user-interface design [34].

2.2.1. Eye Tracking and Reading

According to [15], there is a significant correlation between readers’ eye movements and their cognitive processing of texts. This is linked to Just and Carpenter’s eye–mind hypothesis, which posits that “there is no appreciable lag between what is fixated and what is processed” [37]. Thus, it allows natural reading observations with minimal intervention, providing insights into language processing and serving as an alternative to traditional assessments like cloze tests and think-aloud protocols [20,36]. Studies measure saccades (rapid movements), fixations (stops), and regressions (backward movements). Longer saccades suggest easier text, shorter saccades indicate more effort, brief fixations imply readability, longer fixations suggest difficulty, and regressions indicate comprehension challenges [13,17,36,38,39]. Fixation and saccade patterns reveal information about the reader and the text [36,40]. Thus, eye motion data elucidate cognitive processes related to reading [17]. Although it is resource-intensive, eye tracking offers a natural way to study reading [13,14].

2.2.2. Eye-Tracking Visualization

Eye-tracking visualizations offer detailed data from large groups, confirming hypotheses, detecting patterns, and identifying attention-grabbing or ignored elements. They align individual feedback with group metrics, providing deep insights into human behavior [41,42]. Figure 1 shows visualizations of the results from an eye tracker as participants read two Arabic texts of varying lengths.

Gaze plots show where and for how long a person’s attention is focused, the order of focus, and the reading direction (forward or backward) [43]. Circles (fixation bubbles) represent eye fixations, with size indicating fixation time, and arrows show saccades, with length indicating the scanning path [44]. Fixations are numbered [44,45]. In Arabic, which is read from right to left, less readable text results in shorter saccades, more fixations, and regressions, indicating reading difficulty [17]. Heatmaps visualize eye fixations by aggregating fixation counts or durations from all participants and highlighting the areas receiving the most attention. A color gradient indicates fixation frequency and duration: red for the most or longest fixations, green for the fewest, and intermediate colors for other levels [41].

2.2.3. Eye Tracking and Arabic Language

Arabic presents unique reading challenges compared to other languages. Research indicates that reading Arabic requires stronger visual–spatial processing abilities due to its distinct cognitive demands [45,46]. Unlike Latin languages with non-cursive writing, few studies have explored eye movement control in Arabic reading [30]. Al-Edaily et al. [47] note that Arabic’s complexity—including orthographic directionality, cursive nature, letter size and form changes, use of dots, orthographic ambiguity, and morphological richness—enhances its visual–spatial processing demands [25]. These nuances necessitate investigating specific Arabic attributes influencing reading via eye tracking. Below are some similarities and differences in eye movements when reading Arabic compared to other languages:

Arabic reading’s informational density makes it more time-intensive than Latin languages, making word identification more challenging [30].
The direction of reading influences the perceptual span’s asymmetrical extension. In Arabic, this focus area extends more to the left, while in Latin languages, it extends more to the right, impacting visual processing during reading [30].
The length and familiarity of words affect eye movement decisions in both Arabic and Latin languages. However, the impact of skipping words is less significant in Arabic, with low differences in skipping rates for low- and high-frequency words [30].
Studies suggest that words in Semitic languages are best understood when the focus is placed on the middle, unlike Latin languages, for which the focus should be placed on the beginning–middle. This difference is due to the morphological structure indicating core meaning [30].
Diacritics in Arabic text aid in disambiguation and enhance accuracy and attention but can be perceived as visual noise, leading to longer fixation durations. Experienced readers can rely on context and language knowledge without diacritics [24,28,29,30,31].
Arabic writing is more complex due to the mandatory dots above or below many letters, unlike Latin languages, for which only two lowercase letters have dots [25].
Numbers in Arabic are read from left to right, unlike text, which is read from right to left. This can cause inversion errors during reading [48].
Arabic text is more challenging to read than Latin text due to its cursive nature, context-dependent characters, diverse writing styles, and unique letter positioning, such as in the word “محمد” (Muhammad), for which some letters can be placed above others [49].

3. Literature Review

This section reviews eye-tracking studies in reading across multiple languages. It also examines relevant Arabic corpora, along with cognitive corpora in other languages, emphasizing the usefulness and significance of these data in clarifying the intricate relationship between eye movements and human reading.

3.1. Eye Tracking in Reading Studies

Eye-tracking studies reveal that reading involves sequences of fixations and saccades, with some words fixated on more than once or skipped [50]. Just and Carpenter [37] explored eye movement patterns and cognitive loads in reading scientific papers, assuming the eye lingers on each word for processing time. Frazier and Rayner [51] suggested that sentence misinterpretation is corrected through definable strategies, correlating fixation locations and lengths with text processing. Rayner et al. [52] found that text difficulty increases fixation number and duration. Liversedge et al. [53] added regression path duration and rereading times to key reading time metrics, including first fixation duration, first-pass reading time, and total reading time, in order to better understand eye movements related to textual difficulties. Advanced eye movement recording systems have enabled high-quality recordings, aiding linguistic and psycholinguistic studies [36,38,54,55,56,57,58].

Arabic reading, though crucial for many, has been less studied in terms of cognitive processes [59]. Al-Wabil and Al-Sheaha [45] and Al-Edaily et al. [46,47] have examined eye movements in Arabic readers with learning difficulties like dyslexia. AlJassmi et al. [30] reviewed eye movements in Arabic reading, noting limited understanding compared to European languages and identifying future research questions.

Additionally, visual distinctions between diacritized and undiacritized Arabic texts were investigated by Roman and Pavard [31], Bensoltana and Asselah [60], Hermena et al. [23,29], Maroun and Hanley [61], and Awadh et al. [62]. Hallberg [63] analyzed how case markers are processed in Arabic. Furthermore, some studies have investigated the effect of word spacing. For example, Leung et al. [64] investigated the difference between reading spaced and unspaced Arabic texts. Word characteristics, such as length, were investigated by Paterson et al. [25], predictability by Hermena et al. [29], and morphology by Khateb et al. [65] and Hermena et al. [66]. Additionally, Al-Khalefah and Al-Khalifa [67] studied spelling pattern consistency in elementary students’ recognition of English and Arabic words in Saudi Arabia. Moreover, Hermena and Reichle [68] emphasized re-evaluating reading models designed for European languages to better understand Arabic reading. Lahoud et al. [69] examined factors influencing eye movements in skilled Arabic readers, balancing language-specific and universal factors.

3.2. Corpora

This section reviews the existing Arabic corpora for the readability assessment task and some existing eye-tracking corpora for various languages, which provide a general understanding of the primary characteristics that must be considered when crafting a new Arabic eye-tracking corpus.

3.2.1. Arabic Readability Corpora

For Arabic, there are limited corpora available for readability assessments that target Arabic L1 and L2 readers, and these corpora are mainly in the educational domain [70]. Research into readability evaluation in other fields is scarce. A comparison of Arabic L1 and L2 readability studies revealed that many L2 studies drew from similar sources, such as the Global Language Online Support System (GLOSS) platform. In contrast, L1 studies tended to use unique corpora and, consequently, unique annotation scales, which led to different results from those of the L2 studies. This difference could be attributed to the generally larger size of the corpora used in the L2 studies. Nonetheless, it was noted that having a larger corpus size does not necessarily lead to significantly improved accuracy. Therefore, data are probably not the main reason for the observed differences in the outcomes of Arabic readability studies [59]. In terms of clarity and accessibility, [71] noted that L1 corpora are not as transparently detailed or publicly available as L2 corpora, such as Aljazeera. Arabic readability corpora include those used to formulate Arabic readability formulas, such as those in [72,73]; corpora used to design tools for, and evaluate the readability of, health-related Arabic information [74,75]; and corpora used in data-driven studies, such as [71], which was used in [76]. Details on specific Arabic corpora used for Arabic readability assessment tasks are provided in Appendix A.

3.2.2. Eye-Tracking Corpora

Extensive efforts have been made to model and understand human reading patterns. However, research often overlooks how general text properties affect eye movements in real-world reading, focusing instead on a few linguistic features. Recently, new datasets tracking eye movements across continuous text have been developed, increasing interest in naturalistic eye-tracking data. These datasets are crucial for setting benchmarks for eye movements, testing models like E-Z Reader and SWIFT, and assessing psycholinguistic theories [50,77]. Studies [15,19] have surveyed existing eye-tracking datasets in various languages, mostly in English, but also in Chinese, Dutch, German, Persian, and Russian. Some studies have recorded eye movements during normal reading, while others have documented eye movements during various NLP tasks. While some datasets are publicly accessible, others require permission for use [15].

One pioneering dataset is the Dundee Corpus created by Kennedy et al. [18] in 2003. It examines how peripheral vision influences the time spent focusing on the current word (parafoveal-on-foveal effects). The Dundee Corpus contains 56,212 words and 9776 unique word types for English, as well as 52,173 words and 11,321 unique word types for French, from newspaper articles read by ten individuals in each language [14,58]. This corpus provides substantial insights into reading under naturalistic conditions.

Another significant dataset is the Ghent Eye-Tracking Corpus (GECO) developed by Cop et al. [20] in 2017. It includes eye-tracking data from monolingual and bilingual readers navigating Agatha Christie’s The Mysterious Affair at Styles in English and Dutch. The dataset includes 54,364 words and 5012 unique word types for English, as well as 59,716 words and 5575 unique word types for Dutch. GECO is publicly available, while the Dundee Corpus is accessible only for research purposes. GECO offers forty-six pre-extracted gaze features, emphasizing word-based processing, whereas Dundee provides raw gaze data [77].

In 2017, Luke and Christianson [78] introduced the Provo Corpus, focusing on the impact of word predictability on reading. This publicly available corpus includes fifty-five brief paragraphs with 2689 words and 1197 unique word types, read by eighty-four native-English-speaking participants [50,77,79].

The Zurich Cognitive Language Processing Corpus (ZuCo 1.0), created by Hollenstein et al. [22] in 2018, integrates both eye-tracking and EEG data. Data were collected from twelve native English speakers reading 1107 English sentences, totaling 21,629 words and 7099 unique word types. ZuCo 1.0 includes both natural reading and task-specific reading. ZuCo 2.0 [80], built in 2020, differs from ZuCo 1.0 mainly in its experimental procedures. It encompasses 739 English sentences with 15,138 words and 4849 unique word types read by eighteen participants. ZuCo 2.0 is the first corpus to simultaneously capture eye movements and EEG, merging standard and task-specific reading for detailed analysis and comparison.

Other eye-tracking corpora are available for various languages, including Portuguese [81,82], Chinese [83], and Danish [84], providing insights into human reading behavior and aiding in understanding cognitive processes. This information is invaluable for various research fields focused on understanding human reading patterns.

3.3. Discussion

Readability studies’ success hinges on large, well-graded, gold-standard datasets, which are currently scarce for Arabic. Language resources are essential for computational language analysis in NLP. Arabic readability corpora are mainly designed for educational purposes, aiding in selecting texts for different reading proficiency levels and helping educators design age-appropriate materials [70]. While Arabic readability corpora exist, they primarily focus on L1 and L2 learners, with limited resources for general public readability assessments, leading researchers to create their own corpora. While Arabic L1 readability assessments offer opportunities for resource collection and tool development, their application extends beyond academia to fields like politics, law, healthcare, and marketing, though these efforts are still emerging [59]. Additionally, although making informational texts more accessible is crucial, Arabic L1 readability corpora are not as well documented or available as L2 corpora like GLOSS or Aljazeera Learning [59,71]. Furthermore, most Arabic readability corpora are annotated by experts, which may not fully represent the target audience’s needs. Thus, creating a large, representative corpus annotated by the target audience is essential for improving model performance and accurately reflecting text readability levels. In terms of eye-tracking corpora, there is a significant lack of human reading eye-tracking corpora for Arabic, hindering advances in cognitively driven readability assessment [10]. While various eye-tracking corpora exist, there is a notable gap for right-to-left languages like Arabic. Existing cognitive corpora often balance participant numbers with corpus size, with larger texts involving fewer participants. Some also include linguistic feature annotations alongside cognitive data. Traditionally, eye-tracking corpora have focused on short, isolated sentences. Still, new datasets where participants read extended passages of natural text offer deeper insights into eye movements and cognitive loads during reading [79]. This shift to naturalistic reading enhances the analysis of longer, more complex reading processes. Accordingly, this paper addresses this gap by establishing an Arabic human reading corpus akin to the Dundee Corpus [18].

4. Methodology

This section outlines all the investigative phases conducted to achieve the objectives of this study: the construction of an eye-tracking corpus. Given the detailed nature of the corpus-building steps, they are divided into three subsections: corpus preparation, data collection, and data preparation. Figure 2 provides an overview of these phases and the applied methodology.

4.1. Corpus Preparation

This section highlights the foundational tasks for gathering eye-tracking data: recruiting participants, preparing experimental texts, and defining and testing readability guidelines. This study’s corpus-building approach was user-centric, involving participants throughout the process. Figure 3 summarizes this section’s main components.

4.1.1. Identifying Participants’ Criteria

This paper centers on Arabic L1 readers, aiming to assess text readability using the cognitive signals of native speakers. Participants had to be native Arabic speakers. To ensure consistency, participants had to meet several criteria [25,29]: they had to be male or female, aged 20–50 years, and educated Arabic readers holding or pursuing degrees in Arab countries. They came from various professional backgrounds, and to avoid bias toward expert linguists, they were from diverse Arabic-speaking countries. Additionally, their reading skills were assessed using the Avant Arabic Proficiency Test (Avant APT) (https://avantassessment.com). This approach aimed to reduce variability and ensure relevant data collection for the corpus.

4.1.2. Defining Different Aspects of the Participants’ Tasks

This study’s corpus was built through a user-driven process that involved the participants in most of the phases. This section includes an overview of the settings used in the corpus construction. This included various aspects of the participants’ tasks to ensure consistent and unbiased results. For example, instructions were provided both verbally and in writing to avoid behavioral variations, and texts were organized and randomized to prevent bias and maintain balanced reading experiences. Additionally, multiple participants were used in each task to reduce bias, and each task began with a survey to capture demographics and reading habits. Additionally, quality control was crucial for ensuring work quality and participant engagement. Participants were informed about these techniques beforehand to encourage attention and prevent skipping parts of the texts. The study used two methods for quality assurance [85]: follow-up questions and random texts. Participants answered true/false and multiple-choice questions after reading texts to assess task completion quality, providing clear, unambiguous answers [27,86]. Random texts with specific labeling instructions appeared unpredictably to maintain engagement and prevent skimming, ensuring sustained reader attention without excessive questioning [27].

4.1.3. Collecting and Testing Corpus Texts

This step involved collecting texts suitable for readability assessment [87]. Based on a study of 41 Arabic-speaking participants who favored books over newspapers [26], the corpus was built from Arabic books to reflect diverse styles and knowledge sources. MSA texts were sourced from Hindawi Foundation books [88], and CA texts were sourced from the King Saud University Corpus [89], covering authors from the 8th to the 21st centuries. To capture author style without requiring participants to read entire books, short representative excerpts (600–700 words) were selected, such as introductions or standalone sections. Thirteen topics—including grammar, literature, health, and politics—were chosen to ensure variety and engagement [60]. Texts were selected without predefined readability levels, and the open-source metric for measuring Arabic narratives (OSMAN), the Arabic readability tool (v3.0, 2020), was used to ensure variation [20,80,90]. Despite being published, MSA texts required minimal revision, whereas CA texts required more extensive processing. Using the Qalam tool and support from two linguists, preprocessing steps included the normalization of formatting (e.g., traditional layout and Uthmani script) [87], proofreading for spelling and punctuation without changing vocabulary [87], and applying appropriate diacritics—partial for MSA and CA texts and full for Quranic verses and sayings of Prophet Muhammad [33]. Additional adjustments were made to correct paragraph structure, clarify indentations, and standardize punctuation by referencing reliable sources, such as Al Shamela Library (https://shamela.ws) and Wagfeya Library (https://waqfeya.net).

To minimize selection bias and ensure texts were suitable for eye tracking, 40 participants used Google Forms to evaluate the texts and ensure they were self-contained. Over two rounds, participants assessed 94 texts (62,893 words in total), marking whether each text met the condition and explaining if not. Round 1 included 30 MSA and 32 CA texts, and Round 2 focused on 32 new MSA texts. Based on feedback, adjustments were made, novels were excluded, texts were modified or replaced for clarity, more translated and topic-diverse texts were added, and some were shortened to suit the experiment length. The Round 1 results revealed a sharp contrast in difficulty levels, prompting a further diversification of the MSA texts. The final corpus—comprising both MSA and CA texts—totaled 58,045 words and was used in eye-tracking experiments. Quality control was ensured through random text insertion and engagement checks, following the approach in [86]. Table 1 displays statistics for the collected texts, adjusted according to the results of the tests in Rounds 1 and 2.

4.1.4. Paragraph Segmentation

Previous Arabic readability studies have mostly focused on either individual sentences or entire documents; paragraphs have often been overlooked. However, assessing single sentences can lack context and may not reflect real-life reading behavior, especially when analyzing eye movements such as regression [50,91]. On the other hand, evaluating full documents requires a large corpus and significantly more reading time. Paragraphs strike a balance: they present complete ideas, provide enough context for readers to judge readability accurately, and are manageable for eye-tracking experiments [19]. For this reason, this study focused on paragraph-level readability. Documents were segmented into meaningful paragraphs, each expressing a single idea [39]. Texts were carefully segmented by two linguists, and any disagreements were resolved by a third expert.

4.1.5. Extracting and Testing Arabic Readability Guidelines

A clear gap in Arabic readability guidelines—particularly for native speakers—was identified through a review of previous studies. Most existing research has focused on specific grade levels or second-language learners [92,93], whereas this study aimed to develop general guidelines for Arabic books based on how texts are perceived by everyday readers. Rather than relying solely on expert opinions, a user-centered approach was adopted to assess readability from the reader’s perspective.

In this phase, 28 MSA texts from Round 1 were annotated by 24 participants at both the document and paragraph levels (212 paragraphs in total). Examples of features for each readability level were provided by a linguist to support the annotation process. Initially, four readability levels were tested, but feedback indicated that three levels were more effective and easier to apply. Uncertainty was frequently expressed by participants when determining readability, highlighting the need for guided annotation. Based on participant input, readability guidelines were developed to capture the key characteristics of each level; this offered a useful resource for future applications. For quality control, five random texts were embedded within the corpus. Participants assessed paragraph readability at four distinct levels: (a) an easy paragraph, (b) a medium-easy paragraph, (c) a medium-difficult paragraph, and (d) a difficult paragraph. In addition, they provided reasons for their choices.

To validate the readability guidelines, eight participants annotated 47 paragraphs from nine randomly selected MSA texts chosen using the OSMAN Arabic readability tool to avoid bias. They completed three tasks—rating readability across three, four, and five levels—using Google Forms during a two-hour session with a break. A think-aloud protocol was applied, allowing participants to explain their reasoning while reading and offering insight into how they interpreted the texts and guidelines [94]. They also read half the texts silently and the other half aloud to determine their preferred mode for future eye tracking.

The findings revealed a trade-off: while the inclusion of more levels provided detail, they also caused confusion. A three-level scale offered a good balance of clarity and usability and was chosen for the eye-tracking phase. Participant feedback led to guideline refinements, including clearer notes on pronunciation difficulties, especially with diacritics and loan words. A linguist reviewed the updated MSA guidelines (Appendix B) [23]. Regarding the reading mode, six participants preferred reading aloud for better comprehension, while others favored silent reading to stay focused. Diacritization was further enhanced for uncommon and heterophonic–homographic words. Two quality control methods were also tested: random texts and comprehension questions. While preferences varied, five participants recommended combining both—a question to prompt focus, followed by a random text to maintain engagement.

4.2. Data Collection

This section outlines the process of collecting eye movement data from Arabic participants, including the design and testing of the eye-tracking experiment, data acquisition, and labeling. Matching the text difficulty to the reader’s level is crucial for learning [76]. Labeling texts with readability levels is essential for readability assessment models [87]. NLP has recently focused on annotating data based on user interactions [86]. By engaging real audiences and using eye-tracking technology, this study identified factors affecting text readability through eye movements. This section details the steps to build the first Arabic cognitive corpus using eye-tracking data, as shown in Figure 4.

4.2.1. Pilot Testing the Eye-Tracking Experiment

Based on preliminary findings on Arabic readers’ preferences and behavior, this stage involved a hands-on eye-tracking test to understand reading behavior and fine-tune the experimental procedure. Previous studies highlighted the importance of pilot experiments and participant feedback for reliable results and participant engagement [42,85,86]. The eye-tracking experiment involved three participants, each in three sessions lasting about three hours. Factors considered included instruction clarity, experiment flow, text design features, calibration, quality control, reading mode, session duration, and breaks [7]. While justifying Arabic text enhances readability by adhering to traditional rules, this study focused on textual components. Additionally, the impact of diacritics on readability was also examined. Based on Hindawi’s suggestion and input from two linguists, partial diacritization was selected. Six Arabic natives read nine texts with full, partial, and no diacritics. The results showed the highest reading duration and fixation with full diacritics, followed by partial, and then none, aligning with participant feedback on visual noise and reading speed [24]. Partial diacritics struck a balance, confirming previous findings [24,30,31,60]. To avoid redundancy, detailed procedures of this pilot step were omitted. Based on these pilot tests, the actual eye-tracking experiment was designed and conducted.

4.2.2. Designing the Eye-Tracking Experiment

Tobii Studio software Version 3.4.8 (https://www.tobii.com) was used to record, display, manipulate, and export the eye-tracking data in a Microsoft Excel (.xlsx) file format for further analysis. The design of the experiment involved deciding on its flow, selecting the reading materials, and setting up the experimental environment.

Apparatus and Setup

The eye movement data were recorded using the Tobii X120 eye tracker (Tobii Technology, Inc., Danderyd, Sweden), which has a 120 Hz sampling rate and 0.5-degree precision. Participants could move their heads freely within a 30 × 15 × 20 cm area at a 60–65 cm viewing distance [41,44,45]. Figure 5 shows the eye-tracker setup within the participants’ viewing range.

The eye tracker was placed below a Samsung T35F monitor (Samsung Electronics, Suwon, South Korea), which has a 75 Hz refresh rate and a resolution of 1920 × 1080. Adjustable chairs ensured participants’ eyes were centered for accurate calibration. Positioning parameters were entered into the X120 Configuration Tool in Tobii Studio for each participant. Adjustments to the eye-tracker angle, participant position, and seat height were made before each session. Experiments took place in a well-lit room without direct sunlight or reflective surfaces to minimize distractions [41,44].

Materials

The process began with the entire dataset, including MSA texts from both rounds in Section 4.1.3 (Collecting and Testing Corpus Texts). The texts were displayed right to left, but Tobii Studio’s issues with Arabic text alignment required using images for the experimental materials. Each text appeared as a sequence of images, with one per paragraph. Paragraphs longer than eight lines were split across multiple screens [19,91]. Each paragraph had a title and unique identifier (e.g., GeoAndTra_T2_P4_p1). Sixteen reading sets were created: The first was a practice set marked as a dummy and excluded from the analysis. The remaining fifteen sets included four texts from different topics in varied and randomized orders to ensure diverse and balanced reading sets for each session [13,57,78,95].

4.2.3. Setting Up the Eye-Tracking Experiment

The study was conducted according to King Saud University’s Institutional Review Board rules and regulations and was approved by the Ethics Committee of King Saud University (no. 21/0892/IRB, dated 19 October 2021).

According to [42], gaze patterns vary by task, so clear communication is crucial for accurate observations. Research in [22,23,39,84] shows that aggregated non-expert assessments can match expert quality [96]. Thus, this study included subjective readability annotation in the eye-tracking experiment using guidelines for three readability levels from the previous section [92]. Participants performed two tasks. In Task 1, they read the paragraphs silently while the eye tracker captured their behavior, and then they clicked to proceed [18,19,20,78,95]. In Task 2, they rated readability on a scale of easy, medium, and difficult after each paragraph and document, following methods similar to [97,98].

Guidelines were available throughout the experiment to ensure accurate results. Combining these annotations with eye-tracking data provided a comprehensive understanding of user behavior and context. Figure 6a–c illustrate the tasks and annotation levels. This study was conducted between August 2022 and February 2023. Participants received a Google Form link in advance to collect demographic information and data on visual impairments, as well as to review the annotation procedure and guidelines. Upon arrival, participants were briefed on the texts and the experiment, with time for questions. Key instructions included displaying each paragraph only once, focusing, and limiting head and body movements for data accuracy. Participants read at their own pace with no time limit. Minor disruptions were acceptable unless calibration was lost, in which case re-recording was necessary. Breaks were provided after each reading set.

Participants provided consent for the anonymous recording and use of their data, with the option to withdraw [92]. They were seated comfortably in front of the screen and eye tracker, and the device was calibrated for accurate gaze capture [42]. Calibration involved following points on the screen in a five-point setup, ensuring both eyes were within the optimal range (60–65 cm) [99]. Satisfactory calibration was indicated by short green lines for both eyes. The experiment began with onscreen instructions, and participants used the mouse for navigation to minimize drift.

4.2.4. Conducting the Eye-Tracking Experiment

To familiarize participants with the task, a practice set with two sample texts was introduced, helping them adjust to the eye tracker [20,44]. Participants reported clear visibility of words and diacritics during this practice [23]. The main session involved sequential reading tasks (Task 1 and Task 2) for each document that was recorded by the eye tracker. An observer was present initially but left quietly to minimize distractions. Each session began with a five-point calibration (sixteen per participant) to ensure gaze accuracy. Data collection spanned two days, with sessions lasting three to four hours, including regular short breaks and a longer break after four sets. Binocular tracking analyzed data from both eyes. Texts were displayed in black traditional Arabic font on a white background of size eighteen with expanded character spacing and clear line separation. Margins of 0.2 cm on both sides ensured accurate eye tracking at the edges [99].

4.2.5. Participants

Data on the eye movements of fifteen healthy adults (seven male and eight female; ages 20–45 years) were collected. The primary cognitive data came from at least ten participants, following the benchmark set by the Dundee Corpus for English. This number was able to increase during the experiment. None of the participants had learning disabilities, neurological issues, or reading problems [24]. Table 2 details the participants. No participants were excluded based on the APT, as these tasks were preparatory for the eye-tracking experiment, and there was no clear exclusion threshold. The median APT score of previous participants, including those from Section 4.1.3 (Collecting and Testing Corpus Texts) and Section 4.1.5 (Extracting and Testing Arabic Readability Guidelines), in addition to the two pilot tests, was used to establish a threshold of nine for inclusion in the eye-tracking experiment.

All participants satisfied the conditions outlined in Table 2, achieving a minimum APT score of 9, which indicates an advanced–low level with the potential for improvement to superior. This was expected as the participants were highly educated native Arabic speakers, ensuring they could easily interpret the materials [23]. Most participants had normal vision and did not require corrective lenses, while a few reported slight reductions in visual acuity or mild astigmatism and used eyeglasses or contact lenses accordingly. Additionally, none of the participants reported any reading problems. According to Tobii guidelines, the eye tracker effectively monitors individuals regardless of ethnicity, age, or use of eyeglasses/contact lenses, even under varying light conditions. Participants were advised to avoid wearing reflective eyeglasses and makeup to prevent tracking issues. Individuals with exclusion criteria, such as wet eyes (epiphora), were excluded.

4.2.6. Results

Sessions 1 and 2 for MSA Texts

During the eye-tracking experiment, each reading set consisted of four texts recorded in one session. The quality of these recordings was evaluated based on gaze sample percentage, calculated by dividing the number of valid gaze samples by the total number of samples and multiplying by 100. Factors like excessive blinking, inaccurate configuration, and distractions can reduce this percentage [41]. A 100% gaze sample is rare, and the initial acceptance threshold was set at 70% [44]. Analysis after the first session, which included eighty recordings, showed gaze sample percentages ranging from 84% to 98%, indicating that participants were still adjusting during the practice set. Recordings above 80% showed minimal drift and were considered acceptable. Those below this threshold were discarded [41,67].

Among the 160 recordings from the first and second sessions, three participants had less than 80% gaze sample scores. Specifically, participant P01 had two low recordings at 78% and 51%, P04 had seven recordings ranging from 33% to 76%, and P06 had one recording at 75%. To improve accuracy, these participants were asked to repeat the recordings after a one-month interval [24]. Although the percentages improved, P04 was replaced due to persistent low recordings, indicating potential eye issues. The study initially included twenty-three participants, but technical issues and personal reasons led to the exclusion of several people, leaving eighteen participants with complete recordings, averaging 93% gaze point accuracy. In total, 288 recordings were made, with participants rating each paragraph and document as easy, medium, or difficult, as summarized in Table 3.

Participant observations and feedback indicated that even with annotation guidelines [92], participants often defaulted to the medium rating when uncertain about a text’s readability. They struggled to differentiate between medium and difficult levels and hesitated to use the difficult rating. Rating entire documents was especially challenging, resulting in imbalanced readability ratings due to varied paragraph difficulties within documents. To determine a common readability level, the median was chosen over the mode for its reliability with categorical ordinal data in small samples [100,101]. The mode risked misclassifying texts as easy when many were perceived as medium or difficult. The analysis of ratings shown in Table 3 suggested that MSA texts were more suitably classified into two readability levels instead of three.

Session 3 for CA Texts

To address the imbalance in text readability, CA texts, known for their challenging language and vocabulary from the pre-Islamic era [26,70], were used to represent a more difficult level. Participants from Round 1 found the CA texts to be tougher than the MSA texts, aiding in clearer readability distinctions. MSA texts were anticipated to represent the first two readability levels, while CA texts would cover the third, ensuring a comprehensive range of difficulty levels. With a linguist’s help, guidelines for CA texts were adapted. The third session, focusing on CA texts, lasted three to four hours with one practice set and eight reading sets. This session involved fifteen participants who completed 135 recordings, with one needing a repeat due to technical issues. The session achieved an average gaze point score of 94%.

After the three sessions, data from all 375 recordings (240 from the first two sessions and 135 from the third) were analyzed. The results, shown in Table 4 and Table 5, indicate an under-representation of the difficult level in both documents and paragraphs, with only six documents categorized as difficult. This imbalance persisted as participants, uncertain about the correct level, frequently chose medium and hesitated to select difficult, despite struggling with the content, reflecting a tendency to avoid potential overrating. On the other hand, there was a noticeable tendency to rate paragraphs as easy, and to a lesser extent, entire documents. This could be because evaluating paragraphs in isolation might seem simpler than annotating full documents, which can contain a mix of easy, medium, and difficult passages.

This imbalance, with paragraphs often rated as easy and documents as medium, raises concerns about classification models favoring these predominant categories. Imbalanced datasets are common in real-world scenarios and pose challenges for automatic text readability assessments, as highlighted in [102]. Many Arabic readability studies have encountered issues with data imbalance. This can be addressed through various methods, such as data augmentation and cost-sensitive training. Further analysis of model-level solutions and a comparative evaluation of imbalance-handling techniques will be presented in future work. Figure 7 illustrates the distribution of documents and paragraphs across three readability levels in a bar chart.

4.2.7. Quality Control

To ensure full engagement during silent reading, participants were prompted with comprehension questions [18,19,20] and random texts [27] based on prior findings. They were aware that they would be questioned at intervals and informed about quality control measures. Each reading set included either or both measures to maintain concentration. Some documents, chosen at random, were followed by multiple-choice [19,20] or true/false questions [23,95], with the number of questions adjusted based on text complexity and length; more complex or longer texts had two questions, while shorter texts had three. Random questions accompanied detail-rich texts to avoid overwhelming readers, with their locations randomized. Periodic checks ensured thorough reading. The results are detailed in Table 6, which includes the outcomes of interactions with fifty-four quality control measures for the MSA and CA texts. The data show that participants correctly answered 77.41% of the questions, closely aligning with the 78.27% reported by GECO [20].

The discussion on using comprehension questions and setting participant thresholds in reading studies lacks clear standards. In this study, a 60% threshold on fifty-four questions was deemed sufficient, based on task complexity, participant engagement, and expert consultations in gaze behavior dataset creation and human annotation [19,39]. Factors influencing this decision included the complexity and duration of tasks potentially affecting recall, as each session lasted three to four hours with twenty-five recordings per participant. Participants’ diverse learning styles meant variability in correct responses; some easily recalled numerical data, while others struggled with names. Controlled reading confirmed the engagement, as cursor movements indicated active reading. An expert advised retaining all gaze data to create a comprehensive dataset, including all gaze behavior and comprehension scores. Notably, annotating texts across varying readability levels also enhanced focus, serving as an additional quality control measure.

4.3. Data Preparation

To effectively use the collected eye-tracking data, thorough preparation was necessary, involving tokenization, extracting eye-tracking and textual features, and preprocessing to remove irrelevant information and filter out data captured during track loss. After cleaning, the data allowed for an analysis of the correlation between eye movements and text readability levels. Tobii Studio software facilitated the display, manipulation, and analysis of the behavioral data. Figure 8 summarizes the key components of this process.

4.3.1. Tokenization

Tokenization divides texts into individual words and was crucial for this study [70]. In Tobii Studio, tokenization requires manually defining areas of interest (AOIs) around each word to capture accurate eye metrics, such as time to first fixation, saccades, and regressions [15,19,41]. Typically carried out by white-space tokenization [26,43,70], this process was manually executed for the 57,617 words (excluding book titles) in the text images. Punctuation was attached to preceding words, and stop words were retained to preserve comprehension. Manual adjustments were essential for poems or irregular text formats. Figure 9 showcases word segmentation, with the rectangles defining individual words, each associated with an ID. Colors are assigned to tokens randomly.

4.3.2. Feature Extraction

This section discusses how text features and reader characteristics influence readability, as supported by readability studies and linguistic reading theories [7]. Features influencing readability are divided into three categories [39]:

1.: Text-based features represent the general characteristics or linguistic complexity of the selected texts.
2.: Gaze-based features, derived from eye-tracking experiments, reflect cognitive processing and comprehension through established eye-tracking metrics.
3.: The readability level feature represents the combined subjective readability ratings of paragraphs and documents, as provided by participants during the eye-tracking experiments.

Both text and gaze features are included in the Arabic cognitive corpus, providing a dual perspective on text readability from both the text’s properties and the Arabic L1 readers’ interactions. This holistic approach improves upon traditional readability corpora, which typically rely solely on expert opinions and may be less practical [103]. Each feature category and the methods for extracting these features are summarized in Figure 10.

Text-Based Features

Text-based features associated with readability vary by extraction methods and the complexity levels measured [104]. As shown in Table 7, these features were categorized into two groups: general features and linguistic features. General features summarize the books used in this study, taking a user-centric approach rather than relying solely on expert judgment. In addition to standard text features like character count and average sentence length, features were selected based on the Round 1 and 2 described in Section 4.1.3 (Collecting and Testing Corpus Texts) and Section 4.1.5 (Extracting and Testing Arabic Readability Guidelines), and their impact on participant-reported readability. Linguistic features are shallow or surface features that are basic textual and structural characteristics widely used to gauge grammatical and lexical complexity. This study includes statistical attributes of characters, words, and sentences due to their simplicity and strong correlation with text readability, as well as ongoing interest in readability research [5,7,74,105,106,107,108,109,110].

To enhance text diversity, texts with various features impacting readability were selected. Appendix C shows the details of each feature in the text-based features. All features were calculated for each document and paragraph, except for paragraph count, which was relevant only at the document level.

We aimed to keep the annotation process as automated and objective as possible. Most of the general and linguistic features were extracted using Python scripts (Python 3.12.3) and Arabic-specific tools like PyArabic and OSMAN, with linguists reviewing the results for accuracy. Only a few features—like loan words, foreign words, paragraph count, and stylistic elements—had to be assessed manually, simply because there are not yet reliable tools for these in Arabic. To reduce any bias, two linguists worked independently, and a third stepped in to resolve any disagreements. Given the current limitations in Arabic NLP, this mix of automation and expert review felt like the most practical and accurate approach.

Gaze-Based Features

Text readability focuses on meeting the target reader’s needs [103], and eye-tracking analysis offers insights into what captures visual attention during reading. This study used the intra-saccadic velocity threshold fixation filter in Tobii Studio to identify fixations and saccades based on eye movement speed (degrees per second, deg/s). Fixations were movements slower than 30 deg/s, and data included (x, y) positions of each fixation and timestamps for each gaze point [19]. Choosing the right eye-tracking metrics is crucial and depends on a study’s goals. This study selected key eye behaviors based on past research and Tobii Studio guidelines [13,17,38,41,43,52,56,58,111]. These metrics, popular for capturing cognitive processes during reading, were evaluated for their effectiveness in assessing text readability, particularly in Arabic.

After segmenting texts into tokens and organizing trials, eye-tracking measures for AOIs were extracted and saved in an Excel (.xlsx) format using Tobii Studio. Real-time gaze-based metrics were categorized into two groups: reading metrics and experimental condition metrics. Reading metrics are widely used eye-tracking metrics reflecting cognitive processing during reading. The primary analyzed measures were fixations, saccades, visits, pupils, and click metrics, as mentioned in other studies. Table 8 summarizes the eye-tracking metrics used. Some of these eye-tracking metrics were extracted directly from Tobii Studio, while others necessitated manual extraction and further calculation after extraction.

Experimental condition metrics include session duration and rating duration, which could contribute to participant fatigue and influence final readability assessments. These metrics represent aspects within the eye-tracking experiment settings distinct from the reading of the experimental texts. They include the evaluation of rating tables following each paragraph and document, as well as the time taken to complete each recording, which consisted of four texts. Appendix D shows the details of each gaze-based feature.

Readability Level Features

All the aforementioned features serve as independent variables (predictors) of the dependent (target) variable, which is the readability level. This level is the aggregated subjective readability of the paragraphs and documents, collected from the fifteen participants during the eye-tracking experiment, with three possible values: easy, medium, and difficult.

4.3.3. Data Preprocessing

Preprocessing the data points is important to enhance their reliability and ensure that the insights derived from them are precise and meaningful. This is essential in later data modeling, as it will influence the model’s performance.

Encoding of Categorical Features

Features in datasets often come as categorical variables rather than merely continuous values. Many ML algorithms necessitate numeric inputs, meaning that categorical data must be converted into a numerical format [100,112,113]. This necessitates different encoding techniques applied to different features.

Data Formatting

In this phase, inconsistencies in the formatting of some features were adjusted. For example, gaze data such as total recording duration were converted from milliseconds to seconds for consistency with the other duration features required in some calculations.

Data Cleaning

In this phase, errors and missing data were addressed through exploratory analyses of quantitative metrics, observing their distribution characteristics, such as normality and skewness. The data exhibited typical imperfections found in real-world datasets. The main aim was to prepare the data for subsequent modeling. The data cleaning process involved addressing various issues with the dataset to ensure accuracy and reliability. For missing values, event types labeled as “Unclassified” by Tobii Studio due to ambiguity in identifying saccade boundaries were excluded, focusing only on definitive classifications of fixation and saccade features. For incorrect values or noise, features such as OSMAN score, Flesch score, Kincaid score, and Fog score, which exhibited unexpected negative scores, were addressed differently. The Kincaid metric was removed due to a high percentage of negative values, while the other metrics had missing values substituted with the median of non-missing values for similar texts. Duplicates, specifically in total saccade duration, were eliminated as this feature was duplicated for each participant due to timestamp calculations. Lastly, non-informative features like book name and author, document code, and paragraph code were removed as they lacked predictive power for determining readability levels.

4.3.4. Corpus Evaluation

Readability prediction, while similar to ML tasks like topic modeling and sentiment analysis, uses more subjective scoring [2]. Unlike tasks with benchmark annotations, readability levels are not definitively correct or incorrect, reducing the need for comparison with expert annotations. However, quality assurance remains crucial as individual participation quality can vary and affect labels [85]. After cleaning the data, the goal was to determine if there was a correlation between the gold-standard aggregated subjective readability annotation and the cognitive annotation from the eye-tracking experiment. The following subsections describe the adopted corpus evaluation approaches, as relying on a single metric can be problematic.

Visualization of Gaze Plots

The method of collecting eye movements in this study focused on normal reading during the readability annotation task. This approach was similar to the natural reading in ZuCo and the Dundee study, in which participants read texts for comprehension, sometimes revisiting words as needed. Figure 11 illustrates three participants’ fixations on a paragraph, showing continuous reading until the end.

This pattern, consistent across all texts in the corpus, resembles typical reading behaviors in standard reading corpora. Unlike task-specific readings with quick skimming, participants in this study exhibited thorough, natural reading, occasionally revisiting words when encountering difficulties.

Interpersonal Consistency in Reading Times

Consistent with GECO and ZuCo, this study analyzed eye-tracking data by aggregating participants’ eye movements, focusing on reading time metrics. Seven metrics were chosen: time to first fixation, first fixation duration, single fixation duration, total fixation duration, total saccade duration, single visit duration, and total visit duration. The histograms in Figure 12 show the distributions of these reading times.

Most reading time variables were normally or approximately normally distributed, except for total visit and fixation durations, which mainly had short readings with a few longer instances. Pearson’s moment coefficient of skewness (G) values in Table 9 further support this. Positive G values indicate skewness, with values near zero implying symmetry around the mean. The highest G values for total visit and fixation durations show the most pronounced positive skewness, while other features are closer to zero.

In eye tracking during reading, factors like reading ability, topic familiarity, and reading strategies influence participants’ behaviors. While participants generally exhibit brief fixations, certain words or segments may require longer fixations or rereadings, affecting total visit duration and total fixation duration. Knowing they would assess the text’s readability might have influenced participants’ behavior, leading to multiple revisits and hesitation between classifications like easy or medium, or medium or difficult [22,80]. This likely contributed to longer revisit durations as participants took extra time to decide. The findings from Figure 12 and Table 9 align with reading times from other eye-tracking corpora, such as GECO and ZuCo. Studies on these corpora have shown that even after log transformation, reading times often deviate from a normal distribution and exhibit a rightward skew, especially for total fixation duration and total visit duration.

Association Between OSMAN and Other Features

We anticipated that texts within specific readability levels would show distinct patterns for each duration metric. Following [26], we aimed to correlate validated OSMAN Scores [72], subjective participant annotations, and objective eye-tracker reading time metrics. Our goal was to determine if texts perceived as easy, medium, and difficult by participants displayed common trends in eye movements and OSMAN readability scores. Table 10 reveals a predictable pattern: more complex texts required longer reading times, shown by increased fixation and visit durations.

This trend, consistent with Rayner [114] and others [18,36], indicates that as text complexity increases, so does processing time. OSMAN scores also reflected these readability differences, with lower scores indicating more complex texts. These findings confirm our assumptions that texts can be differentiated by their difficulty based on specific eye-tracking patterns.

Based on the observed consistency between the visualization and aggregated readability levels (Visualization of Gaze Plots Section), the interpersonal consistency in reading time metrics (Interpersonal Consistency in Reading Times Section), and the correlation between aggregated readability levels and OSMAN scores at the paragraph level (Association Between OSMAN and Other Features Section), it can be inferred that the generated readability levels and collected eye movements are trustworthy and acceptable.

5. Conclusions, Limitations, and Future Work

This paper marks a significant step toward a cognitively driven Arabic readability model and lays the groundwork for future studies. It gives detailed preparations for eye-tracking experiments, including text acquisition, cleaning, preprocessing into paragraphs, and creating annotation guidelines for assessing Arabic text readability at multiple levels. This process identified key reading factors and produced guidelines based on public input, providing insights into readability levels through direct engagement. The study used two annotation methods, eye-tracking and human-guided annotations, to build the first Arabic cognitive corpus using cognitive sensors. The methodology aimed to authentically capture reading experiences by integrating eye-tracking and human-guided annotations. Participants classified text readability into three levels, easy, medium, or difficult, aiding significant aggregation of text annotations and elucidating readability ratings.

This paper also described the dataset preparation process for the study, starting with the feature extraction of general characteristics, linguistic features, and real-time metrics essential for readability assessment. This was followed by data preprocessing to clean, correct, and format the data. Various cognitive corpus evaluations were conducted to ensure consistency and usability for readability assessment. With the corpus ready, model development and evaluation using different ML and deep learning (DL) approaches make it possible to compare their results. The Arabic cognitive corpus presented in this paper, which includes the experimental texts, participants’ details, and collected eye movement data for all tokens, is available to the Arabic NLP.

When analyzing the experimental results, several limitations should be noted. The small Arabic dataset posed challenges in gaze analysis due to the complexity and resources required for extensive annotation [115,116]. The study’s fifteen participants, while comparable to benchmarks like Dundee [18] and ZuCo [19], highlighted the need for a larger, more diverse pool for robust assessments. Dataset imbalance could be addressed by adjusting training weights to preserve the diversity of human readability judgments [17,117]. Additional techniques, such as oversampling, could also be explored. Refining feature engineering and readability guidelines is crucial for improving model accuracy, especially in distinguishing between medium and difficult texts. Additionally, the models’ performance varied with different text types, indicating the need to test adaptability across various genres to understand the impact of text variability.

The reviewed literature demonstrates the value of combining linguistic attributes with eye-tracking data for various NLP tasks across multiple languages [13,16,95,97,98,118,119,120,121,122]. This suggests a promising direction for future research on Arabic NLP tasks, such as text readability, text simplification, and text summarization. Future work can leverage the Arabic cognitive corpus, which uniquely integrates reading behavior and perceived readability through combined eye-tracking and annotation data. By utilizing both gaze-based and linguistic features collected from the same participants, the potential of merging cognitive and textual signals to improve Arabic readability assessment can be explored. This approach offers opportunities to develop machine learning models that go beyond traditional handcrafted features, using eye-tracking data to gain deeper insights into reading processes and text complexity. Another possible direction for future research is to expand the Arabic cognitive corpus, exploring the impact of text variability on modeling reading difficulties, and analyzing the interaction and importance of linguistic and cognitive features. In addition, incorporating alternative cognitive signals, such as EEGs, and investigating the potential of large language models as alternatives or aids for human readability judgment are recommended for further research.

Author Contributions

Conceptualization, I.B., H.A.-K., and A.A.-S.; methodology, I.B., H.A.-K., and A.A.-S.; software, I.B.; validation, I.B.; formal analysis, I.B. and H.A.-K.; investigation, I.B., H.A.-K., and A.A.-S.; resources, I.B. and H.A.-K.; data curation, I.B.; writing—original draft preparation, I.B.; writing—review and editing, H.A.-K. and A.A.-S.; visualization, I.B.; supervision, H.A.-K. and A.A.-S.; project administration, I.B., H.A.-K., and A.A.-S.; funding acquisition, I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by King Abdulaziz City for Science and Technology.

Institutional Review Board Statement

The study was conducted according to King Saud University’s Institutional Review Board rules and regulations and was approved by the Ethics Committee of King Saud University (no. 21/0892/IRB, dated 19 October 2021).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The dataset is available at Harvard Dataverse, V1, https://doi.org/10.7910/DVN/P5WPNS (accessed on 2 January 2025).

Acknowledgments

The authors would like to thank King Abdulaziz City for Science and Technology for funding and supporting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Arabic readability assessment corpora.

Corpus	Description	Used or Compiled
Targeting Arabic L1 Learners or Readers
Saudi Curriculum Texts	60 Arabic texts, 20 each from the 3rd and 6th grade elementary and 3rd grade intermediate levels, with each text being around 100 words.	[5]
Saudi Curriculum Texts	150 Arabic curriculum texts, with 50 each from elementary, intermediate, and secondary levels, totaling 57,089 tokens.	[3]
King Abdulaziz City for Science and Technology Arabic Corpus	Over 700 million words spanning a period of more than 1500 years; the materials are organized according to period, geographical area, format, field, and subject, allowing for search and exploration based on these categories.	[73]
United Nations Corpus	73,000 corresponding English and Arabic paragraph pairs sourced from the United Nations corpus.	[72]
Jordanian Curriculum and Saudi Articles Dataset	600 Saudi news articles and 866 Jordanian curriculum lessons, totaling 1200 records and 307,238 tokens, categorized into school and advanced readability levels.	[109]
Open-Source Corpus	75,630 Arabic web pages not tailored to language learners; a subset of 8627 longer sentences was selected.	[123]
Jordanian Curriculum Texts	1196 Arabic texts from the Jordanian elementary curriculum, covering different subjects.	[7]
Medicine Information Leaflets	1112 Arabic medicine information leaflets, acquired from the King Abdullah Arabic Health Encyclopedia and the Saudi Food and Drug Agency Authority.	[74,75]
Modern Standard Arabic Readability Corpus	644 curriculum texts from Moroccan primary books, categorized into 7 difficulty levels ranging from kindergarten (Level 0) to the 6th grade (the final primary grade).	[76,102,124]
Targeting Arabic L2 Learners
GLOSS	Created by the Defense Language Institute Foreign Language Center; offers public access to over 7000 reading and listening lessons in 40 languages and dialects sorted into 11 difficulty levels based on the Interagency Language Roundtable proficiency scale.	[4,102,104,107,108,123,125,126,127,128,129]
Aljazeera Learning	Instructional Arabic texts on the Aljazeera website are categorized into 5 difficulty levels, from beginner to advanced.	[102,104,128,129]
Malaysian Curriculum Texts	313 reading texts sourced from 13 religious curriculum textbooks for grades 1–5 in Malaysia.	[130]
Al-Kitaab fii TaAallum al-Arabiyya	Textbook series commonly used to teach MSA as a second language.	[106,123]
Saaq al-Bambuu	An Arabic novel with an approved condensed edition for learners of Arabic as a foreign language.	[123]
Collected Web Texts	39,792 documents manually sourced from the Web on various topics categorized into 4 readability levels: easy, medium, difficult, and very difficult.	[70]
Targeting both Arabic L1 and L2 Learners
A Leveled Reading Corpus of Modern Standard Arabic	Constructed from Arabic curriculum (grades 1–12) and adult fiction, categorized into 4 levels with a total of 22,240 documents.	[126]
Arabic Learner Corpus	Comprises Arabic texts by Saudi Arabian students, divided into non-native learners of Arabic and native speakers improving their written proficiency.	[123]

Appendix B

The following tables show the MSA readability guidelines for data collection.

Table A2. Three-level readability guidelines for MSA paragraphs.

Level	Characteristics
Easy Paragraph	- The paragraph’s idea is clear and straightforward. - The paragraph uses easy and common vocabulary, such as contemporary terms relevant to the current times. - The paragraph is short, making it enjoyable to read without causing boredom, and uses concise expressions without compromising meaning. - Proper use of diacritical marks and punctuation promotes interpretation. - The use of verbal links between sentences and concluding phrases aids comprehension.
Moderate Paragraph	- The paragraph covers multiple ideas, slightly hindering comprehension and requiring multiple readings to grasp the intended meaning. - Repetition of ideas within the paragraph makes it lengthy and bores the reader. - Some vocabulary and terminology either require multiple readings for clear understanding or are difficult to pronounce due to their foreign origin or because they are transliterated, uncommon, specialized, or inadequately marked. - Although the text is intended for specialists, it is relatively clear for nonspecialized readers. - The abundance of numbers, statistics, parentheses, or parenthetical sentences hinders reading and distracts the reader.
Difficult Paragraph	- The sentences in the paragraph are long and complex, making the expression of ideas ambiguous and unclear. - The abundance of vocabulary and terminology hinders understanding due to their foreign origin or because they are transliterated, uncommon, specialized, or require clarification. - Inaccuracies in the use of punctuation obscure the meaning. - Excessive elaboration of multiple ideas or definitions distracts the reader and makes the paragraph lengthy, tedious, and disjointed. - The paragraph uses indirect language, involving the extensive use of linguistic techniques, metaphors, and similes that complicate the meaning instead of clarifying and simplifying it.

Table A3. Three-level readability guidelines for MSA documents.

Level	Characteristics
Easy Document	- The text’s subject is easy and interesting. - The text’s language is direct and clear. - The sequence of ideas in the text is logical. - The text maintains brevity without unnecessary elaboration. - Verbal links between paragraphs in the text, along with examples and evidence that clarify the idea being conveyed, make it easier to understand.
Moderate Document	- The paragraphs are fairly sequential and coherent. - There are inaccuracies in the use of punctuation. - The abundance of unorganized numbers and statistics in the text makes it difficult for readers to understand its main idea. - The text’s idea is not entirely clear, requiring multiple readings to understand the intended meaning. - The text is lengthy and contains many irrelevant details, making it a dull read.
Difficult Document	- The text’s language is figurative and indirect. - The text contains numerous ambiguous concepts and terms that are not clarified. - The paragraphs in the text are not cohesive, which disrupts comprehension. - The ideas expressed in the text are overly branched, which does not serve the discussed topic and distracts the reader. - The text’s subject matter is complex and directed at a specific audience, not a general one.

Table A4. Three-level readability guidelines for CA paragraphs.

Level	Characteristics
Easy Paragraph	- The main idea of the paragraph is clear and straightforward. - Simple and easy-to-understand vocabulary is used. - The paragraph is short, which makes it enjoyable to read without causing boredom, and the writer uses concise sentences without compromising the meaning. - Attention to diacritical marks and the accurate use of punctuation facilitate understanding. - The paragraph maintains a logical connection between sentences along with the use of conclusive phrases that aid comprehension.
Moderate Paragraph	- The paragraph contains multiple ideas, which slightly hinder comprehension and demand multiple readings for thorough understanding. - The repetition of ideas within the paragraph makes it longer and may bore the reader. - Some of the words used in the paragraph are difficult to pronounce due to their uncommon usage, formal literary nature, specialization, or lack of proper diacritical marks. - The text is intended for specialists, but remains relatively understandable for nonspecialist readers. - The places and characters mentioned in the paragraph may seem unfamiliar as they are associated with a different era.
Difficult Paragraph	- The paragraph is difficult to understand because the linguistic expressions used in it are drawn from older eras and, therefore, differ from contemporary styles. - An abundance of vocabulary and terms hampers comprehension due to semantic shifts over time, rendering some words ambiguous. - Inaccuracies in punctuation usage distort meaning. - Excessive elaboration of ideas or definitions makes the paragraph lengthy, tedious, and incoherent. - The language used in the paragraph is indirect, marked by the heavy use of linguistic techniques, similes, and metaphors that complicate the meaning instead of clarifying and simplifying it.

Table A5. Three-level readability guidelines for CA documents.

Level	Characteristics
Easy Document	- The subject of the text is simple and engaging. - The language used in the text is direct and clear. - Ideas are presented in a logical sequence. - Conciseness is maintained without unnecessary elaboration. - There are verbal links between paragraphs, coupled with examples and evidence that clarify the main idea, making the text easy to understand.
Moderate Document	- The paragraphs of the text are fairly sequential and connected. - Punctuation is used inaccurately. - Differences in cultural and civilizational contexts hinder understanding, since the social conditions referred to in the text do not exist in the present time, and the names of the places mentioned have also changed. - The main idea of the text is not entirely clear, requiring multiple readings to grasp the meaning. - The text is lengthy and contains many unnecessary details, making it a tedious read.
Difficult Document	- The language used in the text is figurative and indirect. - The text includes numerous ambiguous concepts and terms that are not sufficiently explained. - Reliance on extensive references makes the paragraphs disconnected, disrupting the overall meaning. - Digressions—ideas diverging in a way that does not serve the main topic—divert the reader’s focus. - The complexity and specialized nature of the subject make the text challenging, conveying that it is solely directed at experts.

Appendix C

Appendix C.1. Text-Based Metrics

Appendix C.1.1. Descriptive Features

These features include metadata about the texts, such as bibliographic details (e.g., book name) and characteristics indicating the nature of the texts (e.g., topic and language). Participants noted varying difficulty levels based on the topic, with subjects like philosophy being more challenging.

Table A6. Descriptive features of the texts.

Feature	Description
Book Name and Author	The complete name of a book and its author(s).
Document Code	Includes the topic of the text from a book and a unique identifier for the text.
Paragraph Code	Includes the topic of the paragraph, a unique identifier for the text, and the sequence number of the paragraph in its source text.
Book Language	The language of the text, whether MSA or CA. Although MSA originated from CA, it has evolved and led to differences in structure and word complexity from CA [26].
Book Topic	The topic of the book from which the text was chosen. It was assumed that texts on different topics would have different readability levels. The possible values of this feature are grammar and morphology, literature and eloquence, history, geography and travel, health and nutrition, philosophy, politics, biography, sociology, technology, psychology, commerce, and arts.
Publication Century	The century in which the book containing the text was published was found to influence the text’s readability level [26]. The possible values for this feature are 8, 9, 10, 11, 12, 14, 20, and 21.
Authorship Type	The gender of the text’s author(s) was assumed to affect the text’s writing style, perspective, and reading experience. The possible values of this feature are single-gender (male) book, single-gender (female) book, and mixed-gender book.
Translation Type	The gender of the text’s translator, which is similar to the gender of the author(s), was also assumed to affect the text’s readability level. The possible values of this feature are single-gender (male) translation, single-gender (female) translation, mixed-gender translation, and no translation.
Author Count	The number of authors of the book from which the text was taken was also assumed to affect text readability because each author has a different writing style, perspective, and experience.
Text Source	This indicates the part of the book from which the text was taken. It was assumed that texts taken from the author’s introduction at the start of a book would be easier to read than other book contents, which are usually deeper and more detailed. The possible values of this feature are introductory content and other book content.

Appendix C.1.2. Textual Complexity Features

These features assess lexical and structural complexities, including word length, syllable count, and the presence of foreign or loan words [125]. They gauge text difficulty in readability research [7,106,109,125,131]. This study selected these features based on Section 4.1.3 (Collecting and Testing Corpus Texts) and Section 4.1.5 (Extracting and Testing Arabic Readability Guidelines), and their impact on participant-reported readability. Participants noted that difficulty can arise from the writer’s style, coherence of ideas, place names, technical terms, and unfamiliar or complex terms, including loan or foreign words. Loan words are borrowed from one language and adapted for use in another. In Arabic, loan words retain their original meaning. An example is تلفزيون (televizyoon)—television. Two linguists initially identified these words separately, with a third resolving disagreements.

Table A7. Textual components’ features.

Feature	Description
Listing Count	The number of all lists [e.g., bullet, number, letter, and number word (e.g., first, second, etc.) lists] in a text.
Parenthesis Count	The number of all parenthesis pairs containing additional information or abbreviations in a text, including all parenthesis pairs for textual content [e.g., ( التجمُّد) and (العالم العربي)] and parenthesis pairs for numerical content [i.e., numbers, dates, times, years, and percentages; e.g., (١٩٦٠م)], and excluding parenthesis pairs used in numbered lists [e.g., (1) and (2)] and lettered lists [e.g., (أ) and (ب)] because they will be accounted for in the Listing Count feature.
Parenthetical Expression Count	The number of all parenthetical expressions between two dashes in a text (e.g., “- بما فيها من قوة الحياة -”).
Numerical Content Count	The number of all numerical content, such as numbers, dates, times, years, and percentages, in a text (e.g., ٢٥٠٠ عام). Numbered lists [e.g., “-١”, “-٢”, “(١)”, and “(٢)”] are excluded because they are part of the Listing Count feature and are not considered numerical content. Sequences of numbers with attached characters should also be considered (e.g., “١٦٤٨م” and “٨٩٣ م -٨٩٤”).
Religious Text Count	The number of all verses (“Ayah”) of the Holy Qur’an and of the Hadith, a statement of the Prophet Muhammad (peace be upon him), in the text.
Poem Verse Count	The number of verses (e.g., “وخالدٌ يَحْمَدُ أصحابُهُ ... بالحَقِّ لا يُحمَدُ بالباطلِ”) of Arabic poems in the text. One verse in an Arabic poem has two hemistichs (parts), which are separated by ellipses (“…”). However, some texts contain ellipses, such as “ومنها ما هو عارضٌ كالأديان والغَزَوات... إلخ”, which required manual revision.

Appendix C.1.3. Structural Complexity Features

These features investigate the language-level complexity of texts by considering their organization and structure [125].

Table A8. Textual complexity features.

Feature	Description
Character Count	The number of characters in a text, excluding punctuation [7,87,109] and diacritics [72]. While certain studies [5,7,109] have linked this feature to Arabic text difficulty, other studies, such as [106], indicate that it may not substantially impact word complexity.
Word Count	The number of words in a text. This represents the text length in tokens using white space as a token separator in a text [105,106,109,125].
Average Word Length	The average length of a word in characters per text [3,107]. This is calculated as follows [3,7,74,109]: Average Word Length = Character Count per Text/Word Count per Text. This feature has been used in a great deal of readability research to show the density of a text, as a denser text with longer words tends to be more difficult to read than a less dense text [3,7,125].
Syllable Count	The total number of syllables in a text. Some studies indicate that in Arabic, words with more syllables do not significantly impact readability [26], as words with over three syllables can still be simple [3,72], contrary to studies suggesting that having more syllables in words affects readability [3,106].
Average Syllables per Word	The average number of syllables per word is calculated as follows [3,107]: Average Syllables per Word = Syllable Count per Text/Word Count per Text.
Difficult Word Count	The number of difficult words in a text [7]. Scholars continue to debate the definition of “difficult words”, but several studies have defined them as words with six or more letters [7,108,109]. In this study, difficult words were defined as OSMAN Faseeh words: words that have six or more characters and end with any of the following letters: ء ,ئ ,وء , ذ ,ظ ,وا, and ون. This is indicated in [72].
Average Difficult Word Count	The average number of difficult words in a text is calculated as follows [7,109]: Average Number of Difficult Words = Difficult Word Count per Text/Word Count per Text.
Unique Loan Word Count	The number of loan words used in a text, excluding repetitions.
Total Loan Word Count	The total number of loan words used in a text, counting repetitions. This is the same as the previous feature, except that in this feature, every occurrence of a loan word is counted.
Unique Foreign Word Count	The number of foreign words used in a text (e.g., herbalists, Thomas More, and apothecary), excluding repetitions.
Total Foreign Word Count	The total number of foreign words, including repetitions, in a text. This is the same as the previous feature, except that in this feature, each occurrence of a word is counted.
Foreign-Words-to-Token Ratio	The percentage of foreign words in a text [104] is calculated as follows: Foreign-Words-to-Token Ratio = Total Foreign Word Count per Text/Word Count per Text.
Loan-Words-to-Token Ratio	The percentage of loan words in a text [104] is calculated as follows: Loan-Words-to-Token Ratio = Total Loan Word Count per Text/Word Count per Text.

Table A9. Structural complexity features.

Feature	Description
Sentence Count	The number of sentences in a text. This feature suggests that sentence length and structure affect text difficulty [7,109]. For an accurate readability assessment, sentences are counted based on meaning, focusing on complete, meaningful units rather than merely punctuation [105,108].
Average Sentence Length in Words	The average number of words in a sentence [3,5,107,125] is calculated as follows [3,7,74,109]: Average Sentence Length in Words = Word Count per Text/Sentence Count per Text. This feature is widely considered a key measure of readability in readability formulas and studies [3,5,7,106,125,128] due to the belief that longer sentences are harder to read and understand [5,75].
Average Sentence Length in Characters	The average number of characters per sentence in the text is calculated as follows: Average Sentence Length in Characters = Character Count per Text/Sentence Count per Text. This feature indicates the density of a text. Denser texts, or those with higher average sentence lengths in characters, tend to be more difficult to read than less dense texts [3,7].
Paragraph Count	The number of paragraphs in a text. This might affect a text’s organization and how easily readers can digest the information.

Appendix C.1.4. Readability Scores

Following the approach used in other studies [13,72,110] of utilizing readability formulas to assess the linguistic and structural characteristics of a text, in this study, several readability formulas were employed that are similar to those used in [72]. These include the OSMAN Score, the Lasbarhets Index (LIX) Score, the Automated Readability Index (ARI) Score, the Flesch Reading Ease Score (Flesch), the Flesch–Kincaid Score (Kincaid), and the Gunning Fog (Fog) Score. These formulas are described in detail in [72].

Appendix C.1.5. Stylistic Features

Inspired by [105,125], in this study, different kinds of texts were used that show different styles of expressing meanings that an Arabic reader would experience. This made the corpus texts more representative of Arabic books.

Table A10. Stylistic features.

Feature	Description
Text Style	The method of choosing and composing words to express meanings for the purpose of clarification and influence. Possible values for this feature include scientific, literary, literary scientific, and social scientific.
Script Style	The method that the text writer used to prepare, organize, and produce the text. Possible values for this feature include argumentative, expository, guideline, narrative, informative, and demonstrative.
Linguistic Style	The approach that the text writer followed in creating vocabulary and structures to express meanings. Possible values for this feature include informative, structural, and mixed informative and structural.
Writing Technique	An expression mechanism innovated by the text writer. Possible values for this feature include critic, mentor and educator, objective researcher, the narrator, and subjective.

Appendix D

Gaze-Based Metrics: The following tables list various fixation metrics per AOI that reflect visual effort during reading, with some metrics calculated and others directly extracted, along with their formulas, when applicable, and their analysis based on predetermined AOIs.

Appendix D.1. Fixation Metrics

Fixations, in which the eyes remain still on an AOI, are highly valued and frequently used metrics that indicate either high attention to or processing difficulty in the pertinent AOI [43,44,132].

Table A11. Fixation-derived metrics for visual effort.

Metric	Description
Time to First Fixation	The period between the onset of a trial containing an AOI and the moment a participant fixated inside the AOI. It measures how long it takes participants to notice and fixate on the AOI. A longer time to first fixation suggests a longer task completion time [41,43,132].
Fixations Before	The number of times a participant fixated on the trial before first fixating on the AOI. The fixation count begins when the medium that contains the AOI is presented for the first time and ends with the participant’s first fixation on the AOI [41].
First Fixation Duration	The duration of a participant’s fixation inside an AOI for the first time [16,43,111]. In reading studies, a higher first fixation duration indicates difficulty in processing the text by reflecting both syntactic processing and low-level lexical access [16].
Single Fixation Duration	The average duration of each (single) fixation inside an AOI [19,41,77]. While there is an assumption that this duration reflects a reader’s engagement in reading [97], longer fixations are believed to reflect increased cognitive effort during reading [13,132,133].
Total Fixation Duration	The total time a participant spent fixating inside an AOI in a trial, including regressions to that AOI (refixations after the AOI was left) [13,17,36,40,41,43,44,95,110,111]. Longer fixations could indicate higher interest or perceived importance of the AOI, but conversely, they could indicate deeper processing, possibly due to confusion with the AOI [13,16,110].
Average Fixation Duration	The average duration of all the fixations inside an AOI [13,17,19,43,133], which represents the average duration of a participant’s fixation on an AOI. It is calculated as follows: Average Fixation Duration = Total Fixation Duration/Total Fixation Count. This measure can distinguish AOIs that receive more attention and correlate strongly with text difficulty. Prolonged durations on certain words may indicate their complexity for the reader [43].
Total Fixation Count	The total number of fixations on a specific AOI of a trial [43,110,111]. A higher total fixation count suggests that the AOI was either attractive to the participant or required greater visual effort [43,44,133]. Increased fixations are associated with comprehension difficulties and text complexity in reading studies [13,17,132].
Percentage Fixated	The percentage of eye-tracking recordings in which the participants fixated on an AOI at least once [41].
Average Number of Fixations per Word	When working with text, the fixation count can be adjusted based on the text length by calculating the normalized number of fixations for each trial [17,133]. This metric is calculated as follows [17]: Average Number of Fixations per Word = Total Fixation Count/Word Count.
Fixation Rate	The number of fixations per second. For comprehension tasks, a high fixation rate indicates either participant interest or AOI difficulty. This metric is calculated as follows [134]: Fixation Rate = Total Fixation Count/Total Fixation Duration. Measuring both the fixation count and the fixation duration is crucial as they can vary independently. Thus, the fixation rate provides insight into how frequently an AOI is fixated on relative to the overall time spent fixating on it [133,134].

Appendix D.2. Saccade Metrics

Saccade metrics assess fast eye movements and involve rapid acceleration and deceleration of the eye to shift focus points [43]. Although less text processing happens during saccades than during fixations [44], saccade metrics are useful for analyzing visual search patterns and effort [132,133]. Table 6 lists the saccade-based metrics used in past studies to gauge visual effort.

Table A12. Saccadic-derived metrics.

Metric	Description
Total Saccade Count	The number of saccades that occurred in a trial [43]. A higher count indicates increased searching and mental workload, providing insight into how eye movements are affected by material difficulty [133].
Total Saccade Duration	The time taken between the start and the end of the search path [44]. Shorter saccades indicate comprehension difficulties and higher mental workload [13,44,133], whereas longer saccades are associated with more readable texts and shorter reading times [38].
Average Saccade Duration	The estimated speed of processing information. A longer duration suggests increased cognitive effort, indicating that more time is spent on comprehending the content [97,110]. This is calculated as follows: Average Saccade Duration = Total Saccade Duration/Total Saccade Count.
Saccadic Amplitude	The saccade size is measured in degrees (the angular distance) [132]. This metric represents the distance spanned by the eyes during a saccade [43]. This distance tends to decrease as task difficulty and cognitive load increase, indicating the focused, detailed visual exploration that is commonly used to deal with complex, information-rich material requiring careful analysis [132].
Saccade-to-Fixation Ratio	The ratio of the time between information search and cognitive information processing [44]. A higher value of this metric indicates more searching compared to processing (more saccades and fewer fixations) [133]. This is calculated as follows: Saccade-to-Fixation Ratio = Total Saccade Duration/Total Fixation Duration.
Absolute Saccadic Direction	The angle between the horizontal axis and the current fixation point, with the prior fixation location serving as the origin of the coordinate system, calculated using a unit circle. The direction of a participant’s gaze implicitly reflects the participant’s area of attention [132].
Relative Saccadic Direction	The angle changes between the current saccade and the prior saccade. It is calculated using the difference between the absolute directions of two consecutive saccades [132].

Appendix D.3. Visit Metrics

A visit is defined as the period from the initial fixation on the AOI to the end of the final fixation inside the same AOI when there were no fixations outside of the AOI. These metrics include all the data collected during a specific period, including activities outside the AOI, such as saccades, blinks, and invalid data, but they exclude entry and exit saccades [41,43]. AOI visit metrics calculate statistics related to general-looking behavior and visual search [43,44]. Tobii Studio uses these metrics for regression analysis, unlike in recent Tobii versions. Table A9 details metrics related to AOI visits.

Table A13. AOI-visit-derived metrics.

Metric	Description
Total Visit Count	The number of visits of a participant inside an AOI, reflecting how many times a participant ran fixations in the AOI [41,43]. This metric helps identify areas that captured a participant’s attention frequently [132], indicating their importance or the participant’s need to revisit them for understanding or memory [13,43,44].
Single Visit Duration	The average duration of each (single) visit to an AOI [41].
Total Visit Duration	The duration of all visits that occurred in an AOI, starting from the first fixation in this AOI until a further fixation occurred in a subsequent AOI [41,43]. Longer durations typically indicate greater difficulty in processing the text within the AOI [16].
Average Visit Duration	The average duration of all visits to an AOI, indicating the average time spent fixating on an AOI [43]. It is calculated as follows: Average Visit Duration = Total Visit Duration/Total Visit Count.
Average Number of Visits per Word	For text analysis, visit counts can be normalized as follows to account for varying word counts in different texts [17,133]: Average Number of Visits per Word = Total Visit Count/Word Count.

Appendix D.4. Pupil Metrics

Pupil metrics track changes in pupil size [132]. They are sensitive to light levels, individual differences, and focus shifts [133]. When the lighting level is controlled, fluctuations in the eye pupil size can offer useful information about the cognitive load and emotional response [43,97], with larger pupils showing greater effort [132,133] or negative emotions [42]. The pupil size is calculated as follows:

Pupil Size = (Right Pupil + Left Pupil)/2

Appendix D.5. Experimental Condition Metrics

These metrics are measured for the rating tables’ stimuli instead of the experimental texts.

Table A14. Experimental condition metrics.

Metrics	Submetrics
Rating Fixation	Rating Total Fixation Duration, Rating Total Fixation Count, Rating Percentage Fixated
Rating Visit	Rating Total Visit Count, Rating Total Visit Duration
Recording Duration	Recording Duration

Appendix D.6. Recording of General Information Metrics

The recording length for a set of texts can affect a reader’s engagement, with potential boredom or fatigue impacting text readability. This issue makes it necessary to consider both eye-tracking data and the reading context in the experiment.

References

Balyan, R.; McCarthy, K.S.; McNamara, D.S. Comparing Machine Learning Classification Approaches for Predicting Expository Text Difficulty. In Proceedings of the the Thirty-First International Flairs Conference, Melbourne, FL, USA, 21–23 May 2018; pp. 421–426. [Google Scholar]
Collins-Thompson, K. Computational assessment of text readability: A survey of current and future research. ITL Int. J. Appl. Linguist. 2014, 165, 97–135. [Google Scholar] [CrossRef]
Al-Khalifa, H.S.; Al-Ajlan, A.A. Automatic readability measurements of the Arabic text: An exploratory study. Arab. J. Sci. Eng. 2010, 35, 103–124. [Google Scholar]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Modern Standard Arabic Readability Prediction. In Proceedings of the Arabic Language Processing: From Theory to Practice (ICALP 2017), Fez, Morocco, 11–12 October 2017; pp. 120–133. [Google Scholar]
Al-Ajlan, A.A.; Al-Khalifa, H.S.; Al-Salman, A.S. Towards the development of an automatic readability measurements for Arabic language. In Proceedings of the Third International Conference on Digital Information Management, London, UK, 13–16 November 2008; pp. 506–511. [Google Scholar]
Dale, E.; Chall, J.S. The Concept of Readability. Elem. Engl. 1949, 26, 19–26. [Google Scholar]
Al Tamimi, A.K.; Jaradat, M.; Al-Jarrah, N.; Ghanem, S. AARI: Automatic Arabic readability index. Int. Arab J. Inf. Technol. 2014, 11, 370–378. [Google Scholar]
Baazeem, I. Analysing the Effects of Latent Semantic Analysis Parameters on Plain Language Visualisation. Master’s Thesis, Queensland University, Brisbane, Australia, 2015. [Google Scholar]
Mesgar, M.; Strube, M. Graph-based coherence modeling for assessing readability. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, Denver, CO, USA, 4–5 June 2015; pp. 309–318. [Google Scholar]
Cavalli-Sforza, V.; Saddiki, H.; Nassiri, N. Arabic Readability Research: Current State and Future Directions. Procedia Comp. Sci. 2018, 142, 38–49. [Google Scholar] [CrossRef]
Feng, L.; Elhadad, N.M.; Huenerfauth, M. Cognitively motivated features for readability assessment. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 30 March–3 April 2009; pp. 229–237. [Google Scholar]
Balakrishna, S.V. Analyzing Text Complexity and Text Simplification: Connecting Linguistics, Processing and Educational Applications. Ph.D. Thesis, der Eberhard Karls Universität Tübingen, Tübingen, Germany, 2015. [Google Scholar]
Vajjala, S.; Meurers, D.; Eitel, A.; Scheiter, K. Towards grounding computational linguistic approaches to readability: Modeling reader-text interaction for easy and difficult texts. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), Osaka, Japan, 11 December 2016; pp. 38–48. [Google Scholar]
Vajjala, S.; Lucic, I. On understanding the relation between expert annotations of text readability and target reader comprehension. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, 2 August 2019; pp. 349–359. [Google Scholar]
Mathias, S.; Kanojia, D.; Mishra, A.; Bhattacharyya, P. A Survey on Using Gaze Behaviour for Natural Language Processing. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Survey Track, Yokohama, Japan, 7–15 January 2021; pp. 4907–4913. [Google Scholar]
Singh, A.D.; Mehta, P.; Husain, S.; Rajkumar, R. Quantifying sentence complexity based on eye-tracking measures. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), Osaka, Japan, 11 December 2016; pp. 202–212. [Google Scholar]
Copeland, L.; Gedeon, T.; Caldwell, S. Effects of text difficulty and readers on predicting reading comprehension from eye movements. In Proceedings of the 6th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), Gyor, Hungary, 19–21 October 2015; pp. 407–412. [Google Scholar]
Kennedy, A.; Hill, R.; Pynte, J.E. The Dundee Corpus. In Proceedings of the 12th European Conference on Eye Movements, Dundee, Scotland, 20–24 August 2003. [Google Scholar]
Hollenstein, N. Leveraging Cognitive Processing Signals for Natural Language Understanding. Ph.D. Thesis, ETH Zurich, Zürich, Switzerland, 2021. [Google Scholar]
Cop, U.; Dirix, N.; Drieghe, D.; Duyck, W. Presenting GECO: An eyetracking corpus of monolingual and bilingual sentence reading. Behav. Res. Methods 2017, 49, 602–615. [Google Scholar] [CrossRef]
Mathias, S.; Murthy, R.; Kanojia, D.; Mishra, A.; Bhattacharyya, P. Happy are those who grade without seeing: A multi-task learning approach to grade essays using gaze behaviour. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; pp. 858–872. [Google Scholar]
Hollenstein, N.; Rotsztejn, J.; Troendle, M.; Pedroni, A.; Zhang, C.; Langer, N. ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading. Sci. Data 2018, 5, 13. [Google Scholar] [CrossRef]
Hermena, E.W.; Drieghe, D.; Hellmuth, S.; Liversedge, S.P. Processing of Arabic diacritical marks: Phonological–syntactic disambiguation of homographic verbs and visual crowding effects. J. Exp. Psychol. Hum. Percept. Perform. 2015, 41, 494. [Google Scholar] [CrossRef]
Al-Samarraie, H.; Sarsam, S.M.; Alzahrani, A.I.; Alalwan, N. Reading text with and without diacritics alters brain activation: The case of Arabic. Curr. Psychol. 2020, 39, 1189–1198. [Google Scholar] [CrossRef]
Paterson, K.B.; Almabruk, A.A.A.; McGowan, V.A.; White, S.J.; Jordan, T.R. Effects of word length on eye movement control: The evidence from Arabic. Psychon. Bull. Rev. 2015, 22, 1443–1450. [Google Scholar] [CrossRef]
Baazeem, I.; Al-Khalifa, H.; Al-Salman, A. Cognitively Driven Arabic Text Readability Assessment Using Eye-Tracking. Appl. Sci. 2021, 11, 8607. [Google Scholar] [CrossRef]
El-Haj, M.; Kruschwitz, U.; Fox, C. Creating language resources for under-resourced languages: Methodologies, and experiments with Arabic. Lang. Resour. Eval. 2015, 49, 549–580. [Google Scholar] [CrossRef]
Alnefaie, R.; Azmi, A.M. Automatic minimal diacritization of Arabic texts. Procedia Comput. Sci. 2017, 117, 169–174. [Google Scholar] [CrossRef]
Hermena, E.W.; Bouamama, S.; Liversedge, S.P.; Drieghe, D. Does diacritics-based lexical disambiguation modulate word frequency, length, and predictability effects? An eye-movements investigation of processing Arabic diacritics. PLoS ONE 2021, 16, e0259987. [Google Scholar] [CrossRef]
AlJassmi, M.A.; Hermena, E.W.; Paterson, K.B. Eye movements in Arabic reading. Exp. Arab. Linguist. 2021, 10, 85–108. [Google Scholar] [CrossRef]
Roman, G.; Pavard, B. A comparative study: How we read in Arabic and French. In Eye Movements from Physiology to Cognition; Elsevier: Amsterdam, The Netherlands, 1987; pp. 431–440. [Google Scholar]
Azmi, A.M.; Alnefaie, R.M.; Aboalsamh, H.A. Light Diacritic Restoration to Disambiguate Homographs in Modern Arabic Texts. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 21, 1–14. [Google Scholar] [CrossRef]
Bouamor, H.; Zaghouani, W.; Diab, M.; Obeid, O.; Oflazer, K.; Ghoneim, M.; Hawwari, A. A pilot study on Arabic multi-genre corpus diacritization. In Proceedings of the Second Workshop on Arabic Natural Language Processing, Beijing, China, 26–31 July 2015; pp. 80–88. [Google Scholar]
Klaib, A.F.; Alsrehin, N.O.; Melhem, W.Y.; Bashtawi, H.O.; Magableh, A.A. Eye tracking algorithms, techniques, tools, and applications with an emphasis on machine learning and Internet of Things technologies. Expert Syst. Appl. 2021, 166, 114037. [Google Scholar] [CrossRef]
Singh, H.; Singh, J. Human eye tracking and related issues: A review. Int. J. Sci. Res. Publ. 2012, 2, 1–9. [Google Scholar]
Conklin, K.; Pellicer-Sánchez, A. Using eye-tracking in applied linguistics and second language research. Sec. Lang. Res. 2016, 32, 453–467. [Google Scholar] [CrossRef]
Just, M.A.; Carpenter, P.A. A theory of reading: From eye fixations to comprehension. Psychol. Rev. 1980, 87, 329. [Google Scholar] [CrossRef]
Grabar, N.; Farce, E.; Sparrow, L. Study of readability of health documents with eye-tracking approaches. In Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), Tilburg, The Netherlands, 8 November 2018. [Google Scholar]
Mathias, S.; Kanojia, D.; Patel, K.; Agarwal, S.; Mishra, A.; Bhattacharyya, P. Eyes are the windows to the soul: Predicting the rating of text quality using gaze behaviour. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2352–2362. [Google Scholar]
Gonzalez-Garduno, A.V.; Søgaard, A. Using gaze to predict text readability. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, Copenhagen, Denmark, 8 September 2017; pp. 438–443. [Google Scholar]
Tobii Technology AB. Tobii Studio User’s Manual (Version 3.4.8); Tobii Technology AB: Danderyd, Sweden, 2017; p. 172. [Google Scholar]
Tobii Technology AB. Fundamentals. Available online: https://developer.tobii.com/xr/learn/analytics/fundamentals/ (accessed on 7 July 2023).
Tobii Technology AB. Tobii Pro Lab User’s Manual (Version 1.130); Tobii Technology AB: Danderyd, Sweden, 2019. [Google Scholar]
Wu, X.; Xue, C.; Zhou, F. An Experimental Study on Visual Search Factors of Information Features in a Task Monitoring Interface. In Proceedings of the Human-Computer Interaction: Users and Contexts, Los Angeles, CA, USA, 2–7 August 2015; pp. 525–536. [Google Scholar]
Al-Wabil, A.; Al-Sheaha, M. Towards an interactive screening program for developmental dyslexia: Eye movement analysis in reading Arabic texts. In Proceedings of the 12th International Conference on Computers Helping People with Special Needs, Vienna, Austria, 14–16 July 2010; pp. 25–32. [Google Scholar]
Al-Edaily, A.; Al-Wabil, A.; Al-Ohali, Y. Dyslexia Explorer: A Screening System for Learning Difficulties in the Arabic Language Using Eye Tracking. In Proceedings of the Human Factors in Computing and Informatics, Maribor, Slovenia, 1–3 July 2013; pp. 831–834. [Google Scholar]
Al-Edaily, A.; Al-Wabil, A.; Al-Ohali, Y. Interactive Screening for Learning Difficulties: Analyzing Visual Patterns of Reading Arabic Scripts with Eye Tracking. In Proceedings of the HCI International 2013—Posters’ Extended Abstracts, Las Vegas, NV, USA, 21–26 July 2013; pp. 3–7. [Google Scholar]
Blanken, G.; Dorn, M.; Sinn, H. Inversion errors in Arabic number reading: Is there a nonsemantic route? Brain Cogn. 1997, 34, 404–423. [Google Scholar] [CrossRef] [PubMed]
Naz, S.; Razzak, M.I.; Hayat, K.; Anwar, M.W.; Khan, S.Z. Challenges in baseline detection of Arabic script based languages. In Proceedings of the Intelligent Systems for Science and Information: Extended and Selected Results from the Science and Information Conference, London, UK, 7–9 October 2013; pp. 181–196. [Google Scholar]
Wiechmann, D.; Qiao, Y.; Kerz, E.; Mattern, J. Measuring the impact of (psycho-) linguistic and readability features and their spill over effects on the prediction of eye movement patterns. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 5276–5290. [Google Scholar]
Frazier, L.; Rayner, K. Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cogn. Psychol. 1982, 14, 178–210. [Google Scholar] [CrossRef]
Rayner, K.; Chace, K.H.; Slattery, T.J.; Ashby, J. Eye movements as reflections of comprehension processes in reading. Sci. Stud. Read. 2006, 10, 241–255. [Google Scholar] [CrossRef]
Liversedge, S.P.; Paterson, K.B.; Pickering, M.J. Chapter 3—Eye Movements and Measures of Reading Time. In Eye Guidance in Reading and Scene Perception; Underwood, G., Ed.; Elsevier Science Ltd.: Amsterdam, The Netherlands, 1998; pp. 55–75. [Google Scholar]
Raney, G.E.; Campbell, S.J.; Bovee, J.C. Using eye movements to evaluate the cognitive processes involved in text comprehension. J. Vis. Exp. 2014, 10, 50780. [Google Scholar] [CrossRef]
Schroeder, S.; Hyönä, J.; Liversedge, S.P. Developmental eye-tracking research in reading: Introduction to the special issue. J. Cogn. Psychol. 2015, 27, 500–510. [Google Scholar] [CrossRef]
Atvars, A. Eye movement analyses for obtaining Readability Formula for Latvian texts for primary school. Procedia Comp. Sci. 2017, 104, 477–484. [Google Scholar] [CrossRef]
Sinha, A.; Roy, D.; Chaki, R.; De, B.K.; Saha, S.K. Readability analysis based on cognitive assessment using physiological sensing. IEEE Sens. J. 2019, 19, 8127–8135. [Google Scholar] [CrossRef]
Zubov, V.I.; Petrova, T.E. Lexically or grammatically adapted texts: What is easier to process for secondary school children? In Proceedings of the 24th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Petersburg, Russia, 16–18 September 2020; pp. 2117–2124. [Google Scholar]
Nassiri, N.; Cavalli-Sforza, V.; Lakhouaja, A. Approaches, Methods, and Resources for Assessing the Readability of Arabic Texts. ACM Trans. Asian Low Resour. Lang. Inf. Process. 2023, 22, 1–30. [Google Scholar] [CrossRef]
Bensoltana, D.; Asselah, B. Exploration of Arabic reading, in terms of the vocalization of the text form by registering the eyes movements of pupils. World J. Neurosci. 2013, 3, 263–268. [Google Scholar] [CrossRef]
Maroun, M.; Hanley, J.R. Are alternative meanings of an Arabic homograph activated even when it is disambiguated by vowel diacritics? Writ. Syst. Res. 2020, 11, 203–211. [Google Scholar] [CrossRef]
Awadh, F.H.; Zoubrinetzky, R.; Zaher, A.; Valdois, S. Visual attention span as a predictor of reading fluency and reading comprehension in Arabic. Front. Psychol. 2022, 13, 868530. [Google Scholar] [CrossRef] [PubMed]
Hallberg, A.; Niehorster, D.C. Parsing written language with non-standard grammar: An eye-tracking study of case marking in Arabic. Read. Writ. 2021, 34, 27–48. [Google Scholar] [CrossRef]
Leung, T.; Boush, F.; Chen, Q.; Al Kaabi, M. Eye movements when reading spaced and unspaced texts in Arabic. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vienna, Austria, 26–29 July 2021; pp. 439–444. [Google Scholar]
Khateb, A.; Asadi, I.A.; Habashi, S.; Korinth, S.P. Role of morphology in visual word recognition: A parafoveal preview study in Arabic using eye-tracking. Theory Pract. Lang. Stud. 2022, 12, 1030–1038. [Google Scholar] [CrossRef]
Hermena, E.W.; Juma, E.J.; AlJassmi, M. Parafoveal processing of orthographic, morphological, and semantic information during reading Arabic: A boundary paradigm investigation. PLoS ONE 2021, 16, e0254745. [Google Scholar] [CrossRef]
Al-Khalefah, K.; Al-Khalifa, H.S. Reading Process of Arab Children: An Eye-Tracking Study on Saudi Elementary Students. Int. J. Asian Lang. Process. 2021, 31, 2150003. [Google Scholar] [CrossRef]
Hermena, E.W.; Reichle, E.D. Insights from the study of Arabic reading. Lang. Linguist. Compass. 2020, 14, 1–26. [Google Scholar] [CrossRef]
Lahoud, H.; Eviatar, Z.; Kreiner, H. Eye-movement patterns in skilled Arabic readers: Effects of specific features of Arabic versus universal factors. Read. Writ. 2023, 37, 1079–1108. [Google Scholar] [CrossRef]
Bessou, S.; Chenni, G. Efficient measuring of readability to improve documents accessibility for arabic language learners. J. Digit. Inf. Manag. 2021, 19, 75–82. [Google Scholar] [CrossRef]
Nassiri, N.; Cavalli-Sforza, V.; Lakhouaja, A. MoSAR: Modern Standard Arabic Readability Corpus for L1 Learners. In Proceedings of the 4th International Conference on Big Data and Internet of Things (BDIoT’19), Rabat, Morocco, 23–24 October 2019; pp. 1–7. [Google Scholar]
El-Haj, M.; Rayson, P. OSMAN―A Novel Arabic Readability Metric. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 250–255. [Google Scholar]
Daud, N.M.; Hassan, H.; Aziz, N.A. A corpus-based readability formula for estimate of Arabic texts reading difficulty. World Appl. Sci. J. 2013, 21, 168–173. [Google Scholar]
Al Aqeel, S.; Abanmy, N.; Aldayel, A.; Al-Khalifa, H.; Al-Yahya, M.; Diab, M. Readability of written medicine information materials in Arabic language: Expert and consumer evaluation. BMC Health Serv. Res. 2018, 18, 139. [Google Scholar] [CrossRef]
Alotaibi, S.; Alyahya, M.; Al-Khalifa, H.; Alageel, S.; Abanmy, N. Readability of Arabic Medicine Information Leaflets: A Machine Learning Approach. Procedia Comp. Sci. 2016, 82, 122–126. [Google Scholar] [CrossRef]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Combining Classical and Non-classical Features to Improve Readability Measures for Arabic First Language Texts. In Proceedings of the International Conference on Advanced Intelligent Systems for Sustainable Development, Tangier, Morocco, 21–26 December 2020; pp. 463–470. [Google Scholar]
Barrett, M. Improving Natural Language Processing with Human Data: Eye Tracking and Other Data Sources Reflecting Cognitive Text Processing. Ph.D. Thesis, University of Copenhagen, Copenhagen, Denmark, 2018. [Google Scholar]
Luke, S.G.; Christianson, K. The Provo Corpus: A large eye-tracking corpus with predictability norms. Behav. Res. Methods. 2018, 50, 826–833. [Google Scholar] [CrossRef] [PubMed]
Salicchi, L.; Chersoni, E.; Lenci, A. A study on surprisal and semantic relatedness for eye-tracking data prediction. Front. Psychol. 2023, 14, 1112365. [Google Scholar] [CrossRef] [PubMed]
Hollenstein, N.; Troendle, M.; Zhang, C.; Langer, N. ZuCo 2.0: A dataset of physiological recordings during natural reading and annotation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 138–146. [Google Scholar]
Leal, S.E.; Duran, M.S.; Aluísio, S. A nontrivial sentence corpus for the task of sentence readability assessment in Portuguese. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 401–413. [Google Scholar]
Leal, S.E.; Lukasova, K.; Carthery-Goulart, M.T.; Aluísio, S.M. RastrOS Project: Natural Language Processing contributions to the development of an eye-tracking corpus with predictability norms for Brazilian Portuguese. Lang. Resour. Eval. 2022, 56, 1333–1372. [Google Scholar] [CrossRef]
Zhang, G.; Yao, P.; Ma, G.; Wang, J.; Zhou, J.; Huang, L.; Xu, P.; Chen, L.; Chen, S.; Gu, J.; et al. The database of eye-movement measures on words in Chinese reading. Sci. Data 2022, 9, 411. [Google Scholar] [CrossRef]
Hollenstein, N.; Barrett, M.; Björnsdóttir, M. The Copenhagen Corpus of Eye Tracking Recordings from Natural Reading of Danish Texts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 1712–1720. [Google Scholar]
Kazai, G.; Kamps, J.; Koolen, M.; Milic-Frayling, N. Crowdsourcing for book search evaluation: Impact of hit design on comparative system ranking. In Proceedings of the 34th International ACM SIGIR Conference on Research and development in Information Retrieval, Beijing, China, 24–28 July 2011; pp. 205–214. [Google Scholar]
Aker, A.; El-Haj, M.; Albakour, M.-D.; Kruschwitz, U. Assessing Crowdsourcing Quality through Objective Tasks. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 21–27 May 2012; pp. 1456–1461. [Google Scholar]
Vajjala, S.; Majumder, B.; Gupta, A.; Surana, H. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems; O’Reilly Media Inc.: Sebastopol, CA, USA, 2020. [Google Scholar]
Hindawi Foundation. Available online: http://www.hindawi.org/ (accessed on 27 June 2021).
Alrabiah, M.; Alsalman, A.; Atwell, E. The design and construction of the 50 million words KSUCCA King Saud University Corpus of Classical Arabic. In Proceedings of the WACL’2 Second Workshop on Arabic Corpus Linguistics, Lancaster University, UK, 22 July 2013; pp. 5–8. [Google Scholar]
Fouad, M.M.; Atyah, M.A. MLAR: Machine Learning based System for Measuring the Readability of Online Arabic News. Int. J. Comput. Appl. 2016, 154, 29–33. [Google Scholar] [CrossRef]
Chung-Fat-Yim, A.; Peterson, J.B.; Mar, R.A. Validating self-paced sentence-by-sentence reading: Story comprehension, recall, and narrative transportation. Read. Writ. 2017, 30, 857–869. [Google Scholar] [CrossRef]
Aldayel, A.; Al-Khalifa, H.; Alaqeel, S.; Abanmy, N.; Al-Yahya, M.; Diab, M. ARC-WMI: Towards Building Arabic Readability Corpus for Written Medicine Information. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, Miyazaki, Japan, 8 May 2018; p. 14. [Google Scholar]
Al Khalil, M.; Habash, N.; Jiang, Z. A large-scale leveled readability lexicon for Standard Arabic. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 3053–3062. [Google Scholar]
Nielsen, J. Thinking Aloud: The #1 Usability Tool. Available online: https://www.nngroup.com/articles/thinking-aloud-the-1-usability-tool/ (accessed on 15 December 2022).
Leal, S.E.; Vieira, J.M.M.; dos Santos Rodrigues, E.; Teixeira, E.N.; Aluísio, S. Using eye-tracking data to predict the readability of Brazilian Portuguese sentences in single-task, multi-task and sequential transfer learning approaches. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 5821–5831. [Google Scholar]
Callison-Burch, C. Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009; pp. 286–295. [Google Scholar]
Chen, Y.; Zhang, W.; Song, D.; Zhang, P.; Ren, Q.; Hou, Y. Inferring Document Readability by Integrating Text and Eye Movement Features. In Proceedings of the SIGIR2015 Workshop on Neuro-Physiological Methods in IR Research, Santiago, Chile, 2 December 2015. [Google Scholar]
Biedert, R.; Dengel, A.; Elshamy, M.; Buscher, G. Towards robust gaze-based objective quality measures for text. In Proceedings of the Symposium on Eye Tracking Research and Applications, Santa Barbara, CA, USA, 28–30 March 2012; pp. 201–204. [Google Scholar]
SR Reserach Ltd. SR Research Experiment Builder User Manual (Version 2.3.1); SR Reserach Ltd.: Ottawa, ON, Canada, 2020. [Google Scholar]
Bhandari, P. Central Tendency|Understanding the Mean, Median & Mode. Available online: https://www.scribbr.com/statistics/central-tendency/ (accessed on 28 December 2022).
Measures of Central Tendency. Available online: https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php (accessed on 20 August 2023).
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Evaluating the Impact of Oversampling on Arabic L1 and L2 Readability Prediction Performances. In Networking, Intelligent Systems and Security; Springer International Publishing: Singapore, 2022; Volume 237, pp. 763–774. [Google Scholar]
Jian, L.; Xiang, H.; Le, G. English Text Readability Measurement Based on Convolutional Neural Network: A Hybrid Network Model. Comput. Intell. Neurosci. 2022, 2022, 1–9. [Google Scholar] [CrossRef]
Nassiri, N.; Lakhouaja, A.; Cavalli-Sforza, V. Arabic L2 readability assessment: Dimensionality reduction study. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 3789–3799. [Google Scholar] [CrossRef]
Al Khalil, M.; Saddiki, H.; Habash, N.; Alfalasi, L. A leveled reading corpus of modern standard Arabic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
Cavalli-Sforza, V.; El Mezouar, M.; Saddiki, H. Matching an Arabic text to a learners’ curriculum. In Proceedings of the 2014 Fifth International Conference on Arabic Language Processing (CITALA 2014), Oujda, Morocco, 26–27 November 2014; pp. 79–88. [Google Scholar]
Salesky, E.; Shen, W. Exploiting Morphological, Grammatical, and Semantic Correlates for Improved Text Difficulty Assessment. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, Baltimore, MD, USA, 16 June 2014; pp. 155–162. [Google Scholar]
Saddiki, H.; Bouzoubaa, K.; Cavalli-Sforza, V. Text readability for Arabic as a foreign language. In Proceedings of the 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), Marrakech, Morocco, 17–20 November 2015; pp. 1–8. [Google Scholar]
Al Jarrah, E.Q. Using Language Features to Enhance Measuring the Readability of Arabic Text. Master’s Thesis, Yarmouk University, Irbid, Jordan, 2017. [Google Scholar]
Mishra, A.; Bhattacharyya, P. Scanpath Complexity: Modeling Reading Effort Using Gaze Information. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 77–98. [Google Scholar]
SR Reserach Ltd. EyeLink Data Viewer User’s Manual (Version 3.1.97); SR Reserach Ltd.: Ottawa, ON, Canada, 2017. [Google Scholar]
Scikit-Learn. Scikit-Learn Machine Learning in Python. Available online: https://scikit-learn.org/stable/ (accessed on 5 October 2023).
Brownlee, J. Ordinal and One-Hot Encodings for Categorical Data. Available online: https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/ (accessed on 20 December 2022).
Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychol. Bull. 1998, 124, 372–422. [Google Scholar] [CrossRef]
Ghosh, S.; Dhall, A.; Hayat, M.; Knibbe, J.; Ji, Q. Automatic Gaze Analysis: A Survey of Deep Learning Based Approaches. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 61–84. [Google Scholar] [CrossRef] [PubMed]
Makowski, S.; Jäger, L.A.; Abdelwahab, A.; Landwehr, N.; Scheffer, T. A discriminative model for identifying readers and assessing text comprehension from eye movements. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Dublin, Ireland, 10–14 September 2018; pp. 209–225. [Google Scholar]
Caruso, M.; Peacock, C.E.; Southwell, R.; Zhou, G.; D'Mello, S.K. Going Deep and Far: Gaze-Based Models Predict Multiple Depths of Comprehension during and One Week Following Reading. In Proceedings of the 15th International Conference on Educational Data Mining, International Educational Data Mining Societ, Durham, UK, 24–27 July 2022; pp. 145–157. [Google Scholar]
Copeland, L.; Gedeon, T. Measuring reading comprehension using eye movements. In Proceedings of the IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom), Budapest, Hungary, 2–5 December 2013; pp. 791–796. [Google Scholar]
Copeland, L.; Gedeon, T.; Mendis, B.S.U. Predicting reading comprehension scores from eye movements using artificial neural networks and fuzzy output error. Artif. Intell. Res. 2014, 3, 35–48. [Google Scholar] [CrossRef]
Sanches, C.L.; Augereau, O.; Kise, K. Using the Eye Gaze to Predict Document Reading Subjective Understanding. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; pp. 28–31. [Google Scholar]
Gonzalez-Garduno, A.; Søgaard, A. Learning to predict readability using eye-movement data from natives and learners. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5118–5124. [Google Scholar]
Sarti, G.; Brunato, D.; Dell’Orletta, F. That looks hard: Characterizing linguistic complexity in humans and language models. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Virtual, 10 June 2021; pp. 48–60. [Google Scholar]
Khallaf, N.; Sharoff, S. Automatic Difficulty Classification of Arabic Sentences. In Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP), Virtual, Kyiv, Ukraine, 19 April 2021; pp. 105–114. [Google Scholar]
Berrichi, S.; Nassiri, N.; Mazroui, A.; Lakhouaja, A. Exploring the Impact of Deep Learning Techniques on Evaluating Arabic L1 Readability. In Artificial Intelligence, Data Science and Applications; Springer: Cham, Switzerland, 2024; pp. 1–7. [Google Scholar]
Shen, W.; Williams, J.; Marius, T.; Salesky, E. A language-independent approach to automatic text difficulty assessment for second-language learners. In Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, Sofia, Bulgaria, 8 August 2013; pp. 30–38. [Google Scholar]
Saddiki, H.; Habash, N.; Cavalli-Sforza, V.; Al Khalil, M. Feature optimization for predicting readability of Arabic L1 and L2. In Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, Melbourne, Australia, 19 July 2018; pp. 20–29. [Google Scholar]
Forsyth, J.N. Automatic Readability Detection for Modern Standard Arabic. Master’s Thesis, Department of Linguistics and English Language, Brigham Young University, Provo, UT, USA, 2014. [Google Scholar]
Berrichi, S.; Nassiri, N.; Mazroui, A.; Lakhouaja, A. Interpreting the Relevance of Readability Prediction Features. Jordanian J. Comput. Inf. Technol. 2023, 9, 36–52. [Google Scholar] [CrossRef]
Berrichi, S.; Nassiri, N.; Mazroui, A.; Lakhouaja, A. Impact of Feature Vectorization Methods on Arabic Text Readability Assessment. In Artificial Intelligence and Smart Environment (ICAISE 2022); Springer: Cham, Switzerland, 2023; Volume 635, pp. 504–510. [Google Scholar]
Ghani, K.A.; Noh, A.S.; Yusoff, N.M.R.N.; Hussein, N.H. Developing Readability Computational Formula for Arabic Reading Materials Among Non-native Students in Malaysia. In The Importance of New Technologies and Entrepreneurship in Business Development: In The Context of Economic Diversity in Developing Countries: The Impact of New Technologies and Entrepreneurship on Business Development; Springer: Cham, Switzerland, 2021; Volume 194, pp. 2041–2057. [Google Scholar]
Brooke, J.; Tsang, V.; Jacob, D.; Shein, F.; Hirst, G. Building readability lexicons with unannotated corpora. In Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations, Montréal, Canada, 7 June 2012; pp. 33–39. [Google Scholar]
Mahanama, B.; Jayawardana, Y.; Rengarajan, S.; Jayawardena, G.; Chukoskie, L.; Snider, J.; Jayarathna, S. Eye movement and pupil measures: A review. Front. Comput. Sci. 2022, 3, 733531. [Google Scholar] [CrossRef]
Sharafi, Z.; Shaffer, T.; Sharif, B.; Guéhéneuc, Y.-G. Eye-tracking metrics in software engineering. In Proceedings of the 2015 Asia-Pacific Software Engineering Conference (APSEC), New Delhi, India, 1–4 December 2015; pp. 96–103. [Google Scholar]
Mohamad Shahimin, M.; Razali, A. An eye tracking analysis on diagnostic performance of digital fundus photography images between ophthalmologists and optometrists. Int. J. Environ. Res. Public Health 2019, 17, 30. [Google Scholar] [CrossRef]

Figure 1. Examples of visualizations demonstrate the output of an eye tracker during participants’ reading of two Arabic texts of differing lengths: (a) gaze plot and (b) heatmap.

Figure 2. Corpus construction outline.

Figure 3. Details of the preparatory tasks for building the corpus.

Figure 4. Data collection summary.

Figure 5. Experiment settings.

Figure 6. Examples of Task 1 and Task 2: a participant (a) reading, (b) rating each paragraph, and (c) rating each text.

Figure 7. Distribution of documents and paragraphs across three readability levels. (a) At the document level, the medium readability level had a noticeably higher share than the easy and difficult levels, and (b) at the paragraph level, the easy level had a noticeably higher share than the medium and difficult levels.

Figure 8. Data preparation details.

Figure 9. Tokenization example for a paragraph, with the rectangles defining words. The different colored boxes in the figure are used solely for segmentation purposes by Tobii Studio and do not carry any specific meaning associated with their individual colors.

Figure 10. Summary of the extracted features.

Figure 11. (a–c) Three participants’ reading behaviors.

Figure 12. Reading time distributions.

Table 1. Corpus statistics.

Book Language	Document Count	Paragraph Count	Sentence Count	Word Count
MSA	60	442	2835	37,486
CA	32	145	1936	20,559
Total	92	587	4771	58,045

Table 2. Details of the eye-tracking experiment participants.

Participant Code	APT Score	Gender	Age Range	Country of Origin	Academic Level	Major
P01	9	F	25–30	Saudi Arabia	Master’s	Biochemistry
P02	9	F	25–30	Sudan	Bachelor’s	Business Administration
P03	9	F	40–45	Egypt	Bachelor’s	Arabic Linguistics
P04	10	M	30–35	Saudi Arabia	Master’s	Information Systems
P05	9	F	35–40	Sudan	Master’s	Biotechnology
P06	9	F	35–40	Saudi Arabia	Doctoral	Arabic Literature
P07	9	M	35–40	Saudi Arabia	Bachelor’s	Information Systems
P08	9	F	30–35	Jordan	Master’s	Mathematics
P09	10	F	30–35	Sudan	Bachelor’s	Computer Engineering
P10	9	M	40–45	Saudi Arabia	Master’s	Internet Communication
P11	9	M	25–30	Palestine	Bachelor’s	General Management
P12	9	M	40–45	Saudi Arabia	Bachelor’s	Computer Engineering
P13	9	M	30–35	Syria	Bachelor’s	Healthcare Management
P14	9	M	30–35	Saudi Arabia	Doctoral	Dentistry
P15	9	F	20–25	Yemen	Bachelor’s	Chemistry

Table 3. Distribution of readability ratings at the document and paragraph levels with MSA texts.

Readability Level	Documents	Paragraphs
Easy	22	297
Medium	38	139
Difficult	0	6
Total	60	442

Table 4. Distribution of readability ratings of ninety-two documents by fifteen participants.

Readability Level	MSA	CA	Total	Percent (%)
Easy	22	4	26	28.26
Medium	38	22	60	65.22
Difficult	0	6	6	6.52
Total	60	32	92	100

Table 5. Distribution of readability ratings of 587 paragraphs by fifteen participants.

Readability Level	MSA	CA	Total	Percent (%)
Easy	297	59	356	60.65
Medium	139	60	199	33.90
Difficult	6	26	32	5.45
Total	442	145	587	100

Table 6. Participants’ performance regarding the quality control measures.

Participant Code	No. of Correctly Answered Questions (CA)	No. of Correctly Answered Questions (MSA)	Total	Percent (%)
P01	14	28	42	77.78
P02	17	27	44	81.48
P03	15	32	47	87.04
P04	12	25	37	68.52
P05	15	34	49	90.74
P07	11	23	34	62.96
P08	12	25	37	68.52
P09	14	25	39	72.22
P10	16	31	47	87.04
P12	13	29	42	77.78
P15	16	26	42	77.78
P16	14	28	42	77.78
P17	14	26	40	74.07
P18	12	31	43	79.63
P19	14	28	42	77.78

Table 7. Text-based features.

Features		Subfeatures
General	Descriptive	Book Name and Author, Document Code, Paragraph Code, Book Language, Book Topic, Publication Century, Authorship Type, Translation Type, Author Count, Text Source
General	Textual Components	Parenthesis Count, Parenthetical Expression Count, Numerical Content Count, Listing Count, Religious Text Count, Poem Verse Count
Linguistic	Textual Complexity	Character Count, Word Count, Average Word Length, Syllable Count, Average Syllables Per Word, Difficult Word Count, Average Difficult Words Count, Unique Loan Word Count, Total Loan Word Count, Unique Foreign Word Count, Total Foreign Word Count, Foreign-Word-to-Token Ratio, Loan-Word-to-Token Ratio
	Structural Complexity	Sentence Count, Average Sentence Length in Words, Average Sentence Length in Characters, Paragraph Count
	Readability Scores	OSMAN Score, Lasbarhets Index Score, Automated Readability Index Score, Flesch Reading Ease Score, Flesch–Kincaid Score, Gunning Fog Score
	Stylistic	Text Style, Script Style, Linguistic Style, Writing Technique

Table 8. Types of eye-tracking reading metrics used.

Metrics	Submetrics
Fixation	Time to First Fixation, Fixations Before, First Fixation Duration, Single Fixation Duration, Total Fixation Duration, Total Fixation Count, Percentage Fixated, Average Fixation Duration, Average Number of Fixations per Word, Fixation Rate
Visit	Total Visit Count, Single Visit Duration, Total Visit Duration, Average Visit Duration, Average Number of Visits per Word
Saccade	Total Saccade Count, Total Saccade Duration, Saccadic Amplitude, Absolute Saccadic Direction, Relative Saccadic Direction, Average Saccade Duration, Saccade-to-Fixation Ratio
Pupil	Pupil Size

Table 9. Values of the normality and skewness statistics.

Reading Time Metrics	Coefficient of Skewness (G)
Time to First Fixation	0.512
First Fixation Duration	0.248
Single Fixation Duration	0.190
Total Fixation Duration	4.796
Total Saccade Duration	0.676
Single Visit Duration	0.231
Total Visit Duration	4.881

Table 10. Reading durations in comparison with readability levels and OSMAN scores.

Reading Time Metrics	Easy	Medium	Difficult	Maximum/Minimum
Time to First Fixation	10.990	18.271	20.535	20.535
First Fixation Duration	0.247	0.249	0.259	0.259
Single Fixation Duration	0.242	0.244	0.252	0.252
Total Fixation Duration	0.025	0.026	0.031	0.031
Total Saccade Duration	47.263	79.051	84.421	84.421
Single Visit Duration	0.295	0.301	0.315	0.315
Total Visit Duration	0.026	0.027	0.032	0.032
OSMAN Score	129.586	127.694	125.482	125.482

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baazeem, I.; Al-Khalifa, H.; Al-Salman, A. AraEyebility: Eye-Tracking Data for Arabic Text Readability. Computation 2025, 13, 108. https://doi.org/10.3390/computation13050108

AMA Style

Baazeem I, Al-Khalifa H, Al-Salman A. AraEyebility: Eye-Tracking Data for Arabic Text Readability. Computation. 2025; 13(5):108. https://doi.org/10.3390/computation13050108

Chicago/Turabian Style

Baazeem, Ibtehal, Hend Al-Khalifa, and Abdulmalik Al-Salman. 2025. "AraEyebility: Eye-Tracking Data for Arabic Text Readability" Computation 13, no. 5: 108. https://doi.org/10.3390/computation13050108

APA Style

Baazeem, I., Al-Khalifa, H., & Al-Salman, A. (2025). AraEyebility: Eye-Tracking Data for Arabic Text Readability. Computation, 13(5), 108. https://doi.org/10.3390/computation13050108

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AraEyebility: Eye-Tracking Data for Arabic Text Readability

Abstract

1. Introduction

2. Background

2.1. Arabic Language

2.2. Eye Tracking

2.2.1. Eye Tracking and Reading

2.2.2. Eye-Tracking Visualization

2.2.3. Eye Tracking and Arabic Language

3. Literature Review

3.1. Eye Tracking in Reading Studies

3.2. Corpora

3.2.1. Arabic Readability Corpora

3.2.2. Eye-Tracking Corpora

3.3. Discussion

4. Methodology

4.1. Corpus Preparation

4.1.1. Identifying Participants’ Criteria

4.1.2. Defining Different Aspects of the Participants’ Tasks

4.1.3. Collecting and Testing Corpus Texts

4.1.4. Paragraph Segmentation

4.1.5. Extracting and Testing Arabic Readability Guidelines

4.2. Data Collection

4.2.1. Pilot Testing the Eye-Tracking Experiment

4.2.2. Designing the Eye-Tracking Experiment

Apparatus and Setup

Materials

4.2.3. Setting Up the Eye-Tracking Experiment

4.2.4. Conducting the Eye-Tracking Experiment

4.2.5. Participants

4.2.6. Results

Sessions 1 and 2 for MSA Texts

Session 3 for CA Texts

4.2.7. Quality Control

4.3. Data Preparation

4.3.1. Tokenization

4.3.2. Feature Extraction

Text-Based Features

Gaze-Based Features

Readability Level Features

4.3.3. Data Preprocessing

Encoding of Categorical Features

Data Formatting

Data Cleaning

4.3.4. Corpus Evaluation

Visualization of Gaze Plots

Interpersonal Consistency in Reading Times

Association Between OSMAN and Other Features

5. Conclusions, Limitations, and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix C.1. Text-Based Metrics

Appendix C.1.1. Descriptive Features

Appendix C.1.2. Textual Complexity Features

Appendix C.1.3. Structural Complexity Features

Appendix C.1.4. Readability Scores

Appendix C.1.5. Stylistic Features

Appendix D

Appendix D.1. Fixation Metrics

Appendix D.2. Saccade Metrics

Appendix D.3. Visit Metrics

Appendix D.4. Pupil Metrics

Appendix D.5. Experimental Condition Metrics

Appendix D.6. Recording of General Information Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives