Automatic Correction of Arabic Dyslexic Text

This paper proposes an automatic correction system that detects and corrects dyslexic errors in Arabic text. The system uses a language model based on the Prediction by Partial Matching (PPM) text compression scheme that generates possible alternatives for each misspelled word. Furthermore, the generated candidate list is based on edit operations (insertion, deletion, substitution and transposition), and the correct alternative for each misspelled word is chosen on the basis of the compression codelength of the trigram. The system is compared with widely-used Arabic word processing software and the Farasa tool. The system provided good results compared with the other tools, with a recall of 43%, precision 89%, F1 58% and accuracy 81%.


Introduction
Dyslexia is defined as a neurobiological condition characterised by an individual's inability to read, spell, decode text and recognise words accurately or fluently [1]. Dyslexia is considered to arise from a deficiency in the phonological dimension of language. Dyslexia affects about 10% of the population, with around 4% of the world's population being severely dyslexic [2]. The Kuwait Dyslexic Association [3] identified a rate of 6% among Kuwaiti. A later study by Elbeheri et al. [4] reported a percentage of 20% among young Kuwaiti. Around 18% of United Arab Emirates University students were found to have evidence of deficits consistent with dyslexia [5].
Goodwin and Thomson and Washburn et al. [6,7], among others, have found that a key characteristic of dyslexic writing is inelegance, imprecision, poor spacing, missing words, suffix errors, mistaken letter ordering and inconsistent spelling. In the case of Arabic writing, it is worth recognising that short vowels are not considered to be independent graphemes; rather, they are represented as additional diacritical marks. Diacritics of this kind are small symbols that are placed above or below letters, the purpose being to indicate to the reader how the short vowels in certain words should be pronounced [8]; for example, "i + n" indicates the kasratain short vowel; for example, in the word " " [E: "a ball" B: "krpK" R: "krtin"] (the syntax used here is as follows: "Arabic text" [E: "English translation" B: "Buckwalter transliteration" R: "ALA-LCRomanization"]). The literature indicates that dyslexic individuals often write diacritical marks as letters [9][10][11] as in " " [B: "krptn" R: "krttn"] instead of " ". Additionally, they also tend to write text from left to right instead of right to left used when writing in Arabic [12]. Moreover, most Arabic letters have more than one written form, depending on the letter's place in a word: the beginning, middle or end. These errors are made as a result of confusion of letter-shape [9]. Rello et al. [13] remarked that spelling errors have a significant influence on the way an individual is perceived within a community. Spelling errors and the frequency of such errors are often viewed as being linked to an individual's intelligence. In this context, Graham et al.'s finding that technologies (e.g., spellcheckers) can be used to aid dyslexic individuals in minimising the incidence of spelling

Related Work
Issues relating to spelling error correction have been investigated by many researchers for several decades. It remains a topic of interest to NLP researchers. The first study was carried out by Damerau in 1964. He developed a rule-based string-matching technique based on substitution, insertion, deletion and transposition [23]. Kernighan et al. [24] used the noisy channel model for spelling correction. Church and Gale [25] used probability scores (word bigram probabilities) and a probabilistic correction process based on the noisy channel model for the purpose of spellchecking. Kukich [26] divided spelling errors into three types: error detection, isolated word correction and context-sensitive correction. In 2000, Brill and Moore published a paper that described a new error model for noisy channel spelling correction based on generic string-to-string edits [27].
There are different approaches that can be used to solve the problem of spelling error detection and correction such as the dictionary-based approach used to detect error words [28], the noisy channel model using n-grams [24,25] and the edit-distance approaches [23] used for error correction. Specific approaches to Arabic spelling correction will be reviewed in the next section.

Arabic Spelling Correction
Several studies have characterised and classified spelling errors in Arabic [29,30], and some studies have also looked at dyslexic spelling errors in Arabic [9][10][11].
Several attempts have been made to detect and correct Arabic text with spelling errors using various combinations of approaches such as rule-based and statistical approaches. Mars [31] developed a system for automatic Arabic text correction based on a sequential combination of approaches including lexicon based, rule based and statistical based. The F 1 score obtained through the study was 67%. Likewise, AlShenaifi et al. [32] used the rule-based, statistical-based and lexicon-based approaches in a cascade fashion, with an F 1 score of 57%. A study conducted by Zerrouki et al. [33] created a list that contained misspelled words with their corrections and also used regular expressions to detect errors and give a single replacement, which resulted in an F1 score of 20%.
Another study by Mubarak and Darwish [34] employed an approach based on two correction models and two punctuation recovery models: a character-level model and a case-specific model, a simple statistical model and a conditional random fields model, respectively. The best result was by using a cascaded approach that involves a character-level model, then case-based correction, resulting in an F 1 score of 63%. Alkanhal et al. [35] used the Damerau-Levenshtein edit distance to generate alternatives for each misspelled word. The selection-based method was then used on the maximum marginal probability via an A* lattice search and n-gram probability estimation to select the most applicable word. The study focused on inserting and removing spaces. For misspelled word detection, the experimental result showed an F 1 score of 98%. In terms of misspelled words' correction, the system achieved an F 1 score of 92%.
Conversely, Zaghouani et al. [36] used regular expression patterns to detect errors by using the Arabic verb forms and affixes and built a rule-based correction method that added linguistic rules using existing lexicons and regular expressions to correct native and non-native text. The system achieved an F 1 score of 67% for native speakers; and an F 1 score of 32% for non-native speakers. Similarly, Nawar and Ragheb performed two studies in 2014 and 2015 [37,38]. The first study developed a rule-based probabilistic system, which achieved a 65% F 1 score on the data [37]. In 2015, Nawar and Ragheb [38] made improvements to a previous statistical rule-based system in which word patterns were used to improve error correction. They also used a statistical system using the syntactic error correction rules; the system achieved an F 1 score of 72% on the Aljazeera articles by native Arabic speakers' dataset and an F 1 score of 35% on the non-native speakers' data. Mubarak et al. [39] employed a case-specific correction approach that addressed particular errors such as substitution and word splits and some errors that are specific to non-native speakers such as gender-number agreement. The best result on non-native speakers' data gave an F 1 score of 27%.
Some studies have adopted the noisy channel model approach. Shaalan et al. [40] detected errors by building a character-based trigram language model in order to classify words as valid and invalid. For correction, they used finite-state automata to propose candidate corrections within a specified edit distance measured by Levenshtein distance from the misspelled word. After choosing candidate corrections, they used the noisy channel model and knowledge-based rules to assign scores to the candidate corrections and choose the best correction independent of the context. Additionally, Noaman et al. [41] used pairs of spelling errors and a corrected form extracted from the Qatar Arabic Language Bank (QALP) to build an error confusion matrix, then used this confusion matrix with the noisy channel model to generate a candidates' list and select a suitable candidate for the erroneous word. The overall system accuracy that was obtained was 85%. On the other hand, a study by Attia et al. [29] attempted to improve three main components: the dictionary, the error model and language model. The way they improved the error model was by analysing error types and creating an edit distance-based re-ranker that analysed the level of noise in different sources to improve the language model. By improving the three main components, they achieved an accuracy rate of 83%.

Dyslexia Spelling Correction
There are various studies that deal with dyslexic spelling correction in different languages. Pedler [42] developed a program to detect and correct dyslexic real-word spelling errors in English. The program considered words that differed in their parts-of-speech. Decisions for words that have the same parts-of-speech were left for the second stage, which used semantic associations derived from WordNet. Rello et al. [13] used a probabilistic language model, a statistical dependency parser and Google n-grams to detect and correct real-word errors in Spanish in a system called Real Check. For the Arabic language, Alamri and Teahan [43] investigated the possibility of utilising automatic noiseless channel model encoding-based correction techniques [44] as a form of assistance for dyslexic writers of Arabic, correcting the spelling errors of dyslexic writers in Arabic texts using a PPM model, but were only able to correct single-character errors.
Apart from Alamri and Teahan [43], there is a general lack of research into automatic spelling correction for Arabic dyslexic text. Despite this, there have been a number of studies that investigated assistive technology to help dyslexics to overcome reading and writing difficulties or strengthening skills for dyslexic learners [45][46][47].

Commercial Tools
With respect to dyslexia spelling correction software, there is software such as Global AutoCorrect (https://www.lexable.com/education/) and Ghotit (https://www.ghotit.com), but both of these are only available for English. In contrast, we were not able to find any Arabic dyslexia software online. However, there is a toolbar called ATbar (https://www.atbar.org), which is a free cross-browser toolbar that can help in reading, spellchecking and word prediction when writing. The ATbar toolbar can help people with dyslexia and poor vision.

Differences from Previous Studies
The study presented in this paper is different from previous studies in several aspects. Firstly, this study examines Arabic dyslexia, while previous studies were based on non-native speakers learning Arabic data or news data that were not written by dyslexics. The types of errors handled in the data of previous studies were punctuation errors, grammar errors, real spelling errors and non-word spelling errors; this study specifically examines the spelling errors made by dyslexic writers of Arabic text. Dyslexic writers have a higher level of frequency and severity of misspellings compared to non-dyslexic writers [48]. Many studies have observed that individuals with dyslexia struggle more with words that have difficult structures and spellings than with words that can be easily retained through the process of repetition or which have more simple spellings [49,50]. Examples if dyslexic errors that differ from regular errors that are unique for the Arabic language are as follows: the Arabic language has unique characteristics, such as diacritics, while some letters are written cursively, and the form of Arabic letters changes in accordance with their position within words; that is, whether they are placed at the beginning, in the middle or at the end of the word. For example, the letter ' ' has three shapes: at the beginning of a word, it is ' '; in the middle it is ' '; and at the end. it is ' '. Dyslexic individuals have difficulties in writing the correct form of letters [43]. Furthermore, Abu-Rabia and Taha [9] found that the individuals with dyslexia have difficulties in differentiating between letters with similar forms and different sounds such as the letter ' ' [R = "gh"] and letter ' ' [R = "ayn"].
Furthermore, difficulties differentiating between letters with similar sounds and different forms such as in the word " " [E: "sound"], the letter ' ' sounds like the letter ' '. Moreover, they have difficulty with short and long vowels; for example, if the erroneous word was " " and the intended word was " " [E: "fruits"], the author wrote ' ', instead of the diacritical [R: "ksrh"], and omitted the letter ' ' [43]. Secondly, the system used in this study is based on various approaches, combining a statistical approach by using the PPM language model and edit operations to generate possible alternatives for each error (a candidates' list). The correct alternative for each misspelled word is then selected automatically using the compression codelength of the sentence. The codelength is the number of bits required to encode the text using the compression algorithm. The contribution of this paper is that it proposes a new system called Sahah for the automatic spelling correction for dyslexic Arabic text and that it empirically shows how dyslexic errors in Arabic text can be corrected. Experiments were carried out to evaluate the usefulness of the system used. Firstly, the accuracy of the system was evaluated using an Arabic corpus containing errors made by dyslexic writers. Secondly, the results of the system were compared with the results obtained using word processing software and the Farasa (http://qatsdemo.cloudapp.net/farasa/) tool.

The Sahah System for the Automatic Spelling Correction of Dyslexic Arabic Text
In order to propose an efficient spelling correction for dyslexic Arabic text, it is necessary to study and categorize the error patterns of dyslexia. Such a study was the focus of the authors' previous work [11]. Following this study, creating a prototype of an automatic spelling correction system called Sahah described in this paper was undertaken. However, the Sahah system is intended to correct dyslexic text automatically by using both a statistical and an edit operation approach, via a hybrid mechanism combining the different, but complementary approaches to correcting the errors.
The workflow of the proposed Sahah system starts with transliteration using the Buckwalter transformation and consists of three stages. The first stage (Stage 1) is a pre-processing stage. The second stage (Stage 2) is a detection and correction stage that contains three further sub-stages: a sub-stage for error detection using a dictionary and two sub-stages for correction. The first sub-stage (2a) uses a statistical model to correct dyslexic errors according to their context based on the dyslexic error classification for Arabic texts (DECA) [11]. The second sub-stage (2b) is for error detection and uses an AraComLexdictionary [29]. The third sub-stage (2c) is based on edit operations to generate a candidate list, then the codelength of the trigram is calculated in order to score the candidate and choose the correct word for the error.
The final stage (Stage 3) is the post-processing stage. The Sahah system ends with the reverse of the Buckwalter transliteration back to Arabic text. More details about each stage are described below, as presented in Figure 1.

Pre-Processing Stage
While preparing the data for the process of error detection and correction, some errors such as tackling of split words and repeated characters were identified as causing 'noise' within the data, complicating the process. Following analysis, the dyslexic errors identified were that sometimes, the dyslexic author divides words into two. This division could be due to the word having a short vowel or due to the pronunciation of the word.
The analysis of dyslexic errors indicated that in some cases, a dyslexic author inserts a space after prefixes or before suffixes; for instance, " " [B: "<n hA"] to represent the word " " [E: "it" B: "<nhA" R: "annaha"], which is not acceptable, especially in Arabic texts for children. The characters " " [B: "hA"] are an Arabic suffix; thus, the way to cover the space insertion was inspired by a light stemming process, which refers to a process of removing prefixes and/or suffixes, without recognising patterns, dealing with infixes or finding roots [51]. However, instead of splitting the prefix or suffix, we concatenated the two together as part of the pre-processing process; for example, as in the word " " [B: "<n"] and the suffix " " [B: "hA"] to be " " [E: "it" B: "<nhA" R: "annaha"].
The most common prefixes and suffixes based on the dyslexic corpus analysis were selected as represented in Table 1. Hassan et al. [52] removed the incorrect redundant characters from the word. Likewise, our pre-processing stage corrected the redundant characters, but with some modification. The modification is that in the Arabic language, there are some words in which a character can be repeated twice; for instance, " " [E: "Excellent" B: "mmtaz"] repeats the character ' ' [B: 'm'] twice. Therefore, characters that were repeated more than twice were reduced to just two repeated characters because no Arabic word contains three consecutive characters. However, there are some characters that cannot be repeated consecutively twice, which are: This case is solved by reducing the repeated characters to one. Table 2 illustrates the way repeated characters were removed.

Error
Intended Word After Pre-Processing However, by using the pre-processing stage, the Sahah system can correct the error to the intended word, which is " " [E: "His sky" B: "smA}h" R: "smi"ah"].
Furthermore, it was found that if a step was introduced prior to the error detection and correction stage, it would resolve the issue of the split words and repeated characters, which meant that the accuracy of the detection and correction stage would be enhanced. Therefore, the pre-processing stage included the tackling of split words and repeated characters.

Error Detection and Correction Stage
This stage was as stated divided into three sub-stages: a sub-stage (2a) that employed the Prediction by Partial Matching (PPM) compression-based language model; a sub-stage (2b) that employed error detection based on a dictionary; and a sub-stage (2c) that employed the edit-operations to generate the candidate list, then score the candidate list based on the codelength. The workflow of Stage 2 is shown in Figure 2; also, more detail about each sub-stage is explained below.  Table 3 illustrates the result of using the statistical sub-stage (2a) first, then the edit operation correction sub-stage (2c) next, and vice versa. As shown in Table 3, if the statistical sub-stage (2a) is utilised first then followed by the detection (2b) and edit operation sub-stage (2c), the system can correct all the errors in the sentence. Contrariwise, if we utilised the detection (2b) and edit operation sub-stage (2c) first then the statistical sub-stage (2a) next, the system can correct the transposition error only.

Sub-Stage 2a: Statistical Stage
This sub-stage is based on using the Prediction by Partial Matching (PPM) language model, which applies an encoding-based correction process. PPM is a lossless text compression method that was designed by Cleary and Witten [53] in 1984. PPM is an adaptive context-sensitive statistical method of compression. A statistical model sequentially processes the symbols (typically characters) that are currently available in the input data [54]. The algorithm essentially uses the previous input data to predict a probability distribution for upcoming symbols. For the correction process, we use an encoding-based noiseless channel model approach as opposed to the decoding-based noisy channel model [44].
As reported by Teahan et al. [55], a fixed order of five is usually the most effective on English text for compression purposes. The variant usually found to be most effective for both compression and correction purposes is PPMD. The experiments conducted for this paper used the version of PPMD implemented by the Tawa Toolkit combined with the noiseless channel model [44].
The formula below can be applied to calculate the probability p of the subsequent symbol ϕ for PPMD: where the current coding order is represented by d, the total number of times that the current context has occurred is shown by T d and c d (ϕ) is the total number of times the symbol ϕ has appeared in the current context [56]. A problem occurs (called the "zero frequency problem") when the current context cannot predict the upcoming symbol. In this case, PPM "escapes" or backs off to a lower order model where the symbol has occurred in the past. If the symbol has never occurred before, then PPM will ultimately escape to what is called an order −1 context, where all symbols are equiprobable. The escape probability e for PPMD is estimated as follows: where t d represents the total number of times that a unique character has appeared after the current context. For example, if the next character in the string "dyslexicornotdyslexic" to be encoded is o, we must make the prediction ic→o using the maximum order, let us say using an order two context. Since the character o has been seen once before in the context ic, then a probability of 1 2 will be assigned by using Equation (1) since c = 1. Correspondingly, 1 bit will be required by the encoder to encode the character because − log 2 1 2 = 1. However, if the subsequent character has not previously been seen in the order two context (i.e., presuming the next letter would be n instead of o, say), it will be necessary to conduct an escape procedure or back off to a lower order. In this case, the escape probability will be 1 2 (calculated by Equation (2)), and a lower order of one will then be applied by the model. When this happens, the character n will also not be present after c. As a result, the model will need to encode a further escape (whose probability will also be estimated as 1 2 ), and there will be a reduction in the current context to order zero. In this order, the probability that will be applied to encode letter n will be 1 42 . The total cost of predicting this letter is 1 2 × 1 2 × 1 42 = 1 168 , which costs around 7.39 bits to encode it (− log 2 1 168 ≈ 7.39). The probability p(S) (S is the sequence of length m characters c i being encoded) is estimated by training a PPM model on Arabic text: where p are the probabilities estimated by the order five PPM model. The codelength can be used to estimate the cross-entropy of the text, which can be calculated according to the following formula: where H(S) is the number of bits required to encode the text. Improvements in prediction are possible by two mechanisms: full exclusions and Update Exclusions (UE). Full exclusions result in higher order symbols being excluded when an escape has occurred, while Update Exclusions (UE) only update the counts for the higher orders until an order is reached where the symbol has already been encountered [56]. On the other hand, when PPM is applied Without Update Exclusions (WUE), all the counts for all orders of the model are updated. The counts are incremented even if they are already predicted by a higher order context.
To perform the experiment, 10% of the Bangor Dyslexic Arabic Corpus (BDAC), created by Alamri and Teahan [11,43], was used. The size of the BDAC is 28,203 words collected from Saudi Arabian schools, forms and text provided by parents. The texts were written by dyslexics aged between eight and 13 years old. The BDAC contains text written by both male and female students. Ten percent of the BDAC corpus containing different types of errors was used. Firstly, two models were created to enable input of text representative of the training corpus; one model with update exclusion and one without update exclusion. The training corpus consisted of the BACC corpus, a 31,000,000-word corpus called the Bangor Arabic Compression Corpus (BACC) created by Alhawiti [57] for standardising compression experiments on Arabic text. Alkahtani [58] developed a parallel corpus that includes 27,775,663 words in Arabic, based on corpora from Al Hayat articles and the open-source online corpora database and from the King Saud University Corpus of Classical Arabic (KSUCCA), which is part of research attempting to study the meanings of words used in the holy Quran through analysis of their distributional semantics in contemporaneous texts [59]. The above three corpora combined are jointly referred to here as the BSK corpus. Hence, a large text corpus was needed in order to develop a well-estimated language model. This need was met by the BSK corpus.
Then, these models were used in the initial statistical sub-stage (2a). The findings indicated that using update exclusion produced precision 92%, recall 62%, F 1 score 74% and accuracy 82% for detection. Furthermore, precision 80%, recall 22%, F 1 score 35% and accuracy 67% were produced for the correction. With respect to the without update exclusion option, precision was 93%, recall 53%, F 1 score 67% and accuracy 79% for detection. In this case, precision was 86%, recall 26%, F 1 score 40% and accuracy 69% for the correction. As a result of the above experiments, the model without update exclusion was selected for sub-stage (2a). Subsequently, two models with and without update exclusion were created using the BSK to see which one worked better in calculating the codelength (Sub-stage 2c). The results are presented below in Table 4: The results of these different experiments revealed that the language model without update exclusion performed better than the model with update exclusion, which is compatible with the findings of Al-kazaz for cryptanalysis [60].
The Tawa toolkit facilitates the definition of transformations in the form of an 'observed→corrected' rule, which denotes the transformation from the observed state to the corrected state when the noiseless channel correction process is applied. The PPM model was applied in order to correct these errors by searching through possible alternative spellings for each character and then using a Viterbi-based algorithm to find the most compressible sequence from these possible alternatives at the character level [61].
The Viterbi algorithm guarantees that the alternative with the best compression will be found by using a trellis-based search: all possible alternative search paths are extended at the same time, and the poorer performing alternatives that lead to the same conditioning context are discarded [44].
In order to correct the erroneous word "  Table 6 below is the output of utilising the PPM language model to calculate the codelengths.  Table 6. The codelength of possible alternatives for each character by using the confusions in Table 5 for the erroneous word " ". Thus, the smallest codelength was given to the word " " [B: ">Hmd"], which is the correct version of the word. The pre-processing stage and the statistical stage covered many categories from the DEAC, which include the Hamza, almadd, diacritics, differences and form, but it did not include the common errors, which are substitution, deletion, transposition and insertion. Norvig's approach [62] was deemed appropriate for this type of error. However, first, it is necessary to know whether or not the word is an erroneous word; hence, Sub-stage 2b is required.

Sub-Stage 2b: Error Detection
The most direct means of detecting misspelled words is to search for each word in a dictionary and report the words that are not located therein. Based on this principle, an open-source dictionary was used to detect errors in a list containing nine million Arabic words. The words in this dictionary were generated automatically from the AraComLex open-source finite state transducer [29], since it is a free resource that has proven to be effective in previous studies to either correct or detect general spelling errors [40,52] and for the detection of non-native spelling errors [36].
Prior to checking whether a word is in the AraComLex or not, any diacritical marks have to be removed for two reasons. The first reason is that the dyslexic corpus itself does not contain diacritical marks. However, dyslexic individuals have diacritical issues for example when the teacher dictates the word and it contains a diacritical mark (Tanwin). Dyslexic individuals wrote the diacritical Tanwin as character ' ' [B: 'n'], but did not usually put diacritical marks in their writing. The second reason is that the AraComLex does not contain diacritical marks. Thus, if the input word was not located in the AraComLex dictionary as illustrated in Figure 2, it was considered to contain a spelling error and was passed to the edit operation Sub-stage 2c.

Sub-Stage 2c: Edit Operation
This sub-stage is based on using edit operations, which consist of applying insertion (add a letter), deletion (remove one letter), substitution (change one letter to another) or transposition (swap two adjacent letters) of the misspelled word, and returns a set of all of the edited strings that can be achieved using one or two edit operations. A set of candidate corrections is then generated, including real and non-real words. The candidate list was filtered with reference to an open-source dictionary (AraComLex), commencing with the list of known words for the first edit operation, if any existed, and proceeding to the list of known words for the second edit operation. Once the Sahah system has generated the candidate list, the PPM language model is run to calculate the codelength of the candidate trigram (previous word, candidate word and next word), then returns the candidate word with the lowest candidate trigram codelength. Using the previous example in Section 3.2 above, " " [B: "wtbyn AnZmh Alt$ygl llHAswb"], following Sub-stage 2a, which corrected the second word " " [B: "wtbyn >nZmp Alt$ygl llHAswb"], there was still an error in the third word " " [B: "Alt$ygl|"], which was transposed under the common category. Table 7 shows the candidate list for the error word " " [B: "Alt$ygl"]: The lowest codelength is for the candidate word " " [B: "Alt$gyl"], which required 100.821 bits to encode the surrounding trigram. Therefore, the Sahah system corrected all errors in the following sentence: " " [E: "and show the operating systems of the computer" B: "wtbyn >nZmp Alt$gyl llHAswb"].

Post-Processing Stage
The space omission problem was tackled using word segmentation during the post-processing stage.
In order to correct the segmentation of dyslexic errors where spaces had been omitted, the order five character-based PPM model was first trained on the three corpora (BSK). Two segmentations are possible for each character: the character itself and the character followed by a space. In order to find the most probable segmentation sequence that exhibits the best encoding performance, as determined by the PPM language model, the Viterbi-based search algorithm via the noiseless channel model approach was used again to find the best segmentation as measured by the sequence of text with spaces inserted that had the lowest compression codelength. For example, a sample incorrect sequence is " " [B: "AlTA}rgrd"], while the intended sequence is " " [E: "the bird is chirping" B: "AlTA}r grd"]. The last step in the Sahah system is the reverse transliteration of the output back into Arabic.

Evaluation
This section discusses the evaluation methodology and the experiments that were conducted to evaluate the performance of the dyslexic Arabic spelling corrector using the Sahah system that is presented in this paper.

Evaluation Methodology
There are five possible outcomes of the Sahah system. These cases are based on those proposed by Pedler [42]. However, the case where "the error was considered by the program but wrongly accepted as correct" was not applicable in the Sahah system, so it was not adopted. This is because once the Sahah system detected the error, it is either changed to a correct word or an incorrect alternative word. Errors can be dealt with in the first three cases below, and correctly spelt words can be dealt with in the last two cases as below: Error detection evaluation: The error detection function evaluates whether a word is detected when compared with the gold-standard manual annotation. Recall, precision, F 1 score and accuracy are calculated as follows: Recall (Rec.) = TP TP + FN (5) Accuracy (Acc.) = TP + TN TP + TN + FP + FN (8) where: True Positive (TP): This implies that a spelling error has successfully been detected. False Negative (FN): This implies that a spelling error has not been detected. False Positive (FP): This implies that a correctly-spelled word was detected as being a misspelled word. True Negative (TN): This implies that a correctly-spelled word was detected as being a correct word.
The total of Case I words and Case II words gives the TP, while the FN is the number of Case III words, FP is the number of the Case V words and TN is the number of Case IV words. Table 8 gives an example of how the recall, precision, F 1 score and accuracy are calculated for a text that contains 60 error words. Error correction evaluation: The error correction evaluation is calculated by determining whether a word has been successfully or unsuccessfully corrected based on the gold-standard manual annotation. Recall, precision, F 1 score and accuracy can then also be calculated as follows: True Positive (TP): This implies that a spelling error was successfully corrected. False Negative (FN): This implies that a spelling error was not corrected. False Positive (FP): This implies that a correctly-spelled word was changed. True Negative (TN): This implies that a correctly-spelled word was not changed.
Using the same example given in Table 8, TP is the number of corrected words (Case I) and TN is the number of skipped words (Case IV), while FN is the total number of incorrect alternative words and missed words (Cases II and III) and FP is the number of false alarm words (Case V). Table 9 gives an example of how recall, precision, F 1 and accuracy is calculated to evaluate the error correction. Rec. = 83%, Prec. = 68%, F 1 = 75%, Acc. = 68%

Experimental Results
The Sahah system developed for this study was evaluated in two ways: (i) using a BDAC corpus that consisted of text written by people with dyslexia; and (ii) using a comparison with commonly-used spellcheckers/tools.
(i) Experiment using the BDAC corpus: This experiment used the BDAC corpus (28,203 words). The recall rate, precision and F 1 score for the pre-processing stage and Sub-stage 2a using the PPM language model are presented in Table 10. When all stages and sub-stages are taken into consideration, the Sahah system achieved a better result (see Table 11). The F 1 score for correction increased by 14% when the edit operations Sub-stage 2c was used. It is clear that the inclusion of Sub-stages 2b and 2c led to a higher rate of recall, precision, F 1 score and accuracy.
(ii) Experiment using a comparison with commonly-used spellcheckers/tools: For the experimental comparison with commonly-used tools, there are two parts: namely detection comparison and correction comparison.

Detection comparison:
For our comparison, we compared the results of the Sahah system against Microsoft Office 2013 and Ayaspell 3.0 used in OpenOffice because it is a widely-used word processing software. There are a number of previous studies that used Microsoft Office 2013 and Ayaspell 3.0 to evaluate their approach [13,29,31,41]. The results in Table 12 list recall, precision, F 1 score and accuracy by using BDAC corpus evaluation. The assessment of our system's ability to detect errors is based on the F 1 score. Sahah's 83% (shown in bold font) was significantly higher than that for both Ayaspell for OpenOffice (69%) and Microsoft Word (64%). It is also noteworthy that the number of false negatives for Sahah was lower than that of the others systems, while the number of true positives was higher for Sahah.
Correction comparison: The Sahah system does not show a suggestion list, which means there is no need for human interaction to replace erroneous words. Thus, the spellcheckers investigated above in Table 12 are not compatible with our correction system that was investigated for these experiments. For comparison purposes, the results obtained from this study for the Sahah system in Section 3.4.2 above were compared to the results obtained using the Farasa tool, which is a text processing toolkit for Arabic text.
Farasa comprises a segmentation/tokenisation module, a part-of-speech tagger, an Arabic text diacritizer and spellchecker. Farasa is available online and operates in a similar way to the Sahah system in this study, i.e., the Farasa tool corrects the text automatically without showing a suggestion list. The use of Farasa has been described in two papers [34,39] as detailed in Section 2.1. Both studies produced results with respect to correcting Arabic news, native and non-native text.
The results in terms of recall, precision, F 1 score and accuracy using the BDAC corpus are presented in Table 13. When compared with the Farasa tool, the Sahah system achieved higher precision and recall. The number of true positives for Sahah is higher than Farasa, while the number of false negatives is smaller for Sahah. Although the Sahah system produced good recall, precision and F 1 score rates as discussed above, it could not detect some errors (Type I) or could not correct some errors that were detected (Type II). The errors can be categorised as follows: • Type I: The Sahah system in some cases could not detect an error if the word used matched with a word in the dictionary. Furthermore, it could not detect errors falling under the word boundary error category, for example the use of " " [B: "ly Eqwlhm"] instead of " " [E: "To their minds" B: "lyEqwlhm"] where both words are valid. One solution is to check by pair instead of tokens. However, it is worth noting that none of the widely-used word processing software, Microsoft Office 2013 and Ayaspell 3.0 used in OpenOffice or the Farasa tool referred to above can detect this type of error. • Type II: If more than one letter in the word is deleted or added, it makes the word hard to correct. In such cases, the Sahah system inserted an alternative word. For example, instead of the erroneous word " " [B: "Altr"], which is missing three letters, the Sahah system substituted " " [B: "Albr"] when the intended word was " " [B: "Altrbyp"]. When the erroneous word contained more than three types of errors, the Sahah system could easily detect the error, but could not correct it, for example " " [B: "AlylAmlAy"], which was used instead of " " [B: "Al<mlA}yp"], which contained five errors that were detected by the Sahah system, which then exchanged it with the incorrect alternative " " [B: "Al}lAm l>y"]. • Type II: An incorrect alternative occurred when the wrong candidates were chosen on the basis of the codelength of the trigram according to the statistical language model. For example, for " " [E: "The thieves" B: "AlSwS"], the candidates' list included " " [E: "The thieves" B: "AllSwS"] (94.727 bits) and " " [E: "The voice" B: "AlSwt"] (89.462 bits). The candidate list contained the intended word, but the smallest codelength was for [B: "AlSwt"], which is an incorrect alternative in this case. • Type II: Addition words, deletion words or incorrect synonyms written for a word during dictation time such as " " [E: "Home" B: "Albyet"] instead of " " [E: "House" B: "Almnzel"] fall outside the scope of this study as they are not contains errors and very rare in the BDAC corpus.

Conclusions
This paper addressed the problem of the automatic correction of errors in Arabic text written by dyslexic writers. It introduced the Sahah system that automatically detects and corrects error words in dyslexic text. The Sahah system has three stages: a pre-processing stage, which corrects split words and repeated characters; the second stage, which uses the character-based PPM language model to identify all possible alternatives for the erroneous words and also uses an edit operation to generate a candidate list with a compression codelength calculation to rank possible candidates for error words; and the post-processing stage that addresses the spaces that had been omitted. It does this by using a character-based PPM method in order to segment the errors caused by dyslexia correctly.
An Arabic corpus containing errors made by dyslexic writers was used to evaluate the performance of the Sahah system presented in this study. The spelling correction used in this study significantly outperformed the Microsoft Word and Ayaspell systems for the detection and the Farasa tool in the correction.