This order of sub-stages arose after discovering the importance of correcting some errors first such as the Hamzaerror type before checking the word with the dictionary and generating the candidate list. For example, the sentence “وتبين انظمه التشيغل للحاسوب” [B: “wtbyn AnZmh Alt$ygl llHAswb”] contains three errors. The second word “انظمه” [B: “AnZmh”] contains two errors; Alif ‘A’ needs to be replaced with Alif hamza above [B: ‘<’], and Taa marbuta [B: ‘p’] needs to be used instead of Haa [B: ‘h’]. The third word is “التشيغل” [B: “Alt$ygl”] with the transposition of “غيـ” [B: “gy”] to “يغـ” [B: “yg”].

#### 3.2.1. Sub-Stage 2a: Statistical Stage

This sub-stage is based on using the Prediction by Partial Matching (PPM) language model, which applies an encoding-based correction process. PPM is a lossless text compression method that was designed by Cleary and Witten [

53] in 1984. PPM is an adaptive context-sensitive statistical method of compression. A statistical model sequentially processes the symbols (typically characters) that are currently available in the input data [

54]. The algorithm essentially uses the previous input data to predict a probability distribution for upcoming symbols. For the correction process, we use an encoding-based noiseless channel model approach as opposed to the decoding-based noisy channel model [

44].

As reported by Teahan et al. [

55], a fixed order of five is usually the most effective on English text for compression purposes. The variant usually found to be most effective for both compression and correction purposes is PPMD. The experiments conducted for this paper used the version of PPMD implemented by the Tawa Toolkit combined with the noiseless channel model [

44].

The formula below can be applied to calculate the probability

$p$ of the subsequent symbol

$\phi $ for PPMD:

where the current coding order is represented by

$d$, the total number of times that the current context has occurred is shown by

${T}_{d}$ and

${c}_{d}\left(\phi \right)$ is the total number of times the symbol

$\phi $ has appeared in the current context [

56].

A problem occurs (called the “zero frequency problem”) when the current context cannot predict the upcoming symbol. In this case, PPM “escapes” or backs off to a lower order model where the symbol has occurred in the past. If the symbol has never occurred before, then PPM will ultimately escape to what is called an order −1 context, where all symbols are equiprobable. The escape probability

$e$ for PPMD is estimated as follows:

where

${t}_{d}$ represents the total number of times that a unique character has appeared after the current context.

For example, if the next character in the string “dyslexicornotdyslexic” to be encoded is o, we must make the prediction ic→o using the maximum order, let us say using an order two context. Since the character o has been seen once before in the context ic, then a probability of

$\frac{1}{2}$ will be assigned by using Equation (

1) since

$c$ = 1. Correspondingly, 1 bit will be required by the encoder to encode the character because

$-{log}_{2}\frac{1}{2}=1$.

However, if the subsequent character has not previously been seen in the order two context (i.e., presuming the next letter would be n instead of o, say), it will be necessary to conduct an escape procedure or back off to a lower order. In this case, the escape probability will be

$\frac{1}{2}$ (calculated by Equation (

2)), and a lower order of one will then be applied by the model. When this happens, the character n will also not be present after c. As a result, the model will need to encode a further escape (whose probability will also be estimated as

$\frac{1}{2}$), and there will be a reduction in the current context to order zero. In this order, the probability that will be applied to encode letter n will be

$\frac{1}{42}$. The total cost of predicting this letter is

$\frac{1}{2}\times \frac{1}{2}\times \frac{1}{42}=\frac{1}{168}$, which costs around 7.39 bits to encode it (

$-{log}_{2}\frac{1}{168}\approx 7.39$).

The probability

$p(S)$ (

S is the sequence of length

m characters

${c}_{i}$ being encoded) is estimated by training a PPM model on Arabic text:

where

${p}^{\prime}$ are the probabilities estimated by the order five PPM model.

The codelength can be used to estimate the cross-entropy of the text, which can be calculated according to the following formula:

where

$H(S)$ is the number of bits required to encode the text.

Improvements in prediction are possible by two mechanisms: full exclusions and Update Exclusions (UE). Full exclusions result in higher order symbols being excluded when an escape has occurred, while Update Exclusions (UE) only update the counts for the higher orders until an order is reached where the symbol has already been encountered [

56]. On the other hand, when PPM is applied Without Update Exclusions (WUE), all the counts for all orders of the model are updated. The counts are incremented even if they are already predicted by a higher order context.

To perform the experiment, 10% of the Bangor Dyslexic Arabic Corpus (BDAC), created by Alamri and Teahan [

11,

43], was used. The size of the BDAC is 28,203 words collected from Saudi Arabian schools, forms and text provided by parents. The texts were written by dyslexics aged between eight and 13 years old. The BDAC contains text written by both male and female students. Ten percent of the BDAC corpus containing different types of errors was used. Firstly, two models were created to enable input of text representative of the training corpus; one model with update exclusion and one without update exclusion. The training corpus consisted of the BACC corpus, a 31,000,000-word corpus called the Bangor Arabic Compression Corpus (BACC) created by Alhawiti [

57] for standardising compression experiments on Arabic text. Alkahtani [

58] developed a parallel corpus that includes 27,775,663 words in Arabic, based on corpora from Al Hayat articles and the open-source online corpora database and from the King Saud University Corpus of Classical Arabic (KSUCCA), which is part of research attempting to study the meanings of words used in the holy Quran through analysis of their distributional semantics in contemporaneous texts [

59]. The above three corpora combined are jointly referred to here as the BSK corpus. Hence, a large text corpus was needed in order to develop a well-estimated language model. This need was met by the BSK corpus.

Then, these models were used in the initial statistical sub-stage (2a). The findings indicated that using update exclusion produced precision 92%, recall 62%, F

_{1} score 74% and accuracy 82% for detection. Furthermore, precision 80%, recall 22%, F

_{1} score 35% and accuracy 67% were produced for the correction. With respect to the without update exclusion option, precision was 93%, recall 53%, F

_{1} score 67% and accuracy 79% for detection. In this case, precision was 86%, recall 26%, F

_{1} score 40% and accuracy 69% for the correction. As a result of the above experiments, the model without update exclusion was selected for sub-stage (2a). Subsequently, two models with and without update exclusion were created using the BSK to see which one worked better in calculating the codelength (Sub-stage 2c). The results are presented below in

Table 4:

The results of these different experiments revealed that the language model without update exclusion performed better than the model with update exclusion, which is compatible with the findings of Al-kazaz for cryptanalysis [

60].

The Tawa toolkit facilitates the definition of transformations in the form of an ‘observed→corrected’ rule, which denotes the transformation from the observed state to the corrected state when the noiseless channel correction process is applied. The PPM model was applied in order to correct these errors by searching through possible alternative spellings for each character and then using a Viterbi-based algorithm to find the most compressible sequence from these possible alternatives at the character level [

61].

The Viterbi algorithm guarantees that the alternative with the best compression will be found by using a trellis-based search: all possible alternative search paths are extended at the same time, and the poorer performing alternatives that lead to the same conditioning context are discarded [

44].

In order to correct the erroneous word “احمد” [B: “AHmd”], which contains one error, ‘ا’ [B: ‘A’] is replaced with ‘أ’ [B: ‘>’]. The correct version is “أحمد” [B: “>Hmd”]. The toolkit generated a possible alternative for each character by using the confusion

Table 5. From the confusion, the character ‘ا’ [B: ‘A’] can be (‘إ’ [B: ‘<’], or ‘أ’ [B: ‘>’], or ‘ى’ [B: ‘Y’]). Probabilities for each likely error can be estimated from a large training corpus.

Table 6 below is the output of utilising the PPM language model to calculate the codelengths.

Thus, the smallest codelength was given to the word “أحمد” [B: “>Hmd”], which is the correct version of the word.

The pre-processing stage and the statistical stage covered many categories from the DEAC, which include the Hamza, almadd, diacritics, differences and form, but it did not include the common errors, which are substitution, deletion, transposition and insertion. Norvig’s approach [

62] was deemed appropriate for this type of error. However, first, it is necessary to know whether or not the word is an erroneous word; hence, Sub-stage 2b is required.