A Holistic Technique for an Arabic OCR System

Analytical based approaches in Optical Character Recognition (OCR) systems can endure a significant amount of segmentation errors, especially when dealing with cursive languages such as the Arabic language with frequent overlapping between characters. Holistic based approaches that consider whole words as single units were introduced as an effective approach to avoid such segmentation errors. Still the main challenge for these approaches is their computation complexity, especially when dealing with large vocabulary applications. In this paper, we introduce a computationally efficient, holistic Arabic OCR system. A lexicon reduction approach based on clustering similar shaped words is used to reduce recognition time. Using global word level Discrete Cosine Transform (DCT) based features in combination with local block based features, our proposed approach managed to generalize for new font sizes that were not included in the training data. Evaluation results for the approach using different test sets from modern and historical Arabic books are promising compared with state of art Arabic OCR systems.


Introduction
Cursive scripts recognition has traditionally been handled by two major paradigms: a segmentationbased analytical approach and a word-based holistic approach.In the analytical approach, the input word is treated as a sequence of units (usually characters).Each unit is then individually recognized [1][2][3][4].This approach has several disadvantages.The segmentation of cursive words is a challenging task and any errors in that process will increase the errors in the following recognition step.Also, many of the used fonts for cursive scripts extensively use ligatures where two or more letters are joined as a single glyph, which complicates the character level segmentation.Figure 1 shows some challenging samples of Arabic words.Cursively written word cannot be recognized without being segmented and cannot be segmented without being recognized [5].This phenomenon, known as Sayre's paradox, pushes the community to search for more effective solutions to tackle the problem of classification.A more direct and efficient methodology can be provided using holistic recognition [6].Holistic approach handles the whole word as a unified unit.A global feature vector is calculated for the indivisible input word sample which is then utilized to classify the word against a stored lexicon of words.Holistic recognition is inspired from what is known as the word superiority effect, which states that people have better recognition of letters presented within words as compared to isolated letters and to letters presented within non-words [7].Holistic paradigms are not only effective, but also have the ability to maintain certain effects which are special to the class under operation such as coarticulation effects [8].
Several previous research efforts have investigated the holistic approach for Arabic cursive script recognition for both printed and handwritten types.Erlandson et al. [9] reported a word-level recognition system for machine-printed Arabic.They used an image-morphological based vector of features such as dots and hamzas, the direction of segments, the junctions and endpoints, direction of cavities, holes, descenders and intra-word gaps.All these features are computed for a query word image in the recognition phase and are matched against a pre-computed database of vectors from an Arabic words lexicon and that system achieved a word recognition rate of 65%.This accuracy was achieved with the integration of a lexicon pruning subsystem that is based on another recognition method that was developed under the same project for a training set of 8436 word images scanned at 300 dpi.
Al-Badr et al. [10] developed an Arabic holistic word recognition system based on a set of shape primitives that are detected with mathematical morphology operations.That system was trained using a single font with three types of documents: ideal (noise-free), synthetically degraded and scanned.The used feature extraction operators were very sensitive to the scanning noise and the degraded low resolution documents.That system achieved a recognition rate of 99.4% for noise-free documents.For synthetically degraded documents, the system accuracy decreased to 95.6% and to 73% for scanned documents.All these evaluations were performed using a limited lexicon that contained 4317 words [10].
Khorsheed and Clocksin [11] presented a technique for recognizing Arabic cursive words from scanned images of text by transforming each word in a certain lexicon into a normalized polar image, and then applied a two-dimensional Fourier transform to that polar image.Each word is represented by a template that includes a set of Fourier's coefficients, and for recognition, the system used a normalized Euclidean distance that measures the distance between the word under test and those templates.That system achieved a recognition rate of 90% for a lexicon size of 145 words and used 1700 word samples for training.
To get better performance, Khorsheed [12] presented a new system based on Hidden Markov Models (HMMs).In that system, each word was represented by a single HMM.The word models were trained using the word sample Fourier's spectrum.The experiments were conducted on four fonts, and the reported results are for Simplified Arabic and Arabic Traditional fonts only.The system achieved a higher recognition rate compared to the template-based recognizer.The highest achieved results for both fonts are: 90% as the first choice and 98% within the top-ten choices.
In a later work, Khorsheed [13] presented a cursive Arabic text recognition system based on HMM.This system was also segmentation-free with an easy-to-extract statistical features vector of length 60 elements, representing three different types of features.This system was trained with a data corpus which includes Arabic text of more than 600 A4-size sheets typewritten in six different computer-generated fonts: Tahoma, Simplified Arabic, Traditional Arabic, Andalus, Naskh and Thuluth.The highest achieved results were 88.7% and 92.4% for Andalus font in mono-model and tri-model, respectively.In another experiment, that system was trained with a multi-font data set that was selected randomly with same sample size from all fonts and tested with a data set consisting of 200 lines from each font, and achieved an accuracy of 95% using the tri-model.
In another effort, Krayem et al. [14] presented a word level recognition system using discrete hidden Markov classifier along with a block based discrete cosine transform.This system was trained by typewritten Arabic words in five fonts with size 14 points and lexicon size of 252 words.Vector quantization was used to map each feature vector to the closest symbol in the codebook.The multiple recognition hypotheses (N-best word lattice) of that system achieved a 97.65% accuracy.Also, the holistic approach was successfully used on the subword level.Nasrollahi and Ebrahimi [15] presented an approach to offline OCR for printed Persian subwords using wavelet packet transform.The proposed technique extracted font invariant and size invariant features from different subwords of four fonts and three sizes and compressed them using Principal Component Analysis (PCA).When tested on a subset of 2000 words of printed Persian text documents, that system achieved an accuracy of 97.9%.
In a later work [16], Slimane et al. organized the ICDAR2013 competition on multi-font and multi-size digitally represented Arabic text.The main characteristic of the winner system, Siemens system submitted by Marc-Peter Schambach et al., was the using of a three hidden layers neural network, that transforms a two-dimensional pixel plane into a sequence of class probabilities. the system have been applied on a subset of the APTI dataset [17] and managed to achieve an accuracy over 99%.
While the holistic approach avoids the challenging segmentation task of Arabic cursive scripts, it still has another challenge of dealing with large lexicon size of Arabic words.As the number of words in the lexicon grows, the recognition task becomes more computationally expensive.Most of the previously proposed holistic based Arabic OCR systems tested with small size vocabularies, but this is not practical for Arabic as a morphologically rich language with a huge vocabulary size.
In this paper, we propose a computationally efficient holistic Arabic OCR system for a large vocabulary size.For the sake of a practical approach, a lexicon reduction technique based on clustering the similar shape words is used to minimize the word recognition time.The proposed system utilizes a hybrid of several holistic features that combine global word level DCT-based features and local block based features.Using these types of features, the system manages to achieve Omni-font performance with font and size independence.Also, the presented system has a flexible architecture for integrating language modelling constraints by using a second rescoring pass for the top n-best word hypotheses.This rescoring operation provided a significant enhancement in the recognition accuracy of the system.The rest of the paper is organized as follows.Section 2 includes a description for the proposed holistic OCR system.The holistic DCT features used are described in Section 3. The developed lexicon reduction technique is illustrated in Section 4. Section 5 describes the language rescoring process used by the system.Section 6 presents system evaluation results and performance comparison with state of art commercial Arabic OCR systems.The final conclusions and prospects for future work are included in Section 7.

System Description
The developed holistic OCR system consists of two modules.The first one is the training module where the holistic features are extracted from the training set of the word images.The extracted features are used to build the set of clusters of similar word shapes.The generated words' clusters and their extracted features represent the knowledge base that is used in the recognition phase.The second module is the recognition module.In that module, after applying the preprocessing operations on the input image, the detected text blocks are segmented into lines and words.The features are extracted for each word image then the word cluster or best-n clusters, that have the minimum Euclidean distance with the test image vector, are assigned.The generated word list from the selected cluster is used to construct a word lattice for the possible recognition hypotheses of the whole line.This word lattice is rescored using n-gram language model to get the best recognition hypothesis.Figure 2 shows the block diagram of the proposed holistic OCR system.

Feature Extraction
The main concept of the proposed algorithm is based on the property that the DCT transform compressed image is a decomposition vector which can uniquely represent the input image to be correctly reconstructed later at a decompression stage.In this work, the first 100-200 2D-DCT coefficients are used as word features that provide good approximation about the word image information.In our system, three features were experimented.Those features are: Discrete Cosine Transforms (DCT), Discrete Cosine Transforms 4-Blocks (DCT_4B), and a feature which is a combination of DCT and DCT_4B.

Discrete Cosine Transform (DCT)
The DCT features in our system are extracted via two dimensional DCT.The two dimensional DCT of an M × N image f(x, y) is defined as follows: where 0 After applying DCT to the whole word image, the features are extracted in a vector form by using the most significant DCT coefficients.The steps involved in DCT feature extraction as shown in Figure 3 are: 1. Apply the DCT to the whole word image.2. Perform zigzag operation on the DCT coefficients I dct .
The zigzag matrix I z is a row vector matrix containing high frequency coefficients in its first N values that contain most word information.This forms features vector f dct for each word.

Discrete Cosine Transform 4-Blocks (DCT_4B)
In this feature set, firstly we find the Centre of Gravity (COG) of image and make it as the starting point; in order to calculate the centre of gravity, the horizontal and vertical centre must be determined by the following equations: C y = M (0,1) where C x is the horizontal centre and C y the vertical centre of gravity and M (p,q) the geometrical moments of rank p + q: The x and y determine the image word pixels.The division of x and y by the width and the height of the image, respectively, causes the geometrical moments to be normalized and be invariant to the size of the word [18].This method uses features of COG and DCT at the same time, the first one as an auxiliary feature to divide the image into four parts and apply the second feature DCT on each part as a whole.
This feature set is extracted and implemented as follows: 1. Calculate the COG of the word image and make it as a starting point as explained in Equations ( 1)-(4).2. Use the vertical and horizontal COG to divide the word image into four regions.3. Apply the DCT to each part of the word image.4. Perform zigzag operation on the DCT coefficients of each image part to get the first N/4 values that contain most word information on that word part. 5. Repeat Steps 3 and 4 sequentially for all the word parts, and then combine them together to form the feature vector of the word image.

Hybrid DCT and DCT_4B (DCT + DCT_4B)
This feature combines the two features DCT and DCT_4B.

Lexical Reduction and Clustering
To reduce the computation time for searching the whole lexicon in the recognition phase, the similar shape words are clustered together.The word search is performed in two steps.In the first one, the word cluster or the nearest n-clusters are determined then the best matching word inside that cluster are selected as the recognition output.For words clustering, we used the LBG algorithm [19] to cluster the words in each group depending on closeness of the word shapes from the point of view of the used features.For the clustering process, we used the same DCT and DCT_4B features that we use for the word recognition phase.
To measure the accuracy of the clustering step, and also lexical reduction, we used a clustering accuracy measure which counts the number of times the test word exists within the selected cluster/clusters per the tested words.For a vocabulary size of around 356,000 words of Simplified Arabic font (14 pt.), we tested the clustering accuracy using a test set of 3465 words and a codebook size of 1024.Table 1 shows the clustering accuracy rate of the tested words using the three implemented features when using varying number of clusters from one to 10.The results of Table 1 show that the DCT+DCT_4B feature is better than the other two.This hybrid feature benefited from the local and global feature of the DCT, so it achieved good results, especially in the noisy data.Figure 4 shows the relation between codebook size and clustering accuracy rate.As shown in Figure 4, the clustering accuracy rate increases when using larger number of top-n clusters which is a logical consequence.When using a small number of clusters, each cluster contains large number of words which raises the possibility of finding the tested word within one of these clusters.When the number of clusters increase, the number of words in each cluster decrease, which reduces the clustering accuracy rate but at the same time the words within each cluster becomes more similar, which starts again to raise the clustering accuracy rate even up to the highest level when each cluster contains only one word.

Language Rescoring
To enhance the recognition accuracy, the top-hypotheses from the holistic recognition results are rescored using a language model.In our system, we used a 4-gram language model that was trained from a Giga-word Arabic training database [20].The top n-hypotheses for each word are combined in a lattice format as shown in Figure 5, then we used the A* search technique to search for the best score path in that lattice using the 4-gram language model to select the best matching sentence according to the Arabic language constraints [21].

Experiment Results
To train the proposed holistic Arabic OCR system, we used a lexicon of around 356,000 words selected from the news domain with high coverage for the Arabic Language.Using this lexicon, we generated a database of images for three fonts: Simplified Arabic, Traditional Arabic and Arabic Transparent, in 300 dpi with four different sizes.
To test the system, we used three different test datasets that represent different degrees of challenges: Figure 6 illustrates some examples of the scanned images.In the first experiment, we evaluated our system using the laser scanned data set.Initially, we evaluated the system on a single font.The system was trained on a single font with single size but was tested on the same font with different sizes.We didn't use the language model with this dataset as it consists of single words.Table 2 illustrates the Word Recognition Rate (WRR) results for this experiment.From the results in Table 2, we can see that the proposed system achieved very high accuracy and managed to generalize for new font sizes that were not included in the training data with best WRR of 98.44% for Simplified Arabic font and the lowest WRR of 97.33% for Traditional Arabic font.When considered the multiple recognition hypotheses, the top-5 WRR was almost 100%.
In the second experiment, the system was evaluated as omnifont by including several fonts and sizes from the laser scanned training data set.Table 3 includes the results for that evaluation.As we can see in Table 3, the proposed system managed to achieve for the multi-font and multi-size task almost the same WRR as the single font one.This result shows that the presented system can provide an omnifont performance.
In the third experiment, we evaluated our system using the recent and old books data sets.Table 4 shows the results of that evaluation.From the results in Table 4 we can see that our Arabic holistic OCR system achieved 77.3% WRR for recent books and 47.8% WRR for old books.Considering the top-10 hypotheses, the WRR for recent books increased to 87.7% and for old books increased to 65.7%.When considering top-20 hypotheses, the WRR increased to 89% and 69% for recent and old books, respectively.A data analysis for the recognition errors of the books data sets revealed several reasons that contributed to the reduction of the WRR.We found that this data sets included high Out Of Vocabulary (OOV) rate of around 6% for recent books and 7% for old books.It is known that the effect of the OOV is accumulative which means a single OOV word can result in recognition errors for more than one of its neighboring words.Another phenomenon that we noticed in these data set is the high rate of using the Kashida character, which was 4% for recent books and 6% for old books.The Kashida character resulted in altering the shapes of some characters which caused some word recognition errors.Also, we noticed that some fonts of the old books had large differences from the fonts used in training the system such as the Anglo-font which resulted in very low WRR for some pages.
When we applied a 4-gram language model rescoring for the books data sets using the top-10 hypothesis, we achieved 83% WRR for the recent books set and 53% WRR for the old books set.We got an absolute gain of 6% in WRR for both of the recent and old books data sets.This result show that a high percentage of the system recognition errors can be corrected using the top-n hypotheses and a language model.
In the fourth evaluation, we compared the performance of the proposed system with three commercial Arabic OCR systems, Sakhr, ABBYY and NovoDynamics, which represent the best performing Arabic OCR packages currently available.Table 5 shows these comparative results.The results in Table 5 show that, while using squared Euclidean distance as the distance measure, our system managed to achieve better performance than two systems, ABBYY and Sakhr, for the computerized books data set and achieved better performance than the ABBYY system for the uncomputerized books data set.When we used the absolute Euclidean distance, the recognition rate increased from 82.97% to 84.76% for the computerized books set and from 53.21% to 58.04% for the uncomputerized books set, and the proposed system outperformed Sakhr and ABBYY systems for both of the two datasets, although the NovoDynamics system outperfoms the proposed one.Our system is still much faster, as we will see in the next section.
As heavy computation is one of the main drawbacks for the holistic approach, we evaluated the runtime speed of the presented system.Table 6 shows the processing times of the proposed system before and after lexical reduction versus the number of selected word clusters.These experiments were run on Core i7 2.8 GHz machine with single thread execution.We can see from the displayed results in Table 6 that the computation cost of our developed holistic system is very practical.With lexical reduction, we managed to reduce the run time by a factor of 1000 and a one page with average number of 250 words can be computed in average time of 1.2 s compared to 1 s/page for Sakhr system, 2.3 s/page for NovoDynamics and 3.5 s/page for ABBY system.

Conclusions and Future Work
The holistic approaches provide effective solutions for the challenges of cursive scripts recognition such as Arabic OCR.The main drawback of such approaches is its complexity and heavy computation requirement especially for large vocabulary tasks.In this paper, we introduced a holistic Arabic OCR approach that is computationally efficient.A lexicon reduction technique based on clustering the similar shape words is utilized to reduce the word recognition time.The presented system makes use of a hybrid of several holistic features that combine global word level DCT based features and local block based features.Using this type of features, the system achieved Omni-font performance with size and font independence.Also, the suggested system has a flexible architecture to integrate language modelling constraints by using a second rescoring pass for the top n-best word hypotheses.
The proposed system has been tested using different sets of 1152 words with three different fonts and four font sizes and has achieved 99.3% WRR.It also has been tested using sets of 2730 words of recent computerized book's text and has attained more than about 84.8% WRR.Results of the holistic proposed system have been compared with known commercial Arabic OCR systems provided by the largest international and local companies, and the results were promising.In future work, we will investigate other holistic features like Wavelet Transform, Zernike Transform, Hough Transform and loci.Also, we will investigate other lexicon reduction techniques that benefit from linguistic information.

Figure 1 .
Figure 1.Some examples of Arabic words that contain ligatures with manually segmented characters.

Figure 2 .
Figure 2. Block Diagram of the Holistic OCR System.

Figure 4 .
Figure 4. Clustering accuracy rate of Simplified Arabic font vs. codebook size number using DCT+ DCT_4B feature for different top clusters.

Figure 5 .
Figure 5.An example of a rescoring lattice.

1 .
Laser scanned text data set: This data set is composed of 1152 single words taken from newspaper articles and printed in three fonts and four different sizes in two types of qualities: clean and first copy.2. Recent computerized books data set: A data set composed of 10 scanned pages from different recent computerized books that contain 2730 words.3. Old un-computerized books: This data set consists of 10 scanned pages contain 2276 words from old books that are typewritten with not well known fonts.

Figure 6 .
Figure 6.Some samples of the scanned images.

Table 5 .
Recognition rate (percent) of recent computerized and uncomputerized books.ED stands for Euclidean distance.

Table 6 .
Processing time of word search and LM vs. words candidates.