Attention-Based CNN-RNN Arabic Text Recognition from Natural Scene Images

: According to statistics, there are 422 million speakers of the Arabic language. Islam is the second-largest religion in the world, and its followers constitute approximately 25% of the world’s population. Since the Holy Quran is in Arabic, nearly all Muslims understand the Arabic language per some analytical information. Many countries have Arabic as their native and ofﬁcial language as well. In recent years, the number of internet users speaking the Arabic language has been increased, but there is very little work on it due to some complications. It is challenging to build a robust recognition system (RS) for cursive nature languages such as Arabic. These challenges become more complex if there are variations in text size, fonts, colors, orientation, lighting conditions, noise within a dataset, etc. To deal with them, deep learning models show noticeable results on data modeling and can handle large datasets. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) can select good features and follow the sequential data learning technique. These two neural networks offer impressive results in many research areas such as text recognition, voice recognition, several tasks of Natural Language Processing (NLP), and others. This paper presents a CNN-RNN model with an attention mechanism for Arabic image text recognition. The model takes an input image and generates feature sequences through a CNN. These sequences are transferred to a bidirectional RNN to obtain feature sequences in order. The bidirectional RNN can miss some preprocessing of text segmentation. Therefore, a bidirectional RNN with an attention mechanism is used to generate output, enabling the model to select relevant information from the feature sequences. An attention mechanism implements end-to-end training through a standard backpropagation algorithm.


Introduction
The research community has been working for many years to recognize printed and written texts [1]. Many applications to the text recovery system help blind people to navigate specific environments [2]. The text is filtered with textual contents on the images. The text has high-level scene information in natural pictures. The styles, font sizes, and text may be different. Anyone can view natural text scenes in advertising notes and banners. Text with a background of water, constructions, woods, rocks, etc., mostly has noise. The process of text recognition from natural images can be slowed down with these backgrounds [3]. This problem may be more prominent when lights are loud, blurred, and texts are directed in images of nature [3,4]. Here, this paper's primary focus is on the recognition of Arabic texts from natural scenes.
For most people, video is a crucial source of information-there are tho derstand videos better than other sources of information. Many videos on soc and TV channels are available. Storage technology can now save a large num data. Since the length and annotation of these videos are increasing, quick and recovery systems should be available to enable access to the information rel videos. The most significant information in videos is the title. Anybody can the text written on a video. The recognition of video text offers help in analyz tent of the video [5]. The text data on the videos are highly linked to the p video, and they are also linked to the events of the video.
Reading text from a printed document such as an essay differs to natural nition [6]. This problem has brought attention to the area of natural scene Scene text recognition has become an essential module for many practical due to the advancement of mobile devices such as self-driving cars, smartpho A lot of work has been carried out in recognizing printed text in the form of O acter Recognition (OCR) systems. Still, they cannot deal effectively with na with factors such as image distortion, variable fonts, background noise, etc. [8 [9] described the problems in the text recognition system. According to them are not sufficient for a complete solution; human expertise is also required. A fourth most widely spoken language globally, but the work on word-level text for the Arabic language is negligible [10]. The original image containing one w can be seen in Figure 1. Arabic text acknowledgment is an active field of research in pattern rec day. Arabic printed documents have been a topic of study for many years in community. Character recognition [11] makes up most of the previous work o script. Due to its cursive fashion and lack of annotated information, the Ara challenging [12]. Additionally, the OCR script system is also challenging in re abic because most English OCR techniques are not very practical for Arabic scripts, many researchers have worked on word segmentation, but now, peo free text segmentation recognition. Models such as the recurrent neural netwo for this sequence by sequence (RNN). The input to target labels in a series is in these models [13]. For handwritten English and Arabic text recognition, the is used in [14,15], respectively. For Arabic video text recognition, this paper us sequence to sequence model. A great deal of work has been carried out in [1 ognize video text. The availability of Arabic script datasets is minor compare Arabic text acknowledgment is an active field of research in pattern recognition today. Arabic printed documents have been a topic of study for many years in the research community. Character recognition [11] makes up most of the previous work on the Arabic script. Due to its cursive fashion and lack of annotated information, the Arabic script is challenging [12]. Additionally, the OCR script system is also challenging in regard to Arabic because most English OCR techniques are not very practical for Arabic. For Arabic scripts, many researchers have worked on word segmentation, but now, people work on free text segmentation recognition. Models such as the recurrent neural network are used for this sequence by sequence (RNN). The input to target labels in a series is transcribed in these models [13]. For handwritten English and Arabic text recognition, the RNN model is used in [14,15], respectively. For Arabic video text recognition, this paper uses the same sequence to sequence model. A great deal of work has been carried out in [16,17] to recognize video text. The availability of Arabic script datasets is minor compared to English script datasets, but for video text recognition, two datasets are available: the ACTIV dataset [18] and the ALIF dataset [19]. Due to the lack of annotated data of Arabic script, the work on word-level text recognition is very scarce [20].
DCNNs are used for various tasks such as feature extraction, classification, and objection detection. However, the input and output size of this network is not fixed [21]. If the model receives input data in a series, a sequence of labels should be predicted. RNN models resolve the problem of inputs and outputs of variable sizes. To recall previous input, RNN introduces the concept of memory [22]; the information used by RNN-based models is usually converted into sequence-function vectors [23]. A DCNN is used to obtain the functional representations of the input image in the proposed study. These features are transferred to a bidirectional RNN. These characteristics are transmitted to the two-way RNN. The mechanism of attention is applied to the RNN. The mechanism for attention gives the results of each RNN output. Relevant information will arrive at the front when necessary, via the attention mechanism [24]. The results are predicted based on this attention model.
For natural scene images, several text recognition systems are proposed. Bissacco et al. [25] recognize characters for English scripts on a single basis, and then the DCNN model recognizes the characters detected. These models' limitation is that the exactness of the word's cutting and detection is not very accurate. Due to the complications, this is harder for Arabic script. As far as our knowledge is concerned, there is less work in Arabic text script recognition at the word level. Here, we use images of natural scenes; Using images of natural scenes involves three challenges, including environmental challenges scene complexity, noise, etc. The mages may be blurred, and the content's text may have variable fonts, guidelines, and text colors. The complex properties of an Arabic script, including its semi-cursive mode, letter shape, paws (Arabic part), dots, and similar letters in different situations present other challenges. Figure 2 is a representation of Arabic natural scene text.
Forecasting 2021, 3 FOR PEER REVIEW DCNNs are used for various tasks such as feature extraction, classification jection detection. However, the input and output size of this network is not fixe the model receives input data in a series, a sequence of labels should be predict models resolve the problem of inputs and outputs of variable sizes. To recall input, RNN introduces the concept of memory [22]; the information used by RN models is usually converted into sequence-function vectors [23]. A DCNN is us tain the functional representations of the input image in the proposed study. T tures are transferred to a bidirectional RNN. These characteristics are transmitt two-way RNN. The mechanism of attention is applied to the RNN. The mecha attention gives the results of each RNN output. Relevant information will arri front when necessary, via the attention mechanism [24]. The results are predict on this attention model.
For natural scene images, several text recognition systems are proposed. Bi al. [25] recognize characters for English scripts on a single basis, and then th model recognizes the characters detected. These models' limitation is that the e of the word's cutting and detection is not very accurate. Due to the complication harder for Arabic script. As far as our knowledge is concerned, there is less work text script recognition at the word level. Here, we use images of natural scene images of natural scenes involves three challenges, including environmental ch scene complexity, noise, etc. The mages may be blurred, and the content's text m variable fonts, guidelines, and text colors. The complex properties of an Arabic s cluding its semi-cursive mode, letter shape, paws (Arabic part), dots, and simil in different situations present other challenges. Figure 2 is a representation of Ar ural scene text. Jain et al. [13] proposed a hybrid CNN-RNN model to recognize Arabic tex ral scenes and videos. In this model, an input image is converted into a feature the CNN layer. This layer splits the feature map column-wise (f1, f2, ..., ft) and tra the RNN layer. The RNN layer reads the feature sequences one by one (mean step at a time) and obtains a fixed dimension vector that contains all the inform input feature sequences. Then, a transcription layer uses this fixed vector to gen output sequence. The main issues of this approach include the fact that all the kn of these feature sequences is stored in a fixed-length vector. During the backprop the model has to visit all the information in a fixed vector. This will be more diffi have long sequences. Therefore, there should be a mechanism that deals dynamic these feature sequences (input sequences). The attention mechanism dynamica Jain et al. [13] proposed a hybrid CNN-RNN model to recognize Arabic text in natural scenes and videos. In this model, an input image is converted into a feature map by the CNN layer. This layer splits the feature map column-wise (f 1 , f 2 , ..., f t ) and transfers to the RNN layer. The RNN layer reads the feature sequences one by one (mean one time step at a time) and obtains a fixed dimension vector that contains all the information of input feature sequences. Then, a transcription layer uses this fixed vector to generate the output sequence. The main issues of this approach include the fact that all the knowledge of these feature sequences is stored in a fixed-length vector. During the backpropagation, the model has to visit all the information in a fixed vector. This will be more difficult if we have long sequences. Therefore, there should be a mechanism that deals dynamically with these feature sequences (input sequences). The attention mechanism dynamically picks the most relevant information from the feature sequence and feeds it to the context vector; for this, it soft searches where the vital information is present in the input sequences [26]. During backpropagation, the model will only visit the critical information in the context vector. This paper presents an attention-based CNN-RNN model for Arabic image text recognition.

Literature Review
The Arabic language is extremely complex, due to its many properties. It is written cursively from right to left; it has only 28 letters. Most of its letters, except six letters, are connected to the previous letter at the bottom of the line. One or more paws can be an Arabic word (part of the word). A letter can be linked to the letter or linked in the form of a paw to the letter. The letter form makes the script more difficult, and the shape of it depends on the location of the letter in the beginning, in the middle, or at the end [27]. Ligature fashion [28] is also used in Arabic scripts, which means that two letters connect and form a new letter that cannot be separated. These characteristics make the automatic recognition of Arabic scripts more difficult. A. Shahin presented a regression-based model to recognize Arabic printed text on paper [29], which could be linear or nonlinear. First, they thinned the text from the images and segmented the images into sub-words. Following that, a coding scheme was used to represent the relationship between word segments. Each form of the character and font type had a code in the coding scheme. Finally, the linear regression model validated the representations against a truth table. They tested the model on 14,000 different word samples, and the accuracy was 86 percent.
There are two recognition methods for document-and video-based OCR techniques: one is the free segmentation method, and the other is based on segmentation. Mostly, two approaches are used for recognition: the first one is the analytical approach, which segments a line or word into smaller units, and the second one is the global approach, which does not perform any segmentation; it takes the whole line or word as an input. Most of the Arabic text recognition methods are segmentation-based. For example, Ben Halima [30] uses an analytical approach for Arabic video text recognition. They perform binarization on Arabic text lines and then segmentation by projection profiles. They use the fuzzyk-nearest-neighbors technique on handcrafted features to classify characters. Lwata [31] also employs the analytical approach for Arabic text recognition on various video frames. They use a space detection technique to segment the line into words, which are then segmented into characters. Finally, they use an adapted quadratic function classifier to perform character-level recognition. With these approaches, segmentation errors can grow indefinitely and contribute to recognition performance.
A. Bodour presented a deep learning model for Arabic text recognition in their work. Their work was mainly focused on the recognition of Islamic manuscripts of history.
To enhance the quality of scanned images and for segmentation, several preprocessing techniques were used. After preprocessing, they used CNNs for recognition and achieved 88.20% validation accuracy. A deep learning model is also used in [9] for handwritten Arabic text recognition. Various preprocessing techniques were used for image noise reduction. They used a threshold method to segment the characters of words into distinct regions. Feature vectors were constructed by these distinct regions. They evaluated the model on a custom dataset gathered from different writers. The model achieved 83% accuracy.
In research works [32,33], a CNN model with three layers is presented for the recognition of handwritten characters of Arabic script. The datasets used for evaluation were AHCD [34] and AIA9k [35]. The model achieved 97.5% validation accuracy. Another deep learning model with MD-LSTM is presented in [36] for handwritten Arabic text recognition. The model used the CTC loss function. The purpose of that work was to evaluate the results of extending the dataset using data enhancing techniques and compare the performance of an extended model with other related models. The dataset used for training and evaluation is handwritten KHAT [37]. The model achieved 80.02% validation accuracy. In [13], the authors presented a hybrid deep learning model for printed Urdu text detection from scanned documents. The model used a CNN for feature extraction and bi-LSTM for sequence labeling. The model used the CTC loss function. The APTI and URDU datasets [38] were used for evaluation. The model achieved 98.80% accuracy. To illustrate, Figure 3 is a flowchart to separate words from Arabic sentence images. asting 2021, 3 FOR PEER REVIEW [38] were used for evaluation. The model achieved 98.80% accur 3 is a flowchart to separate words from Arabic sentence images. Zhai [39] proposed a bidirectional RNN-based segmentati layer of CTC for the recognition of Chinese text, which was obtain For model training, they composed almost two million news hea nels. There is a lot of work on automatic text recognition from n for English scripts, but for Arabic scripts, the research commun English scripts, Bissacco et al. [25] firstly detect the character DCNN model recognizes these detected characters. The limitatio the accuracy of cropping and detecting the character from the w be more difficult for the Arabic script due to its complications.
For English script text recognition, Jaderberg et al. [40] use but the recognition region mechanism was used. However, this Arabic script due to its complications. The number of unique w less than in Arabic script. Almazán et al. [41] had two goals to spotting and the second is word recognition on the images. The Zhai [39] proposed a bidirectional RNN-based segmentation free model with the layer of CTC for the recognition of Chinese text, which was obtained from news channels. For model training, they composed almost two million news headings from 13 TV channels. There is a lot of work on automatic text recognition from natural scenes and videos for English scripts, but for Arabic scripts, the research community is newly active. For English scripts, Bissacco et al. [25] firstly detect the characters individually, then the DCNN model recognizes these detected characters. The limitation of these models is that the accuracy of cropping and detecting the character from the word is not high. This will be more difficult for the Arabic script due to its complications.
For English script text recognition, Jaderberg et al. [40] used a DCNN for detection but the recognition region mechanism was used. However, this model is not suitable for Arabic script due to its complications. The number of unique words in English script is less than in Arabic script. Almazán et al. [41] had two goals to fulfill: the first is word spotting and the second is word recognition on the images. The authors proposed an approach where text strings and word images both emerged to a common vector space. In semantic segmentation, RNN-based models are used to handle the long-distance pixel dependencies [42]. For example, in [43], Gatta et al. unrolled the CNN to obtain the semantic connections at different time steps. Authors [44][45][46][47][48] used LSTM with four blocks to scan all directions of an image; this helped to learn long range spatial dependencies.
There has been a lot of work performed in the field of text extraction from videos. Karray et al. [49] extracted text using classification methods. Hua et al. [50] used edge detection, which is a type of text extraction. Previously, the majority of work on Arabic videos was divided into two categories: dealing with relevant feature extraction and classifying features to target labels. To recognize Chinese handwritten text, authors [51][52][53][54][55] used a bidirectional LSTM network. The attention-based model inspired our proposed model. Papers [56][57][58][59][60][61] proposed different attention-based speech recognition model. Authors [62][63][64][65][66][67][68][69][70][71] proposed various attention-based models for object detection in images as well. Fasha et al. [72] proposed a hybrid deep learning model, and used five CNN layers and two LSTM layers for Arabic text recognition. They worked on 18 fonts of the Arabic language [73]. Graves et al. [74] created Neural Turing machines using an attention-based model with convolutions (NTMs). The NTM computes the attention weights based on content. Authors [75][76][77] present a supervised convolutional neural network (CNN) model that uses batch normalization and dropout regularization settings to extract optimal features from context. When compared to traditional deep learning models, this tries to prevent overfitting and improve generalization performance. In [78][79][80], for feature extraction, the first model employs deep convolutional neural networks (CNNs), whereas word categorization is handled by a fully connected multilayer perceptron neural network (MLP). SimpleHTR, the second model, extracts information from images using CNN and recurrent neural network (RNN) layers to develop the suggested deep CNN (DCNN) architecture. To compare the outcomes, Daniyal et al. presented the Bluechet and Puchserver models. Arafat et al. [81] introduce Hijja, a novel dataset of Arabic letters produced only by youngsters aged 7-12. A total of 591 individuals contributed 47,434 characters to our dataset. They propose a convolutional neural network-based automatic handwriting recognition model (CNN). Hijja and the Arabic Handwritten Character Dataset (AHCD) are used to train our model. Ali et al. present a method for recognizing cursive caption text that uses a combination of convolutional and recurrent neural networks that are trained in an end-to-end framework. The background is segmented using text lines retrieved from video frames, which are then fed into a convolutional neural network for feature extraction. To learn sequence-to-sequence mapping, the extracted feature sequences are fed to different forms of bidirectional recurrent neural networks along with the ground truth transcription. Finally, the final transcription is created using a connectionist temporal classification layer. Alrehali et al. present a method for identifying text in photographs of the Hadeethe, Tafseer, and Akhidah manuscripts and converting it into a legible text that can be copied and saved for use in future studies. Their algorithm's major steps are as follows: (1) picture enhancement (preprocessing); (2) segmenting the manuscript image into lines and characters; (3) creating an Arabic character dataset; (4) text recognition (classification). On three produced datasets, they employ a CNN in the classification stage. To solve the Arabic Named Entity Recognition (NER) task, El Bazi et al. developed a deep learning technique. They built a bidirectional LSTM and Conditional Random Fields (CRFs) neural network architecture and tested various commonly used hyperparameters to see how they affected the overall performance of our system. The model takes two types of word information as input: pre-trained word embeddings and character-based representations, and it does not require any task-specific expertise or feature engineering. Arafat et al. proposed a method for detecting, predicting, and recognizing Urdu ligatures in outdoor photographs. For identification and localization purposes, the unique Faster-RCNN algorithm was utilized in conjunction with well-known CNNs such as Squeezenet, Googlenet, Resnet18, and Resnet50 in the first phase for images with a resolution of 320 240 pixels. A unique Regression Residual Neural Network (RRNN) is trained and evaluated on datasets including randomly oriented ligatures to predict ligature orientation. A Two Stream Deep Neural Network was used to recognize ligatures (TSDNN).
According to a review of related work, less work exists for natural scene Arabic text recognition, particularly at the word level. Our performed work represents specific steps in this research area and provides tools that can be used to expand and improve the results obtained.

Proposed CNN-RNN Attention Model
The architecture is divided into three sections: the first deals with convolutional layers, the second with recurrent layers of a bidirectional LSTM, and the third with recurrent layers of a bidirectional LSTM and an attention mechanism. The first part extracts feature representations from the input image. These representations are fed into the second part. This recurrent layer will label the sequences one by one. The connectionist temporal classification layer (CTC) is used when dealing with aligned data; it is essentially a loss function deal where time is a variable [48].
This paper uses two datasets, ACTIV and ALIF; they are firstly preprocessed, then fed to the convolutional layer. The size of all images is fixed before giving input to the first part of the architecture. Here, the height of each image is fixed. Convolutional layers will split the feature maps column-wise and create features sequence by sequence (f 1 , f 2 , ..., f t ). Due to CNN feature extraction, each feature contains important information. As CNNs cannot handle the long dependency, these features are directed as input to the second part of the architecture. This section makes predictions based on the features generated by the first section. In the second part, bidirectional Long Short-Term Memory (LSTM) networks are used. These networks deal with inputs and outputs in a recurrent fashion. The second part of the network will be unrolled based on a time variable, and the time variable will be determined solely by the input data. The disappearing gradient problem occurs in simple recurrent neural networks, but LSTM networks can learn meaningful information. They use different gates and memory cells to resolve the problem of vanishing gradient [49]. A bidirectional LSMT network is used in text recognition, as the learning pattern of text from right to left and left to right is very important [23]. The network can be made deeper by stacking these LSTM layers [50].
Now, 'v' has the information of both sides. The third part of the architecture has an LSTM network with attention, produces the results based on the information in 'v i '. The attention mechanism gives scores to each feature sequence; for this, it gives probability to the recognized text.
Pr(y t |{y 1 , y 2 , y 3 , . . . , y t−1 }, con t ) (2) con t is the context vector at time t generated from feature sequences. Probability with one LSTM layer is: Pr(y t |{y 1 , y 2 , y 3 , . . . , y t−1 }, con t ) = ny t−1 , hs t , con t ) Here, n is a non-linear function that gives the probability output of y t . Here, hs t is the hidden state of LSTM, which can be calculated as: The context vector can be calculated as: Forecasting 2021, 3

527
Each feature sequence has a weight W ti and can be calculated as: es i scores represent how well input around i related to the output at t. It can be calculated in the equation below. However, the proposed method is explained in Figure 4.
recasting 2021, 3 FOR PEER REVIEW The proposed methodology uses an attention mechanism over a bidirectional LS for this, it calculates the attention of each input sequence and then multiplies this atten to the respective input vector. The weighted sum of these attentions with their respec vectors is stored in the context vector. Attentions are the weights of input sequences.

Dataset and Preprocessing
The model was trained on two datasets, ACTIV and ALIF. There are 21,520 line ages in the ACTIV dataset, and these images are from different news channels. The A dataset is smaller than ACTIV, and has 6532 text lines images, and these are also f Arabic news channels. These images are annotated with XML files. Another dataset curated by downloading freely available images containing Arabic script from Go Images and Facebook; this dataset contains cropped images of market banners, shop b ners, etc. This Arabic dataset of 2477 images was manually annotated by experts. Tab and 2 represent the statistical details of the ACTIV dataset and ALIF dataset. Detai the Google curated dataset is given in Table 3. Segmented word images can be see Figure 5.  The proposed methodology uses an attention mechanism over a bidirectional LSTM; for this, it calculates the attention of each input sequence and then multiplies this attention to the respective input vector. The weighted sum of these attentions with their respective vectors is stored in the context vector. Attentions are the weights of input sequences.

Dataset and Preprocessing
The model was trained on two datasets, ACTIV and ALIF. There are 21,520 line images in the ACTIV dataset, and these images are from different news channels. The ALIF dataset is smaller than ACTIV, and has 6532 text lines images, and these are also from Arabic news channels. These images are annotated with XML files. Another dataset was curated by downloading freely available images containing Arabic script from Google Images and Facebook; this dataset contains cropped images of market banners, shop banners, etc. This Arabic dataset of 2477 images was manually annotated by experts. Tables 1 and 2 represent the statistical details of the ACTIV dataset and ALIF dataset. Details of the Google curated dataset is given in Table 3. Segmented word images can be seen in Figure 5.

Dataset and Preprocessing
The model was trained on two datasets, ACTIV and ALIF. There are 21,520 line images in the ACTIV dataset, and these images are from different news channels. The ALIF dataset is smaller than ACTIV, and has 6532 text lines images, and these are also from Arabic news channels. These images are annotated with XML files. Another dataset was curated by downloading freely available images containing Arabic script from Google Images and Facebook; this dataset contains cropped images of market banners, shop banners, etc. This Arabic dataset of 2477 images was manually annotated by experts. Tables 1 and 2 represent the statistical details of the ACTIV dataset and ALIF dataset. Details of the Google curated dataset is given in Table 3. Segmented word images can be seen in Figure 5.

Segmented Dataset
Here, we had two datasets, ACTIV and ALIF, and an image may have had more than one word. We worked on word-level text recognition; we needed one word in an image. For this, we performed segmentation to obtain one word in an image. The working of text recognition process is explained in Figure 6. A binarization algorithm was applied, which extracted the text from the noisy and shadow affected images.

Segmented Dataset
Here, we had two datasets, ACTIV and ALIF, and an image may have had more than one word. We worked on word-level text recognition; we needed one word in an image. For this, we performed segmentation to obtain one word in an image. The working of text recognition process is explained in Figure 6. A binarization algorithm was applied, which extracted the text from the noisy and shadow affected images.

Commonly Used Deep Learning Techniques
Deep learning is a technique of machine learning, which uses multiple layers of neural networks for classifying patterns [51]. Deep learning models use neural networks referred to as deep neural networks (DNNs). Simple neural networks only use two to three

Commonly Used Deep Learning Techniques
Deep learning is a technique of machine learning, which uses multiple layers of neural networks for classifying patterns [51]. Deep learning models use neural networks referred to as deep neural networks (DNNs). Simple neural networks only use two to three hidden layers, but in DNNs, hidden layers can be more than a hundred (deep mean numbers of hidden layers in network). A simple neural network takes the input at the input layer, and processes this input through hidden layers. Each neuron in the hidden layer is fully connected to all previous layer neurons. In a layer, neurons do not share their information; they operate independently from other neurons [52]. The last layer of the network will be fully connected. These fully connected neural networks cannot scale to larger images. Convolutional neural networks (CNNs) solve this problem; each neuron in the hidden layer is not fully connected to all previous layer neurons [53]. Only a few neurons of the previous layer are connected to each neuron. The shortcoming of CNNs is that they are unable to deal with variable size input and output; they only take fixed size input and generate fixed size output [54]. CNNs are not suitable for some applications such as speech and video processing. The algorithm for data preprocessing in our proposed method is described in Algorithm 1. Recurrent neural networks (RNNs) are designed to connect the previous information with the present task. They tackle the issue of fixed size input and output [55]. Simple RNNs cannot handle long dependencies, and are only useful when the gap between past information and the needed place is small. Simple RNNs face the problem of a vanishing gradient and exploding. Long short-term memory (LSTM) networks are a special kind of RNN that deal with this problem by having cell memory using linear and logistic units [56]. An LSTM network consists of three gates: write gate, keep the gate and read gate. Write gate writes the information on the cell, keep gate will keep the information on the cell and read gate allows one to read the information from the gate.
In deep learning, attention mechanisms attracted a lot of interest. Initially, an attention mechanism was applied on sequence for sequence models in neural machine translation, but now they are used in many problems, such as image captioning and others [57]. In attention-based LSTM, the system gives a score or attention weight to each output of LSTM. Based on these attention weights, the system will predict the output with high accuracy.

Overview of RNN
In 1960, the first RNNs were introduced. Due to their contextual modeling ability, they became more well known. RNNs are especially popular to tackle the time series problem with different patterns. Basically, with recurrent layers, an RNN is a simple multi-layer perceptron (MLP Here, a m (t) is the value of input m in time t, e n (t) and f n (t) are the input and the activation of the network to unit n, respectively, at time t. From unit m to unit n, the connection is represented as v mn . Θ h represents the activation function of hidden unit h. For the first time, RNNs were used for speech recognition by Robison [20]. For handwritten recognition, RNNs were used by Lee and Kim [40]. An input image as in Figures 7 and 8 was preprocessed to obtain the background and locate the text regions. An image background can be estimated by the pre binarization process. Firstly, a color image was converted into a gray scale image and then further converted into black and white images. Vertical profile projections were used to obtain the segmentation of words. Figures 9-11 are the visual representations of text.      A bidirectional RNN was presented by Schuter and Paliwal [18] in 1997. Ther two recurrent layers in sequence processing, one for the forward direction and the se for the backward direction. These recurrent layers are connected to the same inpu output layers. To deal with multi-dimensions such as 2D images or 3D videos, Grave presents a multi-dimension recurrent neural network (MD-RNN). For multi-dime RNN, consider a point of input series designed to have higher-layer units such as me cells. These memory cells are specially used to maintain information for a longer       A bidirectional RNN was presented by Schuter and Paliwal [18] in 1997. Ther two recurrent layers in sequence processing, one for the forward direction and the se for the backward direction. These recurrent layers are connected to the same inpu output layers. To deal with multi-dimensions such as 2D images or 3D videos, Grave presents a multi-dimension recurrent neural network (MD-RNN). For multi-dimen RNN, consider a point of input series designed to have higher-layer units such as me cells. These memory cells are specially used to maintain information for a longer       A bidirectional RNN was presented by Schuter and Paliwal [18] in 1997. Ther two recurrent layers in sequence processing, one for the forward direction and the se for the backward direction. These recurrent layers are connected to the same inpu output layers. To deal with multi-dimensions such as 2D images or 3D videos, Grave presents a multi-dimension recurrent neural network (MD-RNN). For multi-dime RNN, consider a point of input series designed to have higher-layer units such as me cells. These memory cells are specially used to maintain information for a longer       A bidirectional RNN was presented by Schuter and Paliwal [18] in 1997. Ther two recurrent layers in sequence processing, one for the forward direction and the se for the backward direction. These recurrent layers are connected to the same input output layers. To deal with multi-dimensions such as 2D images or 3D videos, Grave presents a multi-dimension recurrent neural network (MD-RNN). For multi-dimen RNN, consider a point of input series designed to have higher-layer units such as mem cells. These memory cells are specially used to maintain information for a longer       A bidirectional RNN was presented by Schuter and Paliwal [18] in 1997. Ther two recurrent layers in sequence processing, one for the forward direction and the se for the backward direction. These recurrent layers are connected to the same input output layers. To deal with multi-dimensions such as 2D images or 3D videos, Grave presents a multi-dimension recurrent neural network (MD-RNN). For multi-dimen RNN, consider a point of input series designed to have higher-layer units such as mem cells. These memory cells are specially used to maintain information for a longer Figure 11. Cropped third word.
A bidirectional RNN was presented by Schuter and Paliwal [18] in 1997. There are two recurrent layers in sequence processing, one for the forward direction and the second for the backward direction. These recurrent layers are connected to the same input and output layers. To deal with multi-dimensions such as 2D images or 3D videos, Graves [41] presents a multi-dimension recurrent neural network (MD-RNN). For multi-dimension RNN, consider a point of input series designed to have higher-layer units such as memory cells. These memory cells are specially used to maintain information for a longer time span.
In the 1D case, the input sequence is written as e(t), but here, e q is written as input for multi-dimension cases. The position of the upper index is written as q m , m ∈ 1,2,3, . . . k. In dimension g, the step back positions can be presented as Q g = q 1 . . . q d−1 , . . . q k . From m to n with dimension g, V mn d is the recurrent connection. The equation with the forward direction of g dimension MD-RNNs is represented as: The below equation calculates the backward pass as follows: where ε q n = ∂E/∂f n q is the output error of n unit at time q and δ n q = q, ∂E/∂e n represents the error after accumulation.
One-dimensional recurrence is used by standard RNNs such as one axis of an image, but the scanning of an image by MDRNNs is carried out along both axes. This will enable more contextual variations in four directions (top, bottom, left, and right). Figure 12 illustrates the scanning direction of 2D-RNN. In the figure, a q is the input sequence; the two points (m − 1,n) and (m,n − 1) always reached the end before that point (m,n). There were very few practical applications of RNNs due to the problem of vanishing gradient and long-term dependencies. In 1997, Hochreiter designed a new network, LSTM, to tackle long-term dependencies. LSTMs are a special type of RNN that are designed to have hidden-layer units such as memory cells. These memory cells are specially used to maintain information for a longer time span. ig t = σ(W iga a t + W igh h t−1 + W igmc mc t−1 + bi ig ) (14) f g t = σ(W f ga a t + W f gh h t−1 + W f gmc mc t−1 + bi mc ) (15) mc t = f g t mc t−1 ig t tanh(W amc a t + W hmc h t−1 + W ghmc mc t−1 + bi mc ) (16) og t = σ(W oga a t + W ogh h t−1 + W ogmc mc t−1 + bi og ) (17) h t = og t tan h(mc t ) (18) orecasting 2021, 3 FOR PEER REVIEW 1  =  (13) where ε q n = ∂E/∂fn q is the output error of n unit at time q and δn q = q, ∂E/∂en represents the error after accumulation. One-dimensional recurrence is used by standard RNNs such as one axis of an image but the scanning of an image by MDRNNs is carried out along both axes. This will enable more contextual variations in four directions (top, bottom, left, and right). Figure 12 illus trates the scanning direction of 2D-RNN. In the figure, a q is the input sequence; the two points (m − 1,n) and (m,n − 1) always reached the end before that point (m,n). There were very few practical applications of RNNs due to the problem of vanishing gradient and long-term dependencies. In 1997, Hochreiter designed a new network, LSTM, to tackle long-term dependencies. LSTMs are a special type of RNN that are designed to have hid den-layer units such as memory cells. These memory cells are specially used to maintain information for a longer time span.

Description of Implementation
In the proposed method, the convolutional layer follows the VGG architecture [6]. In VGG architecture, square pooling windows are used, but this architecture uses rectangu lar max-pooling windows. This convolutional layer generates wider and longer feature maps. These feature maps will be the input of RNN layers. An illustration of LSTM archi tecture is given in Figure 13. Figure 14 compares proposed architecture with different ar chitectures on character recognition rate using Alif dataset. However, for line recognition rate and word recognition rate, Figure 15 and 16 describe analytically. With respect to Activ dataset, the comparison of proposed architecture with different architecture on WRR, LRR and CRR are illustrated in Figures 17, 18 and 19 respectively.

Description of Implementation
In the proposed method, the convolutional layer follows the VGG architecture [6]. In VGG architecture, square pooling windows are used, but this architecture uses rectangular max-pooling windows. This convolutional layer generates wider and longer feature maps. These feature maps will be the input of RNN layers. An illustration of LSTM architecture is given in Figure 13. Figure 14 compares proposed architecture with different architectures on character recognition rate using Alif dataset. However, for line recognition rate and word recognition rate, Figures 15 and 16 describe analytically. With respect to Activ dataset, the comparison of proposed architecture with different architecture on WRR, LRR and CRR are illustrated in Figures 17-19 respectively.
All the scene text and video text images are resized to a fixed height and width. The input resized shape for video text is (90,300) and for a scene, the text is (110,320). We have tested this many times, and observe that the performance is not much affected by resizing the input image into the fixed width and height. As is well known, the Arabic language starts from right to left, for this, all the input images are horizontally flipped; then, these will be the input to the convolutional layer. Here, the architecture composes three parts: the convolutional part, RNN part, and attention over RNN; because of them, it will face a training problem. The batch normalization technique is used to handle this problem [27]. Stochastic gradient descent is used to train this architecture. Backpropagation algorithms are used to calculate gradients. In [28], the backpropagation through time algorithm is used for differentials of error for RNN layers. The adadelta optimization technique is used to set the learning rate parameter [29]. All the scene text and video text images are resized to a fixed height and width. T input resized shape for video text is (90,300) and for a scene, the text is (110,320). We h tested this many times, and observe that the performance is not much affected by resiz the input image into the fixed width and height. As is well known, the Arabic langu starts from right to left, for this, all the input images are horizontally flipped; then, th will be the input to the convolutional layer. Here, the architecture composes three pa the convolutional part, RNN part, and attention over RNN; because of them, it will fac

Experimental Results
The text on natural scene images and video frames can be recognized by the abovediscussed architecture; this section will elaborate on the efficient recognition of Arabic text by this architecture. Here, as we know, all the input images are cropped from the original scene image or a video frame. Each image contains only one word of Arabic text. As per our knowledge, most of the work on Arabic text is character-level classification or recognition. This paper introduces new Arabic scene text baseline results. There are two datasets available for Arabic video text, Alif and Activ [23,25]; the results are reported on these two datasets. Table 4 presents the results of recognition on the Alif dataset for video text. Table 5 shows the proposed architectural results of the AcTiV dataset. A famous Optical character recognition Tesseract (OCR-T) [47] is applied over our one-word dataset for comparison, because there is no or a lower amount of work on a word-level dataset. Three metrics are used to evaluate the performance: Character Recognition Rate (ChRR), Word Recognition Rate (WoRR), and Line Recognition Rate (LiRR).
The methods used for comparison on video text recognition have different convolutional architecture for the extraction of features from end-to-end trainable architecture. For the video text recognition task, we obtained better character-level and line-level accuracy and set the new state of the art. A Deep Belief Network (DBN) is a feed-forward neural network with one or more layers of hidden units, which are commonly referred to as feature detectors [59]. The fact that generative weights can be learned layer-by-layer is a unique feature of DBN. A pair of layers' latent variables can be learned at the same time. MLP_AE_LSTM is an Auto Encoder Long Short-Term Memory based on Multi-Layer Perceptron [59]. This is a feed-forward network that learns to provide the same output as the input. In this network, they used a three-layered neural network with fully connected hidden units. The ability to collect multi-model features is improved by using more than one hidden layer with nonlinear units in the auto-encoder. The backpropagation algorithm is used to train the network. ABBYY [59] is a famous Arabic OCR engine Fine reader 12, used for Arabic character recognition; this OCR obtains a 82.4% to 83.26% character recognition rate, but our proposed model obtains a 98.73% character recognition rate. MDLSTM [60] is a Multi-Dimension Long Short-Term Memory Network; here, MDLSTM Figure 19. Comparison of proposed architecture with different architectures on character recognition rate (Activ dataset).

Experimental Results
The text on natural scene images and video frames can be recognized by the abovediscussed architecture; this section will elaborate on the efficient recognition of Arabic text by this architecture. Here, as we know, all the input images are cropped from the original scene image or a video frame. Each image contains only one word of Arabic text. As per our knowledge, most of the work on Arabic text is character-level classification or recognition. This paper introduces new Arabic scene text baseline results. There are two datasets available for Arabic video text, Alif and Activ [23,25]; the results are reported on these two datasets. Table 4 presents the results of recognition on the Alif dataset for video text. Table 5 shows the proposed architectural results of the AcTiV dataset. A famous Optical character recognition Tesseract (OCR-T) [47] is applied over our one-word dataset for comparison, because there is no or a lower amount of work on a word-level dataset. Three metrics are used to evaluate the performance: Character Recognition Rate (Ch RR ), Word Recognition Rate (Wo RR ), and Line Recognition Rate (Li RR ).
Ch RR = n chars − ∑ Edit dist (R text , G truth ) n chars (19) Wo RR = n words − correctly − recognized n words (20) Li RR = n textimages − correctly − recognized n textimages The methods used for comparison on video text recognition have different convolutional architecture for the extraction of features from end-to-end trainable architecture. For the video text recognition task, we obtained better character-level and line-level accuracy and set the new state of the art. A Deep Belief Network (DBN) is a feed-forward neural network with one or more layers of hidden units, which are commonly referred to as feature detectors [59]. The fact that generative weights can be learned layer-by-layer is a unique feature of DBN. A pair of layers' latent variables can be learned at the same time. MLP_AE_LSTM is an Auto Encoder Long Short-Term Memory based on Multi-Layer Perceptron [59]. This is a feed-forward network that learns to provide the same output as the input. In this network, they used a three-layered neural network with fully connected hidden units. The ability to collect multi-model features is improved by using more than one hidden layer with nonlinear units in the auto-encoder. The backpropagation algorithm is used to train the network. ABBYY [59] is a famous Arabic OCR engine Fine reader 12, used for Arabic character recognition; this OCR obtains a 82.4% to 83.26% character recognition rate, but our proposed model obtains a 98.73% character recognition rate.
MDLSTM [60] is a Multi-Dimension Long Short-Term Memory Network; here, MDLSTM was used for Arabic text recognition, and the text variations on both dimensions of the input image are modeled using this architecture.  Table 5. Proposed architectural results on AcTiV dataset.

Conclusions
This paper presents a state-of-the-art deep learning attention mechanism over an RNN which gives successful results over Arabic text datasets such as Alif and Activ. In recent works, RNN's performance in sequence learning approaches has been influential, especially in text transcription and speech recognition. The attention layer over RNN enables one to obtain a focused area of the input sequence and allows fast and easier learning. In preprocessing, we made a new dataset of one word on an image from Alif and Activ, which will help to improve the reduction in line error rate. We reported it as 85% to 87%. This model gives better results from those based on a simple CNN, RNN, and hybrid CNN-RNN. This model outperformed results with any language modeling and misfit regularization techniques. Images can be transcribed directly from sequence learning techniques and the context of them can be modeled in both directions, forward and backward. Due to the availability of good feature learning algorithms, the focus of the community should move to harder problems such as natural scene recognition.