An End-to-End Scene Text Recognition for Bilingual Text
Abstract
:1. Introduction
- To our knowledge, no researchers have developed an end-to-end STR system capable of integrating the tasks of localizing and recognizing bilingual Arabic and English text within a unified framework. Most existing methods treat these tasks separately.
- The most advanced studies on localizing Arabic text from natural scene images have primarily addressed horizontal text, neglecting the challenges posed by multi-oriented and curved Arabic text.
- To our knowledge, our study is the first to utilize advanced Arabic language models like AraElectra to enhance the recognition accuracy of Arabic text from natural scene images.
- To the best of our knowledge, this is the first study to utilize end-to-end STR for localizing and recognizing Arabic text only, as well as bilingual Arabic and English text from natural scene images.
- To the best of our knowledge, this is the first study to propose an EvArest dataset that contains multi-oriented and curved text for localizing and recognizing Arabic, as well as bilingual Arabic and English text from natural scene images.
- We employed a pretrained CNN model, EfficientNetV2, to extract features from bilingual Arabic and English texts in the images.
- We utilized BiLSTM with an attention mechanism to recognize bilingual Arabic and English text from natural scene images.
- We integrated the Arabic language model named AraElectra with our end-to-end STR model to enhance the recognition of Arabic text from natural scene images.
2. Background
2.1. Arabic Language Characteristics
2.2. Scene Text Recognition (STR)
2.2.1. Text Localization
- Connected Component-Based Methods
- Sliding Window-Based Methods
- Deep Learning-Based Methods
2.2.2. Text Recognition
- Segmentation-Based Methods
- Free-Segmentation Methods
- Image Preprocessing Stage: The image preprocessing stage aims to enhance image quality and mitigate issues caused by poor image conditions. It plays a critical role in text recognition by improving feature representation. Various image preprocessing techniques such as background removal, text image super-resolution, and rectification are employed. These methods effectively address challenges associated with low image quality, thereby significantly enhancing text recognition accuracy.
- Feature Representation Stage: Feature representation is crucial for converting raw text-instance images into a form that emphasizes essential characteristics for character recognition while minimizing the influence of irrelevant factors such as font style, color, size, and background. CNNs are widely adopted in this stage due to their efficiency and effectiveness in extracting image features.
- Sequence Modeling Stage: The sequence modeling stage establishes connections between image features and predictions, enabling the extraction of contextual information from sequences of characters. This approach is valuable for predicting characters in sequence, demonstrating improved reliability and efficiency compared to independent character analysis. BiLSTM networks are commonly utilized in sequence modeling for their capability to capture long-range dependencies accurately [34].
- Prediction Stage: In the prediction stage, the objective is to determine the correct string sequence based on features extracted from the input text-instance image. Two main techniques employed for this purpose are Connectionist Temporal Classification (CTC) [35] and attention mechanisms [36]. These techniques facilitate accurate and effective decoding of the sequence from the extracted features.
2.2.3. End-to-End System
2.3. Datasets of Arabic Scene Text
2.3.1. ARASTC [37]
2.3.2. ARASTI [38]
2.3.3. ICDAR2017 [39]
2.3.4. EASTR-42K [40]
2.3.5. ICDAR2019 [41]
2.3.6. ASAYAR [42]
2.3.7. Real-Time Arabic Scene Text Detection [43]
2.3.8. ATTICA [44]
2.3.9. EvArEST [10]
2.3.10. Tunisia Street View Dataset [45]
3. Related Works
3.1. Text Localization from Natural Scene Images
3.1.1. Arabic Scene Text Localization
- The majority of techniques for handling Arabic datasets, such as ASAYAR and ATTICA, primarily focus on horizontal Arabic text.
- The aforementioned approaches exclusively target Arabic text and do not address bilingual text combining Arabic and English. While the ASAYAR dataset used in [42] accurately locates bilingual text, it is restricted to horizontal text.
- All of the above techniques employ the same pretrained CNN model, specifically VGG-16, for feature extraction from images.
3.1.2. English Scene Text Localization
3.2. Text Recognition from Natural Scene Images
3.2.1. Arabic Scene Text Recognition
- Currently, there is no research on an end-to-end system for recognizing scene text that can localize and recognize bilingual Arabic and English text. Most studies treat localization and recognition as separate processes.
- Although current studies are successful in recognizing Arabic text in horizontal forms, there are still difficulties in recognizing multi-oriented and curved Arabic text.
- A study [10] attempted to address the recognition of Arabic and English text from natural scene images but did not employ an end-to-end approach.
- Researchers have not yet investigated the use of the LSTM model with an attention mechanism specifically for Arabic text recognition.
3.2.2. English Scene Text Recognition
3.3. End-to-End Scene Text Recognition
4. Proposed Methodology
4.1. Localization Phase
4.1.1. Backbone Stage
- ResNet
- EfficientNetV2
4.1.2. Feature Pyramid Enhancement Module (FPEM) Stage
4.1.3. Detection Head
4.1.4. Pixel Aggregation
4.2. Recognition Phase
4.2.1. Masked Region of Interest (RoI)
4.2.2. Recognition Head
- Stater Stage
- Decoder Stage
- Long Short-Term Memory: LSTM is a widely recognized technique that effectively addresses challenges related to vanishing and exploding gradients [85]. Unlike traditional ‘sigmoid’ or ‘tanh’ activation functions, LSTM introduces memory cells equipped with gates to manage the flow of information in and out of the cells. These gates regulate how information is input to hidden neurons and preserve features from earlier time steps [86]. An LSTM cell consists of input, forget, and output gates, alongside a cell activation component. These elements receive activation signals from various sources and control cell activation using designated multipliers. LSTM gates prevent other network components from modifying memory cell contents across successive time steps. Compared to RNNs, LSTMs excel in retaining signals and transmitting error information over longer periods, making them highly effective in processing data with intricate dependencies across various sequence learning tasks [87].
- Bidirectional Long Short-Term Memory: BiLSTM, an extension of the bidirectional recurrent neural network (BiRNN) introduced in 1997 to enhance traditional RNNs [88], combines both forward and backward LSTM networks. During training, the forward LSTM network processes the input sequence in chronological order, while the backward LSTM network processes it in reverse. Both networks capture the context of the input sequence and extract crucial features [89]. The outputs of both the forward and backward LSTM networks are then combined to generate the final output of the BiLSTM network. By processing input data in both directions, BiLSTM captures additional contextual information compared to unidirectional LSTM models, enabling it to handle long-term dependencies more effectively and improve overall model accuracy [90].
5. Experiments
5.1. Datasets
5.1.1. Arabic Scene Text Datasets
5.1.2. English Scene Text Datasets
- ICDAR 2015 [96]: This dataset was collected over several months in Singapore and contains 1670 images with 17,548 annotated regions. It is one of the most comprehensive publicly available datasets with complete ground truth for Latin-scripted text. Out of these images, 1500 are publicly accessible, divided into a training set of 1000 images and a test set of 500 images.
- COCO-Text [97]: The Microsoft COCO dataset serves as the largest benchmark for text localization and recognition. It includes 173,589 text instances from 63,686 images, encompassing handwritten and printed text in both clear and blurry conditions, as well as English and non-English texts. The dataset is composed of 43,686 training images and 20,000 testing images.
- Total-Text [98]: The Total-Text dataset is designed for the localization and recognition of Latin text in various forms, including curved, multi-oriented, and horizontal text lines. It consists of 1255 training images and 300 testing images, annotated with polygons at the word level, primarily obtained from street billboards.
5.2. Evaluation Metrics
- Accuracy: This metric calculates the ratio of correctly predicted texts (True Positives, TP, and True Negatives, TN) to the total number of predicted texts, including both correct and incorrect predictions.
- Precision: Precision measures the proportion of correctly predicted bounding boxes (TP) relative to the total number of predicted bounding boxes (both TP and False Positives, FP).
- Recall: Recall quantifies the ratio of correctly predicted bounding boxes (TP) to the total number of expected results based on the ground truth of a dataset.
- F-score: The F-score represents the harmonic mean of precision and recall, providing a balanced measure of a model’s performance.
5.3. Training Details
5.4. Implementation Details
6. Results
6.1. Scene Text Localization Results
6.2. End-to-End Scene Text Recognition Results
7. Discussion
7.1. Bilingual Scene Text Localization
7.2. End-to-End Scene Text Recognition for Bilingual Text
7.3. Effect of Text Direction
7.4. Error Analysis
7.5. Comparative Analysis
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8440–8449. [Google Scholar]
- Luo, C.; Jin, L.; Sun, Z. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognit. 2019, 90, 109–118. [Google Scholar] [CrossRef]
- Bayatpour, S.; Sharghi, M. A bilingual text detection in natural images using heuristic and unsupervised learning. J. AI Data Min. 2022, 10, 449–466. [Google Scholar]
- Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In Proceedings of the IEEE/CVF Conference on Compute Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4593–4603. [Google Scholar]
- Khan, T.; Sarkar, R.; Mollah, A.F. Deep learning approaches to scene text detection: A comprehensive review. Artif. Intell. Rev. 2021, 54, 3239–3298. [Google Scholar] [CrossRef]
- Katper, S.H.; Gilal, A.R.; Alshanqiti, A.; Waqas, A.; Alsughayyir, A.; Jaafar, J. Deep neural networks combined with STN for multi-oriented text detection and recognition. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 178–184. [Google Scholar] [CrossRef]
- Yao, C.; Zhang, X.; Bai, X.; Liu, W.; Ma, Y.; Tu, Z. Rotation-invariant features for multi-oriented text detection in natural images. PLoS ONE 2013, 8, e70173. [Google Scholar] [CrossRef]
- Ranjitha, P.; Rajashekar, K. A Review on text detection from multi-oriented text images in different approaches. In Proceedings of the 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), IEEE, Coimbatore, India, 2–4 July 2020; pp. 240–245. [Google Scholar]
- Hassan, H.; El-Mahdy, A.; Hussein, M.E. Arabic scene text recognition in the deep learning era: Analysis on a novel dataset. IEEE Access 2021, 9, 107046–107058. [Google Scholar] [CrossRef]
- Wang, P.; Li, H.; Shen, C. Towards end-to-end text spotting in natural scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7266–7281. [Google Scholar] [CrossRef]
- Ahmed, S.B.; Razzak, M.I.; Yusof, R. Cursive Script Text Recognition in Natural Scene Images; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Hakak, S.; Kamsin, A.; Tayan, O.; Idris, M.Y.I.; Gilkar, G.A. Approaches for preserving content integrity of sensitive online Arabic content: A survey and research challenges. Inf. Process. Manag. 2019, 56, 367–380. [Google Scholar] [CrossRef]
- Elnagar, A.; Al-Debsi, R.; Einea, O. Arabic text classification using deep learning models. Inf. Process. Manag. 2020, 57, 102121. [Google Scholar] [CrossRef]
- Alrobah, N.; Albahli, S. Arabic handwritten recognition using deep learning: A survey. Arab. J. Sci. Eng. 2022, 47, 9943–9963. [Google Scholar] [CrossRef]
- Hicham, E.M.; Akram, H.; Khalid, S. Using features of local densities, statistics and HMM toolkit (HTK) for offline Arabic handwriting text recognition. J. Electr. Syst. Inf. Technol. 2017, 4, 387–396. [Google Scholar] [CrossRef]
- Al-Saqqar, F.; AL-Shatnawi, A.M.; Al-Diabat, M.; Aloun, M. Handwritten Arabic text recognition using principal component analysis and support vector machines. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 209896493. [Google Scholar] [CrossRef]
- Eltay, M.; Zidouri, A.; Ahmad, I. Exploring deep learning approaches to recognize handwritten Arabic texts. IEEE Access 2020, 8, 89882–89898. [Google Scholar] [CrossRef]
- Mustafa, M.E.; Elbashir, M.K. A deep learning approach for handwritten Arabic names recognition. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 211029354. [Google Scholar] [CrossRef]
- Eltay, M.; Zidouri, A.; Ahmad, I.; Elarian, Y. Generative adversarial network based adaptive data augmentation for handwritten Arabic text recognition. PeerJ Comput. Sci. 2022, 8, e861. [Google Scholar] [CrossRef] [PubMed]
- Wang, W.; Xie, E.; Li, X.; Liu, X.; Liang, D.; Yang, Z.; Lu, T.; Shen, C. Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5349–5367. [Google Scholar] [CrossRef]
- Balaha, H.M.; Ali, H.A.; Badawy, M. Automatic recognition of handwritten Arabic characters: A comprehensive review. Neural Comput. Appl. 2021, 33, 3011–3034. [Google Scholar] [CrossRef]
- Chen, X.; Jin, L.; Zhu, Y.; Luo, C.; Wang, T. Text recognition in the wild: A survey. ACM Comput. Surv. 2021, 54, 42. [Google Scholar] [CrossRef]
- Lin, H.; Yang, P.; Zhang, F. Review of scene text detection and recognition. Arch. Comput. Methods Eng. 2020, 27, 433–454. [Google Scholar] [CrossRef]
- Neumann, L.; Matas, J. A method for text localization and recognition in real-world images. In Proceedings of the Computer Vision–ACCV 2010: 10th Asian Conference on Computer Vision, Queenstown, New Zealand, 8–12 November 2010; Revised Selected Papers, Part III 10. Springer: Berlin/Heidelberg, Germany, 2011; pp. 770–783. [Google Scholar]
- Epshtein, B.; Ofek, E.; Wexler, Y. Detecting text in natural scenes with stroke width transform. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, San Francisco, CA, USA, 13–18 June 2010; pp. 2963–2970. [Google Scholar]
- Pan, Y.F.; Hou, X.; Liu, C.L. A hybrid approach to detect and localize texts in natural scene images. IEEE Trans. Image Process. 2010, 20, 800–813. [Google Scholar] [PubMed]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. EAST: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X. Textboxes++: A single-shot oriented scene text detector. IEEE Trans. Image Process. 2018, 27, 3676–3690. [Google Scholar] [CrossRef]
- Tian, Z.; Huang, W.; He, T.; He, P.; Qiao, Y. Detecting text in natural image with connectionist text proposal network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 56–72. [Google Scholar]
- Zhang, C.; Liang, B.; Huang, Z.; En, M.; Han, J.; Ding, E.; Ding, X. Look more than once: An accurate detector for text of arbitrary shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10552–10561. [Google Scholar]
- Graves, A.; Liwicki, M.; Fernandez, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 855–868. [Google Scholar] [CrossRef] [PubMed]
- Graves, A.; Fernandez, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Tounsi, M.; Moalla, I.; Alimi, A.M.; Lebouregois, F. Arabic characters recognition in natural scenes using sparse coding for feature representations. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, Tunis, Tunisia, 23–26 August 2015; pp. 1036–1040. [Google Scholar]
- Tounsi, M.; Moalla, I.; Alimi, A.M. ARASTI: A database for arabic scene text recognition. In Proceedings of the 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR), IEEE, Nancy, France, 3–5 April 2017; pp. 140–144. [Google Scholar]
- Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), IEEE, Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1454–1459. [Google Scholar]
- Ahmed, S.B.; Naz, S.; Razzak, M.I.; Yusof, R.B. A novel dataset for English-Arabic scene text recognition (EASTR)-42K and its evaluation using invariant feature extraction on detected extremal regions. IEEE Access 2019, 7, 19801–19820. [Google Scholar] [CrossRef]
- Nayef, N.; Patel, Y.; Busta, M.; Chowdhury, P.N.; Karatzas, D.; Khlif, W.; Matas, J.; Pal, U.; Burie, J.C.; Liu, C.l.; et al. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition–RRC-MLT-2019. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, Sydney, Australia, 20–25 September 2019; pp. 1582–1587. [Google Scholar]
- Akallouch, M.; Boujemaa, K.S.; Bouhoute, A.; Fardousse, K.; Berrada, I. ASAYAR: A dataset for Arabic–Latin scene text localization in highway traffic panels. IEEE Trans. Intell. Transp. Syst. 2020, 23, 3026–3036. [Google Scholar] [CrossRef]
- Moumen, R.; Chiheb, R.; Faizi, R. Real-time Arabic scene text detection using fully convolutional neural networks. Int. J. Electr. Comput. Eng. 2021, 11, 1634–1640. [Google Scholar] [CrossRef]
- Boujemaa, K.S.; Akallouch, M.; Berrada, I.; Fardousse, K.; Bouhoute, A. ATTICA: A dataset for Arabic text-based traffic panels detection. IEEE Access 2021, 9, 93937–93947. [Google Scholar] [CrossRef]
- Boukthir, K.; Qahtani, A.M.; Almutiry, O.; Dhahri, H.; Alimi, A.M. Reduced annotation based on deep active learning for Arabic text detection in natural scene images. Pattern Recognit. Lett. 2022, 157, 42–48. [Google Scholar] [CrossRef]
- Gaddour, H.; Kanoun, S.; Vincent, N. A new method for arabic text detection in natural scene image based on the color homogeneity. In Proceedings of the Image and Signal Processing: 7th International Conference, ICISP 2016, Trois-Rivières, QC, Canada, 30 May–1 June 2016; Proceedings 7. Springer: Berlin/Heidelberg, Germany, 2016; pp. 127–136. [Google Scholar]
- Chowdhury, A.; Biswas, S.K.; Bianco, S. Active Deep Learning Reduces Annotation Burden in Automatic Cell Segmentation. In Proceedings of the Medical Imaging 2021: Digital Pathology, Online, 15–19 February 2021; SPIE: Bellingham, WA, USA, 2021; Volume 11603, pp. 94–99. [Google Scholar]
- Yang, L.; Zhang, Y.; Chen, J.; Zhang, S.; Chen, D.Z. Suggestive annotation: A deep active learning framework for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, 11–13 September 2017; Proceedings, Part III 20. Springer: Berlin/Heidelberg, Germany, 2017; pp. 399–407. [Google Scholar]
- Liao, M.; Shi, B.; Bai, X.; Wang, X.; Liu, W. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Shi, B.; Bai, X.; Belongie, S. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2550–2558. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9336–9345. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9365–9374. [Google Scholar]
- Dai, P.; Zhang, S.; Zhang, H.; Cao, X. Progressive contour regression for arbitrary-shape scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7393–7402. [Google Scholar]
- Ye, M.; Zhang, J.; Zhao, S.; Liu, J.; Du, B.; Tao, D. Dptext-detr: Towards better scene text detection with dynamic points in transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3241–3249. [Google Scholar]
- Ahmed, S.B.; Naz, S.; Razzak, M.I.; Yousaf, R. Deep learning based isolated arabic scene character recognition. In Proceedings of the 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR), IEEE, Nancy, France, 3–5 April 2017; pp. 46–51. [Google Scholar]
- Jain, M.; Mathew, M.; Jawahar, C. Unconstrained scene text and video text recognition for rabic script. In Proceedings of the 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR), IEEE, Nancy, France, 3–5 April 2017; pp. 26–30. [Google Scholar]
- Alsaeedi, A.; Al Mutawa, H.; Snoussi, S.; Natheer, S.; Omri, K.; Al Subhi, W. Arabic words recognition using CNN and TNN on a smartphone. In Proceedings of the 2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition (ASAR), IEEE, London, UK, 12–14 March 2018; pp. 57–61. [Google Scholar]
- Ahmed, S.B.; Naz, S.; Razzak, I.; Prasad, M. Unconstrained arabic scene text analysis using concurrent invariant points. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, Glasgow, UK, 19–24 July 2020; pp. 1–6. [Google Scholar]
- Bissacco, A.; Cummins, M.; Netzer, Y.; Neven, H. Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 2–8 December 2013; pp. 785–792. [Google Scholar]
- Liu, W.; Chen, C.; Wong, K.Y.K.; Su, Z.; Han, J. Star-net: A spatial attention residue network for scene text recognition. In Proceedings of the BMVC, York, UK, 19–22 September 2016; Volume 2, p. 7. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef]
- Wang, J.; Hu, X. Gated recurrent convolution neural network for ocr. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Borisyuk, F.; Gordo, A.; Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 71–79. [Google Scholar]
- Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4168–4176. [Google Scholar]
- Lee, C.Y.; Osindero, S. Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2231–2239. [Google Scholar]
- Zhan, F.; Lu, S. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2059–2068. [Google Scholar]
- Hassan, H.; Torki, M.; Hussein, M.E. SCAN: Sequence-character aware network for text recognition. In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021), Vienna, Austria, 8–10 February 2021; pp. 602–609. [Google Scholar]
- Cheng, C.; Wang, P.; Da, C.; Zheng, Q.; Yao, C. LISTER: Neighbor decoding for length-insensitive scene text recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 19541–19551. [Google Scholar]
- Liu, Y.; Shen, C.; Jin, L.; He, T.; Chen, P.; Liu, C.; Chen, H. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8048–8064. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Su, Y.; Tripathi, S.; Tu, Z. Text spotting transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9519–9528. [Google Scholar]
- Kittenplon, Y.; Lavi, I.; Fogel, S.; Bar, Y.; Manmatha, R.; Perona, P. Towards weakly-supervised text spotting using a multi-task transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4604–4613. [Google Scholar]
- Huang, M.; Zhang, J.; Peng, D.; Lu, H.; Huang, C.; Liu, Y.; Bai, X.; Jin, L. Estextspotter: Towards better scene text spotting with explicit synergy in transformer. In Proceedings of the IEEE/CVF International Conference on Computer 1446 Vision, Paris, France, 2–3 October 2023; pp. 19495–19505. [Google Scholar]
- Kil, T.; Kim, S.; Seo, S.; Kim, Y.; Kim, D. Towards unified scene text spotting based on sequence generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15223–15232. [Google Scholar]
- Ye, M.; Zhang, J.; Zhao, S.; Liu, J.; Liu, T.; Du, B.; Tao, D. Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19348–19357. [Google Scholar]
- Das, A.; Biswas, S.; Banerjee, A.; Lladós, J.; Pal, U.; Bhattacharya, S. Harnessing the power of multi-lingual datasets for pre-training: Towards enhancing text spotting performance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 718–728. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Tan, M.; Le, Q. EfficientNetV2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MNASNet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2820–2828. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Sifre, L.; Mallat, S. Rigid-motion scattering for texture classification. arXiv 2014, arXiv:1403.1687. [Google Scholar]
- Gupta, S.; Tan, M. EfficientNet-EdgeTPU: Creating Accelerator-Optimized Neural Networks with AutoML; Google AI Blog: San Francisco, CA, USA, 2019; Volume 2. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Le, Q.V.; Jaitly, N.; Hinton, G.E. A simple way to initialize recurrent networks of rectified linear units. arXiv 2015, arXiv:1504.00941. [Google Scholar]
- Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent advances in recurrent neural networks. arXiv 2017, arXiv:1801.01078. [Google Scholar]
- Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
- Sun, S.; Sun, J.; Wang, Z.; Zhou, Z.; Cai, W. Prediction of battery SOH by CNN-BiLSTM network fused with attention mechanism. Energies 2022, 15, 4428. [Google Scholar] [CrossRef]
- Adil, M.; Wu, J.Z.; Chakrabortty, R.K.; Alahmadi, A.; Ansari, M.F.; Ryan, M.J. Attention-based STL-BiLSTM network to forecast tourist arrival. Processes 2021, 9, 1759. [Google Scholar] [CrossRef]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. AraELECTRA: Pre-training text discriminators for Arabic language understanding. arXiv 2020, arXiv:2012.15516. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for rabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; I Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; De Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, IEEE, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S. Coco-text: Ddtaset and benchmark for text detection and recognition in natural images. arXiv 2016, arXiv:1601.07140. [Google Scholar]
- Ch’ng, C.K.; Chan, C.S. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), IEEE, Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 935–942. [Google Scholar]
- Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The Pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Name | Isolated | Initial | Middle | End |
---|---|---|---|---|
Baa | ب | بـ | ـبـ | ـب |
Taa | ت | تـ | ـتـ | ـت |
Thaa | ث | ثـ | ـثـ | ـث |
Dot (Noqtah) | Name of Characters | Shape of Characters |
---|---|---|
One dot | Baa, Gem, Khaa, Zal, Zai, Dad, Zaa, Gin, Faa, and Noon | ب, ج, خ, ذ, ز, ض, ظ, غ, ف, ن |
Two dots | Taa, Qaf, and Yaa | ت, ق, ي |
Three dots | Thaa and Shin | ث, ش |
Above characters | Khaa, Zal, Zai, Dad, Zaa, Gin, Faa, Qaf, and Noon | خ, ذ, ز, ض, ظ,غ, ف,ق, ن |
Below characters | Baa and Yaa | ب,ي |
Center of characters | Ge | ج |
Dataset | Year | Task Type | Text Orientation | No. of Images | Availability |
---|---|---|---|---|---|
ARASTEC [37] | 2015 | Localization and recognition | Horizontal | 260 | Private |
ARSTI [38] | 2017 | Recognition | Horizontal | 374 | Public |
ICDAR2017 [39] | 2017 | Localization and recognition | Horizontal, multi-oriented, and curve | 18,000 | Public |
EASTR-42K [40] | 2019 | Localization and recognition | Horizontal | 2469 | Private |
ICDAR2019 [41] | 2019 | Localization and recognition | Horizontal, multi-oriented, and curve | 20,000 | Public |
ASAYAR [42] | 2020 | Localization | Horizontal | 1375 | Public |
Real-Time Arabic Scene Text Detection [43] | 2021 | Localization | Horizontal and curve | 575 | Private |
ATTICA [44] | 2021 | Localization and recognition | Horizontal | 1180 | Private |
EvArEST [10] | 2021 | Localization and recognition | Horizontal, multi-oriented, and curve | 510 | Public |
TSVD [45] | 2021 | Localization | Horizontal | 7000 | Private |
Ref. | Approach | Scale | Year | Backbone | Dataset | No. of Images | Type of Text | Evaluation | |||
---|---|---|---|---|---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F-Score (%) | FPS | ||||||||
[46] | NA | NA | 2016 | NA | Private dataset | 50 | Horizontal text | 78.0 | 89.0 | 83.1 | NA |
[42] | CTPN | 1920 × 1080 | 2020 | VGG | ASAYAR | 1375 | Horizontal text | 88.0 | 95.0 | 86.0 | |
EAST | 93.0 | 74.0 | 82.0 | ||||||||
TextBoxes++ | 66.0 | 52.0 | 58.0 | ||||||||
[43] | NA | NA | 2021 | VGG | Private dataset | 575 | Multi-oriented dataset | 65.1 | 71.4 | 68.0 | 24.3 |
[44] | CTPN | 2021 | VGG | ATIICA | 1180 | Horizontal text | 67.0 | 85.0 | 74.9 | NA | |
EAST | 71.0 | 89.0 | 78.9 | ||||||||
[45] | Deep active learning | 640 × 640 | 2022 | VGG | TSVD | 700 | Horizontal text | 82.77 | NA | NA | |
ICDAR2017 | 250 | Multi-oriented and curved | 73.26 | ||||||||
ICDAR2019 | 250 | 74.09 | |||||||||
Deep learning | 640 × 640 | TSVD | 7000 | Horizontal text | 89.86 | ||||||
ICDAR2017 | 1000 | Multi-oriented and curved | 81.55 | ||||||||
ICDAR2019 | 1000 | 81.56 |
Ref. | Approach | Year | Preprocessing | Feature Extraction | Sequence Modeling | Prediction | Dataset | Evaluation |
---|---|---|---|---|---|---|---|---|
[37] | NA | 2015 | NA | SIFT | NA | SVM | ARASTEC/260 images | (5-Arabic character) = 48.1 (15-Arabic character) = 57.5 (15-Glyphs) = 60.4 |
[55] | NA | 2017 | Grayscale image/ fixed size | ConvNet | NA | ConvNet | EAST/250 images | Error rate = 0.15 |
[56] | CNN-RNN | 2017 | Scaled to fixed high resolution | VGG | BiLSTM | CTC | Image from Google/2000 words | CRR = 75.05 LRR = 39.43 |
[57] | NA | 2018 | Binarization image and grayscale image | CNN | NA | CNN for predict characters and TNN for predict word | Isolated character/300 images and signboard image/100 images | Recognition rate: Calibri = 100% Al-Andalus = 87% Aldhabi = 97% |
[40] | NA | 2019 | Binary image and mask image | MSER and SIFT | MDLSTM | CTC | EASTR-42,000/1500 images | Recall = 89.5% Precision = 94.1% F-score = 97.52% |
[58] | NA | 2020 | Binary image and mask image | MSER and SIFT | MDLSTM | CTC | (ASTR)/13,593 words | Accuracy = 94% |
[10] | CRNN | 2021 | NA | VGG | BiLSTM | CTC | EvArEsT/5337 words | Accuracy = 86.5% |
RARE | Rectification | VGG | BiLSTM | Attention-based decoder | Accuracy = 89.8% | |||
R2AM | NA | RCNN | NA | Soft-attention mechanism | Accuracy = 84% | |||
STARNET | Rectification | ResNet | BiLSTM | CTC | Accuracy = 89.6% | |||
GRCNN | NA | RCNN | BiLSTM | CTC | Accuracy = 87.4% | |||
Rosetta | NA | ResNet | NA | CTC | Accuracy = 85.4% | |||
WWSTR | Rectification | ResNet | BiLSTM | Attention-based decoder | Accuracy = 91.2% | |||
Moran | Rectification | VGG | BiLSTM | Attention-based decoder | Accuracy = 89.4% | |||
SCAN | Segmentation | VGG | BiLSTM | Self-attention mechanism | Accuracy = 88.4% |
Dataset | Language | Training Set | Testing Set | Total | Annotation |
---|---|---|---|---|---|
ICDAR2017 | Bilingual text | 800 | 200 | 1000 | Word-level |
ICDAR2019 | Bilingual text | 1000 | 200 | 1200 | Word-level |
EvArEST | Bilingual text | 377 | 133 | 510 | Word-level |
ICDAR2015 | English text | 1000 | 500 | 1500 | Word-level |
COCO-Text | English text | 43,686 | 20,000 | 63,686 | Word-level |
Total-Text | English text | 1255 | 300 | 1555 | Word-level |
Stage | Model | Batch Size | Epochs | Learning Rate | Training Strategy | Image Size |
---|---|---|---|---|---|---|
Text Localization | ResNet | 16 | 600 | 1 × 10−3 | Training from scratch | 736/896 |
300 | 1 × 10−3 | Joint training | ||||
EfficientNetV2 | 16 | 600 | 2 × 10−3 | Training from scratch | ||
300 | 2 × 10−3 | Joint training | ||||
End-to-end | ResNet | 16 | 30 | 1 × 10−3 | Joint training | 896 |
EfficientNetV2 | 16 | 30 | 2 × 10−3 |
Model | Training Strategy | Scale | ICDAR2017 | ICDAR2019 | EvArEST | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F-Score (%) | FPS | Precision (%) | Recall (%) | F-Score (%) | FPS | Precision (%) | Recall (%) | F-Score (%) | FPS | |||
ResNet-18 | Training from scratch | 736 | 82.3 | 71.6 | 76.5 | 28.4 | 84.6 | 75.2 | 79.6 | 29.3 | 90.3 | 89.5 | 89.9 | 30.0 |
896 | 84.4 | 75.7 | 79.8 | 27.0 | 85.4 | 76.1 | 80.4 | 27.4 | 91.4 | 89.8 | 90.6 | 28.3 | ||
Joint training | 736 | 86.7 | 81.6 | 84.0 | 34.7 | 89.6 | 82.9 | 86.1 | 31.9 | 88.2 | 90.7 | 89.4 | 32.4 | |
896 | 89.6 | 83.4 | 86.3 | 27.6 | 90.3 | 84.6 | 87.3 | 19.5 | 86.9 | 91.8 | 89.3 | 26.9 | ||
ResNet-50 | Training from scratch | 736 | 85.2 | 76.1 | 80.3 | 22.9 | 85.3 | 76.6 | 80.7 | 21.4 | 89.9 | 89.5 | 89.7 | 28.9 |
896 | 86.5 | 77.2 | 81.5 | 22.9 | 86.1 | 78.9 | 82.3 | 20.3 | 90.5 | 89.9 | 90.2 | 16.2 | ||
Joint training | 736 | 90.8 | 82.3 | 86.3 | 26.2 | 91.1 | 85.2 | 88.0 | 25.9 | 86.1 | 90.7 | 88.4 | 26.0 | |
896 | 91.2 | 84.6 | 87.7 | 20.9 | 91.8 | 86.4 | 89.0 | 20.7 | 86.2 | 91.1 | 89.6 | 20.6 |
Model | Training Strategy | Scale | ICDAR2017 | ICDAR2019 | EvArEST | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision (%) | Recall (%) | F-Score (%) | FPS | Precision (%) | Recall (%) | F-Score (%) | FPS | Precision (%) | Recall (%) | F-Score (%) | FPS | |||
EfficientNetV2-S | Training from scratch | 736 | 82.3 | 66.4 | 73.4 | 41.8 | 83.3 | 69.6 | 75.8 | 36.1 | 90.4 | 82.4 | 86.4 | 38.2 |
896 | 80.4 | 70.8 | 75.2 | 32.3 | 84.6 | 70.8 | 77.0 | 28.0 | 90.0 | 83.3 | 86.5 | 32.1 | ||
Joint training | 736 | 82.5 | 70.9 | 76.2 | 35.2 | 80.4 | 74.8 | 77.4 | 34.8 | 88.6 | 85.5 | 87.0 | 34.0 | |
896 | 81.3 | 74.6 | 77.8 | 29.2 | 82.9 | 76.4 | 79.5 | 25.4 | 89.0 | 85.2 | 87.1 | 28.0 | ||
EfficientNetV2-M | Training from scratch | 736 | 79.9 | 64.2 | 71.1 | 32.6 | 80.2 | 66.8 | 72.8 | 30.5 | 86.5 | 82.7 | 84.5 | 34.6 |
896 | 78.4 | 67.6 | 72.6 | 26.9 | 81.5 | 68.6 | 74.4 | 27.9 | 87.2 | 83.2 | 85.1 | 26.6 | ||
Joint training | 736 | 81.0 | 70.6 | 75.4 | 30.9 | 80.1 | 73.8 | 76.8 | 30.7 | 86.9 | 85.9 | 86.4 | 31.8 | |
896 | 80.5 | 75.5 | 77.9 | 25.6 | 82.3 | 75.9 | 78.9 | 24.9 | 88.1 | 86.1 | 87.1 | 27.2 | ||
EfficientNetV2-L | Training from scratch | 736 | 81.9 | 65.9 | 73.0 | 23.6 | 80.6 | 71.1 | 75.5 | 22.7 | 88.2 | 83.4 | 84.7 | 23.3 |
896 | 80.3 | 68.9 | 74.1 | 17.4 | 81.6 | 72.5 | 76.7 | 20.9 | 88.9 | 85.1 | 86.2 | 18.1 | ||
Joint training | 736 | 82.2 | 70.1 | 75.6 | 25.3 | 83.7 | 76.7 | 80.0 | 16.4 | 87.1 | 85.4 | 86.3 | 26.3 | |
896 | 81.3 | 76.3 | 78.7 | 16.2 | 85.6 | 78.8 | 82.0 | 15.2 | 86.9 | 87.8 | 87.3 | 16.5 |
Model | Training Strategy | ICDAR2017 | ICDAR2019 | EvArEST | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | ||
ResNet-18 | Joint training | 77.0 | 97.9 | 58.3 | 73.1 | 28.4 | 76.1 | 97.8 | 57.5 | 72.4 | 28.3 | 85.5 | 98.0 | 66.0 | 79.2 | 29.5 |
ResNet-50 | 77.1 | 98.1 | 58.3 | 73.2 | 24.9 | 76.3 | 98.5 | 57.5 | 72.6 | 24.8 | 85.7 | 97.7 | 66.7 | 79.3 | 25.0 | |
EfficientNetV2-S | 53.9 | 97.1 | 40.9 | 57.5 | 33.7 | 53.1 | 97.3 | 40.3 | 57.0 | 33.5 | 69.8 | 97.6 | 52.4 | 68.2 | 34.2 | |
EfficientNetV2-M | 52.2 | 97.3 | 39.7 | 56.4 | 26.5 | 51.4 | 97.2 | 39.0 | 55.6 | 26.9 | 68.9 | 97.7 | 51.7 | 67.6 | 28.1 | |
EfficientNetV2-L | 56.2 | 98.2 | 42.4 | 59.3 | 17.3 | 53.9 | 97.7 | 40.9 | 57.7 | 17.5 | 71.9 | 98.0 | 54.0 | 69.6 | 18.6 |
Model | Training Strategy | ICDAR2017 | ICDAR2019 | EvArEST | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | ||
ResNet-18 | Joint training | 78.1 | 97.7 | 59.2 | 73.7 | 25.9 | 74.2 | 97.7 | 55.9 | 71.1 | 26.8 | 86.9 | 97.8 | 68.0 | 80.2 | 28.3 |
ResNet-50 | 78.6 | 97.8 | 59.7 | 74.1 | 21.3 | 77.6 | 98.9 | 58.0 | 73.1 | 23.5 | 87.1 | 97.9 | 68.3 | 80.5 | 24.6 | |
EfficientNetV2-S | 56.8 | 97.6 | 42.9 | 59.6 | 29.7 | 55.0 | 98.1 | 41.6 | 58.4 | 30.9 | 72.3 | 98.5 | 54.4 | 70.1 | 32.4 | |
EfficientNetV2-M | 54.9 | 97.4 | 40.9 | 57.6 | 24.4 | 53.0 | 97.6 | 40.2 | 56.9 | 25.5 | 70.4 | 98.1 | 53.5 | 69.2 | 27.7 | |
EfficientNetV2-L | 57.2 | 97.4 | 43.1 | 59.8 | 15.1 | 56.3 | 97.5 | 42.5 | 59.2 | 16.2 | 72.8 | 97.9 | 54.7 | 70.2 | 17.3 |
Model | ICDAR2017 | ICDAR2019 | EvArEST | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | |
ResNet-50 | 77.0 | 97.9 | 58.1 | 72.9 | 18.6 | 75.8 | 98.1 | 57.0 | 72.1 | 19.2 | 85.2 | 97.3 | 66.7 | 79.1 | 21.0 |
EfficientNetV2-L | 56.0 | 97.5 | 42.1 | 58.8 | 15.6 | 53.5 | 97.3 | 40.4 | 57.0 | 15.9 | 71.3 | 97.8 | 54.1 | 69.6 | 19.2 |
Model | ICDAR2017 | ICDAR2019 | EvArEST | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | FPS | |
ResNet-50 | 78.3 | 97.2 | 59.2 | 73.5 | 19.2 | 77.2 | 98.4 | 58.0 | 72.9 | 20.5 | 87.0 | 97.3 | 68.0 | 80.0 | 21.1 |
EfficientNetV2-L | 57.0 | 97.1 | 43.1 | 59.7 | 14.2 | 56.2 | 97.5 | 42.0 | 58.7 | 15.1 | 72.4 | 97.3 | 54.5 | 69.8 | 17.2 |
Backbone | Number of Parameters |
---|---|
ResNet-18 | 12,246,022 |
ResNet-50 | 24,899,782 |
EfficientNetV2-S | 12,106,462 |
EfficientNetV2-M | 20,803,678 |
EfficientNetV2-L | 176,052,766 |
Dataset | Total Number of Image | Training | Testing | ||||
---|---|---|---|---|---|---|---|
Number of Image | Short Sentence (%) | Medium Sentence (%) | Number of Image | Short Sentence (%) | Medium Sentence (%) | ||
EvArEST | 510 | 377 | 87 | 12 | 133 | 91 | 8 |
ICDAR2017 | 1000 | 800 | 98 | 1 | 200 | 97 | 2 |
ICDAR2019 | 1200 | 1000 | 99 | 1 | 200 | 100 | 0 |
Model | ICDAR2017 | ICDAR2019 | EvArEST | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | |
ResNet-50 + LSTM | 78.3 | 98.5 | 59.9 | 74.4 | 77.5 | 97.6 | 59.5 | 73.9 | 88.1 | 97.6 | 69.4 | 81.1 |
EfficientNetV2-L + LSTM | 57.9 | 98.6 | 44.6 | 61.4 | 55.6 | 98.8 | 42.3 | 59.2 | 73.1 | 98.9 | 56.2 | 71.5 |
Model | ICDAR2017 | ICDAR2019 | EvArEST | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | Accuracy (%) | Precision (%) | Recall (%) | F-Score (%) | |
ResNet-50 + BiLSTM | 80.3 | 98.5 | 61.4 | 75.8 | 79.2 | 98.3 | 60.1 | 74.5 | 88.9 | 98.9 | 70.3 | 82.1 |
EfficientNetV2-L + BiLSTM | 59.3 | 98.6 | 45.2 | 61.9 | 58.6 | 97.3 | 44.6 | 61.1 | 75.7 | 98.5 | 57.9 | 72.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Albalawi, B.M.; Jamal, A.T.; Al Khuzayem, L.A.; Alsaedi, O.A. An End-to-End Scene Text Recognition for Bilingual Text. Big Data Cogn. Comput. 2024, 8, 117. https://doi.org/10.3390/bdcc8090117
Albalawi BM, Jamal AT, Al Khuzayem LA, Alsaedi OA. An End-to-End Scene Text Recognition for Bilingual Text. Big Data and Cognitive Computing. 2024; 8(9):117. https://doi.org/10.3390/bdcc8090117
Chicago/Turabian StyleAlbalawi, Bayan M., Amani T. Jamal, Lama A. Al Khuzayem, and Olaa A. Alsaedi. 2024. "An End-to-End Scene Text Recognition for Bilingual Text" Big Data and Cognitive Computing 8, no. 9: 117. https://doi.org/10.3390/bdcc8090117
APA StyleAlbalawi, B. M., Jamal, A. T., Al Khuzayem, L. A., & Alsaedi, O. A. (2024). An End-to-End Scene Text Recognition for Bilingual Text. Big Data and Cognitive Computing, 8(9), 117. https://doi.org/10.3390/bdcc8090117