FAS-Res2net: An Improved Res2net-Based Script Identification Method for Natural Scenes
Abstract
:1. Introduction
- (1)
- Text and image styles in natural scenes are changeable. They can come from various constructions, such as outdoor billboards, traffic signs, user instructions, etc. Text images contain different fonts, colors, multiple artistic styles, and there are great variations.
- (2)
- Low resolution, noise, and light changes will cause distortion, reduce image quality, and affect the accuracy of identification. Since it is difficult to get rid of the influence of objective factors such as weather, brightness, and equipment when capturing images, the image quality needs to be improved.
- (3)
- Languages belonging to the same family may have minimal differences, and distinct languages may share a common subset of characters. For instance, Greek, English, and Russian share a set of characters, and the arrangement of these characters is nearly identical. However, accurately distinguishing between them requires identifying the unique characters and components specific to each language, which presents a fine-grained classification challenge.
- (4)
- Background interference has a direct impact on identification accuracy. When the background of the image overlaps with the text, the background may be mistakenly regarded as part of the text, thereby identifying the wrong script.
- (1)
- This paper proposes an improved scene script identification method based on convolutional neural network Res2Net, namely FAS-Res2net.
- (2)
- Feature Pyramid Network (FPN) was proposed to preserve the deep semantic feature information and shallow geometric feature information of text images.
- (3)
- An adaptive spatial feature fusion module is proposed to calculate the spatial weights of feature maps at different levels, and the weight fusion feature information solves the feature conflict between positive and negative samples.
- (4)
- The two block-encoding modules of swin transformer were used to extract the global features of the image, which enriches the feature information of the script image and aggregates the self-applicable local feature information and the global feature information.
- (5)
- The full convolution classifier was used instead of the traditional Linear fully connected layer to output the classification confidence of each category, and then the script category was determined, which improves the classification efficiency.
2. Related Works
3. Methods
3.1. Overview of Network Structure
3.2. Feature Extraction Module
3.3. Adaptive Multi-Layer Feature Fusion Module
3.4. Swin Transformer-Encoding Block
3.5. Fully Convolutional Classifier
4. Experiments
4.1. Dataset
4.2. Implementation Details
4.3. Results of the Method in This Paper
4.3.1. Benchmark Results
4.3.2. Feature Pyramid Results
4.3.3. Adaptive Multi-Layer Feature Fusion Results
4.3.4. Transformer-Encoding Block
4.3.5. GMP Results
4.3.6. Experimental Results under Different Parameters
4.3.7. Error Analysis
4.4. Comparison with State-of-the-Art Results
4.5. Experimental Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ubul, K.; Tursun, G.; Aysa, A.; Impedovo, D.; Pirlo, G.; Yibulayin, I. Script Identification of Multi-Script Documents: A Survey. IEEE Access 2017, 5, 6546–6559. [Google Scholar] [CrossRef]
- Cao, Y.; Li, J.; Wang, Q.F.; Huang, K.; Zhang, R. Improving Script Identification by Integrating Text Recognition Information. Aust. J. Intell. Inf. Process. Syst. 2019, 16, 67–75. [Google Scholar]
- Ma, M.; Wang, Q.-F.; Huang, S.; Goulermas, Y.; Huang, K. Residual attention-based multi-scale script identification in scene text images. Neurocomputing 2020, 421, 222–233. [Google Scholar] [CrossRef]
- Naosekpam, V.; Sahu, N. Text detection, recognition, and script identification in natural scene images: A Review. Int. J. Multimed. Inf. Retr. 2022, 11, 291–314. [Google Scholar] [CrossRef]
- Gomez, L.; Nicolaou, A.; Karatzas, D. Improving patch-based scene text script identification with ensembles of conjoined networks. Pattern Recognit. 2017, 67, 85–96. [Google Scholar] [CrossRef] [Green Version]
- Huang, K.; Hussain, A.; Wang, Q.F.; Zhang, R. (Eds.) Deep Learning: Fundamentals, Theory and Applications; Springer: Berlin, Germany, 2019. [Google Scholar]
- Hosny, K.M.; Kassem, M.A.; Fouad, M.M. Classification of skin lesions into seven classes using transfer learning with AlexNet. J. Digit. Imaging 2020, 33, 1325–1334. [Google Scholar] [CrossRef]
- Sitaula, C.; Hossain, M.B. Attention-based VGG-16 model for COVID-19 chest X-ray image classification. Appl. Intell. 2021, 51, 2850–2863. [Google Scholar] [CrossRef]
- Roy, S.K.; Manna, S.; Song, T.; Bruzzone, L. Attention-Based Adaptive Spectral–Spatial Kernel ResNet for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7831–7843. [Google Scholar] [CrossRef]
- Srinivasu, P.N.; SivaSai, J.G.; Ijaz, M.F.; Bhoi, A.K.; Kim, W.; Kang, J.J. Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM. Sensors 2021, 21, 2852. [Google Scholar] [CrossRef]
- Marques, G.; Agarwal, D.; Díez, I.D.L.T. Automated medical diagnosis of COVID-19 through EfficientNet convolutional neural network. Appl. Soft Comput. 2020, 96, 106691. [Google Scholar] [CrossRef]
- Akhtar, N.; Ragavendran, U. Interpretation of intelligence in CNN-pooling processes: A methodological survey. Neural Comput. Appl. 2019, 32, 879–898. [Google Scholar] [CrossRef]
- Kumar, R.L.; Kakarla, J.; Isunuri, B.V.; Singh, M. Multi-class brain tumor classification using residual network and global average pooling. Multimed. Tools Appl. 2021, 80, 13429–13438. [Google Scholar] [CrossRef]
- Zhu, Y.; Wan, L.; Xu, W.; Wang, S. ASPP-DF-PVNet: Atrous Spatial Pyramid Pooling and Distance-Filtered PVNet for occlusion resistant 6D object pose estimation. Signal Process. Image Commun. 2021, 95, 116268. [Google Scholar] [CrossRef]
- Dong, Y.; Shen, X.; Jiang, Z.; Wang, H. Recognition of imbalanced underwater acoustic datasets with exponentially weighted cross-entropy loss. Appl. Acoust. 2020, 174, 107740. [Google Scholar] [CrossRef]
- Yeung, M.; Sala, E.; Schönlieb, C.-B.; Rundo, L. Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput. Med. Imaging Graph. 2022, 95, 102026. [Google Scholar] [CrossRef] [PubMed]
- Zhao, R.; Qian, B.; Zhang, X.; Li, Y.; Wei, R.; Liu, Y.; Pan, Y. Rethinking dice loss for medical image segmentation. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020; pp. 851–860. [Google Scholar]
- Woodworth, B.E.; Patel, K.K.; Srebro, N. Minibatch vs local sgd for heterogeneous distributed learning. Adv. Neural Inf. Process. Syst. 2020, 33, 6281–6292. [Google Scholar]
- Liu, Z.; Shen, Z.; Li, S.; Helwegen, K.; Huang, D.; Cheng, K.T. How do adam and training strategies help bnns optimization. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 6936–6946. [Google Scholar]
- Kalfaoglu, M.; Kalkan, S.; Alatan, A.A. Late temporal modeling in 3d cnn architectures with bert for action recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 731–747. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. Script identification in the wild via discriminative convolutional neural network. Pattern Recognit. 2016, 52, 448–458. [Google Scholar] [CrossRef]
- Luo, C.; Jin, L.; Sun, Z. MORAN: A Multi-Object Rectified Attention Network for scene text recognition. Pattern Recognit. 2019, 90, 109–118. [Google Scholar] [CrossRef]
- Bhunia, A.K.; Konwer, A.; Bhunia, A.K.; Bhowmick, A.; Roy, P.P.; Pal, U. Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network. Pattern Recognit. 2019, 85, 172–184. [Google Scholar] [CrossRef] [Green Version]
- Karim, F.; Majumdar, S.; Darabi, H.; Harford, S. Multivariate LSTM-FCNs for time series classification. Neural Netw. 2019, 116, 237–245. [Google Scholar] [CrossRef] [Green Version]
- Cheng, C.; Huang, Q.; Bai, X.; Feng, B.; Liu, W. Patch aggregator for scene text script identification. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1077–1083. [Google Scholar]
- Fujii, Y.; Driesen, K.; Baccash, J.; Hurst, A.; Popat, A.C. Sequence-to-label script identification for multilingual ocr. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 161–168. [Google Scholar]
- Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Peng, F.; Miao, Z.; Li, F.; Li, Z. S-FPN: A shortcut feature pyramid network for sea cucumber detection in underwater images. Expert Syst. Appl. 2021, 182, 115306. [Google Scholar] [CrossRef]
- Cheng, X.; Yu, J. RetinaNet with difference channel attention and adaptively spatial feature fusion for steel surface defect detection. IEEE Trans. Instrum. Meas. 2020, 70, 1–11. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Dastidar, S.G.; Dutta, K.; Das, N.; Kundu, M.; Nasipuri, M. Exploring knowledge distillation of a deep neural network for multi-script identification. In Proceedings of the International Conference on Computational Intelligence in Communications and Business Analytics, Santiniketan, India, 7–8 January 2021; Springer: Cham, Switzerland, 2021; pp. 150–162. [Google Scholar]
- Mei, J.; Dai, L.; Shi, B.; Bai, X. Scene text script identification with convolutional recurrent neural networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 4053–4058. [Google Scholar]
- Nicolaou, A.; Bagdanov A, D.; Liwicki, M.; Karatzas, D. Sparse radial sampling LBP for writer identification. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 716–720. [Google Scholar]
- Gomez, L.; Karatzas, D. A fine-grained approach to scene text script identification. In Proceedings of the 2016 12th IAPR Workshop on Document Analysis Systems (DAS), Santorini, Greece, 11–14 April 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 192–197. [Google Scholar]
- Zdenek, J.; Nakayama, H. Bag of local convolutional triplets for script identification in scene text. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 369–375. [Google Scholar]
- Mahajan, S.; Rani, R. Word Level Script Identification Using Convolutional Neural Network Enhancement for Scenic Images. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–29. [Google Scholar] [CrossRef]
Type | Configuration |
---|---|
Input | 512 × H × W |
Conv2d | Kernel: 3, Stride: 1, padding: 1, output: 384 × H × W |
BatchNorm | Channels:384 |
Relu | |
Conv2d | Kernel: 3, Stride: 1, padding: 1, output: 384 × H × W |
BatchNorm | Channels: 384 |
Relu | |
Conv2d | Kernel: 1, Stride: 1, padding: 0, output: C × H × W |
Dataset | Category | Train | Validate | Test |
---|---|---|---|---|
SIW-13 | 13 | 9791 | - | 6500 |
CVSI-2015 | 10 | 6412 | 1069 | 3207 |
Method | FPN | ASFF | Transformer | GMP | Accu. (%) |
---|---|---|---|---|---|
Res2Net | 93.3 | ||||
1 | √ | 94.0 | |||
2 | √ | √ | 94.2 | ||
3 | √ | √ | √ | 94.7 | |
4 | √ | √ | √ | √ | 94.1 |
Method | Transformer Block | Accu. (%) |
---|---|---|
Res2Net50 + FPN + ASFF | 1 | 94.1 |
Res2Net50 + FPN + ASFF | 2 | 94.7 |
Res2Net50 + FPN + ASFF | 3 | 93.9 |
Res2Net50 + FPN + ASFF | 4 | 93.9 |
Method | 16 | 32 | 64 | 128 |
---|---|---|---|---|
Res2Net50 + FPN + ASFF + Swin | 94.3 | 94.7 | 94.5 | 94.6 |
Res2Net101 + FPN + ASFF + Swin | 94.1 | 94.4 | 94.4 | 94.3 |
Image | Id | Groundtruth | Res2Net | Res2Net + FPN | Res2Net + FPN + ASFF | Res2Net + FPN + ASFF + Transformer | Res2Net + FPN + ASFF + Transformer + GMP |
---|---|---|---|---|---|---|---|
141_2 | Arabic | English | English | Arabic | Arabic | Arabic | |
460_1 | Japanese | Chinese | Chinese | Chinese | Japanese | Japanese |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Z.; Mamat, H.; Xu, X.; Aysa, A.; Ubul, K. FAS-Res2net: An Improved Res2net-Based Script Identification Method for Natural Scenes. Appl. Sci. 2023, 13, 4434. https://doi.org/10.3390/app13074434
Zhang Z, Mamat H, Xu X, Aysa A, Ubul K. FAS-Res2net: An Improved Res2net-Based Script Identification Method for Natural Scenes. Applied Sciences. 2023; 13(7):4434. https://doi.org/10.3390/app13074434
Chicago/Turabian StyleZhang, Zhiyun, Hornisa Mamat, Xuebin Xu, Alimjan Aysa, and Kurban Ubul. 2023. "FAS-Res2net: An Improved Res2net-Based Script Identification Method for Natural Scenes" Applied Sciences 13, no. 7: 4434. https://doi.org/10.3390/app13074434
APA StyleZhang, Z., Mamat, H., Xu, X., Aysa, A., & Ubul, K. (2023). FAS-Res2net: An Improved Res2net-Based Script Identification Method for Natural Scenes. Applied Sciences, 13(7), 4434. https://doi.org/10.3390/app13074434