UniText: A Unified Framework for Chinese Text Detection, Recognition, and Restoration in Ancient Document and Inscription Images
Abstract
1. Introduction
- We propose an end-to-end multi-task learning framework for character detection, recognition, and glyph reconstruction in page-level images of Chinese historical documents and inscriptions.
- We introduce a fine-grained instance mask-based supervised signal for text regions, and propose multi-task loss guided by text mask, char category, and context position. This strategy is designed to suppress misleading gradients from non-text areas and improve the model’s robustness to background noise.
- The proposed approach is validated on the test set of historical documents and stone inscriptions, showing its effectiveness in improving character readability and supporting digital preservation.
2. Related Work
3. Methodology
3.1. The Proposed Model
3.2. Fine-Grained Text Region Annotation Strategy
3.3. Joint Optimization Framework
3.3.1. Loss Components and Formulation
3.3.2. The Stroke-Aware Classification Loss Computation Strategy
4. Dataset and Evaluation Protocol
4.1. Dataset Constructed with Poisson Blending and Relief Simulation
4.2. Harmonic Evaluation Metric
5. Experiments
5.1. Implementation Details
5.2. Ablation Study
5.3. Text Detection and Recognition Performance Analysis
5.4. Glyph Restoration Effectiveness Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Qi, H.; Yang, H.; Wang, Z.; Ye, J.; Xin, Q.; Zhang, C.; Lang, Q. AncientGlyphNet: An advanced deep learning framework for detecting ancient Chinese characters in complex scene. Artif. Intell. Rev. 2025, 58, 88. [Google Scholar] [CrossRef]
- Shen, L.; Chen, B.; Wei, J.; Xu, H.; Tang, S.K.; Mirri, S. The challenges of recognizing offline handwritten Chinese: A technical review. Appl. Sci. 2023, 13, 3500. [Google Scholar] [CrossRef]
- Fang, K.; Chen, J.; Zhu, H.; Gadekallu, T.R.; Wu, X.; Wang, W. Explainable-AI-based two-stage solution for WSN object localization using zero-touch mobile transceivers. Sci. China Inf. Sci. 2024, 67, 170302. [Google Scholar] [CrossRef]
- Zhu, S.; Xue, H.; Nie, N.; Zhu, C.; Liu, H.; Fang, P. Reproducing the Past: A Dataset for Benchmarking Inscription Restoration. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 7714–7723. [Google Scholar]
- Zhang, P.; Li, C.; Sun, Y. Stone inscription image segmentation based on Stacked-UNets and GANs. Discov. Appl. Sci. 2024, 6, 550. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, X.Y.; Zhang, Z.; Liu, C.L. Large-scale continual learning for ancient Chinese character recognition. Pattern Recognit. 2024, 150, 110283. [Google Scholar] [CrossRef]
- Ma, W.; Zhang, H.; Jin, L.; Wu, S.; Wang, J.; Wang, Y. Joint layout analysis, character detection and recognition for historical document digitization. In Proceedings of the 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), Dortmund, Germany, 8–10 September 2020; IEEE: New York, NY, USA, 2020; pp. 31–36. [Google Scholar]
- Shimoyama, H.; Yoshida, S.; Fujita, T.; Muneyasu, M. U-Net Architecture for Ancient Handwritten Chinese Character Detection in Han Dynasty Wooden Slips. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2023, 106, 1406–1415. [Google Scholar] [CrossRef]
- Sober, B.; Levin, D. Computer aided restoration of handwritten character strokes. Comput.-Aided Des. 2017, 89, 12–24. [Google Scholar] [CrossRef]
- Poddar, A.; Chakraborty, A.; Mukhopadhyay, J.; Biswas, P.K. Texrgan: A deep adversarial framework for text restoration from deformed handwritten documents. In Proceedings of the Twelfth Indian Conference on Computer Vision, Graphics and Image Processing, Jodhpur, India, 19–22 December 2021; pp. 1–9. [Google Scholar]
- Koch, P.; Nuñez, G.V.; Arias, E.G.; Heumann, C.; Schöffel, M.; Häberlin, A.; Aßenmacher, M. A tailored Handwritten-Text-Recognition System for Medieval Latin. arXiv 2023, arXiv:2308.09368. [Google Scholar]
- Locaputo, A.; Portelli, B.; Colombi, E.; Serra, G. Filling the Lacunae in ancient Latin inscriptions. In Proceedings of the 19th IRCDL (The Conference on Information and Research Science Connecting to Digital and Library Science), Bari, Italy, 23–24 February 2023; pp. 68–76. [Google Scholar]
- Aguilar, S.T.; Jolivet, V. Handwritten text recognition for documentary medieval manuscripts. J. Data Min. Digit. Humanit. 2023. [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
- Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D vision (3DV), Stanford, CA, USA, 25–28 October 2016; IEEE: New York, NY, USA, 2016; pp. 565–571. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Shit, S.; Paetzold, J.C.; Sekuboyina, A.; Ezhov, I.; Unger, A.; Zhylka, A.; Pluim, J.P.; Bauer, U.; Menze, B.H. clDice-a novel topology-preserving loss function for tubular structure segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16560–16569. [Google Scholar]
- Yang, H.; Jin, L.; Huang, W.; Yang, Z.; Lai, S.; Sun, J. Dense and tight detection of Chinese characters in historical documents: Datasets and a recognition guided detector. IEEE Access 2018, 6, 30174–30183. [Google Scholar] [CrossRef]
- Xu, Y.; Yin, F.; Wang, D.H.; Zhang, X.Y.; Zhang, Z.; Liu, C.L. CASIA-AHCDB: A large-scale Chinese ancient handwritten characters database. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; IEEE: New York, NY, USA, 2019; pp. 793–798. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; IEEE: New York, NY, USA, 2015; pp. 1156–1160. [Google Scholar]
- Pérez, P.; Gangnet, M.; Blake, A. Poisson image editing. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2; Association for Computing Machinery: New York, NY, USA, 2023; pp. 577–582. [Google Scholar]
- PaddleOCR. PaddleOCR Documentation. 2024. Available online: https://paddlepaddle.github.io/PaddleOCR (accessed on 15 November 2024).
- Liao, M.; Wan, Z.; Yao, C.; Chen, K.; Bai, X. Real-time scene text detection with differentiable binarization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 34, 11474–11481. [Google Scholar] [CrossRef]
- Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
- Zhu, Y.; Chen, J.; Liang, L.; Kuang, Z.; Jin, L.; Zhang, W. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3123–3131. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Hou, W.; Lu, T.; Yu, G.; Shao, S. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9336–9345. [Google Scholar]
Level | Layer | Params | Output |
---|---|---|---|
Input | - | - | 512 × 512 |
Conv_1_1 | ConvBNLayer | 3 × 3, in = 3, stride = 2 | 256 × 256 |
Conv_1_2 | ConvBNLayer | 3 × 3, in = 32, stride = 1 | 256 × 256 |
Conv_1_3 | ConvBNLayer | 3 × 3, in = 32, stride = 1 | 256 × 256 |
MaxPool1 | MaxPool2d | 3 × 3, stride = 2, padding = 1 | 128 × 128 |
Block1 | ResBlock | 128 × 128 | |
Block2 | ResBlock | 64 × 64 | |
Block3 | ResBlock | 32 × 32 | |
Block4 | ResBlock | 16 × 16 | |
DeConv1 | DeConvBNLayer | 4 × 4, in = 512, stride = 2 | 32 × 32 |
DeConv2 | ConvBNLayer | 3 × 3, in = 384, stride = 1 | 32 × 32 |
DeConvBNLayer | 4 × 4, in = 128, stride = 2 | 64 × 64 | |
DeConv3 | ConvBNLayer | 3 × 3, in = 256, stride = 1 | 64 × 64 |
DeConvBNLayer | 4 × 4, in = 128, stride = 2 | 128 × 128 | |
DeConv4 | ConvBNLayer | 3 × 3, in = 192, stride = 1 | 128 × 128 |
ConvBNLayer | 3 × 3, in = 128, stride = 1 | 128 × 128 | |
UpConvBlock1 | Conv2d | 1 × 1, in = 128, stride = 1 | 128 × 128 |
Upsample | 256 × 256 | ||
Conv2d | 3 × 3, in = 32, stride = 1 | 256 × 256 | |
UpConvBlock2 | Upsample | 512 × 512 | |
Conv2d | 3 × 3, in = 32, stride = 1 | 512 × 512 | |
DownConv1 | Conv2d | 3 × 3, in = 128, stride = 2 | 64 × 64 |
DownConv2 | Conv2d | 1 × 1, in = 384, stride = 1 | 64 × 64 |
DownConv3 | Conv2d | 3 × 3, in = 256, stride = 2 | 32 × 32 |
DownConv4 | Conv2d | 1 × 1, in = 640, stride = 1 | 32 × 32 |
DetConv | ConvBNLayer | 3 × 3, in = 512, stride = 1 | 32 × 32 |
ConvBNLayer | 3 × 3, in = 128, stride = 1 | 32 × 32 | |
ClsConv | ConvBNLayer | 3 × 3, in = 512, stride = 1 | 32 × 32 |
ConvBNLayer | 3 × 3, in = 512, stride = 1 | 32 × 32 | |
ResHead | Conv2d | 1 × 1, in = 128, stride = 1 | 512 × 512 |
DetHead | ConvBNLayer | 1 × 1, in = 64, stride = 1 | 32 × 32 |
ConvBNLayer | 1 × 1, in = 64, stride = 1 | 32 × 32 | |
RecHead | ConvBNLayer | 1 × 1, in = 512, stride = 1 | 32 × 32 |
fc | in = 1024 | out = 745 | |
softmax | in = 745 | 32 × 32 |
Dataset | Chars | Classes | Images | Noise | Angles | Det | Rec | Gly |
---|---|---|---|---|---|---|---|---|
Train | 430,552 | 744 | 12,000 | Yes | Yes | ✓ | ✓ | ✓ |
Test | 71,757 | 744 | 2000 | Yes | Yes | ✓ | ✓ | ✓ |
Exp | Models | mIoU ↑ | RMSE ↓ | Prec. ↑ | Rec. ↑ | CA ↑ | GS ↑ | Hm. ↑ | Hm.-CA ↑ | HCG ↑ |
---|---|---|---|---|---|---|---|---|---|---|
1 | UniText | 0.6691 | 0.1426 | 0.9822 | 0.9588 | 0.8805 | 0.7632 | 0.9704 | 0.9232 | 0.8629 |
2 | bbox-sel | 0.6195 | 0.1538 | 0.9746 | 0.9050 | 0.8037 | 0.7328 | 0.9385 | 0.8659 | 0.8165 |
3 | all-pixel | 0.6683 | 0.1418 | 0.9655 | 0.9236 | 0.5405 | 0.7633 | 0.9441 | 0.6875 | 0.7110 |
4 | w/o | - | - | 0.8462 | 0.9077 | 0.6102 | - | 0.8759 | 0.7193 | - |
5 | w/o | - | - | 0.9588 | 0.9479 | 0.8764 | - | 0.9533 | 0.9132 | - |
6 | w/o | 0.6237 | 0.1498 | 0.9810 | 0.9502 | - | 0.7370 | 0.9654 | - | - |
7 | w/o | 0.6657 | 0.1451 | - | - | 0.6075 | 0.7603 | - | - | - |
Methods | Hmean ↑ | Precision ↑ | Recall ↑ | FPS ↑ |
---|---|---|---|---|
Ours | 0.9704 | 0.9822 | 0.9588 | 74.6142 |
DB [26] | 0.7955 | 0.9068 | 0.7085 | 27.1825 |
EAST [27] | 0.9603 | 0.9892 | 0.9331 | 44.8303 |
PSENet [29] | 0.9847 | 0.9997 | 0.9702 | 35.0545 |
FCENet [28] | 0.9772 | 0.9932 | 0.9616 | 39.5791 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shen, L.; Wu, Z.; Huang, X.; Zhang, B.; Tang, S.-K.; Henriques, J.; Mirri, S. UniText: A Unified Framework for Chinese Text Detection, Recognition, and Restoration in Ancient Document and Inscription Images. Appl. Sci. 2025, 15, 7662. https://doi.org/10.3390/app15147662
Shen L, Wu Z, Huang X, Zhang B, Tang S-K, Henriques J, Mirri S. UniText: A Unified Framework for Chinese Text Detection, Recognition, and Restoration in Ancient Document and Inscription Images. Applied Sciences. 2025; 15(14):7662. https://doi.org/10.3390/app15147662
Chicago/Turabian StyleShen, Lu, Zewei Wu, Xiaoyuan Huang, Boliang Zhang, Su-Kit Tang, Jorge Henriques, and Silvia Mirri. 2025. "UniText: A Unified Framework for Chinese Text Detection, Recognition, and Restoration in Ancient Document and Inscription Images" Applied Sciences 15, no. 14: 7662. https://doi.org/10.3390/app15147662
APA StyleShen, L., Wu, Z., Huang, X., Zhang, B., Tang, S.-K., Henriques, J., & Mirri, S. (2025). UniText: A Unified Framework for Chinese Text Detection, Recognition, and Restoration in Ancient Document and Inscription Images. Applied Sciences, 15(14), 7662. https://doi.org/10.3390/app15147662