Minimizing Bleed-Through Effect in Medieval Manuscripts with Machine Learning and Robust Statistics
Abstract
:1. Introduction
2. State of the Art
3. The Codices Analyzed
4. Factors of Parchment Deterioration: Iron Gall Inks
5. Methods
5.1. Segmentation
5.2. Thresholding
5.3. Morphological Cleaning
5.4. Non-Local Means Denoising
- is the intensity of the pixel j;
- is the weighted average intensity within the patch window of pixels centered on pixel i;
- is the image patch, i.e., a user-defined window of pixels centered on the pixel i;
- is the weight representing the similarity between the patches around pixels i and j;
- h is a smoothing parameter that controls how much the similarity between patches influences the averaging. A smaller h means that only very similar patches are averaged because the dissimilar ones will obtain small weights, while a larger h allows more distant patches to contribute to the denoised value.
5.5. Biweight Estimation
- Compute , which is the mean or an initial estimation of the central tendency of the image;
- Compute , which is a weighting factor, based on the scaled deviation of each data point from :The following definitions are used above:
- is the intensity of the pixel i;
- MAD is the Median Absolute Deviation of the pixels;
- c is an empirical tuning constant, also known as cut-off point, which modulates the way to down-weight the image values that deviate from the central location. Smaller values of this constant make the biweight estimation more sensitive to outliers (it removes more outliers, becoming more robust), while larger values make the biweight more tolerant to outliers. As discussed in [23], the best balance of estimation efficiency was found for , which makes the method able to include data up to ( is the standard deviation of the image pixel distribution) from the central position.
- If the weighting factor is greater than 1, then the weight of the pixel i will not be updated, otherwise it obtains a value depending on :
- Finally, compute the biweight location:
- Calculate the standard deviation () and median (m) of the pixel distribution;
- Remove all pixels that are smaller or larger than ;
- Go back to step 1, unless the selected exit criteria is reached, where the exit criteria are based on two possible conditions:
- After reaching a user-defined amount of iterations;
- When the new measured standard deviation is within a certain tolerance level of the old one (i.e., the standard deviation measured at previous iteration). Here, the tolerance level is defined by: , where at each clipping iteration the new value of standard deviation of pixels is always equal or less than the previous one, (i.e., ).
5.6. Gaussian Mixture Models
5.7. Gaussian Blur
6. Experiments and Results
6.1. Workflow
- The text and the ornaments of the original page, as detected by the segmentation models, are represented exactly as they appear in the original image;
- The text and the ornaments of the original page that are not detected by the segmentation models are represented in a smoothed way relative to the original image;
- The bleed-through is smoothed and barely visible.
- We fit a GMM on the original images. We set the number of Gaussian components to four (front text, background, borders of the page, and other different degradation patterns). Unfortunately, the model was not good enough to capture the differences between the bleed-through and the front text. The reason is probably that the distributions of the four classes, and especially the one of bleed-through, are very different among the pages and sometimes the thickness and the saturation of the bleed-through pixels are close to the one of the front text. Moreover, it takes really a long time to train a GMM on a considerable number of pages;
- We fit a GMM with three components (background, borders of the page, and other different degradation patterns), just on the pixels of the pages segmented as background/bleed-through by the segmentation models. The idea was to distinguish the clean background from the one affected by bleed-through. It works well on simple cases, but it still needs some inpainting techniques to replace the bleed-through pixels. This means that it needs significant computation time, the results are not optimal, and there is still the need for an inpainting technique to complete the task.
6.2. Model Selection Strategies
6.3. Qualitative and Quantitative Results
6.4. Inference Time, with GPU Comparison
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hanif, M.; Tonazzini, A.; Savino, P.; Salerno, E.; Tsagkatakis, G. Document Bleed-Through Removal Using Sparse Image Inpainting. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 281–286. [Google Scholar]
- Dubois, E.; Pathak, A. Reduction of bleed-through in scanned manuscript documents. In Proceedings of the IS&T Conference on Image Processing, Image Quality, Image Capture Systems, Montreal, QC, Canada, 22–25 April 2001; Volume 1, pp. 177–180. [Google Scholar]
- Savino, P.; Tonazzini, A.; Bedini, L. Bleed-through cancellation in non-rigidly misaligned recto–verso archival manuscripts based on local registration. Int. J. Doc. Anal. Recognit. (IJDAR) 2019, 22, 163–176. [Google Scholar] [CrossRef]
- Savino, P.; Tonazzini, A. Training a shallow NN to erase ink seepage in historical manuscripts based on a degradation model. Neural Comput. Appl. 2024, 36, 11743–11757. [Google Scholar] [CrossRef]
- Hu, X.; Lin, H.; Li, S.; Sun, B. Global and local features based classification for bleed-through removal. Sens. Imaging 2016, 17, 9. [Google Scholar] [CrossRef]
- Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
- Sun, B.; Li, S.; Zhang, X.P.; Sun, J. Blind bleed-through removal for scanned historical document image with conditional random fields. IEEE Trans. Image Process. 2016, 25, 5702–5712. [Google Scholar] [CrossRef] [PubMed]
- Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, pp. 282–289. [Google Scholar]
- da Costa Rocha, C.; Deborah, H.; Hardeberg, J.Y. Ink bleed-through removal of historical manuscripts based on hyperspectral imaging. In Proceedings of the Image and Signal Processing: 8th International Conference, ICISP 2018, Cherbourg, France, 2–4 July 2018; Proceedings 8. Springer: Cham, Switzerland, 2018; pp. 473–480. [Google Scholar]
- Hanif, M.; Tonazzini, A.; Hussain, S.F.; Khalil, A.; Habib, U. Restoration and content analysis of ancient manuscripts via color space based segmentation. PLoS ONE 2023, 18, e0282142. [Google Scholar] [CrossRef] [PubMed]
- Villegas, M.; Toselli, A.H. Bleed-Through Removal by Learning a Discriminative Color Channel. In Proceedings of the 2014 14th International Conference on Frontiers in Handwriting Recognition, Hersonissos, Greece, 1–4 September 2014; pp. 47–52. [Google Scholar] [CrossRef]
- Russo, G.; Aiosa, L.; Alfano, G.; Chianese, A.; Cornevilli, F.; Marco Di Domenico, G.; Maddalena, P.; Mazzucchi, A.; Muraglia, C.; Russillo, F.; et al. MAGIC: Manuscripts of Girolamini in Cloud. IOP Conf. Ser. Mater. Sci. Eng. 2020, 949, 012081. [Google Scholar] [CrossRef]
- Conte, S.; Maddalena, P.M.; Mazzucchi, A.; Merola, L.; Russo, G.; Trombetti, G. The Role of Project MA.G.I.C. in the Context of the European Strategies for the Digitization of the Library and Archival Heritage. In Proceedings of the Eurographics Workshop on Graphics and Cultural Heritage, Lecce, Italy, 4–6 September 2023; Bucciero, A., Fanini, B., Graf, H., Pescarin, S., Rizvic, S., Eds.; The Eurographics Association: Lecce, Italy, 2023. [Google Scholar] [CrossRef]
- Conte, S.; Russo, G.; Salvatore, M.; Tortora, A. Application of the IBiSCo Data Canter for cultural heritage projects. In Proceedings of the Final Workshop for the Italian PON IBiSCo Project, Naples, Italy, 18–19 April 2024. [Google Scholar]
- Conte, S.; Mazzucchi, A.; Russo, G.; Tortora, A.; Tortora, G. The organization and management of the MAGIC project for ancient manuscripts digitization: Connections between Mediterranean cultures. In Proceedings of the AIUCD 2024. MeTe Digital: Mediterranean Networks Between Texts and Contexts, XIII Convegno Nazionale, Catania, Italy, 28–30 May 2024. [Google Scholar]
- Conte, S.; Di Domenico, G.M.; Mazzei, A.; Mazzucchi, A.; Russo, G.; Salvi, A.; Tortora, A. The MAGIC project: First research results. In Proceedings of the Information and Research Science Connecting to Digital and Library Science, Bressanone-Brixen, Italy, 22–23 February 2024; pp. 87–93. [Google Scholar]
- Conte, S.; Ferrante, G.; Laccetti, L.; Mazzucchi, A.; Momtaz, Y.; Tortora, A. Content representation and analysis: The Magic Project and the Illuminated Dante Project integrated systems for multimedia information retrieval. In Proceedings of the 14th Italian Information Retrieval Workshop, Udine, Italy, 5–6 September 2024; pp. 66–69. [Google Scholar]
- Buades, A.; Coll, B.; Morel, J.M. Non-local means denoising. Image Process. Line 2011, 1, 208–212. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Hinton, G.E.; Zemel, R. Autoencoders, minimum description length and Helmholtz free energy. In Proceedings of the 7th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–2 December 1993; Volume 6. [Google Scholar]
- Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
- Kafadar, K. The efficiency of the biweight as a robust estimator of location. J. Res. Natl. Bur. Stand. 1983, 88, 105–116. [Google Scholar] [CrossRef] [PubMed]
- Mosteller, F.; Tukey, J.W. Data Analysis and Regression. A Second Course in Statistics; Addison-Wesley Series in Behavioral Science: Quantitative Methods; Addison-Wesley Publishing Company: Reading, MA, USA, 1977. [Google Scholar]
- Lehmann, G.G. Kappa sigma clipping. Afr. Insight 2006, 367, 1–3. [Google Scholar] [CrossRef]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar] [CrossRef]
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Fix, E. Discriminatory Analysis: Nonparametric Discrimination, Consistency Properties; USAF School of Aviation Medicine: Randolph Field, TX, USA, 1985; Volume 1. [Google Scholar]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Thorndike, R.L. Who belongs in the family? Psychometrika 1953, 18, 267–276. [Google Scholar] [CrossRef]
- Pratikakis, I.; Zagoris, K.; Barlas, G.; Gatos, B. ICDAR2017 competition on document image binarization (DIBCO 2017). In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 1395–1403. [Google Scholar]
- Pratikakis, I.; Zagoris, K.; Karagiannis, X.; Tsochatzidis, L.; Mondal, T.; Marthot-Santaniello, I. ICDAR 2019 Competition on Document Image Binarization (DIBCO 2019). In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 1547–1556. [Google Scholar] [CrossRef]
- Tensmeyer, C.; Davis, B.; Wigington, C.; Lee, I.; Barrett, B. Pagenet: Page boundary extraction in historical handwritten documents. In Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, Kyoto, Japan, 10–11 November 2017; pp. 59–64. [Google Scholar]
- Oliveira, S.A.; Seguin, B.; Kaplan, F. dhSegment: A generic deep-learning approach for document segmentation. In Proceedings of the 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 5–8 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7–12. [Google Scholar]
- Wang, J.; Yue, Z.; Zhou, S.; Chan, K.C.; Loy, C.C. Exploiting diffusion prior for real-world image super-resolution. Int. J. Comput. Vis. 2024, 132, 5929–5949. [Google Scholar] [CrossRef]
- Grüning, T.; Labahn, R.; Diem, M.; Kleber, F.; Fiel, S. READ-BAD: A new dataset and evaluation scheme for baseline detection in archival documents. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 351–356. [Google Scholar]
- Ntirogiannis, K.; Gatos, B.; Pratikakis, I. Performance Evaluation Methodology for Historical Document Image Binarization. IEEE Trans. Image Process. 2013, 22, 595–609. [Google Scholar] [CrossRef] [PubMed]
- Lu, H.; Kot, A.; Shi, Y. Distance-reciprocal distortion measure for binary document images. IEEE Signal Process. Lett. 2004, 11, 228–231. [Google Scholar] [CrossRef]
Method | F1 | Pseudo F1 | PSNR | DRD |
---|---|---|---|---|
rank 1 | 72.87 | 72.15 | 14.475 | 16.71 |
rank 5 | 62.985 | 61.01 | 14.32 | 10.84 |
Ours | 61.83 | 60.44 | 13.58 | 12.19 |
rank 10 | 60.145 | 56.7 | 11.745 | 36.52 |
rank 20 | 46.63 | 44.06 | 13.09 | 15.57 |
Name | GPU Time (Segmentation) (s) | TOTAL Elapsed Time (s) |
---|---|---|
NVIDIA H100 (GPU) | 1.00 | 4.76 |
NVIDIA L40S (GPU) | 1.93 | 3.60 |
NVIDIA 3090 (GPU) | 2.06 | 3.65 |
NVIDIA V100 (GPU) | 3.19 | 7.91 |
NVIDIA A4000 (GPU) | 4.10 | 8.73 |
Intel i9-149000KF (CPU) | NaN | 62.70 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ettari, A.; Brescia, M.; Conte, S.; Momtaz, Y.; Russo, G. Minimizing Bleed-Through Effect in Medieval Manuscripts with Machine Learning and Robust Statistics. J. Imaging 2025, 11, 136. https://doi.org/10.3390/jimaging11050136
Ettari A, Brescia M, Conte S, Momtaz Y, Russo G. Minimizing Bleed-Through Effect in Medieval Manuscripts with Machine Learning and Robust Statistics. Journal of Imaging. 2025; 11(5):136. https://doi.org/10.3390/jimaging11050136
Chicago/Turabian StyleEttari, Adriano, Massimo Brescia, Stefania Conte, Yahya Momtaz, and Guido Russo. 2025. "Minimizing Bleed-Through Effect in Medieval Manuscripts with Machine Learning and Robust Statistics" Journal of Imaging 11, no. 5: 136. https://doi.org/10.3390/jimaging11050136
APA StyleEttari, A., Brescia, M., Conte, S., Momtaz, Y., & Russo, G. (2025). Minimizing Bleed-Through Effect in Medieval Manuscripts with Machine Learning and Robust Statistics. Journal of Imaging, 11(5), 136. https://doi.org/10.3390/jimaging11050136