State-of-the-Art Document Image Binarization Using a Decision Tree Ensemble Trained on Classic Local Binarization Algorithms and Image Statistics

Tarbă, Nicolae; Boiangiu, Costin-Anton; Voncilă, Mihai-Lucian

doi:10.3390/app15158374

Open AccessArticle

State-of-the-Art Document Image Binarization Using a Decision Tree Ensemble Trained on Classic Local Binarization Algorithms and Image Statistics

by

Nicolae Tarbă

^*

,

Costin-Anton Boiangiu

and

Mihai-Lucian Voncilă

Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica Bucharest, 060042 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8374; https://doi.org/10.3390/app15158374

Submission received: 5 June 2025 / Revised: 13 July 2025 / Accepted: 16 July 2025 / Published: 28 July 2025

(This article belongs to the Special Issue Statistical Signal Processing: Theory, Methods and Applications)

Download

Browse Figures

Versions Notes

Abstract

Image binarization algorithms reduce the original color space to only two values, black and white. They are an important preprocessing step in many computer vision applications. Image binarization is typically performed using a threshold value by classifying the pixels into two categories: lower and higher than the threshold. Global thresholding uses a single threshold value for the entire image, whereas local thresholding uses different values for the different pixels. Although slower and more complex than global thresholding, local thresholding can better classify pixels in noisy areas of an image by considering not only the pixel’s value, but also its surrounding neighborhood. This study introduces a local thresholding method that uses the results of several local thresholding algorithms and other image statistics to train a decision tree ensemble. Through cross-validation, we demonstrate that the model is robust and performs well on new data. We compare the results with state-of-the-art solutions and reveal significant improvements in the average F-measure for all DIBCO datasets, obtaining an F-measure of 95.8%, whereas the previous high score was 93.1%. The proposed solution significantly outperformed the previous state-of-the-art algorithms on the DIBCO 2019 dataset, obtaining an F-measure of 95.8%, whereas the previous high score was 73.8%.

Keywords:

computer vision; decision tree ensemble; document image binarization; local thresholding; machine learning

1. Introduction

Image binarization is a vital preprocessing step in document digitalization and other applications of computer vision. It is the process of classifying pixels into either background or foreground, representing them with either black or white. For a thresholding-based approach, pixel values are classified as black if they are below a certain threshold and as white if they are above it. If the threshold is the same for all pixels in the image, then the method falls in the global thresholding family. However, if the threshold can take different values based on pixel position and the surrounding neighborhood, then the method falls into the local thresholding family. Global thresholding methods are faster, but they do not perform well on images with poor lighting conditions, stains, faint ink, ink bleeding through from the other side of the paper, holes, creases, or other types of noise. Local thresholding methods overcome these limitations at the cost of increased computational complexity and the risk of introducing noise into what should be uniform areas in the image.

Transforming the original image into a bitonal image improves the results of most optical character recognition (OCR) engines and layout analysis algorithms (including line and separator detection) to such an extent that they usually include a binarization algorithm in their processing pipeline. Several document image datasets containing low-quality images that highlight the caveats of the previously used thresholding methods are available, which has sparked the development of new thresholding methods to address these caveats. Most documents contain dark text on a bright background, which is why most datasets provide ground-truth images, where black represents the text and white represents the background.

The main contribution of this paper is that we propose a machine learning model, more specifically, a decision tree ensemble that leverages classical local thresholding algorithms and image statistics to significantly improve the classification performance of the leveraged algorithms, achieving a performance comparable to the state of the art.

Related Works

Otsu [1] introduced a classic global thresholding algorithm. Moghaddam and Cheriet [2] proposed a local version of Otsu’s method using the following formula:

\begin{matrix} O_{A} (x) = Θ (|O_{G} - O_{L} (x)| / R - 1) (O_{G i} - O_{L} (x)) + O_{L} (x), \end{matrix}

(1)

where

O_{G}, O_{G i}

, and

O_{L}

are the thresholds obtained by applying Otsu’s thresholding method to the entire image, the inverted image, and a window centered on the pixel

x

,

Θ

is the unit step function, and

R

is an adjustable parameter used to compensate for the deviation between the local and global thresholds. In a later study [3], they proposed a parameterless and adaptive generalization of Otsu’s method that uses the following formula:

\begin{matrix} O_{A} (x) = ϵ + (O_{L} (x) - ϵ) Θ (σ (x) - k_{σ} σ_{E B} (x)), \end{matrix}

(2)

where ϵ is a small value that ensures numerical stability,

σ (x)

and

σ_{E B} (x)

are the standard deviations of the neighborhood centered on

x

in the original image and the estimated background image, and

k_{σ}

is a factor between 1 and 2 used to modify the behavior of the unit step function

Θ

. However, applying Otsu’s algorithm to each window requires local histogram computation, making it significantly more time-consuming compared with other methods.

Bloechle et al. [4] proposed a two-stage algorithm that applies adaptive Otsu thresholding after estimating and subtracting the background. The background estimation was further refined in [5], where only global Otsu was used after subtracting the background.

Howe [6] proposed an algorithm with three steps. First, the image is binarized by minimizing a global energy function, inspired by a Markov random field model. Next, the Laplacian of the pixel intensities is used to distinguish the foreground from the background. Finally, edge discontinuities are incorporated into the smoothness term of the global energy function, and the final binarized image is generated.

Jia et al. [7] computed local thresholds for neighborhoods centered on structurally symmetric pixels and used a voting mechanism to classify pixels as either foreground or background.

Calvo-Zaragoza and Gallego [8] proposed SAE, a convolutional auto-encoder that transforms the original image into a probability map, which can then be thresholded to produce the binarized image.

He and Schomaker [9] proposed DeepOtsu, a method that uses deep learning to iteratively remove noise from input images and then applies Otsu’s method on the enhanced images. They proposed two iterative methods: recurrent refinement and stacked refinement. Recurrent refinement uses the same network in all iterations, whereas stacked refinement uses different networks in different iterations. They later proposed CT-NET [10], a chained-cascade network based on a dual task, T-shaped network that can perform both image enhancement and image binarization.

Peng et al. [11] proposed MRAM, a deep learning solution where a multi-resolution attention model generates a probability map, which is then fed into a convolutional random field to produce the binarized image.

Mondal et al. [12] proposed 2DMN, a bidimensional morphological network that uses dilation and erosion to eliminate noise in the original image. The network was trained on old handwritten documents using backpropagation.

De et al. [13] proposed DD-GAN, a dual discriminator generative adversarial network that consists of a network that processes the image globally and a network that processes the image locally. The generated image is then thresholded to produce the binarized image.

Zhao et al. [14] proposed cGANs, harnessing conditional generative adversarial networks to combine multi-scale information. The method consists of two stages: one that extracts text pixels from the original image using various scales and one that combines the multi-scale binarized images to produce the final binarized image.

Souibgui and Kessentini [15] proposed DE-GAN, an end-to-end deep learning framework that uses conditional generative adversarial networks to binarize images. The method was trained on severely degraded document images.

Suh et al. [16] proposed a two-stage deep learning approach using generative adversarial networks. The first stage consists of four color-independent networks that extract color information from the original image. The second stage consists of two independent networks that process the images generated in the first stage, producing the final binarized image.

Yang and Xu [17] proposed D2BFormer, a deep learning semantic segmentation approach that produces a two-channel feature map with a probability sum of one, one channel corresponding to the foreground and the other to the background. Then, they used a weighted combination of the binary cross-entropy loss function and the Sørensen–Dice coefficient loss function to further train the model. They used the resulting final feature map to binarize the image and argued that avoiding the conversion of pixel values into binary values within the model improves the binarization results.

Biswas et al. [18] proposed DocBinFormer, a two-level transformer network based on vision transformers. The transformer encoder operates directly on pixel patches and feeds their latent representation to the decoder, which then generates the binarized image.

Yang et al. [19] proposed GDB, a method based on gated convolutions, arguing that regular convolutions tend to extract stroke edges imprecisely. Their method was based on two sub-networks: a coarse sub-network trained to obtain more precise feature maps and a refinement sub-network cascaded to refine the output of the coarse sub-network using gated convolutions. They also presented a version of GDB with multi-scale operations (GDBMO) that combines local and global features to further improve the results.

2. Materials and Methods

In this section, we present the thresholding algorithms that we used as features in our machine-learning model, other notable approaches, the tools we used for this study and, finally, the proposed solution.

2.1. Features

As features, we used the results of ten thresholding algorithms that can be computed globally, using convolution (i.e., Gaussian blur), or per pixel in constant time, using summed area tables. Three thresholds were elementary, and seven were from the scientific literature. The elementary local thresholds are the average pixel value of the window, the average of the maximum and minimum pixel values of the window, and the Gaussian blur value, computed using the following formula:

T = \sum_{x = - w}^{w} \sum_{y = - h}^{h} e^{- \frac{1}{2} ({(\frac{x}{σ_{x}})}^{2} + {(\frac{y}{σ_{y}})}^{2})} I (x, y),

(3)

where

w

is the half-width of the window,

h

is the half-height of the window, and

σ_{x} = 0.3 (w - 1) + 0.8

and

σ_{y} = 0.3 (h - 1) + 0.8

are the variances of the Gaussian kernel. The formulas for the standard deviations were taken from the Gaussian kernel generator function defined in OpenCV 4.13.

Bernsen [20] proposed an algorithm based on the range of pixel values within a window. In their method, to reduce the number of misclassified pixels in the very dark or very bright windows, the difference between the maximum and minimum values within the window is compared with a contrast threshold. If the difference is lower than this threshold, then the binarization threshold is the average of the maximum and minimum possible pixel values. Otherwise, the result is the average of the maximum and minimum pixel values in the window.

White and Rohrer [21] proposed a dynamic threshold algorithm comprising a two-dimensional running average,

z (x)

, and a bias function,

h (x, z (x))

, between

z (x)

and the pixel being compared,

x

. Without this bias function, noise fluctuations would heavily influence the threshold selection in areas with low variance.

Niblack’s method [22] uses the local mean

μ

and standard deviation

σ

to compute a local threshold

T = μ + k σ

, where

k

is an adjustable parameter with a negative value. The author recommends

k = - 0.2

.

Khurshid et al. [23] proposed the NICK algorithm, which is a modified version of Niblack’s method that shifts the binarization threshold down to improve the results for “white- and light-page” images. The formula for the NICK algorithm is

T = μ + k \sqrt{(\sum p_{i}^{2} - μ^{2}) / N}

, where

k

is an adjustable parameter,

p_{i}

are the gray-level values of the pixels within the local window, and

N

is the number of pixels within the local window. The authors recommend

k = - 0.2

.

Sauvola and Pietikainen [24] modified Niblack’s method by introducing a new term

R

, which is the dynamic range of the standard deviations of every sliding window in the entire image. The formula for Sauvola’s threshold is

T = μ (1 + k (σ / R - 1))

, where

k

is an adjustable parameter that gets positive values. The authors recommend a value of 0.5 for

k

.

Wolf and Jolion [25] modified Sauvola’s method, shifting the binarization condition from gray levels towards contrast by introducing M, which is the minimum value of the gray levels of the whole image. The formula for Wolf’s method is

T = (1 - k) μ + k M + k (μ - M) σ / R

, where

k

is an adjustable parameter that controls the incertitude around the mean value. The authors recommend a value of 0.5 for

k

.

Phansalkar et al. [26] modified Sauvola’s method to improve the performance in low-contrast dark regions without significantly affecting the performance in high-contrast light regions. The formula for Phansalkar’s method is

T = μ (1 + p e^{- q μ} + k (σ / R - 1))

, where

k

,

p

, and

q

are adjustable parameters. The authors recommend

k = 0.25, p = 3, q = 10

.

We also used image statistics as features, including average, standard deviation, skewness, kurtosis, Sarle’s Bimodality Coefficient (BC) [27], minimum, and maximum. All these statistics were computed both on the whole image and on each window, resulting in a global and local set of image statistics for each pixel. The last feature was a global normalized histogram computed using five bins.

Computing the image statistics on each window might seem to have a high computational cost, but it can be performed in constant time (

O (1)

) using summed area tables and substituting the local minimum and maximum with Lehmer means. The following should be noted:

\begin{matrix} \lim_{p \to \infty} \frac{\sum x^{p + 1}}{\sum x^{p}} = \max (x) \\ \lim_{p \to - \infty} \frac{\sum x^{p + 1}}{\sum x^{p}} = \min (x), \end{matrix}

(4)

for any distribution

x

. The powers for the Lehmer means were chosen on an image-by-image basis as the highest possible value for which overflow did not occur. This ensured sufficient precision for all 256 possible values of the local minimum and maximum, introducing no approximation errors on any of the tested images. For example, on a window where the maximum value is 231 and the result of the Lehmer mean is 230.8, it would be correctly rounded to 231 without loss.

2.2. Tools

ML.NET [28] is a framework developed and maintained by Microsoft that allows users to train and evaluate a variety of machine-learning models. It allows automatic exploration of various hyperparameter configurations for these models, using one of several search strategies, such as random search, grid search, and Cost Frugal Tuner (CFT) [29]. Several binary classification model types are available in AutoML, but LightGBM [30] always produced the best results in our experiments.

LightGBM is a gradient boosting decision tree implementation that harnesses Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to improve efficiency and scalability. GOSS excludes a considerable proportion of data with small gradients, thereby obtaining an accurate estimate of the information gain while reducing the data size. EFB bundles mutually exclusive features to reduce the number of features without significantly affecting the accuracy of split point determination.

Models trained and tested on the same dataset are prone to maximizing the performance of the model on that dataset at the cost of reducing the performance on new data. This problem is called overfitting and is usually addressed by separating a dataset into training and testing subsets. The model is fitted to the training set and evaluated on the test set, resulting in one set of performance metrics. According to Berrar [31], 10-fold cross-validation is one of the most widely used methods for simulating the behavior of a model when introducing new data. To perform

k

-fold cross-validation, the dataset is split into

k

equally sized disjunct partitions, and each partition forms one testing subset, whereas the other partitions form the corresponding training subset. Each test set has its own set of performance metrics that can be averaged over all testing sets to obtain the performance metrics of the model. The results on the

k

test partitions are then aggregated. For example, 3-fold cross validation works by training the model on partitions 1 and 2 and testing on 3, then training on 1 and 3 and testing on 2, and, finally, training on 2 and 3 and testing on 1, producing a total of 3 trained models, each of them with metrics computed on their respective test partitions, which can be aggregated by computing their mean value, their standard deviation, etc.

2.3. Proposed Solution

We trained a binary classification model using the 10 local thresholds presented in Section 2.1, local and global statistics (average, standard deviation, skewness, kurtosis, minimum and maximum values), and a global histogram computed with five bins. Four scenarios were explored, using the following feature sets:

Only local thresholds;
Only local and global image statistics;
All features except the global histogram;
All features.

No matter the scenario, the feature set is computed per pixel and contains pixel-specific values for local thresholds and statistics, and image-specific values for global statistics and histogram.

The DIBCO datasets [32,33,34,35,36,37,38,39,40,41] were used because they provide a wide variety of hard-to-binarize document images, and many other binarization algorithms in the scientific literature use at least one of them, allowing us to compare our results with those of other state-of-the-art methods. The dataset was split into ten partitions, one for each DIBCO dataset, and the models were trained using 10-fold cross-validation. This means that, for each DIBCO dataset, the model was trained on all other DIBCO datasets and validated on the respective DIBCO dataset.

The F-measure (

F M

), also known as the

F_{1}

score, is used as a metric in many studies on image binarization and is defined as follows:

\begin{matrix} F M = 2 T P / (2 T P + F), \end{matrix}

(5)

where

T P

is the number of correctly classified foreground pixels and

F

is the number of incorrectly classified pixels.

F M

takes values between 0 and 1, with 1 indicating a perfect classifier.

The ML.NET binary classification experiment was executed with a memory limit of 28 GB and a time limit of 48 h on a system with an Intel 14600K processor and 32 GB of RAM.

F M

was used as the optimizing metric, and CFT was used as the search strategy. The 10

F M

values obtained from the 10-fold cross-validation were aggregated into an average value,

μ_{F M}

, and a standard deviation,

σ_{F M}

. The best model was considered to be the model with the highest value for

μ_{F M} - σ_{F M}

, thus encouraging the robustness of the selected model. For example, a model that achieved 95%

F M

on 9 partitions and 80% on the 10th would be considered worse than a model that achieved 90% on all 10 partitions, because a robust model would achieve consistent results on any dataset.

3. Results

After training all the model types available in ML.NET, the best models found after each training session were instances of the LightGBM model. The experiment was then limited to training only LightGBM instances. Table 1 compares the best results obtained in scenarios 1–4.

{\bar{F M}}_{D}

represents the mean value for all the datasets,

{\bar{F M}}_{I}

represents the mean

F M

value for all the images, and

{\bar{F M}}_{O}

represents the mean value for all the images after retraining the model on all datasets.

{\bar{F M}}_{D}

and

{\bar{F M}}_{I}

differ because the datasets did not contain the same number of images.

{\bar{F M}}_{I}

and

{\bar{F M}}_{O}

differ because

{\bar{F M}}_{I}

was the average obtained using the results of the cross-validation models for their respective test datasets, whereas

{\bar{F M}}_{O}

was obtained using a single model trained and tested on all the datasets.

The importance of each feature used in a decision tree ensemble can be determined by permuting the values of that feature in the dataset and measuring the impact on the evaluation metric [42]. Table 2 shows the impact on

F M

of permuting each feature used in the four scenarios. Higher absolute values correspond to more important features and values closer to 0 correspond to less important features. It may be noted that the features exclusive to scenario 4 have significantly lower importance compared with the others, which explains the very small difference between the

F M

values obtained in scenarios 3 and 4.

Retraining the model on all the datasets significantly improved the average

F M

, but this might have been caused either by overfitting or by increasing the size of the training set. Therefore, it makes more sense to compare our cross-validation results with state-of-the-art solutions because they also excluded the test set from the training process.

Table 3 compares the best average

F M

obtained using the proposed method in all four scenarios with state-of-the-art methods based on deep learning and traditional methods. It might seem counterintuitive to compute

{\bar{F M}}_{I}

and

{\bar{F M}}_{D}

for the winners of the DIBCO competitions, because they are entirely different solutions, but the scores for the other solutions were also computed for different model instances obtained using different training and test sets. The results for DIBCO 2019 are highlighted because previous state-of-the-art methods performed significantly worse on that dataset, most likely owing to the increased variance of the images compared with the previous DIBCO datasets.

{\bar{F M}}_{19}

represents the average

F M

for the DIBCO 2019 dataset,

{\bar{F M}}_{9 D}

represents the average

F M

for the nine previous DIBCO datasets, and

{\bar{F M}}_{9 I}

represents the average

F M

for all the images from the previous nine DIBCO datasets.

Figure 1 compares the binarization results for image 9 from the DIBCO 2019 dataset, which is difficult to binarize because of the ink bleeding from the other side of the page. Figure 2 compares the binarization results for image 13 from the DIBCO 2019 dataset, which is difficult to binarize because of the missing parts of the papyrus on which the text is written. In both figures, green signifies false positives (background pixels classified as foreground), and magenta signifies false negatives (foreground pixels classified as background).

The proposed solution is compared with a classic global thresholding method (Otsu), a classic local thresholding method (Sauvola), and a state-of-the-art deep learning method that achieved the highest average

F M

for the DIBCO 2019 dataset (GDB). Both figures graphically illustrate that the proposed solution performed significantly better than the classical and state-of-the-art methods for exceedingly difficult images.

Appendix A contains high-resolution versions of the proposed solution binarization for both images.

4. Discussion

The proposed solution outperformed the state-of-the-art solutions by a significant margin on average, especially for the DIBCO 2019 dataset, which proved to be exceedingly difficult for all the compared solutions.

The performance of the proposed solution on DIBCO 2019 was remarkably close to the average over all the datasets, making our solution the most robust method presented in this study.

Although D2BFormer outperformed the proposed solution on the first nine DIBCO datasets, its performance on DIBCO 2019 was so low that, on average, the proposed solution performed better in three of the four scenarios. The poor performance of previous state-of-the-art solutions on the DIBCO 2019 dataset is most likely caused by bleed-through on 6 of the 20 images (see Figure 1) and by missing paper sections on 10 of the 20 images (see Figure 2).

The proposed method uses a decision tree ensemble, which is significantly less demanding to train compared with any deep learning architecture, being able to run in low memory conditions and using only CPU. Thus, the user is able to quickly and easily incorporate any additional datasets in the training phase, making the tuning of the model to specific use cases more accessible. A deep learning architecture incorporating the same features might yield better results, but finding a suitable architecture and tuning the hyperparameters would likely take much longer. Further experimentation is required.

Refitting the model to the full dataset produced significantly better results. However, this may have been caused by overfitting. Therefore, further investigations are required. In practical applications, we would recommend using the model fitted on the whole dataset. However, if it performs poorly, one of the cross-validation models could be used, or they could all be combined using a voting system, minimizing the risk of overfitting, therefore guaranteeing similar performance on new data. The practical application we would recommend would be using the proposed model to binarize images as part of an OCR pipeline. However, the proposed method is not limited to this use case. It can easily be adapted to deploy practical applications in other fields, such as industrial counting, medical imaging, pattern recognition, etc., by training on data specific to those fields.

Future work could include using more datasets or adding more features to the model, such as higher-order moments, higher-order generalized bimodality coefficients [43], other local thresholding algorithms, and global thresholding algorithms. The proposed method is not limited to document images; other uses of binarization may be tested using domain-specific datasets.

Another viable way to improve the results would be to use nested cross-validation [44] to increase the confidence that the best model found is not prone to overfitting, allowing us to retrain the model on the entire dataset for even better results. However, this would multiply the training time by the number of outer cross-validation folds, making it very time-consuming.

A more straightforward method of improving the results would be to run the experiment on a machine with more memory and a faster processor, or to increase the time limit.

Author Contributions

Conceptualization, N.T., C.-A.B. and M.-L.V.; Data curation, C.-A.B.; Formal analysis, N.T., C.-A.B. and M.-L.V.; Investigation, N.T.; Methodology, N.T. and M.-L.V.; Project administration, C.-A.B.; Resources, C.-A.B.; Software, N.T.; Supervision, C.-A.B.; Validation, N.T., C.-A.B. and M.-L.V.; Visualization, N.T., C.-A.B. and M.-L.V.; Writing—original draft, N.T.; Writing—review and editing, N.T., C.-A.B. and M.-L.V. All authors have read and agreed to the published version of the manuscript.

Funding

The APC has been supported by Politehnica University of Bucharest, through the PubArt program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

References were formatted using https://www.scribbr.com/citation/generator/ (accessed on 16 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OCR	Optical Character Recognition
BC	Bimodality Coefficient
CFT	Cost Frugal Tuner
GOSS	Gradient-based One-Sided Sampling
EFB	Exclusive Feature Bundling
FM	F-Measure

Appendix A

Figure A1. High-resolution version of the proposed solution binarization for image 9 from the DIBCO 2019 dataset.

Figure A2. High-resolution version of the proposed solution binarization for image 13 from the DIBCO 2019 dataset.

References

Otsu, N.A. Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Moghaddam, R.F.; Cheriet, M. A Multi-Scale Framework for Adaptive Binarization of Degraded Document Images. Pattern Recognit. 2010, 43, 2186–2198. [Google Scholar] [CrossRef]
Moghaddam, R.F.; Cheriet, M. AdOtsu: An Adaptive and Parameterless Generalization of Otsu’s Method for Document Image Binarization. Pattern Recognit. 2011, 45, 2419–2431. [Google Scholar] [CrossRef]
Bloechle, J.-L.; Hennebert, J.; Gisler, C. YinYang, a Fast and Robust Adaptive Document Image Binarization for Optical Character Recognition. In Proceedings of the DocEng ’23: Proceedings of the ACM Symposium on Document Engineering 2023, Limerick, Ireland, 22–25 August 2023; pp. 1–4. [Google Scholar]
Bloechle, J.-L.; Hennebert, J.; Gisler, C. ZigZag: A Robust Adaptive Approach to Non-Uniformly Illuminated Document Image Binarization. In Proceedings of the DocEng ’24: Proceedings of the ACM Symposium on Document Engineering 2024, San Jose, CA, USA, 20–23 August 2024; pp. 1–10. [Google Scholar]
Howe, N.R. Document Binarization with Automatic Parameter Tuning. Int. J. Doc. Anal. Recognit. (IJDAR) 2012, 16, 247–258. [Google Scholar] [CrossRef]
Jia, F.; Shi, C.; He, K.; Wang, C.; Xiao, B. Degraded Document Image Binarization Using Structural Symmetry of Strokes. Pattern Recognit. 2017, 74, 225–240. [Google Scholar] [CrossRef]
Calvo-Zaragoza, J.; Gallego, A.-J. A Selectional Auto-Encoder Approach for Document Image Binarization. Pattern Recognit. 2018, 86, 37–47. [Google Scholar] [CrossRef]
He, S.; Schomaker, L. DeepOtsu: Document Enhancement and Binarization Using Iterative Deep Learning. Pattern Recognit. 2019, 91, 379–390. [Google Scholar] [CrossRef]
He, S.; Schomaker, L. CT-Net: Cascade T-Shape Deep Fusion Networks for Document Binarization. Pattern Recognit. 2021, 118, 108010. [Google Scholar] [CrossRef]
Peng, X.; Wang, C.; Cao, H. Document Binarization via Multi-Resolutional Attention Model with DRD Loss. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 45–50. [Google Scholar]
Mondal, R.; Chakraborty, D.; Chanda, B. Learning 2D Morphological Network for Old Document Image Binarization. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019; pp. 65–70. [Google Scholar]
De, R.; Chakraborty, A.; Sarkar, R. Document Image Binarization Using Dual Discriminator Generative Adversarial Networks. IEEE Signal Process. Lett. 2020, 27, 1090–1094. [Google Scholar] [CrossRef]
Zhao, J.; Shi, C.; Jia, F.; Wang, Y.; Xiao, B. Document Image Binarization with Cascaded Generators of Conditional Generative Adversarial Networks. Pattern Recognit. 2019, 96, 106968. [Google Scholar] [CrossRef]
Souibgui, M.A.; Kessentini, Y. DE-GAN: A Conditional Generative Adversarial Network for Document Enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1180–1191. [Google Scholar] [CrossRef] [PubMed]
Suh, S.; Kim, J.; Lukowicz, P.; Lee, Y.O. Two-Stage Generative Adversarial Networks for Binarization of Color Document Images. Pattern Recognit. 2022, 130, 108810. [Google Scholar] [CrossRef]
Yang, M.; Xu, S. A Novel Degraded Document Binarization Model through Vision Transformer Network. Inf. Fusion 2022, 93, 159–173. [Google Scholar] [CrossRef]
Biswas, R.; Roy, S.K.; Wang, N.; Pal, U.; Huang, G.B. DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization. arXiv 2023, arXiv:2312.03568. [Google Scholar] [CrossRef]
Yang, Z.; Liu, B.; Xiong, Y.; Wu, G. GDB: Gated Convolutions-Based Document Binarization. Pattern Recognit. 2023, 146, 109989. [Google Scholar] [CrossRef]
Bernsen, J. Dynamic Thresholding of Grey-Level Images. In Proceedings of the Eighth International Conference on Pattern Recognition (ICPR), Paris, France, 27–31 October 1986. [Google Scholar]
White, J.M.; Rohrer, G.D. Image Thresholding for Optical Character Recognition and Other Applications Requiring Character Image Extraction. IBM J. Res. Dev. 1983, 27, 400–411. [Google Scholar] [CrossRef]
Niblack, W. An Introduction to Digital Image Processing; Prentice Hall: Hoboken, NJ, USA, 1986. [Google Scholar]
Khurshid, K.; Siddiqi, I.; Faure, C.; Vincent, N. Comparison of Niblack Inspired Binarization Methods for Ancient Documents. In Proceedings of the SPIE, the International Society for Optical Engineering/Proceedings of SPIE; SPIE: Bellingham, WA, USA, 2009. [Google Scholar] [CrossRef]
Sauvola, J.; Pietikäinen, M. Adaptive Document Image Binarization. Pattern Recognit. 2000, 33, 225–236. [Google Scholar] [CrossRef]
Wolf, C.; Jolion, J.-M. Extraction and Recognition of Artificial Text in Multimedia Documents. Pattern Anal. Appl. 2004, 6, 309–326. [Google Scholar] [CrossRef]
Phansalkar, N.N.; More, N.S.; Sabale, N.A.; Joshi, N.M. Adaptive Local Thresholding for Detection of Nuclei in Diversity Stained Cytology Images. In Proceedings of the 2011 International Conference on Communications and Signal Processing (ICCSP 2011), Kerala, India, 10–12 February 2011; pp. 218–220. [Google Scholar]
Knapp, T.R. Bimodality Revisited. J. Mod. Appl. Stat. Methods 2007, 6, 8–20. [Google Scholar] [CrossRef]
Ahmed, Z.; Amizadeh, S.; Bilenko, M.; Carr, R.; Chin, W.-S.; Dekel, Y.; Dupre, X.; Eksarevskiy, V.; Filipi, S.; Finley, T.; et al. Machine Learning at Microsoft with ML.NET. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2448–2458. [Google Scholar]
Wu, Q.; Wang, C.; Huang, S. Frugal Optimization for Cost-Related Hyperparameters. Proc. AAAI Conf. Artif. Intell. 2021, 35, 10347–10354. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Thirty-First Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Berrar, D. Cross-Validation. In Elsevier eBooks; Elsevier: Amsterdam, The Netherlands, 2018; pp. 542–545. [Google Scholar]
Gatos, B.; Ntirogiannis, K.; Pratikakis, I. ICDAR 2009 Document Image Binarization Contest (DIBCO 2009). In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition, Barcelona, Spain, 26–29 July 2009; pp. 1375–1382. [Google Scholar]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. H-DIBCO 2010—Handwritten Document Image Binarization Competition. In Proceedings of the 2010 12th International Conference on Frontiers in Handwriting Recognition, Kolkata, India, 16–18 November 2010; pp. 727–732. [Google Scholar]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. ICDAR 2011 Document Image Binarization Contest (DIBCO 2011). In Proceedings of the International Conference on Document Analysis and Recognition 2011, Beijing, China, 18–21 September 2011; pp. 1506–1510. [Google Scholar] [CrossRef]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. ICFHR 2012 Competition on Handwritten Document Image Binarization (H-DIBCO 2012). In Proceedings of the International Conference on Frontiers in Handwriting Recognition 2012, Bari, Italy, 18–20 September 2012; pp. 817–822. [Google Scholar] [CrossRef]
Pratikakis, I.; Gatos, B.; Ntirogiannis, K. ICDAR 2013 Document Image Binarization Contest (DIBCO 2013). In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1471–1476. [Google Scholar]
Ntirogiannis, K.; Gatos, B.; Pratikakis, I. ICFHR2014 Competition on Handwritten Document Image Binarization (H-DIBCO 2014). In Proceedings of the 2014 14th International Conference on Frontiers in Handwriting Recognition, Hersonissos, Greece, 1–4 September 2014; pp. 809–813. [Google Scholar]
Pratikakis, I.; Zagoris, K.; Barlas, G.; Gatos, B. ICFHR2016 Handwritten Document Image Binarization Contest (H-DIBCO 2016). In Proceedings of the 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Shenzhen, China, 23–26 October 2016; pp. 619–623. [Google Scholar]
Pratikakis, I.; Zagoris, K.; Barlas, G.; Gatos, B. ICDAR2017 Competition on Document Image Binarization (DIBCO 2017). In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–12 November 2017; pp. 1395–1403. [Google Scholar]
Pratikakis, I.; Zagori, K.; Kaddas, P.; Gatos, B. ICFHR 2018 Competition on Handwritten Document Image Binarization (H-DIBCO 2018). In Proceedings of the 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Niagara Falls, NY, USA, 5–8 August 2018; pp. 489–493. [Google Scholar]
Pratikakis, I.; Zagoris, K.; Karagiannis, X.; Tsochatzidis, L.; Mondal, T.; Marthot-Santaniello, I. ICDAR 2019 Competition on Document Image Binarization (DIBCO 2019). In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1547–1556. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Tarbă, N.; Voncilă, M.-L.; Boiangiu, C.-A. On Generalizing Sarle’s Bimodality Coefficient as a Path towards a Newly Composite Bimodality Coefficient. Mathematics 2022, 10, 1042. [Google Scholar] [CrossRef]
Bates, S.; Hastie, T.; Tibshirani, R. Cross-Validation: What Does It Estimate and How Well Does It Do It? J. Am. Stat. Assoc. 2023, 119, 1434–1445. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Comparison of binarization results. Left to right, top to bottom: original image, ground-truth, Otsu binarization, Sauvola binarization, GDB binarization, proposed solution binarization.

Figure 2. Comparison of binarization results. Left to right, top to bottom: original image, ground-truth, Otsu binarization, Sauvola binarization, GDB binarization, proposed solution binarization.

Table 1. Scenario comparison.

$F M$ (%)	Scenario 1	Scenario 2	Scenario 3	Scenario 4
${\bar{F M}}_{D}$	91.775	94.3657	95.846	95.675
${\bar{F M}}_{I}$	91.774	94.3652	95.845	95.672
${\bar{F M}}_{O}$	95.357	96.92	99.41	99.21

Table 2. Feature importance.

Feature	$F M$ Difference	Scenarios Using the Feature
Pixel Value	−0.76909	1, 2, 3, 4
White	−0.02847	1, 3, 4
Bernsen	−0.04733	1, 3, 4
Niblack	−0.07923	1, 3, 4
Sauvola	−0.0711	1, 3, 4
Wolf	−0.05156	1, 3, 4
Phansalkar	−0.07106	1, 3, 4
Nick	−0.04388	1, 3, 4
Gaussian Blur	−0.38731	1, 3, 4
Average of Local Minimum and Maximum	−0.05427	1, 3, 4
Local Average	−0.01125	1, 2, 3, 4
Local Standard Deviation	−0.14666	2, 3, 4
Local Skewness	−0.08837	2, 3, 4
Local Kurtosis	−0.07781	2, 3, 4
Local BC	−0.14658	2, 3, 4
Local Minimum	−0.15828	2, 3, 4
Local Maximum	−0.06222	2, 3, 4
Global Average	−0.02384	2, 3, 4
Global Standard Deviation	−0.02445	2, 3, 4
Global Skewness	−0.04725	2, 3, 4
Global Kurtosis	−0.00972	2, 3, 4
Global BC	−0.01169	2, 3, 4
Global Minimum	−0.00799	2, 3, 4
Global Maximum	−0.00867	2, 3, 4
First Bin of Global Histogram	−0.05454	4
Second Bin of Global Histogram	−0.01896	4
Third Bin of Global Histogram	−0.01805	4
Fourth Bin of Global Histogram	−0.01445	4
Fifth Bin of Global Histogram	−0.00761	4

Table 3. State-Of-The-Art Comparison.

Method	${\bar{FM}}_{9 D}$ (%)	${\bar{FM}}_{9 I}$ (%)	${\bar{FM}}_{19}$ (%)	${\bar{FM}}_{D}$ (%)	${\bar{FM}}_{I}$ (%)
Otsu	78.74	78.75	63.87	77.25	76.56
Sauvola	79.72	80.05	63.82	78.13	77.66
DIBCO winners	91.33	91.30	72.88	89.49	88.59
Howe	90.93	90.97	48.20	86.66	84.68
Jia	89.94	89.93	55.87	86.53	84.92
DeepOtsu	91.68	91.57	60.75	88.59	87.04
SAE	92.98	93.04	58.78	89.56	88.01
DD-GAN	92.71	92.82	58.01	89.24	87.70
MRAM	91.93	91.90	49.53	87.69	85.67
2DMIN	84.46	87.49	42.76	80.29	78.61
cGANs	93.19	93.18	62.29	90.10	88.64
CT-Net	94.03	94.05	-	-	-
DE-GAN	93.33	94.13	56.03	89.60	88.52
Suh	93.33	93.25	70.63	91.06	89.92
GDB	93.84	93.82	73.84	91.84	90.89
GDBMO	94.10	94.20	73.50	92.04	91.16
D2BFormer	96.02	96.05	66.69	93.08	91.73
DocBinFormer	95.76	95.87	60.31	92.22	90.64
Scenario 1	91.776	91.777	91.763	91.775	91.775
Scenario 2	94.367	94.368	94.351	94.366	94.365
Scenario 3	95.846	95.845	95.846	95.846	95.845
Scenario 4	95.677	95.675	95.658	95.675	95.673

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tarbă, N.; Boiangiu, C.-A.; Voncilă, M.-L. State-of-the-Art Document Image Binarization Using a Decision Tree Ensemble Trained on Classic Local Binarization Algorithms and Image Statistics. Appl. Sci. 2025, 15, 8374. https://doi.org/10.3390/app15158374

AMA Style

Tarbă N, Boiangiu C-A, Voncilă M-L. State-of-the-Art Document Image Binarization Using a Decision Tree Ensemble Trained on Classic Local Binarization Algorithms and Image Statistics. Applied Sciences. 2025; 15(15):8374. https://doi.org/10.3390/app15158374

Chicago/Turabian Style

Tarbă, Nicolae, Costin-Anton Boiangiu, and Mihai-Lucian Voncilă. 2025. "State-of-the-Art Document Image Binarization Using a Decision Tree Ensemble Trained on Classic Local Binarization Algorithms and Image Statistics" Applied Sciences 15, no. 15: 8374. https://doi.org/10.3390/app15158374

APA Style

Tarbă, N., Boiangiu, C.-A., & Voncilă, M.-L. (2025). State-of-the-Art Document Image Binarization Using a Decision Tree Ensemble Trained on Classic Local Binarization Algorithms and Image Statistics. Applied Sciences, 15(15), 8374. https://doi.org/10.3390/app15158374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

State-of-the-Art Document Image Binarization Using a Decision Tree Ensemble Trained on Classic Local Binarization Algorithms and Image Statistics

Abstract

1. Introduction

Related Works

2. Materials and Methods

2.1. Features

2.2. Tools

2.3. Proposed Solution

3. Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI