2.1. Features
The following algorithms were directly used in the proposed solution as features in the model that we trained to predict the optimal global threshold. They operate directly on the image histogram, optimizing various metrics, and can be computed in constant time () in relation to the image size. Although some of these algorithms are iterative, they converge rapidly.
Otsu [
23] proposed a popular global thresholding algorithm. For a given threshold k, the pixels are divided into two classes: the 0 class, composed of pixels with intensities lower than or equal to k, and the 1 class, composed of pixels with intensities greater than k. Between-class variance is defined as the weighted sum of the variances of the two classes. Otsu’s threshold is defined as the threshold with the largest between-class variance, defined as follows:
where
is the ratio of pixels with intensities lower than or equal to t;
is the ratio of pixels with intensities greater than
; and
are the mean values of the 0 class and 1 class pixel intensities, respectively.
Kittler and Illingworth [
23] proposed a computationally efficient solution for minimum error thresholding. They assumed that the two pixel classes follow normal distributions, splitting the probability density function
into two components,
and
, normally distributed with means
and
, standard deviations
and
, and a priori probabilities
and
. They choose a threshold that minimizes the following criterion function:
The function is easy to compute, and finding its minimum is relatively simple, as the function is smooth.
Lloyd [
23] proposed an iterative thresholding algorithm that minimizes the mean squared error between the original and the binarized images. In the first iteration, the average intensity is chosen as the threshold, and in further iterations, the threshold is updated via
where
is the variance of the entire image. The algorithm converges when the final threshold is equal to the previous threshold.
Sung et al. [
24] argued that the within-class variance is more useful than the between-class variance as a selection criterion for the optimal threshold. The within-class variance is defined as
where
are the variances of each class. If there is more than one threshold for which the within-class variance is minimal, then they consider the average of those thresholds to be the optimal one. Their experiments show that their method performs better than Otsu’s on their synthetic dataset.
Ridler and Calvard [
23] proposed an iterative thresholding algorithm very similar to Lloyd’s. In the first iteration, the average intensity is chosen as the threshold, and in further iterations, the threshold is updated via
The algorithm converges when the final threshold is equal to the previous threshold.
Huang and Wang [
23] proposed an algorithm based on computing a measure of fuzziness, such as Shannon’s entropy, for each possible threshold and choosing the lowest threshold for which the minimum fuzziness is reached. They use the following formula for fuzziness:
where
represents how many times the ith intensity appears in the image,
is Shannon’s entropy, and the membership function is
where
is a constant value such that
.
Ramesh et al. [
23] proposed an algorithm based on functional approximation of the histogram composed of recursive bilevel approximations that minimize an error function. They compare the results of two error functions:
Their experimental results show that the latter error function produces better results than the former.
Li and Lee [
25] proposed an algorithm based on minimizing the cross-entropy between the original image and the binarized image. The cross-entropy is computed for each possible threshold with the following formula:
where
is the number of pixels with intensity
in the image, and
is the maximum intensity. Li and Tam [
26] proposed an iterative variation of this algorithm, boasting improved computational speed at the cost of accuracy. Brink and Pendock [
27] proposed a similar algorithm, but they compute the cross-entropy with the following formula:
Kapur et al. [
26] proposed an algorithm based on maximizing an evaluation function:
where they use the classic information entropy formula:
and
is the cumulative probability density function.
Sahoo et al. [
26] proposed an algorithm based on maximizing the sum of Renyi’s entropies associated with each class:
They observed that there are three distinct thresholds
that maximize the sum depending on whether α is lower than, equal to, or greater than 1, and use all three values to determine the optimal threshold:
where
are the order statistics of the thresholds
, respectively;
and
take different values based on the distances between
.
Shanbhag [
26] proposed a modified version of Kapur’s algorithm. They propose a different information measure:
and choose the threshold that minimizes this measure. They define
where
Yen et al. [
26] proposed a maximum entropy criterion based on the discrepancy between the binarized image and the original image and the number of bits required to represent the binarized image. They define entropy as
They choose the threshold with the maximum entropy.
Tsai [
26] proposed an algorithm based on preserving the moments of the original image in the binarized image. They use the first 4 raw moments and obtain the following system:
where
is the
-th raw moment, and
are representative gray values for the two classes of pixels. Considering that
solving the system is trivial, and the threshold can be determined from the resulting
as the
-th percentile.
Sarle’s bimodality coefficient (BC) [
28] uses skewness and kurtosis to ascertain how separable two populations that form a bimodal distribution are. It can also be generalized (GBC) [
29] by using standardized moments of higher orders. For both BC and GBC, values range from 0 to 1, with 1 indicating no overlap between the two populations and lower values indicating more overlap. The formula for the
-th GBC is
where
is the
-th standardized moment. Sarle’s bimodality coefficient is
.
2.2. Tools
ML.NET [
30] is a framework that allows users to train and evaluate a plethora of machine-learning models. Its AutoML component allows users to automatically explore various hyperparameter configurations for said models using one of the several search strategies, such as random search, grid search, and cost frugal tuner (CFT) [
31]. There are several model types available in AutoML for regression, but LightGbm [
32] always scored best in our tests. LightGbm is a gradient boosting decision tree that uses gradient-based one-sided sampling to exclude a significant proportion of data instances with small gradients, using only the rest to estimate the information gain, and exclusive feature bundling to bundle mutually exclusive features to reduce the number of features.
ML.NET does not allow for custom metrics. The metrics available for regression are the mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared (R2), also known as the coefficient of determination [
33]. For the first three metrics, values range from 0 to infinity, with 0 indicating a perfect fit. For R2 values range from negative infinity to 1, with 1 indicating a perfect fit. Because MSE and RMSE square the errors, they are more sensitive to larger errors than MAE.
Training a model on a dataset might cause overfitting, i.e., maximizing the performance of the model for the training dataset at the cost of lower performance on new datasets. Resampling the dataset can alleviate or even eliminate overfitting. According to Berrar [
34], 10-fold cross-validation is one of the most widely used data resampling methods. k-fold cross-validation splits a dataset into k equally sized disjunct partitions and forms k training sets and k test sets. Each test set consists of one of the k partitions, and its corresponding training set consists of the other k-1 partitions. For each training set, the model is fitted on it and evaluated on its corresponding test set, resulting in k performance metrics, one for each test set, which can be averaged to obtain the performance metric of the model.
Bates et al. [
35] introduced nested cross-validation and proved that the method is more robust than cross-validation in many examples. The idea behind nested cross-validation is splitting the dataset into train/test sets twice, fitting the model on the inner train sets and evaluating them on the inner test sets, fitting them on the outer train sets and evaluating them on the outer test sets and then using all the obtained metric on all the test sets to compute a robust approximand for the metric consisting of an average value and an MSE.
2.3. Dataset and Metrics
Our dataset is composed of all the images from the following datasets: the DIBCO datasets [
36,
37,
38,
39,
40,
41,
42,
43,
44,
45], a synthetic dataset from [
46], the CMATERdb dataset [
47], a subset of 10 pages from the Einsielden Stiftsbibliothek [
48], a subset of 10 pages from the Salzinnes Antiphonal Manuscript [
49], a subset of 15 images from the LiveMemory project [
50], the Nabuco dataset [
51], the Noisy Office dataset [
52], the LRDE-DBD dataset [
53], the palm leaf dataset [
54], the PHIBD 2012 dataset [
55], the Rahul Sharma dataset [
56], and a multispectral dataset [
57]. To enrich the dataset, 15 variants were generated for each image by adjusting the gamma value with values ranging from 0.5 to 2. For each image, 48 features were computed: the thresholds resulting from the 15 algorithms presented in
Section 2.1; the confidence values for the 7 algorithms that provide them, Otsu, Lloyd, Ridler, Huang, Ramesh, Li and Lee, and Li and Tam (which represent the normalized values of their respective optimization functions for their selected thresholds); the normalized between-class variances associated with the computed thresholds; the pixel average, standard deviation, and normalized standardized moments up to the eighth order; Sarle’s bimodality coefficient; and the generalized bimodality coefficients up to the third order. Higher-order moments require significantly better precision than that offered by the standard 64-bit double floating-point representation. In order to ensure the required precision, the Boost Multiprecision C++ library [
58] was employed using 256 bits for the cpp_bin_float back-end.
All features were normalized using min-max normalization, which ensures all values are in the range by subtracting from all values and then dividing them by . The min value was considered 0 for all features that could not have negative values. The max value was considered the maximum possible value for all features with values that, in practice, reach values close to their theoretical upper limit (255 for thresholds, 127.5 for the standard deviation, for all variances, and 1 for all bimodality coefficients). For the features that do not have a finite or obvious lower/upper limit (both limits for standardized moments of odd orders and upper limits for standardized moments of even orders), quantiles were used instead of min and max values (0.001 quantile for the lower limit, and 0.999 quantile for the upper limit) to clamp outlier values to a more compact interval, ensuring a less sparse distribution of feature values.
A popular metric used to evaluate binarized images is the F-measure (FM), defined as
where
is the number of pixels correctly classified as pixels containing text, and
is the number of pixels incorrectly classified.
takes values between 0 and 1, with 1 indicating a perfect classifier. However, because we are using global thresholding, the maximum value of
can differ from image to image. The maximum possible value of
for an image is not necessarily provided by a single threshold. If two thresholds
and
provide the maximum possible
, then all thresholds in the interval
provide the maximum possible
. The ideal threshold of an image, in the global thresholding scenario, was defined as the middle of the largest interval of thresholds that provide the maximum possible FM for that image. The ideal threshold was computed for each image in the dataset and used as the column to be predicted by the ML models.
The goal of the experiment is to find a regression model instance that approximates the ideal threshold as closely as possible. Because the maximum possible FM on different image sets is not necessarily the same, we define the relative FM (
) as
where
is the FM provided by the predicted threshold, and
is the maximum possible FM of the image. With
, the performance on different images can be compared as the values it can take are always between 0 and 1, and the maximum value of 1 can always be reached regardless of the image.
Other popular metrics are peak signal-to-noise ratio (PSNR) and MSE. Both PSNR and FM are better when their values are higher, so we define relative PSNR (
) similarly to
as
where
is the PSNR provided by the predicted threshold, and
is the maximum possible PSNR of the image. MSE is better when its values are lower, so we define relative MSE (
) as
where
is the MSE provided by the predicted threshold;
is the minimum possible MSE of the image, and we add
to avoid division by 0, while still making sure that
can take all values in
.
2.4. Proposed Solution
In order to achieve nested cross-validation, the dataset was split into 11 folds, one fold was set apart as the outer test set, and the remaining 10 folds were set as the outer training set. An ML.NET AutoML regression experiment was set up with cross-validation on the outer training set, with a time limit of 60 min, using MSE as the optimizing metric, exploring all the model types available for regression, searching for hyperparameters using an implementation of CFT for hierarchical search spaces. Lower time limits were also tested but produced slightly worse results. Higher time limits were not tested due to increased computational cost. Each of the 10 folds in the outer training set formed an inner test set, with the remaining 9 forming an inner training set. The model instances were fitted on the inner train sets and evaluated on the inner test sets, thus producing 10 different values for the evaluation metrics. The model instances were then refitted on the outer training set and evaluated on the outer test set. Each of the 10 folds was then switched with the one in the outer test set and the model instances were refit on the inner training sets and evaluated on the inner test sets, resulting in a total of 110 different values for the evaluation metrics for models trained on 9 folds and evaluated on one fold and a total of 11 different values for the evaluation metrics for models trained on 10 folds and evaluated on one fold.
We denote
the values for
obtained on images from the inner test sets and
the values for
obtained on images from the outer test sets. The nested cross-validation estimand of
is denoted as
and is the average over all the inner test sets of
. For each outer test set, two values are computed:
where the
denotes the average value of the measure
, and
refers to all the inner test sets corresponding to the outer test set. The nested cross-validation estimand of the mean squared error of
is denoted as
and is the difference between
and
. Analogous estimands were computed for the PSNR and the MSE of the binarized image but were not used in the ranking of the models.
A score was defined as
and assigned to each model instance and used to determine the best one (a higher score is better). The best instance was then refit on the whole dataset, and the average metrics were computed over the entire dataset to compare the model’s results with the input algorithms. Algorithm 1 illustrates, using pseudocode, how the best model was selected using NCV.
Algorithm 1. NCV experiment pseudocode. |
| Input: dataset |
| Output: ML model |
1. | partitions = split the dataset into 11 equally sized partitions |
2. | for i from 1 to 11 |
3. | outerFolds[i].testSet = partitions[i] |
4. | outerFolds[i].trainSet = all partitions except partitions[i] |
5. | innerFolds = split outerFolds [1].trainSet into 10 folds |
6. | models = run AutoML 10-fold cross-validation experiment on innerFolds |
7. | , bestModel = null |
8. | for each model in models |
9. | esfm = empty array |
10. | for oi from 1 to 11 |
11. | innerFolds = split outerFolds[oi].trainSet into 10 folds |
12. | einfm = empty array |
13. | for ii from 1 to 10 |
14. | trainSet = innerFolds[ii].trainSet |
15. | testSet = innerFolds[ii].testSet |
16. | innerModel = model.fit(trainSet) |
17. | predicted = innerModel.transform(testSet) |
18. | for i from 1 to predicted.size |
19. | einfm.add(relative FM of predicted[i]) |
20. | trainSet = outerFolds[oi].trainSet |
21. | testSet = outerFolds[oi].testSet |
22. | predicted = outerModel.transform(testSet) |
23. | eoutfm = empty array |
24. | for i from 1 to predicted.size |
25. | eoutfm.add(relative FM of predicted[i]) |
26. | afm[oi] = pow(average(einfm)-average(eoutfm),2) |
27. | bfm[oi] = variance(eoutfm)/eoutfm.size |
28. | esfm.concatenate(einfm) |
29. | = average(afm)-average(bfm) |
30. | = average(esfm) |
31. | |
32. | if score > bestScore |
33. | bestScore = score |
34. | bestModel = model |