Document Image Binarization Using Various Machine Learning Models and Ensembles Trained on Classic Local and Global Binarization Algorithms and Image Statistics
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript presents the comparison of several binarization methods for digital documents that contain hard to handle defects. The methods are based on machine learning models and use different estimates from direct thresholding algorithms like Otsu's as features. the hyperparameters tuning and model training is performed with frameworks H2O and ML.NET. The experiments were conducted with DIBCO datasets.
The manuscript needs revision to clarify the methods and results. Conclusions also need to be strengthen.
line 20, 23, 45, 54 all abbreviations are to be explained on the first use
The section 1.1 Related works doesn't give any analysis of enumerated methods, no doubts more of them can be enumerated. The section should emphasize the research gap and the need for the new developments.
The specification of the software - OS and frameworks versions - are needed.
line 246 The datasets are to be described briefly.
line 273 It is not clear what results in Tab.1 are from previous works.
line 298 It is better to explicitly write here that the proposed method performed worse than previously developed TL, if it is correct, of course.
Several aspects of the method are not described, such as what ranges for hyperparameters were used, what values were obtained for the best models, were these values stable across runs. What was the inference time for one document. Also, the format and color model of the images is to be described, how color information was used if any.
I think to obtain more meaning full results and strengthen the conclusions such models like Swin Transformer, CatBoost and XGBoost are to be included into comparison.
Author Response
Reviewer 1
The manuscript presents the comparison of several binarization methods for digital documents that contain hard to handle defects. The methods are based on machine learning models and use different estimates from direct thresholding algorithms like Otsu's as features. the hyperparameters tuning and model training is performed with frameworks H2O and ML.NET. The experiments were conducted with DIBCO datasets.
The manuscript needs revision to clarify the methods and results. Conclusions also need to be strengthen.
line 20, 23, 45, 54 all abbreviations are to be explained on the first use
- Explained all relevant abbreviations on the first use. DIBCO does not need an explanation
The section 1.1 Related works doesn't give any analysis of enumerated methods, no doubts more of them can be enumerated. The section should emphasize the research gap and the need for the new developments.
- Section 1.1 presents state of the art methods that we compare our results against. Image binarization is used in many industrial applications and any improvements in this step will translate to improvements in the final applications as well.
The specification of the software - OS and frameworks versions - are needed.
- We added OS and framework versions in section 2.3
line 246 The datasets are to be described briefly.
- We described them briefly
line 273 It is not clear what results in Tab.1 are from previous works.
- Clarified in the previous paragraph which results are from previous works
line 298 It is better to explicitly write here that the proposed method performed worse than previously developed TL, if it is correct, of course.
- Line 298 was comparing the current LightGBM (AutoML) and GBM (H2O) models, but we now explicitly write that the current models perform worse than the previously developed TL and explain why
Several aspects of the method are not described, such as what ranges for hyperparameters were used, what values were obtained for the best models, were these values stable across runs. What was the inference time for one document. Also, the format and color model of the images is to be described, how color information was used if any.
- The ranges for hyperparameters are automatically determined by the AutoML frameworks, we made no manual settings. We used the default AutoML hyperparameter exploration strategy for each framework and now mention this for ML.NET in section 2. We did not find any information on the exploration strategy for H2O.
I think to obtain more meaning full results and strengthen the conclusions such models like Swin Transformer, CatBoost and XGBoost are to be included into comparison.
- We did not find any articles presenting results of Swin Transformer, CatBoost and XGBoost on the DIBCO datasets, but we would love to include them in the comparison if there are any. The H2O autoML framework includes XGBoost, but it only runs on Linux and the OS on our machine is Windows. The others are not included in either framework.
Author Response File:
Author Response.docx
Reviewer 2 Report
Comments and Suggestions for Authors- The paper does not clearly articulate a strong research gap. While it combines global and local thresholding with machine learning, similar hybrid approaches and learning-based binarization methods already exist, and the novelty over prior work (including the authors’ own TG and TL methods) is not convincingly established.
- The contribution is incremental rather than fundamentally new. The method mainly aggregates outputs from existing thresholding algorithms and feeds them into AutoML frameworks, which raises concerns about whether this constitutes a substantial methodological innovation.
- The methodology is overly complex and difficult to follow. The paper introduces a large number of global and local thresholding formulas and image statistics (Section 2.1), but lacks clear intuition or a structured explanation of how these features contribute to the final model.
- The feature engineering process appears excessive and insufficiently justified. The use of numerous thresholds, optimization values, histogram features, and statistical descriptors is not supported by ablation studies to demonstrate which features are actually important.
- The reliance on AutoML frameworks limits methodological insight. Since model selection and hyperparameter tuning are largely automated (ML.NET and H2O AutoML), the paper provides little understanding of why certain models (e.g., LightGBM) perform better.
- The experimental setup raises concerns about fairness and reproducibility. For example, dataset size reduction for H2O AutoML and differences in optimization metrics (FM vs AUC) may bias comparisons across models.
- There are indications of potential overfitting that are not rigorously addressed. The significant gap between cross-validation performance and retrained model performance (Table 1) is acknowledged but not thoroughly analyzed or mitigated.
- The evaluation is limited in scope. The paper primarily relies on F-measure (FM) and does not include other important metrics such as computational efficiency, inference time, or memory usage, despite emphasizing practical applicability.
- The comparison with state-of-the-art methods is not entirely convincing. Although Table 2 shows competitive performance, the paper does not critically analyze cases where the proposed method underperforms or only marginally improves over existing approaches.
- The writing and organization can be improved. The paper contains long, dense sections with heavy mathematical notation and limited explanatory text, making it difficult for readers to follow the core idea and contributions.
Author Response
Reviewer 2
The paper does not clearly articulate a strong research gap. While it combines global and local thresholding with machine learning, similar hybrid approaches and learning-based binarization methods already exist, and the novelty over prior work (including the authors’ own TG and TL methods) is not convincingly established.
- The research gap we aim to address is now articulated at the end of Section 1. Additionally, the proposed method combines features from TG and TL, obtaining a mixed local-global approach which, in theory, should obtain better results than a purely global or local approach.
The contribution is incremental rather than fundamentally new. The method mainly aggregates outputs from existing thresholding algorithms and feeds them into AutoML frameworks, which raises concerns about whether this constitutes a substantial methodological innovation.
- The proposed method is an incremental upgrade of TL. The main contribution over TL is mixing local and global thresholding algorithms. We now justify this decision in section 2.
The methodology is overly complex and difficult to follow. The paper introduces a large number of global and local thresholding formulas and image statistics (Section 2.1), but lacks clear intuition or a structured explanation of how these features contribute to the final model.
- We performed permutation feature importance (PFI) on the best model (LightGBM), we now present results in section 3 and discuss them in section 4
The feature engineering process appears excessive and insufficiently justified. The use of numerous thresholds, optimization values, histogram features, and statistical descriptors is not supported by ablation studies to demonstrate which features are actually important.
- We performed permutation feature importance (PFI) on the best model (LightGBM), present results in section 3 and discuss them in section 4
The reliance on AutoML frameworks limits methodological insight. Since model selection and hyperparameter tuning are largely automated (ML.NET and H2O AutoML), the paper provides little understanding of why certain models (e.g., LightGBM) perform better.
- We explain in section 3 that LightGBM most likely performs better than H2O GBM due to EFB and GOSS and due to reducing the training dataset size for H2O
The experimental setup raises concerns about fairness and reproducibility. For example, dataset size reduction for H2O AutoML and differences in optimization metrics (FM vs AUC) may bias comparisons across models.
- We used the available optimization metrics for each AutoML framework (it’s not like we chose to not use FM for H2O, we didn’t have the option), which makes the comparison fair and the experiment reproducible
There are indications of potential overfitting that are not rigorously addressed. The significant gap between cross-validation performance and retrained model performance (Table 1) is acknowledged but not thoroughly analyzed or mitigated.
- We now properly address overfitting concerns in the first paragraph after Table 1. is higher than than and most likely because the test dataset is in the training dataset
The evaluation is limited in scope. The paper primarily relies on F-measure (FM) and does not include other important metrics such as computational efficiency, inference time, or memory usage, despite emphasizing practical applicability.
- FM was the only metric we could compare across all of the presented state-of-the-art methods. We now present feature computation time and memory usage, and inference time for LightGBM after Table 1
The comparison with state-of-the-art methods is not entirely convincing. Although Table 2 shows competitive performance, the paper does not critically analyze cases where the proposed method underperforms or only marginally improves over existing approaches.
- We now critically analyze proposed method performance in section 4
The writing and organization can be improved. The paper contains long, dense sections with heavy mathematical notation and limited explanatory text, making it difficult for readers to follow the core idea and contributions.
- The core idea and contributions are clearly presented in section 2.3. Sections 2.1.1 and 2.1.2 ensure that the article is fully self-contained. All the formulas and constants are fully specified to ensure that results are reproducible. We added/moved explanatory text for better readability.
Author Response File:
Author Response.docx
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors answered all my questions and enhanced the manuscript so I don't have any further comments
Author Response
Thank you for your previous comments, we are glad we answered them well enough that you don't have any further comments
Reviewer 2 Report
Comments and Suggestions for AuthorsNo comment.
Author Response
Thank you for your previous comments, we are glad we answered them well enough that you don't have any further comments

