Model-Aware Automatic Benchmark Generation with Self-Error Instructions for Data-Driven Models
Abstract
1. Introduction
- (i)
- We formalize the task of model-aware automatic benchmark dataset generation using self-error instructions for classical machine learning models.
- (ii)
- We propose a two-stage pipeline for benchmark dataset generation, using a genetic algorithm to augment the bad prediction points and a generative model to approximate their distribution.
- (iii)
- We conduct several experiments to demonstrate the applicability of our benchmark on both synthetic and real-world datasets for regression and classification problems.
2. Related Works
3. General Problem Formulation
4. Automatic Benchmark Model
4.1. General Pipeline for Regression Problems
4.2. General Pipeline for Classification Problems
- (i)
- If , then bad points are those for which .
- (ii)
- If , then bad points are those for which .
4.3. Genetic Algorithm
4.3.1. Formulation for Regression Problems
| Algorithm 1 Mutation operator. |
| Input: , , , , Require: , Return:
|
4.3.2. Formulation for Classification Problems
4.4. Generative Model
5. Experimental Study
5.1. Data Description
5.2. Evaluation Scores
5.3. Toy Example
5.4. Results for the Regression Model on Real-World Data
5.5. Results for the Classification Model on Real-World Data
5.6. Hyperparameters
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Quantitative Results
| Model | Percent of Bad Points in , % | Data | MSE (MEAN ± STD) | SMAPE (MEAN ± STD) | Wasserstein Distance (MEAN ± STD) |
|---|---|---|---|---|---|
| Baseline model | 10 | — | |||
| — | |||||
| 56 | — | ||||
| — | |||||
| 75 | — | ||||
| — | |||||
| Model for comparison | 10 | — | |||
| — | |||||
| 56 | — | ||||
| — | |||||
| 75 | — | ||||
| — | |||||
| Model | Threshold for Bad Points () | Data | MSE (MEAN ± STD) | SMAPE (MEAN ± STD) | Wasserstein Ditance (MEAN ± STD) |
|---|---|---|---|---|---|
| Baseline model | — | ||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
| Model for comparison | — | ||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
| Model | Threshold for Bad Points () | Data | (MEAN ± STD) | ROCAUC (MEAN ± STD) | Wasserstein Distance (MEAN ± STD) |
|---|---|---|---|---|---|
| Baseline model | — | ||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
| Model for comparison | — | ||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
| — | |||||
References
- Feng, Y.; Long, Y.; Wang, H.; Ouyang, Y.; Li, Q.; Wu, M.; Zheng, J. Benchmarking machine learning methods for synthetic lethality prediction in cancer. Nat. Commun. 2024, 15, 9058. [Google Scholar] [CrossRef]
- Maheshwari, G.; Ivanov, D.; Haddad, K.E. Efficacy of synthetic data as a benchmark. arXiv 2024, arXiv:2409.11968. [Google Scholar] [CrossRef]
- Liu, Y.; Khandagale, S.; White, C.; Neiswanger, W. Synthetic benchmarks for scientific research in explainable machine learning. arXiv 2021, arXiv:2106.12543. [Google Scholar] [CrossRef]
- Nikitin, N.O.; Vychuzhanin, P.; Sarafanov, M.; Polonskaia, I.S.; Revin, I.; Barabanova, I.V.; Maximov, G.; Kalyuzhnaya, A.V.; Boukhanovsky, A. Automated evolutionary approach for the design of composite machine learning pipelines. Future Gener. Comput. Syst. 2022, 127, 109–125. [Google Scholar] [CrossRef]
- He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl.-Based Syst. 2021, 212, 106622. [Google Scholar] [CrossRef]
- Truong, A.; Walters, A.; Goodsitt, J.; Hines, K.; Bruss, C.B.; Farivar, R. Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1471–1479. [Google Scholar]
- Longjohn, R.; Kelly, M.; Singh, S.; Smyth, P. Benchmark data repositories for better benchmarking. Adv. Neural Inf. Process. Syst. 2024, 37, 86435–86457. [Google Scholar]
- Demetriou, D.; Mavromatidis, P.; Petrou, M.F.; Nicolaides, D. CODD: A benchmark dataset for the automated sorting of construction and demolition waste. Waste Manag. 2024, 178, 35–45. [Google Scholar] [CrossRef]
- Xia, C.S.; Deng, Y.; Zhang, L. Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM. arXiv 2024, arXiv:2403.19114. [Google Scholar] [CrossRef]
- Mahdavi, H.; Hashemi, A.; Daliri, M.; Mohammadipour, P.; Farhadi, A.; Malek, S.; Yazdanifard, Y.; Khasahmadi, A.; Honavar, V. Brains vs. bytes: Evaluating llm proficiency in olympiad mathematics. arXiv 2025, arXiv:2504.01995. [Google Scholar] [CrossRef]
- Shi, Q.; Tang, M.; Narasimhan, K.; Yao, S. Can Language Models Solve Olympiad Programming? arXiv 2024, arXiv:2404.10952. [Google Scholar] [CrossRef]
- Zheng, Z.; Cheng, Z.; Shen, Z.; Zhou, S.; Liu, K.; He, H.; Li, D.; Wei, S.; Hao, H.; Yao, J.; et al. LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? arXiv 2025, arXiv:2506.11928. [Google Scholar]
- Ang, Y.; Huang, Q.; Bao, Y.; Tung, A.K.H.; Huang, Z. TSGBench: Time Series Generation Benchmark. arXiv 2023, arXiv:2309.03755. [Google Scholar] [CrossRef]
- Yao, B.M.; Wang, Q.; Huang, L. Error-driven Data-efficient Large Multimodal Model Tuning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Vienna, Austria, 2025; pp. 20289–20306. [Google Scholar]
- Mendes, P.; Romano, P.; Garlan, D. Error-driven uncertainty aware training. arXiv 2024, arXiv:2405.01205. [Google Scholar] [CrossRef]
- Chandra, A.; Yao, X. Ensemble Learning Using Multi-Objective Evolutionary Algorithms. J. Math. Model. Algorithms 2006, 5, 417–445. [Google Scholar] [CrossRef]
- Brown, G.; Wyatt, J.; Harris, R.; Yao, X. Diversity creation methods: A survey and categorisation. Inf. Fusion 2005, 6, 5–20. [Google Scholar] [CrossRef]
- Abdollahi, J.; Nouri-Moghaddam, B. Hybrid stacked ensemble combined with genetic algorithms for diabetes prediction. Iran J. Comput. Sci. 2022, 5, 205–220. [Google Scholar] [CrossRef]
- Shojaee, P.; Nguyen, N.H.; Meidani, K.; Farimani, A.B.; Doan, K.D.; Reddy, C.K. LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models. arXiv 2025, arXiv:2504.10415. [Google Scholar] [CrossRef]
- Rafiei Oskooei, A.; Babacan, M.; Yağci, E.; Alptekin, C.; Buğday, A. Beyond Synthetic Benchmarks: Assessing Recent LLMs for Code Generation. In Proceedings of the 14th International Workshop on Computer Science and Engineering (WCSE 2024), Phuket Island, Thailand, 19–21 June 2024. [Google Scholar] [CrossRef]
- Ronchetti, E.M.; Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Syswerda, G. Uniform crossover in genetic algorithms. In Proceedings of the Third International Conference on Genetic Algorithms and Their Application, Morgan Kaufmann, San Mateo, CA, USA, 1 December 1989; pp. 2–9. [Google Scholar]
- Blickle, T. Tournament selection. Evol. Comput. 2000, 1, 188. [Google Scholar]
- Fortin, F.A.; De Rainville, F.M.; Gardner, M.A.; Parizeau, M.; Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 2012, 13, 2171–2175. [Google Scholar]
- Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Zakharov, K.; Stavinova, E.; Boukhanovsky, A. Synthetic financial time series generation with regime clustering. J. Adv. Inf. Technol 2023, 14, 1372–1381. [Google Scholar] [CrossRef]
- Yuan, X.; Qiao, Y. Diffusion-ts: Interpretable diffusion for general time series generation. arXiv 2024, arXiv:2403.01742. [Google Scholar] [CrossRef]
- Liao, S.; Ni, H.; Szpruch, L.; Wiese, M.; Sabate-Vidales, M.; Xiao, B. Conditional sig-wasserstein gans for time series generation. arXiv 2020, arXiv:2006.05421. [Google Scholar] [CrossRef]








| Data | MSE | Wasserstein Distance for Target | Wasserstein Distance for Features | Residual Variance |
|---|---|---|---|---|
| Initial bad data | 0.029 | 0.153 | 0.00 | 0.006 |
| Augmented data with | 0.039 | 0.171 | 0.090 | 0.010 |
| Augmented data with | 0.039 | 0.171 | 0.089 | 0.009 |
| Augmented data with | 0.041 | 0.170 | 0.092 | 0.012 |
| Augmented data with | 0.049 | 0.171 | 0.09 | 0.020 |
| Hyperparameter | Toy example | Regression | Classification |
|---|---|---|---|
| Number of features | 2 | 16 | 12 |
| Number of categorical features | 0 | 1 | 4 |
| The threshold | |||
| Number of steps in GA | 20 | ||
| Batch size | 16 | ||
| Number of epochs | 100 | ||
| Optimizer | AdamW | AdamW | AdamW |
| Initial learning rate | |||
| Gradient max norm clip | 3 | 3 | 3 |
| Scheduler | CosineAnnealingLR | CosineAnnealingLR | CosineAnnealingLR |
| Minimum learning rate | |||
| Cross-entropy loss multiplier | |||
| Hidden channels | 16 | 16 | 16 |
| Latent dimension | 64 | 256 | 256 |
| Categorical embedding latent dimension | 0 | 4 | 4 |
| Decoder hidden dimension | 32 | 512 | 512 |
| Activation function | SiLU | SiLU | SiLU |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zakharov, K.; Boukhanovsky, A. Model-Aware Automatic Benchmark Generation with Self-Error Instructions for Data-Driven Models. Mach. Learn. Knowl. Extr. 2025, 7, 148. https://doi.org/10.3390/make7040148
Zakharov K, Boukhanovsky A. Model-Aware Automatic Benchmark Generation with Self-Error Instructions for Data-Driven Models. Machine Learning and Knowledge Extraction. 2025; 7(4):148. https://doi.org/10.3390/make7040148
Chicago/Turabian StyleZakharov, Kirill, and Alexander Boukhanovsky. 2025. "Model-Aware Automatic Benchmark Generation with Self-Error Instructions for Data-Driven Models" Machine Learning and Knowledge Extraction 7, no. 4: 148. https://doi.org/10.3390/make7040148
APA StyleZakharov, K., & Boukhanovsky, A. (2025). Model-Aware Automatic Benchmark Generation with Self-Error Instructions for Data-Driven Models. Machine Learning and Knowledge Extraction, 7(4), 148. https://doi.org/10.3390/make7040148

