Open AccessReview
Large Language Model Benchmarks: A Taxonomy of Capabilities, Scientific Quality Assessment, and Saturation Analysis
by
Rubén Gómez, Carlos E. Miranda, Julio-Alejandro Romero-González, Diana-Margarita Córdova-Esparza, Gendry Alfonso-Francia, Edgar-Arturo Chávez-Urbiola, Alfonso Ramirez-Pedraza and Juan Terven
Mach. Learn. Knowl. Extr. 2026, 8(6), 141; https://doi.org/10.3390/make8060141 (registering DOI) - 22 May 2026
Abstract
The rapid evolution of Large Language Models (LLMs) has exposed limitations of static, accuracy-oriented benchmarks and increased the need for evaluation frameworks that distinguish among capabilities and benchmark quality. This survey analyzes 63 LLM benchmarks spanning 2012–2026 and organizes them into a taxonomy
[...] Read more.
The rapid evolution of Large Language Models (LLMs) has exposed limitations of static, accuracy-oriented benchmarks and increased the need for evaluation frameworks that distinguish among capabilities and benchmark quality. This survey analyzes 63 LLM benchmarks spanning 2012–2026 and organizes them into a taxonomy of six capability dimensions and 20 operational subcategories. We also propose the Benchmark Quality Assurance Index (BQAI), an AHP-weighted composite framework for assessing the scientific quality of benchmarks across seven dimensions related to annotation, clarity, standardization, reproducibility, robustness, coverage, and fairness. The BQAI is applied to 30 representative benchmarks, corresponding to 48% of the 63-benchmark corpus, with three-evaluator blinded scoring, formal inter-rater reliability validation
and quadratic-weighted Cohen’s
, and Monte Carlo sensitivity analysis
. In addition, we synthesize public performance results for 16 models across 10 benchmarks to examine saturation trends and reporting gaps. The analysis indicates that benchmark usefulness varies substantially across evaluation settings, that several established benchmarks are becoming less discriminative for frontier models, and that important gaps remain in safety, agentic, and cross-cultural assessment. Together, the taxonomy, BQAI, and saturation analysis provide a structured perspective on the current LLM benchmark landscape and on priorities for more rigorous evaluation.
Full article