1. Introduction
The evaluation of researchers is a central concern in academia and scientometrics because it directly influences recruitment, promotion, tenure, research funding, and the conferral of prestigious awards [
1,
2,
3]. Universities, funding agencies, and professional societies therefore require assessment mechanisms that remain reliable and equitable across heterogeneous research profiles. Traditional evaluation practices rely heavily on bibliometric indicators such as publication counts and citation totals [
4]. Although these measures are easy to compute and broadly available, they provide an incomplete representation of scholarly impact. In particular, primitive metrics can overemphasize quantity or popularity and fail to account for citation distribution, sustained influence, and disciplinary variation [
5,
6].
Over the past two decades, many indices have been introduced to address these limitations. The h-index proposed by Hirsch [
7] remains one of the most influential because it combines productivity and citation impact into a single measure. Several extensions were subsequently developed to better capture citation distribution and to distinguish researchers with similar h-scores but different influence profiles, including the g-index [
8], the A-index [
9], and the R-index [
9]. Additional refinements incorporated temporal normalization (e.g., m-quotient, hc-index, AWCR) and author contribution adjustments (e.g., hf-index, hi-index, fractional h-index, gm-index) [
10,
11,
12,
13]. More recently, composite and multi-dimensional indices have been proposed to integrate multiple bibliometric signals into unified evaluation frameworks [
14].
Despite these advances, three critical gaps remain unresolved. First, most indices are validated using narrow or domain-specific datasets, limiting generalizability across heterogeneous disciplines. Second, comparative studies predominantly rely on descriptive statistics or incremental metric variants, which cannot capture complex nonlinear relationships among bibliometric features. Third, and most importantly, existing predictive approaches, including those employing machine learning, rarely explain which bibliometric factors drive predicted outcomes, undermining trust and impeding adoption in institutional settings. No prior study has simultaneously addressed all three gaps within a unified, cross-domain, explainable framework.
Before describing the proposed framework, it is important to clarify the scientific rationale for using awards as evaluation targets. Awards conferred by established professional societies represent community-validated recognition of sustained research excellence. Although award decisions involve multifactorial judgments, identifying the bibliometric patterns that statistically distinguish awardees from non-awardees provides an empirical basis for understanding which measurable dimensions of scholarly output are associated with peer recognition. This study therefore treats award status as a proxy for high-impact research distinction, rather than claiming that bibliometric indices fully explain award outcomes. Four disciplines were selected, Computer Science, Neuroscience, Mathematics, and Civil Engineering, because they represent contrasting citation cultures, publication norms, and collaboration intensities, thereby providing a rigorous basis for cross-domain evaluation. Award-winning researchers were identified from recognized professional societies and academic organizations including ACM, IEEE, the American Mathematical Society, and the American Society of Civil Engineers, ensuring publicly verifiable and reproducible ground-truth labels.
This study proposes a SHAP-interpretable, multi-domain supervised learning framework for predicting academic award recognition using publication count-based bibliometric indices. In contrast to approaches that emphasize citation-age adjustments or collaboration-network features, the proposed framework deliberately focuses on publication-based metrics that capture research productivity and remain uniformly computable across disciplines. Thirty-two such indices were computed and normalized to improve cross-domain comparability, forming the feature space for supervised classification. Eight classifiers were evaluated, including Logistic Regression, Ridge Regression, k-Nearest Neighbors, Naïve Bayes, Support Vector Machines, Decision Trees, AdaBoost, and Extreme Gradient Boosting, under stratified five-fold cross-validation. Performance was assessed using accuracy, precision, recall, F1-score, and ROC-AUC to ensure both threshold-dependent and threshold-independent evaluation. SHAP (SHapley Additive exPlanations) was applied to quantify feature contributions at both global and local levels, transforming the predictive model into an explainable decision-support mechanism.
The contributions of this study are as follows:
- (i)
A unified, reproducible, multi-domain predictive framework for awardee classification that simultaneously combines thirty-two publication-based bibliometric indices, cross-domain evaluation across four disciplines, and SHAP-based explainability within a single pipeline, extending prior predictive bibliometric work [
1,
15] by integrating cross-domain validation and model-level interpretability within a unified and reproducible design.
- (ii)
A systematic comparison of eight supervised learning algorithms across four heterogeneous disciplines, demonstrating that margin-based and ensemble methods generalize more robustly than probabilistic or single-tree approaches, advancing beyond prior single-domain evaluations that were limited to one or two classifier families.
- (iii)
SHAP-driven feature attribution at both global and local levels, providing transparent evidence that normalized and balance-oriented indices, including normalized h-index variants, -type indices, -index, and g-index, which consistently distinguish awardees across domains, while domain-specific indicators reflect disciplinary recognition patterns.
- (iv)
A discussion of the ethical implications of deploying bibliometric prediction frameworks in institutional evaluation contexts, including considerations of fairness, transparency, and the risk of metric gaming, alongside a prior sensitivity analysis demonstrating the importance of evaluating framework performance under realistic class priors before any institutional deployment.
The remainder of this paper is organized as follows.
Section 2 reviews related work on bibliometric indices and predictive evaluation models.
Section 3 presents the proposed methodology, including dataset construction, feature computation, normalization, model training, and interpretability analysis.
Section 4 reports experimental results, cross-domain comparisons, and SHAP-based feature attribution.
Section 5 concludes the study and outlines directions for future research.
3. Materials and Methods
This section presents the proposed interpretable machine learning framework for predicting academic award recognition using publication-based bibliometric indices. The framework is structured as a multi-stage pipeline integrating dataset construction, feature engineering, supervised classification, quantitative evaluation, and explainable analysis.
Let
denote the constructed dataset, where
N is the total number of authors under study,
represents the vector of
publication count-based indices for the
i-th author, and
denotes the class label (0 = non-awardee, 1 = awardee). The objective is to learn a mapping function that predicts award status while preserving interpretability of feature contributions:
The overall architecture of the framework is illustrated in
Figure 1. The workflow proceeds through six stages: (1) data acquisition, (2) preprocessing and normalization, (3) computation of bibliometric indices, (4) supervised model training, (5) performance evaluation, and (6) SHAP-based interpretability analysis. This structured design ensures reproducibility, cross-domain comparability, and transparency.
5. Discussion
The empirical evaluation demonstrates that machine learning models can effectively distinguish award-winning researchers from non-awardees using publication count-based bibliometric indices. Across the four examined disciplines, predictive performance varied according to disciplinary publication and citation structures, with domain-specific F1-scores ranging from 0.70 (Computer Science) to 0.78 (Mathematics) and ROC-AUC values from 0.75 to 0.88. XGBoost achieved the strongest results in Mathematics (F1 ; ROC-AUC ) and Neuroscience (F1 ; ROC-AUC ), indicating that gradient boosting is better suited to domains with complex nonlinear bibliometric feature interactions. SVM performed best in Computer Science (F1 ; ROC-AUC ) and Civil Engineering (F1 ; ROC-AUC ), where the feature space exhibits stronger margin-based separability. These patterns suggest that no single classifier dominates universally; rather, the optimal model reflects the underlying citation and productivity structure of each discipline. Overall, margin-based and ensemble methods consistently outperformed probabilistic and single-tree models, confirming that publication-based features contain sufficient predictive signal to support reliable awardee classification when analyzed through supervised learning frameworks.
Comparison with prior studies further highlights the contribution of the proposed framework. Earlier scientometric research largely focused on evaluating individual indices or comparing their descriptive performance within specific disciplines [
4,
29]. Studies examining h-index variants demonstrated that no single metric consistently captures all dimensions of scholarly impact across domains. More recent predictive work has made progress: Alshdadi et al. [
1] achieved prediction accuracy up to 70% using deep learning, while Usman et al. [
15] applied logistic regression within a single domain. The present framework advances beyond these contributions in three specific ways. First, it evaluates eight classifiers simultaneously across four heterogeneous disciplines, enabling systematic cross-domain robustness assessment that prior single-domain studies could not provide. Second, unlike the deep learning approach of Alshdadi et al., which operates as a black-box system, the present framework integrates SHAP-based explainability that identifies
which specific bibliometric features drive predictions, a critical requirement for institutional adoption. Third, by engineering thirty-two publication count-based indices organized into conceptual categories, productivity, impact, and structural balance, the framework evaluates the
joint predictive contribution of multiple indices rather than treating them in isolation. These combined advances represent a meaningful step toward systematic, transparent, and generalizable bibliometric evaluation.
The interpretability analysis provides further insight into the bibliometric characteristics associated with academic recognition. SHAP-based feature attribution consistently identified normalized and balance-oriented indices as influential predictors across all four domains. Indicators such as the normalized h-index,
-type variants,
-index, and g-index emerged repeatedly among the most important features. These indices capture balanced productivity and citation concentration rather than relying solely on total publication counts or raw citation totals. This pattern aligns with established scientometric theory: Bornmann and Daniel [
5] argued that composite indices better reflect scholarly influence than primitive metrics precisely because they capture the structural distribution of citations rather than their aggregate. The present findings provide empirical, cross-domain confirmation of this argument within a predictive rather than purely descriptive framework, suggesting that the same structural properties associated with scholarly influence in theory are also those that distinguish recognized researchers in practice. In practical terms, this transparency directly informs how the framework can be deployed responsibly in institutional settings: for example, a grant committee or promotion panel could use the model to rank a large candidate pool by predicted awardee probability, surfacing the top decile of researchers for detailed human review, analogous to how abstract-screening tools are used in systematic literature reviews, rather than applying it as a binary decision system. The SHAP explanations further allow evaluators to interrogate
why a candidate ranks highly, for instance, whether their score is driven by sustained productivity efficiency or by citation concentration within the h-core, enabling informed and accountable judgment rather than uncritical acceptance of model outputs.
Domain-specific variations in feature importance reveal the influence of disciplinary publication cultures. In Neuroscience, the -index and w-index exhibited stronger influence, reflecting the field’s high citation intensity and the prominence of landmark publications that accumulate citations over long periods—both of which are captured by these citation-intensive, career-normalized indices. In Mathematics, the normalized h-index and -family metrics dominated, reflecting the relatively lower publication frequency and longer citation half-lives typical of theoretical disciplines, where productivity efficiency rather than volume is the distinguishing characteristic. Civil Engineering demonstrated a mixed pattern in which productivity-based indices such as the P-index interacted with citation balance metrics, reflecting the applied research culture of the field. Notably, the descriptive statistics revealed that non-awardees in Civil Engineering had higher raw h-index and publication counts than awardees on average, reinforcing that raw volume is insufficient and that citation structure is the decisive discriminator. These differences highlight the importance of evaluating bibliometric indicators within cross-domain frameworks rather than assuming universal applicability across disciplines.
The integration of SHAP-based interpretability has important methodological implications for scientometric evaluation. Traditional machine learning models frequently operate as opaque systems, limiting their practical adoption in academic decision-making contexts where transparency is essential [
1]. By decomposing model predictions into additive feature contributions, the proposed framework transforms predictive modeling into an explainable decision-support mechanism. Institutions and funding agencies can therefore not only generate predictions regarding potential award recognition but also understand the specific bibliometric characteristics underlying those predictions. This transparency supports more evidence-based and accountable assessment of scholarly impact, and allows decision-makers to interrogate and challenge model outputs rather than accepting them uncritically.
The deployment of bibliometric prediction frameworks in institutional evaluation contexts raises important ethical considerations that must be addressed alongside predictive performance. Three concerns are particularly relevant. First,
dataset bias: the present study relies on award lists from specific professional societies, which may not represent the full diversity of research excellence across career stages, geographic regions, or underrepresented communities. Researchers from lower-resource institutions, early-career scholars, or those working in interdisciplinary areas that are underrepresented in major award programs may be systematically disadvantaged by frameworks trained on such data. Second,
model fairness: the misclassification analysis revealed that awardees with atypical bibliometric profiles, for example, those recognized for a single landmark contribution, are more likely to be classified as non-awardees. Deploying a framework that penalizes non-standard excellence patterns would reinforce existing structural inequities rather than correct them. Third, and critically,
metric gaming: if institutional evaluations were to formally incorporate bibliometric prediction scores, researchers might optimize their behavior to inflate the specific indices identified as influential by the model, such as normalized h-index or
-family indices, at the expense of genuine scientific contribution. This is a well-documented risk in bibliometric policy contexts [
2] and argues strongly for using such frameworks as
decision-support tools rather than
decision-making systems. The present framework is intended to support human judgment, not replace it.
Several limitations of the present study should be acknowledged. The analysis focuses exclusively on publication count-based bibliometric indices and does not incorporate citation-age adjustments, collaboration-network measures, or temporal career dynamics, features that may capture additional dimensions of scholarly impact relevant to award recognition. The reliance on a single data source (Google Scholar via Publish or Perish) introduces potential indexing inconsistencies across disciplines, as Google Scholar’s coverage varies in depth and completeness between fields. The non-awardee matching strategy, while designed to reduce selection bias, relied on publication volume as the primary matching criterion; other potential confounders, such as career stage, institutional affiliation, and sub-disciplinary specialization, were not controlled. Finally, although the framework was validated across four disciplines, broader coverage of research fields, particularly humanities, arts, and interdisciplinary domains, and larger, more diverse datasets would strengthen generalizability.
6. Conclusions and Future Work
This study investigated whether machine learning techniques can reliably and transparently predict academic award recognition using publication count-based bibliometric indices. An interpretable multi-domain evaluation framework was developed and applied to four disciplines, Computer Science, Neuroscience, Mathematics, and Civil Engineering, integrating structured data collection, feature computation, supervised classification, stratified cross-validation, and SHAP-based explainability. By combining predictive modeling with interpretable analysis, the study advances beyond descriptive bibliometric comparisons toward a systematic, transparent, and cross-domain evaluation methodology.
Unlike traditional approaches that evaluate individual indices independently, the proposed framework integrates thirty-two publication count-based bibliometric indicators, organized into productivity-based, impact-based, and structural balance categories, within a unified predictive modeling pipeline. Eight supervised machine learning algorithms were evaluated to examine their ability to classify award-winning and non-award-winning researchers. The integration of comprehensive feature engineering, stratified five-fold cross-validation, ROC-AUC evaluation, and SHAP-based interpretability enabled a rigorous assessment of both predictive accuracy and the relative contribution of bibliometric features, with standard deviations reported to characterize result variability.
The empirical results reveal several important findings. Predictive performance differs across disciplines, reflecting domain-specific publication and citation structures; Mathematics achieved the highest discriminative performance (ROC-AUC ), while Computer Science presented the most challenging classification task (ROC-AUC ). XGBoost demonstrated the strongest performance in citation-intensive domains (Mathematics and Neuroscience), while SVM was more effective in domains with stronger margin-based separability (Computer Science and Civil Engineering). SHAP-based interpretability analysis consistently identified normalized and balance-oriented indices, including the normalized h-index, variants, -index, and g-index, as the most influential predictors across domains, providing empirical cross-domain confirmation that structural citation balance rather than raw productivity volume characterizes award-worthy scholarly output.
The findings demonstrate that publication count-based bibliometric indicators, when analyzed within an interpretable machine learning framework, provide meaningful and theoretically grounded signals for understanding academic recognition patterns. The integration of SHAP transforms the predictive model from a purely statistical classifier into a transparent decision-support mechanism, essential in evaluation contexts where fairness, accountability, and transparency are critical. The proposed framework is not intended to automate academic evaluation decisions, but rather to augment human judgment with evidence-based, explainable insights. Importantly, the ethical implications of such frameworks, including dataset bias, model fairness, and the risk of metric gaming, must be carefully considered before any institutional deployment, and the framework should be applied as a transparent
support tool rather than a prescriptive decision system. It is important to note that all reported metrics are evaluated under a balanced 1:1 class distribution arising from the matched dataset design, and should not be interpreted as reflecting deployment performance under real-world class priors. A prior sensitivity analysis (
Section 4.3) demonstrates that adjusted precision falls substantially at realistic awardee prevalence rates, reinforcing that the proposed framework is designed as a ranking and screening tool to support human judgment rather than as a standalone decision-making system. Probability calibration and evaluation under natural class priors are identified as essential directions for future work before any institutional deployment.
Despite these contributions, several limitations remain. The analysis is restricted to publication count-based indices and does not incorporate citation-age adjustments, collaboration-network features, or longitudinal career dynamics. The reliance on a single data source (Google Scholar via Publish or Perish) introduces potential coverage inconsistencies across disciplines. The non-awardee matching strategy, while structured to reduce selection bias, did not control for confounders such as career stage or institutional affiliation. Finally, the framework was validated across four disciplines; broader domain coverage and larger, more diverse datasets would further strengthen generalizability.
A further limitation concerns the evaluation setting. The dataset was constructed using a matched 1:1 design, which means both the training and test sets reflect a balanced class distribution rather than the natural prevalence of awardees in any real academic population. As demonstrated in
Section 4.3, precision under realistic awardee prevalence rates of 1–5% falls to between 2% and 17% across domains, substantially below the balanced test estimates. Probability calibration, threshold adjustment, and evaluation under realistic class priors are therefore necessary prerequisites before this framework is applied in any institutional setting.
Future research can extend this framework in several directions. Incorporating citation-age-adjusted indices, collaboration-network measures, and temporal bibliometric trajectories would provide a more complete representation of scholarly influence. The inclusion of formal statistical significance tests, such as McNemar’s test for pairwise classifier comparison or Friedman tests across multiple domains, would strengthen the statistical rigor of cross-model comparisons. Fairness-aware machine learning approaches, such as adversarial debiasing or constraint-based optimization, could address the dataset bias and model fairness concerns identified in this study. Extending the framework to additional disciplines, particularly humanities, social sciences, and interdisciplinary fields, and applying longitudinal modeling of academic careers would broaden applicability. Finally, integrating the framework with real-time bibliometric APIs (e.g., Semantic Scholar, OpenAlex) would support dynamic, continuously updated evaluation pipelines. Through these extensions, predictive scientometric evaluation may evolve toward more robust, fair, transparent, and context-aware mechanisms for assessing scholarly impact.