Next Article in Journal
Reviewing Crowdsourcing and Community Engagement in Museums
Previous Article in Journal
The Newcastle–Ottawa Scale for Assessing the Quality of Studies in Systematic Reviews
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid BWM-GRA-PROMETHEE Framework for Ranking Universities Based on Scientometric Indicators

1
Doctoral Program of Information Systems, Universitas Diponegoro, Semarang 50241, Indonesia
2
Department of Informatics Engineering, Universitas Islam Sultan Agung, Semarang 50181, Indonesia
*
Author to whom correspondence should be addressed.
Publications 2026, 14(1), 5; https://doi.org/10.3390/publications14010005
Submission received: 17 November 2025 / Revised: 13 December 2025 / Accepted: 30 December 2025 / Published: 4 January 2026

Abstract

University rankings based on scientometric indicators frequently rely on compensatory aggregation models that allow extreme values to dominate the evaluation, while also remaining sensitive to outliers and unstable weighting procedures. These issues reduce the reliability and interpretability of the resulting rankings. This study proposes a hybrid BWM–GRA–PROMETHEE (BGP) framework that combines judgement-based weighting Best-Worst Method (BWM), outlier-resistant normalization Grey Relational Analysis (GRA), and a non-compensatory outranking method Preference Ranking Organization Methods for Enrichment Evaluation (PROMETHEE II). The framework is applied to an expert-validated set of scientometric indicators to generate more stable and behaviorally grounded rankings. The results show that the proposed method maintains stability under weight and threshold variations and preserves ranking consistency even under outlier-contaminated scenarios. Comparative experiments further demonstrate that BGP is more robust than Additive Ratio Assesment (ARAS), Multi-Attributive Border Approximation Area Comparison (MABAC), and The Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS), achieving the highest Spearman. This study contributes a unified evaluation framework that jointly addresses three major methodological challenges in scientometric ranking, outlier sensitivity, compensatory effects, and instability from data-dependent weighting. By resolving these issues within a single integrated model, the proposed BGP approach offers a more reliable and methodologically rigorous foundation for researchers and policymakers seeking to evaluate and enhance research performance.

1. Introduction

Research performance has become a key indicator of the quality and success of a university. The importance of research output is reflected in the emergence of various global ranking systems that aim to measure and compare the scientific contributions of higher education institutions (Demeter et al., 2022; Maral, 2024; Szluka et al., 2023). Several ranking organizations such as the Scimago Institutions Rankings (SIR), URAP, NTU, and the Leiden Ranking assess the research performance of universities worldwide (Lukić & Tumbas, 2019). These systems typically measure and rank institutions based on scientometric indicators, including publication volume and citation counts, which form the core of their evaluation frameworks (Schlögl et al., 2025). However, their methodologies rely on statistical aggregation, combining diverse scientometric indicators into a single composite score. This approach is often unsuitable because scientometric data are inherently multidimensional and therefore require evaluation through Multi-Criteria Decision-Making (Keenan, 2024; Maral, 2024). Furthermore, statistical aggregation methods lack consistency validation mechanisms and comprehensive sensitivity analyses, undermining the reliability and reproducibility of the resulting rankings (Daraio et al., 2023; Doğan & Al, 2019).
Another challenge is that scientometric data inherently contain outliers (Bornmann, 2024; Schmoch, 2020). According to Gagolewski et al. (2022), between 1 and 5% of publications generate more than 50% of all citations, indicating a heavy-tailed distribution. Similarly, Lovakov and Teixeira da Silva (2025) report that certain universities can produce publication outputs that are 10 to 20 times higher than the median. Such extreme values are not random anomalies but structural elements of the scientific ecosystem. Empirically, this heavy-tailed nature makes normalization processes highly sensitive, potentially producing unstable rankings driven by only one or two extreme observations. The presence of outliers in scientometric data not only causes compensatory effects and distorts ranking outcomes but also leads to inaccurate, unstable, and non-robust decisions (Erbey et al., 2025; Ziemba, 2022).
Several studies have explored Multi-Criteria Decision-Making (MCDM) approaches to enhance existing university ranking systems (Ayyildiz et al., 2023; Gul & Yucesan, 2022; C. Zhang et al., 2022), However, most of these studies have not specifically focused on addressing the outlier and compensatory effects that commonly arise in scientometric data and can substantially bias or destabilize the resulting rankings, the developed ranking models generally combine scientometric indicators with other criteria such as education, finance, or survey-based measures. Recent studies Maral (2024) have implemented an MCDM university ranking based on scientometric indicators; however, the model still exhibits several fundamental limitations, such as the use of the Entropy weighting method is potentially less robust in the presence of outliers within the dataset, as the resulting criterion weights are highly dependent on data dispersion, this can be distorted by extreme values (Erbey et al., 2025). The ranking methods employed ARAS, TOPSIS, and MABAC are compensatory in nature. These methods compensate for low performance in one or two criterion with high performance in another, since the scientometric data are often highly skewed and prone to outliers, this compensatory effect could potentially mask fundamental performance imbalances allowing institutions with skewed profiles to achieve more higher in overall rankings (Baydaş et al., 2024; El Gibari et al., 2018).
The challenges posed by the presence of outliers in scientometric data and the compensatory effects among criteria necessitate a more stable and methodologically grounded ranking approach. These conditions motivate the present study to develop an evaluation framework that incorporates a weighting model unaffected by data distortions, stabilizes underlying data distributions, and limits the dominance of extreme indicators so that they do not obscure weaknesses in fundamental quality dimensions.
As a response to these challenges, this study proposes a new hybrid BWM–GRA–PROMETHEE (BGP) framework for measuring and ranking universities based on scientometric indicators. The proposed framework is capable of assigning criterion weights according to their relative importance to ensure consistent and unbiased weight determination, handling outliers to provide robustness as the ability of the method to preserve ranking stability under outlier contamination and skewed data distributions, to generating rankings that limit compensatory effects across criteria. This framework is expected to produce rankings that are more stable, reliable, and methodologically robust.

2. Literature Review

Literature review examines previous research by identifying the limitations and exploring opportunities for integrating the methods. The analysis aims to highlight existing research gaps and establish the foundation for a hybrid method that is more adaptive, stable and robust.
Several studies have integrated Multi-Criteria Decision-Making (MCDM) approaches into university ranking systems, Gul and Yucesan (2022) proposed Bayesian BWM–TOPSIS model using institutional indicators from TÜBİTAK, C. Zhang et al. (2022) developed a three-stage MCDM–NRSDEA framework with bootstrapping to evaluate research efficiency among Chinese universities, and Ayyildiz et al. (2023) combined hierarchical clustering, IVN-AHP, and VIKOR from a student perspective, Trung Do (2024) investigated the effect of different weighting methods on the rankings of the top ten universities in Vietnam. The PROMETHEE framework has been applied to the evaluation of higher education performance, including university strategy formulation (Živković et al., 2017) and the analysis of knowledge transfer activities in academia (Ishizaka et al., 2020). Despite their diversity, these approaches share fundamental methodological weaknesses their focus remains on generic institutional indicators while overlooking challenges of scientometric data, such as the presence of outliers and compensatory effects. Their contributions are limited to novel applications of established methods rather than to the development of methodologies.
The use of scientometric indicators to measure and rank university into the MCDM framework was proposed by (Maral, 2024), who combined objective weighting methods (Entropy, MEREC, CRITIC) with linear models (TOPSIS, MABAC, ARAS) aggregated through the Borda count. Although innovative, this approach remains limited, as the exclusive use of purely objective weights risks neglecting the substantive relevance of criteria, potentially obscuring their relative importance (Chen et al., 2025; Ezell et al., 2021; Zoraghi et al., 2013). The Entropy method has notable limitations in determining criterion weights, particularly due to its sensitivity to outliers and lack of statistical robustness. Its probability-based formulation leads to large shifts in weight assignments even under small deviations in the data, making it unstable when applied to irregular or noisy datasets (Chen et al., 2025; Erbey et al., 2025; Z. Li & Zhang, 2023). The MEREC method also exhibits limitations when applied to data with high variability or the presence of outliers, because MEREC calculates weights based on absolute deviation from an ideal condition, extreme values in a single criterion can exert a disproportionate influence and distort the overall weighting structure (Chen et al., 2025; Nguyen & Nhieu, 2025; Pala, 2024). The CRITIC method also has several limitations when applied to highly variable datasets or those containing outliers, the CRITIC estimates weights based on variance and correlation, it becomes sensitive to extreme values, allowing criteria with high dispersion to receive disproportionately large weights (Alrababah & Gan, 2023; Chen et al., 2025; X. Li et al., 2022).
Compensatory aggregation approaches also suffer from outlier-driven distortions, the additive linear structure fails to represent non-compensatory preference behaviour. The application of ARAS to datasets with non-normal or outlier-contaminated distributions requires caution, because its additive and fully compensatory structure combined with normalization-based scoring makes it highly sensitive to data dispersion and scale heterogeneity. As a result, extreme values (outliers) can disproportionately increase the aggregate utility of an alternative, masking deficiencies in other criteria and ultimately producing unstable or distorted ranking outcomes (N. Liu & Xu, 2021; Turskis & Keršulienė, 2024). The TOPSIS method exhibits a compensatory aggregation structure, whereby very high performance on one indicator can offset substantial weaknesses on another. In addition, TOPSIS is unable to account for interactions or dependencies among criteria, assuming full independence; as a result, nonlinear or asymmetric relationships between criteria cannot be adequately modelled. Weight assignment alone is insufficient to address these complexities, making the method less robust when applied to structurally complex datasets (Baydaş et al., 2024; Jamwal et al., 2021; Rahman et al., 2024). In the context of heavy-tailed scientometric data, this compensatory mechanism may mask imbalances in performance across indicators. MABAC is a fully compensatory method. Its computation structure, which is based on additive deviation through the Q matrix, makes the method highly sensitive to normalization procedures and differences in scale across criteria. As a result, criteria with larger variance or wider value ranges tend to dominate the final outcome. Moreover, MABAC lacks explicit mechanisms for controlling the influence of outliers or handling highly skewed data distributions (Ayan et al., 2023; Shi et al., 2022; Torkayesh et al., 2023), making it vulnerable to distortion when applied to heavy-tailed scientometric datasets.
These challenges are further compounded by limitations in the normalization techniques commonly used in scientometric evaluation. Approaches for data normalization in the scientometric domain are still commonly based on traditional methods such as min–max scaling, z-score standardization, and logarithmic transformation, which are widely used in MCDM implementations. However, these methods are not fully capable of mitigating the influence of outliers. In min–max scaling, a single extreme value expands the entire range, causing all other values to become disproportionately compressed. Z-score standardization is highly sensitive to the mean and standard deviation, both of which can be heavily affected by outliers. Logarithmic transformation reduces scale but does not eliminate the dominance of extreme values in heavy-tailed distributions (Gagolewski et al., 2022; Lovakov & Teixeira da Silva, 2025).
To overcome these methodological limitations, a more resilient evaluation framework must address the fundamental challenges of weighting instability, outlier-sensitive normalization, and the compensatory nature of traditional ranking models. Addressing the first challenge requires a weighting mechanism that is not affected by data distortions or heavy-tailed distributions. In this regard, the Best–Worst Method (BWM) offers higher efficiency and consistency in determining criterion weights, as it requires only pairwise comparisons between the best and worst criteria while considering their relative importance (Rezaei, 2015). In contrast to objective weighting methods such as Entropy, CRITIC, or MEREC which derive weights directly from data dispersion and are therefore highly sensitive to skewness, variance, and outliers BWM is entirely independent of data structure. Because its weights are elicited from expert judgements rather than statistical properties of the dataset, BWM avoids distortions caused by extreme values, maintains stability across different data conditions, and produces weights that more accurately reflect the importance of indicators (Goodarzi et al., 2022; Keenan, 2024; Raed et al., 2025). This makes BWM particularly suitable for scientometric evaluation, where heavy-tailed distributions and outlier-driven variability commonly undermine the reliability of data-dependent weighting approaches.
While BWM resolves the issue of unstable and outlier-dependent weighting, the second challenge concerns the normalization process, which must be capable of stabilizing skewed scientometric distributions. This motivates the use of Grey Relational Analysis (GRA). The GRA method offers a data pre-processing mechanism through the Grey Relational Coefficient (GRC) transformation, which reduces the dominance of extreme values and stabilizes the underlying data distribution. In addition, GRA can be integrated with various MCDM methods, providing flexibility for constructing more effective evaluation frameworks (Esangbedo & Wei, 2023; Zheng et al., 2025). Several studies demonstrated the effectiveness of GRA across different domains for example, the contribution-driven weighted GRA (Wu et al., 2025), university sustainability assessment using the coupling-GRA model (Zhu et al., 2024), and vocational education evaluation employing GRA–TOPSIS (H. Li, 2024). A comparative study between Entropy-weighted GRA and PROMETHEE has also been reported by (Elevli & Elevli, 2024). Furthermore, P. Li et al. (2022) developed a GRA–DEMATEL based PROMETHEE approach for renewable energy selection, demonstrating the integration GRA, although it still focused primarily on linguistic datasets. Grey Relational Analysis (GRA) not only handles information uncertainty but also effectively dampens the influence of outlier values (Başaran & Ighagbon, 2024). GRA offers a relational normalization mechanism that is suitable to handle data skewed or outlier-contaminated data by computing the grey relational coefficient as an absolute measure of closeness to the ideal reference (Başaran & Ighagbon, 2024; Zheng et al., 2025). This mechanism enables consistent comparison among alternatives and, being based on relative differences, more resistant to distortions caused by extreme values (Esangbedo & Wei, 2023; Paradowski et al., 2025).
However, even with stable weights and robust normalization, a third challenge remains, the compensatory behaviour of traditional aggregation models and prior method, where strong performance on a one or two indicator can overshadow weaknesses in others. To address this, PROMETHEE II provides a non-compensatory outranking mechanism, in which alternatives are compared through preference relations rather than numerical aggregation. In a non-compensatory framework, high performance on one or two criterion cannot fully offset substantial deficiencies on another, while outranking refers to a preference-based comparison indicating that one alternative is sufficiently supported to be preferred over another (Deng et al., 2022; Gul et al., 2018). PROMETHEE II inherently limits undesired compensation across criteria. Through its pairwise preference functions and the explicit use of indifference and preference thresholds (Papapostolou et al., 2024; Wulf et al., 2021). In Addition, PROMETHEE II offers substantial methodological flexibility and can be integrated with various analytical procedures, enabling its adoption in numerous hybrid MCDM frameworks (Ishizaka & Resce, 2021; Oubahman & Duleba, 2024; Pinochet et al., 2023; E. Pohl & Geldermann, 2024).
Based on the synthesis of existing literature, no prior approach has simultaneously addressed the three major methodological challenges inherent in scientometric evaluation, the need for consistent weighting that is not distorted by extreme or skewed data, the need for a normalization mechanism that is robust to skewness and outliers, and the need for a non-compensatory ranking method capable of preventing one or two indicator dominance. The absence of a framework that satisfies all three requirements reveals a clear research gap in the development of more robust and theoretically grounded scientometric ranking methodologies.
To bridge this gap, this study proposes a hybrid BWM–GRA–PROMETHEE II (BGP) framework. This integrated design intentionally leverages the complementary strengths of each component method, BWM provides rigorous and consistent weighting independent of data irregularities, GRA contributes a robust preprocessing layer capable of stabilizing heavy-tailed and outlier-prone scientometric distributions, and PROMETHEE II for a non-compensatory outranking logic that prevents dominant indicators from overshadowing weaknesses in others. Together, these components jointly resolve the issues that have not been addressed in an integrated manner within the scientometric literature, this hybrid framework is expected to produce ranking that are more robust, stable, and reliable.

3. Materials and Methods

This section describes the data and indicators also the procedural steps of each method (BWM-GRA-PROMETHEE). Figure 1 sequentially illustrates the integration process.

3.1. Materials

The data obtained from Scopus API, which was further enriched with additional data from the SCImago Journal Rank (SJR) portal during the period 23–26 May 2025. In this study, only journal articles were included for analysis, we excluded non-journal publications such as conference proceedings, books and other publication types to maintain analytical consistency. The dataset covers journal publications from 100 Indonesian universities indexed in Scopus over the period 2020–2024. Various indicators presented in Table 1 commonly used in scientometric evaluation these indicators provide a comprehensive overview of research performance.
The indicators presented in Table 1 were used as candidate criteria to be evaluated by experts through the Delphi method, using a Likert scale, in which each indicator was required to meet predefined thresholds for Aiken’s V, median, and interquartile range (IQR) to ensure relevance of the indicators (Anculle-Arauco et al., 2024; Beiderbeck et al., 2021; Hesselink et al., 2024; Shang, 2023). Following this screening process, the overall consistency of expert judgements was externally validated using Kendall’s coefficient of concordance (Kendall’s W) (Olivero et al., 2022), This procedure resulted in a validated set of criteria used in the subsequent analysis.

3.2. Methods

3.2.1. Best-Worst Method (BWM)

Criteria weighting is a fundamental part of the decision process, we use BWM introduced by (Rezaei, 2015) for determining the criteria weights. BWM pairwise comparisons are conducted between the best, worst, and remaining criteria to generate consistent weight values, (Liang et al., 2020; Moslem et al., 2020; Rezaei, 2015). The BWM computational steps is described below:
Step 1: Pairwise Comparison
The BWM steps start with experts perform two sets of pairwise evaluations. The first set is choose the Best-to-Others (BO), The second set is choose the Others-to-Worst (OW), the BO compares the best criterion with all remaining criteria to indicate its relative dominance, while the OW assesses how strongly each criterion performs relative to the worst one (Keenan, 2024; Raed et al., 2025; Yu et al., 2025). These judgements are made using a nine-point (1–9) ordinal scale (Rezaei, 2015). The relationships can be represented as follows:
BO   vector :   A B = a B 1 , a B 2 , , a B n
OW   vector :   A W = a 1 W , a 2 W , , a n W
In the BO vector a B j indicates how strongly the best criterion dominates criterion j , while on the OW vector a j W reflects how much criterion j is favoured compared with the worst criterion.
Step 2: Criteria Weights Calculation
The term w 1 , w 2 , , w n are the criteria weights determined by applying minimax optimization approach, which utilizes the two comparison vectors BO and OW obtained from Step 3. The model is defined as follows:
Objective : minimize : ξ ,   Constraints : w B w j a B j ξ     j ,   w j w W a j W ξ j , j = 1 n w j = 1 ,   w j 0   j
where w B denotes the weight of the best, w W denotes the weight of the worst criterion, while w j refers to the weight of each remaining criterion. The parameters a B j and a j W capture the pairwise preference ratios between the best-to-others and others-to-worst comparisons. The variable ξ represents the maximum deviation used to minimize inconsistencies in the pairwise judgements. j = 1 n w j = 1 . Each weight must take a non-negative value, with all weights proportionally scaled to ensure a total sum of one.
Step 3: Consistency Assessment
Consistency is assessed for evaluating internal consistency within the pairwise comparison. The calculation is using the formula defined as follows:
C R = ξ * C I
The term C R represents the degree of internal consistency within the expert’s pairwise comparison judgements, ξ * represents the optimal consistency index obtained from the optimization model, C I denotes the reference consistency index, defines the maximum allowable inconsistency for the given comparison scale. The consistency index (CI) is adopted from the BWM proposed by (Rezaei, 2015). Table 2 presents the Consistency Index values.
Step 4: Aggregation of Individual Priorities
After the consistency of each expert’s pairwise comparisons were verified (CR ≤ 0.1), the individual priority vectors were aggregated into a unified group priority vector to obtain the final criterion weights. The aggregation process is formulated as:
w j ¯ =   1 m   k = 1 m w j k
The term w j ¯ represent the aggregated weight of criterion j . reflecting its average priority based on expert evaluations. The parameter m refers to the number of experts contributed in the validation. w j k is the individual weight assigned by the k expert to criterion j , w j k denotes the total sum of weights from all experts for criterion j .

3.2.2. Grey Relational Analysis (GRA)

Introduced by Ju-Long (1982), Grey Relational Analysis (GRA) offers a problem-solving approach for relationship evaluation, forecasting, and supporting decision-making (P. Li et al., 2022; Malekpoor et al., 2018). In this study, Grey Relational Analysis (GRA) is employed as a relational transformation that converts the raw decision matrix into a per-criterion dimensionless relational coefficient matrix (Başaran & Ighagbon, 2024; Zheng et al., 2025). This transformation enables scale-free comparability among indicators and enhances robustness against outliers (Esangbedo & Wei, 2023; Paradowski et al., 2025). The computational steps of GRA are as follows:
Step 1: Normalization
This normalization step transforms the raw performance values of all criteria into a dimensionless scale eliminating differences in units between indicators, enabling fair and consistent comparison among alternatives (Baydaş et al., 2024), It is expressed as follows:
b e n e f i t :                             r i j = x i j     min i x i j max i ( x i j )       min i ( x i j )
c o s t :     r i j = max i ( x i j )     x i j max i ( x i j )       min i ( x i j )
where x i j denotes as raw performance value of alternative under criterion, while min i x i j and max i ( x i j ) represent the minimum and maximum values of that criterion across all alternatives. The resulting r i j values reflect better performance, regardless of whether the criterion is benefit or cost.
Step 2: Determination of the Reference Sequence
After normalization all criterion fall within [0, 1] range, where the value 1 represent the ideal performance of each criterion, the reference sequence is defined as:
r 0 j = 1     j
where r 0 j denotes the ideal normalized value for the j criterion, since all criteria have been normalized to the [0, 1] range, each r 0 j equals 1, these values form the ideal reference vector R 0 = r 0 1 , r 0 2 , . . . , r 0 n , which serve as the benchmark for all subsequent GRA calculations.
Step 3: Calculation of the Grey Relational Coefficient (GRC)
The normalized matrix transformed into a GRC matrix that quantifies the degree of each alternative to the ideal condition. The calculation of the GRC involves three sub steps, as follows:
  • Deviation from the ideal.
    For each entry, compute the absolute deviation from the ideal value, it is as follows:
    i j   =   1 r i j
    where i j represent the absolute deviation between the normalized performance value of an alternative and its ideal condition. The value 1 denotes the maximum value within the normalized (0, 1) scale. r i j is the normalized performance of alternative i under criterion j obtained from by scaling the original data to the (0, 1) range.
  • Global deviation boundaries.
    Determine the minimum and maximum deviation values across all alternatives and criteria.it is as follows:
    m i n = min i , j i j ,   m a x = max i , j i j
    where m i n denotes the smallest deviation within the entire matrix, indicating the point closest to the ideal condition. m a x denotes the largest deviation, indicating the point farthest from the ideal. Both serve as global reference boundaries used to standardize the deviation range across the entire dataset.
  • Grey Relational Coefficient (GRC)
    GRC converts the deviation value into a measure of relational closeness between each alternative and the ideal reference, expressed similar or dissimilar an alternatives performance to the ideal condition under each criterion. It is defined as follows:
    γ i j =   m i n +   ζ   m a x i j +   ζ   m a x
    where γ i j denotes the grey relational coefficient of alternative i under criterion j . ζ is the distinguishing coefficient used to adjust the contrast level (Başaran & Ighagbon, 2024; Mahmoudi et al., 2020; Malekpoor et al., 2018). i j is the deviation from the ideal value for criterion j. m i n and m a x are the minimum and maximum deviations across all alternatives and criteria.

3.2.3. PROMETHEE II

Developed by Brans and Vincke (1985), this method is based on pairwise comparisons between alternatives (Watrianthos et al., 2021; Yedjour et al., 2024). PROMETHEE II employed to limit compensatory effect and generates complete ranking of alternatives through pairwise comparisons, utilizing the criterion-wise closeness coefficient derived from GRA and the criterion weights obtained from Best-Worst Method.
The method is designed to limit excessive compensation through a Type-V preference function, followed by net flow computation for final ranking. The proposed integration maintains the axiomatic integrity of PROMETHEE II while incorporating relational normalization from GRA and statistically derived preference thresholds. These enhancements aim to improve robustness and interpretability without altering the fundamental outranking logic. The PROMETHEE II steps described below:
Step 1: Pairwise Performance Difference
PROMETHEE II conducts its pairwise comparisons using the Grey Relational Coefficient (GRC) matrix γ i j supplied by the GRA procedure. Since these GRC values are already dimensionless and directly comparable across criteria, no additional normalization is required. The performance difference between alternatives A i and A k on criterion j . It is defined as follows:
d j i , k =   γ i j   γ k j
where γ i j and γ k j represent the GRCs of alternatives A i and A k respectively, under criterion j . A positive difference d j i , k > 0 indicates that A i is closer to the ideal state—and thus performs better than A k on that criterion.
Step 2: Preference Function
After calculating Pairwise Performance Difference, the next step is transforming the differences into a preference degree using a preference function. In this study the type-V (linear) function is adopted, which provides a gradual transition between indifference and preference. This formulation ensures that the model reflects a relatively non-compensatory behaviour and reduce the impact of insignificant performance advantages. It is defined as follows:
P j d j i , k =   0 , d j i , k q j p j q j 1 ,           i f   d j i , k q j i f   q j   < d j i , k < p j i f   d j i , k p j
where P j d j i , k denotes the preference degree of alternative A i over A k under criterion j. d j i , k represent the performance difference between two alternatives for that criterion. The preference function is governed by two threshold parameters the indifference threshold ( q j ) and the preference threshold ( p j ) (Coquelet et al., 2025; Wulf et al., 2021). A difference of d j i , k q j yields zero preference ( P j d = 0 ), whereas q j < d j i , k produces a preference value that increases linearly, and d j i , k > p j yields full preference ( P j d = 1 ). The threshold values q j and p j are determined objectively based on the statistical characteristics of the normalized data distribution, following the approaches (Coquelet et al., 2025; E. Pohl & Geldermann, 2024; Wątróbski, 2023; Wulf et al., 2021), and are further tested through sensitivity analysis to ensure the robustness and consistency of the ranking results.
Step 3: Aggregate Preference Index
The individual preference values obtained from each criterion aggregated into a single measure representing the overall preference of alternatives A i over A k . The calculation is expressed as:
π i , k =   j = 1 n w j P j d j i , k
The aggregate preference index, denoted as π i , k reflects how much alternative A i is favoured compared to A k when all criteria are jointly evaluated, w j denotes the weight of criterion j, obtained from BWM. P j d j i , k denotes the preference degree for criterion j. The term n refers to the overall number of criteria used in the model.
Step 4: Leaving and Entering Flows
The next step after Aggregate Preference Index calculated is measuring the leaving and entering flow to evaluate dominance relation between alternatives is formulated as:
ϕ + i = 1 m 1   k i π i , k
ϕ i = 1 m 1   k i π k , i
where the Leaving Flow ( ϕ + i ) quantifies how strongly an alternative A i outranks the others, the Entering Flow ( ϕ i ) measure the extent to which A i is dominated by the others. m is the total number of alternatives and π k , i is the aggregated preference index.
Step 5: Net Flow—Complete Ranking
Net Flow is computed for evaluate the performance of each alternative. It is expressed as follows:
ϕ i =   ϕ + i   ϕ i
The Net Flow ( ϕ i ) provides a comprehensive measure of the dominance of each alternative, combining both its strengths and weakness in pairwise comparisons. Alternative with a higher Net Flow ( ϕ i ) is ranked superior, whereas alternative with lower Net Flow is considered less favourable.

4. Results

This section reports the outcomes of the hybrid BWM–GRA–PROMETHEE (BGP) framework applied to the university ranking dataset, focusing on sensitivity analysis of weights and parameters, and robustness testing under outlier contamination using the Friedman–Nemenyi procedure.

4.1. Result of Formulating the Criteria

Based on Table 1, the preliminary analysis indicated that the 14 scientometric indicators required further refinement to ensure that only those truly relevant would be retained as evaluation criteria. Therefore, all indicators were assessed using a Delphi procedure (Hasson et al., 2025). The expert panel consisted of seven members who rated each indicator on a 1–9 Likert scale to evaluate its level of relevance. Their ratings were analyzed using established validity measures, including content validity (Aiken’s V ≥ 0.80), relevance (Median ≥ 7), and judgement stability (IQR ≤ 1) (Anculle-Arauco et al., 2024; Beiderbeck et al., 2021; Hesselink et al., 2024; Shang, 2023).
Following the Delphi procedure, only indicators that met all validity criteria were retained, while those that did not meet the thresholds were eliminated. Through this process, five final indicators were established as the evaluation criteria for this study, SJR Score, International Collaboration, Total Citations (Citations), Publications in Top Journal (Q1 Count), and Number of Publications (Productivity) as presented in Table 3.
To further validate the coherence of the expert panel’s responses, Kendall’s coefficient of concordance (W) was calculated (Olivero et al., 2022). The significant result (Kendall’s W = 0.730, p < 0.001) indicates a strong and statistically reliable level of agreement among the experts. This high degree of consensus provides external validation for the robustness of the panel and reinforces the credibility of the selection outcomes.

4.2. Criteria Weighting Using BWM

The process of criteria weighting based on the judgements of seven experts are presented in Table 4. The BWM procedure began with two comparison vectors BO and OW, using Equations (1) and (2).
Based on the BO and OW comparison matrices, the individual weight vectors and consistency levels were computed using the linear optimization model Equation (3) of the BWM. The consistency analysis results obtained using Equation (4), Table 5 shows that all respondents achieved a Consistency Ratio (CR) ≤ 0.1 (max value = 0.0305), which is well below the acceptable threshold of 0.1 (Rezaei, 2016). The low value of ξ* (approaching zero) indicates a high level of consistency across all expert judgements (Rezaei, 2015).
The individual weight vectors that satisfied the consistency requirement were subsequently aggregated using the Average Individual Priority (AIP) method (Equation (5)). The use of AIP was justified by the fact that all individual judgements were consistent (CR ≤ 0.1), allowing the simple arithmetic mean to effectively represent group consensus without assigning different weights to individual experts (Liang et al., 2020). The resulting criterion weights are presented in Table 6 and the distribution of criterion weights is presented in Figure 2. SJR Score obtained the highest weight (0.33656), indicating that journal quality is considered the most dominant factor in the institutional performance evaluation. Q1 Count (0.2473) and Citations (0.2185) follow, representing the dimensions of visibility and scientific impact. Meanwhile, Productivity (0.1050) contributes moderately to the overall assessment, and International Collaboration (0.0637) has the lowest weight. These results highlight that the experts placed greater emphasis on the quality of scientific output rather than on its quantity or the extent of international collaboration.

4.3. Results of Grey Relational Analysis

The raw data of scientometric indicators were first normalized into a dimensionless range of (0, 1) using the normalization procedures Equations (6) and (7). This transformation eliminated unit inconsistencies between indicators and allowed direct comparison across all criteria. All indicators were of the benefit type, meaning higher values indicate better performance. Consequently, all normalized values became positively oriented toward the ideal condition. Table 7 presents the normalized decision matrix derived from this process. It shows that several universities achieved normalized values close to 1 across most benefit criteria, signifying strong overall scientometric performance.
The normalization stage effectively ensured that all indicators contributed to a common interpretative scale, where higher values consistently represented better scientometric performance. Following normalization, an ideal reference sequence was defined according to Equation (8). This sequence represents the theoretical optimum against which all alternatives are compared. It forms the benchmark for determining each university’s closeness to the ideal performance profile in subsequent calculations. After establishing the reference sequence, the deviation between each normalized value and the ideal reference was computed based on Equation (9). The smallest and largest deviations across the dataset were then identified in accordance using Equation (10), forming the global deviation boundaries. Using these parameters, the Grey Relational Coefficient (GRC) was derived for each criterion through Equation (11), which measures how close each alternative is to the ideal condition. The distinguishing coefficient (ζ) is defined within the interval 0 ≤ ζ ≤ 1 to control the contrast level of the relational measure (Jamshaid et al., 2025; Mahmoudi et al., 2020). Following common practice in grey system theory and to achieve a balanced level of sensitivity and stability, we set ζ = 0.5. The lower ζ values increase contrast among alternatives, making the GRC more sensitive to small deviations and potentially amplifying rank fluctuations, whereas higher ζ values smooth the contrast, improving stability but reducing discriminative power. Thus, ζ = 0.5 provides a balanced compromise between sensitivity and robustness (Jamshaid et al., 2025; Mahmoudi et al., 2020; Malekpoor et al., 2018). Table 8 summarizes the resulting GRC matrix, where higher coefficients indicate greater relational proximity to the ideal benchmark.
From the results as shown in Table 8, UNIV-1 consistently exhibits the highest GRC values across most criteria, implying the smallest deviation from the ideal reference. In contrast, UNIV-3 and several other institutions show lower coefficients, suggesting comparatively weaker scientometric alignment and lower proximity to the ideal performance profile. The raw indicators shown in Figure 3 exhibit wide dispersion and scale heterogeneity, especially in citations (ranging from 177 to 130,951). After min-max normalization, values are confined to (0, 1), but relative distribution patterns persist. The GRC transformation further compresses the range to (0.33, 1.00), creating a more stable and comparable metric space. This lower bound of 0.33 arises from the distinguishing coefficient (ζ = 0.5) in the GRC formula, which ensures all alternatives maintain a measurable relation to the ideal reference.

4.4. Results of PROMETHEE II

Pairwise performance differences were obtained from the Grey Relational Coefficients (GRC) generated in the GRA process. As defined in Equation (12), this stage measures how much closer one alternative is to the ideal reference compared with another under each evaluation criterion. The results show that alternatives such as UNIV-1, UNIV-4, and UNIV-2 exhibit predominantly positive differences across most criteria, indicating strong dominance. Conversely, several lower-ranked universities display negative differences, revealing lower proximity to the ideal reference profile. The pairwise performance differences were transformed into preference degrees through the Type-V (linear) preference function defined in Equation (13). The threshold setting in this study is inspired by the relative, percentage-based threshold approach proposed by (Papapostolou et al., 2024; Wulf et al., 2021), who emphasize that PROMETHEE thresholds should be defined in proportion to the scale and variability of each criterion rather than treated as fixed constants. Consistent with this principle, and given the wide dispersion observed in our scientometric indicators, the thresholds were defined relative to the observed range (max–min) of each criterion. Exploratory analysis showed that adjacent differences among mid-ranked universities typically correspond to approximately 5–10% of the criterion range, whereas the gap between mid-range and top-performing universities often exceeds 50%. Accordingly, the indifference threshold was set to q = 10% and the preference threshold to p = 60% of each criterion’s range, ensuring that minor fluctuations are filtered out while substantive performance differences are appropriately captured.
The pairwise preference degrees were aggregated using the BWM-derived weights to obtain the aggregated preference matrix using Equation (14). This matrix represents the overall dominance of one university over another across all evaluation criteria, the representative results of all alternatives are presented in Table 9, and illustrated in Figure 4, UNIV-1 exhibits the highest dominance values against almost all other institutions, followed by UNIV-4 and UNIV-2, forming the upper cluster of high-performing universities. In contrast, mid- and lower-tier universities display substantially lower aggregate preference values (π < 0.3), reflecting weak competitiveness and limited outranking capability.
Following the aggregated preference matrix presented in Table 9. The leaving flow (ϕ+) and entering flow (ϕ) were computed to evaluate mutual dominance among alternatives using Equations (15) and (16). The leaving flow indicates how strongly an alternative outranks others, The entering flow measures how strongly an alternative is dominated by others. The net flow (ϕ) values were derived according to Equation (17) to determine the comprehensive performance of each university. A higher net flow indicates stronger dominance and overall superiority, Table 10 shows the top-10 and bottom-10 ranking.
Table 10 shows the top-10 and bottom-10 universities ranked according to their net flow (ϕ = ϕ+ − ϕ) values derived from the proposed BGP framework. As result, UNIV-1, UNIV-4, and UNIV-2 achieve the highest net flow values (0.97703, 0.782374, and 0.7292, respectively), reflecting strong dominance and consistent superiority over the remaining universities. In contrast, the bottom-tier universities (e.g., UNIV-95, UNIV-73, UNIV-94, and UNIV-100) record zero leaving flow (ϕ+) and the highest entering flow (ϕ ≈ 0.0378), resulting in markedly negative net flows (−0.0373 to −0.0370).
Figure 5 illustrates the complete PROMETHEE II ranking curve for the 100 evaluated universities. The horizontal axis represents the ranking positions, while the vertical axis shows each university’s net flow (ϕ), which reflects its overall dominance relative to all other alternatives. The curve exhibits a steep decline at the top ranks, where a small group of universities (e.g., UNIV-1, UNIV-4, and UNIV-2) achieve substantially higher ϕ values than the rest. This sharp drop indicates a clear separation between the top-tier performers and the mid-tier group. After approximately rank 10, the curve stabilizes and forms a long, nearly flat tail, suggesting that the majority of universities exhibit relatively similar.
The classification of universities into High, Medium, and Weak performance groups, as visualized in Figure 6, was conducted using a clustering procedure based exclusively on the PROMETHEE II net flow values. Following a data-structure-driven modelling approach (Rohmer, 2025; Tamak et al., 2025), the standardized net flow scores were partitioned into three clusters (k = 3) using the K-Means algorithm. Each cluster was then interpreted according to its mean net flow, with the highest-scoring group labelled as High, followed by Medium and Weak (de Vico et al., 2025; Hou et al., 2022). Although clustering relies solely on the net flow measure, the dominance structure is visualized using a two-dimensional preference map involving both the positive flow (φ+) and the net flow (φ). This visualization serves interpretive purposes only and does not influence the clustering results.
The dominance patterns among universities, illustrated in Figure 6 through the scatter plot of positive flow versus net flow, provide additional insight into how the clustered institutions behave under the PROMETHEE II preference structure. Universities positioned farther to the right exhibit stronger overall dominance relative to others, while those located higher on the plot demonstrate superior aggregate performance across all criteria. The stable distribution of institutions within the preference map reflects the effectiveness of the outlier-control mechanisms and the regulation of compensatory effects in the proposed model. Universities with consistently strong performance remain within the High cluster without distorting the distribution of other institutions, whereas universities with only partial strengths are not automatically pushed upward indicating that the preference mechanism maintains a controlled non-compensatory behaviour. Weaknesses in one or two criteria cannot be fully offset by extreme superiority in others, confirming that the model effectively suppresses the influence of extreme values while limiting excessive compensation in the ranking process.

4.5. Validations

4.5.1. Sensitivity Analysis

Sensitivity analysis was conducted to assess the stability and robustness of the proposed BGP ranking results under variations in criterion weights and preference thresholds (Chen et al., 2025; Makki & Abdulaal, 2023; Mukhametzyanov & Pamucar, 2018). The sensitivity test on criterion weight variations was performed by adjusting each weight between 10% and 90% of its baseline value (Makki & Abdulaal, 2023), while proportionally recalibrating the remaining weights. Figure 7a–e show that all alternatives maintained a consistent ranking order across all weight variation scenarios, as indicated by the horizontal line patterns without significant intersections. Only minor shifts were observed for the Productivity and Q1 Count criteria around the 50% weight level, variations in weight had no significant influence on the overall aggregated ranking positions. This result indicates that the model remains stable even under extreme weight variation scenarios up to ±90%.
The sensitivity of the Type-V PROMETHEE II preference-function thresholds was further examined by systematically varying the parameters (q, p) (Coquelet et al., 2025; Wulf et al., 2021), for the Citations criterion, where q ∈ {0.10, 0.15, 0.20} and p ∈ {0.50, 0.60, 0.70}, while the remaining criteria were kept at their baseline values. Citations criterion was selected for the (q, p) threshold sensitivity test because it is the criterion with the most skewed distribution, making it particularly susceptible to potential local preference shifts in PROMETHEE II. As shown in Figure 7f–h, the rankings of the leading universities remain unchanged under all threshold configurations, and only marginal shifts occur among lower-ranked alternatives. These results demonstrate that even substantial variations in indifference and preference thresholds do not alter the global outranking structure. Overall, sensitivity analysis confirms that BGP demonstrates stable and robust performance against variations in both criterion weights and preference function parameters.

4.5.2. Comparative Analysis

A comparative analysis was conducted to evaluate the proposed BGP method against ARAS, MABAC, and TOPSIS under controlled outlier-contamination scenarios. Outliers were introduced by selectively perturbing two scientometric indicators, Citations and SJR Score, which are empirically prone to extreme deviations due to their heavy-tailed and highly skewed distributions (Gagolewski et al., 2022; Lovakov & Teixeira da Silva, 2025). The contamination procedure was applied to the top 10 universities for each indicator were identified separately from the complete set of alternatives, representing elite performers where extreme deviations are most likely to exert a disproportionate influence on ranking stability. For these selected universities, the corresponding indicator values (Citations or SJR Score) were amplified using a multiplicative factor of ×8, a magnitude chosen to generate extreme yet plausible deviations exceeding the 99th percentile of the original distribution. In robust statistical analysis, observations beyond the 99th percentile are commonly classified as extreme outliers because they lie outside the typical variability range and may disproportionately influence ranking and distance-based decision models (Gao et al., 2025; Paulillo et al., 2021). Contamination was applied independently to Citations and SJR Score within each scenario, while all other indicators for all universities remained unchanged. The entire set of alternatives was then re-ranked under each contaminated scenario using BGP, ARAS, MABAC, and TOPSIS. In total, 30 contaminated scenarios were generated, and the resulting rankings were compared with the baseline (uncontaminated) rankings to assess robustness.
The robustness-testing framework in this study aligns closely with contemporary methodological advances in MCDM, which have shifted from simple parameter sensitivity checks toward structured, multi-metric stress-testing under severe yet plausible data perturbations (Farnè & Vouldis, 2024; Guo et al., 2024). Reflecting this paradigm, our evaluation is conducted under controlled outlier-contamination scenarios that deliberately target elite performers, where extreme deviations are most likely to exert a disproportionate influence on ranking stability. To comprehensively assess robustness, we employ an integrated suite of complementary metrics, including Spearman’s rank correlation, the Sum of Ranking Differences (SRD), the Stability Ratio (SR@k) (Erbey et al., 2025; Gagolewski et al., 2022), Maximum Rank Change (Max ΔRank), and the Rank Inversion Ratio (RIR) (Andjelković et al., 2024). Together, these indicators capture multiple dimensions of ranking resilience, global rank-order preservation, cumulative positional deviation, stability of top-tier institutions, extreme positional shifts, and susceptibility to local pairwise inversions.
To rigorously discern performance differences among methods across the 30 contaminated scenarios, we follow established comparative standards by applying the Friedman test with Nemenyi post hoc analysis (Pulvera & Lao, 2024; Wachowicz & Roszkowska, 2025). This multi-faceted evaluation framework enables a robust comparison of how effectively each method maintains ranking integrity when exposed to targeted, outlier-induced distortions. The full results of this robustness analysis are reported in Table 11.
Based on the robustness results summarized in Table 11, the proposed BGP framework consistently demonstrates superior robustness compared to ARAS, MABAC, and TOPSIS under controlled outlier-contamination scenarios. BGP achieves the highest Spearman’s rank correlation (0.9985), indicating the strongest preservation of the global ranking order, along with the lowest Sum of Ranking Differences (SRD = 98.7), reflecting minimal cumulative positional deviation across scenarios. In addition, BGP exhibits the highest SR@5 value (0.99), confirming superior stability among top-tier universities, as well as the smallest maximum rank change (Max ΔRank = 5.73) and the lowest rank inversion ratio (RIR = 0.52).
By contrast, ARAS shows moderate robustness but experiences larger rank deviations and instability among leading institutions, as evidenced by higher SRD and Max ΔRank values. MABAC and TOPSIS exhibit greater sensitivity to outlier-induced perturbations, reflected in their lower Spearman’s correlations, higher cumulative rank deviations, and increased rank inversion ratios. These results indicate that the proposed BGP framework maintains ranking integrity more effectively than the benchmark methods when exposed to targeted, outlier-driven distortions.
The results of the Friedman test, which was applied to evaluate whether the performance differences among the compared MCDM methods are statistically significant across multiple contamination scenarios are presented in Table 12.
The obtained Friedman test statistic (χ2 = 82.03, df = 3, p = 1.13 × 10−17) as shown in Tabel 12, indicates a highly significant difference (p < 0.001) among the compared methods, confirming that their robustness performances are not statistically equivalent. With 30 blocks (contaminated scenarios) and a critical difference at the 5% significance level (CD0.05 = 1.211), which is widely adopted in Friedman–Nemenyi analyses for method comparison (Pulvera & Lao, 2024; Wachowicz & Roszkowska, 2025), the results justify the application of the post hoc Nemenyi test to identify specific pairwise differences between methods.
Following the significant Friedman test, a post hoc Nemenyi analysis was conducted to identify which specific pairs of methods exhibit statistically significant differences. The results of these pairwise comparisons are presented in Table 13.
The post hoc Nemenyi results are presented in Table 13, which reports the average rank differences between each pair of methods and compares them against the critical difference (CD0.05 = 1.211). All pairwise comparisons involving BGP exceed this threshold, indicating statistically significant differences (p < 0.05). In particular, BGP outperforms ARAS, MABAC, and TOPSIS with average rank differences of 1.612, 1.700, and 2.967, respectively.
These comparisons demonstrate that the robustness performance of the proposed BGP framework over ARAS, MABAC, and TOPSIS is statistically substantiated by the Friedman–Nemenyi analysis and is not attributable to random variation.
The proposed BGP framework demonstrates the strongest overall robustness performance among the evaluated methods, as summarized in Table 11 and illustrated in Figure 8. BGP achieves the highest average rank correlation (Spearman’s ρ = 0.9985), indicating superior global rank-order preservation under outlier-contamination scenarios. It also records the lowest Sum of Ranking Differences (SRD = 98.7), reflecting minimal cumulative positional deviations from the baseline ranking.
In terms of top-tier stability, BGP attains the highest Stability Ratio (SR@5 = 0.99), confirming that the leading universities remain largely unaffected by data perturbations. Furthermore, BGP exhibits the smallest maximum rank displacement (Max ΔRank = 5.73) and the lowest Rank Inversion Rate (RIR = 0.52), indicating strong resistance to extreme positional shifts and local pairwise rank reversals.
By comparison, ARAS shows moderate robustness but experiences larger ranking deviations (SRD = 134.6) and higher maximum rank changes (Max ΔRank = 9.70), suggesting increased sensitivity to outlier-induced distortions. MABAC performs competitively in terms of rank correlation (Spearman’s ρ = 0.9973) and SR@5 (0.97), yet exhibits higher cumulative deviations and inversion rates than BGP. TOPSIS displays the weakest robustness overall, with the lowest rank correlation (Spearman’s ρ = 0.9950), the highest SRD (183.1), and the largest inversion rate (RIR = 0.70), indicating pronounced vulnerability to data perturbations.
Figure 8 presents a heatmap illustrating the dominant ranking method across the 30 outlier-contamination scenarios based on the Stability Ratio at the top 10 positions (SR@10). Each row corresponds to a contamination scenario (S01–S30), while each column represents a competing method. A highlighted cell indicates the method achieving the highest SR@10 value in the corresponding scenario, thereby identifying the most stable method in preserving top-ranked institutions under that perturbation.
The heatmap reveals a clear and consistent dominance of the proposed BGP framework, which achieves the highest SR@10 score in all evaluated scenarios. This uniform pattern demonstrates that BGP most effectively preserves the stability of leading universities despite severe perturbations in key indicators. In contrast, ARAS, MABAC, and TOPSIS do not emerge as dominant in any scenario, indicating comparatively weaker resilience in maintaining top-tier rankings under outlier contamination.
Overall, the results confirm that the integration of BWM, GRA, and PROMETHEE within the proposed BGP framework effectively enhances ranking stability across multiple robustness dimensions. The consistent dominance observed in the SR@10 heatmap Figure 8, provides strong visual evidence that complements the metric-based results reported in Table 11, jointly demonstrating that BGP more reliably suppresses distortions caused by extreme outliers and outperforms the benchmark methods.

5. Discussion

When developing a ranking framework, it is essential to ensure that the applied weighting procedure are not distorted by outliers. Scientometric data naturally contain outliers (Bornmann, 2024; Gagolewski et al., 2022; Lovakov & Teixeira da Silva, 2025; Schmoch, 2020), making objective, variance-based weighting methods unsuitable because their outcomes can be heavily influenced by extreme values (Erbey et al., 2025). The BGP framework uses the Best–Worst Method (BWM), whose weighting mechanism relies solely on expert judgement rather than the statistical properties of the dataset (Kheybari et al., 2021; Rezaei, 2015), thereby ensuring that the weighting process remains stable even when scientometric indicators exhibit heavy-tailed distributions or contain extreme outliers. The Best–Worst Method (BWM) stabilizes the criterion weights by ensuring that each weight reflects the true relative importance of expert-validated scientometric indicators rather than being influenced by data variance. The resulting weights indicate that SJR Score (0.3366), Q1 Count (0.2473), and Citations (0.2185) are perceived by experts as the most influential indicators of research performance, followed by Productivity (0.1050) and International Collaboration (0.0637), with the high level of consistency and validity (CR ≤ 0.1), as shown in Table 5, confirms that the criterion weights reported in Table 6 are valid.
The relatively low weight assigned to International Collaboration suggests that experts do not view collaborative breadth as a direct proxy for research quality or scientific impact. This perspective aligns with characteristics of the Indonesian research ecosystem, where patterns of international co-authorship are frequently shaped by mobility programmes, institutional agreements, or funding-driven partnerships (Yudhoyono et al., 2025), rather than by strong thematic or methodological alignment. While collaboration may increase publication volume, it does not consistently lead to higher citation impact or publication in high-prestige journals. This indicates that scientometric evaluation in the Indonesian context is driven predominantly by quality-oriented indicators rather than network-based measures.
Handling outliers in raw scientometric data cannot be effectively achieved using traditional normalization methods such as min–max scaling, z-score standardization, or logarithmic transformation. These methods are known to be highly susceptible to distortion caused by extreme values. In min–max scaling, a single extreme value can drastically expand the range, causing all other observations to become disproportionately compressed. Z-score standardization is highly sensitive to the mean and standard deviation, both of which can be easily influenced by outliers. Meanwhile, logarithmic transformation may reduce the scale but is unable to eliminate the dominance of extreme values in heavy-tailed distributions (Gagolewski et al., 2022; Lovakov & Teixeira da Silva, 2025). To handle outliers, the BGP framework uses GRA by integrating the distinguishing coefficient ζ, which produces the Grey Relational Coefficient (GRC) transformation (Jamshaid et al., 2025; Mahmoudi et al., 2020; Malekpoor et al., 2018). This transformation compresses extreme values into a more stable and proportional range (Başaran & Ighagbon, 2024; Zheng et al., 2025). This mechanism effectively reduces the influence of outliers, preventing them from distorting the resulting rankings. As shown in Table 8, GRA can mitigate the impact of extreme values without eliminating meaningful differences among universities. Furthermore, Figure 3 illustrates how the GRC transformation compresses the raw data into the range (0.33, 1.00), creating a more stable and comparable metric space that ensures all alternatives maintain a measurable relationship to the ideal reference.
We also need to address compensatory effects, which occur when a small number of dominant criteria offset weaknesses in others, potentially leading to unstable rankings. The BGP framework mitigates this risk by reinforcing a non-compensatory paradigm through the integration of PROMETHEE II’s outranking logic. This mechanism prevents excessive dominance of one or two indicator from overshadowing deficiencies in others. It aligns with the principle of non-substitutability among criteria, whereby superiority in one aspect should not compensate for substantial shortcomings in another (Erbey et al., 2025; Ziemba, 2022). As a result, the construct validity of the ranking process is strengthened, and interpretive distortions commonly observed in compensatory methods are reduced (N. Liu & Xu, 2021; Turskis & Keršulienė, 2024). The BGP uses PROMETHEE II method with Type-V preference function indifference (q) and preference (p) thresholds create a gradual transition between alternatives, limiting compensatory effects among criteria. This approach resulting in a more robust ranking structure. As demonstrated by the sensitivity test on criterion weight variations between 10% and 90% of their baseline values (Makki & Abdulaal, 2023), with proportional recalibration of the remaining weights, all alternatives maintained a consistent ranking order across all scenarios (Figure 7a–e). Similarly, the sensitivity analysis of preference function threshold variations (q, p) (Coquelet et al., 2025; Wulf et al., 2021) for the Citations criterion where q ∈ {0.10, 0.15, 0.20} and p ∈ {0.50, 0.60, 0.70} showed that the top-ranked universities remained unchanged across all (q, p) combinations (Figure 7f–h).
The BGP framework provides a more robust ranking of scientometric performance under outlier-contamination scenarios. The robustness validation results presented in Table 11, Table 12 and Table 13 and Figure 8 provide strong empirical evidence that the integration of BWM, GRA, and PROMETHEE is well suited for scientometric-based evaluation. Across all contamination scenarios, BGP consistently achieves the highest average rank correlation (Spearman’s ρ = 0.9985), the lowest cumulative rank deviation (SRD = 98.7), the strongest top-tier stability (SR@5 = 0.99), and the smallest maximum rank displacement (Max ΔRank = 5.73) and rank inversion rate (RIR = 0.52), as reported in Table 11.
Compared with ARAS, MABAC, and TOPSIS, BGP demonstrates superior robustness across all evaluated metrics, indicating a substantially lower sensitivity to extreme deviations in key scientometric indicators. This robustness advantage is further reinforced by the SR@10 heatmap in Figure 8, which shows that BGP consistently emerges as the dominant method across all 30 contaminated scenarios, providing clear visual evidence of its ability to preserve elite-rank stability under severe perturbations.
Importantly, the observed performance differences are not attributable to random variation. The Friedman test confirms statistically significant differences among the compared methods, and the subsequent Nemenyi post hoc analysis substantiates the robustness superiority of the BGP framework relative to the benchmark approaches.
These findings demonstrate that the BGP framework offers integrated theoretical and empirical advancements for hybrid MCDM in scientometrics. Theoretically, it provides a structured response to core methodological challenges in the field. First, by employing BWM for expert-driven weighting, it theoretically decouples criterion importance from data variance, ensuring weights reflect strategic priorities rather than statistical artefacts of heavy-tailed distributions. Second, the use of GRA with the distinguishing coefficient (ζ) as a robust normalization theory, compressing outliers while preserving meaningful ordinal relationships a critical advancement over variance-sensitive methods like traditional methods. Third, it strengthens the theoretical application of non-compensatory logic, by applying PROMETHEE II’s outranking mechanism only after securing stable weights and normalized data, the framework guarantees that criterion non-substitutability is not distorted by extreme values. Collectively, this sequenced integration (priority weighting, outlier-resilient transformation, non-compensatory ranking) extends MCDM theory by offering a principled model for constructing evaluative frameworks in data-noisy contexts. Empirically, this theoretical robustness translates into interpretable stability for stakeholders, directly addressing the practical meaning of validation metrics. The high Spearman rank correlation provides researchers with confidence in longitudinal tracking, indicating that university performance trajectories remain consistent despite normal data fluctuations. For university administrators and policymakers, the low sum of ranking differences (SRD) and minimal ranking inversion rate (RIR) are crucial, they signal that institutional rankings do not experience volatile shifts when indicators vary, protecting against misleading benchmarks and enabling stable strategic planning. Furthermore, the stability ranking (SR@k) offers decision-makers a reliable identification of elite performers, ensuring that funding or policy focus targets genuinely leading institutions. Thus, BGP moves beyond statistical validation to deliver a transparent, decision-ready framework where theoretical soundness underpins practical reliability in research performance evaluation.

6. Conclusions

The implementation of the Best–Worst Method (BWM) for weighting which is lower than the threshold C R 0.1 , the CR indicates that the pairwise judgements among criteria are consistent and the resulting weights are valid. This outcome ensures that the structure of importance across the indicators comprising SJR Score, Citations, Productivity, Q1 Count, and International Collaboration accurately reflects rational priorities consistent with the characteristics of scientometric data. The integration of Grey Relational Analysis (GRA) within the BGP framework proved effective in reducing the influence of extreme values and scale heterogeneity among indicators. The GRA transformation produced a relational coefficient matrix with a stable range (0.33–1.00), resulting in a more balanced evaluation process that is not distorted by outliers. The PROMETHEE II with Type-V linear preference function, using indifference and preference thresholds set to 10% and 60% of each criteria’s value range, limits compensatory effects and produced a more proportional ranking outcome that constrains overcompensation and reflects a genuine balance of scientific performance. As discussed in the Discussion section, the robustness analysis reveals statistically significant performance differences among the evaluated methods. These findings provide the empirical basis for interpreting the structural robustness and non-compensatory behaviour of the proposed BGP framework.
Although the BGP framework results demonstrate stable and promising performance, this study has several limitations. The publication dataset is derived solely from Scopus and ScimagoJR, which restricts the diversity of indicators to the characteristics of these two databases and may limit the representativeness of broader scientometric patterns. The university sample is also confined to Indonesian institutions, meaning that the findings primarily reflect the characteristics of the national research landscape and may not be fully generalizable to other regional or international contexts. Furthermore, although the GRA stage in this study adopts the commonly used distinguishing coefficient (ζ = 0.5), no dedicated sensitivity analysis on this parameter was conducted. Most GRA applications in prior studies employ ζ = 0.5 as a standard setting without examining its variation (Jamshaid et al., 2025; Zheng et al., 2025). However, the extent to which ζ affects ranking behaviour under different data distributions remains an open methodological question and warrants further investigation.
Future research may expand the dataset by incorporating indicators from additional bibliographic databases such as Web of Science, Dimensions, or OpenAlex to obtain a more comprehensive scientometric representation. Including universities from multiple countries is essential for assessing cross-regional generalizability and capturing variations in research performance and further validating the robustness of the proposed BGP framework in diverse international contexts. Beyond expanding data sources, integrating cost-type integrity criteria such as the number of publications in Scopus-discontinued journals or retracted articles could strengthen the evaluative scope of the framework, these indicators may be incorporated through cost-oriented normalization so that higher values penalize overall performance in a systematic way. Future work may also examine the effect of varying the distinguishing coefficient (ζ) to evaluate whether alternative parameter settings improve the robustness of the GRC transformation, particularly under extreme or heavy-tailed distributions. Moreover, the applicability of the BGP framework should be examined in other decision-making domains to further assess its methodological robustness and generalizability. Comparative analyses involving alternative MCDM techniques may also provide additional insight into methodological refinement and integration.
Overall, the BGP framework approach provides a more stable and robust ranking. This model has the potential to be applied in scientometric-based university ranking systems and can support for researchers, university leaders and policymakers in formulating strategies to enhance research performance objectively and transparently.

Author Contributions

Conceptualization, D.K., R.G. and B.S.; methodology, D.K., R.G. and B.S.; software, D.K.; validation, D.K., R.G. and B.S.; formal analysis, D.K. and R.G.; investigation, R.G. and B.S.; resources, D.K.; data curation, D.K.; writing—original draft preparation, D.K., R.G. and B.S.; writing—review and editing, D.K.; visualization, D.K.; supervision, R.G. and B.S.; project administration, D.K., R.G. and B.S.; funding acquisition, D.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study involved a non-interventional Delphi process with expert participants and did not collect sensitive personal data, IRB approval was not required for this type of research.

Data Availability Statement

The datasets used in this study are available upon by request from the corresponding author.

Acknowledgments

The authors expresses sincere gratitude to Universitas Diponegoro and Universitas Islam Sultan Agung for their support of this research through the provision of facilities and experimental resources.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Albadayneh, B. A., Alrawashdeh, A., Obeidat, N., Al-Dekah, A. M., Zghool, A. W., & Abdelrahman, M. (2024). Medical magnetic resonance imaging publications in Arab countries: A 25-year bibliometric analysis. Heliyon, 10(7), e28512. [Google Scholar] [CrossRef] [PubMed]
  2. Alrababah, S. A. A., & Gan, K. H. (2023). Effects of the hybrid CRITIC–VIKOR method on product aspect ranking in customer reviews. Applied Sciences, 13(16), 9176. [Google Scholar] [CrossRef]
  3. Anand, L., Gayathri, P., Karuveettil, V., & Anjali, M. (2024). Top 100 most cited economic evaluation papers of preventive oral health programmes: A bibliometric analysis. Journal of Oral Biology and Craniofacial Research, 14(6), 802–807. [Google Scholar] [CrossRef]
  4. Anculle-Arauco, V., Krüger-Malpartida, H., Arevalo-Flores, M., Correa-Cedeño, L., Mass, R., Hoppe, W., & Pedraz-Petrozzi, B. (2024). Content validation using Aiken methodology through expert judgment of the first Spanish version of the Eppendorf schizophrenia inventory (ESI) in Peru: A brief qualitative report. Spanish Journal of Psychiatry and Mental Health, 17(2), 110–113. [Google Scholar] [CrossRef] [PubMed]
  5. Andjelković, D., Stojić, G., Nikolić, N., Das, D. K., Subotić, M., & Stević, Ž. (2024). A novel data-envelopment analysis interval-valued fuzzy-rough-number multi-criteria decision-making (DEA-IFRN MCDM) model for determining the efficiency of road sections based on headway analysis. Mathematics, 12(7), 976. [Google Scholar] [CrossRef]
  6. Ayan, B., Abacıoğlu, S., & Basilio, M. P. (2023). A comprehensive review of the novel weighting methods for multi-criteria decision-making. Information, 14(5), 285. [Google Scholar] [CrossRef]
  7. Ayyildiz, E., Murat, M., Imamoglu, G., & Kose, Y. (2023). A novel hybrid MCDM approach to evaluate universities based on student perspective. Scientometrics, 128(1), 55–86. [Google Scholar] [CrossRef]
  8. Başaran, S., & Ighagbon, O. A. (2024). Enhanced FMEA methodology for evaluating mobile learning platforms using grey relational analysis and fuzzy AHP. Applied Sciences, 14(19), 8844. [Google Scholar] [CrossRef]
  9. Baydaş, M., Yılmaz, M., Jović, Ž., Stević, Ž., Özuyar, S. E. G., & Özçil, A. (2024). A comprehensive MCDM assessment for economic data: Success analysis of maximum normalization, CODAS, and fuzzy approaches. Financial Innovation, 10(1), 105. [Google Scholar] [CrossRef]
  10. Beiderbeck, D., Frevel, N., von der Gracht, H. A., Schmidt, S. L., & Schweitzer, V. M. (2021). The impact of COVID-19 on the European football ecosystem—A Delphi-based scenario analysis. Technological Forecasting and Social Change, 165, 120577. [Google Scholar] [CrossRef]
  11. Bornmann, L. (2024). Skewed distributions of scientists’ productivity: A research program for the empirical analysis. Scientometrics, 129(4), 2455–2468. [Google Scholar] [CrossRef]
  12. Bornmann, L., & Williams, R. (2020). An evaluation of percentile measures of citation impact, and a proposal for making them better. Scientometrics, 124(2), 1457–1478. [Google Scholar] [CrossRef]
  13. Brans, J. P., & Vincke, P. (1985). Note—A preference ranking organisation method. Management Science, 31(6), 647–656. [Google Scholar] [CrossRef]
  14. Chen, F., Bulgarova, B. A., & Kumar, R. (2025). Prioritizing generative artificial intelligence co-writing tools in newsrooms: A hybrid MCDM framework for transparency, stability, and editorial integrity. Mathematics, 13(23), 3791. [Google Scholar] [CrossRef]
  15. Clermont, M., Krolak, J., & Tunger, D. (2021). Does the citation period have any effect on the informative value of selected citation indicators in research evaluations? Scientometrics, 126(2), 1019–1047. [Google Scholar] [CrossRef]
  16. Coquelet, B., Dejaegere, G., & De Smet, Y. (2025). Evaluating the Promethee ii ranking quality. Algorithms, 18(10), 597. [Google Scholar] [CrossRef]
  17. Daraio, C., Di Leo, S., & Leydesdorff, L. (2023). A heuristic approach based on Leiden rankings to identify outliers: Evidence from Italian universities in the European landscape. Scientometrics, 128(1), 483–510. [Google Scholar] [CrossRef]
  18. Demeter, M., Jele, A., & Major, Z. B. (2022). The model of maximum productivity for research universities SciVal author ranks, productivity, university rankings, and their implications. Scientometrics, 127(8), 4335–4361. [Google Scholar] [CrossRef]
  19. Deng, J., Zhan, J., & Wu, W. Z. (2022). A ranking method with a preference relation based on the PROMETHEE method in incomplete multi-scale information systems. Information Sciences, 608, 1261–1282. [Google Scholar] [CrossRef]
  20. de Vico, G., Alves, C. J. R., Sellitto, M. A., & da Silva, D. O. (2025). Data-driven performance evaluation and behavior alignment in port operations: A multivariate analysis of strategic indicators. Administrative Sciences, 15(9), 345. [Google Scholar] [CrossRef]
  21. Doğan, G., & Al, U. (2019). Is it possible to rank universities using fewer indicators? A study on five international university rankings. Aslib Journal of Information Management, 71(1), 18–37. [Google Scholar] [CrossRef]
  22. Elevli, S., & Elevli, B. (2024). A study of entrepreneur and innovative university index by entropy-based grey relational analysis and PROMETHEE. Scientometrics, 129(6), 3193–3223. [Google Scholar] [CrossRef]
  23. El Gibari, S., Gómez, T., & Ruiz, F. (2018). Evaluating university performance using reference point based composite indicators. Journal of Informetrics, 12(4), 1235–1250. [Google Scholar] [CrossRef]
  24. Erbey, A., Fidan, Ü., & Gündüz, C. (2025). A robust hybrid weighting scheme based on IQRBOW and entropy for MCDM: Stability and advantage criteria in the VIKOR framework. Entropy, 27(8), 867. [Google Scholar] [CrossRef]
  25. Esangbedo, M. O., & Wei, J. (2023). Grey hybrid normalization with period based entropy weighting and relational analysis for cities rankings. Scientific Reports, 13(1), 13797. [Google Scholar] [CrossRef]
  26. Ezell, B., Lynch, C. J., & Hester, P. T. (2021). Methods for weighting decisions to assist modelers and decision analysists: A review of ratio assignment and approximate techniques. Applied Sciences, 11(21), 10397. [Google Scholar] [CrossRef]
  27. Farnè, M., & Vouldis, A. (2024). ROBOUT: A conditional outlier detection methodology for high-dimensional data. Statistical Papers, 65(4), 2489–2525. [Google Scholar] [CrossRef]
  28. Gagolewski, M., Żogała-Siudem, B., Siudem, G., & Cena, A. (2022). Ockham’s index of citation impact. Scientometrics, 127(5), 2829–2845. [Google Scholar] [CrossRef]
  29. Gao, L., Tian, T., & Wen, L. (2025). Study on outlier detection algorithm based on tightest neighbors. Expert Systems with Applications, 290, 128385. [Google Scholar] [CrossRef]
  30. Goodarzi, F., Abdollahzadeh, V., & Zeinalnezhad, M. (2022). An integrated multi-criteria decision-making and multi-objective optimization framework for green supplier evaluation and optimal order allocation under uncertainty. Decision Analytics Journal, 4, 100087. [Google Scholar] [CrossRef]
  31. Gul, M., Celik, E., Gumus, A. T., & Guneri, A. F. (2018). A fuzzy logic based PROMETHEE method for material selection problems. Beni-Suef University Journal of Basic and Applied Sciences, 7(1), 68–79. [Google Scholar] [CrossRef]
  32. Gul, M., & Yucesan, M. (2022). Performance evaluation of Turkish universities by an integrated Bayesian BWM-TOPSIS model. Socio-Economic Planning Sciences, 80, 101173. [Google Scholar] [CrossRef]
  33. Guo, Z., Liu, J., Liu, X., Meng, Z., Pu, M., Wu, H., Yan, X., Yang, G., Zhang, X., Chen, C., & Chen, F. (2024). An integrated MCDM model with enhanced decision support in transport safety using machine learning optimization. Knowledge-Based Systems, 301, 112286. [Google Scholar] [CrossRef]
  34. Hasson, F., Keeney, S., & McKenna, H. (2025). Revisiting the Delphi technique—Research thinking and practice: A discussion paper. International Journal of Nursing Studies, 168, 105119. [Google Scholar] [CrossRef] [PubMed]
  35. Hesselink, G., Verhage, R., van der Horst, I. C. C., van der Hoeven, H., & Zegers, M. (2024). Consensus-based indicators for evaluating and improving the quality of regional collaborative networks of intensive care units: Results of a nationwide Delphi study. Journal of Critical Care, 79, 154440. [Google Scholar] [CrossRef]
  36. Hou, Z., Yan, R., & Wang, S. (2022). On the k-means clustering model for performance enhancement of port state control. Journal of Marine Science and Engineering, 10(11), 1608. [Google Scholar] [CrossRef]
  37. Ishizaka, A., Pickernell, D., Huang, S., & Senyard, J. M. (2020). Examining knowledge transfer activities in UK universities: Advocating a PROMETHEE-based approach. International Journal of Entrepreneurial Behaviour and Research, 26(6), 1389–1409. [Google Scholar] [CrossRef]
  38. Ishizaka, A., & Resce, G. (2021). Best-Worst PROMETHEE method for evaluating school performance in the OECD’s PISA project. Socio-Economic Planning Sciences, 73, 100799. [Google Scholar] [CrossRef]
  39. Jamshaid, H., Khan, A. A., Mishra, R. K., Ahmad, N., Chandan, V., Kolář, V., & Müller, M. (2025). Taguchi grey relational analysis (GRA) based multi response optimization of flammability, comfort and mechanical properties in station suits. Heliyon, 11(4), e42508. [Google Scholar] [CrossRef]
  40. Jamwal, A., Agrawal, R., Sharma, M., & Kumar, V. (2021). Review on multi-criteria decision analysis in sustainable manufacturing decision making. International Journal of Sustainable Engineering, 14(3), 202–225. [Google Scholar] [CrossRef]
  41. Ju-Long, D. (1982). Control problems of grey systems. Systems & Control Letters, 1(5), 288–294. [Google Scholar] [CrossRef]
  42. Keenan, P. (2024). A scientometric analysis of multicriteria decision-making research. Journal of Decision Systems, 33(sup1), 78–88. [Google Scholar] [CrossRef]
  43. Kheybari, S., Javdanmehr, M., Rezaie, F. M., & Rezaei, J. (2021). Corn cultivation location selection for bioethanol production: An application of BWM and extended PROMETHEE II. Energy, 228, 120593. [Google Scholar] [CrossRef]
  44. Li, H. (2024). Evaluation system of vocational education construction in China based on linked TOPSIS analysis. Heliyon, 10(21), e39369. [Google Scholar] [CrossRef]
  45. Li, P., Xu, Z., Wei, C., Bai, Q., & Liu, J. (2022). A novel PROMETHEE method based on GRA-DEMATEL for PLTSs and its application in selecting renewable energies. Information Sciences, 589, 142–161. [Google Scholar] [CrossRef]
  46. Li, X., Han, Z., Yazdi, M., & Chen, G. (2022). A CRITIC-VIKOR based robust approach to support risk management of subsea pipelines. Applied Ocean Research, 124, 103187. [Google Scholar] [CrossRef]
  47. Li, Z., & Zhang, L. (2023). An ensemble outlier detection method based on information entropy-weighted subspaces for high-dimensional data. Entropy, 25(8), 1185. [Google Scholar] [CrossRef]
  48. Liang, F., Brunelli, M., & Rezaei, J. (2020). Consistency issues in the best worst method: Measurements and thresholds. Omega, 96, 102175. [Google Scholar] [CrossRef]
  49. Limaymanta, C. H., Quiroz-de-García, R., Rivas-Villena, J. A., Rojas-Arroyo, A., & Gregorio-Chaviano, O. (2022). Relationship between collaboration and normalized scientific impact in South American public universities. Scientometrics, 127(11), 6391–6411. [Google Scholar] [CrossRef]
  50. Liu, N., & Xu, Z. (2021). An overview of ARAS method: Theory development, application extension, and future challenge. International Journal of Intelligent Systems, 36(7), 3524–3565. [Google Scholar] [CrossRef]
  51. Liu, X. Z., & Fang, H. (2020). A comparison among citation-based journal indicators and their relative changes with time. Journal of Informetrics, 14(1), 101007. [Google Scholar] [CrossRef]
  52. Lovakov, A., & Teixeira da Silva, J. A. (2025). Scientometric indicators in research evaluation and research misconduct: Analysis of the Russian university excellence initiative. Scientometrics, 130(3), 1813–1829. [Google Scholar] [CrossRef]
  53. Lukić, N., & Tumbas, P. (2019). Indicators of global university rankings: The theoretical issues. Strategic Management, 24(3), 43–54. [Google Scholar] [CrossRef]
  54. Mahmoudi, A., Javed, S. A., Liu, S., & Deng, X. (2020). Distinguishing coefficient driven sensitivity analysis of gra model for intelligent decisions: Application in project management. Technological and Economic Development of Economy, 26(3), 621–641. [Google Scholar] [CrossRef]
  55. Makki, A. A., & Abdulaal, R. M. S. (2023). A hybrid MCDM approach based on fuzzy MEREC-G and fuzzy RATMI. Mathematics, 11(17), 3773. [Google Scholar] [CrossRef]
  56. Malekpoor, H., Chalvatzis, K., Mishra, N., Mehlawat, M. K., Zafirakis, D., & Song, M. (2018). Integrated grey relational analysis and multi objective grey linear programming for sustainable electricity generation planning. Annals of Operations Research, 269(1–2), 475–503. [Google Scholar] [CrossRef]
  57. Maral, M. (2024). Examining the research performance of universities with multi-criteria decision-making methods. SAGE Open, 14(4), 21582440241300542. [Google Scholar] [CrossRef]
  58. Mitrović, I., Mišić, M., & Protić, J. (2023). Exploring high scientific productivity in international co-authorship of a small developing country based on collaboration patterns. Journal of Big Data, 10(1), 64. [Google Scholar] [CrossRef] [PubMed]
  59. Moslem, S., Farooq, D., Ghorbanzadeh, O., & Blaschke, T. (2020). Application of the AHP-BWM model for evaluating driver behavior factors related to road safety: A case study for Budapest. Symmetry, 12(2), 243. [Google Scholar] [CrossRef]
  60. Mukhametzyanov, I., & Pamucar, D. (2018). A sensitivity analysisin mcdm problems: A statistical approach. Decision Making: Applications in Management and Engineering, 1(2), 51–80. [Google Scholar] [CrossRef]
  61. Nguyen, H. K., & Nhieu, N. L. (2025). Comparative sustainability efficiency of G7 and BRICS economies: A DNMEREC-DNMARCOS approach. Mathematics, 13(22), 3640. [Google Scholar] [CrossRef]
  62. Olivero, M. A., Bertolino, A., Dominguez-Mayo, F. J., Matteucci, I., & Escalona, M. J. (2022). A Delphi study to recognize and assess systems of systems vulnerabilities. Information and Software Technology, 146, 106874. [Google Scholar] [CrossRef]
  63. Oubahman, L., & Duleba, S. (2024). Fuzzy PROMETHEE model for public transport mode choice analysis. Evolving Systems, 15(2), 285–302. [Google Scholar] [CrossRef] [PubMed]
  64. Pala, O. (2024). Assessment of the social progress on European Union by logarithmic decomposition of criteria importance. Expert Systems with Applications, 238, 121846. [Google Scholar] [CrossRef]
  65. Papapostolou, A., Karakosta, C., Mexis, F. D., Andreoulaki, I., & Psarras, J. (2024). A fuzzy PROMETHEE method for evaluating strategies towards a cross-country renewable energy cooperation: The cases of Egypt and Morocco. Energies, 17(19), 4904. [Google Scholar] [CrossRef]
  66. Paradowski, B., Wątróbski, J., & Sałabun, W. (2025). Novel coefficients for improved robustness in multi-criteria decision analysis. Artificial Intelligence Review, 58(10), 298. [Google Scholar] [CrossRef]
  67. Paul, B., & Saha, I. (2018). Research rating: Some technicalities. Medical Journal Armed Forces India, 78, S24–S30. [Google Scholar] [CrossRef]
  68. Paulillo, A., Kim, A., Mutel, C., Striolo, A., Bauer, C., & Lettieri, P. (2021). Influential parameters for estimating the environmental impacts of geothermal power: A global sensitivity analysis study. Cleaner Environmental Systems, 3, 100054. [Google Scholar] [CrossRef]
  69. Pinochet, L. H. C., Moreira, M. Â. L., Fávero, L. P., dos Santos, M., & Pardim, V. I. (2023). Collaborative work alternatives with ChatGPT based on evaluation criteria for its use in higher education: Application of the PROMETHEE-SAPEVO-M1 method. Procedia Computer Science, 221, 177–184. [Google Scholar] [CrossRef]
  70. Pohl, E., & Geldermann, J. (2024). PROMETHEE-Cloud: A web app to support multi-criteria decisions. EURO Journal on Decision Processes, 12, 100053. [Google Scholar] [CrossRef]
  71. Pohl, H. (2024). Using citation-based indicators to compare bilateral research collaborations. Scientometrics, 129(8), 4751–4770. [Google Scholar] [CrossRef]
  72. Potter, R. W. K., Szomszor, M., & Adams, J. (2022). Comparing standard, collaboration and fractional CNCI at the institutional level: Consequences for performance evaluation. Scientometrics, 127(12), 7435–7448. [Google Scholar] [CrossRef]
  73. Pulvera, E. V. J., & Lao, D. M. (2024, November 21–23). Enhancing deep learning-based breast cancer classification in mammograms: A multi-convolutional neural network with feature concatenation, and an applied comparison of best-worst multi-attribute decision-making and mutual information feature selections. 2024 9th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS) (pp. 511–518), Okinawa, Japan. [Google Scholar] [CrossRef]
  74. Raed, L., Mahdi, I., Ibrahim, H. M. H., Tolba, E. R., & Ebid, A. M. (2025). Innovative BWM–TOPSIS-based approach to determine the optimum delivery method for offshore projects. Scientific Reports, 15(1), 13340. [Google Scholar] [CrossRef] [PubMed]
  75. Rahman, S., Alali, A. S., Baro, N., Ali, S., & Kakati, P. (2024). A novel TOPSIS framework for multi-criteria decision making with random hypergraphs: Enhancing decision processes. Symmetry, 16(12), 1602. [Google Scholar] [CrossRef]
  76. Rezaei, J. (2015). Best-worst multi-criteria decision-making method. Omega, 53, 49–57. [Google Scholar] [CrossRef]
  77. Rezaei, J. (2016). Best-worst multi-criteria decision-making method: Some properties and a linear model. Omega, 64, 126–130. [Google Scholar] [CrossRef]
  78. Rohmer, J. (2025). Importance ranking of data and model uncertainties in quantile regression forest-based spatial predictions when data are sparse, imprecise and clustered. Ecological Informatics, 92, 103459. [Google Scholar] [CrossRef]
  79. Schlögl, C., Stock, W. G., & Reichmann, G. (2025). Scientometric evaluation of research institutions: Identifying the appropriate dimensions and attributes for assessment. Journal of Information Science Theory and Practice, 13(2), 49–68. [Google Scholar] [CrossRef]
  80. Schmoch, U. (2020). Mean values of skewed distributions in the bibliometric assessment of research units. Scientometrics, 125(2), 925–935. [Google Scholar] [CrossRef]
  81. Shang, Z. (2023). Use of Delphi in health sciences research: A narrative review. Medicine, 102(7), E32829. [Google Scholar] [CrossRef]
  82. Shi, H., Huang, L., Li, K., Wang, X. H., & Liu, H. C. (2022). An extended multi-attributive border approximation area comparison method for emergency decision making with complex linguistic information. Mathematics, 10(19), 3437. [Google Scholar] [CrossRef]
  83. Szluka, P., Csajbók, E., & Győrffy, B. (2023). Relationship between bibliometric indicators and university ranking positions. Scientific Reports, 13(1), 14193. [Google Scholar] [CrossRef] [PubMed]
  84. Tamak, S., Eslami, Y., & Da Cunha, C. (2025). Validation of multidimensional performance assessment models using hierarchical clustering. Expert Systems with Applications, 290, 128446. [Google Scholar] [CrossRef]
  85. Torkayesh, A. E., Tirkolaee, E. B., Bahrini, A., Pamucar, D., & Khakbaz, A. (2023). A systematic literature review of MABAC method and applications: An outlook for sustainability and circularity. Informatica, 34(2), 415–448. [Google Scholar] [CrossRef]
  86. Tóth, B., Motahari-Nezhad, H., Horseman, N., Berek, L., Kovács, L., Hölgyesi, Á., Péntek, M., Mirjalili, S., Gulácsi, L., & Zrubka, Z. (2024). Ranking resilience: Assessing the impact of scientific performance and the expansion of the times higher education word university rankings on the position of Czech, Hungarian, Polish, and Slovak universities. Scientometrics, 129(3), 1739–1770. [Google Scholar] [CrossRef]
  87. Triggle, C. R., MacDonald, R., Triggle, D. J., & Grierson, D. (2022). Requiem for impact factors and high publication charges. Accountability in Research, 29(3), 133–164. [Google Scholar] [CrossRef] [PubMed]
  88. Trung Do, D. (2024). Assessing the impact of criterion weights on the ranking of the top ten universities in vietnam. Engineering, Technology and Applied Science Research, 14(4), 14899–14903. [Google Scholar] [CrossRef]
  89. Turskis, Z., & Keršulienė, V. (2024). SHARDA–ARAS: A methodology for prioritising project managers in sustainable development. Mathematics, 12(2), 219. [Google Scholar] [CrossRef]
  90. Wachowicz, T., & Roszkowska, E. (2025). Enhancing TOPSIS to evaluate negotiation offers with subjectively defined reference points. Group Decision and Negotiation, 34, 715–749. [Google Scholar] [CrossRef]
  91. Watrianthos, R., Ritonga, W. A., Rengganis, A., Wanto, A., & Isa Indrawan, M. (2021). Implementation of PROMETHEE-GAIA method for lecturer performance evaluation. Journal of Physics: Conference Series, 1933(1), 012067. [Google Scholar] [CrossRef]
  92. Wątróbski, J. (2023). Temporal PROMETHEE II—New multi-criteria approach to sustainable management of alternative fuels consumption. Journal of Cleaner Production, 413, 137445. [Google Scholar] [CrossRef]
  93. Wu, H., Han, X., Yang, Y., Hu, A., & Li, Y. (2025). A contribution-driven weighted grey relational analysis model and its application in identifying the drivers of carbon emissions. Expert Systems with Applications, 287, 128039. [Google Scholar] [CrossRef]
  94. Wulf, C., Zapp, P., Schreiber, A., & Kuckshinrichs, W. (2021). Setting thresholds to define indifferences and preferences in promethee for life cycle sustainability assessment of European hydrogen production. Sustainability, 13(13), 7009. [Google Scholar] [CrossRef]
  95. Yedjour, D., Yedjour, H., Amri, M. B., & Senouci, A. (2024). Rule extraction based on PROMETHEE-assisted multi-objective genetic algorithm for generating interpretable neural networks. Applied Soft Computing, 151, 111160. [Google Scholar] [CrossRef]
  96. Yu, T., Tang, Y., Cui, H., & Kang, B. (2025). A novel BWM-based conflict management method for interval-valued belief structure. Information Sciences, 721, 122617. [Google Scholar] [CrossRef]
  97. Yudhoyono, A. H., Sukoco, B. M., Maharani, I. A. K., Putra, I. K., & Suhariadi, F. (2025). Bridging the gap: Indonesia’s research trajectory and national development through a scientometric analysis using SciVal. Journal of Open Innovation: Technology, Market, and Complexity, 11(1), 100505. [Google Scholar] [CrossRef]
  98. Zhang, C., Jiang, N., Su, T., Chen, J., Streimikiene, D., & Balezentis, T. (2022). Spreading knowledge and technology: Research efficiency at universities based on the three-stage MCDM-NRSDEA method with bootstrapping. Technology in Society, 68, 101915. [Google Scholar] [CrossRef]
  99. Zhang, T., Shi, J., & Situ, L. (2021). The correlation between author-editorial cooperation and the author’s publications in journals. Journal of Informetrics, 15(1), 101123. [Google Scholar] [CrossRef]
  100. Zheng, K., Fang, J., Li, J., Shi, H., Xu, Y., Li, R., Xie, R., & Cai, G. (2025). Robust grey relational analysis-based accuracy evaluation method. Applied Sciences, 15(9), 4926. [Google Scholar] [CrossRef]
  101. Zhu, B., Wang, T., Liu, G., & Zhou, C. (2024). Revealing dynamic goals for university’s sustainable development with a coupling exploration of SDGs. Scientific Reports, 14(1), 22799. [Google Scholar] [CrossRef]
  102. Ziemba, P. (2022). Application framework of multi-criteria methods in sustainability assessment. Energies, 15(23), 9201. [Google Scholar] [CrossRef]
  103. Zoraghi, N., Amiri, M., Talebi, G., & Zowghi, M. (2013). A fuzzy MCDM model with objective and subjective weights for evaluating service quality in hotel industries. Journal of Industrial Engineering International, 9, 38. [Google Scholar] [CrossRef]
  104. Živković, Ž., Nikolić, D., Savić, M., Djordjević, P., & Mihajlović, I. (2017). Prioritizing strategic goals in higher education organizations by using a SWOT–PROMETHEE/GAIA–GDSS model. Group Decision and Negotiation, 26(4), 829–846. [Google Scholar] [CrossRef]
Figure 1. Proposed BGP methodology.
Figure 1. Proposed BGP methodology.
Publications 14 00005 g001
Figure 2. Criteria weights distribution.
Figure 2. Criteria weights distribution.
Publications 14 00005 g002
Figure 3. Raw data transformation to grey relational coefficient.
Figure 3. Raw data transformation to grey relational coefficient.
Publications 14 00005 g003
Figure 4. Aggregated preference matrix.
Figure 4. Aggregated preference matrix.
Publications 14 00005 g004
Figure 5. Full ranking of the BGP framework.
Figure 5. Full ranking of the BGP framework.
Publications 14 00005 g005
Figure 6. Alternatives performance cluster.
Figure 6. Alternatives performance cluster.
Publications 14 00005 g006
Figure 7. Sensitivity analysis results.
Figure 7. Sensitivity analysis results.
Publications 14 00005 g007
Figure 8. Results of comparative analysis across methods (↑ Higher is better; ↓ Lower is better).
Figure 8. Results of comparative analysis across methods (↑ Higher is better; ↓ Lower is better).
Publications 14 00005 g008
Table 1. Scientometric indicators.
Table 1. Scientometric indicators.
NoIndicatorCategoryDefinition
1Number of Publications (Productivity)ProductivityTotal Document Published (T. Zhang et al., 2021).
2Number of Citations
(Citations)
CitationsTotal citations received by all publications (T. Zhang et al., 2021; Clermont et al., 2021).
3International CollaborationNetworkingPublications authored by researchers from two or more different countries (Mitrović et al., 2023).
4h-indexCitationsThe h-index value, where h publications have received ≥ h citations each (Paul & Saha, 2018).
5g-indexCitationsassesses research impact with emphasis on highly cited articles (Paul & Saha, 2018).
6Category Normalized Citation Impact (CNCI)Citationscompares a paper’s citation impact to the global average in the same field and year (Potter et al., 2022).
7Field-Weighted Citation Impact (FWCI)Citationsindicates citation impact normalized to global field performance (H. Pohl, 2024).
8Publications in Top Journal (Q1 Count)QualityTotal publications in Q1 journals (Tóth et al., 2024).
9Publications in Top Cited (top10%)CitationsThe percentage of publications ranked within the top 10% most cited in their field (Bornmann & Williams, 2020).
10Number of Cited PublicationsCitationsNumber of publication with ≥ 1 citation (Albadayneh et al., 2024).
11Percentage of Cited PublicationsCitationsthe proportion of papers within the global top 10% by citations (Anand et al., 2024).
12Journal Impact Factor (JIF)Impact MetricsThe average two-year citation rate per journal article (Triggle et al., 2022).
13SCImago Journal Rank (SJR) ScoreImpact MetricsA journal prestige index weighted by citation network connectivity (Limaymanta et al., 2022).
14Source Normalized Impact per Paper (SNIP)Impact MetricsField-normalized citation impact based on disciplinary citation patterns (X. Z. Liu & Fang, 2020).
Table 2. Consistency index.
Table 2. Consistency index.
a_BW123456789
CI0.000.441.001.632.303.003.734.475.23
Source: (Rezaei, 2015).
Table 3. Criteria for evaluation.
Table 3. Criteria for evaluation.
CriteriaMedianIQRAiken’s VValidity
Productivity80.50.839Valid
Citations810.821Valid
International Collaboration810.929Valid
Q1 Count910.946Valid
SJR Score90.50.964Valid
Table 4. BO and OW comparison metrices.
Table 4. BO and OW comparison metrices.
ExpertBest CriteriaWorst CriteriaSJR Score aIntl Collab aCitations aQ1 Count aProductivity a
1SJR ScoreIntl Collab 1/88/13/42/64/3
2SJR ScoreIntl Collab1/77/14/43/55/3
3Q1 CountIntl Collab2/57/13/41/64/3
4SJR ScoreIntl Collab1/99/15/44/56/3
5CitationsIntl Collab3/48/11/74/35/4
6SJR ScoreProductivity1/75/33/44/57/1
7Q1 CountIntl Collab3/48/12/51/65/3
a Expressed as BO/OW.
Table 5. Consistency analysis results.
Table 5. Consistency analysis results.
ExpertBest Criteria Worst Criteria a_BW CI_Max ξ* CR Remark
1 SJR Score Intl Collab 84.47 0.0480 0.0107 Consistent
2SJR ScoreIntl Collab73.730.09100.0244Consistent
3Q1 CountIntl Collab73.730.06940.0186Consistent
4SJR ScoreIntl Collab95.230.09330.0178Consistent
5CitationsIntl Collab84.470.10180.0228Consistent
6SJR ScoreProductivity73.730.11380.0305Consistent
7Q1 CountIntl Collab84.470.07500.0168Consistent
Table 6. Criteria weights.
Table 6. Criteria weights.
NoCriteriaExp1Exp2Exp3Exp4Exp5Exp6Exp7Weight
1SJR Score0.43200.48890.24280.55150.19570.48130.16670.3656
2International Collaboration0.04800.05690.05780.05090.05480.11900.05830.0637
3Citations0.16000.14500.16180.12900.48530.19840.25000.2185
4Q1 Count0.24000.19330.41620.16120.14680.14880.42500.2473
5Productivity0.12000.11600.12140.10750.11740.05250.10000.1050
Table 7. Normalized performance matrix.
Table 7. Normalized performance matrix.
AlternativeSJR ScoreCitationsProductivityQ1 CountIntl Collab
UNIV-11.0001.0000.9401.0000.758
UNIV-40.7480.8781.0000.6871.000
UNIV-20.7770.5720.8410.8280.666
UNIV-30.5450.4450.4730.6960.456
UNIV-970.0000.00470.00090.00000.0046
Table 8. Grey relational coefficient.
Table 8. Grey relational coefficient.
AlternativeSJR ScoreCitationsProductivityQ1 CountIntl Collab
UNIV-11.0001.0000.8931.0000.674
UNIV-40.6640.8031.0000.6151.000
UNIV-20.6920.5390.7590.7440.600
UNIV-30.5240.4740.4870.6220.479
UNIV-970.33330.33440.33350.33330.3344
Table 9. Representative aggregated preference matrix.
Table 9. Representative aggregated preference matrix.
Aᵢ
(Compared to)
U-1U-4U-2U-9U-49U-42U-51U-89U-97U-100
UNIV-1-0.61650.64670.94580.98710.98730.98720.98860.98840.9886
UNIV-40.0622-0.24840.46000.82380.82390.82500.83610.83650.8356
UNIV-20.00000.0465-0.30160.78840.78890.78760.80060.80030.8010
UNIV-90.00000.00000.0000-0.29380.29400.29400.31450.31450.3146
UNIV-490.00000.00000.00000.0000-0.00000.00000.00000.00000.0000
UNIV-420.00000.00000.00000.00000.0000-0.00000.00000.00000.0000
UNIV-510.00000.00000.00000.00000.00000.0000-0.00000.00000.0000
UNIV-890.00000.00000.00000.00000.00000.00000.0000-0.00000.0000
UNIV-970.00000.00000.00000.00000.00000.00000.00000.0000-0.0000
UNIV-1000.00000.00000.00000.00000.00000.00000.00000.00000.0000-
Table 10. Top-10 and bottom-10 ranking.
Table 10. Top-10 and bottom-10 ranking.
Ranked Top-10 of 100Ranked Bottom-10 of 100
RankUniversitiesϕ+ϕϕRankUniversitiesϕ+ϕϕ
1UNIV-10.9776590.0006290.9770391UNIV-9500.037027−0.03703
2UNIV-40.7890710.0066980.78237492UNIV-7300.037032−0.03703
3UNIV-20.7382420.0090420.729293UNIV-9400.037051−0.03705
4UNIV-30.332340.0160970.31624494UNIV-9900.037066−0.03707
5UNIV-90.2597780.0174570.2423295UNIV-8700.037169−0.03717
6UNIV-50.0864280.022220.06420796UNIV-9800.037233−0.03723
7UNIV-100.0384180.024340.01407897UNIV-8900.037292−0.03729
8UNIV-80.0335860.0247090.00887798UNIV-9700.037298−0.0373
9UNIV-120.0271950.0254540.00174299UNIV-9600.037301−0.0373
10UNIV-60.0222830.025071−0.00279100UNIV-10000.037312−0.03731
Table 11. Robustness performance.
Table 11. Robustness performance.
MethodSpearman’s ↑SRD ↓SR@5 ↑Max ΔRank ↓RIR ↓
BGP0.998598.70.995.730.52
ARAS0.9966134.60.949.700.60
MABAC0.9973142.00.977.230.67
TOPSIS0.9950183.10.919.600.70
↑ Higher is better, ↓ Lower is better.
Table 12. Friedman test.
Table 12. Friedman test.
StatisticX2dfp-ValueN (Block)CD0.05Result
Value82.0331.13 × 10−17301.211Significant
Table 13. Post hoc Nemenyi.
Table 13. Post hoc Nemenyi.
Method iMethod jAvg Rank DifferenceCD0.05Result
BGPARAS1.6121.211Significant
BGPMABAC1.7001.211Significant
BGPTOPSIS2.9671.211Significant
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kurniadi, D.; Gernowo, R.; Surarso, B. A Hybrid BWM-GRA-PROMETHEE Framework for Ranking Universities Based on Scientometric Indicators. Publications 2026, 14, 5. https://doi.org/10.3390/publications14010005

AMA Style

Kurniadi D, Gernowo R, Surarso B. A Hybrid BWM-GRA-PROMETHEE Framework for Ranking Universities Based on Scientometric Indicators. Publications. 2026; 14(1):5. https://doi.org/10.3390/publications14010005

Chicago/Turabian Style

Kurniadi, Dedy, Rahmat Gernowo, and Bayu Surarso. 2026. "A Hybrid BWM-GRA-PROMETHEE Framework for Ranking Universities Based on Scientometric Indicators" Publications 14, no. 1: 5. https://doi.org/10.3390/publications14010005

APA Style

Kurniadi, D., Gernowo, R., & Surarso, B. (2026). A Hybrid BWM-GRA-PROMETHEE Framework for Ranking Universities Based on Scientometric Indicators. Publications, 14(1), 5. https://doi.org/10.3390/publications14010005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop