WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets

Banegas-Luna, Antonio Jesús; Pérez-Sánchez, Horacio; Martínez-Cortés, Carlos

doi:10.3390/make8040097

Open AccessArticle

WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets

by

Antonio Jesús Banegas-Luna

^1,*

,

Horacio Pérez-Sánchez

²

and

Carlos Martínez-Cortés

²

¹

Departamento de Tecnologías de la Información y Telecomunicaciones, Centro Universitario de la Defensa, Academia General del Aire, Universidad Politécnica de Cartagena, C/Coronel López Peña s/n, 30729 San Javier, Spain

²

Structural Bioinformatics and High Performance Computing (BIO-HPC), Universidad Católica de Murcia (UCAM), Avd. de los Jerónimos, 30107 Murcia, Spain

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(4), 97; https://doi.org/10.3390/make8040097

Submission received: 19 February 2026 / Revised: 31 March 2026 / Accepted: 9 April 2026 / Published: 10 April 2026

Download

Browse Figures

Versions Notes

Abstract

While predictive accuracy is often prioritized in machine learning (ML) models, interpretability remains essential in scientific and high-stakes domains. However, diverse interpretability algorithms frequently yield conflicting explanations, highlighting the need for consensus to harmonize results. In this study, six ML models were trained on six synthetic datasets with known ground truths, utilizing various model-agnostic interpretability techniques, as well as gradient-based and counterfactual-based explainers. Consensus explanations were generated using established methods and a novel approach: WISCA (Weighted Scaled Consensus Attributions), which integrates class probability and normalized attributions. WISCA consistently aligned with the most reliable individual method, underscoring the value of robust consensus strategies in improving explanation reliability.

Keywords:

explainable artificial intelligence; interpretability; consensus function; model explanation; synthetic datasets

Graphical Abstract

1. Introduction

Machine learning (ML), a key area of artificial intelligence, enables computers to learn autonomously from data. The robust statistical foundation of ML has facilitated its adoption for intricate tasks in various scientific domains, from biology [1,2] to drug discovery [3,4,5], meteorology [6,7,8], and, notably, medicine [9,10,11], due to its precision in unveiling patterns within complex datasets.

Although ML models are typically evaluated based on predictive accuracy, transparency in their decision-making is increasingly demanded, especially in sensitive domains such as healthcare and finance. The opaqueness of certain models is particularly problematic in sectors where understanding model logic is as critical as the outcomes themselves, such as healthcare [12,13,14] and financial services [15,16,17]. This imperative for clarity has catalyzed the emergence of eXplainable Artificial Intelligence (XAI) [18,19], an initiative to reveal the rationale behind model predictions. In addition, the ability to explain predictions is becoming a legal and ethical requirement in some jurisdictions, particularly in high-stakes domains such as healthcare and credit scoring [20].

Interpretability algorithms, designed to elucidate ML models, are classified as global or local based on their explanatory scope or by their specificity to model types versus model-agnostic versatility [21,22]. As model accuracy is typically assessed using metrics such as AUC, Precision, Recall, R², MSE, and MAE [23,24], the quality of explanations is evaluated through criteria like correctness, consistency, coherence, and confidence [25]. These factors play a crucial role in determining the effectiveness and reliability of the explanations provided by AI systems. Researchers have developed different scales and metrics to measure the quality of explanations, such as the Explanation Satisfaction Scale, Explanation Goodness Checklist, and the System Causability Scale [26]. The Explanation Satisfaction Scale focuses on users’ evaluations of explanations, whereas researchers use the Explanation Goodness Checklist to independently assess their quality. On the other hand, the System Causability Scale aims to measure the quality of explanations in explainable AI systems, considering aspects like timeliness, detail, completeness, understandability, and learnability of the explanation [27].

However, despite being widely adopted by the community, these metrics usually refer to subjective concepts that are difficult to quantify. In this regard, progress is being made in developing objective metrics to assess the quality of explanations generated by interpretability algorithms [28,29]. However, the utility of XAI is often undercut by the inconsistent explanations these algorithms generate, leading to confusion and diminishing trust [24,30,31,32]. This disagreement is particularly pronounced when different interpretability methods are applied to the same prediction and produce contradictory results, leaving end users without a clear understanding of the behavior of the model [31]. The challenge, therefore, lies in synthesizing these divergent narratives into a coherent consensus that accurately reflects the collective wisdom of multiple algorithms. Yet, there is no established framework for systematically comparing or integrating these explanations in a reliable way [33].

To this end, consensus functions have emerged as a solution to harmonize explanations, with methods ranging from the arithmetic mean to other statistical measures and feature occurrence-based approaches [34,35,36,37,38,39,40]. Yet, these functions often fall short by not accounting for factors such as model accuracy, which can significantly influence the reliability of the derived explanations. Incorporating model accuracy into the consensus functions can enhance the robustness and trustworthiness of the explanations provided by AI systems. By considering model accuracy, the resulting explanations can better reflect the true behavior and decision-making processes of the underlying AI models, leading to more reliable and informative insights for users. This holistic approach to explanation harmonization can contribute to improving the overall transparency and interpretability of AI systems, fostering greater trust and understanding among users and facilitating more effective decision-making processes. Moreover, better consensus mechanisms could directly benefit clinical and industrial stakeholders by reducing uncertainty and enabling auditable explanation workflows [41].

This study explores several consensus functions applied to ML models, identifies limitations of traditional approaches, and introduces a new method, WISCA, which incorporates class probabilities and attribution scaling for a more robust explanation process. Our goal is to bridge the gap between model performance and explanation quality by prioritizing both in the consensus process. The subsequent sections detail the models, datasets, interpretability algorithms, and consensus functions utilized in our study, with Section 2 presenting the methodology, Section 3 presenting the findings, and Section 4 delving into a comprehensive evaluation of the consensus functions. Ultimately, we consolidate our key insights and propose future research trajectories.

2. Materials and Methods

This section describes the synthetic datasets used for the experiments, the set of ML models, the interpretability algorithms used to explain the models’ internal workings, and the consensus functions assessed. Finally, a novel consensus function is proposed to mitigate the identified challenges.

2.1. Datasets

In order to analyze disagreements in classification and regression problems, six synthetic datasets were created: two for binary classification problems, two for multiclass classification problems, and two for regression problems. The main reason for choosing synthetic over real-world datasets is that they can be constructed using predefined rules based on specific input features. This facilitates the evaluation of whether the resulting explanations correctly identify the expected features, since the features expected to explain the model are known in advance.

In all datasets, the target feature was calculated based on 3 or 4 input features by applying different linear and non-linear functions. In addition, several irrelevant features that simulate noise were added to the datasets. This was done to verify whether the interpretability algorithms focused on the most relevant features or whether they were disturbed by the noise instead.

The datasets contained between 20 and 75 features. The number of features in each data set was chosen randomly so that the variables not involved in the target computation were sufficient to prevent the model from learning the target feature too easily. Therefore, each dataset includes 16 to 72 noise variables designed to reduce model overfitting. Furthermore, to avoid time-consuming calculations, since each model was trained 10 times, each dataset was created with a random number of samples ranging from 1500 to 2500. With this number of samples, the models have enough data to learn effectively and can be trained in a few minutes or hours. All features were generated using a uniform distribution between 0 and 1. In this way, collections of normalized datasets were quickly simulated.

Table 1 summarizes the main properties of each dataset and the combination of variables implemented to calculate the target. For each dataset, its name, the type of task it solves, the number of samples, and the features are indicated. The “Expected Explanation” column lists the features from which the target variable was calculated. Therefore, these are the variables that the interpretability algorithms were expected to identify.

Concerning the formulas used to model the target variables, a representative collection of linear and non-linear functions has been used to generate different types of datasets. Table 2 shows the formula implemented to calculate the output values. The synthetic datasets are available as Supplementary Materials.

2.2. Machine Learning Models

The algorithms analyzed were used to interpret the predictions of different ML models. The selected models include k-nearest neighbors (KNNs), random forests (RFs), support vector machines (SVMs), extreme gradient boosting machines (XGBs), and artificial neural networks (ANNs). They are a representative collection that includes non-linear, ensemble, and black-box models. In addition, a linear (logistic) regression model (LR) was added to be used as a baseline. With this approach, in this study, a diverse set of model architectures is covered.

2.3. Interpretability Algorithms

A representative set of interpretability algorithms was chosen for this work. Both global and local approaches were considered. In addition, the selected explainers include methods with different methodological foundations. Global methods summarize the contribution of each feature to the model output through a single numerical value, commonly referred to as “attribution”. By contrast, local methods provide feature-wise explanations at the individual sample level.

In this work, we used Random Forests [42] to estimate the importance of each input feature in the model’s decision-making process. Specifically, we relied on impurity-based feature importance, also known as mean decrease in impurity (MDI), as implemented in the Scikit-learn library. This method calculates the importance of a feature by measuring the total decrease in node impurity (e.g., Gini impurity) brought about by splits involving that feature, averaged over all trees in the forest. A higher decrease indicates a more informative feature. Furthermore, we used permutation-based feature importance [43] as a complementary method. This approach evaluates the decrease in model performance when the values of a single feature are randomly permuted, effectively breaking its relationship to the target. Together, these methods provide a global view of the relative contribution of each feature to the predictive performance of the model.

On the other hand, Local Interpretable Model-agnostic Explanations (LIME) [44], SHAP [45,46], Integrated Gradients [47], and counterfactual explanations [48] were selected as local approaches. Among them, LIME and SHAP were considered model-agnostic explainers, whereas Integrated Gradients was treated as a gradient-based attribution method. Counterfactual explanations were handled separately, since they provide instance-level explanations based on the changes required to obtain an alternative prediction. In this work, the Python 3.8 implementations provided by LIME (https://lime-ml.readthedocs.io/en/latest/ (accessed on 15 March 2026)), SHAP (https://github.com/shap/shap (accessed on 15 March 2026)), alibi (https://github.com/SeldonIO/alibi (accessed on 15 March 2026)), and DiCE (https://github.com/interpretml/DiCE (accessed on 15 March 2026)) were adopted, respectively. For counterfactual explanations, the feature-wise scores used in the analysis were derived from the feature-importance estimates returned by the DiCE framework from the generated counterfactual instances.

Integrated Gradients (IGs) is a gradient-based attribution method that requires access to model derivatives with respect to the input features. Therefore, it is formally applicable only to differentiable models. In our implementation, IG was applied exclusively to TensorFlow-based models, where gradients can be computed directly. For non-differentiable models, such as KNN, Random Forest, XGBoost, and sklearn-based SVM, IG is not theoretically well-defined and was therefore not considered in the final analysis.

This design choice does not affect the applicability of the proposed WISCA framework, since it is intended to integrate heterogeneous explainers and does not require all interpretability methods to be available for all model classes.

2.4. Consensus Functions

This work assessed different consensus functions previously published in the literature. Those functions will be the basis for developing and evaluating a novel consensus approach. The consensus functions evaluated in this study are described below.

2.4.1. Arithmetic Mean

The arithmetic mean (Equation (1)) is a simple approach to average the importance of all the features. It is frequently used because of its simplicity of calculation. In addition, it is a fair approach that gives the same importance to all the explanations. However, the main problem with this function is that it weights equally the explanations of very precise and randomly classified samples. Furthermore, since interpretability algorithms may produce attributions on different scales, applying the arithmetic mean to unnormalized attributions can bias the consensus.

A_{m e a n} = \frac{\sum_{i = 1}^{n} x_{i}}{n}

(1)

2.4.2. Harmonic Mean

The harmonic mean is the reciprocal of the arithmetic mean (Equation (2)). It is often used to average the data inversely proportional to the data. For example, calculate the average velocities to account for the effect of distance over time. Although it cannot handle null or negative values, it is robust to extreme positive outliers. While the harmonic mean cannot handle zero or negative values, it can mitigate the influence of extreme outliers when the attribution scales vary.

H_{m e a n} = \frac{n}{\sum_{i = 1}^{n} \frac{1}{x_{i}}}

(2)

2.4.3. Geometric Mean

The geometric mean can be expressed in terms of the arithmetic and the harmonic means (Equation (3)), but it can also be formulated in terms of individual attributions of each feature (Equation (4)). The geometric mean is often used to analyze time series and growth rates, where values are multiplicatively related to one another. It is less sensitive to extreme values than the arithmetic mean, but its calculation is more complex. Additionally, like the harmonic mean, it does not handle non-positive values correctly. Despite all those limitations, the geometric mean lies between the arithmetic and harmonic means. Hence, it can be worth testing its performance as a consensus function.

G_{m e a n} = \sqrt{A_{m e a n} \cdot H_{m e a n}}

(3)

G_{m e a n} = \sqrt[n]{\prod_{i = 1}^{n} x_{i}}

(4)

2.4.4. Voting

A limitation due to the usage of attribution scores is that not all interpretability algorithms handle the same range of values, making it difficult to compare the explanations of different algorithms. Inspired by the majority voting principle in random forests, a voting approach is proposed. This function sorts the features by attribution in descending order and counts how often each feature is present among the N most attributed features. Finally, the features appearing most frequently in the top N are assumed to be the most important. The selection of N is a crucial but difficult task. If it is too small, it may be difficult to find a pattern among the top-ranked features, but if it is too large, features with very low attribution will be taken into account. In this work, a value of

N = 5

was established. Note that this decision is conditional on prior knowledge of the synthetic datasets, for which the number of variables that explain them is known in advance. Its main limitation is that it omits the attributions assigned to the features and does not consider either the contribution sign or the model accuracy.

2.4.5. Relative Position

A similar approach, based on sorting features by descending attribution, simply uses each feature’s relative position in the list. The feature positions then replace the attributions. Hence, this method ensures that the attributions are normalized in a common range: [1, number of features]. The main difference between this method and the previous ones is that here the lower the combined attribution, the more important the feature. Like voting, this function discards the actual magnitude of the feature attributions.

2.4.6. Other Functions

Recent studies have explored data fusion through machine learning and deep learning (DL) models [49,50,51,52]. Although relevant as evidence of current research trends, these approaches generally involve trained fusion architectures, hyperparameter optimization, and substantially higher computational cost than the consensus functions considered in this work. In contrast, the present study focuses on lightweight closed-form aggregation rules applied directly to the feature-importance outputs of XAI methods, without introducing an additional learning stage. Therefore, the comparison was restricted to standard consensus functions and the proposed method, all of them operating under the same aggregation framework. This choice ensures a homogeneous comparison among methods of the same nature.

2.5. Development of a Novel Consensus Function

The functions described in the previous section omit some crucial parameters to perform consensus accurately. To address these limitations, we developed a novel consensus function that incorporates these critical factors.

2.5.1. Identified Challenges

To overcome the aforementioned issues, the proposed approach should address the main limitations identified.

Based on attributions. Some consensus functions, such as voting or the relative position, ignore the importance of the input features in the decision. Although they may be less computationally expensive, they overlook the magnitude of the contributions provided by interpretability algorithms.
Scale attributions. The interpretability approaches used in this work do not handle attributions in the same range, making comparing them difficult and unfair. In consequence, the proposed function will normalize all the attributions in the range [0, 1] using the min-max approach (Equation (5)). To preserve the sign of the attributions, the scaled values are multiplied by the original sign of the attribution (attr_sign), represented by 1 or −1, to mark the difference between positive and negative attributions.

$a t t r_{s c a l e d} = \frac{a t t r - m i n (a t t r)}{m a x (a t t r) - m i n (a t t r)} \cdot a t t r_s i g n$

(5)
Distinguish global and local explanations. Global interpretability methods return a single attribution value per feature, which is inferred from the attributions of all individual samples. On the contrary, local methods return a single value per sample. Therefore, if both values were combined into one single equation, local methods would have more importance simply because they contribute more values. To overcome this problem, the attributions of the local methods have to be divided by the number of samples so that each local attribution contributes in the same proportion as the global ones.
Weight the errors. Local interpretability methods explain each sample individually. However, not all the samples are predicted with the same certainty. For example, a classification sample that is predicted with a probability of 0.5 is the result of a random decision. However, another sample, which is classified with a probability of 0.99, is a very reliable prediction whose explanation is of great interest. Similarly, the difference between the predicted value and the actual value in a regression problem is a measure of how reliable the prediction is. Hence, this can be an essential parameter when assessing local methods. The impact of class probability or regression error in model explanations should be represented by a correction factor.

2.5.2. WISCA Formulation

WISCA (Weighed Scaled Consensus Attributions) is the new consensus function proposed to avoid the above problems. It takes into account factors such as class probability (in classification problems), the difference between the actual and predicted value (in regression problems), and the different scales used by interpretability algorithms. In addition, it weighs global and local interpretation methods fairly and equally.

To define it formally, three cases are differentiated according to the type of problem and the type of interpretability algorithm used: (i) global explanations (for classification and regression) (Equation (6)); (ii) local explanations in classification problems (Equation (7)); and (iii) local explanations in regression problems (Equation (8)).

ϕ (f) = ϕ^{'} (f)

(6)

ϕ (f) = \sum_{s \in S} \frac{ϕ^{'} (f)}{N} \cdot π (s, m)

(7)

ϕ (f) = \sum_{s \in S} \frac{ϕ^{'} (f) \cdot e^{- α |y - \hat{y}|}}{N}

(8)

In Equations (6)–(8),

ϕ^{'} (f)

is the feature attribution scaled between 0 and 1 with the min–max algorithm; m is the model; f is the input feature; S is the set of samples in the dataset; s is the current sample; N is the total number of samples in the dataset;

π (s, m)

is the correction factor applied to classification problems; and

e^{- α |y - \hat{y}|}

is the corrector factor used in regression problems.

Thus, consensus is computed feature-wise across the different explainers. Global methods directly contribute one attribution value per feature, whereas local methods contribute one aggregated attribution value per feature after combining the sample-level explanations.

In a formal way, the consensus vector can be formulated as follows.

Let

E = e_{1}, \dots, e_{K}

be the set of explainers. For each feature f, each explainer provides an attribution

ϕ_{e} (f)

. The final WISCA consensus attribution is computed as

ϕ_{W I S C A} (f) = \frac{1}{K} \sum_{e \in E} ϕ_{e} (f)

(9)

where

ϕ_{e} (f)

is computed according to Equations (6)–(8). It is worth noting that all the methods are weighted equally.

2.5.3. Correction Factor in Regression

The function chosen to model the correction factor in regression problems (Equation (8)) measures the distance, in absolute value, between the value predicted by the model and the real value of the sample. Thus, when the distance between both values is very small, the correction factor increases, which implies keeping the attribution of the variable close to its original value. Conversely, when the distance between the two values is very large, the correction factor decreases and reduces the impact of the attribution on the consensus. In this work, a default value of

α = 0.5

was used for calculations.

2.5.4. Alternatives for the Classification Correction Factor

When implementing the correction factor in classification problems, a function is sought that attains its maximum at 0 or 1 and its minimum at 0.5. These values correspond to a curve that is maximized when the sample is predicted with absolute certainty or when the prediction is totally wrong. However, when uncertainty is maximized, the correction factor minimizes attribution until it disappears completely. There are several families of functions that exhibit the desired behavior, as shown in Figure 1.

Quadratic. This function represents a smooth and symmetrical curve. It shows a rapid drop towards the minimum at p = 0.5. It reaches its maximum at p = 0 and p = 1.

$π (p) = 1 - 4 p (1 - p)$

(10)
Power. This function behaves similarly to the parabola, but does not touch exactly 1 at the ends. When n = 2 it shows a sharper shape than the parabola. As its behavior is nearly identical to the quadratic function, consequently, it can be omitted.

$π (p) = {|2 p - 1|}^{2}$

(11)
Cosine. The cosine function shows a smooth and wavy behavior. It reaches perfect maxima at 0 and 1 and a minimum at 0.5. However, the transition is much slower than that of the parabola.

$π (p) = \frac{1 + c o s (π p)}{2}$

(12)
Exponential. This function is displayed as an inverted bell. The fall speed is adjustable by means of a parameter (a). The higher a, the steeper the fall. At probabilities close to 0 and 1, it behaves more smoothly. However, we want our function to be more constant in its curvature.

$π (p) = 1 - e x p (- a {(2 p - 1)}^{2})$

(13)
Negative entropy. It behaves very similarly to a quadratic function, but the curve is slightly steeper. That makes the factor take longer to grow, i.e., intermediate between (0, 0.5) and (0.5, 1), the attributions become more similar because the correction factor varies less.

$π (p) = 1 + p l o g_{2} p + (1 - p) l o g_{2} (1 - p)$

(14)

Among all the families of functions evaluated, the ones that perform best, as expected by WISCA, are the parabolic functions. Consequently, this approach has been used to model the corrector function for classification problems.

3. Results

3.1. Model Training

To ensure robust results, each of the six models was trained ten times with each dataset using the SIBILA tool [53]. The instances that produced the highest AUC (for classification) and R² (for regression) were selected for evaluation. Additional metrics, including F1-score and mean absolute error (MAE), were also collected. Figure 2 displays the main metrics of the six models after training. As each model was run 10 times, the average value of each metric is shown for each model, together with the standard deviation. Thus, Figure 2 summarizes the variability across the 10 runs. For interpretability analysis, the highest-performing run was selected for each model-dataset pair. This choice was made to compare consensus functions on a reliable fitted model. Importantly, all consensus strategies were computed from the same fitted model and the same underlying attribution set within each pair, ensuring a controlled and fair comparison among aggregation methods.

3.2. Consensus

This section describes the experiments carried out to evaluate the accuracy of the consensus functions. The experiments aimed to assess how accurately each consensus function identifies the most relevant features in each dataset. Ideally, the features explaining the output should be the most attributed after consensus.

The consensus functions were tested on six synthetic datasets: two for binary classification, two for multiclass classification, and two for regression. The outputs of each dataset were generated manually using different rules to test consensus functions across a variety of cases. Therefore, the features involved in the calculation of the output were known in advance and are summarized in Table 1. It must be highlighted that gradient-based attributions are reported only for differentiable models. For non-differentiable estimators, these methods are omitted, and consensus is computed using the remaining explainers.

To enable fair comparison, we defined a custom metric—termed the hit rate metric—that calculates the precision of each function by assigning a weight to each feature that is considered relevant according to its position in the list of features ordered from highest to lowest attribution. Features that do not participate in the target calculation formula do not receive a score. According to this logic, if the N variables used to generate the target are ranked highest after consensus, the function will obtain the highest score. As the position of these variables decreases and features introduced as noise are interspersed, the score goes down. The function assigns scores between 0 and 1, with 1 being the highest score and 0 the lowest. By contrast, precision@k depends on a fixed cutoff, Kendall’s

τ

evaluates global rank agreement, and NDCG requires an explicit relevance model. Therefore, the proposed metric is not meant to replace standard ranking measures, but to quantify a different property that is more closely aligned with the objective of this study. Consequently, alternative ranking metrics were not considered in the present evaluation, as the aim was not to measure general ranking agreement but to assess the prioritization of the expected feature set. The mathematical formulation of our metric is

P = \frac{\sum_{i = 1}^{N} \frac{h (x_{i})}{i}}{\sum_{i = 1}^{m i n (n, N)} \frac{1}{i}}

(15)

Here, N denotes the number of expected features and n the total number of ranked features considered in the explanation. In addition,

h (x_{i})

is 1 when

x_{i}

is one of the expected features (a hit), and 0 otherwise. More formally, it can be formulated as follows:

h (x_{i}) = \{\begin{matrix} 1 & if x_{i} \in Expected Features \\ 0 & if x_{i} \notin Expected Features \end{matrix}

(16)

Figure 3 illustrates the comparative performance of the consensus functions according to the custom hit rate metric. Higher values indicate better alignment with ground-truth relevant features. Figure 4 shows the average distance of each function between the six datasets. To evaluate WISCA’s ability to recover the most relevant features, we also calculated the hit rate for the individual interpretability algorithms (Figure 5), along with the Spearman correlation (Figure 6) and the Jensen–Shannon divergence (Figure 7) between WISCA and each algorithm.

To strengthen the statistical validation of the consensus comparison, we performed paired Wilcoxon signed-rank tests across the 36 selected model-dataset instances (6 datasets × 6 models), using the hit rate metric. Mean values, medians, interquartile ranges, and 95% confidence intervals are reported in Table 3, and p-values were adjusted using the Holm–Bonferroni procedure.

WISCA achieved the highest mean hit rate across instances (0.9912), with a median of 1.0000 and a null interquartile range, indicating near-perfect performance consistently. The paired comparisons show that WISCA significantly outperformed arithmetic mean, relative position, geometric mean, and harmonic mean. Although WISCA also obtained a higher mean hit rate than voting, the difference did not reach statistical significance after paired testing. The lack of statistical significance between WISCA and voting can be attributed to a ceiling effect in the hit rate metric, since both methods achieved perfect or near-perfect scores in a large proportion of the evaluated instances.

This analysis supports a fair within-instance comparison among consensus functions, although it does not quantify the variability of consensus performance across repeated training runs, since interpretability was computed only for the selected run of each model-dataset pair.

Finally, the ability of each consensus function to clearly distinguish between expected and non-expected variables has been analyzed. For this purpose, the distance between the two sets of variables, expected and nonexpected, was measured as

d i s t = \frac{ϕ (E F_{m i n}) - ϕ (N E F_{m a x})}{m a x (ϕ (E F_{m i n}), e p s i l o n)} \cdot 100

(17)

where

ϕ (E F_{m i n})

is the minimum attribution assigned to an expected feature;

ϕ (N E F_{m a x})

is the maximum attribution assigned to a non-expected feature; and

e p s i l o n

is a constant, set to

10 \times 10^{- 6}

, to avoid divisions by zero. The calculation was performed on a percentage basis to avoid bias introduced by the voting function. Since this function provides many more values, due to the fact that all local explanations vote equally, it would always be clearly the function that obtains the largest difference.

4. Discussion

4.1. Assessment of Consensus Functions

Analyzing the functions for each group of datasets (Figure 3) shows that in the binary classification datasets (datasets 1 and 2), voting and WISCA always find all the expected variables in the first positions. The arithmetic mean also performs very well, although slightly worse than the others in the black-box model (ANN). In multiclass classification cases, there is greater variety in results. However, voting still gives the expected result in dataset 5, closely followed by WISCA. On the other hand, in dataset 6, it is WISCA that slightly outperforms the voting function. Although the mean attribution is very similar across both functions, the WISCA standard deviation is clearly lower, suggesting more compact attributions. Finally, in the regression datasets (datasets 3 and 4), WISCA is the only function that identifies all the expected features across all models. While the arithmetic mean shows similar behavior across both datasets, the voting clearly worsens in dataset 4. In fact, the poor performance of the voting function is repeated in datasets 3, 4, and 6, exactly in datasets where some model, namely KNN, performed poorly during training. Therefore, it appears that this function is very sensitive to model accuracy. In summary, although the arithmetic mean and voting perform well, they are outperformed by WISCA in the vast majority of cases. The latter finds the expected results in 4 of the 6 datasets and remains the best (though not perfect) in another. On the other hand, the geometric mean and harmonic mean are very poor choices for computing consensus, while the ranking function barely performs accurately in a few cases.

It is also interesting to investigate why some functions perform so poorly in the consensus. The harmonic mean is defined as the reciprocal of the arithmetic mean, which already gives the intuition that if one gives good results, the other will give poor results. In the case of the harmonic mean, the higher the weight assigned to the variable, the less it contributes to the decision. This is exactly the opposite of what is expected and the reason why the harmonic mean is not a good alternative for implementing consensus. For its part, the geometric mean systematically fails as a consensus function of attributions for several reasons. First, the geometric mean cancels out any explanation that contains a null attribution (or very close value), because if only one of the attributions is 0, then the product of all of them will also be 0. In the context of attributions, many variables do not contribute to the explanation; for example, variables that make up the noise usually have very low attributions. When using the geometric mean, these variables drag the entire consensus down to 0, masking the signal of the really relevant variables. Second, many interpretation techniques (e.g., integrated gradients, counterfactuals) return positive and negative attributions, signaling a favorable or unfavorable contribution to the prediction. The geometric mean cannot handle negative values. Furthermore, the geometric mean does not preserve the original scale, nor does it maintain additive order. In contrast, sum-based functions (arithmetic mean, WISCA, etc.) do maintain proportionality, which is key for comparing relative importance. Finally, consensus ranking considers only the relative positions of the features in the ordered list of variables by descending attribution. Unfortunately, some algorithms (e.g., feature permutation importance) assign a weight of 0 to many variables, making the ranking of variables dependent on the sorting algorithm or a secondary tie-breaker criterion. This can clearly affect the weight given to the features in the consensus and, consequently, lead to unreliable results.

4.2. Evaluation of WISCA

Figure 5 shows the scores obtained individually by the interpretability algorithms in their original explanations, i.e., without using any consensus, when applying the hit rate metric. The best results are obtained when the values are close to 1 (the outer line), as these indicate the algorithms that best interpret each model. This information will be useful to contrast with the following figures.

Spearman’s correlation tells us whether the order of attributions is preserved, i.e., whether there is a relationship between the original and consensus attributions. To measure it, we have taken the original averaged attributions. In global interpretability methods, these are the attributions returned directly; whereas in local methods, they are calculated as the average of each variable’s attribution across all samples. This information has been correlated with the attributions calculated by WISCA. It is expected that, in each dataset, the algorithms with the highest Spearman correlation correspond to those with the highest hit rate (Figure 5). It is observed that in datasets 1, 2, and 5, the highest correlation is with the RF approach, and in datasets 3, 4, and 6—with Permutation and RF. Figure 6 shows that the algorithms indicated here are almost always the ones with the highest hit rate. This means that WISCA has the ability to assign attributions that closely resemble those of the algorithm that best explains each model.

The Jensen–Shannon divergence has then been calculated using the same data as the Spearman correlation, that is, between the consensus and the original attributions. Observing Figure 7, it can be seen that the Jensen–Shannon divergence between WISCA and the original attributions is usually very close to zero. This means that the distribution of the data is very similar between the consensus and original values. That confirms that WISCA can obtain attributions that are very similar to those of the algorithm that best explains each model.

4.3. Linear Models vs. WISCA

Of particular interest is also comparing WISCA against a linear model (LR). As the linear models trained reached high accuracy, it was expected that they would yield accurate explanations. To perform the comparison, the new hit rate metric was calculated for both WISCA and LR. In order to calculate the hit rate metric for the linear model, the features were sorted in descending order of their computed coefficients. Based on that, the importance of the features calculated by LR and WISCA is compared. Table 4 confirms that WISCA behaves similarly to a linear model. Even in dataset 6, our function outperforms the linear model, which obtained an F1 score = 91.152 and an AUC = 0.991.

This finding proves the ability of WISCA to identify the most relevant features. In addition, the explanations created with WISCA are supported by several interpretability algorithms, making them more robust and consistent.

4.4. Real-World Datasets

WISCA has also been tested on real-world datasets. Three public datasets have been selected: cervical cancer risk [54] for binary classification, origin of wine classification [55] for multiclass classification, and bike rental [56] for regression. Figure 8, Figure 9 and Figure 10 show the explanations generated by WISCA across the three cases.

In the cervical cancer risk, three models (LR, RF, and XGB) clearly highlighted the importance of the Schiller feature. KNN and SVM also highlighted the importance of id. Finally, the ANN model is explained by those two features, but their importance is closer to the rest of the features. The Schiller test is a gynecological test that uses iodine to detect cellular alterations in the cervix. It is performed during colposcopy. The Schiller test helps to define the boundaries between the epithelium and the lesion. Therefore, it is logical that the result of this test is considered an expected marker for detecting this type of cancer. On the other hand, id is the sample identifier, which, as expected, cannot be considered the only explanation of a model.

Regarding the wine dataset, in four of the six models “proline” was the most ranked feature. Proline is an amino acid found naturally in grapes that has recently been shown to enhance the viscosity, sweetness, and flavor of red wine. Grapes high in proline are associated with warmer climates and riper grapes. As far as we are concerned, the identification of proline as a marker of wine origin is totally accurate. The other two models (RF and XGB) ranked “proline” among the three most relevant features too. In both cases, a small gap can be observed between “proline” and the next-ranked feature.

Finally, in the bike rental dataset, four out of six models can be undoubtedly explained with the “registered” feature. This means that the number of bicycle rentals increases as the number of registered users increases. On the other hand, the ANN and XGB models use the opposite reasoning to explain their behavior. They assign a very negative attribution to the “casual” feature, meaning that the fewer unregistered users there are, the more rentals occur. In short, WISCA was able to logically explain the three tested datasets.

5. Conclusions

Understanding how black-box AI models work is essential. Various approaches have emerged in recent years to shed light on their decision-making processes. While these methods aim to quantify the contribution of each input feature, they employ different strategies. Addressing the disparities among these algorithms and reaching a consensus could offer a promising solution. Consensus can take different forms, such as averaging, weighting features by their relative importance, or tallying their frequency among the most relevant ones.

In this study, we evaluated five consensus functions using six synthetic datasets in different types of problems. Our findings revealed crucial limitations across all functions, such as the lack of information about the models’ accuracy, difficulties with missing or negative attributions, difficulties in attributing significantly relevant features, and a lack of a common scale for all attributions. Consequently, we proposed a novel consensus function, WISCA, that accounts for factors like class probability and attribution normalization to comprehensively explain model outcomes. WISCA outperformed others by effectively identifying the features used to generate outputs in most cases. These results underscore the importance of considering multiple factors for accurate consensus. An effective consensus function is crucial for explaining model predictions, particularly in critical fields like medicine. Nevertheless, in real-world datasets where the importance of the features is unknown, human validation is necessary to ensure the explanations are valid.

Future research will concentrate on applying consensus to more real-world datasets with unknown internal structures. In addition, WISCA could be extended to take into account the predicted class in classification problems. That way, samples accurately predicted would contribute more to the explanation than those wrongly predicted. Moreover, expanding the range of consensus functions is a promising avenue. Leveraging ML models to predict feature attributions based on existing interpretations could offer an alternative approach. This entails developing a novel model using a dataset derived from previous interpretations to effectively predict feature attributions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/make8040097/s1.

Author Contributions

Conceptualization, A.J.B.-L. and H.P.-S.; methodology, A.J.B.-L.; software, A.J.B.-L. and C.M.-C.; validation, A.J.B.-L., H.P.-S. and C.M.-C.; formal analysis, A.J.B.-L. and H.P.-S.; investigation, A.J.B.-L., H.P.-S. and C.M.-C.; resources, A.J.B.-L., H.P.-S. and C.M.-C.; data curation, A.J.B.-L., H.P.-S. and C.M.-C.; writing—original draft preparation, A.J.B.-L., H.P.-S. and C.M.-C.; writing—review and editing, A.J.B.-L., H.P.-S. and C.M.-C.; visualization, A.J.B.-L., H.P.-S. and C.M.-C.; supervision, A.J.B.-L. and H.P.-S.; project administration, H.P.-S.; funding acquisition, H.P.-S. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been funded by grants from the European Project Horizon 2020 SC1-BHC-02-2019 [REVERT, ID:848098].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code of the SIBILA tool is available at https://github.com/bio-hpc/sibila (accessed on 18 February 2026). The real-world datasets used in this work are referenced and available for download from public repositories. On the other hand, the synthetic datasets are provided as Supplementary Materials.

Acknowledgments

Supercomputing resources in this work were supported by the Plataforma Andaluza de Bioinformática at the University of Málaga and the supercomputing infrastructure of the NLHPC (ECM-02, Powered@NLHPC).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ANN	Artificial Neural Network
AUC	Area Under the Curve
DL	Deep Learning
EF	Expected Feature
KNN	K-Nearest Neighbors
LIME	Local Interpretable Model-agnostic Explanations
LR	Linear/Logistic Regressor
MAE	Mean Absolute Error
MDI	Mean Decrease in Impurity
ML	Machine Learning
MSE	Mean Squared Error
NEF	Non-Expected Feature
RF	Random Forest
SHAP	SHapley Additive exPlanations
SVM	Support Vector Machine
WISCA	WeIghted Scaled Consensus Attributions
XAI	eXplainable Artificial Intelligence
XGB	eXtreme Gradient Boosting

References

Qu, K.; Guo, F.; Liu, X.; Lin, Y.; Zou, Q. Application of machine learning in microbiology. Front. Microbiol. 2019, 10, 827. [Google Scholar] [CrossRef] [PubMed]
Jones, D.T. Setting the standards for machine learning in biology. Nat. Rev. Mol. Cell Biol. 2019, 20, 659–660. [Google Scholar] [CrossRef] [PubMed]
Vamathevan, J.; Clark, D.; Czodrowski, P.; Dunham, I.; Ferran, E.; Lee, G.; Li, B.; Madabhushi, A.; Shah, P.; Spitzer, M.; et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 2019, 18, 463–477. [Google Scholar] [CrossRef] [PubMed]
Ekins, S.; Puhl, A.C.; Zorn, K.M.; Lane, T.R.; Russo, D.P.; Klein, J.J.; Hickey, A.J.; Clark, A.M. Exploiting machine learning for end-to-end drug discovery and development. Nat. Mater. 2019, 18, 435–441. [Google Scholar] [CrossRef] [PubMed]
Elbadawi, M.; Gaisford, S.; Basit, A.W. Advanced machine-learning techniques in drug discovery. Drug Discov. Today 2021, 26, 769–777. [Google Scholar] [CrossRef]
Bi, K.; Xie, L.; Zhang, H.; Chen, X.; Gu, X.; Tian, Q. Accurate medium-range global weather forecasting with 3D neural networks. Nature 2023, 619, 533–538. [Google Scholar] [CrossRef]
Pathak, J.; Subramanian, S.; Berkeley, L.; Harrington, P.; Raja, S.; Chattopadhyay, A.; Mardani, M.; Kurth, T.; Hall, D.; Li, Z.; et al. FOURCASTNET: ADATA-DRIVEN MODEL FOR HIGH-RESOLUTION WEATHER FORECASTS USING ADAPTIVE FOURIER NEURAL OPERATORS. Ann Arbor 2022, 1001, 48109. [Google Scholar]
Stirnberg, R.; Cermak, J.; Kotthaus, S.; Haeffelin, M.; Andersen, H.; Fuchs, J.; Kim, M.; Petit, J.E.; Favez, O. Meteorology-driven variability of air pollution (PM 1) revealed with explainable machine learning. Atmos. Chem. Phys. 2021, 21, 3919–3948. [Google Scholar] [CrossRef]
Waljee, A.K.; Higgins, P.D.R. Machine learning in medicine: A primer for physicians. Am. J. Gastroenterol. 2010, 105, 1224–1226. [Google Scholar] [CrossRef]
Sidey-Gibbons, J.A.M.; Sidey-Gibbons, C.J. Machine learning in medicine: A practical introduction. BMC Med. Res. Methodol. 2019, 19, 64. [Google Scholar] [CrossRef]
Rajkomar, A.; Dean, J.; Kohane, I. Machine learning in medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef] [PubMed]
Yang, C.C. Explainable artificial intelligence for predictive modeling in healthcare. J. Healthc. Inform. Res. 2022, 6, 228–239. [Google Scholar] [CrossRef] [PubMed]
Payrovnaziri, S.N.; Chen, Z.; Rengifo-Moreno, P.; Miller, T.; Bian, J.; Chen, J.H.; Liu, X.; He, Z. Explainable artificial intelligence models using real-world electronic health record data: A systematic scoping review. J. Am. Med. Inform. Assoc. 2020, 27, 1173–1185. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Weng, Y.; Lund, J. Applications of explainable artificial intelligence in diagnosis and surgery. Diagnostics 2022, 12, 237. [Google Scholar] [CrossRef]
Chen, X.Q.; Ma, C.Q.; Ren, Y.S.; Lei, Y.T.; Huynh, N.Q.A.; Narayan, S. Explainable artificial intelligence in finance: A bibliometric review. Fin. Res. Lett. 2023, 56, 104145. [Google Scholar] [CrossRef]
Demajo, L.M.; Vella, V.; Dingli, A. Explainable ai for interpretable credit scoring. arXiv 2020, arXiv:2012.03749. [Google Scholar] [CrossRef]
Černevičienė, J.; Kabašinskas, A. Review of multi-criteria decision-making methods in finance using explainable artificial intelligence. Front. Artif. Intell. 2022, 5, 827584. [Google Scholar] [CrossRef]
Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar]
Bennetot, A.; Donadello, I.; El Qadi El Haouari, A.; Dragoni, M.; Frossard, T.; Wagner, B.; Sarranti, A.; Tulli, S.; Trocan, M.; Chatila, R.; et al. A Practical Tutorial on Explainable AI Techniques. ACM Comput. Surv. 2024, 57, 1–44. [Google Scholar] [CrossRef]
Pavlidis, G. Unlocking the black box: Analysing the EU artificial intelligence act’s framework for explainability in AI. Law Innov. Technol. 2024, 16, 293–308. [Google Scholar] [CrossRef]
Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable ai: A review of machine learning interpretability methods. Entropy 2020, 23, 18. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning. 2020. Available online: https://christophm.github.io/interpretable-ml-book/ (accessed on 18 February 2026).
Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
Zhou, J.; Gandomi, A.H.; Chen, F.; Holzinger, A. Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics 2021, 10, 593. [Google Scholar] [CrossRef]
Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; Seifert, C. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Comput. Surv. 2023, 55, 1–42. [Google Scholar] [CrossRef]
Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Measures for explainable AI: Explanation goodness, user satisfaction, mental models, curiosity, trust, and human-AI performance. Front. Comput. Sci. 2023, 5, 1096257. [Google Scholar] [CrossRef]
Holzinger, A.; Carrington, A.; Müller, H. Measuring the quality of explanations: The system causability scale (SCS) comparing human and machine explanations. KI-Künstl. Intell. 2020, 34, 193–198. [Google Scholar] [CrossRef] [PubMed]
Islam, S.R.; Eberle, W.; Ghafoor, S.K. Towards quantification of explainability in explainable artificial intelligence methods. In Proceedings of the Thirty-Third International Florida Artificial Intelligence Research Society Conference (FLAIRS 2020), North Miami Beach, FL, USA, 17–20 May 2020; pp. 75–81. [Google Scholar]
Rosenfeld, A. Better metrics for evaluating explainable artificial intelligence. In Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems, Virtual, 3–7 May 2021; pp. 45–50. [Google Scholar]
Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
Krishnan, M. Against interpretability: A critical examination of the interpretability problem in machine learning. Philos. Technol. 2020, 33, 487–502. [Google Scholar] [CrossRef]
Vowels, M.J. Trying to outrun causality with machine learning: Limitations of model explainability techniques for identifying predictive variables. Stat 2022, 1050, 22. [Google Scholar]
Saeed, W.; Omlin, C. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowl.-Based Syst. 2023, 263, 110273. [Google Scholar] [CrossRef]
Sarlette, A.; Sepulchre, R. Consensus optimization on manifolds. SIAM J. Contr. Optim. 2009, 48, 56–76. [Google Scholar] [CrossRef]
Zhang, H.; Kou, G.; Peng, Y. Soft consensus cost models for group decision making and economic interpretations. Eur. J. Oper. Res. 2019, 277, 964–980. [Google Scholar] [CrossRef]
Bajusz, D.; Racz, A.; Heberger, K. Comparison of data fusion methods as consensus scores for ensemble docking. Molecules 2019, 24, 2690. [Google Scholar] [CrossRef] [PubMed]
Burgos-Mellado, C.; Llanos, J.J.; Cárdenas, R.; Saez, D.; Olivares, D.E.; Sumner, M.; Costabeber, A. Distributed control strategy based on a consensus algorithm and on the conservative power theory for imbalance and harmonic sharing in 4-wire microgrids. IEEE Trans. Smart Grid 2019, 11, 1604–1619. [Google Scholar] [CrossRef]
Ayad, H.G.; Kamel, M.S. Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 160–173. [Google Scholar] [CrossRef]
Ayad, H.G.; Kamel, M.S. On voting-based consensus of cluster ensembles. Pattern Recogn. 2010, 43, 1943–1953. [Google Scholar] [CrossRef]
Fischman, J.B. Estimating preferences of circuit judges: A model of consensus voting. J. Law Econ. 2011, 54, 781–809. [Google Scholar] [CrossRef]
Lekadir, K.; Frangi, A.F.; Porras, A.R.; Glocker, B.; Cintas, C.; Langlotz, C.P.; Weicken, E.; Asselbergs, F.W.; Prior, F.; Collins, G.S.; et al. FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 2025, 388, e081554. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010, 26, 1340–1347. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why Should I Trust You? Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Strumbelj, E.; Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inform. Syst. 2014, 41, 647–665. [Google Scholar] [CrossRef]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]
Mothilal, R.K.; Sharma, A.; Tan, C. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 607–617. [Google Scholar]
Zamani, M.G.; Nikoo, M.R.; Niknazar, F.; Al-Rawas, G.; Al-Wardy, M.; Gandomi, A.H. A multi-model data fusion methodology for reservoir water quality based on machine learning algorithms and bayesian maximum entropy. J. Clean. Prod. 2023, 416, 137885. [Google Scholar] [CrossRef]
Röcken, S.; Zavadlav, J. Accurate machine learning force fields via experimental and simulation data fusion. npj Comput. Mater. 2024, 10, 69. [Google Scholar] [CrossRef]
Steyaert, S.; Pizurica, M.; Nagaraj, D.; Khandelwal, P.; Hernandez-Boussard, T.; Gentles, A.J.; Gevaert, O. Multimodal data fusion for cancer biomarker discovery with deep learning. Nat. Mach. Intell. 2023, 5, 351–362. [Google Scholar] [CrossRef]
Singh, A.; Gaurav, K. Deep learning and data fusion to estimate surface soil moisture from multi-sensor satellite images. Sci. Rep. 2023, 13, 2251. [Google Scholar] [CrossRef]
Banegas-Luna, A.; Pérez-Sánchez, H. SIBILA: Automated Machine-Learning-Based Development of Interpretable Machine-Learning Models on High-Performance Computing Platforms. AI 2024, 5, 2353–2374. [Google Scholar] [CrossRef]
Fernandes, K.; Cardoso, J.; Fernandes, J. Transfer learning with partial observability applied to cervical cancer screening. In Iberian Conference on Pattern Recognition and Image Analysis; Springer International Publishing: Cham, Switzerland, 2017; pp. 243–250. [Google Scholar] [CrossRef]
Aeberhard, S.; Forina, M. Comparative analysis of statistical pattern recognition methods in high dimensional settings. Pattern Recogn. 1994, 27, 1065–1077. [Google Scholar] [CrossRef]
Fanaee-T, H. Event labeling combining ensemble detectors and background knowledge. Lect. Notes Artif. Int. 2014, 2, 113–127. [Google Scholar] [CrossRef]

Figure 1. Families of functions that can implement the classification correction factor.

Figure 2. Evaluation metrics of the models.

Figure 3. Score of the consensus functions, measured using the hit rate metric.

Figure 4. Average distance between expected and non-expected features accross the datasets.

Figure 5. Score of the interpretability algorithms, measured by the hit rate metric.

Figure 6. Spearman correlation between WISCA and the interpretability algorithms.

Figure 7. Jensen–Shannon divergence between WISCA and the interpretability algorithms.

Figure 8. Consensus explanations returned by WISCA on the cervical cancer risk dataset.

Figure 9. Consensus explanations returned by WISCA on the wine dataset.

Figure 10. Consensus explanations returned by WISCA on the bike rental dataset.

Table 1. Description of the synthetic datasets used in this work.

Synthetic Datasets	Type	Number of Samples	Number of Features	Expected Explanation ¹
Dataset 1	Binary	2000	20	F2, F3, F9, F17
Dataset 2	Binary	1500	75	F5, F25, F55
Dataset 3	Regression	2500	60	F1, F56, F58, F60
Dataset 4	Regression	2000	30	F19, F21, F24, F26
Dataset 5	Multiclass	1000	10	F3, F4, F7, F10
Dataset 6	Multiclass	2500	30	F12, F16, F22, F27

¹ Features used to calculate the target.

Table 2. Formulas implemented in the datasets to calculate the target.

Dataset	Formula
Dataset 1	if $\frac{F 2 \cdot F 3}{F 9} < F 17$ then 0 else 1
Dataset 2	if $(F 55^{3} + F 5^{2} - F 25 < 0)$ then 0 else 1
Dataset 3	$sin (F 60) + cos (F 58) + tanh (F 56) + F 1$
Dataset 4	$F 19^{4} - F 21^{3} + F 24^{2} - F 26$
Dataset 5 ¹	if $X < - 4$ then 0 elsif $X \in [- 4, 0)$ then 1 elsif $X \geq 0$ then 2
Dataset 6 ²	if $X \leq - 12$ then 0 elsif $X \in (- 12, - 6]$ then 1 elsif $X \in (- 6, 0]$ then 2 elsif $X \in (0, 6]$ then 3 elsif $X \in (6, 14]$ then 4 elsif $X > 14$ then 5

¹

X = (F 7 \cdot 2) + (F 3 / 3) + F 4 - F 10

. ²

X = (F 27 \cdot 13) - F 22 + F 16^{2} - (F 12 \cdot 1.5)

.

Table 3. Statistical comparison of consensus functions across the 36 selected model–dataset instances using the hit rate metric.

Method	Mean Hit Rate	95% CI	Median	IQR	p-Value vs. WISCA	Holm-Adjusted p
WISCA	0.991	[0.981, 1.000]	1.000	0.000	-	-
Arithmetic	0.954	[0.919, 0.989]	1.000	0.023	0.005	0.009
Voting	0.956	[0.908, 1.000]	1.000	0.000	0.053	0.053
Ranking	0.674	[0.564, 0.783]	0.716	0.480	<0.001	<0.001
Geometric	0.408	[0.290, 0.527]	0.556	0.695	<0.001	<0.001
Harmonic	0.218	[0.122, 0.315]	0.067	0.157	<0.001	<0.001

Table 4. Hit rate score of LR and WISCA.

Dataset	LR	WISCA
Dataset 1	1.00	1.00
Dataset 2	1.00	1.00
Dataset 3	1.00	1.00
Dataset 4	1.00	1.00
Dataset 5	1.00	1.00
Dataset 6	0.91	0.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Banegas-Luna, A.J.; Pérez-Sánchez, H.; Martínez-Cortés, C. WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets. Mach. Learn. Knowl. Extr. 2026, 8, 97. https://doi.org/10.3390/make8040097

AMA Style

Banegas-Luna AJ, Pérez-Sánchez H, Martínez-Cortés C. WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets. Machine Learning and Knowledge Extraction. 2026; 8(4):97. https://doi.org/10.3390/make8040097

Chicago/Turabian Style

Banegas-Luna, Antonio Jesús, Horacio Pérez-Sánchez, and Carlos Martínez-Cortés. 2026. "WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets" Machine Learning and Knowledge Extraction 8, no. 4: 97. https://doi.org/10.3390/make8040097

APA Style

Banegas-Luna, A. J., Pérez-Sánchez, H., & Martínez-Cortés, C. (2026). WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets. Machine Learning and Knowledge Extraction, 8(4), 97. https://doi.org/10.3390/make8040097

Article Menu

WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Machine Learning Models

2.3. Interpretability Algorithms

2.4. Consensus Functions

2.4.1. Arithmetic Mean

2.4.2. Harmonic Mean

2.4.3. Geometric Mean

2.4.4. Voting

2.4.5. Relative Position

2.4.6. Other Functions

2.5. Development of a Novel Consensus Function

2.5.1. Identified Challenges

2.5.2. WISCA Formulation

2.5.3. Correction Factor in Regression

2.5.4. Alternatives for the Classification Correction Factor

3. Results

3.1. Model Training

3.2. Consensus

4. Discussion

4.1. Assessment of Consensus Functions

4.2. Evaluation of WISCA

4.3. Linear Models vs. WISCA

4.4. Real-World Datasets

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI