Next Article in Journal
Research on Price Prediction of Stock Price Index Based on Combination Method with Introduction of Options Market Information
Next Article in Special Issue
Stroke Detection in Brain CT Images Using Convolutional Neural Networks: Model Development, Optimization and Interpretability
Previous Article in Journal
Predicting and Preventing School Dropout with Business Intelligence: Insights from a Systematic Review
Previous Article in Special Issue
Deepfake Image Forensics for Privacy Protection and Authenticity Using Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Benchmarking Methods for Pointwise Reliability

by
Cláudio Correia
1,
Simão Paredes
1,2,*,
Teresa Rocha
1,2,
Jorge Henriques
2 and
Jorge Bernardino
1,2,*
1
Coimbra Institute of Engineering, Polytechnic University of Coimbra, Rua da Misericórdia, Lagar dos Cortiços, S. Martinho do Bispo, 3045-093 Coimbra, Portugal
2
CISUC, Center for Informatics and Systems of University of Coimbra, University of Coimbra, Pólo II, 3030-290 Coimbra, Portugal
*
Authors to whom correspondence should be addressed.
Information 2025, 16(4), 327; https://doi.org/10.3390/info16040327
Submission received: 25 February 2025 / Revised: 15 April 2025 / Accepted: 16 April 2025 / Published: 20 April 2025
(This article belongs to the Special Issue Real-World Applications of Machine Learning Techniques)

Abstract

:
The growing interest in machine learning in a critical domain like healthcare emphasizes the need for reliable predictions, as decisions based on these outputs can have significant consequences. This study benchmarks methods for assessing pointwise reliability, focusing on data-driven techniques based on the density principle and the local fit principle. These methods evaluate the reliability of individual predictions by analyzing their similarity to training data and evaluating the performance of the model in local regions. Aiming to establish a standardized comparison, the study introduces a benchmark framework that combines error rate evaluations across reliability intervals with t-distributed Stochastic Neighbor Embedding visualizations to further validate the results. The results demonstrate that methods combining density and local fit principles generally outperform those relying on a single principle, achieving lower error rates for high-reliability predictions. Furthermore, the study identifies challenges such as the adjustment of method parameters and clustering limitations and provides insight into their impact on reliability assessments.

1. Introduction

Machine learning (ML) has gained significant traction in healthcare where it has the potential to improve decision-making processes, improve efficiency, and support professionals in high-stakes environments [1,2]. However, the adoption of ML models in this field remains limited due to concerns about the reliability of their predictions. In domains where decisions based on these predictions can directly impact lives, the reliability of individual predictions is not only desirable, but essential. Traditional performance metrics such as accuracy or the F1 score often fail to address this need [3]. These metrics assess the overall performance of models on datasets, but do not capture the reliability associated with individual predictions. This gap has driven the development of methods designed for assessing pointwise reliability, which is critical in contexts where prediction errors can have significant consequences.
While the concept of reliability has been used in various contexts [4,5,6,7,8], in this work, we define reliability as the probability that a prediction of a model is correct for a given new instance, expressed as a continuous value within a defined range (e.g., between 0 and 1).
Assessing the reliability of individual predictions, or pointwise reliability, aims to determine when the output of the model is reliable or when it requires additional scrutiny. This problem has been addressed through various approaches, among which two principles can be identified [4]: (1) the density principle, which evaluates prediction reliability by examining how similar a test instance is to the training data; (2) the local fit principle, which evaluates how well the model fits training examples in the local neighborhood of a test instance.
Although various studies have proposed methods based on these principles, their evaluation has often been carried out independently, with different datasets, models, and validation criteria [4,5,9]. Consequently, there is little direct comparison between techniques, and many methods are evaluated in domain-specific environments. To address this gap, we performed a benchmark of pointwise reliability assessment methods using a consistent dataset and model. This approach provides a way to directly compare these techniques and analyze their strengths and limitations under similar conditions.
This study focuses on benchmarking data-driven methods that can be deployed when the model is in its production stage to assess pointwise reliability. These methods rely primarily on the information contained in the training data and do not require additional models to be trained. A key feature of our benchmark is the integration of quantitative analysis, such as the evaluation of error rates across reliability scores, with the t-distributed Stochastic Neighbor Embedding (t-SNE) visualization technique. These visualizations provide interpretable representations of the data structure, enabling assessment of whether methods are consistent with the underlying data distribution and providing insight into the relationship between predictions and their reliability scores.
Therefore, the main contribution of this paper is a benchmark of pointwise reliability assessment methods based on density and local fit principles, evaluated under consistent conditions, providing a comparison of their strengths and limitations.
The remainder of this paper is structured as follows: Section 2 reviews related work on pointwise reliability assessment. Section 3 introduces the benchmark framework and details the implemented methods and parameter selection strategies. Section 4 discusses the results. Finally, Section 5 concludes the paper and outlines future work.

2. Related Work

There are several proposed approaches to assessing pointwise reliability that fall under the density principle and local fit principles. Additionally, some methods incorporate domain expertise to further improve reliability assessments.
The density principle evaluates reliability by analyzing the concentration of data points, where regions with higher density and close to the target are considered more reliable. Classical approaches, such as k-nearest neighbors [10], which calculate distances to evaluate the density of nearby points, and DBSCAN [11], which identifies dense clusters to distinguish reliable from sparse regions, are examples of methods that can be used to evaluate pointwise reliability.
Other methods have been developed specifically to determine pointwise reliability. Nicora et al. proposed a density-based reliability measure that has been applied in different contexts [4,12]. Their approach proposes a model-independent density-based method for computing pointwise reliability that focuses on a bounding box approach. Their method evaluates the proximity of a new instance to the “boundary” instances in the training data for each feature, considering both internal and external boundaries. The reliability score is calculated as the proportion of features of the new instance that fall outside the boundaries relative to the total number of features. The method, validated on clinical datasets and single-cell genomics, effectively discriminates between reliable predictions and emphasizes that relying solely on classifier-specific metrics can be misleading.
Waa et al. [9] introduced a density-based framework that computes reliability by comparing test points to their k-nearest neighbors in a memory set of past cases. The process involves selecting the neighbors that are most similar to the new instances, separating them based on agreement or disagreement with the current prediction, and then computing the confidence score based on the similarity of these neighbors. Experiments on various datasets have shown that the framework provides reliable and interpretable confidence scores.
Another density-based approach was proposed by Kailkhura et al. [13] for material discovery. The authors used the Gower metric to quantify the similarity between samples, overcoming the challenge of mixed continuous and categorical features in the data. The reliability was then calculated as the ratio of the average Gower distance of a test sample to samples in its own class to the total average distance to both its own class and other classes, giving a higher score if the sample is closer to its own class.
Schulam et al. [3] presented the Resampling Uncertainty Estimation method, a model-agnostic approach that estimates the variability in the predictions as if the model had been trained on different subsets of the training data. This is achieved by using the gradients and Hessian of the model’s loss function to create a simulated ensemble of predictions based on resampling techniques.
The method, developed by Waa et al. [8], evaluates the similarity of new data points to previously observed data points, where the performance of the model is known, using a distance-based approach. The method calculates a confidence score by weighting the influence of neighboring data points based on their proximity. The method was validated by comparing it with a baseline that employed stacked machine learning models.
For online reliability applications, Bahrami and Tuncel [14] proposed a density method that dynamically estimates the density of error samples to prioritize smaller errors and reduce outliers. Although primarily focused on regression in noisy environments, this online approach indirectly improves reliability by maintaining robustness during real-time operations.
The local fit principle focuses on verifying the consistency of the model. Techniques based on this principle typically rely on available data, use the model directly, or use an additional model to assess reliability.
In approaches relying on the data, Briesemeister et al. [15] proposed two methods, CONFINE and CONFIVE, to compute pointwise reliability independent of the underlying model. Both approaches aim to generate confidence scores for individual predictions by analyzing the characteristics of nearby data points, making them independent of the specific model used. The CONFINE method works by selecting the nearest neighbors of a test instance from the training data and calculating the mean square error of the prediction in that local region, essentially transferring the local error information from the neighboring instances to the new test point. On the other hand, the CONFIVE method evaluates the variance in the response values (labels) of the nearest neighbors, assuming that greater variance in this local region is correlated with lower prediction reliability. Both methods have been validated on diverse datasets, demonstrating their robustness as model-agnostic reliability measures.
Some techniques integrate information directly from the model or use a complementary model to improve reliability estimation. Adomavicius and Wang [16] proposed a model-driven approach where a primary machine learning model predicts the outcomes, and a secondary model, trained on the prediction errors from a validation dataset, estimates the expected absolute error for each prediction. This secondary model generates reliability scores, which their experiments showed to outperform data-driven approaches such as the local variance and average distance of neighbors, particularly in complex data scenarios.
Similarly, Myers [17] introduced a model-driven approach to assessing reliability in a clinical context. The method identifies unreliable predictions by constructing a separate risk metric (a new model), which is based on summary statistics derived from the same training data used to develop the original model. The model estimates the probability of adverse patient outcomes using prognostic features, and the additional metric helps to classify predictions that may be less trustworthy.
Some methods combine the density and local fit principle to create more comprehensive reliability assessments using the structure of the data and the behavior of the model in local regions. For example, Henriques et al. [5] proposed a model-agnostic method to assess pointwise reliability in the context of cardiovascular risk prediction. Their approach integrates density and local fit measures to estimate the reliability in individual predictions, using a pre-trained neural network to classify patients into survival or death categories after myocardial infarction. The reliability of each prediction is quantified using a scalar value derived from three components. The density component evaluates the proximity of similar training samples to the instance being assessed, using a maximum threshold to determine whether the region is densely populated. The data agreement component measures the consistency of the actual outputs of neighboring training samples, ensuring that they agree with the model prediction for the instance. Finally, the ML agreement component examines the stability of the model’s predictions for nearby instances, increasing confidence when consistent outputs are generated. This combined reliability measure has been validated in cardiovascular risk assessment tasks and effectively distinguishes between reliable and unreliable predictions.
Some methods also incorporate domain knowledge to improve the reliability of predictions. Valente et al. [18] presented a method for predicting mortality risk after acute coronary syndrome that incorporates a pointwise reliability score. The method generates decision rules based on risk factors, which are trained using a machine learning classifier to predict the probability of each rule being correct for individual patients. These probabilities are combined to calculate the patient’s mortality risk and the reliability of the prediction. Pointwise reliability is calculated by evaluating the difference between the average acceptance probabilities of the rules predicting positive and negative outcomes, providing an estimate of the reliability in the model’s predictions. Although this approach improves reliability, it is specific to the behavior and structure of the model, making it difficult to apply directly to different machine learning models.
Although the methods reviewed provide valuable approaches to evaluating pointwise reliability, they are often evaluated independently or within specific contexts. This work aims to address these limitations by benchmarking a selection of techniques using density and local fit under standardized conditions. To this end, we implemented techniques based on their data-driven nature and model-agnostic applicability, making them suitable for deployment when the model is in production. We started by implementing clustering-based methods [10,11], due to their recent lack of attention in the literature. Subsequently, we implemented Distance-Based Methods, specifically evaluating the approach proposed by [9], which uses a weighted distance metric, and the method proposed by [5], which combines multiple metrics and integrates density and local fit principles.

3. Methodology

This section describes the benchmark, the dataset, and the implemented methods for reliability assessment. The code used to implement these methods, along with scripts for data processing, reliability computation, and result visualization, is available in the GitHub (https://github.com/Oak10/pointwise-benchmark/releases/tag/1.1.1 accessed on 17 April 2025) repository [19].
Figure 1 illustrates the workflow used in this study. The process begins with data preparation, as described in Section 3.2, where the dataset is divided into a training set and a test set. The training set is used to train the ML model, while the test set is employed to evaluate the predictions of the model. During the prediction phase, individual instances are assessed using both the ML model and the chosen pointwise reliability method. The results, as described in Section 3.1, are then aggregated into reliability intervals for further evaluation, including error rate analysis and visualizations.

3.1. Benchmark

To evaluate the methods for assessing pointwise reliability, we propose a simple approach that groups predictions into 10% intervals based on their reliability scores (e.g., [0.00, 0.10] and ].90, 1.00]). For each interval, predictions are categorized as follows:
  • Correct (0): true label 0 with correct prediction.
  • Incorrect (0): true label 0 with incorrect prediction.
  • Correct (1): true label 1 with correct prediction.
  • Incorrect (1): true label 1 with incorrect prediction.
This categorization enables a detailed evaluation of the performance of the methods by assessing their ability to differentiate between high-reliability and low-reliability predictions. It also allows us to verify whether predictions in high-reliability intervals correspond to low error rates (as theoretically expected). Furthermore, it provides insight into the distribution of errors across reliability intervals, helping to identify potential inconsistencies or a lack of capability in the methods.
In addition to the quantitative analysis, we propose a visual representation of the data to validate reliability classifications. Real-world datasets often have multiple features, making it challenging to represent them in two dimensions. Dimensionality reduction techniques, which map high-dimensional feature spaces to low-dimensional ones, such as t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection [20,21], have been used in machine learning to investigate the relationship between training and test data.
Among these alternative approaches, we present visualizations using t-SNE due to its ability to maintain consistency between visualizations when initialized with the same parameters. As shown in Figure 2, the t-SNE visualization presents the training and validation datasets, highlighting the predictions by class. When used in the benchmark, this visualization helps to evaluate the quality of the methods used. For example, instances classified as highly reliable should visually cluster closely with neighbors of the same class, while errors in low-reliability intervals should appear as outliers (or clusters with neighbors from a different class) in the visualization.

3.2. Dataset

The dataset used in this study is the Patient Treatment Classification Dataset [22], which is composed of health records from a private hospital in Indonesia. The data are composed of laboratory test results (features) and corresponding treatment decisions (labels), categorized as “in-care” or “out-care”. To evaluate pointwise reliability, the dataset was split into two subsets prior to training: 10% of the data (442 instances) were reserved as a validation set for the final evaluation of the reliability measures, while the remaining 90% (3970 instances) were used to train the baseline model.
In terms of class distribution, the dataset shows a moderate class imbalance, with approximately 60% of the samples classified as negative (Class 0, corresponding to “out-care”). After correlation analysis, HAEMATOCRIT and MCH were excluded from the feature set due to their high correlation with other features, which indicated redundancy. The final feature set includes numerical features (HAEMOGLOBINS, ERYTHROCYTE, LEUCOCYTE, THROMBOCYTE, MCHC, MCV, and AGE) and one categorical feature (SEX).
Table 1 presents key descriptive statistics for the numerical features of the dataset.
Since the primary focus of this study was to assess the reliability of individual predictions independent of the ML model itself, a simple logistic regression classifier was chosen as the baseline model. The model was trained on the training set, with numerical features normalized to a scale between 0 and 1, and the categorical variable SEX encoded as 1 for males and 0 for females. After training, the classifier was exported to ensure consistent predictions throughout the study. The model achieved an accuracy of 73.0% and an Area Under the Receiver Operating Characteristic curve (AUC-ROC) score of 0.76, indicating moderate discrimination capability.
Figure 3 presents the population distribution and error rates of the predictions made by the model using the validation dataset, which are grouped into reliability intervals. The bar plots represent the total number of predictions per class in each interval, while the line plots indicate the corresponding error rates. The reliability values were directly derived from the predicted probabilities of the model.
As visualized in Figure 3, error rates for both classes generally decrease with increasing reliability intervals. The error rate for Class 0 consistently decreases, suggesting calibration, while Class 1 shows greater variability. The population distribution reveals fewer samples in higher reliability intervals, which is expected. It should be noted that the model concludes with a relatively high error rate of approximately 20% for Class 0 and 33% for Class 1.

3.3. Pointwise Reliability Methods

Pointwise reliability was assessed with methods using density and a combination of density and the local fit principle.
These methods are designed to evaluate the reliability of individual predictions independent of the underlying model, and their score (reliability) is normalized or adjusted to output a scalar value in the range [0, 1].

3.3.1. Subtractive Clustering

The first method uses subtractive clustering [23,24], where each data point is evaluated as a potential cluster center based on the density of nearby points within a specified influence range ( r a ). The algorithm consists of three main iterative steps. In step 1, the potential (density— D i ) is calculated for each data point x i as follows:
D i   = j = 1 n e x p x i   x j   2 ( r a 2 ) 2
The second step selects the point with the highest potential as a cluster center and suppresses nearby potentials. The data point with the highest potential ( D 1 ) is chosen as the first cluster center ( x c 1 ). Then, potentials of points close to this cluster center are suppressed to avoid redundant centers, according to the following:
D i   = D i   D c 1   e x p x i x j 2 r b 2 2
where D c 1 is the density of the selected cluster center x c 1 , and r b = s q u a s h _ f a c t o r × r a (where the default squash factor is set to 1.25).
The third step evaluates and selects subsequent cluster centers iteratively. After density suppression, the next candidate point with the highest remaining potential is selected, and the following acceptance criterion is applied (if the candidate potential D c k satisfies the equation):
d m i n r a + D c k D 1   1
The point is accepted as a new cluster center (where d m i n is the shortest distance to previously identified cluster centers). If not, its potential is suppressed, and the search continues. This process repeats iteratively until no points exceed a predefined rejection threshold (set by a rejection ratio, default = 0.15).
The workflow for this method includes the following:
  • Parameter Optimization: A grid search was performed over the r a parameter (starting with the values 0.05 and ending with 0.5, with 18 additional values in between) to maximize the number of valid clusters, ensuring that no cluster contained fewer than a predefined minimum number (3) of members.
  • Clustering and Assignment: Using the optimal r a , the algorithm assigned each data point to the nearest cluster if it was within the range; otherwise, points were marked as unassigned.
  • Reliability Computation: Reliability for each new instance is computed based on the following:
R x   = min   ( | C | m i n i m u m   c l u s t e r   s i z e , 1 )
where | C | is the size of the cluster to which the instance belongs. Points outside any cluster received a reliability score of 0.
The “minimum cluster size” parameters were determined as described in Section 3.4.

3.3.2. DBSCAN Method

The second method uses the DBSCAN algorithm [11], which identifies clusters based on the density of points in a neighborhood defined by a distance parameter (“eps”) and a minimum number of points.
The workflow for this method includes the following:
  • Parameter Optimization: A grid search was performed on the “eps” (10 values ranging from 0.01 to 0.5 in equal intervals) and “min samples” (4 values, starting at 2 and increasing by 2 in each iteration) to maximize the number of clusters.
  • Clustering and Assignment: Using optimal parameters, DBSCAN assigned points to clusters or labeled them as noise.
Reliability Computation: For each new instance, the nearest point (a dense point identified by DBSCAN) was identified using the Euclidean distance. If the nearest core point belonged to a valid cluster and the instance was within the defined distance threshold (“eps”), the instance was considered part of the cluster. Otherwise, it was treated as unassigned. Reliability was calculated as the size of the cluster divided by a predefined threshold, “minimum cluster size”, with scores capped at 1. The calculation follows the same principles previously described (Equation (4)) for the subtractive clustering method.
The “minimum cluster size” parameters were determined as described in Section 3.4.

3.3.3. Distance-Based Method

The third method evaluates reliability by focusing on the density of neighbors within a given distance threshold. For each new instance, the Euclidean distances to all training points are computed, and neighbors within the distance threshold are identified.
The workflow for this method is as follows:
  • Distance Calculation: The Euclidean distances from the new instance to all points in the training data are computed.
  • Neighbor Identification: Points belonging to the same predicted class as the new instance are identified, and the number of these points within the defined distance threshold is calculated.
  • Reliability Computation: Reliability is calculated as the total number of neighbors of the same class within the distance, divided by the “minimum cluster size” parameter. Reliability scores are capped at 1 for sufficiently dense regions.
The distance threshold and the “minimum cluster size” parameters were determined as described in Section 3.4.

3.3.4. ICM Method

The fourth method presented is an adaptation of the interpretable confidence measure (ICM) framework, based on the work of van der Waa et al. [9]. The method evaluates the reliability of a prediction by analyzing the support or opposition provided by the nearest neighbors of a new instance in the training data.
The method relies on two main steps: identifying the nearest neighbors of a new instance and using their distances to compute a weighted confidence score. Closer neighbors are given more influence, reflecting their proximity to the instance.
The workflow for this method is as follows:
  • Neighbor Selection: The k-nearest neighbors of the new instance are identified from the training data using the Euclidean distance. These neighbors are then divided into two groups: S + , representing neighbors that share the predicted label, and S , representing neighbors with a label different from the predicted one.
  • Sigma Calculation: o is computed as the mean squared distance between the new instance x and its neighbors as shown in
    o = 1 k x i N ( x ) x x i 2   ,
    where N (x) is the set of all k-nearest neighbors.
  • Weight Calculation: Each neighbor is assigned a weight based on its distance to the new instance, with the closer neighbors having greater influence. The weight of a neighbor x i is computed using
    w x , x i = e x p x x i   2 o 2
  • Weighted Contributions for Support and Opposition: The average weighted contributions of the supporting ( S + ) and opposing ( S ) neighbors are calculated as follows:
    W + = 1 S + x i S + W x , x i ,     W = 1 S x i S W x , x i  
    where S + and S denote the number of elements in S + and S , respectively.
  • Reliability Computation: The confidence score, C (x), is calculated as the difference between the weighted contributions of the supporting and opposing neighbors:
    C X = W + W
  • The confidence score is then rescaled to the range [0, 1] using
    R =   0 ,   i f   C x < 0 C x ,   o t h e r w i s e  
The value of k used in this method was selected using the procedure described in Section 3.4.

3.3.5. Density and Local Fit Method

For the last method, we made an implementation based on the method proposed by Henriques et al. [5], which combines density measures with local fit principles to assess pointwise reliability. This method calculates a reliability score for a new instance by integrating three components: density, data agreement, and machine learning agreement. These components evaluate the consistency of the new instance within the training data distribution, the agreement with its neighbors, and the consistency of the model predictions.
The workflow implemented for this method is as follows:
  • Density Component: The density is computed by counting how many training data points fall within a predefined (Euclidean) distance threshold around the new instance. The count is limited to a maximum number of neighbors (“minimum cluster units”), ensuring that the density score is normalized to a range between 0 and 1.
  • Data Agreement Component: The data agreement is calculated as the proportion of neighbors that share the same label as the predicted label of the new instance based on the distance threshold.
  • ML Agreement Component: The ML agreement measures the consistency between the model’s predictions for the neighbors and the predicted label of the new instance (based on neighbors with consistent predictions divided by total neighbors within the threshold).
  • Reliability Computation: The final reliability score is calculated by multiplying the three components. This formulation ensures that all three aspects contribute to reliability.
The predefined distance threshold and “minimum cluster units” were selected based on the strategy described in Section 3.4.

3.4. Parameter Selection for Methods

For methods requiring initial parameters, such as the distance threshold or minimum cluster units, we computed different percentiles of pairwise Euclidean distances from the training data to determine the optimal values. Our approach to selecting the threshold values is based on the method proposed in [5], where the authors used a similar technique to calculate the distance threshold for their pointwise reliability method. In the study, the authors computed the thresholds based on pairwise distances between instances across the entire dataset. The authors selected the 5th percentile of these distances to define the distance threshold and the “minimum cluster unit” (a value of 11).
For the current study, we tried to find a balance between the “minimum cluster unit” and the distance threshold. Our aim was to minimize the distance threshold (ensuring a stricter criterion) while maintaining a “minimum cluster unit” value that avoided being too small or too large.
Table 2 illustrates the relationship between various percentiles of distances between instances, the corresponding distance thresholds, and the resulting minimum cluster unit values. We opted for the 0.25 percentile because it is a reasonable balance; the percentile maintains a relatively small distance threshold while guaranteeing a practical “minimum cluster unit size” (this parameter is crucial, as it is used to set the ideal number of similar instances that should be expected for a given unseen instance).
The selection of lower thresholds (lower percentiles), together with a stricter minimum cluster size, helps to ensure that reliability assessments are concentrated in denser regions of the training data.
This parameterization strategy allowed for consistency across methods while adapting to the characteristics of the dataset.
For the current work, we selected the 0.25 percentile (Figure 4), which corresponds to a distance threshold of 0.107 and a minimum cluster unit value of 9.

4. Experimental Results

The results are reported based on reliability scores grouped into 10% intervals, showing the distribution of correct and incorrect predictions for positive and negative cases. The primary focus is on analyzing the number of incorrect predictions relative to the reliability scores across intervals. Predictions in higher intervals ([80, 100]) are expected to be predominantly correct (with minimal or no incorrect predictions).

4.1. Subtractive Clustering Method

Figure 5 presents the population distribution and error rates of the predictions made using the subtractive clustering method, which are grouped into reliability intervals. The bar plots represent the total population of predictions per class in each interval, while the line plots represent the corresponding error rates. The predictions are almost entirely concentrated in two intervals: [0.00, 0.10] and ]0.90, 1.0], with the majority classified as having low reliability (most instances are not associated with any cluster).
Figure 6 complements the quantitative analysis by showing a t-SNE visualization of the training and validation datasets, which are categorized by reliability intervals and class. Training data are represented by unfilled markers, while validation data are shown with filled markers. Squares and circles represent Class 0 and Class 1, respectively, with colors indicating reliability intervals.
The method struggles to assign predictions to intermediate reliability intervals (]0.10, 0.90]), resulting in a split between two reliability intervals (despite efforts to increase the number of clusters and the resulting clusters’ size variation). While a small number of Class 1 predictions classified as highly reliable (]0.90, 1.0]) are correct, the t-SNE visualization does not show a clear relationship between these predictions and their neighbors. This suggests that the method fails to capture consistent relationships between feature spaces and reliability scores.
This behavior may be influenced by the dataset itself. The visualization of t-SNE Figure 6 shows regions that overlap between classes 0 and 1, with no distinct boundaries. Such an overlap makes it difficult for the subtractive clustering method to form clusters that reflect reliability. As a result, most predictions are assigned a reliability score of 0, and error rates do not consistently decrease as reliability increases.
In conclusion, the subtractive clustering method appears to be unsuitable for assessing prediction reliability in the present dataset. The lack of intermediate reliability scores, the absence of distinct groups, and the inability to reflect meaningful relationships between predictions and reliability intervals limit its effectiveness.

4.2. DBSCAN Method

Figure 7 shows the performance of the DBSCAN method in assigning predictions to reliability intervals. Most predictions are concentrated in low-reliability intervals ([0.00, 0.10]), with sparse representation in the medium- and high-reliability intervals (]0.20, 1.0]).
The error rate for Class 0 predictions shows significant variation despite the reliability classification, indicating that there is no clear relationship between the assigned reliability scores and the prediction accuracy. Although the number of instances is small, no errors are observed in the interval ]0.50, 1.00] for Class 1 predictions.
Similar to the subtractive clustering method, the predictions are almost entirely concentrated in two intervals, [0.00, 0.10] and ]0.90, 1.0], with the majority classified as having low reliability (most instances are not associated with any cluster, despite efforts to maximize the number of clusters).
Figure 8 shows a t-SNE visualization of the training and validation datasets categorized by reliability intervals and separated by class.
Similar to the subtractive clustering method, there is no obvious relationship between predictions and their reliability scores in the t-SNE visualization. Some predictions that are considered unreliable appear in regions populated by instances of their own class. This observation suggests that the reliability scores assigned by DBSCAN are not consistent with the structure of the feature space or the underlying data relationships. Table 3 presents the number of instances categorized by labels within the five largest clusters (with instances classified as noise counted as part of a cluster for this purpose). The majority of the data is identified as noise, which serves as a possible explanation for why the method classifies most new instances with zero reliability. Table 3 also highlights that DBSCAN struggles to form clusters containing instances from only one class.
These results suggest that DBSCAN alone may not be suitable for identifying the reliability of predictions, as the error rate does not consistently decrease with increasing reliability scores. However, the absence of errors for Class 1 in the [0.50, 1.00] interval suggests that, with refinements or by incorporating additional assessments (e.g., data variation or class structure), the method could potentially be adapted to more effectively assess pointwise reliability.

4.3. Distance-Based Method

Figure 9 illustrates the performance of the Distance-Based Method in assigning predictions to reliability intervals. Compared to the previously discussed methods, this approach demonstrates a more balanced distribution of predictions across intervals, with significant representation in all reliability ranges.
For Class 0, the error rate does not show a consistent proportional decrease as the reliability level increases. In fact, the intervals [0.00, 0.10] and ]0.90, 1.00] exhibit similar error rates, which is contrary to the expectation that the higher reliability intervals would contain fewer errors. Additionally, the method produces an unusually high number of predictions in the highest reliability interval (]0.90, 1.00]), which is unexpected since reliable predictions are generally fewer (or expected to have very low error levels). However, for Class 1, the error rate decreases consistently as the reliability increases. This is consistent with expectations and suggests that the method is more effective in assigning reliability levels to predictions from this class.
Figure 10 shows a t-SNE visualization of the training and validation datasets categorized by reliability intervals and separated by class. Similar to the previous methods, there is no discernible pattern in the reliable predictions. Many highly reliable predictions are located in regions containing instances from both classes, showing a lack of consistency. However, the method assigns low reliability to sparse instances (instances with fewer visible neighbors), which is reflected in its ability to determine local density.
This behavior may be due to the fact that the method relies on proximity (i.e., the number of instances of the same class within a defined distance) without considering variations in the data distribution. Consequently, predictions in dense regions may be considered highly reliable, regardless of the class overlap, while sparse instances are consistently assigned low reliability.
In conclusion, while the Distance-Based Method provides a more balanced distribution of predictions across reliability intervals, its limitations become evident in the t-SNE visualization and error rate trends. The lack of correlation between reliability scores and prediction accuracy for Class 0, combined with its insensitivity to data variation, suggests that this method requires refinement to improve its reliability assessment.

4.4. ICM-Based Method

Figure 11 illustrates the performance of the ICM-based method in assigning predictions to reliability intervals.
The method demonstrates a decreasing error rate as the reliability increases (with a slight increase in the ]0.40, 0.50] interval, which contains only one instance). Although there were few observations of instances in the ]0.50, 1.00] interval, this general trend suggests that the method can successfully identify intervals where the predictions are more reliable.
An important observation is that most instances are classified in the [0.00, 0.10] interval, with only two predictions assigned to the ]0.50, 1.00] interval. This is probably due to the strict criteria of the algorithm for assigning reliability, which makes it overly restrictive. Despite this, the method assigns high reliability to two instances of Class 1, which, as can be seen in the t-SNE visualization (Figure 12), are located in dense regions predominantly populated by instances of the same class.
The t-SNE visualization also reveals a limitation: some predictions considered unreliable by the method are located in zones populated exclusively by instances of the same class. This observation suggests that the strict reliability threshold of the method may overlook instances that could reasonably be considered reliable.
The ICM-Based Method demonstrates consistent error rate trends and has the ability to use density to assess reliability (as observed for Class 1). However, compared to other methods and as shown in the visualization, its overly restrictive algorithm prevents it from assigning higher reliability scores, even to predictions in favorable regions of the feature space. For example, the formula for the confidence score calculation (Equation (8)) in the method is redefined as follows:
C X = W + W W + + W ,
where W + and W are the average weighted contributions of the supporting and opposing neighbors, respectively, showing a spread across all reliability intervals, as shown in Figure 13.

4.5. Density and Local Fit Combination Method

Figure 14 illustrates the performance of the Density and Local Fit Combination Method in assigning predictions to reliability intervals. Compared to previously discussed methods, this approach demonstrates superior performance, with a balanced distribution of predictions and good results in the higher reliability interval (]0.80, 1.0]).
For both classes, the error rate generally decreases as the reliability interval increases. For Class 0, the error rate decreases consistently (with small fluctuations) and reaches near-zero levels in ]0.80, 1.0]. Similarly, Class 1 achieves a decreasing error rate, reaching 0 in the highest reliability intervals. This trend is consistent with the expected behavior, indicating that the method effectively discriminates between reliable and unreliable predictions.
An important observation is the even distribution of predictions across the reliability intervals. Unlike the other methods, this approach does not present a high number of instances in low-reliability intervals, having predictions more evenly distributed across all reliability levels. This reflects the method’s ability to assess reliability with greater granularity.
The visualization of the t-SNE (Figure 15) supports these findings. High-reliability predictions are predominantly located in regions of high local density and are surrounded by training instances of the same class. In contrast, low-reliability predictions are typically located near areas where instances of both classes are present. This suggests that the method successfully uses density information to assess reliability.
Despite consistent results, the method shows some error rates in high-reliability intervals and occasional spikes in error rates within the medium-reliability intervals. Optimizing the defined thresholds (such as adjusting the distance, the minimum number of instances required, or the thresholds for each class independently) could potentially improve the performance of the method.

4.6. Aggregate Performance

To further ensure the robustness of the results, we conducted the evaluation process ten times (including the main analysis presented). Each iteration followed the methodology described in Section 3.1, apart from the t-SNE visualization, which could not be reproduced with the aggregated results. The data processing and model training steps detailed in Section 3.2 were repeated in each iteration, with the dataset randomly split in each run.
Table 4 summarizes the performance of the evaluated methods in terms of error rates and the corresponding number of instances (Cases), organized by distinct labels (Y). Each entry in the table includes the error rate expressed as mean ± standard deviation.
The results presented in Table 4 are consistent with previous experiments. For the clustering methods, most instances are concentrated in two predominant intervals ([0.0, 0.1] and ]0.9, 1.0]) with sparse representation in intermediate intervals. This pattern aligns with earlier observations, reinforcing the limited effectiveness of clustering-based methods in producing reliability results. Additionally, clustering methods exhibit a high degree of variability across intervals, confirming their tendency to generate unstable predictions (some intervals show zero error rates, yet this often coincides with intervals containing no instances). This behavior is further visualized in Figure 16, which presents the aggregated results across multiple iterations using boxplots for each reliability interval, separated by class. The boxplots highlight the instability of clustering methods (particularly DBSCAN), which fail to demonstrate a stable decrease in error rates and exhibit considerable variability.
The Distance-Based Method also follows a similar pattern. Error rates remain inconsistent across intervals. However, for Class 1, the error rate shows a consistent decrease as reliability intervals increase. However, this tendency is accompanied by a considerable degree of variance, indicating instability (aligns with the earlier t-SNE visualization, which revealed no clear pattern in the data distribution).
For the ICM-Based Method, the table supports the earlier conclusion that this method successfully identifies intervals with lower error rates as reliability increases, but restricts the results to the interval [0.0, 0.5] (the remaining intervals have nearly no instances).
The Density and Local Fit Combination Method consistently outperforms the other methods. The method demonstrates a constant decrease in error rates as reliability intervals increase, and the predictions are also more evenly distributed across reliability intervals (demonstrating the ability of the method to assign reliability scores with greater granularity). The stability of the density and local fit method is further confirmed by Figure 17, which shows the aggregated results using boxplots. The median error rates consistently decrease as reliability increases, with improved stability and reduced variance compared to the other methods.

4.7. Discussion of the Results

Methods that rely solely on clustering and evaluate reliability based solely on density principles, such as subtractive clustering and DBSCAN methods, have shown limited ability to effectively determine prediction reliability. These methods tend to cluster predictions into lower reliability intervals ([0.00, 0.10]) and struggle to establish a consistent relationship between reliability intervals and error rates. This suggests that clustering-based methods alone may not be adequate for assessing pointwise reliability (particularly in the dataset used).
The Distance-Based Method performs better than clustering-based approaches, providing a more balanced distribution of predictions across intervals and showing a gradual decrease in error rates as reliability increases. However, inconsistencies, particularly in high-reliability intervals where error rates are unexpectedly high, highlight the need for further refinement.
The ICM-Based Method incorporates a validation of the data variation, resulting in improvements. It achieves a more balanced distribution of predictions and consistent error rates. However, the method remains overly restrictive, assigning high-reliability scores to very few predictions, even in favorable regions. This suggests the need to refine the algorithm for a better balance in reliability scores.
The combination of density and local fit principles, as implemented in the “density and local fit combination” method, has been demonstrated to be the most effective approach. This method consistently reduces error rates for both classes as reliability increases, demonstrating its potential to overcome the limitations observed in other methods.
Consistent with the findings in [5], which indicate that the “density and local fit” method effectively distinguishes between reliable and unreliable predictions (achieving only 2% error rates in the [0.8, 1.0] reliability interval), our results reveal similar trends, although with higher error rates. Specifically, considering the median of all test samples, we observed error rates of 16% and 8% in the [0.8, 0.9] and [0.9, 1.0] intervals, respectively, for the most prevalent class (Class 0). For the less prevalent class, the error rate was 0% in both intervals (it should be noted that this class has a substantially smaller number of instances in these intervals, which may influence this outcome).
These results (particularly in the density and local fit method) suggest the potential of complementing the trained model with this method to reduce errors and increase user confidence. For example, in the [0.9, 1.0] interval, the ML model showed error rates of approximately 20% for the predominant class and 33% for the minority class. Integrating the Density and Local Fit Combination Method could significantly mitigate these error rates.
Several refinements can be proposed to improve the performance of these methods. Adjusting parameters such as distance thresholds could improve the balance between reliability intervals and error rates. In addition, integrating distance thresholds with methods such as the ICM-Based Method could result in a more balanced approach. Finally, while not a significant issue in this dataset, addressing class imbalance remains a general problem across all methods and should be considered in future refinements (possibly by adjusting thresholds for each class independently).

5. Conclusions and Future Work

This study provides a benchmark of methods assessing the pointwise reliability in machine learning predictions, which were evaluated under consistent conditions. By comparing techniques based on density principles, local fit principles, and their combination, we identified both the strengths and limitations of each approach. The results indicate that more robust methods, such as those that account for data variation and those that combine density and local fit principles, generally outperform methods that rely on distance alone. The study also highlighted the limitations of clustering-based methods, which often struggle to classify instances across different reliability intervals.
In addition, the alignment between the data representation and the results of the methods suggests that visualization tools like t-SNE can support in assessing prediction reliability by providing a visual representation on how predictions relate to their sur-rounding data points. This visualization complemented the quantitative results by providing an interpretable view of the neighborhood of new instances and provided support for validating reliability classifications and verifying that the methods were working as intended.
As future work, we intend to explore additional methods based on the local fit principle, particularly those that incorporate complementary models to assess reliability. In addition, refining the evaluation process by adjusting the parameters and thresholds of the methods (such as using different thresholds for each class, given that the results show some variation) or developing approaches that consider the importance of individual features in the dataset could further enhance the effectiveness of the benchmark.
Furthermore, investigating potential ethical considerations related to the application of these reliability assessment methods will be an important aspect of future research.

Author Contributions

Conceptualization, C.C., S.P., T.R. and J.H.; methodology, C.C., S.P., T.R. and J.H.; software, C.C.; validation, C.C., S.P., T.R. and J.H.; formal analysis, C.C.; investigation, C.C.; resources, S.P., T.R., J.H. and J.B.; writing—original draft preparation, C.C., S.P., T.R., J.H. and J.B.; writing—review and editing, C.C., S.P., T.R., J.H. and J.B.; visualization, C.C., S.P., T.R., J.H. and J.B.; supervision, S.P., T.R., J.H. and J.B.; funding acquisition, J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/Oak10/pointwise-benchmark/releases/tag/1.1.1 (accessed on 17 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DBSCANDensity-Based Spatial Clustering of Applications with Noise
ICMInterpretable confidence measure
MLMachine learning
t-SNEt-distributed Stochastic Neighbor Embedding

References

  1. Qayyum, A.; Qadir, J.; Bilal, M.; Al-Fuqaha, A. Secure and Robust Machine Learning for Healthcare: A Survey. IEEE Rev. Biomed. Eng. 2021, 14, 156–180. [Google Scholar] [CrossRef] [PubMed]
  2. Javaid, M.; Haleem, A.; Pratap Singh, R.; Suman, R.; Rab, S. Significance of Machine Learning in Healthcare: Features, Pillars and Applications. Int. J. Intell. Netw. 2022, 3, 58–73. [Google Scholar] [CrossRef]
  3. Schulam, P.; Saria, S. Can You Trust This Prediction? Auditing Pointwise Reliability after Learning. In Proceedings of the AISTATS 2019—22nd International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019. [Google Scholar]
  4. Nicora, G.; Rios, M.; Abu-Hanna, A.; Bellazzi, R. Evaluating Pointwise Reliability of Machine Learning Prediction. J. Biomed. Inform. 2022, 127, 103996. [Google Scholar] [CrossRef] [PubMed]
  5. Henriques, J.; Rocha, T.; Paredes, S.; Gil, P.; Loureiro, J.; Petrella, L. Pointwise Reliability of Machine Learning Models: Application to Cardiovascular Risk Assessment. In 9th European Medical and Biological Engineering Conference, Portorož, Slovenia, 9–13 June 2024; Jarm, T., Šmerc, R., Mahnič-Kalamiza, S., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 213–222. [Google Scholar]
  6. Saraygord Afshari, S.; Enayatollahi, F.; Xu, X.; Liang, X. Machine Learning-Based Methods in Structural Reliability Analysis: A Review. Reliab. Eng. Syst. Saf. 2022, 219, 108223. [Google Scholar] [CrossRef]
  7. Bud, M.A.; Moldovan, I.; Radu, L.; Nedelcu, M.; Figueiredo, E. Reliability of Probabilistic Numerical Data for Training Machine Learning Algorithms to Detect Damage in Bridges. Struct. Control Health Monit. 2022, 29, e2950. [Google Scholar] [CrossRef]
  8. van der Waa, J.; van Diggelen, J.; Neerincx, M.; Raaijmakers, S. ICM: An Intuitive Model Independent and Accurate Certainty Measure for Machine Learning. In 10th International Conference on Agents and Artificial Intelligence, Funchal, Portugal, 16–18 January 2018; SCITEPRESS—Science and Technology Publications: Setúbal, Portugal, 2018; pp. 314–321. [Google Scholar]
  9. van der Waa, J.; Schoonderwoerd, T.; van Diggelen, J.; Neerincx, M. Interpretable Confidence Measures for Decision Support Systems. Int. J. Hum. Comput. Stud. 2020, 144, 102493. [Google Scholar] [CrossRef]
  10. Hellman, M.E. The Nearest Neighbor Classification Rule with a Reject Option. IEEE Trans. Syst. Sci. Cybern. 1970, 6, 179–185. [Google Scholar] [CrossRef]
  11. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Proceedings—2nd International Conference on Knowledge Discovery and Data Mining, KDD 1996, Portland, OR, USA, 2–4 August 1996. [Google Scholar]
  12. Nicora, G.; Bellazzi, R. A Reliable Machine Learning Approach Applied to Single-Cell Classification in Acute Myeloid Leukemia. AMIA Annu. Symp. Proc. 2021, 2020, 925. [Google Scholar] [PubMed]
  13. Kailkhura, B.; Gallagher, B.; Kim, S.; Hiszpanski, A.; Han, T.Y.J. Reliable and Explainable Machine-Learning Methods for Accelerated Material Discovery. NPJ Comput. Mater. 2019, 5, 108. [Google Scholar] [CrossRef]
  14. Bahrami, S.; Tuncel, E. An Efficient Running Quantile Estimation Technique alongside Correntropy for Outlier Rejection in Online Regression. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; IEEE: New York, NY, USA, 2020; pp. 2813–2818. [Google Scholar]
  15. Briesemeister, S.; Rahnenführer, J.; Kohlbacher, O. No Longer Confidential: Estimating the Confidence of Individual Regression Predictions. PLoS ONE 2012, 7, e48723. [Google Scholar] [CrossRef] [PubMed]
  16. Adomavicius, G.; Wang, Y. Improving Reliability Estimation for Individual Numeric Predictions: A Machine Learning Approach. INFORMS J. Comput. 2022, 34, 503–521. [Google Scholar] [CrossRef]
  17. Myers, P.D.; Ng, K.; Severson, K.; Kartoun, U.; Dai, W.; Huang, W.; Anderson, F.A.; Stultz, C.M. Identifying Unreliable Predictions in Clinical Risk Models. NPJ Digit. Med. 2020, 3, 8. [Google Scholar] [CrossRef] [PubMed]
  18. Valente, F.; Henriques, J.; Paredes, S.; Rocha, T.; de Carvalho, P.; Morais, J. A New Approach for Interpretability and Reliability in Clinical Risk Prediction: Acute Coronary Syndrome Scenario. Artif. Intell. Med. 2021, 117, 102113. [Google Scholar] [CrossRef] [PubMed]
  19. Correia, C. Pointwise-Benchmark. Available online: https://github.com/Oak10/pointwise-benchmark/releases/tag/1.1.1 (accessed on 17 April 2025).
  20. Cervati Neto, A.; Levada, A.L.M.; Ferreira Cardia Haddad, M. Supervised T-SNE for Metric Learning With Stochastic and Geodesic Distances. IEEE Can. J. Electr. Comput. Eng. 2024, 47, 199–205. [Google Scholar] [CrossRef]
  21. Li, K.; DeCost, B.; Choudhary, K.; Greenwood, M.; Hattrick-Simpers, J. A Critical Examination of Robustness and Generalizability of Machine Learning Prediction of Materials Properties. NPJ Comput. Mater. 2023, 9, 55. [Google Scholar] [CrossRef]
  22. Sadikin, M. EHR Dataset for Patient Treatment Classification. Mendeley Data 2020, 1, 2020. [Google Scholar]
  23. Chiu, S.L. Fuzzy Model Identification Based on Cluster Estimation. J. Intell. Fuzzy Syst. 1994, 2, 267–278. [Google Scholar] [CrossRef]
  24. Chandar, S.K. Stock Market Prediction Using Subtractive Clustering for a Neuro Fuzzy Hybrid Approach. Clust. Comput. 2019, 22, 13159–13166. [Google Scholar] [CrossRef]
Figure 1. Workflow for reliability assessment, illustrating the data flow from dataset preparation to model training, prediction evaluation, and benchmark.
Figure 1. Workflow for reliability assessment, illustrating the data flow from dataset preparation to model training, prediction evaluation, and benchmark.
Information 16 00327 g001
Figure 2. t-SNE visualization of the training and validation data, highlighting correct and incorrect predictions by class.
Figure 2. t-SNE visualization of the training and validation data, highlighting correct and incorrect predictions by class.
Information 16 00327 g002
Figure 3. Population distribution and error rates of the predictions made by the model.
Figure 3. Population distribution and error rates of the predictions made by the model.
Information 16 00327 g003
Figure 4. Distribution of pairwise distances in the dataset.
Figure 4. Distribution of pairwise distances in the dataset.
Information 16 00327 g004
Figure 5. Subtractive clustering method: population distribution and error rate across reliability intervals, separated by predictions classified as 0 and 1.
Figure 5. Subtractive clustering method: population distribution and error rate across reliability intervals, separated by predictions classified as 0 and 1.
Information 16 00327 g005
Figure 6. t-SNE visualization of the training and validation datasets using the subtractive clustering method, which are categorized by reliability intervals and separated by class.
Figure 6. t-SNE visualization of the training and validation datasets using the subtractive clustering method, which are categorized by reliability intervals and separated by class.
Information 16 00327 g006
Figure 7. Population distribution and error rate across reliability intervals for the DBSCAN method.
Figure 7. Population distribution and error rate across reliability intervals for the DBSCAN method.
Information 16 00327 g007
Figure 8. t-SNE visualization of the training and validation datasets using the DBSCAN method, which are categorized by reliability intervals and separated by class.
Figure 8. t-SNE visualization of the training and validation datasets using the DBSCAN method, which are categorized by reliability intervals and separated by class.
Information 16 00327 g008
Figure 9. Population distribution and error rate across reliability intervals for the Distance-Based Method.
Figure 9. Population distribution and error rate across reliability intervals for the Distance-Based Method.
Information 16 00327 g009
Figure 10. t-SNE visualization of the training and validation datasets using the Distance-Based Method, which are categorized by reliability intervals and separated by class.
Figure 10. t-SNE visualization of the training and validation datasets using the Distance-Based Method, which are categorized by reliability intervals and separated by class.
Information 16 00327 g010
Figure 11. Population distribution and error rate across reliability intervals for the ICM-Based Method.
Figure 11. Population distribution and error rate across reliability intervals for the ICM-Based Method.
Information 16 00327 g011
Figure 12. t-SNE visualization of the training and validation datasets using the ICM-Based Method, which are categorized by reliability intervals and separated by class.
Figure 12. t-SNE visualization of the training and validation datasets using the ICM-Based Method, which are categorized by reliability intervals and separated by class.
Information 16 00327 g012
Figure 13. Population distribution and error rates across reliability intervals for the ICM-Based Method with a redefined confidence formula.
Figure 13. Population distribution and error rates across reliability intervals for the ICM-Based Method with a redefined confidence formula.
Information 16 00327 g013
Figure 14. Population distribution and error rate across reliability intervals for the Density and Local Fit Combination Method.
Figure 14. Population distribution and error rate across reliability intervals for the Density and Local Fit Combination Method.
Information 16 00327 g014
Figure 15. t-SNE visualization of the training and validation datasets using the Density and Local Fit Combination Method, which are categorized by reliability intervals and separated by class.
Figure 15. t-SNE visualization of the training and validation datasets using the Density and Local Fit Combination Method, which are categorized by reliability intervals and separated by class.
Information 16 00327 g015
Figure 16. DBSCAN’s aggregated results across multiple iterations for each reliability interval, which are separated by class.
Figure 16. DBSCAN’s aggregated results across multiple iterations for each reliability interval, which are separated by class.
Information 16 00327 g016
Figure 17. Density and local fit method’s aggregated results across multiple iterations for each reliability interval, which are separated by class. Red lines represent the median values.
Figure 17. Density and local fit method’s aggregated results across multiple iterations for each reliability interval, which are separated by class. Red lines represent the median values.
Information 16 00327 g017
Table 1. Statistics of the numerical features of the dataset, including mean, standard deviation, minimum and maximum values, and the number of unique values for each feature.
Table 1. Statistics of the numerical features of the dataset, including mean, standard deviation, minimum and maximum values, and the number of unique values for each feature.
Mean   ( ± std) Min Max Unique Values
HAEMOGLOBINS 12.7 ± 2.0 3.818.9126
ERYTHROCYTE 4.5 ± 0.7 1.47.8425
LEUCOCYTE 8.7 ± 4.9 1.176.6269
THROMBOCYTE 257.5 ± 114.0 10.01183.0542
MCHC 33.3 ± 1.2 26.039.0103
MCV 84.6 ± 6.8 54.0115.6401
AGE 46.6 ± 21.7 198.094
Table 2. Relationship between selected percentiles of pairwise distances, their corresponding distance thresholds, and the resulting “minimum cluster unit” values (average number of neighbors within each threshold).
Table 2. Relationship between selected percentiles of pairwise distances, their corresponding distance thresholds, and the resulting “minimum cluster unit” values (average number of neighbors within each threshold).
PercentileDistance ThresholdMinimum Cluster Unit
0.1~0.0873
0.25~0.1079
0.5~0.12419
0.75~0.13629
5~0.223198
Table 3. Distribution of instances by labels across the largest DBSCAN clusters.
Table 3. Distribution of instances by labels across the largest DBSCAN clusters.
NoiseCluster 1Cluster 2Cluster 3Cluster 4Cluster 5
No.Instances2136233146907132
Count Label 01291202109685515
Count Label 18453137221617
Table 4. Performance of the evaluated methods, where error rates are expressed as mean ± standard deviation. The number of instances (Cases) for each label is included to provide context on the data distribution.
Table 4. Performance of the evaluated methods, where error rates are expressed as mean ± standard deviation. The number of instances (Cases) for each label is included to provide context on the data distribution.
[0.0, 0.1] ]0.1, 0.2] ]0.2, 0.3] ]0.3, 0.4] ]0.4, 0.5] ]0.5, 0.6] ]0.6, 0.7] ]0.7, 0.8] ]0.8, 0.9] ]0.9, 1.0]
SubtractiveCasesY = 0 298.8 ± 14 0 ± 0 0 ± 0 0.4 ± 0.6 0.8 ± 0.6 1.8 ± 1.9 1 ± 1 1 ± 1 1.8 ± 2.2 24.8 ± 7
Y = 1 108.1 ± 12 0 ± 0 0 ± 0 0.2 ± 0.6 0.4 ± 0.8 0.4 ± 0.8 0.4 ± 0.5 0.2 ± 0.4 0.5 ± 0.8 1.4 ± 0.9
Error
(%)
Y = 0 30 ± 1 0 ± 0 0 ± 0 10 ± 31 5 ± 15 35 ± 41 45 ± 49 26 ± 43 21 ± 37 24 ± 8
Y = 1 28 ± 3 0 ± 0 0 ± 0 0 ± 0 10 ± 21 0 ± 0 10 ± 31 0 ± 0 0 ± 0 0 ± 0
DBSCANCasesY = 0 214.3 ± 18 0 ± 0 20.3 ± 5.1 11 ± 5.1 7.5 ± 3.3 5.1 ± 2.6 4.7 ± 1.7 2.5 ± 2.5 1.9 ± 1.5 63 ± 13.6
Y = 1 90.4 ± 9.6 0 ± 0 8.1 ± 2.5 5.2 ± 2.9 2.2 ± 1.3 2.6 ± 1.6 1.3 ± 0.9 0.3 ± 0.6 0.2 ± 0.4 1.3 ± 1
Error
(%)
Y = 0 28 ± 2 0 ± 0 39 ± 10 44 ± 15 45 ± 24 29 ± 25 14 ± 20 27 ± 41 45 ± 45 26 ± 5
Y = 1 32 ± 4 0 ± 0 11 ± 10 9 ± 12 9 ± 14 7 ± 16 3 ± 10 0 ± 0 0 ± 0 18 ± 33
DistanceCasesY = 0 55.5 ± 8.1 36.8 ± 7.6 24.6 ± 5.3 20.5 ± 4.8 14.8 ± 4.6 15.1 ± 3.9 13 ± 3.6 10.8 ± 3.2 12 ± 3.4 127 ± 14.9
Y = 1 31 ± 6.4 18.2 ± 3.3 17.6 ± 5 13.1 ± 3.1 7.9 ± 3.3 6.5 ± 2.1 3.9 ± 1.6 4.1 ± 1.5 2.5 ± 1.5 6.8 ± 3.1
Error
(%)
Y = 0 33 ± 4 43 ± 8 42 ± 7 42 ± 11 39 ± 14 26 ± 10 23 ± 13 23 ± 9 27 ± 10 21 ± 2
Y = 1 50 ± 8 27 ± 10 21 ± 7 23 ± 6 15 ± 14 15 ± 9 25 ± 33 7 ± 11 6 ± 16 3 ± 7
ICMCasesY = 0 211.6 ± 7.2 45 ± 10.2 22.7 ± 3.2 43.7 ± 6.6 7.1 ± 2.3 0.3 ± 0.4 0 ± 0 0 ± 0 0 ± 0 0 ± 0
Y = 1 55.6 ± 7.5 14.8 ± 4.1 10.8 ± 4.2 17.2 ± 4 11.5 ± 3.1 1.1 ± 0.8 0.3 ± 0.6 0.3 ± 0.6 0 ± 0 0 ± 0
Error
(%)
Y = 0 36 ± 2 21 ± 5 23 ± 5 11 ± 5 10 ± 12 20 ± 42 0 ± 0 0 ± 0 0 ± 0 0 ± 0
Y = 1 41 ± 5 24 ± 13 13 ± 13 11 ± 5 9 ± 8 13 ± 3 2 10 ± 31 0 ± 0 0 ± 0 0 ± 0
D. LocalCasesY = 0 65 ± 8.1 36.6 ± 6.7 25.5 ± 4.8 21.4 ± 4.5 17.2 ± 4.9 22 ± 5.4 30.6 ± 7.6 44.1 ± 5.9 43.3 ± 6.8 24.7 ± 3.7
Y = 1 40.3 ± 7.8 21.4 ± 5.1 15.5 ± 2.9 12.0 ± 3.8 8.4 ± 3.7 5.3 ± 2.2 3.5 ± 1.5 2.9 ± 1.2 1.1 ± 1.6 1.2 ± 0.7
Error
(%)
Y = 0 39 ± 4 42 ± 10 38 ± 4 40 ± 10 40 ± 13 25 ± 8 26 ± 8 21 ± 6 16 ± 5 8 ± 3
Y = 1 49 ± 6 23 ± 7 20 ± 5 12 ± 6 7 ± 9 8 ± 14 9 ± 21 2 ± 6 0 ± 0 0 ± 0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Correia, C.; Paredes, S.; Rocha, T.; Henriques, J.; Bernardino, J. Benchmarking Methods for Pointwise Reliability. Information 2025, 16, 327. https://doi.org/10.3390/info16040327

AMA Style

Correia C, Paredes S, Rocha T, Henriques J, Bernardino J. Benchmarking Methods for Pointwise Reliability. Information. 2025; 16(4):327. https://doi.org/10.3390/info16040327

Chicago/Turabian Style

Correia, Cláudio, Simão Paredes, Teresa Rocha, Jorge Henriques, and Jorge Bernardino. 2025. "Benchmarking Methods for Pointwise Reliability" Information 16, no. 4: 327. https://doi.org/10.3390/info16040327

APA Style

Correia, C., Paredes, S., Rocha, T., Henriques, J., & Bernardino, J. (2025). Benchmarking Methods for Pointwise Reliability. Information, 16(4), 327. https://doi.org/10.3390/info16040327

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop