Estimating Calibrated Risks Using Focal Loss and Gradient-Boosted Trees for Clinical Risk Prediction

Johnston, Henry; Nair, Nandini; Du, Dongping

doi:10.3390/electronics14091838

Open AccessArticle

Estimating Calibrated Risks Using Focal Loss and Gradient-Boosted Trees for Clinical Risk Prediction

by

Henry Johnston

¹

,

Nandini Nair

²

and

Dongping Du

^1,*

¹

Department of Industrial, Manufacturing, and Systems Engineering, Texas Tech University, Lubbock, TX 79409, USA

²

Heart and Vascular Institute, Pennsylvania State University College of Medicine, Hershey, PA 17033, USA

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1838; https://doi.org/10.3390/electronics14091838

Submission received: 4 February 2025 / Revised: 22 April 2025 / Accepted: 26 April 2025 / Published: 30 April 2025

(This article belongs to the Special Issue Data-Centric Artificial Intelligence: New Methods for Data Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Probability calibration and decision threshold selection are fundamental aspects of risk prediction and classification, respectively. A strictly proper loss function is used in clinical risk prediction applications to encourage a model to predict calibrated class-posterior probabilities or risks. Recent studies have shown that training with focal loss can improve the discriminatory power of gradient-boosted decision trees (GBDT) for classification tasks with an imbalanced or skewed class distribution. However, the focal loss function is not a strictly proper loss function. Therefore, the output of GBDT trained using focal loss is not an accurate estimate of the true class-posterior probability. This study aims to address the issue of poor calibration of GBDT trained using focal loss in the context of clinical risk prediction applications. The methodology utilizes a closed-form transformation of the confidence scores of GBDT trained with focal loss to estimate calibrated risks. The closed-form transformation relates the focal loss minimizer and the true-class posterior probability. Algorithms based on Bayesian hyperparameter optimization are provided to choose the focal loss parameter that optimizes discriminatory power and calibration, as measured by the Brier score metric. We assess how the calibration of the confidence scores affects the selection of a decision threshold to optimize the balanced accuracy, defined as the arithmetic mean of sensitivity and specificity. The effectiveness of the proposed strategy was evaluated using lung transplant data extracted from the Scientific Registry of Transplant Recipients (SRTR) for predicting post-transplant cancer. The proposed strategy was also evaluated using data from the Behavioral Risk Factor Surveillance System (BRFSS) for predicting diabetes status. Probability calibration plots, calibration slope and intercept, and the Brier score show that the approach improves calibration while maintaining the same discriminatory power according to the area under the receiver operating characteristics curve (AUROC) and the H-measure. The calibrated focal-aware XGBoost achieved an AUROC, Brier score, and calibration slope of 0.700, 0.128, and 0.968 for predicting the 10-year cancer risk, respectively. The miscalibrated focal-aware XGBoost achieved equal AUROC but a worse Brier score and calibration slope (0.140 and 1.579). The proposed method compared favorably to the standard XGBoost trained using cross-entropy loss (AUROC of 0.755 versus 0.736 in predicting the 1-year risk of cancer). Comparable performance was observed with other risk prediction models in the diabetes prediction task.

Keywords:

risk prediction; probability estimation; loss function; gradient boosting; probability calibration; Bayesian hyperparameter optimization; class imbalance

1. Introduction

Risk prediction is one of the most important tasks in clinical research. It consists of estimating the probability that an individual will experience an event or outcome of interest. A model estimates the probability of the event based on features such as age, sex, body mass index, blood type, or previous history of malignancy. Estimated probabilities are useful to group patients into different risk strata such as low or high risk. Patients in the high-risk group can receive preemptive treatment. For example, solid organ transplant recipients may be closely monitored if they are at risk of post-transplant malignancy [1]. They may also receive anti-viral medications to prevent serious adverse events arising from viral infections [2]. A solid organ donor can be matched to a specific recipient on the waiting list to improve survival after transplantation [3]. Another example of clinical risk prediction applications includes heart failure assessment. Individuals with genetic risk factors can undergo non-invasive imaging to check for heart disease [4]. Individuals at high risk for heart failure can be treated preemptively with medications to manage blood pressure [5]. Preemptive treatment can reduce heart failure diagnoses and save money for the healthcare system [6].

Risk prediction models are assessed by their discrimination or their ability to separate high-risk patients from low-risk patients. A model with perfect discriminatory power assigns higher probabilities to all patients with a medical condition or “event” and lower probabilities to patients who do not have the condition or event. The area under the receiver operating characteristics curve (AUROC) and the H-measure are evaluation metrics used to assess the discriminatory power of risk prediction models and probabilistic classifiers [7]. Clinical risk prediction models are also expected to be calibrated [8]. Estimated probabilities or risks are calibrated if they match the frequency of the event. For example, if a model provides an estimated probability of

20 %

given a set of patient variables (features), then 20 out of 100 similar patients should experience the event. The evaluation of probability calibration has generally received less attention than discriminatory power, although it is important for clinical decision-making [9,10]. A miscalibrated model can underestimate or overestimate risks. Underestimation may prevent patients from receiving the necessary medical treatment, while overestimation can lead to unnecessary and costly treatment. Calibration curves, calibration slope, and intercept are commonly used in the risk prediction literature to assess probability calibration [8,10]. The Brier score metric summarizes both discrimination and calibration and is considered an overall accuracy metric [9].

One of the most commonly used models in clinical risk prediction applications is logistic regression. The logistic regression model assumes the logit (logarithmic odds) of the estimated posterior probability to be a linear function of the variables or features. Logistic regression offers a straightforward interpretation for clinical professionals. More complex machine learning models, such as gradient-boosted decision trees (GBDT), have attracted attention in medical fields, where large electronic health records are commonly used. GBDT is an iterative method that adds individual decision trees to improve the performance of an ensemble of trees [11]. GBDT has attributes that make it capable of handling electronic health records effectively. GBDT can handle heterogeneous features (numerical and categorical) available in electronic health records and identify non-linear interactions automatically. GBDT variants also offer regularization techniques to prevent overfitting noisy electronic health records [12]. Efficient implementations of GBDT include extreme gradient-boosting machine (XGBoost) [13], light gradient-boosting machine (LightGBM) [14], CatBoost [15], and PaloBoost [12].

GBDT models offer the flexibility to optimize a smooth and differentiable loss function to guide the decision tree training process. Each decision tree in the ensemble is fitted to the pseudo-residuals or negative gradient of the loss function. The standard loss function for probability estimation and classification in GBDT is the cross-entropy loss function (also known as log loss or negative log likelihood). Recent studies related to class-imbalanced data classification have combined GBDT with the focal loss function to improve classification performance [16,17,18,19,20,21,22,23]. The focal loss adds a modulating factor to the cross-entropy loss function. This factor modulates the relative loss of the samples according to the accuracy of the estimated confidence scores [24]. This modulation effect guides the training process to focus on the samples with large losses (“hard-to-classify” samples). This effect may be desirable in applications such as clinical risk prediction and object detection because it allows the training process to focus on rare and difficult samples [25,26]. Previous studies that combined GBDT with focal loss were primarily concerned with classification tasks and evaluated predictive performance based on discrimination and classification metrics. Training GBDT with focal loss leads to miscalibrated confidence scores because the loss function is not strictly proper [27]. Hence, these confidence scores cannot be interpreted as reliable class-posterior probabilities. In class probability estimation, the expected loss of a strictly proper loss function is minimized if and only if the estimated conditional distribution of the outcome variable is equal to the true distribution [28]. Therefore, strictly proper loss functions are fundamental for prediction tasks that require accurate probabilities.

In this study, we investigate the focal loss function and GBDT in the context of risk prediction and classification of clinical data with a binary label or event. A recent theoretical study on focal loss for class-posterior probability estimation proposed a closed-form transformation of confidence scores to recover the true class-posterior probabilities [27]. The transformation is known as the

Ψ^{γ}

transformation. The

Ψ^{γ}

transformation relates the focal loss minimizer and the true class-posterior probability. It can be applied to binary and multi-class prediction tasks that require calibrated class-posterior probabilities.

The contributions of this study are the following: (1) We investigate the application of the

Ψ^{γ}

transformation to GBDT trained using focal loss and evaluate the calibration of confidence scores for clinical risk prediction. (2) We propose two algorithms based on Bayesian hyperparameter optimization and K-fold cross-validation to choose the focal loss parameter and optimize discriminatory power and calibration (as measured by the Brier score). (3) We investigate the adjustment of the decision threshold in combination with GBDT, focal loss, and

Ψ^{γ}

transformation. We specifically choose a decision threshold to optimize the arithmetic mean of sensitivity and specificity (balanced accuracy) for classification. (4) We investigate the performance of these techniques in two important clinical risk prediction tasks: prediction of cancer in lung transplant recipients and prediction of diabetes. We use real datasets extracted from the Scientific Registry of Transplant Recipients (SRTR) and the Behavioral Risk Factor Surveillance System (BRFSS). We provide comparisons with GBDT trained using the standard cross-entropy loss and unregularized and regularized logistic regression. We also provide comparisons with GBDT calibrated with Platt scaling and isotonic regression.

The paper is organized as follows: Section 2 discusses previous studies on GBDT for clinical risk prediction, custom loss functions for GBDT, and probability calibration. We describe gaps in previous studies that modified the loss function of GBDT. Section 3 describes the methodology. It is divided into five subsections. Section 3.1 introduces the cross-entropy loss, focal loss, strictly proper losses, and the

Ψ^{γ}

transformation for risk prediction tasks. Section 3.2 introduces the training objective of GBDT and regularized GBDT. Section 3.3 introduces Bayesian hyperparameter optimization. This section describes two algorithms to tune the GBDT and choose the focal loss parameter. We also provide a detailed discussion of hyperparameter values and distributions. Section 3.4 discusses the decision threshold that will be used to calculate the classification metrics. The section also discusses the importance of calibrated probabilities in optimizing the balanced accuracy metric for classification tasks. Section 3.5 introduces the relevant metrics for risk prediction and classification, including discrimination, calibration, and classification metrics. Section 4 describes the experiments. It is divided into three subsections. Section 4.1 describes the datasets from the SRTR and BRFSS databases. Section 4.2 describes the design of the experiments. Section 4.3 presents the results of the experiments. Section 5 provides a discussion of the study and the main results. Finally, the conclusion summarizes the study and suggests future directions for researchers.

2. Related Works

2.1. GBDT in Clinical Risk Prediction

Seto et al. studied the probability calibration of LightGBM for the prediction of diabetes using the Kokuho database from Japan. The authors reported that LightGBM produced more accurate probabilistic predictions (measured by the expected calibration error) compared to logistic regression as the sample size increased beyond

10^{4}

[29]. Ma et al. applied task-wise split gradient boosting trees (TSGB) to predict venous thromboembolism (VTE) across different hospital departments [30]. TSGB is based on GBDT and showed a superior discriminatory power (measured by the AUROC) compared to logistic regression. However, the study did not assess the calibration of the proposed TSGB model. Another study found XGBoost to achieve the best performance among various models in predicting graft survival after kidney transplantation [31]. The study used discrimination and classification metrics including AUROC, accuracy, sensitivity, specificity, and F1-measure, but did not assess the calibration of the model. However, other studies have found that GBDT and logistic regression have comparable performance in some clinical risk prediction tasks [32,33]. Bae et al. compared logistic regression and XGBoost in predicting kidney transplant outcomes using the Scientific Registry of Transplant Recipients (SRTR). They found that AUROC and overall accuracy (measured by the Brier score) were comparable in different tasks, such as predicting delayed graft function and acute rejection [34]. The authors found that the AUROC of the models was in the range 0.717–0.723 for the prediction of the delayed graft function. Austin et al. performed simulations using two large cardiovascular datasets to compare machine learning and statistical models [35]. They found that conventional logistic regression and GBDT had superior performance across a variety of data-generating processes. The authors used metrics such as AUROC, Brier score, calibration slope, and intercept. Miller et al. compared machine learning and statistical models and concluded that performance depended on the type of cross-validation used [36]. The authors used data from the United Network for Organ Sharing (UNOS) to predict mortality after heart transplantation. XGBoost obtained better AUROC than logistic regression in shuffled cross-validation (0.820 vs. 0.661). However, the AUROC values decreased to 0.657 and 0.641 for XGBoost and ridge logistic regression in rolling cross-validation, respectively. These studies used GBDT trained with the standard cross-entropy loss function and did not consider other loss functions such as focal loss.

2.2. GBDT and Custom Loss Functions

GBDT can handle the optimization of custom smooth loss functions. Several studies have combined GBDT with focal loss and other loss functions to mainly address problems in classifying data with imbalanced or skewed class distribution. Wang et al. compared XGBoost trained with different loss functions such as cross-entropy loss, weighted cross-entropy loss, and focal loss in Parkinson’s disease classification [16]. They reported that XGBoost combined with focal loss performed better than standard XGBoost according to the F1-score metric. Liu et al. combined a cost-sensitive version of LightGBM with focal loss to classify imbalanced credit scoring datasets [18]. The authors found that the combination of cost-sensitive learning and focal loss was superior or competitive on four different credit scoring tasks. Boldini et al. compared weighted cross-entropy loss, focal loss, logit-adjusted loss, label-distribution-aware margin loss, and equalization loss to train LightGBM for imbalanced bioassay classification [17]. The authors reported logit-adjusted loss and label-distribution-aware margin loss to achieve the most promising results in different experiments. Label-distribution-aware margin loss and logit-adjusted loss are designed to optimize classification metrics appropriate for class-imbalanced data. Cao et al. proposed the label-distribution-aware margin loss function to optimize a balanced generalization error bound [37]. Logit-adjusted loss was originally proposed as a loss function to minimize balanced error in deep long-tail learning [38]. Mushava et al. proposed unified focal loss combined with flexible link functions such as generalized extreme value (GEV) and exponentiated-exponential logistic (EEL) distribution-based link functions [19,20]. The authors used their proposed loss function to train XGBoost using the Freddie Mac mortgage loan database. They reported improvements in discriminatory power according to the H-measure metric. Rao et al. combined cost-sensitive CatBoost with focal loss for imbalanced customer churn prediction. The authors also used a modified adaptive synthetic (ADASYN) resampling approach to handle class-imbalanced data [22]. Luo et al. compared different loss functions to train LightGBM and XGBoost [23]. The authors compared weighted cross-entropy loss, focal loss, and asymmetric losses for binary, multi-class, and multi-label imbalanced datasets. They reported improvements in F1-score when training XGBoost with weighted cross-entropy and asymmetric losses. Fan et al. proposed to combine XGBoost with triage loss to diagnose vibration sensor failure [39]. The triage loss is a modification of the focal loss that further increases the weight of the hard-to-classify samples during the training process. The authors reported that XGBoost combined with triage loss outperformed XGBoost trained with cross-entropy and focal losses.

The previous studies that combined GBDT with custom losses were primarily concerned with classification tasks, so they did not assess the calibration of predicted probabilities or discuss whether the loss functions were strictly proper.

2.3. Probability Calibration

Probability calibration is concerned with the agreement between the estimated probabilities and the observed fraction of positive events. The assessment of probability calibration generally receives less attention than the assessment of discriminatory power and classification accuracy [10]. The topic has gained popularity in recent years in the literature on deep learning [40]. Studies in the area have focused on the properties of loss functions that encourage calibration, heuristics, and appropriate methods and metrics to evaluate performance. The properties of loss functions that encourage accurate probability estimation have been studied in [25,28,41,42,43,44]. Loss functions that encourage accurate probability estimation are known as strictly proper loss functions. Strictly proper loss functions are globally minimized if and only if the estimated class-posterior probabilities are equal to the ground-truth probabilities. Examples of these loss functions are the cross-entropy loss and the Brier score loss. However, note that minimization of strictly proper loss functions encourages calibration but is not sufficient to guarantee good calibration [44]. Strictly proper loss functions are also used in heuristic techniques for post hoc probability calibration. An example of a heuristic technique is the Platt scaling method proposed in [45]. In Platt scaling, a sigmoid function is applied to the predictions of a miscalibrated model. The parameters of the sigmoid function are found by minimizing the strictly proper cross-entropy loss over a calibration set or by performing a K-fold cross-validation over the training set. Temperature scaling is a variant of Platt scaling that uses a single parameter [40,46]. Temperature scaling is effective in improving the calibration of deep neural networks. Another method of calibrating probabilities is isotonic regression [47]. The isotonic regression method finds a step-wise non-decreasing function to map the miscalibrated confidence scores to calibrated probabilities. Isotonic regression can perform poorly when the sample size is small. Beta calibration is a technique that calibrates confidence scores using calibration maps based on the beta distribution [48]. Continuous model update strategies have also been studied to maintain calibration after deployment in a clinical setting [10]. Continuous update strategies are useful when the distribution of the population changes over time [49]. Many recent studies on probability calibration are also related to evaluation methods such as reliability diagrams and appropriate metrics for risk prediction applications [8,50,51].

3. Methods

This section describes the machine learning methodology for the prediction of clinical risk and the classification of data with an imbalanced or skewed class distribution. It consists of GBDT trained using focal loss, calibration, and Bayesian optimization. The proposed approach is also combined with an empirical decision threshold to optimize the arithmetic mean of sensitivity and specificity (balanced accuracy). We consider the problem of risk prediction using data with a binary class label. We also consider the classification of binary class-imbalanced data through decision threshold adjustment. Let

D = {(x_{i}, y_{i})}_{i = 1}^{n}

be a training set consisting of n samples, where

x_{i} \in X

is the p-dimensional feature vector and

y_{i} \in {0, 1}

is the binary class label. The first goal is to learn a risk prediction model,

q : X \to [0, 1]

. The risk prediction model must provide a reliable estimate of the true class-posterior probability

η (x) = p (y = 1 | x)

. The second goal is to learn a binary classification model

\hat{δ} : X \to {0, 1}

, where

\hat{δ}

is an approximation of the true function

δ : X \to {0, 1}

. We assign a sample to one of two classes by thresholding the class-posterior probability using a decision threshold estimated from the training data.

3.1. Focal Loss for Risk Prediction

The most common loss function used for risk prediction is the binary cross-entropy loss (also known as log loss or negative log-likelihood). The cross-entropy loss is a strictly proper loss function, and it is used to estimate the true class-posterior probability. It is also used as a surrogate loss function for classification. The cross-entropy loss is shown in Equation (1). For simplicity, we denote

q (x)

as q and

η (x)

as

η

:

\begin{matrix} l_{C E} (y, q) = - y log q - (1 - y) log (1 - q) \end{matrix}

(1)

A loss function previously applied to classification tasks with imbalanced class labels is the focal loss,

l_{F L}^{γ}

. The focal loss was proposed in [24] for the object detection task in computer vision. It adds a modulating factor

{(1 - q)}^{γ}

to

l_{C E}

. The modulating factor is tuned by the focusing parameter,

γ \geq 0

. The binary focal loss is given by

\begin{matrix} l_{F L}^{γ} (y, q) & = - y {(1 - q)}^{γ} log q - (1 - y) q^{γ} log (1 - q) \end{matrix}

(2)

A modulating factor with

γ = 0

reduces the focal loss to the cross-entropy loss. A modulating factor with

γ > 0

decreases the loss for “easy” samples. For example, suppose that the training sample

(x_{i}, y_{i})

has the class label

y_{i} = 1

. The modulating factor decreases to 0 if the estimated confidence score for the training sample

q_{i} \to 1

. Therefore, the relative loss for the sample decreases to 0 and the training process shifts its focus to samples with erroneous confidence scores. The modulating effect of focal loss is enhanced for higher values of the focusing parameter

γ

. Figure 1A shows the modulation effect caused by the focusing parameter

γ

.

Unlike binary cross-entropy loss, focal loss is not a strictly proper loss function, and estimated confidence scores cannot be interpreted as reliable class-posterior probabilities [27]. A loss function,

l : {0, 1} \times [0, 1] \mapsto R

, is proper if

η \in arg {min}_{q} E_{Y} l (Y, q)

for every

η \in [0, 1]

, where

Y \sim B e r n o u l l i (η)

[28,41,42]. Furthermore, it is strictly proper if

η

is the unique minimizer.

The expected loss of a sample is the binary pointwise conditional risk given the prediction q and the true class-posterior probability

η

:

\begin{matrix} W^{l} (η, q) & = η l (1, q) + (1 - η) l (0, q) \end{matrix}

(3)

where

l (1, q)

and

l (0, q)

are partial losses. The binary pointwise conditional risk can be used to recover the true posterior probability,

η

, if the focal risk minimizer is available [27]. Let

q^{*} = arg {min}_{q} W^{l_{F L}^{γ}} (η, q)

be the focal risk minimizer. Since focal loss is classification calibrated, then

q^{*}

preserves the order of

η

:

q_{i}^{*} < q_{j}^{*} \Rightarrow η_{i} < η_{j}

. Furthermore, the binary pointwise conditional risk with respect to focal loss,

W^{l_{F L}^{γ}} (η, q^{*})

, is differentiable. To keep expressions simple, denote

W^{l_{F L}^{γ}} (η, q^{*})

as

W^{l_{F L}^{γ, *}} (η, q)

:

\begin{matrix} W^{l_{F L}^{γ, *}} (η, q) & = η (- {(1 - q)}^{γ} log q) + (1 - η) (- q^{γ} log (1 - q)) \end{matrix}

(4)

To find an expression for the class-posterior probability, the derivative of

W^{l_{F L}^{γ, *}} (η, q)

with respect to q is set equal to 0:

\begin{matrix} \frac{d}{d q} W^{l_{F L}^{γ, *}} (η, q) = & \frac{d}{d q} [η (- {(1 - q)}^{γ} log q) + (1 - η) (- q^{γ} log (1 - q))] = 0 \end{matrix}

Finally, the class-posterior probability given the focal risk minimizer is given by the following expression:

\begin{matrix} η = & \frac{\frac{q^{γ}}{1 - q} - γ q^{γ - 1} log (1 - q)}{\frac{q^{γ}}{1 - q} - γ q^{γ - 1} log (1 - q) + \frac{{(1 - q)}^{γ}}{q} - γ {(1 - q)}^{γ - 1} log q} \\ η = & Ψ^{γ} (q) \end{matrix}

(5)

According to Equation (5),

η = q

if

γ = 0

. Hence,

l_{F L}^{γ}

is strictly proper if and only if

γ = 0

, i.e., when it is equal to

l_{C E}

. However,

η \neq q

if

γ > 0

, i.e.,

l_{F L}^{γ}

is not strictly proper if

γ > 0

. The closed-form transformation

Ψ^{γ} : [0, 1] \to [0, 1]

serves to calibrate a model trained via focal loss. The discriminatory power attained (often measured by the AUROC) through focal loss minimization is also maintained, since the transformation is a strictly increasing function. Therefore, the model gains calibration without losing discriminatory power. Figure 1B shows the relationship between the true class-posterior probability and the focal risk minimizer for different values of the focusing parameter

γ

. It should be emphasized that [27] initially introduces the transformation for the multiclass classification problem

Ψ^{γ} (v) = {[Ψ_{1}^{γ} (v), . . ., Ψ_{K}^{γ} (v)]}^{T}

, where K is the number of classes,

v

is a K-dimensional simplex vector,

Ψ_{i}^{γ} (v) = \frac{h^{γ} (v_{i})}{\sum_{l = 1}^{K} h^{γ} (v_{l})}

and

h^{γ} (v) = \frac{v}{{(1 - v)}^{γ} - γ {(1 - v)}^{γ - 1} v log v}

. These equations are the result of a constrained convex program to optimize the pointwise conditional risk for the multiclass classification case. Equation (5) can be obtained from multiclass equations considering the case

K = 2

and

v = [q, 1 - q]

.

Applying the

Ψ^{γ}

transformation is a necessary post-processing step to obtain the true class-posterior probability from the focal risk minimizer. Hence, the

Ψ^{γ}

transformation is fundamental in the application of focal loss in risk prediction applications. However, statistical and machine learning models are trained to minimize a loss function over finite training datasets. Therefore, the loss function minimizer may not be obtained. In practice, even models such as logistic regression that minimize a strictly proper loss function may be miscalibrated [44]. Studies suggest adding a regularization term to the loss function to encourage better calibration of class-posterior probabilities [43,52].

3.2. Regularized GBDT

We employ regularized gradient-boosted decision trees (GBDT) to minimize a loss function. The training objective of the original GBDT is given by [11]:

\begin{matrix} L^{t} & = \sum_{i = 1}^{n} l (y_{i}, F_{t - 1} (x_{i}) + f_{t} (x_{i})) \end{matrix}

(6)

where l is a loss function,

f_{t} : X \to R

is the

t_{t h}

-decision tree, and

F_{t - 1} : X \to R

is the ensemble model from iteration

t - 1

. Regularized GBDT such as XGBoost [13], LightGBM [14], and CatBoost [15] use the second-order Taylor expansion approximation to minimize the training objective:

\begin{matrix} L^{t} \approx & \sum_{i = 1}^{n} [l (y_{i}, F_{t - 1} (x_{i})) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t}) \\ \propto & \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t}) \end{matrix}

(7)

where

Ω (f_{t})

is a regularization term to control the complexity of the decision tree, and

g_{i} = \frac{\partial l (y_{i}, F_{t - 1})}{\partial F_{t - 1}}

and

h_{i} = \frac{\partial^{2} l (y_{i}, F_{t - 1})}{\partial F_{t - 1}^{2}}

are the first- and second-order derivatives of the loss function, respectively. The first- and second-order derivatives are known as gradient and Hessian, respectively. The right-hand side of Equation (7) can be rewritten as the weighted square loss,

\sum_{i = 1}^{n} \frac{1}{2} h_{i} (f_{t} (x_{i}) + \frac{g_{i}}{h_{i}})^{2} + Ω (f_{t}) + c o n s t a n t

, with labels

- \frac{g_{i}}{h_{i}}

and weights

h_{i}

, regularized by

Ω (f_{t})

. The goal in the

t_{t h}

- iteration is to find the decision tree that minimizes this weighted loss. A class-posterior probability is obtained by applying the sigmoid transformation to the raw prediction of the ensemble model

σ (F (x)) = \frac{1}{1 + e^{- F (x)}} \in (0, 1)

. The gradient and Hessian of focal loss are derived in studies [16,18].

3.3. Bayesian Hyperparameter Optimization

Previous studies have suggested that Bayesian optimization outperforms random search, grid search, and manual search to tune the hyperparameters of gradient-boosted decision trees [53]. Bayesian optimization is capable of performing an informed search of hyperparameters to efficiently tune a machine learning model. Hence, in this study, we use the Bayesian optimization method known as sequential model-based optimization (SMBO) to tune hyperparameters [54,55]. SMBO uses a cheap surrogate model to evaluate an expensive black-box objective, g. Hyperparameters,

λ

, and losses, l, are stored as elements in a trajectory set,

H = {(λ_{i}, l_{i})}_{i = 1}^{T}

, to carry out an informed search of hyperparameters. In this study, we use the Gaussian process (GP) as a prior of

g (λ)

. An acquisition function is optimized using the posterior distribution of

g (λ)

given the trajectory set

H

to find the next hyperparameters to evaluate [56,57]. The acquisition function used in this study is the expected improvement:

E I (λ) = E [max (g (λ^{b e s t}) - g (λ), 0)]

(8)

where

λ^{b e s t}

is the best hyperparameter combination obtained before the current iteration and

λ

is the next hyperparameter combination to evaluate by SMBO. This expectation is computed using the predictive mean and predictive variance of

g (λ)

[57].

Unlike previous studies on focal-aware gradient-boosted decision trees that used a balanced version of the focal loss and tuned hyperparameters to optimize g-mean or H-measure [18,19,20], we seek to tune hyperparameters to minimize the K-fold cross-validated focal loss on the training set. The reason is that the focal loss minimizer is necessary to obtain calibrated posterior probabilities after applying the

Ψ^{γ}

transformation [27]. The Bayesian hyperparameter optimization procedure is shown in Algorithm 1:

Algorithm 1 SMBOFocalGBDT

Input:

M_{0}

, T,

D_{f u l l_t r a i n}

, focusing parameter

γ

, K
Output:

λ

with minimum

l_{F L}^{γ, C V}

1:: Initialize $M_{0}$
2:: $H \leftarrow \emptyset$
3:: Divide $D_{f u l l_t r a i n}$ into K stratified folds
4:: for $t \in {1, . ., T}$ do
5:: $λ^{*} \leftarrow arg {min}_{λ} - E I (λ)$ ▹ Minimize negative EI
6:: for $k \in {1, . . ., K}$ do ▹ Evaluate black-box objective
7:: $D_{k}^{t r a i n} \leftarrow \cup_{j \neq k} D_{j}$
8:: $D_{k}^{v a l} \leftarrow D_{k}$
9:: Train GBDT using $D_{k}^{t r a i n}$ and hyperparameters $λ^{*}$
10:: Calculate validation focal loss $l_{F L, k}^{γ}$ using $D_{k}^{v a l}$ and Equation (2)
11:: $l_{F L}^{γ, C V} \leftarrow \frac{1}{K} \sum_{k = 1}^{K} l_{F L, k}^{γ}$ ▹ K-fold cross-validated focal loss
12:: $H \leftarrow H \cup (λ^{*}, l_{F L}^{γ, C V})$
13:: Fit $M_{t}$ to trajectory set $H$

Algorithm 1 depends on the selection of a focusing parameter

γ

. Since risk prediction models are evaluated in terms of discrimination and calibration, we will seek the focusing parameter

γ

that provides the GBDT with the best K-fold cross-validated Brier score on the training set. The Brier score,

\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - q (x_{i}))}^{2}

, is a strictly proper scoring rule and can be decomposed into discrimination and calibration components [9]. Therefore, the Brier score serves as a good metric for choosing the focusing parameter. We provide a second algorithm that describes the process of choosing the focusing parameter and selecting the final GBDT for risk prediction. Algorithm 2 trains different GBDTs via focal loss minimization with different

γ

and then selects the GBDT calibrated by

Ψ^{γ}

transformation with the best K-fold cross-validated Brier score. Notice that this method is different from adjusting the focal loss parameter

γ

to directly optimize the Brier score. Here, we tune several models to minimize focal loss with predefined

γ

and then select the final clinical risk prediction model based on the Brier score.

Algorithm 2 Learning and Prediction Stages

Input:

M_{0}

, T,

D_{f u l l_t r a i n}

, focusing parameters

{γ_{1}, . ., γ_{e n d}}

, K, test sample

x

Output: Prediction for test sample

x

1:: for $γ \in {γ_{1}, . ., γ_{e n d}}$ do
2:: $λ \leftarrow S M B O F o c a l G B D T (M_{0}, T, D_{f u l l_t r a i n}, γ, K)$ ▹ Call Algorithm 1
3:: Divide $D_{f u l l_t r a i n}$ into K stratified folds
4:: for $k \in {1, . . ., K}$ do
5:: $D_{k}^{t r a i n} \leftarrow \cup_{j \neq k} D_{j}$
6:: $D_{k}^{v a l} \leftarrow D_{k}$
7:: Train GBDT using $D_{k}^{t r a i n}$ and optimal hyperparameters $λ$
8:: Calibrate confidence scores for $D_{k}^{v a l}$ by Equation (5)
9:: Calculate Brier score $B S_{k}^{γ}$ for $D_{k}^{v a l}$ by Equation (12)
10:: $B S^{γ, C V} \leftarrow \frac{1}{K} \sum_{k = 1}^{K} B S_{k}^{γ}$ ▹ K-fold cross-validated Brier score
11:: $Γ \leftarrow Γ \cup (γ, B S^{γ, C V})$ ▹ Track $γ$ and $B S^{γ, C V}$
12:: Train GBDT using $D_{f u l l_t r a i n}$ , $γ$ with minimum $B S^{γ, C V}$ , and optimal hyperparameters $λ$
13:: Estimate confidence score for test sample $x$
14:: Calibrate confidence score of test sample $x$ using Equation (5)

GBDTs such as XGBoost and LightGBM provide a wide range of hyperparameters to prevent overfitting. Great differences in performance can be seen in GBDT with different hyperparameters due to the large number of possible combinations. Careful tuning of these hyperparameters is important in the context of clinical risk prediction applications. Large clinical databases pose several challenges, such as inaccurate data, noisy class labels, and high dimensionality (large number of features) [12]. Decision trees can easily overfit irrelevant patterns in clinical databases due to these problems. Decision trees are also biased towards numerical features. Numerical features provide more split values, leading to overfitting. We address these issues by carefully limiting the range of possible values for each hyperparameter. Nine hyperparameters are considered for XGBoost. These hyperparameters are learning rate, maximum tree depth, number of trees, fraction of samples in each tree, fraction of features in each tree, fraction of features in each tree level, l1 regularization, l2 regularization, and minimum sum of sample weights (Hessians) in each child node. Bayesian hyperparameter optimization starts with a distribution of possible values for each hyperparameter. We limit the learning rate to values between

e^{- 8}

and

e^{- 2}

. We choose these values to reduce the weights of each tree in the ensemble. However, small learning rates can possibly lead the Bayesian hyperparameter tuning process to increase the number of trees to overfit the validation dataset. Therefore, we also limit the total number of decision trees to the range [20, 300]. We use between 55% and 85% of the features in each decision tree to deal with high dimensionality. We also limit the total number of samples in each decision tree to the range [55%, 85%]. Random sampling of observations can decrease the variance of the ensemble and improve performance [58]. We limit the tree depth to values between 1 (decision stump) and 8. The l1 and l2 regularization parameters are used to shrink the values of the leaves. We search for a value in the space [

e^{1}

,

e^{6}

]. Larger values encourage sparsity and reduce leaf scores. The last hyperparameter is related to the sum of sample weights (Hessians) in a child node. The Hessian is the weight of each sample shown in the weighted square loss form of Equation (7). This hyperparameter is known as the “minimum child weight”. The “minimum child weight” hyperparameter stops tree growth if the sum of Hessians in a leaf is less than a threshold. We search for an appropriate threshold in the space [

e^{1}

,

e^{6}

]. Note that higher thresholds prevent overfitting.

Different distributions can be used in the Bayesian hyperparameter search. We select a discrete uniform distribution for the number of trees and the maximum tree depth. Uniform distributions are used for the maximum number of samples per tree, the fraction of features per tree, and the fraction of features per level. Log-uniform distributions are chosen for the remaining hyperparameters. These distributions are in part based on the distributions in the experimental setting described in [15]. This experimental setting was designed to compare XGBoost, LightGBM, and CatBoost. Appendix B provides the distribution of each hyperparameter used in the experiments.

3.4. Class-Imbalanced Classification Using Calibrated Focal-Aware GBDT

Rare outcomes and imbalanced class labels are ubiquitous in real clinical datasets. Common approaches to solving this problem include resampling to balance the frequency of class labels. Resampling approaches include oversampling observations from the minority class [59] and undersampling observations from the majority class. Potential problems with these techniques include miscalibrated posterior probabilities [60], the generation of synthetic minority samples that are likely to belong to the majority class [61], and the loss of information in the case of undersampling. In cases where the imbalance is extreme (e.g., malignancies after solid organ transplantation), resampling techniques would have to generate or discard a very large number of samples. Furthermore, these techniques do not optimize any specific classification metric or the cost of clinical intervention.

A simple approach to deal with class imbalance is to introduce class weights in the loss function [24]. Class weights equal to the inverse of the class frequencies can be multiplied by the relative losses. Another approach is to add margins or biases to the logits of the loss function. For example, adding margins equal to

log \frac{1 - p (y = 1)}{p (y = 1)}

and

log \frac{p (y = 1)}{1 - p (y = 1)}

to the logits of the positive and negative class samples in the logistic loss,

log (1 + e^{- y f})

, leads to Fisher consistency for the balanced error [38]. The Fisher consistency for balanced error guarantees an optimal balanced error in the large sample size limit. Minimizing this logit-adjusted loss encourages the model to optimize the balanced error. In recent years, logit adjusted losses have led to better classification performance in long-tailed deep learning [26] and in GBDT for the modeling of imbalanced bioassay data [17]. However, these methods consider misclassification costs during the training process and are not designed for risk prediction problems where the true class-posterior probabilities must be estimated.

A method of developing a classification rule using a clinical risk prediction model is threshold or logit adjustment after training. The threshold adjustment addresses the classification of class-imbalanced data by choosing a decision threshold that optimizes a classification metric or the cost of clinical intervention. The estimated posterior probabilities are then converted into binary labels [62]. Setting the decision threshold to the prior of the positive class,

p (y = 1)

, optimizes the arithmetic mean of sensitivity and specificity [60,63,64]. This metric is known as the balanced accuracy. The prior probability,

p (y = 1)

, is estimated using the relative frequency of the positive class,

\hat{p} (y = 1) \frac{n_{1}}{n}

, where

n_{1}

is the number of samples in the positive class, and n is the total number of samples in the training set. The success of probability thresholding depends on obtaining accurate class-posterior probabilities according to Theorem 1 in [60]. We provide the Theorem for the binary class-imbalanced problem for completeness:

Theorem 1

([60]). Let

p (y = 1)

be the prior of class 1, and let

η (x) = p (y = 1 | x)

be the true class-posterior probability. If the proportion of the minority class label in the testing set is equal to the proportion of the minority class label in the training set, then

δ (x) : = \{\begin{matrix} 1, & if \frac{η (x)}{p (y = 1)} > \frac{1 - η (x)}{1 - p (y = 1)} \\ 0, & O t h e r w i s e \end{matrix}

(9)

maximizes the balanced accuracy.

Theorem 1 provides a method to maximize balanced accuracy (or minimize balanced error) after training. Equation (9) is a cost-weighted Bayes rule,

1 (c (0, 1) η (x) > c (1, 0) (1 - η (x)))

, with costs of misclassification

c (0, 1) = 1 - p (y = 1)

and

c (1, 0) = p (y = 1)

. In this expression,

c (0, 1)

is the cost of classifying a positive sample as negative, and

c (1, 0)

is the cost of classifying a negative sample as positive [63,65]. This classifier assumes that the costs of misclassification are much higher for positive samples when the data set has imbalanced class labels (

p (y = 1) < < 0.5

). A rearrangement of the condition in (9) leads to the optimal decision threshold for the balanced accuracy to be

p (y = 1)

. In practice, we have the model’s estimated class-posterior probabilities and the relative frequency of the positive class in the training set

\hat{p} (y = 1)

. Therefore, the goal is to estimate accurate class-posterior probabilities to optimize the balanced accuracy. Collel. et al. previously investigated the combination of threshold adjustment with bagging classifiers [60]. In this study, we investigate the performance of GBDT trained via focal loss and the empirical threshold equal to the relative frequency of the positive class in the training set:

\hat{δ} (x) : = \{\begin{matrix} 1, & if Ψ^{γ} (q (x)) > \hat{p} (y = 1) \\ 0, & O t h e r w i s e \end{matrix}

(10)

where

Ψ^{γ} (q (x))

is the calibrated confidence score of a model trained by minimizing the focal loss (5). This method has advantages in the context of clinical risk prediction. The model should provide calibrated class-posterior probabilities assuming the focal loss minimizer is learned. Then, Equation (10) optimizes both sensitivity and specificity without threshold optimization. The classifier solely depends on having accurate estimates of the class-posterior probabilities and a consistent estimator of the prior probability of the minority class. Other decision thresholds can be used assuming the class-posterior probabilities are well calibrated. Different decision thresholds may optimize costs associated with clinical interventions [9] or classification metrics such as the F1-score [66].

3.5. Evaluation Metrics

This section introduces the evaluation metrics used in this study. We describe discrimination, calibration, and classification metrics to evaluate risk prediction and classification models. The discrimination metrics used in this study are the AUROC and the H-measure. The receiver operating characteristic (ROC) curve displays the true positive rate (

T P R

) and the false positive rate (

F P R

) for different decision thresholds. The AUROC is 1 when the model ranks all positive samples higher than negative samples. The AUROC is 0.5 if the model cannot separate positive samples from negative samples; that is, the model provides a random guess. The AUROC is given by the following expression:

\begin{matrix} A U R O C = \int_{0}^{1} T P R (F P R) d F P R \end{matrix}

(11)

The H-measure metric is also defined over the ROC curve and considers misclassification costs by incorporating a misclassification severity distribution [7,67]. The average precision is also considered. The average precision measures the area under the precision–recall curve, which is useful for comparing classifiers trained with class-imbalanced data. The precision–recall curve provides precision and recall, or TPR, for different decision thresholds. The average precision is calculated as the weighted mean of precision values for different decision thresholds. Each weight is the increase in the recall with respect to the recall value obtained using the previous decision threshold.

To assess calibration, we used the Brier score. The Brier score is the mean squared error for predicted probabilities and it is a strictly proper scoring rule:

\begin{matrix} B r i e r s c o r e = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - q (x_{i}))}^{2} \end{matrix}

(12)

The best possible Brier score for a model is 0. The Brier score measures both calibration and discrimination, so it serves as an overall accuracy metric. Therefore, we also used the calibration plot to visualize the calibration of the predicted probabilities. We use quantile binning to assign predicted probabilities to groups. The predicted probabilities are sorted and grouped into 10 bins, each with the same number of samples. The fraction of positive samples and the mean predicted probability are then plotted for each bin. A perfect calibration curve must be aligned with the line of identity. We also provide the calibration slope and intercept to quantify calibration. The calibration slope and intercept are extracted from a logistic regression model that uses the logit of the predicted posterior probabilities as an independent variable and the binary outcome as a dependent variable. The coefficient and intercept are the calibration slope and the calibration intercept, respectively. A risk prediction model is calibrated if the calibration slope and intercept are equal to 1 and 0, respectively. The predicted posterior probabilities are then considered to be in agreement with the positive label frequencies in the test data. The deviation from these values indicates miscalibration. The calibration slope and intercept are commonly used to assess models in the clinical risk prediction literature [8,10,35,68].

To assess classification performance, we used sensitivity (recall or TPR), specificity or true negative rate (TNR), and balanced accuracy or arithmetic mean of sensitivity and specificity. These metrics measure the proportion of samples correctly classified. The metrics are expressed as follows:

\begin{matrix} S e n s i t i v i t y (T P R) & = \frac{T P}{T P + F N} \end{matrix}

(13)

\begin{matrix} S p e c i f i c i t y (T N R) & = \frac{T N}{T N + F P} \end{matrix}

(14)

\begin{matrix} B a l a n c e d a c c u r a c y & = \frac{T P R + T N R}{2} \end{matrix}

(15)

where TP stands for true positives, TN are true negatives, FP are false positives, and FN are false negatives. These metrics depend on the selection of a decision threshold to convert the class-posterior probabilities into class labels.

This study used data from the Scientific Registry of Transplant Recipients (SRTR) [69]. The SRTR data system includes data on all donors, wait-listed candidates, and transplant recipients in the US, submitted by members of the Organ Procurement and Transplantation Network (OPTN). The Health Resources and Services Administration (HRSA), US Department of Health and Human Services, provides oversight to the activities of the OPTN and SRTR contractors.

4. Experiments

This section describes lung transplantation data from the Scientific Registry of Transplant Recipients (SRTR) and the Behavioral Risk Factor Surveillance System (BRFSS) of the Centers for Disease Control and Prevention (CDC). The first application consists of cancer prediction in lung transplant recipients. The second application consists of the prediction of diabetes based on survey responses from residents of the United States. The experimental settings and results are described in detail.

4.1. Datasets

4.1.1. Lung Transplant Data

SRTR has stored data on solid organ transplant candidates, recipients, and donors in the United States since 1987. Its primary source is the Organ Procurement and Transplantation Network (OPTN). Solid organ transplants in the registry consist of the heart, lung, kidney, pancreas, liver, and intestines. SRTR files provide features related to demographics, medical history, immunosuppression, histocompatibility, donor–recipient matching, transplant follow-up, and outcomes including death and malignancy after transplantation.

The initial lung transplant file contains 44,926 patients from 16 October 1987, to 1 June 2021. We extract lung transplant recipients from the SRTR and filter the data using the exclusion criteria shown in Figure 2. We exclude patients with retransplantation, patients under 18 years of age, patients without follow-up records, and patients with unknown cancer status. The final dataset includes 33,244 patients from 1 January 1988 to 28 November 2020. The binary outcome is post-transplant lymphoproliferative disorder (PTLD). PTLD is a malignancy that occurs in solid organ transplant recipients [70]. It is one of the most common malignancies after non-melanoma skin cancer among lung transplant recipients. The mortality range for lung transplant recipients who develop PTLD is 20–50% [71]. The status and type of malignancy are extracted from the SRTR malignancy file. We calculate the time-to-PTLD by subtracting the date of diagnosis of PTLD from the date of transplantation. The final number of patients is 33,244 with 32,260 (97.04%) patients who did not develop PTLD and 984 (2.96%) patients who developed PTLD. We then generate datasets with binary outcomes for five different time horizons: 1, 3, 5, 8, and 10 years. Patients who were not followed up or died during the period of time were excluded. Patients who developed PTLD before the end of the time horizon are assigned an outcome of 1, while patients who did not develop PTLD before the end of the time horizon are assigned an outcome of 0. The numbers of patients are 28,883, 19,501, 13,351, 7507, and 5159 for 1, 3, 5, 8, and 10 years after transplantation, respectively. The datasets show a high degree of class-imbalance. The relative frequencies of the event are 1.14%, 2.92%, 5.19%, 10.87%, and 17.02% for 1, 3, 5, 8, and 10 years after transplantation, respectively. We consider features including demographics, immunosuppression during transplantation and early post-transplantation period (medications for induction and anti-rejection), viral infections prior to transplantation, and some of the most common human leukocyte antigens (HLA) among lung recipients. A total of 54 features are used to predict PTLD in lung transplant recipients. Feature names and basic statistics are shown in Table A1 and Table A2 of Appendix A. Two of the 54 features are numerical. The remaining features are categorical. We perform a Wilcoxon rank sum test to compare numerical features of cancer and non-cancer patients. The Wilcoxon rank sum test is a non-parametric statistical test that compares the locations of two distributions. The p-values of the statistical test are available in Table A1 of Appendix A. We also perform a chi-square test to compare categorical features of cancer and non-cancer patients. The test is performed to assess whether there is an association between categorical features and the outcome. The contingency tables and p-values are available in Table A2 of Appendix A. These statistical tests are performed as preliminary analyses of the features. Figure 3 shows the distributions of six features used to predict PTLD.

4.1.2. Diabetes Data

We also use data from the Behavioral Risk Factor Surveillance System (BRFSS) [72]. The BRFSS is a survey collection data system established by the Centers for Disease Control and Prevention (CDC) in 1984. It collects data from residents of the United States related to demographics, risky living habits, healthcare coverage, chronic, and infectious diseases. More than 400,000 surveys are conducted by telephone in 50 states of the US every year. We utilize the pre-processing pipeline available in [73]. The pipeline extracts 253,680 survey responses without missing values from the 2015 file available in [74]. We obtained 229,474 samples after removing duplicated rows.

The binary outcome for the BRFSS dataset is diabetes (including pre-diabetes). The relative frequency of the event is 15.29%. There are 21 features in total. These features are extracted from the survey responses. Each survey represents a row or sample. Three of the 21 features are numerical features. Numerical features include body mass index, number of days with poor mental health (in a 30-day window), and number of days with poor physical health (in a 30-day window). The remaining features are categorical. These features are related to high blood pressure, high cholesterol, smoking, stroke, alcohol consumption, heart disease, demographics (sex, age group, education, and income), healthcare coverage, and eating habits (fruit and vegetable consumption per day). Wilcoxon rank sum test was performed to compare the numerical features of residents with diabetes and residents without diabetes. Chi-square test was performed to compare categorical features. The names of features, means, standard deviations, and categories are shown in Table A3 and Table A4. Figure 4 shows the distributions of six features used to predict diabetes.

4.2. Design of Experiments

Each dataset is split into a training set (70%) and a testing set (30%). Median and mode imputations are performed for continuous and categorical features with missing values, respectively. We consider several statistical and machine learning models for comparison, including logistic regression (LR) without regularization, LR with least absolute shrinkage and selection operator (LASSO) regularization, LR with ridge regularization, LightGBM, XGBoost, LightGBM-Focal, and XGBoost-Focal. LR models are chosen because they are commonly used in clinical risk prediction applications. We compare the performance of focal-aware GBDT before and after applying the

Ψ^{γ}

transformation. We also apply Platt scaling [45] and isotonic regression [47] to the predictions of GBDT trained using focal loss. For these heuristic techniques, we use 10-fold cross-validation over the training set to generate a new set consisting of miscalibrated predictions and true class labels. We then fit sigmoid and isotonic regression models to this new set. Note that miscalibrated predictions are converted into logits, and true labels are scaled to

(\frac{1}{n_{0} + 2}, \frac{n_{1} + 1}{n_{1} + 2})

for the Platt scaling method [45]. We provide the expression of the sigmoid transformation used in the Platt scaling:

\begin{matrix} q_{P l a t t} = \frac{1}{1 + e^{A * l o g i t (q) + B}} \end{matrix}

(16)

where

l o g i t (q)

is the logarithmic odds of prediction q, and A and B are the hyperparameters to be tuned to improve the calibration. The output

q_{P l a t t}

is the calibrated probabilistic prediction.

All models are tuned with Bayesian hyperparameter optimization with

T = 30

iterations. Hyperparameters are selected to minimize the 10-fold cross-validated cross-entropy loss for all models except for models trained through focal loss. The focal loss parameter is initially set to 0.1, and then the parameter search continues from 0.25 to 2.5 with a step size of 0.25. The distributions of other hyperparameters considered are provided in Appendix B. The scikit-learn [75], LightGBM, XGBoost, scikit-optimize, statsmodels, and hmeasure packages are used for model development, hyperparameter tuning, and evaluation. Algorithms 1 and 2 are used to train all models that used focal loss as a loss function.

We use the relative frequency of the minority class label (prevalence) as the decision threshold for calibrated models. In addition, we also use an optimized decision threshold for the miscalibrated GBDT models trained using focal loss. An optimized threshold is used because the posterior probabilities produced by these models are not accurate. Therefore, the prevalence threshold will not optimize balanced accuracy according to Theorem 1 in Section 3.4. To find the optimized threshold, we train the model with optimal hyperparameters and perform 10-fold cross-validation again. For each iteration of 10-fold cross-validation, we find the best decision threshold that optimizes the balanced accuracy in the validation fold. Finally, the mean of these ten thresholds is used as the final decision threshold to perform the classification in the test set.

Risk probabilities are plotted to show the calibration of models trained using focal loss. We provide calibration plots based on quantile binning with 10 bins. The described experiment is repeated 10 times for the PTLD dataset using different random seeds. For the prediction of diabetes using the BRFSS, we performed only three rounds of cross-validation due to the large size of the dataset. We provide the mean performance over the three cross-validation experiments.

4.3. Results

The mean

γ

values for the 10 experiments for the XGBoost-Focal models are 1.49, 1.34, 1.63, 1.10, and 0.92 for the prediction of the risk of PTLD at 1, 3, 5, 8, and 10 years, respectively. The mean

γ

values for the 10 experiments for the LightGBM-Focal models are 0.60, 0.85, 1.05, 0.65, and 0.61 for the prediction of the risk of PTLD at 1, 3, 5, 8, and 10 years, respectively. The

γ

value for each repetition of shuffled cross-validation is automatically selected by Bayesian hyperparameter optimization (Algorithms 1 and 2). Higher

γ

values deteriorate the calibration. Table 1, Table 2 and Table 3 and Figure 5 and Figure 6 show the model performance for the prediction of PTLD in lung transplant recipients. From Table 1, we observe that AUROC, the H-measure, and the average precision are the same for GBDT-Focal before and after applying the

Ψ^{γ}

transformation. GBDT-Focal combined with Platt scaling also preserves the same discriminatory power. This result is expected because the

Ψ^{γ}

and sigmoid transformations (used in Platt scaling) preserve the order of the confidence scores. Therefore, ranking metrics such as AUROC, H-measure, and average precision remain unchanged.

Table 2 and Figure 5 show that the

Ψ^{γ}

transformation improves the performance of XGBoost-Focal and LightGBM-Focal according to the Brier score. All calibrated XGBoost-Focal models show an overall improved accuracy according to the Brier score metric compared to their miscalibrated counterparts. The Brier scores for XGBoost-Focal are 0.0242, 0.0460, 0.0749, 0.1082, and 0.1400 for years 1, 3, 5, 8, and 10, respectively. The Brier scores for XGBoost-Focal

Ψ^{γ}

are 0.0110, 0.0277, 0.0476, 0.0914, and 0.1278 for years 1, 3, 5, 8, and 10, respectively. Since discrimination remains unchanged, the improvement in the Brier score can be attributed to an improvement in the calibration of the confidence scores.

Figure 7 and Figure 8 display calibration plots for different time horizons. As seen in the leftmost columns, both XGBoost and LightGBM trained via focal loss minimization overestimate the risks of PTLD post-transplantation across all time horizons. For example, the group with the lowest risk during the first year after transplantation (first bin in the first subplot of XGBoost-Focal in Figure 7) has a mean predicted probability of 0.0758, while the actual risk is 0.002. Furthermore, the group with the highest post-transplant risk during the first year has a mean predicted probability of 0.185 while the actual risk is 0.052. In contrast, calibration of the XGBoost models trained via focal loss is greatly improved across all time horizons after applying the

Ψ^{γ}

transformation. The mean predicted probabilities of the lowest and highest risk groups for the first year are 0.0035 and 0.044 after applying the

Ψ^{γ}

transformation, respectively. All post-processing methods improve the calibration with respect to the raw GBDT trained through focal loss. However, GBDT calibrated using isotonic regression shows worse performance, as reflected by the higher Brier score in Table 2. The poor performance might be due to the small number of positive samples available in the lung transplant dataset.

The calibration slope and intercept were also calculated to quantify the calibration plots. These values are shown as box plots in Figure 9. The ideal values for the calibration slope and intercept are 1 and 0 (shown as red horizontal lines in both subplots), respectively. The median calibration slope and intercept for XGBoost-Focal for the prediction of 1-year risk of PTLD are 2.744 (interquartile range (IQR): 1.787–3.196) and 0.654 (IQR: 0.282–0.829), respectively. These values indicate that XGBoost-Focal is severely miscalibrated. The median calibration slope and intercept for the XGBoost-Focal calibrated using the

Ψ^{γ}

transformation are 1.138 (IQR: 1.041–1.192) and 0.563 (0.262–0.772), respectively. The median calibration slope and intercept for the XGBoost-Focal calibrated using Platt scaling are 1.066 (IQR: 1.009–1.113) and 0.272 (IQR: 0.032–0.486), respectively. The median calibration slope and intercept for the XGBoost-Focal calibrated using isotonic regression are 0.986 (IQR: 0.892–1.043) and −0.048 (IQR: −0.436–0.160), respectively. The median calibration slope and intercept for XGBoost-Focal for the prediction of 10-year risk of PTLD are 1.579 (IQR: 1.217–1.941) and −0.040 (IQR: −0.081–0.104), respectively. The median calibration slope and intercept for the XGBoost-Focal calibrated using the

Ψ^{γ}

transformation are 0.968 (IQR: 0.917–1.040) and −0.045 (IQR: −0.088–0.099), respectively. The median calibration slope and intercept for the XGBoost-Focal calibrated using Platt scaling are 0.969 (IQR: 0.932–1.058) and −0.026 (IQR: −0.137–0.114), respectively. The median calibration slope and intercept for the XGBoost-Focal calibrated using isotonic regression are 0.848 (IQR: 0.802–0.935) and −0.224 (IQR: −0.297–(−0.110)), respectively. The calibration slopes and intercepts for the remaining time horizons are shown in Figure 9.

XGBoost trained using focal loss achieved better discrimination compared to standard XGBoost to predict cancer one year after lung transplantation. The AUROC values are 0.755 and 0.736 for XGBoost-Focal and standard XGBoost, respectively. The H-measure values are 0.271 and 0.257 for XGBoost-Focal and standard XGBoost, respectively. These values are shown in Table 1 and Figure 5. XGBoost-Focal calibrated using the

Ψ^{γ}

transformation achieved the best sensitivity (0.631) to predict cancer one year after transplantation, as shown in Table 3 and Figure 6. The discriminatory power and the overall accuracy (in terms of the Brier score) of XGBoost-Focal and XGBoost trained using cross-entropy loss are comparable in the remaining time periods. The differences in AUROC between the best GBDT model and the best logistic regression are in the range 0.007–0.023 for all risk prediction tasks (including diabetes prediction). Differences in the H-measure and average precision are more prominent. The differences between the best GBDT model and the best logistic regression are in the range of 0.013–0.031 for the H-measure and 0.006–0.035 for the average precision. XGBoost also tended to perform slightly better than LightGBM in all prediction tasks.

In summary, the XGBoost-Focal

Ψ^{γ}

models are equally competitive or better than other models according to the AUROC, H-measure, average precision, and Brier score across all time horizons. We observe that GBDT models achieve better performance compared to logistic regression models (unregularized and regularized) as the time horizon increases for the prediction task, as evidenced by the H-measure and average precision of the models. Figure 5 displays the performance of the risk prediction models as the time horizon changes.

The thresholded classification metrics of all models, including sensitivity and specificity, are provided in Table 3. Models trained with focal loss are overconfident, that is, they assign a higher probability of event. As a result, the prevalence threshold would convert most confidence scores to positive labels, leading to a perfect or nearly perfect sensitivity but low specificity. This problem is solved using an optimized threshold or by calibrating confidence scores before using the prevalence as a decision threshold, as discussed in Section 3.4 (Equation (10)). In addition, we observe that optimized decision thresholds provide comparable or slightly better balanced accuracy at the expense of lower sensitivity and higher specificity. We notice this trend for both LightGBM-Focal and XGBoost-Focal through different time horizons in the PTLD prediction task (see Table 3 and Figure 6). We also find that the sensitivity values of LightGBM-Focal and XGBoost-Focal using the optimized thresholds tend to be worse than the sensitivity values achieved by the models using the prevalence threshold. This is not desirable, as positive samples are associated with higher costs of misclassification in clinical risk prediction applications. Therefore, the prevalence threshold is more appropriate than the optimized balanced accuracy threshold in the PTLD prediction task.

Figure 10 shows the calibration plots for the diabetes prediction task. The mean values of

γ

across the three shuffled cross-validation experiments are

1.03

for LightGBM-Focal and

1.08

for XGBoost-Focal. LightGBM-Focal and XGBoost-Focal show equal miscalibration. All post-processing methods successfully calibrate the predicted probabilities, as shown in Figure 10 (mean predicted probabilities and fraction of positives in the test set match). Isotonic regression provides almost equal performance to the other calibration methods thanks to the larger sample size. Table 4 provides the AUROC, H-measure, average precision, and Brier score for the diabetes prediction task. We observe that the calibrated XGBoost-Focal and the standard XGBoost have a mean Brier score equal to 0.1052. In general, the differences between GBDT trained using focal loss and GBDT trained using cross-entropy loss are less prominent in the diabetes prediction task. The classification metrics for the diabetes prediction task are provided in Table A5 of the Appendix C.

5. Discussion

This study investigated the impact of focal loss in the context of clinical risk prediction. Although existing studies have combined GBDT with custom loss functions [16,17,18,19,20,21,22,23,39], few have addressed the topic of probability calibration. Probability calibration assesses the accuracy of the confidence scores provided by a model. The confidence scores represent the predictive uncertainty of the model. Miscalibrated posterior probabilities can be misleading for decision-making tasks [10]. This is because miscalibrated scores do not represent the fraction of positive observations for a given set of features. We proposed a method to select the focusing parameter of the focal loss function to optimize discrimination and calibration. The method was based on the

Ψ^{γ}

transformation proposed in [27] and Bayesian hyperparameter optimization. Our study also uses calibration plots and the Brier score to assess the effect of the transformation and the “overall accuracy” (discrimination plus calibration), respectively. Probability calibration plots and the Brier score are popular in the clinical risk prediction literature [8,9]. These calibration metrics have attracted attention in recent deep learning research related to calibration [44,76].

Since focal loss is not strictly proper, minimization does not lead to true class-posterior probabilities. Therefore, confidence scores should not be interpreted as accurate clinical risks unless a transformation is applied after training. The transformation applied in the experiments directly related the focal loss minimizer to the true class-posterior probability. Such a transformation may not exist for all loss functions. For example, class-posterior probability estimation is not possible for some asymmetric losses [52] and for the hinge loss [45,77]. Models trained using these losses require heuristic techniques such as Platt scaling and isotonic regression. These techniques increase computational time and require a calibration dataset or cross-validation to prevent overfitting. Isotonic regression can also produce poor calibration for small sample sizes.

In our experiments, we have observed that miscalibrated GBDTs trained via focal loss minimization produce overconfident predictions. The overconfidence can be seen in the mean predicted probabilities of XGBoost-Focal and LightGBM-Focal in Figure 7, Figure 8 and Figure 10. The mean predicted probabilities before calibration (horizontal axis) are greater than the fraction of positives (vertical axis) in the test dataset. The plotted bins were below the identity line, indicating overestimated risks [10]. However, models trained through focal loss can also produce underconfident predictions. This phenomenon was first observed in deep neural networks trained using focal loss [76] and was later formalized in [27]. Mukhoti et al. used this property and proposed a method to find focusing parameters to calibrate overconfident deep neural networks [76].

The use of isotonic regression resulted in a worse performance than using the

Ψ^{γ}

transformation or Platt scaling in the prediction of PTLD (see the higher Brier score and lower average precision for 5, 8, and 10 year cancer prediction in Table 2 and Table 1, respectively). The most likely explanation is that the isotonic regression is sensitive to data imbalance and that a small number of positive cancer patients leads to poor performance. Furthermore, isotonic regression did not maintain the discriminatory power of the model. This is because the isotonic regression fits a step-wise isotonic function instead of a strictly increasing function [47]. This approach introduces ties between similar probabilities and changes ranking metrics such as the AUROC, the H-measure, and the average precision. Unlike isotonic regression, the

Ψ^{γ}

transformation and sigmoid transformation (used in Platt scaling) are strictly increasing. Hence, they do not alter the discrimination of the model, as shown in the AUROC and H-measure metrics in Table 1 and Table 4. Isotonic regression achieved performance comparable to other calibration methods in the prediction of diabetes (see Brier scores in Table 4). Note that the diabetes dataset contains more positive observations compared to the lung transplant dataset.

An advantage of choosing focal loss over the strictly proper cross-entropy loss in clinical risk prediction applications is its flexibility. Focal loss forces the training process to “focus” on hard-to-classify samples, thanks to its focusing parameter. This parameter can prove useful when dealing with large medical databases with imbalanced classes or rare samples (samples with large losses). As an example, we found that XGBoost and LightGBM trained using focal loss achieved better discrimination compared to standard XGBoost and LightGBM for cancer prediction one year after lung transplantation. XGBoost-Focal (

Ψ^{γ}

) also achieved a higher sensitivity compared to other models for the same time horizon as shown in Table 3. However, it is important to emphasize that focal loss might not always enhance performance. GBDT trained with focal loss and cross-entropy loss had an equal Brier score and an almost equal AUROC in the diabetes prediction application (Table 4). Therefore, forcing the training process to focus on rare samples is not necessary for every dataset or application. The focusing parameter

γ = 0

can be considered to tune the GBDT using Algorithms 1 and 2. A focusing parameter

γ = 0

will handle cases for which the cross-entropy loss would lead to better performance. In this case, the transformation

Ψ^{γ}

is no longer necessary because the loss is strictly proper, as explained in Section 3.1.

This study also explored threshold adjustment in combination with GBDT trained using focal loss. Many applied machine learning studies assume a regular decision threshold of 0.5. In this setting, minimization of weighted cross-entropy loss or logit-adjusted loss is necessary to maximize balanced accuracy. Boldini et al. previously studied these loss functions in combination with GBDT [17]. A model trained through weighted cross-entropy loss or logit-adjusted loss will provide balanced class-posterior probabilities [38]. Note that balanced posterior probabilities would overestimate the risks or true posterior probabilities. Therefore, these loss functions are mainly applicable to classification problems. We specifically used the prevalence threshold to maximize balanced accuracy after applying the transformation

Ψ^{γ}

. The choice of decision threshold is based on Theorem 1 provided by Collel et al. in [60]. This decision threshold optimizes the balanced accuracy in the large sample size limit given accurate posterior probabilities. This threshold would fail to optimize the balanced accuracy without a calibration step because the probabilities estimated using focal loss are not reliable. Therefore, an optimized threshold is needed when the

Ψ^{γ}

transformation or a heuristic technique is not applied. Note that estimation of the optimized threshold would increase the computational burden since it would require K-fold cross-validation to prevent overfitting.

A limitation observed in the experiments is that the selected focusing parameter may not be the same for cross-validation repetitions with different random seeds. This is evident in Figure 7, Figure 8 and Figure 10. The background curves show different degrees of miscalibration caused by training XGBoost-Focal and LightGBM-Focal with different focusing parameters. These focusing parameters were chosen automatically in the Bayesian hyperparameter tuning stage, as described in Section 3.3. A possible explanation is that the heterogeneity of the patient groups available in the SRTR database (including patient groups from different transplant centers and time periods) causes different focusing parameters to be selected for different cross-validation repetitions. Problems such as label noise may also cause hyperparameter values to fluctuate among different cross-validation runs. These observations are consistent with studies showing that the hyperparameter tuning of GBDT is sensitive to noise in electronic health records [12]. For model deployment, it is recommended to estimate this parameter using the full dataset after assessing stability through repeated cross-validation runs.

A limitation of our experimental setting is that we did not consider other GBDT models in addition to XGBoost and LightGBM. Other variants of GBDT include CatBoost [15] and PaloBoost [12]. PaloBoost is a variant of the original stochastic GBDT model [58]. PaloBoost is designed to be less sensitive to hyperparameter settings. CatBoost introduces an algorithm to automatically handle categorical features. Clinical databases such as the SRTR typically contain many categorical and ordinal features. Hence, CatBoost can be used as an alternative to XGBoost and LightGBM in clinical risk prediction applications. We also did not consider recent hyperparameter optimization techniques to tune GBDT [78]. A future research direction is to explore CatBoost and novel hyperparameter optimization techniques in combination with custom loss functions.

6. Conclusions

This paper investigated focal loss in clinical risk prediction problems. We investigated the application of a closed-form transformation (

Ψ^{γ}

transformation) to calibrate the confidence scores of XGBoost and LightGBM trained using focal loss. Algorithms based on Bayesian hyperparameter tuning were proposed to select the focal loss parameter to optimize discriminatory power and calibration. Adjustment of the decision threshold was also investigated to optimize the balanced accuracy classification metric. We considered two real-life applications related to lung transplantation and diabetes prediction.

The methods and experiments presented in this study can support the adoption of the focal loss function to train clinical risk prediction models. The practical implications of improving clinical risk prediction include providing data-driven models to support clinical professionals, standardize patient care, accelerate clinical decision making to prioritize treatment, and improve patient outcomes through preventive interventions. Despite the benefits of machine learning techniques for clinical risk prediction, there are ethical concerns that require attention. Although the gradient-boosted decision tree model can potentially improve predictive performance compared to conventional statistical approaches, it can also inherit biases present in the training data. These biases are more difficult to detect in complex models such as GBDT than in simpler linear models. Therefore, extensive validation is essential to detect and minimize biases. Interpretable approaches also need to be thoroughly explored to support the adoption of GBDT and new loss functions in high-stakes clinical decision making.

A future research direction is to explore the application of focal loss in the field of fair machine learning. Focal loss can be applied to improve predictive performance in tasks with underrepresented or minority groups in electronic health records. The focal loss parameter can be used to reduce the loss of a sample of a minority group in a medical database. Custom loss functions can be developed to correct biases in GBDT for clinical risk prediction. Another direction for future research is to use focal loss and logit-adjusted losses to train GBDT for multiclass and multilabel classification.

Author Contributions

Conceptualization, N.N. and D.D.; Methodology, H.J. and D.D.; Validation, H.J.; Formal analysis, H.J.; Data curation, N.N.; Writing—original draft, H.J.; Writing—review & editing, D.D.; Supervision, N.N. and D.D.; Project administration, D.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The Scientific Registry of Transplant Recipients (SRTR) database is publicly available through the SRTR (https://www.srtr.org). The database is subject to the data use agreement. Behavioral Risk Factor Surveillance System (BRFSS) files are publicly available through the Centers for Disease Control and Prevention (CDC). The files can be accessed via https://www.cdc.gov/brfss/annual_data/annual_data.htm (accessed on 24 July 2024). The diabetes dataset is available at: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data (accessed on 24 July 2024).

Acknowledgments

The data reported here have been supplied by the Hennepin Healthcare Research Institute (HHRI) as the contractor for the Scientific Registry of Transplant Recipients (SRTR). The interpretation and reporting of these data are the responsibility of the author(s) and in no way should be seen as an official policy of or interpretation by the SRTR or the U.S. Government.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Feature Description

For the Scientific Registry of Transplant Recipients (SRTR), the body mass index variable (

R E C_B M I

) was converted into a categorical feature with four groups: underweight, normal, overweight, and obese. Patients with a

R E C_B M I

less than

18.5

are labeled as “underweight”. Patients with

18.5 \leq R E C_B M I < 25

are labeled “normal”. Patients with

25 \leq R E C_B M I < 30

are labeled as “overweight”. Patients with

R E C_B M I \geq 30

are labeled “obese”.

Table A1. Numerical features from post-transplant lymphoproliferative disorder (PTLD) dataset extracted from the SRTR files. Wilcoxon rank sum test was performed to compare numerical features.

Variable	Total Mean (SD)	PTLD Negative Mean (SD)	PTLD Positive Mean (SD)	p-Value
REC_AGE_AT_TX	54.81 (12.86)	54.91 (12.80)	51.51 (14.20)	0
DON_AGE	33.73 (13.89)	33.79 (13.90)	31.74 (13.35)	0

Table A2. Categorical features from post-transplant lymphoproliferative disorder (PTLD) dataset extracted from the SRTR files. Chi-square test was performed to compare categorical features.

Variable	Category	Total	PTLD Negative	PTLD Positive	p-Value
REC_BMI_CAT	Normal	13,470	13,068	402	0.000
REC_BMI_CAT	Obese	4826	4711	115	0.000
REC_BMI_CAT	Overweight	11,493	11,171	322	0.000
REC_BMI_CAT	Underweight	3015	2888	127	0.000
REC_A_MM_EQUIV_TX	0	2240	2172	68	0.605
REC_A_MM_EQUIV_TX	1	12,472	12,131	341	0.605
REC_A_MM_EQUIV_TX	2	14,833	14,403	430	0.605
REC_B_MM_EQUIV_TX	0	621	595	26	0.072
REC_B_MM_EQUIV_TX	1	8085	7845	240	0.072
REC_B_MM_EQUIV_TX	2	20,836	20,264	572	0.072
REC_DR_MM_EQUIV_TX	0	1577	1535	42	0.889
REC_DR_MM_EQUIV_TX	1	12,404	12,049	355	0.889
REC_DR_MM_EQUIV_TX	2	15,476	15,042	434	0.889
REC_LU_SURG	N	24,044	23,526	518	0.000
REC_LU_SURG	Y	1966	1889	77	0.000
CAN_ABO	A	13,310	12,889	421	0.165
CAN_ABO	B	3720	3605	115	0.165
CAN_ABO	O	14,814	14,410	404	0.165
CAN_ABO	OTHER	1400	1356	44	0.165
CAN_DIAB_TY	NO	26,220	25,503	717	0.218
CAN_DIAB_TY	YES	5125	4969	156	0.218
ABO_MATCH	False	9999	9673	326	0.034
ABO_MATCH	True	23,245	22,587	658	0.034
CAN_GENDER	F	14,380	13,987	393	0.033
CAN_GENDER	M	18,864	18,273	591	0.033
CAN_MALIG	N	28,865	28,050	815	0.020
CAN_MALIG	Y	2174	2131	43	0.020
CAN_RACE_SRTR	BLACK	2749	2710	39	0.000
CAN_RACE_SRTR	OTHER	731	717	14	0.000
CAN_RACE_SRTR	WHITE	29,760	28,830	930	0.000
CAN_ETHNICITY_SRTR	LATINO	1993	1957	36	0.002
CAN_ETHNICITY_SRTR	NLATIN	31,251	30,303	948	0.002
REC_CMV_STAT	N	12,119	11,755	364	0.000
REC_CMV_STAT	P	15,977	15,643	334	0.000
REC_EBV_STAT	N	2824	2612	212	0.000
REC_EBV_STAT	P	23,900	23,461	439	0.000
CMV_MATCH	False	13,059	12,713	346	0.081
CMV_MATCH	True	14,930	14,583	347	0.081
REC_MED_COND	HOSPITALIZED	2608	2540	68	0.002
REC_MED_COND	INTENSIVE_CARE	2647	2596	51	0.002
REC_MED_COND	NOT_HOSPITALIZED	27,966	27,102	864	0.002
REC_LIFE_SUPPORT	N	30,613	29,691	922	0.053
REC_LIFE_SUPPORT	Y	2604	2543	61	0.053
REC_CHRONIC_STEROIDS	N	16,944	16,490	454	0.094
REC_CHRONIC_STEROIDS	Y	14,188	13,763	425	0.094
DON_GENDER	F	12,866	12,540	326	0.000
DON_GENDER	M	20,378	19,720	658	0.000
DON_ABO	A	4854	4715	139	0.045
DON_ABO	A1	6417	6192	225	0.045
DON_ABO	A2	872	853	19	0.045
DON_ABO	B	3594	3488	106	0.045
DON_ABO	O	16,787	16,318	469	0.045
DON_ABO	OTHER	720	694	26	0.045
DON_ANTI_CMV	N	13,101	12,665	436	0.001
DON_ANTI_CMV	P	19,975	19,435	540	0.001
DON_EBV_IGG	N	1452	1408	44	0.011
DON_EBV_IGG	P	20,894	20,467	427	0.011
DON_HIST_DIAB	NO	29,775	28,914	861	0.063
DON_HIST_DIAB	YES	1979	1936	43	0.063
DON_RACE_HISPANIC_LATINO	0	28,504	27,633	871	0.010
DON_RACE_HISPANIC_LATINO	1	4729	4617	112	0.010
DON_INFECT_LU	0	17,397	16,752	645	0.000
DON_INFECT_LU	1	15,836	15,498	338	0.000
INDUCTION_AZATHIOPRINE	0	30,235	29,437	798	0.000
INDUCTION_AZATHIOPRINE	1	2699	2526	173	0.000
INDUCTION_ATGAM	0	31,292	30,405	887	0.000
INDUCTION_ATGAM	1	1642	1558	84	0.000
INDUCTION_OKT3	0	32,691	31,741	950	0.000
INDUCTION_OKT3	1	243	222	21	0.000
INDUCTION_BASILIXIMAB	0	19,225	18,493	732	0.000
INDUCTION_BASILIXIMAB	1	13,709	13,470	239	0.000
INDUCTION_CYCLOSPORINE	0	30,266	29,444	822	0.000
INDUCTION_CYCLOSPORINE	1	2668	2519	149	0.000
INDUCTION_TACROLIMUS	0	32,650	31,702	948	0.000
INDUCTION_TACROLIMUS	1	284	261	23	0.000
INDUCTION_STEROIDS	0	11,596	11,261	335	0.639
INDUCTION_STEROIDS	1	21,338	20,702	636	0.639
ANTI_REJECTION_STEROIDS	0	29,725	28,880	845	0.001
ANTI_REJECTION_STEROIDS	1	3209	3083	126	0.001
HLA_A2	0	15,800	15,388	412	0.005
HLA_A2	1	14,052	13,610	442	0.005
HLA_A28	0	29,406	28,573	833	0.018
HLA_A28	1	446	425	21	0.018
HLA_A31	0	28,194	27,374	820	0.042
HLA_A31	1	1658	1624	34	0.042
HLA_A32	0	27,920	27,110	810	0.112
HLA_A32	1	1932	1888	44	0.112
HLA_B7	0	22,822	22,147	675	0.062
HLA_B7	1	7028	6850	178	0.062
HLA_B13	0	28,549	27,737	812	0.515
HLA_B13	1	1301	1260	41	0.515
HLA_B14	0	29,344	28,512	832	0.078
HLA_B14	1	506	485	21	0.078
HLA_B18	0	27,501	26,727	774	0.126
HLA_B18	1	2349	2270	79	0.126
HLA_B27	0	27,661	26,864	797	0.382
HLA_B27	1	2189	2133	56	0.382
HLA_B49	0	28,876	28,052	824	0.820
HLA_B49	1	974	945	29	0.820
HLA_B51	0	27,226	26,457	769	0.269
HLA_B51	1	2624	2540	84	0.269
HLA_B57	0	27,799	27,027	772	0.002
HLA_B57	1	2051	1970	81	0.002
HLA_B62	0	26,680	25,935	745	0.050
HLA_B62	1	3170	3062	108	0.050
HLA_B65	0	28,657	27,841	816	0.606
HLA_B65	1	1193	1156	37	0.606
HLA_DR2	0	29,371	28,547	824	0.004
HLA_DR2	1	382	362	20	0.004
HLA_DR3	0	28,822	28,042	780	0.000
HLA_DR3	1	931	867	64	0.000
HLA_DR7	0	22,751	22,112	639	0.600
HLA_DR7	1	7002	6797	205	0.600
HLA_DR14	0	27,909	27,104	805	0.054
HLA_DR14	1	1844	1805	39	0.054
HLA_DR17	0	24,279	23,583	696	0.512
HLA_DR17	1	5474	5326	148	0.512

Table A3. Numerical features from the Behavioral Risk Factor Surveillance System (BRFSS) diabetes health indicators dataset.

Variable (SAS Variable Name)	Total Mean (SD)	No Diabetes Mean (SD)	Diabetes Mean (SD)
BMI (_BMI5)	28.69 (6.79)	28.10 (6.50)	31.96 (7.38)
MentHlth (MENTHLTH)	3.51 (7.72)	3.33 (7.46)	4.49 (8.97)
PhysHlth (PHYSHLTH)	4.68 (9.05)	4.08 (8.44)	8.01 (11.32)

Table A4. Categorical features from the BRFSS diabetes health indicators dataset.

Variable (SAS Variable Name)	Category	Total	No Diabetes	Diabetes
HighBP (_RFHYPE5)	0	125,214	116,522	8692
HighBP (_RFHYPE5)	1	104,260	77,855	26,405
HighChol (TOLDHI2)	0	128,129	116,528	11,601
HighChol (TOLDHI2)	1	101,345	77,849	23,496
CholCheck (_CHOLCHK)	0	9298	9057	241
CholCheck (_CHOLCHK)	1	220,176	185,320	34,856
Smoker (SMOKE100)	0	122,585	105,711	16,874
Smoker (SMOKE100)	1	106,889	88,666	18,223
Stroke (CVDSTRK3)	0	219,190	187,361	31,829
Stroke (CVDSTRK3)	1	10,284	7016	3268
HeartDiseaseorAttack (_MICHD)	0	205,761	178,520	27,241
HeartDiseaseorAttack (_MICHD)	1	23,713	15,857	7856
PhysActivity (_TOTINDA)	0	61,260	48,222	13,038
PhysActivity (_TOTINDA)	1	168,214	146,155	22,059
Fruits (_FRTLT1)	0	88,881	74,289	14,592
Fruits (_FRTLT1)	1	140,593	120,088	20,505
Veggies (_VEGLT1)	0	47,137	38,535	8602
Veggies (_VEGLT1)	1	182,337	155,842	26,495
HvyAlcoholConsump (_RFDRHV5)	0	215,524	181,259	34,265
HvyAlcoholConsump (_RFDRHV5)	1	13,950	13,118	832
AnyHealthcare (HLTHPLN1)	0	12,389	10,967	1422
AnyHealthcare (HLTHPLN1)	1	217,085	183,410	33,675
NoDocbcCost (MEDCOST)	0	208,151	176,796	31,355
NoDocbcCost (MEDCOST)	1	21,323	17,581	3742
GenHlth (GENHLTH)	1	34,854	33,719	1135
GenHlth (GENHLTH)	2	77,365	71,085	6280
GenHlth (GENHLTH)	3	73,632	60,308	13,324
GenHlth (GENHLTH)	4	31,545	21,764	9781
GenHlth (GENHLTH)	5	12,078	7501	4577
DiffWalk (DIFFWALK)	0	186,849	164,866	21,983
DiffWalk (DIFFWALK)	1	42,625	29,511	13,114
Sex (SEX)	0	128,715	110,370	18,345
Sex (SEX)	1	100,759	84,007	16,752
Age (_AGEG5YR)	1	5511	5433	78
Age (_AGEG5YR)	2	7064	6924	140
Age (_AGEG5YR)	3	10,023	9709	314
Age (_AGEG5YR)	4	12,229	11,604	625
Age (_AGEG5YR)	5	14,040	12,991	1049
Age (_AGEG5YR)	6	17,280	15,539	1741
Age (_AGEG5YR)	7	23,121	20,049	3072
Age (_AGEG5YR)	8	27,272	23,031	4241
Age (_AGEG5YR)	9	29,678	23,997	5681
Age (_AGEG5YR)	10	29,093	22,610	6483
Age (_AGEG5YR)	11	21,993	16,903	5090
Age (_AGEG5YR)	12	15,379	11,996	3383
Age (_AGEG5YR)	13	16,791	13,591	3200
Education (EDUCA)	1	174	127	47
Education (EDUCA)	2	4040	2857	1183
Education (EDUCA)	3	9467	7171	2296
Education (EDUCA)	4	61,124	50,092	11,032
Education (EDUCA)	5	66,444	56,133	10,311
Education (EDUCA)	6	88,225	77,997	10,228
Income (INCOME2)	1	9791	7408	2383
Income (INCOME2)	2	11,756	8670	3086
Income (INCOME2)	3	15,920	12,356	3564
Income (INCOME2)	4	19,953	15,906	4047
Income (INCOME2)	5	25,326	20,837	4489
Income (INCOME2)	6	34,957	29,697	5260
Income (INCOME2)	7	40,131	34,905	5226
Income (INCOME2)	8	71,640	64,598	7042

Appendix B. Hyperparameters

This Appendix section provides the hyperparameters considered at the beginning of Bayesian hyperparameter optimization. The choice of distribution for each hyperparameter of GBDT is in part based on the experimental setting in the Supplementary Material for [15] (Section D.2). Our ranges of values for the hyperparameters are more strict to prevent overfitting. The number of estimators is also considered as part of the Bayesian hyperparameter optimization process. We consider the “path smooth” hyperparameter for LightGBM. Path smoothing prevents overfitting caused by leaves with few observations. We also consider the “min gain to split” hyperparameter for LightGBM. The fraction of features per level (“colsample by level”) and the minimum sum of hessians in the leaves (“min child weight”) are only used in XGBoost.

Bayesian hyperparameter optimization is not applied to conventional (unregularized) logistic regression. The conventional logistic regression model uses the default “lbfgs” solver. We apply Bayesian hyperparameter optimization in LASSO and ridge logistic regression to tune the inverse of the regularization strength. We use the “liblinear” solver for both models. The search space for LASSO and ridge logistic regression is provided below.

XGBoost:

learning rate: Log-uniform distribution [ $e^{- 8}, e^{- 2}$ ]
max depth: Discrete uniform distribution $[1, 8]$
n estimators: Discrete uniform distribution $[20, 300]$
subsample: Uniform distribution $[0.55, 0.85]$
colsample by tree: Uniform distribution $[0.55, 0.85]$
colsample by level: Uniform distribution $[0.55, 0.85]$
reg alpha: Log-uniform distribution [ $e^{1}, e^{6}$ ]
reg lambda: Log-uniform distribution [ $e^{1}, e^{6}$ ]
min child weight: Log-uniform distribution [ $e^{1}, e^{6}$ ]

LightGBM:

learning rate: Log-uniform distribution [ $e^{- 8}, e^{- 2}$ ]
max depth: Discrete uniform distribution $[1, 8]$
n estimators: Discrete uniform distribution $[20, 300]$
feature fraction: Uniform distribution $[0.55, 0.85]$
subsample: Uniform distribution $[0.55, 0.85]$
lambda l1: Log-uniform distribution [ $e^{1}, e^{6}$ ]
lambda l2: Log-uniform distribution [ $e^{1}, e^{6}$ ]
min gain to split: Log-uniform distribution [ $e^{- 4}, e^{0}$ ]
path smooth: Log-uniform distribution [ $e^{1}, e^{6}$ ]

LASSO logistic regression:

C: Log-uniform distribution [ $e^{- 4}, e^{2}$ ]
Solver: liblinear

Ridge logistic regression:

C: Log-uniform distribution [ $e^{- 4}, e^{2}$ ]
Solver: liblinear

Versions of libraries:

scikit-learn (1.1.2)
scikit-optimize (0.10.1)
hmeasure (0.1.6)
xgboost (2.0.3)
lightgbm (3.3.5)
statsmodels (0.14.1)

Appendix C. Additional Tables for Results Section

Table A5. Model performance according to thresholded metrics for the prediction of diabetes. The best models are underlined.

Model	Threshold	Sensitivity	Specificity	Balanced Accuracy
LR	Prevalence	0.767	0.704	0.735
LASSO LR	Prevalence	0.767	0.704	0.736
Ridge LR	Prevalence	0.767	0.704	0.736
LightGBM	Prevalence	0.777	0.701	0.739
LightGBM-Focal	Optimized	0.784	0.695	0.739
LightGBM-Focal ( $Ψ^{γ}$ )	Prevalence	0.778	0.700	0.739
LightGBM-Focal (Platt)	Prevalence	0.778	0.701	0.739
LightGBM-Focal (Isotonic)	Prevalence	0.776	0.702	0.739
XGBoost	Prevalence	0.780	0.700	0.740
XGBoost-Focal	Optimized	0.786	0.695	0.740
XGBoost-Focal ( $Ψ^{γ}$ )	Prevalence	0.778	0.702	0.740
XGBoost-Focal (Platt)	Prevalence	0.776	0.704	0.740
XGBoost-Focal (Isotonic)	Prevalence	0.784	0.696	0.740

References

Choquet, S.; Varnous, S.; Deback, C.; Golmard, J.; Leblond, V. Adapted treatment of Epstein–Barr virus infection to prevent posttransplant lymphoproliferative disorder after heart transplantation. Am. J. Transplant. 2014, 14, 857–866. [Google Scholar] [CrossRef] [PubMed]
Owers, D.S.; Webster, A.C.; Strippoli, G.F.; Kable, K.; Hodson, E.M. Pre-emptive treatment for cytomegalovirus viraemia to prevent cytomegalovirus disease in solid organ transplant recipients. Cochrane Database Syst. Rev. 2013, 2013, CD005133. [Google Scholar] [CrossRef] [PubMed]
Zafar, F.; Hossain, M.M.; Zhang, Y.; Dani, A.; Schecter, M.; Hayes, D., Jr.; Macaluso, M.; Towe, C.; Morales, D.L. Lung transplantation advanced prediction tool: Determining recipient’s outcome for a certain donor. Transplantation 2022, 106, 2019–2030. [Google Scholar] [CrossRef] [PubMed]
Sinha, A.; Gupta, D.K.; Yancy, C.W.; Shah, S.J.; Rasmussen-Torvik, L.J.; McNally, E.M.; Greenland, P.; Lloyd-Jones, D.M.; Khan, S.S. Risk-based approach for the prediction and prevention of heart failure. Circ. Heart Fail. 2021, 14, e007761. [Google Scholar] [CrossRef]
Hammond, M.M.; Everitt, I.K.; Khan, S.S. New strategies and therapies for the prevention of heart failure in high-risk patients. Clin. Cardiol. 2022, 45, S13–S25. [Google Scholar] [CrossRef]
Ponikowski, P.; Anker, S.D.; AlHabib, K.F.; Cowie, M.R.; Force, T.L.; Hu, S.; Jaarsma, T.; Krum, H.; Rastogi, V.; Rohde, L.E.; et al. Heart failure: Preventing disease and death worldwide. ESC Heart Fail. 2014, 1, 4–25. [Google Scholar] [CrossRef]
Hand, D.J. Measuring classifier performance: A coherent alternative to the area under the ROC curve. Mach. Learn. 2009, 77, 103–123. [Google Scholar] [CrossRef]
Collins, G.S.; Dhiman, P.; Ma, J.; Schlussel, M.M.; Archer, L.; Van Calster, B.; Harrell, F.E.; Martin, G.P.; Moons, K.G.; Van Smeden, M.; et al. Evaluation of clinical prediction models (part 1): From development to external validation. BMJ 2024, 384, e074819. [Google Scholar] [CrossRef] [PubMed]
Walsh, C.G.; Sharman, K.; Hripcsak, G. Beyond discrimination: A comparison of calibration methods and clinical usefulness of predictive models of readmission risk. J. Biomed. Inform. 2017, 76, 9–18. [Google Scholar] [CrossRef]
Van Calster, B.; McLernon, D.; Van Smeden, M.; Wynants, L.; Steyerberg, E.; on behalf of Topic Group. ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019, 17, 230. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Park, Y.; Ho, J.C. Tackling overfitting in boosting for noisy healthcare data. IEEE Trans. Knowl. Data Eng. 2019, 33, 2995–3006. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31, 6638–6648. [Google Scholar]
Wang, C.; Deng, C.; Wang, S. Imbalance-XGBoost: Leveraging weighted and focal losses for binary label-imbalanced classification with XGBoost. Pattern Recognit. Lett. 2020, 136, 190–197. [Google Scholar] [CrossRef]
Boldini, D.; Friedrich, L.; Kuhn, D.; Sieber, S.A. Tuning gradient boosting for imbalanced bioassay modelling with custom loss functions. J. Cheminformatics 2022, 14, 80. [Google Scholar] [CrossRef]
Liu, W.; Fan, H.; Xia, M.; Xia, M. A focal-aware cost-sensitive boosted tree for imbalanced credit scoring. Expert Syst. Appl. 2022, 208, 118158. [Google Scholar] [CrossRef]
Mushava, J.; Murray, M. A novel XGBoost extension for credit scoring class-imbalanced data combining a generalized extreme value link and a modified focal loss function. Expert Syst. Appl. 2022, 202, 117233. [Google Scholar] [CrossRef]
Mushava, J.; Murray, M. Flexible loss functions for binary classification in gradient-boosted decision trees: An application to credit scoring. Expert Syst. Appl. 2024, 238, 121876. [Google Scholar] [CrossRef]
Shankar, V.; Yang, X.; Krishna, V.; Tan, B.; Silva, O.; Rojansky, R.; Ng, A.; Valvert, F.; Briercheck, E.; Weinstock, D.; et al. LymphoML: An interpretable artificial intelligence-based method identifies morphologic features that correlate with lymphoma subtype. In Proceedings of the Machine Learning for Health (ML4H), New Orleans, LA, USA, 10 December 2023; PMLR: Cambridge, MA, USA, 2023; pp. 528–558. [Google Scholar]
Rao, C.; Xu, Y.; Xiao, X.; Hu, F.; Goh, M. Imbalanced customer churn classification using a new multi-strategy collaborative processing method. Expert Syst. Appl. 2024, 247, 123251. [Google Scholar] [CrossRef]
Luo, J.; Yuan, Y.; Xu, S. Improving GBDT performance on imbalanced datasets: An empirical study of class-balanced loss functions. Neurocomputing 2025, 634, 129896. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, New Orleans, LA, USA, 10 December 2017; pp. 2980–2988. [Google Scholar]
Le, P.B.; Nguyen, Z.T. ROC curves, loss functions, and distorted probabilities in binary classification. Mathematics 2022, 10, 1410. [Google Scholar] [CrossRef]
Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10795–10816. [Google Scholar] [CrossRef]
Charoenphakdee, N.; Vongkulbhisal, J.; Chairatanakul, N.; Sugiyama, M. On focal loss for class-posterior probability estimation: A theoretical perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5202–5211. [Google Scholar]
Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Seto, H.; Oyama, A.; Kitora, S.; Toki, H.; Yamamoto, R.; Kotoku, J.; Haga, A.; Shinzawa, M.; Yamakawa, M.; Fukui, S.; et al. Gradient boosting decision tree becomes more reliable than logistic regression in predicting probability for diabetes with big data. Sci. Rep. 2022, 12, 15889. [Google Scholar] [CrossRef]
Ma, H.; Dong, Z.; Chen, M.; Sheng, W.; Li, Y.; Zhang, W.; Zhang, S.; Yu, Y. A gradient boosting tree model for multi-department venous thromboembolism risk assessment with imbalanced data. J. Biomed. Inform. 2022, 134, 104210. [Google Scholar] [CrossRef]
Badrouchi, S.; Ahmed, A.; Bacha, M.M.; Abderrahim, E.; Abdallah, T.B. A machine learning framework for predicting long-term graft survival after kidney transplantation. Expert Syst. Appl. 2021, 182, 115235. [Google Scholar] [CrossRef]
Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 2019, 110, 12–22. [Google Scholar] [CrossRef]
Nusinovici, S.; Tham, Y.C.; Yan, M.Y.C.; Ting, D.S.W.; Li, J.; Sabanayagam, C.; Wong, T.Y.; Cheng, C.Y. Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [Google Scholar] [CrossRef]
Bae, S.; Massie, A.B.; Caffo, B.S.; Jackson, K.R.; Segev, D.L. Machine learning to predict transplant outcomes: Helpful or hype? A national cohort study. Transpl. Int. 2020, 33, 1472–1480. [Google Scholar] [CrossRef]
Austin, P.C.; Harrell, F.E., Jr.; Steyerberg, E.W. Predictive performance of machine and statistical learning methods: Impact of data-generating processes on external validity in the “large N, small p” setting. Stat. Methods Med. Res. 2021, 30, 1465–1483. [Google Scholar] [CrossRef] [PubMed]
Miller, R.J.; Sabovčik, F.; Cauwenberghs, N.; Vens, C.; Khush, K.K.; Heidenreich, P.A.; Haddad, F.; Kuznetsova, T. Temporal shift and predictive performance of machine learning for heart transplant outcomes. J. Heart Lung Transplant. 2022, 41, 928–936. [Google Scholar] [CrossRef] [PubMed]
Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 2019, 32, 1567–1578. [Google Scholar]
Menon, A.K.; Jayasumana, S.; Rawat, A.S.; Jain, H.; Veit, A.; Kumar, S. Long-tail learning via logit adjustment. In Proceedings of the International Conference on Learning Representations, Virtual, 4 May 2021. [Google Scholar]
Fan, C.; Li, C.; Peng, Y.; Shen, Y.; Cao, G.; Li, S. Fault Diagnosis of Vibration Sensors Based on Triage Loss Function-Improved XGBoost. Electronics 2023, 12, 4442. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Cambridge, MA, USA, 2017; pp. 1321–1330. [Google Scholar]
Shuford, E.H., Jr.; Albert, A.; Edward Massengill, H. Admissible probability measurement procedures. Psychometrika 1966, 31, 125–145. [Google Scholar] [CrossRef] [PubMed]
Buja, A.; Stuetzle, W.; Shen, Y. Loss functions for binary class probability estimation and classification: Structure and applications. Work. Draft. Novemb. 2005, 3, 13. [Google Scholar]
Masnadi-Shirazi, H.; Vasconcelos, N. A view of margin losses as regularizers of probability estimates. J. Mach. Learn. Res. 2015, 16, 2751–2795. [Google Scholar]
Błasiok, J.; Gopalan, P.; Hu, L.; Nakkiran, P. When does optimizing a proper loss yield calibration? In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 72071–72095. [Google Scholar]
Platt, J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif. 1999, 10, 61–74. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Zadrozny, B.; Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; pp. 694–699. [Google Scholar]
Kull, M.; Silva Filho, T.; Flach, P. Beta calibration: A well-founded and easily implemented improvement on logistic calibration for binary classifiers. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: Cambridge, MA, USA, 2017; pp. 623–631. [Google Scholar]
Hickey, G.L.; Grant, S.W.; Caiado, C.; Kendall, S.; Dunning, J.; Poullis, M.; Buchan, I.; Bridgewater, B. Dynamic prediction modeling approaches for cardiac surgery. Circ. Cardiovasc. Qual. Outcomes 2013, 6, 649–658. [Google Scholar] [CrossRef] [PubMed]
Ferri, C.; Hernández-Orallo, J.; Flach, P.A. Brier curves: A new cost-based visualisation of classifier performance. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 585–592. [Google Scholar]
Dimitriadis, T.; Gneiting, T.; Jordan, A.I. Stable reliability diagrams for probabilistic classifiers. Proc. Natl. Acad. Sci. USA 2021, 118, e2016191118. [Google Scholar] [CrossRef] [PubMed]
Cheng, J.; Vasconcelos, N. Towards Calibrated Multi-label Deep Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 27589–27599. [Google Scholar]
Xia, Y.; Liu, C.; Li, Y.; Liu, N. A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring. Expert Syst. Appl. 2017, 78, 225–241. [Google Scholar] [CrossRef]
Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Proceedings of the Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, 17–21 January 2011; Selected Papers 5. Springer: Berlin/Heidelberg, Germany, 2011; pp. 507–523. [Google Scholar]
Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. Adv. Neural Inf. Process. Syst. 2011, 24, 2546–2554. [Google Scholar]
Jones, D.R.; Schonlau, M.; Welch, W.J. Efficient global optimization of expensive black-box functions. J. Glob. Optim. 1998, 13, 455–492. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical bayesian optimization of machine learning algorithms. Adv. Neural Inf. Process. Syst. 2012, 25, 2951–2959. [Google Scholar]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Collell, G.; Prelec, D.; Patil, K.R. A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing 2018, 275, 330–340. [Google Scholar] [CrossRef]
Tarawneh, A.S.; Hassanat, A.B.; Altarawneh, G.A.; Almuhaimeed, A. Stop oversampling for class imbalance learning: A review. IEEE Access 2022, 10, 47643–47660. [Google Scholar] [CrossRef]
Provost, F. Machine learning from imbalanced data sets 101. In Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets, Austin, TX, USA, 31 July 2000; AAAI Press: Washington, DC, USA, 2000; Volume 68, pp. 1–3. [Google Scholar]
O’Brien, R.; Ishwaran, H. A random forests quantile classifier for class imbalanced data. Pattern Recognit. 2019, 90, 232–249. [Google Scholar] [CrossRef] [PubMed]
Menon, A.; Narasimhan, H.; Agarwal, S.; Chawla, S. On the statistical consistency of algorithms for binary classification under class imbalance. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; PMLR: Cambridge, MA, USA, 2013; pp. 603–611. [Google Scholar]
Mease, D.; Wyner, A.J.; Buja, A. Boosted classification trees and class probability/quantile estimation. J. Mach. Learn. Res. 2007, 8, 409–439. [Google Scholar]
Lipton, Z.C.; Elkan, C.; Naryanaswamy, B. Optimal thresholding of classifiers to maximize F1 measure. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, 15–19 September 2014; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2014; pp. 225–239. [Google Scholar]
Hand, D.J.; Anagnostopoulos, C. A better Beta for the H measure of classification performance. Pattern Recognit. Lett. 2014, 40, 41–46. [Google Scholar] [CrossRef]
Merdan, S.; Barnett, C.L.; Denton, B.T.; Montie, J.E.; Miller, D.C. OR practice–Data analytics for optimal detection of metastatic prostate cancer. Oper. Res. 2021, 69, 774–794. [Google Scholar] [CrossRef]
Leppke, S.; Leighton, T.; Zaun, D.; Chen, S.C.; Skeans, M.; Israni, A.K.; Snyder, J.J.; Kasiske, B.L. Scientific Registry of Transplant Recipients: Collecting, analyzing, and reporting data on transplantation in the United States. Transplant. Rev. 2013, 27, 50–56. [Google Scholar] [CrossRef]
Shtraichman, O.; Ahya, V.N. Malignancy after lung transplantation. Ann. Transl. Med. 2020, 8, 416. [Google Scholar] [CrossRef]
Cheng, J.; Moore, C.A.; Iasella, C.J.; Glanville, A.R.; Morrell, M.R.; Smith, R.B.; McDyer, J.F.; Ensor, C.R. Systematic review and meta-analysis of post-transplant lymphoproliferative disorder in lung transplant recipients. Clin. Transplant. 2018, 32, e13235. [Google Scholar] [CrossRef]
Behavioral Risk Factor Surveillance System Survey Data. Atlanta, Georgia: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. Available online: https://www.cdc.gov/brfss/annual_data/annual_data.htm (accessed on 24 July 2024).
Teboul, A. Diabetes Health Indicators Dataset. Version 1. Available online: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data (accessed on 24 July 2024).
Centers for Disease Control and Prevention, and Sohier Dane. Behavioral Risk Factor Surveillance System. Available online: https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system (accessed on 24 July 2024).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Mukhoti, J.; Kulharia, V.; Sanyal, A.; Golodetz, S.; Torr, P.; Dokania, P. Calibrating deep neural networks using focal loss. Adv. Neural Inf. Process. Syst. 2020, 33, 15288–15299. [Google Scholar]
Reid, M.D.; Williamson, R.C. Composite binary losses. J. Mach. Learn. Res. 2010, 11, 2387–2422. [Google Scholar]
Lu, Y.; Yu, X.; Hu, Z.; Wang, X. Convolutional neural network combined with reinforcement learning-based dual-mode grey wolf optimizer to identify crop diseases and pests. Swarm Evol. Comput. 2025, 94, 101874. [Google Scholar] [CrossRef]

Figure 1. (A) Focal loss for different values of the focusing parameter

γ

. (B) Relationship between the focal risk minimizer and the true class-posterior probability for different values of the focusing parameter

γ

.

Figure 1. (A) Focal loss for different values of the focusing parameter

γ

. (B) Relationship between the focal risk minimizer and the true class-posterior probability for different values of the focusing parameter

γ

.

Figure 2. Lung transplantation patients extracted from the Scientific Registry of Transplant Recipients (SRTR).

Figure 3. Examples of features used to predict risk of post-transplant lymphoproliferative disorder (PTLD) in lung transplant recipients.

Figure 4. Examples of features used to predict risk of diabetes. Features are based on survey responses collected by the Centers for Disease Control and Prevention (CDC).

Figure 5. Performance based on unthresholded metrics for PTLD prediction. AUROC, H-measure and average precision show equal performance for calibrated and miscalibrated GBDT-Focal models. LightGBM-Focal and XGBoost-Focal have worse Brier scores due to miscalibration of the posterior probabilities.

Figure 6. Performance based on classification metrics for PTLD prediction. Optimizing the decision threshold for balanced accuracy provides lower sensitivity and higher specificity compared to the prevalence threshold.

Figure 7. Calibration plots for XGBoost trained using focal loss for the prediction of post-transplant lymphoproliferative disorder (PTLD) in lung transplant recipients. Each row displays the model calibration for each of the five time horizons. There are 10 curves in each subplot representing 10 different cross-validation experiments. The thick bold curve is the mean. The dotted diagonal line (line of identity) represents perfect calibration, where the mean predicted probabilities match the fraction of positives. Confidence scores are miscalibrated before applying methods such as the

Ψ^{γ}

transformation, Platt scaling, and isotonic regression.

Figure 7. Calibration plots for XGBoost trained using focal loss for the prediction of post-transplant lymphoproliferative disorder (PTLD) in lung transplant recipients. Each row displays the model calibration for each of the five time horizons. There are 10 curves in each subplot representing 10 different cross-validation experiments. The thick bold curve is the mean. The dotted diagonal line (line of identity) represents perfect calibration, where the mean predicted probabilities match the fraction of positives. Confidence scores are miscalibrated before applying methods such as the

Ψ^{γ}

transformation, Platt scaling, and isotonic regression.

Figure 8. Calibration plots for LightGBM trained using focal loss for the prediction of post-transplant lymphoproliferative disorder (PTLD) in lung transplant recipients. Each row displays the model calibration for each of the five time horizons. There are 10 curves in each subplot representing 10 different cross-validation experiments. The thick bold curve is the mean. The dotted diagonal line (line of identity) represents perfect calibration, where the mean predicted probabilities match the fraction of positives. Confidence scores are miscalibrated before applying methods such as the

Ψ^{γ}

transformation, Platt scaling, and isotonic regression.

Figure 8. Calibration plots for LightGBM trained using focal loss for the prediction of post-transplant lymphoproliferative disorder (PTLD) in lung transplant recipients. Each row displays the model calibration for each of the five time horizons. There are 10 curves in each subplot representing 10 different cross-validation experiments. The thick bold curve is the mean. The dotted diagonal line (line of identity) represents perfect calibration, where the mean predicted probabilities match the fraction of positives. Confidence scores are miscalibrated before applying methods such as the

Ψ^{γ}

transformation, Platt scaling, and isotonic regression.

Figure 9. Calibration slope and intercept of GBDT trained using the focal loss function (prediction of post-transplant lymphoproliferative disorder). The calibration slope and intercept were extracted from a logistic regression model with the logit of the predicted class-posterior probabilities as the independent variable and the binary outcome as the dependent variable. The box plots show results for 10 repetitions of shuffled cross-validation. Deviations from a slope of 1 and an intercept of 0 (red horizontal lines) indicate a miscalibration of the class-posterior probabilities. XGBoost-Focal shows worse miscalibration compared to LightGBM-Focal because it was trained using larger values of focusing parameter

γ

. Applying calibration methods such as

Ψ^{γ}

transformation, Platt scaling, and isotonic regression pushes the calibration slope towards 1 (better calibration).

Figure 9. Calibration slope and intercept of GBDT trained using the focal loss function (prediction of post-transplant lymphoproliferative disorder). The calibration slope and intercept were extracted from a logistic regression model with the logit of the predicted class-posterior probabilities as the independent variable and the binary outcome as the dependent variable. The box plots show results for 10 repetitions of shuffled cross-validation. Deviations from a slope of 1 and an intercept of 0 (red horizontal lines) indicate a miscalibration of the class-posterior probabilities. XGBoost-Focal shows worse miscalibration compared to LightGBM-Focal because it was trained using larger values of focusing parameter

γ

. Applying calibration methods such as

Ψ^{γ}

transformation, Platt scaling, and isotonic regression pushes the calibration slope towards 1 (better calibration).

Figure 10. Calibration plots for prediction of diabetes. Shuffled cross-validation was repeated three times using different random seeds. The mean focusing parameters (

γ

) are 1.03 and 1.08 for LightGBM-Focal and XGBoost-Focal, respectively. The thick bold curve is the mean calibration curve. The dotted diagonal line (line of identity) represents perfect calibration, where the mean predicted probabilities match the fraction of positives. We compare the miscalibrated models and models calibrated using different techniques (

Ψ^{γ}

transformation, Platt scaling and isotonic regression).

Figure 10. Calibration plots for prediction of diabetes. Shuffled cross-validation was repeated three times using different random seeds. The mean focusing parameters (

γ

) are 1.03 and 1.08 for LightGBM-Focal and XGBoost-Focal, respectively. The thick bold curve is the mean calibration curve. The dotted diagonal line (line of identity) represents perfect calibration, where the mean predicted probabilities match the fraction of positives. We compare the miscalibrated models and models calibrated using different techniques (

Ψ^{γ}

transformation, Platt scaling and isotonic regression).

Table 1. Model performance according to AUROC, H-measure, and average precision for PTLD prediction in lung transplant recipients. The best models are underlined.

Model	1-Year PTLD	3-Year PTLD	5-Year PTLD	8-Year PTLD	10-Year PTLD
	AUROC
LR	0.743 ± 0.013	0.686 ± 0.017	0.653 ± 0.014	0.651 ± 0.013	0.674 ± 0.011
LASSO LR	0.739 ± 0.009	0.687 ± 0.016	0.654 ± 0.014	0.659 ± 0.015	0.676 ± 0.008
Ridge LR	0.736 ± 0.012	0.686 ± 0.016	0.653 ± 0.013	0.655 ± 0.014	0.677 ± 0.010
LightGBM	0.738 ± 0.014	0.695 ± 0.022	0.665 ± 0.011	0.676 ± 0.014	0.697 ± 0.010
LightGBM-Focal	0.742 ± 0.014	0.696 ± 0.018	0.666 ± 0.013	0.675 ± 0.014	0.697 ± 0.009
LightGBM-Focal ( $Ψ^{γ}$ )	0.742 ± 0.014	0.696 ± 0.018	0.666 ± 0.013	0.675 ± 0.014	0.697 ± 0.009
LightGBM-Focal (Platt)	0.742 ± 0.014	0.696 ± 0.018	0.666 ± 0.013	0.675 ± 0.014	0.697 ± 0.009
LightGBM-Focal (Isotonic)	0.739 ± 0.017	0.693 ± 0.020	0.664 ± 0.013	0.674 ± 0.014	0.695 ± 0.011
XGBoost	0.736 ± 0.017	0.701 ± 0.019	0.668 ± 0.017	0.678 ± 0.012	0.697 ± 0.009
XGBoost-Focal	0.755 ± 0.010	0.699 ± 0.015	0.668 ± 0.017	0.678 ± 0.012	0.700 ± 0.009
XGBoost-Focal ( $Ψ^{γ}$ )	0.755 ± 0.010	0.699 ± 0.015	0.668 ± 0.017	0.678 ± 0.012	0.700 ± 0.009
XGBoost-Focal (Platt)	0.755 ± 0.010	0.699 ± 0.015	0.668 ± 0.017	0.678 ± 0.012	0.700 ± 0.009
XGBoost-Focal (Isotonic)	0.752 ± 0.011	0.694 ± 0.014	0.662 ± 0.017	0.675 ± 0.011	0.698 ± 0.009
	H-measure
LR	0.241 ± 0.030	0.144 ± 0.020	0.114 ± 0.019	0.099 ± 0.014	0.133 ± 0.016
LASSO LR	0.247 ± 0.025	0.145 ± 0.025	0.114 ± 0.019	0.107 ± 0.014	0.134 ± 0.013
Ridge LR	0.236 ± 0.024	0.141 ± 0.025	0.112 ± 0.017	0.104 ± 0.014	0.137 ± 0.017
LightGBM	0.248 ± 0.023	0.154 ± 0.032	0.129 ± 0.012	0.134 ± 0.017	0.165 ± 0.019
LightGBM-Focal	0.254 ± 0.025	0.154 ± 0.029	0.132 ± 0.018	0.134 ± 0.017	0.163 ± 0.019
LightGBM-Focal ( $Ψ^{γ}$ )	0.254 ± 0.025	0.154 ± 0.029	0.132 ± 0.018	0.134 ± 0.017	0.163 ± 0.019
LightGBM-Focal (Platt)	0.254 ± 0.025	0.154 ± 0.029	0.132 ± 0.018	0.134 ± 0.017	0.163 ± 0.019
LightGBM-Focal (Isotonic)	0.241 ± 0.026	0.140 ± 0.031	0.120 ± 0.017	0.124 ± 0.015	0.153 ± 0.019
XGBoost	0.257 ± 0.029	0.162 ± 0.028	0.136 ± 0.017	0.133 ± 0.014	0.162 ± 0.017
XGBoost-Focal	0.271 ± 0.023	0.156 ± 0.021	0.136 ± 0.019	0.136 ± 0.014	0.168 ± 0.014
XGBoost-Focal ( $Ψ^{γ}$ )	0.271 ± 0.023	0.156 ± 0.021	0.136 ± 0.019	0.136 ± 0.014	0.168 ± 0.014
XGBoost-Focal (Platt)	0.271 ± 0.023	0.156 ± 0.021	0.136 ± 0.019	0.136 ± 0.014	0.168 ± 0.014
XGBoost-Focal (Isotonic)	0.257 ± 0.026	0.145 ± 0.021	0.123 ± 0.019	0.125 ± 0.013	0.156 ± 0.014
	Average precision
LR	0.054 ± 0.012	0.078 ± 0.010	0.117 ± 0.013	0.206 ± 0.008	0.331 ± 0.022
LASSO LR	0.054 ± 0.008	0.079 ± 0.011	0.115 ± 0.011	0.211 ± 0.008	0.335 ± 0.023
Ridge LR	0.051 ± 0.007	0.078 ± 0.010	0.116 ± 0.011	0.210 ± 0.010	0.339 ± 0.023
LightGBM	0.055 ± 0.012	0.086 ± 0.012	0.129 ± 0.010	0.244 ± 0.015	0.371 ± 0.028
LightGBM-Focal	0.059 ± 0.011	0.088 ± 0.015	0.131 ± 0.011	0.243 ± 0.014	0.369 ± 0.028
LightGBM-Focal ( $Ψ^{γ}$ )	0.059 ± 0.011	0.088 ± 0.015	0.131 ± 0.011	0.243 ± 0.014	0.369 ± 0.028
LightGBM-Focal (Platt)	0.059 ± 0.011	0.088 ± 0.015	0.131 ± 0.011	0.243 ± 0.014	0.369 ± 0.028
LightGBM-Focal (Isotonic)	0.049 ± 0.007	0.077 ± 0.012	0.119 ± 0.010	0.225 ± 0.013	0.348 ± 0.026
XGBoost	0.053 ± 0.008	0.090 ± 0.012	0.137 ± 0.013	0.242 ± 0.015	0.362 ± 0.023
XGBoost-Focal	0.060 ± 0.012	0.088 ± 0.011	0.136 ± 0.016	0.246 ± 0.015	0.369 ± 0.020
XGBoost-Focal ( $Ψ^{γ}$ )	0.060 ± 0.012	0.088 ± 0.011	0.136 ± 0.016	0.246 ± 0.015	0.369 ± 0.020
XGBoost-Focal (Platt)	0.060 ± 0.012	0.088 ± 0.011	0.136 ± 0.016	0.246 ± 0.015	0.369 ± 0.020
XGBoost-Focal (Isotonic)	0.051 ± 0.006	0.079 ± 0.009	0.123 ± 0.014	0.227 ± 0.011	0.346 ± 0.015

Table 2. Model performance according to Brier score for PTLD prediction in lung transplant recipients. The best models are underlined.

Model	1-Year PTLD	3-Year PTLD	5-Year PTLD	8-Year PTLD	10-Year PTLD
	Brier score
LR	0.0111 ± 0.0001	0.0279 ± 0.0002	0.0482 ± 0.0004	0.0939 ± 0.0007	0.1317 ± 0.0020
LASSO LR	0.0111 ± 0.0001	0.0278 ± 0.0002	0.0481 ± 0.0003	0.0932 ± 0.0005	0.1309 ± 0.0017
Ridge LR	0.0111 ± 0.0000	0.0279 ± 0.0002	0.0481 ± 0.0003	0.0932 ± 0.0006	0.1306 ± 0.0017
LightGBM	0.0111 ± 0.0000	0.0277 ± 0.0002	0.0477 ± 0.0002	0.0915 ± 0.0008	0.1279 ± 0.0022
LightGBM-Focal	0.0131 ± 0.0026	0.0346 ± 0.0071	0.0611 ± 0.0107	0.0989 ± 0.0073	0.1353 ± 0.0105
LightGBM-Focal ( $Ψ^{γ}$ )	0.0111 ± 0.0000	0.0277 ± 0.0002	0.0477 ± 0.0003	0.0915 ± 0.0008	0.1279 ± 0.0022
LightGBM-Focal (Platt)	0.0111 ± 0.0000	0.0277 ± 0.0003	0.0477 ± 0.0003	0.0915 ± 0.0008	0.1280 ± 0.0023
LightGBM-Focal (Isotonic)	0.0111 ± 0.0000	0.0279 ± 0.0003	0.0478 ± 0.0004	0.0917 ± 0.0011	0.1282 ± 0.0028
XGBoost	0.0111 ± 0.0000	0.0277 ± 0.0002	0.0476 ± 0.0002	0.0916 ± 0.0007	0.1285 ± 0.0017
XGBoost-Focal	0.0242 ± 0.0113	0.0460 ± 0.0182	0.0749 ± 0.0155	0.1082 ± 0.0141	0.1400 ± 0.0117
XGBoost-Focal ( $Ψ^{γ}$ )	0.0110 ± 0.0001	0.0277 ± 0.0001	0.0476 ± 0.0003	0.0914 ± 0.0007	0.1278 ± 0.0015
XGBoost-Focal (Platt)	0.0110 ± 0.0001	0.0277 ± 0.0001	0.0476 ± 0.0004	0.0913 ± 0.0008	0.1278 ± 0.0016
XGBoost-Focal (Isotonic)	0.0111 ± 0.0000	0.0278 ± 0.0002	0.0477 ± 0.0005	0.0917 ± 0.0009	0.1282 ± 0.0018

Table 3. Model performance according to classification metrics (sensitivity, specificity and balanced accuracy) for PTLD risk prediction in lung transplant recipients. The best models are underlined.

Model	Threshold	1-Year PTLD	3-Year PTLD	5-Year PTLD	8-Year PTLD	10-Year PTLD
		Sensitivity
LR	Prevalence	0.608 ± 0.034	0.543 ± 0.038	0.539 ± 0.034	0.558 ± 0.035	0.578 ± 0.029
LASSO LR	Prevalence	0.597 ± 0.033	0.544 ± 0.036	0.512 ± 0.040	0.572 ± 0.035	0.591 ± 0.013
Ridge LR	Prevalence	0.606 ± 0.042	0.558 ± 0.034	0.543 ± 0.038	0.579 ± 0.034	0.595 ± 0.026
LightGBM	Prevalence	0.602 ± 0.033	0.550 ± 0.058	0.543 ± 0.021	0.584 ± 0.033	0.602 ± 0.021
LightGBM-Focal	Optimized	0.561 ± 0.039	0.512 ± 0.058	0.482 ± 0.047	0.540 ± 0.037	0.589 ± 0.038
LightGBM-Focal ( $Ψ^{γ}$ )	Prevalence	0.606 ± 0.039	0.545 ± 0.055	0.531 ± 0.035	0.578 ± 0.032	0.596 ± 0.021
LightGBM-Focal (Platt)	Prevalence	0.618 ± 0.041	0.554 ± 0.068	0.544 ± 0.032	0.586 ± 0.032	0.602 ± 0.022
LightGBM-Focal (Isotonic)	Prevalence	0.568 ± 0.037	0.531 ± 0.050	0.523 ± 0.042	0.531 ± 0.078	0.586 ± 0.059
XGBoost	Prevalence	0.575 ± 0.051	0.561 ± 0.059	0.543 ± 0.042	0.598 ± 0.029	0.611 ± 0.021
XGBoost-Focal	Optimized	0.576 ± 0.040	0.526 ± 0.031	0.499 ± 0.049	0.561 ± 0.039	0.576 ± 0.044
XGBoost-Focal ( $Ψ^{γ}$ )	Prevalence	0.631 ± 0.039	0.556 ± 0.054	0.539 ± 0.042	0.572 ± 0.027	0.589 ± 0.017
XGBoost-Focal (Platt)	Prevalence	0.629 ± 0.042	0.573 ± 0.058	0.532 ± 0.056	0.580 ± 0.029	0.602 ± 0.023
XGBoost-Focal (Isotonic)	Prevalence	0.608 ± 0.047	0.516 ± 0.055	0.516 ± 0.055	0.584 ± 0.047	0.573 ± 0.067
		Specificity
LR	Prevalence	0.762 ± 0.005	0.720 ± 0.013	0.697 ± 0.012	0.666 ± 0.015	0.668 ± 0.012
LASSO LR	Prevalence	0.768 ± 0.010	0.726 ± 0.019	0.713 ± 0.018	0.669 ± 0.012	0.662 ± 0.012
Ridge LR	Prevalence	0.737 ± 0.014	0.707 ± 0.017	0.682 ± 0.016	0.651 ± 0.015	0.655 ± 0.015
LightGBM	Prevalence	0.774 ± 0.015	0.730 ± 0.020	0.698 ± 0.009	0.666 ± 0.021	0.682 ± 0.018
LightGBM-Focal	Optimized	0.815 ± 0.021	0.763 ± 0.042	0.760 ± 0.041	0.708 ± 0.034	0.692 ± 0.038
LightGBM-Focal ( $Ψ^{γ}$ )	Prevalence	0.768 ± 0.016	0.736 ± 0.024	0.712 ± 0.017	0.671 ± 0.016	0.689 ± 0.017
LightGBM-Focal (Platt)	Prevalence	0.757 ± 0.018	0.729 ± 0.031	0.700 ± 0.014	0.662 ± 0.017	0.681 ± 0.019
LightGBM-Focal (Isotonic)	Prevalence	0.799 ± 0.047	0.748 ± 0.033	0.716 ± 0.038	0.714 ± 0.065	0.694 ± 0.063
XGBoost	Prevalence	0.788 ± 0.043	0.733 ± 0.024	0.693 ± 0.013	0.657 ± 0.019	0.672 ± 0.013
XGBoost-Focal	Optimized	0.818 ± 0.040	0.762 ± 0.024	0.750 ± 0.035	0.691 ± 0.027	0.716 ± 0.038
XGBoost-Focal ( $Ψ^{γ}$ )	Prevalence	0.759 ± 0.025	0.735 ± 0.017	0.701 ± 0.018	0.680 ± 0.018	0.700 ± 0.017
XGBoost-Focal (Platt)	Prevalence	0.767 ± 0.031	0.717 ± 0.034	0.711 ± 0.033	0.671 ± 0.015	0.685 ± 0.024
XGBoost-Focal (Isotonic)	Prevalence	0.791 ± 0.026	0.769 ± 0.056	0.724 ± 0.050	0.662 ± 0.048	0.710 ± 0.069
		Balanced accuracy
LR	Prevalence	0.685 ± 0.017	0.632 ± 0.015	0.618 ± 0.016	0.612 ± 0.015	0.623 ± 0.013
LASSO LR	Prevalence	0.682 ± 0.017	0.635 ± 0.013	0.613 ± 0.017	0.621 ± 0.016	0.627 ± 0.006
Ridge LR	Prevalence	0.672 ± 0.021	0.633 ± 0.013	0.612 ± 0.016	0.615 ± 0.016	0.625 ± 0.009
LightGBM	Prevalence	0.688 ± 0.017	0.640 ± 0.024	0.621 ± 0.012	0.625 ± 0.013	0.642 ± 0.009
LightGBM-Focal	Optimized	0.688 ± 0.018	0.638 ± 0.019	0.621 ± 0.016	0.624 ± 0.012	0.640 ± 0.012
LightGBM-Focal ( $Ψ^{γ}$ )	Prevalence	0.687 ± 0.020	0.641 ± 0.019	0.622 ± 0.016	0.624 ± 0.013	0.643 ± 0.010
LightGBM-Focal (Platt)	Prevalence	0.688 ± 0.021	0.642 ± 0.021	0.622 ± 0.017	0.624 ± 0.013	0.641 ± 0.010
LightGBM-Focal (Isotonic)	Prevalence	0.683 ± 0.022	0.639 ± 0.018	0.619 ± 0.014	0.623 ± 0.012	0.640 ± 0.012
XGBoost	Prevalence	0.681 ± 0.012	0.647 ± 0.024	0.618 ± 0.020	0.628 ± 0.011	0.641 ± 0.010
XGBoost-Focal	Optimized	0.697 ± 0.014	0.644 ± 0.018	0.625 ± 0.017	0.626 ± 0.013	0.646 ± 0.013
XGBoost-Focal ( $Ψ^{γ}$ )	Prevalence	0.695 ± 0.014	0.645 ± 0.021	0.620 ± 0.020	0.626 ± 0.012	0.645 ± 0.010
XGBoost-Focal (Platt)	Prevalence	0.698 ± 0.014	0.645 ± 0.020	0.622 ± 0.019	0.625 ± 0.012	0.643 ± 0.013
XGBoost-Focal (Isotonic)	Prevalence	0.699 ± 0.015	0.642 ± 0.018	0.620 ± 0.019	0.623 ± 0.011	0.642 ± 0.011

Table 4. Performance according to discrimination and overall accuracy metrics (AUROC, H-measure, average precision and Brier score) for the prediction of diabetes. The best models are underlined.

Model	AUROC	H-Measure	Average Precision	Brier Score
LR	0.811	0.291	0.420	0.1069
LASSO LR	0.811	0.291	0.420	0.1069
Ridge LR	0.811	0.291	0.420	0.1069
LightGBM	0.817	0.303	0.441	0.1052
LightGBM-Focal	0.817	0.303	0.440	0.1172
LightGBM-Focal ( $Ψ^{γ}$ )	0.817	0.303	0.440	0.1053
LightGBM-Focal (Platt)	0.817	0.303	0.440	0.1053
LightGBM-Focal (Isotonic)	0.817	0.301	0.434	0.1053
XGBoost	0.817	0.304	0.441	0.1052
XGBoost-Focal	0.818	0.304	0.442	0.1168
XGBoost-Focal ( $Ψ^{γ}$ )	0.818	0.304	0.442	0.1052
XGBoost-Focal (Platt)	0.818	0.304	0.442	0.1052
XGBoost-Focal (Isotonic)	0.817	0.303	0.435	0.1052

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Johnston, H.; Nair, N.; Du, D. Estimating Calibrated Risks Using Focal Loss and Gradient-Boosted Trees for Clinical Risk Prediction. Electronics 2025, 14, 1838. https://doi.org/10.3390/electronics14091838

AMA Style

Johnston H, Nair N, Du D. Estimating Calibrated Risks Using Focal Loss and Gradient-Boosted Trees for Clinical Risk Prediction. Electronics. 2025; 14(9):1838. https://doi.org/10.3390/electronics14091838

Chicago/Turabian Style

Johnston, Henry, Nandini Nair, and Dongping Du. 2025. "Estimating Calibrated Risks Using Focal Loss and Gradient-Boosted Trees for Clinical Risk Prediction" Electronics 14, no. 9: 1838. https://doi.org/10.3390/electronics14091838

APA Style

Johnston, H., Nair, N., & Du, D. (2025). Estimating Calibrated Risks Using Focal Loss and Gradient-Boosted Trees for Clinical Risk Prediction. Electronics, 14(9), 1838. https://doi.org/10.3390/electronics14091838

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimating Calibrated Risks Using Focal Loss and Gradient-Boosted Trees for Clinical Risk Prediction

Abstract

1. Introduction

2. Related Works

2.1. GBDT in Clinical Risk Prediction

2.2. GBDT and Custom Loss Functions

2.3. Probability Calibration

3. Methods

3.1. Focal Loss for Risk Prediction

3.2. Regularized GBDT

3.3. Bayesian Hyperparameter Optimization

3.4. Class-Imbalanced Classification Using Calibrated Focal-Aware GBDT

3.5. Evaluation Metrics

4. Experiments

4.1. Datasets

4.1.1. Lung Transplant Data

4.1.2. Diabetes Data

4.2. Design of Experiments

4.3. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Feature Description

Appendix B. Hyperparameters

Appendix C. Additional Tables for Results Section

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI