Next Article in Journal
The Analysis of a Model–Task Dyad in Two Settings: Zaplify and Pencil and Paper
Next Article in Special Issue
Short-Term Exuberance and Long-Term Stability: A Simultaneous Optimization of Stock Return Predictions for Short and Long Horizons
Previous Article in Journal
A Topological View of Reed–Solomon Codes
Previous Article in Special Issue
Multivariate Classes of GB2 Distributions with Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach

by
Jessica Pesantez-Narvaez
,
Montserrat Guillen
and
Manuela Alcañiz
*
Department of Econometrics, Riskcenter-IREA, Universitat de Barcelona, 08034 Barcelona, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(5), 579; https://doi.org/10.3390/math9050579
Submission received: 4 January 2021 / Revised: 25 February 2021 / Accepted: 5 March 2021 / Published: 9 March 2021

Abstract

:
A boosting-based machine learning algorithm is presented to model a binary response with large imbalance, i.e., a rare event. The new method (i) reduces the prediction error of the rare class, and (ii) approximates an econometric model that allows interpretability. RiskLogitboost regression includes a weighting mechanism that oversamples or undersamples observations according to their misclassification likelihood and a generalized least squares bias correction strategy to reduce the prediction error. An illustration using a real French third-party liability motor insurance data set is presented. The results show that RiskLogitboost regression improves the rate of detection of rare events compared to some boosting-based and tree-based algorithms and some existing methods designed to treat imbalanced responses.

1. Introduction

Research on rare events is steadily increasing in real-world applications of risk management. Examples include fraud detection [1], credit default prediction [2], bankruptcy prediction [3], emerging markets anomalies [4], customer churn predictions [5], and accident occurrence for insurance studies [6]. We address the rare event modeling problem with a purposefully designed method to identify rare potential hazards in advance and facilitate an understanding of their causes.
Rare events are extremely uncommon patterns whose atypical behavior is difficult to predict and detect. A broad consensus [7,8,9,10] favors the definition of rare events data as binary variables with much fewer events (ones) than non-events (zeros). In other words, the degree of imbalance is more extreme in rare events than it is in class imbalanced data, such that rare events are characterized by the number of ones being hundreds to thousands of times smaller than the number of zeros.
Imbalanced data and rare events have been studied mostly as statistical problems with potential application in diverse fields of biology, political science, engineering, and medicine. To name a few, ref. [11] develop a computational method for evaluating the extreme probabilities from random initialization of nonlinear dynamical systems. Ref. [12] proposes a solution for the rare events problem with fuzzy sets. Ref. [13] proposes a resampling strategy via gamma distribution for imbalanced data in medical diagnostics. Ref. [14] proposes a penalized maximum likelihood fixed effects estimator for binary time-series-cross-sectional data for political science applications. Ref. [15] proposes a learning-based stochastic optimization model using rare event data on U.S. federal government agencies. Ref. [16] introduces dynamic models for rare events and time-inhomogeneity in fluctuating currency markets.
The insurance literature draws heavily on discrete probability distributions, where the occurrence of few or non-events are considered as rare or extreme. For instance, authors like [17,18] use generalized linear models to predict insurance fraud. Ref. [19] develops an extension of the Poisson approximation of binomial distributions for rare events. Another work revolves around solutions reached by rare-event simulations [20]. Refs. [21,22,23] employ non-parametric methods for heavy tailed distributions. Ref. [24] employs transaction aggregation to detect credit card fraud when the occurrence of ones in the dependent variable is much less than zeros. However, very few papers in this field have been devoted to studying rare events in binary response such as [25,26,27], and even fewer that go beyond econometric methods, such as [9], which employs advanced machine learning methods.
In fact, developing algorithms that can handle rare events powered by the latest machine learning advances faces two important challenges:
(i)
Some models exhibit bias towards the majority class or underestimate the minority class. Some classifiers are suitable for balanced data [28,29] or treat the minority class as noise [30]. Moreover, some popular tree-based and boosting-based algorithms have been shown to have a high predictive performance measured only with evaluation metrics that consider all observations equally important [31].
(ii)
Unlike econometric methods, several machine learning methods are considered as black boxes in terms of interpretation. They are frequently interpreted using single metrics such as classification accuracy as unique descriptions of complex tasks [32], and they are not able to provide robust explanations for high-risk environments.
In this paper we address these two challenges in an attempt to predict and explain rare events, which will be referred to as dependent or target variables. We propose a RiskLogitboost regression, which is a Logitboost-based algorithm that leads to the convergence of coefficient estimates after some iterations, as occurs when using Iteratively Re-Weighted procedures. Moreover, bias and weighting corrections are incorporated to improve the predictive capacity of the events (ones).
More specifically, our prediction strategy consists of: (i) increasing the accuracy of minority class prediction, and (ii) building an interpretable model similar to classical econometric models. After the introduction, this paper is organized as follows. Section 2 presents the background to the three main approaches used in this research: boosting methods for imbalanced data sets, penalized regression models, and interpretable machine learning. Section 3 describes in detail the proposed RiskLogitboost regression in the rare event problem framework. Section 4 shows the illustrative data used to prove the RiskLogitboost regression. Section 5 discusses the results obtained in terms of predictive capacity and interpretability. Finally, Section 6 presents the conclusions of the paper.

2. Background

To formally define the novel RiskLogitboost regression as a supervised machine learning method, this section first addresses three important notions which are the basis of our strategy. A rigorous description of boosting-based algorithms is presented since it is the core procedure of our method. We will obtain certain key expressions from penalized linear models to approximate RiskLogitboost as an econometric method. Finally, two widely recognized interpretable machine learning techniques are briefly described to gain an overview of how traditional machine learning has been interpreted so far.
Supervised machine learning methods are used to predict a response variable denoted as Y i , i = 1, …, n. The data consist of a sample of n observations of the response, and the prediction is established by a set of covariates denoted as X i p , p = 1, …, P with P predictor variables. The model is trained by a base learner F X i p ; u , which is a function of covariates X i p and the parameters represented by u. The predicted response is denoted as Y ^ i .
The purpose of supervised machine learning is to minimize the learning error measured by a loss function φ using an optimization strategy like gradient descent. The loss function is the distance between the observed Y i and the predicted response Y ^ i which is denoted as φ( Y i , Y ^ i ).

2.1. Boosting Methods

Boosting methods for additive functions are developed within an iterative process through a numerical optimization technique called gradient descent. Each function minimizes a specified loss function φ. Ref. [33] applied the boosting strategy to some loss criteria for classification and regression problems (we use the term “classification problem” if Y i is qualitative, whereas if Y i is quantitative, we use the term “regression problem”. The latter does not refer to regression models studied in econometrics; it refers to a predictive model) such as: least-squares Y i Y ^ i 2 for the least-squares regression; least absolute-deviation Y i Y ^ i for the least-absolute-deviation regression; Huber for M-Regression 0.5 Y i Y ^ i 2 if Y i Y ^ i δ or δ Y i Y ^ i δ / 2 otherwise; and the Logistic binomial log-likelihood log e 2 Y i Y ^ i for two-class Logistic classification.
The Gradient Boosting Machine shown in Algorithm 1 is the base proposal of [33]. The algorithm initializes with a prediction guess of Y ^ i 0 . Then a boosting process of D iterations is carried out in four stages: the first transforms the new response denoted as r ˜ i d computed as the negative gradient of φ Y i , Y ^ i d at iteration d. The second stage fits a least squares regression with the recently computed r ˜ i d as the response. The third stage minimizes the loss function between the observed Y i and Y ^ i d + γ F X i p ; u d and the result is delivered in γ. Finally, the last stage updates the prediction Y ^ i d by summing Y ^ i d 1 and γ F X i p ; u d .
Algorithm 1.Gradient Boosting Machine
  1.  Initial values : Y ^ i 0 = a r g m i n ρ i = 1 n φ Y i , ρ .
  2. For d = 1 to D do:
   2.1  Transformation : r ˜ i d = φ Y i , Y ^ i d Y ^ i d Y i = Y ^ i d 1 .
   2.2  Fitting : u d = a r g m i n u , ϖ i = 1 n r ˜ i d ϖ F X i p ; u 2 .
   2.3.  Minimizing : γ d = a r g m i n γ i = 1 n φ Y i , Y ^ i d + γ F X i p ; u d .
   2.4  Updating : Y ^ i d = Y ^ i d 1 + γ d F X i p ; u d .
  3. End for
Adaboost was one of the first boosting-based prediction algorithms [34,35]. It trains the base learner in a reweighted version by allocating more weight to misclassified observations. Many other boosting techniques have since been derived, such as RealBoost [33], which allows a probability estimate instead of a binary outcome. Logitboost [33] can be used for two-class prediction problems by optimizing an exponential criterion. Gentle Adaboost [33] builds on Real Adaboost and uses probability estimates to update functions. Madaboost [36] modifies the weighting system of Adaboost. Brownboost [37] is based on finding solutions to Brownian differential equations. Delta Boosting [38] uses a delta basis instead of the negative gradient as transformed response.
In the context of rare event and imbalanced prediction problems, various boosting-based methods have been proposed in the literature, including but not limited to RareBoost [39], which calibrates the weights depending on the accuracy of each iteration. Asymmetric Adaboost [40] is a variant of Adaboost and incorporates a cascade classifier. SMOTEBoost [41] incorporates SMOTE (synthetic minority over-sampling techniques) in a boosting procedure. DataBoost-IM [42] treats outliers and extreme observations in a separate procedure to generate synthetic examples of majority and minority classes. RUSBoost [43] trains using skewed data. MSMOTEBoost [44] rebalances the minority class and eliminates noise observations. Additional cost-sensitive methods [45,46,47,48,49,50] have been developed by introducing cost items in the boosting procedure.
Other boosting extensions include the tree boosting-based methods, which have been considered a great success, due to their predictive capacity, in the machine learning community. The tree gradient boost [51] varies from the original gradient boost in the initial value of the first prediction Y ^ i 0 , and the use of a Logistic loss function and a tree base learner.
A tree gradient boost as shown in Algorithm 2 consists of six stages. The first states the values for the initial prediction, Y ^ i 0 . The second stage obtains the new transformed response with the negative gradient of a Logistic loss function. The third maps the observations onto J leaves of the tree at iteration d. The tree learner is j = 1 J u j 1 ( X i p ϵ R j ) with J terminal nodes known as leaves, and R j classification rules (regions), j = 1, …, J. Parameter u corresponds to the score of each leaf, which is the proportion of cases classified as events given covariates X i p . Gini and entropy are two metrics for choosing how to split a tree. Gini is a measurement of the likelihood of an incorrect classification of a new observation if it were randomly classified according to the distribution of class labels of the covariates. Entropy measures how much information there is in a node.
Algorithm 2. Tree Gradient Boost
1.  Initial values : Y ^ i 0 = 1 2 l o g 1 + Y ¯ 1 Y ¯ , where Y ¯ is the mean of Y i .
2. For d = 1 to D do:
  2.1  Transformation : r ˜ i d = 2 Y i 1 + e x p 2 Y i Y ^ i d 1
  2.2  Mapping : R j d = j - leaf scores ( r ˜ i , X 1 n )
  2.3  Minimizing : γ j d = a r g m i n γ X i ϵ R j d r ˜ i X i ϵ R j d r ˜ i 2 r ˜ i .
  2.4  Updating : Y ^ i d = Y ^ i d 1 + j = 1 J γ j d 1 ( X i ϵ R j d )
3. End for
The fourth stage requires minimizing a Logistic loss function: a r g m i n γ X i ϵ R j d n l o g 1 + e x p 2 Y i Y ^ i d 1 + γ d delivered in γ j d . However, since there is no closed form for γ j d , a Newton-Raphson approximation is computed. Finally, the sixth stage updates the final Y ^ i d .
Tree gradient boosting techniques tend to overfit especially when data are complex or highly imbalanced [31]. Regularization is a popular strategy to penalize the complexity of the tree and allow out-of-sample reproducibility. This involves adding a shrinkage penalty or regularization term to the loss function φ( Y i , Y ^ i ) so that the leaf scores shrink: i = 1 n φ Y i , Y ^ i + d = 1 D ή Y ^ d ( ή = λ u ˙ , where λ is a regularization parameter associated with L1-norm or L2-norm of the scores vector). Moreover, ref. [52] introduced cost-complexity pruning that penalizes the number of terminal nodes J according to the following expression: i = 1 n φ Y i , Y ^ i + d = 1 D λ J . As a consequence, these strategies seem quite risky for analysts who want to keep the effect of the covariates even when this effect is small or not significant, because after applying regularization or pruning the score of the leaf is arbitrarily shrunk and the correspondingly less important characteristics disappear.

2.2. Penalized Regression Methods

In the econometric setting, regression models have commonly been used to describe the relationship between a response Y i and a set of covariates X i p . Regression models are used to predict a target variable Y ^ i , and allow interpretability of the coefficients by measuring the effect of the covariates on the expected response.
Logistic regression models are used to model the binary variable Y i . Y i follows a Bernoulli distribution, where π i is the probability that Y i equals 1, expressed as follows:
π i = e x p X i β 1 + e x p X i β .
Note that X i β is the matrix notation of β o + p = 1 P X i p β p , where β is the parameter vector. 1 π i is the probability that Y i equals 0:
π i = 1 1 + e x p X i β .
The Logistic regression uses a logit function as the linear predictor defined as:
η i = l o g π i 1 π i = β o + p = 1 P X i p β p .
Then, the classical likelihood function is the joint Bernoulli probability distribution of observed values of Y i as follows:
l β o , , β P ; X i = i = 1 n π i Y i 1 π i 1 Y i .
Taking logarithms of (4), and replacing with Expressions (1) and (2) we obtain:
l β o , , β P ; X i = i = 1 n Y i X i β l o g 1 + e x p X i β .
Then Logistic regression estimates can be found by maximizing the log likelihood from (5) or minimizing the negative log likelihood function, which can be seen as a loss function to be minimized. Maximization is achieved by deriving l β o , , β P ; X i by all the P + 1 parameters, obtaining a vector of P + 1 partial derivate equation known as the score and denoted as l = β o , , β P ; X i (We denote to transpose vectors and matrices).
l = β o , , β P ; X i = l β o , , l β P
However, when fitting a simple model like a Logistic regression, it is sometimes the case that many variables are not strongly associated with the response Y i , which lowers the classification accuracy of the model. That this problem can be improved with alternative fitting procedures such as constraining or shrinking (also known as regularization) before considering non-linear models was recognized by [53]. The idea is that complex models are sometimes built with irrelevant variables, but by shrinking coefficient estimates we manage to reduce variance, and thus the prediction error.
However, when complex models arise, the machine learning literature suggests imposing some degree of penalty on the Logistic regression so that the variables that contribute less are shrunk through a regularization procedure.
Ridge Logistic regression, shown in Algorithm 3, follows the dynamics of the Logistic regression, but the term λ p = 1 P β p 2 known as the regularization penalty is added to the negative likelihood function, as in (4). Thus, covariates with a minor contribution are forced to be close to zero.
Algorithm 3. Ridge Logistic Regression.
1.  Minimizing the negative likelihood function: L = i = 1 n π i Y i 1 π i 1 Y i
2.  Penalizing: L = i = 1 n π i Y i 1 π i 1 Y i + λ p = 1 P β p 2 .
On the other hand, Lasso Logistic regression, shown in Algorithm 4, follows the dynamics of the Logistic regression, but a regularization penalty λ p = 1 P β p is added to the negative likelihood function. In this case, less contributive covariates are forced to be exactly zero. In both cases, λ is a shrinkage parameter, so the larger it is, the smaller the magnitude of the coefficient estimates [53].
Algorithm 4. Lasso Logistic Regression.
1.  Minimizing the negative likelihood function: L = i = 1 n π i Y i 1 π i 1 Y i
2.  Penalizing: L = i = 1 n π i Y i 1 π i 1 Y i + λ p = 1 P β p .

2.3. Interpretable Machine Learning

Unlike statistical models in econometrics, machine learning algorithms are generally not self-explanatory. For example, generalized linear models provide coefficient estimates and their standard errors give information about the effect of covariates, whereas machine learning requires alternative methods to make the models understandable. Two popular approaches are described below.
Variable importance (VI), as proposed by [52], measures the influence of inputs on the variation of Y ^ i . We obtain the importance in a decision tree by summing the improvements in the loss function over all splits on a specific covariate X p ; in other words, variable importance is calculated by the node impurity weighted by the node probability (The node probability is calculated by the number of observations contained in that node of the tree divided by total number of observations). For ensemble techniques, the VI of all the trees that composed the ensemble is averaged.
Partial Dependence Plots (PDP) proposed by [51] show the marginal effect of a covariate X p on the prediction. The predicted function Y ^ is evaluated in certain values of the specific covariate X p while averaging over a range of values of all the other covariates.

3. The Rare Event Problem with RiskLogitboost Regression

The RiskLogitboost regression is an extension of Logitboost [33] that modifies the weighting procedure to improve the classification of rare events. It also adapts a bias correction from [54] in the boosting procedure, which is also applied to regression models such as those in [7,8,10].
To formally define the RiskLogitboost regression, we first describe briefly the Logitboost shown in Algorithm 5. It first initializes with Y ^ i 0 = 0 and π 0 X i = 0.5 . Then the boosting procedure continues with four stages. The first one transforms the response. Logitboost also uses the exponential loss function e Y i Y ^ i which is a quadratic approximation of χ 2 and z i (transformed response) (see further details in Appendix A). The second stage involves calculating the weights by computing the variance of the transformed response V a r [ z i | X ] (see further details in Appendix B). The third stage fits a least squares regression with response z i . Finally, the fourth stage updates the prediction Y ^ i d and π X i by computing F X i p ; u d as X i β for this particular case.
Algorithm 5. Logitboost
  1.  Initial values : Y ^ i 0 = 0 ,
             π 0 X i = 0.5 , where π X i are the probability estimates.
  2. For d= 1 to D do:
   2.1  Transformation : z i d = Y i d 1 π X i d 1 π X i d 1 1 π X i d 1
   2.2  Weighting : w i d = π X i d 1 1 π X i d 1
   2.3  Minimizing : β d = a r g m i n β i = 1 n w i d z i d β o + p = 1 P X i p β p 2
   2.4  Updating : Y ^ i d = Y ^ i d 1 + 1 2 F X i p ; u d , and
             π X i d = e x p Y ^ i d 1 e x p Y ^ i d 1 + e x p Y ^ i d 1
  3. End for

3.1. RiskLogitboost Regression Weighting Mechanism to Improve Rare-Class Learning

We propose a weighting mechanism that might be considered as a mixed case of oversampling and undersampling. The main idea is to overweight observations whose estimated probability π X i is further from the observed value Y i , in other words, observations that are more likely to be misclassified. The new majority class observations are interpolated through a threshold that determines the calibration of weights. The proposed weighting mechanism takes the following form:
w i = [ π X i 1 π X i 1 + Y i π X i ; i f Y i π X i > Y ¯ [ π X i 1 π X i 1 Y i π X i ; i f Y i π X i Y ¯
The original weights w i of the Logitboost are now multiplied by a factor 1 ± Y i π X i that is related to the distance between Y i and π X i .
Figure 1 shows the relationship between weights according to the estimated probabilities of the Logitboost and the RiskLogitboost regression. Logitboost overweights observations whose estimated probability is around 0.5 and then decreases gradually and symmetrically on either side. The result of the weighting mechanism in the RiskLogitboost regression shows that low estimated probabilities are overweighted when Y i = 1 while high estimated probabilities are underweighted when Y i = 0. In Figure 1 we show that, once the weighting mechanism is transformed, we maintain the u–inverted shape for Y = 1 and Y = 0.
Refs. [9,26,55,56] proposed weighting mechanisms for parametric and non-parametric models to improve the predictive performance of imbalanced and rare data.

3.2. Bias Correction with Weights

Bias correction will lead to a lower root mean square error. Ref. [54] proposed a bias correction method and showed that the bias of the coefficient estimators for any generalized model can be computed as X W X 1 X W , where W is the diagonal matrix of w i . However, we propose replacing w i by w i since the behaviour, and therefore the bias, for the RiskLogitboost is computed as X W X 1 X W .
The factor equals Q i i π D X i 0.5 , where Q i i is the diagonal elements of the Fisher information matrix denoted as Q. The matrix Q measures the amount of information that matrix X carries about the parameters; in other words, it is the variance of the gradient of the log-likelihood function with respect to the parameter vector known as the score.
Q r k is the Fisher information matrix for two arbitrary generic parameters: β k and β r .
Q r k = E 2 l n l β o , , β k , , β r , , β P ; X i β r β k .
Now let us take the partial derivative of l β o , , β P ; X i in (5) with respect to β k .
l β k = i = 1 n Y i l β k X β l β k l o g 1 + e x p X β ,
where
l β k X i β = X i k
and
l β k l o g 1 + e x p X i β = e x p X β 1 + e x p X β l β k X i β
l β k l o g 1 + e x p X β = π i X i k .
Considering (9) and (10), we obtain:
l β k = i = 1 n Y i X i k π i X i k .
Now, let us compute the second derivative of (8) with respect to β r .
2 l β k β r = β r l β k
2 l β k β r = i = 1 n X i k Y i β r π i
And,
β r π i = e x p X i β β r X i β 1 + e x p ( X i β ) e x p X i β e x p X i β β r X i β 1 + e x p ( X i β ) 2
β r π i = π i X i r 1 π i .
Plugging (13) into (12):
2 l β k β r = i = 1 n X i k X i r π i 1 π i
Recall that Var ( Y i ) = π i ( 1 π i ), since Y i follows a Bernoulli distribution and coincides with vector w i (second stage of Algorithm 5). However, the new RiskLogitboost replaces w i with w i again in Equation (14).
If we generalize expression (14) for all P parameters, we obtain:
P l β 1 , , β P = X W X ,
where W is the diagonal matrix of w i . Equation (15) is a variance-covariance matrix. Thus, Q is expressed as an nxn symmetric matrix:
Q = X X W X 1 X .
Finally, each transformed parameter is computed as β R i s k L o g i t b o o s t = β D X W X 1 X W .

3.3. RiskLogitboost Regression

The RiskLogitboost regression (Algorithm 6) modifies the original version of Logitboost to improve the classification of the rare events (ones). This algorithm comprises 11 stages. The first states the initial values of the prediction Y ^ i and probability π X i .
The second obtains the transformed answer as explained in Algorithm 5. In the third stage we compute Y ^ i d = 1 2 l o g π X i d 1 1 π X i d 1 , which is the value that minimizes a negative binomial log-likelihood loss function: l o g 1 + e x p 2 Y i Y ^ i used for two-class classification and regression problems. However, Y ^ i also minimize the exponential loss function e Y i Y ^ i used in Logitboost [33]. Therefore, the exponential loss function approximates the log-likelihood denoted as transformed answer z i , as explained in Algorithm 5.
The fourth stage computes the weights that were explained in detail in Section 3.1. The fifth stage normalizes the weights of the previous stage so as to convert them into a distribution that must add up to 1.
The fifth stage consists of fitting a weighted linear regression to z i d and obtaining the P + 1 parameters β . Perhaps the constant β 0 is computed by setting X p to a vector of ones. As proposed in the original Logitboost, the sixth stage updates the final prediction Y ^ i d to fit the model by maximum likelihood using Newton steps as follows:
We update the prediction Y ^ i + F X i p ; u d , where u corresponds to parameters β . The outcome of F X i p ; u d would be X i β in a logistic regression with π i expressed in (1), which is e x p 2 F X i p ; u d , as follows:
π i     =   e x p 2 F X i p   ; u d 1   +   e x p 2 F X i p   ; u d
π i     =   e x p 2 X i β 1   +   e x p 2 X i β .
Recalling l β o , , β P ; X i from (5), we compute the expected log-likelihood of Y ^ i + F X i p ; u d .
E l Y ^ i + F X i p ; u d = i = 1 n 2 Y i Y ^ i + F X i p ; u d l o g 1 + 2 e x p Y ^ i + F X i p ; u d .
The Newton method for minimizing a strictly convex function requires the first and second derivatives. Let g be the first derivative and H be the second derivative, also known as the Hessian matrix.
g = E l Y ^ i + F X i p ; u d F X i p ; u d
g = 2 E ( Y i π i )
H = 2 E l Y ^ i + F X i p ; u d F X i p ; u d 2
= 4 E ( π i 1 π i
Hence,
Y ^ i = Y ^ i H 1 g
Y ^ i = Y ^ i + 1 2 E Y i π i π i 1 π i
This result is a very close approximation of the iteratively reweighted least squares method (Appendix A, Equation (A2)) to the likelihood shown in (5). The key difference is the factor ½ that multiplies the expected value. The seventh stage consists of checking that probabilities are bounded between 0 and 1, since adding a δ might lead to a number larger than 1.
The eighth stage consists of inverting 1 2 l o g π X i d 1 1 π X i d 1 (explained in the third stage), which yields the probability estimates. Once the iterative process is finished, we obtain the coefficient estimates of iteration D in stage nine through the expression suggested by [57,58]. Last but not least, we obtain β * by subtracting β D –bias.
Algorithm 6. RiskLogitboost regression
 1.  Initial   values :   Y ^ i 0 = 0,
            π 0 X i   =   0.5 ,   where   π X i   are the probability estimates.
 2. For d = 1 to D do:
  2.1  Transformation :   z i d =   Y i d 1   π X i   d 1 π X i   d 1 1 π X i   d 1 + δ ,   where   δ = 0.0001  
  2.2  Population   Minimizer :   Y ^ i d =   1 2 l o g π X i   d 1 1 π X i   d 1
  2.3  Weighting :   w i d = [ π X i   1 π X i   1 + Y i π X i     ; i f   Y i π X i   > Y ¯ π X i   1 π X i   1 π X i   ]   ; i f   Y i π X i   Y ¯
  2.4  Normalizing :   w i d = w i d i = 1 n w i d
  2.5  Minimizing :   β d   =   a r g m i n β i = 1 n w i d   z i d   β o + p = 1 P X i p β p 2  
  2.6  Updating   prediction :   Y ^ i d =   Y ^ i d 1 +   1 2 F X i p   ; u d .
  2.7  Checking   probabilities :   π X i   d = min 1 1 + e x p   2 Y ^ i d 1   +   δ   ,   1
 3. End For
 4.  Converting :   π d ( Y i   = 1 | X   ) =   1 1 + e x p   2 Y ^ i d 1  
           π d ( Y i   = 0 | X   ) =   1 1 + e x p   2 Y ^ i d 1  
 5.  Obtaining   the   P   Parameters :   β p D =   i = 1 n X p D z i D i = 1 n X p D 2 ,   p = 1, …, P.
 6. Correcting Bias: β   =   β D     X i w i X i   1 X i w i i .

4. Illustrative Data

The illustrative data set used for testing classical and alternative machine learning algorithms is a French third-party liability motor insurance data set available from [59] through publicly available data sets in the library CASdatasets in R. It contains 413,169 observations that were recorded mostly in one year about risk factors for third-party liability motor policies.
This data set contains the following information about vehicle characteristics: The power of the car ordered by category (Power); the car brand divided into seven categories (Brand); the fuel type, either diesel or regular (Gas). This data set also includes information about the policy holder’s characteristics such as: the policy region in France based on the 1970–2015 classification (Region); the number of inhabitants per km2 in the city in which the driver resides (Density). More information is included about the policy holders’ characteristics: the car age measured in years (Car age); and the driver’s age (Driver Age). Finally, the occurrence of accident claims Y i is coded as 1 if the policy holder had suffered at least one accident, and otherwise coded as 0. A total of 3.75% of policy holders had reported at least one accident (rare event ratio).

5. Discussion of Results

This section first presents the predictive performance of some machine learning algorithms jointly with the RiskLogitboost regression when Y = 1 in the extreme observations; secondly, this section shows that the model is interpretable through the coefficient estimates.

5.1. Predictive Performance of Extremes

Table 1 and Table 2 show the Root Mean Square Error (RMSE) for observations when Y = 1 and Y = 0, respectively. Even though the Boosting Tree has optimized hyperparameters, it produced a larger error than all other methods when Y = 1 (The Boosting Tree is built with 10-fold cross validation and has optimized hyperparameters through grid search which correspond to the number of trees (50), the maximum depth of variable interactions (1), the minimum number of observations in the terminal nodes of the trees (10), and shrinkage (0.1) with the caret package in R; the Lasso and Ridge Logistic models had the lowest deviance among several trials with shrinkage values). This can be attributed to the fact that high predictive performance algorithms such as tree-based methods reduce the global error, which is mainly influenced by the majority class (usually coded as 0) when data are imbalanced. Thus, observations modelled using this type of method show high levels of error when Y = 1. This means that the riskiest observations (with misclassifications costs) are poorly detected, and observations whose probability is not high enough are more likely to be misclassified.
The RiskLogitboost regression had the lowest error for observations whose estimated probability was in the lower extremes. This is an important result since the proportion of cases for this set of observations usually tends to be underestimated by traditional predictive modeling techniques. Moreover, the RiskLogitboost regression perfectly predicted observations whose estimated probability was in the highest extremes, suggesting that observations that are more likely to belong to the rare event (Y = 1) will never be misclassified. From a risk analysis perspective, this is a valuable achievement since it reduces misclassification costs for this group.
Observations classified with SMOTEBoost and RUSBoost outperform Logitboost, Ridge Logistic, Lasso Logistic, and Boosting Tree; however, their predictive performance is still below that of the RiskLogitboost regression. Even though the SMOTEBoost and RUBoost are designed to handle imbalance data sets, RiskLogitboost seems to be more efficient at detecting rare events.
Similar performance is obtained between the Weighted Logistic Regression (WLR) [26], Penalized Logistic regression for complex surveys (PLR), with the two weighting mechanisms PSWa and PSWb [9], and SyntheticPL (Synthetic Penalized Logitboost) [56]. Both WLR and PLR with PSWa provide exactly the same result because the PLR incorporates the sampling design, as well as a resampling correction. The sampling correction of both methods coincide when data are simple random samples. RiskLogitboost still outperforms these modern methods for imbalanced and rare event data. The Weighted Logistic for rare events (WeiLogRFL) [10] might be considered as the second best. In contrast, when Y = 0 the Boosting Tree, Ridge Logistic regression and Lasso Logistic had a lower RMSE than the RiskLogitboost regression. These three methods classify the non-events (Y = 0) accurately whereas the RiskLogitboost regression tends to underestimate their occurrence. The results obtained by the RiskLogitboost are quite close to the WeiLogRFL. Moreover, WLR, RLR and boosting tree obtained the lowest RMSE of highest and lowest prediction scores. SyntheticPL outperforms RUSBoost and SMOTEBoost, even though its purpose improves the predictive performance of imbalanced data.
The results when Y = 1 also showed that Logitboost was superior, in predictive capacity terms, to the Ridge Logistic regression, Lasso Logistic regression and Boosting Tree in the testing data set. In this particular case, the Ridge Logistic regression and Lasso Logistic performed similarly in the training data set.
Figure 2 shows the highest and lowest prediction scores for all observed response Y. The RiskLogitboost regression started with higher levels of RMSE in the first iterations, after which they decreased until becoming stable. The RMSE did not vary from the fortieth iteration onwards. As a result, we were able to maintain the convergence process since the proposed transformation for the weighting procedure (Section 3.1) achieved identical stability to that of the original Logitboost.

5.2. Interpretable RiskLogitboost Regression

Table 3 presents the coefficient estimates, standard errors and confidence intervals obtained by the RiskLogitboost regression. Due to the design and the way of fitting the RiskLogitboost regression, similar to generalized linear models (i.e., logistic regression) as fully explained in Section 3, we may obtain the odds ratio by exponentiating the estimated coefficient estimates.
The results provided by the RiskLogitboost regression suggest that the likelihood of a policy holder having an accident increased if they had e, k, l, m, n, o type Power vehicle; in particular, drivers with o–type Power were the most likely to have an accident among all types of Power.
The policy holder was more likely to have an accident if they drove in the Regions of Haute-Normandie and Limousin, whereas driving in the Regions of Bretagne, Centre, Haute Normandie, Ile de France, Pays de la Loire, Basse Normandie, Nord Pas de Calais and Poitou Charentes did not influence the likelihood of a person having an accident.
Policy holders driving Renault, Nissan or Citroen cars were less likely to have an accident than those driving other brands of car.
As expected, the Lasso Logistic regression shrunk all coefficients to zero except the one corresponding to the intercept; in this sense, this method is not informative and is actually disadvantageous for analyzing the effects. The Ridge Logistic Regression provided a very small magnitude of the coefficient estimates, and overall the covariates in the Ridge Logistic regression seemed to have a small effect on the final prediction, which makes sense because 96.25% of the cases had not reported an accident. However, this model risks underestimating the probability of having an accident.
All in all, the coefficients obtained by the RiskLogitboost regression are much bigger than those obtained by the other regressions since this type of algorithm tends to overestimate the probability of occurrence of the target variable to avoid classifying risky observations as Y ^ i = 1 instead of Y ^ i = 0.
Table 4 shows the variable importance of the six most relevant covariates according to RiskLogitboost, Boosting Tree, Ridge Logistic and Logitboost regressions. The results show no consensus between the methods; however, the Boosting Tree and Ridge Logistic regression have certain categories of Brand and Region as the most important covariates, while certain categories of Power and Region seem to be the most relevant according to Logitboost and RiskLogitboost.
As a consequence, it seems that there is no consensus in the results provided by the variable importance technique, which is risky in terms of interpretation. Analysts should consider that the results of a Boosting Tree, Ridge Logistic or Lasso Logistic regression can generate misleading inferences because they underestimate the occurrence of rare events; the covariates that appear to be most contributive will be those with more effect on non-events (Y = 0). By contrast, the variable importance technique suggests that RiskLogitboost better identifies the covariates that are the most influential in the occurrence of rare events (Y = 1).
Figure 3 shows the partial dependence plot (PDP) obtained from a Boosting Tree. Each plot shows an average model prediction for each value of the covariate of interest. The intuitive interpretation of this plot is that the magnitude on the y axis shows more or less likelihood of the occurrence of the event (Y = 1). In this particular case, drivers with m–type Power were more likely to have an accident than drivers with d–type Power. Newer vehicles were less likely to be involved in an accident than older ones. Drivers aged between approximately 30 and 80 were less likely to have an accident than very old or very young drivers. Moreover, policy holders who drove in the region of Limousin were the least likely to have an accident in comparison with other regions of France. Last but not least, it seems that Japanese (except Nissan) or Korean vehicles were more likely to be involved in an accident than other brands.

6. Conclusions

On balance, RiskLogitboost brings a key advantage to the prediction of rare events, principally when the detection of the minority class is fundamental or extremely important in the case study, and the impact of false negatives is irrelevant or barely important. The treatment and the interpretation of rare events is more accurate when using the RiskLogitboost, and it may contribute to the prevention of events whose occurrence would be disastrous, and whose cost policy holders are not willing to accept or able to afford.
The RiskLogitboost regression is a boosting-based machine learning algorithm shown to improve the prediction of rare events compared to certain well-known tree-based and boosting-based algorithms. It will be of most value where the failure to predict the occurrence of the rare event and when it will occur is high. RiskLogitboost regression implements a weighting mechanism and a bias correction that lower prediction error to better predict such rare events by overestimating their probabilities. The results presented here show that the lowest RMSE in the upper and lower extremes occurs when Y = 1. This comes at a cost. The RiskLogitboost regression RMSE tended to increase when Y = 0 in the extreme observations due to the fact that the algorithm adjusts misclassified observations, which, in the context of rare events with a binary response, are coded as Y = 1. This cost is low, when the cost of false negatives is much smaller than the cost of false positives.
While regularization procedures can be incorporated in econometric methods such as logistic regression, they have two main drawbacks. First, the resulting models may not be adequately interpretable because the shrinkage from such procedures depends on the penalty term, causing loss of the real effect of the covariates on the final prediction. Second, such procedures cannot classify rare events efficiently.
The Tree Boosting regression had the lowest RMSE in the majority class observations (Y = 0) but showed poor performance in the minority class observations. It is also more in the nature of a black box in terms of interpretability, requiring more reliance on the variable importance method and PDP. The PDP from the Tree Boosting regression is relatively informative, but all covariates are treated as significant or relevant for the final prediction, which is sometimes inconsistent with an econometric model like a regression. Moreover, while a PDP is easy to implement when there are only a few variables, with more variables interpretation is more difficult. It is often desirable to achieve both high predictive performance for rare events and interpretability. Tree-based and boosting-based methods may be unsuitable in such situations because they underestimate the probability that the rare event will occur while also underestimating the effect of the covariates that are most important to predicting the rare event rather than the majority class. RiskLogitboost delivers high predictive performance while also facilitating interpretation by identifying the covariates most important to prediction of the rare event.
The RiskLogitboost has still limitations when decreasing the false negative rate since it focuses on reducing efficiently the error of observations Y i = 1. However, for those case studies whose cost of false negative rate tends to be high, the proposed method could be redesigned so as to improve the detection of observations Y i = 0. This would be a proposal for further research.

Author Contributions

Conceptualization, J.P.-N. and M.G.; methodology, J.P.-N.; software, J.P.-N.; validation, M.G. and M.A.; formal analysis, M.G., J.P.-N. and M.A.; investigation, J.P.-N.; resources, M.G. and M.A.; data curation, J.P.-N.; writing—original draft preparation, J.P.-N. and M.G.; writing—review and editing, M.G. and J.P.-N.; visualization, J.P.-N.; supervision, M.G. and M.A.; project administration, M.G.; funding acquisition, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Spanish Ministry of Economy, FEDER grant ECO2016-76203-C2-2-P and ICREA Academy.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available data set is used in this paper, and described in detail in Section 4.

Acknowledgments

The authors thank the Spanish Ministry of Science and Innovation grant PID2019-105986GB-C21. M.G. thanks ICREA Academia and Fundacion BBVA Big Data grants 2018.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Computation of z i as Transformed Response

A Taylor transformation is applied in (2) so that η i is expanded around π i . Let η i be expressed as ( Y i ).
Y i π X i + Y i π X i π X i
Y i l o g π i 1 π i + Y i π X i 1 π X i 1 π X i
Y i η i + Y i π X i 1 π X i π X i .
We denote Y i as the transformed response z i shown in Algorithm 5.
z i η i + Y i π X i 1 π X i π X i .

Appendix B. Computation of Weights

The weights of the Logitboost are obtained by computing the variance of the transformed response V a r [ z i | X ] .
V a r [ z i | X =   V a r [   π X i   | X +   V a r [ ( Y i π X i   )   π X i   | X ]
V a r [ z i | X   = 0 +   π X i   2   V a r [ Y i + π X i   2   V a r π X i  
V a r [ z i [ X ] = ( π ( X i ) ) 2 V a r [ Y i ] = 1 π ( X i ) ( π ( X i ) 1 ) 2 [ π ( X i ) ( 1 π ( X i ) ) ] = [ π ( X i ) ( 1 π ( X i ) ) ] .

References

  1. Wei, W.; Li, J.; Cao, L.; Ou, Y.; Chen, J. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 2013, 16, 449–475. [Google Scholar] [CrossRef]
  2. Jiang, C.; Wang, Z.; Wang, R.; Ding, Y. Loan default prediction by combining soft information extracted from descriptive text in online peer-to-peer lending. Ann. Oper. Res. 2018, 266, 511–529. [Google Scholar] [CrossRef]
  3. Barboza, F.; Kimura, H.; Altman, E. Machine learning models and bankruptcy prediction. Expert Syst. Appl. 2017, 83, 405–417. [Google Scholar] [CrossRef]
  4. Zaremba, A.; Czapkiewicz, A. Digesting anomalies in emerging European markets: A comparison of factor pricing models. Emerg. Mark. Rev. 2017, 31, 1–15. [Google Scholar] [CrossRef]
  5. Verbeke, W.; Martens, D.; Baesens, B. Social network analysis for customer churn prediction. Appl. Soft Comput. 2014, 14, 431–446. [Google Scholar] [CrossRef]
  6. Ayuso, M.; Guillen, M.; Pérez-Marín, A.M. Time and distance to first accident and driving patterns of young drivers with pay-as-you-drive insurance. Accid. Anal. Prev. 2014, 73, 125–131. [Google Scholar] [CrossRef]
  7. King, G.; Zeng, L. Logistic regression in rare events data. Political Anal. 2001, 9, 137–163. [Google Scholar] [CrossRef] [Green Version]
  8. Maalouf, M.; Trafalis, T.B. Robust weighted kernel Logistic regression in imbalanced and rare events data. Comput. Stat. Data Anal. 2011, 55, 168–183. [Google Scholar] [CrossRef] [Green Version]
  9. Pesantez-Narvaez, J.; Guillen, M. Penalized Logistic regression to improve predictive capacity of rare events in surveys. J. Intell. Fuzzy Syst. 2020, 1–11. [Google Scholar] [CrossRef]
  10. Maalouf, M.; Mohammad, S. Weighted logistic regression for large-scale imbalanced and rare events data. Knowl. Based Syst. 2014, 59, 142–148. [Google Scholar] [CrossRef]
  11. Rao, V.; Maulik, R.; Constantinescu, E.; Anitescu, M. A Machine-Learning-Based Importance Sampling Method to Compute Rare Event Probabilities. In Computational Science—ICCS 2020; Krzhizhanovskaya, V., Závodszky, G., Lees, M., Dongarra, J., Sloot, P., Brissos, S., Texeira, J., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12142. [Google Scholar] [CrossRef]
  12. Kuklev, E.A.; Shapkin, V.S.; Filippov, V.L.; Shatrakov, Y.G. Solving the Rare Events Problem with the Fuzzy Sets Method. In Aviation System Risks and Safety; Springer: Singapore, 2019. [Google Scholar] [CrossRef]
  13. Kamalov, F.; Denisov, D. Gamma distribution-based sampling for imbalanced data. Knowl. Based Syst. 2020, 207, 106368. [Google Scholar] [CrossRef]
  14. Cook, S.J.; Hays, J.C.; Franzese, R.J. Fixed effects in rare events data: A penalized maximum likelihood solution. Political Sci. Res. Methods 2020, 8, 92–105. [Google Scholar] [CrossRef]
  15. Carpenter, D.P.; Lewis, D.E. Political learning from rare events: Poisson inference, fiscal constraints, and the lifetime of bureaus. Political Anal. 2004, 201–232. [Google Scholar] [CrossRef]
  16. Bo, L.; Wang, Y.; Yang, X. Markov-modulated jump–diffusions for currency option pricing. Insur. Math. Econ. 2010, 46, 461–469. [Google Scholar] [CrossRef]
  17. Artís, M.; Ayuso, M.; Guillén, M. Detection of automobile insurance fraud with discrete choice models and misclassified claims. J. Risk Insur. 2002, 69, 325–340. [Google Scholar] [CrossRef]
  18. Wilson, J.H. An analytical approach to detecting insurance fraud using logistic regression. J. Financ. Account. 2009, 1, 1. [Google Scholar]
  19. Falk, M.; Hüsler, J.; Reiss, R.D. Laws of Small Numbers: Extremes and Rare Events; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
  20. L’Ecuyer, P.; Demers, V.; Tuffin, B. Rare events, splitting, and quasi-Monte Carlo. ACM Trans. Model. Comput. Simul. 2007, 17. [Google Scholar] [CrossRef]
  21. Buch-Larsen, T.; Nielsen, J.P.; Guillén, M.; Bolancé, C. Kernel density estimation for heavy-tailed distributions using the Champernowne transformation. Statistics 2005, 39, 503–516. [Google Scholar] [CrossRef]
  22. Bolancé, C.; Guillén, M.; Nielsen, J.P. Transformation Kernel Estimation of Insurance Claim Cost Distributions. In Mathematical and Statistical Methods for Actuarial Sciences and Finance; Corazza, M., Pizzi, C., Eds.; Springer: Milano, Italy, 2010. [Google Scholar] [CrossRef]
  23. Rached, I.; Larsson, E. Tail Distribution and Extreme Quantile Estimation Using Non-Parametric Approaches. In High-Performance Modelling and Simulation for Big Data Applications; Kołodziej, J., González-Vélez, H., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; Volume 11400. [Google Scholar] [CrossRef] [Green Version]
  24. Jha, S.; Guillen, M.; Westland, J.C. Employing transaction aggregation strategy to detect credit card fraud. Expert Syst. Appl. 2012, 39, 12650–12657. [Google Scholar] [CrossRef]
  25. Jin, Y.; Rejesus, R.M.; Little, B.B. Binary choice models for rare events data: A crop insurance fraud application. Appl. Econ. 2005, 37, 841–848. [Google Scholar] [CrossRef]
  26. Pesantez-Narvaez, J.; Guillen, M. Weighted Logistic Regression to Improve Predictive Performance in Insurance. Adv. Intell. Syst. Comput. 2020, 894, 22–34. [Google Scholar] [CrossRef]
  27. Calabrese, R.; Osmetti, S.A. Generalized extreme value regression for binary rare events data: An application to credit defaults. J. Appl. Stat. 2013, 40, 1172–1188. [Google Scholar] [CrossRef]
  28. Loyola-González, O.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; García-Borroto, M. Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 2016, 175, 935–947. [Google Scholar] [CrossRef]
  29. Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef] [Green Version]
  30. Beyan, C.; Fisher, R. Classifying imbalanced data sets using similarity based hierarchical decomposition. Pattern Recognit. 2015, 48, 1653–1672. [Google Scholar] [CrossRef] [Green Version]
  31. Pesantez-Narvaez, J.; Guillen, M.; Alcañiz, M. Predicting motor insurance claims using telematics data—XGBoost versus Logistic regression. Risks 2019, 7, 70. [Google Scholar] [CrossRef] [Green Version]
  32. Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv, 2017; arXiv:1702.08608. [Google Scholar]
  33. Friedman, J.; Hastie, T.; Tibshirani, R. Additive Logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
  34. Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. ICML 1996, 96, 148–156. [Google Scholar]
  35. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of online learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
  36. Domingo, C.; Watanabe, O. MadaBoost: A modification of AdaBoost. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT), Graz, Austria, 9–12 July 2000; pp. 180–189. [Google Scholar]
  37. Freund, Y. An adaptive version of the boost by majority algorithm. Mach. Learn. 2001, 43, 293–318. [Google Scholar] [CrossRef]
  38. Lee, S.C.; Lin, S. Delta boosting machine with application to general insurance. N. Am. Actuar. J. 2018, 22, 405–425. [Google Scholar] [CrossRef]
  39. Joshi, M.V.; Kumar, V.; Agarwal, R.C. Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; IEEE: San Jose, CA, USA, 2001; pp. 257–264. [Google Scholar] [CrossRef]
  40. Viola, P.; Jones, M. Fast and robust classification using asymmetric Adaboost and a detector cascade. Adv. Neural Inf. Process. Syst. 2001, 14, 1311–1318. [Google Scholar]
  41. Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the EUROPEAN Conference on Principles of Data Mining and Knowledge Discovery, Dubrovnik, Croatia, 22–26 September 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 107–119. [Google Scholar] [CrossRef] [Green Version]
  42. Guo, H.; Viktor, H.L. Learning from imbalanced data sets with boosting and data generation: The databoost-im approach. ACM Sigkdd Explor. Newsl. 2004, 6, 30–39. [Google Scholar] [CrossRef]
  43. Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2009, 40, 185–197. [Google Scholar] [CrossRef]
  44. Hu, S.; Liang, Y.; Ma, L.; He, Y. MSMOTE: Improving classification performance when training data is imbalanced. In Proceedings of the 2009 Second International Workshop on Computer Science and Engineering, Qingdao, China, 28–30 October 2009; pp. 13–17. [Google Scholar] [CrossRef]
  45. Fan, W.; Stolfo, S.J.; Zhang, J.; Chan, P.K. AdaCost: Misclassification cost-sensitive boosting. In Proceedings of the 16th International Conference on Machine Learning, Bled, Slovenia, 27–30 June 1999; Volume 99, pp. 97–105. [Google Scholar]
  46. Ting, K.M. A comparative study of cost-sensitive boosting algorithms. In Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000. [Google Scholar]
  47. Wang, S.; Chen, H.; Yao, X. Negative correlation learning for classification ensembles. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010. [Google Scholar] [CrossRef]
  48. Sun, Y.; Kamel, M.S.; Wang, Y. Boosting for learning multiple classes with imbalanced class distribution. In Proceedings of the Sixth IEEE International Conference on Data Mining, Hong Kong, China, 18–22 December 2006; pp. 592–602. [Google Scholar] [CrossRef] [Green Version]
  49. Sun, Y.; Kamel, M.S.; Wong, A.K.; Wang, Y. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit. 2007, 40, 3358–3378. [Google Scholar] [CrossRef]
  50. Masnadi-Shirazi, H.; Vasconcelos, N. Cost-sensitive boosting. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 294–309. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  51. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  52. Breiman, L.; Friedman, J.; Stone, C.; Olshen, R. Classification and Regression Trees; The Wadsworth and Brooks-Cole Statistics-Probability Series; Taylor and Francis: Wadsworth, OH, USA, 1984. [Google Scholar]
  53. James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
  54. McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman and Hall: New York, NY, USA, 1989. [Google Scholar] [CrossRef]
  55. Mease, D.; Wyner, A.J.; Buja, A. Boosted classification trees and class probability/quantile estimation. J. Mach. Learn. Res. 2007, 8, 409–439. [Google Scholar]
  56. Pesantez-Narvaez, J.; Guillen, M.; Alcañiz, M. A Synthetic Penalized Logitboost to Model Mortgage Lending with Imbalanced Data. Comput. Econ. 2020, 57, 1–29. [Google Scholar] [CrossRef]
  57. Liska, G.R.; Cirillo, M.Â.; de Menezes, F.S.; Bueno Filho, J.S.D.S. Machine learning based on extended generalized linear model applied in mixture experiments. Commun. Stat. Simul. Comput. 2019, 1–15. [Google Scholar] [CrossRef]
  58. De Menezes, F.S.; Liska, G.R.; Cirillo, M.A.; Vivanco, M.J. Data classification with binary response through the Boosting algorithm and Logistic regression. Expert Syst. Appl. 2017, 69, 62–73. [Google Scholar] [CrossRef]
  59. Charpentier, A. Computational Actuarial Science with R; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Plot of weights versus estimated probabilities of the Logitboost and the RiskLogitboost regression.
Figure 1. Plot of weights versus estimated probabilities of the Logitboost and the RiskLogitboost regression.
Mathematics 09 00579 g001
Figure 2. The highest and lowest prediction scores for all observed response Y within 50 iterations (D = 50) obtained with the RiskLogitboost regression.
Figure 2. The highest and lowest prediction scores for all observed response Y within 50 iterations (D = 50) obtained with the RiskLogitboost regression.
Mathematics 09 00579 g002
Figure 3. Partial dependence plots from the Boosting Tree. Abbreviations: B-N (Basse-Normandie), Ile (Ile-de-France), N.C. (Nord-Pas-de-Calais), Pays (Pays-de-la-Loire), Poitu (Poitou-Charentes), Japanese (Japanese (except Nissan) or Korean), M/C/B (Mercedes, Chrysler or BMW), V/A/S/S (Volkswagen, Audi, Skoda or Seat), Opel (Opel, General Motors or Ford).
Figure 3. Partial dependence plots from the Boosting Tree. Abbreviations: B-N (Basse-Normandie), Ile (Ile-de-France), N.C. (Nord-Pas-de-Calais), Pays (Pays-de-la-Loire), Poitu (Poitou-Charentes), Japanese (Japanese (except Nissan) or Korean), M/C/B (Mercedes, Chrysler or BMW), V/A/S/S (Volkswagen, Audi, Skoda or Seat), Opel (Opel, General Motors or Ford).
Mathematics 09 00579 g003
Table 1. Root Mean Square Error (RMSE) for observations with Y = 1.
Table 1. Root Mean Square Error (RMSE) for observations with Y = 1.
Training Data Set (RMSE Y = 1)
Lower ExtremeUpper Extreme
0.010.050.100.200.30.40.010.050.100.200.30.4
RiskLogitboost regression0.24540.18250.14960.11320.09270.08030.00000.00000.00000.00000.00000.0000
Ridge Logistic0.96290.96290.96290.96290.96280.96280.96270.96270.96270.96270.96270.9627
Lasso Logistic0.96280.96280.96280.96280.96280.96280.96280.96280.96280.96280.96280.9628
Boosting Tree0.97870.97470.97270.97000.96790.96650.91620.92930.94170.94950.95220.9539
Logitboost0.98290.97990.97810.97360.97070.96880.94160.94790.95050.95300.95450.9557
SMOTEBoost0.69630.69010.68520.68000.67610.67250.60460.60900.61170.61780.62220.6264
RUSBoost0.58110.57420.5620.55170.54470.53910.44660.47270.48530.49310.49700.5001
WLR0.99920.99820.99730.99610.99500.99390.47880.70920.79610.86760.89960.9183
PLR (PSWa)0.99920.99820.99730.99610.99500.99390.47880.70920.79610.86760.89960.9183
PLR (PSWb)0.98200.97900.97710.97250.96970.96780.94070.94700.94960.95200.95360.9547
SyntheticPL0.98300.98030.97830.97360.97080.96890.93800.94670.94970.95230.95400.9552
WeiLogRFL0.36960.28600.23860.18260.14980.12970.00000.00000.00000.00000.00000.0000
Testing Data Set (RMSE Y = 1)
Lower ExtremeUpper Extreme
0.010.050.100.200.30.40.010.050.100.200.30.4
RiskLogitboost regression0.46900.37250.31330.24210.19910.17240.00000.00000.00000.00000.00000.0000
Ridge Logistic0.96290.96290.96290.96290.96280.96280.96270.96270.96270.96270.96270.9627
Lasso Logistic0.96280.96280.96280.96280.96280.96280.96280.96280.96280.96280.96280.9628
Boosting Tree0.97880.97500.97310.97050.96830.96690.91560.92970.94240.94980.95250.9542
Logitboost0.87450.87230.87100.86880.86740.86650.85580.85770.85860.85950.86010.8606
SMOTEBoost0.69590.69010.68540.68010.67620.67270.60420.60880.61160.61800.62260.6270
RUSBoost0.57810.56000.55150.54250.53580.53120.44340.45390.47270.48580.49130.4948
WLR0.99930.99820.99730.99610.99500.99380.45230.70570.79590.86640.89890.9178
PLR (PSWa)0.99930.99820.99730.99610.99500.99380.45230.70570.79590.86640.89890.9178
PLR (PSWb)0.98220.97920.97730.97290.97000.96810.94090.94710.94970.95220.95370.9549
SyntheticPL0.87450.87210.87080.86860.86730.86640.85590.85770.85870.85960.86020.8607
WeiLogRFL0.46900.37250.31330.24210.19910.17240.00000.00000.00000.00000.00000.0000
The results are presented for observations that correspond to policy holders who suffered an accident (Y = 1). All results were analyzed by groups of prediction scores, also known as predicted probabilities. Each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the lowest accumulated prediction scores is shown on the left-hand side of the table under “Lower Extreme”, and each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the highest accumulated prediction scores is shown on the right-hand side of the table under “Upper Extreme”. Abbreviations: WLR (Weighted Logistic Regression) [26], PLR (Penalized Logistic regression for complex surveys), with two weighting mechanisms PSWa and PSWb [9]. SyntheticPL (Synthetic Penalized Logitboost) [56], WeiLogRFL (Weighted Logistic) of [10].
Table 2. Root Mean Square Error (RMSE) for observations with Y = 0.
Table 2. Root Mean Square Error (RMSE) for observations with Y = 0.
Training Data Set (RMSE Y = 0)
Lower ExtremeUpper Extreme
0.010.050.100.200.30.40.010.050.100.200.30.4
RiskLogitboost regression0.75080.82190.86050.90620.93520.95141.00001.00001.00001.00001.00001.0000
Ridge Logistic0.03710.03710.03710.03710.03710.03720.03730.03730.03730.03730.03730.0373
Lasso Logistic0.03720.03720.03720.03720.03720.03720.03720.03720.03720.03720.03720.0372
Boosting Tree0.01970.02210.02470.02730.02940.03130.07730.05830.05130.04700.04510.0436
Logitboost0.01570.01880.02020.02270.02640.02890.05740.05100.04850.04600.04450.0434
SMOTEBoost0.29780.30700.31160.31710.32060.32400.39580.39090.38650.37970.37520.3704
RUSBoost0.42190.44030.44880.45790.46460.46920.55660.54630.52810.51490.50940.5058
WLR0.00080.00200.00290.00430.00550.00670.52300.31060.23560.17330.14330.1250
PLR (PSWa)0.00080.00200.00290.00430.00550.00670.52300.31060.23560.17330.14330.1250
PLR (PSWb)0.01660.01980.02120.02370.02740.02990.05820.05190.04950.04700.04550.0444
SyntheticPL0.01570.01830.01980.02250.02620.02870.06010.05210.04930.04660.04500.0439
WeiLogRFL0.62580.72020.77580.84670.89450.92121.00001.00001.00001.00001.00001.0000
Testing Data Set (RMSE Y = 0)
Lower ExtremeUpper Extreme
0.010.050.100.200.30.40.010.050.100.200.30.4
RiskLogitboost regression0.54460.64880.71340.79880.85980.89571.00001.00001.00001.00001.00001.0000
Ridge Logistic0.03740.03740.03740.03740.03740.03740.03740.03740.03740.03740.03740.0374
Lasso Logistic0.03720.03720.03720.03720.03720.03720.03720.03720.03720.03720.03720.0372
Boosting Tree0.01970.02200.02460.02720.02930.03120.07740.05830.05120.04700.04510.0436
Logitboost0.12470.12690.12790.12950.13110.13220.14400.14200.14110.14010.13950.1390
SMOTEBoost0.29760.30690.31160.31710.32060.32400.39590.39090.38650.3790.3750.3705
RUSBoost0.41890.42590.43830.44870.45580.46140.55340.52800.51540.50740.50340.5003
WLR0.00090.00200.00290.00430.00550.00670.52940.31190.23640.17380.14380.1254
PLR (PSWa)0.00080.00200.00290.00430.00550.00670.52300.31060.23560.17330.14330.1250
PLR (PSWb)0.01670.01980.02120.02360.02730.02980.05820.05190.04950.04700.04550.0444
SyntheticPL0.12470.12710.12830.12980.13130.13230.14380.14180.14090.14000.13940.1389
WeiLogRFL0.54460.64880.71340.79880.85980.89571.00001.00001.00001.00001.00001.0000
The results are presented for observations that correspond to policy holders who did not suffer an accident (Y = 0). All results were analyzed by groups of prediction scores also known as predicted probabilities. So, eEach RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the lowest accumulated prediction scores is shown on the left-hand side of the table under “Lower Extreme”, and each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the highest accumulated prediction scores is shown on the right-hand side of the table under “Upper Extreme”. Abbreviations: WLR (Weighted Logistic Regression) [26], PLR (Penalized Logistic regression for complex surveys), with two weighting mechanisms PSWa and PSWb [9]. SyntheticPL (Synthetic Penalized Logitboost) [56]. WeiLogRFL (Weighted Logistic) of [10].
Table 3. Coefficient Estimates, Standard Error and Confidence Intervals provided by the RiskLogitboost regression.
Table 3. Coefficient Estimates, Standard Error and Confidence Intervals provided by the RiskLogitboost regression.
VariablesCategoriesRiskLogitboost RegressionRiskLogitboost Regression (Standard Error)RiskLogitboost Regression
(Confidence Intervals)
* Intercept20.8747.4130(6.3445; 35.4035)
Powere−0.65273.5641(−7.6383; 6.3329)
f−1.33793.4769(−8.1526; 5.4768)
g−0.80033.4506(−7.5635; 5.9629)
h4.90614.9344(−4.7653; 14.578)
i7.87705.4611(−2.8268; 18.5808)
j8.06755.5682(−2.8462; 18.9812)
* k18.18807.1178(4.2371; 32.1389)
* l45.33201.0540(43.2662; 47.3978)
* m99.68401.5136(96.7173; 102.6507)
* n144.19001.7590(140.7424; 147.6376)
* o145.800017.6033(111.2975; 180.3025)
BrandJapanese (except Nissan) or Korean−7.67745.7732(−18.9929; 3.6381)
Mercedes, Chrysler or BMW−2.01306.7667(−15.2757; 11.2497)
Opel, General Motors or Ford−6.52985.7170(−17.7351; 4.6755)
other8.20487.9329(−7.3437; 23.7533)
* Renault, Nissan or Citroen−10.37604.9954(−20.1669; −0.5850)
Volkswagen, Audi, Skoda or Seat−5.50555.8621(−16.9952; 5.9842)
RegionBasse-Normandie10.2797.1850(−3.8036; 24.3616)
Bretagne−3.49534.9434(−13.1844; 6.1938)
Centre−6.57494.2924(−14.9880; 1.8382)
* Haute-Normandie27.60609.3055(9.3672; 45.8448)
Ile-de-France−4.10335.12264(−14.1437; 5.9371)
* Limousin34.552010.0028(14.9465; 54.1575)
Nord-Pas-de-Calais0.08975.7443(−11.1691; 11.3485)
Pays-de-la-Loire−2.73105.0910(−12.7094; 7.2474)
Poitou-Charentes2.45235.9926(−9.2932; 14.1978)
Density0.00030.00025(−0.0003; 0.0009)
Gas Regular0.01872.1895(−4.2727; 4.3101)
Car Age0.10530.1969(−0.2806; 0.4912)
Driver Age0.02170.0712(−0.1179; 0.1613)
The base category is other for the covariates Power, Brand and Region, and diesel for the covariate Gas. * Indicates that the coefficient is significant at the 95% confidence level. The standard error (se) root square of the diagonal of the variance-covariance matrix was computed as X i w i D X i   1 . We built a 95% confidence interval for β as [β − 1.96 se; β + 1.96 se].
Table 4. Variable importance of the six most relevant covariates according to RiskLogitboost, Boosting Tree, Ridge Logistic regression and Logitboost.
Table 4. Variable importance of the six most relevant covariates according to RiskLogitboost, Boosting Tree, Ridge Logistic regression and Logitboost.
OrderRiskLogitboostBoosting TreeRidge LogisticLogitboost
FirstPower oDriver AgeBrand Japanese (except Nissan) or KoreanRegion Limousin
SecondPower nBrand Japanese (except Nissan) or KoreanRegion
Haute-Normandie
Power m
ThirdPower mCar AgeBrand Opel, General Motors or FordPower l
FourthPower lDensityBrand Volkswagen, Audi, Skoda or SeatPower n
FifthRegion LimousinBrand
Opel, General
Motors or Ford
Region Nord-Pas-de-CalaisRegion Haute-Normandie
SixthRegion Haute-NormandieRegion
Haute-Normandie
Brand Mercedes, Chrysler or BMWPower k
The Lasso Logistic regression has no significant coefficient estimates with which to compute the variable importance technique.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Pesantez-Narvaez, J.; Guillen, M.; Alcañiz, M. RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach. Mathematics 2021, 9, 579. https://doi.org/10.3390/math9050579

AMA Style

Pesantez-Narvaez J, Guillen M, Alcañiz M. RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach. Mathematics. 2021; 9(5):579. https://doi.org/10.3390/math9050579

Chicago/Turabian Style

Pesantez-Narvaez, Jessica, Montserrat Guillen, and Manuela Alcañiz. 2021. "RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach" Mathematics 9, no. 5: 579. https://doi.org/10.3390/math9050579

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop