RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach

: A boosting-based machine learning algorithm is presented to model a binary response with large imbalance, i.e., a rare event. The new method (i) reduces the prediction error of the rare class, and (ii) approximates an econometric model that allows interpretability. RiskLogitboost regression includes a weighting mechanism that oversamples or undersamples observations according to their misclassiﬁcation likelihood and a generalized least squares bias correction strategy to reduce the prediction error. An illustration using a real French third-party liability motor insurance data set is presented. The results show that RiskLogitboost regression improves the rate of detection of rare events compared to some boosting-based and tree-based algorithms and some existing methods designed to treat imbalanced responses.


Introduction
Research on rare events is steadily increasing in real-world applications of risk management.Examples include fraud detection [1], credit default prediction [2], bankruptcy prediction [3], emerging markets anomalies [4], customer churn predictions [5], and accident occurrence for insurance studies [6].We address the rare event modeling problem with a purposefully designed method to identify rare potential hazards in advance and facilitate an understanding of their causes.
Rare events are extremely uncommon patterns whose atypical behavior is difficult to predict and detect.A broad consensus [7][8][9][10] favors the definition of rare events data as binary variables with much fewer events (ones) than non-events (zeros).In other words, the degree of imbalance is more extreme in rare events than it is in class imbalanced data, such that rare events are characterized by the number of ones being hundreds to thousands of times smaller than the number of zeros.
Imbalanced data and rare events have been studied mostly as statistical problems with potential application in diverse fields of biology, political science, engineering, and medicine.To name a few, ref. [11] develop a computational method for evaluating the extreme probabilities from random initialization of nonlinear dynamical systems.Ref. [12] proposes a solution for the rare events problem with fuzzy sets.Ref. [13] proposes a resampling strategy via gamma distribution for imbalanced data in medical diagnostics.Ref. [14] proposes a penalized maximum likelihood fixed effects estimator for binary timeseries-cross-sectional data for political science applications.Ref. [15] proposes a learningbased stochastic optimization model using rare event data on U.S. federal government agencies.Ref. [16] introduces dynamic models for rare events and time-inhomogeneity in fluctuating currency markets.
The insurance literature draws heavily on discrete probability distributions, where the occurrence of few or non-events are considered as rare or extreme.For instance, authors like [17,18] use generalized linear models to predict insurance fraud.Ref. [19] develops an extension of the Poisson approximation of binomial distributions for rare events.Another work revolves around solutions reached by rare-event simulations [20].
Refs. [21][22][23] employ non-parametric methods for heavy tailed distributions.Ref. [24] employs transaction aggregation to detect credit card fraud when the occurrence of ones in the dependent variable is much less than zeros.However, very few papers in this field have been devoted to studying rare events in binary response such as [25][26][27], and even fewer that go beyond econometric methods, such as [9], which employs advanced machine learning methods.
In fact, developing algorithms that can handle rare events powered by the latest machine learning advances faces two important challenges: (i) Some models exhibit bias towards the majority class or underestimate the minority class.Some classifiers are suitable for balanced data [28,29] or treat the minority class as noise [30].Moreover, some popular tree-based and boosting-based algorithms have been shown to have a high predictive performance measured only with evaluation metrics that consider all observations equally important [31].(ii) Unlike econometric methods, several machine learning methods are considered as black boxes in terms of interpretation.They are frequently interpreted using single metrics such as classification accuracy as unique descriptions of complex tasks [32], and they are not able to provide robust explanations for high-risk environments.
In this paper we address these two challenges in an attempt to predict and explain rare events, which will be referred to as dependent or target variables.We propose a RiskLogitboost regression, which is a Logitboost-based algorithm that leads to the convergence of coefficient estimates after some iterations, as occurs when using Iteratively Re-Weighted procedures.Moreover, bias and weighting corrections are incorporated to improve the predictive capacity of the events (ones).
More specifically, our prediction strategy consists of: (i) increasing the accuracy of minority class prediction, and (ii) building an interpretable model similar to classical econometric models.After the introduction, this paper is organized as follows.Section 2 presents the background to the three main approaches used in this research: boosting methods for imbalanced data sets, penalized regression models, and interpretable machine learning.Section 3 describes in detail the proposed RiskLogitboost regression in the rare event problem framework.Section 4 shows the illustrative data used to prove the RiskLogitboost regression.Section 5 discusses the results obtained in terms of predictive capacity and interpretability.Finally, Section 6 presents the conclusions of the paper.

Background
To formally define the novel RiskLogitboost regression as a supervised machine learning method, this section first addresses three important notions which are the basis of our strategy.A rigorous description of boosting-based algorithms is presented since it is the core procedure of our method.We will obtain certain key expressions from penalized linear models to approximate RiskLogitboost as an econometric method.Finally, two widely recognized interpretable machine learning techniques are briefly described to gain an overview of how traditional machine learning has been interpreted so far.
Supervised machine learning methods are used to predict a response variable denoted as Y i , i = 1, . . ., n.The data consist of a sample of n observations of the response, and the prediction is established by a set of covariates denoted as X ip , p = 1, . . ., P with P predictor variables.The model is trained by a base learner F X ip ; u , which is a function of covariates X ip and the parameters represented by u.The predicted response is denoted as Ŷi .
The purpose of supervised machine learning is to minimize the learning error measured by a loss function ϕ using an optimization strategy like gradient descent.The loss function is the distance between the observed Y i and the predicted response Ŷi which is denoted as ϕ(Y i , Ŷi ).

Boosting Methods
Boosting methods for additive functions are developed within an iterative process through a numerical optimization technique called gradient descent.Each function minimizes a specified loss function ϕ.Ref. [33] applied the boosting strategy to some loss criteria for classification and regression problems (we use the term "classification problem" if Y i is qualitative, whereas if Y i is quantitative, we use the term "regression problem".The latter does not refer to regression models studied in econometrics; it refers to a predictive model) such as: least-squares Y i − Ŷi 2 for the least-squares regression; least absolute-deviation Y i − Ŷi for the least-absolute-deviation regression; Huber for M-Regression 0.5

End for
Adaboost was one of the first boosting-based prediction algorithms [34,35].It trains the base learner in a reweighted version by allocating more weight to misclassified observations.Many other boosting techniques have since been derived, such as RealBoost [33], which allows a probability estimate instead of a binary outcome.Logitboost [33] can be used for two-class prediction problems by optimizing an exponential criterion.Gentle Adaboost [33] builds on Real Adaboost and uses probability estimates to update functions.Madaboost [36] modifies the weighting system of Adaboost.Brownboost [37] is based on finding solutions to Brownian differential equations.Delta Boosting [38] uses a delta basis instead of the negative gradient as transformed response.
In the context of rare event and imbalanced prediction problems, various boosting-based methods have been proposed in the literature, including but not limited to RareBoost [39], which calibrates the weights depending on the accuracy of each iteration.Asymmetric Adaboost [40] is a variant of Adaboost and incorporates a cascade classifier.SMOTEBoost [41] incorporates SMOTE (synthetic minority over-sampling techniques) in a boosting procedure.DataBoost-IM [42] treats outliers and extreme observations in a separate procedure to generate synthetic examples of majority and minority classes.RUSBoost [43] trains using skewed data.MSMOTEBoost [44] rebalances the minority class and eliminates noise observations.Additional cost-sensitive methods [45][46][47][48][49][50] have been developed by introducing cost items in the boosting procedure.
Other boosting extensions include the tree boosting-based methods, which have been considered a great success, due to their predictive capacity, in the machine learning community.The tree gradient boost [51] varies from the original gradient boost in the initial value of the first prediction Ŷi 0 , and the use of a Logistic loss function and a tree base learner.
A tree gradient boost as shown in Algorithm 2 consists of six stages.The first states the values for the initial prediction, Ŷi 0 .The second stage obtains the new transformed response with the negative gradient of a Logistic loss function.The third maps the observations onto J leaves of the tree at iteration d.The tree learner is ∑ J j=1 u j 1(X ip R j ) with J terminal nodes known as leaves, and R j classification rules (regions), j = 1, . . ., J. Parameter u corresponds to the score of each leaf, which is the proportion of cases classified as events given covariates X ip .Gini and entropy are two metrics for choosing how to split a tree.Gini is a measurement of the likelihood of an incorrect classification of a new observation if it were randomly classified according to the distribution of class labels of the covariates.Entropy measures how much information there is in a node., where Y is the mean of Y i .2. For d = 1 to D do: 2.1 Transformation : The fourth stage requires minimizing a Logistic loss function: However, since there is no closed form for γ d j , a Newton-Raphson approximation is computed.Finally, the sixth stage updates the final Ŷi d .Tree gradient boosting techniques tend to overfit especially when data are complex or highly imbalanced [31].Regularization is a popular strategy to penalize the complexity of the tree and allow out-of-sample reproducibility.This involves adding a shrinkage penalty or regularization term to the loss function ϕ(Y i , Ŷi ) so that the leaf scores shrink: u , where λ is a regularization parameter associated with L1-norm or L2-norm of the scores vector).Moreover, ref. [52] introduced cost-complexity pruning that penalizes the number of terminal nodes J according to the following expression: As a consequence, these strategies seem quite risky for analysts who want to keep the effect of the covariates even when this effect is small or not significant, because after applying regularization or pruning the score of the leaf is arbitrarily shrunk and the correspondingly less important characteristics disappear.

Penalized Regression Methods
In the econometric setting, regression models have commonly been used to describe the relationship between a response Y i and a set of covariates X ip .Regression models are used to predict a target variable Ŷi , and allow interpretability of the coefficients by measuring the effect of the covariates on the expected response.
Logistic regression models are used to model the binary variable Y i .Y i follows a Bernoulli distribution, where π i is the probability that Y i equals 1, expressed as follows: Note that X i β is the matrix notation of β o + ∑ P p=1 X ip β p , where β is the parameter vector. 1 − π i is the probability that Y i equals 0: The Logistic regression uses a logit function as the linear predictor defined as: Then, the classical likelihood function is the joint Bernoulli probability distribution of observed values of Y i as follows: Taking logarithms of (4), and replacing with Expressions ( 1) and ( 2) we obtain: Then Logistic regression estimates can be found by maximizing the log likelihood from ( 5) or minimizing the negative log likelihood function, which can be seen as a loss function to be minimized.Maximization is achieved by deriving l(β o , . . ., β P ; X i ) by all the P + 1 parameters, obtaining a vector of P + 1 partial derivate equation known as the score and denoted as = l (β o , . . . ,β P ; X i ) (We denote to transpose vectors and matrices).
However, when fitting a simple model like a Logistic regression, it is sometimes the case that many variables are not strongly associated with the response Y i , which lowers the classification accuracy of the model.That this problem can be improved with alternative fitting procedures such as constraining or shrinking (also known as regularization) before considering non-linear models was recognized by [53].The idea is that complex models are sometimes built with irrelevant variables, but by shrinking coefficient estimates we manage to reduce variance, and thus the prediction error.
However, when complex models arise, the machine learning literature suggests imposing some degree of penalty on the Logistic regression so that the variables that contribute less are shrunk through a regularization procedure.
Ridge Logistic regression, shown in Algorithm 3, follows the dynamics of the Logistic regression, but the term λ ∑ P p=1 β p 2 known as the regularization penalty is added to the negative likelihood function, as in (4).Thus, covariates with a minor contribution are forced to be close to zero.Algorithm 3. Ridge Logistic Regression.

1.
Minimizing the negative likelihood function: Penalizing: On the other hand, Lasso Logistic regression, shown in Algorithm 4, follows the dynamics of the Logistic regression, but a regularization penalty λ ∑ P p=1 β p is added to the negative likelihood function.In this case, less contributive covariates are forced to be exactly zero.In both cases, λ is a shrinkage parameter, so th larger it is, the smaller the magnitude of the coefficient estimates [53].

1.
Minimizing the negative likelihood function:

Interpretable Machine Learning
Unlike statistical models in econometrics, machine learning algorithms are generally not self-explanatory.For example, generalized linear models provide coefficient estimates and their standard errors give information about the effect of covariates, whereas machine learning requires alternative methods to make the models understandable.Two popular approaches are described below.
Variable importance (VI), as proposed by [52], measures the influence of inputs on the variation of Ŷi .We obtain the importance in a decision tree by summing the improvements in the loss function over all splits on a specific covariate X p ; in other words, variable importance is calculated by the node impurity weighted by the node probability (The node probability is calculated by the number of observations contained in that node of the tree divided by total number of observations).For ensemble techniques, the VI of all the trees that composed the ensemble is averaged.
Partial Dependence Plots (PDP) proposed by [51] show the marginal effect of a covariate X p on the prediction.The predicted function Ŷ is evaluated in certain values of the specific covariate X p while averaging over a range of values of all the other covariates.

The Rare Event Problem with RiskLogitboost Regression
The RiskLogitboost regression is an extension of Logitboost [33] that modifies the weighting procedure to improve the classification of rare events.It also adapts a bias correction from [54] in the boosting procedure, which is also applied to regression models such as those in [7,8,10].
To formally define the RiskLogitboost regression, we first describe briefly the Logitboost shown in Algorithm 5.It first initializes with Ŷi 0 = 0 and π 0 (X i ) = 0.5.Then the boosting procedure continues with four stages.The first one transforms the response.Logitboost also uses the exponential loss function e Y i Ŷi which is a quadratic approximation of χ 2 and z i (transformed response) (see further details in Appendix A).The second stage involves calculating the weights by computing the variance of the transformed response Var[z i |X] (see further details in Appendix B).The third stage fits a least squares regression with response z i .Finally, the fourth stage updates the prediction Ŷi d and π(X i ) by computing F X ip ; u d as X i β for this particular case.
where π(X i ) are the probability estimates.2. For d= 1 to D do:

RiskLogitboost Regression Weighting Mechanism to Improve Rare-Class Learning
We propose a weighting mechanism that might be considered as a mixed case of oversampling and undersampling.The main idea is to overweight observations whose estimated probability π(X i ) is further from the observed value Y i , in other words, obser- vations that are more likely to be misclassified.The new majority class observations are interpolated through a threshold that determines the calibration of weights.The proposed weighting mechanism takes the following form: The original weights w i of the Logitboost are now multiplied by a factor (1 ) that is related to the distance between Y i and π(X i ).
Figure 1 shows the relationship between weights according to the estimated probabilities of the Logitboost and the RiskLogitboost regression.Logitboost overweights observations whose estimated probability is around 0.5 and then decreases gradually and symmetrically on either side.The result of the weighting mechanism in the RiskLogitboost regression shows that low estimated probabilities are overweighted when Y i = 1 while high estimated probabilities are underweighted when Y i = 0.In Figure 1 we show that, once the weighting mechanism is transformed, we maintain the u-inverted shape for Y = 1 and Y = 0. Refs.[9,26,55,56] proposed weighting mechanisms for parametric and non-parametric models to improve the predictive performance of imbalanced and rare data.

Bias Correction with Weights
Bias correction will lead to a lower root mean square error.Ref.
[54] proposed a bias correction method and showed that the bias of the coefficient estimators for any generalized model can be computed as (X WX) −1 X Wℵ, where W is the diagonal matrix of w i .However, we propose replacing w i by w * i since the behaviour, and therefore the bias, for the RiskLogitboost is computed as (X W * X) −1 X W * ℵ.
The factor ℵ equals Q ii π D (X i ) − 0.5 , where Q ii is the diagonal elements of the Fisher information matrix denoted as Q.The matrix Q measures the amount of information that matrix X carries about the parameters; in other words, it is the variance of the gradient of the log-likelihood function with respect to the parameter vector known as the score.
Q rk is the Fisher information matrix for two arbitrary generic parameters: β k and β r .
Now let us take the partial derivative of l(β o , . . ., β P ; X i ) in ( 5) with respect to β k . where and Considering ( 9) and ( 10), we obtain: Now, let us compute the second derivative of ( 8) with respect to β r . And, Plugging ( 13) into (12): Recall that Var (Y i ) = π i (1 − π i ), since Y i follows a Bernoulli distribution and coincides with vector w i (second stage of Algorithm 5).However, the new RiskLogitboost replaces w i with w * i again in Equation ( 14).If we generalize expression (14) for all P parameters, we obtain: where W * is the diagonal matrix of w * i .Equation ( 15) is a variance-covariance matrix.Thus, Q is expressed as an n x n symmetric matrix: Finally, each transformed parameter is computed as

RiskLogitboost Regression
The RiskLogitboost regression (Algorithm 6) modifies the original version of Logitboost to improve the classification of the rare events (ones).This algorithm comprises 11 stages.The first states the initial values of the prediction Ŷi and probability π(X i ).
The second obtains the transformed answer as explained in Algorithm 5.In the third stage we compute Ŷi , which is the value that minimizes a negative binomial log-likelihood loss function: log 1 + exp −2Y i Ŷi used for two-class classification and regression problems.However, Ŷi also minimize the exponential loss function e −Y i Ŷi used in Logitboost [33].Therefore, the exponential loss function approximates the log-likelihood denoted as transformed answer z i , as explained in Algorithm 5.
The fourth stage computes the weights that were explained in detail in Section 3.1.The fifth stage normalizes the weights of the previous stage so as to convert them into a distribution that must add up to 1.
The fifth stage consists of fitting a weighted linear regression to z d i and obtaining the P + 1 parameters β.Perhaps the constant β 0 is computed by setting X p to a vector of ones.As proposed in the original Logitboost, the sixth stage updates the final prediction Ŷi d to fit the model by maximum likelihood using Newton steps as follows: We update the prediction Ŷi + F X ip ; u d , where u corresponds to parameters β.The outcome of F X ip ; u d would be X i β in a logistic regression with π i expressed in (1), which is exp 2F X ip ; u d , as follows: Recalling l(β o , . . ., β P ; X i ) from ( 5), we compute the expected log-likelihood of Ŷi + F X ip ; u d .
The Newton method for minimizing a strictly convex function requires the first and second derivatives.Let g be the first derivative and H be the second derivative, also known as the Hessian matrix. Hence This result is a very close approximation of the iteratively reweighted least squares method (Appendix A, Equation (A2)) to the likelihood shown in (5).The key difference is the factor 1  2 that multiplies the expected value.The seventh stage consists of checking that probabilities are bounded between 0 and 1, since adding a δ might lead to a number larger than 1.
The eighth stage consists of inverting 1 2 log (explained in the third stage), which yields the probability estimates.Once the iterative process is finished, we obtain the coefficient estimates of iteration D in stage nine through the expression suggested by [57,58].Last but not least, we obtain β* by subtracting β D -bias.

Illustrative Data
The illustrative data set used for testing classical and alternative machine learning algorithms is a French third-party liability motor insurance data set available from [59] through publicly available data sets in the library CASdatasets in R. It contains 413,169 observations that were recorded mostly in one year about risk factors for third-party liability motor policies.
This data set contains the following information about vehicle characteristics: The power of the car ordered by category (Power); the car brand divided into seven categories (Brand); the fuel type, either diesel or regular (Gas).This data set also includes information about the policy holder's characteristics such as: the policy region in France based on the 1970-2015 classification (Region); the number of inhabitants per km 2 in the city in which the driver resides (Density).More information is included about the policy holders' characteristics: the car age measured in years (Car age); and the driver's age (Driver Age).Finally, the occurrence of accident claims Y i is coded as 1 if the policy holder had suffered at least one accident, and otherwise coded as 0. A total of 3.75% of policy holders had reported at least one accident (rare event ratio).

Discussion of Results
This section first presents the predictive performance of some machine learning algorithms jointly with the RiskLogitboost regression when Y = 1 in the extreme observations; secondly, this section shows that the model is interpretable through the coefficient estimates.

Predictive Performance of Extremes
Tables 1 and 2 show the Root Mean Square Error (RMSE) for observations when Y = 1 and Y = 0, respectively.Even though the Boosting Tree has optimized hyperparameters, it produced a larger error than all other methods when Y = 1 (The Boosting Tree is built with 10-fold cross validation and has optimized hyperparameters through grid search which correspond to the number of trees (50), the maximum depth of variable interactions (1), the minimum number of observations in the terminal nodes of the trees (10), and shrinkage (0.1) with the caret package in R; the Lasso and Ridge Logistic models had the lowest deviance among several trials with shrinkage values).This can be attributed to the fact that high predictive performance algorithms such as tree-based methods reduce the global error, which is mainly influenced by the majority class (usually coded as 0) when data are imbalanced.Thus, observations modelled using this type of method show high levels of error when Y = 1.This means that the riskiest observations (with misclassifications costs) are poorly detected, and observations whose probability is not high enough are more likely to be misclassified.
The RiskLogitboost regression had the lowest error for observations whose estimated probability was in the lower extremes.This is an important result since the proportion of cases for this set of observations usually tends to be underestimated by traditional predictive modeling techniques.Moreover, the RiskLogitboost regression perfectly predicted observations whose estimated probability was in the highest extremes, suggesting that observations that are more likely to belong to the rare event (Y = 1) will never be misclassified.From a risk analysis perspective, this is a valuable achievement since it reduces misclassification costs for this group.
Observations classified with SMOTEBoost and RUSBoost outperform Logitboost, Ridge Logistic, Lasso Logistic, and Boosting Tree; however, their predictive performance is still below that of the RiskLogitboost regression.Even though the SMOTEBoost and RUBoost are designed to handle imbalance data sets, RiskLogitboost seems to be more efficient at detecting rare events.
Similar performance is obtained between the Weighted Logistic Regression (WLR) [26], Penalized Logistic regression for complex surveys (PLR), with the two weighting mechanisms PSWa and PSWb [9], and SyntheticPL (Synthetic Penalized Logitboost) [56].Both WLR and PLR with PSWa provide exactly the same result because the PLR incorporates the sampling design, as well as a resampling correction.The sampling correction of both methods coincide when data are simple random samples.RiskLogitboost still outperforms these modern methods for imbalanced and rare event data.The Weighted Logistic for rare events (WeiLogRFL) [10] might be considered as the second best.In contrast, when Y = 0 the Boosting Tree, Ridge Logistic regression and Lasso Logistic had a lower RMSE than the RiskLogitboost regression.These three methods classify the non-events (Y = 0) accurately whereas the RiskLogitboost regression tends to underestimate their occurrence.The results obtained by the RiskLogitboost are quite close to the WeiLogRFL.Moreover, WLR, RLR and boosting tree obtained the lowest RMSE of highest and lowest prediction scores.SyntheticPL outperforms RUSBoost and SMOTEBoost, even though its purpose improves the predictive performance of imbalanced data.
The results when Y = 1 also showed that Logitboost was superior, in predictive capacity terms, to the Ridge Logistic regression, Lasso Logistic regression and Boosting Tree in the testing data set.In this particular case, the Ridge Logistic regression and Lasso Logistic performed similarly in the training data set.The results are presented for observations that correspond to policy holders who suffered an accident (Y = 1).All results were analyzed by groups of prediction scores, also known as predicted probabilities.Each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the lowest accumulated prediction scores is shown on the left-hand side of the table under "Lower Extreme", and each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the highest accumulated prediction scores is shown on the right-hand side of the table under "Upper Extreme".Abbreviations: WLR (Weighted Logistic Regression) [26], PLR (Penalized Logistic regression for complex surveys), with two weighting mechanisms PSWa and PSWb [9].SyntheticPL (Synthetic Penalized Logitboost) [56], WeiLogRFL (Weighted Logistic) of [10].The results are presented for observations that correspond to policy holders who did not suffer an accident (Y = 0).All results were analyzed by groups of prediction scores also known as predicted probabilities.So, eEach RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the lowest accumulated prediction scores is shown on the left-hand side of the table under "Lower Extreme", and each RMSE for 1%, 5%, 10%, 20%, 30% and 40% of the highest accumulated prediction scores is shown on the right-hand side of the table under "Upper Extreme".Abbreviations: WLR (Weighted Logistic Regression) [26], PLR (Penalized Logistic regression for complex surveys), with two weighting mechanisms PSWa and PSWb [9].SyntheticPL (Synthetic Penalized Logitboost) [56].WeiLogRFL (Weighted Logistic) of [10].
Figure 2 shows the highest and lowest prediction scores for all observed response Y.The RiskLogitboost regression started with higher levels of RMSE in the first iterations, after which they decreased until becoming stable.The RMSE did not vary from the fortieth iteration onwards.As a result, we were able to maintain the convergence process since the proposed transformation for the weighting procedure (Section 3.1) achieved identical stability to that of the original Logitboost.

Interpretable RiskLogitboost Regression
Table 3 presents the coefficient estimates, standard errors and confidence intervals obtained by the RiskLogitboost regression.Due to the design and the way of fitting the RiskLogitboost regression, similar to generalized linear models (i.e., logistic regression) as fully explained in Section 3, we may obtain the odds ratio by exponentiating the estimated coefficient estimates.
The results provided by the RiskLogitboost regression suggest that the likelihood of a policy holder having an accident increased if they had e, k, l, m, n, o type Power vehicle; in particular, drivers with o-type Power were the most likely to have an accident among all types of Power.
The policy holder was more likely to have an accident if they drove in the Regions of Haute-Normandie and Limousin, whereas driving in the Regions of Bretagne, Centre, Haute Normandie, Ile de France, Pays de la Loire, Basse Normandie, Nord Pas de Calais and Poitou Charentes did not influence the likelihood of a person having an accident.
Policy holders driving Renault, Nissan or Citroen cars were less likely to have an accident than those driving other brands of car.
As expected, the Lasso Logistic regression shrunk all coefficients to zero except the one corresponding to the intercept; in this sense, this method is not informative and is actually disadvantageous for analyzing the effects.The Ridge Logistic Regression provided a very small magnitude of the coefficient estimates, and overall the covariates in the Ridge Logistic regression seemed to have a small effect on the final prediction, which makes sense because 96.25% of the cases had not reported an accident.However, this model risks underestimating the probability of having an accident.The base category is other for the covariates Power, Brand and Region, and diesel for the covariate Gas.* Indicates that the coefficient is significant at the 95% confidence level.The standard error (se) root square of the diagonal of the variance-covariance matrix was computed as X i w i D X i −1 .We built a 95% confidence interval for β as [β − 1.96 se; β + 1.96 se].
All in all, the coefficients obtained by the RiskLogitboost regression are much bigger than those obtained by the other regressions since this type of algorithm tends to overestimate the probability of occurrence of the target variable to avoid classifying risky observations as Ŷi = 1 instead of Ŷi = 0.The Lasso Logistic regression has no significant coefficient estimates with which to compute the variable importance technique.
Table 4 shows the variable importance of the six most relevant covariates according to RiskLogitboost, Boosting Tree, Ridge Logistic and Logitboost regressions.The results show no consensus between the methods; however, the Boosting Tree and Ridge Logistic regression have certain categories of Brand and Region as the most important covariates, while certain categories of Power and Region seem to be the most relevant according to Logitboost and RiskLogitboost.
As a consequence, it seems that there is no consensus in the results provided by the variable importance technique, which is risky in terms of interpretation.Analysts should consider that the results of a Boosting Tree, Ridge Logistic or Lasso Logistic regression can generate misleading inferences because they underestimate the occurrence of rare events; the covariates that appear to be most contributive will be those with more effect on nonevents (Y = 0).By contrast, the variable importance technique suggests that RiskLogitboost better identifies the covariates that are the most influential in the occurrence of rare events (Y = 1).
Figure 3 shows the partial dependence plot (PDP) obtained from a Boosting Tree.Each plot shows an average model prediction for each value of the covariate of interest.The intuitive interpretation of this plot is that the magnitude on the y axis shows more or less likelihood of the occurrence of the event (Y = 1).In this particular case, drivers with m-type Power were more likely to have an accident than drivers with d-type Power.Newer vehicles were less likely to be involved in an accident than older ones.Drivers aged between approximately 30 and 80 were less likely to have an accident than very old or very young drivers.Moreover, policy holders who drove in the region of Limousin were the least likely to have an accident in comparison with other regions of France.Last but not least, it seems that Japanese (except Nissan) or Korean vehicles were more likely to be involved in an accident than other brands.

Conclusions
On balance, RiskLogitboost brings a key advantage to the prediction of rare events, principally when the detection of the minority class is fundamental or extremely important in the case study, and the impact of false negatives is irrelevant or barely important.The treatment and the interpretation of rare events is more accurate when using the RiskLogitboost, and it may contribute to the prevention of events whose occurrence would be disastrous, and whose cost policy holders are not willing to accept or able to afford.
The RiskLogitboost regression is a boosting-based machine learning algorithm shown to improve the prediction of rare events compared to certain well-known tree-based and boosting-based algorithms.It will be of most value where the failure to predict the occurrence of the rare event and when it will occur is high.RiskLogitboost regression implements a weighting mechanism and a bias correction that lower prediction error to better predict such rare events by overestimating their probabilities.The results presented here show that the lowest RMSE in the upper and lower extremes occurs when Y = 1.This comes at a cost.The RiskLogitboost regression RMSE tended to increase when Y = 0 in the extreme observations due to the fact that the algorithm adjusts misclassified observations, which, in the context of rare events with a binary response, are coded as Y = 1.This cost is low, when the cost of false negatives is much smaller than the cost of false positives.
While regularization procedures can be incorporated in econometric methods such as logistic regression, they have two main drawbacks.First, the resulting models may not be adequately interpretable because the shrinkage from such procedures depends on the penalty term, causing loss of the real effect of the covariates on the final prediction.Second, such procedures cannot classify rare events efficiently.
The Tree Boosting regression had the lowest RMSE in the majority class observations (Y = 0) but showed poor performance in the minority class observations.It is also more in the nature of a black box in terms of interpretability, requiring more reliance on the variable importance method and PDP.The PDP from the Tree Boosting regression is relatively informative, but all covariates are treated as significant or relevant for the final prediction, which is sometimes inconsistent with an econometric model like a regression.Moreover, while a PDP is easy to implement when there are only a few variables, with more variables interpretation is more difficult.It is often desirable to achieve both high predictive performance for rare events and interpretability.Tree-based and boosting-based methods may be unsuitable in such situations because they underestimate the probability that the rare event will occur while also underestimating the effect of the covariates that are most important to predicting the rare event rather than the majority class.RiskLogitboost delivers high predictive performance while also facilitating interpretation by identifying the covariates most important to prediction of the rare event.
The RiskLogitboost has still limitations when decreasing the false negative rate since it focuses on reducing efficiently the error of observations Y i = 1.However, for those case studies whose cost of false negative rate tends to be high, the proposed method could be redesigned so as to improve the detection of observations Y i = 0.This would be a proposal for further research.

Figure 1 .
Figure 1.Plot of weights versus estimated probabilities of the Logitboost and the RiskLogitboost regression.

Figure 2 .
Figure 2. The highest and lowest prediction scores for all observed response Y within 50 iterations (D = 50) obtained with the RiskLogitboost regression.
[33]herwise; and the Logistic binomial log-likelihood log e −2Y i Ŷi for two-class Logistic classification.The Gradient Boosting Machine shown in Algorithm 1 is the base proposal of[33].The algorithm initializes with a prediction guess of Ŷi 0 .Then a boosting process of D iterations is carried out in four stages: the first transforms the new response denoted as r i d computed as the negative gradient of ϕ Y i , Ŷi d at iteration d.The second stage fits a least squares regression with the recently computed r i d as the response.The third stage minimizes the loss function between the observed Y i and Ŷi d + γF X ip ; u d and the result is delivered in γ.Finally, the last stage updates the prediction Ŷi d by summing Ŷi d−1 and γF X ip ; u d .Algorithm 1. Gradient Boosting Machine 1.Initial values : Ŷi 0

Table 1 .
Root Mean Square Error (RMSE) for observations with Y = 1.

Table 2 .
Root Mean Square Error (RMSE) for observations with Y = 0.

Table 3 .
Coefficient Estimates, Standard Error and Confidence Intervals provided by the RiskLogitboost regression.

Table 4 .
Variable importance of the six most relevant covariates according to RiskLogitboost, Boosting Tree, Ridge Logistic regression and Logitboost.