Next Article in Journal
Energy Transfer through a Magnetized Williamson Hybrid Nanofluid Flowing around a Spherical Surface: Numerical Simulation
Previous Article in Journal
Estimation of Heat Diffusion in Human Tissue at Adverse Temperatures Using the Cylindrical Form of Bioheat Equation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Double-Penalized Estimator to Combat Separation and Multicollinearity in Logistic Regression

1
School of Science, Kunming University of Science and Technology, Kunming 650500, China
2
Center for Applied Statistics, Kunming University of Science and Technology, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(20), 3824; https://doi.org/10.3390/math10203824
Submission received: 23 August 2022 / Revised: 11 October 2022 / Accepted: 13 October 2022 / Published: 16 October 2022
(This article belongs to the Section Probability and Statistics)

Abstract

:
When developing prediction models for small or sparse binary data with many highly correlated covariates, logistic regression often encounters separation or multicollinearity problems, resulting serious bias and even the nonexistence of standard maximum likelihood estimates. The combination of separation and multicollinearity makes the task of logistic regression more difficult, and a few studies addressed separation and multicollinearity simultaneously. In this paper, we propose a double-penalized method called lFRE to combat separation and multicollinearity in logistic regression. lFRE combines the logF-type penalty with the ridge penalty. The results indicate that compared with other penalty methods, lFRE can not only effectively remove bias from predicted probabilities but also provide the minimum mean squared prediction error. Aside from that, a real dataset is also employed to test the performance of the lFRE algorithm compared with several existing methods. The result shows that lFRE has strong competitiveness compared with them and can be used as an alternative algorithm in logistic regression to solve separation and multicollinearity problems.

1. Introduction

In many clinical research areas, logistic regression is frequently used to develop a predictive model based upon binary data to predict the likelihood of a patient’s health status, such as health or disease. In breast cancer research, for example, models can be developed to predict the likelihood of developing breast cancer. Predictions based on these models and the logistic regression framework assist physicians and patients in making joint decisions about future treatment options. Logistic regression, on the other hand, frequently suffers from severe estimation problems due to separation or multicollinearity, limiting its implementation and potentially leading to unreliable conclusions about the estimated model. Separation and multicollinearity are not insignificant issues, and they can be aggravating when they occur frequently at the same time.
The existence, finiteness, and uniqueness of maximum likelihood estimation, which has been extensively studied, determines whether the logistic regression model can draw valid conclusions. Silvapulle (1981) took a step in that direction by demonstrating that a certain degree of overlap is both a necessary and sufficient condition for the existence of maximum likelihood estimates for the binomial response model [1]. Albert and Anderson demonstrated the existence theorems of the maximum likelihood estimates (MLEs) for the multinomial logistic regression model by considering three possible patterns for the sample points: complete separation, quasi-complete separation, and overlap [2]. Albert and Anderson defined separation as one or more predictors having strong effects on the response and thus (almost) precisely predicting the outcome of interest (complete separation or quasi-complete separation) [2]. In small-to medium-sized datasets, Allison found that separation happens frequently when the regression coefficient of at least one covariate value becomes infinite [3]. Sparse data or separated data occur when the response variable is completely separated by a single covariate or linear combination of variables [4]. Additionally, an infinite estimate can also be regarded as exceedingly inaccurate, resulting in Wald confidence intervals with infinite widths [5]. Separation can occur even if the underlying model parameter has a low absolute value, according to Heinze and Schemper [6]. They also showed how the sample sizes, the number of dichotomous covariates, the magnitude of the odds ratios, and the degree of balance in their distribution all affected the probability of separation [6]. Furthermore, advanced but computationally costly techniques were exploited for detecting separation [7,8]. However, according to Agresti, monitoring the variance in the iteration process or supervising the iterative convergence is sufficient to proclaim separation [9].
The majority of the papers in this field are devoted to proposing new parameter estimates that would exist and have good theoretical properties if the data were already known to be in complete or quasi-complete separation. Penalized approaches for logistic regression models have gained popularity as a means of overcoming the separation problem. They produce fewer skewed estimates and, in general, more accurate inferences. In this case, Firth’s penalized method is an alternative to the standard logistic regression approach [10]. This method eliminates the first-order term ( O ( n 1 ) ) in the asymptotic bias expansion of the MLEs of the regression parameters by modifying the score equation with a penalty term known as the Jeffreys invariant prior. Heinze and Schemper applied Firth’s method to the solution of the separation problem in the logistic regression [6]. However, one of the criticisms leveled at the Firth-type penalty in recent studies is that it is based on observed covariate data, which can result in artifacts such as estimates that fall outside the prior median and the maximum likelihood estimation (MLE) range [11,12]. Firth’s logistic regression diminishes bias in maximum likelihood coefficient estimates while producing bias in the predicted probabilities. The more the outcome variables are imbalanced, the greater the bias [13]. As an alternative to this, Greenland and Mansournia proposed the log F(1, 1) and log F(2, 2) priors as default priors for logistic regression [11,12]. According to the authors, the proposed log-F(m, m) priors are reasonable, transparent, and computationally straightforward for logistic regression. Emmanuel modeled the prediction of the default probability using the penalized regression models and found that the log-F prior methods are preferred [14]. Note that other methods imposing shrinkage on the regression coefficients can also overcome the separation issue. For example, to address the issue of point separation, Rousseeuw and Christmann presented a hidden logistic regression model [15].
According to one assumption of binary logistic regression, explanatory variables should not be strongly correlated. Logistic regression models need to meet certain assumptions in order to produce reliable results. When the number of covariates is relatively large or the covariates are highly correlated, multicollinearity is likely to arise. One way to deal with this problem in linear regression models is called ridge regression, which was first introduced by Hoerl and Kennard [16,17]. The author proved that there is a non-zero ridge parameter value which makes the mean square error (MSE) of the slope parameter of the ridge regression smaller than the variance of the ordinary least squares (OLS) estimate of the corresponding parameter. Schaefer applied the ridge parameters proposed by Hoerl and Kennard in logistic regression and created a ridge-type estimator that will have a smaller total mean squared error than the maximum likelihood estimator when the independent variables are severely collinear [18]. Furthermore, the ridge parameter, whose size is determined by the number and collinearity of covariates, controls the amount of shrinkage [18]. Lee and Silvapulle showed that it is advised to employ ridge tracing as a diagnostic tool in logistic regression analysis, since it offers additional insight and readily draws attention to particular characteristics of the data model. They observed that a ridge type estimator is at least as good as and often much better than the standard logistic regression estimator in terms of the total and predicted mean squared error criteria in a Monte Carlo study [19]. In their study, Le Cessie and Van Houwellingen used the ridge regression method to improve the parameter estimates and decrease prediction errors in logistic regression [20]. Inan and Erdogan were the first to use the Liu-type estimator in the logistic regression model [21]. For the first time, Guoping Zeng and Emily Zeng investigated the relationship between multicollinearity and separation, proving analytically that multicollinearity implies quasi-complete separation as well as the absence of a finite solution for maximum likelihood estimation [22]. Senaviratna et al. also examined the four primary strategies for detecting multicollinearity before moving on to statistical inference, which are the tolerance, variance inflation factor (VIF), condition index, and variance proportions [23].
Separation and multicollinearity, on the other hand, are nearly never discussed together in the literature. Methods for addressing separation and multicollinearity at the same time have received little attention. To begin, Shen and Gao introduced a double-penalized likelihood estimator, which solves these two problems concurrently by combining Firth’s penalized likelihood equation with a ridge parameter in the logistic regression model [24]. Here, the separation problem is solved using Firth’s penalized likelihood equation, and the multicollinearity problem is solved using the ridge parameter. Nonetheless, Firth’s penalty has several unignorable flaws. Because Firth’s penalty incorporates the penalization’s intercept, it provides a bias in the average predicted probability that is much greater than that created by logistic regression [11]. Furthermore, as Firth noted [10], it does not minimize the MSE. Then, there’s the logF-type penalty, which is a popular alternative to Firth-type penalization that performs better in many ways. The following are the advantages of the logF-type penalty in further detail. First, the logF-type penalty avoids penalizing the intercept, causing an average predicted probability equal to the proportion of observed events. Next, the logF-type penalty is independent of the data. The logF-type penalty, unlike the Jeffreys prior (Firth-type penalty), does not result in correlations between explanatory variables. In particular, the logF-type penalty will minimize the MSE of efficient estimates while also providing more stable estimates. Thus, in this paper, we modify Shen and Gao’s approach and propose a more appealing double-penalized likelihood estimator (lFRE) for logistic regression models that combines a logF-type penalized likelihood approach with a ridge penalty term.
The rest of this paper is organized as follows. We consider logistic regression, Firth’s logistic regression [6], logistic regression with penalization by log-F(m, m) [11], and Shen and Gao’s double-penalized likelihood estimator [24] as the comparison methods. These methods are reviewed in Section 2. Then, in Section 3, we propose the estimator that combines the logF-type penalized likelihood approach with a ridge penalty term (lFRE), and we discuss the algorithm implementation of coefficient estimates and the selection of ridge parameters in this model. In Section 4, we offer two detailed simulation studies comparing the performance of lFRE to that of previous logistic regression-based approaches presented for both prediction and effect estimation in rare event and small dataset conditions. The predictive performance of the methods under investigation is demonstrated in Section 5 of the research for predicting the existence of breast cancer. Section 6 concludes with the discussion and conclusions.

2. Materials and Methods

2.1. Logistic Regression (LR)

Given the values of (a subset of) p candidate predictors, we define a logistic regression model for estimating the probability of an event occurring ( Y = 1 ) versus not occurring ( Y = 0 ). Let Y i be a binary response (0/1) for the ith subject with the probability π i = P ( Y = 1 | x i ) = 1 P ( Y = 0 | x i ) , i = 1 , , n , which follows a Bernoulli distribution. Let x i = ( 1 , x i 1 , , x i p ) T be the covariate for the ith observation, where it is a p + 1 -dimensional row vector. Furthermore, the design matrix is denoted by X T = ( x 1 , , x n ) with an order n × ( p + 1 ) . Then, we consider the following logistic regression model:
P ( Y = 1 | x i ) = π i = 1 1 + exp ( x i T β ) ,
where β = ( β 0 , β 1 , , β p ) T is the ( p + 1 ) -dimensional parameter vector of the unknown parameters corresponding to the covariates. The log-likelihood for this model is written as
( β ) = i = 1 n { Y i log π i + ( 1 Y i ) log ( 1 π i ) } ,
The corresponding score function is
U ( β ) = i = 1 n ( Y i π i ) x i .
This method, however, is dependent on the existence of MLEs. Separation may occur in small to medium datasets. Although the likelihood converges, at least one parameter estimate is infinite in this case. MLE regression estimates produce large mean squared errors in the presence of multicollinearity. When separation occurs, Heinze and Schemper suggested Firth’s logistic regression as an alternative to the MLE [6].

2.2. Ridge Logistic Regression (Ridge)

The punishment term employed by Le Cessie and Van Houwelingen is λ β 2 . The ridge regression method minimizes the mean squared error of the predictions by introducing some bias to the estimation of the regression coefficients. The ridge log-likelihood is defined as
R i d g e β = ( β ) λ β 2 ,
where β = ( j = 1 p β j 2 ) 1 / 2 , the norm of the parameter vector β , and λ is a tuning parameter that helps control the trade-off between the likelihood term and the penalty term, which is typically chosen to be a data-driven procedure such as cross-validation.
Hence, the corresponding score function is
U R i d g e ( β ) = U ( β ) 2 λ β .
The ridge was initially created to address issues brought on by multicolinearity. However, it reduces the regression coefficient to almost zero, making it effective for reducing overfitting in risk prediction when there are linked predictors.

2.3. Firth’s Logistic Regression (Firth)

Firth proposed a method for removing the well-known small sample bias in maximum likelihood estimation [10]. Firth’s method involves introducing a bias term into the standard likelihood function, which itself becomes zero as n + , but that will offset the existing O ( n 1 ) for small n. The model likelihood is penalized by | I ( β ) | 1 / 2 in Firth’s penalized likelihood logistic regression model, where I ( β ) is Fisher’s information matrix assessed at β , I ( β ) = X T WX , and W is the diagonal matrix diag { π i ( 1 π i ) } . The corresponding Firth’s penalized log-likelihood function can be expressed as
F i r t h β = ( β ) + 1 2 log | I ( β ) | ,
The corresponding score function is
U * ( β ) = U ( β ) + 1 2 trace I ( β ) 1 I ( β ) β ,
where X represents the observed covariate (design) matrix, ( β ) is the log-likelihood function in Equation (2), U ( β ) denotes the standard score equation, and | I ( β ) | represents the determinant of the Fisher information matrix for ( β ) . The penalty term 1 2 log | I ( β ) | is referred to as the Jeffreys invariant prior.
The Firth-type penalty is applicable in logistic regression. It prevents infinite estimates caused by separation problems by removing the O ( n 1 ) bias of β . The Firth-type penalty does, in fact, have the following flaws, which have been identified in the literature. Greenland and Mansournia pointed out that there are several serious practical flaws in the Jeffreys prior [11]. First, Firth’s penalty is dependent on the data because it is proportional to 1 2 log | I ( β ) | , where I ( β ) consists of the design matrix of the observed covariates. As a result, the majority of design matrices will generate correlations in the Jeffreys priors and coefficient estimates. Secondly, it is unclear how the penalty can be turned into a prior for odds ratios in general. Furthermore, the mean squared error (MSE) is not minimized. Therefore, when applied to small or sparse data, it can provide incredible estimates when contrasted to potentially stronger but still contextually weak penalties such as log-F priors. Finally, as model covariates are added or removed, the marginal prior for a given β could be altered in mysterious ways. Additionally, Puhr et al. demonstrated that a post hoc adjustment of the intercept in Firth’s logistic regression increases bias toward one half in the predicted probability [25]. They also demonstrated that the greater the imbalance of the outcome, the more substantial the bias in the predicted probabilities [25]. Accordingly, Greenland and Mansournia proposed a logF-type penalty as an alternative to Firth’s penalty in logistic regression when separation exists to avoid the concerns discussed above [11].

2.4. Logistic Regression with a log-F Prior Penalty (LF)

Greenland and Mansournia developed a family of penalty functions r ( β ) = log ( | I ( β ) m | ) indexed by m 0 , which yields the MLE ( m = 0 ) and Halane–Firth estimator ( m = 1 ) as special instances in an important paper [11]. By transforming each expected coefficient penalty into a pseudo-data record, log-F penalties are simple to implement in any logistic regression package, and this avoids penalizing the intercept in order to ensure that the mean predicted probability is the same as the proportion of events. Generally speaking, in a log-Fprior, the prior degrees of freedom m are equal to the number of observations added by the prior. The penalty imposed by the log-F(m,m) priors is equal to exp ( m β / 2 ) / { 1 + exp ( β ) } m multiplied by the likelihood function, corresponding to Equation (2). As a result, the corresponding penalized log-likelihood can be calculated as
L F ( β ) = ( β ) + m β / 2 m log { 1 + exp ( β ) } ,
The corresponding score function is expressed as
U * * ( β ) = U ( β ) + m / 2 exp ( β ) / { 1 + exp ( β } .
When m = 0 , the function simplifies to the log-likelihood function in Equation (2) from the standpoint of this log-likelihood function. In other words, the penalized estimator with a logF(0,0) prior is the same as the MLE. Similarly, logF(1, 1) contains a Jefrreys prior in the one-parameter model, such as matched-pair case control. The log-F prior should be favored as a default penalty for general sparse data circumstances [12]. Taking simplicity of implementation and interpretation as our chief criteria, we consider the logF(1, 1) and logF(2, 2) priors in this study [11,12,13].

2.5. Shen and Gao’s Double-Penalized Likelihood Estimator (DPLE)

For handling both separation and multicollinearity, Shen and Gao proposed a double-penalized maximum likelihood estimator for logistic regression models that combines Firth’s penalized likelihood function with a ridge parameter [24]. The corresponding double-penalized log-likelihood can then be written as
D P L E ( β ) = ( β ) + 1 2 log | I ( β ) | λ β 2 ,
The corresponding score function is expressed as
U * * * ( β ) = U ( β ) + 1 2 trace I ( β ) 1 I ( β ) β 2 λ β ,
The ridge parameter λ determines the amount of shrinkage of the norm of β in the above log-likelihood function D P L E ( β ) . The value of λ is determined by the number of covariates and their degree of multicollinearity.
In general, the method is the development of Firth’s logistic regression with the added ability to select λ to decrease multicollinearity. However, according to Section 2.3, there are several serious shortcomings in the Firth-type penalty. The same issues persist in Shen and Gao’s double-penalized likelihood estimator. Above all, Shen and Gao’s estimator also introduces some bias in the average predicted probability, and there’s a chance that the bias in the predicted probability is not insignificant. Nonetheless, by ignoring the intercept in the penalty, log-F(m,m) priors produce an average predicted probability equal to the observed proportion of occurrences in each sample. Greenland and Mansournia found that log-F(m,m) priors outperformed the Firth-type penalty [11]. The Jeffreys prior (Firth’s penalty term) is then included in DPLE, which is data-dependent and creates correlations between covariates. The logF-type prior, which is not data-dependent and overcomes this issue, is included in the lFRE we propose. Thus, we propose making a change to this double-penalized method: replacing Firth’s penalized likelihood approach with a log-F penalized approach while leaving the rest unaltered. Finally, as we previously stated, the Firth-type penalty does not minimize the MSE, whereas log-F priors do so. Consequently, of the two penalties, the logF-type penalty could be used to avoid these drawbacks that Firth’s penalty includes. Consistent with intuition, DPLE may provide a larger MSE than lFRE, as demonstrated in Simulations 1 and 2. We propose that lFRE may be a better strategy for dealing with the problems of separation and multicollinearity than DPLE in terms of contextual transparency, computational simplicity, and reasonableness for logistic regression. For more information on the double-penalized likelihood estimator that combines a logF-type penalty with a ridge penalty, see Section 3 below (lFRE).

3. The logF-Type Penalty with a Ridge-Based Estimator (lFRE)

3.1. The Double-Penalized Likelihood Estimator Combining a logF-Type Penalty with a Ridge Penalty

Based on the analysis in Section 2.5, an alternative approach for Shen and Gao’s double-penalized likelihood estimator that avoids these drawbacks is to replace Firth-type penalization with logF-type penalization. As a result, we propose a new double-penalized likelihood estimator that incorporates a logF-type penalty as well as a ridge penalty (lFRE). The log-likelihood function corresponding to this can be expressed as
l F R E ( β ) = ( β ) + m β / 2 m log { 1 + exp ( β ) } λ β 2 ,
where ( β ) is the unconstrained log-likelihood function, m represents the prior degree of freedom m in a log-F prior, λ is the ridge parameter that regulates how much the norm of β shrinks, and | | . | | denotes a parameter vector’s Euclidean nor, which imposes linear constraints on the parameters of β .
A one-parameter issue without an intercept can be used to illustrate the core concept of lFRE. Multiparameter extensions will be covered in the following sections. Consider a single independent variable x i with just two possible values (zero or one), as well as a binary response Y (see Table 1).
In Table 1, n 00 = Σ I ( y i = 0 , x i = 0 ) , n 01 = Σ I ( y i = 0 , x i = 1 ) , n 10 = Σ I ( y i = 1 , x i = 0 ) , and n 11 = Σ I ( y i = 1 , x i = 1 ) . Here, I ( . ) is the indicator function. By writing P ( Y = 1 | X = x i ) = 1 / ( 1 + e x i β ) , we obtain p 1 = P ( y i = 1 | x i = 1 ) = 1 / ( 1 + e β ) and p 0 = P ( y i = 1 | x i = 0 ) = 1 / 2 . Now, let n 1 = n 11 + n 01 .
Therefore, the likelihood function can be written as
L = p 0 n 10 ( 1 p 0 ) n 00 p 1 n 11 ( 1 p 1 ) n 1 n 11 .
It is obvious that p 0 n 10 ( 1 p 0 ) n 00 is a constant which does not include the parameter β . We will use C = p 0 n 10 ( 1 p 0 ) n 00 for the sake of derivation. As a result, the corresponding log-likelihood function can be written as
= log C + n 11 β + n 1 log ( 1 p 1 ) .
For this logistic regression model, the maximum likelihood estimate of β is expressed as
β ^ = log n 11 n 1 n 11 .
Clearly, in the situation of n 1 = n 11 , β ^ is nonexistent (i.e., there is no observation with the combination of x i = 1 and y i = 0 ). As a result, complete separation occurs.
Following that, we apply this simple logistic model to the log-F prior penalty. In general, penalization by log-F(m,m) priors is equivalent to multiplying the likelihood function corresponding to Equation (13) by exp ( m β / 2 ) / { 1 + exp ( β ) } m . Therefore the likelihood function for this model is written as
L * = C p 1 n 11 ( 1 p 1 ) n 1 n 11 exp ( m β / 2 ) / 1 + exp ( β ) m .
The corresponding log-likelihood function is expressed as
* = log C + n 11 β + n 1 log ( 1 p 1 ) + m β / 2 m log { 1 + exp ( β ) } .
The parameter estimate for the logistic model with a log-F prior penalty is equal to
β ^ * = log n 11 + m / 2 m / 2 + n 1 n 11 .
The double-penalized log-likelihood function we propose is written as
* * = log C + n 11 β + n 1 log ( 1 p 1 ) + m β / 2 m log { 1 + exp ( β ) } λ β 2 .
Because the analytical solution of β ^ * * is difficult to derive, we use the iterative algorithm to calculate its numerical solution, and the iterative algorithm is defined as
β ^ * * ( t + 1 ) = log n 11 + m / 2 2 λ β * * ( t ) m / 2 + n 1 n 11 + 2 λ β * * ( t ) .
When λ = 0 , it can be demonstrated that lFRE is the same as the logF-type penalized likelihood method. Given this viewpoint, lFRE is a log-F prior penalty with the added capability of selecting λ to improve multicollinearity. Section 3.2 then goes over the algorithm implementation of coefficient estimates.

3.2. Algorithm Implementation

The maximum likelihood estimates (MLEs) of β could be obtained by maximizing l F R E ( β ) . To put it another way, the coefficient estimates can be calculated as follows:
β ^ = argmax ( β ) + m β / 2 m log { 1 + exp ( β ) } λ β 2 .
As a consequence, similar to unconstrained MLEs, the Newton–Raphson maximization algorithm approach may be used to solve β ^ for β . The Newton–Raphson iterative algorithm is defined as
β ^ ( t + 1 ) = β ^ ( t ) + A 1 ( t ) U ( β ^ ( t ) ) ,
where t stands for the number of iterations, ( A ) 1 ( t ) represents the information matrix of the double-penalized log-likelihood function described in Section 3.1, and U ( β ^ ( t ) ) is the first derivative of the log-likelihood function (Equation (12)). In other words, we have
U ( β ^ j ( t ) ) = ( β ) β j + ( m β / 2 m log { 1 + exp ( β ) } ) β j λ β 2 β j = i = 1 n ( Y i π i ) x j + m / 2 m exp ( β j ) / { 1 + exp ( β j ) } 2 λ β j ( j = 0 , , p ) ,
H is the Hessian matrix of the log-likelihood function corresponding to Equation (12) such that
H = i = 1 n π i ( π i 1 ) m exp ( β 0 ) { 1 + exp ( β 0 ) } 2 2 λ i = 1 n π i ( π i 1 ) x p i = 1 n π i ( π i 1 ) x 1 i = 1 n π i ( π i 1 ) x 1 x p i = 1 n π i ( π i 1 ) x p i = 1 n π i ( π i 1 ) x p x p m exp ( β p ) { 1 + exp ( β p ) } 2 2 λ .
We discovered that solving the second-order partial derivative for the log-likelihood function presented in this research was considerably easier than solving the partial derivative derived by Shen and Gao [24]. Because the information matrix of the double-penalized likelihood function is difficult to derive in Shen and Gao’s double-penalized likelihood function, the approximation information matrix was used to obtain the variance estimates of the parameters at convergence. We favor lFRE over DPLE because of its interpretive and computational simplicity, as well as the aforementioned concerns. The choice of λ is discussed further in the following section:
Theorem 1.
l F R E ( β ) in Equation (12) is a convex function.
Proof 
We consider the Hessian matrix H of the log-likelihood function l F R E ( β ) . For simplicity, we let g ( β ) = e x p ( β ) / 1 + e x p ( β ) 2 and z i = π i ( π i 1 ) . Here, we consider the case of p = 2 and have
H = i = 1 n z i m g ( β 0 ) 2 λ i = 1 n z i x 1 i = 1 n z i x 2 i = 1 n z i x 1 i = 1 n z i x 1 2 m g ( β 1 ) 2 λ i = 1 n z i x 1 x 2 i = 1 n z i x 2 i = 1 n z i x 1 x 2 i = 1 n z i x 2 2 m g ( β 2 ) 2 λ .
Let D k denote the kth sequential principal minor of H , where k = 1 , 2 , 3 . H is a negative definite matrix if ( 1 ) k D k > 0 , ( k = 1 , 2 , 3 ). The first-order sequential principal minor of H can be expressed as
D 1 = i = 1 n z i m g ( β 0 ) 2 λ .
According to z i < 0 , g ( β 0 ) > 0 and λ > 0 , we have D 1 < 0 . Thus, ( 1 ) D 1 > 0 :
D 2 = i = 1 n z i m g ( β 0 ) 2 λ i = 1 n z i x 1 i = 1 n z i x 1 i = 1 n z i x 1 2 m g ( β 1 ) 2 λ = i = 1 n m z i g ( β 0 ) + g ( β 1 ) x 1 2 2 λ i = 1 n z i ( 1 + x 1 2 ) + 2 λ m { g ( β 0 ) + g ( β 1 ) } + m 2 g ( β 0 ) g ( β 1 ) + 4 λ 2 .
According to z i < 0 , g ( β 0 ) > 0 , g ( β 1 ) > 0 , m > 0 , and λ > 0 , we find that D 2 > 0 . Then, ( 1 ) 2 D 2 > 0 :
D 3 = i = 1 n z i m g ( β 0 ) 2 λ i = 1 n z i x 1 i = 1 n z i x 2 i = 1 n z i x 1 i = 1 n z i x 1 2 m g ( β 1 ) 2 λ i = 1 n z i x 1 x 2 i = 1 n z i x 2 i = 1 n z i x 1 x 2 i = 1 n z i x 2 2 m g ( β 2 ) 2 λ = 2 i = 1 n m z i m g ( β 0 ) 2 λ m 2 g ( β 1 ) g ( β 2 ) + 2 λ m { g ( β 1 ) + g ( β 2 ) } + 4 λ 2 + { m g ( β 0 ) + 2 λ } i = 1 n z i m x 1 2 g ( β 2 ) + i = 1 n z i m x 2 2 g ( β 1 ) + 2 λ i = 1 n z i ( x 1 2 + x 2 2 ) .
Because 2 i = 1 n m z i m g ( β 0 ) 2 λ m 2 g ( β 1 ) g ( β 2 ) + 2 λ m { g ( β 1 ) + g ( β 2 ) } + 4 λ 2 < 0 and { m g ( β 0 ) + 2 λ } i = 1 n z i m x 1 2 g ( β 2 ) + i = 1 n z i m x 2 2 g ( β 1 ) + 2 λ i = 1 n z i ( x 1 2 + x 2 2 ) < 0 , we can obtain that D 3 < 0 . Thus, ( 1 ) 3 D 3 > 0 .
Therefore, the Hessian matrix H is a negative definite matrix, and l F R E ( β ) is a convex function.
According to Theorem 1, there is a unique maximum for l F R E ( β ) . □

3.3. The Choice of the Ridge Parameter

The most typical technique for finding the appropriate λ ridge parameter is to minimize the prediction error of logistic regression as the primary criterion. The maximizer in Equation (12) is written as β ^ λ . Various techniques for defining prediction errors were discussed by Efron as well as Van Hovelingen and Le Cessie [20,26]. Following that, Van Hovelingen et al. and Van Wieringen [20,27] developed three different methods for quantifying prediction error:
(a) Classifition or counting error ( C E ):
C E = 1 , i f Y i = 1 a n d π ^ i < 1 2 ; o r Y i = 0 a n d π ^ i > 1 2 , 1 2 , i f π ^ i > 1 2 , 0 , o t h e r w i s e .
(b) Prediction error ( P E ):
P E = ( Y i π ^ i ) 2 .
(c) Minus log-likelihood error ( M L ):
M L = { Y i l o g π ^ i + ( 1 Y i ) l o g ( 1 π ^ i ) } .
The merits and disadvantages of these three measures were debated by Van Hovelingen and Le Cessie [20]. P E is very sensitive to the model prediction near π = 1 2 , whereas the other two indicators take into account model prediction across the whole p-value range. In this paper, we use the intuitive square error as the measure of prediction error due to its ease of calculation. When comparing the effects of various ridge parameters, the prediction sum of squares ( P R E S S ) given by Allen is a credible criterion of good prediction when calculated on the entire dataset [28]. We employed cross-validation (CV) estimation of P R E S S in the absence of external validation datasets, which is defined as
P R E S S C V λ = 1 n i = 1 n ( Y i π ^ i ( X i ) ) 2 ,
where β ^ i ( λ ) represents the estimate based on all observations except ( X i , Y i ) and π ^ i ( x ) is the estimate of π ( x ) based on β ^ i ( λ ) . The optimal λ is then chosen by minimizing P R E S S . The estimate β ^ λ is predicted to be closer to the real value of β than the standard MLE; that is, P R E S S ( β ^ λ ) < P R E S S ( β ^ ) for an ideal λ . There is always a ridge parameter λ > 0 for which the estimates have a lower P R E S S than the maximum likelihood estimates for lFRE. In the following section, we will conduct two extensive simulation studies to compare the performance of LFRE with other methods based on logistic regression (MLE, Firth, Ridge, LF22, and DPLE) in various scenarios.

4. Simulations

The purpose of this section is to assess the properties of the regression coefficients of the LR, Ridge, Firth, LF22, DPLE, lFRE11, and lFRE22 methods by conducting two simulation studies. The following briefly describes how data were generated and cases varied in the simulation study.

4.1. Software and Data Generation

The simulations were programmed in R 4.0.5. The R functions glm and glmnet [29] were used to implement LR and Ridge. Firth was implemented in the brglm2 package [30]. The LF11 and LF22 models were implemented using the R function plogit [12]. To generate the mixed continuous and binary covariates in a dataset, we first generated three continuous covariates ( x 1 , x 2 , and x 3 ) from the multivariate normal distribution with zero means, unit variances, and a specified correlation matrix. Then, we produced two binary covariates ( x 4 and x 5 ) from binomial distribution, followed by a binary response variable generated from a Bernoulli distribution with a probability π i ( i = 1 , , n ) calculated from true logistic model logit ( π ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 5 , where β 0 = 2 , β 1 = 1 , β 2 = 1 , β 3 = 1 , β 4 = 1 , and β 5 = 1 . To ensure that the results were comparable, we used the same random number seed throughout the procedure.

4.2. Performance Evaluation of the Methods

The average parameter estimation bias and mean squared error (MSE) results were considered to evaluate the performances of the penalized methods in previous studies as mentioned in Section 2. The average MSE for a replication was calculated using
M S E = 1 100 s = 1 100 ( β ^ s , p β s , p ) 2 ,
where β ^ s , p is the estimate of β s , p and β s , p denotes the true parameter for the pth parameter of the sth simulated dataset.
In addition, the average parameter estimation bias was calculated by using
b i a s = 1 100 s = 1 100 ( β ^ s , p β s , p ) .
For the Ridge, DPLE, lFRE11, and lFRE22 penalized methods with a ridge parameter, the average MSE was calculated as follows:
M S E = 1 100 s = 1 100 ( β ^ s , p ( λ ) β s , p ) 2 .
Additionally, the average parameter estimation bias was computed as
b i a s = 1 100 s = 1 100 ( β ^ s , p β s , p ) ,
where the choice of ridge parameter ( λ ) depends on the cross-validation criterion, which was mentioned in Section 3.3, and β ^ s , p ( λ ) was obtained from these estimators with optimal λ . The appropriate tuning parameters for fitting Ridge, DPLE, lFRE11, and lFRE22 were chosen using 10-fold cross validation. The average over the number of simulations where convergence was achieved was used to calculate the estimations of the regression coefficients for the respective models.

4.3. Simulation 1

In this section, we carried out a simulation study to assess the properties of the regression coefficients of these methods (LR, Ridge, Firth, LF22, DPLE, lFRE11, and lFRE22) with respect to the bias and mean squared error (MSE). For a fixed correlation coefficient r = 0.85 , we considered alternative scenarios in which there were three covariates with potentially strong correlations. To study the effects of the sample size in the situation of a small severe multicollinearity, the sample sizes (n) were set to 50, 80, 130, and 200. We also used the data from the first simulation here. Five coefficients ( β 1 , β 2 , β 3 , β 4 , and β 5 ) were estimated in this simulation study, with β 1 , β 2 , and β 3 being continuous covariables and β 4 and β 5 being binary covariables, omitting the intercept term.
Each sample size scenario yielded 100 datasets, which we simulated. The results of the simulations are shown in Table 2. The maximum likelihood estimates (MLEs) were nonexistent in the small sample of 50 people (due to separation problems caused by the small sample size, high correlation among covariates, or both). When we created a model using conventional maximum likelihood binary logistic regression with data with a limited number of subjects relative to the number of predictors, our results demonstrated that the MLEs were nonexistent. However, as the sample size grew, the evidence for nonexistence improved. According to the findings in Table 2, the Ridge estimator generally offered the largest bias, followed by Firth and LF22, while DPLE, lFRE11, and lFRE22 offered very little bias. To some extent, all penalized approaches offered improvement. Firth and LF22 reduced the bias of the coefficient estimations when compared with LR in terms of the mean bias. Some bias values obtained by DPLE, lFRE11, and lFRE22 were higher than those obtained by LR, which was due to the ridge parameter penalty term. These results are to be expected, given that the ridge parameter in the double-penalized estimators compromises bias in order to lower the MSE. Firth would yield a significantly higher MSE than LF22 in every situation. This is due to the fact that Firth does not minimize the MSE. Clearly, the statistics in Table 2 reveal that DPLE with a Firth-type penalty had a higher MSE than lFRE11 and lFRE22 with a logF-type penalty, as evidenced by our simulation findings, which are consistent with intuition. In almost all settings, lFRE22 provided slightly larger bias on the binary covariables ( β 4 and β 5 ), but it outperformed the other approaches on continuous covariables ( β 1 , β 2 , and β 3 ). However, when it comes to the MSE, lFRE22 was the best, while lFRE11 was at least second. When compared with other penalized methods, lFRE11 and lFRE22 had a significantly lower bias and MSE. The differences were larger when the sample size was small or medium (see Table 2). The difference gradually decreased as the sample size grew larger.
There appeared to be a difference in the parameter estimation of binary and continuous covariates between all penalized estimators in terms of bias and MSE in our simulation. The ridge effect appeared to be stronger in the continuous covariates than in the binary covariates, as seen in Table 2. The reduction in MSE in Firth, Ridge, and LF22 compared with DPLE, lFRE11, and lFRE22 for the binary covariates suggests a compromise in terms of increased bias. In the three double-penalized estimators, the MSE for the continuous covariates was dramatically lowered. When it came to the mean bias and MSE for the binary variable, the three doubly penalized estimators (DPLE, lFRE11, and lFRE11) were indistinguishable.
The MSE results of the DPLE, lFRE11, and lFRE22 are visualized with respect to the continuous and binary variables for different sample sizes in Figure 1 for further information. It is clear that lFRE22 had the lowest MSE for both the continuous and binary covariables. Furthermore, DPLE, lFRE11, and lFRE22 provided greater improvement. When the sample size was small, this trend became more apparent. The difference between the three methods for continuous and binary covariates decreased as the sample size increased.
In the second simulation, we conducted two series of simulations to look at the effects of correlation among covariates ( r { 0.65 , 0.75 , 0.85 , 0.95 } ), sample size ( n { 40 , 80 } ), and covariate type (binary and continuous). To investigate the effects of the correlation and small sample size on the estimation of the regression coefficients, we used a variety of correlation coefficients and small sample sizes. We also used the data generated in the first simulation in this case. For each scenario, 100 datasets were generated, and each dataset was fitted with all of the regression approaches under consideration. Two tables present the bias and MSEs of the estimates, and two figures present the bias and MSEs for different scenarios obtained by DPLE, lFRE11, and lFRE22.

4.4. Simulation 2

The estimated bias and MSE values for LR, Firth, Ridge, LF22, DPLE, lFRE11, and lFRE22 are reported in Table 3 and Table 4, omitting the intercept. Other methods, with the exception of Ridge and LR, were slightly biased. Ridge produced a coefficient deviation that was even greater than that for LR and could not be ignored. In one set of simulations, these estimators were tested with a sample size of n = 40 . A second series tested these estimators with a sample size of n = 80 . LR offered infinite estimates in a sample size of 40, as shown in Table 3, which is regarded as conditional on the presence of uncorrected MLEs due to sample size separation. Nonetheless, all of the approaches evaluated had finite bias and MSEs, allowing us to reasonably conclude that they could address the small sample separation problem. The maximum likelihood estimation converged as the sample size grew from 40 to 80, and LR produced an effective bias and MSE. One can also conclude from Table 4 that the MSE of LR was inflated, especially when the correlation coefficient was between 0.85 and 0.95, and some MSE values exceeded 1. According to Table 3 and Table 4, all penalized methods outperformed LR, as evidenced by the smaller MSE. Furthermore, the biases of the Firth, LF11, and LF22 methods were less severe than that of the LR model. Because of the ridge penalty term in DPLE, lFRE11, and lFRE22, a slight increase in bias was required to ensure a decrease in the MSE. As the correlation coefficient increased, so did the MSEs of LR, Firth, Ridge, and LF22. However, it appears that DPLE, lFRE11, and lFRE22 performed the best in terms of reducing the MSE. It is worth noting that lFRE22 performed best because it achieved the greatest reduction in MSE in all experimental settings. The findings appear to suggest that lFRE in logistic regression warrants further investigation. When the covariables were highly correlated, adding a ridge parameter improved the estimator’s performance.
Figure 2 and Figure 3 analyze the biases and MSEs of the DPLE, lFRE11, and lFRE22 methods over all explanatory variables, omitting the intercept for n = 40 and n = 80 , respectively. The results in Figure 2 reveal that they provided almost the same level of bias, followed by lFRE11, lFRE22 provided the lowest MSE, and DPLE provided the highest MSE, especially when the correlation coefficient value was high in the sample size of 40. In practically all situations, lFRE22 performed best in terms of bias and MSE, as shown in Figure 3. In the situation of increasing collinearity, DPLE also performed poorly. It was revealed that when the sample size was small and the multicollinearity was severe, it was recommended to use lFRE22, which not only yielded modest absolute bias but also provided the lowest MSE when compared with other penalized approaches.
As a result, we believe that the lFRE approach is frequently the best method for small datasets with strongly correlated independent variables, as it efficiently removes bias from predicted probabilities and provides the smallest MSE while also delivering very accurate regression coefficients. When the sample size is small and the independent variables are highly correlated, lFRE is more computationally simple than competing penalized regression approaches.

5. Breast Cancer Study

The dataset used to evaluate the predictive performance of our proposed estimator and compare it to other estimators under study is available from the University of California Irvine machine learning repository, and it was originally extracted from the study conducted by M. Pereira et al. (2018) [31], where the goal was to obtain the selection of effective breast cancer biomarkers in predicting the presence or absence of breast cancer in 116 patients (64 patients with breast cancer and 52 healthy controls). The presence or absence of breast cancer was of particular interest as an outcome (1 = healthy controls; 2 = patients). There were nine quantitative predictors in total. Anthropometric data and parameters could be acquired during standard blood analysis. Age, BMI, blood glucose, insulin, HOMA, leptin, adiponectin, resistin, and MCP-1 were all measured or observed in each of the 166 subjects. These clinical characteristics are detailed as follows:
  • Age (years);
  • BMI (kg/m 2 ) was calculated by dividing the weight by the square of the height;
  • Glucose (mg/dL), namely the serum glucose levels, were measured using a commercial kit and an automated analyzer;
  • Insulin ( μ U/mL), adiponectin ( μ g/mL), resistin (ng/mL), and MCP-1(pg/dL) levels in the serum were determined using commercial enzyme-linked immunosorbent test kits for leptin, adiponectin, and resistin, as well as the chemokine Monocyte Chemoattractant Protein 1 (MCP-1);
  • To assess insulin resistance, the Homeostasis Model Assessment (HOMA) index was calculated ( H O M A = log ( ( I f ) × ( G f ) ) / 22.5 , where (If) is the fasting insulin level ( μ U/mL) and Gf is the fasting glucose level (mmol/L)).
There were nine predictors in total, and they were all continuous. The HOMA index and insulin had a strong correlation ( r = 0.93 ). The results from fitting the logistic regression models using LR, Firth, LF11, LF22, Ridge, DPLE, lFRE11, and lFRE22 are summarized in Table 5. The model’s discriminating ability was quantified using the area under the receiver operating characteristic curve (AUROC), which is a measure that distinguishes subjects with and without events of interest. When the AUROC value was around 0.5, that meant there was no discrimination, and when it was close to 1, that meant there was excellent discrimination. The Brier Score (BS) was used to assess the overall predictive performance, and it is the mean of the squared difference between each patient’s observed and predicted risk as determined by the model. The better the performance, the lower the value. When B S = 0 , the prediction is perfect. Table 5 reveals their predictive performance using the two most common approaches to predictive accuracy (BS and AUROC).
The results of ridge parameter selection using cross-validation are provided in Figure 4. The ideal ridge parameter for this dataset was determined to be 100. The assessment metrics BS and AUROC chose the same model, lFRE22, according to the results. Table 5 indicates that all penalized estimators increased in AUROC over LR for both predictive performance measures (BS and AUROC). It should be noted that Firth’s performance in BS was significantly worse than that of the LR model, which is consistent with earlier findings [13]. Furthermore, the magnitudes of the three double-penalized estimators (DPLE, lFRE11, and lFRE22) were significantly smaller than those of the three single-penalized estimators (Firth, LF11, and LF22) for all items. It was also observed from the results in Table 5 that the three double-penalized estimators with ridge parameters outperformed the three single-penalized estimators. This is because, as previously stated, the two variables of the HOMA index and insulin were highly correlated, and the ridge penalty penalized the multicollinearity effect. Regarding the the BS and AUROC, LF22 surpassed Firth and LF11, whereas Firth and LF11 outperformed each other. Similarly, lFRE11 and lFRE22 provided roughly the same degree of improvement in BS and AUROC, but DPLE gave the greatest BS value and the lowest AUROC value of the three doubly penalized estimators. As a result, in the setting of a small dataset with strongly correlated covariates, such analysis adequately demonstrates the superiority of lFRE22’s predictive ability.
Given the sample size of 116, finite MLEs were available in the conventional logistic regression model for this dataset. As a result, based on the convergent simulation results, the simulation findings for the LR model are relevant and comparable. In general, lFRE22 had the lowest BS and the highest AUROC compared with the other models discussed in this study. Such a result demonstrates that lFRE22 should be emphasized.

6. Discussion and Conclusions

In this study, separation and multicollinearity problems in logistic regression were discussed. Despite the issue being a hot topic for more than 40 years, few studies have attempted to handle both types of problems at the same time. This paper reviewed some shortcomings in Shen and Gao’s method (DPLE) and investigated a new double-penalty likelihood estimation method called lFRE, which combines a logF-type penalty with a ridge penalty to effectively solve separation and multicollinearity problems in logistic regression.
The results revealed that lFRE (lFRE11 and lFRE22) was superior to Firth, LF11, LF22, and DPLE in terms of prediction performance. Separation is commonly solved using Firth, LF11, and LF22. As long as there is no multicollinearity in logistic regression, LF22 has been proven to be the preferred method for obtaining unbiased prediction probability and the minimum MSE of coefficient estimation. However, when the separation and multicollinearity problems emerge at the same time in logistic regression, they are not appropriate. Considering separation and multicollinearity, based on the simulation results, we support that lFRE is superior to DPLE because lFRE is effective not only in solving separation problems but also in solving multicollinearity problems. On the other hand, when multicollinearity is taken seriously, DPLE will provide a non-negligible MSE.
In conclusion, Shen and Gao’s double-penalized likelihood estimator may provide a more appropriate reference point than previous penalized estimators, given that the generated estimates remain finite even when separation occurs. However, based on the research results, we suggest using lFRE instead of Shen and Gao’s method to deal with the separation and multicollinearity problems of small or sparse data with strong correlation of variables. The reasons for this are as follows. First, the proposed lFRE provides the minimum MSE in regression coefficient estimation and has greater improvement in prediction performance than other penalty method, such as DPLE. Secondly, since a logF-type penalty releases the intercept from the penalty and accurately estimates the average prediction probability, lFRE provides an unbiased prediction probability. In contrast, DPLE penalizes the intercept and introduces bias into the prediction probability. Thirdly, the logF-type penalty used by lFRE is not data-dependent, and it does not incorporate the correlation between explanatory variables, which is different from the Firth-type penalty. Finally, lFRE performs reasonably well in the case of high correlation between covariates. However, in this scenario, DPLE will result in a large MSE. As a result, lFRE is superior to DPLE, and lFRE22 is superior to lFRE11 among the two forms of lFRE11. Although both have similar prediction efficiencies, lFRE22 has a smaller MSE in terms of the regression coefficient, especially for continuous variables.
In addition, lFRE can be applied to any generalized linear model (GLIM) for future research.

Author Contributions

Conceptualization, Y.G. and G.-H.F.; methodology, Y.G.; software, Y.G. and G.-H.F.; validation, Y.G. and G.-H.F.; formal analysis, Y.G.; investigation, Y.G.; resources, Y.G.; data curation, Y.G. and G.-H.F.; writing—original draft preparation, Y.G. and G.-H.F.; writing—review and editing, Y.G. and G.-H.F.; visualization, Y.G. and G.-H.F.; supervision, Y.G. and G.-H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of P. R. China (grant nos. 12261052 and 11761041) and the Natural Science Foundation of Yunnan Province of P. R. China (grant no. CB22052C127A).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra.

Acknowledgments

The authors sincerely thank the anonymous reviewers for their constructive comments that led to the current improved version of the paper. This work was financially supported by the National Natural Science Foundation of P. R. China (grant nos. 12261052 and 11761041) and a grant from the Natural Science Foundation of Yunnan Province of P. R. China (grant no. CB22052C127A).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LRLogistic regression
MLEMaximum likelihood estimation
MLEsMaximum likelihood estimates
RidgeRidge logistic regression
FirthFirth’s logistic regression
LFLogistic regression with a log-F prior penalty
LF11Logistic regression with a logF(1, 1) prior penalty
LF22Logistic regression with a logF(2, 2) prior penalty
DPLEShen and Gao’s double-penalized likelihood estimator
lFRE11The logF(1, 1)-type penalty with a ridge-based estimator
lFRE22The logF(2, 2)-type penalty with a ridge-based estimator
PEPrediction error
PRESSPredicted residual error sum of squares
MSEMean squared error
rCorrelation coefficient
nSample sizes
BSBrier score
AUROCThe area under the receiver operating characteristic curve

References

  1. Silvapulle, M.J. On the existence of maximum likelihood estimators for the binomial response models. J. R. Stat. Soc. Ser. B (Methodol.) 1981, 43, 310–313. [Google Scholar] [CrossRef]
  2. Albert, A.; Anderson, J.A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984, 71, 1–10. [Google Scholar] [CrossRef]
  3. Allison, P.D. Convergence failures in logistic regression. SAS Glob Forum 2008, 360, 1–11. [Google Scholar]
  4. Zorn, C. A Solution to Separation in Binary Response Models. Political Anal. 2005, 13, 157–170. [Google Scholar] [CrossRef]
  5. Vaeth, M. On the use of Wald’s test in exponential families. Int. Stat. Rev./Rev. Int. de Stat. 1985, 53, 199–214. [Google Scholar] [CrossRef]
  6. Heinze, G.; Schemper, M. A solution to the problem of separation in logistic regression. Stat. Med. 2002, 21, 2409–2419. [Google Scholar] [CrossRef] [PubMed]
  7. Clarkson, D.B.; Jennrich, R.I. Computing extended maximum likelihood estimates for linear parameter models. J. R. Stat. Soc. Ser. B (Methodol.) 1991, 53, 417–426. [Google Scholar] [CrossRef]
  8. Konis, K.P.; Ripley, B. Linear Programming Algorithms for Detecting Separated Data in Binary Logistic Regression Models. Ph.D. Thesis, University of Oxford, Oxford, UK, 2007. [Google Scholar]
  9. Agresti, A. An introduction to categorical data analysis. Publ. Am. Stat. Assoc. 1996, 103, 1323. [Google Scholar]
  10. Firth, D. Bias reduction of maximum likelihood estimates. Biometrika 1993, 80, 27–38. [Google Scholar] [CrossRef]
  11. Greenland, S.; Mansournia, M.A. Penalization, bias reduction, and default priors in logistic and related categorical and survival regressions. Stat. Med. 2015, 34, 3133–3143. [Google Scholar] [CrossRef]
  12. Greenland, S.; Mansournia, M.A.; Altman, D.G. Sparse data bias: A problem hiding in plain sight. BMJ 2016, 352, i1981. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Rahman, M.S.; Sultana, M. Performance of Firth- and logF-type penalized methods in risk prediction for small or sparse binary data. BMC Med. Res. Methodol. 2017, 17, 1–15. [Google Scholar] [CrossRef]
  14. Ogundimu, E.O. Prediction of default probability by using statistical models for rare events. J. R. Stat. Soc. Ser. A (Stat. Soc.) 2019, 182, 1143–1162. [Google Scholar] [CrossRef]
  15. Rousseeuw, P.J.; Christmann, A. Robustness against separation and outliers in logistic regression. Comput. Stat. Data Anal. 2003, 43, 315–332. [Google Scholar] [CrossRef]
  16. Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
  17. Hoerl, A.E.; Kennard, R.W. Ridge regression: Applications to nonorthogonal problems. Technometrics 1970, 12, 69–82. [Google Scholar] [CrossRef]
  18. Schaefer, R.; Roi, L.; Wolfe, R. A ridge logistic estimator. Commun. Stat. Theory Methods 1984, 13, 99–113. [Google Scholar] [CrossRef]
  19. Lee, A.; Silvapulle, M. Ridge estimation in logistic regression. Commun. Stat. Simul. Comput. 1988, 17, 1231–1257. [Google Scholar] [CrossRef]
  20. Le Cessie, S.; Van Houwelingen, J.C. Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1992, 41, 191–201. [Google Scholar] [CrossRef]
  21. Inan, D.; Erdogan, B.E. Liu-type logistic estimator. Commun. Stat. Simul. Comput. 2013, 42, 1578–1586. [Google Scholar] [CrossRef]
  22. Zeng, G.; Zeng, E. On the relationship between multicollinearity and separation in logistic regression. Commun. Stat. Simul. Comput. 2021, 50, 1989–1997. [Google Scholar] [CrossRef]
  23. Senaviratna, N.; Cooray, T. Diagnosing multicollinearity of logistic regression model. Asian J. Probabil. Stat. 2019, 5, 1–9. [Google Scholar] [CrossRef]
  24. Shen, J.; Gao, S. A solution to separation and multicollinearity in multiple logistic regression. J. Data Sci. 2008, 6, 515–531. [Google Scholar] [CrossRef]
  25. Puhr, R.; Heinze, G.; Nold, M.; Lusa, L.; Geroldinger, A. Firth’s logistic regression with rare events: Accurate effect estimates and predictions? Stat. Med. 2017, 36, 2302–2317. [Google Scholar] [CrossRef] [PubMed]
  26. Efron, B. How biased is the apparent error rate of a prediction rule? J. Am. Stat. Assoc. 1986, 81, 461–470. [Google Scholar] [CrossRef]
  27. van Wieringen, W.N. Lecture notes on ridge regression. arXiv 2015, arXiv:1509.09169. [Google Scholar]
  28. Allen, D.M. The relationship between variable selection and data agumentation and a method for prediction. Technometrics 1974, 16, 125–127. [Google Scholar] [CrossRef]
  29. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1. [Google Scholar] [CrossRef] [Green Version]
  30. Kosmidis, I. brglm2: Bias reduction in generalized linear models. R Package Version 0.6 2020, 2, 635. [Google Scholar]
  31. Patrício, M.; Pereira, J.; Crisóstomo, J.; Matafome, P.; Gomes, M.; Seiça, R.; Caramelo, F. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 2018, 18, 1–8. [Google Scholar] [CrossRef]
Figure 1. Plots of mean squared errors for sample sizes and continuous and binary covariate types in double-penalized maximum likelihood estimator for logistic regression models.
Figure 1. Plots of mean squared errors for sample sizes and continuous and binary covariate types in double-penalized maximum likelihood estimator for logistic regression models.
Mathematics 10 03824 g001
Figure 2. Plots of the bias and MSE for four different scenarios which combine a fixed sample size n = 40 with a variational correlation coefficient obtained by DPLE, lFRE11 and lFRE22.
Figure 2. Plots of the bias and MSE for four different scenarios which combine a fixed sample size n = 40 with a variational correlation coefficient obtained by DPLE, lFRE11 and lFRE22.
Mathematics 10 03824 g002
Figure 3. Plots of the bias and MSE for four different scenarios which combine a fixed sample size n = 80 with a variational correlation coefficient obtained by DPLE, lFRE11 and lFRE22.
Figure 3. Plots of the bias and MSE for four different scenarios which combine a fixed sample size n = 80 with a variational correlation coefficient obtained by DPLE, lFRE11 and lFRE22.
Mathematics 10 03824 g003
Figure 4. Cross-validated estimates of the mean squared errors as a function of the ridge parameter.
Figure 4. Cross-validated estimates of the mean squared errors as a function of the ridge parameter.
Mathematics 10 03824 g004
Table 1. A 2 × 2 contingency table of a one-parameter issue without an intercept.
Table 1. A 2 × 2 contingency table of a one-parameter issue without an intercept.
x i
y i 01
0 n 00 n 01
1 n 10 n 11
Table 2. Bias and MSEs of all explanatory variables (omitting the intercept) for sample size n { 50 , 80 , 130 , 200 } and a fixed correlation coefficient r = 0 . 85 .
Table 2. Bias and MSEs of all explanatory variables (omitting the intercept) for sample size n { 50 , 80 , 130 , 200 } and a fixed correlation coefficient r = 0 . 85 .
Bias MSE
nCoeffientLRFirthRidgeLF22DPLElFRE11lFRE22 LRFirthRidgeLF22DPLElFRE11lFRE22
50 β 1 ( 1 ) 0.02−0.29−0.04−0.17−0.03−0.10 1.680.180.410.460.400.28
β 2 ( 1 ) 0.19−0.240.12−0.020.110.02 1.830.130.470.500.430.27
β 3 ( 1 ) −0.09−0.29−0.08−0.22−0.08−0.13 1.140.170.440.470.430.32
β 4 ( 1 ) 0.19−0.39−0.18−0.15−0.17−0.30 1.630.570.500.560.490.40
β 5 ( 1 ) 0.17−0.40−0.18−0.15−0.17−0.30 1.100.440.500.500.490.42
80 β 1 ( 1 ) 0.270.05−0.240.04−0.040.05−0.01 0.840.510.120.320.330.320.25
β 2 ( 1 ) 0.19−0.01−0.210.03−0.070.03−0.02 1.160.730.110.360.400.350.25
β 3 ( 1 ) 0.200.00−0.280.01−0.070.02−0.03 0.650.370.150.210.220.210.16
β 4 ( 1 ) 0.21−0.01−0.35−0.20−0.17−0.19−0.29 1.170.680.420.440.440.430.38
β 5 ( 1 ) 0.220.02−0.29−0.15−0.13−0.14−0.23 0.960.590.290.380.390.380.33
130 β 1 ( 1 ) 0.03−0.07−0.24−0.05−0.10−0.04−0.07 0.450.360.090.260.280.260.22
β 2 ( 1 ) 0.260.13−0.200.130.070.130.09 0.910.670.080.420.440.400.31
β 3 ( 1 ) 0.150.030.250.04−0.010.040.01 0.450.340.100.260.260.250.21
β 4 ( 1 ) 0.07−0.03−0.32−0.15−0.13−0.14−0.22 0.530.410.220.300.310.300.27
β 5 ( 1 ) 0.170.060.42−0.06−0.05−0.06−0.13 0.580.410.360.280.280.270.23
200 β 1 ( 1 ) 0.01−0.05−0.25−0.03−0.07−0.03−0.05 0.180.160.080.130.140.130.12
β 2 ( 1 ) 0.130.07−0.230.070.040.070.05 0.400.330.080.250.260.250.21
β 3 ( 1 ) 0.06−0.01−0.250.01−0.020.01−0.01 0.190.170.080.140.140.140.12
β 4 ( 1 ) 0.100.04−0.32−0.03−0.02−0.03−0.08 0.240.200.190.170.170.170.16
β 5 ( 1 ) 0.080.02−0.36−0.04−0.03−0.04−0.09 0.250.210.240.180.180.180.17
Table 3. Bias and MSE for all explanatory variables (omitting the intercept) for correlation coefficient r { 0 . 65 , 0 . 75 , 0 . 85 , 0 . 95 } and a fixed sample size n = 40 .
Table 3. Bias and MSE for all explanatory variables (omitting the intercept) for correlation coefficient r { 0 . 65 , 0 . 75 , 0 . 85 , 0 . 95 } and a fixed sample size n = 40 .
Bias MSE
rCoeffientLRFirthRidgeLF22DPLElFRE11lFRE22 LRFirthRidgeLF22DPLElFRE11lFRE22
0.65 β 1 ( 1 ) −0.04−0.26−0.02−0.18−0.01−0.09 0.380.190.230.220.230.18
β 2 ( 1 ) 0.03−0.230.04−0.140.04−0.04 0.720.170.350.310.340.24
β 3 ( 1 ) 0.120.310.08−0.080.08−0.01 0.580.210.300.240.300.21
β 4 ( 1 ) 0.04−0.34−0.30−0.27−0.29−0.41 1.330.590.600.610.600.52
β 5 ( 1 ) −0.07−0.32−0.37−0.34−0.36−0.47 1.550.530.740.770.740.63
0.75 β 1 ( 1 ) 0.130.26−0.30−0.100.07−0.03 1.330.180.450.410.430.29
β 2 ( 1 ) 0.15−0.290.11−0.100.100.01 2.120.190.600.560.560.27
β 3 ( 1 ) 0.14−0.300.03−0.130.04−0.06 1.350.220.310.320.300.20
β 4 ( 1 ) 0.05−0.33−0.32−0.28−0.31−0.43 1.090.580.600.680.670.57
β 5 ( 1 ) 0.340.31−0.17−0.15−0.16−0.31 1.010.390.530.550.520.42
0.85 β 1 ( 1 ) −0.05−0.35−0.04−0.20−0.03−0.11 0.760.240.310.340.300.22
β 2 ( 1 ) 0.09−0.240.01−0.150.01−0.07 2.200.180.460.570.430.27
β 3 ( 1 ) −0.10−0.26−0.04−0.22−0.04−0.11 1.360.160.390.460.380.26
β 4 ( 1 ) −0.02−0.49−0.34−0.32−0.33−0.45 1.270.680.610.630.590.53
β 5 ( 1 ) 0.00−0.37−0.30−0.29−0.29−0.41 1.430.560.640.640.640.54
0.95 β 1 ( 1 ) 0.31−0.290.07−0.090.07−0.03 4.560.170.450.610.430.25
β 2 ( 1 ) −0.02−0.27−0.01−0.180.00−0.08 6.270.130.370.670.360.21
β 3 ( 1 ) −0.25−0.27−0.12−0.32−0.12−0.17 3.570.160.410.640.400.26
β 4 ( 1 ) 0.19−0.35−0.21−0.18−0.19−0.34 1.530.520.570.580.560.47
β 5 ( 1 ) 0.05−0.25−0.30−0.270.29−0.41 1.040.480.470.470.470.43
Table 4. Bias and MSE for all explanatory variables (omitting the intercept) for correlation coefficient r { 0 . 65 , 0 . 75 , 0 . 85 , 0 . 95 } and a fixed sample size n = 80 .
Table 4. Bias and MSE for all explanatory variables (omitting the intercept) for correlation coefficient r { 0 . 65 , 0 . 75 , 0 . 85 , 0 . 95 } and a fixed sample size n = 80 .
Bias MSE
rCoeffientLRFirthRidgeLF22DPLElFRE11lFRE22 LRFirthRidgeLF22DPLElFRE11lFRE22
r = 0.65 β 1 ( 1 ) 0.210.02−0.280.03−0.050.03−0.02 0.420.240.140.190.180.190.16
β 2 ( 1 ) 0.230.04−0.240.06−0.030.060.01 0.640.400.120.280.270.280.21
β 3 ( 1 ) 0.240.05−0.260.05−0.030.050.00 0.350.200.130.160.150.160.13
β 4 ( 1 ) 0.330.13−0.36−0.06−0.03−0.05−0.16 0.750.460.370.300.310.300.25
β 5 ( 1 ) 0.310.11−0.27−0.06−0.04−0.05−0.16 0.960.610.230.420.420.420.35
r = 0.75 β 1 ( 1 ) 0.11−0.07−0.25−0.05−0.13−0.05−0.09 0.430.280.130.220.220.220.18
β 2 ( 1 ) 0.320.11−0.180.12−0.030.130.07 0.850.500.100.340.320.330.25
β 3 ( 1 ) 0.10−0.08−0.27−0.06−0.14−0.06−0.10 0.660.400.140.280.280.270.22
β 4 ( 1 ) 0.17−0.01−0.36−0.17−0.14−0.16−0.25 0.780.520.320.380.390.380.34
β 5 ( 1 ) 0.230.04−0.33−0.13−0.11−0.12−0.22 0.880.530.300.330.340.330.29
r = 0.85 β 1 ( 1 ) 0.270.05−0.240.04−0.040.05−0.01 0.840.510.120.320.330.320.25
β 2 ( 1 ) 0.19−0.01−0.210.03−0.070.03−0.02 1.160.730.110.360.400.350.25
β 3 ( 1 ) 0.200.00−0.280.01−0.070.02−0.03 0.650.370.150.210.220.210.16
β 4 ( 1 ) 0.21−0.01−0.35−0.20−0.17−0.19−0.29 1.170.680.420.440.440.430.38
β 5 ( 1 ) 0.220.02−0.29−0.15−0.13−0.14−0.23 0.960.590.290.380.390.380.33
r = 0.95 β 1 ( 1 ) −0.02−0.18−0.24−0.11−0.21−0.11−0.13 2.111.380.100.430.570.410.28
β 2 ( 1 ) 0.240.02−0.230.02−0.070.02−0.03 3.722.460.080.510.760.480.29
β 3 ( 1 ) 0.340.12−0.260.090.020.090.03 1.871.200.120.400.490.380.25
β 4 ( 1 ) 0.18−0.02−0.35−0.21−0.19−0.21−0.30 0.850.540.350.380.390.380.36
β 5 ( 1 ) 0.200.11−0.30−0.15−0.13−0.14−0.23 0.850.560.270.360.370.360.32
Table 5. Real data example: BS and AUROC obtained by fitting LR, Firth, LF11, LF22, Ridge, DPLE, lFRE11, and lFRE22 for a breast cancer study.
Table 5. Real data example: BS and AUROC obtained by fitting LR, Firth, LF11, LF22, Ridge, DPLE, lFRE11, and lFRE22 for a breast cancer study.
IDModelBSAUROC
1LR0.21080.6973
2Firth0.21170.7626
3LF110.20790.7732
4LF220.20720.7749
5Ridge0.20240.7775
6DPLE0.20030.7762
7lFRE110.19850.7816
8lFRE220.19850.7816
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Guan, Y.; Fu, G.-H. A Double-Penalized Estimator to Combat Separation and Multicollinearity in Logistic Regression. Mathematics 2022, 10, 3824. https://doi.org/10.3390/math10203824

AMA Style

Guan Y, Fu G-H. A Double-Penalized Estimator to Combat Separation and Multicollinearity in Logistic Regression. Mathematics. 2022; 10(20):3824. https://doi.org/10.3390/math10203824

Chicago/Turabian Style

Guan, Ying, and Guang-Hui Fu. 2022. "A Double-Penalized Estimator to Combat Separation and Multicollinearity in Logistic Regression" Mathematics 10, no. 20: 3824. https://doi.org/10.3390/math10203824

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop