K-L Estimator: Dealing with Multicollinearity in the Logistic Regression Model

: Multicollinearity negatively affects the efﬁciency of the maximum likelihood estimator (MLE) in both the linear and generalized linear models. The Kibria and Lukman estimator (KLE) was developed as an alternative to the MLE to handle multicollinearity for the linear regression model. In this study, we proposed the Logistic Kibria-Lukman estimator (LKLE) to handle multicollinearity for the logistic regression model. We theoretically established the superiority condition of this new estimator over the MLE, the logistic ridge estimator (LRE), the logistic Liu estimator (LLE), the logistic Liu-type estimator (LLTE) and the logistic two-parameter estimator (LTPE) using the mean squared error criteria. The theoretical conditions were validated using a real-life dataset, and the results showed that the conditions were satisﬁed. Finally, a simulation and the real-life results showed that the new estimator outperformed the other considered estimators. However, the performance of the estimators was contingent on the adopted shrinkage parameter estimators.


Introduction
Frisch [1] coined the term "multicollinearity" to describe the problem that occurs when the explanatory variables in a model are linearly related.This problem posed a severe threat to different regression models, e.g., the linear regression model (LRM), logistic regression model, Poisson regression model and gamma regression model.The parameters in the linear and logistic regression models are popularly estimated using the ordinary least squares (OLS) estimator and the maximum likelihood estimator (MLE), respectively.However, both estimators with multicollinearity possess high standard error, and occasionally, the estimated regression coefficients exhibit the wrong coefficient signs, making the conclusion doubtful [2,3].The ridge regression estimator (RRE) and the logistic ridge estimator are notable alternatives to the OLS estimator in the LRM and the logistic regression model [4,5].The Liu estimator is an alternative to the ridge estimator which accounts for multicollinearity in the LRM and the logistic regression model [6,7].The modified ridge-type estimator is a two-parameter estimator that competes favorably with the ridge and Liu estimators [8,9].Recently the K-L estimator emerged as another estimator in the ridge class and the Liu estimator with a single biasing parameter class [10].The K-L estimator is a form of the Liu-type estimator with one parameter that minimizes the residual sum of squares with respect to the L2 norm with a prior information.The K-L Mathematics 2023, 11, 340 2 of 14 estimator outperforms the RRE and Liu estimators based on the theoretical conditions.In this study, we developed the K-L estimator for parameter estimation with the logistic regression model, derived its statistical properties, performed a theoretical comparison with other estimators, and validated its performance by a simulation and a real-life application.
The organization of this paper is as follows.The proposed estimator is discussed in Section 2. A theoretical comparison of various estimators is presented in Section 3. A simulation study is conducted in Section 4. Real-life data are analyzed in Section 5. Finally, some concluding remarks are given in Section 6.

Proposed Estimator
Given that y i is a binary response variable, then the logistic regression model is defined as a Bernoulli distribution: y i ∼ Be(π i ).
where π i = e x T i β 1+e x T i −(x T i β) , i = 1, 2, . . ., n and x i is the i th row of X, which is an n × (p + 1) matrix of explanatory variables, β is a (p + 1) × 1 vector of regression coefficients and y i ∼ Be(π i ).The parameters in the logistic regression model are estimated by the method of MLE.The MLE of where Ĝn = diag( πi (1 − πi )) and ẑi = log( πi ) + y i − πi πi (1− πi ) .Multicollinearity among the explanatory variables affects the MLE.The variance of the regression parameter is often influenced by the presence of multicollinearity [11,12].The RRE is an alternative to the MLE in linear and logistic regression models [4,5].The logistic ridge estimator (LRE) is defined as: where I is an identity matrix, k (k > 0) is the ridge parameter, and Ĝn is the estimate of G using βMLE .The ridge parameter [13] is defined as while the logistic version [14] is as follows: The Liu estimator [6] is an alternative to the ridge estimator in the linear regression model, while the logistic Liu estimator (LLE) [7] is expressed as follows: where d (0 < d < 1) is the Liu parameter.Further, we adopted the following method to compute Liu parameter d [15]: where max and min represent the maximum and minimum operators, respectively.Further, λ j represents the jth eigenvalue of the X T Ĝn X and α = Q T βMLE , where Q is the eigenvector of X T Ĝn X. Liu [16] proposed a two-parameter estimator called the Liu-type estimator.Inan and Erdogan [17] extended this work to the logistic regression model.The logistic Liu-type estimator (LLTE) is as follows: where k (k > 0) and d (−∞ < d < ∞) are the biasing parameters of the LLTE.Ozkale and Kaciranlar [15] developed the two-parameter estimator (TPE) to mitigate multicollinearity in the LRM.Huang [18] developed the logistic TPE estimator (LTPE), defined as follows: where k (k > 0) and d (−∞ < d < ∞) are the biasing parameters.The biasing parameters are defined in Equations ( 5) and ( 7), respectively.Recently, the K-L estimator (KLE) [10] has shown better performance than the ordinary least squares, the RRE and the LE for parameter estimation in the LRM.The KLE is defined as βKLE = X T X + kI p ) −1 X T X − kI p βMLE (10) where k (k > 0) is the KLE biasing parameter, which, as will be discussed in Section 3.6, was obtained by minimizing the mean squared error (MSE).However, in this study, we propose the logistic K-L estimator (LKLE) as βLKLE = X T Ĝn X + kI p ) −1 X T Ĝn X − kI p βMLE (11) The bias and the matrix mean squared error (MMSE) of the LKLE is obtained as follows: The bias of the LKLE is as follows: where The variance of the LKLE is defined as follows: where Therefore, the MMSE and the scalar mean squared error (MSE) are, respectively, defined by and The MMSE and MSE of the MLE, LRE, LLE, LLTE and LTPE are given, respectively, as follows: where where where The following lemmas are needful to prove the statistical properties of the proposed estimator.Lemma 1.Let M be a positive definite matrix, that is, M > 0, and α be some vector.Then M − αα T ≥ 0 if and only if α T M −1 α ≤ 1 [19].

Comparison among the Estimators
In this section, we will perform a theoretical comparison of the proposed estimator with the available estimators in terms of MMSEs.

Comparison among the Estimators
In this section, we will perform a theoretical comparison of the proposed estimator with the available estimators in terms of MMSEs.

Comparison between βMLE and βLKLE
Theorem 1.If k > 0, the estimator βLKLE is preferable to the estimator βMLE in the MMSE sense, can be written in scalar form as follows: Simplifying Equation ( 27), we have 4k This was practically illustrated in Section 3.5 (Proof completed).

Comparison between βLRE and βLKLE
Proof.
can be written in scalar form as follows: Simplifying Equation ( 29), we have 4k This was practically illustrated in Section 5 (Proof completed).

Comparison between βLLE and βLKLE
Theorem 3. If k > 0 and 0 < d < 1, the estimator βLKLE is preferable to the estimator βLLE in Proof.
be written in scalar form as follows: This was practically illustrated in Section 5 (Proof completed).

Comparison between βLLTE and βLKLE
where can be written in scalar form as follows: Cov βLLTE − Cov βLKLE is non-negative (nn), since Hence, using Lemma 2.2, This was practically illustrated in Section 5 (Proof completed).

Comparison between βLTPE and βLKLE
Theorem 5.If k > 0 and −∞ < d < ∞ , the estimator βLKLE is preferable to the estimator βLTPE in where Proof.
be written in scalar form as follows: This was practically illustrated in Section 5 (Proof completed).

Selection of k
Since the shrinkage parameter plays a significant role in estimating biased estimators such as the LRE, LLE and LKLE, several researchers have introduced various shrinkage parameter estimation methods for the different regression models [21][22][23][24][25][26][27][28][29].Based on these studies, we propose some shrinkage estimators of the parameter k for the LKLE.
To estimate parameter k, following [4], we will consider the generalized version of KLestimator, which is given as follows: where K = diag(k 1 , k 2 , . . ., k p ).
The MSE of KL estimator in (44) would be Differentiating Equation ( 42) with respect to k j , (all terms except k j will be 0) and equating to 0, we have Simplifying further Equation (43), we have Dividing both sides of Equation (44) by 2λ j , we obtain By replacing α and λ with its unbiased estimates, Equation (45) becomes Following Hoerl et al. [13], and based on the study of Mansson et al. [7], Lukman and Ayinde [3] and Qasim et al. [22,30], we suggest the following biasing parameter estimators for the logistic regression model:

Monte Carlo Simulation
In this section, we compare the performance of the logistic regression estimators using a simulation study.A significant number of simulation studies have been conducted to compare the performance of estimators for both linear and logistic regression models [24][25][26][27][28][29][30][31][32][33][34][35].The MSE is a function of β,σ 2 , p and is minimized subject to constraint β β = 1 [36,37].Schaefer [14] showed that the logistic regression model can be designed employing a similar approach to that of the linear regression model.The correlated explanatory variables can be obtained using the simulation procedure given in [38,39].
where w ij are independent standard normal pseudo-random numbers and ρ is the correlation between the explanatory variables.The values of ρ are chosen to be 0.9, 0.95, 0.99 and 0.999.The response variable is generated from the Bernoulli distribution, i.e., y i ∼ Be(π i ), where π i = e x T i β 1+e x T i β .Sample size, n, is varied, i.e., 50, 100, 250 or 300.The estimated MSE is calculated as where βi denotes the vector of the estimated regression coefficient in ith replication and β is the vector of the true parameter values, chosen such that β β = 1.The experiment was replicated 2000 times.We present the estimated MSEs and the bias of each of the estimators for p = 3 in Tables 1 and 2, respectively.For p = 7, the results are provided in Tables 3 and 4, respectively.We observed that increasing the sample size resulted in a decrease in the MSE values for each case.The following observations were obtained from the simulation result.
The MSE values of the estimators increased as the degree of correlation and the number of explanatory variables increased.The simulation results show that the LKLE performed best at most levels of multicollinearity, sample sizes and the number of explanatory variables with few exceptions.The LTPE competed favorably in most cases, except on a few occasions.Upon comparing the performance of the shrinkage parameters in the LKLE, we found that LKLE 1 performed well except in a few cases.The MLE performed least well when there was multicollinearity in the data.Of the two-parameter estimators (LTPE and LLTE), LTPE performed better.Additionally, it is obvious that the bias of the proposed estimator was the lowest in most cases.Generally, the LKLE estimator is preferred over the twoparameter estimator.

Application: Cancer Data
The performance of LKLE and the other estimators was evaluated using a cancer remission dataset [34,40].In the dataset, the binary response variable y i is 1 if the patient experiences complete cancer remission and 0 otherwise.There are five explanatory variables.These explanatory variables include cell index (x 1 ), smear index (x 2 ), infıl index (x 3 ), blast index (x 4 ) and temperature (x 5 ).There were 27 patients, of which nine experienced complete remission.The eigenvalues of the X Ĝn X matrix were found to be λ 1 = 9.2979, λ 2 = 3.8070, λ 3 = 3.0692, λ 4 = 2.2713 and λ 5 = 0.0314.To test the multicollinearity among the explanatory variables, we use condition index (CI), computed as CI = max(λ j ) min(λ j ) = 17.2.There was moderate collinearity when CI was between 10 and 30 and severe multicollinearity when CI exceeded 30 [41].Thus, the results provide evidence of moderate multicollinearity among the explanatory variables.Next, we compared the performance of the estimators using the previously described dataset.The estimated regression coefficients and the corresponding scalar MSE values are given in Table 5.The scalar MSEs of each of the estimators under study were obtained using Equations ( 17), ( 19), ( 21) and ( 23)-( 25), respectively.The proposed LKLE estimator surpassed the other estimators in this study in terms of MSE.
Table 5. Validation of the theoretical conditions for the cancer data.

Theorems
Conditions Value Moreover, we also evaluated the theoretical conditions as stated in Theorems 1 to 5 for the actual dataset.The validation results of these conditions are given in Table 6.As shown, all the theorem conditions hold for the cancer data, because all the inequalities in the theorems were less than one, as expected.The logistic ridge estimator competed favorably in the simulation and the real-life application.The real-life application result agreed with the simulation study.However, the performance of the estimators in both the simulation and real life was a function of the biasing parameter.For instance, LKLE 1 performed best in the simulation study, while in the real-life analysis, LKLE 2 outperformed LKLE 1.Among the two-parameter estimators, the logistic two-parameter estimator (LTPE) performed best.Of the one-parameter estimators, LKLE outperformed the ridge and the Liu estimator.Generally, LKLE dominated among both the one-and two-parameter estimator.The performance of these estimators is a function of biasing parameters k and d.Additionally, as shown in Table 5. β2 and β3 did not fit well for the following estimators: MLE, LLE, LLTE, LTPE and LKLE 1.

Some Concluding Remarks
Kibria and Lukman (2020) developed the K-L estimator to circumvent the multicollinearity problem for the linear regression model.In this paper, we described the logistic Kibria-Lukman estimator (LKLE) to address the challenge of multicollinearity for the logistic regression model.We theoretically determined the superiority of LKLE over other existing estimators in terms of the MSE.The performance of the estimators was evaluated using the Monte Carlo simulation study.In the design of the experiment, factors such as the degree of correlation, the sample size and the number of explanatory variables were varied.The results showed that the performance of the estimators was highly dependent on these factors.Finally, to illustrate the efficiency of the proposed estimator, we applied a cancer dataset and observed that the results agreed with those of the simulation study to some extent.The findings of this study will be helpful for practitioners and applied researchers who use a logistic regression model with correlated explanatory variables.

Theorem 2 .
If k > 0, the estimator βLKLE is preferable to the estimator βLRE in the MMSE sense, if and only if, b T

Theorem 4 .
If k > 0 and −∞ < d < ∞ , the estimator βLKLE is preferable to the estimator βLLTE in MMSE sense if and only if, b T

Table 1 .
Estimated MSEs and Bias for p = 3.

Table 2 .
Estimated MSEs and Bias for p = 3.

Table 3 .
Estimated MSEs and Bias for p = 7.

Table 4 .
Estimated MSEs and Bias for p = 7.

Table 6 .
Regression coefficients and MSEs of the logistic regression estimators for the cancer dataset *.
* Standard error for each of the estimators is in parenthesis.