A More Accurate Estimation of Semiparametric Logistic Regression

: Growing interest in genomics research has called for new semiparametric models based on kernel machine regression for modeling health outcomes. Models containing redundant predictors often show unsatisfactory prediction performance. Thus, our task is to construct a method which can guarantee the estimation accuracy by removing redundant variables. Speciﬁcally, in this paper, based on the regularization method and an innovative class of garrotized kernel functions, we propose a novel penalized kernel machine method for a semiparametric logistic model. Our method can promise us high prediction accuracies, due to its capability of ﬂexibly describing the complicated relationship between responses and predictors and its compatibility of the interactions among the predictors. In addition, our method can also remove the redundant variables. Our numerical experiments demonstrate that our method yields higher prediction accuracies compared to competing approaches.


Introduction
Logistic regression, a well-known classification prediction model, is an extension of linear models in dealing with binary responses. However, its prediction accuracies can be decreased when redundant predictors are included. Therefore, a high demand has emerged for a flexible classification prediction model which has high prediction accuracies and can select important predictors. Recently, regularization methods have had great success in improving stability of estimation and increasing accuracies of prediction. Popular regularization methods, such as the Least Absolute Shrinkage and Selection Operator (LASSO) [1], the Smoothly Clipped Absolute Deviation (SCAD) [2], adaptive LASSO [3], and Elastic Net [4], have been widely used in logistic regression models with linear structure [5][6][7]. In addition, different types of regularization function have also been explored to be rolled in logistic regression models. For example, a 1 2 penalty and a hybrid 1 2 +2 regularization function have been used in sparse logistic regression models [8,9]. Likewise, in [10], the 0 -norm has been used in a sparse generalized linear model. By shrinking some regression coefficients to exactly zero, most of the above methods can realize model estimation and variable selection simultaneously.
However, linear structure may be insufficient to capture the complicated relationship between responses and predictors. To release the linear assumption, some nonparametric and semiparametric methods have been presented. Among these methods, a Generalized Additive Model (GAM) [11] has drawn increasing attention. Meier et al. [12] estimated a high-dimensional GAM via a sparsity-smoothness penalty. Ravikumar et al. [13] proposed a variable selection procedure for GAM and extended it to an additive logistic model. Li et al. [14] developed a SCAD-based method for simultaneous variable selection and estimation in GAM with non-polynomial dimensionality. These additive models are generally sensible. However, the additive structure may be not flexible enough for some real data. In addition, the spline based methods used in these models, which requires the specification of the smoothness condition of the unknown function, may be involved and awkward for multidimensional data.
Known as another function approximation method, a kernel machine method starts with a kernel function which implicitly determines the smoothness of the unknown function rather than decides a predetermined form or basis, and hence greatly simplifies specification of a nonparametric function especially for multidimensional data. More recently, many promising kernel machine methods were proposed. For example, the Kernel Logistic Regression (KLR) [15], which replaces the loss function of a support-vector machine with a negative log-likelihood of binomial distribution, can obtain a natural estimate of the posterior probability. Compared to nonparametric models, semiparametric models are more widely used to deal with real data. This is because the semiparametric models not only retain the flexibility of the nonparametric models but also maintain the interpretability of the parametric models. The Least-Squares Kernel Machine method (LSKM) [16] is a popular semiparametric model and has been extended to a generalized semiparametric model [17]. To overcome the limitation of LSKM for small sample cases, a Bayesian hierarchical modeling approach has been proposed by Kim et al. [18]. In addition, Freytag et al. [19,20] proposed two modified kernel functions based on LSKM to improve identification of meaningful associations. However, the prediction accuracies of these methods would be decreased when redundant predictors were contained in the models.
Variable selection based on kernel machine has attracted a lot of attention. The COmponent Selection and Smoothing Operator (COSSO) method [21] is proposed for variable selection via using an 1 penalty in nonparametric kernel models, which is designed in the framework of smoothing splines ANOVA model, and then is extended to a logistic model by Zhang and Lin [22]. Allen [23] showed that a fully nonparametric KerNel Iterative Feature Extraction (KNIFE) method could also achieve feature selection via using a linearized weighted kernel. In addition, Xu et al. [24] modified the penalty form of the KLR with 1 2 regularization to gain more sparse solutions. The above variable selection methods are based on fully nonparametric models. Moreover, the general kernel function assumes that the predictors have the same marginal effect on the response variable. This assumption is not flexible enough to describe the relationship between predictors and responses. Alternatively, considering the presence of possible gene-gene interactions, a garrote kernel machine method which uses a score test for variable selection was proposed in Maity and Lin [25]. However, it is a backwards selection method and thus not efficient for modeling. By using a regularization method, a Penalized Garrotized Kernel Machine method (PGKM) which can select important predictors and process modeling simultaneously is proposed in Rong et al. [26]. This is an efficient method for continuous outcomes. A semiparametric logistic model is a widely used model to predict health outcomes in terms of clinical information and gene expression measurements. The accuracy of modeling estimation is the core problem. Thus, our task is to construct a method which can guarantee the estimation accuracy by removing redundant variables. Specifically, in this paper, we propose a novel Penalized Logistic Garrotized Kernel Machine (PLGKM) method for binary outcomes. PLGKM can eliminate irrelevant predictors in both the parametric and nonparametric part, while allowing for a complex nonlinear relationship between the responses and predictors. The remainder of this paper is organized as follows. In Section 2, we present a novel garrotized kernel machine method for a semiparametric logistic model and investigate the model estimation and variable selection problem. In addition, we propose a "one-group-at-a-time" cyclical coordinate descent algorithm for the solution path of the tuning parameters. The performance of our proposed method is evaluated by several simulation examples in Section 3. We apply our method to a study of the breast cancer in Section 4. Section 5 considers the extension to a generalized kernel machine model. Section 6 contains concluding remarks.

Kernel Machine for Semiparametric Logistic Regression
We are interested in the relationship between predictors (clinical and demographic information, gene expression measurements) and disease outcomes. A popular technique to explore the relationship between these predictors and responses is semiparametric models which treat clinical and demographic variables in a linear part and set the gene variables in a nonparametric part. By this way, not only can we interpret the margin effect of the clinical and demographic variables on disease outcomes, but we also allow a nonlinear relationship between gene variables and outcomes.
Suppose we have n observations constructed with healthy individuals and patients. We take y i to denote the health situation of the i-th individual. Specifically, y i = 1 refers to a patient and y i = 0 refers to a healthy individual. For each individual, we collect P clinical and demographic predictors and Q gene predictors. Thus, we have clinical and demographic predictors x i ∈ R P and gene predictors z i ∈ R Q . Then, we can construct a semiparametric model as follows: where p i = prob(y i = 1|x i , z i ), β is an unknown P × 1 vector of regression coefficients, and h(·) is an unknown and possibly complicated function. The first part in the model (1) is just a linear model which can be estimated easily. Thus, in this paper, we focus on exploring the nonparametric part of (1). We assume that h(·) lies in a Reproducing Kernel Hilbert Space H K and use a linear combination of positive definite kernel functions to construct h(·). Exactly, as Mercer's theorem [27] claims, a positive definite kernel function K(·, ·) produces a Reproducing Kernel Hilbert Space H K . In addition, these positive definite kernels are also called Mercer kernels. Two popular Mercer kernels are: where c is tuning parameter. The corresponding feature vector contains all terms up to degree d.
where γ is known as the bandwidth. The Gaussian kernel function can be seen as a function of Euclidean distance. In addition, its feature map lies in an infinite dimensional space.
There are also some other commonly used kernel functions like the neural network and smoothing spline kernels [28]. One can choose an appropriate kernel function according to a real situation.

Penalized Garrotized Kernel Machine for Semiparametric Logistic Regression
In some real data analysis, the redundant predictors contained in x i and z i may decrease the estimation and prediction accuracies of the semiparametric logistic regression model. Therefore, variable selection is necessary and especially challenging in nonparametric components. In order to facilitate the variable selection in nonparametric components, Rong et al. [26] propose a "garrotized" kernel K (g) which can describe the influence of different marginal effects on the response y i . In addition, K (g) is defined by For example, the dth garrotized Polynomial Kernel is K(z 1 , z 2 ; δ) = (z T 1 ∆z 2 + c) d , where the ∆ = diag(δ 1 , ..., δ Q ) and the garrotized Gaussian kernel is The unknown δ can be estimated from data. Each δ q modulates the effect of nonparametric predictors z q on response. δ q = 0 implies that z q is not predictive of the response. Thus, the garrotized kernel formulation provides a flexible way to select variables in a semiparametric setting. In addition, with some proper K(·, ·), the nonparametric predictors' interactions will be allowed in the complicated function h(·).
To eliminate the interference of variable dimension, we first standardize the predictors x i and z i . Then, to estimate the parameters of model (1) with the garrotized kernel (2), we maximize the following popular penalized log-likelihood function: where λ 1 and λ 2 are non-negative regularization parameters, λ 3 is a trade-off parameter between goodness of fit and complexity of the model, and h H K (g) denotes the functional norm in the Reproducing Kernel Hilbert Space H K (g) generated by the garrotized kernel. The linear mapping of the predictors in the kernel function implies that the garrotized Gaussian kernel is essentially a Gaussian kernel. In addition, the garrotized Gaussian kernel is therefore positive definite. Then, with the representer theorem [29], the nonparametric function h(z i ) can be expressed as where k i (δ) = {K (g) (z i , z 1 ; δ), · · · , K (g) (z i , z n ; δ)} is a n × 1 vector, and α = (α 1 , · · · , α n ) T is an unknown parameter vector. By substituting (4) into (3), we have where K(δ) is a n × n Gram matrix whose (i, j)-th element is Equivalently, we can rewrite the maximization of (5) in matrix form as arg max T is an n-dimensional vector and · 1 is the 1 norm. The solution to (6) is called a PLGKM estimate for a semiparametric logistic regression model.
Specifically, giving α = α (t−1) and δ = δ (t−1) , problem (6) can be written as to get the solution to (7), we adapt the penalized reweighed least squares approach proposed in [1]. We denote the log-likelihood as n (α, A(α, β, δ). Taking η = X β + K(δ)α, the log-likelihood can be seen as a function of η. Thus, we can get the gradient and Hessian of n (·) with respect to η as n (η) and n (η), respectively. Given the current estimateη = X β (t−1) + K(δ (t−1) )α (t−1) , maximizing log-likelihood n (η) is equivalent to maximizing the following two-term Taylor expansion of the log-likelihood where V (η) =η − n (η) −1 n (η). Similar to [1], to reduce the difficulty of computation, we approximate the Hessian matrix with a diagonal one. Thus, we can get an approximate solution to (7) by minimizing the penalized reweighed least squares with respect to β where W (η) is a positive definite diagonal matrix which has the same diagonal elements as − n (η). Hence, we can transfer (7) to a convex optimization problem. Treating V (η) − K(δ (t−1) )α (t−1) as the working response, W (η) as weight, and the standard computational procedures for estimation of LASSO regression [30,31] can be used to estimate β. Similarly, to update the estimate of α with β = β (t) and δ = δ (t−1) , we can rewrite problem (6) as arg max in addition, problem (9) is equivalent to minimizing the following function with respect to α, which is a quadratic function of α. The stationary point of this function is the solution to the following linear equation: (11) which is straightforward to solve. When the left-hand side of the above equation is singular, a diagonal matrix with small entries such as 1 × 10 −5 , can be added to stabilize the estimate. Finally, with α = α (t) and β = β (t) , updating δ is equivalent to solving a nonlinear optimization problem under the constraints of δ q > 0, q = 1, · · · , Q. To this end, we propose to use a one-at-a-time coordinate descent algorithm. Specifically, for δ q 0 , q 0 = 1, · · · , Q, with α = α (t) , β = β (t) and δ q = δ (t−1) q , q = q 0 , q = 1, · · · , Q, (6) can be written as arg max the estimate of δ q 0 can be obtained by using the L-BFGS-B algorithm [32] on a standard univariate nonlinear constrained optimization software program. The coordinate descent algorithm for PLGKM is outlined in Algorithm 1: Algorithm 1 Coordinate descent algorithm [31] for PLGKM 1: Initialization: t = 0, α = α (0) , β = β (0) and δ = δ (0) . 2: repeat 3: Set t = t + 1.

Selection of Tuning Parameters
We use a decreasing sequence of λ = (λ 1 , λ 2 , λ 3 ) to fit models on the training set. This scheme not only provides us a path of solutions but also exploits warm starts and thus makes the algorithm more stable and faster. To select the optimal tuning parameters, the cross-entropy loss (CEL) is calculated on the validation dataset. The CEL is defined as Finally, we choose the estimated model that gives the lowest CEL on the validation set. In practice, we start with a rough search in a large range to find a relatively reasonable value for λ. In addition, then, according to this value, we launch a finer local search to get an appropriate value for λ.

Comparison with LSKM
We conduct a simulation study to compare our PLGKM method with the LSKM method of Liu et al. [17]. This simulation study is started with the generation of 500 observations from the following semiparametric logistic model: where logit(p i ) = log( p i 1−p i ). x ip and z iq are independently generated from U(−1, 1) and U(0, 1), respectively. To mimic the complicated relationship between gene expression predictors and outcomes, we employ a nonparametric function h(·) which is able to describe the nonlinearity of predictors and is compatible with the interactions among predictors.
The prediction performances of the fitted models are measured using the CEL and misclassification rates (MR) based on the test sets. The MR stands for the proportion of misclassified observations In the following Table 1, for each setting, we compare the mean prediction performances of PLGKM and LSKM based on 500 simulation experiments.  Table 1, we can see that our PLGKM method outperforms the LSKM method for these four settings. Specifically, for settings 1 and 3, our PLGKM method produces slightly smaller average MR and average CEL than the LSKM method. These small advantages are guaranteed by the flexibility of the garrotized kernel in the value of δ q . For settings 2 and 4, our PLGKM method shows significant advantages in both average MR and average CEL. This is because the PLGKM method can eliminate the irrelevant variables by shrinking the estimation of the corresponding δ q and β p to be very small. Therefore, in practice, as it is unknown whether there are irrelevant variables, we recommend the PLGKM method for modeling.
Variable selection is a byproduct of our PLGKM method. The PLGKM method, allowing complicated nonlinear h(·), can simultaneously estimate the parameters of model (14) and select the important variables. When the estimation of δ q is zero, we conclude that the corresponding predictor z q is not related to the outcome. In fact, similar to the SCAD procedure in [2], we consider δ q to be zero when its estimation is less than 10 −5 . In addition, for the clinical and demographic predictors in X, we eliminate the irrelevant ones of which the estimation is exactly zero.
In the following Table 2, we show the variable selection ability of our PLGKM method based on settings 2 and 4.  From Table 2, we can see that nearly all relevant X can be selected by PLGKM with a fairly high probability. This is reflected in the low under-selection rates of X given in Table 2. In addition, when the dimension of Z is not very high, our PLGKM method still has the opportunity to select all relevant Z. In addition, this opportunity will become less when the dimension of Z increases.

Comparison with COSSO
The COSSO method [22], a popular method for the nonparametric model, shows an impressive ability in model estimation and variable selection. Thus, in this section, we would like to compare the performance of our PLGKM method based on Gaussian kernel for a semiparametric logistic model with the performance of the COSSO method for a nonparametric model (exactly an additive ANOVA model only containing the main effects). We generate data as in Section 3.1 and process 500 replications of each setting.
We evaluate the performances of these two methods in terms of MR and CEL which are reported in the following Table 3. From Table 3, we can see that the PLGKM method always produces considerably smaller average MR and CEL than the COSSO method. Exactly, compared with the COSSO method which is limited by its additive structure, the PLGKM method is more flexible as it can describe the more complicated relationships between the predictors and the outcomes by allowing the interactions among predictors. In addition, the outperformances are therefore guaranteed.

Analysis of the Breast Cancer Dataset
To explore the performance of our PLGKM method on microarray data, we consider a breast cancer data (GSE70947) obtained from the CuMiDa database (The microarray datasets in CuMiDa database were examined from more than 30,000 microarray experiments from the Gene Expression Omnibus using a rigorous filtering criteria [33]). We have expression data on 35,981 genes from 289 observations which are composed of 143 patients with breast cancer and 146 healthy individuals.
The high dimension of this breast cancer data makes modeling difficult. Thus, we first conduct a dimension reduction using the nonparametric independence screening method proposed in [34]. After this dimension reduction, we get adjusted breast cancer data with only 12 genes. Considering that there is no clinical nor demographic predictors, we fit this adjusted breast cancer data with a nonparametric model using our PLGKM method based on a garrotized Gaussian kernel. In addition, for comparison, we also apply the LSKM method based on Gaussian kernel and the LASSO method for logistic regression to this adjusted breast cancer data.
Specifically, we randomly split the adjusted breast cancer data into a training set with 100 observations for estimation, a validation set with 100 observations for tuning parameters selection, and a test set with the remaining 89 observations for prediction. We repeat this random split for 1000 times. For each repetition, we calculate the CEL and MR for these three methods, respectively. In addition, then we compare these three methods in terms of the average values of the CEL and MR which are collected in the following Table 4. Table 4. Average prediction error of each method for 1000 replications, with their standard deviations in parentheses.  Table 4, we can see that our PLGKM outperforms the other two methods. Compared to the LSKM method, the PLGKM method benefits from the garrotized kernel and thus gives smaller average CEL and average MR. Compared to the LASSO-COX, our PLGKM method makes a significant improvement in prediction. In addition, this implies that the linear structure is not sufficient to describe the complicated relationships between the genes and the outcome and gene-gene interactions.

PLGKM Method for Outcomes Having an Exponential Family Distribution
In this paper, we introduce our PLGKM method based on binary outcomes. Exactly, when outcomes have an exponential family distribution, we can also use our PLGKM method to estimate the corresponding semiparametric model. The distribution of y i follows the exponential family [35]: where θ i = x T i β + h(z i ) is a canonical parameter, φ is a dispersion parameter, w i is a known weight, b(·), and c(·) are known functions. From (16), we can obtain µ i = E(y i ) = b (θ i ), Var(y i ) = b (θ i )φ/w i . Inspired by the generalized linear model, we can use link functions and garrotized kernel to construct a generalized semiparametric model as where g(·) is a link function. Different response distributions correspond to different link functions g(·). For example, g(µ) for binomial distribution is equal to log µ 1−µ , while g(µ) for Poisson distribution is equal to log(µ).
Analogous to (5), the unknown parameters α, β and δ can be obtained by maximizing the penalized log-likelihood function where l(·) = log(P ) is the log-likelihood function, P is the density function given in (16), and K(δ) is a n × n Gram matrix whose (i, j)-th element is K (g) (z i , z j ; δ).
The "one-group-at-a-time" cyclical coordinate descent algorithm can also be used to solve α, β, δ. This process is essentially the same as that described in Section 2.3. Specifically, we denote the log-likelihood as n (α, β, δ) = ∑ n i=1 l{y i , x i , z i ; α, β, δ} and adapt the penalized reweighted least squares approach by performing a two-term Taylor expansion on the log-likelihood. Thus, standard computational procedures for solving LASSO regression estimates, like the R package glmnet, could be used to estimate β. Then, by directly solving a linear system of equations, we can estimate α with the stationary point of a quadratic function similar to (10). Finally, with given α and β, updating δ is equivalent to solving a nonlinear optimization problem under the constraints that δ q > 0, q = 1, · · · , Q. The estimate of δ could be obtained by using the standard univariate nonlinear constrained optimization software.

Conclusions
In this paper, we have proposed a PLGKM method for semiparametric logistic model based on the LASSO method and an innovative class of garrotized kernel functions, suitable for the data measured on a strong, metric scale. In addition, we have also devised an efficient coordinate descent computational algorithm for implementation. To explore the performance of our PLGKM method, we have conducted some numerical experiments with simulation data sets and the breast cancer data (GSE70947).
In the numerical experiments, our PLGKM method consistently outperforms the other methods (LSKM, COSSO, and LASSO) as our PLGKM method always provides small average CEL and average MR. In addition, the advantage is especially significant when there are redundant predictors in the models. For breast cancer data (GSE70947), our PLGKM method obviously outperforms the LASSO method.
Our proposed method can not only capture the complicated relationships between predictors and outcomes, but also allow predictors' interactions. In addition, the variable selection ability of our PLGKM method helps to achieve high prediction accuracies. These advantages of our PLGKM method guaranteed its outperformances in the numerical experiments.
In the future, we will explore the extension of the PLGKM method to other models (such as quantile regression model), and improve the handlability of the PLGKM method for complex structured data (such as categorical data).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
Publicly available datasets were analyzed in this study. These data can be found here: https://sbcb.inf.ufrgs.br/cumida (accessed on 3 June 2020).

Conflicts of Interest:
The authors declare no conflict of interest.