On the Performance of Variable Selection and Classification via Rank-Based Classifier

Md Showaib Rahman Sarker; Michael Pokojovy; Sangjin Kim

doi:10.3390/math7050457

,

and

Department of Mathematical Sciences, The University of Texas at El Paso, El Paso, TX 79968, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics2019, 7(5), 457;https://doi.org/10.3390/math7050457

This article belongs to the Special Issue Uncertainty Quantification Techniques in Statistics

Version Notes

Order Reprints

Abstract

In high-dimensional gene expression data analysis, the accuracy and reliability of cancer classification and selection of important genes play a very crucial role. To identify these important genes and predict future outcomes (tumor vs. non-tumor), various methods have been proposed in the literature. But only few of them take into account correlation patterns and grouping effects among the genes. In this article, we propose a rank-based modification of the popular penalized logistic regression procedure based on a combination of

ℓ_{1}

and

ℓ_{2}

penalties capable of handling possible correlation among genes in different groups. While the

ℓ_{1}

penalty maintains sparsity, the

ℓ_{2}

penalty induces smoothness based on the information from the Laplacian matrix, which represents the correlation pattern among genes. We combined logistic regression with the BH-FDR (Benjamini and Hochberg false discovery rate) screening procedure and a newly developed rank-based selection method to come up with an optimal model retaining the important genes. Through simulation studies and real-world application to high-dimensional colon cancer gene expression data, we demonstrated that the proposed rank-based method outperforms such currently popular methods as lasso, adaptive lasso and elastic net when applied both to gene selection and classification.

Keywords:

gene-expression data; ℓ2 ridge; ℓ1 lasso; adapative lasso; elastic net; BH-FDR; Laplacian matrix

MSC:

62F03; 62F07; 62P10

1. Introduction

Microarrays are an advanced and widely used technology in genomic research. Tens of thousands of genes can be analyzed simultaneously with this approach [1]. Identifying the genes related to cancer and building high-performance prediction models of maximal accuracy (tumor vs. non-tumor) based on gene expression levels are among central problems in genomic research [2,3,4]. Typically, in high-dimensional gene expression data analysis, the number of genes is significantly larger than the sample size, i.e.,

m ≫ n

. Hence, it is particularly challenging to identify those genes that are relevant to cancer disease and put forth prediction models. The main problem associated with high-dimensional data (

m ≫ n

) is that of overfitting or overparametrization which leads to poor generalizability from training to test data.

Therefore, various researchers apply different types of regularization methods to overcome this “curse of dimensionality” in regression and other statistical and machine learning frameworks. These regularization approaches include, for example, the

ℓ_{1}

-penalty or lasso [5], which performs continuous shrinkage and feature selection simultaneously; smoothly clipped

ℓ_{1}

-penalty or SCAD [6], which is symmetric, non-concave and has singularities at the origin to produce sparse solutions; fussed lasso [7], which imposes the

ℓ_{1}

-penalty on the absolute difference of regression coefficients in order to enforce some smoothness of coefficients; or the adaptive lasso [8], etc. Unfortunately,

ℓ_{1}

-regularization sometimes perform inconsistently when used for variable selection [8]. In some situations, it introduces a major bias in estimated parameters in the logistic regression [9,10]. In contrast, the elastic net regularization procedure [11] as a combination of

ℓ_{1}

- and

ℓ_{2}

-penalties can successfully handle the highly correlated variables which are grouped together. Among the procedures mentioned above, elastic net and fussed lasso penalized methods are appropriate for gene expression data analysis. Unfortunately, when some prior knowledge needs to be utilized, e.g., when studying complex diseases such as cancer, those methods are not appropriate [4]. To account for a regulatory relationship between the genes and a priori knowledge about these genes, network-constrained regularization [4] is known to perform very well by incorporating a Laplacian matrix into the

ℓ_{2}

-penalty from the enet procedure. This Laplacian matrix represents a graph-structure of genes which are linked with each other. To select significant genes in high-dimensional gene expression data for classification, the graph-constrained regularization method is extended to logistic regression model [12].

Using penalized logistic regression methods [12,13] and graph-constrained procedures [4,12], we would build rank-based logistic regression method with variable screening procedure to improve the power of detecting most promising variables as well as classification capability.

The rest of this article is organized as follows. In Section 2, we describe variable screening procedure with adjusted p-values and regularization procedure for grouped and correlated predictors and present the computational algorithm. Further, we state the ranking criteria of four models and summarize the result of ranking procedure. In Section 3, we compare the proposed procedure with existing cutting edge regularization methods on simulation studies. Next, we apply four penalized logistic regression methods to the high dimensional gene expression data of colon cancer carcinoma to evaluation and comparison of the performance. Finally, we present a brief discussion of results and future research direction.

2. Materials and Methods

2.1. Adjusted p-Values: Benjamini and Hochberg False Discovery Rate (BH-FDR)

Multiple hypohtesis testing methods have been playing an important role in selecting most promising features while controlling type I error in high-dimensional settings. One of the most popular methods is BH-FDR [14,15] which is concerned with the expected proportion of incorrect number of rejections among a total number of rejections. The formula is mathematically expressed as

E (\frac{V}{R} | R > 0)

, where V is the number of false positives and R is the total number of rejections. In this paper, the FDR method is used both for the purpose of prelimary variable screening both in the simulation studies and real data analysis to be presented later. The procedure of the method is as follows:

(1): Let $p_{1}, p_{2}, \dots, p_{m}$ be the p-values of m hypothesis tests and sort them with the increasing oder: $p_{(1)}, p_{(2)}, \dots, p_{(m)}$ .
(2): Let $\hat{i} = max {i | p_{(i)} \leq \frac{i q}{m}, i = 1, \dots, m}$ for a given threshold q. If $\hat{i} > 1$ , then reject the null hypotheses associated with $p_{(1)}, p_{(2)}, \dots, p_{(i)}$ . Otherwise, no hypotheses are rejected.

2.2. Regularized Logistic Regression

In the following, we present the regularized logistic regression model used in this paper (cf. [12]). Since this model is an integral part of our computational algorithm to be outlined in the section to follow, presenting the formula with all appropriate notations is necessary for our purposes.

Let the

n \times (m + 1)

matrix

X = (\begin{matrix} 1 & x_{11} & x_{12} & \dots & x_{1 j} & \dots & x_{1 m} \\ 1 & x_{21} & x_{22} & \dots & x_{2 j} & \dots & x_{2 m} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{i 1} & x_{i 2} & \dots & x_{i j} & \dots & x_{2 m} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ 1 & x_{n 1} & x_{n 2} & \dots & x_{n j} & \dots & x_{n m} \end{matrix})

denote the design matrix, where n is the sample size and m is the total number of predictor variables. Without loss of generality, we assume the data are standardized with respect to each variable. This step is also performed by the pclogit R-package used in the present paper. Define the parameter vector

η = (β_{0}, β)

comprised of an intercept

β_{0}

and m “slopes”,

β_{1}, \dots, β_{m}

. The objective function then is written as

f (η) = - L (η) + p (β)

(1)

with the log-likelihood function

L (η) = \frac{1}{n} \sum_{i = 1}^{n} [y_{i} log π (x_{i}) + (1 - y_{i}) log (1 - π (x_{i}))]

and resulting probabilities

π (x_{i}) = \frac{exp (β_{0} + x_{i}^{T} β)}{1 + exp (β_{0} + x_{i}^{T} β)} .

Here,

p (β)

is the penalty function and the response variable

y_{i}

takes the value 1 for cases and 0 for controls. The i-th individual is deemed case or control based on the probability

π_{i}

. Following [4], statistical dependence among the m explanatory variables can be modeled by a graph, which, in turn, can be described by its m-dimensional Laplacian matrix

L = (L (u, v) | u, v vertices)

with the entries

L (u, v) = \{\begin{matrix} 1, & if u = v and d_{u} \neq 0, \\ - {(d_{u} d_{v})}^{- \frac{1}{2}}, & if u and v are adjacent, \\ 0, & otherwise . \end{matrix}

Here,

d_{v}

is the degree of a vertex v, i.e., the number of edges through this vertex. If there is no link in v (i.e., v is isolated), then

d_{v} = 0

. The martix L is symmetric, positive semi-definite and has 0 as the smallest eigenvalue and 2 as the largest eigenvalue. In the following, we will write

u \sim v

to refer to adjecent vertices. The penalty term in Equation (1) can is defined as

p (β) = λ_{1} {∥ β ∥}_{1} + λ_{2} β^{T} L β = λ_{1} \sum_{j = 1}^{m} | β_{j} | + λ_{2} \sum_{u = 1}^{m} \sum_{u \sim v} {(\frac{β_{u}}{\sqrt{d_{u}}} - \frac{β_{v}}{\sqrt{d_{v}}})}^{2} .

(2)

Here,

λ_{1}

and

λ_{2}

are tuning parameters meant to control the sparsity and smoothness,

{∥ β ∥}_{1}

is the

ℓ_{1}

-norm and

\sum_{u \sim v} (\dots)

denotes the summation over all adjacent vertex pairs. When

λ_{2} = 0

, the penalty reduces to that of lasso [5], and if

L

is replaced by the

m \times m

-identity matrix

I

, the penalty corresponds to that of an elastic net [11]. If

λ_{1} = 0

and

L = I

, we arrive at ridge regression. In Equation (2), the penalty consists of

ℓ_{1}

- and

ℓ_{2}

-components. The

ℓ_{2}

-penalty is a degree-scaled difference of coefficients between linked predictors. According to [4], the predictor variables with more connections have larger coefficients. That is why small change of expression in the variables can lead to large change in response. Thus, this imposes sparsity and smoothness as well as correlation and grouping effects among variables. In case-control DNA methylation data analysis, ring networks and fully connected networks (cf. Figure 1) are typically used to describe correlation pattern of CpG sites within genes [12]. The Laplacian matrix is sparse and tri-diagonal (except for two corner elements) for ring networks and has all non-zero elements for fully connected networks. Those variables with more links produce strong grouping effects and are more likely to be selected in both networks [12].

Figure 1. The ring network (left) and F.con network (right) are shown for the case there are two genes consisting of 6 and 9 CpG sites, respectively.

2.3. Computational Algorithm

Li & Li (2010) [16] developed an algorithm for graph-constrained regularization motivated by a coordinate descent algorithm from [17] for solving the unconstrained minimization problem for the objective in Equation (1). The algorithm implementation from the pclogit R-package [12,13] replaced the identity matrix by Laplacian matrix in the elastic net algorithm from the glmnet R-package [18]. According to Equation (1), the objective function is

f (η) = - L (η) + p (β),

where

p (β) = λ α \sum_{i = 1}^{m} | β_{i} | + \frac{1}{2} λ (1 - α) \sum_{u = 1}^{m} \sum_{u \sim v} {(\frac{β_{u}}{\sqrt{d_{u}}} - \frac{β_{v}}{\sqrt{d_{v}}})}^{2}

(3)

with

λ = λ_{1} + 2 λ_{2}

and

α = \frac{λ_{1}}{λ_{1} + 2 λ_{2}}

for some

λ_{1}, λ_{2} > 0

.

Following [18], we perform a second-order Taylor expansion of

L (\cdot)

around the current estimate

(β_{0}^{*}, β^{*})

to approximate the objective

L (\cdot)

in Equation (1) via

f^{*} (x) = - \frac{1}{2 n} \sum_{i = 1}^{n} q_{i} {(t_{i} - β_{0} - x_{i}^{T} β)}^{2} + p (β),

where

\begin{matrix} t_{i} & = β_{0}^{*} + x_{i}^{T} β^{*} + q_{i}^{- 1} (y_{i} - π^{*} (x_{i})), \\ q_{i} & = π^{*} (x_{i}) (1 - π^{*} (x_{i})), \\ π^{*} (x_{i}) & = 1 - {(1 + exp (β_{0} + x_{i}^{T} β^{*}))}^{- 1} . \end{matrix}

Now, if all other estimates for all

v = u

are fixed,

β_{u} = β_{u}^{*}

can be computed. To update the estimate from

β_{u}^{*}

, we have to set the gradient of

f^{*} (\cdot)

equal zero (strictly speaking, zero has to be included in the subgradient of

f^{*} (\cdot)

) and then solve for

β_{u}

to obtain

β_{u}^{*} = \frac{s (\frac{1}{n} \sum_{i = 1}^{n} q_{i} x_{i u} (t_{i} - t_{i}^{(\tilde{u})}) + λ (1 - α) g (u), λ α)}{\frac{1}{n} \sum_{i = 1}^{n} q_{i} x_{i u}^{2} + λ (1 - α)},

where

\begin{matrix} t_{i}^{(\tilde{u})} & = β_{0}^{*} + \sum_{j \neq u} x_{i j} β_{j}^{*}, \\ g (u) & = \sum_{u \sim v} \frac{β_{v}^{*}}{\sqrt{d_{u} d_{v}}} \end{matrix}

(4)

and

s (z, r)

denotes the “soft threshholding” operator given by

s (z, r) = sign (z) {(| z | - r)}_{+} = \{\begin{matrix} z - r, & if z > 0 and r < | z |, \\ z + r, & if z < 0 and r < | z |, \\ 0, & otherwise . \end{matrix}

If the u-th predictor has no links to other predictors, then

g (u)

in Equation (4) becomes zero, while Equation (3) takes the form

p (β) = λ α \sum_{i = 1}^{m} | β_{i} | + \frac{1}{2} λ (1 - α) \sum_{u = 1}^{m} β_{u}^{2} .

Thus, the regularization reduces to that of the elastic net (enet) procedure. In general, when the linkage is nontrivial, the term

λ (1 - α) g (u)

is added to the elastic net to get the desired grouping effect.

2.4. Adaptive Link-Constrained Regularization

When there is a link between two predictors but their regression coefficients have different signs, the coefficients cannot be expected to be smooth [16]—even locally. To resolve this problem, we first need to estimate the sign of the coefficients and then refit the model with estimated signs. When the number of predictor variables is smaller than that of sample points, ordinary least squares are performed, while ridge estimates are computed, otherwise. We have to modify the Laplacian matrix in the penalty function:

L^{*} (u, v) = \{\begin{matrix} 1, & if u = v and d_{u} \neq 0, \\ - s_{u} s_{v} {(d_{u} d_{v})}^{- \frac{1}{2}}, & if u and v are adjacent, \\ 0, & otherwise \end{matrix}

and then update the

g (\cdot)

-function in Equation (4) via

g^{*} (u) = \sum_{u \sim v} \frac{s_{u} s_{v} β_{v}^{*}}{\sqrt{d_{u} d_{v}}} .

2.5. Accuracy, Sensitivity, Specificity and Area under the Receiver Operating Curve (AUROC)

We evaluated four metrics of binary classification for each of lasso, adaptive lasso, elastic net and the proposed rank based logistic regression methods to compare the performance. These metrics are accuracy, sensitivity, specificity and AUROC.

Based on the notations in Table 1, we define

\begin{matrix} Accuracy = \frac{a + b}{a + b + c + d}, & Specificity = \frac{d}{n - m}, & Sensitivity = \frac{a}{m} \end{matrix}

as well as

\begin{matrix} TPR (true positive rate) = \frac{a}{k}, & FPR (false positive rate) = \frac{b}{n - m} . \end{matrix}

Table 1. Confusion table: a is the number of true positives, b the number of false positives, c the number of false negatives and d the number of true negatives.

The last metric AUROC is related to the probability that the classifier under consideration will rank a randomly selected positive case higher than a randomly selected negative case [19]. The values of all these fours metrics—accuracy, sensitivity, specificity and AUROC—range from 0 to 1. The value of 1 represents a perfect model whereas the value of 0.5 corresponds to “coin tossing”. The class prediction for each individual in binary classification is made based on a continuous random variable z. Given a threshold k as a tuning parameter, an individual is classified as “positive” if

z > k

and “negative”, otherwise. The random variable z follows a probability density

f_{1} (z)

if the individual belongs to “positives” and

f_{0} (z)

, otherwise. So, the true positive and true negative rates are given by

TPR (k) = \int_{k}^{\infty} f_{1} (z) d z and FPR (k) = \int_{k}^{\infty} f_{0} (z) d z, respectively .

Now, the AUROC statistic can be expressed as

A = \int_{0}^{1} TPR ({FPR}^{- 1} (z)) d z = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} 1 {k^{'} > k} f_{1} (k^{'}) f_{0} (k) d k^{'} d k = P (z_{1} > z_{0}),

where

z_{1}

and

z_{0}

are the values of positive or negative instances, respectively.

2.6. Ranking and Best Model Selection

The penalty function in Equation (3) has two tuning parameters, namely,

α \in [0, 1]

and

λ > 0

. The "limiting” cases

α = 0

and

α = 1

correspond to ridge and lasso regression, respectively. For a fixed value of

α

, the model selects more variables for smaller

λ

’s and fewer variables for larger

λ

’s. Theoretically, the result continuosly depends on

α

and should not significantly change under small perturbations of the latter [12,13]. Empirically, however, we discovered that the results produced by pclogit significantly vary with

α

. In pclogit, the Laplacian matrix determines the group effects of predictors and is calculated from adjacency matrix via

L = D - A,

where

D

is the degree matrix and

A

is the adjacency matrix. The degree-scaled difference of predictors in Equation (3) is computed from the normalized Laplacian matrix

L = I - D^{- \frac{1}{2}} A D^{- \frac{1}{2}} .

We computed the adjacency matrix by using the information from the correlation matrix obtaining

A (u, v) = \{\begin{matrix} 1, & if u \neq v and | cor (u, v) | \geq ϵ, \\ 0, & if u = v or | cor (u, v) | < ϵ . \end{matrix}

Here,

ϵ \in (0, 1)

is a specific cut-off value for correlation. So,

ϵ

is another tuning parameter in our model which needs to be optimally selected. In summary, to find an optimal combination of parameters

α

and

ϵ

, we make the combination of tuning parameter

α

and

ϵ

, where the total number of combinations is given by

C = K \times L

with K and L being the number of

ϵ

and

α

values, respectively. We compared the performance for each of different combinations with T resamplings. The (negative) measure of performance for each combination is the misclassification or error rate. The pair

(α, ϵ)

producing the smallest misclassification rate is declared optimal and used in the next step. The sparse coefficient matrix with dimensions

m \times nlam

(

nlam =

number of

λ

’s) is used in pclogit (cf. [12,13]). By default,

nlam = 100

. We extracted all predictors with non-zero coefficients for each of

λ

values. Then we built 100 logistic regression models. Given estimated parameter values

β

, we have the estimated class probability for a predictor vector

x

at each of

λ

values.

π (x) = \frac{exp (x^{T} β)}{1 + exp (x^{T} β)}

Using the “naïve” Bayesian approach, we infer

y = 1

if

π \geq 0.5

and

y = 0

, otherwise. The values of accuracy, sensitivity, specificity and AUROC statistics are computed for each of 100 models and ranked in an increasing order by their values. Note that AUROC method does not use a fixed cut-off value, e.g.,

c = 0.5

, but rather describes the overall performance with all possible cut-off values in the decision rule. Let

R_{i j}

,

i = 1, 2, 3, 4

,

j = 1, 2, \dots, 100

, comprise the ranking matrix

R

. The first row, i.e.,

i = 1

, displays the ranking of models with respect to their accuracy. Similarly,

i = 2

ranks the models with respect to their sensitivity,

i = 3

, in terms of specificity and

i = 4

by AUROC. Suppose,

R_{1, 5} > R_{1, 8}

. Then in the 1st row (i.e., in terms of accuracy), model 5 outperforms model 8. We calculate the column means (

{\bar{R}}_{. j}

) of the

R

matrix. The column with the highest overall mean value of accuracy, sensitivity, specificity and AUROC will be chosen as the resulting optimal model. Note that there is a one-to-one correspondence between columns and the 100 competing models. In (the unlikely) case of two or more columns producing the same mean, the column with a smaller index j is selected since the model represented by such column is more parsimonious. Formally, suppose p and q,

p > q

, are two column indices in the

R

matrix. If

{\bar{R}}_{. p} = {\bar{R}}_{. q} = {max}_{r} {\bar{R}}_{. r}

, the q-th column will be selected and the associated model becomes our proposed rank-based penalized logistic regression model.

3. Results

3.1. Analysis of Simulated Data

We conducted extensive simulation studies to compare the performance in terms of accuracy, sensitivity, specificity and AUROC as well as the power of detecting true important variables by the proposed method with the performance of such three prominent regularized logistic regression methods as lasso, adaptive lasso and elastic net. We decided to focus on these (meanwhile) classical methods due to their popularity both in the literature and applications. Some of their very recently developed comptetitors such as [20] (R-package SelectiveInference) and [21] (R-package islasso) are currently gaining attention from the community and will be used as benchmarks in our future research.

Continuing with the description of our simulation study, all predictors

x

were generated from a multivariate normal distribution with the following probability density function

f (x) = {(\frac{1}{2 π})}^{\frac{m}{2}} \frac{1}{\sqrt{det (Σ)}} exp (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))

with an m-dimensional mean vector

μ

and an

(m \times m)

-dimensional covariance matrix

Σ

. Writing out the covariance matrix

Σ = (σ_{i j}) = (\begin{matrix} σ_{11} & σ_{12} & σ_{13} & \dots & σ_{1 m} \\ σ_{21} & σ_{22} & σ_{23} & \dots & σ_{2 m} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ σ_{m 1} & σ_{m 2} & σ_{m 3} & \dots & σ_{m m} \end{matrix}) with σ_{i i} = σ_{i}^{2},

the correlation matrix

M

can be expressed as

M = (ρ_{i j}) = (\begin{matrix} ρ_{11} & ρ_{12} & ρ_{13} & \dots & ρ_{1 m} \\ ρ_{21} & ρ_{22} & ρ_{23} & \dots & ρ_{2 m} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ ρ_{m 1} & ρ_{m 2} & ρ_{m 3} & \dots & ρ_{m m} \end{matrix}) with ρ_{i j} = \frac{σ_{i j}}{\sqrt{σ_{i i}^{2} σ_{j j}^{2}}} .

The binary response variable is generated using Bernoulli distribution with individual probability (

π

) defined as

π (x) = \frac{1}{1 + exp (- x^{T} β)},

x

is the matrix of true important variables and

β

is the associated preassigned regression coefficients. Next, we present the details of the three different simulation scenarios considered.

Under scenario 1, each of the simulated datasets has 200 observations and 1000 predictors. Here, for all $x$ vectors, we let $μ = 0$ and $Var (x_{j}) = 0.3$ . Pairwise correlation of $ρ = 0.4$ was applied to the first eight variables, while the remaining 992 variables were left uncorrelated. The $β$ -vector was chosen as

$β = (\underset{5 entries}{\underset{︸}{2, 2, 2, 2, 2}}, \underset{3 entries}{\underset{︸}{3, 3, 3}}, \underset{992 entries}{\underset{︸}{0, 0, 0, \dots, 0}}) .$

Each of the datasets was split into training and test sets with equal proportions.
The datasets under scenario 2 also have 200 observations and 1000 predictors. Again, $μ = 0$ and $Var (x_{j}) = 0.3$ . Now, the first five variables were assumed to have a correlation of $ρ = 0.4$ . The remaining 995 variables were independent. The $β$ -vector was selected as

$β = (\underset{15 entries}{\underset{︸}{2.0, 2.0, 2.0, 2.7, 2.0, 2.0, 2.5, 2.7, - 2.8, 3.0, 2.6, 3.0, 3.0, 3.0, 3.0}}, \underset{985 entries}{\underset{︸}{0, 0, 0, \dots, 0}}) .$

Each of the datasets was split into training and test sets with equal proportions.
Under the last scenario 3, each of the datasets has 150 observations and 1000 predictors. We let $μ = 0$ and $Var (x_{j}) = 0.4$ . The first five variables were assigned into a correlation value of $ρ = 0.3$ , while the variables with indices from 11 to 30 were chosen to have the correlation value of $ρ = 0.6$ . Outside of these two blocks, the variables were assumed uncorrelated. The $β$ -vector was chosen

$\begin{matrix} β = ( & \underset{15 entries}{\underset{︸}{2.0, 2.0, 2.0, 2.0, 2.0, 2.5, - 2.6, 2.7, 3.0, - 2.9, 2.0, 2.0, 2.0, 2.0, 2.0}}, \\ \underset{15 entries}{\underset{︸}{2.5, - 2.0, 2.7, 3.0, - 2.5, 2.0, 2.0, 2.0, 2.0, 2.0, 2.5, - 2.0, 2.7, 3.0, - 2.5}}, \underset{970 entries}{\underset{︸}{0, 0, 0, \dots, 0}}) . \end{matrix}$

The dataset was split into training and test sets with ratio of 70 to 30.

We compared the proposed rank-based penalized logistic regression method with lasso, adaptive lasso and elastic net methods from the glmnet R-package [11]. Algorithm 1 summarizes the procedure to calculate the average value of accuracy, sensitivity, specificity and AUROC based on a given number of iterations for each of the three simulation scenarios.

Algorithm 1 Calculation of overall mean and standard deviation on simulation studies

Step 1: Generate the data on each of the three simulation scenarios.
Step 2: Split the data into training and test sets randomly with the ratio of 70 to 30.
Step 3: Screen the variables using BH-FDR based on the training dataset.
Step 4: Plug the screened variables to each of the four methods.
Step 5: Calculate the values of Accuracy, Sensitivity, Specificity and AUROC for each of the methods.
Step 6: Repeats Step 1–5 to achieve a given number of replications.
Step 7: Calculate the means and standard deviations for each of the methods.

In Table 2, we compare the estimated mean and standard deviation of accuracy, sensitivity, specificity and AUROC values based on 200 iterations under correlation structure of

ρ = 0.4

in the simulation of scenario 1. The proposed rank-based penalized method shows the highest accuracy of 0.963 with the standard deviation of 0.02, sensitivity of 0.961 with standard deviation of 0.03, specificity of 0.965 with standard deviation of 0.03. In addition, it yields the same AUROC of 0.995 with standard deviation of 0.01 as elastic net and adaptive lasso.

Table 2. Comparison of the performance among the four methods over 200 replications under simulation scenario 1. The values in parentheseses are the standard deviations.

In Table 3, we compare estimated mean and standard deviation of accuracy, sensitivity, specificity and AUROC values using 200 iterations under correlation structure of

ρ = 0.4

in the simulation of scenario 2. The proposed rank-based method also shows highest accuracy of 0.831 with standard deviation 0.04, sensitivity of 0.833 with standard deviation of 0.06, specificity of 0.829 with standard deviation of 0.05. In addition, the proposed method produces AUROC of 0.913 with standard deviation of 0.03. This is the second highest value which is slightly lower than the AUROC value of the elastic net.

Table 3. Comparison of the performance among the four methods over 200 replications under simulation scenario 2. The values in parentheseses are the standard deviations.

In Table 4, we compare estimated mean and standard deviation of accuracy, sensitivity, specificity and AUROC values with 150 iterations under correlation structure of

ρ = 0.3

and

ρ = 0.6

in simulation of scenario 3. The proposed method shows highest accuracy of 0.916 with standard deviation of 0.04, sensitivity of 0.919 with standard deviation of 0.06, specificity of 0.912 with standard deviation of 0.06 and AUROC of 0.977 with standard deviation of 0.02.

Table 4. Comparison of the performance among the four methods over 150 replications under simulation scenario 3. The values in parentheseses are the standard deviations.

Furthermore, we compared the performance in terms of selecting the number of true important variables by each of the four methods under three different simulation scenarios. First, we performed multiple hypothesis testing with BH-FDR [15] to reduce the dimensionality of the data. After performing a screening step to retain the relevant variables, we used them as input for the proposed rank-based penalized method with the regularization step outlined in Section 2.3. We illustrate the performance of variable selection with boxplots in Figure 2, Figure 3 and Figure 4 for simulation scenarios 1, 2, and 3. Each figure displays two boxplots, which, in turn, depict the distribution of the number of variables selected (NVS) and the number of true important variables (NTIV) within the number of variables selected (NVS) with each of the four methods computed based on the given number of iterations in each of the three simulation scenarios.

Figure 2. Boxplots of total number of variables (NVS) selected and the number of true important variables (NTIV) within the number of variables selected with four different models under scenario 1 based on 200 replications.

Figure 3. Boxplots of total number of variables (NVS) selected and the number of true important variables (NTIV) within the number of variables selected with four different models on scenario 2 based on 200 replications.

Figure 4. Boxplots of total number of variables (NVS) selected and the number of true important variables (NTIV) within the number of variables selected with four different models under scenario 3 based on 150 replications.

Figure 2 reports that the proposed rank-based method has a slightly higher median number of variables selected (displayed as a thick line in the upper boxplots) than lasso, adaptive lasso and elastic net under scenario 1. The lower boxplots show that all four methods performed head-to-head for selection of true important variables under scenario 1 with 200 replications. Table 5 compares the mean and the standard deviation (in parentheseses) of the number of variables (NVS) selected and the number of true important variables (NTIV) in NVS for each of the four methods over 200 replications. The proposed rank-based method and elastic net performed head-to-head while slightly outperforming lasso and adaptive lasso.

Table 5. Estimated mean and standard deviation of number of variables (NVS) selected and the number of true important variables (NTIV) among NVS with four different models under simulated scenario 1 with 200 replications. The values in parentheseses are standard deviations.

Figure 3 suggests the proposed method has a marginally higher median number of variable selected compared to the other three methods in the upper boxplot. It is also clear that the proposed method has a slightly higher median number of true important variables in the lower boxplot on scenarios 2 computed with 200 replications. Table 6 confirms that the rank-based penalized method has the highest mean both for selecting the number of variables and important variables.

Table 6. Estimated mean and standard deviation of number of variables (NVS) selected and the number of true important variables (NTIV) among NVS in four different models under simulation scenario 2 with 200 replications. The values in parentheseses are standard deviations.

In Figure 4, the upper boxplot demonstrates that the proposed rank-based method has the highest median number of variables selected, elastic net has second highest median, lasso has third largest median and adaptive lasso has the smallest median under scenario 3 based on 150 replications. The lower boxplots also show that the proposed rank-based method has the highest median number of true important variables selected. However, adaptive lasso has a higher median number of true important variables than lasso unlike the upper boxplots. Thus, the proposed rank based-method clearly outperforms other three methods under high-correlation settings among variables.

Table 7 summarizes the number of variables selected and true important variables selected across the four methods under the high-correlation setting among variables computed from 150 replications. The proposed rank-based method has the highest mean number of overall variables selected and true important variables selected.

Table 7. Estimated mean and standard deviation of number of variables (NVS) selected and the number of true important variables (NTIV) among NVS in four different models under simulation scenario 3 with 150 replications. The values in parentheseses are standard deviations.

3.2. Real Data Example

We applied four logistic regression methods to select differentially expressed genes and assess their discrimination capability between colon cancer cases and healthy controls using high-dimensional gene expression data [22]. The colon cancer gene expression dataset is available at [23]. It contains 2000 genes with the highest minimal intensity across 62 tissues. The data were measured on 40 colon tumor samples and 22 normal colon tissue samples. We split the data set into training and testing sets with proportions 70% and 30%, respectively. To detect significantly differentially expressed genes for high-dimensional colon cancer carcinoma and measure classification prediction, we adapted two step procedures of filtering and variable selection. First, we applied BH-FDR [15] to select most promising candidates of genes as a preprocessing step and then used the screened genes as input to the proposed rank-based method and three other popular methods—lasso, adaptive lasso and elastic net. The performance in terms of accuracy, sensitivity, specificity and AUROC as well as the selection probabilities for the four methods are reported in Table 8 and Table 9, respectively.

Table 8. Estimated mean values and standard deviations for the four metrics across the four competing penalized logistic regression models computed from 100 resamplings. The values in parentheseses are standard deviations.

Table 9. List of top 5 ranked genes across rank-based, lasso, adaptive and elastic net. An extra asterix (*) sign is put next to a gene each time the gene is selected by one of four methods.

Algorithm 2 outlines above protocols the procedure of calculating the average values of accuracy, sensitivity, specificity and AUROC through 100 bootstrap iterations applied to the colon cancer gene expression data. In Table 8, the performance of all four metrics are computed based on 100 iterations of resampled subsets of individuals.

Algorithm 2 Calculation of mean with standard deviation on colon cancer data

Step 1: Split the data into training and test sets randomly with the ratio of 70 to 30.
Step 2: Screen genes with the BH-FDR method based on the training data.
Step 3: Plug the screened genes as the input to each of four methods.
Step 4: Calculate the values of Accuracy, Sensitivity, Specificity and AUROC across each of the methods on the test data.
Step 5: Repeat Steps 1 through 4 for 100 times.
Step 6: Calculate means and standard deviations for each of the methods.

The average AUROC of 0.853 with standard deviation of 0.06 in the proposed rank-based method has the highest value compared to other three methods. Also the accuracy of 0.853 with standard deviation of 0.08 are optimal among the four methods. The values of sensitivity (0.860) and specificity (0.840) are also better than those of the other three methods. In summary, it is fair to conclude that the proposed rank-based method outperforms the other three popular penalized logistic regression methods. Table 9 shows top 5 ranked genes with highest selection probabilities for the proposed rank-based method, lasso, adaptive lasso and elastic net. An expressed sequence tag (EST) of Hsa.1660 associated with colon cancer carcinoma is found by all four methods. Hsa.36689 [24,25] is shown and top ranked by the proposed method, lasso and elastic net. Hsa692 also appeared and is second ranked by the proposed method, lasso and elastic net. In addition, Hsa.37937 is shown and is third and second ranked by the proposed method and elastic net, respectively.

4. Discussion

In this paper, we proposed a new rank-based penalized logistic regression method to improve classification performance and the power of variable selection in high-dimensional data with strong correlation structure.

Our simulation studies demonstrated that the proposed method improves not only the performance of classification or class prediction but also the detection of true important variables under various correlation settings among features when compared to existing popular regularization methods such as lasso, adaptive lasso, and elastic net. As demonstrated by simulation studies, if the true important variables are not passed through the filtering method such as BH-FDR, their chance of being selected in the final model decreases signficantly, thus, leading to reduction in variable selection and classification performance. Therefore, effective filtering methods which are likely to retain as many most promising variables as possible are indispensable.

Applied to high-dimensional colon gene expression data, the proposed rank-based logistic regresson method with BH-FDR screening produced the highest average AUROC value of 0.917 with standard deviation of 0.06 and accuracy of 0.853 with standard deviation of 0.08 using 100 resampling steps. The proposed method produced a good balance between sensitivity and specificity in contrast to other methods. Elastic net demonstrated the second best peformance with an average AUROC value of 0.903 with standard deviation of 0.07. A probable reason is that elastic net accounts for group correlation effects. In addition, we compared top 5 ranked ESTs across the proposed method, lasso, adaptive lasso and elastic net [12]. They had a common EST of Hsa.1660 associated to colon cancer data. We also found that Hsa.36689 was both deemed important and top ranked by the proposed method, lasso and elastic net. This also applied to Hsa.692, which was deemed important and second top ranked by the proposed method and lasso, whereas it was only third-ranked by the elastic net. Hsa.37937 was detected by both the proposed method and the elastic net. Hence, the four ESTs mentioned appear to be promising candidate biomarkers associated with colon cancer carcinoma. The function of the genes corresponding to ESTs is summarized in Table 9.

5. Conclusions

In this study the proposed rank-based classifier demonstrated the superiority in not only classification prediction but also the power of detecting true important variables when compared to lasso, adaptive lasso, and elastic net through the extensive simulation studies. Besides, in the application of high-dimensional colon cancer gene expression data, the proposed classifier showed the best performance in terms of accuracy and AUROC among the four classifiers considered in the paper. As a future research, we would develop the methodology of variable selection and compare the performance with those of most recent competitors such as [20,21,26,27], etc.

Author Contributions

All authors have equally contributed to this work. All authors wrote, read, and approved the final manuscript.

Funding

This research received no external funding.

Acknowledgments

We would like to thank the reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Houwelingen, H.C.V.; Bruinsma, T.; Hart, A.A.M.; Veer, L.J.V.; Wessels, L.F.A. Cross-validated Cox regression on microarray gene expression data. Stat. Med. 2006, 25, 3201–3216. [Google Scholar] [CrossRef] [PubMed]
Lofti, E.; Keshavarz, A. Gene expression microarray classification using PCA–BEL. Comput. Biol. Med. 2014, 54, 180–187. [Google Scholar]
Algamal, Z.Y.; Lee, M.H. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl. 2015, 42, 9326–9332. [Google Scholar] [CrossRef]
Li, C.; Li, H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 2008, 24, 1175–1182. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1175–1182. [Google Scholar] [CrossRef]
Tibshirani, R.; Saunders, M. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B 2005, 67, 91–108. [Google Scholar] [CrossRef]
Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Meinshausen, N.; Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 2009, 37, 246–270. [Google Scholar] [CrossRef]
Huang, H.; Liu, X.Y.; Liang, Y. Feature selection and cancer classification via sparse logistic regression with the hybrid L_1/2+2 regularization. PLoS ONE 2009, 11, e0149675. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
Sun, H.; Wang, S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 2012, 28, 1368–1375. [Google Scholar] [CrossRef] [PubMed]
Sun, H.; Wang, S. Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data. Stat. Med. 2012, 32, 2127–2139. [Google Scholar] [CrossRef] [PubMed]
Reiner, H.; Yekutieli, D.; Benjamin, Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003, 19, 368–375. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B 1995, 57, 289–300. [Google Scholar] [CrossRef]
Li, C.; Li, H. Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat. 2010, 4, 1498–1516. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Hofling, H.; Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat. 2007, 1, 302–332. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Lee, J.D.; Sun, D.L.; Sun, Y.; Taylor, J.E. Exact post-selection inference, with application to the lasso. Ann. Stat. 2016, 44, 907–927. [Google Scholar] [CrossRef]
Cilluffo, G.; Sottile, G.; La Grutta, S.; Muggeo, V.M.R. The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression. Stat. Methods Med. Res. 2019. [Google Scholar] [CrossRef] [PubMed]
Alon, U.; Barakai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A.J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 1999, 96, 6745–6750. [Google Scholar] [CrossRef]
Available online: http://genomics-pubs.princeton.edu/oncology/affydata/index.html (accessed on 25 April 2019).
Ding, Y.; Wilkins, D. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003, 19, 2246–2253. [Google Scholar]
Li, Y.; Campbell, C.; Tipping, M. Bayesian automatic relevance determination algorithms for classfifying gene expression data. Bioinformatics 2002, 18, 1332–1339. [Google Scholar] [CrossRef]
Frost, H.R.; Amos, C.I. Gene set selection via LASSO penalized regression (SLPR). Nucleic Acids Res. 2017. [Google Scholar] [CrossRef]
Boulesteix, A.L.; De, B.R.; Jiang, X.; Fuchs, M. IPF-LASSO: Integrative L₁-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data. Comput. Math. Methods Med. 2017. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The ring network (left) and F.con network (right) are shown for the case there are two genes consisting of 6 and 9 CpG sites, respectively.

Figure 2. Boxplots of total number of variables (NVS) selected and the number of true important variables (NTIV) within the number of variables selected with four different models under scenario 1 based on 200 replications.

Figure 3. Boxplots of total number of variables (NVS) selected and the number of true important variables (NTIV) within the number of variables selected with four different models on scenario 2 based on 200 replications.

Figure 4. Boxplots of total number of variables (NVS) selected and the number of true important variables (NTIV) within the number of variables selected with four different models under scenario 3 based on 150 replications.

Table 1. Confusion table: a is the number of true positives, b the number of false positives, c the number of false negatives and d the number of true negatives.

Predicted Condition	True Condition
	Positive	Negative	Total
Positive	a	b	k
Nnegative	c	d	$n - k$
Total	m	$n - m$	$n = a + b + c + d$

Table 2. Comparison of the performance among the four methods over 200 replications under simulation scenario 1. The values in parentheseses are the standard deviations.

Method	Accuracy	Sensitivity	Specificity	AUROC
rank-based	0.963 (0.02)	0.961 (0.03)	0.965 (0.03)	0.995 (0.01)
lasso	0.953 (0.03)	0.952 (0.04)	0.955 (0.03)	0.993 (0.01)
alasso	0.957 (0.03)	0.955 (0.04)	0.960 (0.03)	0.995 (0.01)
enet	0.961 (0.02)	0.959 (0.04)	0.962 (0.03)	0.995 (0.01)

Table 3. Comparison of the performance among the four methods over 200 replications under simulation scenario 2. The values in parentheseses are the standard deviations.

Method	Accuracy	Sensitivity	Specificity	AUROC
rank-based	0.831 (0.04)	0.833 (0.03)	0.829 (0.05)	0.913 (0.03)
lasso	0.826 (0.05)	0.827 (0.07)	0.825 (0.09)	0.910 (0.04)
alasso	0.815 (0.04)	0.814 (0.07)	0.815 (0.07)	0.902 (0.04)
enet	0.826 (0.04)	0.828 (0.07)	0.825 (0.07)	0.915 (0.03)

Table 4. Comparison of the performance among the four methods over 150 replications under simulation scenario 3. The values in parentheseses are the standard deviations.

Method	Accuracy	Sensitivity	Specificity	AUROC
rank-based	0.916 (0.04)	0.919 (0.06)	0.912 (0.06)	0.977 (0.02)
lasso	0.888 (0.04)	0.898 (0.06)	0.880 (0.07)	0.963 (0.02)
alasso	0.866 (0.04)	0.877 (0.07)	0.855 (0.07)	0.949 (0.03)
enet	0.909 (0.04)	0.916 (0.06)	0.903 (0.06)	0.975 (0.02)

Table 5. Estimated mean and standard deviation of number of variables (NVS) selected and the number of true important variables (NTIV) among NVS with four different models under simulated scenario 1 with 200 replications. The values in parentheseses are standard deviations.

Method	NVS	NTIV
rank-based	10.465 (2.24)	7.975 (0.19)
lasso	9.885 (1.98)	7.880 (0.37)
alasso	9.475 (1.47)	7.970 (0.20)
enet	10.805 (2.37)	7.975 (0.16)

Table 6. Estimated mean and standard deviation of number of variables (NVS) selected and the number of true important variables (NTIV) among NVS in four different models under simulation scenario 2 with 200 replications. The values in parentheseses are standard deviations.

Method	NVS	NTIV
rank-based	13.675 (3.95)	9.345 (1.62)
lasso	12.750 (3.50)	8.905 (1.81)
alasso	11.965 (3.18)	8.720 (1.73)
enet	13.115 (3.92)	9.105 (1.73)

Table 7. Estimated mean and standard deviation of number of variables (NVS) selected and the number of true important variables (NTIV) among NVS in four different models under simulation scenario 3 with 150 replications. The values in parentheseses are standard deviations.

Method	NVS	NTIV
rank-based	37.830 (7.14)	20.770 (1.88)
lasso	22.010 (4.96)	11.780 (2.21)
alasso	16.430 (4.39)	11.920 (2.28)
enet	32.270 (8.55)	17.600 (3.15)

Table 8. Estimated mean values and standard deviations for the four metrics across the four competing penalized logistic regression models computed from 100 resamplings. The values in parentheseses are standard deviations.

Colon Cancer Data Analysis Based on 100 Times Resmpling
Method	Accuracy	Sensitivity	Specificity	AUROC
rank-based	0.853 (0.08)	0.860 (0.13)	0.840 (0.13)	0.917 (0.06)
lasso	0.801 (0.09)	0.911 (0.07)	0.637 (0.21)	0.897 (0.08)
adaptive lasso	0.804 (0.09)	0.869 (0.09)	0.719 (0.21)	0.877 (0.08)
elastic net	0.802 (0.09)	0.917 (0.07)	0.640 (0.22)	0.903 (0.07)

Table 9. List of top 5 ranked genes across rank-based, lasso, adaptive and elastic net. An extra asterix (*) sign is put next to a gene each time the gene is selected by one of four methods.

EST Name	Gene ID	Gene Description	Selection Probability
Rank-Based
***Hsa.36689	Z50753	H.sapiens mRNA for GCAP-II/uroguanylin precursor	1.00
**Hsa.692.2	M76378	Human cysteine-rich protein (CRP) gene, exons 5 and 6	0.99
*Hsa.37937	R87126	Myosin heavy chain, nonmuscle(Gallus gallus)	0.97
***Hsa.1660	H55916	Peptidyl-prolyl cis-trans isomerase, mitrochondrial precursor(human)	0.91
Hsa.1832	R44887	nedd5 protein (Mus musculus)	0.90
Lasso
***Hsa.36689	Z50753	H.sapiens mRNA for GCAP-II/uroguanylin precursor	0.87
Hsa.692.2	M76378	Human cysteine-rich protein (CRP) gene, exons 5 and 6	0.82
****Hsa.1660	H55916	Peptidyl-prolyl cis-trans isomerase, mitrochondrial precursor(human)	0.66
Hsa.6814	H08393	Collagen alpha 2(XI) chain(Homo sapiens)	0.52
Hsa.8147	M63391	Human desmin gene, complete cds	0.50
Adaptive Lasso
Hsa.1454	M82919	H. gamma amino butyric acid(GABAA)receptor beta3 subunit mRNA, cds	0.83
Hsa.6814	H08393	Collagen alpha 2(XI) chain(Homo sapiens)	0.77
***Hsa.1660	H55916	Peptidyl-prolyl cis-trans isomerase, mitrochondrial precursor(human)	0.77
Hsa.14069	T67077	Sodium/Potasssium-transporting atpase gamma chain(Ovis aries)	0.69
Hsa.2456	U25138	Human MaxiK potassium channel beta subunit mRNA, complete cds	0.55
Elastic Net
***Hsa.36689	Z50753	H.sapiens mRNA for GCAP-II/uroguanylin precursor	0.98
*Hsa.37937	R87126	Myosin heavy chain, nonmuscle(Gallus gallus)	0.94
**Hsa.692.2	M76378	Human cysteine-rich protein (CRP) gene, exons 5 and 6	0.94
Hsa.8147	M63391	Human desmin gene, complete cds	0.91
***Hsa.1660	H55916	Peptidyl-prolyl cis-trans isomerase, mitrochondrial precursor(human)	0.84

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

On the Performance of Variable Selection and Classification via Rank-Based Classifier

Abstract

1. Introduction

2. Materials and Methods

2.1. Adjusted p-Values: Benjamini and Hochberg False Discovery Rate (BH-FDR)

2.2. Regularized Logistic Regression

2.3. Computational Algorithm

2.4. Adaptive Link-Constrained Regularization

2.5. Accuracy, Sensitivity, Specificity and Area under the Receiver Operating Curve (AUROC)

2.6. Ranking and Best Model Selection

3. Results

3.1. Analysis of Simulated Data

3.2. Real Data Example

4. Discussion

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics