# Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Penalized Logistic Regression Method

#### 2.2. Variable Ranking with MMLR

Algorithm 1 Proposed two-step procedure |

Step 1: Sample 70% of samples randomly without replacement from the training set. |

Step 2: Count frequency of each of genes from 100 models of λ values. |

Step 3: Repeat Step 1 and Step 2 100 times. |

Step 4: Calculate selection probability for each of variables based on Equation (10) and then rank them. |

Step 5: Select top $\lfloor \frac{n}{\mathrm{log}\left(n\right)}\rfloor $ genes with the highest frequency. |

Step 6: Apply them to sparse logistic regression methods to build prognostic models. |

#### 2.3. The Proposed Variable Ranking Method

#### 2.4. Metrics of Performance

## 3. Results

#### 3.1. Simulation Results

^{nd}among 1000 variables, whereas the MMLR method was at 44

^{th}. In case of high correlation coefficients of 0.5 and 0.7, the proposed one was 59

^{th}and 62

^{nd}while MMLR was 132

^{nd}and 139

^{th}.

#### 3.2. Real Data Analysis

## 4. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Sangjin, K.; Susan, H. High Dimensional Variable Selection with Error Control. Biomed. Res. Int. Vol.
**2016**, 2016. [Google Scholar] [CrossRef] - Shuangge, M.; Jian, H. Penalized feature selection and classification in bioinformatics. Brief. Bioinform.
**2008**, 9, 392–403. [Google Scholar][Green Version] - Abhishek, B.; Shailendra, S. Gene Selection Using High Dimensional Gene Expression Data: An Appraisal. Curr. Bioinform.
**2018**, 13, 225–233. [Google Scholar] [CrossRef] - Hassan, T.; Elf, E.; lan, W. An efficient approach for feature construction of high-dimensional microarray data by random projections. PLoS ONE
**2018**, 13, e0196385. [Google Scholar] [CrossRef] - Bourgon, R. Independent filtering increases detection power for high-throughput experiments. Proc. Natlacad. Sci.
**2010**, 107, 9546–9951. [Google Scholar] [CrossRef] [PubMed] - Bourgon, R.; Gentleman, R.; Huber, W. Reply to Talloen et al.: Independent filtering is a generic approach that needs domain-specific adaptation. Proc. Natl Acad. Sci. USA
**2010**, 107, E175. [Google Scholar] [CrossRef] - Lu, J.; Peddada, S.D.; Bushel, P.R. Principal component analysis-based filtering improves detection for Affymetrix gene expression arrays. Nucleic Acids Res.
**2011**, e86, 39. [Google Scholar] [CrossRef] - Jiang, H.; Doerge, R.W. A two-step multiple comparison procedure for a large number of tests and multiple treatments. Stat. Appl. Genet. Mol. Biol.
**2006**, 5. [Google Scholar] [CrossRef] - Ramskold, E.; Kerns, R.T. An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS Comput. Biol.
**2009**, 5, e1000598. [Google Scholar] [CrossRef] - Sultan, M.; Schulz, M.H.; Richard, H.; Magen, A.; Klingenhoff, A.; Scherf, M.; Seifert, M.; Borodina, T.; Soldatov, A.; Parkhomchuk, D.; et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science
**2008**, 321, 956–960. [Google Scholar] [CrossRef] - Calle, M.L.; Urrea, V.; Malats, V.N.; Steen, K.V. Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Stat. Med.
**2008**, 27, 6532–6546. [Google Scholar] [CrossRef] [PubMed] - Li, L.; Kabesch, M.; Bouzigon, E.; Demenais, F.; Farrall, M.; Moffatt, M.F.; Lin, X.; Liang, L. Using eQTL weights to improve power for genome-wide association studies: A genetic study of childhood asthma. Fron. Genet.
**2013**, 4, 103. [Google Scholar] [CrossRef] [PubMed] - Taqwa, A.A.; Siraj, M.M.; Zainal, A.; Elshoush, H.T.; Elhaj, F. Feature Selection Using Information Gain for Improved Structural-Based Alert Correlation. PLoS ONE
**2016**, 11, e0166017. [Google Scholar] [CrossRef] - Tan, Y.; Liu, Z. Feature selection and prediction with a Markov blanket structure learning algorithm. BMC Bioinform.
**2013**, 14, A3. [Google Scholar] [CrossRef] - Kakourou, A.; Mertens, B. Bayesian variable selection logistic regression with paired proteomic measurements. Biom. J.
**2018**. [Google Scholar] [CrossRef] [PubMed] - Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw.
**2010**, 36, 1–13. [Google Scholar] [CrossRef] - Okeh, U.M.; Oyeka, I.C.A. Estimating the fisher’s scoring matrix formula from the logistic model. Am. J. Theor. Appl. Stat.
**2013**, 2, 221–227. [Google Scholar] - Urbanowicz, R.J.; Meekerb, M.; La Cavaa, W.; Olsona, R.S.; Moorea, J.H. Relief-based feature selection: Introduction and review. J. Biomed. Inform.
**2018**, 85, 189–203. [Google Scholar] [CrossRef] - Milos, R.; Mohamed, G.; Nenad, F.; Zoran, O. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinform. BMC Ser.
**2017**, 18, 9. [Google Scholar] [CrossRef] - Algamal, Z.Y.; Lee, M.H. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv. Data Anal. Classif.
**2018**, 1–19. [Google Scholar] [CrossRef] - Le, T.T.; Urbanowicz, R.J.; Moore, J.H.; McKinney, B.A. Statistical Inference Relief (STIR) feature selection. Bioinformatics
**2018**, 788. [Google Scholar] [CrossRef] [PubMed] - Abdel-Aal, R.E. GMDH-based feature ranking and selection for improved classification of medical data. J. Biomed. Inf.
**2005**, 38, 456–468. [Google Scholar] [CrossRef] [PubMed][Green Version] - Fan, J. Sure Independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. B
**2008**, 70, 849–911. [Google Scholar] [CrossRef] - Dizler, G.; Morrison, J.C.; Lan, Y.; Rosen, G.L. Fizzy: Feature subset selection for metagenomics. BMC Bioinform.
**2015**, 1, 358. [Google Scholar] [CrossRef] - Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
**2005**, 27, 1226–1238. [Google Scholar] [CrossRef] - Wei, M.; Chow, T.W.S.; Chan, R.H.M. Heterogeneous feature subset selection using mutual information based feature transformation. Neurocomputing
**2015**, 168, 706–718. [Google Scholar] [CrossRef] - Su, C.-T.; Yang, C.-H. Feature selection for the SVM: An application to hypertension diagnosis. Expert Syst. Appl.
**2008**, 34, 754–763. [Google Scholar] [CrossRef] - Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat.
**2010**, 38, 894–942. [Google Scholar] [CrossRef][Green Version] - Fan, J.; Li, R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Am. Stat. Assoc.
**2001**, 96, 1348–1360. [Google Scholar] [CrossRef] - Two-Stage-Resources-2019. Available online: https://sites.google.com/site/sangjinkim0716/data-repository/two-stage-resources-2019 (accessed on 29 May 2019).
- Pappua, V.; Panagopoulosb, O.P.; Xanthopoulosb, P.; Pardalosa, P.M. Sparse proximal support vector machines for features selection in high dimensional datasets. Expert Syst. Appl.
**2015**, 42, 9183–9191. [Google Scholar] [CrossRef] - Liao, J.G.; Chin, K.-V. Logistic regression for disease classification using micro data: Model selection in a large p and small n case. Bioinformatics
**2007**, 23, 1945–1951. [Google Scholar] [CrossRef] [PubMed] - Park, M.Y.; Hastie, T. Penalized logistic regression for detecting gene interactions. Biostatistics
**2008**, 9, 30–50. [Google Scholar] [CrossRef] [PubMed] - Bielza, C.; Robles, V.; Larrañaga, P. Regularized logistic regression without a penalty term: An application to cancer classification with microarray data. Expert Syst. Appl.
**2011**, 38, 5110–5118. [Google Scholar] [CrossRef] - Bootkrajang, J.; Kabán, A. Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics
**2013**, 29, 870–877. [Google Scholar] [CrossRef] [PubMed][Green Version] - Cawley, G.C.; Talbot, N.L.C. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics
**2006**, 22, 2348–2355. [Google Scholar] [CrossRef] [PubMed][Green Version] - Li, J.; Jia, Y.; Zhao, Z. Partly adaptive elastic net and its application to microarray classification. Neural Comput. Appl.
**2012**, 22, 1193–1200. [Google Scholar] [CrossRef] - Sun, H.; Wang, S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics
**2012**, 28, 1368–1375. [Google Scholar] [CrossRef] [PubMed] - Zhu, J.; Hastie, T. Classification of gene microarrays by penalized logistic regression. Biostatistics
**2004**, 5, 427–443. [Google Scholar] [CrossRef] - Liang, Y.; Liu, C.; Luan, X.-Z.; Leung, K.-S.; Chan, T.-M.; Xu, Z.-B.; Zhang, H. Sparse logistic regression with an L1/2 penalty for gene selection in cancer classification. BMC Bioinform.
**2013**, 14, 198–211. [Google Scholar] [CrossRef] - Huang, H.H.; Liu, X.Y.; Liang, Y. Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2 + 2 regularization. PLoS ONE
**2016**, 11, e0149675. [Google Scholar] [CrossRef] [PubMed] - Algamal, Z.Y.; Lee, M.H. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl.
**2015**, 42, 9326–9332. [Google Scholar] [CrossRef] - Ben Brahim, A.; Limam, M. A hybrid feature selection method based on instance learning and cooperative subset search. Pattern Recogn. Lett.
**2016**, 69, 28–34. [Google Scholar] [CrossRef] - Wang, Y.; Yang, X.-G.; Lu, Y. Informative Gene Selection for Microarray Classification via Adaptive Elastic Net with Conditional Mutual Information. Appl. Math. Model.
**2019**, 71, 286–297. [Google Scholar] [CrossRef] - Patrick, M.; John, S.; Rebecca, W. Methods for Bayesian Variable Selection with Binary Response Data using the EM algorithm. arXiv
**2016**, arXiv:1605.05429. [Google Scholar] - Castellanos-Garzon, J.A.; Ramos-Gonzalez, J. A Gene Selection Approach based on Clustering for Classification Tasks in Colon Cancer. Adv. Distrib. Comput. Artif. Intell. J.
**2015**, 4. [Google Scholar] [CrossRef] - Fortunato, R.S.; Gomes, L.R.; Munford, V.; Pessoa, C.F.; Quinet, A.; Hecht, F.; Kajitani, G.S.; Milito, C.B.; Carvalho, D.P.; Martins Menck, C.F. DUOX1 Silencing in Mammary Cell Alters the Response to Genotoxic Stress. Oxid. Med. Cell. Longev.
**2018**, 2018. [Google Scholar] [CrossRef] [PubMed] - Little, A.C.; Sham, D.; Hristova, M.; Danyal, K.; Heppner, D.E.; Bauer, R.A.; Sipsey, L.M.; Habibovic, A.; van der Vliet, A. DUOX1 silencing in lung cancer promotes EMT, cancer stem cell characteristics and invasive properties. Oncogenesis
**2016**, 5. [Google Scholar] [CrossRef] [PubMed] - Liang, Y.; Han, H.; Liu, L.; Duan, Y.; Yang, X.; Ma, C.; Zhu, Y.; Han, J.; Li, X.; Chen, Y. CD36 plays a critical role in proliferation, migration and tamoxifen-inhibited growth of ER-positive breast cancer cells. Oncogenesis
**2018**, 7, 98. [Google Scholar] [CrossRef] [PubMed] - Sun, Q.; Zhang, W.; Guo, F. Hypermethylated CD36 gene affected the progression of lung cancer. Genetics
**2018**, 678, 395–406. [Google Scholar] [CrossRef] [PubMed] - Zhang, W.; Fan, J.; Chen, Q.; Lei, C.; Qiao, B.; Liu, Q. SPP1 and AGER as potential prognostic biomarkers for lung adenocarcinoma. Oncol. Lett.
**2018**, 15, 7028–7036. [Google Scholar] [CrossRef] [PubMed] - Ioanna, G.; Vasilieios, P.; Ioannis, L.; Nikolaos, K.; Theodora, A.; Georgios, S. Tumor cell-derived osteopontin promotes lung metastasis via both cell-autonomous and paracrine pathways. Eur. Respir. J.
**2016**, 48. [Google Scholar] [CrossRef] - Pastuszak-Lewandoska, D.; Czarnecka, K.H.; Nawrot, E.; Domanska, D.; Kiszalkiewicz, J. Decreased FAM107A Expression in Patients with Non-small Cell Lung Cancer. Adv. Exp. Med. Biol.
**2015**, 852, 39–48. [Google Scholar] [PubMed]

**Figure 2.**The boxplots of ranking true variables with the proposed filter method (PF) and MMLR method under correlation coefficients 0.2, 0.5, and 0.7 with 100 iterations.

**Figure 3.**Comparison of area under the receiver operating characteristic (AUROC) with SIS-LASSO, SIS-MCP, and SIS-SCAD after filtering with both proposed filter ranking method and MMLR method under three correlation settings.

**Figure 4.**The histogram and boxplot of pairwise correlation coefficients between 2000 expression levels of genes for the colon and normal group combined. The number of correlation coefficients is 1,999,000. Two plots show that average pairwise correlation is 0.428 (median = 0.433) with a standard deviation of 0.203.

**Figure 5.**Boxplots of differential expression level between normal and colon samples on eight genes from SIS-LASSO, SIS-MCP, and SIS-SCAD with the ranked data. Each boxplot contains the p-value of mean differential expression between two groups with a two-sample t-test.

**Figure 6.**Boxplots of differential expression level between normal and lung samples on five genes from SIS-LASSO, SIS-MCP, and SIS-SCAD with the ranked data. Each boxplot contains the p-value of mean differential expression between two groups with a two-sample t-test.

**Table 1.**An average number of true positives from the proposed PF and MMLR with SIS and a significance level of paired two-sample t-test for the mean difference of the number of true positives between two methods using the number of true positives obtained over 100 iterations.

Filtering Method | Metric | Correlation Coefficient | ||
---|---|---|---|---|

0.2 | 0.5 | 0.7 | ||

PF | Number of True Positive | 5.4 (0.765) | 4.21 (1.09) | 3.11 (1.09) |

MMRL | 4.52(0.948) | 2.15 (1.26) | 0.29 (0.50) | |

two sample t-test (p value) | 1.204 × 10^{−11} | < 2.2 × 10^{−16} | < 2.2 × 10^{−16} |

**Table 2.**Classification performance of proposed filtering (PF) compared to marginal maximum likelihood logistic regression estimator (MMLR) with SIS-LASSO, SIS-MCP, and SIS-SCAD over 100 iterations.

Correlation | Filtering | Methods | Accuracy | G-mean | TP | FP | MS |
---|---|---|---|---|---|---|---|

0.2 | PF | SIS-LASSO | 0.856(0.047) | 0.854(0.049) | 5.25(0.757) | 0.019(0.002) | 24.55(1.971) |

SIS-MCP | 0.878(0.054) | 0.877(0.056) | 5.03(0.937) | 0.006(0.003) | 11.3(2.805) | ||

SIS-SCAD | 0.878(0.053) | 0.876(0.055) | 5.18(0.757) | 0.012(0.005) | 17.24(5.053) | ||

average | 0.871(0.051) | 0.869(0.053) | 5.153(0.817) | 0.012(0.003) | 17.697(3.276) | ||

MMLR | SIS-LASSO | 0.847(0.056) | 0.844(0.06) | 4.3(0.99) | 0.015(0.002) | 18.73(2.131) | |

SIS-MCP | 0.86(0.061) | 0.858(0.063) | 4.21(0.988) | 0.006(0.003) | 10.32(2.449) | ||

SIS-SCAD | 0.861(0.059) | 0.858(0.062) | 4.3(0.99) | 0.011(0.004) | 14.8(3.649) | ||

average | 0.856(0.059) | 0.853(0.062) | 4.27(0.989) | 0.011(0.003) | 14.617(2.743) | ||

0.5 | PF | SIS-LASSO | 0.886(0.041) | 0.884(0.042) | 3.65(1.266) | 0.019(0.003) | 22.71(2.267) |

SIS-MCP | 0.869(0.055) | 0.868(0.057) | 2.93(1.409) | 0.008(0.003) | 10.87(2.058) | ||

SIS-SCAD | 0.884(0.048) | 0.883(0.05) | 3.57(1.257) | 0.017(0.004) | 20.06(3.92) | ||

average | 0.88(0.048) | 0.878(0.05) | 3.383(1.311) | 0.015(0.003) | 17.88(2.748) | ||

MMLR | SIS-LASSO | 0.865(0.046) | 0.863(0.047) | 1.84(1.237) | 0.015(0.003) | 17.02(2.137) | |

SIS-MCP | 0.858(0.048) | 0.857(0.048) | 1.66(1.233) | 0.008(0.002) | 9.89(1.681) | ||

SIS-SCAD | 0.863(0.047) | 0.861(0.047) | 1.83(1.28) | 0.014(0.003) | 15.64(2.873) | ||

average | 0.862(0.047) | 0.86(0.047) | 1.777(1.25) | 0.012(0.003) | 14.183(2.23) | ||

0.7 | PF | SIS-LASSO | 0.911(0.037) | 0.911(0.038) | 2.74(1.16) | 0.019(0.003) | 21.14(2.274) |

SIS-MCP | 0.899(0.042) | 0.899(0.043) | 1.82(1.158) | 0.007(0.002) | 8.88(1.981) | ||

SIS-SCAD | 0.907(0.038) | 0.907(0.038) | 2.68(1.171) | 0.016(0.004) | 18.88(3.699) | ||

average | 0.906(0.039) | 0.906(0.04) | 2.413(1.163) | 0.014(0.003) | 16.3(2.651) | ||

MMLR | SIS-LASSO | 0.887(0.037) | 0.886(0.037) | 0.26(0.543) | 0.014(0.002) | 13.72(1.724) | |

SIS-MCP | 0.881(0.04) | 0.88(0.041) | 0.21(0.498) | 0.008(0.002) | 7.75(1.591) | ||

SIS-SCAD | 0.888(0.036) | 0.888(0.037) | 0.25(0.52) | 0.013(0.002) | 13.45(2.285) | ||

average | 0.885(0.038) | 0.885(0.038) | 0.24(0.52) | 0.012(0.002) | 11.64(1.867) |

**Table 3.**Classification performance of the proposed selection method with SIS-LASSO, SIS-MCP, and SIS-SCAD in both colon and lung cancer. It is the average performance resulting from 100 iterations.

Dataset | Method | Accuracy | AUROC | G-Mean | Model Size |
---|---|---|---|---|---|

SIS-LASSO | 0.803 (0.098) | 0.886 (0.077) | 0.745 (0.144) | 7.8 (1.47) | |

Colon | SIS-MCP | 0.793 (0.097) | 0.864 (0.088) | 0.748 (0.132) | 4.14 (1.054) |

SIS-SCAD | 0.798 (0.096) | 0.874 (0.082) | 0.753 (0.13) | 6.73 (1.896) | |

SIS-LASSO | 0.976 (0.017) | 0.998 (0.007) | 0.975 (0.019) | 9.53 (1.453) | |

Lung | SIS-MCP | 0.952 (0.03) | 0.983 (0.017) | 0.95 (0.032) | 1.09 (0.288) |

SIS-SCAD | 0.975 (0.021) | 0.997 (0.006) | 0.973 (0.023) | 8.65 (2.222) |

**Table 4.**Top 10 ranked genes with highest selection frequency from the lists of ranking genes using 100 times resampling approach across three methods of SIS-LASSO, SIS-MCP, and SIS-SCAD on both the colon cancer and the lung cancer gene expression data.

Rank | SIS-LASSO | SIS-MCP | SIS-SCAD |
---|---|---|---|

Gene Accession ID | |||

1 | Hsa.36689 *** (G50753) | Hsa.36689 | Hsa.36689 |

2 | Hsa.692.2 *** (M76378) | Hsa.8147 | Hsa.692.2 |

3 | Hsa.6814 *** (H08393) | Hsa.6814 | Hsa.6814 |

4 | Hsa.1660 *** (H55916) | Hsa.1660 | Hsa.1660 |

5 | Hsa.8147 *** (M63391) | Hsa.692.2 | Hsa.33268 |

6 | Hsa.5392 *** (T62947) | Hsa.12241 ** (T64012) | Hsa.12241 |

7 | Hsa.37937 ** (R87126) | Hsa.33268 | Hsa.5392 |

8 | Hsa.33268 *** (R80427) | Hsa.5392 | Hsa.8147 |

9 | Hsa.3016 ** (T47377) | Hsa.8125 | Hsa.8125 |

10 | Hsa.8125 *** (T71025) | Hsa.37937 | Hsa.3016 |

**Table 5.**Top 10 ranked genes with highest selection frequency from lists of gene ranking using 100 times resampling approach of three methods of SIS-LASSO, SIS-MCP, and SIS-SCAD on the lung cancer gene expression data.

Rank | SIS-LASSO | SIS-MCP | SIS-SCAD |
---|---|---|---|

- | Gene Accession ID | ||

1 | 219597_s_at ***(DUOX1) | 209555_s_at | 219597_s_at |

2 | 205357_s_at ** | 209074_s_at | 205357_s_at |

3 | 209555_s_at ***(CD36) | 32625_at | 209555_s_at |

4 | 209875_s_at ***(SPP1) | 206209_s_at * | 209875_s_at |

5 | 203980_at ** | 204271_s_at * | 209074_s_at |

6 | 208982_at ** | 204396_s_at * | 219213_at |

7 | 209074_s_at *** (FAM107A) | 219213_at | 208982_at |

8 | 220170_at ** | 219597_s_at | 220170_at |

9 | 219213_at *** (JAM2) | 219719_at * | 209614_at * |

10 | 32625_at ** | 209875_s_at | 203980_at |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kim, S.; Kim, J.-M.
Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data. *Mathematics* **2019**, *7*, 493.
https://doi.org/10.3390/math7060493

**AMA Style**

Kim S, Kim J-M.
Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data. *Mathematics*. 2019; 7(6):493.
https://doi.org/10.3390/math7060493

**Chicago/Turabian Style**

Kim, Sangjin, and Jong-Min Kim.
2019. "Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data" *Mathematics* 7, no. 6: 493.
https://doi.org/10.3390/math7060493