Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data
Abstract
1. Introduction
2. Materials and Methods
2.1. Data Structure and the Multiple Pathways
2.2. Evaluation Criteria for Binary Classification
2.3. SMOTE-Tomek Procedure for Imbalanced Data
2.4. The OGS Approach with Binary Logistic Regression for G-E Interactions
2.5. The Alternative Classification Methods
3. Results
3.1. Simulation Studies: Synthetic Imbalanced Dataset with Complex Gene Structure
3.2. Real Data Application: TCGA LUAD Data
3.3. Real Data Application: TCGA BRCA Data
3.4. Improvement in Predictive Capability for Real Data
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
BMI | Body mass index |
BP | Biological process |
BRCA | Breast invasive carcinoma |
CC | Cellular composition |
CV | Cross-validation |
FN | Number of false negatives |
FP | Number of false positives |
G-E | Gene-environment |
G-G | Gene-gene |
GO | Gene ontology |
GWAS | Genome-wide association study |
IR | Imbalanced ratio |
KNNs | K-nearest neighbors |
LDA | Linear discriminant analysis |
LUAD | Lung adenocarcinoma |
MF | Molecular function |
MI | Mutual information |
ML | Machine learning |
MLRs | Multiple logistic regression models |
OGS | Overlapping group screening |
RFs | Random forests |
SKAT | Sequence kernel association test |
SMOTE | Synthetic minority oversampling technique |
SVMs | Support vector machines |
TCGA | The cancer genome atlas |
TN | Number of true negatives |
TP | Number of true positives |
Appendix A. Latent Effect Approach
ML Method | R Package and Function | Hyperparameter | Procedure |
---|---|---|---|
SVM | “e1071”, svm(), tune() | Kernel: “radial” | given |
cost: | CV | ||
gamma: | CV | ||
RF | “randomForest”, randomForest(), train() | Kernel: “rectangular” | given |
ntree: 1, 2, …, 500 | CV | ||
mtry: 1, 2, …, 10 | CV | ||
KNN | “kknn”, kknn() | k: 1, 2, …, 50 | CV |
LDA | “MASS”, lad() | prior: 0.5 | given |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
60:40 | |||||
OGS_Ridge | 0.8358 | 0.8084 | 0.8194 | 0.7958 | 0.8456 |
OGS_Lasso | 0.8308 | 0.7959 | 0.8331 | 0.7961 | 0.8281 |
OGS_ALasso | 0.8636 | 0.8388 | 0.8572 | 0.8327 | 0.8665 |
OGS_SVM | 0.8248 | 0.7476 | 0.8437 | 0.7891 | 0.8163 |
OGS_LDA | 0.7985 | 0.6960 | 0.8709 | 0.7704 | 0.7551 |
OGS_KNN | 0.4941 | 0.4334 | 0.8440 | 0.5662 | 0.2678 |
OGS_RF | 0.6520 | 0.6837 | 0.2756 | 0.3700 | 0.8970 |
70:30 | |||||
OGS_Ridge | 0.7575 | 0.7057 | 0.7609 | 0.6717 | 0.7578 |
OGS_Lasso | 0.7480 | 0.7060 | 0.7525 | 0.6604 | 0.7473 |
OGS_ALasso | 0.7467 | 0.6689 | 0.8219 | 0.6867 | 0.7080 |
OGS_SVM | 0.7790 | 0.6752 | 0.7266 | 0.6867 | 0.8158 |
OGS_LDA | 0.6543 | 0.5016 | 0.7228 | 0.5879 | 0.6210 |
OGS_KNN | 0.4199 | 0.3650 | 0.9101 | 0.5173 | 0.1626 |
OGS_RF | 0.6531 | 0.5085 | 0.5771 | 0.5295 | 0.6932 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
80:20 | |||||
OGS_Ridge | 0.6640 | 0.9071 | 0.6687 | 0.7369 | 0.6529 |
OGS_Lasso | 0.6807 | 0.8994 | 0.6957 | 0.7544 | 0.6220 |
OGS_ALasso | 0.7858 | 0.8984 | 0.8425 | 0.8261 | 0.5490 |
OGS_SVM | 0.7397 | 0.8905 | 0.7767 | 0.8233 | 0.6125 |
OGS_LDA | 0.6090 | 0.8889 | 0.5791 | 0.7000 | 0.7195 |
OGS_KNN | 0.4314 | 0.8390 | 0.3641 | 0.4998 | 0.6842 |
OGS_RF | 0.7375 | 0.8044 | 0.8835 | 0.8397 | 0.2015 |
Location | Left-Lower | Left-Upper | Right-Lower | Right-Middle | Right-Upper | Other | NA |
---|---|---|---|---|---|---|---|
Number | 76 | 119 | 96 | 23 | 180 | 4 | 7 |
Location | left | Left LIQ | left LOQ | left UIQ | left UOQ |
Number | 189 | 29 | 40 | 83 | 230 |
Location | right | right LIQ | right LOQ | right UIQ | right UOQ |
Number | 175 | 27 | 49 | 83 | 189 |
References
- Thomas, D. Gene–environment-wide association studies: Emerging approaches. Nat. Rev. Genet. 2010, 11, 259–272. [Google Scholar] [CrossRef]
- Franks, P.W.; Paré, G. Putting the genome in context: Gene-environment interactions in type 2 diabetes. Curr. Diabetes Rep. 2016, 16, 57. [Google Scholar] [CrossRef]
- Batchelor, T.T.; Betensky, R.A.; Esposito, J.M.; Pham, L.-D.D.; Dorfman, M.V.; Piscatelli, N.; Jhung, S.; Rhee, D.; Louis, D.N. Age-dependent prognostic effects of genetic alterations in glioblastoma. Clin. Cancer Res. 2004, 10, 228–233. [Google Scholar] [CrossRef]
- Lin, W.; Huang, C.; Liu, Y.; Tsai, S.; Kuo, P. Genome-Wide Gene-Environment Interaction Analysis Using Set-Based Association Tests. Front. Genet. 2019, 9, 715. [Google Scholar] [CrossRef]
- Rauschert, S.; Raubenheimer, K.; Melton, P.E.; Huang, R.C. Machine learning and clinical epigenetics: A review of challenges for diagnosis and classification. Clin. Epigenetics 2020, 12, 51. [Google Scholar] [CrossRef] [PubMed]
- Xie, J.; Wang, M.; Xu, S.; Huang, Z.; Grant, P.W. The unsupervised feature selection algorithms based on standard deviation and cosine similarity for genomic data analysis. Front. Genet. 2021, 12, 684100. [Google Scholar] [CrossRef]
- Lavanya, C.; Pooja, S.; Kashyap, A.H.; Rahaman, A.; Niranjan, S.; Niranjan, V. Novel biomarker prediction for lung cancer using random forest classifiers. Cancer Inform. 2023, 22, 11769351231167992. [Google Scholar]
- Ali, M.D.; Saleem, A.; Elahi, H.; Khan, M.A.; Khan, M.I.; Yaqoob, M.M.; Farooq Khattak, U.; Al-Rasheed, A. Breast cancer classification through meta-learning ensemble technique using convolution neural networks. Diagnostics 2023, 13, 2242. [Google Scholar] [CrossRef] [PubMed]
- Tian, X.; Wang, X.; Chen, J. Network-constrained group lasso for high-dimensional multinomial classification with application to cancer subtype prediction. Cancer Inform. 2015, 13, 25–33. [Google Scholar] [CrossRef] [PubMed]
- Zhou, F.; Ren, J.; Lu, X.; Ma, S.; Wu, C. Gene–Environment Interaction: A Variable Selection Perspective. Methods Mol. Biol. 2021, 6, 191–223. [Google Scholar]
- Murcray, C.E.; Lewinger, J.P.; Gauderman, W.J. Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 2009, 169, 219–226. [Google Scholar] [CrossRef] [PubMed]
- Winham, S.J.; Biernacka, J.M. Gene-environment interactions in genome-wide association studies: Current approaches and new directions. J. Child Psychol. Psychiatry Allied Discip. 2013, 54, 1120–1134. [Google Scholar] [CrossRef] [PubMed]
- Cordell, H.J. Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 2009, 10, 392–404. [Google Scholar] [CrossRef] [PubMed]
- Ahn, J.; Mukherjee, B.; Gruber, S.B.; Ghosh, M. Bayesian semiparametric analysis for two-phase studies of gene-environment interaction. Ann. Appl. Stat 2013, 7, 543–569. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Ma, J.; Amos, C.I. Bayesian variable selection for hierarchical gene-environment and gene-gene interactions. Hum. Genet. 2015, 134, 23–36. [Google Scholar] [CrossRef] [PubMed]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Methodol. 2008, 70, 849–911. [Google Scholar] [CrossRef]
- Wang, J.; Chen, Y. Overlapping group screening for detection of gene-gene interactions: Application to gene expression profiles with survival trait. BMC Bioinform. 2018, 19, 335. [Google Scholar] [CrossRef]
- Wang, J.; Wang, K.; Chen, Y. Overlapping group screening for detection of gene-environment interactions with application to TCGA high-dimensional survival genomic data. BMC Bioinform. 2022, 23, 202. [Google Scholar] [CrossRef]
- Wang, J.; Chen, Y. Overlapping group screening for binary cancer classification with TCGA high-dimensional genomic data. J. Bioinform. Comput. Biol. 2023, 21, 2350013. [Google Scholar] [CrossRef]
- Selamat, N.A.; Abdullah, A.; Diah, N.M. Association features of smote and rose for drug addiction relapse risk. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 7710–7719. [Google Scholar] [CrossRef]
- Abdoh, S.F.; Rizka, M.A.; Maghraby, F.A. Cervical cancer diagnosis using random forest classifier with SMOTE and feature reduction techniques. IEEE Access 2018, 6, 59475–59485. [Google Scholar] [CrossRef]
- Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man. Cybern. 1976, 6, 769–772. [Google Scholar]
- Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Colaprico, A.; Silva, T.C.; Olsen, C.; Garofano, L.; Cava, C.; Garolini, D.; Sabedot, T.S.; Malta, T.M.; Pagnotta, S.M.; Castiglioni, I.; et al. TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016, 44, e71. [Google Scholar] [CrossRef] [PubMed]
- Wang, S.; Liu, X. The UCSCXenaTools R package: A toolkit for accessing genomics data from UCSC xena platform, from cancer multi-omics to single-cell RNA-seq. J. Open Source Softw. 2019, 4, 1627. [Google Scholar] [CrossRef]
- Sain, H.; Purnami, S.W. Combine sampling support vector machine for imbalanced data classification. Procedia Comput. Sci. 2015, 72, 59–66. [Google Scholar] [CrossRef]
- Liu, C.; Wu, J.; Mirador, L.; Song, Y.; Hou, W. Classifying dna methylation imbalance data in cancer risk prediction using smote and tomek link methods. In International Conference of Pioneering Computer Scientists, Engineers and Educators; Springer: Singapore, 2018; pp. 1–9. [Google Scholar]
- Jonathan, B.; Putra, P.H.; Ruldeviyani, Y. Observation imbalanced data text to predict users selling products on female daily with SMOTE, Tomek, and SMOTE-Tomek. In Proceedings of the 2020 IEEE International Conference on Industry 4.0, Artificial Intelligence, and Communications Technology (IAICT), Bali, Indonesia, 7–8 July 2020; pp. 81–85. [Google Scholar]
- Sasada, T.; Liu, Z.; Baba, T.; Hatano, K.; Kimura, Y. A Resampling Method for Imbalanced Datasets Considering Noise and Overlap. Procedia Comput. Sci. 2020, 176, 420–429. [Google Scholar] [CrossRef]
- Jacob, L.; Obozinski, G.; Vert, J.P. Group lasso with overlap and graph lasso. In Proceedings of the International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 433–440. [Google Scholar]
- Zeng, Y.; Breheny, P. Overlapping group logistic regression with applications to genetic pathway selection. Cancer Inform. 2016, 15, 179–187. [Google Scholar] [CrossRef]
- Wu, M.C.; Lee, S.; Cai, T.; Li, Y.; Boehnke, M.; Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011, 89, 82–93. [Google Scholar] [CrossRef] [PubMed]
- Davies, R.B. Algorithm AS 155: The distribution of a linear combination of random variables. J. R. Stat. Soc. Ser. C Appl. Stat. 1980, 29, 323–333. [Google Scholar] [CrossRef]
- Duchesne, P.; Lafaye De Micheaux, P. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput. Stat. Data Anal. 2010, 54, 858–862. [Google Scholar] [CrossRef]
- Zou, H. The Adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
- Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate de scent. J. Stat. Softw. 2011, 39, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Wu, M.; Ma, S. Robust semiparametric gene–environment interaction analysis using sparse boosting. Stat. Med. 2019, 38, 4625–4641. [Google Scholar] [CrossRef]
- Wang, B.; Pei, J.; Xu, S.; Liu, J.; Yu, J. System analysis based on glutamine catabolic-related enzymes identifies GPT2 as a novel immunotherapy target for lung adenocarcinoma. Comput. Biol. Med. 2023, 165, 107415. [Google Scholar] [CrossRef] [PubMed]
- Rodriguez, E.F.; De Marchi, F.; Lokhandwala, P.M.; Belchis, D.; Xian, R.; Gocke, C.D.; Eshleman, J.R.; Illei, P.; Li, M.-T. IDH1 and IDH2 mutations in lung adenocarcinomas: Evidences of subclonal evolution. Cancer Med. 2020, 9, 4386–4394. [Google Scholar] [CrossRef] [PubMed]
- Lei, B.; Jiang, X.; Saxena, A. TCGA expression analyses of 10 carcinoma types reveal clinically significant racial differences. Cancers 2023, 15, 2695. [Google Scholar] [CrossRef]
- Qu, W.; Yao, Y.; Liu, Y.; Jo, H.; Zhang, Q.; Zhao, H. Prognostic and immunological roles of CES2 in breast cancer and potential application of CES2-targeted fluorescent probe DDAB in breast surgery. Int. J. Gen. Med. 2023, 16, 1567–1580. [Google Scholar] [CrossRef]
- Wang, Z.; Zhang, S.; Zheng, C.; Xia, K.; Sun, L.; Tang, X.; Zhou, F.; Ouyang, Y.; Tang, F. CTHRC1 is a potential prognostic bi omarker and correlated with macrophage infiltration in breast cancer. Int. J. Gen. Med. 2022, 15, 5701–5713. [Google Scholar] [CrossRef] [PubMed]
- Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef] [PubMed]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef] [PubMed]
- Lauby-Secretan, B.; Scoccianti, C.; Loomis, D.; Grosse, Y.; Bianchini, F.; Straif, K. Body fatness and cancer—Viewpoint of the IARC working group. N. Engl. J. Med. 2016, 375, 794–798. [Google Scholar] [CrossRef] [PubMed]
- Hu, C.; Chen, X.; Yao, C.; Liu, Y.; Xu, H.; Zhou, G.; Xia, H.; Xia, J. Body mass index-associated molecular characteristics involved in tumor immune and metabolic pathways. Cancer Metab. 2020, 8, 21. [Google Scholar] [CrossRef]
- Lee, S.; Abecasis, G.R.; Boehnke, M.; Lin, X. Rare-variant association analysis: Study designs and statistical tests. Am. J. Hum. Genet. 2014, 95, 5–23. [Google Scholar] [CrossRef]
Positive (Predicted) | Negative (Predicted) | |
---|---|---|
Positive (actual) | number of true positives (TP) | number of false negatives (FN) |
Negative (actual) | number of false positives (FP) | number of true negatives (TN) |
Group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
Gene Size | 3 | 3 | 3 | 6 | 6 | 6 | 9 | 9 | 9 | 15 | 15 | 15 | 24 | 24 | 24 | 36 | 36 | 36 | 45 | 45 | 45 | 60 | 60 | 60 | 38 |
Overlapping | 1 | 1 | 0 | 2 | 2 | 0 | 3 | 3 | 0 | 5 | 5 | 0 | 8 | 8 | 0 | 12 | 12 | 0 | 15 | 15 | 0 | 20 | 20 | 0 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
60:40 | |||||
OGS_Ridge | 0.8867 (0.7819) | 0.8673 (0.7571) | 0.8571 (0.8338) | 0.8536 (0.7590) | 0.9043 (0.7456) |
OGS_Lasso | 0.8796 (0.7298) | 0.8436 (0.6772) | 0.8822 (0.8463) | 0.8515 (0.7187) | 0.8777 (0.6509) |
OGS_ALasso | 0.8695 (0.6957) | 0.8286 (0.6336) | 0.8864 (0.8466) | 0.8439 (0.6923) | 0.8581 (0.5945) |
OGS_SVM | 0.8827 (0.8184) | 0.8617 (0.8150) | 0.8281 (0.7070) | 0.8418 (0.7506) | 0.9167 (0.8928) |
OGS_LDA | 0.8737 (0.8265) | 0.8131 (0.7688) | 0.8732 (0.8074) | 0.8403 (0.7849) | 0.8738 (0.8393) |
OGS_KNN | 0.5929 (0.6554) | 0.4809 (0.7277) | 0.7226 (0.2254) | 0.5743 (0.3281) | 0.5109 (0.9405) |
OGS_RF | 0.7007 (0.6354) | 0.6764 (0.5547) | 0.4402 (0.5069) | 0.5248 (0.5187) | 0.8631 (0.7213) |
70:30 | |||||
OGS_Ridge | 0.8284 (0.6571) | 0.7940 (0.6052) | 0.7189 (0.7735) | 0.7157 (0.6109) | 0.8791 (0.6061) |
OGS_Lasso | 0.7641 (0.5780) | 0.6726 (0.5126) | 0.8473 (0.8140) | 0.7069 (0.5563) | 0.7255 (0.4746) |
OGS_ALasso | 0.7515 (0.5476) | 0.6790 (0.4868) | 0.8529 (0.8175) | 0.7102 (0.5337) | 0.7061 (0.4291) |
OGS_SVM | 0.8753 (0.8206) | 0.8531 (0.7831) | 0.7383 (0.5909) | 0.7849 (0.6659) | 0.9382 (0.9244) |
OGS_LDA | 0.8329 (0.8093) | 0.7080 (0.6607) | 0.8104 (0.7970) | 0.7537 (0.7190) | 0.8424 (0.8150) |
OGS_KNN | 0.5178 (0.7120) | 0.3858 (0.7475) | 0.8449 (0.1199) | 0.5253 (0.2089) | 0.3682 (0.9784) |
OGS_RF | 0.7273 (0.6512) | 0.6300 (0.4357) | 0.4375 (0.3801) | 0.4836 (0.3949) | 0.8587 (0.7723) |
Group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 1 |
Gene Size | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 26 | 23 |
Overlapping | 3 | 3 | 0 | 3 | 3 | 0 | 3 | 3 | 0 | 3 | 3 | 0 | 0 | 3 | 0 | 0 | 3 | 0 | 3 | 3 | 0 | 0 | 3 | 0 | 3 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
80:20 | |||||
OGS_Ridge | 0.7996 (0.5055) | 0.8870 (0.9554) | 0.8550 (0.4089) | 0.8386 (0.6226) | 0.6102 (0.8496) |
OGS_Lasso | 0.8013 (0.4993) | 0.8799 (0.9357) | 0.8668 (0.4140) | 0.8765 (0.5815) | 0.5779 (0.8036) |
OGS_ALasso | 0.8149 (0.5104) | 0.8943 (0.9202) | 0.8746 (0.4411) | 0.8535 (0.5914) | 0.6194 (0.7580) |
OGS_SVM | 0.7842 (0.8220) | 0.9055 (0.8473) | 0.8101 (0.9407) | 0.8505 (0.8905) | 0.7055 (0.4121) |
OGS_LDA | 0.6572 (0.8011) | 0.9076 (0.9095) | 0.6146 (0.8259) | 0.7307 (0.8648) | 0.7998 (0.7149) |
OGS_KNN | 0.4094 (0.7755) | 0.8024 (0.7771) | 0.3189 (0.9964) | 0.4532 (0.8726) | 0.6876 (0.0137) |
OGS_RF | 0.7343 (0.7275) | 0.7832 (0.8012) | 0.9052 (0.8638) | 0.8371 (0.8288) | 0.1990 (0.2544) |
With the SMOTE-Tomek Process | Without the SMOTE-Tomek Process | |||
---|---|---|---|---|
IR | Gene | G-E Interaction | Gene | G-E Interaction |
Original coefficients | ||||
60:40 | 0.8467 | 0.4667 | 0.8443 | 0.0046 |
70:30 | 0.8410 | 0.4489 | 0.8456 | 0 |
Weaker coefficients (all original coefficients divided by 2) | ||||
60:40 | 0.8527 | 0.4390 | 0.8417 | 0.0033 |
70:30 | 0.8406 | 0.4216 | 0.8441 | 0 |
Factor | Coding | Missing Status | Continuous (C) /Discrete (D) |
---|---|---|---|
Number of pack years smoked | Yes | C | |
Race | white = 1, Asian = 2, black or African American = 3 | Yes | D |
Gender | female = 0, male = 1 | No | D |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.7426 | 1.0000 | 0.7063 | 0.8278 | 1.0000 |
OGS_Lasso | 0.6832 | 1.0000 | 0.6482 | 0.7865 | 1.0000 |
OGS_ALasso | 0.6782 | 1.0000 | 0.6389 | 0.7797 | 1.0000 |
OGS_SVM | 0.8762 | 1.0000 | 0.8716 | 0.9262 | 1.0000 |
OGS_LDA | 0.6436 | 1.0000 | 0.6022 | 0.7517 | 1.0000 |
OGS_KNN | 0.8663 | 1.0000 | 0.8533 | 0.9199 | 1.0000 |
OGS_RF | 0.8861 | 1.0000 | 0.8785 | 0.9301 | 1.0000 |
GO_CC | |||||
OGS_Ridge | 0.7277 | 1.0000 | 0.6945 | 0.8197 | 1.0000 |
OGS_Lasso | 0.6832 | 1.0000 | 0.6424 | 0.7822 | 1.0000 |
OGS_ALasso | 0.6881 | 1.0000 | 0.6480 | 0.7864 | 1.0000 |
OGS_SVM | 0.8465 | 0.9939 | 0.8694 | 0.9109 | 0.9667 |
OGS_LDA | 0.6634 | 1.0000 | 0.6250 | 0.7692 | 1.0000 |
OGS_KNN | 0.8366 | 1.0000 | 0.8162 | 0.8988 | 1.0000 |
OGS_RF | 0.8762 | 1.0000 | 0.8641 | 0.9271 | 1.0000 |
GO_MF | |||||
OGS_Ridge | 0.7624 | 1.0000 | 0.7303 | 0.8441 | 1.0000 |
OGS_Lasso | 0.7376 | 1.0000 | 0.7017 | 0.8247 | 1.0000 |
OGS_ALasso | 0.7475 | 1.0000 | 0.7102 | 0.8306 | 1.0000 |
OGS_SVM | 0.8663 | 1.0000 | 0.8827 | 0.9211 | 1.0000 |
OGS_LDA | 0.7673 | 1.0000 | 0.7310 | 0.8446 | 1.0000 |
OGS_KNN | 0.8713 | 1.0000 | 0.8540 | 0.9212 | 1.0000 |
OGS_RF | 0.9059 | 1.0000 | 0.8959 | 0.9440 | 1.0000 |
Gene | Number Pack Years Smoked | Race | Gender |
---|---|---|---|
GPT2 | 0.9891 | 1.4863 | 1.2152 |
IDH2 | 0.9993 | 1.0542 | 1.0906 |
L2HGDH | 1.0143 | 1.0690 | 1.0884 |
Variable | Coding | Missing Status | Continuous (C) /Discrete (D) |
---|---|---|---|
age at initial pathologic diagnosis (years) | No | C | |
Race | white = 1, Asian = 2, black or African American = 3 | Yes | D |
Gender | female = 0, male = 1 | No | D |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.7384 | 0.9996 | 0.7090 | 0.8290 | 0.9977 |
OGS_Lasso | 0.6909 | 1.0000 | 0.6559 | 0.7916 | 1.0000 |
OGS_ALasso | 0.6915 | 1.0000 | 0.6566 | 0.7920 | 1.0000 |
OGS_SVM | 0.8626 | 0.9710 | 0.8743 | 0.9194 | 0.7619 |
OGS_LDA | 0.6691 | 0.9999 | 0.6319 | 0.7711 | 0.9990 |
OGS_KNN | 0.8383 | 0.9928 | 0.8227 | 0.8912 | 0.9780 |
OGS_RF | 0.8811 | 0.9975 | 0.8699 | 0.9280 | 0.9813 |
GO_CC | |||||
OGS_Ridge | 0.7023 | 0.9997 | 0.6686 | 0.8002 | 0.9983 |
OGS_Lasso | 0.6932 | 0.9999 | 0.6581 | 0.7925 | 0.9996 |
OGS_ALasso | 0.7028 | 0.9999 | 0.6689 | 0.8008 | 0.9997 |
OGS_SVM | 0.8471 | 0.9666 | 0.8609 | 0.9097 | 0.7318 |
OGS_LDA | 0.7721 | 0.9990 | 0.7468 | 0.8527 | 0.9936 |
OGS_KNN | 0.7604 | 0.9874 | 0.7359 | 0.8042 | 0.9809 |
OGS_RF | 0.8213 | 0.9989 | 0.8020 | 0.8878 | 0.9923 |
GO_MF | |||||
OGS_Ridge | 0.7350 | 1.0000 | 0.7048 | 0.8256 | 1.0000 |
OGS_Lasso | 0.7166 | 1.0000 | 0.6842 | 0.8119 | 1.0000 |
OGS_ALasso | 0.7227 | 1.0000 | 0.6910 | 0.8168 | 1.0000 |
OGS_SVM | 0.8541 | 0.9605 | 0.8744 | 0.9147 | 0.6803 |
OGS_LDA | 0.7568 | 0.9999 | 0.7291 | 0.8419 | 0.9991 |
OGS_KNN | 0.7886 | 0.9887 | 0.7672 | 0.8305 | 0.9783 |
OGS_RF | 0.8456 | 0.9993 | 0.8286 | 0.9048 | 0.9959 |
Gene | Age at Initial Pathologic Diagnosis (Years) | Race | Gender |
---|---|---|---|
SPRY2 | 1.0246 | 0.7620 | 0.9996 |
CES1 | 1.0456 | 1.6377 | 1.0135 |
CTHRC1 | 1.0022 | 0.8993 | 1.0054 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.8243 | 1.0000 | 0.8008 | 0.8886 | 1.0000 |
OGS_Lasso | 0.7866 | 1.0000 | 0.7579 | 0.8606 | 1.0000 |
OGS_ALasso | 0.7650 | 1.0000 | 0.7333 | 0.8451 | 1.0000 |
OGS_SVM | 0.9716 | 0.9803 | 0.9877 | 0.9838 | 0.8750 |
OGS_LDA | 0.9738 | 0.9963 | 0.9739 | 0.9849 | 0.9779 |
OGS_KNN | 0.5223 | 0.5701. | 0.4715 | 0.4753 | 0.8577 |
OGS_RF | 0.9767 | 0.9862 | 0.9885 | 0.9872 | 0.9117 |
GO_CC | |||||
OGS_Ridge | 0.8020 | 1.0000 | 0.7727 | 0.8718 | 1.0000 |
OGS_Lasso | 0.7475 | 1.0000 | 0.7182 | 0.8360 | 1.0000 |
OGS_ALasso | 0.7327 | 1.0000 | 0.6966 | 0.8212 | 1.0000 |
OGS_SVM | 0.9802 | 0.9889 | 0.9888 | 0.9886 | 0.9129 |
OGS_LDA | 0.9703 | 1.0000 | 0.9674 | 0.9828 | 1.0000 |
OGS_KNN | 0.9604 | 0.9888 | 0.9625 | 0.9775 | 0.9045 |
OGS_RF | 0.9802 | 0.9889 | 0.9890 | 0.9889 | 0.9167 |
GO_MF | |||||
OGS_Ridge | 0.8091 | 1.0000 | 0.7848 | 0.8788 | 1.0000 |
OGS_Lasso | 0.7653 | 1.0000 | 0.7354 | 0.8459 | 1.0000 |
OGS_ALasso | 0.7423 | 1.0000 | 0.7094 | 0.8288 | 1.0000 |
OGS_SVM | 0.9752 | 0.9842 | 0.9881 | 0.9860 | 0.8911 |
OGS_LDA | 0.9684 | 0.9964 | 0.9681 | 0.9819 | 0.9760 |
OGS_KNN | 0.4974 | 0.5468 | 0.4491 | 0.4523 | 0.8466 |
OGS_RF | 0.9657 | 0.9854 | 0.9756 | 0.9782 | 0.8983 |
Accuracy | Precision | Sensitivity | F1 | Specificity | |
---|---|---|---|---|---|
GO_BP | |||||
OGS_Ridge | 0.8119 | 1.0000 | 0.7908 | 0.8829 | 1.0000 |
OGS_Lasso | 0.7552 | 1.0000 | 0.7277 | 0.8418 | 1.0000 |
OGS_ALasso | 0.7586 | 1.0000 | 0.7316 | 0.8446 | 1.0000 |
OGS_SVM | 0.9799 | 0.9867 | 0.9910 | 0.9888 | 0.8869 |
OGS_LDA | 0.9728 | 0.9992 | 0.9706 | 0.9846 | 0.9947 |
OGS_KNN | 0.6156 | 0.7004 | 0.5798 | 0.5835 | 0.9093 |
OGS_RF | 0.9832 | 0.9927 | 0.9887 | 0.9906 | 0.9400 |
GO_CC | |||||
OGS_Ridge | 0.8173 | 1.0000 | 0.7971 | 0.8868 | 1.0000 |
OGS_Lasso | 0.8162 | 0.9991 | 0.7967 | 0.8842 | 0.9928 |
OGS_ALasso | 0.7494 | 1.0000 | 0.7214 | 0.8365 | 1.0000 |
OGS_SVM | 0.9812 | 0.9858 | 0.9935 | 0.9896 | 0.8737 |
OGS_LDA | 0.9773 | 0.9993 | 0.9755 | 0.9872 | 0.9949 |
OGS_KNN | 0.4933 | 0.5406 | 0.4460 | 0.4479 | 0.8864 |
OGS_RF | 0.9856 | 0.9922 | 0.9918 | 0.9920 | 0.9351 |
GO_MF | |||||
OGS_Ridge | 0.8125 | 1.0000 | 0.7912 | 0.8831 | 1.0000 |
OGS_Lasso | 0.7476 | 0.9999 | 0.7189 | 0.8357 | 0.9993 |
OGS_ALasso | 0.7489 | 1.0000 | 0.7203 | 0.8368 | 1.0000 |
OGS_SVM | 0.9826 | 0.9895 | 0.9911 | 0.9903 | 0.9135 |
OGS_LDA | 0.9793 | 0.9995 | 0.9776 | 0.9884 | 0.9960 |
OGS_KNN | 0.4127 | 0.4819 | 0.3567 | 0.3593 | 0.8880 |
OGS_RF | 0.9842 | 0.9930 | 0.9894 | 0.9912 | 0.9438 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, J.-H.; Liu, C.-Y.; Min, Y.-R.; Wu, Z.-H.; Hou, P.-L. Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics 2024, 12, 2209. https://doi.org/10.3390/math12142209
Wang J-H, Liu C-Y, Min Y-R, Wu Z-H, Hou P-L. Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics. 2024; 12(14):2209. https://doi.org/10.3390/math12142209
Chicago/Turabian StyleWang, Jie-Huei, Cheng-Yu Liu, You-Ruei Min, Zih-Han Wu, and Po-Lin Hou. 2024. "Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data" Mathematics 12, no. 14: 2209. https://doi.org/10.3390/math12142209
APA StyleWang, J.-H., Liu, C.-Y., Min, Y.-R., Wu, Z.-H., & Hou, P.-L. (2024). Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics, 12(14), 2209. https://doi.org/10.3390/math12142209