Extreme Gradient Boosting Tuned with Metaheuristic Algorithms for Predicting Myeloid NGS Onco-Somatic Variant Pathogenicity
Abstract
:1. Introduction
2. Materials and Methods
- Variant Allele Frequency ≥ 5%;
- Amino acid change is different from synonymous (≠p.(=)). A synonymous variant will probably have a low influence on the gene because the amino acid does not change;
- Grantham score is in a range of [5; 215] (in the case of substitution variant): [0, 50] = conservative, [51; 100] moderately conservative, [101; 150] moderately radical, and over 150 radical;
- Manual inspection on different databases and prediction tools: VarSeak, Varsome, UMD Predictor, Cancer Genome Interpreter (CGI), and OncoKB;
- We use the tool Integrative Genomics Viewer (IGV) to check if alignment sequences are clear and show no strand bias in the region where the variant is located. It allows us to eliminate false positives;
- We verify the presence of these pathogenic variants in our second-in-house pipeline to validate them.
2.1. Programming Language
2.2. Data Selection
2.3. Problem Formulation
2.4. Data Encoding (Feature Construction)
2.5. Correlation
2.6. Principal Component Analysis (PCA)
2.7. Metaheuristic Algorithms
2.8. Defining the Fitness Function
2.9. Differential Evolution (DE)
- Creating a new agent, called the “trial agent”, by adding a weighted difference between two parents to a third parent.
- If the trial agent is better than the corresponding parent, then it is selected as the new agent in the population.
2.10. Applying DE to XGBoost
- 1.
- Initialize a population of N candidate solutions , , …, , where each is a set of parameter values.
- 2.
- Set the crossover rate (CR) and the scaling factor F.
- 3.
- Repeat the following steps until the stopping criterion is met:
- (a)
- For each candidate solution , randomly select three other candidate solutions , , and from the population, such that j, k, l ≠ i.
- (b)
- Generate a new candidate solution by applying the DE mutation operator to , , and . The mutation operator generates a new solution by adding a scaled difference between and to , according to the formula , where F is the scaling factor.
- (c)
- Apply the DE crossover operator to combine and into a trial solution . The crossover operator generates a new solution by randomly selecting one parameter value from and copying it to the corresponding position in , with a probability of CR.
- (d)
- Evaluate the fitness (i.e., accuracy) of and .
- (e)
- If the fitness of is better than the fitness of , replace with in the population.
- (f)
- If the fitness of is worse than the fitness of , keep in the population.
- (g)
- If the fitness of is equal to the fitness of , choose one of them at random to keep in the population.
- (h)
- Repeat steps (a) to (f) for all candidate solutions in the population.
- (i)
- Update the best solution found so far based on the candidate solutions in the population.
- Mutation factor = 0.5.
- Crossover probability = 0.9.
- Strategy = 1.
- Population size = 10.
- Number of iterations = 10.
2.11. Genetic Algorithm (GA)
2.12. Applying GA to XGBoost
- 1.
- .
- 2.
- .
- 3.
- .
- 4.
- .
- 5.
- .
- 6.
- .
- 7.
- .
2.13. Particle Swarm Optimization (PSO)
2.14. PSO Applied to XGBoost
- 1.
- Initialize the swarm of particles with random positions and velocities.
- 2.
- Evaluate the accuracy of each particle by training an XGBoost model with the corresponding parameter values and computing the accuracy on the training dataset.
- 3.
- Update the best position found by each particle and the global best position found by the swarm so far.
- 4.
- Update the velocity and position of each particle based on its own previous position, the best position found by any particle in the swarm so far, and the global best position found by the swarm.
- 5.
- Repeat steps 2 to 4 until a stopping criterion is met (the maximum number of iterations is reached).
2.15. Simulated Annealing (SA)
2.16. Applying SA to XGBoost
- 1.
- Initialize the current set of parameter values .
- 2.
- Set an initial temperature T and a cooling schedule that reduces the temperature over time.
- 3.
- Repeat the following steps until the stopping criterion is met:
- (a)
- Generate a new set of parameter values by perturbing the current set of values.
- (b)
- Compute the change in accuracy .
- (c)
- If , accept the new set of parameter values (i.e., set ).
- (d)
- If , accept the new set of parameter values with a probability determined by the temperature and the magnitude of .
- (e)
- Update the temperature according to the cooling schedule.
3. Results
3.1. Model Performance
- .
- .
- .
3.2. Performance on the Training Set
- nrounds: The number of boosting iterations to perform. A boosting iteration trains a new weak learner to improve the overall performance of the model.
- eta: The learning rate or step size for shrinking the contribution of each tree in the model. A smaller eta value will result in more rounds of boosting required to achieve the same level of accuracy, but each boosting iteration will have a smaller impact on the final model.
- max_depth: The maximum depth of each tree in the model. Deeper trees capture more complex relationships in the data but can also lead to overfitting if they are too deep.
- gamma: The minimum reduction in the loss function required to split a node in the tree. This parameter controls the trade-off between the complexity of the model and overfitting. A smaller value allows the model to split nodes more frequently, increasing the model’s complexity and potential for overfitting.
- colsample_bytree: The fraction of columns (features) to be used in each split. This parameter can be used to prevent overfitting by reducing the impact of noisy features.
- min_child_weight: The minimum number of samples required in a leaf node. This parameter can be used to control overfitting by increasing the minimum number of samples required in each leaf node.
- subsample: The fraction of the training set to use for each boosting iteration. A smaller value can make the model more robust to noise and prevent overfitting but may also result in a slower training time.
- 1.
- nrounds = [100, 600].
- 2.
- eta = [−5, −3].
- 3.
- max depth = [2, 6].
- 4.
- gamma () = [0, 1].
- 5.
- colsample by tree = [0.4, 1].
- 6.
- min child weight = [1, 3].
- 7.
- subsample = [0.5, 1].
3.3. Performance on the Test Set
3.3.1. Performance of DE on the Test Set
3.3.2. Performance of PSO on the Test Set
3.3.3. Performance of GA on the Test Set
3.3.4. Performance of SA on the Test Set
3.4. Feature Importance
3.5. Performance on a New Dataset
3.6. Test XGBoost Model Tuned with DE on Solid Tumor Hotspots
3.7. Related Work
- Random forest:
- −
- Strengths:
- *
- High accuracy of 98.50% indicates that the model performs well in classifying the data.
- *
- High precision of 97.06% suggests that the positive predictions (class 1) are mostly accurate.
- *
- High specificity of 97.01% indicates that the model has a low false positive rate.
- −
- Weaknesses:
- *
- Although the sensitivity is 1.00 (indicating perfect detection of positive cases), this high sensitivity could also imply a high false negative rate.
- XGBoost (DE):
- −
- Strengths:
- *
- Very high accuracy of 99.35% suggests excellent overall performance.
- *
- High precision of 98.70% indicates accurate positive predictions.
- *
- High specificity of 98.71% suggests a low false positive rate.
- *
- Sensitivity of 1.00 indicates perfect detection of positive cases.
- −
- Weaknesses:
- *
- No specific weaknesses identified.
- XGBoost (PSO):
- −
- Strengths:
- *
- High accuracy of 99.04% indicates good overall performance.
- *
- High precision of 98.10% suggests accurate positive predictions.
- *
- High specificity of 98.11% suggests a low false positive rate.
- *
- Sensitivity of 99.98% indicates excellent detection of positive cases.
- −
- Weaknesses:
- *
- No specific weaknesses identified.
- XGBoost (GA):
- −
- Strengths:
- *
- Although the accuracy is lower at 92.51%, it still indicates reasonable overall performance.
- *
- High precision of 90.01% suggests accurate positive predictions.
- *
- High specificity of 90.02% suggests a low false positive rate.
- *
- Sensitivity of 95.03% indicates good detection of positive cases.
- −
- Weaknesses:
- *
- Lower accuracy and sensitivity compared with other models.
- XGBoost (SA):
- −
- Strengths:
- *
- High accuracy of 99.13% indicates excellent overall performance.
- *
- High precision of 98.28% suggests accurate positive predictions.
- *
- High specificity of 98.28% suggests a low false positive rate.
- *
- Sensitivity of 99.98% indicates excellent detection of positive cases.
- −
- Weaknesses:
- *
- No specific weaknesses identified.
4. State-of-the-Art Comparison
5. Discussion
6. Limitations
6.1. Dataset Limitations
6.2. Feature Selection
6.3. Domain-Specific Interpretation
6.4. Computational Resources
6.5. External Validation
6.6. Error Sequencing and Implementation Variations
7. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Pellegrino, E.; Jacques, C.; Beaufils, N.; Nanni, I.; Carlioz, A.; Metellus, P.; Ouafik, L.H. Machine learning random forest for predicting oncosomatic variant NGS analysis. Sci. Rep. 2021, 11, 21820. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T. Xgboost: Extreme Gradient Boosting. Package Version-0.4-1.4. 2015. Available online: https://xgboost.ai/ (accessed on 15 May 2023).
- Patel, J.P.; Gönen, M.; Figueroa, M.E.; Fernandez, H.; Sun, Z.; Racevskis, J.; Van Vlierberghe, P.; Dolgalev, I.; Thomas, S.; Aminova, O.; et al. Prognostic relevance of integrated genetic profiling in acute myeloid leukemia. N. Engl. J. Med. 2012, 366, 1079–1089. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Papaemmanuil, E.; Gerstung, M.; Bullinger, L.; Gaidzik, V.I.; Paschka, P.; Roberts, N.D.; Potter, N.E.; Heuser, M.; Thol, F.; Bolli, N.; et al. Genomic classification and prognosis in acute myeloid leukemia. N. Engl. J. Med. 2016, 374, 2209–2221. [Google Scholar] [CrossRef] [PubMed]
- Marcucci, G.; Maharry, K.; Wu, Y.Z.; Radmacher, M.D.; Mrózek, K.; Margeson, D.; Holland, K.B.; Whitman, S.P.; Becker, H.; Schwind, S.; et al. IDH1 and IDH2 gene mutations identify novel molecular subsets within de novo cytogenetically normal acute myeloid leukemia: A Cancer and Leukemia Group B study. J. Clin. Oncol. 2010, 28, 2348–2355. [Google Scholar] [CrossRef] [Green Version]
- Itzykson, R.; Kosmider, O.; Cluzeau, T.; Mas, M.D.; Dreyfus, F.; Beyne-Rauzy, O.; Quesnel, B.; Vey, N.; Gelsi-Boyer, V.; Raynaud, S.; et al. Impact of TET2 mutations on response rate to azacitidine in myelodysplastic syndromes and low blast count acute myeloid leukemias. Leukemia 2011, 25, 1147–1152. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bejar, R.; Lord, A.; Stevenson, K.; Bar-Natan, M.; Pérez-Ladaga, A.; Zaneveld, J.; Wang, H.; Caughey, B.; Stojanov, P.; Getz, G.; et al. TET2 mutations predict response to hypomethylating agents in myelodysplastic syndrome patients. Blood 2014, 124, 2705–2712. [Google Scholar] [CrossRef] [Green Version]
- Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Science Business Media: Berlin/Heidelberg, Germany, 2009; Chapter 7. [Google Scholar]
- Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: Berlin/Heidelberg, Germany, 2013; Chapter 7. [Google Scholar]
- Bengio, Y.; Grandvalet, Y. No unbiased estimator of the variance of k-fold cross-validation. J. Mach. Learn. Res. 2004, 16, 1089–1105. [Google Scholar]
- Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Int. Jt. Conf. Artif. Intell. 1995, 14, 1137–1143. [Google Scholar]
- Pellegrino, E.; Brunet, T.; Pissier, C.; Camilla, C.; Abbou, N.; Beaufils, N.; Nanni-Metellus, I.; Métellus, P.; Ouafik, L.H. Deep Learning Architecture Optimization with Metaheuristic Algorithms for Predicting BRCA1/BRCA2 Pathogenicity NGS Analysis. BioMedInformatics 2022, 2, 244–267. [Google Scholar] [CrossRef]
- Dagan, T.; Talmor, Y.; Graur, D. Ratios of Radical to Conservative Amino Acid Replacement are Affected by Mutational and Compositional Factors and May Not Be Indicative of Positive Darwinian Selection. Mol. Biol. Evol. 2002, 19, 1022–1025. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Alberts, B.; Johnson, A.; Lewis, J.; Raff, M.; Roberts, K.; Walter, P. Molecular Biology of the Cell, 6th ed.; Garland Science: New York, NY, USA, 2014. [Google Scholar]
- Richardson, J.S.; Richardson, D.C. Natural beta-sheet proteins use negative design to avoid edge-to-edge aggregation. Proc. Natl. Acad. Sci. USA 2002, 99, 2754–2759. [Google Scholar] [CrossRef] [Green Version]
- Grantham, R. Amino acid difference formula to help explain protein evolution. Science 1974, 185, 862–864. [Google Scholar] [CrossRef] [PubMed]
- Jolliffe, I.T. Principal Component Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2002. [Google Scholar]
- Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
- Ringnér, M. What is principal component analysis? Nat. Biotechnol. 2008, 26, 303–304. [Google Scholar] [CrossRef]
- Yang, X.S. Nature-Inspired Optimization Algorithms; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
- Blum, C.; Roli, A. Metaheuristics in combinatorial optimization: Overview and conceptual comparison. ACM Comput. Surv. (Csur) 2003, 35, 268–308. [Google Scholar] [CrossRef]
- Talbi, E.G. Metaheuristics: From Design to Implementation; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
- Gandomi, A.H.; Yang, X.S.; Talatahari, S.; Alavi, A.H. Metaheuristic algorithms in modeling and optimization. In Metaheuristic Applications in Structures and Infrastructures; Elsevier: Amsterdam, The Netherlands, 2013; Volume 1. [Google Scholar]
- Storn, R.; Price, K. Differential Evolution-a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 1997, 11, 341. [Google Scholar] [CrossRef]
- Holland, J.H. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
- Reeves, C.R. Modern Heuristics; Springer: New York, NY, USA, 1995. [Google Scholar]
- Whitley, L.D. A Genetic Algorithm tutorial. Stat. Comput. 1994, 4, 65–85. [Google Scholar] [CrossRef]
- Mitchell, M. An introduction to Genetic Algorithms; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Kennedy, J.; Eberhart, R. Particle Swarm Optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, Australia, 27 November–1 December 1995. [Google Scholar]
- Adyatama, A. RPubs-Introduction to Particle Swarm Optimization. Available online: https://rpubs.com/argaadya/intro-pso (accessed on 12 October 2021).
- Leung, S.C.; Zhang, D.; Zhou, C.; Wu, T. A hybrid Simulated Annealing metaheuristic algorithm for the two-dimensional knapsack packing problem. Comput. Oper. Res. 2012, 39, 64–73. [Google Scholar] [CrossRef]
- Dowsland, K.A.; Jonathan, T. Simulated Annealing. In Handbook of Natural Computing; Springer: Berlin/Heidelberg, Germany, 2012; pp. 1623–1655. [Google Scholar]
- Bertsimas, D.; Tsitsiklis, J. Simulated Annealing. Stat. Sci. 1993, 8, 10–15. [Google Scholar] [CrossRef]
- Geng, X.; Chen, Z.; Yang, W.; Shi, D.; Zhao, K. Solving the traveling salesman problem based on an adaptive Simulated Annealing algorithm with greedy search. Appl. Soft Comput. 2011, 11, 3680–3689. [Google Scholar] [CrossRef]
- Wang, Z.; Geng, X.; Shao, Z. An effective Simulated Annealing algorithm for solving the traveling salesman problem. J. Comput. Theor. Nanosci. 2009, 6, 1680–1686. [Google Scholar] [CrossRef]
- McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]
- Vainchenker, W.; Kralovics, R. Genetic basis of myeloproliferative neoplasms. Blood 2017, 129, 2377–2386. [Google Scholar] [CrossRef] [Green Version]
- Harel, M.; Kryger, G.; Rosenblatt, D.; Sussman, J.L. The effect of Thr to Met mutations on protein structure and function. Proteins: Struct. Funct. Bioinform. 2004, 56, 85–93. [Google Scholar]
- Kumar, M.D.; Bava, K.A.; Gromiha, M.M.; Prabakaran, P. Structural consequences of amino acid substitutions in protein structures. J. Biomol. Struct. Dyn. 2009, 26, 721–734. [Google Scholar]
- Kircher, M.; Witten, D.M.; Jain, P.; O’Roak, B.J.; Cooper, G.M.; Shendure, J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014, 46, 310–315. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yu, J.; Du, Y.; Jalil, A.; Ahmed, Z.; Mori, S.; Patel, R.; Varela, J.C.; Chang, C.C. Mutational profiling of myeloid neoplasms associated genes may aid the diagnosis of acute myeloid leukemia with myelodysplasia-related changes. Leuk. Res. 2021, 110, 106701. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Song, Y.; Ma, F. XGBoost Prediction of Infection of Leukemia Patients with Fever of Unknown Origin. In Proceedings of the 7th International Conference on Biomedical Signal and Image Processing, Suzhou, China, 19–21 August 2022; pp. 85–89. [Google Scholar]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 2014, 15, 3133–3181. [Google Scholar]
- Birattari, M. A review of metaheuristics for optimization of analytic models. IEEE Trans. Syst. Man Cybern. Part (Appl. Rev.) 2010, 42, 1090–1102. [Google Scholar]
Row Names | Data Description | Data Coding | Data Type |
---|---|---|---|
chr | Number of the chromosome where the mutation is | No encoding | Integer |
POS | Position of in the chromosome where the mutation is | No encoding | Big Integer |
TypeBin | Type of mutation: Single Nucletide Variant (SNV), Multiple Nucleotide Variants (MNV), insertion deletion (INDEL) | SNV = 0, MNV = 1, INDEL = 2 | Integer |
Exon | Exon number. 1 for the first exon in gene, 2 for the second one, …, n | exon = {1, …, n} exon = 0 when it is intronic. Exon | Integer |
Freq | Variant Allele Frequency (frequency of the mutation) | No encoding | Float |
MAFbin | Minor Allele Frequency | MAF value if it exists, −1 when there is no MAF | Float |
Coverage | Sequencing coverage (0 if <300 reads, 1 if >300 reads) | No encoding | Big integer |
Protbin | Mutation effect on the protein (amino acid change) | Intronic or splice site (p.? = 0), same acid amino (p.(=) = 1), acid amino change = 2 | Integer |
aarefbin | Amino acid (before the mutation) | Arg = 1, His = 2, Lys = 3, Asp = 4, Glu = 5, Ser = 6, Thr = 7, Asn = 8, Gln = 9, Trp = 10, Sec = 11, Gly =12, Pro = 13, Ala = 14, Val = 15, Ile = 16, Leu = 17, Met = 18, Phe = 19, Tyr = 20, Cys = 21 | Integer (more detail on Table 2) |
aamutbin | Amino acid (after the mutation) | Arg = 1, His = 2, Lys = 3, Asp = 4,Glu = 5, Ser = 6, Thr = 7, Asn = 8, Gln = 9, Trp = 10, Sec = 11, Gly = 12, Pro = 13, Ala = 14, Val = 15, Ile = 16, Leu = 17, Met = 18, Phe = 19, Tyr = 20, Cys = 21, fs = 22, Ter = 22, del =22 | Interger (more detail on Table 2) |
aarefChemical | Charge of the amino acid side chains (before the mutation) | See Table 3 or Figure 2 | Float |
aamutChemicalVal | Charge of the amino acid side chains (after the mutation) | See Table 3 or Figure 2 | Float |
Grantham | Grantham score of the mutation | Grantham value from 5 to 215. −1 when it is not applicable | Grantham |
varEffectBin | Effect of the mutation on the reading frame (frameshift, missense …) | Frameshitf = 3, missense = 1, nonsense = 2, synonymous = 0, unknown = −1 | Variant.effect |
isMut | Potential pathogenic variant. Decision of the biologist on the mutation variant | Benign/uncertain significance = 0, pathogenic = 1 | Boolean |
Chemical Properties | Amino Acids |
---|---|
Acidic | Aspartic (Asp), Glutamic (Glu) |
Aliphatic | Alanine (Ala), Glycine (Glycine), Isoleucine (Ile), Leucine (Leu), Valine (Val) |
Amide | Asparagine (Asn), Glutamine (Gln) |
Aromatic | Phenylalanine (Phe), Tryptophan (Trp), Tyrosine (Tyr) |
Basic | Arginine (Arg), Histidine (His), Lysine (Lys) |
Hydroxyl | Serine (Ser), Threonine (Thr) |
Imino | Proline (Pro) |
Sulfur | Cysteine (Cys), Methionine (Met) |
Group | 1 Letter Code | 3 Letters Code | Encoded Value |
---|---|---|---|
Apolar | A | Ala | 1.1 |
Apolar | F | Phe | 1.2 |
Apolar | I | Ile | 1.3 |
Apolar | L | Leu | 1.4 |
Apolar | M | Met | 1.5 |
Apolar | P | Pro | 1.6 |
Apolar | V | Val | 1.7 |
Apolar | W | Trp | 1.8 |
Apolar | G | Gly | 1.9 |
Uncharged | C | Cys | 2.1 |
Uncharged | N | Asn | 2.3 |
Uncharged | Q | Gln | 2.4 |
Uncharged | S | Ser | 2.5 |
Uncharged | T | Thr | 2.6 |
Uncharged | Y | Tyr | 2.7 |
Negative charged | D | Asp | 3.1 |
Negative charged | E | Glu | 3.2 |
Positive charged | H | His | 4.1 |
Positive charged | K | Lys | 4.2 |
Positive charged | R | Arg | 4.3 |
NA | NA | Ter | 0.0 |
NA | NA | dup | 0.0 |
NA | NA | del | 0.0 |
NA | NA | p. ? | 0.0 |
NA | NA | p.(=) | 0.0 |
NA | NA | fs | 0.0 |
Parameters | Value |
---|---|
Elitism | 0.3 |
Population size | 10 |
Random selection | 0.1 |
Mutation rate | 0.5 |
Number of generations | 10 |
Value of Kappa | Level of Agreement | % of Data That Are Reliable |
---|---|---|
0–0.20 | None | 0–4 |
0.21–0.39 | Minimal | 4–15 |
0.40–0.59 | Weak | 15–35 |
0.80–0.90 | Strong | 64–81 |
Above 0.90 | Almost perfect | 82–100 |
Method | Accuracy | Kappa | Nrounds | Eta (10⌃x) | Md | Cols | Mw | Sub | Runtime | |
---|---|---|---|---|---|---|---|---|---|---|
DE | 99.22 | 98.70 | 186 | −0.36 | 4.71 | 0.25 | 0.73 | 1.49 | 0.54 | 6.64 h |
PSO | 99.04 | 98.08 | 149 | −1 | 6 | 1 | 1 | 1 | 1 | 10.21 min |
GA | 93.33 | 84.63 | 120 | −2.75 | 2.49 | 0.19 | 0.66 | 2.04 | 0.78 | 2.36 h |
SA | 99.04 | 98.08 | 175 | −1.04 | 5.67 | 0.08 | 0.60 | 2.36 | 0.56 | 5.30 h |
Method | Accuracy | Precision | Specificity | Sensitivity | MCC | AUC | F1-Score | Err |
---|---|---|---|---|---|---|---|---|
DE | 99.35 | 98.70 | 98.71 | 1 | 98.70 | 99.78 | 99.35 | 0.65 |
PSO | 99.04 | 98.10 | 98.11 | 99.98 | 98.08 | 99.79 | 99.03 | 0.96 |
GA | 92.51 | 90.01 | 90.02 | 95.03 | 85.13 | 97.90 | 92.65 | 7.49 |
SA | 99.13 | 98.28 | 98.28 | 99.98 | 98.26 | 99.80 | 99.12 | 0.87 |
XGBoost Probability Score | Description |
---|---|
Inferior to 0.001 | Benign |
0.001 to 0.049 | Likely benign |
0.05 to 0.949 | Uncertain |
0.95 to 0.99 | Likely pathogenic |
Superior to 0.99 | Pathogenic |
Genes | MAF | Amino Acid | Coding | isMut | Xgboost DE | UMD Predictor | Varsome | CGI |
---|---|---|---|---|---|---|---|---|
DNMT3A | 0.224 | p.Leu422= | c.1266G>A | 0 | Benign | Polymorphism | Benign | Passenger |
DNMT3A | p.Arg366His | c.1097G>A | 1 | Likely pathogenic | Pathogenic | Uncertain | Driver | |
JAK2 | p.Val617Phe | c.1849G>T | 1 | Pathogenic | Polymorphism | Pathogenic | Passenger | |
TET2 | p.Ser1518Ter | c.4553C>A | 1 | Pathogenic | Pathogenic | Uncertain | Driver | |
CEBPA | p.Gly104del | c.295_297del | 0 | Benign | NA | Uncertain | Passenger | |
GATA2 | 0.233 | p.Ala164Thr | c.490G>A | 0 | Benign | Polymorphism | Benign | Passenger |
CEBPA | p.Ala259Thr | c.775G>A | 0 | Likely benign | Probable polymorphism | Benign | Passenger | |
ASXL1 | 0.02 | p.Val751Ile | c.2251G>A | 0 | Likely benign | Polymorphism | Benign | Passenger |
CALR | p.Lys385Thr | c.1154A>C | 0 | Pathogenic | Probably pathogenic | Benign | Passenger | |
CALR | p.Lys385IlefsTer | c.1154delinsTTTGTC | 1 | Pathogenic | NA | NA | Driver |
Genes | Amino Acid | Coding | XGBoost DE | MLRF |
---|---|---|---|---|
KRAS | p.Gly12Cys | c.34G>T | Pathogenic | NO |
KRAS | p.Gln61Lys | c.180_181delinsCA | Pathogenic | YES |
KRAS | p.Ala146Pro | c.436G>C | Pathogenic | YES |
BRAF | p.Val600Glu | c.1799T>A | Likely pathogenic | YES |
BRAF | p.Asp594Asn | c.1780G>A | Likely pathogenic | YES |
BRAF | p.Gly469Val | c.1406G>T | Likely pathogenic | YES |
EGFR | p.Glu709Lys | c.2125G>A | Likely pathogenic | YES |
EGFR | p.Thr790Met | c.2369C>T | Likely benign | YES |
EGFR | p.Leu747_Ala750delinsPro | c.2239_2248delinsC | Pathogenic | YES |
PIK3CA | p.His1047Arg | c.3140A>G | Pathogenic | YES |
EGFR | p.Thr725Met | c.2174C>T | Likely benign | YES |
Method | Accuracy | Precision | Specificity | Sensitivity |
---|---|---|---|---|
Random forest | 98.50 | 97.06 | 97.01 | 1 |
XGBoost (DE) | 99.35 | 98.70 | 98.71 | 1.00 |
XGBoost (PSO) | 99.04 | 98.10 | 98.11 | 99.98 |
XGBoost (GA) | 92.51 | 90.01 | 90.02 | 95.03 |
XGBoost (SA) | 99.13 | 98.28 | 98.28 | 99.98 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pellegrino, E.; Camilla, C.; Abbou, N.; Beaufils, N.; Pissier, C.; Gabert, J.; Nanni-Metellus, I.; Ouafik, L. Extreme Gradient Boosting Tuned with Metaheuristic Algorithms for Predicting Myeloid NGS Onco-Somatic Variant Pathogenicity. Bioengineering 2023, 10, 753. https://doi.org/10.3390/bioengineering10070753
Pellegrino E, Camilla C, Abbou N, Beaufils N, Pissier C, Gabert J, Nanni-Metellus I, Ouafik L. Extreme Gradient Boosting Tuned with Metaheuristic Algorithms for Predicting Myeloid NGS Onco-Somatic Variant Pathogenicity. Bioengineering. 2023; 10(7):753. https://doi.org/10.3390/bioengineering10070753
Chicago/Turabian StylePellegrino, Eric, Clara Camilla, Norman Abbou, Nathalie Beaufils, Christel Pissier, Jean Gabert, Isabelle Nanni-Metellus, and L’Houcine Ouafik. 2023. "Extreme Gradient Boosting Tuned with Metaheuristic Algorithms for Predicting Myeloid NGS Onco-Somatic Variant Pathogenicity" Bioengineering 10, no. 7: 753. https://doi.org/10.3390/bioengineering10070753
APA StylePellegrino, E., Camilla, C., Abbou, N., Beaufils, N., Pissier, C., Gabert, J., Nanni-Metellus, I., & Ouafik, L. (2023). Extreme Gradient Boosting Tuned with Metaheuristic Algorithms for Predicting Myeloid NGS Onco-Somatic Variant Pathogenicity. Bioengineering, 10(7), 753. https://doi.org/10.3390/bioengineering10070753