# Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Microbiota Data

#### 2.2. Methodology

#### 2.2.1. Classifier Generation

**Naive Bayes (NB).**The naive Bayes is a probabilistic model for estimating conditional distribution over the class variable, given a new observation [34]. Based on this model, we can assign the phenotype value k to the new helminth observations, ${X}_{i}=({x}_{1},{x}_{2},{x}_{3},\dots ,{x}_{M})$, as follows:

**Support Vector Machines (SVMs).**In the SVMs model, observations are represented as points in a multidimensional space of features [35]. SVMs produce a discriminative model by learning that the hyper-plane separates observations of different categories in a way that makes the inter-category distance as large as possible. A new observation is classified based on the learned hyper-plane separator. Large numbers of features may reduce the performance of discriminative models. Since SMART-scan aims to eliminate the inefficient and redundant features, the combination of the SMART-scan and any discriminative model (such as SVMs) may result in a high-performing system.

**Multilayer Perceptrons (MLP).**The MLP is a discriminative model, a network of simple neurons [36]. The neurons are structured in one input layer at the beginning, at least one hidden layer in the middle, and one output layer at the end of the network. The number of hidden layers and the number of neurons in each of these layers depend on the network design, while the number of neurons in the input and output layers depends on the number of features and classes, respectively. The neurons of all layers, except the output layer, are fully connected to the neurons of the next layer by weighted edges. These weights are the parameters learned during the model training phase. Each neuron has an activation function, a mathematical function accepts weights and input data from the previous layer and feeds the neurons of the next layer by the function’s output. By giving the new observation to the input layer of the trained network, we will have the predicted class as the output of the last layer. SMART-scan may improve the performance and computation time of the MLP by feature reduction. By selecting neural network models as classifiers, we may suffer from considerable computation time and a low interpretation level of the model, even though it may result in high performance. We use MLP in our experiments to test whether or not we can substitute it with simpler and faster models with comparable classification power.

**Random Forests (RF).**Random forests is an ensemble learning method, i.e., it is a multitude of random decision trees for obtaining better predictive performance and dealing with the overfitting problem by selecting the mode of the class labels predicted by the decision trees [37]. Decision tree is a discriminative model that results in a trained classification tree in which the leaves are the classes, and the interior nodes of each level are possible values of the selected feature for that level [38]. In the train phase, for each level, the feature with the highest information gain (IG) is selected [39]. In the test phase, the class label of the leaf that matches the conjunctions of feature values of the new observation is selected. By using RF, we want to benefit from ensemble learning models while also testing whether or not selecting features based on another measure, IG, may result in a classification performance improvement.

#### 2.2.2. Taxonomic Modeling

## 3. Results and Discussion

#### 3.1. Experimental Evaluation

#### 3.2. Discussion

## 4. Conclusions

## Supplementary Materials

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- World Health Organization. Estimated Incidence, Prevalence and TB Mortality; WHO: Geneva, Switzerland, 2004; Available online: http://www. who. int/mediacentre/factsheets/fs104/en (accessed on 29 June 2016).
- Mendes-Soares, H.; Krishnan, V.; Settles, M.L.; Ravel, J.; Brown, C.J.; Forney, L.J. Fine-scale analysis of 16S rRNA sequences reveals a high level of taxonomic diversity among vaginal Atopobium spp. Pathog. Dis.
**2015**, 73, ftv020. [Google Scholar] [CrossRef] [PubMed] - Nistal, E.; Caminero, A.; Herrán, A.R.; Pérez Andres, J.; Vivas, S.; Ruiz de Morales, J.M.; Sáenz de Miera, L.E.; Casqueiro, J. Study of duodenal bacterial communities by 16s rrna gene analysis in adults with active celiac disease versus non-celiac disease controls. J. Appl. Microbiol.
**2016**, 120, 1691–1700. [Google Scholar] [CrossRef] [PubMed] - Wendl, M.C.; Kota, K.; Weinstock, G.M.; Mitreva, M. Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens’ theorem. J. Math. Biol.
**2013**, 67, 1141–1161. [Google Scholar] [CrossRef] [PubMed] - Jumpstart Consortium Human Microbiome Project Data Generation Working Group. Evaluation of 16S rDNA-based community profiling for human microbiome research. PLoS ONE
**2012**, 7, e39315. [Google Scholar] - Hill, T.C.; Walsh, K.A.; Harris, J.A.; Moffett, B.F. Using ecological diversity measures with bacterial communities. FEMS Microbiol. Ecol.
**2003**, 43, 1–11. [Google Scholar] [CrossRef] [PubMed] - Zhang, Q.; Abel, H.; Wells, A.; Lenzini, P.; Gomez, F.; Province, M.A.; Templeton, A.A.; Weinstock, G.M.; Salzman, N.H.; Borecki, I.B. Selection of models for the analysis of risk-factor trees: Leveraging biological knowledge to mine large sets of risk factors with application to microbiome data. Bioinformatics
**2015**, 31, 1607–1613. [Google Scholar] [CrossRef] [PubMed] - White, J.R. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol.
**2009**, 5, e1000352. [Google Scholar] [CrossRef] [PubMed] - Segata, N.; Izard, J.; Waldron, L.; Gevers, D.; Miropolsky, L.; Garrett, W.S.; Huttenhower, C. Metagenomic biomarker discovery and explanation. Genome Biol.
**2011**, 12, 1. [Google Scholar] [CrossRef] [PubMed] - Holmes, I.; Harris, K.; Quince, C. Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE
**2012**, 7, e30126. [Google Scholar] [CrossRef] [PubMed] - La Rosa, P.S.; Brooks, J.P.; Deych, E.; Boone, E.L.; Edwards, D.J.; Wang, Q.; Sodergren, E.; Weinstock, G.; Shannon, W.D. Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE
**2012**, 7, e52078. [Google Scholar] [CrossRef] [PubMed] - Anderson, M.J. A new method for nonparametric multivariate analysis of variance. Austral Ecol.
**2001**, 26, 32–46. [Google Scholar] - Chen, J.; Bittinger, K.; Charlson, E.S.; Hoffmann, C.; Lewis, J.; Wu, G.D.; Collman, R.G.; Bushman, F.D.; Li, H. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics
**2012**, 28, 2106–2113. [Google Scholar] [CrossRef] [PubMed] - Mantel, N. The detection of disease clustering and a generalized regression approach. Cancer Res.
**1976**, 27, 209–220. [Google Scholar] - Lozupone, C.; Knight, R. UniFrac: A new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol.
**2005**, 71, 8228–8235. [Google Scholar] [CrossRef] [PubMed] - Tobias, R.D. An introduction to partial least squares regression. In Proceedings of the Twentieth Annual SAS Users Group International Conference, Orlando, FL, USA, 2 April 1995; pp. 1250–1257.
- Barker, M.; Rayens, W. Partial least squares for discrimination. J. Chemom.
**2003**, 17, 166–173. [Google Scholar] [CrossRef] - Nguyen, D.V.; Rocke, D.M. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics
**2002**, 18, 39–50. [Google Scholar] [CrossRef] [PubMed] - Lê Cao, K.A.; Rossouw, D.; Robert-Granié, C.; Besse, P. A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol.
**2008**, 7, 1544–6115. [Google Scholar] [CrossRef] [PubMed][Green Version] - Lê Cao, K.A.; Martin, P.G.; Robert-Granié, C.; Besse, P. Sparse canonical methods for biological data integration: Application to a cross-platform study. BMC Bioinform.
**2009**, 10, 34. [Google Scholar] [CrossRef] [PubMed][Green Version] - Mahana, D.; Trent, C.M.; Kurtz, Z.D.; Bokulich, N.A.; Battaglia, T.; Chung, J.; Müller, C.L.; Li, H.; Bonneau, R.A.; Blaser, M.J. Antibiotic perturbation of the murine gut microbiome enhances the adiposity, insulin resistance, and liver disease associated with high-fat diet. Genome Med.
**2011**, 8, 1. [Google Scholar] [CrossRef] [PubMed] - Lê Cao, K.A.; Boitard, S.; Besse, P. Sparse PLS discriminant analysis: Biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinform.
**2011**, 12, 1. [Google Scholar] [CrossRef] [PubMed] - Lê Cao, K.A.; Costello, M.E.; Lakis, V.A.; Bartolo, F.; Chua, X.Y.; Brazeilles, R.; Rondeau, P. mixMC: A multivariate statistical framework to gain insight into Microbial Communities. bioRxiv
**2016**, 044206, doi:http://dx.doi.org/10.1101/044206. [Google Scholar] - Sun, Y.; Cai, Y.; Mai, V.; Farmerie, W.; Yu, F.; Li, J.; Goodison, S. Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data. Nucleic Acids Res.
**2011**. [Google Scholar] [CrossRef] [PubMed] - Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**2011**, 73, 273–282. [Google Scholar] [CrossRef] - Loh, W.Y. Classification and regression trees. Wiley Interdiscip. Rev. Data Min. Know. Dis.
**2011**, 1, 14–23. [Google Scholar] [CrossRef] - Ogoe, H.A.; Visweswaran, S.; Lu, X.; Gopalakrishnan, V. Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data. BMC Bioinform.
**2015**, 16, 1. [Google Scholar] [CrossRef] [PubMed] - Ordiz, M.I.; May, T.D.; Mihindukulasuriya, K.; Martin, J.; Crowley, J.; Tarr, P.I.; Ryan, K.; Mortimer, E.; Gopalsamy, G.; Maleta, K.; et al. The effect of dietary resistant starch type 2 on the microbiota and markers of gut inflammation in rural Malawi children. Microbiome
**2015**, 3, 1–9. [Google Scholar] [CrossRef] [PubMed] - Alpaydin, E. Supervised Learning. In Introduction to Machine Learning; Dietterich, T., Bishop, C., Heckerman, D., Jordan, M., Kearns, M., Eds.; The MIT Press: London, UK, 2010; pp. 32–34. [Google Scholar]
- Cole, J.R.; Wang, Q.; Fish, J.A.; Chai, B.; McGarrell, D.M.; Sun, Y.; Brown, C.T.; Porras-Alfaro, A.; Kuske, C.R.; Tiedje, J.M. Ribosomal Database Project: Data and tools for high throughput rRNA analysis. Nucleic Acids Res.
**2013**, 42, 633–642. [Google Scholar] [CrossRef] [PubMed] - Bellman, R.E. Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 1957. [Google Scholar]
- Bermingham, M.L.; Pong-Wong, R.; Spiliopoulou, A.; Hayward, C.; Rudan, I.; Campbell, H.; Wright, A.F.; Wilson, J.F.; Agakov, F.; Navarro, P.; et al. Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Sci. Rep.
**2012**, 5, 10312. [Google Scholar] [CrossRef] [PubMed] - Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science
**2000**, 290, 2323–2326. [Google Scholar] [CrossRef] [PubMed] - Rish, I. An empirical study of the naive Bayes classifier. IJCAI
**2001**, 3, 41–46. [Google Scholar] - Burges, C.J. A tutorial on support vector machines for pattern recognition. Data Min. Know. Dis.
**1998**, 2, 121–167. [Google Scholar] [CrossRef] - Panchal, G.; Ganatra, A.; Kosta, Y.P.; Panchal, D. Behaviour analysis of multilayer perceptrons with multiple hidden neurons and hidden layers. Int. J. Comput. Theory Eng.
**2011**, 3, 332–337. [Google Scholar] [CrossRef] - Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] - Quinlan, J.R. Induction of decision trees. Mach. Learn.
**1986**, 1, 81–106. [Google Scholar] [CrossRef] - Kent, J.T. Information gain and a general measure of correlation. Biometrika
**1983**, 70, 163–173. [Google Scholar] [CrossRef] - Russell, S.J.; Norvig, P.; Canny, J.F.; Malik, J.M.; Edwards, D.D. Informed Search Methods. In Artificial Intelligence: A Modern Approach; Pompili, M., Chavez, S., Eds.; Prentice Hall: Upper Saddle River, NJ, USA, 1995; pp. 92–118. [Google Scholar]
- Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. ACM SIGKDD Explor.
**2009**, 11, 10–18. [Google Scholar] [CrossRef] - Zhang, Q. Implemented Code for SMARTscan, 2015. Available online: https://dsgweb.wustl.edu/qunyuan/software/smartscan/ (accessed on 10 July 2016).

**Figure 1.**Flowchart of baseline and taxonomic models. The phylogenetic tree is one of the outputs of Ribosomal Database Project (RDP) [30].

**Figure 2.**The pseudocode and first three iterations of the SMART-scan algorithm derived from the explanations provided in the main paper [7], and the R code provided by the authors of this paper. The derived psuedocode in (

**a**) is meant to provide a computational interpretation and rapid lookup to enhance the extensions to the application of the SMART-scan method for automatic clustering of microbiota. The first three iterations of the SMART-scan algorithm in (

**b**) are prepared based on the pseudocode in (

**a**). In the first column, the sub-tree candidate for grouping is extracted from TreeCandidatesForSplitting. In the second column, the sub-tree is enclosed by a triangle in the phylogenetic tree; all the possible cut points of the sub-tree are marked by lines cutting edges, and the taxa are named as ${x}_{i}s$. In the third column, the selected cut points are depicted by double lines, the selected grouping of taxa are named as ${Z}_{i}s$, and the new splitted sub-trees, named as ${T}_{1}$ and ${T}_{2}$, are pushed into TreeCandidatesForSplitting.

Dataset ID | Number of Samples | Number of Taxa | Class Distribution (Helminth-Infected/Non-Infected) | Helminth-Infected Distribution (Single/Multi-Infected) | Type of Multi-Infection Worms |
---|---|---|---|---|---|

Indonesia | 90 | 702 | 38/52 | 35/3 | 3 (ascaris + hookworm) |

Liberia | 74 | 702 | 23/51 | 19/4 | 3 (ascaris + hookworm) and 1 (ascaris + whipworm) |

**Table 2.**Baseline and taxonomic model performance on detecting helminth infection over the Indonesia dataset, using 10 runs of 10-fold cross-validation. In this table and the two other following tables, columns are as follows (from left to right): the classifier generation method, the Area Under the ROC Curve (AUC) measures for the baseline and taxonomic models, the improvement achieved by the taxonomic model, and the p-value for t-test and Wilcoxon tests. Best results of columns’ Taxonomic Model AUC and Improvement are shown in bold font in the tables.

Classifier | Baseline AUC | Taxonomic Model AUC | Improvement | p-Value for [t-Test, Wilcoxon Test] |
---|---|---|---|---|

NB | 0.69 | 0.87 | 0.18 | [0.029, 0.039] |

SVMs | 0.61 | 0.87 | 0.26 | [0.002, 0.000] |

MLP | 0.61 | 0.85 | 0.24 | [0.004, 0.001] |

RF | 0.67 | 0.81 | 0.14 | [0.039, 0.048] |

**Table 3.**Baseline and taxonomic model (AUC) performance on detecting helminth infection over the Liberia dataset.

Classifier | Baseline AUC | Taxonomic Model AUC | Improvement | p-Value for [t-Test, Wilcoxon Test] |
---|---|---|---|---|

NB | 0.71 | 0.94 | 0.23 | [0.006, 0.002] |

SVMs | 0.52 | 0.82 | 0.30 | [0.002, 0.000] |

MLP | 0.78 | 0.84 | 0.06 | [0.180, 0.087] |

RF | 0.85 | 0.92 | 0.07 | [0.062, 0.073] |

**Table 4.**Baseline and taxonomic model (AUC) performance on detecting helminth infection over the Combined dataset.

Classifier | Baseline AUC | Taxonomic Model AUC | Improvement | p-Value for [t-Test, Wilcoxon Test] |
---|---|---|---|---|

NB | 0.66 | 0.89 | 0.23 | [0.002, 0.000] |

SVMs | 0.59 | 0.81 | 0.22 | [0.002, 0.000] |

MLP | 0.75 | 0.87 | 0.12 | [0.014, 0.008] |

RF | 0.72 | 0.85 | 0.13 | [0.027, 0.024] |

**Table 5.**Baseline and taxonomic model performance on detecting helminth infection over the Indonesia dataset, using 10 runs of 10-fold cross-validation. In this table and the two other following tables, columns are as follows (from left to right): the classifier generation, the Sensitivity/Specificity (Sen/Spec) for infection detection and the Balanced accuracy (Bacc) measures for the baseline and taxonomic models, the improvement achieved by the taxonomic model for Bacc, and the p-value for t-test and Wilcoxon tests. Best results of the columns’ Taxonomic Model Bacc and Improvement are depicted in bold font in the tables.

Classifier | Baseline Sen/Spec | Taxonomic Model Sen/Spec | Baseline Bacc | Taxonomic Model Bacc | Improvement Bacc | p-Value for [t-Test, Wilcoxon Test] |
---|---|---|---|---|---|---|

NB | 0.58/0.67 | 0.86/0.77 | 0.62 | 0.82 | 0.20 | [0.001, 0.004] |

SVMs | 0.26/0.96 | 0.78/0.96 | 0.61 | 0.87 | 0.26 | [0.000, 0.002] |

MLP | 0.51/0.73 | 0.82/0.78 | 0.62 | 0.80 | 0.18 | [0.030, 0.040] |

RF | 0.41/0.88 | 0.57/0.85 | 0.64 | 0.71 | 0.07 | [0.006, 0.030] |

**Table 6.**Baseline and taxonomic model (Sensitivity, Specificity, and Balanced accuracy) performance on detecting helminth infection over the Liberia dataset.

Classifier | Baseline Sen/Spec | Taxonomic Model Sen/Spec | Baseline Bacc | Taxonomic Model Bacc | Improvement Bacc | p-Value for [t-Test, Wilcoxon Test] |
---|---|---|---|---|---|---|

NB | 0.6/0.72 | 0.88/0.90 | 0.66 | 0.89 | 0.23 | [0.020, 0.040] |

SVMs | 0.05/1.0 | 0.72/0.92 | 0.75 | 0.82 | 0.07 | [0.000, 0.002] |

MLP | 0.63/0.82 | 0.68/0.92 | 0.72 | 0.80 | 0.08 | [0.001, 0.002] |

RF | 0.18/0.98 | 0.52/0.98 | 0.58 | 0.75 | 0.17 | [0.030, 0.040] |

**Table 7.**Baseline and taxonomic model (Sensitivity, Specificity, and Balanced accuracy) performance on detecting helminth infection over the Combined dataset.

Classifier | Baseline Sen/Spec | Taxonomic Model Sen/Spec | Baseline Bacc | Taxonomic Model Bacc | Improvement Bacc | p-Value for [t-Test, Wilcoxon Test] |
---|---|---|---|---|---|---|

NB | 0.59/0.70 | 0.87/0.79 | 0.64 | 0.83 | 0.19 | [0.001, 0.002] |

SVMs | 0.21/0.97 | 0.67/0.88 | 0.59 | 0.77 | 0.18 | [0.002, 0.006] |

MLP | 0.68/0.82 | 0.77/0.80 | 0.75 | 0.87 | 0.12 | [0.036, 0.039] |

RF | 0.29/0.94 | 0.61/0.91 | 0.61 | 0.76 | 0.15 | [0.001, 0.004] |

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Eshaghzadeh Torbati, M.; Mitreva, M.; Gopalakrishnan, V. Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations. *Data* **2016**, *1*, 19.
https://doi.org/10.3390/data1030019

**AMA Style**

Eshaghzadeh Torbati M, Mitreva M, Gopalakrishnan V. Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations. *Data*. 2016; 1(3):19.
https://doi.org/10.3390/data1030019

**Chicago/Turabian Style**

Eshaghzadeh Torbati, Mahbaneh, Makedonka Mitreva, and Vanathi Gopalakrishnan. 2016. "Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations" *Data* 1, no. 3: 19.
https://doi.org/10.3390/data1030019