MegaD: Deep Learning for Rapid and Accurate Disease Status Prediction of Metagenomic Samples

The diversity within different microbiome communities that drive biogeochemical processes influences many different phenotypes. Analyses of these communities and their diversity by countless microbiome projects have revealed an important role of metagenomics in understanding the complex relation between microbes and their environments. This relationship can be understood in the context of microbiome composition of specific known environments. These compositions can then be used as a template for predicting the status of similar environments. Machine learning has been applied as a key component to this predictive task. Several analysis tools have already been published utilizing machine learning methods for metagenomic analysis. Despite the previously proposed machine learning models, the performance of deep neural networks is still under-researched. Given the nature of metagenomic data, deep neural networks could provide a strong boost to growth in the prediction accuracy in metagenomic analysis applications. To meet this urgent demand, we present a deep learning based tool that utilizes a deep neural network implementation for phenotypic prediction of unknown metagenomic samples. (1) First, our tool takes as input taxonomic profiles from 16S or WGS sequencing data. (2) Second, given the samples, our tool builds a model based on a deep neural network by computing multi-level classification. (3) Lastly, given the model, our tool classifies an unknown sample with its unlabeled taxonomic profile. In the benchmark experiments, we deduced that an analysis method facilitating a deep neural network such as our tool can show promising results in increasing the prediction accuracy on several samples compared to other machine learning models.


Introduction
Recent research on microbial communities has actively studied the topic of metagenomics of environmental sample diversity while focusing on the role of microbes in human, animal, plant, and environmental niches [1]. Analysis of metagenomic data related to human health and disease has surged in popularity due to the increasing availability of metagenomic sequencing data, which is the result of decreasing sequencing costs. Several projects, such as the MetaHIT consortium and the Human Microbe Project, aimed to catalogue metagenomic differences in healthy and sick individuals utilizing whole genome sequencing techniques [2,3]. These large-scale metagenomics projects provide extensive publicly available datasets that allow research groups to delve much deeper into specific disease-microbe relationships. However, most of these studies rely only on statistical analysis or on simple machine learning approaches.
As the sheer volume of features in samples becomes ever larger, the two main challenges of analyzing metagenomic data are as follows. (1) First, each sample may contain Life 2022, 12, 669 2 of 10 thousands of species of widely varying abundances. (2) Second, the sequencing of metagenomic data is non-trivial due to the abundance of replicated sequences scattered across many different species. To identify species abundance within a sample, a recent study performed 16S rRNA gene amplicon analysis which amplifies the 16S rRNA region to identify reads hailing from different species [4]. These reads are then assigned to specific taxonomies as taxonomic profiles are created. The association between read and taxonomy is defined as an operational taxonomic unit, or OTU. The main limitation of 16S rRNA is that OTUs commonly have the phylum or genus taxonomic resolution, making species and strain level analysis difficult [5]. In recent years, the decreasing cost of whole genome sequencing (WGS) has led to increased adoption of this technique for metagenomic analysis. In species and strain level classification, WGS provides greater extendibility and flexibility than rRNA sequencing. WGS also enables the classification of viral samples, while 16S rRNA sequencing does not [6]. At present, both sequencing techniques are in use depending on the specific application at hand [7]. On the one hand, 16S rRNA sequencing is cheaper and more applicable at a large scale, while on the other WGS is suited to smaller sample sizes since it provides a much higher resolution albeit at an increased cost [8].
Another important aspect of metagenomic analysis is taxonomic classification. Taxonomic classification is the process of identifying microbial presence and abundance by comparing sequenced reads to a database of catalogued genomes. Many tools have been developed to perform this task, such as QIIME (QIIME 2), which classifies 16S rRNA samples using an OTU binning method [9]. Existing tools and techniques classify WGS sequences by (1) searching reads against reference genomes [10][11][12], (2) using short genomic substrings (k-mers) [13][14][15], and (3) aligning reads to specific genetic markers [16,17]. In modern metagenomics, several approaches leverage taxonomic classification to identify linkages between diseases and microbes. For example, the projects DIABIUMME and MetaSub aim at linking microbial flora to their environments [18][19][20][21]. These approaches in metagenomics are suited for machine learning algorithms since a deep analysis of patterns within samples is required for extracting patterns from large datasets [22,23].
Currently, there are several machine learning based tools to analyze these metagenomic datasets for taxonomic classification. MetAML takes advantage of several machine learning models to link taxonomic profiles to specific phenotypes [24]. MetaDprof uses spline regression to identify differential abundances of samples [25]. MegaR leverages random forest, generalized linear models, and support vector machines to try and predict host phenotype from taxonomic profiles generated from metagenomic data [26]. PopPhy utilizes a convolutional neural network and tree structure for its phenotypic prediction [27]. DeepMicro is another deep representation learning framework that transforms a highdimensional microbiome profile to a low-dimensional representation [28]. Although these tools are useful for examining microbiome-phenotype associations, there are important limitations within these tools. MetAML cannot support 16S rRNA sequencing data. MetaDprof is effective for longitudinal studies, but does not provide proper performance for general sample classification. MegaR and PopPhy struggle to function properly on extremely large datasets. DeepMicro often drops important features by transforming high-dimensional profiles to low-dimensional representations.
The idea of metagenomic analysis is to identify a link between the microbial diversity of a given sample and the phenotype of the host. Previous metagenomic analysis approaches primarily utilized statistical methods which were time consuming and incapable of processing large amounts of data [29]. More recent approaches have applied machine learning models, such as support vector machines, generalized linear models, and random forest, in order to identify microbiome host links [26]. Although these approaches have shown respectably accurate results on many datasets, they still revealed key limitations to be solved urgently. First, these approaches have repeatedly failed to show appropriate performance on datasets with more than two levels of classification, which presents a significant limitation in the usefulness of these models. Second, these models are unable to classify large datasets with thousands of features and samples.
To address these limitations, we propose a deep learning based approach for applying deep neural networks for this classification problem. We have developed a deep learning based tool, MegaD ( (accessed on 27 April 2022)) which examines the links between microbes and their hosts and performs precise phenotypic classification on metagenomic samples. The main contributions made by MegaD are as follows:

1.
MegaD is developed based on deep neural networks for phenotypic prediction of unknown metagenomic samples, supporting both 16S rRNA and WGS data; 2.
MegaD demonstrates the robustness of handling extremely large datasets containing thousands of features and samples; 3.
MegaD attests rapid computing performance with high precision and accuracy in predicting the phenotype of unknown samples based on the trained model.
In this paper, we examined the efficacy of using a modular, multi-layer deep neural network for predicting host phenotype in comparison to other available machine learning tools. We found that our implementation provided a more robust and better performing model in most datasets.

Datasets
In order to test the efficacy of a deep neural network for phenotypic classification of taxonomic profiles, we collected metagenomic datasets from several independent research groups. We used these datasets for the experiment because MegaD should be evaluated by applying real-world datasets in different fields of healthcare. The first dataset was the cirrhosis dataset consisting of 232 metagenomic samples with 114 patient samples and 118 healthy control samples [30]. All samples from this dataset were obtained from people of Han Chinese origin. The samples from this dataset were sequenced using whole genome sequencing, and then turned into taxonomic abundance profiles using Kraken 2 software. There were 3302 taxonomic levels identified. The purpose of this dataset was to examine the efficacy of our tool on a small real-world dataset. The second dataset contained samples from patients with type II diabetes (T2D) and healthy controls. There were a total of 440 participants hailing from two different regions. There were 53 T2D samples and 43 control samples taken from a cohort of 70 year old European women. The samples from 344 Chinese T2D patients and non-diabetic controls are also included [31]. The purpose of this dataset was to benchmark our tool on a more robust real-world clinical dataset. The samples of this dataset were sequenced using whole genome sequencing, and then turned into taxonomic abundance profiles using Kraken 2 software. There were 606 taxonomic levels identified. The third dataset contained samples from canine specimens classified as obese or lean. This dataset was obtained from the Purina group and contained samples from 3096 individuals. Sequencing data was obtained from 1536 obese dogs, and 1560 lean dogs. A taxonomic abundance profile was generated from this sequencing data using the Kraken2 software tool. There were a total of 10,920 taxonomic levels identified. The purpose of this dataset was to serve as our ideal real-world dataset containing a large number of samples with very high read depth. Each dataset size of healthy control and patient samples and their accession numbers are shown in Table 1. The processed dataset for MegaD can be also found under the MegaD project website (https://github.com/BioHPC/MegaD (accessed on 27 April 2022)). 3. Methods Figure 1 shows a general overview of MegaD's workflow. MegaD trains deep neural networks with taxonomic abundance profiles for classification of unknown sample profiles. All raw sequence data from samples in the datasets listed above are converted to taxonomic abundance profiles by utilizing the classification tools (1) QIIME2 for 16S data and (2) Kraken 2 or MetaPhlAn2 for WGS data. Given a sample, these classification tools identify bacterial species by comparing sequence reads with known organism genomes. All abundance profiles used are then converted into OTU table format for analysis.  3. Methods Figure 1 shows a general overview of MegaD's workflow. MegaD trains deep neural networks with taxonomic abundance profiles for classification of unknown sample profiles. All raw sequence data from samples in the datasets listed above are converted to taxonomic abundance profiles by utilizing the classification tools (1) QIIME2 for 16S data and (2) Kraken 2 or MetaPhlAn2 for WGS data. Given a sample, these classification tools identify bacterial species by comparing sequence reads with known organism genomes. All abundance profiles used are then converted into OTU table format for analysis.  3. Methods Figure 1 shows a general overview of MegaD's workflow. MegaD trains deep neural networks with taxonomic abundance profiles for classification of unknown sample profiles. All raw sequence data from samples in the datasets listed above are converted to taxonomic abundance profiles by utilizing the classification tools (1) QIIME2 for 16S data and (2) Kraken 2 or MetaPhlAn2 for WGS data. Given a sample, these classification tools identify bacterial species by comparing sequence reads with known organism genomes. All abundance profiles used are then converted into OTU table format for analysis. Finally, the last panel shows how the processed data is used in our tool to produce a predictive model for classification tasks.  For each of the datasets, metadata files are created by taking the sample IDs and disease status. The classifier takes as input a taxonomic table containing sample IDs and the metadata files created. The features extracted by the classifier are a two-dimensional vector for each sample ID containing the relative abundance of each of the taxonomic levels. The inputs are then split into training, validation, and testing datasets by using stratified splitting methods from the scikit-learn python package [32]. These datasets are then used to create a model for future classification of unknown taxonomic profiles.
The machine learning model used in this project is a Deep Neural Network class created using the Pytorch and scikit-learn python packages. Deep Neural Networks utilize a series of connected layers each containing a set of perceptrons and weights that learn patterns within the training data through back propagation and stochastic gradient descent [33]. The main advantage of DNNs is their ability to handle very large datasets containing thousands of features while providing a very high accuracy [34]. The model uses a user-defined number of (1) neurons per layer and (2) hidden layers as shown in Figure 2. For each neuron, an activation threshold is calculated using an Elu activation function. Each linear layer is followed by a dropout layer which can be disabled based on user input. The final layer of the model is a softmax function to return class probabilities in a MxNarray, where M is the batch size, and N is the number of classes. The model also employs the AdamOptimizer for gradient descent and a cross entropy loss function. We utilize a default learning rate of 0.00003 and a default weight decay of 0.0. We also implement batch gradient descent and a user-defined batch size with a default of 50. Finally, optional normalization of the data is performed using a cumulative sum scaling method. Once the DNN model is trained, the classifier computes the predictive accuracy with training, validation, and testing datasets and then outputs the results to the console. Additionally, precision, recall, and F1 score were calculated using the predicted vs. true labels for the test partition of each dataset. For each of the datasets, metadata files are created by taking the sample IDs and disease status. The classifier takes as input a taxonomic table containing sample IDs and the metadata files created. The features extracted by the classifier are a two-dimensional vector for each sample ID containing the relative abundance of each of the taxonomic levels. The inputs are then split into training, validation, and testing datasets by using stratified splitting methods from the scikit-learn python package [32]. These datasets are then used to create a model for future classification of unknown taxonomic profiles.
The machine learning model used in this project is a Deep Neural Network class created using the Pytorch and scikit-learn python packages. Deep Neural Networks utilize a series of connected layers each containing a set of perceptrons and weights that learn patterns within the training data through back propagation and stochastic gradient descent [33]. The main advantage of DNNs is their ability to handle very large datasets containing thousands of features while providing a very high accuracy [34]. The model uses a user-defined number of (1) neurons per layer and (2) hidden layers as shown in Figure 2. For each neuron, an activation threshold is calculated using an Elu activation function. Each linear layer is followed by a dropout layer which can be disabled based on user input. The final layer of the model is a softmax function to return class probabilities in a MxNarray, where M is the batch size, and N is the number of classes. The model also employs the AdamOptimizer for gradient descent and a cross entropy loss function. We utilize a default learning rate of 0.00003 and a default weight decay of 0.0. We also implement batch gradient descent and a user-defined batch size with a default of 50. Finally, optional normalization of the data is performed using a cumulative sum scaling method. Once the DNN model is trained, the classifier computes the predictive accuracy with training, validation, and testing datasets and then outputs the results to the console. Additionally, precision, recall, and F1 score were calculated using the predicted vs. true labels for the test partition of each dataset.  Since many metagenomic datasets are vastly different from one another in terms of sample size and feature size, we have implemented a randomized grid search algorithm to simplify parameter optimization for the end user. This algorithm is designed to aim at the best results for each type of dataset, which could be either binary classification or multilevel classification. This randomized grid search runs through a user-defined number of iterations and, for each iteration, generates a random set of parameters from the predefined range. The parameters randomized by this method and the selected ranges are as follows: number of layers (5-75), number of neurons (5-75), dropout (True, False), dropout rate (0.1-0.9), learning rate (0.000005-0.0005), and weight decay (0-0.1). Additionally, the grid search method uses the validation accuracy to stop training appropriately when validation accuracy fails to improve over a user-defined number of random iterations, where its default value is configured to 10.
Finally, MegaD offers users the ability to predict an unknown dataset using a pretrained model, or by using a novel training dataset. Once a model is selected or created, the user can load their unknown sample into the program, which will then return a class prediction based on the trained model. This feature is useful for identifying disease states of patients, providing a useful tool in precision medicine.

Results
In order to compare the efficacy of deep neural networks, we benchmarked our tool against the performance of three other analysis tools, such as MegaR, MetAML, and PopPhy. We measured the cross-validation accuracy of our model using the datasets Cirrhosis and type II diabetes (T2D), in order to compare against existing tools. For these datasets, we ran our tool using the grid search function to maximize accuracy of our model. The key selected parameters included a species level filter, with a 0.03 abundance threshold. In Table 2, our model performs a cross validation accuracy of 83.3% on the Cirrhosis datasets and 70% accuracy on the T2D datasets. In MegaR and MetAML, the random forest algorithm was used for the comparison since this algorithm computed and created the best performing model of all. In PopPhy, the default convolutional neural network (CNN) was used for evaluation. We tested the performance of each tool on three datasets: (1) Cirrhosis, (2) type II diabetes, and (3) Purina obesity. The performance metric used to determine the efficacy of each program was prediction accuracy on a testing subset of the original dataset at a 15% stratified split. Accuracy was used since the dataset used is a balanced set.
Regarding the Cirrhosis dataset results, the highest accuracy 83.3% on the species taxonomic level was achieved by MegaD as shown in Table 2, where the integrated grid search function was used. We observed that this search function was performed as best parameters for each dataset. Regarding MegaR, the highest accuracy 88.5% on species taxonomic level was achieved by using the random forest model, where the parameters were set as follows: 0.05 threshold, 15% split, no normalization, and 1-201 variables. PopPhy and MetAML achieved 91% and 87.7% accuracy based on the default parameters, respectively. Then, we evaluated MegaD to determine the performance of its DNN classifier on the type II diabetes dataset. The performance of the DNN model of MegaD on this dataset was 70% accuracy, which is the highest accuracy compared to other analysis tools. MegaR, PopPhy, and MetAML achieved 67%, 65%, and 66.4% accuracy configured with the default settings, respectively. These results showed that our tool demonstrated a strong ability to predict disease status from taxonomic abundance profiles for either WGS or 16S sequencing data.
To determine the efficacy of our tool in disease status prediction from metagenomic samples, we tested our tool's performance on the Purina dataset and compared the performance while configuring default parameters versus our grid search function. Our default parameters were: 10 layers, 25 neurons, an abundance threshold of 0.03, a learning rate of 1 × 10 −5 , normalization enabled, and taxonomic level set to species. The model was trained over 20 epochs. For our grid search results, all the settings were consistent at the default parameters with the exception of grid search set to true. In Table 3, MegaD showed 94.1% accuracy using the default parameters and 98.7% accuracy based on our grid search implementation. This experiment result signals a good indicator of our tool's ability to predict the disease status of unknown samples using trained models. In Figure 3, the performance of our DNN on the Purina dataset is shown by plotting the true positive rate against the true negative rate in an ROC curve graph. This graph demonstrates our model's ability to make true positive predictions with high accuracy. To determine the efficacy of our tool in disease status prediction from metagenomic samples, we tested our tool's performance on the Purina dataset and compared the performance while configuring default parameters versus our grid search function. Our default parameters were: 10 layers, 25 neurons, an abundance threshold of 0.03, a learning rate of 1 × 10 −5 , normalization enabled, and taxonomic level set to species. The model was trained over 20 epochs. For our grid search results, all the settings were consistent at the default parameters with the exception of grid search set to true. In Table 3, MegaD showed 94.1% accuracy using the default parameters and 98.7% accuracy based on our grid search implementation. This experiment result signals a good indicator of our tool's ability to predict the disease status of unknown samples using trained models. In Figure 3, the performance of our DNN on the Purina dataset is shown by plotting the true positive rate against the true negative rate in an ROC curve graph. This graph demonstrates our model's ability to make true positive predictions with high accuracy.

Discussion
Our results indicate that MegaD provides a quantitative improvement over other available tools on datasets with medium to large sample sizes, such as the Purina and T2D datasets. The main improvement of our tool is the ability to process very large datasets, such as the Purina dataset, which was unable to be run through the other tools tested. However, it does not seem to offer an improvement over other models on the Cirrhosis dataset, which we believe is due to the very small sample size of the dataset containing a total of 25 test samples. This is most likely due to the model not having enough training samples to properly generate an accurate model, while also not having enough testing samples to provide a stable accuracy metric. In comparison, the performance of MegaD on the Purina dataset shows that when given adequate training data, our tool is able to generate a highly accurate model capable of predicting unknown

Discussion
Our results indicate that MegaD provides a quantitative improvement over other available tools on datasets with medium to large sample sizes, such as the Purina and T2D datasets. The main improvement of our tool is the ability to process very large datasets, such as the Purina dataset, which was unable to be run through the other tools tested. However, it does not seem to offer an improvement over other models on the Cirrhosis dataset, which we believe is due to the very small sample size of the dataset containing a total of 25 test samples. This is most likely due to the model not having enough training samples to properly generate an accurate model, while also not having enough testing samples to provide a stable accuracy metric. In comparison, the performance of MegaD on the Purina dataset shows that when given adequate training data, our tool is able to generate a highly accurate model capable of predicting unknown samples with strong confidence. It may be possible to improve our model's ability to learn small datasets by providing an alternate option that uses a simplified network with reduced layers and nodes, which has been shown to perform better for smaller datasets by minimizing the amount of overfitting.
As a metagenomics analysis tool, MegaD provides notable improvements over currently available software tools. Most notably, MegaD offers comparable or better performance than other tools on the tested datasets which represent a wide range of different dataset types. Additionally, since MegaD offers the ability to save trained models and load them back in for rapid prediction, this allows for much faster and easier large-scale deployment of this tool in clinical settings. Although MegaD lacks the simplified user interface of tools, such as MegaR, it provides a simple enough interface and a robust set of instructions that allows use by those without a strong bioinformatics expertise.
Additionally, our results show that this method of phenotypic prediction may be useful as a diagnostic tool for certain ailments. With such a high accuracy in predicting obesity in dogs, we believe that there are other phenotypic predictions with distinct metagenomic profiles that can be diagnosed using our methods. Although we are unable to provide a reason for the metagenome of lean and obese dogs being unique, we are able to show that there is a correlation between the two that should be explored.

Conclusions
Utilizing a DNN classifier for phenotypic prediction of unlabeled taxonomic profile data showed very promising efficacy results. The main advantage of this type of classifier which we observed was the ability to function on very large datasets with many different levels of classification while maintaining a relatively high accuracy. This type of machine learning model also considerably reduces the wall-clock times of the prediction tasks compared to random forest and CNN models. In the benchmark experiments with three testing datasets, our deep neural network has performed better or similarly compared to the traditional models for phenotypic prediction. The implications of this work show that deep neural networks appear to be a more robust model for analyzing large and small metagenomic datasets. This paper shows strong evidence that DNN-based tools such as our tool allow users to rapidly perform large scale analysis of many different types of metagenomic data, while filling an important void in the field and opening the door to further research into the relationships between microbes and their environment.