Microbial organisms are ubiquitous in virtually all the natural environments of the earth’s biosphere. They play integral and unique roles in ecosystems [1
], they are involved in the biogeochemical cycling of the earth [2
], and they have a great impact on human health. There is a dynamic equilibrium between the intestinal microflora, the host, and the environment. Once the structure, composition, and function of the microbiota cause alterations in the metabolites of the intestinal microbiota, there can be host diseases such as obesity [3
], malnutrition [4
] and diabetes [5
]; intestinal flora disorders and irritable bowel syndrome [6
], ulcerative colitis and Crohn’s disease [7
] and other chronic bowel diseases, colon cancer [9
] and gastric cancer [11
]. Thus, profiling the taxonomic composition using the microbial abundances of related communities is critical for understanding the microbial ecology of the environment and for human health.
Recent innovations in metagenomic shotgun sequencing have made it possible to characterize microbial contents in uncultured samples by yielding billions of short reads from metagenomes. Many algorithms have been merged to estimate a community’s taxonomic composition by analyzing the metagenomics sequencing data. These algorithms can be classified into two categories, alignment-based or composition-based, according to their different resolutions of taxonomic binning.
The composition-based algorithms classify metagenomic reads mostly according to their k-mer frequencies. A number of unsupervised methods of this kind have been employed for clustering the reads generated from similar taxonomies. For example, TETRA [12
] clusters reads based on Pearson’s correlation coefficients of z-scores, which evaluate the divergence between the observed and expected reads of tetra nucleotide frequencies. CompostBin [13
] and SCIMM [14
] step up by using a weighted PCA algorithm and an interpolated Markov model, respectively, to bin the reads. MetaCluster [15
] involves two phases of clustering to guarantee the accuracy of the read binning. Some supervised methods have also been proposed to consider the available genomic information and to assign taxonomic labels. Various machine learning methods such as the support vector machine (SVM) classifier, naive Bayes classifier, and Gaussian kernel function are employed in Phylopythia [16
], NBC [17
] and TACOA, respectively [18
]. In addition, RAIphy [19
] assigns reads taxonomically according to each read’s k-mer index value sum based on its Relative Abundance Index model. Other composition-based methods directly explore the k-mer sequences of the reads, such as LMAT [20
], Kraken [21
] and CLARK [22
], which assigns taxonomic labels using the lowest common ancestor (LCA) approach based on the matches of the reads’ k-mer sequences with their differently self-built k-mer databases. PhymmBL [23
] and RITA [24
] use hybrid information from BLAST results to improve their accuracy. MetaTopics [25
] can efficiently extract the latent microbial communities by using the topic model, and it reflects the intrinsic relations or interactions among several major microbes.
Old composition-based methods such as TETRA and TACOA show sharply reduced accuracy during the abundance estimation when the lengths of the reads are less than 800 bp or the community complexity of the samples is high. Although it is claimed that newly merged composition-based methods can be applied quickly and accurately to metagenomic datasets from next-generation sequencing with read lengths of less than 300 bp, these methods still do not appear to be widely used in real metagenomic analyses because there is currently an open question about how k-mer compositions are associated with microbial phylogeny. More evidence must be discovered until the compositional characteristics of DNA sequences become solid phylogenetic signals.
In the alignment-based approaches, alignment and mapping tools, such as BLAST [26
], BWA [27
] and Bowtie2 [28
], are used to find similarity hits in metagenomic reads to reference sequences and then assign the mapped reads to the most plausible microbial lineages. Early alignment-based methods simply estimate the microbial abundances according to the mapped counts. Most of these methods focus primarily on precisely filtering mapped reads and employing the lowest common ancestor algorithm. Methods such as MEGAN [29
], CARMA [30
], PaPaRa [31
] and MTR [32
] all have different strict mapping and filtering procedures or preprocessing procedures to achieve an accurate estimation. Then, the reference length normalization and probabilistic model are introduced into the metagenomic reads analysis. GAAS [33
] also improves the microbial abundance estimates by introducing similarity weighting based on alignment e-values. GRAMMy [34
] accurately estimates the microbial species abundances based on the mixture model theory and EM algorithm, while Pathoscope [35
] presents a complete framework of the metagenomic composition analysis. GASic [36
] and MetaMix [37
] add additional probabilistic procedures such as similarity correction by Least Absolute Shrinkage and Selection Operator (LASSO) or model comparison by Markov-Chain Monte Carlo (MCMC) to improve the estimation.
Another way to perform microbial abundance estimations other than introducing the probabilistic model to alignment-based approaches is to substitute traditional complete genome sequences with more condensed and informative marker sequences as references. MetaPhlAn [38
] and mOTU [39
], which employ clade-specific marker genes as reference sequences, have greatly improved the speed and accuracy of analyzing extremely large metagenomic datasets. Clade-specific marker genes can be identified in almost every clade at different microbial taxonomic levels, and thus they are much more representative than 16S rRNA in metagenomic analysis. The use of marker genes as references only accounts for approximately 4% of the sequenced microbial genes, which leads to a significant conservation in alignment time and storage. However, this approach will discard large numbers of reads in the meantime, because most reads will hardly be mapped to marker genes, which is a regretful loss of information from the metagenomic datasets. Moreover, some microbial species may have a rather small percentage of marker genes because of frequent mutations.
Here, we recommend using core genomes as the reference sequences of potential species and estimating the microbial abundances in metagenomic analysis by using the probabilistic model. According to the pan-genome concepts motivated by Tettelin [40
], a clade’s core-genome, which contains genes shared by all the strains within the clade, typically includes the genes responsible for the major phenotypic traits, which account for nearly 8% of the genetic repertoire. In involving clade-specific marker genes as a subset, the core-genome generally contains some genes that may be shared by different clades. The reasons for using core-genomes as references are as follows: First, the coexistence of different subpopulations of a microbial species may be a general feature in highly mixed habitats. Evidence has been discovered of both selected cultured isolates and wild uncultured populations, from marine environments [41
] to human body sites [42
]. The existence of multiple subpopulations or strains, the abundances of which are very likely to vary greatly, often brings extensive genomic diversity and community complexity and leads to inaccurate results when using only several complete genomes or a small percentage of marker genes as references for a clade. However, using core-genomes that consist of the genes shared by all the microbial strains that were studied as references provides us with an opportunity to estimate the relative abundance of a species with a complex substrain composition quickly and accurately. Second, variational subpopulations of a microbial species that have never been sequenced before, the strains of one species that may have high similarity with another species [43
], and some mechanisms such as horizontal gene transfer and lysogeny, which currently tend to be considered to occur more frequently in natural environments than previously thought [44
], are all obstacles to alignment accuracy when using the strains’ complete reference genomes. Otherwise, using core-genomes as references should include the general new strains of the species because most currently identified core genes should also be shared by those strains. Different microbial species with high similarity or exotic sequences would also be addressed because they would differ significantly in their core-genomes. Third, compared to using marker genes as clade references, the shared or redundant genes between different clades in the core-genomes form the basis for introducing probabilistic models, which create a balance between mapping quickly and precisely to the references and utilizing the information hidden in ambiguous reads.
In this paper, we introduce CoreProbe, a relative abundance estimation framework for microbes that employs microbial core-genomes as references for metagenomic analysis. In addition, CoreProbe takes advantage of the mixture model theory and describes the sequencing procedure for metagenomic reads as a generative model to accurately estimate microbial abundances with the Gibbs sampling algorithm [45
]. We first tested CoreProbe both in our own simulated metagenomic read sets using MetaSim [48
] and in third-party synthetic communities [49
]. From these experiments, we can observe that using core-genomes as references in CoreProbe sharply outperforms the use of ordinary complete microbial genomes, whether the specific strains of the reference genomes are “in” the metagenomic datasets or not. Compared to other methods including GRAMMy [34
], Pathoscope [35
], MetaPhlAn [38
] and Kraken [21
], CoreProbe also shows improved accuracy in abundance estimations. We then analyzed 25 real metagenomic read sets from Human Microbiome Project (HMP) (https://portal.hmpdacc.org/
), yielding new insights into microbiomes from different human body sites. Finally, we implemented CoreProbe in C++, and we accessed the Docker image with the following instruction: sudo docker pull panhongfei/coreprobe:1.0.
3.1. The CoreProbe Framework
The primary contents of this article can be summarized as shown in Figure 1
We then created a brief introduction to the specific CoreProbe process, which can be found in Figure 2
3.2. Simulation Result
To evaluate the performance of the CoreProbe framework and compare it with the existing methods, we generated 90 simulated metagenomic datasets. We chose 40 microbial species as embed microorganisms in simulated metagenomic datasets. We aimed to estimate the relative abundances of 10 species, leaving the others as “unknown” species, and we compared the precision of the estimations among the existing methods. We downloaded the current completely sequenced microbial genomes of those species in which sequences may coexist, representing different strains of one species. We then employed MetaSim [48
] to generate simulated reads of the selected genomes with preset relative abundances. Specifically, we built an Empirical Error Model based on the study of the Illumina’s sequencing technology [51
] to generate metagenomic datasets that contained 1000, 2000, 5000, 10,000, 20,000, 50,000, 100,000, 200,000, and 500,000 single-ended reads with an average length of approximately 100 bp, and each dataset had 10 replicates.
When applying CoreProbe to those simulated datasets, we first downloaded all the comprehensive non-redundant reference core gene catalogs, which contained classification information and their corresponding sequences, and they also included over 2800 microbial species. Second, we selected certain microbial species as reference species, and we extracted all the core gene sequences for each reference species in the downloaded core gene catalogue. Third, we concatenated the core gene sequences of a reference species together into one FASTA file as a reference sequence of the species. We then used BWA MEM to map the simulated metagenomic reads against the reference core-genome sequences. All parameters were set as the defaults, and the output consisted of all the alignments in a SAM file.
The output SAM files were then passed to our program to estimate the relative abundances. In addition, we applied the three widely used algorithms for an abundance estimation, including GRAMMy [34
], MetaPhlAn [38
] and Kraken [21
], and then used the microbial whole genome sequences as references for the metagenomic analysis to process the above 90 groups of metagenome simulation data and to estimate the relative abundance of 10 species of microorganisms. In addition, we compared the results using the CoreProbe algorithm to compare the accuracy and speed, to support the effectiveness of this algorithm.
It should be noted that when estimating the relative abundance of microorganisms, two types of whole genome sequenced strains were selected as the reference genome of the above algorithm, namely, (1) the selected strain used as a reference genome is contained in a sequence of strains used to generate simulated metagenomic datasets as ; and (2) the selected strain used as a reference genome is not included in the strain that is used to generate the simulated metagenomic datasets, and it is recorded as . This classification is because there are often new strains of certain microbial species in the natural environment, and the sequence has not been effectively sequenced. The use of as a reference genome is used to simulate these cases.
3.2.1. Comparison of Algorithm Accuracy
We compare the accuracy by using the relative mean square error of the relative abundance of each algorithm, and the results are shown in Figure 3
There are some common characteristics of the two graphs. With the increased capacity of the metagenome reads (i.e., the improvement of the depth of the sequencing), the mean variance of the relative abundance estimation of each algorithm decreases, indicating that the increase in the sequencing depth can improve the accuracy of each algorithm. MetaPhlAn [38
] is rather special, in that the mean variance in the relative abundance values obtained here will begin to rise and then quickly decrease with the increase in the sequencing depth. When the metagenome reading capacity is less than 10,000, the relative abundance of the estimated variance is even more than 100%, indicating that with the increased depth of the sequencing, the accuracy will decrease; and when the reading capacity continues to increase, its mean square error will decline rapidly. Especially when the capacity of reads reaches up to 50,000, MetaPhlAn [38
] is more accurate than Kraken [21
]. Thus, we can observe that the MetaPhlAn algorithm [38
] is not stable from the point of view of the sequencing depth, though it achieves high accuracy when the sequencing depth is sufficient. Kraken [21
] is rarely affected by sequencing, with a mean square deviation remaining between 60% and 70%, indicating that the accuracy of the Kraken [21
] is not high; there is a certain gap between the calculation results and the actual value. GRAMMy [34
] has the highest accuracy; its variance is below 40%, and decreases occur with the increasing sequencing depth. The CoreProbe algorithm mentioned in this paper also uses the whole genome for the reference sequence. Its accuracy is better than that of MetaPhlAn [38
] and Kraken, but there is still a gap with the GRAMMy algorithm [34
]. However, when CoreProbe uses the core-genome sequence as the reference sequence, the accuracy is significantly improved. When the metagenome reads capacity is increased to 50,000, the accuracy is very close to that of GRAMMy [34
], and the mean square error in the preset value is less than 10%.
However, by comparing Figure 3
a and Figure 3
b, the accuracy of each algorithm was found to decrease when the reference genome changes from
. Although the difference is not large, the selected reference genome sequence corresponds to the strains present in the sample environment or not, and they will have some impact on the algorithm. In this paper, CoreProbe uses the core gene sequence as the reference sequence, and thus, the variations in the strain in the real environment and the unknown strain that affect the algorithm are relatively small. Therefore, the relative abundance estimation is made using the microbial core genome as the reference. When the actual environment has not yet sequenced new strains, the algorithm estimates of the relative abundance for the accuracy can be more credible.
3.2.2. Comparison of Algorithm Speeds
The various algorithms mentioned in this paper are used in different algorithms, and during the process of dealing with data, there are different sequence mapping algorithms; for example, Kraken [21
], MetaPhlAn [38
], and GRAMMy [34
] correspond to BLAST [26
], Bowtie2 [28
], and BWA [27
]. Thus, the time spent on the calculation process (the sum of time used by the sequence alignment and the statistical algorithm) will be different, and we used the line graph to show the efficiency of each algorithm, as shown in Figure 4
The overall situation of the two figures is similar. As shown in Figure 4
a,b, with the increased capacity of the metagenome (i.e., the improvement of the depth of the sequencing), the relative abundance estimation time of each algorithm is increased. However, the Kraken [21
] is special, because the time it consumes does not increase as the depth of the sequencing increases, and the time was almost maintained at approximately ten seconds. GRAMMy [34
] is the most accurate, but it is time-consuming, and especially when the sequencing depth increases, the growth is faster than it is in other algorithms. MetaPhlAn [38
] is significantly more time-consuming than the other algorithms, and the computational accuracy is lower than that of other algorithms. When the sequencing depth increases, its required time starts to increase significantly, and its accuracy also starts to improve significantly. In addition, our method, CoreProbe, is the least time-consuming, and it only needs more time than Kraken [21
] when the number of reads goes up to 200,000. At the same time, by comparing Figure 4
a and Figure 4
b when the reference genome changes from to , the speed of each algorithm is almost unaffected. This finding shows that each algorithm is relatively stable, and when the amount of data is equivalent, the calculation process will not change the necessary time. Because the microbial core genome is the gene that is common to all the strains in the clade, the number is much smaller than that of the whole genome, and thus the computational time can be dramatically reduced and the relative abundance can be estimated more efficiently. As shown in the figure, when the core genome was used as a reference sequence, the time consumption was less than that of the metagenomes
3.3. Real Metagenomic Datasets Analysis
As an immediate application, we applied this CoreProbe pipeline to a set of actual data. These data come from the HMP, and its sub-database HMIWGS contains 764 groups of high-throughput metagenomic sample data from 16 different human tissues via the Illumina platforms. In this paper, five different human tissues were selected, including the anterior nares, buccal mucosa, posterior fornix, right retro-auricular crease, stool, and five different genomic samples that were randomly selected from each organization. The metagenome data in this paper have passed the quality control (QC) test of the HMP. We intended to use these data to estimate the relative abundance of microorganisms and to classify the metagenomic data. Thus, no additional preprocessing of the above metagenome data was performed.
In this section, 150 common microbes were selected, and the gene sequences belonging to their core genomes were selected from the MetaRef database. Their core gene sequences were grouped together to form a FASTA formatted file as the reference genome sequence of the corresponding microorganism. In this paper, we used the BWA MEM program to compare the sequence of the actual genome reading with the reference genome. This parameter was used to locate the default value and output all the matching results. Finally, the CoreProbe algorithm implemented in this paper was used to estimate the relative abundance of microbes in each sample.
On this basis, the relative abundance of the obtained microbes was analyzed by heat map. In Figure 5
, the buccal mucosa, vaginal dome and stool samples were basically clustered into one class, indicating that the relative abundance of microbes showed similarities within their respective tissues and had significant differences from other human tissues, which is similar to the results of other studies [54
]. In addition, the anterior nares and the right retro-auricular crease corresponding to the metagenomic samples have been doped, and they cannot be clustered well, indicating that the relative microbial abundance between the two human tissues is of great similarity, which may result from close contact between the anterior nares and the right retro-auricular crease and the lesser secretion of body fluid. In addition, there are more white cells relative to the posterior fornix, indicating that the microbial species of this tissue is less than that of others; a small number of cells presented a violet color, and the rest were generally shallow, indicating that the tissue often consists of a small number of microorganisms as the dominant population, and the remaining microbial content is scant. Studies have shown that this finding is due to the important position of the female vagina in the human reproductive system, and its need to inhibit the growth of microbial species. In addition, according to its dominant microbial species, they can be divided into three types of microbial community types [55
]. In the classification tree on the left side of the thermogram, it is also clear that the posterior fornix sample is divided into three sub-trees.
Through the above analysis, we can see that the CoreProbe algorithm can accurately and quickly estimate the relative abundance of microbes and analyze them by heat map, which can distinguish among the different genomic samples from different tissues, and it can be used to find the differences from and links in microbial distributions between different tissues. It is of great significance to understand the relationship between human health and microorganisms.
We developed the CoreProbe framework to estimate the relative abundance of genomes. This approach has three unique advantages. First, a complete probability model is established for the relative abundance of microbes. The distribution and structure of microbes are simulated on the basis of this finding; thus, the relative abundance of various microorganisms can be estimated more accurately.
Second, CoreProbe uses the stochastic algorithm of the Gibbs sampling algorithm as an alternative to the deterministic algorithm of statistical reasoning (such as the expectation maximization algorithm) to achieve a better fuzzy allocation of the readings, which makes the method particularly suitable for short read data sets. The distribution of ambiguous sources may be sequencing errors, genetic variation, horizontal gene transfer or closely related genomes. The Gibbs sampling algorithm is a good solution to this series of problems.
Third, due to the changes in microorganisms in the actual environment to adapt to the changes in the surrounding ecology, an abnormal mutation mechanism is evolved, and the genomic polymorphism at the strain level is not uncommon. An environment often contains different strains of a certain microorganism, and the relative abundance of these strains is inconsistent. For the strains that have never been found and sequenced, these factors have greatly affected the accuracy of the relative abundance statistics algorithm for the genome sequence of the few sequenced strains as the reference genome. The core genome is a collection of all the genes of all the strains, which maintain high stability during microbial variation [56
In using the microbial core genome as a reference genome, the effect of genomic polymorphism and strain diversity on the sequence alignment process can be eliminated to a greater extent, and the accuracy of the relative abundance estimation can be improved. In addition, the core genome can be used as a common gene set of all the strains, accounting for only approximately 8% of all the gene sets of the microbial species. Compared with the algorithm that uses the whole genome of the sequenced strain as a reference gene, the efficiency of the comparison is greatly improved.
In summary, the CoreProbe method we provide is likely to provide a more accurate and efficient estimate of the relative abundance estimates of microbes and to uncover a new direction for estimating abundance.