1. Introduction
The human microbiome is a gathering of microbes that inhabit different sites of the human body, including bacteria, archaea, and fung. The research on low-price high-throughput sequencing analyses has substantially facilitated the characterization of the compositions and diversities of the human microbiome. It provides a chance to carry out large-scale population-based microbiome studies to identify novel risk factors and to improve the risk prediction of diseases. Over the past two decades, increased research has focused on studying the structure, function, and dynamics of microbiome data, referring to all of the taxa and their genes in a well-defined environment [
1]. Technical advancements and reduced costs in sequencing have led to large-scale studies such as the human microbiome project (HMP) and the American gut project (AGP), which characterize the microbiome of largely healthy individuals [
2,
3]. Most of the early research focused on microbial diversity and taxonomic classification. More recently, research has focused on differential abundance analysis and understanding how the host environment is associated with the microbiome [
4,
5,
6,
7]. The human microbiome is known to be associated with complex diseases, such as obesity, inflammatory bowel disease, and rheumatoid arthritis [
8,
9,
10].
Despite these advances, little is known about inter-microbial interactions. Much of the lack of information on microbial interactions comes from the fact that most standard statistical methods for correlation or network analysis cannot be directly applied to such high-dimensional, compositional, and sparse data. To analyze microbiome data, there are two methods in the literature for modeling the distributions of the relative abundance data of many taxa. One method is utilizing multivariate normal distributions to model correlation structures of taxa by means of the centred log ratio transformation (CLR) or alternative log ratio transformations (ALR), and the other is using Dirichlet multinomial distribution to model taxa counts [
11,
12]. Dirichlet distributions are not quite flexible enough to fit many real data sets in microbiome studies. Moreover, the lack of dependence structure analysis hinders further statistical inferences in microbiome studies [
13].
For modeling microbial dependence structures with multivariate normal distributions, two of the most common methods, SparCC and SPIEC-EASI, are used. SparCC uses the additive log-ratio transformations and assumes that the underlying unobserved counts are correlated through log-ratio [
14]. In contrast, SPIEC-EASI uses the CLR and assumes that the network of interactions is generated from a precision matrix [
15]. While the log-ratio transformations are common in compositional data analysis, they are not particularly fitting for data with excessive zeros [
16], as is the case with microbial sequencing data. The normality assumption of such data often does not hold (see, e.g., 
Figure 1a below), and they require the use of pseudo-counts, thus forcing the assumption that the true absolute abundance (AA) for every taxa is non-zero in each sample. 
Figure 1 shows that there are large departures between the estimated contour plots of samples and the contour plots based on the estimated normal distribution with CLR or the estimated Dirichlet distribution. Furthermore, the corresponding results from such transformations are difficult to interpret and are sensitive to the choice of reference groups; as Feng et al. [
17] pointed out, the results of standard statistical tests performed on log-transformed data are often not relevant for the original, non-transformed data.
To model the high-dimensional and compositional data in microbiome studies, we propose two classes of multivariate distributions on a simplex. A new class of multivariate distributions on a simplex is composed of variable transformation of inverse gamma distributions since most datasets show that a single taxonomy can be characterized by an inverse Gamma distribution. Another new class of multivariate distributions on simplex is constructed by means of copula techniques. Copulas are functions that join univariate uniform distribution functions to form multivariate distributions and are powerful for characterizing the dependence structures of random variables [
18].
Furthermore, human microbiota can be influenced by environment changes and life-style changes. Microbiome data are compositional and sparse. A few taxa are abundant, most taxa are rare (present in <10% subjects), and there are zero counts for some taxa. This could be caused by sampling zero (low sequence depth) or structural zeros (absent taxa). To deal with sampling zeros, we could enlarge the sequencing depth. However, in our settings, we ignore sampling zeros and only deal with structural zeros by zero-inflated distributions.
The remainder of this paper is organized as follows. 
Section 2 constructs the new class of multivariate distributions on a simplex by variable transformation of inverse gamma distributions. In 
Section 3, using the copula techniques, we propose a new class of multivariate distributions on a simplex with a Gaussian copula and a Farlie–Gumbel–Morgenstern copula. The real data from the American gut project (AGP) is analyzed to demonstrate the effectiveness of the proposed new multivariate distributions in 
Section 4, and some conclusions are given in 
Section 5.
  2. Multivariate Distributions on Simplex via Inverse Gamma Distributions
Suppose there are 
 microbial taxa, and denote 
 as the true relative abundance (RA) of microbial taxa. The RA has a compositional structure with 
, and the vector 
 lies on the 
k-dimensional simplex 
,
      
To construct the distributions of 
, the inverse Gamma distributions on 
 are employed since most datasets show that a single taxonomy can be characterized by an inverse Gamma distribution (IG). Let 
 be independent random variables on 
 and have the inverse Gamma distributions 
 with its density function
      
      where 
 is the shape parameter and 
 are the scale parameters for 
. For a given subject, we assume that the 
k taxa have the same shape parameter 
, but the scale parameters 
 vary. With the following transformation,
      
The RA of microbial taxa 
 has the multivariate distribution on the 
k-dimensional simplex 
 with its density function as
      
      for 
 and 
, 
 for 
. We denote 
. The special case of this multivariate distribution when 
 was first introduced by Carlton [
19] and further discussed in Favaro et al. [
20].
For the density function (
4), the marginal density functions are
      
      for 
. 
Figure 2 shows the estimated marginal density functions by (
5) are fitting well for the sample density functions of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP.
In the studies of the human microbiome, it is more expedient to consider the average total positivity [
21] (and reverse regularity) of order two microbial taxa. Kendall’s correlation coefficient is a measure of the difference between the probabilities of concordance and discordance for two independent and identically distributed pairs of random variables. It is the difference between the probabilities of concordance and discordance for two independent and identically distributed pairs of random variables [
22,
23]. Let 
 with the density given in (
4). Because of the transformation (
3), the corresponding Kendall correlation coefficient of 
 and 
 is given by
      
      where 
 is an independent copy of 
 and 
 is an independent copy of 
 for 
, 
. Since
      
      then, the Kendall correlation coefficient of 
 and 
 is the same as that of two independent IG random variables 
 and 
, and we have that
      
      which is only dependent on the parameter 
, for 
, 
.
  2.1. Paramter Estimation
Let 
 be 
i.i.d samples from 
. The log-likelihood function of parameters 
 is given by
        
        and we obtain the score functions
        
        where 
 is the derivative function of 
. The maximum likelihood estimation (MLE) of 
 can be obtained by an iterative algorithm.
  2.2. Multivariate Goodness–of–Fit Testing
For given 
n samples 
 in the 
k-dimensional simplex 
, we can obtain the estimated parameters 
. A goodness-of-testing procedure is necessary to test the hypothesis that the samples 
 come from the distribution 
. For one-dimensional or low-dimensional data, there are a wide variety of useful and powerful goodness-of-fit and two-sample testing procedures, such as Kolmogorov–Smirnov (K-S) test. However, for high-dimensional (
) data, those tests rapidly lose statistical power because all finite samples are sparse in high-dimensional settings owing to the “curse of dimensionality” [
24]. In this paper, a machine learning method for two-sample testing is employed, which was proposed by Friedman [
24] and further developed by Gretton et al. [
25]. For the given observed samples 
, we can obtain a set of score values 
, where 
 is the cdf of 
. At same time, with the “Monte Carlo” samples 
 from the distribution 
, we can obtain another set of score values 
. Then, a univariate two-sample test, such as the K-S test, chi–squared test, or Mann–Whitney test, can be used to test the goodness of fit based on the scores. The testing procedure is as follows:
- Step 1.
- Generating  n-  samples  -  from the distribution  -  for  - . With the transform ( 3- ), we can obtain  n-  samples  -  from the distribution  -  such that  - ,  - ,  - , for  - , and  - . 
- Step 2.
- By denoting  the cumulative distribution function (cdf) of , we obtain two samples  and . 
- Step 3.
- Using the one-dimensional two-sample test, such as the K-S test, chi–squared test, or Mann–Whitney test, to test the hypothesis that the two samples  and  have the same distribution. 
- Step 4.
- Repeating N (for example, ) times of Steps 1–3, we cannot reject the null hypothesis if the percent of rejecting times of testing in Step 3 (i.e., the p-value < 0.05) is less than 5%. 
Friedman showed that the power of this testing procedurecan be highly sensitive to the learning machine methods that are employed [
24].
  2.3. Zero-Inflated Distribution with Zero Counts for Absent Taxa
The above multivariate distribution assumes that all taxa have positive proportions. However, for a given taxonomy 
j, they may be absent in some subjects. To consider the zero-inflated structures, we introduce the zero-inflated inverse Gamma distributions [
26]. Let 
 be independent random variables on 
 and 
, where 
 follows a Bernoulli distribution 
 with 
 and 
, and 
 follows a inverse Gamma distribution 
 for 
. Then, 
 has a zero-inflated inverse Gamma distribution 
 with the probability density
        
        for 
. With the transformation
        
The RA of microbial taxa 
 has the multivariate distribution on the 
k-dimensional simplex 
 with its density function as
        
        where 
 for 
 and 
, 
 for 
.
  3. Multivariate Distributions on Simplex Using Copula Dependence Structures
Since the 
 microbial taxa 
 lie on the 
k-dimensional simplex 
 defined in (
1), 
 is a 
k-dimensional random vector. To characterize the dependence structure of the 
k-dimensional random vector, copula techniques are usually employed [
18,
23]. Copulas are multivariate distributions with uniformly distributed marginals on 
. Utilizing copulas, the density function of a multivariate distribution can be decomposed into the density weighting function and the product of marginal densities [
27]. For a 
d-variate random vector 
 with joint cdf 
 and marginal strictly increasing cdf 
 (
), Sklars’s Theorem [
28] states that there a 
d-variate copula cdf 
 on 
 exists such that
      
      for all 
 and 
, where 
 is the inverse function of 
. Let 
 and 
 be the density function and the marginal density functions of 
, respectively. The corresponding copula density and the density function of 
 have the following relationships:
The distributions of individual taxonomy are easily estimated, for example, Qiu et al. [
13] proposed two-parametric simplex distribution with its density function,
      
      where 
 are shaping parameters for 
. In this paper, we propose two families of copulas, Gaussian and Farlie–Gumbel–Morgenstern copulas, to characterize the dependence structures of microbial taxa such that the corresponding multivariate distributions are good-of-fit for a given microbial data.
  3.1. Gaussian Copula
Many new multivariate distributions are constructed based on the dependence structures of multivariate normal distributions, which are called Gaussian copulas. Let 
 with mean vector 
 and the 
 covariance matrix 
R,
        
Denote the cdf of 
 by 
; the corresponding copula of 
 is given by
        
        for 
, where 
 is the cdf of standard normal distribution 
.
Suppose that the marginal distributions of the 
 microbial taxa 
 are 
 the cdf of 
 for 
, and 
, the distribution function of 
 can be constructed with the copula (
16) to be
        
Fang et al. [
18] show that Kendall’s correlation coefficient 
 of 
 and 
 is given by
        
        which depends only on 
. Thus, the estimates of 
 can be obtained by the estimators of Kendall’s correlation coefficient 
 of 
 and 
.
  3.2. Farlie–Gumbel–Morgenstern (FGM) Copula
As an alternative to a multivariate normal distribution, the FGM copula has been widely studied and applied to statistical modeling in various research fields [
29,
30,
31]. Let 
 be a random vector that follows a 
k-dimensional FGM copula with its cdf 
,
        
        where 
 and 
 are dependence parameters since 
 means that the random variables 
 are obviously independent. According to Johnson and Kotz [
30], the 
 parameters 
 have the following constraints:
        for 
. With the cdf (
19), we have the joint-density function of 
 given by
        
For the given marginal distributions 
 of the cdf of 
 for 
, the distribution function of microbial taxa 
 can be constructed with the copula (
19) to be
        
        where 
 for 
, and 
. The Kendall’s correlation coefficient 
 of 
 and 
 is given by
        
        which depends only on 
.
Thus, the estimates of 
 can be obtained by the estimators of Kendall’s correlation coefficient 
 of 
 and 
, but all of the parameters 
 should satisfy the constraints in (
20). Shih et al. [
32,
33] provided the maximum likelihood estimates of 
 with the constrains in (
20).
  3.3. Multivariate Distributions with Zero-Inflated Marginal Distributions
Let the 
 microbial taxa 
 lie on the 
k-dimensional simplex 
 defined in (
1). When there are proportions with zero counts for absent taxa, the zero and the non-zero responses are assumed to occur from two different processes; then, we could use two-part models to characterize the marginal distributions. Define that 
 and 
. Let 
 and 
 be the pdf and cdf of 
, respectively, where 
 is a parameter (or a parametric vector). Assume that 
 with 
 and 
. Then, the marginal random variable 
 has the probability density,
        
        for 
. To construct the multivariate distribution of 
, we assume that 
 have the multinomial distribution with its probability mass function
        
        and the joint cdf of 
 is given by
        
        where 
 is the density function of a copula 
. Denote 
 or 
, where 
 and 
R are the parameters in the given copula function. Thus, the joint probability function of 
 is given by
        
        where 
 for 
, 
 is the density function of 
 at 
 for 
 and 
. Note that 
 and the marginal cdf 
 are also a copula.
  3.4. Parameter Estimation and Goodness–of–Fit Testing
Let 
 be 
i.i.d samples from the multivariate distribution with its pdf in (
27). To obtain the maximum likelihood estimates of parameters 
, a two-stage approach is employed [
34]. The marginal distributions are first estimated based on partial maximum likelihood methods. Given these marginal fits, the estimated cumulative distribution functions are then obtained using the estimates of the parameters in copula. Denote 
 for 
, 
. The estimates of 
’s can be obtained by
        
        and the estimates of 
’s maximize the log-likelihood functions.
        
Then, the estimates of the parameters in copula can be obtained by maximizing the likelihood function
        
When 
 is an FGM copula, Shih et al. [
33] provides the maximum likelihood estimates of 
. If 
 is a Gaussian copula, the parameters 
 can be estimated by the corresponding estimates of Kendall’s correlation coefficient 
 based on (
23).
When we obtain the estimates of parameters 
, the machine learning method given in 
Section 2.1 can be used for goodness-of-fit testing of the proposed multivariate distributions. Shih et al. [
33] provides a procedure of generating samples from FGM copulas, and Fang et al. [
18] gives a procedure of generating samples from Gaussian copulas.
  4. Real Data Analysis
Now we apply our proposed methods to the datasets from AGP, a self-selected, open-platform cohort [
2]. The cohort consists of individuals mostly from the United States, with some from the United Kingdom and Australia, who opted into the study by providing informed consent and paying a fee to offset the cost of processing and sequencing. The data, both 16S rRNA gene sequencing and self-reported meta-data, are publicly available in The European Bioinformatics Institute repository under the accession ERP012803.
We focus on the fecal microbiome samples and restrict our samples to individuals who reported not having inflammatory bowel disease or diabetes. We filtered data that showed that OTUs were removed if present in fewer than 37% of samples. Samples were removed if total sequencing depth fell below 10,800 sequence reads [
15]. After filtering, there were 127 OTUs and 289 samples remaining in our dataset.
To illustrate our methods, the analysis results of two OTUs are shown only in this section. As shown in 
Figure 2, the sample density functions of Faecalibacterium (OTU1) and Bacteroides (OTU2) from AGP are fitted well by the marginal distributions of MSIG distribution, where the parameters 
 are estimated to be 
. The parameters for the fitted Dirichlet distribution are estimated to be 
, 
. 
Figure 3 shows the contour plots from the estimated Dirichlet distribution (DD) and MSIG, respectively. The p-values for the GOF test are 0.001 versus 0.567 between DD and MSIG, which show that the proposed MSIG is goodness-of-fitting the data. Furthermore, the Kendall’s correlation coefficient between OTU1 and OTU2 is estimated to be 
 based on the samples, but the corresponding Kendall’s 
 are calculated to be 
 with DD and 
 with MSIG, respectively.
Black lines in both panels are the estimated contour plots for samples. Red lines in the left panel are the contour plots of the estimated Dirichlet distribution. Blue lines in the right panel are the contour plots of the estimated MSIG. These figures show that MSIG is a goodness-of-fit for the data.
Some individual taxa can not be fitted by the marginal distribution of MSIG. 
Figure 4a,b shows that taxa Clostridium (OTU3) and Escherichia (OTU4) are fitted by the two-parametric simplex distributions [
13], where the corresponding parameters are estimated to be 
 and 
. The Kendall’s correlation coefficient between OTU3 and OTU4 is estimated to be 
 based on the samples. We use the Gaussian copula (GC) to fit the joint distribution of OTU3 and OTU4 with 
. The contour plots in 
Figure 4c show that the proposed multivariate distribution via Gaussian copula fitted the joint distribution of OTU3 and OTU4 well, and the 
p-value for GOF test is 0.647. However, 
Figure 1 (the right panel) shows that the contour plots of samples are a departure from that of the Dirichlet distribution, of which the corresponding parameters are estimated to be 
; the Kendall’s 
, and the 
p-value for the GOF test is 0.001.
  5. Discussions
The human microbiome has been recently shown to be associated with disease risks and has important implications in risk stratification and precision medicine. The existing methods for modeling microbiome dynamics focus on abundant taxa and ignore rare taxa. Modeling both will be crucial to provide information about microbial causal interactions and information for designing effective large-scale epidemiology. In this paper, we propose two classes of multivariate distributions to characterize the dependence structures among taxa in microbiome studies. Multivariate distributions on simplex via inverse Gamma distributions (MSIG) are useful in regression analysis due to the close-form of their density functions, but all of the Kendall’s correlation coefficients between two taxa are assumed to be the same. The copula techniques are very useful to characterize the dependence structures of random variables. We proposed two most popular copulas, Gaussian and Farlie–Gumbel–Morgenstern copulas, to construct the multivariate distributions of taxa. The data analysis shows that our methods perform well.
In this paper, we consider only the dependence structures of taxa at the fixed time points. Some new methodologies for longitudinal observations to study the dynamics of the microbiome community have to be developed, which will investigated in the future.