Estimating Gaussian Copulas with Missing Data with and without Expert Knowledge

In this work, we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modeling. Further, we outline how expert knowledge on the marginals and the dependency structure can be included. A simulation study shows that the distribution learned through this algorithm is closer to the true distribution than that obtained with existing methods and that the incorporation of domain knowledge provides benefits.


Introduction
Even though the amount of data is increasing due to new technologies, big data are by no means good data. For example, missing values are ubiquitous in various fields, from the social sciences [1] to manufacturing [2]. For explanatory analysis or decision making, one is often interested in the joint distribution of a multivariate dataset, and its estimation is a central topic in statistics [3]. At the same time, there exists background knowledge in many domains that can help to compensate for the potential shortcomings of datasets. For instance, domain experts have an understanding of the causal relationships in the data generation process [4]. It is the scope of this paper to unify expert knowledge and datasets with missing data to derive approximations of the underlying joint distribution.
To estimate the multivariate distribution, we use copulas, where the dependence structure is assumed to belong to a parametric family, while the marginals are estimated nonparametrically. Genest et al. [5] showed that for complete datasets, a two-step approach consisting of the estimation of the marginals with an empirical cumulative distribution function (ecdf) and subsequent derivation of the dependence structure is consistent. This idea is even transferable to high dimensions [6].
In the case of missing values, the situation becomes more complex. Here, nonparametric methods do not scale well with the number of dimensions [7]. On the other hand, assuming that the distribution belongs to a parametric family, it can often be derived by using the EM algorithm [8]. However, this assumption is, in general, restrictive. Due to the encouraging results for complete datasets, there have been several works that have investigated the estimation of the joint distribution under a copula model. The authors of [9,10] even discussed the estimation in a missing-not-at-random (MNAR) setting. While MNAR is less restrictive than missing at random (MAR), it demands the explicit modeling of the missing mechanism [11]. On the contrary, the authors of [12,13] provided results in cases in which data were missing completely at random (MCAR). This strong assumption is rarely fulfilled in practice. Therefore, we assume an MAR mechanism in what follows [11].
Another interesting contribution [14] assumed external covariates, such that the probability of a missing value depended exclusively on them and not on the variables under investigation. They applied inverse probability weighting (IPW) and the two-step approach of [5]. While they proved a consistent result, it is unclear how this approach can be adapted to a setting without those covariates. IPW for general missing patterns is computationally demanding, and no software exists [15,16]. Thus, IPW is mostly applied with monotone missing patterns that appear, for example, in longitudinal studies [17]. The popular work of [18] proposed an EM algorithm in order to derive the joint distribution in a Gaussian copula model with data MAR [11]. However, their approach had weaknesses: The presented algorithm was inexact. Among other things, the algorithm simplified by assuming that the marginals and the copula could be estimated separately (compare Equation (6) in [18] and Equation (11) in this paper).
The description of the simulation study was incomplete and the results were not reproducible.
The aim of this paper is to close these gaps, and our contributions are the following: 1.
We give a rigorous derivation of the EM algorithm under a Gaussian copula model. Similarly to [5], it consists of two separate steps, which estimate the marginals and the copula, respectively. However, these two steps alternate.

2.
We show how prior knowledge about the marginals and the dependency structure can be utilized in order to achieve better results.

3.
We propose a flexible parametrization of the marginals when a priori knowledge is absent. This allows us to learn the underlying marginal distributions; see Figure 1.

4.
We provide a Python library that implements the proposed algorithm.
The structure of this paper is as follows. In Section 2, we review some background information about the Gaussian copula. We proceed by presenting the method (Section 3). In Section 4, we investigate its performance and the effect of domain knowledge in simulation studies. We conclude in Section 5. All technical aspects and proofs in this paper are given in Appendices A and B. , purple line) for the marginals X i , i = 1, 2, and the truth (F i , green line) of a two-dimensional example dataset generated as described in Section 4.2 with N = 200, ρ = 0.5, and β = (0, 2).

Notation and Assumptions
In the following, we consider a p-dimensional dataset {x 1 , . . . , x N } ⊂ R p of size N, where x 1 , . . . , x N are i.i.d. samples from a p-dimensional random vector X = X 1 , . . . , X p with a joint distribution function F and marginal distribution functions F 1 , . . . , F p . We denote the entries of x by x = x 1 , . . . , x p ∀ = 1, . . . , N. The parameters of the marginals are represented by θ = θ 1 , . . . , θ p , where θ j is the parameter of F j , so we write F θ j j , where θ j can be a vector itself.
For ∈ {1, . . . , p}, we define obs( ) ⊂ {1, . . . , p} as the index set of the observed and mis( ) ⊂ {1, . . . , p} as the index set of the missing columns of x . Hence, mis( ) ∪ obs( ) = {1, . . . , p} and mis( ) ∩ obs( ) = ∅. R = R 1 , . . . , R p ∈ {0, 1} p is a random vector for which R i = 0 if X i is missing and R i = 1 if X i can be observed. Further, we define φ to be the density function and Φ to be the distribution function of the one-dimensional standard normal distribution. Φ µ,Σ stands for the distribution function of a p-variate normal distribution with covariance Σ ∈ R p×p and mean µ ∈ R p . To simplify the notation, we define Φ Σ := Φ 0,Σ . For a matrix A ∈ R p×p , the entry of the i-th row and the j-th column is denoted by A ij , while for index sets S, T ⊂ {1, . . . , p}, A S,T is the submatrix of A with the row number in S and column number in T. For a (random) vector x (X), x S (X S ) is the subvector containing entries with the index in S.
Throughout, we assume F to be strictly increasing and continuous in every component. Therefore, F j is strictly increasing and continuous for all j ∈ {1, . . . , p}, and so is the existing This work assumes that data are Missing at Random (MAR), as defined by [11], i.e., P X,R (R = r|X r = x −r , X r = x r ) = P X,R (R = r|X r = x r ), where X r := X {i: r i =1} are the observed and X −r := X {i: r i =0} are the missing entries of X.

Properties
Sklar's theorem [25] decomposes F into its marginals F 1 , . . . , F p and its dependency structure C with F(x 1 , . . . , Here, C is a copula, which means it is a p-dimensional distribution function with support [0, 1] p whose marginal distributions are uniform. In this paper, we focus on Gaussian copulas, where and Σ is a covariance matrix with Σ jj = 1 ∀j ∈ {1, . . . , p}. Beyond all multivariate normal distributions, there are distributions with non-normal marginals whose copula is Gaussian. Hence, the Gaussian copula model provides an extension of the normality assumption. Consider a random vector X whose copula is C Σ . Under the transformation it holds that Proposition 1 shows that the conditional distribution's copula is Gaussian as well. More importantly, we can derive an algorithm for sampling from the conditional distribution. Algorithm 1: Sampling from the conditional distribution of a Gaussian copula Input: x S , Σ, F 1 , . . . , F p Result: m samples of X T |X S = x S Calculate z S := Φ −1 (F S (x S )) ; Calculate µ and Σ as in Proposition 1 using z S and Σ; The very last step follows with Proposition 1, as it holds for any measurable A ⊂ R k :

The EM Algorithm
Let {y 1 , . . . , y N } ⊂ R p be a dataset following a distribution with parameter ψ and corresponding density function g ψ (·), where observations are MAR. The EM algorithm [8] finds a local optimum of the log-likelihood function After choosing a start value ψ 0 , it does so by iterating the following two steps.

Applying the ECM Algorithm on the Gaussian Copula Model
As we need a full parametrization of the Gaussian copula model for the EM algorithm, we assume parametric marginal distributions F  (5), the joint density with respect to the parameters θ = θ 1 , . . . , θ p and Σ has the form where z θ : . Section 3.3 will describe how we can keep the flexibility for the marginals despite the parametrization. However, first, we outline the EM algorithm for general parametric marginal distributions.

E-Step
Set K := Σ −1 and K t := Σ t −1 . For simplicity, we pick one summand in Equation (7). By Equation (7) and (9), it holds with ψ = (θ, Σ) and x taking the role of y : The first and last summand depend only on Σ and θ, respectively. Thus, of special interest is the second summand, for which we obtain the following with Proposition 1: Here, At this point, the authors of [18] neglected that, in general, holds, and hence, (11) depends not only on Σ, but also on θ. This let us reconsider their approach, as we describe below.

M-Step
The joint optimization with respect to θ and Σ is difficult, as there is no closed form for Equation (10). We circumvent this problem by sequentially optimizing with respect to Σ and θ by applying the ECM algorithm. The maximization routine is the following.

1.
Set . This is a two-step approach consisting of estimating the copula first and the marginals second. However, both steps are executed iteratively, which is typical for the EM algorithm.
Estimating Σ As we are maximizing Equation (10) with respect to Σ with a fixed θ = θ t , the last summand can be neglected. By a change-of-variables argument, we show the following in Theorem A1: Thus, considering all observations, we search for which only depends on the statistic S : Generally, this maximization can be formalized as a convex optimization problem that can be solved by a gradient descent. However, the properties of this estimator are not understood (for example, a scaling of S by a ∈ R >0 leads to a different solution; see Appendix A.3). To overcome this issue, we instead approximate the solution with the correlation matrix where P ∈ R p is the diagonal matrix with entries P jj = 1 √ S jj , ∀j = 1, . . . , p. This was also proposed in [28] (Section 2.2). In cases in which there is expert knowledge on the dependency structure of the underlying distribution, one can adapt Equation (12) accordingly. We discuss this in more detail in Section 4.4.

Estimating θ
We now focus on finding θ t+1 , which is the maximizer of with respect to θ. As there is, in general, no closed formula for the right-hand side, we use Monte Carlo integration. Again, we start by considering a single observation x to simplify terms. Employing Algorithm 1, we receive M samples x mis( ),1 , . . . , x mis( ),M from the distribution of X mis( ) |X obs( ) = x obs( ) given the parameters θ t and Σ t . We set x obs( ),m = x obs( ) ∀m = 1, . . . , M. Then, by Equation (9), Hence, considering all observations, we set Note that we only use the Monte Carlo samples to update the parameters of the marginal distributions θ. We would also like to point out some interesting aspects about Equations (13) and (14): The estimations of the marginals are interdependent. Hence, in order to maximize with respect to θ j , we have to take into account all other components of θ. • The first summand adjusts for the dependence structure in the data. If all observations at step t + 1 are assumed to be independent, then K t+1 = I, and this term is 0.
• More generally, the derivative depends on θ k if and only if K t+1 jk = 0. This means that if Σ t+1 implies the conditional independence of column j and k given all other columns (Equation (6)), the optimal θ j can be found without considering θ k . This, e.g., is the case if we set entries of the precision matrix to 0. Thus, the incorporation of prior knowledge reduces the complexity of the identification of the marginal distributions.
The intuition behind the derived EM algorithm is simple. Given a dataset with missing values, we estimate the dependency structure. With the identified dependency structure, we can derive likely locations of the missing values. Again, these locations help us to find a better dependency structure. This leads to the proposed cyclic approach. The framework of the EM algorithm guarantees the convergence of this procedure to a local maximum for M → ∞ in Equation (14).

Modelling with Semiparametric Marginals
In the case in which the missing mechanism is MAR, the estimation of the marginal distribution using only complete observations is biased. Even worse, any moment of the distribution can be distorted. Thus, one needs a priori knowledge in order to identify the parametric family of the marginals [19,20]. If their family is known, one can directly apply the algorithm of Section 3.2. If this is not the case, we propose the use of a mixture model parametrization of the form where σ j is a hyperparameter and the ordering of the θ jk ensures the identifiability. Using mixture models for density estimation is a well-known idea (e.g., [29][30][31]). As the authors of [31] noted, mixture models vary between being parametric and being non-parametric, where flexibility increases with g. It is reasonable to choose Gaussian mixture models, as their density functions are dense in the set of all density functions with respect to the L 1 -norm [29] (Section 3.2). This flexibility and the provided parametrization make the mixture models a natural choice.

A Blueprint of the Algorithm
The complete algorithm is summarized in Algorithm 2. For the Monte Carlo EM algorithm, Ref. [26] proposed the stabilization of the parameters with a rather small number of samples M and to increase this number substantially in the latter steps of the algorithm. This seems to be reasonable for line 8 of Algorithm 2 as well.
If there is no a priori knowledge about the marginals, we propose that we follow Section 3.3. We choose the initial θ 0 such that the cumulative distribution function of the mixture model fits the ecdf of the observed data points. For an empirical analysis of the role of g, see Section 4.3.3. For σ 1 , . . . , σ p , we use a rule of thumb inspired by [3] and set where σ j is the standard deviation of the observed data points in the j-th component.

Algorithm 2:
Blueprint for the EM algorithm for the Gaussian copula model

Simulation Study
We analyze the performance of the proposed estimator in two studies. First, we consider scenarios for two-dimensional datasets and check the potential of the algorithm.
In the second part, we explore how expert knowledge can be incorporated and how this affects the behavior and performance. The proposed procedure, which is indexed with EM in the figures below, is compared with:

1.
Standard COPula Estimator (SCOPE): The marginal distributions are estimated by the ecdf of the observed data points. This was proposed by [18] if the parametric family is unknown, and it is the state-of-the art approach. Thus, we apply an EM algorithm to determine the correlation structure on the mapped data points where F j is the ecdf of the observed data points in column j. Its corresponding results are indexed with SCOPE in the figures and tables.

2.
Known marginals: The distribution of the marginals is completely known. The idea is to eliminate the difficulty of finding them. Here, we apply the EM algorithm for the correlation structure on where F j is the real marginal distribution function. Its corresponding results are indexed with a 0 in the figures and tables. 3.

Markov chain-Monte Carlo (MCMC) approach [21]:
The author proposed an MCMC scheme to estimate the copula in a Bayesian fashion. Therefore, Ref. [21] derived the distribution of the multivariate ranks. The marginals are treated as nuisance parameters. We employed the R package sbgcop, which is available on CRAN, as it provides not only a posterior distribution of the correlation matrix Σ, but also imputations for missing values. In order to compare the approach with the likelihood-based methods, we set Sklar's theorem shows that the joint distribution can be decomposed into the marginals and the copula. Thus, we analyze them separately.

Adapting the EM Algorithm
In Sections 4.3 and 4.4, we chose g = 15, for which we saw a sufficient flexibility. A sensitivity analysis of the procedure with respect to g can be found in Section 4.3.3. The initial θ 0 was chosen by fitting the marginals to the existing observations, and Σ 0 was the identity matrix. For the number of Monte Carlo samples M, we observed that with M = 20, θ stabilized after around 10 steps. Cautiously, we ran 20 steps before we increased M to 1000, for which we run another five steps. We stopped the algorithm when the condition Σ t+1 − Σ t 1 < 10 −5 was fulfilled.

Data Generation
We considered a two-dimensional dataset (we would have liked to include the setup of the simulation study of [18]; however, neither could the missing mechanism be extracted from the paper nor did the authors provide it on request) with a priori unknown marginals F 1 and F 2 , whose copula was Gaussian with the correlation parameter ρ ∈ [ − 1, 1]. The marginals were chosen to be χ 2 with six and seven degrees of freedom. The data matrix D ∈ R N×2 kept N (complete) observations of the random vector. We enforced the following MAR mechanism: 1.
Remove every entry in D with probability 0 ≤ p MCAR < 1. We denote the resulting data matrix (with missing entries) as D MCAR = D MCAR We call the resulting data matrix D MAR .
The missing patterns were non-monotone. Aside from p MCAR , the parameters β 0 and β 1 controlled how many entries were absent in the final dataset. Assuming that ρ > 0, β 1 > 0, and |β 0 | was not too large, the ecdf of the observed values of X 2 was shifted to the left compared to the true distribution function (changing the signs of β 1 and/or ρ may change the direction of the shift, but the situation is analogous). This can be seen in Figure 1, where we chose N = 200, ρ = 0.5, β = (β 0 , β 1 ) = (0, 2). The marginal distribution of X 1 could be estimated well by the ecdf of the observed data.

Results
This subsection explores how different specifications of the data-generating process presented in Section 4.2 influenced the estimation of the joint distribution. First, we investigate the influence of the share of missing values (controlled via β) and the dependency (controlled via ρ) by fixing the number of observations (denoted by N) to 100. Then, we vary N to study the behavior of the algorithms for larger sample sizes. Afterwards, we carry out a sensitivity analysis of the EM algorithm with respect to g, the number of mixtures. Finally, we study the computational demands of the algorithms.

The Effects of Dependency and Share of Missing Values
We investigate two different choices for the setup in Section 4.2 by setting the parameters to ρ = 0.1, β = (−1, 1) and ρ = 0.5, β = (0, 2). For both, we draw 1000 datasets with N = 100 each and apply the estimators. To evaluate the methods, we look at two different aspects.
First, we compare the estimators for ρ with respect to bias and standard deviations. The results are depicted in the corresponding third columns of Table 1 and are summarized as boxplots in Figure A1 in Appendix B.3. We see that no method is clearly superior. While the EM algorithm has a stronger bias for ρ = 0.5 than that of SCOPE, it also has a smaller standard deviation. The MCMC approach shows the largest bias. As even known marginals (ρ 0 ) do not lead to substantially better estimators compared to SCOPE (ρ SCOPE ) or the proposed (ρ EM ) approach, we deduce that (at least in this setting) the estimators for the marginals are almost negligible. MCMC performs notably worse.
Second, we investigate the Cramer-von Mises statistics ω between the estimated and the true marginal distribution (ω 1 statistic for the first marginal, ω 2 statistic for the second marginal). The results are shown in Table 1 (corresponding first two columns) and are summarized as boxplots in Figure A2 in Appendix B.3. While for ρ = 0.1, the proposed estimator behaves only slightly better than SCOPE, we see that the benefit becomes larger in the case of high correlation and more missing values, especially when estimating the second marginal. This is in line with the intuition that if the correlation is vanishing, the two random variables X 1 and X 2 become independent. Thus, R 2 , the missing value indicator, and X 2 become independent. (Note that there is a difference from the case in which ρ = 0, and hence, the missingness probability R 2 isconditionally independent from X 2 given X 1 .) In that case, we can estimate the marginal of X 2 using the ecdf of the observed data points. Hence, SCOPE's estimates of the marginals should be good for small values of ρ. An illustration can be found in Figure 2. Again, the MCMC approach performs the worst. Table 1. Comparison of the algorithms with respect to the Cramer-von Mises distance between the estimated and the true first (ω 1 ) and true second marginal distributions (ω 2 ), as well as the correlation (ρ). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain-Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.3.1.

Setting
Method  Dependency graph for X 1 , X 2 , and R 2 . X 2 is independent of R 2 if either X 1 and X 2 are independent (ρ = 0) or if X 1 and R 2 are independent (β 1 = 0).

Varying the Sample Size N
To investigate the behavior of the methods for larger sample sizes, we repeat the experiment from Section 4.2 with ρ = 0.5, β = (0, 2) for the sample sizes N = 100, 200, 500, 1000. The results are depicted in Table 2 and Figures A3-A5 in Appendix B.3. The bias of SCOPE and EM algorithm for ρ seem to vanish for large N, while the MCMC approach remains biased. Studying the estimation of the true marginals, the approximation of the second marginal via MCMC and SCOPE improves only slowly and is still poor for the largest sample sizes N = 1000. In contrast, the EM algorithm performs best in small sample sizes, and the mean (of ω 1 and ω 2 ) and standard deviations (of all three values) move towards 0 for increasing N. Table 2. Comparison of the algorithms with respect to the Cramer-von Mises distance between the estimated and the true first (ω 1 ) and true second marginal distributions (ω 2 ), as well as the correlation (ρ). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain-Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with ρ = 0.5 and β = (0, 2) and varying sample sizes N = 100, 200, 500, 1000.

Mean
Standard Deviation The proposed EM algorithm relies on the hyperparameter g, the number of mixtures in Equation (15). To analyze the behavior of the EM algorithm with respect to g, we additionally run the EM algorithm with g = 5 and g = 30 on the 1000 datasets of Section 4.2 for ρ = 0.5, β = (0, 2), and N = 100. We did not adjust the number of steps in the EM algorithm to keep the results comparable. The results can be found in Table 3. We see that the choice of g does not have a large effect on the estimation of ρ. However, an increased g leads to better estimates for X 1 . This is in line with the intuition that the ecdf of the first components is an unbiased estimate for the distribution function of X 1 , and setting g to the number of samples corresponds to the kernel density estimator. On the other hand, the estimator for X 2 benefits slightly from g = 5, as ω 2 EM has a lower mean and standard deviation compared to the choice g = 30. However, this effect is small and almost non-existent when we compare g = 5 with g = 15. As the choice g = 15 leads to better estimates of the first marginal compared to g = 5, we see this choice as a good compromise for our setting. For applications without prior knowledge, we recommend considering g as additional tuning parameter (via cross-validation). Table 3. Comparison of the proposed EM algorithm with respect to the Cramer-von Mises distance between the estimated and the true first (ω 1 ) and true second marginal distributions (ω 2 ), as well as the correlation (ρ), for different numbers of mixtures g in Equation (15). Shown are the mean and standard deviation for g = 5, 15, 30 and for 1000 datasets generated as described in Section 4.2 with ρ = 0.5 and β = (0, 2).

Mean
Standard Deviation We analyze the computational demands of the different algorithms by comparing their run times in the study of Section 4.3.1 with ρ = 0.5 and β = (0, 2) (the settings ρ = 0.1 and β = (−1, 1) lead to similar results and are omitted). The run times of all presented algorithms depend not only on the dataset, but also on the parameters (e.g., convergence criterion and Σ 0 for SCOPE). Thus, we do not aim for an extensive study, but focus on the magnitudes. We compare the proposed EM algorithm with a varying number of mixtures (g = 5,15,30) with MCMC and SCOPE. The results are shown in Table 4. We see that the EM algorithm has the longest run time, which depends on the number of mixtures g. The MCMC approach and the proposed EM algorithm have a higher computational demand than SCOPE, as they are trying to model the interaction between the copula and the marginals. As mentioned in the onset, we could reduce the run time of the EM algorithm by going down to only 10 steps instead of 20.

Inclusion of Expert Knowledge
In the presence of prior knowledge on the dependency structure, the presented EM algorithm is highly flexible. While information on the marginals can be used to parametrize the copula model, expert knowledge on the dependency structure can be incorporated by adapting Equation (12). In the case of soft constraints on the covariance or precision matrix, one can replace Equation (12) with a penalized covariance estimation, where the penalty reflects the expert assessment [32,33]. Similarly, one can define a prior distribution on the covariance matrices and set Σ t+1 as the mode of the posterior distribution (the MAP estimate) of Σ given the statistic S of Equation (12).
Another possibility could be that we are aware of conditional independencies in the data-generating process. This is, for example, the case when causal relationships are known [4]. To exemplify the latter, we consider a three-dimensional dataset X with the Gaussian copula C Σ and marginals X 1 , X 2 , X 3 , which are χ 2 distributed with six, seven, and five degrees of freedom. The precision is set to where ∆ 1/2 is a diagonal matrix, which ensures that the diagonal elements of Σ are 1. We see that X 2 and X 3 are conditionally independent given X 1 . The missing mechanism is similar to the one in Section 4.2. The missingness of X 3 depends on X 1 and X 2 , while the probability of a missing X 1 or X 2 is independent of the others. The mechanism is, again, MAR. Details can be found in Appendix B.2. We compare the proposed method with prior knowledge on the zeros in the precision matrix (indexed by KP, EM in the figures) with the EM, SCOPE, and MCMC algorithms without background knowledge. We again sample 1000 datasets with 50 observations each from the real distribution. The background knowledge on the precision is used by restricting the non-zero elements in Equation (12). Therefore, we apply the procedure presented in [34] (Chapter 17.3.1) to find Σ t+1 . The means and standard deviations of the estimates are presented in Table 5. First, we evaluate the estimated dependency structures by calculating the Frobenius norm of the estimation error Σ − Σ. The EM algorithm with background knowledge (KP, EM) performs best and is more stable than its competitors. Apart from MCMC, the other procedures behave similarly, which indicates again that the exact knowledge of the marginal distributions is not too relevant for identifying the dependency structure. MCMC performs the worst. Table 5. Comparison of the algorithms with respect to the Cramer-von Mises distance between the estimated and the true first marginal distribution (ω 1 ), true second marginal distribution (ω 2 ), and true third marginal distribution (ω 3 ), as well as the correlation (ρ). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the proposed EM algorithm with prior knowledge on the conditional independencies (KP, EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain-Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.4.

Mean Standard Deviation
Method Second, we see that the proposed EM estimators return marginal distributions that are closer to the truth, while the estimate with background knowledge (KP, EM) performs the best. Thus, the background knowledge on the copula also transfers into better estimates for the marginal distribution-in particular, for X 3 . This is due to Equation (14) and the comments thereafter. The zeros in the precision structure indicate which other marginals are relevant in order to identify the parameter of a marginal. In our case, X 2 provides no additional information for X 3 . This information is provided to the EM algorithm through the restriction of the precision matrix.
Finally, we compare the EM estimates of the joint distribution. The relative entropy or Kullback-Leibler divergence is a popular tool for estimating the difference between two distributions [35,36], where one of them is absolutely continuous with respect to the other. A lower number indicates a higher similarity. Due to the discrete structure of the marginals of SCOPE and MCMC, we cannot calculate their relative entropy with respect to the truth. However, we would like to analyze how the estimate of the proposed procedure improves if we include expert knowledge. The results are depicted in Table 6. Again, we observe that the incorporation of extra knowledge improves the estimates. This is in line with Table 5, as the estimation of all components in the joint distribution of Equation (3) is improved by the domain knowledge. Table 6. Comparison of the algorithms with respect to the Kullback-Leibler divergence (D KL ) between the true joint distribution (F) and the estimates. Shown are the mean and standard deviation of the proposed EM algorithm (EM) and the proposed EM algorithm with prior knowledge on the conditional independencies (KP, EM) for 1000 datasets generated as described in Section 4.4.

Discussion
In this paper, we investigated the estimation of the Gaussian copula and the marginals with an incomplete dataset, for which we derived a rigorous EM algorithm. The procedure iteratively searches for the marginal distributions and the copula. It is, hence, similar to known methods for complete datasets. We saw that if the data are missing at random, a consistent estimate of a marginal distribution depends on the copula and other marginals.
The EM algorithm relies on a complete parametrization of the marginals. The parametric family of the marginals is, in general, a priori unknown and cannot be identified through the observed data points. For this case, we presented a novel idea of employing mixture models. Although this is practically always a misspecification, our simulation study revealed that the combination of our EM algorithm and marginal mixture models delivers better estimates for the joint distribution than currently used procedures do. In principle, uncertainty quantification of the parameters derived by the proposed EM algorithm can be achieved by bootstrapping [37].
There are different possibilities for incorporating expert knowledge. Information on the parametric family of the marginals can be used for their parametrization. However, causal and structural understandings of the data-generating process can also be utilized [4,38,39]. For example, this can be achieved by restricting the correlation matrix or its inverse, the precision matrix. We presented how one can restrict the non-zero elements of the precision, which enforces conditional independencies. Our simulation study showed that this leads not only to an improved estimate for the dependency structure, but also to better estimates for the marginals. This translates into a lower relative entropy between the real distribution and the estimate. We also discussed how soft constraints on the dependency structure can be included.
We note that the focus of this paper is on estimating the joint distribution without precise specification of its subsequent use. Therefore, we did not discuss imputation methods (see, e.g., [40][41][42][43]). However, Gaussian copula models were employed as a device for multiple imputation (MI) with some success [22,24,44]. The resulting complete datasets can be used for inference. All approaches that we are aware of estimate the marginals by using the ecdf of the observed data points. The findings in Section 4 translate into better draws for the missing values.
Additionally, the joint distribution can be utilized for regressing a potentially multivariate Y on Z even if data are missing. By applying the EM algorithm on X := (Y, Z) and by Proposition 1, one even obtains the whole conditional distribution of Y given Z = z.
We have shown how to incorporate a causal understanding of the data-generating process. However, in the potential outcome framework of [45], the derivation of a causal relationship can also be interpreted as a missing data problem in which the missing patterns are "misaligned" [46]. Our algorithm is applicable for this.

3.
This proof is analogous to the one above, and we finally obtain The result can be generalized to the case in which S ∪ T = {1, . . . , p}.
We now apply Proposition 1 and encounter The last integral is understood element-wise. By the first and second moment of Φ Σ ,µ , it follows that z θ t z T θ t φ Σ ,µ (z mis( ),θ t )dz mis( ),θ t = z obs( ),θ t , z mis( ),θ t z obs( ),θ t , z mis( ),θ t T φ Σ ,µ (z mis( ),θ t )dz mis( ),θ t = z obs( ),θ t z T obs( ),θ t z obs( ),θ t µ T µz T In the case of SCOPE, consider the marginal observed data points, which we assume to be ordered y 1 ≤ . . . ≤ y N . We use the following linearly interpolated estimator for the percentile function: To estimate the percentile function for the mixture models, we choose with equal probability (all Gaussians have equal weight) one component of the mixture and then draw a random number with its mean θ jk and standard deviation σ j , j = 1, . . . , p, k = 1, . . . , g. In this manner, we generate N samples y 1 , . . . , y N . The estimator for the percentile function is then chosen to be analogous to the one above. A higher N leads to a more exact result. We choose N to be 10,000. 1 1 + exp(−(β 0 + β 1 Φ −1 (F 1 (D 1 )) + β 2 Φ −1 (F 2 (D 2 )))) and β = (β 0 , β 1 , β 2 ). We call the resulting data matrix D MAR . Its missing patterns are, again, non-monotone, and the data are MAR, but not MCAR. In Section 4.4, we set β = (0, 2, 2).