Gaussian Graphical Model Estimation and Selection for High-Dimensional Incomplete Data Using Multiple Imputation and Horseshoe Estimators

Zhang, Yunxi; Kim, Soeun

doi:10.3390/math12121837

Open AccessArticle

Gaussian Graphical Model Estimation and Selection for High-Dimensional Incomplete Data Using Multiple Imputation and Horseshoe Estimators

by

Yunxi Zhang

¹

and

Soeun Kim

^2,*

¹

Department of Data Science, University of Mississippi Medical Center, 2500 North State Street, Jackson, MS 39216, USA

²

Department of Mathematics, Physics, and Statistics, Azusa Pacific University, 901 E Alosta Ave, Azusa, CA 91702, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(12), 1837; https://doi.org/10.3390/math12121837

Submission received: 26 February 2024 / Revised: 30 May 2024 / Accepted: 7 June 2024 / Published: 13 June 2024

(This article belongs to the Special Issue Statistical Research on Missing Data and Applications)

Download

Browse Figures

Versions Notes

Abstract

Gaussian graphical models have been widely used to measure the association networks for high-dimensional data; however, most existing methods assume fully observed data. In practice, missing values are inevitable in high-dimensional data and should be handled carefully. Under the Bayesian framework, we propose a regression-based approach to estimating sparse precision matrix for high-dimensional incomplete data. The proposed approach nests multiple imputation and precision matrix estimation with horseshoe estimators in a combined Gibbs sampling process. For fast and efficient selection using horseshoe priors, a post-iteration 2-means clustering strategy is employed. Through extensive simulations, we show the predominant selection and estimation performance of our approach compared to several prevalent methods. We further demonstrate the proposed approach to incomplete genetics data compared to alternative methods applied to completed data.

Keywords:

missing data; Gaussian graphical model; sparse precision matrix; horseshoe prior; multiple imputation; Gibbs sampling; Bayesian

MSC:

62D10

1. Introduction

The era of information technology allows the collection of massive amounts of data, and the association networks between variables have gained wide attention from researchers. With the most common assumption in statistics and machine learning applications where variables are multivariate normally distributed, Gaussian graphical models have been widely applied to measure the association networks with complex interaction patterns for a large number of continuous variables. In particular, it describes the conditional dependencies among random variables using the inverse of the covariance matrix, also known as the precision matrix or concentration matrix. When the data have the number of observations n smaller than the number of variables p, the inverted maximum likelihood estimation of the covariance matrix is considered an optimal solution to estimate the precision matrix. However, under high-dimensional settings with

n ≪ p

, the covariance matrix is usually sparse with a mass of zero elements due to conditional independencies, and its maximum likelihood estimation is not invertible, so the precision matrix is unavailable.

Most existing methods of the sparse precision matrix estimation and structure learning can be categorized as a likelihood-based approach, where the penalized likelihood function is maximized for estimation, such as in graphical Lasso [1,2], or a regression-based approach, estimating sparse regression models separately while considering the neighborhood selection, such as [3,4]. Under the Bayesian framework, algorithms have been proposed through both approaches with applications of the horseshoe prior [5], the state-of-the-art shrinkage prior, using multivariate normal scale mixtures. Wang [6] proposed a fully Bayesian treatment with block Gibbs sampler to maximize the penalized log-likelihood through column-wise updating. Li et al. [7] adopted the horseshoe priors into Wang’s work. Williams et al. [8] used the horseshoe estimates of projection predictive selection given in Piironen and Vehtari [9] to determine the precision matrix structure. Most selection methods using shrinkage priors cannot shrink small numbers to exactly 0. It is usually implemented with a pre-specified parameter threshold or cross-validation to choose a tuning parameter that maximizes the log-likelihood of either precision matrix or regression models.

When practicing statistics on real-world data, missing data are ubiquitous and inevitably obstruct the application of existing methods. The simple ignorance of missing data is not recommended, especially for high-dimensional data that have a small number of observations. In addition to the loss of information, the complete case or the available case of those data missing at random (MAR) do not correctly represent the whole population, as they are not sampled at random [10,11]. Single imputation that replaces the missing values by a real number, for example, observed data means, tends to distort the data distribution and underestimate the variance, which is crucial for covariance or precision matrix estimation. Thus, it is necessary to find an appropriate approach to deal with the missing values and to give the estimation of the model.

Under incomplete high-dimensional data settings, only several methods have been proposed to give precision matrix estimation and covariance selection. Employing the graphical lasso with

l_{1}

-penalized term, missGLasso proposed by Städler and Bühlmann [12] is a likelihood-based method estimating the precision matrix with missing value imputed by conditional distributions. Kolar and Xing [13] improved this work by plugging the sample covariance matrix into the penalized log-likelihood maximization problem. Following this, the cGLasso proposed by Augugliaro et al. [14,15,16] allows the process of censored data or a combination of both. As a local optimization method, the EM algorithm can be slow in handling high-dimensional data, and stability is not guaranteed.

To resolve this challenge, we develop an approach to estimate the precision matrix and impute missing values, as well as perform covariance selection. Yang et al. [17] shows that the simultaneously impute and select (SIAS) method outperforms the impute then selection (ITS) method for the problem of fitting a linear regression model using a traditional dataset that has

n ≫ p

. Inheriting the advantage of SIAS, Zhang and Kim [18] proposed an approach to the imputation and selection for the problem of fitting linear regression models using high-dimensional data. We extend these approaches to solve the problem of estimating sparse precision matrix on high-dimensional data. Similar to SIAS and the approach proposed in Zhang and Kim [18], the key feature of this paper is to nest both data imputation and model shrinkage estimation in a combined Gibbs sampling process such that we optimize imputation for model estimation without losing of valuable information. We utilize the multiple imputation for missing values and horseshoe shrinkage prior to precision matrix estimation. After that, we introduce the 2-means clustering into the covariance selection by clustering shrinkage factors into signal and noise groups, which, as a post-iteration selection, is fast and efficient. This study fills the gap of the sparse precision matrix estimation and covariance selection for incomplete high-dimensional data under the Bayesian framework.

The remainder of this paper is organized as follows. In Section 2, we start with a brief review of regression-based precision matrix estimation for the Gaussian graphical model. Then, we implement the linear regression model estimation using horseshoe prior in the precision matrix estimation and introduce the covariance selection using 2-means clustering. After that, we propose our approach to simultaneously impute missing values and estimate the precision matrix followed by the covariance selection. In Section 3, we conduct extensive simulation analyses to show the efficiency and accuracy of our proposed algorithm in comparison to available methods. In Section 4, we illustrate it on real data and show the necessity of it. In Section 5, we provide further discussion.

2. Materials and Methods

In this section, we define notations with a brief review of the regression-based precision matrix estimation of the Gaussian graphical model and its covariance selection through neighborhood selection. Our approach to estimating the sparse precision matrix is described together with 2-means clustering covariance selection. We propose an algorithm for multiple imputation with Gaussian graphical model estimation by horseshoe (MI-GEH) algorithm.

2.1. Gaussian Graphical Model Estimation and Neighborhood Selection

Consider random variables

X_{1}, \dots, X_{p}

following a multivariate normal distribution with mean

μ = {μ_{i}}_{1 \leq i \leq p}

and covariance matrix

Σ = {Σ_{i j}}_{1 \leq i, j \leq p}

. The precision matrix

Θ = {θ_{i j}}_{1 \leq i, j \leq p}

is the inverse of

Σ

that

Θ = Σ^{- 1}

.

In the form of an undirected graph with a set of vertices

V = {1, \dots, p}

and a set of edges

E

, the Gaussian graphical model

G = (V, E)

characterizes conditional relationships of variables

{X_{i}}_{i \in V}

. If two variables

X_{i}

and

X_{j}

,

i \neq j

, are conditionally dependent given all remaining variables, which corresponds to a non-zero element in the precision matrix that

θ_{i j} \neq 0

, then an edge in

E

connects vertices i and j as neighbors in

G

. On the contrary, any unconnected pairs of vertices indicate conditional independence corresponding to zero elements in the precision matrix.

The neighborhood selection, a method for covariance selection, proposed by Meinshausen and Bühlmann [3], aims at identifying the smallest subset of vertices for each node that given variables indexed by selected vertices, the node is independent of all remaining variables. This can be approached by fitting node-wise regression models. Conducting a linear regression of

X_{i}

on all remaining variables that:

\begin{matrix} X_{i} = \sum_{j \neq i} β_{i j} X_{j} + ϵ_{i}, \end{matrix}

(1)

where

β_{i} = {β_{j}}_{j \neq i}

is the regression coefficients, and residual

ϵ_{i} \sim N (0, s_{i}^{2})

. Then, using the conditional normal distribution, elements of the precision matrix

Θ

can be written in terms of regression coefficients and residual variance [19] that:

θ_{i j} = - β_{i j} θ_{i i} and θ_{i i} = s_{i}^{- 2} .

(2)

Thus, if

β_{i j} = 0

, in other words if variable

X_{j}

is not selected into the regression of

X_{i}

, then

θ_{i j} = 0

such that

X_{i}

and

X_{j}

are not connected in

G

. Recognizing the potential difference between

β_{i j}

and

β_{j i}

, the estimation of precision matrix will be the average of.

Moreover, the difficulty in estimating the precision matrix is with its positive definite constraint. Several works [20,21] have shown that the node-wise regression estimates of the precision matrix are asymptotically positive definite under regularity conditions.

2.2. Gaussian Graphical Model Estimation by Horseshoe Prior

Given the advantages of horseshoe priors in identifying signals in high-dimensional settings by utilizing half-Cauchy distributions for local and global shrinkage parameters [5], Makalic [22] proposed a Gibbs sampling for linear regression model on standardized data applying horseshoe priors on regression coefficients, and Zhang and Kim [18] modified it for data without standardization. We apply Zhang and Kim’s work on our nodewise linear regression that, for variable

X_{i}

, the full conditionals of model (1) are:

\begin{matrix} β_{i} | X_{i}, X_{- i}, s_{i}^{2}, Λ_{i}, τ_{i}^{2} & \sim N ({[X_{- i}^{T} X_{- i} + {τ_{i}}^{- 2} {Λ_{- i}}^{- 1}]}^{- 1} X_{- i}^{T} X_{i}, s_{i}^{2} {[X_{- i}^{T} X_{- i} + τ_{i}^{- 2} Λ_{i}^{- 1}]}^{- 1}), \\ s_{i}^{2} | X_{i}, X_{- i}, β_{i}, Λ_{i}, {τ_{i}}^{2} & \sim I G (\frac{n + p}{2}, \frac{1}{2} {(X_{i} - X_{- i} β_{i})}^{T} (X_{i} - X_{- i} β_{i}) + \frac{1}{2 τ_{i}^{2}} β_{i}^{T} Λ_{i}^{- 1} β_{i}), \\ λ_{i j}^{2} | ν_{i j}, β_{i j}, τ_{i}^{2}, s_{i}^{2} & \sim I G (1, \frac{1}{ν_{i j}} + \frac{β_{i j}^{2}}{2 τ_{i}^{2} s_{i}^{2}}), \\ τ_{i}^{2} | ξ_{i}, s_{i}^{2}, β_{i j}, λ_{i j}^{2} & \sim I G (\frac{p + 1}{2}, \frac{1}{ξ_{i}} + \frac{1}{2 s_{i}^{2}} \sum_{j = 1}^{p} \frac{β_{i j}^{2}}{λ_{i j}^{2}}), \\ ν_{i j} | λ_{i j}^{2} & \sim I G (1, s_{j}^{4} + \frac{1}{λ_{i j}^{2}}), \\ ξ_{i} | τ_{i}^{2} & \sim I G (1, 1 + \frac{1}{τ_{i}^{2}}), \end{matrix}

(3)

where

Λ_{- i} = (λ_{i 1}^{2}, \dots, λ_{i, i - 1}^{2}, λ_{i, i + 1}^{2}, \dots, λ_{p}^{2})

.

λ_{i j}

is the local shrinkage parameter of variable

X_{j}

with variance

s_{j}^{2}

,

τ_{i}

is the global shrinkage parameter in the regression on variable

X_{i}

, and

ν_{i j}

and

ξ_{i}

are auxiliary variables for

λ_{i j}

and

τ_{i}

. Through the use of horseshoe priors, the local and global shrinkage parameters

Λ_{- i}

and

τ_{i}

shrink coefficients of noise towards zero while allowing coefficients for signals to remain relatively large [5,18,22].

If no thresholding is applied, then implementing the above Gibbs sampler with Equation (1) produces an estimation of precision matrix with many entries close to 0. Fan et al. [21] elaborated on thresholds usually applied with the precision matrix estimation. Given that soft-thresholding usually outperforms hard-thresholding with more flexible regularization methods, we propose a post-iteration 2-means clustering strategy that clusters the variables by the value of shrinkage factors. The shrinkage factor:

κ_{i j} = \frac{1}{1 + τ_{i}^{2} λ_{i j}^{2} X_{j}^{T} X_{j}}

(4)

measures the magnitude of shrinkage for each coefficient

β_{i}

, which corresponds to variable

X_{i}

, when regressed on variable

X_{j}

, from its maximum likelihood estimate. When

κ_{i j}

is close to 0, the corresponding variable

X_{j}

is an important variable that should be selected for the regression of

X_{i}

. On the contrary, when

κ_{i j}

is close to 1, then variable

X_{j}

is a noise variable that should not be selected for the regression of

X_{i}

. We select

k_{i}

variables for

X_{i}

through 2-means clustering with the objective to minimize:

{[E κ_{select, i}]}^{2} + {[κ_{max, i} - E κ_{unselect, i}]}^{2},

(5)

where

κ_{select, i}

is the set of

κ_{i j}

of selected variables,

κ_{max, i}

is the maximum of

κ_{i j}

in the regression of

X_{i}

, and

κ_{unselect, i}

is the set of

κ_{i j}

of unselected variables. Sets

κ_{select, i}

and

κ_{unselect, i}

are decided through the number of variables selected

k_{i}

that

κ_{select, i}

contains the first

k_{i}

smallest

κ_{i j}

s, and

κ_{unselect, i}

contains the rest

κ_{i j}

s. Through a greedy search on Equation (5), we minimize the sum of the distance between expected shrinkage factors and their respective cluster centers, while keeping the cluster centers fixed at 0 and

κ_{\max, i}

. To satisfy the symmetry constraint of precision matrix, we consider

X_{i}

and

X_{j}

as conditionally dependent in the final covariance selection only if both

X_{i}

and

X_{j}

are selected in the regression of the other variable.

2.3. Multiple Imputation with Graphical Model Estimation by Horseshoe (MI-GEH)

When there are missing values, we optimize the data imputation with consideration of the Gaussian graphical model estimation. For notation simplification, we use

- i

to indicate the set

1, \dots, i - 1, i + 1, \dots, p

. Considering incomplete variable

X_{i}

and remaining variables

X_{- i}

that

(X_{i}, X_{- i}) \sim M V N (μ, Σ)

, we impute

X_{i}

following the conditional distribution that:

\begin{matrix} X_{i} | X_{- i} \sim N (m_{i}, ω_{i}), \\ where & m_{i} = μ_{i} + Σ_{i, - i} Σ_{- i, - i}^{- 1} (X_{- i} - μ_{- i}) \\ and & ω_{i} = Σ_{i, i} - Σ_{i, - i} Σ_{- i, - i}^{- 1} Σ_{- i, i} . \end{matrix}

(6)

Sweeping through all regression and imputation parameters with calculation of precision and covariance matrix in MCMC procedure, we present the MI-GEH Algorithm as follows.

In each chain, initial values may be set as follows: 0 for

β_{i}^{(1)}

, a random number of either 0 or 1 for

λ_{i j}^{(1)}

, 1 for

τ_{i}

, and

{s_{i}^{2}}^{(1)}

. Then:

1.

Regression Parameters.

(a): For each of p variables, update horseshoe estimates and get p sets of ${β_{i}^{(t + 1)}$ , $Λ_{- i}^{(t + 1)}, τ_{i}^{(t + 1)}, s_{i}^{2^{(t + 1)}}}$ following conditional distributions in (3).
(b): Calculate $κ^{(t + 1)}$ using Equation (4).

2.

Precision Matrix and Covariance Matrix.

(a): Update precision matrix $Θ_{0}^{(t + 1)} = {θ_{i j}}_{1 \leq i, j \leq p}^{(t + 1)}$ with regression estimates $β_{i}^{(t + 1)}$ and $s_{i}^{2^{(t + 1)}}$ following their relationship in Equation (2).
(b): Symmetrize the precision matrix $Θ^{(t + 1)} = \frac{Θ_{0}^{(t + 1)} + {Θ_{0}^{(t + 1)}}^{T}}{2}$ and calculate the covariance matrix $Σ^{(t + 1)} = {Θ^{(t + 1)}}^{- 1}$ .

3.

Imputation of data

X

. Draw random numbers following normal distribution described in (6) to impute data and get

X_{aug}^{(t + 1)}

.

4.

Iteration of steps 1–3. Run the above steps to produce a Markov chain of regression and imputation parameters.

After iteration, calculate the posterior mean of regression parameters, precision matrix, and covariance matrix, then select covariance following the 2-means clustering strategy described in Section 2.2. Covariance that has been selected in all chains is included in the final model. Lastly, set all indexes of the precision and covariance matrix that were not selected to zero.

3. Simulation Study

Simulation studies are conducted to investigate the performance of MI-GEH and alternative approaches to data imputation and precision estimation with covariance selection for high-dimensional data.

3.1. Study Design

We generate

X_{1}, \dots, X_{p}

following a multivariate normal distribution with mean 0 and precision matrix

Θ

. The number of observations is

n = 100

, and the number of variables are

p = 100

, 200, or 300. We consider four models of the precision matrix

Θ = {θ_{i j}}

[4,6]:

1.: Model 1: AR(1). $θ_{i i} = 1$ and $θ_{i, i - 1} = θ_{i - 1, i} = 0.5$ .
2.: Model 2: AR(2). $θ_{i i} = 1$ , $θ_{i, i - 1} = θ_{i - 1, i} = 0.5$ and $θ_{i, i - 2} = θ_{i - 2, i} = 0.25$ .
3.: Model 3: Circle. $θ_{i i} = 2$ , $θ_{i, i - 1} = θ_{i - 1, i} = 1$ and $θ_{1, p} = θ_{p, 1} = 0.9$ .
4.: Model 4: Hub. The rows and columns are evenly partitioned into 10 disjoint groups. The elements of the first row and column in the diagonal groups are equal to 1, and 0 otherwise.

To generate missing data, we drop values with 10% or 20% rate in each of

X_{i}

with

i = 1, \dots, 10

under missing completely at random (MCAR) and missing at random (MAR). In the MCAR mechanism, missing values are generated randomly. In the MAR mechanism, the probability of missing in variable

X_{i}

depends on variable

X_{i + 10}

using logistic regression such that the probability is calculated as

{(1 + exp (- a_{k} - X_{i + 10}))}^{- 1}

, where

a_{k}

is the parameter controlling the missing rate. The other variables remain as completely observed.

With different data dimensions, precision matrix models, missing mechanisms, and missing rates, there are 48 scenarios considered in this simulation study. We replicate each scenario 100 times.

We conduct MI-GEH on each simulated dataset using

m = 5

chains with 1000 iterations including 500 burn-in. We compare MI-GEH with three alternative approaches, mean-GLasso, MF-GLasso, and missGLasso [12]. Both mean-GLasso and MF-GLasso conduct imputation and estimation steps sequentially. The mean-GLasso imputes missing data using mean values of observed data, while the MF-GLasso imputes missing data using missForest [23], which predicts each missing variable by random forest. Both approaches estimate and select the precision matrix by cross-validated GLasso [1] on the imputed datasets. We implement missForest using R package missForest [24] and cross-validated GLasso through R package nethet [25]. The missGLasso is a likelihood-based EM algorithm that fits GGM with incomplete data. We implement it through R package cGLasso developed by Augugliaro et al. [15,16].

The efficiency of selection and accuracy of estimation are assessed for each method. For selection results, sensitivity, specificity, and Matthews correlation coefficient (MCC) are calculated as below:

Sensitivity = \frac{T P}{T P + F N}, Specificity = \frac{T N}{T N + F P},

MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}},

where TP (true positive), TN (true negative), FP (false positive), and FN (false negative) are counted by comparing the selected edges with those in the true precision matrix as described. The estimation accuracy is examined in terms of Frobenius norm (Fnorm), which evaluates the precision matrix loss with the following definition:

Fnorm = {(\sum_{i = 1}^{p} \sum_{j = 1}^{p} {({\hat{θ}}_{i j} - θ_{i j})}^{2})}^{1 / 2} .

3.2. Results

To illustrate the sampling process of MI-GEH, Figure 1 displays the MCMC trace plot of

β_{21}, β_{23}, \dots, β_{26}

, and

s_{2}^{2}

. The true values of

β_{21}

and

β_{23}

are not equal to 0, while

β_{24}

,

β_{25}

, and

β_{26}

have a true value of 0. As shown in the figure, the trace plots of

β_{21}

and

β_{23}

deviate significantly from 0 with considerable variance starting around iteration 400, while

β_{24}

,

β_{25}

, and

β_{26}

remain close to 0 with small variance. The corresponding

s_{2}^{2}

values are close to 0 except for a peak early in the chain. The simulation selection results for precision matrix models AR(1), AR(2), Circle, and Hub are summarized in Table 1, Table 2, Table 3 and Table 4. These tables display means and standard deviations of specificity, sensitivity, and MCC for mean-GLasso, MF-GLasso, missGLasso, and MI-GEH by data dimension and missing data setting in each table. Under the sparse precision matrix settings, where most covariance elements are zeros, all methods have selection specificity close to 1. The MI-GEH remarkably outperforms other methods in terms of selection sensitivity and MCC for all scenarios considered. For AR(1) and circle graphs, the MI-GEH has an average sensitivity of over 0.975 and an average MCC of over 0.935, and it slightly changes with an increase in data dimension. All of the methods have lower sensitivity and MCC values in AR(2) and Hub models, but MI-GEH has better performance over the other methods. Moreover, the missGLasso, another method that conducts imputation and selection simultaneously, has an MCC of around 0.3. Convergence is an issue for EM-based methods, and when p goes to 300, missGLasso does not converge in 10,000 iterations. Particularly, when the number of complete cases is low, for example around 10 percent as in our simulation setting, using limited data can affect the choice of initial values. If initial values are far from their actual values, searching for local optimum starting with these initial values can lead to slow convergence of the EM algorithm used in missGLasso. The other two methods, mean-GLasso and MF-GLasso, which conduct imputation and selection sequentially, have sensitivity values around 0.15 and MCC values no higher than 0.4. When there are fewer complete cases or the data are missing at random, all methods have slightly lower sensitivity and MCC values. However, with an increase in dimension, the selection results of MI-GEH, mean-GLasso, and MF-GLasso are not strongly affected in the AR(1), Circle, and Hub models.

Table 5 presents the results of simulation estimation accuracy, Fnorm. For the AR(1) and Circle models, this has less signals compared to the AR(2) and Hub models, while MI-GEH has much smaller average Fnorm, which is around 12% of the Fnorm values of mean-GLasso and MI-GEH, and missGLasso carries the highest Fnorm for those two graphs when it converges. It is worth noting that both missGLasso and MI-GLasso have much smaller standard deviations than mean-GLasso and MF-GLasso. Moreover, for AR(2) models, MI-GEH has smallest Fnorm for

p = 200

among all methods, and it goes to be slightly higher than those of mean-GLasso and MF-GLasso when

p = 300

. In addition, for Hub models, the MI-GEH has much smaller Fnorm than missGLasso, though they are higher than mean-GLasso and MF-GLasso.

4. Application

We now illustrate the MI-GEH on a high-dimensional dataset of Utah residents with northern and western European ancestry (CEU). A detailed description of this dataset is given in Bhadra and Mallick [26], and it is available on Wellcome Sanger Institute website (ftp://ftp.sanger.ac.uk/pub/genevar (accessed on 20 February 2024) and R package BDgraph. This dataset is fully observed with 60 observations and 100 gene variables. To examine the performance of MI-GEH, we generate missing values for 5 gene variables with 10% and 20% missing rates under MCAR and MAR mechanisms. The probability of missing under MAR follows a logistic regression model depending on the fully observed variable with a coefficient of 1. We replicate each of the 4 scenarios 100 times.

In this application, we generate 5 chains with 1000 iterations including a 500 burn-in period for MI-GEH. To summarize the result, we take the mean of each index over all estimated precision matrix. Besides, we applied the approach GHS by Li et al. [27] and GLasso with cross-validation [25] on the original dataset without missing values.

Figure 2 displays the inferred graphs and Table 6 gives the numbers of vertices and edges selected. Among the three methods, MI-GEH provides the most refined neighborhood selection for most variables in the dataset. It reduces the graphical structures to the smallest number of edges for 99 out of 100 vertices. Though GHS selected 30 more edges than MI-GEH does, it covered 15 less vertices. Moreover, as a real data analysis, it is difficult for us to know which edges should be selected as true positives from the dataset itself. To further examine the selection, we compare the results of MI-GEH and GHS with that of GLasso, the most widely used method for GGM. There are 85 same edges selected by MI-GEH and GLasso, which is 87.6% of edges selected by MI-GEH, and 109 same edges selected by GHS and GLasso, which is 85.8% of edges selected by GHS, showing that there is a significant amount of overlap with a potential that MI-GEH has the selection precision no smaller than GHS does.

5. Discussion

This study proposes a Bayesian approach of sparse precision matrix estimation and selection on high-dimensional incomplete data. The proposed MI-GEH optimizes the multiple imputation for precision matrix estimation by nesting them into one MCMC procedure. The post-iteration selection with 2-means clustering on shrinkage factors lets the data speak for themselves and offers efficient selection as demonstrated by simulation analysis and the genetics data analysis. The predominant selection results of simulation analysis show the advantage of the proposed methods in covariance selection compared to the three commonly used alternative approaches, mean-GLasso, MF-GLasso, and missGLasso. Both mean-GLasso and MF-GLasso are ITS approaches and have similar performance. MissGLasso is a useful procedure and performs well in many settings in case the mean and variance of complete cases are representative of the data. However, in real settings, there are situations where complete cases are only a small proportion of full data, resulting in a challenging condition where it is not representative of the full data. The proposed MI-GEH outperforms other approaches in most settings including these challenging conditions, performing well for selection and has an adequate performance in matrix accuracy, which happens for shrinkage methods. The illustration of the proposed MI-GEH on genetic data compared with GLasso [1] and GHS [27] on completed data shows that this is useful for real-world data.

In line with the work of Yang et al. [17] and Zhang and Kim [18], our simulation results provide further evidence that the simultaneous imputation and selection strategy performs better than conducting imputation and selection sequentially for precision matrix estimation in terms of the regression-based estimation. Multiple imputation employed in these methods are designed to provide valid inference of model estimation. The horseshoe priors that unequally scale regression parameters allow shrinkage factors to vary among variables so that applying a post-iteration 2-means clustering gives us efficient selection with outstanding results.

Future research can take several perspectives, including consideration of estimation accuracy. While the shrinkage estimator provides efficient model selection, it yields biased results. It would also be interesting to combine multiple imputation approaches with different priors applied to complex precision matrix structures. Notably, the proposed MI-GEH algorithm might also be applicable to complete datasets by omitting the imputation steps. As such, further investigation of its performance further in such contexts, along with a thorough evaluation of common variable selection issues, such as overfitting, may be warranted.

Author Contributions

Conceptualization, Y.Z. and S.K.; methodology, Y.Z. and S.K.; software, Y.Z.; validation, Y.Z. and S.K.; formal analysis, Y.Z.; investigation, Y.Z. and S.K.; resources, Y.Z. and S.K.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z. and S.K.; visualization, Y.Z.; supervision, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AR	Autoregressive
CEU	Utah residents with ancestry from northern and western Europe
EM	Expectation–maximization
Fnorm	Frobenius norm
FP	False positive
FN	False negative
ITS	Impute then selection
MAR	Missing at random
MCAR	Missing completely at random
MCC	Matthews correlation coefficient
MCMC	Markov chain Monte Carlo
MI-GEH	Multiple imputation with Gaussian graphical model estimation by horseshoe
SIAS	Simultaneously impute and select
TP	True positive
TN	True negative

References

Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. [Google Scholar] [CrossRef] [PubMed]
Banerjee, O.; Ghaoui, L.E.; d’Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn Res. 2008, 9, 485–516. [Google Scholar]
Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the Lasso. Ann. Stat. 2006, 34, 1436–1462. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model selection and estimation in the Gaussian graphical model. Biometrika 2007, 94, 19–35. [Google Scholar] [CrossRef]
Carvalho, C.M.; Polson, N.G.; Scott, J.G. The horseshoe estimator for sparse signals. Biometrika 2010, 97, 465–480. [Google Scholar] [CrossRef]
Wang, H. Bayesian graphical lasso models and efficient posterior computation. Bayesian Anal. 2012, 7, 867–886. [Google Scholar] [CrossRef]
Li, H.; Pati, D. Variable selection using shrinkage priors. Comput. Stat. Data Anal. 2017, 107, 107–119. [Google Scholar] [CrossRef]
Williams, D.R.; Piironen, J.; Vehtari, A.; Rast, P. Bayesian estimation of Gaussian graphical models with predictive covariance selection. arXiv 2018, arXiv:1801.05725. [Google Scholar]
Piironen, J.; Vehtari, A. Comparison of Bayesian predictive methods for model selection. Stat. Comput. 2017, 27, 711–735. [Google Scholar] [CrossRef]
Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 81. [Google Scholar]
Zhang, Y.; Kim, S.; Lin, Y.; Baum, G.; Basen-Engquist, K.M.; Swartz, M.D. Comparisons of imputation methods with application to assess factors associated with self efficacy of physical activity in breast cancer survivors. Commun. Stat.-Simul. Comput. 2019, 48, 2523–2537. [Google Scholar] [CrossRef]
Städler, N.; Bühlmann, P. Missing values: Sparse inverse covariance estimation and an extension to sparse regression. Stat. Comput. 2012, 22, 219–235. [Google Scholar] [CrossRef]
Kolar, M.; Xing, E.P. Estimating sparse precision matrices from data with missing values. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 26 June–1 July 2012. [Google Scholar]
Augugliaro, L. L1-Penalized Censored Gaussian Graphical Models; R Package Version 1.1.0.; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
Augugliaro, L.; Abbruzzo, A.; Vinciotti, V. ℓ 1-penalized censored gaussian graphical model. Biostatistics 2020, 21, e1–e16. [Google Scholar] [CrossRef] [PubMed]
Augugliaro, L.; Sottile, G.; Wit, E.C.; Vinciotti, V. cglasso: An R Package for Conditional Graphical Lasso Inference with Censored and Missing Values. J. Stat. Softw. 2023, 105, 1–58. [Google Scholar] [CrossRef]
Yang, X.; Belin, T.R.; Boscardin, W.J. Imputation and variable selection in linear regression models with missing covariates. Biometrics 2005, 61, 498–506. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Kim, S. Variable selection for high-dimensional incomplete data using horseshoe estimation with data augmentation. Commun. Stat.-Theory Methods 2023, 53, 4235–4251. [Google Scholar] [CrossRef]
Yuan, M. High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn Res. 2010, 11, 2261–2286. [Google Scholar]
Cai, T.; Liu, W. Adaptive thresholding for sparse covariance matrix estimation. J. Am. Stat. Assoc. 2011, 106, 672–684. [Google Scholar] [CrossRef]
Fan, J.; Liao, Y.; Liu, H. An overview of the estimation of large covariance and precision matrices. J. Econom. 2016, 19, C1–C32. [Google Scholar] [CrossRef]
Makalic, E.; Schmidt, D.F. A simple sampler for the horseshoe estimator. IEEE Signal Process. Lett. 2016, 23, 179–182. [Google Scholar] [CrossRef]
Stekhoven, D.J.; Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 2011, 28, 112–118. [Google Scholar] [CrossRef]
Stekhoven, D.J. missForest: Nonparametric Missing Value Imputation Using Random Forest; R Package Version 1.4.; R Foundation for Statistical Computing: Vienna, Austria, 2013. [Google Scholar]
Staedler, N.; Dondelinger, F. Nethet: A Bioconductor Package for High-Dimensional Exploration of Biological Network Heterogeneity; R Package Version 1.16.1.; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
Bhadra, A.; Mallick, B.K. Joint high-dimensional Bayesian variable and covariance selection with an application to eQTL analysis. Biometrics 2013, 69, 447–457. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Craig, B.A.; Bhadra, A. The graphical horseshoe estimator for inverse covariance matrices. J. Comput. Graph. Stat. 2019, 28, 747–757. [Google Scholar] [CrossRef]

Figure 1. MCMC trace plot for

β_{21}, β_{23}, \dots, β_{26}

and

s_{2}^{2}

.

Figure 1. MCMC trace plot for

β_{21}, β_{23}, \dots, β_{26}

and

s_{2}^{2}

.

Figure 2. Inferred graphs for CEU data.

Table 1. Selection result of AR(1).

Method	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	1.000 (0.000)	0.080 (0.007)	0.248 (0.014)	1.000 (0.000)	0.079 (0.007)	0.246 (0.014)
MF-GLasso	1.000 (0.000)	0.080 (0.006)	0.247 (0.013)	1.000 (0.000)	0.080 (0.006)	0.248 (0.013)
missGLasso	0.998 (0.000)	0.122 (0.008)	0.302 (0.015)	0.997 (0.000)	0.121 (0.008)	0.298 (0.015)
MI-GEH	1.000 (0.000)	0.999 (0.003)	0.996 (0.005)	0.999 (0.000)	0.998 (0.004)	0.981 (0.008)
	$p = 200$
mean-GLasso	1.000 (0.000)	0.148 (0.010)	0.373 (0.013)	1.000 (0.000)	0.147 (0.010)	0.371 (0.014)
MF-GLasso	1.000 (0.000)	0.149 (0.010)	0.374 (0.013)	1.000 (0.000)	0.149 (0.010)	0.374 (0.013)
missGLasso	0.999 (0.000)	0.084 (0.004)	0.258 (0.009)	0.999 (0.000)	0.084 (0.004)	0.258 (0.009)
MI-GEH	1.000 (0.000)	0.998 (0.004)	0.988 (0.005)	1.000 (0.000)	0.998 (0.003)	0.974 (0.003)
	$p = 300$
mean-GLasso	1.000 (0.000)	0.153 (0.009)	0.383 (0.011)	1.000 (0.000)	0.152 (0.009)	0.382 (0.011)
MF-GLasso	1.000 (0.000)	0.153 (0.009)	0.383 (0.011)	1.000 (0.000)	0.153 (0.009)	0.383 (0.011)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	1.000 (0.000)	0.996 (0.003)	0.982 (0.004)	1.000 (0.000)	0.995 (0.004)	0.979 (0.003)
Method	MAR, Missing Rate $= 10 %$			MAR, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	1.000 (0.000)	0.079 (0.007)	0.244 (0.014)	1.000 (0.000)	0.077 (0.006)	0.239 (0.014)
MF-GLasso	1.000 (0.000)	0.081 (0.006)	0.248 (0.013)	1.000 (0.000)	0.080 (0.006)	0.247 (0.013)
missGLasso	0.997 (0.001)	0.122 (0.008)	0.299 (0.015)	0.997 (0.000)	0.121 (0.008)	0.294 (0.014)
MI-GEH	1.000 (0.000)	0.998 (0.004)	0.990 (0.009)	0.998 (0.000)	0.976 (0.013)	0.945 (0.012)
	$p = 200$
mean-GLasso	1.000 (0.000)	0.148 (0.010)	0.372 (0.014)	1.000 (0.000)	0.146 (0.010)	0.368 (0.014)
MF-GLasso	1.000 (0.000)	0.149 (0.010)	0.374 (0.013)	1.000 (0.000)	0.148 (0.010)	0.373 (0.013)
missGLasso	0.999 (0.000)	0.084 (0.004)	0.257 (0.009)	0.999 (0.000)	0.084 (0.004)	0.257 (0.009)
MI-GEH	1.000 (0.000)	0.991 (0.006)	0.975 (0.007)	0.999 (0.000)	0.994 (0.005)	0.970 (0.003)
	$p = 300$
mean-GLasso	1.000 (0.000)	0.153 (0.009)	0.382 (0.012)	1.000 (0.000)	0.152 (0.009)	0.379 (0.011)
MF-GLasso	1.000 (0.000)	0.153 (0.009)	0.383 (0.011)	1.000 (0.000)	0.153 (0.009)	0.383 (0.011)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	1.000 (0.000)	0.995 (0.004)	0.980 (0.003)	1.000 (0.000)	0.996 (0.004)	0.979 (0.003)

The highest sensitivity and MCC values of each simulation setting are boldfaced.

Table 2. Selection result of AR(2).

Method	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	0.994 (0.001)	0.147 (0.012)	0.307 (0.017)	0.993 (0.001)	0.148 (0.012)	0.305 (0.017)
MF-GLasso	0.994 (0.001)	0.147 (0.012)	0.307 (0.017)	0.994 (0.001)	0.146 (0.013)	0.304 (0.017)
missGLasso	0.998 (0.001)	0.081 (0.002)	0.200 (0.005)	0.997 (0.001)	0.081 (0.002)	0.199 (0.006)
MI-GEH	0.975 (0.001)	0.997 (0.006)	0.613 (0.018)	0.975 (0.001)	0.994 (0.009)	0.604 (0.019)
	$p = 200$
mean-GLasso	0.996 (0.001)	0.110 (0.012)	0.268 (0.014)	0.996 (0.001)	0.110 (0.011)	0.267 (0.013)
MF-GLasso	0.996 (0.001)	0.110 (0.011)	0.268 (0.014)	0.996 (0.001)	0.108 (0.011)	0.266 (0.013)
missGLasso	0.999 (0.000)	0.045 (0.001)	0.153 (0.003)	0.998 (0.000)	0.045 (0.001)	0.153 (0.003)
MI-GEH	0.987 (0.000)	0.995 (0.006)	0.574 (0.013)	0.987 (0.000)	0.993 (0.007)	0.567 (0.011)
	$p = 300$
mean-GLasso	0.996 (0.000)	0.100 (0.011)	0.254 (0.012)	0.996 (0.000)	0.102 (0.011)	0.255 (0.012)
MF-GLasso	0.996 (0.000)	0.101 (0.011)	0.254 (0.012)	0.996 (0.000)	0.102 (0.011)	0.255 (0.012)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	0.991 (0.000)	0.995 (0.005)	0.559 (0.008)	0.991 (0.000)	0.994 (0.006)	0.555 (0.008)
Method	MAR, Missing Rate $= 10 %$			MAR, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	0.994 (0.001)	0.148 (0.013)	0.307 (0.018)	0.993 (0.001)	0.148 (0.013)	0.306 (0.018)
MF-GLasso	0.994 (0.001)	0.146 (0.012)	0.306 (0.017)	0.994 (0.001)	0.147 (0.013)	0.305 (0.018)
missGLasso	0.998 (0.001)	0.081 (0.002)	0.200 (0.006)	0.997 (0.001)	0.081 (0.002)	0.199 (0.006)
MI-GEH	0.975 (0.001)	0.997 (0.006)	0.612 (0.016)	0.975 (0.001)	0.994 (0.009)	0.603 (0.018)
	$p = 200$
mean-GLasso	0.996 (0.001)	0.110 (0.011)	0.268 (0.013)	0.996 (0.001)	0.110 (0.011)	0.267 (0.013)
MF-GLasso	0.996 (0.001)	0.108 (0.010)	0.266 (0.012)	0.996 (0.001)	0.109 (0.011)	0.266 (0.013)
missGLasso	0.999 (0.000)	0.045 (0.001)	0.153 (0.003)	0.998 (0.000)	0.045 (0.001)	0.153 (0.003)
MI-GEH	0.987 (0.000)	0.995 (0.006)	0.573 (0.011)	0.987 (0.000)	0.993 (0.007)	0.567 (0.011)
	$p = 300$
mean-GLasso	0.996 (0.000)	0.101 (0.011)	0.254 (0.012)	0.996 (0.000)	0.101 (0.011)	0.253 (0.013)
MF-GLasso	0.996 (0.000)	0.100 (0.011)	0.254 (0.013)	0.996 (0.000)	0.101 (0.011)	0.254 (0.012)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	0.991 (0.000)	0.994 (0.005)	0.558 (0.008)	0.991 (0.000)	0.993 (0.006)	0.554 (0.008)

The highest sensitivity and MCC values of each simulation setting are boldfaced.

Table 3. Selection result of Circle.

Method	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	1.000 (0.000)	0.078 (0.006)	0.243 (0.013)	1.000 (0.000)	0.077 (0.006)	0.241 (0.013)
MF-GLasso	1.000 (0.000)	0.077 (0.006)	0.240 (0.012)	1.000 (0.000)	0.077 (0.006)	0.241 (0.013)
missGLasso	0.999 (0.001)	0.108 (0.009)	0.293 (0.016)	0.999 (0.001)	0.108 (0.008)	0.289 (0.016)
MI-GEH	1.000 (0.000)	0.999 (0.003)	0.996 (0.005)	0.999 (0.000)	0.998 (0.005)	0.977 (0.010)
	$p = 200$
mean-GLasso	1.000 (0.000)	0.137 (0.008)	0.358 (0.012)	1.000 (0.000)	0.136 (0.008)	0.356 (0.012)
MF-GLasso	1.000 (0.000)	0.139 (0.008)	0.360 (0.011)	1.000 (0.000)	0.138 (0.008)	0.360 (0.011)
missGLasso	0.999 (0.000)	0.080 (0.004)	0.255 (0.009)	0.999 (0.000)	0.080 (0.004)	0.254 (0.009)
MI-GEH	1.000 (0.000)	0.999 (0.002)	0.990 (0.005)	0.999 (0.000)	0.998 (0.003)	0.971 (0.003)
	$p = 300$
mean-GLasso	1.000 (0.000)	0.135 (0.007)	0.359 (0.010)	1.000 (0.000)	0.134 (0.007)	0.358 (0.010)
MF-GLasso	1.000 (0.000)	0.136 (0.007)	0.360 (0.010)	1.000 (0.000)	0.136 (0.007)	0.360 (0.010)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	1.000 (0.000)	0.998 (0.003)	0.983 (0.004)	1.000 (0.000)	0.999 (0.001)	0.980 (0.002)
Method	MAR, Missing Rate $= 10 %$			MAR, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	1.000 (0.000)	0.076 (0.006)	0.238 (0.013)	0.999 (0.000)	0.075 (0.006)	0.232 (0.012)
MF-GLasso	1.000 (0.000)	0.077 (0.006)	0.241 (0.012)	1.000 (0.000)	0.077 (0.006)	0.241 (0.013)
missGLasso	0.999 (0.001)	0.109 (0.008)	0.291 (0.016)	0.998 (0.001)	0.109 (0.009)	0.283 (0.017)
MI-GEH	0.999 (0.001)	0.997 (0.007)	0.981 (0.015)	0.998 (0.000)	0.976 (0.013)	0.935 (0.010)
	$p = 200$
mean-GLasso	1.000 (0.000)	0.135 (0.008)	0.354 (0.012)	1.000 (0.000)	0.133 (0.008)	0.348 (0.012)
MF-GLasso	1.000 (0.000)	0.139 (0.008)	0.361 (0.012)	1.000 (0.000)	0.138 (0.008)	0.36 (0.012)
missGLasso	0.999 (0.000)	0.080 (0.004)	0.254 (0.009)	0.999 (0.000)	0.080 (0.005)	0.253 (0.009)
MI-GEH	1.000 (0.000)	0.991 (0.006)	0.971 (0.008)	0.999 (0.000)	0.997 (0.004)	0.969 (0.003)
	$p = 300$
mean-GLasso	1.000 (0.000)	0.134 (0.007)	0.357 (0.010)	1.000 (0.000)	0.133 (0.007)	0.353 (0.009)
MF-GLasso	1.000 (0.000)	0.136 (0.007)	0.361 (0.010)	1.000 (0.000)	0.136 (0.007)	0.360 (0.009)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	1.000 (0.000)	0.996 (0.003)	0.979 (0.002)	1.000 (0.000)	0.999 (0.002)	0.979 (0.002)

The highest sensitivity and MCC values of each simulation setting are boldfaced.

Table 4. Selection result of Hub.

Method	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	1.000 (0.000)	0.144 (0.040)	0.350 (0.053)	1.000 (0.000)	0.147 (0.043)	0.353 (0.055)
MF-GLasso	1.000 (0.000)	0.143 (0.040)	0.350 (0.053)	1.000 (0.000)	0.141 (0.039)	0.346 (0.051)
missGLasso	1.000 (0.000)	0.036 (0.001)	0.132 (0.005)	1.000 (0.000)	0.036 (0.001)	0.134 (0.005)
MI-GEH	0.991 (0.001)	0.821 (0.055)	0.634 (0.053)	0.991 (0.001)	0.812 (0.054)	0.627 (0.049)
	$p = 200$
mean-GLasso	1.000 (0.000)	0.137 (0.028)	0.356 (0.038)	1.000 (0.000)	0.141 (0.027)	0.359 (0.038)
MF-GLasso	1.000 (0.000)	0.141 (0.028)	0.360 (0.038)	1.000 (0.000)	0.137 (0.028)	0.355 (0.038)
missGLasso	1.000 (0.000)	0.023 (0.001)	0.115 (0.003)	1.000 (0.000)	0.023 (0.001)	0.115 (0.003)
MI-GEH	0.994 (0.000)	0.683 (0.048)	0.467 (0.037)	0.994 (0.000)	0.682 (0.048)	0.467 (0.038)
	$p = 300$
mean-GLasso	1.000 (0.000)	0.128 (0.020)	0.348 (0.028)	1.000 (0.000)	0.126 (0.020)	0.346 (0.029)
MF-GLasso	1.000 (0.000)	0.128 (0.020)	0.348 (0.029)	1.000 (0.000)	0.126 (0.021)	0.346 (0.030)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	0.996 (0.000)	0.813 (0.031)	0.594 (0.028)	0.996 (0.000)	0.810 (0.032)	0.589 (0.028)
Method	MAR, Missing Rate $= 10 %$			Mar, Missing Rate $= 20 %$
Method	Specificity	Sensitivity	MCC	Specificity	Sensitivity	MCC
	$p = 100$
mean-GLasso	1.000 (0.000)	0.146 (0.040)	0.352 (0.053)	1.000 (0.000)	0.147 (0.041)	0.353 (0.053)
MF-GLasso	1.000 (0.000)	0.142 (0.038)	0.347 (0.051)	1.000 (0.000)	0.144 (0.041)	0.349 (0.053)
missGLasso	1.000 (0.000)	0.036 (0.001)	0.132 (0.005)	1.000 (0.000)	0.036 (0.001)	0.133 (0.005)
MI-GEH	0.991 (0.001)	0.819 (0.052)	0.634 (0.049)	0.991 (0.001)	0.816 (0.054)	0.626 (0.051)
	$p = 200$
mean-GLasso	1.000 (0.000)	0.139 (0.027)	0.357 (0.037)	1.000 (0.000)	0.136 (0.027)	0.353 (0.038)
MF-GLasso	1.000 (0.000)	0.138 (0.027)	0.357 (0.037)	1.000 (0.000)	0.134 (0.026)	0.351 (0.036)
missGLasso	1.000 (0.000)	0.023 (0.001)	0.114 (0.003)	1.000 (0.000)	0.023 (0.001)	0.115 (0.004)
MI-GEH	0.994 (0.000)	0.681 (0.049)	0.469 (0.041)	0.993 (0.000)	0.679 (0.049)	0.464 (0.040)
	$p = 300$
mean-GLasso	1.000 (0.000)	0.127 (0.020)	0.347 (0.029)	1.000 (0.000)	0.124 (0.019)	0.342 (0.028)
MF-GLasso	1.000 (0.000)	0.127 (0.020)	0.347 (0.028)	1.000 (0.000)	0.125 (0.020)	0.345 (0.029)
missGLasso	NA	NA	NA	NA	NA	NA
MI-GEH	0.996 (0.000)	0.815 (0.029)	0.596 (0.028)	0.996 (0.000)	0.810 (0.032)	0.589 (0.029)

The highest sensitivity and MCC values of each simulation setting are boldfaced.

Table 5. Frobenius norm.

Model: AR(1)	$p = 100$	$p = 200$	$p = 300$	$p = 100$	$p = 200$	$p = 300$
	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
mean-GLasso	14.47 (1.87)	121.83 (9.84)	234.45 (14.18)	16.20 (1.95)	122.73 (9.78)	235.08 (14.11)
MF-GLasso	10.68 (1.73)	120.24 (9.97)	233.42 (14.26)	11.34 (1.68)	120.62 (9.94)	233.70 (14.22)
missGLasso	127.68 (1.82)	275.70 (1.74)	NA	127.45 (1.82)	275.44 (1.73)	NA
MI-GEH	5.94 (1.15)	16.39 (2.26)	27.65 (2.26)	10.60 (1.85)	23.48 (1.59)	30.07 (1.82)
	MAR, Missing Rate $= 10 %$			MAR, Missing Rate $= 20 %$
mean-GLasso	16.02 (1.92)	122.78 (9.76)	235.12 (14.12)	16.99 (1.81)	123.34 (9.69)	235.50 (14.07)
MF-GLasso	11.99 (1.91)	120.90 (9.92)	233.87 (14.18)	13.24 (1.8)	121.58 (9.82)	234.33 (14.15)
missGLasso	127.74 (1.77)	275.65 (1.73)	NA	127.54 (1.76)	275.29 (1.74)	NA
MI-GEH	7.85 (1.94)	21.21 (2.97)	29.32 (1.99)	16.47 (2.14)	24.46 (1.48)	30.27 (1.94)
Model: AR(2)	$p = 100$	$p = 200$	$p = 300$	$p = 100$	$p = 200$	$p = 300$
	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
mean-GLasso	36.89 (2.57)	88.43 (5.23)	148.41 (7.36)	37.35 (2.54)	88.49 (4.83)	149.42 (7.34)
MF-GLasso	36.78 (2.61)	88.25 (4.91)	148.62 (7.58)	36.97 (2.59)	87.93 (4.74)	149.34 (7.33)
missGLasso	26.74 (1.34)	99.05 (6.20)	NA	28.72 (1.73)	103.60 (6.32)	NA
MI-GEH	32.86 (2.27)	85.3 (4.61)	158.33 (12.58)	34.32 (2.35)	85.71 (4.47)	153.16 (9.55)
	MAR, Missing Rate $= 10 %$			MAR, Missing Rate $= 20 %$
mean-GLasso	36.98 (2.61)	88.29 (4.82)	148.66 (7.46)	37.47 (2.57)	88.69 (4.73)	148.77 (7.48)
MF-GLasso	36.64 (2.68)	87.64 (4.63)	148.38 (7.52)	37.17 (2.77)	88.30 (4.85)	148.93 (7.45)
missGLasso	26.73 (1.35)	99.33 (6.04)	NA	28.81 (1.71)	103.20 (6.64)	NA
MI-GEH	33.22 (2.11)	85.30 (4.95)	159.34 (12.29)	34.42 (2.31)	85.72 (4.53)	153.38 (9.30)
Model: Circle	$p = 100$	$p = 200$	$p = 300$	$p = 100$	$p = 200$	$p = 300$
	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
mean-GLasso	70.24 (9.02)	522.54 (39.90)	972.76 (55.76)	79.79 (8.85)	527.17 (39.62)	976.52 (55.56)
MF-GLasso	42.71 (6.88)	512.29 (40.78)	965.92 (56.18)	47.04 (7.62)	514.29 (40.58)	967.64 (56.03)
missGLasso	529.38 (8.17)	1120.24 (6.70)	NA	529.39 (7.84)	1119.95 (6.70)	NA
MI-GEH	22.04 (4.27)	59.78 (11.37)	112.27 (9.68)	47.60 (10.80)	97.05 (6.11)	122.31 (7.66)
	MAR, Missing Rate $= 10 %$			MAR, Missing Rate $= 20 %$
mean-GLasso	76.33 (8.92)	526.94 (39.76)	976.45 (55.49)	79.85 (8.75)	529.09 (39.55)	978.07 (55.29)
MF-GLasso	51.29 (7.58)	515.79 (40.49)	968.64 (55.90)	57.92 (7.80)	519.24 (40.08)	970.95 (55.77)
missGLasso	529.30 (7.84)	1120.16 (6.61)	NA	530.24 (7.49)	1119.74 (6.90)	NA
MI-GEH	39.35 (14.13)	88.83 (12.27)	120.34 (8.05)	74.96 (6.37)	100.07 (5.91)	123.06 (8.03)
Model: Hub	$p = 100$	$p = 200$	$p = 300$	$p = 100$	$p = 200$	$p = 300$
	MCAR, Missing Rate $= 10 %$			MCAR, Missing Rate $= 20 %$
mean-GLasso	7.68 (0.82)	417 (29.91)	875.34 (51.71)	8.11 (0.93)	422.67 (30.74)	882.6 (50.95)
MF-GLasso	7.68 (0.82)	414.30 (29.14)	881.38 (51.73)	8.22 (0.95)	427.59 (32.18)	893.79 (55.87)
missGLasso	54.33 (5.02)	2432.42 (163.09)	NA	60.71 (5.82)	2518.57 (171.50)	NA
MI-GEH	11.58 (1.31)	713.74 (40.12)	1979.71 (119.49)	11.61 (1.16)	689.29 (34.19)	1848.43 (98.13)
	MAR, Missing Rate $= 10 %$			MAR, Missing Rate $= 20 %$
mean-GLasso	7.65 (0.78)	415.51 (29.76)	878.42 (53.01)	8.09 (0.83)	427.01 (30.93)	888.03 (53.19)
MF-GLasso	7.68 (0.76)	417.03 (29.72)	883.11 (51.40)	8.20 (1.00)	429.83 (31.36)	893.9 (51.48)
missGLasso	54.32 (5.00)	2423.20 (156.66)	NA	60.82 (5.94)	2523.73 (172.86)	NA
MI-GEH	11.56 (1.19)	711.72 (40.76)	1983.7 (128.11)	11.55 (1.15)	689.30 (38.11)	1834.19 (96.10)

The lowest Fnorm values of each simulation setting are boldfaced.

Table 6. CEU data application results.

Method	Number of Vertices Selected	Number of Edges Selected
MI-GEH	99	97
GHS	84	127
GLasso	100	1102

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Kim, S. Gaussian Graphical Model Estimation and Selection for High-Dimensional Incomplete Data Using Multiple Imputation and Horseshoe Estimators. Mathematics 2024, 12, 1837. https://doi.org/10.3390/math12121837

AMA Style

Zhang Y, Kim S. Gaussian Graphical Model Estimation and Selection for High-Dimensional Incomplete Data Using Multiple Imputation and Horseshoe Estimators. Mathematics. 2024; 12(12):1837. https://doi.org/10.3390/math12121837

Chicago/Turabian Style

Zhang, Yunxi, and Soeun Kim. 2024. "Gaussian Graphical Model Estimation and Selection for High-Dimensional Incomplete Data Using Multiple Imputation and Horseshoe Estimators" Mathematics 12, no. 12: 1837. https://doi.org/10.3390/math12121837

APA Style

Zhang, Y., & Kim, S. (2024). Gaussian Graphical Model Estimation and Selection for High-Dimensional Incomplete Data Using Multiple Imputation and Horseshoe Estimators. Mathematics, 12(12), 1837. https://doi.org/10.3390/math12121837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gaussian Graphical Model Estimation and Selection for High-Dimensional Incomplete Data Using Multiple Imputation and Horseshoe Estimators

Abstract

1. Introduction

2. Materials and Methods

2.1. Gaussian Graphical Model Estimation and Neighborhood Selection

2.2. Gaussian Graphical Model Estimation by Horseshoe Prior

2.3. Multiple Imputation with Graphical Model Estimation by Horseshoe (MI-GEH)

3. Simulation Study

3.1. Study Design

3.2. Results

4. Application

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI