Next Article in Journal
Recent Advances in Synthesis of Non-Alternating Polyketone Generated by Copolymerization of Carbon Monoxide and Ethylene
Next Article in Special Issue
New Discoveries on Protein Recruitment and Regulation during the Early Stages of the DNA Damage Response Pathways
Previous Article in Journal
Inhibition of Autophagy Aggravates Arachis hypogaea L. Skin Extracts-Induced Apoptosis in Cancer Cells
Previous Article in Special Issue
The Multifaceted Functions of TRPV4 and Calcium Oscillations in Tissue Repair
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

SumVg: Total Heritability Explained by All Variants in Genome-Wide Association Studies Based on Summary Statistics with Standard Error Estimates

1
School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong, China
2
KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, Kunming Institute of Zoology and The Chinese University of Hong Kong, Shatin, Hong Kong, China
3
Department of Psychiatry, The Chinese University of Hong Kong, Shatin, Hong Kong, China
4
CUHK Shenzhen Research Institute, Shenzhen 518057, China
5
Margaret K. L. Cheung Research Centre for Management of Parkinsonism, The Chinese University of Hong Kong, Shatin, Hong Kong, China
6
Hong Kong Branch of the Chinese Academy of Sciences Center for Excellence in Animal Evolution and Genetics, The Chinese University of Hong Kong, Shatin, Hong Kong, China
7
Brain and Mind Institute, The Chinese University of Hong Kong, Shatin, Hong Kong, China
8
Department of Psychiatry, The University of Hong Kong, Pokfulam, Hong Kong, China
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2024, 25(2), 1347; https://doi.org/10.3390/ijms25021347
Submission received: 13 October 2023 / Revised: 15 January 2024 / Accepted: 16 January 2024 / Published: 22 January 2024
(This article belongs to the Collection Feature Papers in “Molecular Biology”)

Abstract

:
Genome-wide association studies (GWAS) are commonly employed to study the genetic basis of complex traits/diseases, and a key question is how much heritability could be explained by all single nucleotide polymorphisms (SNPs) in GWAS. One widely used approach that relies on summary statistics only is linkage disequilibrium score regression (LDSC); however, this approach requires certain assumptions about the effects of SNPs (e.g., all SNPs contribute to heritability and each SNP contributes equal variance). More flexible modeling methods may be useful. We previously developed an approach recovering the “true” effect sizes from a set of observed z-statistics with an empirical Bayes approach, using only summary statistics. However, methods for standard error (SE) estimation are not available yet, limiting the interpretation of our results and the applicability of the approach. In this study, we developed several resampling-based approaches to estimate the SE of SNP-based heritability, including two jackknife and three parametric bootstrap methods. The resampling procedures are performed at the SNP level as it is most common to estimate heritability from GWAS summary statistics alone. Simulations showed that the delete-d-jackknife and parametric bootstrap approaches provide good estimates of the SE. In particular, the parametric bootstrap approaches yield the lowest root-mean-squared-error (RMSE) of the true SE. We also explored various methods for constructing confidence intervals (CIs). In addition, we applied our method to estimate the SNP-based heritability of 12 immune-related traits (levels of cytokines and growth factors) to shed light on their genetic architecture. We also implemented the methods to compute the sum of heritability explained and the corresponding SE in an R package SumVg. In conclusion, SumVg may provide a useful alternative tool for calculating SNP heritability and estimating SE/CI, which does not rely on distributional assumptions of SNP effects.

1. Introduction

Genome-wide association studies (GWAS) have proven to be successful in dissecting the genetic basis of a variety of diseases. A number of new susceptibility loci have been discovered, providing novel insight into the pathophysiology of many diseases. Nevertheless, a large proportion of heritability still remains unexplained. It is natural to question the maximum variance that could be explained by all variants in a GWAS (or meta-analyses of GWAS), as we expect that many true susceptibility variants are “hidden” due to limited power.
A number of methods have been developed to estimate total heritability according to all measured SNPs (also known as SNP-based heritability). Regarding methods that require individual-level data, in a pioneering work, Yang et al. [1] derived a method to estimate the variance explained by all SNPs in a GWAS using a linear mixed model with random SNP effects. The approach assumes that all SNPs have non-zero and normally distributed effects (beta), with a mean effect of zero. Each SNP is assumed to contribute to the same level of explained variance (i.e., variance explained by each SNP = total heritability/number of SNPs). Other similar approaches have also been proposed. For example, LDAK [2] assumes that a different heritability explains each SNP, depending on the minor allele frequencies (MAF), linkage disequilibrium (LD) score and imputation quality of the SNP. Advanced methods have also been developed to estimate SNP-based heritability using summary statistics alone. (Here, summary statistics refer to GWAS results for each SNP, with effect size (beta), standard error of beta and test statistics/p-values available, or at least two items available.) LD score regression (LDSC) is one of the most widely used approaches for this purpose [3]. LDSC assumes a mean effect (beta) of zero and equal variance explained by each SNP (i.e., an infinitesimal model). SumHer [4] is an alternative approach based on the LDAK assumptions. For a more detailed technical review, please refer to ref [5]. The broader problem of SNP-based heritability estimation has also been discussed in several other reviews or opinion pieces [6,7,8,9].
Prior to the development of LDSC, we have developed an alternative framework (see ref [10]; referred to as “SumVg” in this paper) to achieve the same goal of estimating SNP-based heritability using summary statistics alone. Essentially, we aimed to recover the true effect sizes from a set of observed z-statistics based on formulas presented by Robbins [11] (who attributed the idea to Maurice Kenneth Tweedie), Brown [12] and Efron [13]. The corrected z-statistics are then converted to variance explained. There are several advantages to this method. Most importantly, the SumVg approach does not rely on any distributional assumptions of the effect sizes of susceptibility variants. In addition, it does not assume an equal amount of heritability is explained by each SNP, or that all SNPs contribute to the heritability (infinitesimal model). There are also no assumptions about the relationship between allele frequencies and variance explained. The method is also computationally fast. In addition, since the LDSC method directly leverages LD patterns, a well-matched LD reference panel is usually required [14]. There is less reliance on LD information when using SumVg as LD is mainly used for pruning.
Our method has been applied in a number of studies (for example see [15,16,17,18,19,20,21,22,23]). However, there are no methods available to quantify the standard error (SE) or precision of the heritability estimates from SumVg, or the corresponding confidence intervals (CIs). There is considerable technical difficulty in developing a reliable approach for estimating the SE since usually only the summary data (instead of individual-level data) are available. If raw data are available, a standard non-parametric bootstrap could be employed by sampling individuals with a replacement. However, there are currently no methods for evaluating the SE or CI of the point estimate of heritability when only summary statistics are available.
We summarize the contributions of this study below. In this work, we proposed five re-sampling approaches to estimate the SE of the total heritability of all SNPs in GWAS, based on summary statistics alone. Extensive simulations were performed to compare and validate the performance of different methods. We also explored various methods for constructing CIs. Secondly, we also developed an easy-to-use R program to implement the SumVg approach with different flexible modeling options, available at https://github.com/lab-hcso/Estimating-SE-of-total-heritability/ (accessed on 12 October 2023). Thirdly, we reported heritability estimates for 12 immune-related traits (levels of cytokines and growth factors) [24] based on this approach, for which LDSC was unable to provide reasonable estimates. Such cytokines/growth factors are regulators of immune responses and inflammation, and are important intermediate phenotypes for autoimmune, inflammatory and infectious diseases [25]. As such, it is of scientific and clinical importance to unravel the genetic architecture of these traits, and estimating their heritability may be considered a useful contribution in its own right.

2. Results

2.1. Overview of Methods

We estimated the total heritability (Vg) explained by all variants in a GWAS panel using the Tweedie’s formula [10], which corrects selection bias in the observed z-statistics. To estimate the standard errors (SE), we proposed five resampling methods. The first two are based on jackknife, namely delete-one and delete-d-jackknife (with d = n/5 observations removed each time). We also proposed three parametric bootstrap methods, where z-statistics were sampled from a normal distribution based on the ‘corrected’ z-statistics, and/or the local false discovery rate (fdr) (i.e., estimated probability that a SNP is null). We also proposed several methods for constructing confidence intervals (CIs), including normal approximation with various bootstrap bias corrections, as well as the percentile and union of CI methods. We tested the performance of the SE and CI estimation methods in simulations under different heritability and sample size scenarios. We applied our methods to estimate SNP-based heritability and the SEs of 12 immune traits to reveal their genetic architecture.

2.2. Simulation Results for SE Estimation

Standard errors (SEs) of heritability, as estimated by the jackknife and bootstrap approaches, are listed in Table 1 and plotted in Figure 1. Bias, variance and root mean square error (RMSE) of SEs were calculated over 100 simulations (Table 2; Figure 1 and Figure 2).
The delete-[n/5]-jackknife worked reasonably well when the total heritability explained was low (when heritability = 0.101), but it tended to overestimate the SE when the total heritability was higher, especially with larger sample sizes. The bias was also positive across all simulation scenarios. The standard (delete-1) jackknife approach performed the worst among all methods, producing inflated estimates of SE. The variance and RMSE of this estimator were high compared to other approaches. The SE was, in general, over-estimated at all heritability levels across all sample sizes. This may be explained by the fact that the sum of variance explained is not a very smooth parameter, which impairs the performance of delete-1-jackknife estimators.
The other methods, including the original parametric bootstrap (paraboot) and the modified versions with consideration of local fdr, performed reasonably well and closely resembled the true SE. With the exception of one simulation setting, the parametric bootstrap methods achieved the lowest (absolute) bias for SE. For the variance and RMSE of SE, parametric bootstrap also performed the best. In terms of RMSE, the parametric bootstrap approaches modeling the local fdr (i.e., fdrboot1 and fdrboot2) outperformed the other methods. The RMSE of different estimators were also observed to reduce with increasing sample sizes.

2.3. Performance of Different CI Construction Methods

The full results are presented in Table 3, Tables S1 and S2. For standard CI (based on normal approximation), the CIs built from the SE of delete-d-jackknife performed reasonably well (in terms of coverage) for large sample sizes, although the coverage was not always adequate for modest samples sizes, especially for N < 20,000. The coverage of CIs constructed from other types of SEs were more variable, with good coverage for some scenarios but poor coverage for others. Therefore, we primarily focus on the SE from delete-d-jackknife when a standard CI is used. Interestingly, the bias-corrected standard CI, with bias correction based on paraboot or fdrboot2, performed better in the several cases when the standard CI had low coverage (<50%) (we assume that the SE from delete-d-jackknife was employed). The performance of percentile CIs was highly variable across different scenarios.
In view of the highly variable performance of different CI construction methods, we expect the union of CI (UCI) to perform better and be more robust across different scenarios. We observed that UCI, no matter if it is constructed from the standard or percentile CI estimators, in general, achieved good coverage across most simulation scenarios, although in some cases the coverage was still below the desired level (95%). When we further took the union of standard and percentile UCI estimators (i.e., Method 3 listed under ‘Union CI’ in the Section 4), the coverage was adequate for almost all scenarios, except one case in which both the sample size and the sum of variance explained (Vg) were low (N = 5000, Vg = 0.101).

2.4. Results on Immune Traits

PLINK was applied to trim GWAS data for 12 immunological traits (Table 4) with various r2 criteria to obtain roughly independent SNPs. We only included common variants with an MAF > 0.01 for further analysis. Then, using SumVg, the “true” z-statistics of trimmed SNPs were retrieved to capture the missing heritability. The jackknife and bootstrap methods were used to compute the corresponding SEs (Table 5; Figure S1).
The total SNP-based heritability predicted by SumVg for the selected traits, in contrast to the comparatively low or negative heritability estimates from LDSC, were around 10–20% based on a collection of LD-pruned SNPs. We obtained a stable (and likely conservative) estimate of heritability at r2 ~ 0.01 or 0.005. Lower r2 values (i.e., r2 < 0.0025 and r2 < 0.001) had limited impact on final estimates of heritability. The delete-one jackknife consistently produced the highest standard error, while the bootstrap and delete-d jackknife approaches produced SEs that were more comparable to one another. Out of the 12 cytokines/growth factors studied, the highest heritability was observed for the levels of IL-4 and IL-17.

2.5. R Package Implementation

We also implemented the methods to compute the sum of heritability explained and the corresponding SEs in an R package SumVg, available at https://github.com/lab-hcso/Estimating-SE-of-total-heritability/ (accessed on 12 October 2023).
The computational speed of different resampling approaches using SumVg is presented in Table S3 (assuming 100,000 SNPs and 200 resampling iterations). The speed is generally fast and the time taken was around 2–4 min for each resampling method, using a single core (Intel Xeon Gold 6230 CPU @ 2.10 GHz).

3. Discussion

In this study, we presented an approach for estimating the SE of SNP-based heritability estimates using SumVg, and our applications to immune phenotypes demonstrate the usefulness of this approach.
Our main purpose is to provide an alternative approach for SNP-based heritability and SE estimation, since different approaches have different statistical modeling assumptions, or assumptions about the genetic architecture. In practice, it is almost impossible to know the true genetic architecture of a disease/trait, and as such, it is very difficult to verify the correctness of heritability estimates due to the lack of a ‘gold standard’. It will be more reassuring if one observes similar heritability estimates from diverse methods. SumVg may provide a useful alternative reference for heritability estimates, in conjunction with existing approaches such as LDSC. SumVg may also be useful when standard approaches are unable to give reasonable results (e.g., close to zero heritability for traits that are likely to be heritable from previous studies, or negative estimates). It will be interesting to investigate the reasons underlying negative heritability estimates for LDSC; one possibility is mis-specified model assumptions [26], but the exact reasons will require further studies.
We recommended pruning the SNPs (such that SNPs are roughly in linkage equilibrium) before applying our method of heritability estimation. One approach is to employ a series of r2 thresholds (e.g., decreasing r2 from 0.1 to 0.001) and consider the point at which heritability became stable. Our empirical applications showed that an r2 threshold of ~0.01 may be sufficient. The resulting SNP-based heritability may be considered to be a conservative estimate (due to the possibility of removing some causal variants during LD-pruning). While not directly modeling LD is a limitation of this approach, the lower reliance on accurate LD information may be advantageous in some cases, for example when in-sample LD information is not available and only limited external reference data are present. On the other hand, we are also investigating methods to model LD in the SumVg framework. Since SumVg and LDSC are based on different modeling strategies and assumptions, and that the main focus of this study is the development of new SE/CI estimation approaches for SumVg (as well as applications to immune traits and presentation of a new R package), we shall leave carrying out a detailed comparison between SumVg and LDSC (or other SNP heritability estimation methods) for future work.
We have not investigated methods for SE estimation when raw genotype data are available. When raw data is available, one potential approach is to simply resample the individuals with a replacement (i.e., standard non-parametric bootstrap). However, such an approach is computationally intensive and its performance over methods based on summary statistics requires further research. The above resampling methods can also potentially be sped up by splitting the job into multiple processes to be run in parallel, although this approach has not been implemented in our software yet. We also wish to point out that, as the resampling methods were supposed to apply to GWAS summary data, in general the computational speed is fast, and the speed is not affected by sample sizes.
We have explored various approaches to construct CI, although we cannot yet find a single approach that yields an optimal CI with good coverage across all scenarios. We shall leave the development of more sophisticated and novel methodologies for CI construction for future works. For practical purposes, the union CI appears to perform well in terms of coverage across most scenarios (at the expense of wider CIs). On the other hand, we suspect that the issue of CI construction may not be unique to the SumVg approach; other methods for estimating SNP-based heritability typically require more stringent assumption on the distribution of effects, and/or that all SNPs contribute to heritability. The violation of such assumptions may lead to the estimates being biased and the inadequate coverage of CIs. Here we have proposed a bootstrap correction of bias, which indeed led to improvement in CI coverage in some cases, for example the standard CI under small sample sizes. Nevertheless, bootstrap correction showed a variable performance across different scenarios and did not always reduce bias in all cases. The above issues may warrant further studies.
Here we further highlight several important points to note and limitations of our framework. Regarding the SumVg estimator of total SNP-based heritability, one future research direction is to further explore its asymptotic theoretical properties. We did not pursue this direction here. Of note, the key difficulty in Equation (1) (i.e., the Tweedie’s formula) is to estimate f x and f x accurately. We primarily employed a kernel density estimator here, although other density estimation approaches may also be attempted. Notably, the kernel density estimator has been shown to be asymptotically consistent under certain assumptions [27]. In the paper by Efron [28], the asymptotic regret (Reg) of the empirical Bayes approach (i.e, using Tweedie’s formula) was studied by comparing the Tweedie’s estimate with the Bayes estimate of the true effect size, for a fixed value of z at z 0 . It was shown that Reg z 0 tends towards zero as N tends towards infinity, and the regret depends on the squared error of l ^ z 0 as an estimator of l z 0 , where l z = d d z   log   f z . Future theoretical studies of SumVg and other SNP heritability estimation methods are warranted.
In the current work, we assume that the summary statistics have been corrected for population stratification and other types of bias. If the original GWAS study suffered from bias, e.g., confounding, selection/ascertainment bias, sampling bias, bias due to missing data, etc., the resulting Vg estimate will also be affected. We suggest that the above bias should be carefully addressed at the design and/or analysis stage of the GWAS, for example by performing proper random sampling, inverse probability weighting to address selection bias [29], proper imputation of missing data, etc. As with any method, independent replication is also important.
Another limitation is that the proposed approach for calculating SNP heritability and SE/CI estimation may not work well for very small sample sizes. Since GWAS sample sizes are generally getting larger (most with N > 5000), we did not address the performance under very small sample sizes here. In such cases, both the SNP heritability and SE estimates may need to be viewed with greater caution. Meta-analysis of GWAS results across multiple studies may be recommended. Future work may also explore more innovative approaches to addressing small sample sizes, for example whether specifying a prior for the underlying effect sizes (δ) may help. (The current approach does not require any specification of the distribution of δ).
We also note that resampling methods often assume that the data points are independent of each other. In our study, prior to the analysis, we processed the data to remove strongly linked SNPs using LD pruning. The resulting SNPs are therefore roughly independent though some residual LD might remain. As a future direction, it may be useful to explore ways to fully tackle LD, for example by block bootstrap or jackknife [30]. However, external LD data from reference panels would be required, and there may be risks of LD mismatch between the studied and external samples. Further studies are required to investigate these issues.
Different resampling methods like bootstrap and jackknife may have different assumptions and applicability to different kinds of data. We have conducted relatively extensive simulations to compare performance of different methods across a range of heritability levels and sample sizes, which helps evaluate their applicability. We believe the proposed methods are generally applicable to most GWAS summary data. Note that the parametric bootstrap approaches assume that the observed data (z-statistics) are drawn from a certain specified parametric distribution. In our case, it is assumed that the δ and/or local fdr are estimated reasonably well. For small sample sizes, this assumption may not hold very well. The jackknife approaches do not require parametric assumptions; however, delete-one-jackknife has been shown to produce inconsistent variance estimators for non-smooth estimators such as the sample quantiles [31]. Delete-d-jackknife can resolve this problem, but the choice of d may not be straightforward. We suggest that multiple types of resampling methods should be performed; similar results across different methods may provide reassurance to the validity of results. Future work may include more extensive simulations for different genetic architectures and wider applications to complex traits.
There may be a concern that resampling methods may not handle extreme values or skewed distributions well. As discussed above, we recommend the GWAS should be conducted carefully in the first place. For example, skewed phenotypes may require transformation before analysis, and confounding or other kinds of bias need to be addressed. The SumVg method works on summary statistics. It is possible to perform further inverse-rank transformation to the summary statistics if the distribution is skewed or outliers are present, although this may create some bias to the Vg estimate. One may also trim the outlying z-statistics, and increasing the number of resamples may also help. The performance of these approaches will be a topic for further studies.
Importantly, we have also applied our approach to estimate the heritability of different cytokines, which play important roles in immune response and the pathogenesis of autoimmune, inflammatory and infectious diseases. Our analyses suggest that the studied cytokines are moderately heritable in general.
To summarize, SumVg is useful for triangulating evidence from different approaches to support conclusions regarding SNP-based heritability. We present novel methods of computing SE and CI and an easy-to-use software here, which we believe will be helpful for other researchers. Our application to the cytokine levels also sheds light on the genetic architecture of these clinically important immune traits.

4. Materials and Methods

4.1. Estimation of the Total Heritability Explained (Vg)

We previously proposed an approach [10] to estimate the sum of heritability explained by all variants on a GWAS panel. Our approach leverages Tweedie’s formula for estimating the true underlying effect sizes of SNPs, based on the observed GWAS summary statistics. The principles are described in detail in the work by Efron [28].

4.1.1. Estimation of Total Vg Based on Tweedie’s Formula

More specifically, assuming we have a large number of normally distributed variables (here z-statistics from a GWAS analysis), each with its own unobserved mean parameter δi, then
z i ~ N δ i ,   σ 2     ,       i = 1 ,   2 ,   ,   k
where k is the total number of variables. The attention is focused on the more extreme values, for example the top SNPs in high-dimensional genomics studies. As described by Efron [28], ‘selection bias’ may be at play here. Intuitively, the more extreme z-statistics might have been ‘lucky’ as random errors pushed them to deviate from zero; as such they can ‘stand out’ among the other z-statistics. In other words, the true underlying effect sizes of these top SNPs tend to be less extreme than the observed values. This phenomenon is also known as the ‘winner’s curse’, for example see [32,33,34]. As a result, if we directly used the observed z-statistics to estimate the true effect sizes, the performance may not be optimal. Some form of ‘correction’ of the observed z-statistics are required.
Efron [28] proposed an empirical Bayes approach to reduce the selection bias, which was first described by Robbins [11] who attributed the ideas to Tweedie. The method assumes that
δ   ~   g   ·               a n d       z | δ   N δ ,   σ 2
In other words, we assume that δ was sampled from a prior ‘density’ g(.), then z ~ 𝒩(δ, σ2) were observed, and the variance σ2 was known. There are no assumptions on the form of the prior density g. According to the Tweedie’s formula,
E { δ   |   z } = z + σ 2 l ( z ) , where   l ( z ) = d d z l o g   f ( z )
In our setting of GWAS analyses, we assume σ2 = 1, since we work with the summary z-statistics. We estimated the true or ‘corrected’ effect sizes of SNPs using
E δ z =   z + f z f z
which is equivalent to the formula above when σ2 = 1. Here z denotes the observed z-statistic, obtained from the estimated regression coefficient divided by the estimated SE (i.e., β ^ S E ^ ). δ is the z-statistic derived by the true effect size divided by the estimated SE of the sample ( β t r u e / S E ^ ), which can be considered a form of the ‘standardized’ true effect size. We previously proposed to employ a kernel density estimator to compute f(z) [10], which was shown to perform well in simulations. The total variance explained (Vg) can be obtained by converting the underlying effects δ to the Vg scale (see below and ref [10]).

4.1.2. Conversion of z-Statistics to Vg

For continuous traits, the conversion formula followed our previous work [10], which can be derived from ANOVA table of regression,
V g = E δ z 2 n 2 + E δ z 2
For binary outcomes, it is also possible to convert the z-statistics to Vg, provided that the estimated SE (or beta) and minor allele frequencies (MAF) of the SNPs, as well as the outcome prevalence, are available. We followed the methodology described in ref [35], which described how to convert coefficients from a logistic model to the liability scale. Note that the liability is assumed to have a variance of one. We followed Equation (4) from the above paper [35] to derive the coefficient (τ1) under a liability scale. We converted τ1 to the standardized coefficient (τstandard) by multiplying τ1 by sqrt(2 × MAF × (1 − MAF)), which is the standard deviation (SD) of the allelic count (coded as 0, 1, 2). Total variance explained is given by sum of the squared τstandard.

4.1.3. Assumptions

Regarding the assumptions of this approach, we emphasize that it does not require prior assumptions about the underlying distributions of the true effect sizes δ, which is an important advantage over other SNP-heritability estimation methods. On the other hand, we assume that the summary statistics have been corrected for population stratification or other confounding factors. The z-statistics are assumed to follow normal distributions; for very small samples sizes, rare variants, highly imbalanced case to control ratio, or highly skewed continuous outcomes, etc., caution should be taken as to whether the test statistic β ^ S E ^   follows a normal distribution. We assume full GWAS summary statistics as input; if the summary statistics have been selected based on their significance levels (e.g., some GWAS only released the top SNPs, say top 10,000 SNPs), the proposed Tweedie’s formula may not work well. The effect sizes may be overestimated in this case as the other SNPs have been selected for being significant.

4.1.4. An Alternative Conditional Estimator

We also proposed an alternative approach by evaluating the expected effect size conditioned on H1 (i.e., δ 0)
E δ z ,   H 1 = E 1 δ z = E δ z P r H 1 z = E δ z 1 f d r z
where fdr is the local false discovery rate described in Efron [36]. The resulting estimate of Vg can be obtained by first converting E δ z ,   H 1 to the Vg scale (see Section 4.1.2), then multiply by 1 − fdr(z).
The conditional estimator, however, is prone to large random variations as it involves local fdr estimation of each SNP. In many subsequent applications of our heritability estimation method [15,16,17], the unconditional estimator (Equation (1)) was primarily employed. We shall hence focus on the unconditional estimator in this paper, although the resampling approaches described below can readily be applied to other estimators in our previous work [10] as well.

4.2. Estimation of the Standard Error (SE) of Vg

4.2.1. Standard and Delete-d-Jackknife to Estimate SE

In standard (delete-one) jackknife procedure [37], we estimate the standard error (SE) by leaving out one observation at a time. The SE is defined by
s e ^ j a c k = n 1 n θ ^ i θ ^ . 2
where n is the sample size, θ ^ i is the parameter estimate from the sample with the ith observation removed and
θ ^ . = i = 1 n θ ^ i n
In our case, the parameter is the sum of heritability from all variants.
An extension is the delete-d-jackknife [31] where we leave out d observations at a time. There are in total N = n d possibilities of removing d out of n observations. In practice, N is usually very large. One may simply randomly repeat the procedure m times only m N instead of exhausting all possibilities of removing d out of n observations. The standard error is given by
s e ^ d e l d j a c k = n d d m v = 1 m θ ^ S v 1 m v = 1 m θ ^ S v 2
where θ ^ S v denotes the parameter estimate in the vth jackknife replicate where d observations are left out. The delete-d-jackknife (when d > 1 ) works better than the standard jackknife for non-smooth parameters like the median [31].
There are no clear rules on the choice of d in delete-d-bootstrap. Chatterjee [38] suggested n/5 as a reasonable choice for d based on the consideration of efficiency and likely model conditions. We followed the suggestion by Chatterjee [38] and set d as n/5 (=20,000) in all simulations.

4.2.2. Parametric Bootstrap Approaches for Estimating SE

In parametric bootstrap, in each replication we simulated z-statistics based on δ ^ , the ‘corrected’ z-statistics from original sample (this method is referred to as ‘paraboot’). We have
z i , b ~ N δ i ^ , 1
where z i , b denotes the ith z-statistic in the bth bootstrap replicate. For small effects, the δ ^ will be shrunken towards zero.
We further proposed a modified approach by also considering the local fdr (i.e., probably of null given z) of each z-statistic. In each replicate, we simulated z-statistics according to the following scheme:
z i , b ~ N z i ^ , 1   with   a   probability   of   1 - f d ^ r ( z i )
z i , b ~ N 0,1   with   a   probability   of   f d ^ r ( z i )
where z i ^ denotes the observed z-statistics. The standard error is then computed from the simulated z-statistics. This method is referred to as “fdrboot1”.
Alternatively, one may employ the corrected z-statistics instead of the observed z-statistics as the mean in each simulation, i.e.,
z i , b ~ N δ i ^ , 1   with   a   probability   of   1 - f d ^ r ( z i )
z i , b ~ N 0,1   with   a   probability   of   f d ^ r ( z i )
The method is also referred to as ‘fdrboot2’.

4.3. Construction of Confidence Intervals (CIs): An Exploratory Analysis

The construction of a proper CI is a more demanding task as it requires the unbiasedness of the estimate and correct estimation of the variability of the estimate. Given the difficulty of constructing accurate CIs, here we consider CI estimation as a secondary or exploratory analysis which requires further investigation and methodological development. We have explored a few approaches as described below.

4.3.1. Normal Approximation (Standard Approach)

Firstly, we explored the standard approach for constructing the 95% CI by using normal approximation, i.e., V g ^ ±   z 0.975 × S E ^ V g , where   z 0.975 is the quantile of a standard normal distribution at the 97.5th percentile. Assuming a polygenic model, the total heritability is the sum of variance explained contributed by many variants of small to modest effect sizes. Hence, it is reasonable to assume normality according to the central limit theorem (as is assumed by other SNP-heritability estimation tools). We examined the performance of different CIs, with SE determined by various methods. Empirically, we found that SE computed by the delete-d-jackknife performed reasonably well.
On the other hand, we also explored this using bootstrap to correct for bias of the point estimates of Vg. In brief, the bias can be estimated by [39]
B i a s ^ = θ * ¯ θ ^
where θ ^ denotes the observed Vg, and θ * ¯ is the mean of the bootstrapped estimates of Vg. The bias-corrected estimator of Vg is given by
B i a s   c o r r e c t e d   θ = θ ^ B i a s ^
The 95% CI is then based on B i a s   c o r r e c t e d   V g ± z 0.975 × S E ^ ^ . Since we proposed 3 bootstrap procedures, there were 3 bootstrap bias-corrected CIs based on normal approximation. The standard CI without bias correction was also included as another estimator.

4.3.2. Percentile Approach

Secondly, we explored the percentile CI approach, namely construction of 95% CIs based on the 2.5th and 97.5th percentiles of the bootstrapped Vg. Again, bias correction can be applied as follows
Lower   95 %   CI = 2 θ ^ θ 0.975 *
U p p e r   95 %   C I = 2 θ ^ θ 0.025 *
where θ 0.025 * and θ 0.975 * are the 2.5th and 97.5th percentiles of the bootstrapped replicates of Vg, respectively. Bias correction was based on the same bootstrap method that was used to derive the percentiles. Again, we also included the percentile CIs without bias correction.

4.3.3. Union CI

Thirdly, we explored a more robust CI estimator by taking the union of individual CIs (UCI). The union of multiple CIs is constructed by taking the minimum of the lower CIs across different methods as the final lower CI, and the maximum of different upper CIs as the final upper CI. This union approach can ensure better robustness if CI construction approaches perform differently under different scenarios. The UCI method has been widely employed in instrumental variables regression to improve robustness of results in the presence of pleiotropy [40].
In summary, the following methods were explored:
  • Normal approximation (standard approach), without bias correction (one estimator) or with bootstrap bias correction (3 estimators), then take the union of CIs;
  • Percentile approach, without bias correction (3 estimators) and with bias correction (3 estimators), then take the union of CIs;
  • Union of the final CI obtained from 1 and 2.

4.4. Simulation Studies

We compare the SE estimated from the above methods with the ‘true’ SE obtained from one hundred simulations with known data generating distributions. The details of the simulations is as follows [10]. Briefly, a gamma distribution was used to simulate three levels of variance explained (Vg = 0.101, 0.191, 0.295), which were converted to true effect sizes (δ). Z-statistics for 100,000 independent SNPs (0.5% were non-null) with different sample sizes (N = 5000, 10,000, 20,000, 50,000, 100,000, 200,000) were then simulated as input for SumVg following the distribution N(δ,1). Two hundred replicates were run for each bootstrap or jackknife procedure. We focus on quantitative traits in our simulations, but the results should most likely apply to binary traits as well, as the only difference in these two scenarios is the formula to convert z to variance explained (Vg). The performance of different methods for CI construction was also evaluated.

4.5. Application to Immune Traits

A selected set of immune-related traits (levels of cytokines/growth factors) were included for study, based on the GWAS by Ahola-Olli et al. [24]. We selected 12 continuous immune traits with (1) sample size N > 5000 and (2) very low (≤3%) or negative SNP-based heritability estimated by LDSC. The LDSC heritability were based on pre-calculated values from GWASAtlas (https://atlas.ctglab.nl/; accessed 1 May 2023). SNPs in strong LD were removed using the PLINK command “--indep-pairwise 100 25 r2” with a series of r2 thresholds (0.1, 0.05, 0.025, 0.01, 0.005, 0.002, 0.001). The 1000G Phase3 EUR sample was used as the reference panel to calculate LD among variants. Independent SNPs with MAF > 0.01 were then applied to SumVg.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ijms25021347/s1.

Author Contributions

Conceptualization, H.-C.S.; methodology, H.-C.S. and P.-C.S.; software, H.-C.S.; formal analysis, H.-C.S.; investigation, H.-C.S., X.X., Z.M. and P.-C.S.; data curation, H.-C.S., X.X. and Z.M.; writing—original draft preparation, H.-C.S.; writing—review and editing, H.-C.S., X.X., Z.M. and P.-C.S.; supervision, H.-C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a Theme-based Research Grant, grant number T44-410/21-N from the Research Grants Council (RGC), an NSFC grant (grant number 81971706), and a Collaborative Research Fund (CRF), grant number C4054-17W from RGC. H.-C.S. was also supported by the KIZ-CUHK Joint Laboratory of Bioresources and Molecular Research of Common Diseases, and the Lo Kwee Seong Biomedical Research Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The GWAS summary statistics of the immune traits are downloaded from the GWASAtlas (https://atlas.ctglab.nl/; accessed 1 May 2023); LDSC heritability were extracted from the same website.

Acknowledgments

We would also like to thank Jinghong Qiu for the help in formatting the manuscript, and Kenneth C. Y. Wong for helping with the software coding and documentations.

Conflicts of Interest

The authors declared no conflict of interests.

References

  1. Yang, J.; Benyamin, B.; McEvoy, B.P.; Gordon, S.; Henders, A.K.; Nyholt, D.R.; Madden, P.A.; Heath, A.C.; Martin, N.G.; Montgomery, G.W. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 2010, 42, 565–569. [Google Scholar] [CrossRef] [PubMed]
  2. Speed, D.; Hemani, G.; Johnson, M.R.; Balding, D.J. Improved heritability estimation from genome-wide SNPs. Am. J. Hum. Genet. 2012, 91, 1011–1021. [Google Scholar] [CrossRef] [PubMed]
  3. Bulik-Sullivan, B.K.; Loh, P.R.; Finucane, H.K.; Ripke, S.; Yang, J.; Schizophrenia Working Group of the Psychiatric Genomics Consortium; Patterson, N.; Daly, M.J.; Price, A.L.; Neale, B.M. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 2015, 47, 291–295. [Google Scholar] [CrossRef] [PubMed]
  4. Speed, D.; Balding, D.J. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat. Genet. 2019, 51, 277–284. [Google Scholar] [CrossRef]
  5. Zhu, H.; Zhou, X. Statistical methods for SNP heritability estimation and partition: A review. Comput. Struct. Biotechnol. J. 2020, 18, 1557–1568. [Google Scholar] [CrossRef]
  6. Barry, C.-J.S.; Walker, V.M.; Cheesman, R.; Davey Smith, G.; Morris, T.T.; Davies, N.M. How to estimate heritability: A guide for genetic epidemiologists. Int. J. Epidemiol. 2023, 52, 624–632. [Google Scholar] [CrossRef]
  7. Zuk, O.; Hechter, E.; Sunyaev, S.R.; Lander, E.S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. USA 2012, 109, 1193–1198. [Google Scholar] [CrossRef]
  8. Brandes, N.; Weissbrod, O.; Linial, M. Open problems in human trait genetics. Genome Biol. 2022, 23, 131. [Google Scholar] [CrossRef] [PubMed]
  9. Young, A.I. Solving the missing heritability problem. PLoS Genet. 2019, 15, e1008222. [Google Scholar] [CrossRef]
  10. So, H.C.; Li, M.; Sham, P.C. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet. Epidemiol. 2011, 35, 447–456. [Google Scholar] [CrossRef]
  11. Robbins, H. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Cambridge, UK, 26–31 December 1954, July and August 1955; University of California Press: Berkeley, CA, USA; Los Angeles, CA, USA, 1956; Volume 1, pp. 157–163. [Google Scholar]
  12. Brown, L.D. Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Stat. 1971, 42, 855–903. [Google Scholar] [CrossRef]
  13. Efron, B. Empirical Bayes estimates for large-scale prediction problems. J. Am. Stat. Assoc. 2009, 104, 1015–1028. [Google Scholar] [CrossRef]
  14. Zhang, Y.; Cheng, Y.; Jiang, W.; Ye, Y.; Lu, Q.; Zhao, H. Comparison of methods for estimating genetic correlation between complex traits using GWAS summary statistics. Brief. Bioinform. 2021, 22, bbaa442. [Google Scholar] [CrossRef] [PubMed]
  15. Benke, K.S.; Nivard, M.G.; Velders, F.P.; Walters, R.K.; Pappa, I.; Scheet, P.A.; Xiao, X.; Ehli, E.A.; Palmer, L.J.; Whitehouse, A.J.; et al. A genome-wide association meta-analysis of preschool internalizing problems. J. Am. Acad. Child. Adolesc. Psychiatry 2014, 53, 667–676.e667. [Google Scholar] [CrossRef] [PubMed]
  16. Lubke, G.H.; Hottenga, J.J.; Walters, R.; Laurin, C.; de Geus, E.J.; Willemsen, G.; Smit, J.H.; Middeldorp, C.M.; Penninx, B.W.; Vink, J.M.; et al. Estimating the genetic variance of major depressive disorder due to all single nucleotide polymorphisms. Biol. Psychiatry 2012, 72, 707–709. [Google Scholar] [CrossRef]
  17. van Beek, J.H.; Lubke, G.H.; de Moor, M.H.; Willemsen, G.; de Geus, E.J.; Hottenga, J.J.; Walters, R.K.; Smit, J.H.; Penninx, B.W.; Boomsma, D.I. Heritability of liver enzyme levels estimated from genome-wide SNP data. Eur. J. Hum. Genet. 2014, 23, 1223–1228. [Google Scholar] [CrossRef]
  18. Hibar, D.P.; Stein, J.L.; Renteria, M.E.; Arias-Vasquez, A.; Desrivieres, S.; Jahanshad, N.; Toro, R.; Wittfeld, K.; Abramovic, L.; Andersson, M.; et al. Common genetic variants influence human subcortical brain structures. Nature 2015, 520, 224–229. [Google Scholar] [CrossRef]
  19. Paternoster, L.; Standl, M.; Waage, J.; Baurecht, H.; Hotze, M.; Strachan, D.P.; Curtin, J.A.; Bonnelykke, K.; Tian, C.; Takahashi, A.; et al. Multi-ancestry genome-wide association study of 21,000 cases and 95,000 controls identifies new risk loci for atopic dermatitis. Nat. Genet. 2015, 47, 1449–1456. [Google Scholar] [CrossRef]
  20. Lo, M.T.; Hinds, D.A.; Tung, J.Y.; Franz, C.; Fan, C.C.; Wang, Y.; Smeland, O.B.; Schork, A.; Holland, D.; Kauppi, K.; et al. Genome-wide analyses for personality traits identify six genomic loci and show correlations with psychiatric disorders. Nat. Genet. 2017, 49, 152–156. [Google Scholar] [CrossRef] [PubMed]
  21. Minica, C.C.; Verweij, K.J.H.; van der Most, P.J.; Mbarek, H.; Bernard, M.; van Eijk, K.R.; Lind, P.A.; Liu, M.Z.; Maciejewski, D.F.; Palviainen, T.; et al. Genome-wide association meta-analysis of age at first cannabis use. Addiction 2018, 113, 2073–2086. [Google Scholar] [CrossRef] [PubMed]
  22. Ahluwalia, T.S.; Prins, B.P.; Abdollahi, M.; Armstrong, N.J.; Aslibekyan, S.; Bain, L.; Jefferis, B.; Baumert, J.; Beekman, M.; Ben-Shlomo, Y.; et al. Genome-wide association study of circulating interleukin 6 levels identifies novel loci. Hum. Mol. Genet. 2021, 30, 393–409. [Google Scholar] [CrossRef]
  23. Shin, S.H.; Park, S.; Wright, C.; D’Astous, V.A.; Kim, G. The Role of Polygenic Score and Cognitive Activity in Cognitive Functioning Among Older Adults. Gerontologist 2021, 61, 319–329. [Google Scholar] [CrossRef] [PubMed]
  24. Ahola-Olli, A.V.; Würtz, P.; Havulinna, A.S.; Aalto, K.; Pitkänen, N.; Lehtimäki, T.; Kähönen, M.; Lyytikäinen, L.P.; Raitoharju, E.; Seppälä, I.; et al. Genome-wide Association Study Identifies 27 Loci Influencing Concentrations of Circulating Cytokines and Growth Factors. Am. J. Hum. Genet. 2017, 100, 40–50. [Google Scholar] [CrossRef] [PubMed]
  25. Turner, M.D.; Nedjai, B.; Hurst, T.; Pennington, D.J. Cytokines and chemokines: At the crossroads of cell signalling and inflammatory disease. Biochim. Biophys. Acta (BBA)—Mol. Cell Res. 2014, 1843, 2563–2582. [Google Scholar] [CrossRef] [PubMed]
  26. Steinsaltz, D.; Dahl, A.; Wachter, K.W. On Negative Heritability and Negative Estimates of Heritability. Genetics 2020, 215, 343–357. [Google Scholar] [CrossRef] [PubMed]
  27. Wied, D.; Weißbach, R. Consistency of the kernel density estimator: A survey. Stat. Pap. 2012, 53, 1–21. [Google Scholar] [CrossRef]
  28. Efron, B. Tweedie’s formula and selection bias. J. Am. Stat. Assoc. 2011, 106, 1602. [Google Scholar] [CrossRef] [PubMed]
  29. Carry, P.M.; Vanderlinden, L.A.; Dong, F.; Buckner, T.; Litkowski, E.; Vigers, T.; Norris, J.M.; Kechris, K. Inverse probability weighting is an effective method to address selection bias during the analysis of high dimensional data. Genet. Epidemiol. 2021, 45, 593–603. [Google Scholar] [CrossRef]
  30. Horowitz, J.L. Bootstrap methods in econometrics. Annu. Rev. Econ. 2019, 11, 193–224. [Google Scholar] [CrossRef]
  31. Shao, J.; Wu, C.J. A general theory for jackknife variance estimation. Ann. Stat. 1989, 17, 1176–1197. [Google Scholar] [CrossRef]
  32. Zhong, H.; Prentice, R.L. Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. Biostatistics 2008, 9, 621–634. [Google Scholar] [CrossRef] [PubMed]
  33. Sun, L.; Bull, S.B. Reduction of selection bias in genomewide studies by resampling. Genet. Epidemiol. Off. Publ. Int. Genet. Epidemiol. Soc. 2005, 28, 352–367. [Google Scholar] [CrossRef] [PubMed]
  34. Zöllner, S.; Pritchard, J.K. Overcoming the winner’s curse: Estimating penetrance parameters from case-control data. Am. J. Hum. Genet. 2007, 80, 605–615. [Google Scholar] [CrossRef] [PubMed]
  35. Gillett, A.C.; Vassos, E.; Lewis, C.M. Transforming summary statistics from logistic regression to the liability scale: Application to genetic and environmental risk scores. Hum. Hered. 2019, 83, 210–224. [Google Scholar] [CrossRef]
  36. Efron, B.; Tibshirani, R.; Storey, J.D.; Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 2001, 96, 1151–1160. [Google Scholar] [CrossRef]
  37. Miller, R.G. The jackknife—A review. Biometrika 1974, 61, 1–15. [Google Scholar]
  38. Chatterjee, S. Another look at the jackknife: Further examples of generalized bootstrap. Stat. Probab. Lett. 1998, 40, 307–319. [Google Scholar] [CrossRef]
  39. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
  40. Conley, T.G.; Hansen, C.B.; Rossi, P.E. Plausibly exogenous. Rev. Econ. Stat. 2012, 94, 260–272. [Google Scholar] [CrossRef]
Figure 1. Boxplots of SE estimated by different approaches. Vg is the sum of variance explained, N is the sample size, and the horizontal line refers to true SE calculated by repeating the experiments 100 times based on the true data generating mechanism. jack_del_1, jack_del_d, paraboot, fdrboot1 and fdrboot2 are different SE estimation approaches as described above.
Figure 1. Boxplots of SE estimated by different approaches. Vg is the sum of variance explained, N is the sample size, and the horizontal line refers to true SE calculated by repeating the experiments 100 times based on the true data generating mechanism. jack_del_1, jack_del_d, paraboot, fdrboot1 and fdrboot2 are different SE estimation approaches as described above.
Ijms 25 01347 g001aIjms 25 01347 g001b
Figure 2. Bar plots of bias, variance and root mean squared error (RMSE) of SE estimated by different approaches in simulations. Vg is the sum of variance explained, and N is the sample size. jack_del_1, jack_del_d, paraboot, fdrboot1 and fdrboot2 are different SE estimation approaches as described above.
Figure 2. Bar plots of bias, variance and root mean squared error (RMSE) of SE estimated by different approaches in simulations. Vg is the sum of variance explained, and N is the sample size. jack_del_1, jack_del_d, paraboot, fdrboot1 and fdrboot2 are different SE estimation approaches as described above.
Ijms 25 01347 g002aIjms 25 01347 g002b
Table 1. Standard error (SE) of the sum of variance explained (Vg) estimated by different resampling approaches.
Table 1. Standard error (SE) of the sum of variance explained (Vg) estimated by different resampling approaches.
Sum_of_VgSample_SizeMean_EstTRUE_SESE
jack_del_1jack_del_dparabootfdrboot1fdrboot2
0.29550000.2320.04820.06720.05240.04880.05190.0489
10,0000.2100.02650.03530.02950.02850.03120.0287
20,0000.2440.01580.02080.01850.01650.01560.0168
50,0000.2830.00760.01490.01670.00630.00810.0063
1 × 10 5 0.3120.00630.01430.01720.00510.00550.0054
2 × 10 5 0.3210.00450.01340.01610.00360.00380.0041
0.19150000.2070.04910.07060.05230.04860.05000.0485
10,0000.1470.02420.03570.02740.02630.02850.0265
20,0000.1580.01590.02080.01660.01560.01620.0160
50,0000.1740.00640.01130.01130.00610.00700.0061
1 × 10 5 0.1950.00450.01100.01310.00400.00470.0041
2 × 10 5 0.2070.00350.01030.01280.00310.00340.0035
0.10150000.1970.05210.06920.05240.04840.04960.0483
10,0000.1160.02600.03450.02650.02510.02570.0251
20,0000.0980.01430.02020.01590.01500.01530.0154
50,0000.0910.00580.00980.00780.00630.00570.0063
1 × 10 5 0.0940.00320.00690.00760.00270.00360.0027
2 × 10 5 0.1070.00280.00720.00830.00230.00270.0025
The main purpose of this table is to compare the true SEs against the SEs estimated by various resampling-based methods. Sum_of_Vg, true total heritability explained (i.e., the real total heritability based on our data-generating mechanism); Sample_size refers to the sample size of the GWAS; Mean_Est, the mean estimated heritability explained based on our approach of corrected z-statistics; True_SE, the ‘true’ SE was based on repeating our simulation experiments 100 times; jack_del_1, delete-1-jackknife; jack_del_d, delete-d-jackknife with d equal to 20,000; paraboot, parametric bootstrap approach as described in the text, based on simulating from a normal distribution in which the mean was derived from the corrected z-statistics (without consideration of local false discovery rates (fdr)); fdrboot1, a “weighted” bootstrap approach with consideration of the local fdr, using the observed z-statistic as the mean in each simulation; fdrboot2, a “weighted” bootstrap approach with consideration of the local fdr, using the corrected z-statistic as the mean in each simulation.
Table 2. Bias, variance and root mean squared error (RMSE) of SE estimated by different resampling approaches.
Table 2. Bias, variance and root mean squared error (RMSE) of SE estimated by different resampling approaches.
Sum_VgNBias of the Estimator for SEVariance of the Estimator for SERMSE of the Estimator for SE
jack_del_1jack_del_dparabootfdrboot1fdrboot2jack_del_1jack_del_dparabootfdrboot1fdrboot2jack_del_1jack_del_dparabootfdrboot1fdrboot2
0.29550001.91 ×   10 2 4.26 ×   10 3 5.92 × 10−43.73 ×   10 3 7.16 ×   10 4 1.77 ×   10 4 5.14 ×   10 5 1.26 ×   10 5 1.15 ×   10 5 9.30 × 10−62.32 ×   10 2 8.34 ×   10 3 3.59 ×   10 3 5.04 ×   10 3 3.13 × 10−3
10,0008.81 ×   10 3 2.98 ×   10 3 2.00 × 10−34.66 ×   10 3 2.25 ×   10 3 7.34 ×   10 5 1.21 ×   10 5 3.87 ×   10 6 4.99 ×   10 6 3.62 × 10−61.23 ×   10 2 4.58 ×   10 3 2.80 × 10−35.17 ×   10 3 2.94 ×   10 3
20,0005.04 ×   10 3 2.78 ×   10 3 7.37 ×   10 4 −1.46 × 10−41.07 ×   10 3 9.06 ×   10 5 2.27 ×   10 6 8.33 × 10−71.12 ×   10 6 1.00 ×   10 6 1.08 ×   10 2 3.16 ×   10 3 1.17 ×   10 3 1.07 × 10−31.47 ×   10 3
50,0007.25 ×   10 3 9.03 ×   10 3 −1.32 ×   10 3 4.68 × 10−4−1.30 ×   10 3 1.29 ×   10 4 1.45 ×   10 6 1.36 ×   10 7 1.93 ×   10 7 1.30 × 10−71.35 ×   10 2 9.11 ×   10 3 1.37 ×   10 3 6.42 × 10−41.35 ×   10 3
1 × 10 5 7.97 ×   10 3 1.09 ×   10 2 −1.20 ×   10 3 −8.78 × 10−4−9.34 ×   10 4 1.41 ×   10 4 1.52 ×   10 6 8.11 × 10−83.48 ×   10 7 1.00 ×   10 7 1.43 ×   10 2 1.10 ×   10 2 1.23 ×   10 3 1.06 ×   10 3 9.86 × 10−4
2 × 10 5 8.92 ×   10 3 1.16 ×   10 2 −8.57 ×   10 4 −6.32 ×   10 4 −3.72 × 10−41.37 ×   10 4 8.70 ×   10 7 3.49 × 10−81.30 ×   10 7 4.06 ×   10 8 1.47 ×   10 2 1.16 ×   10 2 8.77 ×   10 4 7.27 ×   10 4 4.23 × 10−4
0.19150002.16 ×   10 2 3.21 ×   10 3 −5.02 × 10−49.69 ×   10 4 −5.23 ×   10 4 5.53 ×   10 4 5.41 ×   10 5 1.07 ×   10 5 1.02 ×   10 5 8.58 × 10−63.19 ×   10 2 8.03 ×   10 3 3.31 ×   10 3 3.34 ×   10 3 2.98 × 10−3
10,0001.15 ×   10 2 3.16 ×   10 3 2.04 × 10−34.22 ×   10 3 2.29 ×   10 3 2.80 ×   10 4 1.43 ×   10 5 4.88 ×   10 6 2.98 × 10−65.03 ×   10 6 2.03 ×   10 2 4.93 ×   10 3 3.01 × 10−34.56 ×   10 3 3.20 ×   10 3
20,0004.96 ×   10 3 7.22 ×   10 4 −2.41 ×   10 4 3.15 ×   10 4 9.85 × 10−51.19 ×   10 4 2.70 ×   10 6 1.09 ×   10 6 1.31 ×   10 6 7.64 × 10−71.20 ×   10 2 1.79 ×   10 3 1.07 ×   10 3 1.19 ×   10 3 8.80 × 10−4
50,0004.90 ×   10 3 4.83 ×   10 3 −2.92 × 10−45.76 ×   10 4 −2.97 ×   10 4 1.10 ×   10 4 8.17 ×   10 7 1.66 ×   10 7 1.50 ×   10 7 1.24 × 10−71.16 ×   10 2 4.91 ×   10 3 5.01 ×   10 4 6.94 ×   10 4 4.60 × 10−4
1 × 10 5 6.45 ×   10 3 8.56 ×   10 3 −5.40 ×   10 4 1.30 × 10−4−4.33 ×   10 4 1.32 ×   10 4 1.68 ×   10 6 6.30 × 10−81.32 ×   10 7 8.37 ×   10 8 1.32 ×   10 2 8.65 ×   10 3 5.96 ×   10 4 3.85 × 10−45.21 ×   10 4
2 × 10 5 6.85 ×   10 3 9.28 ×   10 3 −3.57 ×   10 4 −1.41 ×   10 4 −3.31 × 10−51.28 ×   10 4 1.06 ×   10 6 3.04 × 10−81.54 ×   10 7 3.78 ×   10 8 1.32 ×   10 2 9.34 ×   10 3 3.97 ×   10 4 4.17 ×   10 4 1.97 × 10−4
0.10150001.71 ×   10 2 3.38 × 10−4−3.72 ×   10 3 −2.54 ×   10 3 −3.81 ×   10 3 1.81 ×   10 4 6.26 ×   10 5 1.01 ×   10 5 1.07 ×   10 5 7.70 × 10−62.17 ×   10 2 7.92 ×   10 3 4.90 ×   10 3 4.14 × 10−34.71 ×   10 3
10,0008.45 ×   10 3 4.62 ×   10 4 −8.81 ×   10 4 −2.84 × 10−4−9.03 ×   10 4 1.66 ×   10 4 1.45 ×   10 5 3.49 ×   10 6 1.66 × 10−62.46 ×   10 6 1.54 ×   10 2 3.83 ×   10 3 2.07 ×   10 3 1.32 × 10−31.81 ×   10 3
20,0005.92 ×   10 3 1.56 ×   10 3 7.21 × 10−41.04 ×   10 3 1.08 ×   10 3 1.44 ×   10 4 4.20 ×   10 6 8.80 ×   10 7 1.22 ×   10 6 8.17 × 10−71.34 ×   10 2 2.57 ×   10 3 1.18 × 10−31.52 ×   10 3 1.41 ×   10 3
50,0004.03 ×   10 3 2.04 ×   10 3 4.91 ×   10 4 −1.02 × 10−45.85 ×   10 4 6.00 ×   10 5 4.31 ×   10 7 1.61 ×   10 7 1.08 × 10−71.26 ×   10 7 8.73 ×   10 3 2.15 ×   10 3 6.34 ×   10 4 3.44 × 10−46.85 ×   10 4
1 × 10 5 3.73 ×   10 3 4.34 ×   10 3 −5.10 ×   10 4 4.31 × 10−4−5.17 ×   10 4 9.13 ×   10 5 3.03 ×   10 7 2.83 ×   10 8 4.34 ×   10 8 2.05 × 10−81.03 ×   10 2 4.38 ×   10 3 5.37 ×   10 4 4.79 × 10−45.36 ×   10 4
2 × 10 5 4.38 ×   10 3 5.48 ×   10 3 −5.31 ×   10 4 −1.61 × 10−4−3.81 ×   10 4 1.00 ×   10 4 3.41 ×   10 7 2.22 × 10−87.06 ×   10 8 2.88 ×   10 8 1.09 ×   10 2 5.51 ×   10 3 5.52 ×   10 4 3.11 × 10−44.17 ×   10 4
The table shows the bias, variance and root mean squared error of SE estimated from our methods, as compared to the true SE. Sum_Vg, true total heritability explained; N, sample size. The best performing method (for estimation of SE) in each scenario is in bold. For other abbreviations, please refer to Table 1.
Table 3. Coverage probabilities of different union CI (UCI) approaches for 95% CI.
Table 3. Coverage probabilities of different union CI (UCI) approaches for 95% CI.
NUnion CI TypeCoverage (Vg = 0.295)Coverage (Vg = 0.191)Coverage (Vg = 0.101)
5000Standard0.750.970.77
Percentile110.78
Standard + Percentile110.78
10,000Standard0.60.670.94
Percentile0.9911
Standard + Percentile0.9911
20,000Standard0.890.840.96
Percentile0.9111
Standard + Percentile0.9711
50,000Standard110.9
Percentile111
Standard + Percentile111
1 × 10 5 Standard111
Percentile111
Standard + Percentile111
2 × 10 5 Standard0.9611
Percentile0.130.661
Standard+Percentile0.9611
Notes for Table 3: The following methods were explored: 1. Normal approximation (standard approach) without bias correction (one estimator) or with bootstrap bias correction (3 estimators), and the union of CIs was taken; (“Standard”). 2. Percentile approach without bias correction (3 estimators) and with bias correction (3 estimators), and the union of CIs was taken; (“Percentile”). 3. Union of the final union CIs obtained from 1 and 2. (“Standard+Percentile”). Coverage refers to the coverage probabilities based on simulations.
Table 4. Summary of the immune traits being studied.
Table 4. Summary of the immune traits being studied.
TraitAbbreviationGWAS IDNSNP_h2 (LDSC)SNP_h2_se (LDSC)
Stem cell factorSCFebi-a-GCST0044298290−0.060.055
Interleukin-4IL4ebi-a-GCST0044538124−0.04460.0595
Interleukin-17IL17ebi-a-GCST0044427760−0.04070.0623
Hepatocyte growth factorHGFebi-a-GCST0044498292−0.03110.0579
Basic fibroblast growth factorFGFBasicebi-a-GCST0044597565−0.01590.0597
Stromal cell-derived factor-1 alpha (CXCL12)SDF1aebi-a-GCST0044275998−0.01160.0713
Interleukin-6IL6ebi-a-GCST0044468189−0.00710.0568
Platelet derived growth factor BBPDGFbbebi-a-GCST0044328293−0.00430.0624
TNF-related apoptosis inducing ligandTRAILebi-a-GCST00442481860.01250.0613
Interferon-gammaIFNgebi-a-GCST00445677010.01340.0624
Granulocyte colony-stimulating factorGCSFebi-a-GCST00445879040.01730.0601
Interleukin-10IL10ebi-a-GCST00444476810.01860.0691
Trait, trait name of analyzed GWAS dataset; abbreviation, abbreviation of the trait name; GWAS ID, ID of GWAS dataset for downloading from the IEU OpenGWAS Project; N, sample size; SNP_h2 (LDSC), SNP heritability estimated by LDSC as reported in GWASAtlas; SNP_h2_se (LDSC), standard error of SNP heritability estimated by LDSC as reported in GWASAtlas.
Table 5. SE of the sum of variance explained estimated by different resampling approaches, for 12 immune traits (under different r2 pruning thresholds).
Table 5. SE of the sum of variance explained estimated by different resampling approaches, for 12 immune traits (under different r2 pruning thresholds).
TraitNLDSCSumVg
h2seh2r2n_pruned_snpse_jack1se_jack_del_dse_parabootse_fdrboot1se_fdrboot2
SCF8290−0.060.0550.3330.1428,5930.09260.08220.06790.04430.0514
0.1850.05251,0080.05260.04560.04670.05020.0517
0.1050.025127,9080.03070.03130.02720.03970.0335
0.1000.0161,9380.03100.02000.02200.02520.0265
0.0920.00551,3700.02290.01690.02010.02350.0230
0.1010.00248,0880.03190.01530.02200.02260.0198
0.1020.00147,1080.03160.01550.02230.02160.0188
IL48124−0.04460.05950.5030.1427,0050.12180.11330.06160.05630.0569
0.3770.05249,7100.10000.08230.04840.04450.0453
0.3020.025127,2480.06500.05940.03180.03650.0336
0.2350.0161,6850.05290.03130.02470.02400.0236
0.2150.00551,1960.04720.02780.02270.02170.0221
0.1970.00247,8780.05710.02730.02280.02250.0253
0.1870.00146,9110.04820.02440.01980.02260.0242
IL177760−0.04070.06230.3520.1427,2260.12400.09460.06920.06250.0609
0.2280.05250,2590.06830.06680.04990.04950.0495
0.2990.025127,4790.08770.05680.03600.03800.0323
0.2340.0161,7560.04850.03400.02390.02670.0256
0.1960.00551,2150.04750.02950.02370.01900.0249
0.1950.00247,8870.06340.02310.02310.02260.0210
0.1880.00146,9310.05680.02420.01830.02110.0215
HGF8292−0.03110.05790.3660.1428,3180.09170.08640.05690.06420.0593
0.2420.05250,8430.08120.07220.04830.04920.0491
0.2050.025127,8500.06570.04880.03270.03260.0357
0.0980.0161,9060.03790.02240.02250.02600.0242
0.1150.00551,3010.03470.01990.02240.02300.0203
0.1110.00247,8780.04140.01620.01890.02110.0215
0.1080.00146,9340.03120.01710.02210.02080.0211
FGFBasic7565−0.01590.05970.2690.1427,2840.08350.09020.06560.05300.0577
0.2170.05249,9300.08910.06040.04730.05040.0468
0.1170.025127,5870.04520.04310.03400.03630.0358
0.1330.0161,9110.04080.03010.02320.02390.0275
0.1350.00551,2590.03760.02430.02420.02670.0219
0.1430.00247,8740.03620.02180.01850.02330.0245
0.1260.00146,9140.03920.02060.02270.02140.0208
SDF1a5998−0.01160.07130.3950.1425,1650.11200.10680.07310.07570.0870
0.2560.05248,7270.08720.07500.05800.05650.0631
0.2130.025126,9860.07070.04620.04310.04720.0468
0.1630.0161,6800.04970.03800.03590.03490.0324
0.1900.00551,0920.07080.03180.02500.02970.0270
0.1650.00247,7020.04470.02700.02940.03040.0301
0.1590.00146,7890.05120.02320.02790.02580.0308
IL68189−0.00710.05680.4220.1427,5660.08780.08960.05100.05750.0594
0.2270.05250,2470.06200.07130.04020.04630.0468
0.1580.025127,5030.06720.04570.03720.03000.0360
0.1390.0161,9310.06060.02580.02200.02470.0220
0.1140.00551,3320.02880.01760.01960.02270.0236
0.1150.00247,9300.03020.01640.01910.02260.0202
0.1170.00146,9440.03190.01750.02270.02090.0211
PDGFbb8293−0.00430.06240.4320.1427,7430.09070.09930.07260.06530.0676
0.3410.05250,3250.06700.08080.06000.04960.0576
0.3070.025127,5670.07350.05540.03700.03340.0326
0.1540.0161,7890.03720.02500.02130.02450.0243
0.1250.00551,1400.03100.02260.02340.02300.0221
0.1200.00247,8220.02580.02050.02140.02080.0233
0.1170.00146,8530.03920.01920.02010.02090.0226
TRAIL81860.01250.06130.5590.1423,3910.06130.10180.07850.07900.0750
0.3040.05247,7170.05430.11900.05260.05030.0439
0.2420.025126,3500.06070.06470.03210.03620.0370
0.1280.0161,1140.03160.02510.02420.02290.0277
0.1270.00550,6330.02980.02310.02550.02390.0268
0.1280.00247,3590.03320.02160.02330.02150.0266
0.1210.00146,4150.03580.01950.02220.02290.0256
IFNg77010.01340.06240.3930.1426,7400.09460.08110.05280.05900.0594
0.2410.05249,8180.06550.06280.05530.05200.0509
0.2440.025127,5140.07340.05820.03300.04060.0320
0.1380.0161,8900.02890.03030.02670.02390.0257
0.1380.00551,3140.04240.02010.02220.02480.0293
0.1410.00247,9180.03210.02040.02510.02480.0286
0.1370.00146,9340.02530.01830.02230.02460.0233
GCSF79040.01730.06010.2460.1427,3930.07070.08200.06200.06040.0580
0.1980.05250,2220.06360.06070.04020.04360.0486
0.1640.025127,5830.05010.04150.03020.03600.0327
0.1420.0161,8460.04340.02570.02800.02390.0257
0.1220.00551,2660.03790.01960.02050.02470.0238
0.1200.00247,9190.04130.01830.02190.02360.0201
0.1120.00146,9390.03120.01590.02340.02070.0202
IL1076810.01860.06910.3310.1427,2180.06210.10190.0584NANA
0.3100.05250,1090.06700.08580.0448NANA
0.1980.025127,5430.05660.04630.03560.03820.0406
0.1300.0161,9440.03280.02250.02510.02680.0258
0.1410.00551,2570.04000.02200.02370.02820.0238
0.1480.00247,8800.04330.01830.02040.02710.0231
0.1420.00146,8980.03170.01940.02190.02610.0228
This table shows the estimated total SNP-based heritability and their SEs for 12 immune traits. We also show a comparison of the estimates between LDSC and SumVg. Trait, N, LDSC (h2, se) have the same meaning as in Table 4; h2, heritability estimated by SumVg across a set of r2 pruning thresholds; r2, the r2 pruning threshold; n_pruned_snp, number of SNPs after LD pruning at the corresponding r2 threshold; se_jack_1, se_jack_del_d, se_paraboot, se_fdrboot1 and se_fdrboot2 are SE estimated by different approaches as described above; “NA” was shown when “locfdr” failed to estimate local false discovery rate. The estimates with r2 = 0.01 were highlighted, as we observed that in general the heritability estimates stabilize at r2 ~ 0.01.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

So, H.-C.; Xue, X.; Ma, Z.; Sham, P.-C. SumVg: Total Heritability Explained by All Variants in Genome-Wide Association Studies Based on Summary Statistics with Standard Error Estimates. Int. J. Mol. Sci. 2024, 25, 1347. https://doi.org/10.3390/ijms25021347

AMA Style

So H-C, Xue X, Ma Z, Sham P-C. SumVg: Total Heritability Explained by All Variants in Genome-Wide Association Studies Based on Summary Statistics with Standard Error Estimates. International Journal of Molecular Sciences. 2024; 25(2):1347. https://doi.org/10.3390/ijms25021347

Chicago/Turabian Style

So, Hon-Cheong, Xiao Xue, Zhijie Ma, and Pak-Chung Sham. 2024. "SumVg: Total Heritability Explained by All Variants in Genome-Wide Association Studies Based on Summary Statistics with Standard Error Estimates" International Journal of Molecular Sciences 25, no. 2: 1347. https://doi.org/10.3390/ijms25021347

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop