An Efficient Bayesian Method for Estimating the Degree of the Skewness of X Chromosome Inactivation Based on the Mixture of General Pedigrees and Unrelated Females

Skewed X chromosome inactivation (XCI-S) has been reported to be associated with some X-linked diseases. Several methods have been proposed to estimate the degree of XCI-S (denoted as γ) for quantitative and qualitative traits based on unrelated females. However, there is no method available for estimating γ based on general pedigrees. Therefore, in this paper, we propose a Bayesian method to obtain the point estimate and the credible interval of γ based on the mixture of general pedigrees and unrelated females (called mixed data for brevity), which is also suitable for only general pedigrees. We consider the truncated normal prior and the uniform prior for γ. Further, we apply the eigenvalue decomposition and Cholesky decomposition to our proposed methods to accelerate the computation speed. We conduct extensive simulation studies to compare the performances of our proposed methods and two existing Bayesian methods which are only applicable to unrelated females. The simulation results show that the incorporation of general pedigrees can improve the efficiency of the point estimation and the precision and the accuracy of the interval estimation of γ. Finally, we apply the proposed methods to the Minnesota Center for Twin and Family Research data for their practical use.


Introduction
X chromosome inactivation (XCI) is an important epigenetic phenomenon, which was described by Lyon [1] for the first time. In mammals, females have two X chromosomes, whereas males have only one X chromosome. During the early development of embryos in females, one of the two X chromosomes becomes a Barr body and remains inactivated in subsequent somatic cells to ensure the balance of transcriptional dosages on the X chromosome between females and males [2]. In general, the process of XCI is random. Specifically, in females, approximately 50% of the cells have the paternal allele at an Xchromosomal locus inactivated, and the remaining approximately 50% of the cells keep the maternal allele inactivated, which is called random XCI (XCI-R) [3]. However, there are still two other patterns of XCI: the escape from XCI (XCI-E) and the skewed XCI (XCI-S) [3]. XCI-E means that a female has a region of the X chromosome without inactivation, i.e., the alleles on both X chromosomes are kept active. In humans, approximately 15-30% of X-linked loci have been reported to undergo XCI-E [4,5]. As for XCI-S, more than 75% of for estimating the degree of XCI-S based on general pedigrees or the mixture of general pedigrees and unrelated females.
In this paper, we propose a Bayesian method to estimate the degree of XCI-S based on the mixture of general pedigrees and unrelated females for both quantitative and qualitative traits, which is also suitable for only general pedigrees. We use the kinship matrix to represent the correlation between females in general pedigrees and construct the generalized linear mixed model. The prior of γ is set to be a truncated normal distribution and a uniform distribution. The posterior distribution of γ is drawn using a Hamiltonian Monte Carlo (HMC) sampling algorithm. We regard the mode of the sample from the posterior distribution as the point estimate of γ, and consider the corresponding highest posterior density interval (HPDI) as the credible interval of γ [35]. Because the posterior sampling process of the generalized linear mixed model is very computationally intensive [36], we additionally employ the eigenvalue decomposition (EVD) and Cholesky decomposition to accelerate the computation speed. Further, we conduct extensive simulation studies to compare the performances of our proposed methods and the existing Bayesian methods. Finally, we apply our proposed methods to Minnesota Center for Twin and Family Research (MCTFR) data for their practical use.

Notations
Consider an X-chromosomal locus with alleles d and D being the normal allele and the deleterious allele, respectively. Suppose that we have collected the X-linked traits (quantitative or qualitative), the genotypes at the locus of N p pedigrees (including n p individuals, males or females), and additional n I f independent/unrelated females. Note that the individuals in the same pedigree are genetically correlated. Since XCI only exists in females, we only select n p f females in these pedigrees and additional n I f unrelated females to build the model, and we assume that n f = n p f + n I f . Let Y i be the trait of the ith female and G i = {dd, Dd, DD} indicate the genotype of the ith female i = 1, 2, . . . , n p f , n p f +1 , . . . , n f . Then, Y = Y 1 , Y 2 , . . . , Y n p f , Y n p f +1 , . . . , Y n f T is the trait vector of all the females, and G = G 1 , G 2 , . . . , G n p f , G n p f +1 , . . . , G n f T is the genotype vector of all the females. According to Wang et al. [24], we encode the genotypes G i = {dd, Dd, DD} as the genotypic values X i = {0, γ, 2}, where γ ∈ [0, 2] represents the degree of XCI-S. As such, the genotypic value vector of all the females can be expressed as X = X 1 , X 2 , . . . , X n p f , X n p f +1 , . . . , X n f T .
Considering the correlations among n p f females selected from N p pedigrees, we utilize the kinship matrix to measure the correlations of this kind. To be specific, we first use both the males and the females in the pedigrees to construct an n p × n p kinship matrix ψ, which can be obtained using the algorithm of Lange [37] through the "kinship2" package of R software [38]. Then, we select the corresponding rows and columns of n p f females in matrix ψ and obtain the n p f × n p f matrix ψ f of these n p f females. As for n I f unrelated females, the genetic relatedness matrix can be expressed as the n I f × n I f identity matrix I n I f ×n I f . Finally, the genetic relatedness matrix ϕ of Y can be denoted as the following block matrix: We build the generalized linear mixed model to describe the association between G i and Y i h(µ i ) = βX i + a T Z i + b i (1) where β is the regression coefficient of X i ; Z i = (Z i1 , Z i2 , . . . , Z im ) T is the vector of m covariates of the ith female including 1 as the first element and Z = Z 1 , Z 2 , . . . , Z n p f , Z n p f +1 , . . . , Z n f T is an n f × m covariate matrix; a = (a 1 , a 2 , . . . , a m ) T is the m × 1 vector of the regression coefficients of Z i with a 1 being the intercept; b i is a random effect, and the random variable b = b 1 , b 2 , . . . , b n p f , b n p f +1 , . . . , b n f T is generated by the multivariate normal distribution, i.e., b ∼ MV N 0, σ 2 g ϕ , where σ 2 g is the variance of the polygenic effects; h(·) is the link function; and µ i = E(Y i |X i , Z i ) is the conditional mean of Y i given X i and Z i .
To estimate γ, we decompose X i in Equation (1) into X i = γX 1i + (2 − γ)X 2i according to Wang et al. [29], where X 1i and X 2i are two indicator variables. X 1i = I {G i =DdorDD} indicates whether the ith female contains at least one deleterious allele D, and X 2i = I {G i =DD} denotes whether the ith female has two deleterious alleles. Then, we can rewrite Equation (1) as follows: For quantitative traits, h(·) is the identity function and Y i has the residual error ε i , so Equation (2) becomes a linear mixed model where ε i ∼ N 0, σ 2 e and σ 2 e is the variance of ε i . For qualitative traits, h(·) is the logit function, and Equation (2) can be written as

Building Bayesian Models
For quantitative traits, T follows a multivariate normal distribution according to Equation (3), i.e., where X 1 = X 11 , X 12 , . . . , X 1n p f , X 1(n p f +1) , . . . , X 1n f T and X 2 = (X 21 , X 22 , . . . , X 2n p f , X 2(n p f +1) , . . . , X 2n f ) T . The unknown parameters are θ 1 = β, γ, a T , σ g , σ e T , and let L 1 be the likelihood function of Y based on expression (5). So, the posterior distribution of θ 1 can be expressed as f (θ 1 |X 1 , is the joint prior of θ 1 . As for qualitative traits, Y i follows a Bernoulli distribution based on Equation (4), i.e., where p i = logit −1 (η i ) and The unknown parameters are θ 2 = β, γ, a T , σ g T , and let L 2 be the likelihood function of Y based on expression (6). The posterior distribution of θ 2 can be expressed as is the joint prior of θ 2 .

Eigenvalue Decomposition and Cholesky Decomposition for Accelerating Computation Speed
It should be noted that, due to the high-dimensional matrix ϕ, the Bayesian posterior sampling processes of f (θ 1 |X 1 , X 2 , Z, ϕ) and f (θ 2 |X 1 , X 2 , Z, ϕ) would be computationally intensive, especially when n f is large [36,39]. So, according to Runcie and Crawford [40] and Zhao et al. [36], we use the EVD and Cholesky decomposition to accelerate the sampling process for quantitative and qualitative traits, respectively. The transformed posterior distributions of θ 1 and θ 2 are denoted by f * θ 1 X * 1 , X * 2 , Z * , Σ and f * (θ 2 |X 1 , X 2 , Z, C, h), respectively, where X * 1 = QX 1 , X * 2 = QX 2 and Z * = QZ, respectively, are the transformed X 1 , X 2 and Z based on ϕ = Q T ΣQ by the EVD; C is a lower triangular matrix satisfying ϕ = CC T by Cholesky decomposition; and h follows MV N 0, I n f ×n f and satisfies σ g Ch ∼ MV N 0, σ 2 g ϕ . The details refer to Supplementary Appendices SA and SB. From Table 1, we find that using the EVD and Cholesky decomposition in the posterior sampling process can greatly reduce running time (the details can be seen in Section 3).

HMC Algorithm and Priors
Note that it is difficult to derive the closed forms of the posterior distributions f * θ 1 X * 1 , X * 2 , Z * , Σ and f * (θ 2 |X 1 , X 2 , Z, C, h), so we use the HMC algorithm [35] to sample the parameters from the approximate posterior distributions, which can be efficiently implemented through the "cmdstanr" package in R software. We choose the HMC algorithm because it can improve the independence of the samples and has higher efficiency than the other Markov-Chain Monte Carlo methods [35].
According to Yu et al. [32], we set the priors of θ 1 and θ 2 as follows: For nuisance parameters β and a, we select non-informative priors to reduce their influence on the posterior distributions. Specifically, we assume that β ∼ N 0, 10 2 and a ∼ MV N 0, diag 10 2 , 10 2 , . . . , 10 2 [41] so that β and a can be sampled from the positive and negative values with equal probabilities. For the standard deviation σ g of polygenic effects, we choose the exponential distribution with mean being 1, i.e., σ g ∼ exp(1) [35]. For θ 1 based on quantitative traits, there is an extra parameter σ e , and we also suppose that σ e ∼ exp (1). For the parameter γ of interest, by considering the constraint condition of γ ∈ [0, 2], we set two priors. One is a uniform distribution from 0 to 2, i.e., γ ∼ U(0, 2), which is a non-informative prior. The other is to assume that the more skewed values of γ have the lower probability and the probability of γ being 1 is the highest, which is consistent with the genetic background [3]. In this way, we set γ to obey a truncated normal distribution with both the parameters being fixed at 1 and the values ranging from 0 to 2. The probability density function of the prior of γ is is the probability density function of the standard normal distribution and Φ(·) is its cumulative distribution function. We assume that the unknown parameters are unrelated to each other because the HMC algorithm does not dramatically suffer from the correlated parameters in the model [35]. Therefore, the prior distributions f (θ 1 ) and f (θ 2 ) can be calculated as the product of the priors of all the parameters. Moreover, f (θ 1 ) and f (θ 2 ) can also be flexibly set according to practical background.
After we obtain the posterior samples of θ 1 and θ 2 through the HMC algorithm, we calculate the mode of the samples as the point estimate of γ, and compute the HPDI of the samples as the credible interval of γ. We denote the Bayesian methods with the truncated normal distribution and the uniform distribution for the mixture of general pedigrees and additional unrelated females as BNM and BUM, and the corresponding point estimates yielded by these two methods asγ BN M andγ BUM , respectively.

Situations When Considering General Pedigrees and Unrelated Females, Respectively
Notice that our proposed methods are also applicable to the situation with only general pedigrees and that with only unrelated females. For the situation with only general pedigrees, the genetic relatedness matrix ϕ degenerates to twice the kinship matrix of all the n p f females from the pedigrees, i.e., 2ψ f . We denote the Bayesian methods with the truncated normal distribution and the uniform distribution for general pedigrees as BNP and BUP, and the corresponding point estimates asγ BNP andγ BUP , respectively. For the situation with only unrelated females, our proposed methods still work where the genetic relatedness matrix ϕ is reduced to be the identity matrix I n I f ×n I f . However, compared with the existing BN and BU methods having the prior of γ being the truncated normal distribution and the uniform distribution, respectively [32], our proposed methods require additionally estimating the random effects b i 's, which may reduce the estimation accuracy and be time-consuming. Therefore, in practice, for unrelated females, we recommend using the existing BN and BU methods. Furthermore, just like Yu et al. [32], the point estimates of γ based on the BN and BU methods are represented asγ BN andγ BU , respectively.

Situation When There Are Missing Genotypes for Some Individuals from General Pedigrees
It should be noted that our proposed methods are also suitable for the situation where the genotypes of some individuals from some pedigrees are missing, by simply excluding the individuals with missing genotypes and deleting the corresponding rows and columns of these individuals from the genetic relatedness matrix ϕ.

Simulation Settings
To evaluate the performance of our proposed methods (BNM and BUM for the mixture of general pedigrees and additional unrelated females, and BNP and BUP for only general pedigrees) and compare them with the existing methods (BN and BU for only unrelated females) [32] when estimating the degree of the XCI-S, we conduct the following extensive simulation studies. When simulating general pedigrees, we consider three pedigree structures: (1) the nuclear family with 4 people, (2) the three-generation family with 10 people and (3) the four-generation family with 12 people, as shown in Figure 1. We fix the sex ratio at 1:1 in our simulation study. A total of 50 pedigrees under each pedigree structure are simulated, which leads to N p being 150, n p being 1300, and n p f being approximately 650. For a larger sample size, we simulate 200 pedigrees under each pedigree structure, and the corresponding N p is 600, n p is 5200, and n p f is approximately 2600. Because there are two X chromosomes in females and only one in males, we first generate the genotypes 3). Then, we simulate the genotypes of the nonfounders according to Mendelian inheritance. We consider a covariate K, which is generated from the standard normal distribution. Note that the estimation of the degree of XCI-S only needs the females. As such, let K pi be the value of K for the ith female (i = 1, 2, . . . , n p f ) and we only simulate the quantitative trait values of all the n p f females in the pedigrees, which are generated based on the following multivariate normal distribution: where Y p is the vector of the quantitative trait values of these n p f females; X p is the vector of their genotypic values with the elements being 0, γ, or 2 respectively corresponding to genotypes {dd, Dd, DD}, where the value of γ represents the degree of XCI-S and is randomly sampled from U(0, 2); K p = K p1 , K p2 , . . . , K pn p f T ; and ψ f is the kinship matrix of the n p f females and I n p f ×n p f is an n p f × n p f identity matrix. β 0 is the intercept and δ is the regression coefficient of the covariate K, which are both fixed at 0.5 [42]. According to Schifano et al. [43], we set σ 2 g = {1/3, 1} and σ 2 e = 1, which means that the values of the polygenic heritability h 2 25, 0.50}. Furthermore, we set β = 0.2 so that the heritability due to the causal SNP, h 2 c = β 2 p f 1 − p f / σ 2 g + σ 2 e , remains less than 2% for the chosen values of p f , σ 2 g , and σ 2 e mentioned above. As for a qualitative trait, we generate the corresponding values using the threshold model [44]. Specifically, once the quantitative trait values in Equation (8) are generated, they are transformed to be affected if they are less than the threshold and otherwise to be unaffected. Here, we fix the prevalence of the disease under study at 0.3, and the threshold is then taken as the 30% quantile of the distribution of the quantitative trait. In addition, to consider the situation in which the genotypes of some individuals in the pedigrees are missing, the missing rate (MR) is set to be 0 and 0.4. MR = 0 means that the genotypes of all the individuals in the pedigrees are collected and MR = 0.4 indicates that the genotype of an individual is randomly missing with probability 0.4. randomly sampled from (0, 2) ; = ( 1 , 2 , … , ) ; and is the kinship matrix of the females and × is an × identity matrix. 0 is the intercept and is the regression coefficient of the covariate , which are both fixed at 0.5 [42]. According to Schifano et al. [43], we set 2 = {1/3, 1} and 2 = 1, which means that the values of the polygenic heritability ℎ 2 = 2 /( 2 + 2 ) = {0.25, 0.50} . Furthermore, we set = 0.2 so that the heritability due to the causal SNP, ℎ 2 = 2 (1 − )/( 2 + 2 ), remains less than 2% for the chosen values of , 2 , and 2 mentioned above. As for a qualitative trait, we generate the corresponding values using the threshold model [44]. Specifically, once the quantitative trait values in Equation (8) are generated, they are transformed to be affected if they are less than the threshold and otherwise to be unaffected. Here, we fix the prevalence of the disease under study at 0.3, and the threshold is then taken as the 30% quantile of the distribution of the quantitative trait. In addition, to consider the situation in which the genotypes of some individuals in the pedigrees are missing, the missing rate ( ) is set to be 0 and 0.4. = 0 means that the genotypes of all the individuals in the pedigrees are collected and = 0.4 indicates that the genotype of an individual is randomly missing with probability 0.4. When simulating unrelated females, we directly generate their genotypes { , , } using probabilities {(1 − ) 2 , 2 (1 − ), 2 } . For comparing BNP and BUP for only general pedigrees with BN and BU for only unrelated females, respectively, we set the number of unrelated females ( ) to be 650 and 2600, which is almost equal to the number of the females in 150 and 600 pedigrees mentioned above, and we fix the variance of the residual error in the unrelated females at 2 + 2 [45], which is the same as the total variance of the quantitative trait value in the females from the general pedigrees. Other parameters and simulation settings are kept the same as those when simulating general pedigrees. Specifically, the quantitative trait values of the unrelated females are generated according to the following multivariate normal distribution: where is the vector of the quantitative trait values of the unrelated females; is the vector of their genotypic values with the elements being 0, , or 2 corresponding to genotypes { , , }; = ( 1 , 2 , … , ) is the covariate vector, where is the value of the covariate for the th female ( = 1, 2, … , ); and × is an × identity matrix. As for a qualitative trait, just like simulating the general pedigrees, we When simulating unrelated females, we directly generate their genotypes {dd, Dd, DD} For comparing BNP and BUP for only general pedigrees with BN and BU for only unrelated females, respectively, we set the number of unrelated females (n I f ) to be 650 and 2600, which is almost equal to the number of the females in 150 and 600 pedigrees mentioned above, and we fix the variance of the residual error in the unrelated females at σ 2 g + σ 2 e [45], which is the same as the total variance of the quantitative trait value in the females from the general pedigrees. Other parameters and simulation settings are kept the same as those when simulating general pedigrees. Specifically, the quantitative trait values of the n I f unrelated females are generated according to the following multivariate normal distribution: where Y I is the vector of the quantitative trait values of the n I f unrelated females; X I is the vector of their genotypic values with the elements being 0, γ, or 2 corresponding to genotypes {dd, Dd, DD}; K I = K I1 , K I2 , . . . , K In I f is the covariate vector, where K Ii is the value of the covariate K for the ith female (i = 1, 2, . . . , n I f ); and I n I f ×n I f is an n I f × n I f identity matrix. As for a qualitative trait, just like simulating the general pedigrees, we also generate the corresponding values using the threshold model [44]. By combining the females in the general pedigrees and additional unrelated females, we can obtain the mixed data. We use the BNM and BUM methods, the BNP and BUP methods, and the existing BN and BU methods to obtain the point estimates and the HPDIs of γ based on the mixed data, only general pedigrees, and only unrelated females, respectively. Ma et al. [23] claimed that the variance of the quantitative trait under study for heterozygous females (Dd) may be higher than those for homozygous females (dd and DD) due to the XCI and other factors (e.g., gene-gene interactions and gene mutation), and the increase ratio can be up to 20%. However, so far, in our model, we do not consider the heteroscedasticity of this kind because of the potential computation cost in Bayesian inference. To investigate whether our proposed methods are still robust in the presence of the heteroscedasticity, we additionally simulate the mixed data for quantitative traits with the heteroscedasticity. Specifically, we use σ 2 e0 , σ 2 e1 , and σ 2 e2 to represent the residual variance σ 2 e in females with genotypes dd, Dd, and DD, respectively. The simulation settings for the mixed data are the same as those under the homoscedasticity, except that we assume σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1.2, 1) here. Furthermore, for comparison, we utilize σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1, 1) to represent that the variances across different genotypes are the same. We apply the BNM and BUM methods to the mixed data, and apply the BNP and BUP methods to only general pedigrees.
We conduct 500 replicates for each simulation setting. For each replicate, we set 4 chains for extracting the samples simultaneously. For each chain, we extract 3000 samples, and the first 1000 samples are used for warming up. Therefore, we finally obtain 8000 samples in each replicate. To ensure the convergence, the target acceptance rate is taken as 0.9. We assess the convergence of Markov chains by calculating the convergence diagnosticR [46]. Note that theR's of our proposed methods are all less than 1.05, which indicates good convergence and also means that drawing 8000 samples is enough. The above posterior sampling process is implemented using the "cmdstanr" package in R software (version 4.1.2, http://r-project.org, accessed on 2 February 2023). To evaluate the accuracy of the point estimates, we calculate their mean squared errors (MSEs). Here, MSE = ∑ 500 w=1 (γ w − γ w ) 2 /500, where γ w is the wth true value of γ, andγ w is the estimate of γ w (w = 1, 2, . . . , 500). We also draw scatter plots to visually display the six point estimates (γ BN M ,γ BUM ,γ BNP ,γ BUP ,γ BN , andγ BU ) against the true values of γ. To compare the performances of the interval estimation of all the six methods (BNM, BUM, BNP, BUP, BN, and BU), we calculate the coverage probability (CP) as well as the median, the mean, the interquartile range, and the standard deviation of the widths of the 95% HPDIs of γ (respectively denoted by W median , W mean , W iqr , and W sd ). Moreover, we draw scatter plots of the interval widths of all the six methods against the true values of γ.

Simulation Results under the Situations of Homoscedasticity and Allele Frequencies in Females and Males Being the Same
To assess the computation efficiency of our proposed methods based on the EVD and Cholesky decomposition, we considered the BNP method for only general pedigrees as an example. Here, N p was taken to be 150 and 600, p f = p m = 0.3, σ 2 g = 1/3, σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1, 1) and MR = 0 (i.e., there were no missing genotypes in all the pedigrees) for both quantitative and qualitative traits. The other parameters were fixed in the same way as in the "Simulation Settings" subsection. A total of 500 replicates were conducted for each simulation setting. There were two kinds of BNP methods that we wanted to compare: (1) the BNP method with the posterior sampling process based on the EVD (for quantitative traits) or Cholesky decomposition (for qualitative traits), and (2) the BNP method with the posterior sampling process based on the posterior distribution f (θ 1 |X 1 , X 2 , Z, ϕ) (for quantitative traits) or f (θ 2 |X 1 , X 2 , Z, ϕ) (for qualitative traits), which is called the original posterior sampling process in this paper. We computed the mean running time of the BNP method based on the EVD or Cholesky decomposition for all 500 replicates. However, it is important to note that the original posterior sampling process may take up a huge amount of time. Therefore, we only calculated the mean running time of the original posterior sampling process over the first 10 replicates. All the computations were performed on a Tsinghua Tongfang Z900 personal computer (Microsoft Windows 7 Enterprise (Service Pack 1), 4 GB of RAM and 3.60 GHz Intel(R) Core(TM) i7-4790 CPU). The results of the mean running time are given in Table 1. As shown in Table 1, the EVD and Cholesky decomposition can greatly speed up the Bayesian sampling process, especially when N p is 600.
The MSEs of the six point estimates ( Table 2. We found that the MSEs of γ BN M andγ BUM based on the mixed data are the smallest under all the simulated scenarios, which means that it is more efficient to estimate the degree of XCI-S by simultaneously using general pedigrees and additional unrelated females. The MSEs ofγ BNP andγ BUP for only general pedigrees are slightly larger than those ofγ BN andγ BU for only unrelated females in all the simulated situations. This probably demonstrates that general pedigrees provide less information for estimating the degree of XCI-S than unrelated females when the total number of the females in all the pedigrees and that of the unrelated females are the same. As for the two priors of γ, the point estimates (γ BN M ,γ BNP , andγ BN ) with the truncated normal distribution have the MSEs similar to those (γ BUM ,γ BUP , andγ BU ) with the uniform distribution, withγ BN M ,γ BNP , andγ BN performing slightly better thanγ BUM , γ BUP , andγ BU , respectively. Furthermore, it can be observed from Table 2 that the MSEs of the six point estimates decrease when N p and n I f increase, p f and p m (the frequency of the deleterious allele D) increase, and σ 2 g (the variance of the polygenic effects) decreases. As expected, compared to MR = 0 (i.e., there are no missing genotypes in all the pedigrees), the MSEs of the six point estimates increase when MR = 0.4 (i.e., the genotypes of about 40% individuals in general pedigrees are missing). In addition, the six point estimates have smaller MSEs for quantitative traits than for qualitative traits. The upper side and the right side of each subplot are the distribution of the true value of γ and that of the point estimate of γ, respectively. By comparing the six subplots in the same column of each figure, we found thatγ BN M andγ BUM based on the mixed data are closer to the true value of γ thanγ BNP ,γ BUP ,γ BN , andγ BU . Moreover, noting that the distribution of the true value of γ is U(0, 2), it can be seen that the distributions ofγ BN M andγ BUM are more uniform than those of the four other point estimates. These indicate that it is necessary to combine general pedigrees with unrelated females when estimating γ. The dispersion ofγ BN M is slightly smaller than that ofγ BUM , and the dispersions of γ BN andγ BU are slightly less than those ofγ BNP andγ BUP , although differences of these kinds are not so obvious in most figures. By comparing the two subplots in the same row of each figure, it can be seen that the estimates with MR = 0.4 (in the subplot of the second column) have larger dispersion than those with MR = 0 (in the subplot of the first column), implying that the missing genotypes of some individuals in the collected pedigrees would increase the MSEs of the point estimates. Furthermore, comparing Figure 2 with Figure 3 (or comparing Supplementary Figures S1-S7 with Supplementary Figures S8-S14, respectively) shows that the six point estimates have better performance for quantitative traits than for qualitative traits. In addition, from these figures, the trend of the six point estimates with respect to N p , n I f , p f , p m , and σ 2 g is consistent with that in Table 2. Finally, it is observed from these figures that most of the point estimates can be evenly distributed on both sides of the true value of γ, except for the situations with N p = 150, n I f = 650, and p f = p m = 0.1, where the six point estimates may underestimate γ (Supplementary Figures S2, S3, S9 and S10). However, when N p = 600, n I f = 2600, and p f = p m = 0.1, we can obtain point estimates which are much more evenly distributed around the true value of γ (Supplementary Figures S6, S7, S13 and S14). This suggests that when analyzing the SNPs with low frequencies of the deleterious allele, our proposed point estimates need large sample sizes.  Table 3 describes the CPs of the six interval estimation methods (BNM, BUM, BNP, BUP, BN, and BU) under p f = p m and σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1, 1). From Table 3, we can find that all six methods can control the CPs around 95% in all the simulated situations, which verifies their accuracy when estimating the degree of XCI-S. Table 4 and Supplementary Table S1 display the medians and the means of the widths of the 95% HPDIs (W median and W mean ), respectively, obtained by the six methods under p f = p m and σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1, 1). From these tables, we can see that the BNM and BUM methods based on the mixed data have smaller W median and W mean than the other four methods (BNP, BUP, BN, and BU) under all the simulated scenarios, which indicates that simultaneously using general pedigrees and additional unrelated females can improve the precision of the interval estimation of the degree of XCI-S. The W median and W mean of the BNP and BUP methods for only general pedigrees are slightly larger than those of the BN and BU methods for only unrelated females, which is consistent with the findings based on the MSEs of their corresponding point estimates from Table 2. For two priors of γ, the interval estimation with the truncated normal prior (the BNM, BNP, and BN methods) and that with the uniform prior (the BUM, BUP, and BU methods) have a similar performance, whereas the BNM, BNP, and BN methods respectively obtain slightly smaller W median and W mean than the BUM, BUP, and BU methods. When N p and n I f increase, p f and p m (the frequency of the deleterious allele D) increase, σ 2 g (the variance of the polygenic effects) decreases, MR (the probability of the genotype of an individual in a pedigree being missing) changes from 0.4 to 0, or the trait changes from qualitative to quantitative, the W median and W mean of the six methods decrease.     ) with 95% probability. Table 5 and Supplementary Table S2 show the interquartile range and the standard deviation of the widths of the 95% HPDIs (W iqr and W sd ), respectively, of the six methods under p f = p m and σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1, 1). Figures 4 and 5 display the scatter plots of the widths of the 95% HPDIs based on the six interval estimation methods (BNM, BUM, BNP, BUP, BN, and BU) against the true values of γ with N p = 150, n I f = 650, p f = p m = 0.3, σ 2 g = 1/3, σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1, 1), and MR = {0, 0.4} for quantitative and qualitative traits, respectively. Supplementary Figures S15-S28 give the corresponding scatter plots under other simulation settings. The six rows of each figure represent the results of the six methods, and the two columns of each figure denote the corresponding results with MR = 0 and 0.4, respectively. From Table 5, we find that when the prior of γ is fixed to be the truncated normal distribution, the BNM method generally obtains smaller W iqr than the BNP and BN methods under all the simulated scenarios except for the cases of N p = 150, n I f = 650, and p f = p m = 0.1 for quantitative traits and those of N p = 150 and n I f = 650 for qualitative traits. Similarly, when the prior of γ is taken as the uniform distribution, the W iqr of the BUM method are less than those of the BUP and BU methods under all the simulated scenarios in general, except for the situations mentioned above. It can be seen in Supplementary Table S2 that the BNM (BUM) method generally derives smaller W sd than the BNP and BN (BUP and BU) methods except for the cases with p f = p m = 0.1 for both quantitative and qualitative traits and those with N p = 150, n I f = 650, and p f = p m = 0.3 for qualitative traits. This may be explained by the fact that the largest width of the 95% HPDIs of the six methods is 2, and when N p = 150 and n I f = 650 or p f = p m = 0.1, the widths of the intervals obtained by the BNP, BUP, BN, and BU methods are very close to 2 (as can be observed in Supplementary  Figures S16, S17 and S22-S24), which make the dispersion of the widths of the intervals of the BNP, BUP, BN, and BU methods smaller and cause smaller W iqr and W sd of the BNP, BUP, BN, and BU methods. It is important to note that the width of the 95% HPDI of γ does not follow the normal distribution under most of the simulated scenarios ( Supplementary  Figures S15-S17, S20-S24, S27 and S28), so the trend of the results of the W sd is not exactly the same as that of the W iqr . On the other hand, the W iqr and W sd of the BNP (BUP) method are larger than those of the BN (BU) method. In addition, the W iqr and W sd of the six methods decrease with higher p f and p m when N p = 600 and n I f = 2600, and increase when N p = 150 and n I f = 650 and other parameters are unchanged. As for two priors, the BNM, BNP, and BN methods obtain slightly smaller W iqr and W sd than the BUM, BUP, and BU methods. It is shown in some subplots of  Tables 4 and 5, and Supplementary Tables S1 and S2, and we do not discuss them here for brevity.

Simulation Results When Allele Frequencies in Females and Males Being Different
Supplementary Table S3 shows Tables 2 and 4, Supplementary Table S1, Table 5 and Supplementary Table S2, respectively), implying that our proposed methods still work when there are differences in the frequencies of the deleterious alleles between females and males.

Application to MCTFR Data
The MCTFR Genome-Wide Association Study of Behavioral Disinhibition is a familybased study of substance abuse and related psychopathology [47]. The dataset can be downloaded from the database of Genotypes and Phenotypes with the accession number phs000620.v1.p1 (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_ id=phs000620.v1.p1, accessed on 2 February 2023). This dataset contains 2183 families, 7377 participants (3831 females and 3546 males), and 527,829 SNPs. There are five quantitative traits in the dataset: the nicotine composite score, the alcohol consumption composite score, the alcohol dependence composite score (DEP), the illicit drug composite score, and the behavioral disinhibition composite score [48]. Because we only use females for measuring the degree of XCI-S, 3831 females and 12,354 SNPs on the X chromosome were selected. We filtered the data using the following quality control criteria: (1) excluding SNPs with a missing rate > 10%, (2) removing SNPs with a minor allele frequency < 5%, and (3) excluding individuals with a genotype missing rate > 10%. After quality control, 850 families, 3195 females (including 1959 females from 850 families and additional 1236 unrelated females), and 11,344 SNPs were kept to conduct the subsequent analyses.
It is important to note that estimating γ requires the SNPs on the X chromosome to be associated with the traits under study. Therefore, borrowing the idea of the GEMMA method for association analysis on autosomes based on only general pedigrees [27], we propose an improved linear mixed model to test for association on the X chromosome based on the mixed data. We made the following two main modifications: Firstly, we set the relatedness matrix as the block matrix ϕ in the Materials and Methods section so that the proposed linear mixed model is applicable to the mixture of general pedigrees and additional unrelated females. Secondly, the parameter γ is generally unknown. To consider the XCI, referring to Wang et al. [24], we utilized the grid search method and γ was taken to be {0, 0.5, 1, 1.5, 2} in the increments of 0.5. We used the improved linear mixed model to calculate the p-value for each value of γ, and then combined these five p-values using Cauchy's method [49] to obtain the final test statistic. We conducted some simulation studies and found that the proposed improved linear mixed model can control the type I error rate well (the details can be seen in Supplementary Appendix SC and Supplementary Table S15). It should be noted that the five quantitative traits in the MCTFR dataset do not follow normal distributions. Therefore, we transformed the traits using the rank-based inverse normal transformation [50] before carrying out association analysis. Furthermore, we incorporated two covariates, age and year of birth, into the improved linear mixed model. The significance level of the association tests was set to be 0.05/11344 = 4.41 × 10 −6 based on the Bonferroni correction.
The proposed linear mixed model identified three SNPs, rs10522027, rs12860832, and rs12849233, which are associated with the DEP trait at the 4.41 × 10 −6 level. The positions, alleles, minor allele frequencies, corresponding traits, p-values, and genes which the three SNPs belong to are presented in Table 6. SNP rs10522027 is found within the gene transmembrane protein 47 (TMEM47), which may be associated with the chemoresistance of breast cancer cells and hepatocellular carcinoma [51]. SNPs rs12860832 and rs12849233 are found in the gene PAS domain containing repressor 1 (PASD1), which might serve as a new target for the prognosis and the future treatment of glioma [52]. Furthermore, we calculated the point estimates (γ BN M ,γ BUM ,γ BNP ,γ BUP ,γ BN , andγ BU ) and the 95% HPDIs of γ based on the proposed BNM, BUM, BNP, and BUP methods and the existing BN and BU methods for these three SNPs, where the BNM and BUM methods use the mixed data (850 families and an additional 1236 unrelated females), the BNP and BUP methods utilize only 850 families with 1959 females, and the BN and BU methods are applied to only the additional 1236 unrelated females. The point estimates and the corresponding 95% HPDIs of γ obtained by the six methods for these SNPs are listed in Table 7. It is shown that the six point estimates of γ for the three SNPs are not far away from one, and the corresponding 95% HPDIs all contain one, which means that the XCI patterns of the three SNPs are the XCI-R or the XCI-E. In addition, we can observe the advantage of the BNM and BUM methods because they generally obtain smaller credible intervals than the other four methods, which is consistent with our simulation results. However, the BNP and BUP methods can give shorter HPDIs than the BN and BU methods, which does not coincide with our simulation results. This could be because the number of females in the 850 families is 1959, which is much larger than the number of additional unrelated females (1236).

Discussion
In this paper, we consider a generalized linear mixed model and propose two Bayesian methods (BNM and BUM) to estimate the degree of XCI-S (i.e., γ) based on the mixture of general pedigrees and additional unrelated females for both quantitative and qualitative traits, where the BNM method uses the prior of the truncated normal distribution and the BUM method utilizes the prior of the uniform distribution, which both make full use of the constraint condition of γ ∈ [0, 2]. When only general pedigrees were available, the BNM and BUM methods were reduced to the BNP and BUP methods, respectively. We do not propose the corresponding Fieller's method and the Penalized Fieller's method to estimate the degree of XCI-S based on general pedigrees in this paper, as it has been confirmed that the performance of the above two methods is worse than Bayesian methods for only unrelated females [32]. It is important to note that that the closed form of the posterior distribution of γ is not easily derived, so we applied the HMC algorithm to conduct the posterior sampling process, calculated the mode of the resulting samples as the point estimate of γ, and regarded the HPDI of γ as the credible interval of γ. However, the posterior sampling process based on general pedigrees is very computationally intensive, especially when the dimension of the relatedness matrix (i.e., ϕ in this paper) is over 1000 [36]. As such, we used the EVD and Cholesky decomposition of ϕ to speed up the posterior sampling process for quantitative and qualitative traits, respectively. On the other hand, we also considered the median and the percentile interval (the 2.5th and 97.5th percentiles) of the posterior samples as the point estimate and the credible interval of γ, respectively. However, they performed less well than the mode and the HPDI (data not shown for brevity), and then we selected the latter instead.
The simulation results demonstrate that the EVD and Cholesky decomposition can greatly speed up the posterior sampling process, which is important to allow our proposed methods to accommodate large sample sizes, and may be referenced by other Bayesian researchers. The simulation results under p f = p m and σ 2 e0 , σ 2 e1 , σ 2 e2 = (1, 1, 1) also show that the BNM and BUM methods have similar performances and are advantageous over the other four methods, which indicates that it is necessary to simultaneously analyze general pedigrees and additional unrelated females when estimating the degree of XCI-S in practice. More specifically, for the point estimation, the MSEs ofγ BN M andγ BUM are close to each other and are smaller than those of the four other point estimates. The MSE ofγ BN M is the smallest in all the simulated situations. The MSEs of the existing point estimatesγ BN andγ BU for unrelated females are slightly smaller than those ofγ BNP andγ BUP for general pedigrees when the number of females is fixed. This suggests that general pedigrees provide less information for estimating the degree of XCI-S than unrelated females when the total number of the females in all the pedigrees and that of the unrelated females are the same. For the interval estimation, all six methods (BNM, BUM, BNP, BUP, BN, and BU) control the CPs around 95%. The BNM and BUM methods perform similarly to each other and both obtain much smaller credible intervals (W median and W mean ) than the other four methods under all the simulated scenarios. The BNP and BUP methods perform slightly worse than the BN and BU methods when the number of females is fixed, which is consistent with the findings based on the point estimation (γ BNP ,γ BUP ,γ BN , andγ BU ). For two priors of γ, the performances of the BNM, BNP, and BN methods with the truncated normal prior are slightly better than those of the BUM, BUP, and BU methods with the uniform prior, whereas differences of these kinds are not so obvious in our proposed methods, suggesting that our proposed methods are not as sensitive to the choice of priors. Furthermore, our proposed methods perform better when N p and n I f increase, p f and p m (the frequency of the deleterious allele D) increase, σ 2 g (the variance of the polygenic effects) decreases, or the trait changes from qualitative to quantitative. When there are missing genotypes for some individuals in pedigrees, the SLINK software based on the peeling algorithm [53] could be used to impute these missing genotypes. However, to make the test statistics in hypothesis testing robust, or the parameter estimation accurate and precise, one may repeatedly impute the missing genotypes using the SLINK software (e.g., 50 imputations), which is very time-consuming for our proposed Bayesian methods. On the other hand, it is easy to combine 50 resulting point estimates of γ by taking the mean, median, or mode of them as the final point estimate; however, there appears to be an issue with the process of combining the 50 resulting credible intervals. Therefore, when the genotypes of some individuals in the collected pedigrees were missing, we did not impute these missing genotypes. Instead, we chose to delete the individuals with missing genotypes directly. In fact, the simulation results show that, even when the genotypes of approximately 40% of individuals in general pedigrees are missing, our proposed methods can still control the CPs well, indicating that our proposed methods are robust when there are missing genotypes in the data. The simulation results also show that our proposed methods still work when the frequency of the deleterious allele in females and that in males are different (i.e., p f , p m = (0.3, 0.1) and (0.1, 0.3)). Furthermore, when heteroscedasticity exists (i.e., 1.2, 1)), our proposed methods remain robust. The proposed methods have the following issues to be discussed: Firstly, it is well known that the prior distributions of unknown parameters are important in Bayesian inference, and the choice of them may affect the results. In this paper, we consider two priors for γ, a non-informative prior U(0, 2) which has little effect on the posterior distribution of γ, and a truncated normal distribution N(1, 1) ∈ [0, 2] based on the genetic background of XCI. We also take account of non-informative priors for regression coefficients and weak priors for variances. In practical applications, researchers can choose appropriate priors according to their research background. Secondly, the Bayesian method adopts the HMC algorithm for the posterior sampling process, which is not greatly influenced by the correlations among unknown parameters. Therefore, for computational efficiency, we assume that all unknown parameters are unrelated. However, Bayesian methods should have better performance if the correlation between parameters is considered. Thirdly, the HPDIs that contain the number one can only indicate that the SNP undergoes the XCI-R or XCI-E pattern. The process of further distinguishing the XCI-R and XCI-E patterns is a potential problem to be solved. Fourthly, Ma et al. [23] claimed that the variance of the quantitative trait under study for heterozygous females may be higher than that for homozygous females in some cases. For computational efficiency, we assumed that the variances of quantitative traits for different genotypes in females are the same in our proposed methods.
To address the issues mentioned above, we will consider the following improvements in the future: Firstly, we will take into account non-informative priors for variances, such as non-informative Gamma prior or inverse-Gamma prior [41], to improve our proposed methods. Secondly, we will use the Gibbs sampling algorithm [54] to conduct the Bayesian posterior sampling process when the parameters are correlated. Thirdly, the information from the XCI-E can be estimated using the difference of transcriptional dosage on the X chromosome between male hemizygotes and female homozygotes. Therefore, we will incorporate the information from males into our model to further distinguish the XCI-E from the XCI-R. Fourthly, although we have completed some simulation studies showing that our proposed methods are still robust in the presence of heteroscedasticity (Supplementary Tables S9-S14), we will extend our proposed methods to manage the situation of heteroscedasticity to further improve the precision and the accuracy of estimating the degree of XCI-S in the future. Finally, besides the GEMMA [27], we understand that the REGENIE method for autosomal SNPs [55] could take into account population stratification. Therefore, we will extend it to test for the association between X chromosomal SNPs and traits based on the mixed data in the future.

Conclusions
In summary, we propose a Bayesian method with two priors (the truncated normal prior and the uniform prior) to estimate the degree of XCI-S based on the mixture of general pedigrees and additional unrelated females, which are denoted by the BNM and BUM methods, respectively. We also develop the corresponding Bayesian method, which is suitable for only general pedigrees, denoted by the BNP and BUP methods. We conducted an extensive simulation study to compare the performance of our four proposed methods with the two existing BN and BU methods. The simulation results show that the BNM method obtains the smallest MSE, the shortest width of the HPDIs, and the most stable CPs, which indicates that it is more efficient in estimating the degree of XCI-S by simultaneously using general pedigrees and additional unrelated females. Finally, we applied the proposed methods to the MCTFR data, and found that three associated SNPs (rs10522027, rs12860832, and rs12849233) undergo the XCI-R or XCI-E pattern.