XCMAX4: A Robust X Chromosomal Genetic Association Test Accounting for Covariates

Although the X chromosome accounts for about 5% of the human genes, it is routinely excluded from genome-wide association studies probably due to its unique structure and complex biological patterns. While some statistical methods have been proposed for testing the association between X chromosomal markers and diseases, very a few of them can adjust for covariates. Unfortunately, those methods that can incorporate covariates either need to specify an X chromosome inactivation model or require the permutation procedure to compute the p value. In this article, we proposed a novel analytic approach based on logistic regression that allows for covariates and does not need to specify the underlying X chromosome inactivation pattern. Simulation studies showed that our proposed method controls the size well and has robust performance in power across various practical scenarios. We applied the proposed method to analyze Graves’ disease data to show its usefulness in practice.


Introduction
Many diseases exhibit a gender preference, such as autoimmune diseases, cardiovascular diseases, psychiatric diseases, and cancer, implying that genetic variants on the X chromosome play an important role in sex differences [1][2][3][4][5]. However, most genome-wide association studies (GWAS) routinely exclude the analysis of X-chromosomal variants probably because the X chromosome has a unique structure and complex biological patterns [6][7][8]. Females have one more X chromosome than males, and to balance gene expression on the X chromosome with that of males, one of the female X chromosomes is inactivated in the early embryo [9]. Usually, the process of X chromosome inactivation (XCI) is considered random (XCI-R) [10], i.e., for an X-linked gene, nearly 50% of the cells have the paternal allele active while the rest cells have the maternal allele active. However, studies have shown that skewed XCI (XCI-S) is more biologically plausible [11]. XCI-S is a non-random process, which has been defined as a significant deviation from XCI-R, for instance, the inactivation of one of the alleles in more than 75% of cells [12]. In addition, up to 25% of X-linked genes can escape from XCI (XCI-E) [9]. Both alleles in the genes under XCI-E will be active, which are similar to autosomal genes.
To account for the unique characteristics of the X chromosome, several statistical methods have been developed for testing the association between X chromosomal markers and diseases [13][14][15][16][17][18]. However, very a few of them can adjust for covariates. In large-scale GWAS, spurious associations may occur due to the influence of additional covariates, such as sex, age, and population structure [19,20]. Particularly on the X chromosome, if the sex ratios differ between cases and controls, then sex will be a confounder when the allele frequency of females is unequal to that of males. In practice, a natural way to adjust for covariates is to build a regression model, and logistic regression is generally adopted for binary traits. Based on the logistic regression framework, Gao et al. [15] integrated four tests (FM 01 , FM 02 , FM F , and FM S ) in the software toolset XWAS. In FM 01 and FM 02 , three genotypes of females are both coded by 0, 1, and 2, while two genotypes of males are coded by 0 and 1 for FM 01 and by 0 and 2 for FM 02 . In the latter, males are treated as homozygous females to reflect the dosage compensation relationship between the two sexes. Hence, FM 01 and FM 02 assume that the underlying XCI patterns are XCI-E and XCI-R, respectively. On the other hand, FM F and FM S build logistic regressions for females and males separately and then combine the two p values using Fisher's and Stouffer's methods, respectively. However, these two methods do not take any XCI patterns into consideration and thus may suffer from substantial power loss if the test marker is undergoing XCI. Wang et al. [14] proposed another approach (denoted by maxLR) that can consider four special XCI patterns simultaneously: XCI-R, XCI-E, XCI-S fully toward the normal allele (XCI-SN), and XCI-S fully toward the risk allele (XCI-SR). In their method, three genotypes of females are coded as 0, γ, and 2 under XCI, where γ ∈ (0, 2) measures the degree of skewness of XCI. For instance, γ = 0 (2) represents that all the risk (normal) alleles are inactivated in heterozygous females, which corresponds to the XCI-SN (XCI-SR) pattern. While maxLR has robust performance in power, its p value is evaluated based on the permutation procedure, which is very computationally intensive, especially in GWAS. Hence, it is still desirable to develop a robust method that can both adjust for covariates and analytically calculate the p value.
To fill this gap, this article proposed a novel statistical method to test the association between X chromosomal markers and a specific disease. Our method, which is also based on logistic regression, is robust because it does not require assigning a specific XCI pattern. Further, our method can compute the p value without the resample procedure by directly using the rhombus formula. We implemented an extensive simulation study to compare the performance of our approach with the existing ones. Simulation results showed that our method controls the size well and can maintain relatively high power across a variety of scenarios. Finally, we applied our proposed approach to the Graves's disease data to demonstrate its practical use.

Method
Consider an X-linked SNP with deleterious allele A and normal allele a. Then, there are three possible genotypes for females: aa, Aa, and AA, and two for males: a and A. We assume a binary variable D for the disease of interest with D = 1 (0) representing individuals with (without) the disease. X = x 1 , · · · , x p denotes the p covariates that need to be adjusted in the model, where x 1 ≡ 1 is the model intercept and x 2 represents the binary variable with 1 being female and 0 being male. We further assume that the relationship between the phenotype and genotype for individual i can be constructed by the following logistic regression model: where the subscript i denotes the ith individual, G is the genotypic score, α = α 1 , α 2 , · · · , α p , and β represents the regression coefficients for the covariates and the genotypic score. Note that the genotypic score depends on the underlying XCI pattern. According to the coding strategy by Wang et al. [14], G i can be written in the following uniform form where I(.) is the indicator function, and Z 1 and Z 2 are unknown parameters depending on the underlying XCI pattern. For instance, when the SNP is undergoing XCI, Z 1 and Z 2 can be assigned by γ and 2, respectively. In this coding strategy, γ is a measure of the skewness of XCI, and males are treated as homozygous females to reflect the dosage compensation. We chose the score statistic to test the null hypothesis: β = 0 because the association tests for all the SNPs share the same null model. For a total sample size of n, the score function can be derived as where Pr(D i = 1|X i ) = exp X i α 1+exp X i α is the disease probability estimated for individual i without considering the genotype (details of the derivation are given in Appendix A). The information matrix of (1) can be written as follows: where Under the null hypothesis: β = 0, we have is estimated as the variance of U(Z 1, Z 2 ). Therefore, the statistic asymptotically follows a standard normal distribution under the null hypothesis. Note that the calculation of the test statistic relies on the underlying XCI pattern. Unfortunately, this is generally unknown for a specific SNP. We thereby proposed a robust test referred to as XCMAX4 to account for the four special XCI models. The XCMAX4 statistic is defined as follows: Due to the correlation between the four score tests, XCMAX4 does not follow any classical distributions. We assume that S(0, 2), S(1, 2), S(2, 2), and S(1, 1) jointly asymptotically follow a multivariate normal distribution N(0, Σ), where 0 is a four-dimensional vector with all elements being 0, and Σ is the correlation matrix with In the above correlation matrix, ρ (z 11 ,z 21 ),(z 12 ,z 22 ) is the correlation coefficient between S(Z 11 , Z 21 ) and S(Z 12 , Z 22 ). Given Σ, we can analytically derive the p value of XCMAX4.
Particularly, let f (y, 0, Σ) be the density function of the multivariate normal distribution N(0, Σ); then, for a given z > 0, the p value of XCMAX4 is calculated by Next, we need to accurately estimate the correlation matrix Σ. To this end, we first build a new model that contains two parameters representing genetic effects as follows: log The information matrix of (2) can be expressed as follows: Under the null hypothesis β 1 = β 2 = 0, the statistic asymptotically follows a chi-square distribution with two degrees of freedom, where is the covariance matrix of U(Z 11, Z 12 ) and U(Z 21, Z 22 ). Therefore, the correlation coefficient between S(Z 11, Z 12 ) and S(Z 21, Z 22 ) can be estimated as Once Σ is estimated, we can calculate the p value of XCMAX4. Although the fourdimensional integral can be calculated in the commonly used software (e.g., the mvtnorm package in R, https://cran.r-project.org/web/packages/mvtnorm/index.html, (accessed on 10 April 2022)), the algorithm based on the Quasi-Monte-Carlo procedure needs a lot of computing resources to achieve relatively high accuracy. Hence, it would be still desirable to obtain its analytic form if possible. Fortunately, we can use the rhombus formula [13,21] to obtain the upper bound of the p-value of XCMAX4 as follows: where Φ(x) and φ(x) denote the cumulative distribution function and probability density function of the standard normal distribution, respectively, and is the correlation efficient between ith and (i + 1)th score statistics. Note the order of four test statistics S(0, 2), S(1, 2), S(2, 2), and S(1, 1) is not specified in the above formula, so 12 kinds of upper bounds can be obtained. Therefore, only the smallest bound among them is adopted as an approximation of the p value. As shown in Wang et al. [13], such approximation is very accurate for small p values, which would be quite useful in GWAS because the significance level is generally very stringent (e.g., 5 × 10 −8 ) in such studies.

Simulation Settings
We conducted comprehensive simulation studies to compare the performance of XCMAX4 with FM 01 , FM 02 , FM F , and FM S , all of which can adjust covariates. Note that we did not include the maxLR in our simulations because this method is a permutation-based approach, which would be too time-consuming for GWAS. The data are simulated from the following model: log where x 2 is the binary covariate sex, x 3 is a continuous covariate, which is sampled from the uniform distribution U(0, 1), and G is the genotype score. The ratio of males to females is assumed to be 1 : 1 in the general population, so x 2 follows a binomial distribution B(0.5). Further, we assume that the genotype of females (aa, Aa, AA) follows a trinomial distribution with probabilities q f 0 , q f 1 , q f 2 , while the genotype of males (a, A) follows a binomial distribution (1 − q m , q m ). Let q f and F be the respective risk allele frequency and the inbreeding coefficient for females. Then, we have The values of q f and q m are both set to be 0.1, 0.2, and 0.3, so there are nine combinations in total. F is assigned to be 0 and 0.05, where the former implies Hardy-Weinberg equilibrium (HWE) and the latter represents a scenario of Hardy-Weinberg disequilibrium (HWD). The intercept α 1 is fixed at −5. For the coefficients x 2 and x 3 , we consider two cases for each of them: α 2 = (0.4005, −0.4005) and α 3 = (0.5, 1.5). The genetic effect β is set to be 0, 0.1116, 0.15, and 0.1858, where β = 0 means no association between the SNP and the disease status, and the other three values of β indicate that the odds ratios of females with genotype AA are about 1.25, 1.35, and 1.45. Obviously, the case of β = 0 is used to study the size, while the empirical power is investigated in the non-zero β cases.
Note that, when studying the power, we only choose three combinations of q f and q m : (0.3, 0.3), (0.3, 0.2), and (0.2, 0.3) for convenience. The scenarios that the SNP undergoes XCI or escapes from XCI are both considered. For the former, we let γ range from 0 to 2 in increments of 0.5. As such, we have considered various XCI patterns, including XCI-SN, XCI-R, and XCI-SR. Once the XCI pattern is assumed, we can assign the corresponding value for the genotypic score G.
Given the covariates, the genotypic score, and the regression coefficients, we can generate the disease status from the binomial distribution for a large population. Then, we randomly sample 2500 cases and 2500 controls from this population. We find that when α 2 = ±0.4005, the proportions of females in cases varied from 40% to 60% in the simulated data. The size is estimated at three nominal levels: α = 1 × 10 −3 , 1 × 10 −4 , and 1 × 10 −5 based on 1,000,000 replicates, while the power is only estimated at the nominal level α = 1 × 10 −4 based on 10,000 replicates. The p value of XCMAX4 is evaluated by using the rhombus formula. Table 2 shows the estimated type I error rate at the nominal significance level α = 1 × 10 −4 when HWE holds in the female population. As expected, all the methods controlled the size well in all the scenarios. Although XCMAX4 appears slightly conservative in some scenarios, its p values are similar to the nominal level. We also simulated the scenarios of HWD (F = 0.05). However, we observed that the performances of all the tests were similar to those of Table 2, and HWD in females had little impact on the size. Therefore, the simulation results with non-zero F are presented in the Supplementary Material (Table S1). The results of type I error rates estimated at the nominal level α = 1 × 10 −3 and α = 1 × 10 −5 are also given in Supplementary Material (Tables S2-S5). As can be seen, XCMAX4 still had the correct size in general, except being slightly conservative at α = 1 × 10 −3 . observed that XCMAX4 had a better power than FM when = 0.5, while FM performed slightly better than XCMAX4 when = 1.5. In both scenarios, FM was still the most powerful method, but the differences in power between these three methods were generally very small. Notice that the results in Figures 2 and 3 are analogous to those in Figure 1, and thereby the allele frequencies of females and males did not apparently change the power profiles of all of the methods.      In Figure 1, we can see that FM 01 and FM F were generally less powerful than other methods in all situations. XCMAX4 performed best when γ = 0 (XCI-SN) and 2 (XCI-SR). However, when γ = 1 (XCI-R), FM 02 was the most powerful, followed by FM S and XCMAX4. This was expected because FM 02 is proposed exactly under XCI-R. We also observed that XCMAX4 had a better power than FM S when γ = 0.5, while FM S performed slightly better than XCMAX4 when γ = 1.5. In both scenarios, FM 02 was still the most powerful method, but the differences in power between these three methods were generally very small. Notice that the results in Figures 2 and 3 are analogous to those in Figure 1, and thereby the allele frequencies of females and males did not apparently change the power profiles of all of the methods. Figure 4 plots the powers of XCMAX4, FM 01 , FM 02 , FM F , and FM S under the XCI-E pattern with β = 0.15. Based on this figure, FM 01 was uniformly the most powerful in all scenarios as expected, followed by FM S and XCMAX4. FM 02 was generally less powerful than FM 01 , FM S , and XCMAX4, but still performed better than FM F . The power results with β = 0.15, and F = 0.05 are provided in Supplementary Material (Figures S1-S4), which are similar to those in Figures 1-4, indicating that HWD in females had little effect on the power results. The power results with β = 0.1116 and 0.1858 are generally consistent with those in Figures 1-4, implying that the properties of XCMAX4 did not vary with the magnitude of the genetic effect (see Figures S5-S20 in Supplementary Material). As expected, when the value of β increased, the powers of all methods uniformly increased.

Size
In conclusion, FM 01 and FM 02 can have high power if the underlying XCI pattern is modelled correctly but may be less powerful in other scenarios. In contrast, XCMAX4 retained a relatively good power across a variety of scenarios. Compared to XCMAX4, FM S may suffer from power loss if the SNP is undergoing XCI but will be more powerful under XCI-E. FM F had the overall worst performance and thus is not recommend. It should be noted that, FM 01 , FM 02 , FM F , and FM S adopted logistic regression, which is slightly more computationally intensive than XCMAX4 in GWAS because the implementation of the logistic regression requires additional iterations. Compared to the other four methods, testing 2000 SNPs, XCMAX4 saved half the time. The details of time comparisons are given in Supplemental Material (Table S6). erful than FM , FM , and XCMAX4, but still performed better than FM . The p sults with = 0.15, and = 0.05 are provided in Supplementary Material (Fig  S4), which are similar to those in Figures 1-4, indicating that HWD in females h effect on the power results. The power results with = 0.1116 and 0.1858 are g consistent with those in Figures 1 to 4, implying that the properties of XCMAX4 vary with the magnitude of the genetic effect (see Figures S5-S20 in Supplement terial). As expected, when the value of increased, the powers of all methods un increased. In conclusion, FM and FM can have high power if the underlying XCI is modelled correctly but may be less powerful in other scenarios. In contrast, retained a relatively good power across a variety of scenarios. Compared to X FM may suffer from power loss if the SNP is undergoing XCI but will be more p under XCI-E. FM had the overall worst performance and thus is not recomm should be noted that, FM , FM , FM , and FM adopted logistic regression, slightly more computationally intensive than XCMAX4 in GWAS because the im tation of the logistic regression requires additional iterations. Compared to the ot methods, testing 2000 SNPs, XCMAX4 saved half the time. The details of time c sons are given in Supplemental Material (Table S6).

Application to Graves' Disease Data
Graves' disease (GD) is an autoimmune disease of hyperthyroidism that is fo more common in women than in men [22,23]. Substantial studies have shown genetic background explains about four-fifths of the susceptibility to GD.
Considering the distinct gender bias, it is highly reasonable to speculate genes on the X chromosome play an important role in the development of GD. R two independent studies found that rs3827440, a non-synonymous SNP of the gene on the X chromosome, was associated with GD. A two-stage GWAS, focuse Han population in China, first reported this finding, which was further validate Caucasian cohorts. There are two alleles at rs3827400, with T being the risk alle

Application to Graves' Disease Data
Graves' disease (GD) is an autoimmune disease of hyperthyroidism that is four times more common in women than in men [22,23]. Substantial studies have shown that the genetic background explains about four-fifths of the susceptibility to GD.
Considering the distinct gender bias, it is highly reasonable to speculate that the genes on the X chromosome play an important role in the development of GD. Recently, two independent studies found that rs3827440, a non-synonymous SNP of the GRP174 gene on the X chromosome, was associated with GD. A two-stage GWAS, focused on the Han population in China, first reported this finding, which was further validated in two Caucasian cohorts. There are two alleles at rs3827400, with T being the risk allele and C being the normal one. Table 3 displays the four datasets about rs3827400 mentioned in these two studies. We applied XCMAX4, FM 01 , FM 02 , FM F , and FM S to each dataset; the results are shown in Table 4. Note that sex was included as a covariate when calculating the p values of XCMAX4, FM 01 , and FM 02 .  This table indicates that none of these methods uniformly performed the best across all four datasets. For the two datasets from the Chinese population, all methods consistently showed that rs3827400 was associated with GD at the 1 × 10 −4 significance level. Among these tests, XCMAX4 consistently had the second smallest p values. However, the p values of all the methods from both Caucasian datasets suggested no such an association at the same significance level probably because of their relatively small sample size. We also observed that XCMAX4 appeared slightly conservative in these scenarios, but this was not surprising because the rhombus formula is less accurate when the p value is greater than 0.01.
Because both the Han population and the Caucasian population contained two datasets, we also tested such association at the population level by treating the data source as an additional covariate. The corresponding results are given in Table 5, which are similar to those in Table 4.

Discussion
This paper proposed a novel robust method, XCMAX4, to test the association between the marker on the X chromosome and a specific disease for case-control design. Our method is an extension of the CMAX3 [24] test on the X chromosome, which can both incorporate the information of XCI and allow for covariates. Unlike the maxLR proposed by Wang et al., XCMAX4 is construted by using the score test, which is more efficient in GWAS because we only need to fit the null model once. Moreover, the maxLR requires permutation to calculate the p value, which makes it unappealing in GWAS. In contrast, the p value of XCMAX4 can be computed analytically by using the rhombus formula. On the other hand, although FM 01 , FM 02 , FM F , and FM S can also adjust for covariates, they do not take various XCI models into consideration and thus may suffer from substantial power loss in some scenarios. However, XCMAX4 can retain a relatively high power by accounting for four special XCI patterns simultaneously. Simulation results showed that XCMAX4 controlled the size well and had robust performance in power. Therefore, we recommend using XCMAX4 for its effectiveness, robustness, and generality. Finally, to help implement XCMAX4 in practice, we provide an R function XCMAX4, which is available at https://github.com/YoupengSU/XCMAX4.git (accessed on 12 April 2022).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Detailed Derivation of the Score Statistic
The log-likelihood function of Model (1) can be written as representing the probability of having disease for ith individual.
By Cox et. al. [25], we can obtain the score test statistic as W S = U θ 0 I θ −1 U θ 0 = U θ 0 I θ −1 U θ 0 = U θ 0 I 11 I 12 I 21 I 22 which asymptotically follows a chi-square distribution with degrees of freedom being 1. In Model (2), β becomes a two-dimensional vector, and the proofs are similar, so the details are omitted.