Adaptively Integrative Association between Multivariate Phenotypes and Transcriptomic Data for Complex Diseases

Phenotype–gene association studies can uncover disease mechanisms for translational research. Association with multiple phenotypes or clinical variables in complex diseases has the advantage of increasing statistical power and offering a holistic view. Existing multi-variate association methods mostly focus on SNP-based genetic associations. In this paper, we extend and evaluate two adaptive Fisher’s methods, namely AFp and AFz, from the p-value combination perspective for phenotype–mRNA association analysis. The proposed method effectively aggregates heterogeneous phenotype–gene effects, allows association with different data types of phenotypes, and performs the selection of the associated phenotypes. Variability indices of the phenotype–gene effect selection are calculated by bootstrap analysis, and the resulting co-membership matrix identifies gene modules clustered by phenotype–gene effect. Extensive simulations demonstrate the superior performance of AFp compared to existing methods in terms of type I error control, statistical power and biological interpretation. Finally, the method is separately applied to three sets of transcriptomic and clinical datasets from lung disease, breast cancer, and brain aging and generates intriguing biological findings.


Introduction
Identifying genes associated with disease phenotypes in transcriptomics studies is of long-standing interest. Many methods for testing genes associated with a univariate phenotype have been labelled as "differential expression" analysis, which is named after the commonly encountered case-control design; examples include edgeR [1], limma [2], and DESeq2 [3] (also see review papers [4,5] for more details). In many complex diseases, however, patients are often characterized by multiple phenotypes that reflect different facets of the disease. For example, chronic obstructive pulmonary disease (COPD) can be characterized by FEV1 (i.e., the volume of breath exhaled with effort in one second), FVC (i.e., the full amount of air that can be exhaled with effort in a complete breath), as well as several test indices of blood. Despite active research in univariate association analysis, the problem of testing genes associated with multivariate phenotypes is understudied and presents new challenges.
A naive association is to test the association between all pairs of genes and phenotypes and interpret significant genes for each phenotype. This, however, loses the statistical power and interpretation of gene mechanisms in often correlated phenotypes. Therefore, there is great interest in aggregating the effects of a gene on multiple phenotypes and testing for the overall association. Hence in this paper, we consider the following union-intersection test (UIT) [6] or conjunction null hypothesis [7] to determine the overall association between a gene and K phenotypes of interest: where θ = (θ 1 , . . . , θ K ) and each θ k denotes the effect size for measuring the association between the gene and phenotype k. Conventional methods for the UIT test include multivariate analysis of variance (MANOVA), linear mixed models (LMMs), and generalized linear models (GLMM). However, such methods either impose strong assumptions on the data type of phenotypes (e.g., Gaussian assumption for MANOVA) or are incapable of handling multiple phenotypes of different data types (e.g., continuous, binary and categorical) at the same time. As it is common to have phenotypes of various data types in a complex disease, it is challenging to use the above methods to aggregate the effects of multiple phenotypes in practice. Another category of potential solution is regressionbased methods. O'Reilly et al. [8] proposed the MultiPhen method by regressing genotypes (SNPs in the paper but genes in our case) on phenotypes via a proportional odds logistic model, which is only applicable to ordinal outcomes in GWAS. Wu et al. [9] proposed the multi-trait gene sequence kernel association test (MSKAT), where the purpose is to test the association between a phenotype and multiple SNPs in a chromesome region. Since both regression-based methods were previously developed for SNP-based association and are difficult to extend to the transcriptomic scenario, we will not include them in the evaluation in this paper.
Borrowing ideas from meta-analysis, another approach is to combine summary statistics (e.g., p-values and test statistics) from the association tests that separately test for each phenotype. O'Brien et al. [10] combined test statistics from the individual tests on each trait weighted by inverse variance. Pan et al. [11] and Zhang et al. [12] proposed the sum of powered score tests (SPU) and adaptive SPU (aSPU) to combine the score test statistics derived from generalized estimation equations (GEE). However, the proposed methods are still incapable of dealing with phenotypes of different data types. In addition, the two methods involve tuning parameters and are not fully data-driven. Instead of directly combining test statistics, many p-value combination tests are also applicable to this problem. Examples include Fisher's method [13] (T Fisher = −2 ∑ K k=1 log p k ), or the minimum p-value method (ref. [14] denoted by minP hereafter) (T minP = minp i ), where for each k = 1, . . . , K, p k denotes the p-value of hypothesis test that tests the association between the gene and phenotype k. To account for dependency between phenotypes and thus the p-values, one can perform a permutation test by randomly shuffling sample columns in the gene expression data while keeping the phenotype data unchanged. In this way, the associations between genes and phenotypes break down while correlation structure between phenotypes is preserved. Van der Sluis et al. [15] also proposed to extend the Simes' test by exploiting the correlation structure between p-values (TATES). The p-value combination approaches can aggregate effects from multiple phenotypes of arbitrary data types, as long as the input p-values are derived from valid tests.
In addition to association tests, another challenge arises from heterogeneity across phenotypes, where only a subset of phenotypes have associations and the subsets differ for each gene. That is, in a multivariate phenotype-gene association study on a complex disease, a given gene may only have an association with a subset of phenotypes, and the association levels may be heterogeneous. The heterogeneity of associations can significantly impact the power of existing methods. For example, when the subset of phenotypes associated with the gene is small or there is only a single strong effect with many rather weak effects, the minP method will be more powerful, while when most phenotypes have associations, Fisher's method will perform better [16]. The heterogeneity can also lead to a misleading biological interpretation of existing methods. For example, suppose p 1 = (0.001, 1, 1) denotes p-values for separately testing associations between gene 1 and three phenotypes, and p 2 = (0.1, 0.1, 0.1) denotes the p-values for gene 2. Although both genes lead to the same Fisher's test statistics (T Fisher = 13.8), their biological interpretations hugely differ. p 1 indicates strong statistical significance in only the first phenotype, whereas p 2 indicates marginal statistical significance in all three. All the methods mentioned above fail to characterize the heterogeneity across the phenotypes. As a result, methods for aggregating heterogeneous effects that have a refined characterization of heterogeneity and competitive statistical power are preferred for this problem.
Another challenge is to identify the subset of phenotypes that are associated with a given gene. When the individual associations between the phenotypes and the gene are strong, it is easy to look at whether each corresponding p-value is significant or not. However, in our setting, where each individual association between the phenotype and the gene is weak (corresponding p-value is marginally significant or non-significant), it is challenging to identify the subset of associated phenotypes. To the best of our knowledge, no existing method has addressed this problem.
To address the above challenges, we extend two existing p-value combination methods to aggregate association information across phenotypes. In Li and Tseng [17] and Huo et al. [18], an adaptively weighted Fisher's statistic takes the minimal p-values among all possible combinatorial aggregations. Consequently, we denote the procedure modified from this method as AFp. Later, Song et al. [19] proposed an alternative adaptive Fisher's statistics by taking the minimal z-standardized Fisher score (i.e., z-standardization by subtraction of the mean and division of standard deviation) among all possible combinatorial aggregations, which we consequently denote as its modified method as AFz in this paper. Based on the p-value combination, both methods can accommodate multiple phenotypes of different data types at the same time. Furthermore, both methods adaptively assign a binary 0-1 weight to each input p-value to determine whether the corresponding phenotype is associated with the gene of interest. As a result, the proposed methods efficiently aggregate heterogeneous effects across phenotypes, provide a refined characterization of the heterogeneity of association patterns for better biological interpretation, and identify the subset of associated phenotypes, even when the individual associations are weak. We evaluate the two methods through comprehensive simulations. We find that both methods have comparable performance on type I error control and statistical power for detecting the overall association between a gene and multiple phenotypes. However, the AFp-based method outperforms the AFz-based method in phenotype selection. Consequently, we recommend the AFp-based method for future applications. For further downstream interpretation, we propose to evaluate the variability of the 0-1 weight estimates using variability indices constructed through bootstrapping. In addition, following Huo et al. [18], we further use the bootstrap samples to estimate a co-membership matrix of genes, followed by the tight clustering method [20] to identify gene modules of different phenotype association patterns.
The paper is structured as follows. Section 2 introduces the AFp and AFz-based methods and the accompanying downstream analysis methods. Section 3.1 performs extensive simulations to evaluate AFp, AFz, as well as other existing methods. Section 3.2 contains the results from a lung disease transcriptomic dataset, a breast cancer transcriptomic dataset, and a brain aging transcriptomic dataset, respectively. Section 4 includes final conclusion and discussion.

Methods
This section introduces our proposed methods for multivariate phenotype-gene association analysis. Section 2.1 discusses how to generate the input p-values. In Section 2.2, we propose to use the adaptive sum of log-transformed p-values to aggregate heterogeneous effects in the input p-values, which is the foundation that AFz and AFp are built upon. We also propose a permutation procedure to take the correlation structure of phenotypes into account. By extending the AFp and AFz methods, Sections 2.3 and 2.4 propose two ways to adaptively assign the binary 0-1 weights for the log-transformed p-values and formulate the final statistic for testing the overall association between multiple phenotypes and the gene of interest. Sections 2.5 and 2.6 discuss the bootstrap algorithms to estimate the variability indices of the 0-1 weight estimates and identify gene modules associated with disease phenotypes. Suppose there are n independent samples, K phenotypes, p gene features, and M covariates. Denote by Y ik , Z im , and X ij the k-th phenotype, m-th covariate, and j-th gene feature of subject i, respectively, where 1 ≤ k ≤ K, 1 ≤ m ≤ M, and 1 ≤ j ≤ p. Let . . , Z nm ) T , and − → X j = (X 1j , X 2j , . . . , X nj ) T be the vectors of the k-th phenotype, m-th covariate, and j-th gene for all the samples, respectively.

Generalized Linear Models for Generating Input p-Values
To generate the input p-values for both the AFp and AFz methods, the following generalized linear model is assumed for the k-th phenotype and the j-th gene with M covariates: where θ jk is the coefficient of the gene feature, α mkj is the coefficient of covariates, and g k () is the link function. The associated p-value p jk of θ jk can be derived by the classic Wald test or score test, which serves as the input p-values for the combination. Note that the distributional assumptions on Y ik and the choices of link function g k are not required to be consistent across the phenotypes, which allows one to accommodate phenotypes of different data types. For example, one can use g k (t) = logit(t) for binary outcomes, g k (t) = log(t) for count data, and g k (t) = t for continuous outcomes.

Aggregation of Heterogeneous and Dependent Effects in p-Values
To aggregate heterogeneous effects in the input p-values, we propose to use the following weighted sum of the log-transformed p-values: where w jk ∈ {0, 1} is the binary 0-1 weight to determine whether the k-th phenotype is associated with the j-th gene. Here, − → w j = (w j1 , w j2 , . . . , w jK ) T denotes the weight vector. The observed weighted statistic of U j ( − → w j ) is denoted by u j ( − → w j ). Let Ω = − → w j : − → w j = (w j1 , . . . , w jK ) ∈ {0, 1} K \ 0 be the searching space of all possible realizations of − → w j . In the following Sections 2.3 and 2.4, we adapt the AFp and AFp methods to aggregate U j ( − → w j )s with all possible choices of w j in Ω and adaptively choose the best − → w j to determine the subset of p-values with true signals. The above procedures require estimation of the mean, standard deviation, and p-value of U j ( − → w j ) for a given − → w j . However, those quantities are intractable when a complex dependency structure between the phenotypes is present. To account for the correlations between the phenotypes, we propose the following permutation procedure.
Recall that we want to test the conditional independence between each gene and K phenotypes given the covariates . We intend to develop a permutation procedure that breaks the associations between Y and − → X j while preserving the associations between Y and Z and between − → X j and Z. Simply permuting the genotype − → X j leads to an inflated type I error rate because it also breaks the correlations between − → X j and covariates Z. Following Potter [21] and Werft and Benner [22], we permute residuals of regressions of − → X j on Z for generalized regression models. That is, we first regress each gene feature on the covariates, then permute the residuals derived from the regression and fit the generalized linear model by regressing each phenotype on the permuted residuals.
More precisely, we denote the vector of residuals of regressing X j on Z by − → e j = (e 1j , e 2j , . . . , e nj ) T and permute it for B times. In the b-th permutation, we regress jk }. The observed weighted statistics for permuted data are calculated as jk ). Note that we break the association of each gene and each phenotype. Therefore, for a given − → w j , the null distribution of the weighted statistic

AFp
Below, we extend the adaptively weighted Fisher's method [17,18] to aggregate pvalues from each dependent phenotype. Since the statistics take the minimal p-values among all possible combinatorial aggregation, we denote the method as AFp. Under the null hypothesis that θ jk = 0, ∀k, the p-value of observed weighted statistic, p U (u j ( − → w j )), can be obtained for the j-th gene and a given weight − → w j ∈ Ω by Here, I{.} is the indicator function. We have further defined the AFp statistic as the minimal p-value among all possible weights − → w j ∈ Ω (see Li and Tseng [17] and Huo et al. [18] for more details): can be used to determine the subset of phenotypes associated with the gene j. To calculate , and the p-value of T AFp j can be calculated as In summary, p T (T AFp j ) can be used to determine whether the j-th gene is associated with K phenotypes and− → w j

AFp
can be used to determine which specific phenotypes the j-th gene is associated with.

AFz
Below, we extend another adaptive Fisher's method [19] to aggregate p-values from each dependent phenotype. Since the statistics take the minimal standardized Fisher's score (i.e., z-standardization by subtracting the mean and dividing by the standard deviation) among all possible combinatorial aggregation, we denote the method as AFz. For a given − → w j ∈ Ω, the mean and standard deviation of u j ( − → w j ) under null can be approximated , respectively. We denote by u j ( − → w j ) the z-standardized (observed) weighted statistic (see [19]), and the AFz statistic is defined as the largest standarized observed weighted statistic among all possible weights: The estimated weight vector is obtained bŷ To calculate the p-value of T AFz j , we obtain the standardized observed weighted statistic by u for each permutation b, and T In summary, p T (T AFz j ) can be used to determine whether the j-th gene is associated with K phenotypes, and− → w j AFz can be used to determine which specific phenotypes the j-th gene is associated with.
One can note that both AFp and AFz use either the minimum p-value (AFp) or maximum z-score (AFz) as the final test statistic, where each p-value (AFp) or z-score (AFz) corresponds to a subset of selected phenotypes (phenotypes with weights equal to 1), and all possible subsets are searched in the optimization. The rationale comes from the fact that the traditional Fisher's method combines all p-values, including non-signal ones. This greatly weakens statistical power. Both AFp and AFz aim to identify the most possible subset of phenotypes that are associated with the gene by an optimization procedure in the test statistics. This strategy of adaptive combination of p-values has gradually become popular in the literature, e.g., [17,19,[23][24][25][26].

Variability Index of Adaptive Weights
The 0-1 weight estimates w j = w j1 , . . . , w jK obtained by either AFp or AFz are binary and discontinuous as a function of the input p-values and thus may not be stable. Following [18], we use a bootstrap procedure to calculate an estimate of variability index U jk = 4 · Var w jk for the j-th gene and k-th phenotype, where the normalization factor 4 respectively.

Ensemble Clustering for Biomarker Categorization
As mentioned in the end of Section 2.4, a unique advantage of AFp and AFz is to estimate the 0-1 weights to identify the subset of associated phenotypes for a given gene (a weight of 1 or 0 means the phenotype is associated or independent with the gene, respectively). Consequently, the methods optimize and select from all possible subsets (i.e., 2 K − 1 combinations of the 0-1 weight values), which grow exponentially. When further considering the sign (positive/negative) of the association in each phenotype, the number of possible association patterns increase to 3 K − 1. For example, in the lung disease application (Figure 1), five phenotypes generate a total of 3 5 − 1 = 242 possible phenotype association patterns. To overcome this challenge, we performed a gene clustering procedure proposed by [18] to identify data-driven gene modules (M 1 , M 2 , · · · M q ) of major phenotype association patterns. We clustered genes by a co-membership matrix for all pairs of genes where each element of the co-membership matrix represents a similarity of signed weightv =ŵ× sign(θ) of any pair genes. Similar to Section 2.5, we bootstrapped data L times and obtained the signed weight statistics v jk AFp,(l) = w jk AFp,(l) × sign( θ jk AFp,(l) ) and v jk AFz,(l) = w jk AFz,(l) × sign( θ jk AFz,(l) ) for the j-th gene, k-th phenotype, and l-th bootstrapping data for AFp and AFz, respectively. We next calculated the co-membership matrix for l-th bootstrapping data of AFp as AFp,(l) for all k, and V AFp,(l) jj = 0 otherwise. The final co-membership matrix was calculated as V AFp = ∑ L l=1 V AFp,(l) /L and any classic clustering algorithm could be applied to obtain gene categorization. In this paper, we applied the tight clustering method [20] in real applications, which can eliminate the distractions of scattered genes and construct compact gene modules. V AFz can also be obtained in a similar way, followed by the tight clustering algorithm.

Simulation
In this section, we perform three simulations to evaluate: (1) Type I error control and power for AFp and AFz-based methods and other existing methods; (2) the accuracy of the weight estimates of the AFp and AFz-based methods. The methods evaluated include MANOVA, aSPU.ind, aSPU.ex, TATES, Fisher, minP, AFp, and AFz. In Simulations I and II, we consider continuous phenotypes without and with confounders, respectively, and all the methods above are evaluated. In Simulation III, we consider a mixture of phenotypes of count and continuous types, and we benchmark the performance of TATES, Fisher, minP, AFp, and AFz since other methods are not applicable. We have two different settings, A and B, in each of Simulations I, II, and III, where A mimics the scenarios where each phenotype-gene association has a similar effect size, and B generates the scenarios where some phenotypes have much stronger associations with genes compared to other phenotypes. In each simulation setting, we adapt a random effect model to simulate a hierarchical association structure between 10 phenotypes and 4800 genes, where phenotypes 1 to 4 are associated with genes 1 to 1600, phenotypes 5 to 9 are associated with genes 1601 to 3200, and phenotype 10 is associated with genes 3201 to 4800. The details of each simulation setting are illustrated below.

Simulation Settings
Simulation IA and IB: Simulation I simulated continuous phenotypes without confounders.
• We simulated u i1 , u i2 , u i3 ∼ N(0, σ 2 µ ) for each sample, where N() stands for a Gaussian distribution and 1 ≤ i ≤ N 1 . N 1 is the sample size. • We simulated 10 phenotypes, where x ) for 3201 ≤ j ≤ 4800. We set N 1 ∈ {100, 30}, σ x = 0.5, and σ µ ∈ {0, 0.4, 0.6} for varying correlation levels. When σ µ = 0, all the phenotypes are independent of genes, and the larger the σ µ is, the stronger association between phenotypes and genes. For σ k , we chose two values that corresponded to two different scenarios, respectively. In Simulation IA, we set σ k = 2 for 1 ≤ k ≤ 9 and σ 10 = 1, where each phenotype-gene association has a similar effect size. In Simulation IB, we set σ 1 = σ 5 = 0.05, σ 10 = 1, and σ k = 2 otherwise, where σ 1 = σ 5 = 0.05 ensures that the first phenotype has much significant association with genes 1 to 1600 compared with phenotypes 2 to 9 and the 5-th phenotype has much significant association with genes 1601 to 3200 compared with phenotypes 5 to 9. We used Simulation IB to evaluate the performance when some phenotypes have a much stronger association with genes compared with other phenotypes.
Simulation IIA and IIB: Simulation II simulated continuous phenotypes with a confounder z for genes 1 to 1600 and phenotypes 1 to 9.
Simulation IIIA and IIIB: Simulation III generated phenotypes with a mixture of count, binary, and continuoustype data.

Evaluation Benchmark
In Simulations I and II, the phenotypes were continuous, and we evaluated MANOVA, aSPU.ind, aSPU.ex, TATES, Fisher, minP, AFp, and AFz in terms of Type I error (σ µ = 0) and power (σ µ = 0.4, 0.6). We also evaluated AFp and AFz for their accuracy of weight estimation. The Type I error and power were calculated by , where S = 500 is the number of simulated data for each setting, p (s) j is the p-value of the j-th gene and the s-th simulated data of a generic method discussed in this paper, and I{.} is the indicator function. For the accuracy of the weight estimation of AFp and AFz, sensitivity (the proportion of weights estimated to be 1 when the truth is 1) and specificity (the proportion of weights estimated to be 0 when the truth is 0) were used for evaluation. We also included average weight estimates for each phenotype and genes from 1 to 1600, 1601 to 3200, and 3201 to 4800 for further inspection (see Section 3.1.3 and Tables S1, S3, and S4 for details). In Simulation III, the phenotypes were a mixture of count and continuous data, and MANOVA, aSPU.ind, and aSPU.ex are inapplicable. Therefore, we only evaluated TATES, Fisher, minP, AFp, and AFz in Simulation III. The benchmark criteria in Simulation III are the same as those in Simulations I and II. Table 1 shows the Type I error, power, sensitivity, and specificity results of Simulation I with N 1 = 100. The simulation results of N 1 = 30 are shown in Table S1 with a similar pattern. All methods control type I error well, and the AFp and AFz methods generally perform among the best in terms of power. For example, in Simulation IA, all the phenotypegene associations have similar effect sizes, and AFp (0.91) and AFz (0.9) have higher power than Fisher (0.87) and MANOVA (0.85) when σ µ = 0.6. In Simulation IB, AFp and AFz, respectively, have powers of 0.96 and 0.97 when σ µ = 0.6, which is comparable with minP (0.97) and higher than Fisher (0.89). In terms of 0-1 weight estimation, AFp has better sensitivity than AFz, and the gap is more significant in Simulation IB (0.49 and 0.78 for AFp compared with 0.29 and 0.31 for AFz). To dig further, in Table S2, we calculate the average weight estimate for AFp and AFz for each phenotype and 1600 genes in Simulation I. In σ µ = 0.6 and N = 100, AFz selects phenotype 1 over phenotypes 2, 3, and 4 with a significantly higher proportion of 0.78 for genes 1∼1600, while AFp also evenly selects phenotypes 2, 3 and 4 (with proportions 0.72, 0.71, and 0.72). For genes 1601∼3200, AFz also selects phenotype 5 with a significantly higher proportion of 0.76 (proportions of phenotypes 6∼9 are 0.29, 0.28, 0.29, and 0.30), while AFp selects the phenotype 6∼9 with probabilities of 0.68, 0.68, 0.67 and 0.67, which are much closer to the probability of phenotype 5 (0.84). This means that when a gene has different effect sizes of associations with several phenotypes, AFz will only detect the association of the phenotype with the strongest association, while AFp can detect all associated phenotypes more evenly.  Tables S3 and S4 show the Type I error, power, sensitivity, and specificity results of Simulation II for N 1 = 100and30, respectively. MANOVA cannot control Type I error well when there are confounders, while all other methods can control Type I error well. Similar to Simulation I, AFp and AFz generally perform among the best in terms of power, and AFp has better sensitivity than AFz in phenotype selection, especially when gene-phenotype association effect sizes are imbalanced (Table S5). Table 2 summarizes the results of Simulation III with N 1 = 100 when the phenotypes have count, binary, and continuous data. The simulation results of N 1 = 30 are shown in Table S6 with a similar pattern. In terms of power, AFp and AFz outperform the other three methods. For example, in Simulation IIIA with N 1 = 100, the power of TATES, minP, Fisher, AFz, and AFp are {0.43, 0.46, 0.51, 0.53, and 0.53} and {0.82, 0.79, 0.79, 0.9, and 0.91} for σ µ = 0.4 and σ µ = 0.6, respectively. Similar to Simulations I and II, AFp has much better sensitivity in terms of phenotype selection than AFz, and AFz only selects the phenotype with the strongest association (Table S7) in most cases.

Application to Complex Lung Diseases
We applied MANOVA, aSPU.ind, aSPU.ex, TATES, Fisher, minP, AFp and AFz to a lung disease transcriptomic dataset with 319 patients, where the majority of patients were diagnosed as the two most representative lung disease subtypes: chronic obstructive pulmonary disease (COPD) and interstitial lung disease (ILD). Gene expression data were collected from Gene Expression Omnibus (GEO) GSE47460, and clinical information was obtained from the Lung Genomics Research Consortium (https://topmed.nhlbi.nih.gov/ group/ltrc (accessed on 11 November 2019)). In this paper, FEV1, FVC, ratiopre, WBCDIFF1, and WBCDIFF4 were the five continuous phenotypes of interest. FEV1 (forced expiratory volume in 1 s) is the volume of air that can be forcibly blown out in the first 1 s after full inspiration. FVC (forced vital capacity) is the volume of air that can be forcibly blown out after full inspiration. Ratiopre is the ratio of FEV1 to FVC, and WBCDIFF1 and WBCDIFF4 are differential neutrophils (%) and differential eosinophils (%) in the white blood cells, respectively. Age, gender, and BMI were included as confounding covariates X in Equation (1) to calculate the input p-values for Fisher, minP, AFp, and AFz. After filtering samples with missing covariates, the final preprocessed dataset contained N = 279 samples and p = 15,966 genes. We first evaluated MANOVA, aSPU.ind, aSPU.ex, TATES, Fisher, minP, AFp, and AFz by the numbers of significant genes detected by each method, followed by gene module identification through the AFp-and AFz-based methods. Figure S1 shows violin plots of − log 10 (p-value) of all methods, where the significant genes are determined by Bonferroni correction with a cutoff of 0.05. We find that aSPU.ex has the most number of signficant genes (6092), followed by AFp (4367) and MANOVA (3973). We then focused on the significant genes of AFp-and AFz-based methods and categorized genes into gene modules. Table 3 shows the percentage of selected phenotypes for the 4367 and 3287 significant genes identified by AFp and AFz, respectively. Consistent with the simulation results, AFp has a more balanced distribution of phenotype selection, while AFz almost only selects ratiopre and unselects all the other phenotypes. For example, the percentages of genes selecting FEV1, FVC, and WBCDIFF1 are 79%, 48%, and 29% for AFp, while these numbers are 14%, 8%, and 1% for AFz. Figure S2 shows the boxplot of the − log 10 (p-value) of each phenotype for significant genes detected by both AFp and AFz methods, and it clearly indicates that the p-value of ratiopre is, on average, much smaller than that of other phenotypes. The AFz method almost only selects ratiopre and ignores others, which is consistent with the findings in Simulation IB, IIB, and IIIB (Tables S2-S4). As a result, we only performed gene categorization on the AFp results, as below. Table 3. The proportion of weight estimated to be 1 for significant genes (4367 and 3287 for AFp and AFz, respectively, determined by Bonferroni correction with cutoff 0.05) of AFp and AFz methods.
Following Section 2.6, we calculated the co-membership matrix of 4367 significant genes from AFp and utilized a tight clustering algorithm to cluster genes. A total of 1106 genes were clustered into seven clusters (C1, C2 · · · C7), where C1, C2, and C3 were closer to one another compared with other clusters, and they were categorized as module 1 (Figure 1a). Similarly, C6 and C7 were combined as module 4. Figure 1b shows the weight estimation. It again indicates that C1, C2, and C3 have similar patterns (upregulated FEV1 and ratiopre and no association with FVC and WBCDIFF1), and C6 and C7 have similar patterns (down-regulated FEV1, ratiopre, and WBCDIFF4 and up-regulated FVC and WBCDIFF1), which is also confirmed by Figure S3, the heatmap of directed − log 10 (p-values) of 1106 genes selected by the tight clustering method. To verify the weight estimations of the significant genes, we also calculated Spearman correlations between gene expressions and phenotypes in each cluster ( Figure S4), which matched the pattern in Figure 1b well. In addition, we found that phenotype ratiopre has relatively strong associations with genes in all four gene modules ( Figure S3) and is always assigned by 1 or −1 in weight estimation with high confidence (i.e., low variability index; see Figure S5). This observation justifies the common use of the FEV1/FVC ratio (ratiopre) in diagnosing obstructive and restrictive lung disease in current clinical practice [27,28].
We next conducted pathway enrichment analysis using Fisher's exact test based on the Gene Ontology (GO), KEGG, and Reactome pathway databases to assess the biological relevance of genes and show the top 10 significant pathways for each module ( Table 4). The top pathways for different modules depict distinct aspects of lung diseases. The top pathways in module 1 involve many DNA damage [29,30] and amino acid alternation/degradation pathways [31,32], which are known to be related to COPD in the literature. This set of genes is positively associated with FEV1 and ratiopre (and some with WBCDIFF4). Module 2 is enriched in many immune response pathways. The immune system needs to react promptly and adequately to potential challenges posed by microbes and particles, while at the same time avoiding extensive tissue damage. Many studies have shown the association between immune response and lung diseases, such as Toll-like receptor and NOD-like receptor [33,34] and kinase-based protein signaling cascades [35]. This set of genes is negatively associated with FEV1, ratiopre, and WBCDIFF4 and positively associated with WBCDIFF1. Module 3 clearly indicates many extracellular matrix (ECM) structure pathways, which provide structural support and stability to the lung. Changes in the ECM in the airway or parenchymal tissues are now recognized in the pathological profiles of many respiratory diseases, including COPD [36]. This gene module is positively associated with FEV1, ratiopre, and WBCDIFF4 but negatively associated with FVC (and some to WBCDIFF1). The top pathways in module 4 include pathways related to cancer and vasculature development. COPD is a risk factor for lung cancer as they have many shared driving factors and genetic effects [37]. Additionally, COPD is a risk factor for major cancers developing outside of the lung, including bladder cancer and pancreatic cancer [38,39]. Furthermore, angiogenesis (vasculature development) is a shared phenomenon for both cancer and COPD [40], which may indicate the molecular connection between COPD and cancers. Genes in this module are positively correlated with FVC, with some being positively correlated with WBCDIFF1 but negatively correlated with FEV1, ratiopre, and WBCDIFF4.

Application to Breast Cancer
We applied Fisher, minP, AFp, and AFz to a breast cancer transcriptomic dataset with 1981 patients collected from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC, https://www.cbioportal.org/study/summary?id=brca_metabric (accessed on 20 February 2023)). We considered tumor grade (binary), lymph node status (count), overall survival (survival), and tumor size (continuous) as the four phenotypes of interest, with age as the confounding covariate. We performed a pre-processing step that kept the 10,000 genes with the largest coefficients of variation. The purpose of this was to remove housekeeping genes. Patients with missing outcomes or confounders were further removed, and 1708 samples were left. As we considered a mixture of phenotypes of different data types, only Fisher, minP, AFp, and AFz can be applied. Figure S6 shows violin plots of − log 10 (p-values) of the four methods, where the significant genes are determined by Bonferroni correction with a cutoff of 0.05. We found that AFp has the most significant genes (5856), followed by EW (5194) and AFz (4590). Similar to Section 3.2.1, we then focused on the significant genes of the AFp-based method and categorized these genes into gene modules. Following Section 2.6, we calculated a co-membership matrix of 5856 significant genes from AFp and utilized the tight clustering algorithm to cluster genes. A total of 3560 genes were clustered into five gene modules (clusters) (Figure 2a). Figure 2b shows the weight estimation, which is verified by Figures S8 and S9. Specifically, we find up-regulation for all phenotypes in gene module 1 (tumor grade, lymph node status, overall survival, and tumor size) and down-regulation for all phenotypes in gene module 2. Gene modules 3 and 5 are down-regulated for lymph node status, grades, and survival, while gene module 4 is up-regulated for lymph node status and grade. Additionally, we find that tumor grade and lymph node status have relatively strong associations with genes when compared to overall survival and tumor size. Lastly, we also find that grade has highly strong associations with genes in all the five genes modules ( Figure S8) and is always assigned by 1 or −1 in weight estimation with high confidence (Figures 2b and S10).
We next conducted pathway enrichment analysis using Fisher's exact test based on the Gene Ontology (GO), KEGG, and Reactome pathway databases to assess the biological relevance of genes and show the 10 most significant pathways for each module ( Table 5). The top pathways for different modules depict distinct aspects of breast cancer. The top pathways in gene module 1 involve cell cycle processes; the top pathways in gene module 2 are related to metabolic processes; the top pathways in gene module 3 involve transport mechanisms; the top pathways in gene module 4 are closely related to the immune system; and the top pathways in gene module 5 relate to the Golgi apparatus. All five aspects play important roles in cancer [41][42][43][44][45][46].

Application to Brain Aging
We lastly applied Fisher, minP, AFp, and AFz to a brain aging dataset from [47]. The dataset contains 210 samples' transcriptomic profiles in the human prefrontal cortex (Brodmann's area 11, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE71620 (accessed on 20 February 2023)). We considered age (continuous), sex (binary), and recorded pH levels of the tissue sample (continuous) as the three phenotypes of interest, and the postmortem interval between the time of death and the time of tissue sample collection as well as the RNA integrity number for each tissue sample as confounders. Similar to the breast cancer dataset, we performed a pre-processing step that kept the 10,000 genes with the largest coefficients of variation to remove housekeeping genes. Since we considered a mixture of phenotypes of different data types, only Fisher, minP, AFp, and AFz can be applied. Figure S12 shows violin plots of − log 10 (p-values) of the four methods, where the significant genes are determined by the Benjamini-Hochberg correction with a cutoff 0.05. We find that AFp has the most significant genes (385), followed by AFz (303) and minP (272). Similar to Sections 3.2.1 and 3.2.2, we then focused on the significant genes of the AFp-based method and categorized genes into gene modules.
Following Section 2.6, we calculated a co-membership matrix of 385 significant genes from AFp and utilized a tight clustering algorithm to cluster genes. A total of 290 genes were clustered into three gene modules (clusters) ( Figure S11a). Figure S11b shows the weight estimation, which is verified by Figures S14 and S15. Specifically, we find upregulation of age in gene modules 1 and 2 and down-regulation of age in gene module 3. Sex is down-regulated by genes in the second module, and pH seems to be irrelevant for all three modules. Similar to Sections 3.2.1 and 3.2.2, we next performed a pathway enrichment analysis using Fisher's exact test based on the Gene Ontology (GO), KEGG, and Reactome pathway databases to assess the biological relevance of the genes and show the 10 most significant pathways for each module (Table S8). The top pathways in gene module 1 regulate the innate immune system; the top pathways in gene module 2 are related to immune signaling; and the top pathways in gene module 3 involve cell signaling. All three categories of pathways from the gene modules connect to the brain aging process [48][49][50][51]. Table 5. The pathway enrichment analysis of each module by GO, KEGG, and Reactome pathway databases for the breast cancer (METABRIC) dataset. The * sign indicates the p-value is significant under a false discovery rate of 0.05.

Discussion
In this paper, we modified two p-value combination methods, AFp and AFz, to multivariate phenotype-gene association studies. Compared with traditional methods targeting the UIT test between each gene and all phenotypes, AFp and AFz can efficiently combine heterogeneous effects in the input p-values. Both methods facilitate the selection of phenotypes associated with the gene of interest. A bootstrap algorithm and gene cluster approach identified gene modules with distinct phenotype association patterns. Pathway enrichment analysis for identified gene modules elucidated the disease mechanisms underlying multivariate phenotype-gene associations.
From extensive simulations and real application examples, we clearly showed that AFp and AFz have robust and competitive statistical power and Type 1 error control for accommodating an association analysis of heterogeneous phenotypes in complex diseases. Moreover, AFp has better sensitivity to phenotype selection compared with AFz, especially when heterogeneous association effect sizes exist across phenotypes. In conclusion, we recommend the AFp-based method for multivariate phenotype-gene association studies, for the subset identification of phenotypes associated with the gene, and downstream analyses, such as gene categorization and pathway enrichment analysis. Our R package to implement AFp, AFz, and all existing methods is available at https://github.com/hungching-chang/MultiPhenoAssoc, along with all data and source code used in this paper.
One limitation of our proposed method is the relatively heavy computational demand of permutation analysis, which is essential for taking the correlation structure of the phenotypes into consideration. To relieve the computational burden, we utilized the R package "Rfast" [52] to speed up and also optimize our code to put it in an affordable range for general omics applications. In the lung disease application (K = 5, N = 279, and p = 15,966), with 50 times bootstrapping using 50 computing threads, it takes approximately 2 h to implement the AFp-based method. Developing methods for further improving the computation is a future direction.