Genotype-Based Gene Expression in Colon Tissue—Prediction Accuracy and Relationship with the Prognosis of Colorectal Cancer Patients

Colorectal cancer (CRC) survival has environmental and inherited components. The expression of specific genes can be inferred based on individual genotypes—so called expression quantitative trait loci. In this study, we used the PrediXcan method to predict gene expression in normal colon tissue using individual genotype data from 91 CRC patients and examined the correlation ρ between predicted and measured gene expression levels. Out of 5434 predicted genes, 58% showed a negative ρ value and only 16% presented a ρ higher than 0.10. We subsequently investigated the association between genotype-based gene expression in colon tissue for genes with ρ > 0.10 and survival of 4436 CRC patients. We identified an inverse association between the predicted expression of ARID3B and CRC-specific survival for patients with a body mass index greater than or equal to 30 kg/m2 (HR (hazard ratio) = 0.66 for an expression higher vs. lower than the median, p = 0.005). This association was validated using genotype and clinical data from the UK Biobank (HR = 0.74, p = 0.04). In addition to the identification of ARID3B expression in normal colon tissue as a candidate prognostic biomarker for obese CRC patients, our study illustrates the challenges of genotype-based prediction of gene expression, and the advantage of reassessing the prediction accuracy in a subset of the study population using measured gene expression data.

Abstract: Colorectal cancer (CRC) survival has environmental and inherited components. The expression of specific genes can be inferred based on individual genotypes-so called expression quantitative trait loci. In this study, we used the PrediXcan method to predict gene expression in normal colon tissue using individual genotype data from 91 CRC patients and examined the correlation ρ between predicted and measured gene expression levels. Out of 5434 predicted genes, 58% showed a negative ρ value and only 16% presented a ρ higher than 0. 10. We subsequently investigated the association between genotype-based gene expression in colon tissue for genes with ρ > 0.10 and survival of 4436 CRC patients. We identified an inverse association between the predicted expression of ARID3B and CRC-specific survival for patients with a body mass index greater than

Introduction
Colorectal cancer (CRC) is a leading cause of cancer death worldwide [1,2]. Modifiable factors of colorectal cancer patients' survival include smoking, alcohol consumption, aspirin use, and physical activity, while the effect of obesity is still controversial [3]. In addition, several studies have identified genetic polymorphisms associated with colorectal cancer prognosis [4][5][6][7][8]. The effect of prognostic genetic variants is thought to be, to a large extent, of a regulatory nature, leading to a modulation of the expression of target genes. Single nucleotide polymorphisms (SNP) that modulate gene expression are called expression quantitative trait loci (eQTLs) and may act in cis (modulating the expression of a near-by gene) or in trans (modulating the expression of a distant gene) [9]. Within recent years, immense efforts have been undertaken to map tissue-specific regulatory variants of the human genome [10,11], resulting in a large variety of tools and databases that facilitate the functional characterization of polymorphisms and their proxies identified in genetic association studies [12,13].
The information contained in such databases is the basis of PrediXcan, a method that enables the prediction of tissue-specific gene expression based on individual genotype data [14]. PrediXcan estimates the fraction of genetically determined gene expression levels and performs association analyses between predicted gene expression profiles and a phenotype of interest. This approach potentially accelerates the identification of phenotype-shaping genes. Genome-wide association studies have identified thousands of loci associated with complex traits [15]. The use of PrediXcan is increasingly common in genetic association studies [16][17][18][19][20][21][22][23][24] and recently contributed to the identification of genes associated with lipid levels and schizophrenia [19,20], cutaneous squamous cell carcinoma [23], lung cancer [24], and colorectal cancer [21]. For example, TRIM4 and PYGL, both related to cellular metabolic programming, were associated with CRC risk [21]. However, studies have also reported that the prediction accuracy may be impaired due to, for example, population stratification [25,26]. A limitation of previous studies is that they fully relied on predicted gene expression without consideration of potential differences in prediction accuracy among human populations. Furthermore, to our knowledge, no previous study has used PrediXcan to investigate colorectal cancer prognosis. We thus measured global gene expression profiles in healthy colon mucosa of 91 colorectal cancer patients from the ColoCare-"Darmkrebs: Chancen der Verhuetung durch Screening" (DACHS) study and subsequently inferred gene expression profiles based on individual genotype data using PrediXcan. We calculated the Spearman correlation ρ between the measured and the genetically predicted gene expression levels as a measure of prediction accuracy, and further investigated the association between the genetically-determined gene expression and survival of 4436 colorectal cancer patients for 863 well-predicted genes (ρ > 0.10).

Correlation between Measured and Genetically Predicted Gene Expression
We first examined the correlation between measured and genetically predicted gene expression levels in a subset of 91 colorectal cancer patients, for which genome-wide genotype and gene expression data of healthy colorectal mucosa were available ( Figure 1). The characteristics of this subset of the total study population are presented in Table S1.
Out of 159,506 SNPs that PrediXcan uses to predict gene expression in colon transverse tissue, 158,115 SNPs with a genotype imputation score higher than 0.99 were available in our study of 91 participants and were used for the prediction of gene expression in normal colon tissue. This translated into the estimation of gene expression levels of 6304 genes, while measured gene expression data in normal tissue was available for 5434 genes. Figure 2a shows the mean measured (x-axis) versus the mean predicted (y-axis) expression values for the 5434 investigated genes.
Int. J. Mol. Sci. 2020, 21, x 3 of 13 expression data of healthy colorectal mucosa were available ( Figure 1). The characteristics of this subset of the total study population are presented in Table S1. Out of 159,506 SNPs that PrediXcan uses to predict gene expression in colon transverse tissue, 158,115 SNPs with a genotype imputation score higher than 0.99 were available in our study of 91 participants and were used for the prediction of gene expression in normal colon tissue. This translated into the estimation of gene expression levels of 6304 genes, while measured gene expression data in normal tissue was available for 5434 genes. Figure 2a shows the mean measured (x-axis) versus the mean predicted (y-axis) expression values for the 5434 investigated genes.   The correlation between measured and genetically predicted gene expression among the 91 investigated individuals was negative for 58% of the genes (displayed in red) and between 0 and 0.10 for 26% of the genes (displayed in black), and only 16% (863 genes) presented a correlation higher than 0.10 (displayed in green, listed in Table S2). The correlation between measured and genetically predicted gen investigated individuals was negative for 58% of the genes (displayed in for 26% of the genes (displayed in black), and only 16% (863 genes) pr than 0.10 (displayed in green, listed in Table S2).
PrediXcan has been previously applied to identify novel colorectal the preceding studies included a part of the study population presente accuracy of the predicted gene expression levels for the previously iden studies reported that the expression of TRIM4 and PYGL in colon tran with colorectal cancer risk [21], while PTPN2 expression in colon tra association of diabetes with colorectal cancer risk [22]. The expression o in our study of 91 participants and could thus not be further investigat the correlation between measured and predicted gene expression for TR and observed a correlation of ϱ = 0.19 for TRIM4 and a negative correlatio SNPs as well as the corresponding regression coefficients used to pre TRIM4 and PYGL are provided in Table S3.

Association of Genetically Predicted Gene Expression and Colorectal Canc
We then investigated the association between the genotype-based with a good prediction accuracy (863 genes with a correlation higher th 4436 colorectal cancer patients. Characteristics of the study population fo overall survival (OS) and disease-specific survival (DSS) are presented i  The correlation between measured and genetically investigated individuals was negative for 58% of the gen for 26% of the genes (displayed in black), and only 16% than 0.10 (displayed in green, listed in Table S2).
PrediXcan has been previously applied to identify n the preceding studies included a part of the study popu accuracy of the predicted gene expression levels for the studies reported that the expression of TRIM4 and PYG with colorectal cancer risk [21], while PTPN2 expressi association of diabetes with colorectal cancer risk [22]. T in our study of 91 participants and could thus not be fu the correlation between measured and predicted gene ex and observed a correlation of ϱ = 0.19 for TRIM4 and a ne SNPs as well as the corresponding regression coefficie TRIM4 and PYGL are provided in Table S3.

Association of Genetically Predicted Gene Expression and
We then investigated the association between the with a good prediction accuracy (863 genes with a corre 4436 colorectal cancer patients. Characteristics of the stud overall survival (OS) and disease-specific survival (DSS)  The correlation between meas investigated individuals was negativ for 26% of the genes (displayed in b than 0.10 (displayed in green, listed PrediXcan has been previously the preceding studies included a pa accuracy of the predicted gene expr studies reported that the expression with colorectal cancer risk [21], wh association of diabetes with colorect in our study of 91 participants and the correlation between measured an and observed a correlation of ϱ = 0.19 SNPs as well as the corresponding TRIM4 and PYGL are provided in Ta

Association of Genetically Predicted
We then investigated the assoc with a good prediction accuracy (86 4436 colorectal cancer patients. Char overall survival (OS) and disease-sp PrediXcan has been previously applied to identify novel colorectal cancer risk loci [21,22]. Since the preceding studies included a part of the study population presented here, we investigated the accuracy of the predicted gene expression levels for the previously identified genes. These previous studies reported that the expression of TRIM4 and PYGL in colon transverse tissue was associated with colorectal cancer risk [21], while PTPN2 expression in colon transverse tissue modified the association of diabetes with colorectal cancer risk [22]. The expression of PTPN2 was not measured in our study of 91 participants and could thus not be further investigated. However, we examined the correlation between measured and predicted gene expression for TRIM4 and PYGL (Figure 2b ρ The correlation between measured and genetically predicted gene expression among the 91 investigated individuals was negative for 58% of the genes (displayed in red) and between 0 and 0.10 for 26% of the genes (displayed in black), and only 16% (863 genes) presented a correlation higher than 0.10 (displayed in green, listed in Table S2).
PrediXcan has been previously applied to identify novel colorectal cancer risk loci [21,22]. Since the preceding studies included a part of the study population presented here, we investigated the accuracy of the predicted gene expression levels for the previously identified genes. These previous studies reported that the expression of TRIM4 and PYGL in colon transverse tissue was associated with colorectal cancer risk [21], while PTPN2 expression in colon transverse tissue modified the association of diabetes with colorectal cancer risk [22]. The expression of PTPN2 was not measured in our study of 91 participants and could thus not be further investigated. However, we examined the correlation between measured and predicted gene expression for TRIM4 and PYGL (Figure 2b-c) and observed a correlation of ϱ = 0.19 for TRIM4 and a negative correlation of ϱ = −0.56 for PYGL. The SNPs as well as the corresponding regression coefficients used to predict the gene expression of TRIM4 and PYGL are provided in Table S3.

Association of Genetically Predicted Gene Expression and Colorectal Cancer Patients' Survival
We then investigated the association between the genotype-based gene expression for genes with a good prediction accuracy (863 genes with a correlation higher than ϱ = 0.10) and survival of 4436 colorectal cancer patients. Characteristics of the study population for the investigated endpoints overall survival (OS) and disease-specific survival (DSS) are presented in Table 1.  The correlation between measured and genetically predicted investigated individuals was negative for 58% of the genes (displaye for 26% of the genes (displayed in black), and only 16% (863 genes than 0.10 (displayed in green, listed in Table S2).
PrediXcan has been previously applied to identify novel colorec the preceding studies included a part of the study population pres accuracy of the predicted gene expression levels for the previously i studies reported that the expression of TRIM4 and PYGL in colon t with colorectal cancer risk [21], while PTPN2 expression in colon association of diabetes with colorectal cancer risk [22]. The expressi in our study of 91 participants and could thus not be further invest the correlation between measured and predicted gene expression for and observed a correlation of ϱ = 0.19 for TRIM4 and a negative corre SNPs as well as the corresponding regression coefficients used to TRIM4 and PYGL are provided in Table S3.

Association of Genetically Predicted Gene Expression and Colorectal C
We then investigated the association between the genotype-b with a good prediction accuracy (863 genes with a correlation highe 4436 colorectal cancer patients. Characteristics of the study populatio overall survival (OS) and disease-specific survival (DSS) are present = −0.56 for PYGL. The SNPs as well as the corresponding regression coefficients used to predict the gene expression of TRIM4 and PYGL are provided in Table S3.

Association of Genetically Predicted Gene Expression and Colorectal Cancer Patients' Survival
We then investigated the association between the genotype-based gene expression for genes with a good prediction accuracy (863 genes with a correlation higher than  The correlation between measured and genet investigated individuals was negative for 58% of the for 26% of the genes (displayed in black), and only than 0.10 (displayed in green, listed in Table S2).
PrediXcan has been previously applied to ident the preceding studies included a part of the study accuracy of the predicted gene expression levels for studies reported that the expression of TRIM4 and with colorectal cancer risk [21], while PTPN2 expr association of diabetes with colorectal cancer risk [2 in our study of 91 participants and could thus not b the correlation between measured and predicted gen and observed a correlation of ϱ = 0.19 for TRIM4 and SNPs as well as the corresponding regression coef TRIM4 and PYGL are provided in Table S3.

Association of Genetically Predicted Gene Expression
We then investigated the association between with a good prediction accuracy (863 genes with a 4436 colorectal cancer patients. Characteristics of the overall survival (OS) and disease-specific survival (D = 0.10) and survival of 4436 colorectal cancer patients. Characteristics of the study population for the investigated endpoints overall survival (OS) and disease-specific survival (DSS) are presented in Table 1.
During a median follow up of 6.94 years, 1790 patients died and 1053 died of colorectal cancer. Patients who died from any cause (overall survival) were older, diagnosed at higher stage of the disease, were more likely to be also affected with diabetes, and were less likely to consume alcohol or to be overweight or obese.
The genetically predicted expression of 36 genes was associated with overall survival and of 48 genes with disease-specific survival (Table S4). After adjustment for multiple testing, the smallest probability value was found for the association between the genetically predicted expression of MAP1B and disease-specific survival (raw p = 0.0002, multiplicity-adjusted p = 0.15).
We further performed stratified analyses. Since body mass index (BMI) was the only variable associated with overall and disease-specific survival, which was also available in the UK Biobank, we classified colorectal cancer patients into four BMI groups: BMI < 18.5 kg/m 2 = underweight, BMI 18.5-24.9 kg/m 2 = normal weight, BMI 25-29.9 kg/m 2 = overweight, and BMI ≥ 30 kg/m 2 = obese, and investigated the association between genetically determined gene expression and survival of colorectal cancer patients in each BMI category. The expression of more than thirty genes was associated with overall or disease-specific survival of colorectal cancer patients in each BMI group (Table S4). After adjustment for multiple testing, the genetically predicted expression of ARID3B showed the strongest association with disease-specific survival (raw p = 0.00008, multiplicity-adjusted p = 0.07). Figure 3 shows the volcano plots of the survival analyses of 852 obese CRC patients in the DACHS study ( Figure 3a: overall survival, Figure 3b: disease-specific survival) with the blue dots indicating the results for the gene ARID3B. The correlation between the measured and the genotype-based ARID3B expression was ρ = 0.11.
We were further able to validate this association in an independent dataset of 1115 colorectal cancer patients with a BMI higher than or equal to 30 kg/m 2 in the UK Biobank (p = 0.02). The characteristics of the total study population of the validation dataset are presented in Table S5 (median follow up of 7.02 years; 1035 deaths, 669 deaths due to CRC). Figure 3c,d depicts the disease-specific survival of obese colorectal cancer patients according to predicted expression of ARID3B in the discovery (Figure 3c, HR (hazard ratio) = 0.66; p = 0.005 for an expression higher vs. lower than the median) and validation datasets (Figure 3d, HR = 0.74; p = 0.04). In both the identification and the validation datasets, only obese CRC patients showed an association between genetically predicted ARID3B expression and disease-specific survival ( Figure S1).

Discussion
In the present study, we used PrediXcan to infer gene expression in normal colon tissue from 91 colorectal cancer patients and examined the prediction accuracy by comparing the predicted and the measured gene expression levels. We then investigated the association between gene expression for well-predicted genes (ϱ > 0.10) and survival of 4436 colorectal cancer patients from the DACHS study. We observed that out of 5434 genes, for which measured as well as predicted gene expression data were available, 58% showed a negative correlation, while only 16% presented a correlation higher

Discussion
In the present study, we used PrediXcan to infer gene expression in normal colon tissue from 91 colorectal cancer patients and examined the prediction accuracy by comparing the predicted and the measured gene expression levels. We then investigated the association between gene expression for well-predicted genes (ρ > 0.10) and survival of 4436 colorectal cancer patients from the DACHS study. We observed that out of 5434 genes, for which measured as well as predicted gene expression data were available, 58% showed a negative correlation, while only 16% presented a correlation higher 7 of 13 than 0.10 and were thus taken forward to perform association analyses with the survival of colorectal cancer patients. The PrediXcan prediction model uses a median number of 23 SNPs to predict gene expression (minimum = 1, maximum = 259 SNPs). The expression of PYGL was predicted based on seven SNPs, which should suffice to accurately predict gene expression. The negative correlation between measured and predicted expression of PYGL (ρ = −0.58) in our study is unlikely the result of an insufficient number of eQTLs as previous studies reported that genetically regulated gene expression seems to be associated with a small number of variants rather than with multiple eQTLs [27,28]. Nevertheless, the expression of better predicted gene TRIM4 (ρ = 0.19) was based on 26 genetic variants.
Negative correlations between measured and genetically predicted gene expression using PrediXcan have been reported in previous studies [25,26] and could also result from weak associations between SNPs and the expression levels of the target gene. Finally, PrediXcan prediction models were trained based on local SNPs (cis-eQTLs) within one megabase (MB) of the start or the end of the gene, and the inclusion of potential trans-eQTLs could improve the prediction of gene expression.
Recent studies reported that gene expression prediction accuracy varies between populations [25,26]. The reference datasets, which were used to train the PrediXcan prediction models, are the Depression Genes and Networks (DGN) study and the Genotype-Tissue Expression (GTEx) project. The majority of the subjects in these reference datasets are of European descent [10]. Thus, it is not surprising that PrediXcan predicts gene expression in individuals of European descent more accurately than individuals of African descent [26]. However, differences in prediction accuracy were also reported among closely related European populations [26]. Our data corroborates this observation in showing that the correlation between measured and genetically predicted gene expression within Europeans strongly varies for some genes. PYGL, which was previously associated with colorectal cancer risk, showed a correlation of 0.51 between measured and predicted gene expression based on data from GTEx [21]. In contrast, in our data we observed a strong negative correlation between measured and predicted gene expression for PYGL (ρ = −0.58), implying that subpopulation differences may also be present in Europeans and need to be considered in genetic prediction tools [26,29]. Alternatively, we propose to measure gene expression in a subset of the total study population to assure that the correlation between predicted and measured gene expression levels of the investigated genes is greater than a preset threshold for the correlation (here 0.10) and to filter for well-predicted genes.
After we filtered for well-predicted genes, we identified one gene that was associated with survival of colorectal cancer patients and was subsequently validated in an independent dataset using genotype and clinical data from the UK Biobank. An increased expression of ARID3B was associated with a better disease-specific survival of obese colorectal cancer patients. ARID3B (AT-Rich Interaction Domain 3B) encodes for a DNA binding protein and has been described as contributor to tumor initiation and progression in cancerous diseases [30]. A recent study has described the role of ARID3B in colorectal tumor growth [31]. ARID3B has been further described as an oncoprotein and is involved in the progression of malignant neuroblastoma, ovarian cancer, and breast cancer [32][33][34][35]. However, to our knowledge no association of ARID3B with overweight or obesity has been reported, and it is unclear why this association was restricted to obese patients and not observed in other BMI-groups.
The present study is based on 4434 colorectal cancer cases with detailed clinical, demographic, and genome-wide genotype data, as well as with global gene expression data for a subset of 91 patients. The availability of paired genotype and gene expression data for a subset of patients enabled us to investigate the correlation between measured and genetically predicted gene expression, which is a major strength of this study. We were thus able to filter for well-predicted genes within our population to subsequently perform association analyses. Furthermore, we had access to an independent dataset from the UK Biobank, in which we validated the association between ARID3B expression and disease-specific survival of obese colorectal cancer patients. Although our sample size was fairly large, some of the subgroup analyses were based on small strata and this hampered stratified correlation analyses between measured and predicted gene expression. Furthermore, we did not have access to measured gene expression data from the validation dataset to test for the accuracy of the predicted expression of ARID3B.
In conclusion, this study illustrates the challenges of gene expression prediction in normal tissue based on individual genotype data and underlines the importance of assessing prediction accuracy through measuring gene expression in a subset of the investigated study population. Finally, we identified ARID3B as a potential survival-modifier in obese colorectal cancer patients.

Study Population
The study population included patients with CRC who participated in a long-term follow-up study of patients of the German population-based case-control DACHS study ("Darmkrebs: Chancen der Verhuetung durch Screening") [36,37]. CRC patients with a primary, confirmed diagnosis of CRC had been recruited from hospitals of the Rhein-Neckar-Odenwald region since January 2003. Included were patients aged 30 years or older, German speaking, resident in the study region, and mentally and physically able to complete an in-person interview. Baseline standardized questionnaires contained demographic information and information on established or suggested CRC risk factors, as well as possible prognostic factors. Follow-up information on overall and disease-specific survival was collected at three, five, and ten years after diagnosis. Causes of death were verified by death certificates and coded based on ICD-10 classifications. Information on recurrences were collected from general practitioners and specialists as applicable. In addition, clinical data was extracted from patient records. Population controls were randomly selected from lists of residents of the population registries of the cities and counties.  [38]. Missing genotypes were imputed using the Haplotype Reference Consortium as reference panel l (HRC r1.1 2016).

Gene Expression Measurement
Gene expression profiles of healthy colorectal mucosa tissues from 91 participants of a subsample of the DACHS study (ColoCare-DACHS study) were measured using Illumina HumanHT-12 Expression BeadChips according to the manufacturer's instructions as described previously [39,40]. Raw gene expression data was processed prior to statistical analyses. Missing expression values were imputed using the nearest neighbor averaging method as implemented in the R package impute. Expression data were adjusted for batch effects using the R package "sva" and subsequently transformed using the variance stabilizing transformation method and normalized using the robust spline normalization method of the R package "lumi".

Gene Expression Prediction
Out of the 4465 CRC patients of the DACHS study with available genome-wide genotype and follow-up data, we used 4436 CRC patients with available information on age, gender, stage at diagnosis and tumor site, and a minimum follow-up of 30 days. DACHS genotype data was combined and quality controlled using the info score of the program qctool (https://www.well.ox.ac.uk/~gav/qctool_v2). PrediXcan transcriptome prediction models were downloaded from the publicly available PredictDB repository (www.predictdb.org). Gene expression profiles in colon normal tissue were estimated based on genome-wide genotype data of DACHS patients using the software PrediXcan (https://github.com/hakyimlab/PrediXcan) [14]. PrediXcan provides prediction models trained by elastic net models and using reference datasets from the Genotype-Tissue Expression Project (GTEx), where the majority of subjects are of European descent. The reference dataset for normal colon tissue contained 169 colon transverse samples resulting in 159,506 variants with a prediction weight unequal to zero and used for the gene expression prediction of 6304 colon genes in normal tissue [10].
As the proportion of genetically determined gene expression levels may differ by population, we further tested the accuracy of predicted gene expression levels in the DACHS study. Observed gene expression profiles in healthy colorectal mucosa tissues of 91 CRC patients from the ColoCare-DACHS study were correlated with the predicted gene expression profiles.
Genes with available measured gene expression and with a median absolute deviation (MAD) of measured and predicted gene expression greater than 0 were included. For each gene, a quality metric ρ was computed as the Spearman correlation between the observed and predicted expression. Association tests were restricted to genes with a predictive ρ ≥ 0.10, i.e., ≥10% correlation between predicted and observed expression (cf. [21,22]).

Statistical Analysis
We tested the association between genetically determined expression levels in healthy colon tissue and the survival of 4436 CRC patients from the DACHS study in 863 genes with a predictive ρ ≥ 0.10. Cox proportional hazards regression was performed to test the predicted gene expression associations with overall survival and disease-specific survival. Survival time was calculated from date of diagnosis until date of death by any cause for overall survival and from date of diagnosis until date of death by CRC for disease-specific survival. Age, sex, stage at diagnosis, and cancer site were included in the model as relevant prognostic factors. Individuals with missing entries were excluded. The proportional hazards assumption was tested according to Grambsch and Therneau for a set of variables including age (<60 years, 60-69 years, 70-79 years, ≥80 years), sex, stage at diagnosis (I, II, III, IV), cancer site (colon, rectum), chemotherapy (yes, no), diabetes (yes, no), BMI (<18.5, 18.5-24.9, 25-29.9, ≥30 kg/m 2 ), regular use of non-steroidal anti-inflammatory drugs (NSAIDs) more than twice per week for at least one year (yes, no), regular smoking (never, former, and current) and alcohol intake (0, 0-5.6, 5.7-13.2, 13.3-28.5, and ≥28.6 g/day). Survival analyses were evaluated using as predictor variables the predicted gene expression levels as continuous variables and subsequently the predicted gene expression levels dichotomized at the median level for Cox proportional hazard models to estimate hazard ratios (HR) for overall survival and disease-specific survival, and their 95% confidence intervals (CIs). The median follow-up time and the cumulative probability of death were calculated using the Aalen-Johansen estimator. Bonferroni-Holm correction method was used to adjust for multiple testing.
The statistical analysis was carried out using SAS version 9.3 (SAS Institute, Cary, NC, USA) and R version 3.1.0 (www.r-project.org).

Validation Set
Validation of associations was performed using the UK Biobank resource. UK Biobank recruited 500,000 people from the UK aged between 40 and 69 years in 2006-2010 [41]. Genotype calling was performed by Affymetrix on the UK BiLEVE Axiom array and the UK Biobank Axiom array (Affymetrix, Santa Clara, CA, USA). Genotypes were imputed using the Haplotype Reference Consortium and UK10K haplotype resources [42].
The validation set included 4241 CRC patients from the UK Biobank resource (Table S5). Statistical analyses from our respective findings were performed as described above. Analyses regarding the registered endpoints overall and disease-specific survival were adjusted for the available variables age and sex. Subgroup analyses for the variable BMI were performed.
Supplementary Materials: Supplementary materials can be found at http://www.mdpi.com/1422-0067/21/21/8150/ s1. Table S1: Characteristics of the 91 colorectal cancer patients with measured gene expression data in normal tissue, Table S2: List of 863 well-predicted genes (ρ > 0.10), Table S3: SNPs and their corresponding weights used for the prediction of the genes TRIM4 and PYGL (extracted from the PrediXcan transverse colon prediction model), Table S4a: p-values from the Cox regression for the association between the continuous predicted gene expression and survival for overall survival (OS), Table S4b: p-values from the Cox regression for the association between the continuous predicted gene expression and disease-specific survival (DSS), Table S5: Characteristics of the 4241 colorectal cancer patients from the validation set, Figure S1: Probability of death due to CRC for patients with a low (lower than the median) and high (higher than the median) ARID3B expression based on individual genotypes. The number of CRC patients at risk is shown in the lower part of each panel; (a,b) Aalen-Johansen curves in the identification cohort in patients with (a) a BMI between 18.5-24.9 (b) a BMI between 25-29.9; (c,d) Aalen-Johansen curves in the validation cohort in patients with (c) a BMI between 18.5-24.9 (d) a BMI between 25-29.9. Aalen-Johansen curves of patients with a BMI < 18.5 are not shown due to low sample size.