MEDTEC Students against Coronavirus: Investigating the Role of Hemostatic Genes in the Predisposition to COVID-19 Severity

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the etiologic agent of the coronavirus disease 2019 (COVID-19) pandemic. Besides virus intrinsic characteristics, the host genetic makeup is predicted to account for the extreme clinical heterogeneity of the disease, which is characterized, among other manifestations, by a derangement of hemostasis associated with thromboembolic events. To date, large-scale studies confirmed that genetic predisposition plays a role in COVID-19 severity, pinpointing several susceptibility genes, often characterized by immunologic functions. With these premises, we performed an association study of common variants in 32 hemostatic genes with COVID-19 severity. We investigated 49,845 single-nucleotide polymorphism in a cohort of 332 Italian severe COVID-19 patients and 1668 controls from the general population. The study was conducted engaging a class of students attending the second year of the MEDTEC school (a six-year program, held in collaboration between Humanitas University and the Politecnico of Milan, allowing students to gain an MD in Medicine and a Bachelor’s Degree in Biomedical Engineering). Thanks to their willingness to participate in the fight against the pandemic, we evidenced several suggestive hits (p < 0.001), involving the PROC, MTHFR, MTR, ADAMTS13, and THBS2 genes (top signal in PROC: chr2:127192625:G:A, OR = 2.23, 95%CI = 1.50–3.34, p = 8.77 × 10−5). The top signals in PROC, MTHFR, MTR, ADAMTS13 were instrumental for the construction of a polygenic risk score, whose distribution was significantly different between cases and controls (p = 1.62 × 10−8 for difference in median levels). Finally, a meta-analysis performed using data from the Regeneron database confirmed the contribution of the MTHFR variant chr1:11753033:G:A to the predisposition to severe COVID-19 (pooled OR = 1.21, 95%CI = 1.09–1.33, p = 4.34 × 10−14 in the weighted analysis).


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the etiologic agent of the coronavirus disease 2019 (COVID- 19) pandemic. Since late 2019, the virus has infected more than 236 million people and claimed more than 4.8 million lives worldwide (October 8, 2021) [1]. Clinical manifestations are extremely heterogeneous, ranging from the absence of symptoms (in the vast majority of infected individuals) to pneumonia with hypoxemia in more severe patients, acute respiratory distress syndrome, sepsis, and multiorgan failure in critical cases [2]. The severe forms of the disease have been consistently associated with age, comorbidities, and male sex [3,4].
Besides intrinsic characteristics of the virus, the genetic makeup of the host likely accounts for the observed clinical heterogeneity of COVID-19 [5]. In this frame, several genome-wide association studies (GWAS) and meta-analyses have shown that genetic predisposition plays a role in COVID-19 severity and susceptibility [6][7][8][9][10][11]. The first reported genome-wide association signals were at the 3p21.31 and 9q34.2 loci, respectively pointing to a cluster of genes (coding for proteins regulating bronchial physiology, viral entry, and immune response) and to the ABO blood group gene. These associations have been repeatedly replicated in subsequent studies, which also evidenced the existence of at least 13 additional genome-wide significant associations with SARS-CoV-2 infection or COVID-19 severity [7][8][9][10][11]. The contribution of rare variants to COVID-19 just started to be elucidated: exome/genome sequencing in 659 patients with life-threatening COVID-19 pneumonia and 534 positive asymptomatics demonstrated that 3.5% of patients had genetic defects at 8 of the 13 loci involved in inborn errors of type I interferon (IFN) immunity [12]. These results were corroborated by the observation that phenocopies of inborn errors of type I IFN immunity account for severe COVID-19 pneumonia in 2.6% of women and 12.5% of men [13]. On the other hand, Povysil and colleagues [14] performed a large, multi-country sequencing study on patients with mild and severe COVID-19 and observed no enrichment of loss-of-function variants in genes in the type I IFN pathway. Hence, notwithstanding this growing body of information, the genetic host factors underpinning the acute responses in severe COVID-19 patients are still largely to be disclosed.
COVID-19 pathogenesis has been associated with a derangement of hemostasis and with thromboembolic events. Specifically, in the early phases of the disease course, increased D-dimer levels, which are indicators of fibrinolysis and coagulopathy, are a predictive biomarker of both severe disease and lethality in COVID-19 patients [15]. Low platelet count (an indicator of abnormally increased activation of the procoagulant pathway) has been registered in 20% of patients who died in hospital compared to 1% of those who recovered [15]. It also seems that systemic activity of coagulation factors V (FV), VIII (FVIII), and X (FX) are abnormally increased in severe COVID-19 patients [16]. More in general, COVID-19 patients face an elevated risk of pulmonary embolism and venous, arterial, and microvascular thrombosis [17][18][19][20]. The molecular mechanisms linking coronavirus infection and deregulation of hemostasis are not yet well understood, but plausible hypotheses have been suggested, especially considering the connection between hemostasis and inflammation. Indeed, acute lung injury deriving from viral cytopathic effects, the activation of the complement cascade, and the COVID-19-associated cytokine storm have all been proposed to activate the coagulation cascade [21,22].
With these premises, in this study we analyzed the association of common variants in hemostatic genes with COVID-19 severity. Considering the tremendous impact that the pandemic had not only on patients but also on the entire society and the consequent pervasive desire to participate, at any level, in the fight against the disease, we decided to engage our students attending the second year of the medical school in this research project. In particular, a class of students following the MEDTEC school (an integrated six-year training program that allows students to obtain an MD in Medicine as well as a Bachelor's Degree in Biomedical Engineering) were asked to participate to the analyses in the frame of the practical activities of the course in "Molecular and Computational Biology and Medical Genetics".

Patient Cohorts for Genetic Analyses
For association analyses, we investigated: (i) 347 patients with severe COVID-19, which was defined as hospitalization with respiratory failure and a confirmed SARS-CoV-2 viral RNA PCR test from nasopharyngeal swabs. Patients were recruited from intensive care units and general wards at two hospitals in the Milan area, i.e., the Humanitas Clinical and Research Center, IRCCS, in Rozzano (140 patients); and the San Gerardo Hospital, in Monza (192 patients); (ii) 1696 controls from the general Italian population with unknown COVID-19 status.
Details on DNA extraction, array genotyping and quality checks are reported elsewhere [6,23].

Imputation
Genetic coverage was increased in the cohort by performing single-nucleotide polymorphism (SNP) imputation on the genome build GRCh38 by using the Michigan Imputation Server [24] and haplotypes generated by the Trans-Omics for Precision Medicine (TOPMed) program (freeze 5) [25]. In the imputation step, we applied the server options to filter by an imputation of R 2 > 0.3 and to perform a QC frequency check against the TopMed panel. In the post-imputation stage, we only retained those SNPs with R 2 ≥ 0.6 and minor allele frequency (MAF) ≥ 1%. Next, we accurately checked cases and controls for solving within-Italian relationships and for excluding the possible existence of population stratification across and within batches: to do this, we performed principal component analysis (PCA), using a LD-pruned subset of SNPs across the entire genome and the PLINK v.1.9 package [26]. The final set of variants was composed of 8.6 million SNPs shared across cases and controls.

SNP Selection
A total of 32 hemostatic genes were selected from the literature (Table 1), based on previous evidence of implication in coronary artery disease/myocardial infarction and/or venous thromboembolism susceptibility. For each gene, we considered a genomic region comprising 250 kb upstream and 250 kb downstream the transcriptional unit and extracted from the dataset all the SNPs by using PLINK v.1.9. The final set of variants to be submitted to association analyses comprised 49,845 SNPs. Genomic coordinates are based on the genome build GRCh38. FGA, FGB, and FGG, coding for the three subunits of the fibrinogen molecule, are present in cluster on chromosome 4; in this case, we considered for the analysis those SNPs mapping from 250 kb upstream to 250 kb downstream the cluster region. Chr = Chromosome.

Statistical Analysis
Case-control allele-dose association tests were performed using the PLINK v.1.9 software. Analyses without the inclusion of covariates in the model (uncorrected) were performed using the -assoc and -ci 0.95 commands. Analyses including the covariates in the model (corrected) made use of a logistic-regression framework for dosage data with the commands -logistic, -covar and -ci 0.95. Age, sex, age*age, sex*age, and the first 10 principal components from PCA were comprised as covariates. For SNPs mapping on the X chromosome, we assumed uniform and complete X-inactivation in females and a similar effect size between males and females (i.e., females are considered to have 0, 1, or 2 copies of an allele, whereas males are considered to have 0 or 2 copies of the same allele). For X chromosome analysis only, we analyzed each sex separately (cases vs. controls). Female-only and male-only p values were then combined using the Stouffer's method [27], which accounts for potential differential effect size and direction between males and females, and also weights the two test statistics (by using the square-root of the male/female sample size). This analysis was performed with XWAS v.3.0 using the commands -xwas, -strat-sex, -stouffers, -logistic, -xchr-model 2, and -ci 0.95 [28].
All analyses were conducted referring to the minor allele. p-values were not corrected for multiple testing unless otherwise specified and are reported together with odds ratios (OR) and their 95% confidence intervals (CI). We considered loci with p < 0.001 as suggestive of association, following the guidelines of Lander and Kruglyak for large-scan association studies [29].

Polygenic Risk Score (PRS)
The PRS was calculated following the classic "clumping + thresholding" (C + T) method, where SNP clumping and a GWAS p-value thresholding are performed to control for linkage disequilibrium (LD) and adjust GWAS estimated effect sizes, respectively [30]. The sum of risk alleles of an individual was calculated using the risk allele effect size estimates from an independent GWAS study on the same phenotype. The summary statistics from the COVID-19 Host Genetic Initiative (HGI) [31] GWAS meta-analyses Release 5 dataset were used as reference data to obtain the effect size estimates. The selected dataset (COVID19_HGI_A2_ALL_eur_leave_ukbb_23andme_20210107.txt) contains the results relative to the "Very severe respiratory confirmed covid vs. population" phenotype.
Based on the results obtained in our association analysis, we calculated the PRS using the top variants for each candidate gene filtered with a threshold of association of p < 0.001 (corrected values), for a total of 5 variants. Through this step the effect size estimate of all the excluded variants is zero, while the effect size estimate of the retained variants will not be shrunken. The independence of the variants was confirmed through clumping using PLINK 1.9. The PRS was then calculated using PLINK 1.9 -score command and -sum option, assuming an additive model of independent variants. All the combinations of the 5 selected variants were tested: the PRS explaining the highest phenotypic variance was identified by using the R program [32], by testing the association between the PRS and the severe COVID phenotype in our dataset using a logistic regression. Covariates and confounders were included in the model.

Meta-Analysis
For the meta-analysis step, we retrieved association data deposited in the Regeneron-Genetic Center database [33] from the GHS study (Geisinger Health System; data were available for 869 cases and 112,862 controls of European ancestry). Pooled ORs and CIs were calculated both using the Mantel-Haenszel model [34] and the Fisher weighted method [35,36], in this last case to take into account the imbalance in the number of cases and controls characterizing the two datasets.

Role of MEDTEC Students
MEDTEC students were in charge of extracting SNPs from the imputed dataset and of performing single-SNP association analyses (uncorrected and corrected). Each student was assigned one gene, but they worked in small collaborative groups, double checking the respective analyses.

Hemostatic Gene Variants and the Predisposition to COVID-19 Severity
To explore the possible role of variants mapping within or in proximity (±250 kb) of 32 hemostatic genes in the predisposition to COVID-19 severity, we performed a single-SNP association analysis between 49,845 genotyped/imputed SNPs and severe COVID-19 with respiratory failure in an Italian cohort of 332 cases and 1668 controls from the general population (post quality-checked cohort). Table 2 lists the best signals found for each analyzed region (having at least a p < 0.05), whereas Figure 1 shows the Manhattan plots visually summarizing the single-gene association analysis. In detail, SNP association analysis revealed several suggestive hits (p < 0.001), with the top signal in the PROC region (chr2:127192625:G:A, OR = 2.23, 95%CI = 1.50-3.34, p = 8.77 × 10 −5 in the corrected analysis) and additional suggestive associations involving MTHFR, MTR, ADAMTS13, and THBS2 (Table 2; Figure 1). Table 3 shows association results for genetic variants known to have a functional impact on the corresponding gene product and to be associated with venous/arterial thrombosis, i.e.,: (i) the Leiden mutation in the F5 gene (rs6025; p.Arg506Gln); (ii) the variant located in the 3 untranslated region of the prothrombin gene (known as G20210A, rs1799963); and (iii) the missense variant p.Val34Leu in the F13A1 gene (rs5985) [37]. We did find only an unexpected weak protective signal with the prothrombin variant (OR = 0.23, 95%CI = 0.070-0.75, p = 0.015; Table 3).

Set up of a Poligenic Risk Score Based on Hemostatic Gene Variants
Though above the conservative threshold for multiple testing (Bonferroni threshold of significance: p = 1 × 10 −6 ), single-gene association results encouraged us to evaluate the joint effect of the 5 top variants (p < 0.001) mapping in the PROC, MTHFR, MTR, ADAMTS13, and THBS2 regions by calculating a weighted PRS, based on risk allele effect size estimates taken from the COVID-19 HGI dataset (Release 5) [31]. The most significant combination of SNPs was the one ignoring the chr6:169195156:A:T polymorphism (located in THBS2), which was hence excluded from further calculations. Figure 2a shows the overall PRS content in severe COVID-19 patients vs. controls: we observed significantly higher PRS values in the case cohort, with median PRS values of 0.017 in cases vs. 0 in controls (p = 1.62 × 10 −8 , Wilcoxon rank sum test). Categorizing the PRS of cases and controls into four equally distributed strata, the resulting normalized distributions showed opposite shapes, with decreasing percentages of control individuals in correspondence of higher PRS values, and vice versa for COVID-19 cases (Figure 2b). In particular, the risk of severe COVID-19 increased across the strata of the PRS, with individuals in the second stratum having an OR of 2.12 (95%CI = 1.35-3.35, p = 8.9 × 10 −4 ), compared with those of the first one (Figure 2c,d). Higher OR values could be observed also for persons belonging to the third and fourth strata (up to OR = 8. 19), but the overall number of individuals populating these strata is low, making less confident OR estimates.   Table 2 are presented by ordering variants according to the significance of association (p values) in the corrected analysis. Age, sex, age × age, sex × age, and the first 10 principal components from the principal component analysis have been used as covariates in the corrected analysis. As for F7 and F10, their top signals coincide (F7 and F10 genomic regions partially overlap). The major/minor alleles are indicated in the name of the variant (chromosome:position:major allele:minor allele). For variants mapping on the X chromosome, males (M) and females (F) were analyzed separately, and the combined p value of association has been calculated using the Stouffer's method (see Materials and Methods). SNPs showing a suggestive association with COVID-19 severity (p < 0.001) are highlighted in grey. Bonferroni threshold of significance: p = 1 × 10 −6 . Chr = chromosome; CI = confidence interval; MAF = minor allele frequency; OR = odds ratio; SNP = single-nucleotide polymorphism.    Table 3 shows association results for genetic variants known to have a functional impact on the corresponding gene product and to be associated with venous/arterial thrombosis, i.e.,: (i) the Leiden mutation in the F5 gene (rs6025; p.Arg506Gln); (ii) the variant located in the 3′ untranslated region of the prothrombin gene (known as G20210A, rs1799963); and (iii) the missense variant p.Val34Leu in the F13A1 gene (rs5985) [37]. We did find only an unexpected weak protective signal with the prothrombin variant (OR = 0.23, 95%CI = 0.070-0.75, p = 0.015; Table 3).  F5, F2, and F13A1.  5 genes (b). The horizontal lines represent the suggestive p = 0.001 (blue) and the Bonferroni-corrected p = 1.02 × 10 −6 (red) significance levels. Genomic regions belonging to the same chromosome are all represented with the same color (either gray or black). Plots were produced using R.

Meta-Analysis Confirms the Involvement of MTHFR in Severe COVID-19 Predisposition
Association data for the four leading SNPs mapping in the PROC, MTHFR, MTR, and ADAMTS13 genes were retrieved from the Regeneron-Genetic Center database. Data referred to a number of genotyped/imputed individuals varying from ≈430 thousands up to ≈898 thousands, depending on the analyzed variant (Table 4). In the meta-analysis, we confirmed the contribution of the MTHFR variant chr1:11753033:G:A to the predisposition to severe COVID-19, with a pooled OR of 1.21 (95%CI = 1.09-1.33) and the notable p = 4.34 × 10 −14 in the weighted analysis ( as the total number of cases and controls populating that stratum). All plots were produced using R.

Meta-Analysis Confirms the Involvement of MTHFR in Severe COVID-19 Predisposition
Association data for the four leading SNPs mapping in the PROC, MTHFR, MTR, and ADAMTS13 genes were retrieved from the Regeneron-Genetic Center database. Data referred to a number of genotyped/imputed individuals varying from ≈430 thousands up to ≈898 thousands, depending on the analyzed variant (Table 4). In the meta-analysis, we confirmed the contribution of the MTHFR variant chr1:11753033:G:A to the predisposition In the boxplot, boxes define the interquartile range; thick lines refer to the median. Samples were divided into 4 strata (dividing the entire range of PRS values into 4 evenly spaced intervals). Dotted grey lines represent boundaries between each PRS stratum. The p value was calculated using the Wilcoxon rank sum test. (b) Proportion of cases and controls in each PRS stratum (setting as 100% the overall amount of individuals for each stratum). (c) Risk of severe COVID-19, expressed as OR ± 95%CI by strata. ORs were calculated comparing the lowest stratum as reference against all the others. p-values from Chi-square test are reported for 2nd and 3rd PRS strata (not indicated for highest stratum due to lower number of individuals). (d) Percentage of case samples within each PRS stratum (with 100% as the total number of cases and controls populating that stratum). All plots were produced using R.

Discussion
In this work, we aimed at evaluating the impact of genetic variation in genes of the hemostatic pathway on the predisposition to severe COVID-19 by analyzing almost 50 thousand common variants distributed across 32 genes. This was made possible also thanks to the collaboration and the great enthusiasm of the group of students of the 2nd year of the MEDTEC program (Humanitas University and Politecnico di Milano), who were eager to give their contribution to the fight against the pandemic, in any possible way. Each one of them was hence made responsible for the analysis of one gene and for double-checking the analyses of their peers.
All genes (with the only exception of F8) were nominally associated (p < 0.05) with severe COVID-19, even after classic covariate correction. The most convincing evidence for association with severe COVID-19 was found for five loci: two folate metabolic genes (MTHFR and MTR), two hemostasis inhibitor genes (PROC and ADAMTS13), and a thrombospondin gene (THBS2). None of these signals, though suggestive, survived the Bonferroni threshold for multiple testing. In this frame, we have to underline that such signals emerged notwithstanding the fact that 4 out of our 5 top variants are characterized by a low frequency in the Italian general population (MAF < 4%), and that simple regression models are underpowered for such kind of variants. In addition, the Bonferroni threshold is based on independence of performed tests, but this assumption is not respected in our analyses, given the correlation between tests for most SNPs due to the presence of LD. A straightforward method to take into account the multiple testing issue in large-scale studies is to rely on the haplotype structure of the genome [38,39]. Just as an example, the 550-kb-long region containing the fibrinogen cluster comprised a total of 1606 SNPs (corresponding to a multiple-testing threshold of 3.11 × 10 −5 ); this region is indeed characterized by the presence of only three huge -and strong-LD blocks (as it can be observed by using the locus browser page of the GTEx portal) [40], making the calculated Bonferroni threshold definitively too conservative.
In any case, our best associations were corroborated from one hand by the PRS, and from the other by the meta-analysis results. In particular, the PRS distribution, built using the risk allele effect size estimates of the HGI GWAS meta-analyses Release 5 dataset, was significantly different between cases and controls, with subjects in the second stratum having a 2.12-fold increased risk for severe COVID-19 compared to those in the lowest stratum. Concerning the meta-analysis, performed by exploiting the high statistical power available through the Regeneron data, it was possible to specifically highlight a strong signal for the MTHFR gene, in correspondence of the chr1:11753033:G:A variant (also known as rs17875978, OR = 1.21, 95%CI = 1.09-1.33, p = 4.34 × 10 −14 ). The functional impact of the rs17875978 variant is unknown, however, some speculation can be put forward. This variant is located 32 kb downstream from the 3 UTR of the MTHFR gene and it tags a region of LD (chr1:11692083-11764076) including a series of whole-blood specific expression quantitative trait loci (eQTLs) [40]. In addition, this region is characterized by the presence of two distal cis enhancer regions (namely, EH38E1319076-EH38E1319075) potentially bound by transcription factors ZNF384, ZNF460, ZNF135, and GFI1B and targeting MTHFR. Overall, these observations point to this region as potentially implicated in the regulation of the MTHFR gene. More interestingly, a phenome-wide association study (PheWAS) using the broad list of phenotypes from the NHGRI GWAS Catalog [41] while revealing no known trait to be significantly associated with the rs17875978 lead SNP, highlighted a total of nine genome-wide significant (p < 5.0 × 10 −8 ) associations with the MTHFR gene [42][43][44][45][46]. These comprised phenotypes potentially related to COVID-19 pathophysiology, such as high blood pressure, agents acting on the renin-angiotensin system, and mean corpuscular hemoglobin. Indeed, the MTHFR gene codes for the 5,10-methylenetetrahydrofolate reductase involved in folate metabolism, essential for the synthesis of nucleotides and single-carbon groups. The MTHFR enzyme converts 5,10-methylenetetrahydrofolate to 5-methyltetrahydrofolate, which produces methyl donors to convert homocysteine to me-thionine. It has been recently shown that SARS-CoV-2 remodels both the host folate and the one-carbon metabolism at the post-transcriptional level to meet the demand for viral subgenomic RNA replication, bypassing viral shutoff of host translation [47]. Since folate is depleted in SARS-CoV-2-infected cells [47], homocysteine levels should increase and eventually lead to hyper-homocysteinemia, a known risk factor for a variety of complex disorders including cardiovascular and neurological diseases [48], possibly contributing to a severe course of COVID-19 [49]. Interestingly, MTHFR variants and homocysteine levels have been suggested as modulators of the risk of COVID-19 incidence and severity [50,51]. Moreover, elevated homocysteine levels may in turn increase tissue factor activity [52,53], thus promoting a prothrombotic state.
In addition to increased levels of pro-coagulant factors, thrombotic complications reported in COVID-19 patients may also arise for impaired endogenous anticoagulants, including Protein C and ADAMTS13. Decreased levels of protein C, a main inhibitor of coagulation, were in fact associated with disease worsening and mortality [54,55], though conflicting data were also reported in smaller patient cohorts [56][57][58]. In-vitro evidence suggested that synthesis and secretion of ADAMTS13 are inhibited by inflammatory cytokines released during systemic inflammation [59]. This might also stimulate the release and inhibit the cleavage of the von Willebrand Factor (vWF) [60], the substrate of the ADAMTS13 protease and the key protein in primary hemostasis (platelet aggregation). Several authors reported hemostatic alterations in vWF/ADAMTS13 axis that were strongly associated with disease severity in COVID-19 patients, including increased levels of vWF [61][62][63][64] accompanied by low normal or mildly decreased ADAMTS13 activity [65][66][67][68][69][70]. Based on these observations, the vWF/ADAMTS13 imbalance has been proposed to be causally related both to the prothrombotic tendency and to the microvascular pulmonary thrombosis, which subtend to a form of thrombotic microangiopathy (TMA) observed in severe COVID-19 patients. The severe deficiency of ADAMTS13 has also gained attention due to the development of a rare TMA, called immune-mediated thrombotic thrombocytopenic purpura [71,72], caused by the production of anti-ADAMTS13 antibodies, in a limited number of subjects after COVID-19 vaccination. Interestingly, vWF significantly increases with age and in non-0 blood type individuals [73], both risk factors for COVID-19.

Conclusions
In conclusion, though several mechanisms can explain the relationship between viral infection and thrombosis, our study suggests a combined effect of some risk variants mapping in the proximity of genes specifically involved in the regulation of hemostasis on the etiopathology of COVID-19.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The dataset for cases was submitted to the European Bioinformatics Institute (www.ebi.ac.uk/gwas (accessed on 1 May 2021)) under accession numbers GCST90000255 and GCST90000256, whereas the dataset for controls is deposited in the Genotypes and Phenotypes database (dbGaP; https://www.ncbi.nlm.nih.gov/gap/ (accessed on 1 May 2021)), under the phs000294.v1.p1 accession code.

Conflicts of Interest:
Pietro Invernizzi has been funded by INTERCEPT research grants and ABBVIE research grants. All other authors declare no conflict of interest.