Regulatory Noncoding and Predicted Pathogenic Coding Variants of CCR5 Predispose to Severe COVID-19

Genome-wide association studies (GWAS) found locus 3p21.31 associated with severe COVID-19. CCR5 resides at the same locus and, given its known biological role in other infection diseases, we investigated if common noncoding and rare coding variants, affecting CCR5, can predispose to severe COVID-19. We combined single nucleotide polymorphisms (SNPs) that met the suggestive significance level (P ≤ 1 × 10−5) at the 3p21.31 locus in public GWAS datasets (6406 COVID-19 hospitalized patients and 902,088 controls) with gene expression data from 208 lung tissues, Hi-C, and Chip-seq data. Through whole exome sequencing (WES), we explored rare coding variants in 147 severe COVID-19 patients. We identified three SNPs (rs9845542, rs12639314, and rs35951367) associated with severe COVID-19 whose risk alleles correlated with low CCR5 expression in lung tissues. The rs35951367 resided in a CTFC binding site that interacts with CCR5 gene in lung tissues and was confirmed to be associated with severe COVID-19 in two independent datasets. We also identified a rare coding variant (rs34418657) associated with the risk of developing severe COVID-19. Our results suggest a biological role of CCR5 in the progression of COVID-19 as common and rare genetic variants can increase the risk of developing severe COVID-19 by affecting the functions of CCR5.


Introduction
On 11 March 2020, World Health Organization (WHO) declared COVID-19 a pandemic. From the first case registered in December 2019 in Wuhan, over 160 million cases were

CCR5 Is Highly Expressed in Lung and Bronchus
Starting from the hypothesis that CCR5 has an important role in lung infection, we have analyzed CCR5 expression in different tissues. CCR5 was shown to be highly expressed in lung in two independent gene expression datasets of normal tissues profiled by GTEx consortium using RNA-Seq ( Figure 1A) and by Roth et al. [19] using microarray ( Figure 1B) technology. We also observed CCR5 as highly expressed in bronchus ( Figure 1B). mechanism of SARS-CoV-2 infection in patients with severe clinically manifestation of disease.

CCR5 is Highly Expressed in Lung and Bronchus
Starting from the hypothesis that CCR5 has an important role in lung infection, we have analyzed CCR5 expression in different tissues. CCR5 was shown to be highly expressed in lung in two independent gene expression datasets of normal tissues profiled by GTEx consortium using RNA-Seq ( Figure 1A) and by Roth et al. [19] using microarray ( Figure 1B) technology. We also observed CCR5 as highly expressed in bronchus ( Figure  1B).
To further investigate the potential role of the selected SNPs in regulating CCR5 e pression, we used Hi-C data from 3DIV database [20] to explore the genomic interactio between the selected SNPs and our gene of interest. We compared Hi-C data of the thr SNPs among all tissues present in the database (Figure 3 A-C, Figure S3). In particular, the frequency of minor allele of rs35951367 varies considerably according to the analyzed population: the minor allele C is more frequent in the European populations (non-Finnish: 16.1%) than other populations of non-European origin such as African, where the frequency is 10.1% or East and South Asian, where the allele frequency drops from 4.9% to 0% ( Figure S2). However, all the minor alleles were associated with increased risk of developing severe COVID-19 (rs9845542 P = 4.2 × 10 −20 , OR = 1.33, 95% CI: 1.25-1.41; rs12639314 P = 7.1 × 10 −9 , OR = 1.23, 95% CI: 1.14-1.31; rs35951367 P = 3.2 × 10 −19 , OR = 1.32, 95% CI: 1.24-1.40) (Supplementary Table S1) and correlated with lower expression levels of CCR5 in lung ( Figure 2B).
To further investigate the potential role of the selected SNPs in regulating CCR5 expression, we used Hi-C data from 3DIV database [20] to explore the genomic interaction between the selected SNPs and our gene of interest. We compared Hi-C data of the three SNPs among all tissues present in the database ( Figure 3A-C, Figure S3). The rs35951367 SNP region showed the highest level of interaction with CCR5 than the regions of the other three SNPs with a Distance Normalized Interaction Frequency (DNFI) of 7.16 in lung ( Figure 3C, Figure S3, Supplementary Table S2). Notably, in lung, the value of interaction frequency of rs35951367-associated region was the highest value observed ( Figure 3, Supplementary Table S2). When we repeated the same analysis using normal lung fibroblast cell line (IMR90), we again observed the strongest interaction between rs35951367-associated region and CCR5 (Supplementary Table S3). Based on our results, we focused our attention on rs35951367, which was likely to be a regulatory SNP of CCR5.
Chip data from the ENCODE catalogue of two lung cell lines (IMR90 and NHLF) showed that rs35951367 falls in a CTCF binding region localized in the intron of XCR1 gene ( Figure 4A,B).   Table S2). Notably, in lung, the value of interaction frequency of rs35951367-associated region was the highest value observed ( Figure 3, Supplementary Table S2). When we repeated the same analysis using normal lung fibroblast cell line (IMR90), we again observed the strongest interaction between rs35951367-associated region and CCR5 (Supplementary Table S3). Based on our results, we focused our attention on rs35951367, which was likely to be a regulatory SNP of CCR5.
Chip data from the ENCODE catalogue of two lung cell lines (IMR90 and NHLF) showed that rs35951367 falls in a CTCF binding region localized in the intron of XCR1 gene ( Figure 4A,B).  The rs35951367 SNP region showed the highest level of interaction with CCR5 than the regions of the other three SNPs with a Distance Normalized Interaction Frequency (DNFI) of 7.16 in lung ( Figure 3C, Figure S3, Supplementary Table S2). Notably, in lung, the value of interaction frequency of rs35951367-associated region was the highest value observed ( Figure 3, Supplementary Table S2). When we repeated the same analysis using normal lung fibroblast cell line (IMR90), we again observed the strongest interaction between rs35951367-associated region and CCR5 (Supplementary Table S3). Based on our results, we focused our attention on rs35951367, which was likely to be a regulatory SNP of CCR5.
Chip data from the ENCODE catalogue of two lung cell lines (IMR90 and NHLF) showed that rs35951367 falls in a CTCF binding region localized in the intron of XCR1 gene ( Figure 4A,B).  This SNP is an eQTL only for CCR5 in lung and not for XCR1 in lung or other tissues (Supplementary Table S4). By in-depth inspection analysis, closely located to rs35951367, we found the rs71327021 SNP that was previously excluded from our initial QTL analysis, as it did not pass the stringent levels of statistical correction. However, we verified that the rs71327021 SNP correlated with CCR5 expression with a nominal level of significance ( Figure S4A). Interestingly, this SNP is in LD with our candidate rs35951367 SNP (r 2 = 0.76) and in addition, it falls in binding sites of CTFC ( Figure 4B). However, analysis of Hi-C data showed the strongest interaction between rs71327021-associated region and CCR5 was observed in spleen tissue and not in lung (Supplementary Table S5, Figure S4B).

Replication Studies
To confirm the association found in GWAS data for rs35951367, we conducted two replication studies. The first replication study was carried out using the genetic data from the 23andMe study [21]. In this cohort, including 1131 hospitalized COVID-19 patients, the minor allele C of rs35951367 was confirmed to be a risk factor for COVID-19 severe (OR = 1.41, P = 3.79 × 10 −6 ); indeed, it was more frequent in cases (20%) than controls (15%). The second replication study was carried out using our Italian cohort of 202 hospitalized COVID-19 patients and 929 controls. Again, we confirmed the minor allele of rs35951367 as a risk factor for the more severe form of disease (OR = 1.307, P = 0.043) ( Table 1). As rs35951367 showed a different allelic frequency in different populations ( Figure S2), we used GWAS data of 2244 critically ill patients with COVID-19 and UK biobank controls from the GenoMICC study [7] to validate the association of rs35951367 with severe COVID-19 phenotype in non-European populations. The association was not confirmed in Africans (P = 0.43, OR = 1.24, 95% CI 0.59-1.89), East Asians (P = 0.75, OR = 1.14, 95% CI: 0.22-2.06), and South Asians (P = 0.18, OR = 1.18, 95% CI: 0.89-1.47).

Coding Variants in CCR5
In parallel with investigation of CCR5 locus in GWAS data, we conducted WES on 147 hospitalized COVID-19 patients of Southern Italy. In this WES dataset, we selected rare missense and INDEL variants (MAF < 0.01%) of CCR5. We found four variants in six patients (Table 2), and the rs34418657 variant was found in three patients (3/147, 2.04%). The minor allele of rs34418657 showed a frequency of 0.03% in European (non-Finnish) population and was predicted to be pathogenic by seven different tools (Supplementary Table S6). Using the DUET tool [22], rs34418657 was predicted to destabilize the protein (∆∆G = −0.893 Kcal/mol), suggesting its potential role as loss of function.

Discussion
The clinical manifestations of COVID-19 are very heterogeneous; indeed, they can range from completely asymptomatic subjects up to patients who are hospitalized and undergo mechanical ventilation [1]. Numerous studies have focused on discovering the causes of this spectrum of phenotypes and on the evaluation of the presence of genetic differences in the predisposition to this pathology [24][25][26].
For example, two recent GWAS studies have demonstrated a genetic association between severe COVID-19 and the blood groups, where group O is a protective factor, and there are common variants at 3p21.31, 12q24. 13, 19p13.2, 19p13.3 loci [6,7]. The 3p21.31 has been the first locus to be associated with the risk of developing aggressive forms of COVID-19 [6] and to be replicated in a different cohort of cases and controls [7]. The severe COVID-19 3p21.31 risk locus harbors several common variants and contains 17 known protein-coding genes including LC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, CCR1, CCR3, CCR2, and CCR5. However, so far, causative variants and genes at this risk locus, influencing COVID-19 pathogenesis, remain to be investigated. Here, in order to study the potential biological role of the 3p21.31 risk locus, we focused our attention on the CCR5 gene based on the motivations described below.
CCR5 is a chemokine receptor and is known in the literature for its role in infection pathogenesis, especially in HIV type 1 (HIV-1), where a deletion of 32 bp, present with an allele frequency of 12% in the northern Europe population, confers resistance at the HIV-1 infection in homozygous [13,[27][28][29]. A study conducted on public genomes from Neanderthals to modern humans identified a total of 262 polymorphisms in the CCR5 gene, with an SNP frequency per individual that ranged from 14 to 24 [30,31]. This high allelic heterogeneity is attributable to its role in the immune system as a result of co-evolution between humans and pathogens and is a peculiarity of only humans and not of other primates [31].
In leukocytes, CCR5 is a key receptor for chemotaxis from the bloodstream to the site of inflammation [32,33]; in particular, CCR5 is important for the rapid recruitment of memory CD8 + T cells to the lung airways during virus infection for limiting virus replication [34]. However, in addition to leukocytes, CCR5 is variably expressed in many tissues [32,[35][36][37][38]. Our in silico analysis shows that it is highly expressed in the lung from healthy individuals, suggesting its key biological role in the pathogenesis of COVID-19. Additional studies to verify the expression of CCR5 in tissues from subjects with severe COVID-19 might further support its role in disease initiation and progression.
Starting from this evidence, we decided to investigate the CCR5 role in new SARS-CoV-2 infection using a dualistic approach: we first investigated the SNPs in noncoding regions, exploiting the freely available GWAS data, and then evaluated the coding variants obtained by WES of our cohort of severe cases.
We hypothesized that common variants at 3p21.31 may act as risk factors of severe COVID-19 by affecting the expression of CCR5. Combining public GWAS data with eQTLs data from the GTEx project, we have demonstrated that the minor allele of the rs35951367 SNP confers risk of severe COVID-19 and correlates with a lower expression of CCR5 in lung. In accord with our results, a preprint paper reports a transcription-wide association study that found CCR5 expression to be regulated by variants at 3p21.31 only in lung tissue [39].
Interestingly, Hi-C data obtained in lung tissue and cell lines showed that the rs35951367 SNP is located in a DNA region interacting with CCR5 and marked by CTCF binding in two lung cell lines (IMR90 and NHLF). CTCF can function as a transcriptional activator, a repressor, or an insulator protein, blocking the communication between enhancers and promoters [40]. It plays a relevant role in contributing to genome organization and gene expression [40,41]. Therefore, our data suggest that the rs35951367 SNP may functionally act by altering topological domains and thus the normal transcriptional program of the CCR5 gene. In vitro experimental studies are needed to confirm these findings and, in addition, to further genetic studies of variants already found to be associated with the different expression of CCR5 [42][43][44].
Very close to rs35951367, we found the rs71327021 SNP, which also falls into the CTCF-binding region in lung cell lines, but it is not an eQTL for CCR5 in the lungs. However, the rs71327021 SNP may be a good candidate to be tested in future genetic and functional studies.
Replication study is the golden standard for validating genome-wide association findings; thus, we used the genetic data from the 23andMe study and typed the rs35951367 SNP by TaqMan assay in cases and controls from Southern Italy; in both cohorts, the genetic association was confirmed. However, the same variant was not found to be associated with severity of COVID-19 illness in non-European subjects such as Africans and Asians, which was likely because of the lower frequency of the variant observed in these populations with respect to Europeans. So, we can speculate that this genetic risk factor may be one of the causes linked to the different rate of mortality observed across the different populations. Future studies will help to better define the effect of genetic variants at the CCR5 locus on the clinical subgroups of COVID-19 disease-for instance, performing association analyses on patients stratified by disease aggressiveness.
Together, our findings provide evidence that common functional SNPs can predispose to clinically severe COVID-19 by long-range interaction with CCR5 and that this gene can play a relevant role in the pathogenesis of SARS-CoV-2 infection.
In accordance with our findings, it was observed that the deficiency of CCR5 in West Nile virus infection, which shares with COVID-19 the multitude of phenotypes ranging from the subclinical to the most severe cases, represents a risk factor to develop severe symptoms [45]. Furthermore, in vivo studies indicate that mice CCR5 −/− , infected with West Nile virus, have an elevated mortality rate and viral loads compared with wild-type mice [46].
In addition, the lung appears to be a particularly sensitive tissue to low CCR5 expression as demonstrated in studies of influenza A virus infection, where CCR5 deficiency is associated with poor outcome and with an increased mortality [18,47]. Notably, null mice of CCR5 show an accelerated macrophage accumulation in lungs with an increased expression of RANTES, MCP-1, and IP-10 [18]. The latter is one of three pro-inflammatory cytokines that correlate with rapid disease progression in hospitalized COVID-19 patients [48]. The above-mentioned scientific evidence is in agreement with our findings according to which a reduced expression of CCR5 in the lungs can promote COVID-19 progression.
Since CCR5 is known in the literature for its numerous polymorphisms, we decided to also investigate if there are rare pathogenic coding variants associated with the severe COVID-19 phenotype. In our cohort of hospitalized patients profiled by WES, we found an association of a rare variant in exon 3 (rs34418657) predicted to destabilize the protein product. This variant has been already described in association study with unexplained recurrent pregnancy loss. This association was explained, as CCR5 is necessary for maternal-fetal tolerance, and rs34418657 was predicted to destabilize the protein [49]. Our observations further suggest that genetic alterations that reduce the CCR5 function predispose to COVID-19 progression and highlight once again the key role of CCR5 in SARS-CoV-2 infection.
Conversely, the CCR5-Delta32 (rs333) common coding variants, known to protect from HIV infection [50], seem to be not associated with a minor risk of developing severe forms of COVID-19, as indicated by both our WES and GWAS results. These data provide evidence that rare and common coding variants of CCR5 can have different and independent risk effects on the onset and progression of infectious diseases.
In conclusion, we demonstrate that common noncoding regulatory variants at the previously identified 3p21.31 risk locus and rare pathogenic coding variants in CCR5 can contribute to severe clinical manifestations of COVID-19 by affecting CCR5 functions. Therefore, these findings suggest that CCR5 is a good candidate to be experimentally studied for unravelling the biological basis of COVID-19 progression and for developing novel therapeutic interventions.

GWAS Data
Summary statistics (P-value, and effect size) were obtained from the GWAS dataset "B2_ALL_eur_leave_23andme", deposited in the fourth round of GWAS meta-analysis of COVID-19 Host Genetics Initiative website (www.COVID19hg.org, accessed on 10 May 2021) [51], available since 20 October 2020. The GWAS dataset includes 6406 hospitalized COVID-19 patients and 902,088 members of the general population with European genetic ancestry and does not include 23andMe cohort. The COVID-19 Host Genetics Initiative was a global initiative to elucidate the role of host genetic factors in the susceptibility and severity of the SARS-CoV-2 virus pandemic.

Replication Study of rs35951367 SNP
To validate the association between rs35951367 and a severe form of COVID-19 disease, we use two independent cohorts of cases and controls.
The first cohort data were obtained from '23andMe COVID-19 Study' [21]. This study includes 1131 COVID-19 patients reported to have a positive test and hospitalization following SARS-CoV-2 infection. Participants answered web questions about symptoms of cold or flu-like illnesses, COVID-19 diagnosis and testing, hospitalization, and severity of illness. Participants provided informed consent and participated in the research online, under a protocol approved by the external AAHRPP-accredited IRB, Ethical & Independent Review Services (E&I Review). Participants were included in the analysis on the basis of consent status as checked at the time that data analyses were initiated. The full GWAS summary statistics for the 23andMe discovery dataset will be made available through 23andMe to qualified researchers under an agreement with 23andMe that protects the privacy of the 23andMe participants. Please visit https://research.23andme.com/collaborate/#dataset-access/ (accessed on 10 May 2021) for more information and to apply to access the data.
The second cohort included 929 healthy controls and 212 hospitalized COVID-19 collected from our research group that was genotyped by TaqMan ® SNP Genotyping Assay for rs34418657 SNP (Applied Biosystems by Thermo Fisher Scientific, Waltham, MA, USA). Twenty DNA samples failed the genotyping. The healthy controls cohort was composed of 39 subjects that resulted to be positive for anti-SARS-CoV2-IgG antibodies but without any symptoms and 890 healthy controls recruited before the COVID-19 pandemic. Hardy-Weinberg equilibrium was evaluated using the goodness-of-fit chi-square test in control and cases subjects (P > 0.05). Clinical characteristics of cases and controls are reported in Supplementary Table S7.
Exomes of 1095 Italian individuals were retrieved from the web database Network for Italian Genomes (NIG) (http://nigdb.cineca.it/index.php, accessed on 10 May 2021) [23] for evaluating the association of rs333 in our cohort of patients.

Whole Exome Sequencing
We collected COVID-19 samples from several hospitals in Southern Italy: Azienda ospedaliera specialistica dei colli, Monaldi-Cotugno; Azienda Ospedaliera di Rilievo Nazionale Antonio Cardarelli, P.O. Ospedale Boscotrecase, Villa dei Fiori S.r.l.-Acerra. The genomic DNA of COVID-19 patients was extracted from peripheral blood using a Maxwell ® RSC Blood DNA Kit (Promega, Madison, WI, USA), and DNA concentration and purity were evaluated using a NanoDrop™ 8000 Spectrophotometer. For library preparation, we measured the DNA concentration of 147 COVID-19 patients with a Qubit ® DNA Assay Kit in Qubit ® 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA), and 1.0 µg of DNA was used for library preparation. Genomic DNA was sonicated to generate 180-250 bp fragments and then end-polished, A-tailed, and ligated with the full-length adapter with further PCR amplification. PCR products were purified using AMPure XP system (Beckman Coulter, Brea, CA, USA) and quantified using the Agilent high sensitivity DNA assay on the Agilent Bioanalyzer 2100 system. The whole exome was captured with Agilent SureSelect Human All Exome V6 (Agilent Tecnologies, Santa Clara, CA, USA), and the sequencing was performed on an Illumina NovaSeq 6000 Systems.

Bioinformatic Analysis of Sequencing Data
The paired end (2 × 150 bp) sequencing returned an average of 42 million raw reads per sample. After removing sequencing artifacts, we retained, on average, the 98.35% of raw reads. The percentage of bases with quality scores above 20 and above 30 (Q20 and Q30) was 97.7% and 93.8%, respectively. The cleaned reads were aligned versus the reference genome (GRCh37/hg19) and mapping BAM files were obtained with BWA-mem (version 0.7.17) and SAMTools (version 1.8). On average, 99.89% of reads were mapped and 20.47% of duplicate reads were removed with Picard (version 2.18.9). We covered 99.63% of the target regions, and the average sequencing depth on target was 149.06x. Most (95.08%) of the target regions were covered with at least 20 reads, which were sufficient for reliable variant calls. SNVs and small insertions and deletions (INDELs) were detected with GATK HaplotypeCaller, and the functional annotation of variants was performed with ANNOVAR. We obtained 189,625 raw SNVs and 32,089 raw INDELs per sample. We excluded off-target variants (e.g., intergenic, intronic, etc.) and single nucleotide polymorphisms (SNPs) with allele frequencies greater than 1% in non-Finnish European populations of the 1000 Genomes Project, ExAC (v3) and GnomAD (v2.1.1) databases. To remove possible false positives, we also discarded variants falling in genomic duplicated regions. Then, the set of exonic variants was filtered to remove synonymous SNVs.

Association Study of the Coding Rare Variant rs34418657
In order to assess the association of the rare variant rs34418657, identified through WES, we genotyped by TaqMan ® SNP Genotyping Assay (Applied Biosystems by Thermo Fisher Scientific, Waltham, MA, USA) 1084 healthy controls and an additional 74 COVID-19 hospitalized patients not analyzed by WES. Fifty controls and one COVID-19 patient failed the genotyping. Healthy controls cohort was composed of 379 people that were shown to be positive for anti-SARS-CoV2-IgG antibodies but without any symptoms and 705 healthy controls recruited before the COVID-19 pandemic. A higher number of healthy controls with respect to common variant rs35951367 was typed in order to gain more statistical power to detect significant genetic differences since the rs34418657 variant is very rare in the general population. Hardy-Weinberg equilibrium was evaluated using the goodnessof-fit chi-square test in control and case subjects (P > 0.05). Clinical characteristics of this large cohort of controls are also reported in Supplementary Table S7. DNA samples from patients were obtained after they signed informed consent and according to the Declaration of Helsinki. Local university ethical committees approved the study.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/ijms22105372/s1. Figure S1: Top CCR5 eQTLs SNPs in lung are not in LD in African and AMR populations. Figure S2: Minor allele frequencies of the selected SNPs in different populations. Figure S3: rs35951367 shows the highest level of interaction with CCR5 in lung. Figure S4: Minor allele of rs71327021 was associated with low expression of CCR5 in lung, but this last one is not tissue, where interaction with CCR5 is stronger. Table S1: eQTL SNPs for CCR5 in lung. Table S2: Results of analysis of Hi-C data of rs35951367 region in lung. Table S3: Results of analysis of Hi-C data for rs35951367 region in lung cell line (IMR90). Table S4: Results of eQTLs analysis of rs35951367 in all tissues. Table S5: Results of the analysis of Hi-C data for the rs71327021 region in different normal tissues. Table S6: Rare coding variants of the CCR5 gene in 147 COVID-19 hospitalized cases. Table S7: Clinical characteristics of COVID-19 hospitalized patients and healthy controls. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The GWAS dataset 'B2_ALL_eur_leave_23andme' is accessible at www.COVID19hg.org (accessed on 10 May 2021). The full GWAS summary statistics for the 23andMe discovery dataset will be made available through 23andMe to qualified researchers under an agreement with 23andMe that protects the privacy of the 23andMe participants at https://research.23 andme.com/collaborate/#dataset-access/ (accessed on 10 May 2021) for more information and to apply to access the data. Exomes of healthy Italian individuals are accessible via the database Network for Italian Genomes (NIG) (http://nigdb.cineca.it/index.php, accessed on 20 March 2021).