From Croatian Roma to 1000 Genomes: The Story of the CYP2D6 Gene Promoter and Enhancer SNPs

The CYP2D6 gene encodes an enzyme responsible for the metabolism of ~20% of clinically prescribed drugs. In this study, 18 SNPs from the enhancer and promoter regions of CYP2D6 in 323 Roma from Croatia were genotyped, to find out whether the demographic history of Roma affected the distribution of the studied SNPs and their linkage disequilibrium (LD) values, with the major SNPs defining the CYP2D6 star alleles. No differences were found between the three Roma groups in allele and genotype frequencies. The distribution of LD values of Roma was compared with LD values of European and Asian populations. Regulatory CYP2D6 SNPs (rs5758550, rs28624811, rs1080985 and rs1080983) showed similar distribution and the highest LDs with rs16947 from the gene-coding region in all populations. In the promoter region, a complete LD between rs1080989 and rs28588594, and between rs1080983 and rs28624811, was found in Croatian Roma and investigated populations from 1000 genomes. A high LD was also found between rs1080985 from the promoter and rs5758550 from the enhancer region. SNP rs28735595 from the gene promoter region had the highest LD, with two gene region SNPs, rs1058164 and rs1135840. To conclude, the Croatian Roma population shows an LD pattern of the CYP2D6 gene region similar to the 1000 Genomes European and Asian populations.


Introduction
The CYP2D6 gene encodes a homonymous drug-metabolizing cytochrome P450 enzyme, responsible for eliminating more than 21% of clinically used drugs [1]. This gene, located on chromosome 22q13.1, is highly polymorphic and its genetic variations greatly contribute to the inter-individual variability of CYP2D6 enzyme activity, which is divided into four categories: poor metabolizer, intermediate metabolizer, normal metabolizer and ultrarapid metabolizer [2][3][4][5]. The Clinical Pharmacogenetics Implementation Consortium (CPIC) offers guidelines for assigning activity scores of CYP2D6 variant alleles, and subsequent translations of diplotypes into phenotypes-this method was proposed and established by Gaedigk and colleagues to standardize genotype-to-phenotype translations [6]. However, recent studies have shown that even in individuals with the same genotype, CYP2D6 enzyme activity can vary up to several times [7][8][9][10], and differential regulation of CYP2D6 transcription may partly explain the variability in CYP2D6-mediated drug metabolism [11].
According to the GeneCards database [12], there are 87 loci in enhancer and promoter regions related to the expression of the CYP2D6 gene, spanning from less than 1000 to almost 300,000 base pairs away from the transcription starting site (TSS). Among these, eight are active in hepatocytes (https://epd.epfl.ch/, https://www.encodeproject.org/, and https://www.ncbi.nlm.nih.gov/refseq/ (accessed on 15 March 2022)). The association of enhancer/promoter activity and variations in the CYP2D6 gene with overall drug metabolizing is not extensively studied. So far, only a few regulatory variants of the CYP2D6 gene have been studied. Mostly, the two completely linked single-nucleotide polymorphisms (SNPs), rs133333 (G > A) and rs5758550 (G > A), located~116 kb downstream of the gene and identified as enhancers [13]. These SNPs are located within the 2.4 kb-long enhancer GH22J042015, the binding site for the transcription factor ZNF512 [12]. In addition to the aforementioned enhancer SNPs, the other most studied SNPs are located within the promoter/enhancer GH22J042130, which is 1.6 kb in size and 0.7 kb away from the TSS of the CYP2D6 gene. According to the Pharmacogene Variation (PharmVar) Consortium, some of the SNPs from this region are part of haplotypes that define the CYP2D6 star alleles [14][15][16].
Considering all cell types, GH22J042130 is the biding site of 15 transcription factors and affects the transcription of 11 genes [12]. In transcription related to drug metabolizing activity, this promoter is induced by the binding of hepatocyte nuclear factor 4 alpha, Kruppel-like factor 9 and peroxisome proliferator-activated receptor alpha, and is suppressed by nitric oxide and estrogen [11]. Since studies have shown that haplotypes containing enhancer or promoter loci allow the determination of CYP2D6 enzyme activity in vivo, their inclusion in genotyping panels could allow more accurate prediction of CYP2D6 activity [12]. SNPs from enhancers and promoter regions may be in linkage disequilibrium (LD) with star allele-defining SNPs from the CYP2D6 gene region, which may influence the metabolizing effect [17]. LD differs among populations, especially if they are isolated, have a different ancestry from the surrounding majority population and are susceptible to genetic drift.
An example of such a population is the Roma (Gypsy) population, a transnational minority present in many countries around the world. They originated in India and arrived in Europe around the 11th century via Central Asia (Afghanistan and Persia), the Middle East and present-day Turkey. It is estimated that the Roma population is numbering around 15 million people worldwide, of whom 12 million reside in Europe. Roma in Croatia belong to two socio-culturally and linguistically different groups: Vlax Roma, descendants of the Roma who crossed the Danube River between the 13th and 15th century and arrived in Wallachia and Transylvania (both in present-day Romania), and Moldavia, where they were forced to work in the mines for the next 500 years. During that time, they were forbidden to use their own language, so their descendants are now recognized by a specific archaic Romanian language-ljimb'd bayash. The second group is Balkan Roma, descendants of the Roma who arrived in the Balkans in the 11th century, and they speak dialects of the romani chib language. Socio-cultural characteristics of the Roma population, such as strict rules of endogamy, in addition to the founder and the bottleneck effects, have caused the genetic structure of Roma to differ compared to other populations [18][19][20], which has been shown to affect ADME genes' variations as well [21].
The main objective of this study was to estimate the variation in enhancers and promoter regions of the CYP2D6 gene among: (a) three socio-culturally and geographically distinct Croatian Roma groups, and (b) Croatian Roma and European and Asian 1000 Genomes populations; in particular, to find out whether the specific history of the Roma population influenced the distribution of the studied SNPs and their LDs with the main CYP2D6 star allele-defining SNPs. The knowledge of LD between CYP2D6 star alleledefining SNPs and SNPs in promoter regions and/or enhancers can enable prediction of CYP2D6 activity with greater accuracy.

Materials and Methods
We analyzed 323 DNA samples, all collected during field studies of the ongoing multidisciplinary anthropological, molecular-genetic and epidemiological investigations of Roma populations in Croatia. Samples belong to members of the three socio-culturally different Roma subpopulations: the Vlax Roma, who are divided into two subpopulations according to the geographical regions of Croatia they inhabit: Baranja and Medjimurje, and the Balkan Roma from the city of Zagreb. All Roma participated in the study voluntarily, and with the help of Roma volunteers, were informed about the goals, methods and expectations of the study. The Scientific Board and the Ethics Committee of the Institute for Anthropological Research in Zagreb, Croatia, approved the study protocol.
DNA was extracted from peripheral blood using the salting-out method [22]. The genotyping of 16 SNPs in the promoter region of the CYP2D6 gene and two enhancer SNPs on chromosome 22 was carried out using the Kompetitive Allele-Specific PCR method (KASP) in a commercial facility. The KASP genotyping assay is a form of competitive allelespecific PCR combined with a homogeneous fluorescent SNP genotyping system, which determines the alleles at a specific locus within genomic DNA [23]. Data for the CYP2D6 star allele-defining SNPs (rs1135840, rs16947, rs28371725, rs3892097, rs1058164, rs1065852 and rs769258) in Croatian Roma were taken from a paper by Stojanović Marković et al. [24].
Allele and genotype frequencies were calculated by direct counting. Hardy-Weinberg equilibrium (HWE) was assessed using the software Arlequin 3.5 [25]. Genotype and allele frequency differences between the three Roma groups were tested using the Chi-square test. The analyses were performed using R with statistical significance set at p < 0.05 [26]. Linkage disequilibrium (LD) analyses in the Roma groups have been performed using the software Haploview [27]. Only r 2 values of LD were calculated since it is considered more robust than D' and correlates better among different population samples [28,29]. Haploview software was also used for drawing plots. Data from the 1000 Genomes database were . LDs for European, East Asian and South Asian populations were calculated using the LD calculator implemented in the Ensembl genome browser [30]. Spearman's correlation was used to compare LD values [26]. Spearman's correlation results were used as input for multidimensional scaling (MDS), and plots were drawn using ggplot2 [31].

Results
Allele and genotype frequencies of studied polymorphic sites determined in three Croatian Roma subpopulations are shown in Table 1. Eight out of the eighteen investigated SNPs in our sample were monomorphic (rs1080993, rs34894147, rs1376235338, rs1224722684, rs1409156443, rs536645539, rs1080990 and rs58188898). All polymorphic sites except for rs13333 in the Baranja Roma subpopulation were in Hardy-Weinberg equilibrium. None of the SNPs showed significant differences in genotype or allele frequencies between the three Roma groups (Table 1). In Table 2, the linkage disequilibrium (LD) values for the three Croatian Roma subpopulations (r 2 values) are shown for pairs of two enhancer (rs133333 and rs5758550) and six polymorphic promoter SNPs (rs28624811, rs1080989, rs28735595, rs28588594, rs1080985 and rs1080983), as well as between pairs of the latter and SNPs that define different CYP2D6 gene star alleles (rs1135840, rs16947, rs28371725, rs3892097, rs1058164, rs1065852 and rs769258). Two SNPs in the CYP2D6 gene promoter region, rs35046171 and rs34167214, were not included in the LD calculation due to the extremely low prevalence of minor alleles in these SNPs.  In the studied world populations, four SNPs in the CYP2D6 regulatory regions (rs5758550 in the enhancer region, and rs28624811, rs1080985 and rs1080983 in the promoter region) showed similar distributions and the highest LD with rs16947 from the CYP2D6 gene region (Figure 1). Since these four regulatory region SNPs have the same LD pattern with the SNPs in the gene region, we calculated their pairwise LDs and found that promoter regions rs1080983 and rs28624811 are in complete LD not only in the European and Asian populations (data not shown), but also in the Croatian Roma population ( Table 2). A high LD (r 2 > 0.8) was found between rs1080985, from the promoter region, and rs5758550, from the enhancer region, both in world populations and Croatian Roma groups (Figure 1). Other SNP pairs have LD values ranging from 0.4 to 0.8, with the highest values in the Finnish population. Promoter region SNPs rs1080989 and rs28588594 also showed nearly identical distribution of LD values in the studied populations, and the highest LDs with rs1065852 from the CYP2D6 gene-coding region. We tested LD values between rs1080989 and rs28588594 from the promoter region and found that they were in complete LD in all studied populations. SNP rs28735595, from the CYP2D6 gene promoter region, had the highest LD with two SNPs in the gene region, rs1058164 and rs1135840 (r 2 > 0. 8 Figures S1-S3, which show nine LD plots for each of the Croatian Roma subpopulations.    To reveal the pattern of LD correlations, the MDS plots ( Figure 2) were constructed as described in the Materials and Methods Section. Most MDS plots show separation of East Asian populations from other populations. Considering the Croatian Roma population, the plots also suggest a slightly remote position of the Balkan Roma subpopulation from others, while the Baranja Roma subpopulation is almost always positioned close to some of the South Asian populations. The Roma subpopulation from Medjimurje is positioned either close to South Asian populations (MDS plots for rs5758550, rs28624811 and rs1080985), or closer to European populations (MDS plots for rs28735595, rs28588594 and rs1080989).

Discussion
Population pharmacogenomics is a growing area driven by increasing population data on genes responsible for absorption, distribution, metabolism and excretion (ADME genes). Population ancestry may affect the diversity of genetic polymorphisms, leading to population-specific differences in drug responses [32]. Within population pharmacogenomics, special attention should be given to the study of indigenous and/or minority populations which, due to their genetic history, show a specific distribution of alleles that can alter drug metabolism and lead to adverse drug reactions (ADR).
Previous analyses showed that the three socio-culturally different Croatian Roma groups show significant differences in allele distribution within the CYP2D6 gene [24], and therefore we continued to investigate promoter and enhancer SNPs associated with this gene. In general, diversity in regulatory elements has an impact on gene expression, so understanding it could help to elucidate the unexplained variability in gene activity [11]. Contrary to the differences found among Croatian Roma groups in the CYP2D6 gene region, the regulatory elements studied here showed no difference among the same Roma subpopulations.
To clarify the relationships of SNPs in the promoter/enhancer region with star alleledefining SNPs from the CYP2D6 gene region in the Croatian Roma population, we determined their LDs. Significant LDs between SNPs in regulatory and gene regions may affect CYP2D6 transcription and consequently drug metabolism, and so far, the most studied example of this interaction is rs5758550 [44]. Using the reporter gene assay, Wang et al. [45] found that the constructs containing minor allele G had higher activity independently of other SNPs which were part of the construct (rs133333 and rs4822082), while deletion of the region surrounding rs5758550 decreased CYP2D6 mRNA levels. Rs133333 and rs5758550 are in complete LD, but chromatin immunoprecipitation with the P300 antibody showed that deletion of 156 bp surrounding rs133333 did not decrease the level of transcription of CYP2D6 [45]. SNPs rs5758550 and rs133333, genotyped in Croatian Roma subpopulations, were also in high LD. The LDs of enhancers rs5758550 and rs133333 with SNPs from the CYP2D6 promoter region were also calculated. Only rs1080985 was in LD with the two enhancer SNPs. Raimundo et al. [46] and Zanger et al. [47] linked rs1080985 with increased CYP2D6 expression in the human liver, but this was not supported by reporter gene assays [13]. Today, it is considered that this SNP has no functional consequences (https://www.ncbi.nlm.nih.gov/clinvar/ (accessed on 15 March 2022)). Wang et al. [13] suggested that higher levels of CYP2D6 mRNA expression, previously thought to be associated with this SNP, may be explained by LD between rs5758550/rs133333 enhancer SNPs and rs1080985. Haplotypes reconstructed in the studied Croatian Roma population have the rs5758550 allele G and the rs1080985 allele C on more than 20% of chromosomes.
Furthermore, we investigated LDs between enhancer/promotor loci and major star allele-defining SNPs. An r 2 LD value higher than 0.8, indicating a significant association, was found between rs1080983, which is part of the CTCF binding site, and SNP rs16947, which defines allele *2. This SNP is also in high LD with rs28624811 from the promoter/enhancer GH22J042130. SNPs from the same regulatory element, rs1080989 and rs28588594, are in high LD with allele *4 (rs3892097), but an LD value higher than 0.8 was found only in Roma from Balkan, while this high LD value has been observed in all Croatian Roma groups for allele *10 (rs1065852). CYP2D6*10 is a decreased-function allele predominantly found in East and South Asian populations, where its prevalence ranges from 9% to 44%. Its frequency in African populations is between 4% and 6%, among Europeans, <2%, and in the Croatian Roma, 6% [24,48,49]. The non-function CYP2D6*4 allele, which is predominantly found in European populations (18%), had the highest frequency in the Balkan and Baranja Roma groups, even higher than in European populations. The prevalence of the allele *4 in the Medjimurje Roma group is lower than in the European population, but still higher than in other world populations [24,48,49]. Since both these star alleles have an impaired function, an additional analysis such as the reporter gene assay could help to untangle the effect of these promoter SNPs on CYP2D6 functionality. A high LD was also noticed between the promoter region SNP rs28735595, and SNPs rs1058164 and rs1135840, which are present in all the main CYP2D6 star alleles. Since the Roma population has a specific genetic history, we were interested to find out whether their CYP2D6 LD distribution is similar to other world populations. Combinations of the CYP2D6 regulatory and gene region SNPs with high LD values in the Croatian Roma are also present in the majority of world populations, but fine differences can be noticed among Roma groups. This is especially evident for rs3892097, which defines the CYP2D6*4 allele, when in LD with rs1080989 and rs28588594 from the promoter region, and this LD is the lowest in the Medjimurje Roma group compared to the other two groups. The LD correlation matrices presented in the MDS plots mostly distinguish the populations of East Asia from other studied populations. Such separation is evident in many studies related to SNPs of the ADME genes [21,50]. Population-specific differences based on r 2 LD values were also found by Ahsan et al. [28], but between drug response-related SNPs.

Conclusions
Although the studied Croatian Roma groups showed significant variability of the CYP2D6 gene variants determined so far, the prevalence of alleles in SNPs from regulatory regions did not differ between these same groups. However, linkage disequilibrium values between these regulatory regions' loci and the CYP2D6 gene region loci differed between the Croatian Roma groups, and the population of Medjimurje showed the lowest LD values.
Higher LD values between the studied SNPs of the promoter region and the SNPs defining impaired-function star alleles *2 and *4 of the CYP2D6 gene could be used in Roma to improve genotyping efficiency if further studies demonstrate that these promoter SNPs affect the functionality of the CYP2D6 enzyme. An overall comparison of the analyzed LD values revealed that while there was greater variety in the populations of East Asia, they were uniform in populations of Europe and South Asia and distinct in their distribution. In the future, our goal is to sequence the promoter region of the CYP2D6 gene in Croatian Roma samples, as this would help to further elucidate the structure and frequencies of common overlapping haplotypes of the CYP2D6 gene, as well as those specific to the Roma population.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki and approved by the Scientific Committee and the Ethics Committee of the Institute for Anthropological Research, in Zagreb, Croatia (RN 1.14-1611/14).

Informed Consent Statement:
All Roma participated in the study voluntarily and were informed about the goals, methods and expectations of the study with the help of linguistically and culturally competent and trained Roma volunteers, after which they gave their informed consent.

Data Availability Statement:
All data analyzed in this study are available at: http://roma.inantro. hr/en/. In case of using this database for further analyses, please cite this publication. If further clarification is required, contact the corresponding author.