Crohn’s Disease Susceptibility and Onset Are Strongly Related to Three NOD2 Gene Haplotypes

The genetic background and the determinants influencing the disease form, course, and onset of inflammatory bowel disease (IBD) remain unresolved. We aimed to determine the NOD2 gene haplotypes and their relationship with IBD occurrence, clinical presentation, and onset, analyzing a cohort of 578 patients with IBD, including children, and 888 controls. Imaging or endoscopy with a histopathological confirmation was used to diagnose IBD. Genotyping was performed to assess the differences in genotypic and allelic frequencies. Linkage disequilibrium was analyzed, and associations between haplotypes and clinical data were evaluated. We emphasized the prevalence of risk alleles in all analyzed loci in patients with Crohn disease (CD). Interestingly, c.2722G>C and c.3019_3020insC alleles were also overrepresented in ulcerative colitis (UC). T-C-G-C-insC, T-C-G-T-insC, and T-T-G-T-wt haplotypes were correlated with the late-onset form of CD (OR = 23.01, 5.09, and 17.71, respectively), while T-T-G-T-wt and C-C-G-T-wt were prevalent only in CD children (OR = 29.36, and 12.93, respectively; p-value = 0.001). In conclusion, the presence of c.3019_3020insC along with c.802C>T occurred as the most fundamental contributing diplotype in late-onset CD form, while in CD children, the mutual allele in all predisposing haplotypes was the c.2798 + 158T. Identifying the unique, high-impact haplotypes supports further studies of the NOD2 gene, including haplotypic backgrounds.


Introduction
The incidence of inflammatory bowel disease (IBD) in Western European populations has been estimated at 100-300 per 100,000 and. most alarmingly, it is still increasing [1].
Despite extensive studies over the past decade [2][3][4][5][6][7], the etiology and pathogenesis of IBD are still not fully understood. Furthermore, the factors influencing the early onset of symptoms of the disease have also not yet been determined. Still, it is now widely accepted that intestinal microbiota, environmental (e.g., diet, smoking, and physiological stress), immunological factors, and genetic susceptibility play crucial roles [6]. Hugot et al., in 1996 [8], identified the 16q12 region of the human genome as the first disease-associated locus (IBD1). The discovery of the NOD2 gene in this locus led to the subsequent research and identification of IBD-related vulnerability loci. To date, genomewide association studies (GWASs) and subsequent replication studies have provided further insights on IBD's pathogenesis by determining over 200 genetic risk loci and over 30 non-conservative mutations in the NOD2 gene [6,[9][10][11][12][13][14]. However, the role of novel IBD-associated loci, of which around 30 are shared between Crohn's disease (CD) and ulcerative colitis (UC), is still less known than IBD1, within which DNA sequence variations remain the most influential genetic disease-triggering factors.
The NOD2 gene (dbGene 64127) encodes the nucleotide-binding oligomerization domain-containing protein 2 (NOD2), also known as caspase recruitment domain-containing protein 15 (CARD15). Being responsible for bacterial pathogen recognition and inflammasome initiation by binding to muramyl dipeptide (MDP), the NOD2 protein plays a vital role in the immune system. Changes in the NOD2 gene sequence affecting the function of the protein product and disturbing the balance of the inflammatory response, as a result, were especially reported in CD patients [6,15]. The three main SNPs (single nucleotide polymorphisms) within the NOD2 gene are most often described as being strongly associated with a higher incidence of CD, but not UC: SNP8 located in exon 4 (c.2104C>T, p.Arg702Trp), SNP12 in exon 8 (c.2722G>C, p.Gly908Arg), and SNP13 in exon 11 (c.3019_3020insC, p.Leu1007fs) [16]. Not infrequently studied and described in the literature is a common polymorphism in exon 4 of the NOD2 gene (SNP5), where cytosine is replaced with thymine in c.802C>T (p.Pro268Ser). The prevalence of the p.268Ser allele reaches 49.5% in CD patients compared to 18.6% in the global population (according to the gnomAD database), and determines the higher probability of parenteral symptoms and earlier onset of disease in homozygotes [17]. Some reports indicate a strong link between the c.2798 + 158C>T (IVS8 +158 ) variant, localized in intron 8, and CD development [18,19]. Initially, the IVS8 +158 variant was identified as part of a haplotype in which none of the four above-described mutations occurs [18]. Genetic studies carried out on the Ashkenazi Jewish (AJ) population showed a higher frequency of NOD2 gene mutations in patients from multiplex families with CD, and a clear link between the incidence of CD and the occurrence of the IVS8 +158 variant [20]. However, an additional genetic factor within this haplotype, predisposing to the development of the disease, was proposed [18]. These suggestions can be explained by the fact that CD is particularly frequent in the AJ population, characterized by severe bottlenecks and the cultural rule of endogamy, causing a higher incidence of certain genetically determined diseases than other ethnic groups. Some of the described loci associated with Crohn's disease in other populations are also related to CD in AJs [21]; nevertheless, the increased prevalence of Crohn's disease in the Jewish population is unexplained, suggesting potentially rare, AJ-specific genetic variants. In their latest report, Rivas et al. indicated that it is unlikely that the incidence of multigenic diseases will change significantly due to the bottleneck itself. The observed difference in the prevalence of Crohn's disease combined with the systematic enrichment of risk-increasing alleles is unlikely to have occurred by chance, and suggests an unintended selection of alleles in the AJ population. These authors suggest that a subset of Crohn's risk alleles may contribute to a typical biological process (e.g., a specific immunological response) or phenotype that has been positively selected in AJ. They admitted that the recent population bottleneck could reveal alleles with a significant increase in frequency, which may consequently contribute to interpopulation differences in genetic susceptibility factors. While the NOD2 gene variants are significantly associated with the genetic risk of CD, other genes with causal alleles that have not passed through the bottleneck are neglected [22].
The fact that genetic influences and inflows remain small or insignificant within the AJ population is undisputed. Still, we do not currently know how many admixtures of Jewish origin exist in European communities, but must bear in mind the history of Jews settling in Europe before the Second World War, particularly in the Polish population.
Based on this hypothesis and the newest studies by Horovitz et al., showing recessive inheritance of rare NOD2 variants in 7-10% of CD cases and indicating NOD2 as a Mendelian disease gene for early-onset CD [23], we aimed to estimate the frequency of c.802C>T, c.2104C>T, c.2722G>C, c.2798 + 158C>T, and c.3019_3020insC variants in a cohort of Polish IBD patients, including children. Thus, despite several studies considering NOD2 variants, we are the first to determine the distribution of NOD2 gene haplotypes in Polish IBD patients as well as those haplotypes' relationships with the disease occurrence, including its subtype and onset.

Study Group
The research group comprises IBD patients (adults and children) diagnosed with Crohn's disease and ulcerative colitis. Adults were patients aged over 16, and patients below 16 were classified as pediatric cases (according to the Montreal criteria) [24]. Patients were subjected, diagnosed, and managed in hospitals and clinics all over Poland, and prospectively included in the study. The inclusion criteria were as follows: diagnosis of IBD based on cross-sectional imaging or endoscopy, with histopathological confirmation or both disease duration over one year and lack of any other autoimmune condition (e.g., rheumatoid arthritis, chronic renal failure). The indeterminate IBD cases were excluded from this research. Healthy unrelated adult individuals attending paternity testing and randomly selected from the Polish population (courtesy of the Laboratory of Molecular Genetics, Poznan, Poland) constituted the control group. The number of study groups differed depending on the analyzed polymorphism. The IBD group consisted of 348-578 patients. The control group comprised 231-888 individuals.

Genotyping
Genomic DNA was extracted from peripheral blood leukocytes, following the standard phenol-chloroform procedure, and stored in AE buffer (0.5M Tris-HCl with 0.1M EDTA). We performed pyrosequencing and/or high-resolution melting analysis (HRMA) followed by Sanger sequencing to conduct the linkage study. Primers were designed using PyroMark Assay Design Software 2.0 (Biotage, Uppsala, Sweden) and Primer 3 plus software (Free Software Foundation, Inc., Boston, MA, USA) [25]. Templates for pyrosequencing were amplified by PCR as follows: a 5 min initial denaturation at 94 • C, then 50 cycles of 94 • C for 30 s, annealing for 30 s, followed by a 5 min final elongation at 72 • C. According to the manufacturer's recommendations, pyrosequencing reactions were performed on the PSQ96 system using PyroMark Gold Q96 reagents (Qiagen, Hilden, Germany). For HRM, polymerase chain reactions were carried out using a commercially available Type-it HRM kit (Qiagen, Hilden, Germany). The PCR program started with initial denaturation (5 min at 95 • C) followed by 40 cycles of denaturation (10 s at 95 • C), annealing (30 s at 55-64 • C), and elongation (10 s at 72 • C) with 10 min extension incubation at 72 • C. The amplified samples were melted between 70 and 90 • C, raising the temperature by 0.1 • C at each step. For all HRMA data evaluations, we used the Rotor-Gene Q Series Software (Qiagen, Hilden, Germany). Genotypes were evaluated in comparison to control samples determined by Sanger sequencing. Samples with an altered melting profile were confirmed by sequencing samples chosen randomly (Supplementary Materials Figure S1).
For NOD2 association analysis, we genotyped all IBD patients and control samples for c.802C>T, c.2104C>T, c.2722G>C, c.2798 + 158C>T, and c.3019_3020insC. Primers for pyrosequencing and HRM used for screening NOD2 genes are listed in the Supplementary Materials (Table S1).

Statistical Analyses
We applied the χ2 or Fisher's exact test to assess differences in genotypic and allelic frequencies between the studied groups. The odds ratios (ORs) with 95% confidence intervals (CIs) were computed using calculators on the websites http://ihg.gsf.de/ihg/ snps.html (accessed on 2 November 2020) and https://www.medcalc.org/calc/ (accessed on 8 December 2020). Counted p-values below 0.05 were considered statistically significant.

Linkage Disequilibrium Analysis
The linkage disequilibrium analysis for the polymorphisms under study and the associations between the haplotypes and the clinical data were conducted using Haploview v.4.2 software (Broad Institute of MIT and Harvard, Cambridge, MA, USA) [26].

Results
We assembled a large group of Polish IBD patients, including adults and children, from the Polish population. The following loci localized in the NOD2 gene were genotyped: c.802C>T (in 556 patients and 598 controls), c.2104C>T (in 575 patients and 539 controls), c.2722G>C (in 578 patients and 715 controls), c.2798 + 158C>T (in 348 patients and 231 controls), and c.3019_3020insC (in 573 patients and 888 controls) (Tables 1-5).
We analyzed patients with CD and UC independently, comparing them with population controls regarding differences in disease form. In the next step, we set CD and UC against one another. Moreover, we divided the CD and UC groups into adults and pediatric patients, and within these groups, we also considered gender.
In patients with CD, all risk alleles of analyzed loci were much more prevalent than in the general population or in UC patients, and all observations were statistically significant.
We did not observe differences between subgroups regarding adult patients and children independently in CD patients. However, looking at UC patients, we noticed relevant differences between pediatric and adult patients: in children, the c.802T allele was more frequent than in adults (OR = 1.69, CI = (1.092-2.604), p-value = 0.018, data not shown in tables). Considering gender, we did not observe differences in the adult patient groups. However, the c.802T allele seems to be significantly (p-value = 0.026) more prevalent in UC boys in comparison to girls (41.9% vs. 23.2%, OR = 2.39, CI = (1.100-5.168)), as well as in CD children (47.2% vs. 41.5%, respectively), where it did not meet the assumed level of statistical significance (p-value = 0.411) (results not included in tables).
In the case of the c.2104T allele, we also observed its higher impact in CD patients (OR = 3.61, CI = (1.84-7.09), p-value < 0.001), and no significant effect on UC susceptibility (p-value = 0.670). We did not observe any homozygotes in this locus. The differences between adults and children and between genders were noticeable in the UC group only (3.1% vs. 1.2%), but statistically insignificant.
In UC children, in contrast, the c.2722C allele was underrepresented in comparison to the general population and UC adults (0.8% vs. 1.1% and 3.8%, respectively); however, these discrepancies did not meet the assumed level of statistical significance. We did not notice significant differences in allele distribution between genders either. Being homozygous in the rs2066845 locus raised CD risk 10-fold (OR = 10.38, CI = (0.42-255.72)), but this result was not significant since we detected only one homozygous female CD patient in the tested group.
For haplotype gathering of wild-type alleles only, C-C-G-C-wt was the most common in patients and in the control group. Surprisingly, however, it was present in patients with even higher frequency than in the general population (OR = 3.40, CI = (2.43-4.78) and OR = 1.77, CI = (1.27-48), respectively), and it was statistically significant (p-value < 0.001). We found five allele sets in the assessed control group that did not appear in IBD patients, and three rare haplotypes present only in the IBD group. In the next step, we analyzed haplotypes' association with disease entity, age of first symptoms, and sex (Table 6).  We noticed that the haplotypes: T-C-G-T-insC (E haplotype according to Tukel et al. [20]), T-C-G-C-insC (unreported earlier), and T-T-G-T-wt (B haplotype according to Tukel et al. [20]) correlated with the late-onset form of CD (ORs were 5.09, 23.01, and 17.71, respectively) and observations were statistically relevant (p-values were < 0.001, 0.003, and 0.007, respectively). Interestingly, three unreported earlier rare allele combinations were present exclusively in IBD patients: T-C-G-C-insC was characteristic only of adult CD and pediatric UC patients. In contrast, T-T-G-T-wt was more prevalent in CD children than in adults, with frequency 30 times higher than in the general population (OR = 29.36, CI = (3.69-233.68), p-value = 0.001). Furthermore, the haplotype C-C-G-T-wt was present exclusively in CD pediatric cases, making it predisposing to early-onset CD (OR = 12.93, CI = (2.72-61.58), p-value = 0.001).
The haplotype T-C-G-T-wt was approximately twice as frequent in both IBD groups as in the population. It met the statistical significance level in UC patients (adults and children) and CD children, while it was borderline significant in CD adults. The haplotypes C-C-C-C-wt, C-C-C-T-wt, C-C-C-C-insC, T-C-C-T-insC, and T-C-C-T-wt (all unreported by Tukel et al. [20]) were absent in IBD patients. However, they were present in the tested population group, and C-C-C-C-wt was the second most frequent (OR calculations were possible using Deek's correction [27]).

Discussion
Allelic variants of the NOD2 gene (c.802C>T, c.2104C>T, c.2722G>C, c.2798 + 158C>T, c.3019_3020insC) are generally overrepresented in the Polish population in comparison with other ethnic groups (Table 7). Still, in CD patients, the risk alleles are prevalent. Moreover, their frequencies are different comparing patients with CD and UC. This observation led us to conclude that the NOD2 gene and its sequence variations remain among the most critical genetic backgrounds of IBD, contributing particularly to susceptibility to Crohn's disease.
Two decades of research and numerous scientific reports have not yet unraveled the molecular basis of IBD. They have shown that CD and UC are distinct diseases conditioned by multiple genetic factors [21][22][23][24]; nevertheless, the NOD2 gene and its sequence variations remain the most influential genetic disease-triggering factors. To date,~2404 variants of the NOD2 gene have been described together with specific phenotypes. Earlier studies have shown that, among the European and North American populations, the most frequent changes in the NOD2 gene associated with CD are 1007fs, R702W, and G908R [19,28], but also the c.802C>T variant [16,29]. In general, mutations in the NOD2 gene are indicated as being CD-related in several Caucasian populations [30]. On the other hand, association studies among Indian IBD patients have shown a weak relationship of the NOD2 gene mutations with UC, but not with CD [31]. In turn, the study of NOD2 gene variants in patients with UC in the Portuguese population did not correlate them with increased risk of the disease; however, a tendency for a more aggressive course of the disease was observed among carriers of rare NOD2 variants [32].
In this research, we highlighted the prevalence of risk alleles in all analyzed loci in CD patients; however, the c.2722G>C and c.3019_3020insC alleles were unexpectedly overrepresented in UC patients as well. Moreover, gathering of pediatric IBD cases and linkage disequilibrium analysis in this particular group of patients enabled us to observe possible relationships of specific haplotypes with early disease onset.
It is widely known that analyzing haplotypes enables the observation of correlations that may not be apparent for single markers. However, we did not find many papers describing NOD2 haplotypes in the context of IBD susceptibility and course. Many of the articles describe variants in the NOD2 gene considered individually. In the newest study performed on a large cohort of CD patients, Horowitz et al. reported that individuals carrying any one of the main three NOD2 risk alleles (p.R702W, p.G908R, or p.L1007fs) have up to 4-fold increased risk for developing CD, while carriers of two or more of the same NOD2 variants have 15-40-fold increased risk. Moreover, these authors highlighted a subset of IBD cases with the recessive inheritance of NOD2 alleles and a substantially higher risk of early CD onset; their analyses unequivocally showed more significant effects for NOD2 homozygotes and compound heterozygotes than carriers of single NOD2 genetic variants only, and indicated that the genetic contribution of NOD2 alleles, in a subset of Crohn's disease patients, suggests a recessive disease model [23]. The present results entirely support the conclusions of Horowitz et al.
In 2004, Tukel et al. described NOD2 haplotypes detected in AJ and Sephardi/Oriental Jewish (SOJ) populations; they also presented possible evolution of haplotypes, and confirmed their theory by analyzing flanking STR markers [20]. In our study, we detected all allele combinations described in AJs and SOJs; nevertheless, we also identified new haplotypes, indicating possible recombination events in this region (Table 6).
To date, the highest recorded incidence of IBD has been reported in non-Hispanic whites (inhabiting Central and Western Europe and North America)-three times higher than in other ethnic groups [33]. In this population, a strong association between mutant NOD2 and CD risk was reported for R702W, G908R, and 1007insC when they were considered separately (p-values: < 0.001, 0.002, and < 0.001, respectively) [34]. G908R and L1007fs were associated with Crohn's disease susceptibility in the Dutch population, and carrying of at least one of these mutations was associated with more severe and penetrating disease [35]. In Hungarian adult Crohn's patients, the 1007finsC and the 1007finsC and G908R in the pediatric cases were significantly associated with increased disease risk [36]. In Greek CD patients, the 1007finsC mutation was significantly more frequent in childhoodonset than in adult-onset form [37]. Other research carried out in Greece pointed out the association of R702W, G908R, and 1007insC with ileitis or ileocolitis in the clinical picture of CD [38,39]. Similarly, in an Italian multicenter study, Ferraris et al. demonstrated that NOD2 polymorphisms were associated with susceptibility to early-onset CD, and with ileal involvement. For the first time, they also reported an association with severe, early-onset UC [40]. Regarding the group of UC children presented in this study, we obtained similar results, with a relevantly higher frequency of the c.3019_3020insC allele (OR = 2.64) than in the general population, and this was visible in UC adults as well (OR = 1.92). However, regarding haplotypes, the presence of insertion mutation alone was not sufficient to increase IBD susceptibility. The definite positive effect was only observed when c.3019_3020insC was accompanied by the c.802C>T variant; nevertheless, in UC children, it did not reach the statistical significance level. In UC adult and pediatric patients, the T-C-G-T-wt haplotype seemed to be predisposing in our study cohort (OR = 2.00 and 2.61, respectively). Since the haplotype including only wt alleles (C-C-G-C-wt) was in the majority of our UC group, this suggests that other variants in the NOD2 gene may play a substantial role in these patients.
In CD patients, the c.3019_3020insC allele present amidst haplotypes containing wt alleles in the remaining loci seemed not to play a crucial role in disease susceptibility. In contrast, the presence of c.3019_3020insC along with c.802C>T occurred as the most fundamental contributing diplotype in late-onset CD. Furthermore, it appeared to be more complex in CD children in whom the mutual allele in all predisposing haplotypes was the c.2798 + 158T allele. Based on these observations, we are convinced that further studies of NOD2 gene variants should include haplotypic backgrounds.
Crohn's disease occurs with the highest frequency in AJs of Central European origin, which is 2-4-fold higher than in non-Jewish ethnic groups [41,42]. AJs are individuals of Jewish ancestry with a recent origin in Central and Eastern Europe. Tukel et al. established that minor alleles of the NOD2 gene in AJ CD patients from Central Europe are twice as frequent as in AJ patients from Eastern Europe-particularly G908R and 1007fs. Surprisingly, in SOJ patients, NOD2 mutations were also overrepresented. Moreover, this study's haplotype analysis revealed that the p.702W allele was associated with the p.268P and p.268S alleles [20]; this was the opposite of the findings of , who indicated 702W, 906R, and 1007fs with 268S variants exclusively [42]. The study of Lesage et al. (2002) indicated that 49% of the patients with familial CD had one (32%) or two (17%) mutant NOD2 alleles; in their research, detected mutations in the majority were localized in the distal part of the gene, and the three common mutations (R702W, G908R, and 1007fs) accounted for 81% of identified alterations [16]. In a different study, 32% of AJ families and 30% of SOJ families with CD did not carry an NOD2 mutation, indicating the heterogeneity of the predisposing genes causing CD among Jewish patients [20]. Considering the examples mentioned above, we can state that the higher incidence of several diseases-including IBD-in AJs is likely due to genetic drift following a bottleneck; the AJ population is much larger and experienced a more severe bottleneck than other founder populations [43].
We also reflect on the probability of the NOD2 gene mutations of similar origin in Polish and Jewish populations. We believe that the NOD2 variants determining the occurrence of Crohn's disease came from common ancestors, resulting from mutual history. In the 13th century, Ashkenazi communities emerged in Poland and multiplied until the 20th century, reaching millions in size and a wide geographic spread across Europe [44,45]. Assessing genetic distance, Atzmon et al. showed that the AJs are more closely related to some host Europeans than to the ancestral Levantines [44]. Hue et al. suggested a model of at least two events of European admixture: The first of them slightly pre-dated a late medieval founder event, was probably of Southern European origin, and was estimated to be 25 ± 50 generations ago. The inferred subsequent admixture was hypothesized to have appeared approximately 30 generations ago, and most likely occurred in Eastern Europe. However, multiple lines of evidence suggest that it represents an average over two or more events, pre-dating and post-dating the founder event experienced by AJs in the late Middle Ages [45]. Before World War II, approximately three million Jews lived in Poland. After 1945, the most significant number of Jews was recorded in July 1946, and amounted to~220,000. In the structure of this decimated population, men predominated, causing an increase in mixed marriages, which was one ground for the postwar conversion to Christianity. Moreover, the postwar anti-Semitic moods and the policies of the communist authorities-and in many cases also the adoption of Polish names and surnames-were among the reasons for the decision to change personal details [46,47]. This historical background may explain the higher frequency of minor NOD2 gene alleles in the Polish population and in Polish patients with CD.
The conclusion from these scientific reports is quite apparent, and widely known: populations of different ethnic origins or living in specific regions show diversity in the distribution of genotypes of a given variant. Most racial and ethnic research on IBD includes analysis of Caucasian populations. Unfortunately, these are usually small groups of subjects characterized by exceptionally high social and economic heterogeneity.
Ng et al. [1] demonstrated that since 1990, incidence rates have changed in Western countries, showing a stable or decreasing incidence. Still, the burden of disease remains high in most European countries, North America, and Oceania. On the other hand, the countries of Africa, the Middle East, Asia, and South America, whose societies are becoming more western and urbanized, reflect the progress in inflammatory bowel diseases in the Western world since the 1900s, which indicates the significant role of environmental externalities in the pathogenesis of the disease.
New locus mapping in GWAS studies leads to the identification of an increasing number of polymorphisms and haplotypes between different populations, which underlines the role of genetic variability in analyzing the molecular background of multigenic diseases. The phenomenon of higher prevalence of minor NOD2 gene alleles in the Polish population most likely results from historical conditions. The identification of unique, high-impact haplotypes supports population genetics as being fundamental to unravelling IBD's molecular background. Our results indicate more significant effects of homozygous and compound heterozygous NOD2 mutations than single NOD2 genetic variants in conditioning Crohn's disease. Moreover, extended NOD2 haplotype analysis suggests the existence of additional genetic factors remaining in linkage disequilibrium, which may be related to IBD susceptibility and onset in a population-specific manner. NOD2 gene sequence variants and haplotypes play a crucial role, and should not be underestimated in IBD diagnosis.

Informed Consent Statement:
All individuals were informed about the terms of participating in the study, and gave their written consent to testing. All procedures performed in studies of human participants followed the institutional and national research committee's ethical standards and the 1964 Declaration of Helsinki and its later amendments, or comparable ethical standards.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.