Exome Array Analysis of 9721 Ischemic Stroke Cases from the SiGN Consortium

Recent genome wide association studies have identified 89 common genetic variants robustly associated with ischemic stroke and primarily located in non-coding regions. To evaluate the contribution of coding variants, which are mostly rare, we performed an exome array analysis on 106,101 SNPs for 9721 ischemic stroke cases from the SiGN Consortium, and 12,345 subjects with no history of stroke from the Health Retirement Study and SiGN consortium. We identified 15 coding variants significantly associated with all ischemic stroke at array-wide threshold (i.e., p < 4.7 × 10−7), including two common SNPs in ABO that have previously been associated with stroke. Twelve of the remaining 13 variants were extremely rare in European Caucasians (MAF < 0.1%) and the associations were driven by African American samples. There was no evidence for replication of these associations in either TOPMed Stroke samples (n = 5613 cases) or UK Biobank (n = 5874 stroke cases), although power to replicate was very low given the low allele frequencies of the associated variants and a shortage of samples from diverse ancestries. Our study highlights the need for acquiring large, well-powered diverse cohorts to study rare variants, and the technical challenges using array-based genotyping technologies for rare variant genotyping.


Introduction
Stroke is the second leading cause of disability and death worldwide, accounting for over 6 million deaths in 2019 [1]. The etiology of ischemic stroke(IS), the predominant form of stroke, is multifactorial and includes both genetic and nongenetic causes. Genomewide association studies (GWAS) have identified 89 stroke-associated loci to date [2,3], although these loci account for only a very small proportion of stroke heritability. A major limitation of current genome-wide approaches, which rely predominantly on genotyping arrays, is that they typically interrogate only common variation throughout the genome (e.g., SNPs with minor allele frequency > 1-5%) and generally do not cover the coding regions of the genome. Protein-coding variants are generally rare and are poorly captured by conventional GWAS arrays. Identifying the contribution of protein coding variation to stroke etiology is important. Even if exonic variation accounts for only a small proportion of stroke burden, identification of variants in novel genes may provide new insights into stroke biology.
The potential contribution of rare protein-coding variation to the etiology of ischemic stroke has not been systematically studied. Several small pilot exome-array association studies have been published based on relatively small numbers of subjects [4]. In 2015, Auer et al., published an exome-wide association analysis based on 365 ischemic stroke cases with small-and large-vessel subtypes (plus additional controls) who underwent whole exome sequencing through the NHLBI Exome Sequencing Project [5]. This study identified two protein-coding variants associated at exome-wide levels of significance, one a common variant (in PDE4DIP), and a second a rare variant (in ACOT4), although neither association has been replicated in subsequent studies. Using whole genome sequencing data from the TOPMed Consortium, Hu et al., recently performed a genome-wide analysis of 5616 ischemic stroke cases and >27,000 controls, from which they identified 2 variants significantly associated with IS and a 3rd variant associated with IS due to large-artery atherosclerosis (n = 352 cases). The lead variants at all loci were low-frequency and more common in non-European populations. None of the variants were exonic, and none of these associations have been replicated in independent data sets, although the minor allele frequencies of these variants were low and the power to replicate limited.
To expand these efforts, we have performed an exome-wide array analysis of 9721 stroke cases from the SiGN Network and 12,345 controls to evaluate the impact of rare coding variation on stroke risk.

Samples and Genotyping
This study includes 9721 ischemic stroke cases from the Stroke Genetics Network (SiGN) (dbGap Accession phs000615.v1.p1) and 12,345 non-stroke controls (1303 from SiGN and 11,042 from the Health and Retirement Study (HRS). SiGN is an international collaboration of 31 studies across North America, Europe, and Australia to identify genetic determinants of ischemic stroke [6]. The analysis presented in this manuscript includes subjects (mostly stroke cases) recruited from multiple sites in the United States and Europe (UK, Poland, Belgium, Spain, Austria, and Sweden). The HRS is a representative sample of people in the U.S. over the age of 50 residing in households with an oversample of African American (AA) and Hispanic populations [7]. HRS exome chip data is available with an approved HRS Restricted Data Agreement (RDA). Access information can be found at https://hrs.isr.umich.edu/data-products/genetic-data/products (accessed on 8 April 2022). Although exome array genotyping was successfully performed on 15,561 HRS subjects, only the subset of 11,042 HRS subjects who had no stroke history and had genome-wide array data available were included in this study so that alignment of the genome-wide genotyping data could be used for estimation of ancestry.
Genotyping for both studies was performed at the Center for Inherited Disease Research (CIDR), SiGN cases on the Illumina HumanOmni5Exome-4v1_A array and HRS controls on the Illumina HumanExome-12v1-1 array. Both studies used calling algorithms implemented in GenomeStudio version 2011.1, Genotyping Module 1.9.4, and GenTrain version 1.0.

Genotype Quality Control
A challenging feature of our study design is the use of cases and controls genotyped on slightly different arrays and at different times. We have previously performed a high quality GWAS of stroke in SIGN using external controls [6], which focused on common variants (MAF > 1%) as opposed to exome content enriched for low frequency variants. Rare variants are more challenging to call using array technologies because arrays rely on clustering due to genotype intensity to make genotype calls. To minimize the potential for bias arising from differential quality of genotyping calling between the two genotyping platforms, we therefore implemented a very stringent quality control procedure to identify poor quality SNPs and SNPs showing evidence for differential genotyping calling between the two arrays. In Stage 1 of our genotype quality control procedure, performed prior to association analysis, we utilized a large set of variant filters to identify and exclude SNPs of poor quality or differential quality between the two arrays. All remaining SNPs then underwent association analysis, after which we performed a Stage 2 quality control assessment that consisted of manual inspection of the genotype intensity plots of all associated SNPs from both the SiGN and HRS arrays to further exclude SNPs showing evidence of poor clustering on one or both genotype intensity plots. Manual inspection of genotype intensity plots for all SNPs prior to analysis was considered too labor-consuming and not feasible.

Population Structure Analysis Using Admixture and PCA
GWAS array genotypes from SIGN (Illumina 5MplusExome array) and HRS (Illumina Human Omni-2.5 Quad beadchip) were used for genetic ancestry analysis following genotype data cleaning as previously [6]. Only directly genotyped autosomal variants with minor allele frequency (MAF) > 5% were used. The variants were further pruned to keep independent variants not in linkage disequilibrium (LD). Principal component (PC) analysis of genotypes was carried out in PLINK on unrelated samples and then the related samples were projected to the established PC space. Up to 10 PCs were included as covariates for association testing. Cases and controls had comparable distribution on the PC space. Particularly for "AFR" samples, there was no statistically significant difference in EUR or AFR component between cases and controls.
We additionally estimated the percentage of genetic ancestry (Europe, Africa, Native America, Eastern/South Asia) in individuals using the ADMIXTURE software program [8] and the Human Genome Diversity Project (HGDP) reference genomes [9]. Samples estimated as having genetic ancestry of (European + Central Asia) > 70% were classified as "EUR", and samples estimated as African ancestry > 50% and Native American ancestry < 5% and Asian ancestry < 5% were classified as "AFR". The remaining samples (mostly Latinx) were classified as "Other." This classification was used to facilitate genotype cleaning and filtering as applied to a particular genetic ancestry, e.g., comparing MAF between EUR samples from SiGN and EUR samples from HRS. Only "EUR" and "AFR" samples were used for this purpose.

Association Analysis
We performed association analysis using the mixed model as implemented in SAIGE [10]. Genetic relationship matrix was modeled as a random effect. Covariates in the logistic regression model included sex and the first 10 principal components to account for ancestry. Power calculations indicated that our sample provided 80% power to detect odds ratios ranging from 1.09 to 1.20 for genetic variants with minor allele frequencies (MAF) = 0.5% and 1%, respectively, at the exome-wide threshold for significance, i.e., 4.7 × 10 −7 for 106,101 variants.

Replication
We sought to replicate associations in the TOPMed Consortium, through a look-up from the TOPMed Stroke Working Group. TOPMed Stroke included 5613 ischemic stroke cases and 27,106 controls who underwent whole genome sequencing [11]. Among the stroke cases were 4305 cases of European ancestry and 884 cases of African ancestry. We additionally attempted replication of associated variants with stroke in the UK Biobank. For these analyses, we extracted ischemic stroke cases using an ICD code algorithm previously published in the "Definitions of Stroke for UK Biobank Phase 1 Outcomes Adjudication" [12]. Ischemic stroke was defined using ICD 10 codes 163.X (cerebral infarction) and 164.X1 (stroke not specified as haemorrhage or infarction) and analyses performed in 5874 stroke cases and 117,442 controls (i.e., 20 controls/case). All data were downloaded from the UK Biobank Resource under Application Number 49852. We performed logistic regression in PLINK using age, sex and 5 principal components for ancestry.

Results
Study subject characteristics. Our analysis was based on 9721 cases and 12,345 controls. SiGN cases were recruited from 22 sites across the U.S. and Europe (Table S1). SiGN cases (all sites) plus SiGN controls from Belgium and Poland were genotyped on the Illumina HumanOmni5Exome-4v1_A array and all HRS controls were genotyped on the Illumina HumanExome-12v1-1 array. Characteristics of study subjects are shown in Table 1. The mean age of stroke onset in cases was 67 years (range: 14-104 years). 81% of cases and 80% of controls were genetically defined as European ancestry and 11% of cases and controls were genetically defined as African ancestry. TOAST subtype classification was unavailable in 64% of all stroke cases.

Merging of variants from the SiGN and HRS arrays.
A total of 4,278,837 GWAS plus exome content SNPs were released for analysis in SiGN and 228,088 exome array SNPs for HRS following completion of array-specific initial quality control procedures by the genotyping center (CIDR) and subsequent in-depth quality assurance/quality control (QA/QC) analysis by the genotyping analysis core at the University of Washington (SiGN) or the University of Michigan (HRS). The sample level and variant level filters recommended by the genotyping analysis cores were applied before data merging and the two-stage variant QC described below for current study. After removing variants that failed strand or allele alignment, there were a total of 198,811 overlapping SNPs between the SiGN 5MPlusExome array. We then removed 10,413 of these aligned SNPs because they were peripheral to relevant exome content, as they were included as ancestry-informative variants or for QC or method development purposes. We then conducted detailed quality control analyses on the remaining set of 188,398 SNPs merged from SiGN and HRS to identify SNPs whose genotyping quality potentially differed across platforms.
Stage 1 variant quality control filtering. We implemented multiple quality control checks to exclude potentially problematic SNPs from analysis. Criteria for excluding SNPs included: (1) excessive deviation from HWE in either EUR or AFR controls, or extreme deviation in cases; (2) AT/GC SNPs with high allele frequency; (3) discordant genotype calls between samples genotyped on both platforms; (4) excessive differences in AF between EUR controls genotyped in SIGN and HRS; (5) low minor allele count; (6) high genotype missingness calls; (7) possible under-calling of genotypes, especially in SiGN (8) non-autosomal SNPs, (9) SNPs marked in HRS as technical failures; (10) large AF differences between controls and gnomAD; and (11) SNPs with duplicate probes but discordant genotypes (Figure 1). A total of 82,297 (43.7%) of SNPs were excluded, although the bulk of these (n = 78,725) were excluded due to low MAC in both cases and controls. A detailed description of the variant exclusion criteria, including filters for used for each criterion, is provided in the Supplemental Materials, and the specific filtering thresholds used at each step in QC are provided in Table 2. For HRS, we applied the filters recommended in the Quality Control Report recommended by HRS Analysis team at the University of Michigan. Genetic association results. Following variant quality control filtering, we performed genetic association studies on 106,101 variants with minor allele count ≥ 10 in cases or controls. We identified 36 variants significantly associated with all ischemic stroke meeting array-wide threshold for statistical significance (i.e., p < 4.7 × 10 −7 ). Upon manual review of these variants, we considered the SiGN or HRS genotype intensity plots for 21 of these SNPs to be of poor or questionable clustering quality, leaving 15 variants that we regarded as being robustly associated with stroke. Association results for these variants are shown for the overall meta-analysis in Table 3, and the genotype intensity plots for these 15 associated SNPs from both the SiGN and HRS arrays are provided in Supplemental Figure S1. Among the 15 associated variants were two common SNPs in ABO, rs507666 and rs635634 that are in near perfect linkage disequilibrium with each other (r 2 = 0.99) and that have previously been associated with stroke [2]. Twelve of the remaining 13 variants were extremely rare in European Caucasians (MAC ≤ 10 combined for each in cases and controls) and the associations were driven by substantially higher allele frequencies in African American cases than in African American controls ( Table 4). Four of these variants (TPTE rs143510517, MEP1A rs62619974, DDX31 rs142792732, and PATL1 rs79336999) were associated with stroke at p < 1 × 10 −10 . A variant in PRIM2, rs199585353, was present exclusively in European Caucasians, in whom the minor allele frequency was 0.25% in cases (MAC = 40) and 0% in controls (p = 8.28 × 10 −8 ).  Table S1 for specific filter thresholds at each QC step).   Table S1 for specific filtering thresholds at each QC step).
Of the 13 non-ABO variants, all have MAF < 0.1% in European ancestry populations and ≤4% in African ancestry populations as indicated in gnomAD. Seven of the 13 are annotated through ANNOVAR using RefSeq gene annotation [13] as missense variants, 2 as stoploss variants, and the remaining 4 as function unknown. One SNP rs149905649 in gene DOK7 was included in ClinVar but was annotated as "benign/likely benign". None of the remaining SNPs were included in ClinVar.
Replication. We sought to replicate the associations of these 15 variants in TOPMed Stroke and UKB. In TOPMed Stroke, 10 of the 15 variants were polymorphic in TOPMed, none of which showed evidence for association (Table S2). Associations of the two ABO variants were directionally consistent with the discovery analysis in both the European and African ancestry populations. Among European ancestry cases and controls combined, only 2 of the remaining 13 (non-ABO) variants had minor allele counts > 8; among African ancestry cases alone, 7 of these 13 non-ABO variants had minor allele counts > 6 (range 190-631). variants showed substantial call rate differences between a particular site and the remaining samples (differential missingness p value < 5.0 × 10− 7 ) 12,691

Number of SNPs remaining 106,101
* the excluded SNPs are not exclusive among rows; e.g., 3534 of these SNPs also excluded by other criteria.
In the UKB, we attempted to replicate only the PRIM2 rs199585353 and DOK7 rs149905649 associations observed in the European ancestry group as the number of African ancestry individuals in UKB with stroke was small (n = 101). The imputation quality of PRIM2 rs199585353 did not meet our info score quality control threshold of 0.7 (info = 0.51) and was therefore not analyzed. The MAF for DOK7 rs149905649 was 0.00051 in cases (6/(5874 × 2)) and 0.00068 in controls (159/(117,439 × 2)); p = 0.46).

Discussion
We combined exome array genotypes from a large collection of stroke cases with exome array data from publicly available controls to carry out a large-scale association study to identify rare protein-coding variants associated with ischemic stroke. With 9721 well-phenotyped stroke cases, this study represents the largest effort to date to identify rare protein-coding variants associated with ischemic stroke at an exome-wide level. Although we identified 13 rare variants meeting exome-wide thresholds for association, none replicated in the 2 replication datasets. Of the 13 rare variants none have known biology function to be compelling candidate genes responsible for the pathogenesis of ischemic stroke; 12 of the 13 associations were driven by allele frequency differences in the African American (AA) population, in whom the sample sizes in both the Discovery and Replication data sets were much smaller. We also identified two common variants (MAF~0.20 in European ancestry) in ABO that are in high linkage disequilibrium with each other and that have been previously associated with ischemic stroke [2].
Our results lend further support for advocating the inclusion of diverse populations in genomic studies. The sample sizes of African American samples were small in our discovery data (1044 AA cases). However, most of the identified exome variants in our EWAS analysis (12 out of 13 variants) reached statistical significance due to their presence or relatively higher minor allele frequencies in African American (MAF 0.03%~4.11%), while being absent in Europeans. Nonetheless, at n =~1000 of cases, the power was too low for a reliable discovery and replication, especially given the genetic diversity among African American across study cohorts. It is thus not surprising that these AA-specific coding signals did not replicate in the TOPMed cohort, in whom the AA sample size was also limited. Our efforts highlight the need to expand genomics research in non-European populations.
Up to 81.5% of our discovery samples were of European ancestry genetically. Yet, we identified only 2 exome-wide associated loci driven by European samples, one previously identified at ABO and the second a novel locus in PRIM2, that was not replicated in either TOPMed or UKB. There are several possible reasons for our failure to identify more robust associations in exomes of Europeans. First, the coverage of the exome in our analysis was low due to our stringent filtering criteria and some 'causally' variants may have failed QC and not been tested. In fact, only 46.5% of the variants of the exome array content were actually tested. Whole exome sequencing would have provided much greater coverage across genomes. Second, the overall EWAS power to detect and replicate 'significant' associations for rare variants in our dataset, even with 9721 cases in the Discovery set, was low. Although the two ABO variants are very common with MAF up to 20% in European, the variant in PRIM2 only has MAF 0.04% in GNOMAD European samples. Third, it is also possible that rare protein-coding variants do not play a large role in the etiology of ischemic stroke.
Our study highlights a major challenge in accruing large sample sizes for rare variant analyses. The problem of accruing large samples sizes for analysis of common variants has been addressed successfully by combining study-specific genome-wide association results through meta-analysis. Within contributing studies, established protocols are typically used that include a thorough assessment of genotype intensity clusters for evaluation of cluster separation and genotype calls. Such assessments rely on sufficient numbers of each genotype to establish genotype separation boundaries. If sufficient numbers of each genotype are not available, as when the minor allele count is very small, establishing the genotype cluster separation boundaries can be difficult, making the genotype call unreliable. This problem is magnified if cases and controls are genotyped in different labs or different batches, because any 'batch' effect will mimic a difference between cases and controls. We attempted to address this problem by implementing a very stringent set of quality control measures that considered both within study (SiGN and HRS) as well as between study measures. We further attempted to minimize the possibility of identifying false results by manually inspecting the genotype intensity clusters of all variants we reported to be exome-wide associated with stroke. However, the cost of our implementing this stringent procedure was removing a large proportion (46.6%) of variants from analysis. The rigid variant QC procedures coupled with the necessity of manually inspecting all genotype intensity plots for all 'reportable' associations also makes bin-based analyses much less attractive since many bins will be incompletely covered due to variant filtering and all variants would require the laborious task of evaluation of genotype intensity plots.
Future efforts to identify rare protein-coding variants associated with stroke would be wise to pay heed to these lessons by using studies that rely either on non-array genotyping technologies, such as sequencing, for variant detection, or to employ very large samples for array-based studies in which cases and controls are genotyped together with careful effort made to minimize batch effects. For example, a recently published study from the TOPMed Stroke Working Group was based on whole genome sequencing (WGS), although this study included only 5616 ischemic stroke cases and 27,116 non-stroke controls [11]. This WGS study identified five novel variants associated with stroke. However, only 2 of these variants were present in SiGN and neither provided evidence for replication.
Our study includes other limitations. Even with 9721 stroke cases, our sample is powered only to detect those rare variants having relatively large effect sizes. Stroke subtype information was available for only 35.6% of our cases, even further limiting power for stroke subtype-specific analyses.

Conclusions
We have conducted the largest effort to date to identify rare protein-coding variants associated with ischemic stroke at an exome-wide level. We identified 13 rare strokeassociated variants as well as one additional association with 2 common variants at a previously known locus in ABO. Our study highlights the multiple challenges in using publicly available controls for large-scale rare variant array-based studies and the importance of expanding the inclusion of diverse non-European samples in the genetic study of ischemic stroke.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes14010061/s1, Detailed quality control procedure; Table  S1: Cases (n = 9721) and controls (n = 12,345) by site and ancestry; Table S2: Replication of stroke associated variants in TOPMed Stroke and UK Biobank; Figure S1: Genotyping intensity plots for 15 exome-wide significant SNPs. but other investigators are encouraged to access the data resources through collaborations with the SiGN consortium.

Conflicts of Interest:
Grant funding for research but no other competing interest. All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (accessed on 11 January 2022) and declare: The submitted work was supported by National Institutes of Health grants R01 NS105150, R01 NS100178, and R01 NS114045; Department of Veterans Affairs grant BX004672-01A1; and AHA Award 19CDA34760258. No financial relationships with any organizations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.