Whole Exome Sequencing of 23 Multigeneration Idiopathic Scoliosis Families Reveals Enrichments in Cytoskeletal Variants, Suggests Highly Polygenic Disease

Adolescent idiopathic scoliosis (AIS) is a lateral spinal curvature >10° with rotation that affects 2–3% of healthy children across populations. AIS is known to have a significant genetic component, and despite a handful of risk loci identified in unrelated individuals by GWAS and next-generation sequencing methods, the underlying etiology of the condition remains largely unknown. In this study, we performed exome sequencing of affected individuals within 23 multigenerational families, with the hypothesis that the occurrence of rare, low frequency, disease-causing variants will co-occur in distantly related, affected individuals. Bioinformatic filtering of uncommon, potentially damaging variants shared by all sequenced family members revealed 1448 variants in 1160 genes across the 23 families, with 132 genes shared by two or more families. Ten genes were shared by >4 families, and no genes were shared by all. Gene enrichment analysis showed an enrichment of variants in cytoskeletal and extracellular matrix related processes. These data support a model that AIS is a highly polygenic disease, with few variant-containing genes shared between affected individuals across different family lineages. This work presents a novel resource for further exploration in familial AIS genetic research.


Introduction
Adolescent idiopathic scoliosis (AIS) is a structural lateral spinal curvature ≥10 • that affects 2-3% of healthy children [1], with females at the greatest risk for severe progression [2][3][4]. In individuals with severe progressive curvatures, life-long problems of cosmetic deformity, respiratory compromise, back pain, and degenerative disease often arise and in many cases require surgical intervention, placing a significant economic burden upon our healthcare system [5]. AIS is known to be highly heritable, however, our knowledge of its etiology is severely limited, both in terms of the individuals at risk for curve initiation and those likely to experience curve progression. Understanding key risk variants or genetic pathways leading to AIS holds the potential to improve patient care by way of targeted clinical treatments or prognostics for detecting AIS risk or risk of curvature progression.
Disease susceptibility variants for complex diseases may collectively be common in the general population, but specific variants may be rare within families or affected individuals. Family studies present unique advantages over case-control studies, as they may reveal rare disease-associated variants enriched within the family that are amenable to targeted sequencing. Sequencing studies of multiple large families can reveal multiple variants within a gene or molecular pathway that contribute to the disease phenotype [44,45].
Within our laboratory, a pilot exome sequencing study of five AIS families identified an enrichment of damaging uncommon variants in cilia, extracellular matrix, and cytoskeletal genes, although no specific variants or genes were found to be present across all families [46]. This led us to propose the hypothesis that the development of AIS may be due to damaging variants within a specific set of pathways or molecular classes, rather than being driven by just a few select 'AIS genes'. In this study, we expand upon our previous work and present exome sequencing of affected individuals from 23 AIS families, interpreted using gene enrichment analyses to identify overrepresented functional categories. We then investigate specific genes containing variants in multiple families via genotyping confirmation in additional affected and unaffected members of the family to assess how closely these variants track with the AIS phenotype.

Materials and Methods
An overview of the methodology for this study-including subject enrollment, sample collection, sequencing, and bioinformatic filtering strategies-are provided in Figure 1.

Subjects
Study subjects were enrolled as previously described [29,38,46]. Inclusion in the sequencing pool required a standing anteroposterior spinal radiograph showing ≥10 • curvature by the Cobb method with pedicle rotation, and no evidence of congenital deformity or other co-existing genetic disorders [47][48][49].
Blood samples were collected from all study subjects as described previously [38]. DNA was then extracted from whole blood using standard phenol chloroform protocols or the QIA-GEN Gentra PureGene Blood Kit. DNA quality was verified by Qubit (Invitrogen, Waltham, MA, USA), agarose gel electrophoresis and Nanodrop (Thermo Scientific, Waltham, MA, USA).

Family Selection for Exome Sequencing
Twenty-three large families were selected through a tiered process wherein each pedigree was reviewed and evaluated by five project experts. Selection was based on the number of affected individuals in the family, the severity of their scoliosis curvatures, and the estimated genetic relationship between enrolled, affected individuals. Families with other musculoskeletal conditions or AIS from multiple sides of the family were excluded.
Three to five affected individuals were selected for sequencing per family based on the degree of hypothesized genetic distance between them, availability of high-quality DNA, and severity of spinal curvature. Pedigrees for these families and a summary of clinical information for all sequenced individuals are provided in Supplementary Files S1 and S2. On the pedigrees, degree of spinal curvature as measured by Cobb angle is indicated in numbers below affected individuals (i.e., 20D). Individuals with known double curves have both listed (i.e., 26/30D), as are any triple curvatures. All individuals with listed curvatures were examined radiographically, and any found to be unaffected on physical exam are labeled as "confirmed negative". Proband for each family is indicated with an arrow.

Whole Exome Sequencing
Exome capture was completed using 1 µg of genomic DNA from 86 individuals across 23 families using the Agilent SureSelect Human V5 (51 Mb) exon capture kit. Samples were sequenced with a 2 × 100 bp run on an Illumina HiSeq 2500 at the Otogenetics Corporation facility in Atlanta, GA, with a minimum average coverage of 50X guaranteed per sample.

Bioinformatic Filtering
Whole exome reads were aligned to GRCh38 and variants were identified with Free-Bayes as previously described [38,46]. Candidate variants were filtered by SnpEff (version 4.1g) [50] along with custom scripts to retain only non-synonymous SNPs, coding indels, and variants affecting splice sites. Known artifacts and variants whose frequency was greater than 0.05 in the ExAC database (r0.3) [51] were also stripped. If the variant was annotated in the dbNSFP database (version 3.0) [52,53], it was retained only if at least one of the included variant prediction algorithms (SIFT, Polyphen2, LRT, MutationTaster) scored it as "damaging", signifying that the resulting change to the encoded protein had a predicted functional consequence. Variants that were not shared by all sequenced members of the family were not retained. Variants with a Minor Allele Frequency (MAF) < 0.05 that remained after the above filters were applied were retained for further analysis in separate gene sets. This MAF threshold was intentionally set higher than typical rare variant thresholds to account for the prevalence of the disease (2-3% of the general population), working under the hypothesis that low frequency variants may contribute to the high prevalence of this disease.

Genotyping
Variants appearing in multiple families that were present in the GO functional categories for "cytoskeleton" or "extracellular matrix" (or related terms) were prioritized for genotyping. Additional enrolled affected and unaffected family members were sequenced at the variant site by the Sanger method to establish whether the variant segregated with the AIS phenotype.
Pedigrees for each sequenced family are provided in Supplementary Files S1 and were created using PedigreeXP software (PC Pal, https://www.pedigreexp.com (accessed on 16 June 2021)).

Gene Set Overrepresentation Analyses
Gene set overrepresentation analyses (GOA) were performed on MAF < 0.05 gene lists with duplicate genes removed (n = 1160) from the combined list from each family. Both DAVID and EnrichR websites were used, as described below.
DAVID: DAVID (Database for Annotation, Visualization and Integrated Discovery) v6.8 was used to identify the significant GO terms and clusters in each gene list (https: //david.ncifcrf.gov, accessed on 20 April 2021) [54,55]. The input gene list of 1160 genes resulted in 1146 DAVID IDs. Functional annotation clustering was used on our dataset with default settings and the GOTERM_All and GOTERM_CC_Direct annotation categories.
Enrichr: The same input gene lists as used for DAVID were used in Enrichr, a gene enrichment software developed by the Ma'ayan laboratory [56,57] (https://maayanlab. cloud/Enrichr/, accessed on 27 April 2021). Additionally, family-specific gene lists were separately inputted into EnrichR. We report results from the 2018 GO term Cellular Component and KEGG pathways. Volcano plots and charts were generated with Appyter (https://appyters.maayanlab.cloud/#/, accessed on 27 April 2021).

Results
To identify rare and low frequency variants associated with familial AIS, we performed whole exome sequencing on 23 multigenerational IS families (3-5 individuals per family, 86 individuals in total). Pedigrees for all families are provided in Supplementary Files S1, and clinical information for sequenced individuals is provided in Supplementary Files S2. Whole exome sequencing was performed using DNA extracted from whole blood, as described in the Methods. Illumina HiSeq reads were mapped to the human reference genome (hg39), and a minimum of 50X average coverage was obtained for each sample.
We next searched for genes containing variants in multiple families, with the hypothesis that shared genes would be more likely to contribute to the IS phenotype. Most variant-containing genes (n = 1028 of 1160 total, 88.6%) were specific to only one family. 132 genes were shared by at least 2 families, 38 were shared between 3+ families, and 10 were shared by 4+ families. No genes were shared among all families. Table 1 provides a list of genes found within multiple families. Of this list, four genes had a previously observed association with either idiopathic, degenerative, or infantile scoliosis (DLL3 [58,59], AHNAK [60], TTN [61], ANKRD11 [62]).
Although few genes were shared across families, we hypothesized that AIS families may share an enrichment of variants in specific functional categories. To identify categories of genes which were overrepresented with damaging variants within the AIS families, we performed gene set overrepresentation analysis (GOA) of our resulting filtered gene lists. We entered the MAF < 0.05 gene lists from individual families and combined data from all families into DAVID to identify overrepresented gene ontologies (GO Terms) ( Table 2). The most overrepresented GO Cell Component categories across all families using DAVID were "microtubule" (p = 1.62 × 10 4 , 1.98 fold enrichment), "slit diaphragm" (p = 8.88 × 10 4 , 10.66 fold enrichment), and "intermediate filament" (p = 8.88 × 10 4 , 2.57 fold enrichment). Categories within extracellular matrix (ECM) GO terms, including "proteinaceous extracellular matrix" and "collagen trimer" also showed mild enrichments (Table 2) and the top enriched KEGG term was "ECM receptor activation" ( Figure 2E). A full list of all GO functional categories is provided in Supplementary Files S2. We also used GOA to analyze an overrepresentation of cytogenic bands, as specific chromosomal regions have been linked with IS development by linkage analyses [63][64][65][66][67][68][69][70][71]. The most overrepresented band was 16p13.3 (p < 0.001), as described in Supplementary Files S2. This region was previously identified as significant in familial linkage analyses of a large AIS familial cohort. A neighboring region, distal chromosome 16p11.2 duplication, has been recently identified as a significant risk factor for severe AIS.   Next, we performed GOA on each family individually, to determine whether the cytoskeletal functional categories seen in the combined gene list were driven by a single family or many families. Overall, a cytoskeletal GO term was detected among the significantly enriched terms in every family except Family H, J, and O (Supplementary Files S2). Every family possessed at least one variant in a cytoskeletal gene.
We next selected variants of interest for genotyping to determine whether the variant segregated with the AIS phenotype within each relevant family. Specific variants were genotyped if they were present in functional categories of interest (cytoskeletal or extracellular matrix GO terms), and if additional affected and unaffected members of the family were enrolled in our study. Ten genes (ANKRD11, COL21A1, COL6A5, FGD1, NPHP4, OBSCN, TNXB, CTNNA3, NTRK1, and PDE4DIP) containing multiple variants across families were genotyped. Of these, variants within TNXB, CTNNA3, NTRK1, and PDE4DIP showed segregation of the variant with the IS phenotype in additional affected and unaffected family members (Supplementary Files S2). TNXB (tenascin XB) localizes to the major histocompatibility region on chromosome 6 and encodes an ECM glycoprotein and has been associated to classic Ehlers-Danlos syndrome (EDS), a connective tissue disorder with scoliosis as one characteristic of the phenotype [72]. The TNXB variants within our dataset (Family G: TNXB:NM_019105:exon24:c.C8192G:p.P2731R; Family S: TNXB:NM_019105:exon12:c.G4444A: p.V1482M) do not appear in the EDS variant database to date [73][74][75]. CTNNA3 (Catenin α 3) encodes a protein within the vinculin family of cell-cell adhesion proteins, and has been linked with certain types of cardiomyopathy [76]. NTRK1 (Neurotrophic Receptor Tyrosine Kinase 1) encodes a protein that binds neurotrophin peptide and participates within neuronal cell differentiation and specification of neuronal cell subtypes [77]. PDE4DIP encodes myomegalin, which anchors phosphodiesterase 4D to the Golgi/centrosome regions. An isoform of myomegalin was recently shown to form a complex with AKAP9 and CDK5RAP to link the pericentrosomal complex to the microtubule-nucleating complex [78]. The remaining variants showed no segregation or only partial segregation of the variant with the AIS phenotype.

Discussion
In this study, we report an enrichment of predicted damaging variants in cytoskeletal and extracellular matrix (ECM) functional categories within adolescent idiopathic scoliosis (AIS) families. Additionally, this cohort showed minimal overlap in specific genes across families. Our results support and add to the growing evidence that AIS is a highly polygenic disorder [7,8,28,79,80] in which multiple variants of variable effect size, potentially in combination with epigenetic and environmental phenomenon [81], contribute to the disease phenotype.
The ECM is a dynamic molecular network of proteoglycans, glycoproteins, minerals, and related proteins that plays a critical role within musculoskeletal tissues [82]. The ECM provides critical structural networks and scaffolding to tissues, and contributes to cellular signaling, growth, and repair [83]. Studies of genes encoding the "matrisome", the proteins composing and relating to the ECM, have identified dozens of causal ECM mutations for connective tissue and musculoskeletal disorders [83]. A polygenic burden of rare variants in musculoskeletal collagen genes was linked to AIS risk in a large casecontrol cohort [28], and our group previously showed mild variant enrichments in our pilot exome sequencing study of five AIS families [46]. Candidate ECM genes including HSPG2 [29] and FBN2 [30] have also been linked to AIS in specific cohorts. Several collagen genes (COL8A2, COL4A3, COL6A5, COL27A1, COL7A1, COL21A1, COL9A2, COL9A3, COL4A6) and perlecan (HSPG2) appeared in our families (Supplementary Files S3). Candidate studies of ECM genes in AIS [63,[84][85][86] were launched, in part, because scoliosis is a common phenotype of monogenic connective tissue disorders including Marfan and Ehlers-Danlos syndromes (EDS) [72,74]. TNXB (tenascin XB) variants, which have been associated with classic EDS, were found in two families in our current cohort as well as one family in our previous study [46]. The TNXB variant within Family G (TNXB:NM_019105:exon24:c.C8192G:p.P2731R) appeared in the Leiden Open Variation Database for EDS but was predicted as "benign" (https://databases.lovd.nl/shared/ variants/0000313458#00021614, accessed on 4 May 2021). Functional data will be required to determine the pathogenicity of these variants. In a recent review, Wise et al. proposed that genetic variants affecting the cartilage matrisome, specifically of the intravertebral disc, may be involved in a subset of AIS cases [33]. Our results provide support to polygenic variants in ECM genes as a contributing factor to AIS in some family lineages.
The cytoskeleton comprises actin, microtubule, and intermediate filament networks and is responsible for a multitude of functions including molecular transport, cellular stability, molecular signaling, cell migration, and cell division [87]. The cytoskeleton and ECM are intricately connected within musculoskeletal tissues and play important roles in mechanotransduction, tissue stability and response to biomechanical loading [88,89]. Every family within our dataset had at least one cytoskeletal genetic variant, and 20/23 families showed an enriched cytoskeletal GO term (Supplementary Files S2). Variants in the centriolar protein POC5 have been associated with scoliosis through human and animal studies [90][91][92]. The cytoskeletal kinesin kif6 was shown to be necessary for proper spine development in zebrafish [34], and mutations in kif6 also appeared in a zebrafish ENU genetic screen for scoliosis [93].
Defects in cilia, microtubule-based projections critical for cell signaling and fluid flow, have been linked to scoliosis in animal models [31,34,35,37,94,95]. A variant in the ciliary kinesin KIF7 was found within one AIS family and specific KIF7 mutations produced scoliosis in zebrafish [38]. Additionally, an overrepresentation of cilia variants was observed in our pilot exome sequencing cohort [46]. However, functional roles of the cytoskeleton and ECM in the development of IS have not yet been demonstrated in humans.
This study contains several limitations. Our familial cohort is relatively small and our ability to understand the impact of rare or low frequency variants in relation to the expression of a complex genetic disease, such as AIS, is limited [96]. Low frequency variants frequently lie outside the scope of large statistical association studies, and therefore may contribute to the "missing heritability" that accompanies many complex traits [97]. The majority of our discovered variants are heterozygous, and we do not know without functional testing whether these variants have a true dominant-negative effect on the resulting protein. To confirm our observed functional pathway enrichments, validate the presence of specific genetic variants, and obtain adequate statistical power, a larger cohort of AIS families must be sequenced. Additional functional studies of specific variants, with particular focus on those most likely to be damaging (i.e., stopgain, frameshift mutations) will be required to demonstrate causality [93].
Collectively, our results provide support to the hypothesis of ECM and cytoskeletal involvement in AIS etiology through what is, to our knowledge, the largest sequencing study of AIS families to date. These results suggest that there are many specific genes that can collectively increase disease risk, although there may be affected pathways that are shared across families. We hypothesize that (1) individual AIS families harbor low frequency mutations in different functional categories, resulting in different subsets of AIS, and/or (2) individuals with AIS require mutations in multiple gene categories (i.e., the cytoskeleton and ECM), resulting in mild dysfunction across several molecular pathways, to cause disease. Specifically, these results suggest that mild mutations in cytoskeletal or ECM genes may play a role in AIS etiology. Ultimately, this work will assist in the ability to predict the onset of adolescent idiopathic scoliosis and the risk of progressive disease and, thus, lead to the development of more personalized treatments for individuals with AIS.

Conclusions
This work presents a novel set of candidate variants found in affected individuals from 23 AIS families. In agreement with our previous WES study of five AIS families, we observed an overrepresentation of variants in cytoskeletal and ECM functional categories, with few specific genes shared across families. Overall, this work paints a picture of AIS genetic etiology as highly polygenic and specific to individual family lineages. An analytical approach that integrates data from family-based sequencing with genetic association studies, an understanding of study population variation, population stratification and genetic heterogeneity, and advances in clinical phenotyping will enhance our ability to define the genetic complexity of this disorder.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/genes12060922/s1: File S1: Family Pedigrees. Degree of spinal curvature is indicated with a D (degrees). Individuals marked "confirmed positive" were confirmed as affected with AIS but were unable to provide us with an X-ray for measurement. Exome sequenced individuals are indicated. File S2: Table S1: Clinical information from all exome sequenced individuals; Table S2: Enriched cytobands from MAF < 0.05 gene list; Table S3: Genotyping of genes in multiple families and in functional categories of interest; Table S4: Full Gene Ontology (GO) Term enrichment results (p < 0.01) for all functional categories using DAVID; File S3: Detail of all filtered variants, sorted by family. All filtered variants had a minor allele frequency (MAF) < 0.05, were present in all sequenced members of the family, and were predicted to be damaging to the resulting protein by at least one algorithm. Details of the exome sequencing strategy and bioinformatic filtering process are provided in the Methods. Funding: Funding for NHM's laboratory and this study was provided by NIH/NIAMS R01AR068292. Additional funds were provided by Children's Hospital Colorado.
Institutional Review Board Statement: Written informed consent was obtained from study subjects who were enrolled in accordance with protocols approved by the Johns Hopkins School of Medicine Institutional Review Board and the University of Colorado Anschutz Medical Campus Institutional Review Board (Colorado Multiple Institutional Review Board, Studies #06-1161 and #07-0417). All procedures involving human participants were performed in accordance with the ethical standards of these institutional review boards, the 1964 Declaration of Helsinki and its later amendments, or comparable ethical standards.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The authors affirm that all data necessary for confirming the conclusions of this article are represented fully within the article and its supplementary files, including the complete lists of filtered variants for each family.