Structural Variations Contribute to the Genetic Etiology of Autism Spectrum Disorder and Language Impairments

Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by restrictive interests and/or repetitive behaviors and deficits in social interaction and communication. ASD is a multifactorial disease with a complex polygenic genetic architecture. Its genetic contributing factors are not yet fully understood, especially large structural variations (SVs). In this study, we aimed to assess the contribution of SVs, including copy number variants (CNVs), insertions, deletions, duplications, and mobile element insertions, to ASD and related language impairments in the New Jersey Language and Autism Genetics Study (NJLAGS) cohort. Within the cohort, ~77% of the families contain SVs that followed expected segregation or de novo patterns and passed our filtering criteria. These SVs affected 344 brain-expressed genes and can potentially contribute to the genetic etiology of the disorders. Gene Ontology and protein–protein interaction network analysis suggested several clusters of genes in different functional categories, such as neuronal development and histone modification machinery. Genes and biological processes identified in this study contribute to the understanding of ASD and related neurodevelopment disorders.


Introduction
Autism spectrum disorder (ASD) is a neurodevelopmental disorder (NDD) that can cause difficulties in the everyday life of affected individuals, specifically in areas of social interaction, communication, and restrictive behavior. The Diagnostic and Statistical Manual-Fifth Edition (DSM-5) outlines the criteria for ASD diagnosis as "persistent deficits in (a) social communication and social interaction across multiple contexts, as manifested by deficits in socialemotional reciprocity, nonverbal communicative behaviors and developing, maintaining, and understanding relationships, and (b) manifested by at least two listed symptoms of restricted, repetitive patterns of behavior, interests, or activities" [1]. Studies across Asia, Europe and North America estimate that ASD affects 1-2% of the world's population. The disorder prevails in individuals across racial, ethnic, and socioeconomic groups, and the risk of ASD is about 4.5 times higher in boys than girls [2]. ASD is variable in its presentation, and has been linked to both environmental and genetic risk factors [3][4][5]. The impact of the environment on ASD development is outside the scope of this project, and we will focus on the genetic components of the ASD etiology.
Researchers estimate the heritability of ASD varies between 50% and 90% [3][4][5][6]. Multiple twin studies showed variable concordance rates for a broad ASD phenotype in monozygotic (70-90%) and dizygotic (0-30%) twins [5,7]. Current ASD research indicates that the genetics of ASD are highly heterogeneous [8]: multiple genetic factors work in concert to produce a cumulative effect in conferring ASD susceptibility, rather than any single mutation being responsible for the ASD. For example, at least four types of rare genetic risk factors have been identified: genetic syndromes (~10% cases), chromosomal abnormalities (~5% cases), copy number variations (5-10% cases), and highly penetrant gene mutations (~5% cases) [9]. In addition, ASD is polygenic, with many common variants contributing to ASD susceptibility [10]. However, the majority of ASD cases are classified as idiopathic [11] and the underlying genetic causes are unknown, especially with regard to structural variations.
Structural variations (SVs) are large changes in the genome and contribute to many disorders [12]. Depending on the size and the discovery technology, SVs are often divided into larger copy number variations (CNVs) identified by genotyping array and smaller SVs identified by sequencing, referred to as "genomic SVs" or gSVs in the following text (the term "genomic SV" is simply a label used to describe this specific dataset for the convenience of reading). CNVs are primarily more than 10 kb in length and the genotypes are reported as the copy number of a genomic region. In contrast, gSVs are usually more than 50 bp in length and can include insertions, deletions, duplications, inversions, and mobile element insertions (MEIs). Both CNVs and gSVs are known risk factors for ASD [13][14][15][16][17]. Rare CNVs are prevalent in 5-10% of all ASD cases, and typically impact at least one but potentially several genes. There is an increased global burden for rare CNVs in ASD-affected individuals when compared to controls [18], and ASD probands inherit more rare CNVs than their unaffected siblings [16]. In particular, de novo CNVs attribute to an increased risk of ASD [19][20][21]. Furthermore, gSVs, especially de novo gSVs, also showed higher burden in ASD cases than controls and contribute to ASD genetic etiology [17]. Although gSVs are usually smaller than CNVs, they are more prevalent in populations and could have major functional impact on genes, both through direct gene disruption and affecting gene expression [22][23][24]. Compared to CNVs, the contributions of gSVs to ASD have not been studied extensively.
The New Jersey Language and Autism Genetics Study (NJLAGS) collected a cohort focused on understanding the shared genetic etiology for language impairment in ASD with more common language impairments, including specific and nonspecific language impairment disorders in a family setting [25]. Specific language impairment (SLI) is a prominent language disorder that affects approximately 7-8% of English-speaking 5-year-olds in the United States [26]. The disorder manifests with difficulties in language skills with no effect on psychology, neurology, or signs of intellectual disabilities [27]. ASD has a strong language component and genetic mapping studies showed shared genetic risks between ASD and SLI [28][29][30]. Linkage analyses of the NJLAGS cohort also identified peaks for SLI in families affected by both ASD and SLI [25,31]. Therefore, examining families with both ASD and SLI individuals could identify genes that contribute to both phenotypes and potentially explain the manifestation of SLI in some family members as opposed to ASD in others [25].
In this study, we aimed to specifically assess the contribution of SVs to ASD and SLI in the NJLAGS cohort. We analyzed SVs across the size spectrum, including CNVs identified using the genotyping array and gSVs identified using whole-genome sequencing (WGS). Across the cohort, we identified variants and genes contributing to both ASD and language impairment.

Family Samples and Project Overview
The NJLAGS cohort recruited families that had at least one ASD patient. Many families also have additional individuals with language impairments (LI or RI). In this study, we focused on three phenotypes: ASD, LI* (the union of individuals diagnosed with either ASD or LI), and RI* (the union of individuals diagnosed with either ASD or RI) (see Methods Section 4.1 for phenotype definition details). These phenotype definitions allow us to examine the genetic variants contributing to ASD, as well as related language impairments. Microarray genotyping and WGS were conducted for 524 and 272 individuals from 109 and 73 families, respectively (Table 1). For this study, CNVs were identified using the microarray data, MEIs were identified using the WGS data, and gSVs were obtained from a previous study using the WGS data [32]. After merging the MEIs and gSVs into a gSV/MEI call set, the CNV and gSV/MEI call sets were annotated, filtered, and prioritized to obtain candidate variants and candidate genes. The final candidate genes from the two pipelines were combined into one gene set, which was then subjected to enrichment and pathway analysis.

Candidate CNV Identification
First, we identified candidate CNVs and associated candidate genes. The overall workflow is outlined in Figure 1. The initial genotyping was carried out in three batches. To improve the accuracy of CNV identification, we applied two CNV calling algorithms: PennCNV and QuantiSNP (see Methods Section 4.3 for detail). After merging CNVs from PennCNV and QuantiSNP with >70% reciprocal overlap, we excluded 21 individuals who had an excess number of CNVs (>19, cohort median = 7). After QC, 524 individuals from 109 families were included in the CNV analysis. In total, 2528 CNVs passed QC and filtration (see Methods Section 4.4 for details). The sizes of CNVs range from 10 kb to 6309 kb, with a median size around 45 kb. The copy number of CNVs ranges from 0 (homozygous deletion) to 4 (homozygous duplication), with CN1 (heterozygous deletion) being the most common genotype (1641 CNVs, Figure S1A). No candidate CNV had more than four copies in any individual. ASD and unaffected individuals had the same median count of CNVs (7) as did LI* and RI* individuals (6) ( Figure S1B).
ants. The top candidate CNVs (StrVCTVRE score > 0.5 and SvAnna score >1) and thei associated genes are listed in Table 3. Several CNVs are known to cause congenital neuro developmental disorders (e.g., cerebellar atrophy, mental retardation, etc.). Finally, gene overlapping candidate CNVs were extracted and filtered on brain expression patterns. The candidate genes were later combined with candidate genes from the gSV/MEI analysis fo joint analysis (see Candidate Gene Analysis section 2.5 below).  Next, we prioritized the 2528 CNVs according to the absence of benign annotation, overlap with coding region, fit of an inheritance pattern, and cohort allele frequency (see Methods Section 4.5 for details). We prioritized 174 CNVs for ASD, 196 for LI*, and 151 for RI* ( Figure 1, Table 2). The majority of candidates have both StrVCTVRE and SvAnna predicted pathogenicity scores >0, suggesting potential functional impact of these variants. The top candidate CNVs (StrVCTVRE score > 0.5 and SvAnna score > 1) and their associated genes are listed in Table 3. Several CNVs are known to cause congenital neurodevelopmental disorders (e.g., cerebellar atrophy, mental retardation, etc.). Finally, genes overlapping candidate CNVs were extracted and filtered on brain expression patterns. The candidate genes were later combined with candidate genes from the gSV/MEI analysis for joint analysis (see Candidate Gene Analysis Section 2.5 below).  We compared candidate CNVs in our patients with CNVs implicated in syndromic ASDs. One ASD patient (FAM23-003), was determined to have two CNVs overlapping the 1q21.1 duplication syndrome region, which is associated with ASD [13,14]. Both CNVs within the region are de novo heterozygous duplications (CN3): one was identified by Pen-nCNV (chr1:146497779-147826789; DUP_1 in Figure 2) that had a >50% reciprocal overlap with the 1q21.1 locus. A second CNV was identified by QuantiSNP (chr1:146089404-146769831; DUP_2 in Figure 2), which shows partial overlap with the DUP_1. In combination these two CNVs covered 84.1% of the1q21.1 duplication syndrome locus and are likely to be one CNV that each program provided a partial definition of. Medical records for this patient indicated overlapping symptoms with the 1q21.1 duplication syndrome (see Discussion for details). Therefore, patient FAM23-003's ASD diagnosis might be explained by this de novo CNV.

A Candidate Syndromic CNV in Individual FAM23-003
We compared candidate CNVs in our patients with CNVs implicated in syndromic ASDs. One ASD patient (FAM23-003), was determined to have two CNVs overlapping the 1q21.1 duplication syndrome region, which is associated with ASD [13,14]. Both CNVs within the region are de novo heterozygous duplications (CN3): one was identified by PennCNV (chr1:146497779-147826789; DUP_1 in Figure 2) that had a >50% reciprocal overlap with the 1q21.1 locus. A second CNV was identified by QuantiSNP (chr1:146089404-146769831; DUP_2 in Figure 2), which shows partial overlap with the DUP_1. In combination these two CNVs covered 84.1% of the1q21.1 duplication syndrome locus and are likely to be one CNV that each program provided a partial definition of. Medical records for this patient indicated overlapping symptoms with the 1q21.1 duplication syndrome (see Discussion for details). Therefore, patient FAM23-003′s ASD diagnosis might be explained by this de novo CNV.

Candidate gSV/MEI Identification
Next, we determined the contribution of gSVs and MEIs to the phenotypes. The overall workflow is described in Figure 3. For gSV analysis, 272 individuals from 73 families were included, including 83, 117, and 134 ASD, LI*, and RI* patients, respectively (Table  1). We identified 32,243 high-quality gSVs in the WGS data set previously [32]. Because standard gSV detection programs have low sensitivities for MEIs, we applied the program MELT to identify MEIs using the WGS data. Overall, 12,587 MEIs in 259 individuals passed QC. As expected, Alu is the dominant polymorphic MEIs in the cohort, followed by LINE-1 and SVA. The size distribution of the MEIs is shown in Figure S2. When combined, 42,950 gSV/MEI loci were included in the downstream analysis (see Methods sections 4.7 for QC and and section 4.8 for merging details). After applying the inheritance pattern and known benign variant filtering, 1816 variants were identified.

Candidate gSV/MEI Identification
Next, we determined the contribution of gSVs and MEIs to the phenotypes. The overall workflow is described in Figure 3. For gSV analysis, 272 individuals from 73 families were included, including 83, 117, and 134 ASD, LI*, and RI* patients, respectively (Table 1). We identified 32,243 high-quality gSVs in the WGS data set previously [32]. Because standard gSV detection programs have low sensitivities for MEIs, we applied the program MELT to identify MEIs using the WGS data. Overall, 12,587 MEIs in 259 individuals passed QC. As expected, Alu is the dominant polymorphic MEIs in the cohort, followed by LINE-1 and SVA. The size distribution of the MEIs is shown in Figure S2. When combined, 42,950 gSV/MEI loci were included in the downstream analysis (see Methods Section 4.7 for QC and and Section 4.8 for merging details). After applying the inheritance pattern and known benign variant filtering, 1816 variants were identified.
To identify variants contributing to each phenotype, we applied two pipelines (Figure 3), the first of which was an "AF-focused" pipeline that is focused on rare variants within the genic regions (i.e., exonic and intronic variants). The pipeline identified five exonic variants, including two for ASD (Table S4). Because of the large number of intronic variants, we applied additional filters on the overlapping genes (see Methods Section 4.9 for details). After the filtering, 32 intronic variants in 30 genes were identified for the three phenotypes.
Second, an "eQTL-focused" pipeline that identifies intronic and intergenic eQTL variants in brain tissues was applied. We identified gSVs and MEIs in our dataset that are considered causal eQTL variants in brain tissues in the GTEx project samples. A total of 26 intronic variants and 10 intergenic variants met these criteria and 29 and 12 eQTL genes were identified, respectively. The exonic variants and eQTL variants that are associated with gene expression in at least three tissues are listed in Table 4. To identify variants contributing to each phenotype, we applied two pipelines (Figure 3), the first of which was an "AF-focused" pipeline that is focused on rare variants within the genic regions (i.e., exonic and intronic variants). The pipeline identified five exonic variants, including two for ASD (Table S4). Because of the large number of intronic

Candidate Gene Analysis
The CNV and gSV/MEI pipelines each prioritized a list of candidate genes across three phenotypes. These genes were then combined into a single gene list. Table 5 shows the counts of these genes, as well as the counts of families where the candidate genes were identified from. A total of 274 candidate genes were prioritized from the CNV data and 75 genes from the gSV/MEI data, including 344 total unique genes (Table S5). Among the candidate genes, 46 were reported previously in the SFARI database, and 13 additional genes were reported in the ADHDgene database or other NDD studies (Table S5). The majority of the families (86 of 112 families,~77%) contained at least one candidate gene for each phenotype. Five genes were identified in both CNV pipeline and gSV/MEI pipelines: DIP2C, SH3D19, WWC1, DNAH17, and FN3K. DIP2C, SH3D19, WWC1, and FN3K are candidate genes for all three phenotypes, while DNAH17 is a candidate gene for ASD. Variants in DIP2C (disco interacting protein 2 homolog C) and DNAH17 (dynein axonemal heavy chain 17) have been previously linked to ASD [33,34], and WWC1 was identified as a risk gene for Tourette syndrome, another NDD known to be comorbid with ASD [35].
From the gSV pipeline, five genes were prioritized in the exonic AF pipeline, with two genes for ASD: EXOC6 and RNF213. EXOC6 (exocyst complex component 6) contains a LINE1 insertion that overlaps its exon 16 and meets the recessive segregation pattern for all three phenotypes. EXOC6 was reported in the SFARI database as a strong candidate for ASD, and GO analysis links it to the axon development. EXOC6 knockout mice demonstrate a lack of response to tactile stimuli and hyperactivity, indicating a disruption of nervous signaling. RNF213 (ring finger protein 213) contains an Alu insertion that overlaps its exon 58 and meets the recessive segregation pattern for all three phenotypes. RNF213 was reported as a candidate gene for Tourette's syndrome [35]. Besides exonic variants, many genes identified from the intronic AF pipeline are strong candidates for ASD, such as RIMS2 (regulating synaptic membrane exocytosis 2, RI*), and ANK3 (ankyrin 3, RI*).
From the eQTL focused pipeline, we identified 40 candidate genes (Table S4). For example, an Alu insertion on chromosome 18 (chr18:73953939-73954219) was identified as a causal eQTL variant for TSHZ1 (teashirt zinc finger homeobox 1) in the cerebellum samples in GTEx. The insertion segregates in two NJLAGS families for all three phenotypes (Table S4). TSHZ1 is a high-confidence ASD gene in the SFARI database and it is important for motor neuron development in mouse [36].

GO Term and Pathway Enrichment Analysis
Using the final gene list, we performed GO term (Table S6) and pathway (Table S7) enrichment analysis. We found enrichment in several GO terms related to neuronal development, including axon part, which comprises 15 candidate genes, such as the gSV/MEI exonic candidate EXOC6 (see candidate genes analysis) and a high-confidence ASD gene ANK3. Another candidate gene in the term is NRSN1 (neurensin 1). NRSN1 is identified overlapping a CNV (chr6: 24139942-24151395) in patient FAM44-005 with an RI* phenotype. NRSN1 plays a role in neurite extension and has been identified as a risk gene for reading impairment [37]. Related terms, such as "dendrite extension," "distal axon," and "nervous system development," were all likewise enriched. Another enriched GO term is "histone demethylase activity," including genes KDM4C, KDM4D, and JMJD1C, all of which are known to contribute to susceptibility for NDDs [38][39][40]. In our cohort, KDM4C and JMJD1C were prioritized across all three phenotypes, and KDM4D for the RI* phenotype.
Along with the GO term analysis, we also looked for enrichment in pathways for NDD association. The most significant pathway (q = 0.0023) was the aforementioned "1q21.1 copy number variation syndrome", discussed above in the syndromic CNV section. Likewise represented is the "complement and coagulation cascades" pathway, chiefly related to thrombin formation and blood cell development, but with some annotation in genes known to be associated with Tourette's syndrome, epilepsy, and intellectual disability.

Protein-Protein Interaction Network Analysis
Finally, we constructed a PPI network for the candidate genes to examine the shared etiology among families and phenotypes. Out of the 344 candidate genes, 116 genes connected into a single network (Figure 4, Table S8), including 29 known NDD genes.
We identified one cluster centered on hub genes DLGAP2 (DLG associated protein 2), RBFOX1 (RNA-binding protein, fox-1 homolog (C. elegans) 1), and RIMS2. DLGAP2 is a known ASD risk gene associated with dysfunction of social cognition in knockout mice [41]. RBFOX1 is a neuronal RNA-binding protein previously demonstrated to be a regulator of cytoplasmic RNA metabolism necessary for cortical development, and has also been linked to ASD [42]. RIMS2 is also a known ASD risk gene that has been discussed in the candidate genes section above. In addition to known NDD genes, the networks include genes that were not included in our NDD gene lists but showed close association with the known genes. For example, the WNT signaling pathway genes WNT3 and WNT9B have been shown to be involved in neurodevelopment and could contribute to ASD etiology [43].
been linked to ASD [42]. RIMS2 is also a known ASD risk gene that has been discussed in the candidate genes section above. In addition to known NDD genes, the networks include genes that were not included in our NDD gene lists but showed close association with the known genes. For example, the WNT signaling pathway genes WNT3 and WNT9B have been shown to be involved in neurodevelopment and could contribute to ASD etiology [43].
The network also contains many genes involved in the histone modification process. For example, genes that overlap the GO term for "histone demethylase activity" (KDM4C; JMJD1C; KDM4D) form a cluster linked by KAT2B (lysine acetyltransferase 2B). KAT2B and KAT6A (lysine acetyltransferase 6A) are lysine acetyltransferases and have been identified as NDD genes. Specifically, KAT6A is specific to the RI* phenotype and has been linked to impaired speech development [44].  The network also contains many genes involved in the histone modification process. For example, genes that overlap the GO term for "histone demethylase activity" (KDM4C; JMJD1C; KDM4D) form a cluster linked by KAT2B (lysine acetyltransferase 2B). KAT2B and KAT6A (lysine acetyltransferase 6A) are lysine acetyltransferases and have been identified as NDD genes. Specifically, KAT6A is specific to the RI* phenotype and has been linked to impaired speech development [44].

Discussion
ASD is a complex disorder with a highly heterogeneous genetic background. The current SFARI Gene database listed 1128 genes that have been associated with ASD.
Large SVs, especially de novo CNVs and gSVs, have been consistently shown to have a higher burden in ASD cases than normal controls [13][14][15][16][17]. By systematically examining the association of different types of SVs with ASD and language impairments in the NJLAGS cohort families, we identified variants and candidate genes that are associated with ASD/language impairments.
Among the ASD patients, we identified one patient who has de novo CNVs that overlap the 1q21.1 duplication syndrome region. Duplications in this region were linked to macrocephaly/microcephaly and other NDDs, including ASD [45]. The patient with the CNVs was nonverbal and unable to take any of the diagnostic batteries. The patient also has a history of erratic moods, self-injurious behavior, and obsessive-compulsive disorder. These observations are consistent with some of the 1q21.1 duplication syndrome symptoms, including impaired communication skills. The patient does not have macrocephaly or microcephaly, another common feature of 1q21.1 duplication syndrome. A clinical test would be needed to confirm the presence of the CNV and provide a clinical diagnosis for the patient. In addition to the candidate syndromic CNVs, we identified more than 250 candidate CNVs that could contribute to the phenotypes in the cohort. Consistent with previous studies, most candidate CNVs are de novo CNVs (Table S3). Because of their large sizes and potential severe impact on affected genes, these CNVs are likely to be under strong purifying selection and will not be passed down from generation to generation.
Besides large CNVs, we also examined the contribution of other SVs, including insertion, deletion, and MEIs in the cohort using WGS data. gSVs are important contributors to ASD, although their contribution has not as been extensively studied as CNVs [17]. In addition to assessing the direct impact of gSV/MEIs at the integration sites, we also identified gSV/MEIs that are potential causal brain eQTL loci and affect nearby gene expressions. gSVs have been shown to affect nearby gene expression and typically have a bigger effect than single-nucleotide variants (SNVs) [22,23]. In our cohort, we identified 36 intronic/intergenic gSV/MEIs that passed our filtering criteria and were causal variants for brain eQTLs in the GTEx project. These variants could affect the expression of the associated brain-expressed genes in the cohort and contribute to the genetic etiology of the disorders.
Combining the candidate genes from both CNVs and gSV/MEIs, we identified 344 candidate genes, including more than 200 genes that are not included in the current database. Many new genes showed enrichment in GO terms and showed connection with known NDD genes in the PPI network. The GO enrichment and PPI network analyses pointed to several biological processes that are important for ASD etiology, such as neuronal development and histone modification. Several of the terms (e.g., "histone modification", "actin binding") also showed enrichment in the cohort in our previous studies of SNVs [32,46]. These genes are strong candidates for future study of their contribution to ASD.
Our study has a few limitations. One is the moderate sample size of the cohort. The family setup of the NJLAGS cohort allowed us to define and select segregating and de novo variants, which vastly reduced the number of false-positive variants. However, the comparatively small number of unaffected individuals in the sample may have hidden the signal of enrichment for CNVs in ASD probands. Furthermore, our sample was collected in two separate waves: the first with diagnoses made under DSM-IV and the second with diagnoses under DSM-5. Individuals meeting DSM-IV criteria for Asperger's disorder would not have been considered affected in our first collection wave, having neither DSM-IV autistic disorder nor language impairment. Similar individuals would have been diagnosed with ASD under our second collection wave, due to the change in clinical diagnostic guidelines, potentially leading to a small degree of clinical heterogeneity across the two study waves. Additionally, the number of high-confidence candidate genes are limited by the sample size. Given the large number of genes involved in ASD, future study with a larger sample size could further increase the statistical power for candidate variant/gene identification.
Another limitation is that the variant discovery was based on a combination of genotyping array and short-read sequencing technologies. Using both technologies allowed us to cover most of the size spectrum and types of SVs, but neither technology is ideal for SV detection [47,48]. In particular, microarray genotyping is unable to detect inversions with neutral copy number changes. Previous studies have identified potentially causal inversions and balanced chromosomal rearrangements in certain cases of ASD [49,50]. Similar cases in our cohort would have been invisible using our current methodology. In the future, long-read sequencing technologies could provide a more complete and accurate SV profile and improve the accuracy of variant discovery [51]. Lastly, we limited our analysis to interactions among SVs. Several types of data, including coding SNVs, and microRNA variants have been identified in the NJLAGS cohort [32,46]. Integrative analysis of candidate genes from multiple types of variants could further improve the power of the analysis.
In conclusion, our multimodal analysis of SVs in the NJLAGS cohort has resulted in a list of high-confidence candidate genes that are likely to be involved in the etiology of ASD and SLI. Drawn from CNV, gSV, and MEI results, we prioritized a set of genes and biological processes that emphasize the effect of neuron growth and histone modification in ASD patients. Experimental study of the function and effect of these candidate genes could further clarify their role in the underlying mechanisms of ASD and its related phenotypes.

Family Selection and Phenotyping
The samples were collected by the NJLAGS [25] and the study was approved by the Institutional Review Board at Rutgers, The State University of New Jersey (IRB: 13-112Mc). NJLAGS gathered data from individuals with ASD and their family members to conduct genetic analysis. To assess their condition, each person was evaluated for ASD, oral language impairment (LI), written language impairment or reading impairment (RI), and social responsiveness during recruitment. The diagnostic process and evaluation criteria have been described in detail previously [25]. Briefly, ASD diagnoses were based on a combination of three resources: (1) the Autism Diagnostic Interview (ADI-R), (2) the Autism Diagnostic Observation Schedule (ADOS), and (3) either the Diagnostic and Statistical Manual of Mental Disorders IV (DSM-IV) or DSM-5, depending on date of the assessment (see [25] for details). In this study, LI is defined as either (1) receiving a score of either less than 85 on the Clinical Evaluation of Langue Fundamentals-4 (CLEF-4) test or (2) a history of language/reading difficulties with at least a score one standard deviation (SD) below their peers on at least 60% of oral language subtests. RI is defined as a score of one SD below the mean of 60% on all reading tests. Ascertainment of families in wave 1 (N = 79 families) required one person with DSM-IV autistic disorder and an additional family member with SLI, while wave 2 (N = 32 families) required one person with DSM-5 ASD and an additional family member with LI or RI [32]. To determine any shared genetic causes of ASD and LI, three phenotypes were used to define affected individuals, similar to a previous study [25]: ASD, the union of individuals diagnosed with either ASD or LI (termed LI* in the following text), and the union of individuals diagnosed with either ASD or RI (termed RI* in the following text). Because of our interest in shared genes between these conditions, the union phenotypes allow for the analysis of inherited genes that contribute to different diagnoses across generations. For example, a gene that contributes to the LI phenotype in a parent might contribute to the language phenotype in their child with ASD. These combined categories are useful for prioritizing shared genes as well as identifying de novo variants.

Microarray Genotyping and Quality Control
The genetic material of the NJLAGS cohort was housed and maintained at the National Institute of Mental Health (NIMH) Repository and Genomics Resource (NRGR), with the genetic and clinical data incorporated into the NRGR Autism Collection and the NIMH Data Archive (NDA). For microarray genotyping, samples were genotyped using two types of genotyping microarrays: Affymetrix Axiom 1.0 Genome-Wide CEU 1 (Affymetrix, CA, USA) and Illumina Infinium PsychArray (Illumina, CA, USA). The Illumina genotyping was performed in two batches (Table S1). The microarray genotyping was performed at Rutgers University Cell and DNA Repository (RUCDR, Piscataway, NJ, USA) following the manufacturer's protocols. The single-nucleotide polymorphism (SNP) genotyping and quality matrix (e.g., log R ratio and B allele frequency) calculations were performed with the Axiom Analysis Suite (v4.0.3.3) and GenomeStudio Software (v2.0.4) for the Affymetrix and Illumina microarrays, respectively.
At the sample level, for the Axiom dataset, samples that had a call rate <0.95 were removed. For the PsychArray dataset, the protocol outlined in [52] for Illumina genotyping arrays was followed and all samples passed quality control (QC). At the SNP level, the Axiom samples were processed using "Best practices workflow" of Axiom Analysis Suite (v4.0.3.3) software [53], and probes that were not labeled as "best and recommended" by the software were removed. PLINK [54] was then used to filter out SNPs with <98% genotype rate (geno 0.02) or failing the Hardy-Weinberg equilibrium test at a p-value of 0.001 (hwe 0.001). The PsychArray samples were processed using the "Best practices workflow" in [52]. Probes with a GenTrain score less than 0.7 were excluded. Similarly to the Axiom samples, PLINK [54] was then used to filter out SNPs with <98% genotype rate or failing the Hardy-Weinberg equilibrium test at a p-value of 0.001.

CNV Identification and Quality Control
To improve the CNV identification accuracy using the microarray data, two CNV identification algorithms were used: PennCNV [55] and QuantiSNP [56]. For PennCNV, a PFB file containing the population frequency of B alleles and SNP genome coordinates was created for SNPs that passed genotyping QC. Following instructions in the PennCNV protocol, a new HMM file (Hidden Markov Model) was trained using the first 100 samples in each batch, and the resulting model file was used for CNV calling. For QuantiSNP, library files and GC files included in the program were used.
Raw CNVs were then filtered to remove CNVs that were too small to be reliably called (<10 kb), too large that they were most likely to be a chromosomal aberration or cell-line anomalies (>7.5 Mb), or called by too few probes (<5 probes). For QuantiSNP calls, CNVs with a max log Bayes factor score <10 were removed. Samples with CNV counts that were significantly greater than the rest of the samples (>median + 1.5 interquartile range) were considered outliers and removed. Sex chromosomes were excluded from the downstream analysis. Raw genotyping data and the CNV call set were deposited to the NDA under collections C1932 and C2933 and NRGR under study 39.

CNV Merging, Annotation, and Segregation Analysis
The following CNV analyses were carried out in a customized pipeline (https:// github.com/JXing-Lab/NJLAGS_SV/tree/master/CNV_Pipeline, 20 August 2023). First, the two CNV call sets for each of the three batches were combined. The merging was performed by the custom script CNV_Builder.py. Positions were considered overlapping and merged if they shared at least a 70% reciprocal overlap with each other, with their outermost breakpoints used to define the region. Nonreciprocal overlaps, such as one variant being entirely encompassed by another but accounting for <70% of the second variant, were considered separate and were not merged. The merged CNV calls were converted into a standard variant call format (VCF) file.
To annotate CNVs for their inheritance patterns, the GEMINI (GEnome MINIng) framework was used. GEMINI is an SQL-based framework that allows querying allele information, including allele segregation in families [59]. Because GEMINI was not designed to work with CNVs, a Python program (Geminesque_2023.py) was written to apply segregation logic similar to GEMINI. The segregation pattern for each CNV was then annotated and filtered using Geminesque_2023.py and CNVs were assigned to the following inheritance patterns: de novo, autosomal recessive, and autosomal dominant. Criteria were similar to the "strict" setting in GEMINI: for de novo, parents cannot be affected, the affected child must be heterozygous for the variant, and no unaffected individuals can have the alternative allele. For autosomal recessive, all known parents must be unaffected and heterozygous for the alternative allele, while the affected children must be homozygous for the alternative allele. For autosomal dominant, affected children must have at least one affected parent and unaffected individuals cannot be homozygous or heterozygous for the alternative allele.

CNV Filtration and Prioritization
Subsequent steps in the filtration and prioritization processes were conducted using a custom Python script-CNV_Prioritizer.py. Candidate CNVs were filtered according to the following criteria: present in at least one affected individual; overlap with proteincoding regions; and segregated with a defined inheritance pattern. Variants with AnnotSV annotation for the benign variants (i.e., annotations in "B_gain_source," "B_loss_source," "B_ins_source," or "B_inv_source") were filtered out. For CNVs identified as a de novo variant, variants present in more than two individuals in the cohort were filtered out.
Given the neurological basis of ASD and SLI, candidate genes were filtered based on expression in brain tissues. A candidate CNV needed to contain at least one gene that is expressed in brain tissue with >5 TPM (transcripts per million). The gene expression in brain tissues was obtained from three databases: Gene Tissue Expression Project (GTEx) [60,61], the BrainSpan Atlas of the Developing Human Brain [62], and the Human Developmental Biology Resource (HDBR) [63].

Whole-Genome Sequencing Data and the gSV Call Set
WGS was performed on 272 individuals across 73 families in four batches by three vendors (Table S1), as described previously [32]. The gSV call set was generated using MetaSV [64]. For details of sequencing and gSV identification, see [32]. Raw sequencing data and the gSV call set were deposited to the NDA under collections C1932 and C2933 and NRGR under study 39.

Mobile Element Insertion Identification and Filtering
Three types of MEIs-Alu, LINE1, and SVA-were identified from the WGS data using MELT (V2.1.5) [65], as described previously [22]. MEIs that are present in the sequenced individual but not the reference genome are defined as "MEI Insertions." MEIs that are present in the reference genome but not the individual are defined as "MEI Deletions." The "MELT-SPLIT" and the "MELT-Deletion" modes were used under default settings to identify MEI Insertions and Deletions, respectively. The six MEI VCF files (three insertions, three deletions) were concatenated into one composite VCF using bcftools [66].
A series of filters were then applied to the combined MEI call set to reduce false positives. Thirteen individuals from the Knome sequencing batch (Table S1) had high no-call rates (average 43.8%) compared to the rest of the batches (average 0.09%), indicating poor variant calling quality. Therefore, these individuals were removed from the dataset. Next, loci with a missing rate >5% or a Hardy-Weinberg equilibrium test p-value < 1 × 10 −20 were removed. MEI Insertions were further filtered for loci with MELT ASSESS score ≥3, VCF "FILTER" column with "PASS" or "rSD," and split reads >2. CrossMap (v. 0.2.7) [67] was then used to lift over the genomic coordinates of the loci from the human reference genome version GRCh38 to GRCh37/hg19.

Merging gSV and MEI Call Sets
The following gSV and MEI analyses were carried out in a customized pipeline (https: //github.com/JXing-Lab/NJLAGS_SV/tree/master/gSV_Pipeline, 20 August 2023). SUR-VIVOR [68] was used to merge the gSV and the MEI call sets in two steps. First, gSVs and MEIs from the same individual were merged using the following parameters: (1) breakpoint distance ≤ 100 bp, (2) supported by at least one caller, (3) agree on SV types, (4) agree on strand, and (5) no minimum length requirement. Once one gSV/MEI file was generated for each individual, all individual gSV/MEI files were merged using the same parameters to generate a combined gSV/MEI call set. This combined call set will be referred to as "gSV/MEI call set" in the following text.

gSV/MEI Annotation, Segregation Analysis, and Filtering
AnnotSV [57] was used to annotate the merged gSV/MEI call set using the default parameters. Variants located in known benign regions of the genome, as defined by AnnotSV, were filtered out. GEMINI v0.20.1 [59] was used to identify the inheritance patterns of gSVs, including autosomal recessive, autosomal dominant, de novo, X-linked recessive, X-linked dominant, and X-linked de novo. Variants that met any of these segregation patterns were included in the candidate prioritization.
Population allele frequency (AF) data was obtained from two sources. For MEIs, AF were extracted from the previous study of GTEx samples [22], allowing a max of 100 bp distance between the GTEx MEIs and our MEIs. For all variants that had no match in the GTEx MEI data set, the gnomAD SVs 2.1 database [69] was used to extract population AF, allowing a max of 100 bp distance between the gnomAD variants and our gSV/MEIs. A list of known NDD genes was obtained from a previous study [32] and SFARI Gene (https://gene.sfari.org//wp-content/themes/sfari-gene/utilities/download-csv. php?api-endpoint=genes, accessed on 19 May 2023) (Table S2). Gene biotype annotations were retrieved from GENCODE (v14).

gSV/MEI Candidate Prioritization
Filtered variants were partitioned based on their locations in the genome: exonic, intronic, and intergenic using the AnnotSV annotation. Intergenic variants were identified as variants with a null value in their "Gene_name" field. Intronic variants were identified as variants that had an annotated gene name and satisfied two criteria: (1) the "Location" field was not null, and (2) the start and end locations were within the same intron. Exonic variants were identified as variants that had an annotated gene name and (1) "Location" fields not null and (2) start and end locations in an exon or across multiple introns/exons. Based on the variant location, two types of candidate prioritization pipelines were performed: AF-focused and eQTL-focused. AF-focused filtering identifies rare variants (in our sample and in the general population) that affect the gene they are located in. eQTL-focused filtering finds variants that are known to have a causal impact on a gene expressed in brain tissues (i.e., a brain eQTL variant in GTEx samples).
Exonic variants are likely to disrupt the function of the overlapped gene; therefore; the AF-focused pipeline was applied. Intergenic variants do not overlap any gene; so only the eQTL-focused pipeline was applied. For intronic variants, both AF and eQTL-focused pipelines were applied.
The AF-focused prioritization pipeline applied three filters: (1) TPM, (2) Sample AF, and (3) Population AF. For the TPM filter, the expression levels of a gene in brain tissues were obtained from three brain expression databases as described in the CNV section. A gene is considered expressed if any of the three TPM values was >5. Variants overlapping only genes that are not expressed in brain tissues were excluded. For the "Sample AF" filter, variants whose sample AF is higher than 5% were excluded. Similarly, the "Population AF" filter excluded variants whose population AF was higher than 5%. Due to the large number of intronic AF-focused candidates, two additional filters were applied: False DeNovo and NDD. The "False DeNovo" filter excluded variants that had a de novo inheritance pattern across multiple affected individuals. Under the "NDD" filtering, variants overlapped genes that were not in a known NDD gene list (Table S2 in [70] and SFARI Gene) were excluded.
The eQTL-focused prioritization pipeline applies two filters: (1) Brain eQTL, and (2) Brain eQTL gene. The "Brain eQTL" filter selects variants that are considered brain eQTL variants in GTEx samples (see "Brain eQTL" section above for detail). The "Brain eQTL gene" filter removes variants whose eQTL gene is not expressed in brain tissues, as defined in the TPM filter above.

Pathogenicity Prediction
Two pathogenicity predictors were used to determine the potential pathogenicity of the candidate variants: StrVCTVRE [71] and SvAnna [72]. StrVCTVRE (Structural Variant Classifier Trained on Variants Rare and Exonic) is a random-forest classifier designed to distinguish between benign and pathogenic SVs [71]. SvAnna is a pathogenicity predictor built off human phenotype ontology (HPO) terms and provides a pathogenicity of structural variation (pSV) score based on sequence deleteriousness and phenotype similarity [72]. Both StrVCTVRE and SvAnna were run with default parameters.

Gene Ontology and Pathway Enrichment Analyses, Protein-Protein Interaction Network Analysis
Gene Ontology (GO) and pathway enrichment analyses were performed using Consen-susPathDB [73]. GO terms and pathways with a false-discovery rate <0.1 were considered enriched and GO terms with <500 total genes within the group were selected to increase the specificity of the enrichment results.
Protein-protein interaction (PPI) networks were built for the final candidate genes using Python package NetWorkX [74]. ConsensusPathDB [73], STRING [75], and GI-ANT_v2 [76,77] were used to generate a list of known gene interactions. Detailed data processing procedures have been described previously [46,70]. Funding: This study was supported by NIMH grants R01 MH-070366 and RC1 MH-088288 to LMB and New Jersey Governor's Council for Medical Research and Treatment of Autism grant CAUT12APS006 to LMB and CAUT19APL028 to JX and LMB. Biomaterial processing and storage was provided by the NIMH Repository and Resource, supported by cooperative agreement U24 MH068457.
Institutional Review Board Statement: The study was approved by the Institutional Review Board at Rutgers, The State University of New Jersey (IRB: 13-112Mc).
Informed Consent Statement: Informed written consent or assent was obtained from all subjects involved in the study.

Data Availability Statement:
The raw sequencing reads, variants, and genotypes for all samples are available in the National Institute of Mental Health Data Archive (NDA) under collections C1932 and C2933 and NRGR under study 39.

Conflicts of Interest:
The authors declare no conflict of interest.