Whole Exome Sequencing in Females with Autism Implicates Novel and Candidate Genes

Classical autism or autistic disorder belongs to a group of genetically heterogeneous conditions known as Autism Spectrum Disorders (ASD). Heritability is estimated as high as 90% for ASD with a recently reported compilation of 629 clinically relevant candidate and known genes. We chose to undertake a descriptive next generation whole exome sequencing case study of 30 well-characterized Caucasian females with autism (average age, 7.7 ± 2.6 years; age range, 5 to 16 years) from multiplex families. Genomic DNA was used for whole exome sequencing via paired-end next generation sequencing approach and X chromosome inactivation status. The list of putative disease causing genes was developed from primary selection criteria using machine learning-derived classification score and other predictive parameters (GERP2, PolyPhen2, and SIFT). We narrowed the variant list to 10 to 20 genes and screened for biological significance including neural development, function and known neurological disorders. Seventy-eight genes identified met selection criteria ranging from 1 to 9 filtered variants per female. Five females presented with functional variants of X-linked genes (IL1RAPL1, PIR, GABRQ, GPRASP2, SYTL4) with cadherin, protocadherin and ankyrin repeat gene families most commonly altered (e.g., CDH6, FAT2, PCDH8, CTNNA3, ANKRD11). Other genes related to neurogenesis and neuronal migration (e.g., SEMA3F, MIDN), were also identified.

biospecimens from over 2000 families (www.AGRE.autismspeaks.org). Most families had two or more affected children with autism. Identification of causative mutations (e.g., serotonin-related gene mutations and disturbed biology) could be important to guide selection of treatment options and medication use as well as to manage medical co-morbidities such as seizures, developmental regression (e.g., MECP2 gene) or for cancer (e.g., PTEN gene). A cursory autism data base search revealed a large body of publications, particularly since 2008, linking autism to a wide range of genetic and environmental factors found only 3 (0.48%) clinically relevant ASD genes to be located on the Y chromosome while 68 (10.81%) clinically relevant ASD genes were recognized on the X-chromosome [12].
The preponderance of males with ASD may be attributable to the single X chromosome in males depriving the normal allelic pair of genes due to the XY sex chromosome constitution. Hence, sex chromosomes illustrate the most obvious genetic difference between men and women. All female mammals have two X chromosomes and achieve a balanced X chromosome gene expression with males by inactivating one of their X chromosomes, a process known as X chromosome inactivation (XCI) [13]. This process occurs randomly and very early in embryonic development. Once an X chromosome is "selected" for inactivation within a cell, then the same X chromosome remains inactivated in each subsequent daughter cell. Therefore, females have a mixture of cells with random expression of genes on a single X chromosome. Occasionally, XCI represents a nonrandom pattern or high skewness which is usually defined by at least 80% preferential inactivation of one of the two X chromosomes [14,15]. Skewed XCI appears to play a role in the increased incidence and presentation of diseases in females with known X-linked gene involvement such as Rett syndrome [16], X-linked intellectual disability [14], X-linked adrenoleukodystrophy [17] and possibly autism [15]. Skewed XCI may also reflect an early disruption in the developing embryo causing cell death followed by a small number of dividing cells repopulating the embryo [18][19][20]. XCI can be measured by using the androgen receptor (AR) gene located at chromosome Xq13 which contains a highly polymorphic CAG repeat region and normally inactivated on one of the X chromosomes in females. The AR gene is used most often to determine the XCI status [21,22]. Disorders with an imbalance in the number of males versus females affected as in autism become a likely condition to study XCI and its impact on X-linked genes in females. We suspect additional autosomal and/or sex chromosome based ASD gene(s) that either overwhelms the protective normal alleles of the second X chromosome among females, or preferentially silences them due to a plausible skewed X chromosome inactivation among neural cell lineages. Therefore, the whole exome sequence of females with ASD from multiplex families having more than one family member affected was analyzed in our study with next generation sequencing and bioinformatics with a novel machine-learning classification engine applied to identify the contributing ASD genes and the proportion of disease contributing alleles from the X chromosomes versus autosomes. The inactivation status of the two X chromosomes was determined to identify significant deviation from randomness, taking into consideration whether the X chromosome gene is known to escape inactivation, or known to act as a dominant gene.

Results
Whole exome sequencing from DNA of 30 females with autism spectrum disorders from AGRE identified between 100 and 300 genes showing genomic variants of novel or candidate genes for autism per subject with an initial classification score >0.5. To further narrow the list of putative disease causing genes, we increased the level of selection criteria by adding cutoff levels for other predictive parameters (GERP2, PolyPhen2 and SIFT) and specifically increased the classification score from >0.5 to a more stringent score of >0.7. For example, the initial list of genes and genomic variants for Subject HI2898 consisted of 245 genes when using a classification score >0.5 alone and then reduced to 22 genes when the cutoff levels of other predictive parameters (GERP2, PolyPhen2 and SIFT) were included at the chosen final selection criteria level. Finally, the classification score was raised to >0.7 to generate the master list of 14 genes. For the 30 females with autism, the number of genes identified based on the final selection criteria ranged between 8 to 23 genes with 10 females presenting with an X-linked gene in the list of selected genes meeting the final selection criteria. The putative candidate genes and genomic variants were then subjected to further screening for biological significance and relevance for neural development, function and for causation of known neurological disorders based on published medical literature. In so doing, we identified two potentially causative genes for autism for Subject HI2898. See Table 1 for the list of genes meeting or exceeding the established final selection criteria level and screening for biological significance. Identical genomic variants were recognized for several subjects even after increasing the final level of selection criteria required to reduce the number of candidate genes and considered artifactual findings. These genes with identical genomic variants included KCNJ12, MLL3, OR9G1 and PCMTD1 with different mutations reported in other disease states and appeared to reflect artifacts in the present analysis of autism.
We report genomic coordinates, previously recognized SNPs (and their ID numbers), reference base to alternate base sequences (e.g., G→A), variant call format (VCF) information, SNP effects (synonymous vs non-synonymous), SNP codons, and GERP2, PolyPhen2, SIFT, Blosum, Blosum62 and the classification scores for each genomic variant identified per affected female (Table 1). Our analysis identified a total of 79 different genes containing variants (both known and novel) consisting of 12 (15%) known candidate genes for ASD, 24 (30%) paralogues of known candidate ASD genes and 43 (54%) novel genes with functional roles in nerve cell growth, adhesion and neurodevelopment or function as putative candidate ASD genes (See Table 2). The identified candidate ASD genes and paralogues were distributed across 11 (37%) and 16 (53%) of the females with autism, respectively.
Four females (HI0555, HI0890, HI1402 and HI2126) possessed a single variant of significance in clinically relevant candidate or known ASD genes considered conclusive for causality. Seventy-seven percent of subjects (N = 23) possessed multiple variants with possible influences on causation of ASD. Two (9%) of the 23 females (7% of the total sample population) possessed more than five variants/mutations of known candidate ASD genes, paralogues or other functionally relevant genes with presumed influence over the development of ASD. Three (10%) of the total females (HI1143, HI1884 and HI2278) possessed a single gene variant with possible functional relevance toward the development of ASD but no known candidate ASD genes or paralogues identified to be considered conclusive.   Six females (HI0605, HI2898, HI1739, HI2172, HI2244 and HI2879) possessed a single variant of a known candidate ASD gene meeting our final selection criteria for pathogenicity in addition to one or more paralogues of ASD genes or gene variants with functional influence over cell growth and neurodevelopment. Fourteen females (HI0714, HI0751, HI0765, HI0855, HI0868, HI1157, HI1228,  HI1305, HI1375, HI1422, HI1433, HI1954, HI2215 and HI2843) possessed 2 to 4 variants of ASD gene paralogues or variants with functional roles of influence over ASD but without a conclusive or definable causal mutation. Variants of multiple putative candidate genes for ASD were identified for Subjects HI0405, HI0558 and HI0793.
Using XCI data, we found that eight of the 30 females had high skewness (XCI > 80%:20%) and an additional six females showed moderate skewness (XCI 65%:35% to 80%:20%). Two of the eight females with high skewness and two females with moderate skewness possessed an X-linked gene variant meeting the final selection criteria. When considering the 78 total genes identified meeting our final selection criteria, X-linked gene variants were disproportionally more likely to be found among the females with autism spectrum disorder who exhibited moderate to high level XCI skewness than females without XCI skewness (OR = 6.4, 95% CI: 0.68, 60.5, p = 0.1) but this relationship did not achieve statistical significance. Similarly, females who exhibited XCI skewness were more likely to possess an X-linked gene variant than females without XCI skewness, (OR = 6.0, 95% CI: 0.58, 61.8, p = 0.132).

Discussion
We chose females with ASD to study from the age of 5 to 16 years from multiplex families with more than one affected family member in order to increase the likelihood of identifying a single gene as a causative factor using a descriptive case study approach. The females were selected from the AGRE, a biorepository of specimens and clinical data from well-characterized families with children with ASD enrolled for genetic research studies. Females with ASD from multiplex families were also examined to determine whether an overabundance of X-linked genes could be identified contributing to ASD by using whole exome sequencing and XCI assays and if X-linked genes were more frequently disturbed in those females with XCI skewness.

Single Gene Variants
Our investigation identified four females (HI0555, HI0890, HI1402 and HI2126) with single gene variants of clinically relevant candidate or known genes for ASD with a high likelihood of causality. Subject HI0555 possessed a non-synonymous, missense mutation of the SETD2 gene located on chromosome 3 with high likelihood to have a deleterious effect on gene expression and/or function.
The SETD2 encodes a Huntingtin-interacting protein B related to Huntington disease and known expression in the brain [27,28]. The SETD2 gene is known as an autism susceptibility gene reported in autism [12] and was the highest rated and most likely candidate gene in this case. Subject HI0890 possessed a non-synonymous, missense mutation of the BTAF1 gene located on chromosome 10 which is implicated as a helicase and sequence-specific binding transcription factor activity [63]. Subject HI2126 possessed a non-synonymous, missense mutation in a known candidate gene for ASD, ANKRD11 which is located on chromosome 16 and likely to be causative in this case. The ANKRD11 gene encodes an ankryin repeat domain-containing protein which inhibits ligand-dependent activation of transcription with mutations causing the KBG syndrome characterized by craniofacial features, short stature, skeletal anomalies, seizures and intellectual disability [90,91]. No physical examination or cognitive data were available from AGRE on this 12 year old female diagnosed with autism.

Involvement of Skewness of X Chromosome Inactivation and Putative Disease Causing Genes
High XCI skewness (>80%:20%) was observed for eight of our females with autism and two of these females possessed putative variants of X-linked genes (HI1402 and HI2898) which may contribute to the phenotype. Skewing of X-chromosome inactivation may increase expression of recessive disease causing variants on the X-chromosome. Subject HI1402 with an XCI status of 85%:15% indicating high skewness also possessed a non-synonymous, missense mutation of an X-linked gene (IL1RAPL1) which is a recognized candidate gene for ASD [12]. IL1RAPL1 had the highest criteria selection score of the four genes meeting initial selection criteria and the most likely cause of ASD in this female. The IL1RAPL1 gene encodes a protein with the highest expression in brain neurons and participates in the regulation of neurite outgrowth via interaction with neuronal calcium sensors thereby regulating synaptic formation and modulation of synaptic transmission [74].
Subject HI2898 with an XCI status of 85%:15% showing high skewness possessed a variant of the X-linked gene (GABRQ) found to be in the top two genes in the selection process. The GABRQ (gamma-aminobutyric acid A receptor theta) gene encodes a receptor protein for GABAergic neurotransmission in the mammalian central nervous system [107]. GABA is a major inhibitory neurotransmitter. GABRQ expression is distributed in the amygdala, hippocampus, anterior hypothalamus and cortex [107]. Subject HI2278 with an XCI status of 8%:92% possessed an X-linked gene (PAK3) in the top 25 genes but was excluded due to strict filtering requirements (i.e., PolyPhen 2) and a correct classification score of >0.50. However, the DCHS1 gene was identified which encodes a transmembrane cell adhesion molecule that belongs to the protocadherin superfamily and acts as a ligand for FAT4, another protocadherin protein. Both DCHS1 and FAT4 form an apically located adhesive complex in the developing brain [39].
The five remaining females (HI0751, HI1143, HI1375, HI1433 and HI2879) with high XCI skewness ranging from 93%:7% to 82%:18% showed no X-linked gene variants meeting the final selection criteria for pathogenicity. Other variations of possible clinical relevance in autosomal chromosomes were observed for these subjects. Subject HI1433 with an XCI status of 93%:7% indicating extreme skewness possessed a variant of the FAT2 gene located on chromosome 5 which is a member of the cadherin-related FAT tumor suppressor homolog 2 (Drosophila) [77][78][79]. Subject HI2879 with an XCI status of 18%:82% showing high skewness possessed a variant of the SOX7 (SRY-Box 7) gene which encodes a SOX protein, acting as a transcription factor regulating diverse developmental processes [104]. Based upon our investigation, XCI skewness in these females did not contribute to the presentation of autism.
Moderate skewness (65%:35% to 80%:20%) was observed for six females (HI0765, HI1305, HI1739, HI1954, HI2244 and HI2843). Two of these females possessed putative variants of X-linked genes that may have contributed to their phenotype. Subject HI1305 with an XCI status of 73%:27% possessed a variant for PIR, a highly conserved X-linked iron-binding nuclear protein gene which functions as a transcriptional co-regulator and contributes to regulation of cellular processes and may promote apoptosis when over expressed [70]. An associated disorder for this gene when disturbed is extratemporal epilepsy. Three genes were identified for Subject HI0765 with an XCI status of 77%:23% with moderate skewness including GPRASP2 located on the X chromosome which encodes a protein that may regulate a variety of G-protein coupled receptors associated with autism spectrum disorders and schizophrenia [48], as an X-linked gene approaching high skewness becomes an important candidate for causation of ASD in this female. The KCNC2 and ASPM genes were the highest rated and became likely candidates

Other Autosomal Putative Disease Causing Genes
The cadherin, protocadherin and ankyrin repeat gene families were the most commonly altered putative disease causing gene variants identified in our study. Subject HI0558 possessed a variant of PCDH8, a member of the cadherin superfamily of genes and functions in cell adhesion in a CNS-specific manner possibly playing a role in activity-induced synaptic reorganization underlying memory with down-regulation of dendritic spines, primarily in the hippocampal area [35]. In addition this affected female possessed a variant of the CTNNA3 gene which is located on chromosome 10 and encodes an alpha-t-catenin which plays a role in functional cadherin-mediated cell adhesion. A possible association of this gene in Alzheimer disease has been proposed [36]. Subject HI0855 possessed a variant of CDH6 located on chromosome 5 encoding a known cadherin playing a role in cell-cell adhesion and implicated in autism [60]. Subject HI0605 possessed a non-synonymous, missense mutation of the CCDC64 (coiled-coil domain containing, 64), a recognized ASD gene on chromosome 12 [12,38]. CCDC64 is a component of the secretory vesicle machinery in developing neurons that acts as a regulator of neurite outgrowth in the early phase of neuronal differentiation which when disturbed is associated with neuronitis [38]. Subject HI1739 possessed a non-synonymous missense mutation of the ASB3 gene located on chromosome 2 and related to the ankryin repeat gene family known to play a role in brain development and function including autism [88]. GRM4 (glutamate receptor, metabotropic, 4) is a known ASD gene involved with glutamate, a major excitatory neurotransmitter in the central nervous system [86]. Subject HI2172 possessed a non-synonymous, missense mutation of the CYFIP1 (cytoplasmic FMRP-interacting protein 1) gene located on chromosome 15 which encodes a protein that interacts with the familial mental retardation protein (FMRP) that when disturbed, causes the fragile X syndrome by impacting on development and maintenance of neuronal structures [108,109]. Individuals with the 15q11.2 BP1-BP2 microdeletion or Burnside-Butler syndrome are known to have developmental and speech delay involving the CYFIP1 gene with an increased rate of aberrant behavior and autism [95].

Samples from Females with Autism
Thirty Caucasian females (average age, 7.7 ± 2.6 years; age range, 5 to 16 years) were selected with confirmed diagnosis of autism. We chose to undertake whole exome sequencing from well-characterized females with autism with a positive family history (e.g., affected brothers) classified as multiplex families with autism and having a high probability of causation due to gene disturbances. We selected affected females from the Autism Genetic Research Exchange (AGRE) repository (www.agre.autismspeaks.org) for a descriptive next generation whole exome sequencing case study. The family members with autism in the AGRE were recruited and enrolled after screening and diagnostic assessments were performed with autism-related testing instruments (e.g., ADOS). Medical examinations and neurology evaluations were undertaken to collect family, pregnancy and medical history data along with physical and anthropometric measures and neurological recordings. Blood was also collected at AGRE for lymphoblastoid cell line development, plasma storage and DNA isolation. Normal chromosome analysis (karyotype) and fragile X syndrome DNA testing were completed previously in all female subjects selected from AGRE. DNA from parents or other family members were not analyzed as a component of our study.

Exome Sequencing Methods
Genomic DNA (5 µg) samples were sent to the Silicon Valley Biosystems, Foster City, CA, USA for whole exome sequencing via paired-end next generation sequencing (NGS) approach using standard protocols with the Illumina HiSeq2000 platform (http://www.service.Illumina.com) and Agilent SureSelect Human All Exon v4-51Mb (http://www.genomics.agilent.com). The primary sequence raw data produced by the sequencing phase of our study was aligned to reference, the variants called and functional significance of each variant determined and rank-ordered into a list of functional variants (in order of decreasing pathology) that were correlated with or causative for the autism phenotype. The raw data was in the form of fastq files, containing the sequence reads and quality scores of the NGS sequencing runs. The individual reads were initially aligned by mapping to the reference genome using Burrows-Wheeler aligner (BWA), a software program for mapping low-divergent genomic sequences of up to 100 base pairs against a large reference genome (http://www.bio-bwa.sourceforge.net). Variant calling procedures then identified genomic sequence variants from the NGS sequencing runs or the absence of variants in the sample of interest relative to a reference sequence. The platform used to produce variant calls was an amalgamation of tools and methods uniquely combined and specifically calibrated for enhanced performance incorporating Genome Analysis Toolkit (GATK) V2.39 (https://www.broadinstitute.org/gatk/). A number of intermediate steps were applied to improve the quality/accuracy of the alignment and variant calling results prior to the creation of the final list of variants. Sequence reads that were likely to represent duplicates were marked to be ignored by the downstream variant-calling tools. Additionally, realignment around known insertions and deletions was performed, and quality scores recalibrated based upon the number of reads. The list of variants was produced in the form of a .vcf file annotated to reflect the results of quality control filters. This established and calibrated alignment and variant calling platform performs well and can reliably predict calls for single insertion and deletion events (up to 50 bp with between 97% and 99.9% sensitivity and specificity). The average number of reads per generated fragment was 64 for each of the 30 females with autism.

Data Analysis
The resulting .vcf file contained annotated variants with scientific features, including sequence, gene-, variant-, and transcript-level annotations. Annotations include characteristics of the observed variant such as the genomic coordinates or the variant's genomic region (e.g., exonic, upstream etc.), predicted effects on the various transcripts (e.g., missense, frame-shift, etc.), observed frequencies of the variant in control populations and affected individuals, curated knowledge from both proprietary and public databases, and pathogenicity predictions (e.g., Polyphen and SIFT). The machine learning tool utilized both the 2012 April 1K Genome release reflecting the frequency at the position and variant levels and the European Exome Sequencing Project (ESP6500) (https://esp.gs.washington.edu/drupal/) at the variant level which reflects the frequency of variants as observed in the ESP6500 dataset of healthy individuals in generating the correct classification score. The regulatory significance of variants were derived from ENCODE data. A classification engine developed by Silicon Valley Biosystems was used to create a pathogenicity score based on a machine learning technique approach. The model is trained and validated using 150,000 gene polymorphisms and mutations from a large set of variants with known pathogenicity. The classification accuracy exceeds 99% and thus the score provides the probability that a particular variant has an effect on gene expression or protein structure. The classifier was trained using known disease-causing mutations and known benign polymorphisms, each of which was annotated with attributes such as location relative to splice junctions, type of non-synonymous amino acid substitution and allele frequency in the population. The classifier was validated on a similarly sized set of variants and was shown to perform with 93% sensitive and 97% specificity. Prioritization of putative functionally relevant variants based on the classification score was performed and these variants were additionally classified with other means of determining pathogenicity such as SIFT, PolyPhen2, Blosum and Blosum62. The variants that were the most highly predicted to be functionally significant were examined across the study cohort for clustering within and around genes hypothesized as relevant to the phenotype.

Gene Filtration/Selection Parameters
A high filter system was utilized whereby the indels and quality disqualified (QD) parameters were identified and removed based on the bioinformatics outlined in our study approach. The classification score (range 0 to 1) was generated as our primary delimiting factor in identifying novel or candidate genes for autism in the affected females. The classifier uses machine learning algorithms to predict a variant's functional impact on a gene with higher scores for genomic variants considered more pathogenic or disease-causing changes at the gene level. Thus, we applied an initial classification cutoff of >0.5 with final selection criteria >0.7. Genomic Evolutionary Rate Profiling (GERP2), an indicator of the prevalence of the genomic variants in the general population, was assigned a cutoff value of <0.01 and remained at <0.01 as final selection criteria. GERP2 identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represented substitutions that would have occurred if the element was neutral DNA, but did not occur because the element was under functional constraint. The initial Polymorphism Phenotyping v2 (PolyPhen2) cutoff value was >0.85 with the final selection criteria >0.9 and the initial Sorts Intolerant From Tolerant (SIFT) cutoff value was <0.05 with a final selection criteria <0.03 were predictors of pathogenicity or disease causing changes. The Blosum (BLOcks SUbstitution Matrix) score reports the ratio between the frequency of an amino acid substitution in the Human Gene Mutation Database (HGMD) to the frequency of this amino acid substitution in known polymorphism data sets. Numbers greater than 1 indicate this variant was observed more in diseased individuals than in controls and less likely to be tolerated. We also used the Blosum62 matrix to score alignments between evolutionarily divergent protein sequences based on local alignments. Blosum62 is a log-odds score for each possible substitution of the 20 standard amino acids. Higher Blosum62 numbers represent more closely related species comparisons and provide a general indication of frequency of the substitution. Negative numbers indicate less frequent substitutions and a lower likelihood to be tolerated (e.g., negative values = unexpected or rare substitutions).
The list of disease causing genes were developed using the primary selection criteria (i.e., classification score) and then adding cutoff levels for other predictive parameters (GERP2, PolyPhen2, and SIFT). The initial list of genes and genomic variants used the classification score >0.5 alone and then the cutoff levels of the other predictive parameters (GERP2, PolyPhen2 and SIFT) were included. The classification score was then raised to >0.7 to generate the final list of genes. These putative candidate genes and genomic variants were then subjected to further screening for biological significance for neural development, function and known neurological disorders using Online Mendelian Inheritance in Man (OMIM, http://www.ncbi.nlm.nih.gov/omim), GeneCards (http://www.genecards.org/), PubMed and other online websites and databases.

Clinical Relevance to ASD
Following the recommendations of the American College of Medical Genetics and Genomics (ACMGG) in reporting clinical exome and genome sequencing data, we analyzed only the primary findings, in depth, which is termed as the pathogenic alteration in a gene or genes relevant to the diagnostic indication for sequencing study, i.e., causation of autism in females. Only clinically relevant candidate or known existing genes for autism were included for extended analysis. Incidental findings are those pathogenic or likely pathogenic alterations in genes that are disease associated or causing but not apparently relevant to a diagnostic indication for which sequencing was undertaken (i.e., in our study on autism [12]) and were not further investigated in our study. We did not include mutations of recognized autosomal recessive genes causing human disorder (e.g., GJB2 or GJB6, gene mutations causing hearing loss). It is estimated that about 1% of sequencing reports would include an incidental variant.

X Chromosome Inactivation in Females with Autism
Genomic DNA isolated from blood was used as a template for polymerase chain reaction (PCR) amplification, to identify the CAG polymorphic region of the AR gene. Prior to the PCR amplification, 200 ng of genomic DNA was digested with the methyl-sensitive restriction enzyme HpaII [21]. Approximately 50 ng of digested or undigested genomic DNA was used as a template for PCR amplification to determine the peak height of the polymorphic PCR fragment of the AR gene using the following primers: forward 5' TCCAGAATCTGTTCCAGAGCGTGC 3' and reverse 5' GCTGTGAA GGTTGCTGTTCCTCAT 3' with the forward primer fluorescently labeled with 6-FAM. The lengths and peak heights of the resulting PCR fragments are determined with the use of capillary electrophoresis and fragment separation software with the ABI 3100 DNA sequencer (Applied Biosystems, Carlsbad, CA, USA) using established protocols [18,22].
The digestion process preferentially degrades activated (unmethylated) over inactivated (methylated) DNA. Undigested DNA is preferentially amplified and produces larger peak heights. The peak height values for digested DNA is normalized using peak height values for the undigested DNA for each subject. The percentage of XCI for each AR allele was then calculated using the following formula: (d1/u1)/[(d1/u1) + (d2/u2)]; d1 = peak height of digested DNA from the first allele and u1 = peak height of undigested DNA from the first allele; d2 = peak height of digested DNA from the second allele and u2 = peak height of undigested DNA from the second allele. Highly skewed XCI is defined as >80% calculated ratio for either one of the AR gene alleles in the digested DNA sample. To ensure the reproducibility of XCI results and equal amplification of both alleles, the digestion, PCR amplification, and genotyping were repeated up to three times in several samples.
XCI status for each subject was assigned to one of the three mutually exclusive categories: (i) randomly selected inactivation of either allele from each X chromosome (XCI = 50%:50% to 64%:36%); (ii) moderately skewed inactivation favoring 1 allele of one of the X chromosomes (XCI = 65%:35% to 80%:20%); and (iii) highly skewed inactivation of a single allele representing one X chromosome (XCI > 80%:20%). The relative frequency of random, moderate, and highly skewed XCI categories was determined for the females with autism. The binomial frequency distribution of X-linked gene variants (present or absent) among females with autism spectrum disorder exhibiting moderate and high levels of XCI skewness were approximated using Yates' chi-squire test and odds ratios were calculated with 95% confidence intervals.
The top genes chosen were based on the final selection criteria identified for each female and evaluated for biologic function and whether each gene was previously reported as a clinically relevant candidate or known gene for autism [12]. The X chromosome inactivation data and XCI status on each female were used to support whether an X-linked gene played a role in the causation of autism in the affected female if XCI skewness was found.

Conclusions
In summary, we shared our experience using a descriptive next generation whole exome sequencing case study approach examining 30 well-characterized Caucasian females with autism between 5 and 16 years of age recruited from multiplex families enrolled at the AGRE for research purposes.
Interpretation of DNA findings are limited due to an inability to study other affected and non-affected family members to determine whether the gene variants were de-novo or inherited in origin for correlation with clinical phenotypes and lack of Sanger sequencing confirmation. Using strict selection criteria, four females (13%) were found to possess single gene variants of known candidate genes for ASD with a high likelihood of causality. We also identified multiple plausible candidate genes [some known (e.g., CCDC64 in subject HI0605) and some novel (e.g., CHAC1 in subject HI0855)] in 77% of the remaining females. In most females, we found more than one gene that could contribute to the causation of autism. This compares with a recent report of molecular findings among 2000 consecutive patients (primarily pediatric age) referred for clinical whole exome sequencing for evaluation of suspected genetic disorders at a large USA medical center in which 25% received a molecular diagnosis. About 60% of the diagnostic mutations were not previously reported [110]. By phenotypic category, 36% of those presenting with neurological involvement were found to have a molecular diagnosis while 20% of those from the non-neurological group had a diagnosis. We not only performed whole exome sequencing in our females with autism but also performed X chromosome inactivation (XCI) studies for evidence of X-linked genes in the females showing non-random XCI skewness. Moderate and high XCI skewness was seen in 14 of the 30 females and those with skewness were 4 times more likely to show a gene variant on the X chromosome.