Estimating the Prevalence of De Novo Monogenic Neurodevelopmental Disorders from Large Cohort Studies

Rare diseases impact up to 400 million individuals globally. Of the thousands of known rare diseases, many are rare neurodevelopmental disorders (RNDDs) impacting children. RNDDs have proven to be difficult to assess epidemiologically for several reasons. The rarity of them makes it difficult to observe them in the population, there is clinical overlap among many disorders, making it difficult to assess the prevalence without genetic testing, and data have yet to be available to have accurate counts of cases. Here, we utilized large sequencing cohorts of individuals with rare, de novo monogenic disorders to estimate the prevalence of variation in over 11,000 genes among cohorts with developmental delay, autism spectrum disorder, and/or epilepsy. We found that the prevalence of many RNDDs is positively correlated to the previously estimated incidence. We identified the most often mutated genes among neurodevelopmental disorders broadly, as well as developmental delay and autism spectrum disorder independently. Finally, we assessed if social media group member numbers may be a valuable way to estimate prevalence. These data are critical for individuals and families impacted by these RNDDs, clinicians and geneticists in their understanding of how common diseases are, and for researchers to potentially prioritize research into particular genes or gene sets.


Introduction
Rare diseases, in particular rare neurodevelopmental disorders (RNDDs), have proven to be challenging to understand epidemiologically. There are several definitions of "rare disease" that vary globally [1,2]. The current definition of a rare disease in the United States is a disease that impacts fewer than 200,000 individuals, or approximately 86 per 100,000 individuals at the time the American Orphan Drug Act was passed in 1983. Other global definitions range from 5 to 76 per 100,000 individuals. Overall, an estimated 3.5-5.9% of the global population has a rare disease, many of which are RNDDs mostly diagnosed in early childhood [3].
Neurodevelopmental disorders (NDDs), impacting up to 17% of the population, are a clinically and genetically heterogenous group of diagnoses [4]. NDDs as a whole are not rare; but each individual RNDD with known genetic cause only accounts for 1% or less of NDD cases. Many studies have shown that de novo variants (DNVs) are key contributors to such disorders [5][6][7][8][9]. The prevalence of these disorders is key for families looking for community, researchers, clinicians, and in pharmaceutical development [10].
Due to the scarcity of these RNDDs, with variable expressivity and incomplete penetrance, traditional epidemiological methods are challenging to assess. Additionally, many 2 of 16 monogenic RNDDs share clinical features or lack pathognomonic features, making it difficult to identify them without genetic testing. Another challenge is the barriers to genetic testing resulting in underdiagnoses of many RNDDs, which leave patients uncounted.
Multiple approaches have been taken to understand the prevalence and/or incidence of RNDDs. Clinical data have been utilized for deletion/duplication syndromes mediated by nonallelic homologous recombination [11]. The number of published articles has also been used as a potential metric for prevalence [12]. For monogenic disorders, Nguengang Wakap et al. (2020) utilized point prevalence (number of cases in the population at one time/total population at the same time point), although not by gene but by inheritance pattern [3]. The incidence of de novo monogenic RNDDs has been elegantly estimated by López-Rivera et al. (2020), utilizing mutational constraint and probability of mutation to estimate based on mutation rate of individual genes [13,14]. Additionally, several resources report estimated prevalence, such as Orphanet, the National Organization for Rare Disorders (NORD), and others, although it is not always clear how these numbers are determined.
In order to assess the prevalence of autosomal dominant de novo monogenic RNDDs, we utilized the DNV data from multiple large cohorts of individuals with NDDs, specifically developmental delay/intellectual disability (DD/ID), autism spectrum disorder (ASD), and epilepsy. Cohorts include the Deciphering Developmental Disorders studies, Autism Sequencing Consortium, Simons Simplex Collection, SPARK, and MSSNG [5][6][7][15][16][17][18][19][20][21][22][23][24][25][26][27][28]. It is likely that these large studies of DNV provide the most comprehensive counts available of individuals with specific neurodevelopmental-related genetic alterations. From these cohorts (n = 50,377), we estimated the prevalence of variation of over 11,000 genes with reported variation in NDDs among the general population, which is positively correlated to the previously estimated incidence. We also identified the most often mutated genes among NDDs broadly, DD/ID and ASD. Finally, we determined that social media group member numbers may be a valuable way to estimate prevalence. These data are critical for individuals and families impacted by these rare disorders, clinicians and geneticists in their understanding of how common diseases are, and researchers to potentially prioritize research into particular genes or gene sets.

Prevalence Estimation
Prevalence information for ASD, DD, and epilepsy was used from Zablotsky and Black (2020) to comprise our NDD prevalence (Table 1) [4]. While NDDs as a whole affect 17% of 3-to 17-year-old children in the US, we focused on those that were well represented in our de novo NDD cohort. Coding, nonsynonymous variant counts were computed from each study (Tables S3-S6). The number of variants in each gene was normalized by the observed/expected values for each type of variant obtained from gnomAD v2.1.1 (Table S3). Genes with negative values resulting from normalization or no constraint metrics available were excluded. The proportion of cases in our combined cohort was multiplied by the estimated prevalence of RNDDs and extrapolated to the prevalence in 100,000 individuals. Estimates were performed for NDDs, DD/ID, and ASD independently. The number of probands for epilepsy was dramatically lower than DD/ID and ASD; thus, this was not calculated separately due to an inaccurate representation of cases of epilepsy. Variants were also separated by variant type: all DNVs, de novo likely gene disrupting (dnLGD) variants, de novo missense (dnMIS) variants, and de novo severe missense variants with a CADD score greater than or equal to 30 (dnMIS30). Candidate NDD genes were assessed separately and determined by combining statistically significant genes from multiple large cohort studies (n = 468) [7][8][9]26] (Table S2).

Comparison to Previous Incidence Estimates
Our estimates were compared to birth incidence rates from López-Rivera et al. (2020) using Pearson's correlations in R Studio (2022.02.2-485, R version 4.2.0). Correlation analyses were performed in R for the gene level and cohort level. For the gene level, the number of DNVs was rounded to the nearest integer. Then, Fisher exact tests between genes that were reported in both our cohort and in that of López-Rivera et al. (2020) were performed in R with Bonferroni correction accounting for all genes (n = 20,000) and number of probands tested (n = 50,377). The 11,461 genes analyzed all had DNVs in our cohort, while the remaining genes in the genome did not in the data used. As previous estimates were not calculated by phenotype, our analysis was only performed for the total NDD cohort.

Comparison to Social Media Group Numbers for Top NDD Genes
We searched Facebook for each gene name and/or known disorder for the top 500 genes as well as any gene that had an OMIM disease entry (n = 294 genes with Facebook groups, Table S3). The number of members in each group was compared to the estimated prevalence using Pearson's correlations.

Prevalence Estimates for All NDDs
We assessed the number of cases in our total NDD cohort for each gene with at least one variant. The number of variants was normalized by observed/expected counts obtained from gnomAD v2.1.1 for dnLGD and dnMIS variants. The dnLGD and dnMIS variants were summed to estimate all DNV prevalence. Utilizing the prevalence estimates from Zablotsky and Black (2020), we calculated prevalence among individuals with NDDs and the prevalence in the general population.

Prevalence Estimates for DD/ID
Cohorts in which the primary diagnosis was DD or ID were analyzed separately. In general, the pattern was similar to the entire NDD cohort, likely due to the larger DD sample size. The gene most often mutated in DD was ARID1B, accounting for 0.38% (dnLGD = 0.3%, dnMIS = 0.04%, dnMIS30 = 0.02%) of all DNVs (Figure 2A, Tables 4 and S4). ARID1B was also the most frequently mutated in dnLGD variants ( Figure 2B). This resulted in the frequency of an ARID1B-related disease with DD of 4/100,000 individuals (95% CI: 3.7-4.7/100,000 individuals (1/24,816 individuals; 95% CI: 1/27,013-21,225)).   Values for all genes analyzed and their 95% confidence intervals are in Table S5.

Comparison to Previous Estimates
López-Rivera et al. (2020) estimated the incidence for 100 known monogenic disorders as well as the mutation incidence of over 1000 variation intolerant genes. We compared our estimates to theirs using correlation analysis. All variant categories' prevalence was significantly positively correlated with previous incidence estimates (Figure 4, Tables S3-S6).

Comparison to Previous Estimates
López-Rivera et al. (2020) estimated the incidence for 100 known monogenic disorders as well as the mutation incidence of over 1000 variation intolerant genes. We compared our estimates to theirs using correlation analysis. All variant categories' prevalence was significantly positively correlated with previous incidence estimates (Figure 4, Tables S3-S6).   Figure 4B,C). For all DNVs and dnMIS variants, NDD candidate genes' prevalence was significantly correlated with previous incidence estimates ( Figure S1, Table S6). No genes had a significantly different prevalence of mutation when using Bonferroni or FDR correction.
Most genes (n = 6681) had a higher prevalence than previous incidence estimates, as expected since prevalence accounts for all cases and incidence is cases in a year. However, some of these genes may also have been over-ascertained in our cohort (n = 468 NDD candidate genes). Genes with a lower prevalence than incidence (n = 1056; 249 NDD candidate genes) may have had lethal phenotypes or have been under-ascertained in our cohort. The proportion of NDD candidate genes among genes with lower prevalence than incidence (19%) was significantly higher than genes with higher prevalence than incidence (1.5%, Chi squared test, p = 0.0005, Figure S2). No genes showed significantly different mutation prevalence after Bonferroni correction.

Comparison to Social Media Groups
One potential estimate of how many individuals and families may be affected by these monogenic disorders is through their social media groups, i.e., how many members does a group have. This likely represents parents of children with rare disorders, and mostly mothers [10]. Over 4000 pediatric rare diseases have Facebook support groups. While membership is limited by computer and internet access, as well as interest in connecting with other families, this may be a reasonable metric for prevalence of these disorders.
To assess this, we found Facebook groups for the top 500 genes and any genes that had a named disorder (n = 294 genes with Facebook groups) (Table S3). Foundation pages were not included, and the group with the highest number of members was used. Gene and syndrome names were used to identify Facebook groups.
The number of Facebook group members was positively correlated with prevalence (PCC = 0.31) ( Figure S1D). This moderate correlation suggests that there is an underdiagnosis for many of these monogenic de novo disorders. Interestingly, 66 of the 293 genes analyzed were not significantly enriched among NDD meta-analyses.

Discussion
The prevalence of most monogenic RNDDs has yet to be determined, and those with estimates are often anecdotal. An accurate estimate of the prevalence is important in understanding each disorder, which also has an impact on research funding and focus. Additionally, there is value in individuals being counted in rare disease [64]. Recently, it has been suggested that there are over 11,000 individual rare diseases, a number that is likely to increase. By identifying individuals with each disorder and determining their prevalence, we can better contribute to our knowledge of rare disease. In combination with cohort-based estimates, incidence estimates from mutation rates, and social media analysis, we hope to have a more comprehensive understanding of the prevalence of these rare disorders.
Utilizing the collection of probands from large sequencing studies that best represent multiple NDD-affected populations to date, we showed the prevalence of de novo variation among NDDs broadly, which, in our cohort, included DD/ID, ASD, epilepsy, and other diagnoses (Figure 1). Our results showed that while most monogenic RNDDs are likely underdiagnosed based on prevalence estimates, they also each account for fewer individuals with NDDs than previously thought. Often, it is reported that each NDD candidate gene accounts for less than 1% of the individuals diagnosed. Here, we showed that each gene accounts for even fewer individuals, with the highest percentage being 0.3% of individuals with NDDs for ARID1B ( Figure 1A, Tables 2 and S3). The GeneReviews for Coffin-Siris syndrome (CSS), of which~37% of cases are due to ARID1B variants, reports that fewer than 200 individuals with CSS have been identified, although a literature and social media review suggests that this number is higher [65][66][67]. Our results suggest there is a considerable underdiagnosis of this syndrome, and this pattern is likely the same for other genetic RNDDs.
Previous studies have tried to use novel methods to determine the prevalence, including using mutation rates and number of papers published [12]. In a similar vein, we compared number of members in social media groups with prevalence estimates ( Figure S1D). While not significant, there is a positive correlation between number of Facebook group members of a rare disease group and the prevalence of that rare disease. Those with higher estimated prevalence but lower numbers of Facebook group members may represent underdiagnosed or misdiagnosed disorders.
While positively correlated, there are notable differences between our prevalence estimates and previous prevalence or incidence estimates. To an extent, we expect prevalence to be higher than incidence, as incidence is the number of new cases per year, and this is the case for many genes. Several genes are overrepresented compared to their estimated incidence, suggesting possible ascertainment bias. In contrast, many genes have markedly decreased prevalence compared to the estimated incidence, which could be due to a range of factors. We only focused on DNVs, and some of these monogenic disorders have carrier parents, affected or unaffected. Given our DNV-only focus, our cohort likely will have higher accuracy for more severe conditions. We also assumed 100% penetrance for our calculations. It is likely that there are variants that are not fully penetrant or result in subclinical features; thus, those probands may not have been included in our cohort. We also did not consider mortality, which may decrease the prevalence, although most of these disorders are not perinatal lethal. However, the few disorders that are perinatal lethal, such as MECP2 variants in males, combined with decreased lifespan of individuals with NDDs (average age~60 years of age) may contribute to the prevalence and be absent from our calculations [68]. Additionally, we only discerned dnLGD and dnMIS or dnMIS30 variants. This leads to some inaccuracy, as there are syndromes that are caused by neither of these variant types but were analyzed in our cohort. Some genes appear to have had a much higher prevalence in our cohort versus the incidence in López-Rivera et al.'s analysis but are skewed due to mutation mechanisms, such as PPM1D or ADNP, both of which are causative for disease by nonsense and frameshift variants in the penultimate exon that result in truncated proteins escaping nonsense mediated decay. Additionally, there are genes in both the López-Rivera et al., 2020, estimates and ours that are not pathogenic, such as TTN, that may skew our correlations, although our normalization with constraint measures aimed to avoid such issues. Furthermore, there are genes that we know to be pathogenic that may have better estimates based on mutation rate than our cohort due to the rarity of these syndromes. Such genes highlight our ascertainment bias, with disorders that have a higher frequency of ASD and/or DD/ID having better estimates. These include disorders such as Schaaf-Yang syndrome (MAGEL2), which had only one variant in our cohort, or HNRNPH2-related NDD, which had no variants in our cohort. Additionally, barriers to genetic testing likely impacted our cohort composition. Finally, we made the assumption that NDDs have similar prevalence globally, which is difficult to assess. While our estimates may reflect some ascertainment bias, these are still the most accurate estimates to date.
In addition to providing novel information for many RNDDs, this work also shows the values of exome or genome sequencing over panel analysis. While it is feasible to choose the top genes from our work for a panel, it is important to know that each of these affects 0.29% or less of individuals with NDDs. Thus, even with the top 100 genes, only 8.8% of potential RNDD diagnoses would be made. Even a panel of the top 500 genes would only have a diagnostic yield of <20%. In contrast, exome sequencing has an approximately 36% diagnostic yield and a higher yield for NDDs with comorbid conditions [69]. Our study supports the use of exome sequencing as a first-tier clinical diagnostic test for individuals with NDDs.
With this new approach to prevalence estimates, we hope that valuable information can be provided to families, clinicians, and groups developing potential therapeutics. Additionally, we show the value of large cohort studies in disease and emphasize the need for international collaboration. While these numbers are inherently in flux, we provide the most accurate prevalence estimates for many disorders to date.

Supplementary Materials:
The following supporting information can be downloaded at https://www. mdpi.com/article/10.3390/biomedicines10112865/s1: Figure S1: Prevalence of DNVs in candidate NDD genes versus incidence estimates from [14], along with comparison to social media estimates; Figure S2: Fold difference between our prevalence estimates and López-Rivera et al.'s incidence estimates; Table S1: Cohorts and samples in study; Table S2: NDD genes (n = 468) as determined by multiple metanalyses studies;

Data Availability Statement:
The data presented in this study are available in the supplemental materials Tables S1-S6.

Acknowledgments:
We thank all the families participating in the multiple studies from which we used data. We are grateful to all of the families at the participating SSC sites, as well as the principal Potter, and P. Farrar). We appreciate obtaining access to phenotypic data on SFARI Base for both SSC and SPARK samples, as well as SPARK exome data from the SPARK Consortium. Approved researchers can obtain the SSC population dataset described in this study (https://www.sfari.org/resource/resources/simons-simplex-collection/) by applying at https://base.sfari.org (accessed on 5 January 2022). We thank the DDD study, which presents independent research commissioned by the Health Innovation Challenge Fund (grant number HICF-1009-003), a parallel funding partnership between the Wellcome Trust and the Department of Health, and the Wellcome Trust Sanger Institute (grant number WT098051). The views expressed in this publication are those of the authors and not necessarily those of the Wellcome Trust or the Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83 granted by the Cambridge South REC and GEN/284/12 granted by the Republic of Ireland REC). The research team acknowledges the support of the National Institute for Health Research, through the Comprehensive Clinical Research Network. We thank the researchers who generated data for all the other cohorts utilized. We thank the Autism Intervention Research Network on Physical Health (AIR-P) for early feedback on this work. E.E.E. is an investigator of the Howard Hughes Medical Institute. We also thank T. Brown for assistance in editing this manuscript. This article is subject to HHMI's Open Access to Publications policy. HHMI lab heads have previously granted a nonexclusive CC BY 4.0 license to the public and a sublicensable license to HHMI in their research articles. Pursuant to those licenses, the author-accepted manuscript of this article can be made freely available under a CC BY 4.0 license immediately upon publication.