Aggregated Genomic Data as Cohort-Specific Allelic Frequencies can Boost Variants and Genes Prioritization in Non-Solved Cases of Inherited Retinal Dystrophies

The introduction of NGS in genetic diagnosis has increased the repertoire of variants and genes involved and the amount of genomic information produced. We built an allelic-frequency (AF) database for a heterogeneous cohort of genetic diseases to explore the aggregated genomic information and boost diagnosis in inherited retinal dystrophies (IRD). We retrospectively selected 5683 index-cases with clinical exome sequencing tests available, 1766 with IRD and the rest with diverse genetic diseases. We calculated a subcohort’s IRD-specific AF and compared it with suitable pseudocontrols. For non-solved IRD cases, we prioritized variants with a significant increment of frequencies, with eight variants that may help to explain the phenotype, and 10/11 of uncertain significance that were reclassified as probably pathogenic according to ACMG. Moreover, we developed a method to highlight genes with more frequent pathogenic variants in IRD cases than in pseudocontrols weighted by the increment of benign variants in the same comparison. We identified 18 genes for further studies that provided new insights in five cases. This resource can also help one to calculate the carrier frequency in IRD genes. A cohort-specific AF database assists with variants and genes prioritization and operates as an engine that provides a new hypothesis in non-solved cases, augmenting the diagnosis rate.


Introduction
Rare diseases are chronically debilitating or life-threatening and have a prevalence in Europe of less than 1 in every 2000 people [1]. Inherited retinal dystrophies (IRD) are a group of rare diseases with a degenerative and progressive course and are caused by the primary affection of photoreceptors and retinal pigmentary epithelial [2]. All together, they affect 1 in every 3000-4000 people in the western world [3]. They are clinically heterogeneous, covering several syndromes (e.g., Usher, Bardet-Biedl -BBS-, or Joubert) [4][5][6], as well as non-syndromic forms such as retinitis pigmentosa [2] and macular dystrophies [7]. They have overlapped phenotypes and display any form of inherited patterns.
During the last two decades, next-generation sequencing (NGS) techniques have transformed research on genetic rare diseases, causing a substantial increase in the volume of available genomic data and knowledge generated [8,9]. Although several sequencing tests are available, a widespread approach in genetic diagnosis is to sequence the coding region of known clinically relevant genes (~4500), the so-called clinical exome (CE). A CE test detects several thousand variants, which need to be filtered and prioritized in order to highlight those responsible for the phenotype [10]. Regarding the task of filtering, in low-prevalence diseases, apart from quality filters, a small population frequency is one of the first requirements to purge non-causal variants [11]. Several global genomic initiatives [12,13] provide allele frequencies in large populations, although frequencies from local cohorts provide a better estimation of real variant prevalence [14,15] and have been proven to identify rare pathogenic variants [15,16].
In the absence of genomic information of a priori healthy people, Mendelian diseases can provide a good estimate of allele frequencies in the general population as pseudocontrols (PC) for other non-related diseases [17,18]. In the same terms, PC can also be applied to the calculation of carrier frequencies (CF) [16,19] of causal variants of non-related diseases in genes with a recessive inheritance pattern, as well as to the analysis of trios [20,21]. In more complex scenarios, where modifying and risk/protective variants may tune the effect of causal variants, the mutational landscape of a disease may help identify: (i) the genetic pleiotropy, together with causal variants in recessive forms [22]; (ii) digenic inheritance [23]; or (iii) disease-associated triallelic sites as in BBS [24].
With all these premises, we hypothesize that a database of variant allele frequencies calculated over a heterogeneous cohort of genetic diseases enriched in IRD cases can help to improve the detection of previously unnoticed, underrated, or unknown causal variants and gene-disease associations. Additionally, it can help one to uncover disease cases with overlapping phenotypes, as well as to describe the carrier frequencies of recessive variants. Thus, we built a database with the genomic data of a large cohort with various genetic diseases and developed methods to compare IRD-specific and PC frequencies. This tool is used as a global reanalysis platform to study frequent variants and over-mutated genes in IRD non-solved cases.

A Multi-Disease Cohort Database of Variant Frequencies to Study the Aggregated Signal in Ird Genomic Landscape
We compiled a heterogeneous cohort of 5683 patients with genetic diseases referred to the Genetics Department of the UH-FJD and with a clinical exome sequencing test available (see Section 4). The cases were distributed into three groups of diseases as inherited retinal dystrophies (IRDs), with 1766 cases; other eye-related diseases (OERDs), with 386 cases; and non-eye-related diseases (NRD), with 3531 cases ( Figure 1A). Additionally, IRD cases were classified according to their diagnostic status as solved (N = 955 cases, 54%) and non-solved (N = 811 cases, 46%). Within the cohort of non-solved IRD patients, we had 447 cases with no candidate variants (25% of the total), that is, excluding cases with a pathogenic/likely pathogenic variant reported in recessive cases (named partially solved cases) and cases with 1-2 VUS reported in dominant/recessive cases, respectively (named VUS cases) ( Figure 1B). CEs were reanalyzed and variants detected and extracted for each case. We performed a quality filter and removed variants with a potential sub-population bias between IRD and other cases ( Figure 1C and Section 4). Thus, for the 5683 samples together, around 560 K unique variants in 5046 genes were left for further analyses.
From a cohort perspective, the number of pathogenic (including likely pathogenic) variants and VUS are unequally distributed over IRD-associated genes when comparing solved-IRD and non-solved-IRD cases ( Figure 1D). The top 5 genes with more pathogenic variants (including likely pathogenic) in solved-cases have a higher ratio of pathogenic variants/VUS for IRD solved cases than for IRD non-solved cases.

IRD-Specific Highly Frequent Variants
In the diagnosis of rare diseases using NGS data, variants must be prioritized in order to facilitate the detection of the causal mutations within the large amount of variation found in a single experiment. Here, we aimed to test if the comparison between the frequency of variants in cases and controls was able to extract deleterious variants. Thus, we focused on the IRD subcohort and treated solved and non-solved cases separately. Partially solved and VUS cases were excluded from the non-solved subcohort in order to work with a subcohort with no candidate variants. As controls for both subcohorts, we used the set of patients with non-related diseases, from now on called pseudocontrols (PC) (see Section 4). Allele frequencies (AF), allele numbers (AN), and allele counts (AC) were calculated for solved-IRD cases, non-solved-IRD cases (free of candidate variants), and PC cases.
Next, for both solved and non-solved IRD cases independently, we compared each variant's AF (solved-AF or non-solved-AF) with its AF in the PC subcohort (PC-AF), see Section 4. We defined the "most frequent variants" in a IRD subcohort (IRD-MFVs) as those within the top 10% with the highest log2 of the fold change, log2(FC) values in the comparison performed. The cut offs for the log2(FC) were 3.12 and 3.92 in solved and nonsolved IRD cases, respectively (Figure 2A,B). The distribution of log2(FC) values in solved and non-solved IRD cases is shown in the Supplementary Figure S1. MFVs are prioritized for a posterior reevaluation of cases. Non-prioritized variants are defined as those with a higher frequency in IRD (FC > 0) but below the significant threshold. Classifying IRD-MFVs according to their clinical relevance and removing those not informative (see Section 4), we found IRD-MFVs enriched in deleterious variants in both solved and non-solved cases compared to non-prioritized variants (Fisher's exact test, p-values = 4.77 × 10 −56 and 1.69 × 10 −32 , respectively; Figure 2C,D and Supplementary Tables S1 and S2).
Focusing on the type of genes where the IRD-MFVs are located, we divided IRD-MFVs as present in IRD-associated genes, OERD-related genes, and NRD-associated genes. In IRD-solved cases, regarding IRD-MFVs in IRD-genes only, the ratio deleterious/benign is 78% (177/256), which is significantly higher than the same proportion in non-prioritized variants (14%, 593/1926, Figure 2E, Supplementary Table S1). A different trend was observed in non-solved IRD cases, where we found no significant differences in the percentages of deleterious/benign variants in IRD-genes between IRD-MFVs and non-prioritized variants ( Figure 2F, Supplementary Table S2). In IRD-MFVs in OERD-genes, we also found more deleterious variants in our prioritized set in both solved and non-solved cases (Fisher's exact test, p-value = 1.71 × 10 −6 and p-value = 3.25 × 10 −6 , respectively; Figure 2E,F, Supplementary Tables S1 and S2). Finally, we also observed an enrichment of deleterious variants in the IRD-MFVs located in NRD-genes in solved and non-solved IRD cases (p-value = 2.75 × 10 −8 and p-value = 1.90 × 10 −28 , respectively; Figure 2E,F, Supplementary Tables S1 and S2).
Furthermore, solved and non-solved cases were divided into disease sub-categories as syndromic, non-syndromic, and macular dystrophy forms (Supplementary Figure S2), and the analysis was repeated for each sub-group. Thus, as for all solved cases considered as a whole, syndromic, macular dystrophy, and non-syndromic forms behave very similarly, with more deleterious variants in the prioritized sets for IRD genes (Supplementary Figure S3A Next, we performed a reevaluation of non-solved IRD cases carrying an IRD-MFV and found eight variants that provide additional insights in 8 cases (Table 1). In this reevaluation, the diagnosis status of the cases is classified as (1) "solved", when the case has a conclusive diagnosis thanks to the variant(s) identified; or (2) "non-solved", in any other case. Non-solved cases were annotated according to findings that may help when making a future diagnosis as (2.1) "partially solved", if a heterozygous pathogenic or likely pathogenic variant is found within a recessive gene that fits the observed phenotype; and (2.2) "with evidence", when the variants identified in the analysis are: (2.2.1) pathogenic/likely pathogenic variants in genes not yet associated with the disease but with some evidence published in the literature or with overlapping phenotypes, or (2.2.2) the VUS in a gene associated with the phenotype when one VUS in dominant or 1-2 VUS in recessive genes were found. . Proportion of deleterious and benign variants in both solved (C) and non-solved IRD cases (D) and the p-values representing the enrichment of deleterious variants in IRD-MFVs. Enrichment analyses are also performed dividing the IRD-MFVs according to the genes in which they are located; they are grouped into IRD genes, other eye-related diseases (OERD genes), and other non-eye-related diseases (NRD genes) (E,F). Total number of deleterious variants in each group is noted at the top of the red bars. Non-significant p-values are marked as "ns".
Another direct application of the IRD relative variant frequencies is the reevaluation of the clinical significance of VUS. We extracted the FC of IRD-AF compared to PC-AF for a set of manually curated VUS whose reclassification could contribute to a conclusive diagnosis of an IRD case in our cohort and that is present in the final dataset (N = 63). Of them, six VUS are IRD-MFVs (log2(FC) ≥ 2.48, see Section 4) and 11 VUS are more frequent in IRD cases than in PCs with a log2(FC)FC ≥ 1.5. ACMG classification was performed for the 11 VUS adding the ACMG criteria PS4 ("The prevalence of the variant in affected individuals is significantly increased compared with the prevalence in controls") as true. Of them, 10 (91%) were reclassified as likely pathogenic/pathogenic ( Table 2). This analysis was able to solve two cases: one with the variant in the dominant gene COL11A1, and one with the variant in OFD1 that has X-linked inheritance ( Table 2). Table 1. Identification of most frequent variants in IRD (IRD-MFV) allowed for the reevaluation of non-solved IRD cases. A total of 8 cases (one per row) gained knowledge in their reassessment. The gene panel column refers to the type of diseases that the gene has been associated with, inherited retinal dystrophies (IRD), other eye-related diseases (OERD), and other non-eye-related diseases (NRD). In the phenotype, abbreviations are RCD: rod-cone dystrophy, RP: retinitis pigmentosa, and MD: macular dystrophy. In variant column: HGVSc, HGVSp, and ACMG classification are included. In "Other Variants", for those biallelic cases, the second variant found after reviewing the case is included. Genotype column has information about the genotype of the variant(s) in the sample. Diagnostic status column has the new diagnostic status of the case after the reanalysis, as (1) "solved", when the case has a conclusive diagnosis thanks to the variant(s) identified; or (2) "non-solved", in any other case. Non-solved cases were annotated according to findings that may help in a future diagnosis as (2.1) "partially solved", if a heterozygous pathogenic or likely pathogenic variant is found within a recessive gene that fits the observed phenotype; and (2.2) "with evidence", when the variants identified in the analysis are: (2.2.1) pathogenic/likely pathogenic variants in genes not yet associated with the disease but with some evidence published in the literature or with overlapping phenotypes, or (2.2.2) VUS in a gene associated with the phenotype when one VUS in dominant or 1-2 VUS in recessive genes were found.

Prioritization of Candidate Genes Based on Weighted Cohort-Specific Frequency of Pathogenic and Benign Variants in Non-Solved Ird Cases
In order to detect genes with an accumulated high pathogenicity in solved and nonsolved IRD cases as good candidates to be involved in IRD phenotypes, for each IRD subcohort, we extracted deleterious and benign variants and calculated the FC for their AF compared to the AF in PCs. For every subcohort and gene, the distributions of log2(FCs) in deleterious and benign variants were compared using a Wilcoxon rank sum test. Genes with a significant higher frequency in IRD cases of deleterious variants compared to benign variants (p-value < 0.05) were selected. This revealed a number of genes with an accumulated pathogenicity in solved and non-solved IRD cases that we divided, as before, into three groups: IRD-genes, OERD-genes, and NRD-genes. Actionable genes defined by ACMG were removed from the analysis (Supplementary Table S5). Thus, in IRD solved cases, we found 56 genes enriched in deleterious mutations. Of them, 22 (39%) were IRD genes, and the top 5 genes with more deleterious variants were ABCA4, USH2A, MYO7A, EYS, and ADGRV1, (Supplementary Table S6, and Figure 3A). In addition, 24 (43%) were OERD genes; the top 5 are highlighted here with more deleterious variants NEB, PAH, DNAH11, DNAH5, and ATM (Supplementary Table S6 and Figure 3A), and 10 (18%) were NRD genes, including the top 5 genes OBSCN, DYSF, SPTBN5, OTOF, and SPTB. Regarding non-solved cases, we found 18 genes with an accumulated pathogenicity, with IRD-genes less represented (22%) and OERD-genes more present (55%) than in solved cases. Finally, 22% of the prioritized genes were NRD-genes ( Figure 3B, and Table 3). Interestingly, there was a high overlap between IRD-genes and OERD-genes prioritized in solved and non-solved cases (75% and 70% of the smallest group (non-solved cases), respectively). However, the overlap in NRD-genes was smaller, with only one gene (25% of genes in non-solved cases) found in both IRD subcohorts (Supplementary Figure S4). Table 2. Reclassification of VUS, with information added about the difference in their frequency in cases and controls. The ACMG classification of VUS with log2 fold change (FC) higher than 1.5 is reevaluated. Gene code, variants in nucleotide code, and protein code are provided. The inheritance mode is annotated as autosomal recessive (AR), autosomal dominant (AD), and x-linked (XL). In the "Varsome" column, we add the automatic ACMG classification of the variants provided by the Varsome database at the time of writing. Previous ACMG criteria fulfilled by the cases are annotated in the "Criteria" column. Column "PS4" provides the new ACMG classification reached after including PS4 criteria. In column "Status", reclassified variants are marked.   Table 3. Genes with more frequent deleterious variants in IRD non-solved cases than in pseudocontrols. The genes are sorted by the gene panel where they are present (IRD, OERD, and NRD) and by number of deleterious variants. The gene panel column refers to the type of diseases that the gene has been associated with, inherited retinal dystrophies (IRD), other eye-related diseases (OERD), and other non-eye-related diseases (NRD). Number of deleterious and benign variants in IRD non-solved cases is shown. FDR stands for false discovery rate of the Wilcoxon rank sum test performed. Inheritance mode associated with each gene is annotated in the "Inheritance" column as recessive, dominant, or R/D (either recessive or dominant).  In a reevaluation of the non-solved cases with pathogenic, likely pathogenic, or VUS variants in the prioritized genes, we found five cases carrying mutations, possibly associated with the phenotype in two IRD associated genes, and one OERD gene (Table 4). Variants found in the gene MYO7A explained this case phenotype and helped to fully characterize it. Regarding OERD genes, the two pathogenic variants found in the gene ADAMTSL4 helped to solve lens luxation phenotype of this case. Table 4. New diagnostic status produced by the reassessment of non-solved IRD cases with deleterious variants in the genes prioritized. Details about five IRD non-solved cases with variants adding knowledge to the phenotype. The gene panel column refers to the type of diseases that the gene has been associated with, inherited retinal dystrophies (IRD), and other eye-related diseases (OERD). In the phenotype, abbreviations are MD: macular dystrophy. In variant column, HGVSc, HGVSp, and ACMG classification are included. Genotype column has information about the genotype of the variant(s) in the sample. Diagnostic status column has the new diagnostic status of the case after the reanalysis, as (1) "solved", when the case has a conclusive diagnosis thanks to the variant(s) identified; or (2) "non-solved", in any other case. Non-solved cases were annotated according to findings that may help in a future diagnosis as (2.1) "partially solved", if a heterozygous pathogenic or likely pathogenic variant is found within a recessive gene that fits the observed phenotype; and (2.2) "with evidence", when the variants identified in the analysis are: (2.2.1) pathogenic/likely pathogenic variants in genes not yet associated with the disease but with some evidence published in the literature or with overlapping phenotypes, or (2.2.2) VUS in a gene associated with the phenotype when one VUS in dominant or 1-2 VUS in recessive genes were found.

Gene
Gene

Carrier's Frequency of Recessive Diseases from a Multi-Disease Cohort
We calculated the carrier frequency (CF) for 69 genes involved in recessive nonsyndromic IRDs using the frequency of pathogenic heterozygous variants located in IRD genes in the pseudocontrol subcohort. We found three genes with a CF ≥ 1%, which represents a total of~4% of the total analyzed (Supplementary Table S12). We highlight ABCA4 and USH2A with a CF of~7% and~3%, respectively (Figure 4), with the first being responsible of Stargardt disease and the later causing Usher syndrome. In the case of ABCA4, if hypomorphic variants are excluded as in a previous work performed by Hanany and collaborators [19], CF is reduced to~4%. These genes are also the two most frequent in our IRD subcohort, found causally in 21% and 15% of the cases, respectively (data not shown). The most frequent variants for ABCA4 in the IRD subcohort and PC subcohort were NM_000350.3: c.3386G>T and NM_000350.3: c.3113C>T, respectively, while USH2A had variant NM_007123.5: c.2276G>T as the most frequent in the two subcohorts. The CF calculated using deleterious heterozygous/hemizygous variants on dominant/X-linked genes is available in Supplementary Table S12. The carrier frequency (CF) was calculated for all IRD genes with a recessive inheritance pattern, and at least one solved case in our cohort. Green represents the CF of the genes and orange the frequency in our IRD subcohort. The pseudocontrol subcohort was composed of 3531 cases, and the IRD subcohort had 1766 cases. The top 10 genes with higher CF are ordered decreasingly.

Discussion
The recent application of NGS techniques has considerably increased our competence to study and diagnose rare diseases. Regarding IRDs, although new associated genes are still being discovered (see RetNet, https://sph.uth.edu/retnet, accessed on 21 June 2022), the diagnostic ratios need to be boosted since up to~46% remain unsolved (data from our own IRD cohort at the time of writing). At the same time, at the diagnostic setting, the genomic information from patients is being accumulated as databases or annotation systems, which can also be privative if only commercial solutions are used. Initiatives such as gnomAD [17] or the collaborative Spanish variability server [18] are acting as a crowdsource to recover this data from the community and offer it back as aggregated allele frequencies. Allelic frequency calculations have been proven to be useful in studying the prevalence of rare diseases [25]; integrated in the diagnostic analysis routines, as an additional annotation source to detect technical biases [26]; or used to interpret and classify variants [27]. These resources have also been used to study the CF of deleterious variants in recessive genes in a particular population [16]. In this study, we propose to develop a framework of methods that are able to prioritize variants and genes in order to explore new associations with a specific disease. Thus, the reuse of the genomic data may provide new discovery capabilities to a heterogeneous cohort of genetic diseases, an extra value to the resources invested previously, as well as allowing the patients to contribute to their own or others' future diagnosis. The main hypothesis behind this work is that, aside from a few causal mutations, the genome of patients with Mendelian diseases behaves similarly to those of the healthy population, so patients with non-related diseases can act as pseudocontrols between each other. The fact of focusing on a heterogeneous but single-center cohort has two main advantages against using controls from larger genomic population databases [12,13,17]: first they can be technically more similar to the sequencing produced on the patients of interest; second, the geographical origin bias can be better controlled [15]; last, the phenotyping of patients can be fine-tuned by experts in order to create subcohorts of interest. A concept introduced here is the significantly different frequent variants, called MFVs, which are variants more present in cases than in pseudocontrols, with statistical support. As a proof of concept, we found that considering the whole IRD cohort, the IRD-MFVs are enriched in deleterious mutations, suggesting that the prioritization is effective there. Indeed, variants selected on solved IRD cases in IRD genes are mostly pathogenic (>78%) compared to benign variants, due to the detection of prevalent causal variants. In contrast, non-solved IRD cases display no differences in the proportion of deleterious/benign variants in prioritized and non-prioritized variants in IRD-related genes, indicating that there are not many described pathogenic variants left out during diagnosis. In spite of this, our approach was able to solve or partially solve five cases with prioritized variants in IRD genes. In this paper, we also highlight the significant accumulation of deleterious variants in those prioritized located in OERD and NRD genes, in both IRD solved and non-solved cases. The rationale behind it might be different, though. While frequent pathogenic mutations in OERD and NRD genes in solved cases may suggest a more complex genotype scenario for Mendelian diseases [28,29], in non-solved cases we have to add the possibility of needing a disease re-evaluation, and the causal mutation being present in a not yet associated IRD gene. The exploration of syndromic and nonsyndromic forms provides the same signal but with lower p-values for syndromic cases. Up to eight non-solved cases gained new insights due to the reevaluation of our prioritized variants. Our variant prioritization approach was also applied for VUS reclassification, a major challenge to unlock the diagnosis of rare pathologies [10,[30][31][32]. Indeed, in our IRD subcohort, the top five genes with more deleterious variants in IRD solved cases present a higher degree of uncertainty in variant annotation (the proportion of VUS and deleterious variants) in unsolved IRD cases. For an initial list of 63 VUS whose reclassification may solve a case from the IRD cohort, we found 11 VUS more frequently in IRD cases than in pseudocontrols, and 10 of them (~91%) changed their classification to likely pathogenic or pathogenic by the application of the ACMG PS4 criteria. This reclassification was able to solve two cases and provide new evidence to the other 10 patients.
In addition to the revised cases reported in this work, the disease-specific AF and its comparison to pseudocontrols have been implemented in the variant annotation task of our reanalysis pipeline [33] that performs periodic reanalysis of non-solved cases, as well as of WES and WGS analysis. Thus, the database is expected to contribute to the diagnosis of more patients over time. Furthermore, and in order to facilitate the implementation of the database and adjust it to the cohorts of other clinical settings, we have code and instructions to build an in-house database available at https://github.com/TBLabFJD/ DbofAFs, accessed on 21 June 2022.
In parallel, we also aimed to highlight genes besides variants. Our method extracts genes having more frequent deleterious variants in IRD cases than in pseudocontrols, weighted by the relative frequency of benign variants in order to increase the disease association signal. Several discovery scenarios may fit into the results of this proposal: first, finding underrated IRD genes with a role in IRD cases, such as DYNC2H1 and MYO7A genes that may carry not previously inspected pathogenic variants (thus, we were able provide new evidence in four cases); next, by providing extra findings in complex cases, either syndromic cases, gene modifiers, or dual diagnosis. For instance, we made a dual diagnosis in a previously non-solved syndromic IRD case with rod-cone dystrophy (HP:0000510) and lens luxation (HP:0012019), among other systemic findings. This case includes a partially solved IRD phenotype with a heterozygous variant in gene CDH23 (Table 1), and a solved Lens Luxation phenotype with two pathogenic variants in compound heterozygosis in gene ADAMTSL4. In non-classical Mendelian scenarios, the exploration of the mutational landscape of IRD may help, for instance, to identify more complex cases as (i) genetic pleiotropy together with causal variants in recessive forms [22], (ii) digenic inheritance [23], or (iii) triallelic sites associated with BBS [24]. Last, there is also the possibility of detecting genes not yet associated with IRD, that is, candidate genes that may become IRD genes if further analyses are performed.
As a result of the reevaluation of IRD non-solved cases using the methods described in this work, we were able to solved four cases (one of them with a dual diagnosis still unsolved) and provide new insights in 20 cases, 15 of them providing a single variant for a recessive case that fits the phenotype (partially solved) and five with candidate variants in genes not yet fully associated with the phenotype (marked here as "with evidence"). Further studies are needed to solve cases annotated as partially solved and with evidence.
An additional interesting use of an internal database of allele frequencies is to have a cohort-specific CF estimation. This analysis can provide a better understanding on how deleterious variants are distributed in a general population and is relevant for their use in a public health strategy for genetic counselling. For instance, in the case of IRD, the gene with a higher carrier frequency is ABCA4, with carrier variants in~7% of the population, which is in line with previous estimations [34]. Considering the curated set of pathogenic variants used by Hanany et al. [19] for ABCA4, we obtained a similar CF (~6%). Nevertheless, excluding hypomorphic variants, as recommended in this study, CF drops to~4%. The high CF obtained for this gene in our cohort can be explained partially by these variants.
It is reasonable to state that although pseudocontrols are suitable for providing a good estimation of general allele frequencies, the lack of healthy controls can be seen as a limitation. However, availing of such a control sample set is not always feasible for clinical settings and should be provided under the umbrella of national plans. The major constraint in the discovery capability of our database is that we are restricted to the~5000 genes targeted in the clinical exome approaches, and thus implementation using data from whole exomes would be optimal. Our intention is to maintain and expand the database in a number of cases but also in genomic regions. We should mention that the IRD non-solved cases presented in the cohort can also have causal variants in non-coding regions, which are not covered with the clinical exome approach.
Although in this work we focus on IRD as the larger group of diseases in our cohort, the same methodology can be applied to other genetic rare diseases in our cohort as well as in other settings. In addition, the disease-specific subcohort that is the subject of analysis can also be tuned in its composition in order to provide different capabilities to the discovery process. For instance, although our subcohort of IRD unsolved cases did not include suspected recessive cases with only one detected candidate variant in heterozygosity (the so-called monoallelic cases), they could be included in this discovery subcohort, so these candidate variants are also evaluated.
In conclusion, our cohort-specific database of allele frequencies has proven to be able to diagnose non-solved IRD cases, reclassify VUS, propose candidate genes, and calculate CF on genes of interest. We believe that the results shown here can highlight the importance of the reuse of genomic data produced in clinical settings, where the phenotyping is usually exhaustive and the patients waiting for a diagnosis or a genetic counselling can be directly benefited.

Ethics Approval and Consent to Participate
The project was reviewed and approved by the Research Ethics Committee of UH-FJD (Ref. 2016/59) and fulfills the principles of the Declaration of Helsinki and subsequent reviews. All patients signed an informed consent before participating. All samples included in this work were pseudonymized, and genomic data were only treated in aggregation.

Cohort Description
We retrospectively selected all index cases (N = 5683) with a clinical exome test performed as a first-tier approach at the Genetics and Genomics Department of the University Hospital Fundación Jiménez Díaz (UH-FJD, Madrid, Spain) from September 2015 to May 2021. The cohort included patients suffering from genetic diseases classified in 14 categories, with the largest disease group being IRD (N = 1766) (Supplementary Table S7). The rest of the diseases were grouped in "other eye-related diseases" (OERD) and "non-related diseases" (NRD). Based on the diagnostic status set by the molecular geneticists after CE inspection, all IRD cases were classified as solved and non-solved. From the non-solved cases, we extract those annotated as "partially solved" or "VUS-cases".

Detection and Removal of Sample Duplicates and Cryptic Relatedness
All known sample duplicates and relatives were removed prior to frequency calculation. In order to detect other possible sample duplicates and relatives, a PLINK whole-genome association analysis toolkit [55] was used to calculate inbreeding coefficients (identity-by-descent, IBD). First, single nucleotide polymorphisms (SNPs) pruning was performed removing SNPs covered in less than 95% of the samples (PLINK parameter: geno 0.05), with less than 5% allelic frequency (PLINK parameter: maf 0.05), and in linkage disequilibrium (PLINK parameters: indep-pairwise 50 5 0.5). With the resulting SNPs, the IBD was calculated for all sample combinations (PLINK parameter: genome). All samples with a PI_HAT score higher than 0.35 were removed.

Variant Frequency Calculation for IRD Patients and Pseudocontrols
After identifying and removing sample duplicates and relatedness (N = 5683), variants (in vcf-format) from the index-cases were processed together and merged into a multi-vcf file. Sequencing coverage was also calculated for each sample to distinguish between non-covered and non-mutated sites.
We developed in-house routines for allelic frequency calculation based on the Hail python library for genomics data exploration and analysis (https://hail.is, accessed on 1 March 2021). The allele frequency (AF), allele-number (AN), allele-count (AC), and homozygotes-count were obtained for the general cohort and for several subcohorts composed of IRD cases: (1) all IRD cases, (2) non-solved IRD cases, (3) solved-IRD cases, (4) syndromic-IRD cases, (5) non-syndromic IRD cases, and (6) macular dystrophy cases. To define the subcohort of IRD PC for all IRD subcohorts, we took samples from the subcohort NRD (N = 3531). AF, AN, and AC were calculated from this PC subcohort (PC-AF, PC-AN, and PC-AC).

Definition of Genes Associated with IRD, OERD, and NRD
Three disease-specific gene panels were used in the inspection of variants and genes: (i) the IRD gene panel (244 genes, including 136 genes for syndromic-IRD and 108 genes for non-syndromic IRD) as the virtual gene panel used in the diagnosis of IRD cases in the Genetics and Genomics Department of the UH-FJD, and extracted using RetNet, HGMD, and literature searches (Supplementary Table S8); (ii) OERD genes, including non-IRD genes with ocular phenotype (all genes linked with the HPO term "Eye Disease"-HP:0000478, N = 1542 genes, Supplementary Table S9); and (iii) NRD genes (the rest of the genes included in TSO/CES panels, not related to eye diseases, N = 3260, Supplementary Table S10). Genes that ACMG recommends reporting in case of secondary findings [56] (Supplementary Table S5) were excluded from the gene panels and analyses, except for the gene RPE65 that belongs to the IRD panel.

Variants Discarded for Analysis
Variants detected in the 5683 samples from our general cohort were further filtered out using two criteria: (i) quality filtering-we removed 5% of variants with lowest AN; and (ii) population filtering-in order to discard a population origin bias in our IRD subcohort compared to the rest of the cohort, we keep variants present in IRD solved or non-solved cases, and assuming no differences in population origin between IRD solved and nonsolved cases, having a fold change between non-solved-AF and solved-AF > 90% percentile, from them we rescue those having a non-solved-AF and solved-AF < 0.1.

Determination of Differentially Frequent Variants in IRD Subcohorts Compared to Pseudocontrols
We define differentially frequent variants as those that have a higher frequency in a subcohort compared with a control subcohort. In order to extract variants that are differentially frequent in the IRD subcohorts (solved-IRD, non-solved IRD, syndromic-IRD, non-syndromic IRD, and macular dystrophies) compared to the IRD PC subcohort, we calculated the FC of the AF in the IRD subcohort compared to the PC-AF for each of the variants. Based on the distribution of the log2 of FCs (log2(FC)) of all variants, we selected those above the 90% as the significant differential frequent variants in a subcohort, tagged as IRD-MFV (IRD most frequent variants) for any IRD subcohort. Variants annotated by ClinVar as "pathogenic" or "likely pathogenic" or with a CADD_PHRED ≥ 30 (top 0.1% most deleterious variants according to CADD) were classified as deleterious, and variants annotated by ClinVar as "benign" or "likely benign" were classified as benign. We compared the proportion of deleterious variants in the IRD-MFV group with the rest of the variants (non-prioritized variants) for solved-IRD and non-solved-IRD cases. Furthermore, we performed this comparison grouping IRD cases as syndromic forms, nonsyndromic forms, and macular dystrophies. A Fisher's exact test was applied to compare the proportion of deleterious variants in these groups, and a p-value < 0.05 was taken as significant.

VUS Reclassification
We selected VUS whose reclassification can determine the diagnosis of an IRD case in our cohort. These VUS are reported in the diagnostic process at the Genetics Department of the UH-FJD if no pathogenic or likely-pathogenic variant is found to be associated with the phenotype. Variants are classified using ACMG guides. In the IRD subcohort, 100 VUS were reported to fulfil these criteria [10] (Supplementary Table S11). Of these, 63 fulfilled criteria to be within the database generated in this work and were still classified as VUS according to ACMG (information taken from VarSome at the time of the analysis). For these VUS, we annotated IRD-AF (general IRD cohort) and PC-AF frequencies, calculated the FC for these two frequencies, and selected two sets: 1) VUS with a log2(FC) ≥ 1.5 (N = 11), and VUS with a log2 (FC) ≥ 2.48 (value of the 90th percentile of the distributions of the log2(FC), N = 6). For all the selected VUS (N = 11), we marked the specific ACMG criterion PS4 for which "the prevalence of the variant in affected individuals is significantly increased compared with the prevalence in controls" and applied previous evidence to obtain a new ACMG classification.

Gene Prioritization for IRD Association
To prioritize genes in non-solved IRD cases, we selected for each gene included in the database (i) deleterious variants annotated by ClinVar as "pathogenic" or "likelypathogenic" or with a CADD_PHRED ≥ 30; and (ii) benign variants annotated by ClinVar as "benign" or "likely benign". Genes with at least five deleterious and five benign variants were selected for further analysis. For each selected variant, a log2 (FC) was calculated between non-solved IRD-AF and PC-AF. Finally, we applied the Wilcoxon rank sum test to the distribution of log2 (FC) for deleterious and benign variants in each gene. P-values were adjusted using FDR, and genes with an adjusted p-value < 0.05 were considered significant. This list of significant genes was classified into three different gene panels according to the relation degree with IRDs: (i) IRD gene panel, (ii) OERD gene panel, and (iii) NRD gene panel. This analysis was also performed for solved-IRD cases (Supplementary Figure S6).

Carrier Frequency Calculation
Carrier frequency (CF) was calculated for genes in the non-syndromic IRD gene panel with at least three solved cases in our cohort. Genes were classified as having autosomal recessive or dominant inheritance patterns using the software DOMINO [57] and OMIM database [58]. In genes annotated as recessive, CF was calculated including the variants classified: (i) "pathogenic" or "likely pathogenic" in ClinVar; (ii) "pathogenic" or "likely pathogenic" in LOVD database; (iii) with a CADD_PHRED score ≥ 30; (iv) or frameshift/stop-gain variants. The total AC of the variants selected was divided by the AN, and the result was multiplied by 2 (two alleles) and by 100 to represent the result as a percentage (0-100%), according to Equation (1).
For the gene ABCA4, the CF was also calculated excluding hypomorphic variants, as described in Hanany et al. [19]. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data underlying this article will be shared on reasonable request to the corresponding author.