1. Introduction
Inherited retinal diseases (IRDs) are a clinically and genetically heterogenous group of visually debilitating diseases caused by pathogenic variation in proteins critical for retinal function [
1]. Over 300 causative genes have been implicated in IRDs, and these genes encode a spectrum of proteins including structural and transmembrane proteins, phototransduction proteins, and other regulatory proteins involved in the visual cycle [
2]. Intrinsically disordered regions (IDRs) are made of protein sequences that lack hydrophobic amino acids and do not form a hydrophobic core or 3D structure. They lack stable structures in isolation and instead exist as an ensemble of conformations that varies depending on other protein–protein interactions. IDRs play crucial roles in many biological processes, including cell signaling and regulation [
3]. Therefore, it is not surprising that they have been linked to the pathogenesis of neurodegenerative diseases.
Recently, Lee et al. showed that the IRD proteins that exhibit a high degree of disorder tolerate a higher degree of missense variation and that these proteins also exhibit a lower amount of pathogenic missense variants with respect to total missense variants [
4]. These findings were observed in 14 genes implicated in inherited retinal diseases (IRDs), which were categorized into four groups: (1) proteins with overrepresentation of missense variants (SAMD11, ALMS1, WFS1, RP1L1, KCNV2, and ADAMTS18), (2) transmembrane transport (CNGB1, CNGA1, TRPM1, ABCA4, BEST1, and KCNV2), (3) internal or structural proteins of the photoreceptors that are essential in visual function (RHO and RPE65), and (4) secreted proteins (TIMP3 and ADAMTS18) [
4]. IDRs are known to affect channel protein function [
5].
Though these initial investigations yielded interesting results, we wanted to further evaluate these findings in a larger number of proteins. We therefore examined all proteins in the RetNet Database, which provides a list of 324 genes and loci implicated in IRDs. We investigated the degree of disorder in each protein and compared it to the number of nonsynonymous missense variants. Through this analysis, we identify outlier proteins that do not exhibit predicted relationships between disorder and missense variation. Further investigation of these genes may be warranted to examine how their defining characteristics may be associated with protein function and disease.
3. Results
Of the 336 genes and loci included in the RetNet Database, 7 mitochondrial genes were excluded as well as 42 additional entries due to the (1) lack of relevant data available on gnomAD or the (2) historical linkage regions synonymous with genes that have since been identified, leaving 287 available for inclusion (
Supplementary Table S1).
The mean percent disorder in all proteins was 26.3 ± 26.0 (range 0–100), mean percent pathogenicity was 5.2 ± 11.0 (range 0–87), mean number of total missense variants was 424.4 ± 450.0 (range 10–3130), and mean percent total missense was 50.0 ± 13.4 (range 3.1–100).
The percentile distribution of IDRs in all proteins is shown in
Figure 1, which plots the number of proteins with respect to the percent disorder organized by decile. The highest number of occurrences was in the 0 to 10th disorder decile, and the lowest distribution was in the 90th to 100th disorder decile. We previously showed that proteins with a high degree of disorder tolerated the greatest amount of missense variation [
4], so we plotted the percent disorder of all proteins in the RetNet Database with respect to the total number of missense variants (
Figure 2). Regression analysis between disorder and total missense variants showed R = 0.079 and
p = 0.18, between percent disorder and percent pathogenicity showing R = 0.024 and
p = 0.68, as well as between % total missense and % disorder showing R = 0.062 and
p = 0.301.
Figure 3 displays a scatterplot of percent disorder with respect to percent total missense. To interrogate the differences between proteins at the 0 to 10th disorder decile and 90th to 100th disorder decile, we plotted the percent disorder with respect to the total number of missense variants and the percent total missense for each group (
Figure 4a–d). Proteins existing as outliers in the 0–10th decile of the disorder group include HMCN1, ADGRV, USH2A, DYNC2H1, LAMA1, and SLC38A8. Removing the five outliers from the 0–10th disorder decile revealed mean total missense variants of 336.5.
Percent disorder with respect to percent pathogenicity for all proteins is shown in
Figure 5. Proteins in the 0 to 10th disorder decile exhibited a mean percent pathogenicity of 5.2. When the outliers were excluded, proteins in the 0 to 10th disorder decile exhibited a mean percent pathogenicity of 8.6. Proteins in the 90th to 100th disorder decile exhibited a mean percent pathogenicity of 13.7. On the exclusion of NDP, the percent pathogenicity for proteins in the 90th to 100th disorder decile was reduced to 10%. Outlier proteins in the 0 to 10th disorder decile include PRPS1, RPE65, RDH12, MFN2, MVK, PPT1, OAT, CRB1, and GUCA1A. These proteins all exhibited pathogenicity above 15% (
Figure 6).
Regression analysis in the 0 to 10th disorder decile, for all proteins, between total missense variants and percent disorder showed R = 0.059 and p = 0.6, between percent pathogenicity and percent disorder showing R = 0.007 and p = 0.3, as well as between percent total missense and percent disorder showing R = 0.069 and p = 0.508. These values did not change significantly with the outliers removed. In the 90th to 100th disorder decile, the regression between total missense variants and percent disorder showed R = 0.306 and p = 0.504 between percent pathogenicity and percent disorder showing R = 0.225 and p = 0.627, as well as between percent total missense and percent disorder showing R = 0.151 and p = 0.747.
When removing the five protein outliers in the 0 to 10th disorder decile, the regression between % total missense and percent disorder showed R = 0.238 and p = 0.024.
4. Discussion
We show that in the cohort of RetNet encoded IRD proteins, the average percent disorder was approximately 26%, with most proteins concentrated in the 0 to 10th decile of disorder. Our previously published work showed that internal or structural proteins of photoreceptors were the most ordered and have low pathogenic missense to total missense variants ratio. In the 0 to 10th disorder decile where the majority of our proteins were concentrated, HMCN1, DYNC2H1, ADGRV1, LAMA1, and USH2A exhibited low disorder but had an unexpectedly high number of total missense variants. We deemed these five proteins “outlier proteins”, four of which are internal or structural proteins. HMCN1 is an extracellular matrix protein sharing similarities to fibulins and has been linked to retinitis pigmentosa. DYNC2H1 is a dynein protein that is involved in retrograde transport in cilia, and its variants have been known to cause Leber congenital amaurosis [
6]. USH2A and LAMA1 are both extracellular matrix proteins. When the outlier proteins were removed from the main analysis, the regression between percent disorder and total missense variation showed a stronger positive correlation and became significant (
p = 0.024). Consistent with previously published reports [
7] and our previous work, this suggests that regions with more disorder do tolerate a greater degree of missense variation.
With respect to SLC38A8, the one protein in the 0–10% decile disorder which was an outlier for % total missense,
SLC38A8 codes for a proton antiporter. It has been implicated in foveal hypoplasia. It had a higher number of missense variants relative to the number of amino acids compared to what would be expected for a percent disorder. However, its overall percent pathogenicity is low at 1.48%, which may mean that the functional effect of many of these missense variants is benign [
8].
In regards to the five proteins in the 0 to 10th disorder decile with an outlier number of total missense variants, USH2A mutations are the most frequent cause of inherited retinal dystrophy, including Usher Syndrome type II and non-syndromic retinitis pigmentosa [
9].
USH2A encodes usherin, a transmembrane protein of 5202 amino acids that contains laminin EGF motifs, a transmembrane domain, and fibronectin repeats [
10]. It is predominantly expressed in photoreceptors. One would expect a low toleration of missense variation with only 5.9% disorder, but it has the second highest number of total missense variation (3076). It is possible that the functional effect of a large portion of the missense variants is benign as the percent pathogenicity was relatively low at 4.8%.
ADGRV1 is also a cause of Usher Syndrome type II and share many characteristics of functional and structural impairment with
USH2A [
9]. Both
USH2A and
ADGRV1 belong to the same protein complex, are located at the periciliary region between the inner and outer segment of photoreceptors, and are thought to be important for stabilizing the connecting cilium [
11]. ADGRV1 and USH2A retinopathy produce indistinguishable clinical profiles, suggesting that their loss of function produces similar effects in the retina [
10].
Lastly, there are 10 proteins with a greater amount of pathogenicity expected relative to the percent disorder. First, NDP has a high degree of disorder (100%) despite having a relatively high percent pathogenicity (87.1%). The reasons for this are unclear, though it is interesting to note that only exons 2 and 3 are translated [
12]. PRPS1, RPE65, and RDH12 have 0 amino acids in IDRs, so using IDRs to inform diagnostics surrounding mutations affecting these proteins may be less helpful. However, it makes sense that these proteins do not tolerate mutations well given that their percent disorder is minimal. Therefore, they do follow our hypothesis.
There are several limitations to this study. Metapredict utilizes a machine learning approach known as knowledge distillation, whereby a computationally cheap model is trained with data generated with computationally expensive models, which can have limitations [
13]. For instance, our analysis demonstrated that genes in the first quartile in size exhibited significantly higher disorders (24.3% versus 6.3% and
p = 0.002). Smaller proteins may be subject to more difficulty with prediction, and thus, analyzing the relationship between protein size and disorder prediction may be an area of future investigation.