Dark Proteome Database: Studies on Dark Proteins

The dark proteome, as we define it, is the part of the proteome where 3D structure has not been observed either by homology modeling or by experimental characterization in the protein universe. From the 550.116 proteins available in Swiss-Prot (as of July 2016), 43.2% of the eukarya universe and 49.2% of the virus universe are part of the dark proteome. In bacteria and archaea, the percentage of the dark proteome presence is significantly less, at 12.6% and 13.3% respectively. In this work, we present a necessary step to complete the dark proteome picture by introducing the map of the dark proteome in the human and in other model organisms of special importance to mankind. The most significant result is that around 40% to 50% of the proteome of these organisms are still in the dark, where the higher percentages belong to higher eukaryotes (mouse and human organisms). Due to the amount of darkness present in the human organism being more than 50%, deeper studies were made, including the identification of ‘dark’ genes that are responsible for the production of so-called dark proteins, as well as the identification of the ‘dark’ tissues where dark proteins are over represented, namely, the heart, cervical mucosa, and natural killer cells. This is a step forward in the direction of gaining a deeper knowledge of the human dark proteome.


Introduction
Many key insights and discoveries in the life sciences have been derived from atomic-scale 3D structures of proteins. Thanks to steady improvements in experimental structure determination methods, the PDB, or Protein Data Bank [1], which stores these structures recently went past 125,000 entries-a landmark in our understanding of the molecular processes of life. This still lags far behind the growth of protein sequence information, with less than 0.1% of UniProt [2] proteins linked to a PDB [1] structure. However, the understanding that evolution conserves structure more than sequence has led to the large-scale computation of structural model efforts [3]. Previously, we contributed to Aquaria [4], a source built upon a systematic all-against-all comparison of Swiss-Prot [2] and PDB sequences, resulting in a large number of template models [4]; this provides a depth of sequence-to-structure information currently not available from other resources to the visible proteome. However, there is a dark side of the proteome which we were the first to categorize [5], i.e., regions of protein sequences, or whole sequences, still remain inaccessible to either experimental structure determination or modelling approaches, such as Aquaria and others. This knowledge was updated and is kept in the independent Dark Proteome Database (DPD) [6] as a synonym for the structurally unknown part of the protein sequence universe, i.e., full sequences and/or regions of sequences for which the structure is currently undetermined [5,6]. The intrinsically disordered proteins (IDP) and/or regions (IDR), and more recently also dark proteome [7] are intrinsically unstructured proteins and/or regions, where their structure determination by conventional methods such as X-ray, nuclear magnetic resonance (NMR) crystallography and electron microscopy (EM) are arduous, due to are impossible in humans (technically and ethically). Some evasive procedures associated with experiments are extremely time-consuming or cost-intensive. C. Elegans presents itself as a reliable alternative because of its short generation time (four days), and due to its complete anatomy also being known. The adult hermaphrodite has exactly 959 cells, while the adult male has exactly 1031 cells and both are transparent. This allows researchers to relate behaviors to particular cells, to trace the effects of genetic mutations and gain relevant insights into the mechanisms of development and ageing. Therefore, due to the evolutionary conservation of gene function, C. Elegans is the ideal model organism to trace basic genetic mechanisms of human development and disease, such as cancer and neurodegenerative diseases. C. Elegans was the first multicellular eukaryotic organism to have its whole genome sequenced.

Escherichia Coli (Bacteria)
The Escherichia Coli (E. Coli) can be found in the intestine of warm blood organisms. The reasons why this bacterium is such a good model are firstly, because it is a simple organism, and secondly that it can be cultured and grown easily and inexpensively in a laboratory. However, the main reason why it is heavily tested and trialed even though a bacterium, is related to the fact that its basic biochemical mechanisms are common to the human organism. Another important aspect is that E. Coli was the reason for first understanding the transcription factors that activate and deactivate genes in the presence of a virus. From that point on, E. Coli was used as a host in genetic engineering and especially in health, producing several types of proteins, encoding them with the majority of human genes that are applied as medicinal drugs. There are several variants of E. Coli, most of them harmless. However, some of them can be lethal and are responsible for product recall due to food contamination or food poisoning. This bacterium was the first prokaryotic model organism to have its genome sequenced (K12 strain) and has a single circular chromosome with 4.6 million base pairs.

Saccharomyces cerevisiae (Yeast)
Saccharomyces cerevisiae is recognized as the key factor in brewing for centuries. The main reasons for it being considered a model organism are related to its easy culture, fast growing process, and inexpensive production in a laboratory. Being a eukaryote, it share the same complex internal cell structure of plants or animals without the high percentage of junk DNA present in more complex eukaryotes facilitating research. Since S. cerevisiae is biochemically very similar to the human organism, many studies about the molecular processes involved in cell cycle, meiosis, recombination, DNA reparation, ageing, and other fundamental areas of biology were possible. S. cerevisiae was the first eukaryotic genome to be completely sequenced. It is composed of 12 chromosomes containing approximately 12.2 million base pairs.

Mus musculus (Mouse)
The Mus Musculus (Mouse) is the most famous model organism because is the mostly used mammal in medicine and biology scientific communities. The main reason why is such a good model is that it is a mammal and, therefore, has organs and development processes that are very similar to the human organism; next, they are easily reproductible, grow fast, and very easy to maintain and manipulate in a laboratory; finally, mice suffer from most of the diseases and calamities that affect mankind. Therefore, mice have an extremely important role in the development of new pharmaceuticals for humans. The mouse genome consists of 40 chromosomes with 2.63 billion amino acids.
In our previous work [5], we analyzed the four domains of life where the conclusions were: The dark proteome is mostly not disordered, mostly not compositionally biased, mostly not transmembrane, but more important and unexpectedly, it is mostly "Unknown Unknowns" [5]. The dark protein portion of "Unknown Unknowns" in eukaryota is almost 50%. It is composed of ordered, globular, and low compositional bias proteins. In the case of bacteria this percentage is over 50% and in case of archaea reached almost 70%. Finally, in viruses this percentage reached almost 75% [5]. There were several questions raised at that time, that are still valid today, such as: Could we detail more dark proteins location and environment? Could we detail even more its functions?

Materials and Methods
Dataset: The set of protein sequences selected for this work were prevenient from Swiss-Prot release of July 2016 [6]. The protein structures were extracted from PDB on July 2016. Predictions from PSSH2 [4], PMP [8] and Predict Protein [9] are versions from July 2016. Finally, Protein-Protein Interaction (PPI) information is prevenient from STRING [10], also from July 2016. The Swiss-Prot dataset is composed of 550.116 proteins and divided in four kingdoms: 19 Mapping Darkness: For each Swiss-Prot protein, each residue was categorized as "non-dark" if it met either one of the following criteria: If the residue was aligned onto the "ATOM" record of any PDB entry [1] in the corresponding Aquaria matching structures entry (criterion A); or if the residue was aligned onto a PDB entry in the corresponding UniProt entry (criterion B). All other residues were categorized as "dark." We then calculated a "darkness" score (D) as defined in Reference [5]. If D = 0 this means it is PDB or a white protein, otherwise, if D = 1, this means it is a dark protein.
If 0 < D < 1 means it is a grey protein with grey regions containing dark regions [5].
Dark and non-Dark Percentages: The percentages displayed for "dark" proteins, "dark" regions, grey regions, and PDB regions present in the above sets (domains of life and model organisms) consist first in obtaining both "dark" and PDB proteins in the sets mentioned above. Next, "dark" as well as, "non-dark" regions are mapped, subtracting the "dark" proteins from the former, and subtracting the PDB proteins from the later, obtaining the cardinality of "dark" regions and "non-dark" regions. If we divide the above cardinalities by the total amount of dark and non-dark regions, we obtain the percentages presented in Figures 1 and 2.
Annotation Enrichment: The functional analysis compares annotations between dark and non-dark proteins in a reliable manner, by the application of annotation enrichment to the 'Description' (DE) field, which were now extended with the 'Features' (FT) field of the Swiss-Prot proteins through Fisher exact tests [11,12] followed by the Benjamini-Hochberg false discovery correction [13] with α, the fraction of false positives was considered acceptable, set to 1%, and accepted only annotations with an adjusted p value of ≤ 1%, calculated via: where p is from Fisher's test, n is the total of number of annotations in the set, and k is the rank of the largest p value that satisfies the false discovery criteria as in Reference [6]. This approach was then repeatedly applied to compare dark and non-dark proteins across various sets of organisms. Tree Maps. From the 'Description' enrichment analysis results, we selected 21 (of 25) subcategories judged to be most significative and visualized them using a tree map [14]. For the 'Features' enrichment analysis results we selected 36 (of 39) subcategories. The removed subcategories included those with relatively few results-or results with relatively high adjusted p values-as well as subcategories such as "Similarity," which only give information about groups of very similar proteins and the specific functions they perform; although interesting, these specific annotations do not reveal more general properties of dark proteins. In Figures 3-9 the results were displayed using the D3 zoomable tree map library (bost.ocks.org/mike/treemap); some annotation terms have also been reworded to improve readability.
Mapping Autonomy per protein. For every human protein we evaluated its autonomy, i.e., using STRING [10] we counted how many others it interacts with. The STRING scheme classifies its functional link confidence into three different scores [15]: Low (< 400), medium (400 < score < 700) and high (> 700) confidence scores measuring the confidence in the pair-wise functional interactions of the networks produced. Even assuming that sequence data is accurate, computational tools can introduce noise when generation sequence similarity data occurs. Taking this noise into account, it is suggested to set a cut-off score above which an interaction is highly probable. In terms of a functional classification accuracy, what matters is a high confidence score of 700 or higher [16], however, low and medium confidence were done for comparison purposes (results not shown). Therefore, for each Swiss-Prot protein, we categorized its autonomy as: where m(N) indicates the number of matches that occur for a link score of N. This means, if a protein has m(0) equals to zero matches, then the protein is fully autonomous because at the lowest quality cut-off score no interactions occur between it and other proteins. On the other hand, if at the highest cut-off score there still exist interactions with other proteins (i.e., m(900) is not zero) then it can be concluded that the protein is completely non-autonomous. Dark Genes. For each chromosome in Homo Sapiens, we then constructed a list of dark proteins sorted by the position of the central nucleotide of the corresponding gene, determined using UCSC assembly hg19 [17]. In some cases, due to gene duplication, multiple copies of the same dark protein were annotated as arising from multiple genes in the same chromosome; in such cases, we considered only the first occurrence, and removed all other copies from the list. For each chromosome, we then calculated the longest run of dark proteins, and assigned a p value by calculating how many times a run with the same number of dark proteins or more occurred by chance in 1,000 random re-orderings of proteins along the chromosome. Note, that the cluster results are very conservative where the chance of a false positive is 1/1000 on a per-chromosome basis; thus, there are probably more such 'dark' gene clusters.
Dark Tissues. Finally, we have used ProteomicsDB [18] that contains mass-spectrometry data from protein expression measurements from 16,857 liquid chromatography tandem-mass-spectrometry (LC-MS/MS) experiments involving human tissues, cell lines, body fluids including data from PTM studies, and affinity purifications. To obtain the normalized intensity values for each protein from ProteomicsDB, the protein expression API was used. These values measure the relative abundance of peptides of each protein in a specific sample in a logarithmic scale. As we did not find any mass-spectrometry data for 1,391 dark proteins and 2,762 non-dark proteins, we considered these empty entries as 0.

Dark Proteome Database Status
This work tries to answer the questions formulated in the introduction, starting by presenting the status of DPD [6] in July 2016 (Figure 1), i.e., the percentage of dark proteins, dark regions, grey regions, and PDB regions as defined in Reference [5] for the four domains of life plus the six model organisms described above. Using the more stringent definition of darkness as defined in Reference [5] we can observe the status of DPD ( Figure 2) including the PMP (Protein Model Portal) [8] predictions for the same four domains of life plus the six model organisms.  Comparing the actual version of DPD without PMP ( Figure 1) and with PMP ( Figure 2), we can observe marginal differences either in the domains of life or in the model organisms studied in this work. Therefore, henceforth, this study will only focus on a DPD version without PMP.
Comparing now the initial version of DPD [5] ( Figure S1) with the current version of DPD (6) and starting by the domains of life we can observe that dark proteins (from: E: 15 (Figure 1). Focusing now on model organisms and performing exactly the same reasoning as the one above, we observe the same tendency ( Figure S1), i.e., we can observe that dark proteins (from: Ar: 13  Comparing the actual version of DPD without PMP ( Figure 1) and with PMP ( Figure 2), we can observe marginal differences either in the domains of life or in the model organisms studied in this work. Therefore, henceforth, this study will only focus on a DPD version without PMP.
Comparing now the initial version of DPD [5] ( Figure S1) with the current version of DPD [ (Figure 1). Focusing now on model organisms and performing exactly the same reasoning as the one above, we observe the same tendency ( Figure S1), i.e., we can observe that dark proteins (from: Ar: 13 (Figure 1). It can be concluded by looking at the previous results that the general knowledge concerning the four domains of life has increased since the number of dark regions and dark proteins percentages decreased, while the grey and white regions percentages increased. The previous conclusion could also apply in model organisms but not so straight, since there were dark regions and dark proteins areas that expanded, while grey and white areas shrank, even if marginally.
However, if we look at the overall picture ( Figure 1), even with this increase in PDB and grey regions, we conclude that for the four domains of life the percentage of the dark proteome is still very high in eukaryotes (43.2%) and in viruses (49.2%) and very low in archaea (13.3%) and in bacteria (12.6%). Considering PMP, the scenario does not improve much, with 38.6% of darkness present in eukaryotes, 47.7% in viruses, 10 and 'Venom Duct' as well as, in 'Skin (including Dorsal) Glands' and 'Testis', being under-represented in only two "Tissues": 'Red blood cells', and 'Ubiquitous'. Dark proteins were under-represented in many "Catalytic site" and "Pathway" annotations, where inference often requires similarity to a PDB structure. Dark proteins in Archaea organisms have several "Functions" that are 'Responsible for cell division', 'Proton extrusion', as well as 'Transport of potassium', among others. These and the following results can be verified at Reference [19] by applying the indicated cutoff values at the slider button.  Table 1 and dataset S1). Functional annotations over-or under-represented in dark proteins. Pooling annotations for all proteins, we used enrichment analysis to find biological functions associated with dark proteins. The tree map shows all over-and under-represented annotations (dark and blue, respectively) in 21 functional categories; cell area indicates annotation significance (scaled to -log10(P), using the adjusted p value from Fisher's exact test -see methods). (A) Archaea; (B) Bacteria; (C) Eykaryota; (D) Viruses. A cut-off value (-log10(p)) = 50 was applied for figure readability. Mitochondrial membrane ATP synthase (F(1)F(0) ATP synthase or Complex V) produces ATP from ADP in the presence of a proton gradient across the membrane which is generated by electron transport complexes of the respiratory chain. F-type ATPases consist of two structural domains, F(1) -containing the extramembraneous catalytic core and F(0) -containing the membrane proton channel, linked together by a central stalk and a peripheral stalk. During catalysis, ATP synthesis in the catalytic domain of F(1) is coupled via a rotary mechanism of the central stalk subunits to proton translocation. Part of the complex F(0) domain. Minor subunit located with subunit a in the membrane (By similarity Bacteria dark proteins ( Figure 3B) like archaea are over-represented in "Subcellular Location" like 'Cell membrane', 'Cell outer membrane', 'Cell inner membrane', 'Lipid anchors' and being 'Secreted'. Dark proteins are also under-represented in 'Cytoplasm'. Dark proteins were under-represented in many "Catalytic site", "Pathway" and "Subunit" (namely 'Ribosomal') annotations. Dark proteins in bacteria organisms have several "Functions" that are 'Responsible for cell division', 'Transport of potassium', as well as, being 'Catalyzers' and being involved in 'Initiation control of chromosome replication' among others.
Eukaryota dark proteins ( Figure 3C) are over-represented in "Subcellular Location", such as many specific secretory tissues and exterior environment, such as 'Venom Gland', 'Venom Duct', in´Skin (including Dorsal) Glands', 'Testis' and 'Milk' and under-represented in the same two "Tissues" annotations like archaea. Dark proteins were also over-represented in 'Cysteine' domains and 'Disulfide bonds'. Additionally, eukaryotic dark proteins were over-represented in 'Cleavage' and other post-translational modifications known to prepare proteins for harsh environments. Dark proteins like archaea were under-represented in many "Catalytic activity" and "Pathway" annotations.
We wanted to detail even more the Dark Proteome therefore, we used the 'Features' field (FT) which is a subsection of the 'Description' field (DE) by life domain through TreeMap (with 36 functional categories). Observing Figure 4, allow us to conclude the following: Archaea dark proteins ( Figure 4A) are over-represented in "Transmembrane" as 'Helical' and in "Topological Domains" , 'Extracellular' and 'Cytoplasmic'. They are also over-represented in "Carbohyd", "Compositional bias" of 'Poly-Glu' and in "Non-Standard" amino acids (Selenocysteine and Pyrrolysine). Dark proteins are under-represented in "Active Sites", "Helixes", "Metal", "NP-bind", and "Binding".

Model Organisms
Arabidopsis has 41.2% of its proteome in the dark (Figure 1). The conclusions that we can infer through the analysis of the corresponding TreeMap ( Figure 5A) for 'Descriptions' (DE field of Swiss-Prot files) are: That dark proteins are over-represented in "Subcellular Location" such as 'Endoplasmic reticulum membrane' either with 'Single-pass' or 'Multi-pass'. They are also 'Secreted' in the 'Extracellular space' or through the 'Cell wall'. Dark proteins most evident "Functions" are related with 'Transcription factors' as well as with 'Regulation of cell fate', and even with 'Regulation of the plant stress, growth and development'. Dark proteins are under-represented in 'Chloroplast' and in 'Cytoplast' (not shown). Analyzing the results of the corresponding TreeMap ( Figure 5B) of 'Features' (FT field of Swiss-Prot files) we can observe that: The dark proteins are located mostly in the extension of "Transmembrane" regions where its "Topological domain" are 'Cytoplasmic' and 'Extracellular' (not shown). Dark proteins are common in "Signal" sequences (prepeptides) and related with transcription factors being also associated with "Disulfides". They are under-represented in "Helix", "Binding", and "NP-bind" (Figure 10A). Arabidopsis has 41.2% of its proteome in the dark (Figure 1). The conclusions that we can infer through the analysis of the corresponding TreeMap ( Figure 5A) for 'Descriptions' (DE field of Swiss-Prot files) are: That dark proteins are over-represented in "Subcellular Location" such as 'Endoplasmic reticulum membrane' either in dark proteins with 'Single-pass' or 'Multi-pass'. These dark proteins are also 'Secreted' in the 'Extracellular space' or through the 'Cell wall'. Dark proteins most evident "Functions" are related with 'Transcription factors' as well as with 'Regulation of cell fate', and even with 'Regulation of the plant stress, growth and development'. Dark proteins are under-represented in 'Chloroplast' and in 'Cytoplast' (not shown). Analyzing the results of the corresponding TreeMap ( Figure 5B) of 'Features' (FT field of Swiss-Prot files) we can observe that: The dark proteins are located mostly in the extension of "Transmembrane" regions where its "Topological domain" are 'Cytoplasmic' and 'Extracellular' (not shown). Dark proteins are common in "Signal" sequences (prepeptides) and related with transcription factors being also associated with "Disulfides". They are under-represented in "Helix", "Binding", and "NP-bind"( Figure 10A).  The C. Elegans organism contains 45.8% of dark proteome (Figure 1). Analyzing again the TreeMap of 'Descriptions' ( Figure 6A) we conclude that dark proteins are mostly located in the 'Membrane', especially in dark proteins with 'Multi-pass' or 'Single-pass' and 'Cell membrane'. The main "Functions" of the dark proteins are "Structural on the gap junctions", also present in "Neuropeptides". Turning now our attention to the TreeMap ( Figure 6B) of 'Features' we can deduce that dark proteins are located essentially in the extension of 'Transmembrane' regions ('Helical') and, like in Arabidopsis, are also 'Signal' sequences. They are under-represented in "Disolfides", "Helix", "Binding", and "NP-bind" (Figure 10B).
High-Throughput 2019, 8 FOR PEER REVIEW 2 The C. Elegans organism contains 45.8% of dark proteome (Figure 1). Analyzing again the TreeMap of 'Descriptions' ( Figure 6A) we conclude that dark proteins are mostly located in the 'Membrane', especially in dark proteins with 'Multi-pass' or 'Single-pass' and 'Cell membrane'. The main "Functions" of the dark proteins are "Structural on the gap junctions", also present in "Neuropeptides". Turning now our attention to the TreeMap ( Figure 6B) of 'Features' we can deduce that dark proteins are located essentially in the extension of 'Transmembrane' regions ('Helical') and, like in Arabidopsis, are also 'Signal' sequences. They are under-represented in "Disolfides", "Helix", "Binding", and "NP-bind" (Figure 10B). E. Coli contains 32.5% of dark matter in its proteome. Attending TreeMap ( Figure 7A) of 'Descriptions' of E. Coli, dark proteins are over-represented in larger number at the 'Cell outer membrane' mainly as a 'Lipid anchor'. However, they are over-represented also at 'Cell inner membrane' where they are 'Secreted' in a slighter less quantity. These dark proteins also interact with E. Coli contains 32.5% of dark matter in its proteome. Attending TreeMap ( Figure 7A) of 'Descriptions' of E. Coli, dark proteins are over-represented in larger number at the 'Cell outer membrane' mainly as a 'Lipid anchor'. However, they are over-represented also at 'Cell inner membrane' where they are 'Secreted' in a slighter less quantity. These dark proteins also interact with themselves to form ligaments. Functions associated with them, not surprisingly include the 'Conjunctive DNA transfer (CDT) which is the unidirectional transfer of ssDNA plasmid from a donor to a recipient cell which is the central mechanism by which antibiotic resistance and virulence factors are propagated in bacterial populations'. Dark proteins can also be associated with 'Lysis' proteins. According to the 'Features' TreeMap ( Figure 7B) of E. Coli, dark proteins are over-represented in "Transmembrane (Helical)" and are associated with "Lipid" bonds. They are under-represented in Helix ( Figure 10C). REVIEW 3 themselves to form ligaments. Functions associated with these dark proteins, not surprisingly include the 'Conjunctive DNA transfer (CDT) which is the unidirectional transfer of ssDNA plasmid from a donor to a recipient cell which is the central mechanism by which antibiotic resistance and virulence factors are propagated in bacterial populations. Dark proteins can also be associated with 'Lysis' proteins. According to the 'Features' TreeMap ( Figure 7B) of E. Coli, dark proteins are overrepresented in "Transmembrane (Helical)" and are associated with "Lipid" bonds. They are underrepresented in Helix ( Figure 10C).  Dark proteins are under-represented in 'Cell membrane'. They have "Tissue Specificity" in the 'Lower and middle cortical regions of the hair shaft in both developing and cycling hair'. Dark proteins also 'Interact with hair keratin' having the purpose or function of keeping hair strong. In the 'Hair cortex, hair keratin intermediate filaments are embedded in an interfilamentous matrix, consisting of hair keratin-associated proteins (KRTAP), which are essential for the formation of a rigid and resistant hair shaft through their extensive disulfide bond cross-linking with abundant cysteine residues of hair keratins. The matrix proteins include the high-sulfur and high-glycine-tyrosine keratins'. Focusing on 'Features' ( Figure 8B) dark proteins are over-represented in "Transmembrane", "Coiled", and "Compositional Bias". The "Topological Domains" of dark proteins are 'Cytoplasmic', 'Lumenal', and 'Extracellular'. Dark proteins are under-represented in "Disulfide", "Biding", "Metal", and "Activity Sites" (Figure 10D).  The last organism studied in this work was the Homo Sapiens (Human) proteome. It was found that over half of it (51.7%) was dark (Figure 1). The results for human enrichment analysis 'Description' (Figure 9A) gives dark proteins over-representation at "Subcellular Location", like 'Membrane' ('Multi-pass' and 'Single-pass') were they 'Shuttles between nucleolus and cytoplasm' being also 'Secreted'. About "Tissue Specificity' they are over-represented in 'Testis', 'Testis (tumor tissues)', 'Melanoma' and 'Carcinoma (bladder and lungs)'. Concerning "Functions", it can be The last organism studied in this work was the Homo Sapiens (Human) proteome. It was found that over half of it (51.7%) was dark (Figure 1). The results for human enrichment analysis 'Description' ( Figure 9A) gives dark proteins over-representation at "Subcellular Location", like 'Membrane' ('Multi-pass' and 'Single-pass') were they 'Shuttles between nucleolus and cytoplasm' being also 'Secreted'. About "Tissue Specificity' they are over-represented in 'Testis', 'Testis (tumor tissues)', 'Melanoma' and 'Carcinoma (bladder and lungs)'. Concerning "Functions", it can be observed that dark proteins are directly linked with 'Tumorigenesis', 'Tumor antigens', and 'Retroviral replication'. In "Caution" as stated in Reference [5], although using Swiss-Prot partly addresses the possibility that dark proteins may actually be unrecognized long noncoding RNA or may arise from pseudogenes were evidence occurs for a small number of cases. For the human enrichment analysis 'Features' (Figure 9B), all significant results are shown in "Transmembrane", "Coiled", "Compositional bias", and 'Cleavage'. Dark proteins are under-represented in 'Disulfide' (Figure 10E). In conclusion, we can add for the human case that less was known about the function and subcellular location of dark proteins, 56% shorter 'CC' field; missing location data for 56% compared with 22% for non-dark proteins ( Figure 10F). REVIEW 5 observed that dark proteins are directly linked with 'Tumorigenesis', 'Tumor antigens', and 'Retroviral replication'. In "Caution" as stated in Reference [5], although using Swiss-Prot partly addresses the possibility that dark proteins may actually be unrecognized long noncoding RNA or may arise from pseudogenes were evidence occurs for a small number of cases. For the human enrichment analysis 'Features' (Figure 9B), all significant results are shown in "Transmembrane", "Coiled", "Compositional bias", and 'Cleavage'. Dark proteins are under-represented in 'Disulfide' ( Figure 10E). In conclusion, we can add for the human case that less was known about the function and subcellular location of dark proteins, 56% shorter 'CC' field; missing location data for 56% compared with 22% for non-dark proteins ( Figure 10F).  Table 3 and dataset S1). (B) TreeMap showing all annotations (features) over-represented in dark proteins for organism Human (details in Table 4 and dataset S2). A cut-off value (-log10(p)) = 0 was applied for figure readability.  Table 3 and dataset S1). (B) TreeMap showing all annotations (features) over-represented in dark proteins for organism Human (details in Table 4 and dataset S2). A cut-off value (-log10(p)) = 0 was applied for figure readability. Retroviral replication requires the nuclear export and translation of unspliced, singly-spliced and multiply-spliced derivatives of the initial genomic transcript. Rec interacts with a highly structured RNA element (RcRE) present in the viral 3'LTR and recruits the cellular nuclear export machinery. This permits export to the cytoplasm of unspliced genomic or incompletely spliced subgenomic viral transcripts (By similarity).

Autonomy
Concerning autonomy in Arabidopsis ( Figure 11A) it can be observed that dark proteins have much less interactions in comparison with non-dark proteins, which are quite high. Note the small peaks shown near 110 interactions that are a consequence of ribosomal proteins, these peaks will be present in all organisms (except E. Coli) below. The autonomy for C. Elegans follows the same previous pattern of Arabidopsis ( Figure 11B), where dark proteins have much less interactions in comparison with non-dark proteins, that are fewer in comparison, but yet quite high. The autonomy for E. Coli has a curious result by comparing it with the previous cases ( Figure 11C), where dark and non-dark proteins have the same number of interactions for high quality (700). The results for Yeast are similar with E. Coli. Concerning the Mouse autonomy ( Figure 11D) it can be observed that dark proteins have fewer interactions in comparison with non-dark proteins of Mouse which are quite sound. Autonomy in the Human organism follows the same pattern presented in Mouse ( Figure 11E), where dark proteins have much less interactions in comparison with non-dark proteins, however less than in Mouse organism.

Dark Genes
It was also determined which dark proteins came from sequential genes, finding seven 'dark' gene clusters. Basically, you can take each protein and mapped down to the gene where the protein comes from and mapping down to chromosomes (See Methods), and if we do that proteins from these clusters had many features described above as typical for dark proteins (Table 5).  (B) Protein-proteins interactions (high quality = 700) for C. Elegans using STRING; (C) protein-proteins interactions (high quality = 700) for E. Coli using STRING; (D) protein-proteins interactions (high quality = 700) for Mouse using STRING; (E) protein-proteins interactions (high quality = 700) for Human using STRING.

Dark Genes
It was also determined which dark proteins came from sequential genes, finding seven 'dark' gene clusters. Basically, you can take each protein and mapped down to the gene where the protein comes from and mapping down to chromosomes (See Methods), and if we do that proteins from these clusters had many features described above as typical for dark proteins (Table 5).  Length indicates the number of amino acids; 'Binds' indicates the number of known binding partners in the same cluster from STRING [10]; 'Bias' indicates the largest single amino acid composition (e.g., a value of '42%' indicates that one amino acid accounts for 42% of the entire sequence) -the most frequently occurring amino acids are given for each cluster (e.g., 'CS-rich' indicates Cys is the most common, followed by Ser). The proteins arising from these gene clusters exhibit typical characteristics of dark proteins: they tend to be short, have few known interactions, have atypical amino acid composition, and are often secreted, transmembrane, or skin-associated. The 1q21.3 cluster arises from gene duplication [20]; it contains many skin proteins with significant compositional bias. The 4q13.3 cluster does not appear to have been previously characterized; it contains proteins related to the mouth, salivary glands, and secretion, implying that these genes share related functions. The 11q12 cluster arises from gene duplication during vertebrate evolution [18]; it contains proteins that all have a 4-pass membrane-spanning region and are components of a multimeric receptor complexes. The 17q21.2 and 21q22.11 clusters have also been previously identified [21,22]; they contain hair-associated proteins. The Xp11.23 and Xp11.22 clusters are both very recent evolutionary developments [23]; they contain proteins that are expressed only in testis and in cancer -some are also unique to human

Dark Tissues
Finally, using ProteomicsDB [18] we looked at all the proteins that were expressed in 69 tissues, where every tissue has a list of expressed proteins and a level of abundance. What we have done was inspect each tissue (for instance, the brain), and observe which proteins where highly expressed and which fraction of those proteins were dark, associated with a darkness value, not for each protein, but for each tissue. The tissue that has the highest level of darkness is the heart, which is very interesting, since it is the tissue that is associated with heart disease, one of the main cause of death in humans (Table 6).

Discussion
Our previous work [5] didn't point out solutions, but it opened a new field to be explored. This study is complementary to [6] through the delivery of new information concerning darkness present in Homo Sapiens and in model organisms related with it pharmacologically. It can be concluded that the amount of dark proteome present in all of them is still high, whereas in higher eukaryotes like mouse and human, it is around 50%. The results presented above are consistent with previous works [6,20] since Arabidopsis dark proteins are mainly located in extracellular space, cellular membrane, and endoplasmic reticulum membranes. C. Elegans dark proteins are present again in cell membrane where they are secreted. In the E. Coli case, it is reconfirmed that they are present in inner and outer membrane. Finally, in higher eukaryotic organisms like mouse, we observed that dark proteins are located in endoplasmic reticulum and in mitochondrion membrane, which is consistent with the previous results that state that dark proteins are mostly over-represented in specific secretory tissues and exterior environments, being also related to cancer endogenous retroviral proteins in the human organism [6]. Therefore, it was shown that dark proteins are not uniformly distributed throughout the different areas of the cell in organisms, where their presence is more common in some regions than in others. There are a lot of them in membranes, cell membranes or associated with transmembrane regions and cleavage, but they are less common in cytoplasm, where many globular proteins perform their activity.
Concerning functions, the results confirmed that dark proteins perform a wide spectrum of functions depending on the organism in question, being more focused in simpler organisms and wider in higher organisms [6]. Again, it was found that a vast amount of them is programmed to live outside the cell, where many are associated with secretion (through secretory glands and ducts) or with extracellular areas in tissues, being an indicator that they possibly are designed for being defensive agents against external threats such as bacteria and/or virus. But we also observed that some of these dark proteins are subject to post-translational modifications, therefore being chemically modified after translation be applied.
Concerning autonomy, up to now there is no comprehensive map of all relevant functionally for PPI's in simple or complex organisms. The existence of this map is of crucial importance to understand cellular behavior. Several databases started to flourish helping in the construction of this global protein interactions map. Some databases are dedicated to register interaction experiments such as physical binding detection among proteins [24][25][26][27]; others are centered on specific model organisms [28,29] However, there are two difficulties: The first is the "tsunami" of genome and proteome sequencing information that must be processed putting the above map in standby; The second difficulty is in the way proteins interact i.e., they also interact through indirect associations such as shared pathways which are not registered in interaction databases, but instead are registered in pathway databases [30][31][32] This is our contribution to the above map, especially to its dark side. The results show clear evidence that-independently of the organism evaluated-dark proteins have significantly fewer interactions with other proteins, in comparison with non-dark proteins. In general, we can conclude that dark proteins are more independent and autonomous than non-dark proteins. Therefore, the DPD is a map for the dark proteome at the present time where the model organisms described are already available together with its functional analysis, augmenting the knowledge about them, where we have work in progress for all the remaining organisms.
A point that we want to bring to discussion is the difference between intrinsically disordered proteins (IDP's) and their relationship with the Dark Proteome (DP) as we coined it in 2015 [5], where we concluded that the Dark Proteome is mostly not disordered using the predictor IUPred [33]. The need to use a predictor was because only 62 proteins of Swiss-Prot (data from 2014) existed with´disordered´annotations from a total of 546,000 proteins. In our subsequent work [6] (data from 2016), the same 62 proteins remained but now among a set of 550,116 Swiss-Prot proteins. For this work, considering 'disordered' annotations for the Human organism we would have (data from 2016 [6]) only one protein to work with: Q8WYP5 (the same for 2014 data [5]). Hence, we make the hypothesis: If we use another predictor do we get a different result? The answer is yes, but no. The disorder values shown in Figure 2 and Figure S3 of Perdigão et al. [5] were calculated using IUPred [33] because it is one of the most widely used methods for predicting disorder.
Residues were defined as disordered if they had an IUPred score ≥0.5 [5]. In this study, we also calculated a second set of disorder values using MD (META-Disorder) [34], a machine-learning method that calculates a consensus disorder from several orthogonal methods. Re-plotting the density and scatterplots Figure 2 and Figure S3 of Reference [5] using MD disorder gave a similar overall pattern, although some differences were apparent ( Figure 12). MD includes as one of its input methods DISOPRED2 [35], which is one of several available methods that are optimized to predict residues missing from PDB structures. For a small fraction of proteins there were not MD predictions; to balance the comparisons, these proteins were removed from the density and scatterplots in Figure 2 and Figure S3 of Reference [5]-thus reducing the number of proteins to 175.646 in archaea, 18.999 in bacteria, 326.945 in eukaryotes and 16.316 in viruses, respectively.
High-Throughput 2019, 8 FOR PEER REVIEW 3 one of the most widely used methods for predicting disorder. Residues were defined as disordered if they had an IUPred score ≥0.5 [5]. In this study, we also calculated a second set of disorder values using MD (META-Disorder) [34], a machine-learning method that calculates a consensus disorder from several orthogonal methods. Re-plotting the density and scatterplots Figures 2 and S3 of Reference [5] using MD disorder gave a similar overall pattern, although some differences were apparent ( Figure 12). MD includes as one of its input methods DISOPRED2 [35], which is one of several available methods that are optimized to predict residues missing from PDB structures. For a small fraction of proteins there were not MD predictions; to balance the comparisons, these proteins were removed from the density and scatterplots in Figure 2 and S3 of Reference [5] -thus reducing the number of proteins to 175.646 in archaea, 18.999 in bacteria, 326.945 in eukaryotes and 16.316 in viruses, respectively.  (Figures 2 and S3 of Reference [5]). For eukaryotes, however, using MD results in a larger fraction of proteins occur close to the diagonal, resulting in an approximately linear relationship between disorder and darkness (H), in contrast to the upper triangular region seen with IUPred ( Figure 2C of Reference [5]). However, as previously, most proteins do not show this trend. Indeed, the presence of almost as many proteins below this region as above indicates that disorder is essentially unrelated to darkness. For viruses (J), the pattern Overall, the results are mostly similar to those obtained using only IUPred (Figure 2 and Figure S3 of Reference [5]). For eukaryotes, however, using MD results in a larger fraction of proteins occur close to the diagonal, resulting in an approximately linear relationship between disorder and darkness (H), in contrast to the upper triangular region seen with IUPred ( Figure 2C of Reference [5]). However, as previously, most proteins do not show this trend. Indeed, the presence of almost as many proteins below this region as above indicates that disorder is essentially unrelated to darkness. For viruses (J), the pattern associated with disordered linear motifs is even more pronounced ( Figure S3C of Reference [5]). The density plots (B, D, I, and K) show that MD disorder is more evenly distributed than IUPred disorder (Figure 2 and Figure S3 of Reference [5], respectively).
Intrinsic disorder in proteins is a complex and poorly understood phenomenon, in addition to IUPred, many other prediction methods have been developed focusing on a range of different aspects of disorder [36]. It would certainly be of interest to compare darkness with disorder predictions from a range of methods, however the output from these algorithms is difficult to decode due to the lack of metrics or references to compare with [37]. The DPD wants to help in a near future with the introduction of three new disorder predictors applied in the Swiss-Prot universe.

Conclusions
Five hundred years ago, very little of the Earth was known. People suspect that it was a sphere, with land and water and they had roughly mapped out Europe, but that was it. Knowing what they didn't know gave Portuguese explorers like Pedro Álvares Cabral, Vasco da Gama, and Fernão de Magalhães a direction in which to head-the same principle applies to science and discovery today. We have been able to identify regions within each protein that are different to any region where the structure has been determined experimentally. This unknown area is called the 'dark proteome' and actually accounts for nearly half the proteins in viruses and in eukaryotes, which includes humans. It will provide insight into protein-based illnesses like cancer, type 2 Diabetes, and many neurodegenerative diseases, such as Parkinson's and Alzheimer's. Just like the early Portuguese explorers that discovered Africa, America, and Asia, knowing what we don't know has provided us with a roadmap upon which to focus our future research and agendas. Knowing that the Dark Proteome is mostly not disordered, mostly not compositionally biased, mostly not transmembrane and that dark proteins are mostly Unknown Unknowns, the purpose of this study was a detailed characterization of the dark proteins belonging to the human and model organisms under the pharmacological umbrella. Because we already saw too many unexpected surprises, the next step of DPD are the IDP's because who knows what secrets are still hidden in the dark proteome. As far as we are concerned, we are only interested in the truth, regardless of how unexpected, difficult or amazing it will be.