Overlooked Short Toxin-Like Proteins: A Shortcut to Drug Design

Short stable peptides have huge potential for novel therapies and biosimilars. Cysteine-rich short proteins are characterized by multiple disulfide bridges in a compact structure. Many of these metazoan proteins are processed, folded, and secreted as soluble stable folds. These properties are shared by both marine and terrestrial animal toxins. These stable short proteins are promising sources for new drug development. We developed ClanTox (classifier of animal toxins) to identify toxin-like proteins (TOLIPs) using machine learning models trained on a large-scale proteomic database. Insects proteomes provide a rich source for protein innovations. Therefore, we seek overlooked toxin-like proteins from insects (coined iTOLIPs). Out of 4180 short (<75 amino acids) secreted proteins, 379 were predicted as iTOLIPs with high confidence, with as many as 30% of the genes marked as uncharacterized. Based on bioinformatics, structure modeling, and data-mining methods, we found that the most significant group of predicted iTOLIPs carry antimicrobial activity. Among the top predicted sequences were 120 termicin genes from termites with antifungal properties. Structural variations of insect antimicrobial peptides illustrate the similarity to a short version of the defensin fold with antifungal specificity. We also identified 9 proteins that strongly resemble ion channel inhibitors from scorpion and conus toxins. Furthermore, we assigned functional fold to numerous uncharacterized iTOLIPs. We conclude that a systematic approach for finding iTOLIPs provides a rich source of peptides for drug design and innovative therapeutic discoveries.


Introduction
Short proteins are strong candidates for peptide-based therapy and drug development [1][2][3]. The search for peptide-based drugs is driven by the urge to improve specificity and affinity over classical drugs [4]. At present, the search for new leads for peptide therapy is mostly restricted to known peptides that act as hormones, neuropeptides, and growth factors [5][6][7].
Venomous proteins are found in diverse taxonomical branches including scorpions, snakes, spiders, and marine cone snails [8]. Venomous animals have developed sophisticated array of delivery systems for defense and offense. Evolutionary studies suggest that venomous toxins often reuse common folds that are abundant in the animal phyla (e.g., lipases [9]). Sequences of short proteins that are characterized by having numerous cysteines often fold into compact, stable structural folds. The resulting different folds are often found in proteins that carry diverse functions (e.g., lectins, protease, and protease inhibitors [10]). Venomous organisms are sporadically scattered within the phylogenetic tree of life. Venomous proteins represent cases of both divergent and convergent evolution, as well as repeated use of several existing, successful and abundant folds. However, the pool of bioactive short peptides resembling animal toxins is larger than anticipated [11]. The toxins' innovation is exemplified by their high degree of sequence variation and broad specificities, with only minimal alterations in the structural scaffolds [12]. In recent years, additional bioactive peptides were identified via systematic searches in the transcriptomes and proteomes of venomous animals [13,14]. Secreted short proteins from venomous glands may include hundreds of poorly studied bioactive peptides [6]. Approximately 2000 toxins out of an estimated >70,000 bioactive peptides have been identified in the genus Conus to date [15]. Evolutionary perspective based on the huge sequence diversity among toxins provides a rich source for rational protein design [16,17].
Toxins are extremely varied in their functions and mode of action. The potency of toxins' function is associated with an extremely broad collection of ion channel inhibitors (ICIs), phospholipases, protease inhibitors, disintegrins, membrane pore inducers, and more [18]. Some animal toxins affect the most basic cellular properties [19]. Examples include the non-reversible effect of amphipathic peptides on the membrane integrity [20] from spider venom [21] to marine hydrozoan toxins [22]. These toxins may cause non-specific hemolysis [23]. However, most toxin proteins act via highly specific binding to their cognate molecular target, making them attractive for drug design. The neuronal [24] and immune systems [25] are often affected by toxin-target molecular recognition. A well-studied example for reuse of a fold that acts on numerous receptors of the cholinergic system was described by Gibbons et al. [26]. The three-finger proteins (TFP) fold is found in numerous mammalian proteins acting in the innate immune system [27], and was also identified as Elapidae α-neurotoxins [28,29].
Two striking examples of human toxin-like proteins are Lynx1 [30] and SLURP-1 [31]. These are human proteins that possess similarity to snake α-neurotoxins, and modulate nicotinic acetylcholine receptors (nAChR), as does the snake α-neurotoxins. The identification of SLURP-1 as a neuromodulator has contributed to the understanding of the genetic effect of the Mal de Meleda disease, a skin disease that results from over activation of TNF-alpha [31].
Many short bioactive molecules are ion channels blockers (ICIs) and toxins with antimicrobial activity [32]. ICIs constitute the most widely studied group of toxins. A large group of ICIs whose evolution has been studied are the K + ICIs [33]. It is estimated that more than 10 different structural folds and 40 structural families represent this extremely diverse (structurally and evolutionally) group [34]. In spite of that, two amino residues are critical for all K + ICIs' function: Lys and a Tyr/Phe, known as the functional dyad [35]. Surprisingly, even though these residues appear in very different positions along the sequences of K + ICIs, the solved structures show they are similarly aligned in space relatively to each other [36]. The same principle of sequence plasticity and structural rigidity apply for ICIs that affect other channels (e.g., [37][38][39][40]). Different ICIs targeting the same channel can vary in both sequence and structural folds [41].
The evolutionary mechanisms underlying the extreme diversity of toxins have been investigated [42]. Direct approaches for assessing the rapid mutation rate of a variety of toxins sharing the same fold have been reported (e.g., for phospholipases A2 [43]). TFP topology is also a strong example of the accelerated evolution and functional diversification reported for many snake toxins [44]. 3D complexes of short toxins and their cognate channels provide the best lead for the design of toxin-based pharmaceutical agents (e.g., [45]). A number of short toxins are already being used in the clinic for pain management [46], antiviral and antibacterial applications [47].
A common ICI design principle is conserved spacing, and the number of cysteines that form a stable scaffold in a few disulfide bridges [11]. In many cases, the core elements of the fold remain untouched by the preservation of at least two cysteine bridges, while the surfaces of the toxins undergo a natural dynamic adaptive evolution process. The extreme stability of the cysteine knot motif in peptide toxins makes these folds attractive for molecular engineering and drug design [48].
Based on the observation that many short animal toxins are rich in cysteines [49,50], we focused on a subset of short proteins (<75 amino acids) that can be used for discoveries towards peptide therapy [51]. The goal of our study is to present a systematic approach for identifying insects' toxin-like proteins TOLIPs (iTOLIPs). We analyzed a large number of published proteomes [52]. A rich catalogue of short bioactive proteins will have the potential to benefit the pharma and medical communities that seek new leads for drugs [53].
Insects represent one of the most diversified metazoan phyla. Many insect species evolved in unique ecological niches (e.g., parasitoid wasp) [54], and exhibit complex social behavior with rapidly evolving genomes [55,56]. In this study, we show that despite limited sequence similarity between short sequences, many toxin-like candidate sequences have been revealed via a machine learning predictor (ClanTox [57]). ClanTox was trained only on features extracted from ion channels inhibitors (ICI) from venomous proteins, for identifying TOLIPs. Using a rigorous bioinformatics and structural modeling scheme, we assigned a potential functional relevance for numerous iTOLIPs. We present dozens of new candidates for peptide-based therapy and discuss their potential for drug design.

Thousands of Toxin-Like Secreted Short Proteins in Insects
UniProtKB is the largest existing proteomic database (about 90 million sequences, August 2017) and is the main source of new templates for drug development. In recent years many new genomes have been sequenced including >30 insects. Despite a tsunami of genome sequences, only a few model organisms (e.g., Drosophila melanogaster) have high quality, manually annotated proteomes. While DNA sequencing quality has improved dramatically, current gene finding methodologies are still geared towards finding transcripts based on length (usually >100 amino acids, AA). Functional inference of genes' function from a transcribed genome remains an unsolved challenge [58]. Short proteins often have missing or faulty annotations (e.g., [59]).
We focused our discovery platform on short proteins. For the rest of the analyses we considered two thresholds on the proteins' length: (i) proteins of length <100 AA ( Figure 1); (ii) a subset of shorter proteins, length <75 AA, that are attractive for drug development. therapy [51]. The goal of our study is to present a systematic approach for identifying insects' toxinlike proteins TOLIPs (iTOLIPs). We analyzed a large number of published proteomes [52]. A rich catalogue of short bioactive proteins will have the potential to benefit the pharma and medical communities that seek new leads for drugs [53]. Insects represent one of the most diversified metazoan phyla. Many insect species evolved in unique ecological niches (e.g., parasitoid wasp) [54], and exhibit complex social behavior with rapidly evolving genomes [55,56]. In this study, we show that despite limited sequence similarity between short sequences, many toxin-like candidate sequences have been revealed via a machine learning predictor (ClanTox [57]). ClanTox was trained only on features extracted from ion channels inhibitors (ICI) from venomous proteins, for identifying TOLIPs. Using a rigorous bioinformatics and structural modeling scheme, we assigned a potential functional relevance for numerous iTOLIPs. We present dozens of new candidates for peptide-based therapy and discuss their potential for drug design.

Thousands of Toxin-Like Secreted Short Proteins in Insects
UniProtKB is the largest existing proteomic database (about 90 million sequences, August-2017) and is the main source of new templates for drug development. In recent years many new genomes have been sequenced including >30 insects. Despite a tsunami of genome sequences, only a few model organisms (e.g., Drosophila melanogaster) have high quality, manually annotated proteomes. While DNA sequencing quality has improved dramatically, current gene finding methodologies are still geared towards finding transcripts based on length (usually >100 amino acids, AA). Functional inference of genes' function from a transcribed genome remains an unsolved challenge [58]. Short proteins often have missing or faulty annotations (e.g., [59]).
We focused our discovery platform on short proteins. For the rest of the analyses we considered two thresholds on the proteins' length: (i) proteins of length <100 AA ( Figure 1); (ii) a subset of shorter proteins, length <75 AA, that are attractive for drug development. . Each step shows the number of proteins (left) and the resulting protein (right). The dashed bar marks the fraction of the data that is excluded from the following step. Sequences marked as "fragments" by UniProtKB were excluded. The final set used in this study includes proteins from Insecta with a "signal peptide" sequence annotation keyword, a restricted length of 10-100 AA and a further selection for proteins length of 10-75 AA. (B) A partition of the main orders of insects and their representation from the set of about 11,000 proteins.
We started with all proteins shorter than 100 AA (after removing all fragmented proteins), restricted to the insects' taxon, which resulted in ~117,600 proteins. Of these, 11,000 proteins were predicted to be secreted, and thus function in the extracellular space ( Figure 1A). . Each step shows the number of proteins (left) and the resulting protein (right). The dashed bar marks the fraction of the data that is excluded from the following step. Sequences marked as "fragments" by UniProtKB were excluded. The final set used in this study includes proteins from Insecta with a "signal peptide" sequence annotation keyword, a restricted length of 10-100 AA and a further selection for proteins length of 10-75 AA. (B) A partition of the main orders of insects and their representation from the set of about 11,000 proteins.
We started with all proteins shorter than 100 AA (after removing all fragmented proteins), restricted to the insects' taxon, which resulted in~117,600 proteins. Of these, 11,000 proteins were predicted to be secreted, and thus function in the extracellular space ( Figure 1A).
Analyzing the~11,000 protein'-origins show that the proteomes of major orders of insects are biased towards the previously sequenced genomes ( Figure 1B). Diptera, which includes mosquitos and flies, dominates the collection (68%). The rest of the candidate short proteins belong to Hymenoptera (mostly bees, wasp, and ants, 10%), Ditrysia (including moth, bumblebee, and butterfly, 9%) and a smaller amount of Hemiptera (e.g., aphids), Coleoptera (mostly beetles) and Blattodea (mostly termites).
While most insects are not venomous [19], some bees, ants, and wasps developed mechanisms to release their venomous proteins and toxic peptides. Many of the short proteins are uncharacterized (see discussion in [56]). Moreover, annotations of genes from fast evolving organisms are often missing. Due to these fast evolutionary innovation in many insects, we anticipate a rich repertoire of overlooked bioactive peptides [60] and iTOLIPs [61].
We used ClanTox [57] to investigate the abundance of iTOLIPs among the 11,000 short, secreted proteins (<100 AA). To this end, we divided the protein according to the major orders of insects, and further investigated the ClanTox predictions, according to the confidence level of the predictor (marked as P1-P3, see Methods). We have previously shown that many valid TOLIPs are identified at all confidence levels, including the least confident one (P1, see Methods, [57]). ClanTox was trained only on ICIs from venomous animals for seeking TOLIPs from all organisms. While it was trained on a limited function, predictions are associated with a much broader spectrum of functions that specify known toxins and proteins with no known homologues in venoms [11]. Figure 2 shows the results from ClanTox prediction with iTOLIPs cover the two largest orders of insects, the Diptera ( Figure 2A) and Hymenoptera ( Figure 2B). A bias in the prediction towards model organisms is evident. The iTOLIPs from Drosophilae (fruit fly) accounts for 44% of the predicted sequences. Still, >1000 sequences are detected in less studied organisms, such as the Tsetse fly, Aedes, blowfly, and more ( Figure 2A). The fraction of iTOLIPs among the cysteine rich short proteins from Hymenoptera (wasp, bees, and ants) is 24%. The high number of iTOLIPs from ant proteomes is a reflection of the many recently sequenced ant genomes ( Figure 2C) [56]. Note that the number of predictions from Nasonia vitripennis (Parasitic wasp) is disproportionally high. Of 145 Nasonia vitripennis'-short proteins, 57 (39%) were predicted as iTOLIPs ( Figure 2C).
From a therapeutic perspective, often, the shorter the protein, the easier it is to produce it synthetically, and to introduce it to laboratory and clinical trials. We restricted the search to 4181 sequences are shorter than 75 AA ( Figure 1A). Figure S1 shows the distribution of the 4181 sequences according to ClanTox's prediction confidence (N, P1-P3, see Methods). Note that most proteins (76%) are predicted as negative, and do not comply with the definition of iTOLIPs (Clantox's label N stands for-"not a toxin-like"). The high confidence predictions (P3, top prediction for Toxin-like) include 379 proteins (9%, Figure S1). The rest of the analyses will focus on these high confidence-predicted iTOLIPs (P3). Table 1 shows the partition of the top predicted iTOLIPs among the major orders of insects. The most outstanding observation is the abundance of iTOLIPs in termites (52%), and the low discovery of top prediction iTOLIPs among Ditrysia (5%). A list of 379 predicted sequences is available (Table S1). predicted sequences. Still, >1000 sequences are detected in less studied organisms, such as the Tsetse fly, Aedes, blowfly, and more ( Figure 2A). The fraction of iTOLIPs among the cysteine rich short proteins from Hymenoptera (wasp, bees, and ants) is 24%. The high number of iTOLIPs from ant proteomes is a reflection of the many recently sequenced ant genomes ( Figure 2C) [56]. Note that the number of predictions from Nasonia vitripennis (Parasitic wasp) is disproportionally high. Of 145 Nasonia vitripennis'-short proteins, 57 (39%) were predicted as iTOLIPs ( Figure 2C).

Most iTOLIP Mini-Proteins Resemble Antibacterial and Antifungal Peptides
Antimicrobial peptides (AMPs) are very abundant among insects [62]. At present, >150 insect AMPs have been identified [63]. A total of 121 peptides out of 379 iTOLIPs are from the Blattodea order, and named by UniProtKB as "termicin". Among the top predicted iTOLIPs, these proteins comprise the largest group. Termicins are restricted to the order Blattodea (termites and cockroaches). These are a collection of secreted AMP mini-proteins (25-40 AA), sharing a moderate sequence similarity. A termicin-like peptide (25 AA) from the cockroach Eupolyphaga sinensis exhibits anti-fungal activity, and a weak activity against bacteria [63]. We hypothesize that other sequences among the al iTOLIPs resemble antimicrobial proteins and potentially act as such.
Structurally, termicin is characterized by three disulfide bridges forming a rigid fold. The tertiary structure of termicin contains an α-helical segment and a two-stranded antiparallel β-sheet (called cysteine-stabilized α-helix/β-sheet, CSαβ, Figure 3A). The structural motif of CSαβ is similar to that of short insect defensins. The cysteine positions and pairing suggest that despite a minimal sequence similarity with insect defensins, the structure is shared by all defensins [64]. Expending the analysis of ClanTox top predictions suggests that the AMP and defensin-like fold could be subjected for a design approach aiming to improve the peptide specificity in the current post-antibiotic era ( Figure 3A). The insect defensin protein is a shorter version of the human defensin-2 ( Figure 3B). Furthermore, the human defensin's N-terminal helix is completely missing in the firefly protein. It is plausible that functionality as an AMP comes from the core folded structure of (31 AA) of the firefly version of the defensin, and therefore, the N'-terminal helix is redundant ( Figure 3B, light green shade). Structural variations of insect antimicrobial peptides illustrate the resemblance to a short version of the defensin fold. The diversity of AMP peptides in view of scorpion toxins had been extensively studied [65,66]. Defensins were also found among sponge, platypus, and scorpion toxins [67]. The assumption is that short specific structural motifs are used as templates by animal toxins [68]. Note that many additional versions of insect defensin genes are longer than 75 AA, and thus will not be further discussed [69,70].
The other major shared function among the top predicted iTOLIPs (Table S1) is the antifungal activity associated with the many Drosomicin genes, including two large sets of DRO and DRS genes [71]. Drosomycins (DRS) are inducible antifungal peptides, and were isolated from the hemolymph of immune-challenged Drosophilae. A similar antifungal specificity applies for DRO1 -DRO6 cassette, which responds to injury and microbial infection [72]. The DRS scaffold is a typical cysteine-stabilized α-helical and β-sheet (CSαβ) that specifies many of the known defensins ( Figure 4). The hallmark of DRS gene is its extra-stability, which is gained by clamping the N' -and C' -termini by an additional disulfide bond. This solution for extreme stability was also found in the spider toxin ω-hexatoxin-Hv1a. This innovation in protein stability is beneficial for a protein design approach for a biochemical stable scaffold [48]. The insect defensin protein is a shorter version of the human defensin-2 ( Figure 3B). Furthermore, the human defensin's N-terminal helix is completely missing in the firefly protein. It is plausible that functionality as an AMP comes from the core folded structure of (31 AA) of the firefly version of the defensin, and therefore, the N'-terminal helix is redundant ( Figure 3B, light green shade). Structural variations of insect antimicrobial peptides illustrate the resemblance to a short version of the defensin fold. The diversity of AMP peptides in view of scorpion toxins had been extensively studied [65,66]. Defensins were also found among sponge, platypus, and scorpion toxins [67]. The assumption is that short specific structural motifs are used as templates by animal toxins [68]. Note that many additional versions of insect defensin genes are longer than 75 AA, and thus will not be further discussed [69,70].
The other major shared function among the top predicted iTOLIPs (Table S1) is the antifungal activity associated with the many Drosomicin genes, including two large sets of DRO and DRS genes [71]. Drosomycins (DRS) are inducible antifungal peptides, and were isolated from the hemolymph of immune-challenged Drosophilae. A similar antifungal specificity applies for DRO1-DRO6 cassette, which responds to injury and microbial infection [72]. The DRS scaffold is a typical cysteine-stabilized α-helical and β-sheet (CSαβ) that specifies many of the known defensins ( Figure 4). The hallmark of DRS gene is its extra-stability, which is gained by clamping the N'-and C'-termini by an additional disulfide bond. This solution for extreme stability was also found in the spider toxin ω-hexatoxin-Hv1a. This innovation in protein stability is beneficial for a protein design approach for a biochemical stable scaffold [48]. Short versions of the AMP peptide, with three disulfide bonds resembling defensin were identified in marine sponges [73] and jellyfish [74]. In jellyfish, a similarity to defensin is extended also to the K + ICIs of sea anemones. Multiple functionalities had been experimentally validated for the short CSαβ scaffold of DRS, and the truncated scorpion toxin. Both peptides are effective as ion channel modulators (on D. melanogaster voltage-gated sodium channel) and exhibit anti-fungal activity [75].

iTOLIPs as Ion Channel Inhibitors
We analyzed proteins whose structural similarity to toxins have been identified. Table 2 lists nine instances in which a toxin related function is revealed. All 9 proteins exhibit channel blocker similarity to various channels [76]. Interestingly, two sequences from the Apis mellifera (Honeybee) and Aphidius ervi (Aphid parasite) show a clear homology to ω-conotoxin MVIIC and GVIA, a potent conus peptide that effectively blocks Ca 2+ channels. The OCLP1 was initially identified using ClanTox, and its function as ICI had been validated [11]. We retested the OCLP1 structural model in view of the doubling of proteins with 3D -structures in the last decade. The most likely structural model for OCLP1 benefited from structural relatedness (Figure 4). The similarity in the cysteine distribution locations along the sequence, and the cysteines that contribute to the disulfide bridges applies for ω-conotoxin MVIIC (1cnn.1, 1omn.1), Ptu-1 (1i26.1),  Toxin Ado1 (1lmr. 1), SVIB (1mvj.1), ω-conotoxin GVIA (1omc.1, 1tr6.1, 1ttl.1, 2cco.1), Robustoxin Short versions of the AMP peptide, with three disulfide bonds resembling defensin were identified in marine sponges [73] and jellyfish [74]. In jellyfish, a similarity to defensin is extended also to the K + ICIs of sea anemones. Multiple functionalities had been experimentally validated for the short CSαβ scaffold of DRS, and the truncated scorpion toxin. Both peptides are effective as ion channel modulators (on D. melanogaster voltage-gated sodium channel) and exhibit anti-fungal activity [75].

iTOLIPs as Ion Channel Inhibitors
We analyzed proteins whose structural similarity to toxins have been identified. Table 2 lists nine instances in which a toxin related function is revealed. All 9 proteins exhibit channel blocker similarity to various channels [76]. Interestingly, two sequences from the Apis mellifera (Honeybee) and Aphidius ervi (Aphid parasite) show a clear homology to ω-conotoxin MVIIC and GVIA, a potent conus peptide that effectively blocks Ca 2+ channels. The OCLP1 was initially identified using ClanTox, and its function as ICI had been validated [11]. We retested the OCLP1 structural model in view of the doubling of proteins with 3D-structures in the last decade. The most likely structural model for OCLP1 benefited from structural relatedness (Figure 4). The similarity in the cysteine distribution locations along the sequence, and the cysteines Toxins 2017, 9, 350 8 of 15 that contribute to the disulfide bridges applies for ω-conotoxin MVIIC (1cnn.1, 1omn.1), Ptu-1 (1i26.1), Toxin Ado1 (1lmr. 1), SVIB (1mvj.1), ω-conotoxin GVIA (1omc.1, 1tr6.1, 1ttl.1, 2cco.1), Robustoxin (1qdp.1), Hainantoxin-3 (2jtb.1), Spiderine-1a (2n86.1), and more. Importantly, the OCLP1 model indicates a comparable sequence similarity to a large number of ICIs. The related sequences exhibiting ICI function blocks Na + , K + , and all major types of Ca +2 channels (L-, N-, and P/Q-types, Figure 4). As such, these sequences are attractive templates for drug development seeking feature determinants that dictate a detailed specificity. Actually, the specificity is not restricted to the selective ion but to the exact version of the ion channel. For example, the protein µ-theraphotoxin-Pn3a that was isolated from venom of the tarantula Pamphobeteus nigricolor, is a potent inhibitor of Nav1.7, a subtype of the sodium ion channel (Nav). Its specificity for the other Nav subtypes is lower by 2-3 order of magnitudes [77].
A detailed report for the five top templates that are used for construction of a structural model for each of the 9 proteins (Table 2) is available (Table S2).

Uncharacterized iTOLIPs Reveal New Cysteine-Rich Patterns
Among the identified mini-proteins are 110 sequences that are annotated as "uncharacterized" (and genes named by their genomic index). About 65% of them are from Diptera (55 from Drosophilae, and 16 from Anopheles). Inspecting the spacing and number of the cysteines among the "uncharacterized" mini-proteins shows numerous recurring patterns ( Figure 5). (1qdp.1), Hainantoxin-3 (2jtb.1), Spiderine-1a (2n86.1), and more. Importantly, the OCLP1 model indicates a comparable sequence similarity to a large number of ICIs. The related sequences exhibiting ICI function blocks Na + , K + , and all major types of Ca +2 channels (L-, N-, and P/Q-types, Figure 4). As such, these sequences are attractive templates for drug development seeking feature determinants that dictate a detailed specificity. Actually, the specificity is not restricted to the selective ion but to the exact version of the ion channel. For example, the protein μ-theraphotoxin-Pn3a that was isolated from venom of the tarantula Pamphobeteus nigricolor, is a potent inhibitor of Nav1.7, a subtype of the sodium ion channel (Nav). Its specificity for the other Nav subtypes is lower by 2-3 order of magnitudes [77]. A detailed report for the five top templates that are used for construction of a structural model for each of the 9 proteins (Table 2) is available (Table S2).

Uncharacterized iTOLIPs Reveal New Cysteine-Rich Patterns
Among the identified mini-proteins are 110 sequences that are annotated as "uncharacterized" (and genes named by their genomic index). About 65% of them are from Diptera (55 from Drosophilae, and 16 from Anopheles). Inspecting the spacing and number of the cysteines among the "uncharacterized" mini-proteins shows numerous recurring patterns ( Figure 5).    B3M6X8_DROAN (Drosophila ananassae). This pattern is identified in Drosophila erecta and Drosophila yabuba, and appears in 20 proteins (with small variations, Figure 5, Patten E). Using structural modeling, we found that the strongest sequence similarity is to PDB: 1myn.1 (Drosomycin). Yet, another set of toxins such as the α-like toxin Lqh3 and BmαTX47 toxins from old and new world scorpions [78] seems to share a structural fold ( Figure 6A). All these neurotoxins are specific to different Nav subtypes [79]. The stiff structure is visible mainly through the α-helix and the antiparallel β-sheets ( Figure 6A). However, the substantial variations in the loops indicate the potential site for specificity of AMP, and the K + and Na + ion channel blocking. The overlap of B3M6X relative to 7 protein representatives that contributed to the model is shown along their multiple sequence alignment ( Figure 6A, bottom). A recurring pattern is illustrated by the B3M6X8_DROAN (Drosophila ananassae). This pattern is identified in Drosophila erecta and Drosophila yabuba, and appears in 20 proteins (with small variations, Figure 5, Patten E). Using structural modeling, we found that the strongest sequence similarity is to PDB: 1myn.1 (Drosomycin). Yet, another set of toxins such as the α-like toxin Lqh3 and BmαTX47 toxins from old and new world scorpions [78] seems to share a structural fold ( Figure 6A). All these neurotoxins are specific to different Nav subtypes [79]. The stiff structure is visible mainly through the α-helix and the antiparallel β-sheets ( Figure 6A). However, the substantial variations in the loops indicate the potential site for specificity of AMP, and the K + and Na + ion channel blocking. The overlap of B3M6X relative to 7 protein representatives that contributed to the model is shown along their multiple sequence alignment ( Figure 6A, bottom). A systematic search for a model for the uncharacterized proteins showed that for A0A182S0S6_ANOFN (Anopheles funestus, Figure 5, Patten A), the best model is similar to gamma 1-P thionins from barley and wheat endosperm (PDB: 1gps). These proteins are common motifs among toxic arthropod proteins and defensins. Still, the most likely defensin that was associated with Anopheles funestus protein is from a plant origin (PDB: 5nce.1).
Modeling the structure of the uncharacterized W5JVP1_ANODA ( Figure 5, Pattern F) revealed a strong and highly conserved structure similar to a "non-classical" Kazal-type inhibitor ( Figure 6B). All six structure representatives are aligned, and support its function as protease inhibitor. Kazal protease inhibitor fold was identified from some snakes, sea anemone, and skin of tree frogs. However, most proteinase inhibitor from toxins are associated with Kunitz fold that display a broader taxonomical coverage and a robust protease inhibition [80]. Other proteins predicted by structural modeling to have the Kazal protease inhibitor fold include A0A182RZB0_ANOFN, A0A0J9TLN1_DROSI, Q29LL5_DROPS, K7J9G8_NASVI, B3MVF1_DROAN, and B4GPS1_DROPE ( Figure 5, Patten F).
Testing other uncharacterized proteins from the list ( Figure 5) resulted in poor or no supportive models. Note that some cysteine-based patterns appear with multiple examples in the list. For A systematic search for a model for the uncharacterized proteins showed that for A0A182S0S6_ANOFN (Anopheles funestus, Figure 5, Patten A), the best model is similar to gamma 1-P thionins from barley and wheat endosperm (PDB: 1gps). These proteins are common motifs among toxic arthropod proteins and defensins. Still, the most likely defensin that was associated with Anopheles funestus protein is from a plant origin (PDB: 5nce.1).
Modeling the structure of the uncharacterized W5JVP1_ANODA ( Figure 5, Pattern F) revealed a strong and highly conserved structure similar to a "non-classical" Kazal-type inhibitor ( Figure 6B). All six structure representatives are aligned, and support its function as protease inhibitor. Kazal protease inhibitor fold was identified from some snakes, sea anemone, and skin of tree frogs. However, most proteinase inhibitor from toxins are associated with Kunitz fold that display a broader taxonomical coverage and a robust protease inhibition [80]. Other proteins predicted by structural modeling to have the Kazal protease inhibitor fold include A0A182RZB0_ANOFN, A0A0J9TLN1_DROSI, Q29LL5_DROPS, K7J9G8_NASVI, B3MVF1_DROAN, and B4GPS1_DROPE ( Figure 5, Patten F).
Testing other uncharacterized proteins from the list ( Figure 5) resulted in poor or no supportive models. Note that some cysteine-based patterns appear with multiple examples in the list. For example, B4PF50_DROYA and B4PF53_DROYA share the same pattern in terms of their cysteine number and spacing ( Figure 5, Pattern B). Additional proteins are associated with structurally new shapes that could not be modeled to reach a satisfactory level (e.g., A0A0P9C2V6_DROAN). These findings suggest that the uncharacterized proteins provide a rich, yet unexplored scaffold for future drug design.

Protein Databases
We used datasets from UniProtKB Release Aug_2017 [81] including 90 million protein sequences, combining the SwissProt and TrEMBL datasets [82]. We used the current data from RCSB protein data bank [83] with the collection of about 124,000 proteins' structural information.

Bioinformatics Analysis Tools
SignalP 4.0 was used to predict signal peptides [84]. This self-standing predictive tool is also provided as an annotation in UniProtKB [KW-0732]. The average length of the signal sequence in mammals is about 25 AA. We consider a protein length of 75 AA to account for a mature protein of about 50 AA. EBI's ClustalW and alignment viewer tools were used. Swiss-Model [82] was applied with default parameters for building a model according to the templates from the RCSB database. In the automated mode, both BLAST and HHblits (profile -profile search) are used. HHpred and HHblits [85] provide sensitive structural prediction by HMM -HMM-comparison. The HHblits builds HMM from a query sequence and compares it with a library of HMMs representing all known structures from PDB [83]. All structural predictions obtained from Swiss-Model, and HHblits were compared for testing the quality of the results.
Template quality is estimated along the process of the model building, for maximization of the quality and coverage of the model. In some cases, more than one model is presented to reflect the structural diversity. The quality of the models is estimated using calculated statistical parameters of the model (GMQE and QMEAN). These values are determined with respect to experimental parameters of proteins with a similar length ( [82]). Only sufficiently supported quality models are presented. The visualization tool used are embedded in Swiss-Model. A sequence similarity map shows the proteins that were used as templates, and contributed to the final model from a set of non-redundant structurally solved proteins.

ClanTox Prediction and Scoring
ClanTox (classifier of animal toxins) is a machine learning classifier ensemble for ranking protein sequences according to their toxin-like properties. ClanTox provides characterization for these mostly uncharacterized proteins. ClanTox uses about 600 features, including the stability and the spacing of the cysteine residues [57]. However, features are not restricted to cysteine-related features. ClanTox was trained on few hundreds of ICIs from a broad range of animal toxins. ClanTox's method represents each sequence as a vector of numerical sequence-derived features. The test set performance of ClanTox in cross-validation is very high, with a mean area under the curve (AUC) of >0.99 [86].
The sequences from the selected subset of insect proteomes downloaded from UniProtKB were used as input for ClanTox. The classifier outputs four labels: N for negative prediction, and P1-P3, reflecting three levels of positive predictions for toxin-like proteins (TOLIPs). The most significant predictions (labeled P3) accounts for proteins with a mean score >0.2, as well as having a coefficient of variation (CV) <0.5. The negative predictions (N, predicted as non-toxin) account for all sequences with a mean score <−0.2. The confidence of the prediction indirectly considers the robustness of the prediction. Formally, P3 are predictions with a mean score >0.2 or mean score >2*SD; P2 are predictions with mean score >0.2 or mean score between SD and 2*SD; and P1 are predictions with mean score >−0.2 or mean score <SD [57].

Conclusions
From the evolutionary perspective, toxins that possess similar functions (e.g., ICIs) may appear in unrelated venomous species, which is in accord with an accelerated evolution and innovation among toxins. Detecting endogenous toxin-like proteins from insects (iTOLIPs) confirmed that much of the innovation associated with bioactive peptides and mini-proteins links to defense against microbes, mainly fungi, and modulating of ion channels. Potentially, these functions are not mutually exclusive, and short proteins may carry more than one function. The rich collection identified in insects is instrumental in searching for particular AA that can enhance specificity towards specific fungi, or bacterium in the case of AMPs. In this study, we discussed a collection of top predictions from ClanTox. Note that hundreds of additional iTOLIPs are reported at somewhat lower predicted confidence. We conclude that the overlooked iTOLIPs characterized by structural stability and enhanced specificity are attractive templates for drug design.
Supplementary Materials: The following are available online at www.mdpi.com/2072-6651/9/11/350/s1, Table S1: Top prediction of ClanTox (P3) for insect < 75 AA with 379 iTOLIPs. Table S2: Top 5 templates selected by Swiss-Model for constructing the structural model of nine mature iTOLIP mini-protein. Figure S1. Scoring of ClanTox predictions for insects' secreted mini-proteins. Distribution of ClanTox predictions of 4180 insects' secreted proteins shorter than 75 AA. The top scoring iTOLIPs are marked by P3 (dark red), the intermediate confidence is P2 and P1 is the least confident predictions. The gray marks the bulk of the sequences (76%) with negative prediction (i.e., not a TOLIPs). All together there are 379 proteins that are scored as P3 (Table S1).
Author Contributions: M.L. N.R and D.O. analyzed the data, and wrote the paper. N.R. is part of developing team of the ClanTox webtool (www.clantox.cs.huji.ac.il), which was used throughout this study.

Conflicts of Interest:
The authors declare no conflict of interest.

AMP
antimicrobial peptides CSαβ cysteine-stabilized α-helical and β-sheet ClanTox classifier of animal toxins CRISP cysteine rich short proteins ICI ion channel inhibitor DRS Drosomycin nAChR nicotinic acetylcholine receptors OCLP omega conotoxin-like protein TFP three-finger proteins iTOLIP insect toxin-like proteins