Discovery of Novel Conotoxin Candidates Using Machine Learning

Cone snails (genus Conus) are venomous marine snails that inject prey with a lethal cocktail of conotoxins, small, secreted, and cysteine-rich peptides. Given the diversity and often high affinity for their molecular targets, consisting of ion channels, receptors or transporters, many conotoxins have become invaluable pharmacological probes, drug leads, and therapeutics. Transcriptome sequencing of Conus venom glands followed by de novo assembly and homology-based toxin identification and annotation is currently the state-of-the-art for discovery of new conotoxins. However, homology-based search techniques, by definition, can only detect novel toxins that are homologous to previously reported conotoxins. To overcome these obstacles for discovery, we have created ConusPipe, a machine learning tool that utilizes prominent chemical characters of conotoxins to predict whether a certain transcript in a Conus transcriptome, which has no otherwise detectable homologs in current reference databases, is a putative conotoxin. By using ConusPipe on RNASeq data of 10 species, we report 5148 new putative conotoxin transcripts that have no homologues in current reference databases. 896 of these were identified by at least three out of four models used. These data significantly expand current publicly available conotoxin datasets and our approach provides a new computational avenue for the discovery of novel toxin families.


Introduction
Predatory marine cone snails (genus Conus) have attracted the attention of biologists and pharmacologists for the great neuropharmacological potential of their venom toxins [1][2][3]. It is estimated that each of the~750 extant Conus species produces~100-400 distinct venom toxins (conotoxins) with almost no overlap in the toxin repertoire between the~750 species, not even between sister species [4]. Despite the tremendous diversity and drug discovery potential of Conus venoms, only~5000 nucleotide sequences of conotoxin-encoding transcripts have been reported from 100 Conus The nature of the Conus training sets and input data also needs to be considered. For example, the true-positive (labeled) training data is limited in scale and is incomplete, as many conotoxins presumably remain uncatalogued [29]. Moreover, the input data is very large, with RNA-seq datasets typically exceeding five million reads and tens of thousands of assembled transcripts. Given these features, logistic regression provides a well-established base-line approach, whereas semi-supervised learning (labelspreading) provides a means to further leverage unlabeled true positives during training. Finally, the Perceptron model scales well to very large training sets and is widely used for pattern recognition with good results [30][31][32][33]. Both the logit and perceptron models are supervised and model-based, in contrast, the label spreading model is semi-supervised and instance-based. Given these machine learning models have intrinsic strengths and weaknesses and are complementary to one another, using an ensemble of these methods is likely to provide best discovery outcomes. Additionally, since using these machine learning models are based on the hypothesis that conotoxins will retain all three traits in sequence evolution, we added cross-species Blastp to search for similar unknown sequences (if they contain a signal sequence) between different Conus species to rescue potential conotoxins that only have one trait (signal sequence) based on the knowledge that the signal peptide sequences of conotoxins from the same superfamilies are highly similar to each other even if they are from different Conus species [4].
By employing three different machine learning models plus Blastp, ConusPipe allows users to take an ensemble approach for discovery to maximize the prediction power. All four methods were applied to 12 RNAseq datasets from 10 different species of Conus. The pipeline discovered 5148 new conotoxin candidates that provide a unique dataset for future pharmacotherapeutic exploration.

The ConusPipe Toolkit
ConusPipe is implemented in Perl and Python as a complete conotoxin discovery package. It is available at https://github.com/Yandell-Lab/ConusPipe. ConusPipe takes six-frame-translated peptide sequences from nucleotide sequences which have no hit in current reference database and extracts conotoxin sequence features to train datasets ( Figure 1A). In addition to three machine learning models, cross-species Blastp is used as the fourth method to retrieve putative toxin candidates that have a signal sequence but may not have all features used in machine learning ( Figure 1B). ConusPipe then generates different combinations (single method, union or overlap) of the four methods to predict candidate conotoxins ( Figure 2). Users can change the settings in sample.config file to use different cut-off options for transcript per million (tpm) values, blast e-values, signalP Dvalues, and provide paths to input and output fasta files and databases used. It took 2 h 34 min 20 s on a single CPU core to run ConusPipe for 757,932 Conus transcripts from 12 samples of 10 species. ConusPipe then generates different combinations (single method, union or overlap) of the four methods to predict candidate conotoxins ( Figure 2). Users can change the settings in sample.config file to use different cut-off options for transcript per million (tpm) values, blast e-values, signalP D-values, and provide paths to input and output fasta files and databases used. It took 2 h 34 min 20 s on a single CPU core to run ConusPipe for 757,932 Conus transcripts from 12 samples of 10 species. Venn diagram illustrates the different combinations of methodologies used (single method, overlap, or union of methods) and the total number of putative toxin candidates identified by each method (unique number of toxin candidates are shown in parentheses). A union of methods means that a conotoxin is predicted by one or more methods, for example, Union.4methods = predicted by perceptron or logit or label spreading or blast. An overlap of methods means that the conotoxin is predicted by all the applied methods, for example, Overlap.4methods = predicted by perceptron and logit and label spreading and blast. Abbreviations used: blast-B; logit-L; labelspreading-S; perceptron-P.

Building and Cross-Validation of the Machine Learning Models
4950 known conotoxin sequences (from the ConoServer [5] and Uniprot databases [34], and 52,613 randomly selected non-conotoxin Conus transcripts were used to build the machine learning models. In order to assess the performance of the models, 10-fold cross validation was applied to the same dataset [35]. The main measures of performance were sensitivity, specificity and accuracy under different regularization parameters. Sensitivity was defined as the fraction of known conotoxins predicted as conotoxin divided by the number of known conotoxins in the test dataset. Specificity was defined as the fraction of known non-conotoxins predicted as non-conotoxin divided by the number of known non-conotoxins in the test dataset. Accuracy was defined as the fraction of known sequences (conotoxin/non-conotoxin) predicted divided by the total number of sequences in the test dataset. The regularized parameter settings were chosen by plotting accuracy vs. parameter settings for each model to make sure the trained model has the best accuracy with minimum overfitting/under fitting. Since the prevalence of conotoxins in the training dataset is only 9.97%, sensitivity is an important performance measure in consideration when choosing regularization parameter settings. The sensitivity, specificity, and accuracy for the chosen regularization parameter settings for each Venn diagram illustrates the different combinations of methodologies used (single method, overlap, or union of methods) and the total number of putative toxin candidates identified by each method (unique number of toxin candidates are shown in parentheses). A union of methods means that a conotoxin is predicted by one or more methods, for example, Union.4methods = predicted by perceptron or logit or label spreading or blast. An overlap of methods means that the conotoxin is predicted by all the applied methods, for example, Overlap.4methods = predicted by perceptron and logit and label spreading and blast. Abbreviations used: blast-B; logit-L; labelspreading-S; perceptron-P.

Building and Cross-Validation of the Machine Learning Models
4950 known conotoxin sequences (from the ConoServer [5] and Uniprot databases [34], and 52,613 randomly selected non-conotoxin Conus transcripts were used to build the machine learning models. In order to assess the performance of the models, 10-fold cross validation was applied to the same dataset [35]. The main measures of performance were sensitivity, specificity and accuracy under different regularization parameters. Sensitivity was defined as the fraction of known conotoxins predicted as conotoxin divided by the number of known conotoxins in the test dataset. Specificity was defined as the fraction of known non-conotoxins predicted as non-conotoxin divided by the number of known non-conotoxins in the test dataset. Accuracy was defined as the fraction of known sequences (conotoxin/non-conotoxin) predicted divided by the total number of sequences in the test dataset. The regularized parameter settings were chosen by plotting accuracy vs. parameter settings for each model to make sure the trained model has the best accuracy with minimum overfitting/under fitting. Since the prevalence of conotoxins in the training dataset is only 9.97%, sensitivity is an important performance measure in consideration when choosing regularization parameter settings. The sensitivity, specificity, and accuracy for the chosen regularization parameter settings for each model is shown in Table 1. The highest overall testing accuracy and sensitivity were 98.2% and 90.93%, respectively, achieved by the label spreading model.

Benchmark by Identifying Known Superfamilies
To assess the sensitivity of ConusPipe in identifying conotoxin transcripts, we performed a benchmark analysis for sequences belonging to known conotoxin gene superfamilies [6]. For these analyses, we deleted an entire superfamily from the training set and then tried to re-discover this superfamily as a putative new superfamily using ConusPipe. The specificity of ConusPipe (i.e., the ability to distinguish between known conotoxins and non-conotoxin proteins from various organisms) was assessed by screening hits against the entire UniProtKB/Swiss-Prot database.
We found that the sensitivity varied among different superfamilies and combinations of models used. The highest sensitivity was achieved by the union of the four methods (logit, labelspreading, perceptron and blastp, mean sensitivity = 95.7%, SD = 0.11) and the union of three methods (logit, perceptron and blastp or labelspreading, perceptron and Blastp, mean sensitivity = 95.7%, SD = 0.11) across all superfamilies. A union of methods means that a conotoxin predicted by one or more methods is a putative positive. A graphical overview of the different groups is provided in Figure 2. Results are shown in Figure S1 and Tables 1 and 2. Table 2. Specificity of the individual machine learning methods and their unions/combinations when searching results against the Uniprot/Swissprot non-conotoxin database and mean sensitivity for recovering all known conotoxin superfamilies. Methods are ordered as follows: Overlap between different methods, single methods, and union of methods. Methods with the highest sensitivities (≥99.7%) and specificities (≥99.9%) are shown in bold. Abbreviations used: blast-B; logit-L; label spreading-S; perceptron-P.

Mean Sensitivity Specificity
Overlap In addition to machine learning, using cross-species Blastp to search candidate sequences from different Conus species against each other provides better performance for conotoxin superfamilies that contain sequences which do not satisfy the hypothesis of having all three traits (signal sequence, propeptide, and mature toxin at the C-terminus), such as the SF-mi2 (Superfamily-2 Conus miles), I4, B4 and Prohormone gene families. However, this approach is less powerful for superfamilies that are limited to a small number of species, such as conorfamides, DivMTFLLLLVSV (Diverse MTFLLLLVSV), teretoxins, and conocaps. The sensitivity for recovering all known conotoxin superfamilies by different combinations of methods is provided in Supplementary Table 1.
The ability of ConusPipe to distinguish between conotoxins and other proteins from various organisms was assessed by screening the entire UniProtKB/Swiss-Prot database. Using the version released on June 2013 we examined a total of 540,261 protein sequences isolated from diverse organisms. The overall highest specificity was achieved using an overlap of the four methods (logit, labelspreading, perceptron, and blastp, specificity = 99.92%) and the overlap of three methods (labelspreading, perceptron, and blastp, specificity = 99.92%). These results are shown in Table 2.

Identification of New Conotoxin Candidates
To identify new conotoxin candidates, we assembled Conus transcripts from RNA-seq datasets derived from venom glands of 10 different Conus species using our previously published methods [4] (see Table 3 for species used in this study). The resulting transcripts were prescreened for conotoxin homology against the Uniprot and ConoServer databases as previously published [4] and described under the methods section. The remaining transcripts were then used as inputs to ConusPipe. New conotoxin candidates were defined as those which were predicted as conotoxins by at least one of the four models in ConusPipe, but lacked significant homology to known conotoxins using Blast against the Uniprot/ConoServer database [17].
Since conotoxin gene superfamilies are generally found across multiple Conus species (hence, the term superfamily) we considered those sequences that lacked significant homology to known conotoxins but had high homology (blastp e-value < 1 × 10 −10 to sequences from at least two other Conus species examined here, as members of a new putative superfamily. 5479 transcripts passed these criteria. In order to validate our predictions, we used NCBI-Blastp to search the 5479 transcripts against the NCBI non-redundant (NR) protein database (August 2018 Version), which includes recently published conotoxin sequences that were not yet available in Uniprot/ConoServer at the time of original analysis (and even now) and also includes large numbers of uncharacterized molluscan sequences not available in Uniprot. Out of 5479 transcripts, 331 had significant blastp hits (e-values < 1 × 10 −4 ) against the NCBI-NR database. 99 transcripts had hits against other molluscan transcripts. As the majority of conotoxins are not found outside of the genus Conus and these transcripts could encode endogenous signaling/housekeeping polypeptides rather than polypeptides used for envenomation, these were removed from our final datasets. 198 sequences were identified as conotoxins. These were also removed from the final machine learning dataset and are provided in Supplementary File 1. Finally, 34 sequences had blastp hits against non-molluscan species such as fish, tardigrade, sea anemone, worm, plant, and bird. These were removed from the final dataset. A total of 5148 sequences were left in our final dataset as new conotoxin candidates, 187 of these were identified by all four models, 709 of these were identified by three models, 1666 of these were identified by two models, 2586 of these were identified by one model (see Figure 1 for toxin candidates identified by each model and their overlap and unions). The logit model did not uniquely identify any toxin candidates; all its 1396 newly discovered putative toxins are overlapped with other models. The labelspreading model, perceptron model, and blastp uniquely discovered 56, 1480, and 1089 new potential toxins, respectively. These results highlight that using an ensemble of different models takes advantage of the complementary aspects of each model to maximize discovery. All of the 5148 transcripts are provided in Supplementary File 2. Using single linkage cluster analysis with a Jaccard Index of 0.5 [36,37] grouped 299 of sequences identified by at least three models into 114 clusters that are likely to represent novel conotoxin gene superfamilies (Supplementary File 3).
The highest number of putative new sequences were identified in C. striatus. Blastp against Uniprot/ConoServer identified 34 toxin transcripts in C. striatus and 25 of these were also retrieved using the machine learning method (i.e., 9 of these were missed by machine learning, see discussion section below). 1079 additional putative toxins were subsequently identified by machine learning, 10 of these were confirmed as toxins by Blastp against the NCBI-NR database ( Table 3). All toxins identified per species are provided in Supplementary File 4 ("sp.all.tar.gz"), including toxins with Blast hits against NCBI/Uniprot/ConoServer and toxin candidates identified by ConusPipe.

Transcripts Identified Using Three or Four out of Four Methods
In this study, we provide a list of all new conotoxin candidates from the venom gland transcriptomes of several cone snail species. We emphasize that these sequences require further experimental validation to confirm their designation as genuine conotoxins. This is discussed in more detail below. However, we propose that many of the transcripts identified by three out of four methods (709 sequences) or by a combination of all four methods (187 sequences) are likely to represent genuine conotoxin sequences since they were independently identified using different approaches. Several of these sequences exhibit clear hallmarks of conotoxins (presence of propeptides, found in multiple species, similar but distinct sequences found in different species, multiple cysteines in mature toxin region). An alignment of some of these is shown in Figure 3. Several sequences that were identified by three or more models but seem unlikely to represent genuine conotoxins are also shown. These typically exhibit a long series of cysteine repeats, several vicinal cysteines, and/or series of methionines (M) in the signal sequence. In the future, our model could be further refined to exclude such sequences as candidates.
found in multiple species, similar but distinct sequences found in different species, multiple cysteines in mature toxin region). An alignment of some of these is shown in Figure 3. Several sequences that were identified by three or more models but seem unlikely to represent genuine conotoxins are also shown. These typically exhibit a long series of cysteine repeats, several vicinal cysteines, and/or series of methionines (M) in the signal sequence. In the future, our model could be further refined to exclude such sequences as candidates.

Discussion
Previous approaches for toxin discovery have been alignment based, using regular expressions, blast, and/or HMMER to identify new members of known conotoxin superfamilies [5,[15][16][17]. Because conotoxins are hyperdiverse, these approaches are intrinsically limited. In an attempt to cast a wider net for discovery, we have created a machine learning based pipeline, ConusPipe, that utilizes

Discussion
Previous approaches for toxin discovery have been alignment based, using regular expressions, blast, and/or HMMER to identify new members of known conotoxin superfamilies [5,[15][16][17]. Because conotoxins are hyperdiverse, these approaches are intrinsically limited. In an attempt to cast a wider net for discovery, we have created a machine learning based pipeline, ConusPipe, that utilizes functional characteristics of conotoxins to identify new conotoxin candidates that have no significant sequence homology to conotoxin sequences currently available in reference databases.
By using more than one machine learning model, we expected to see that an ensemble of different models can maximize the prediction power. Indeed, as determined by benchmark analyses, the highest sensitivity is achieved by the union of three or more methods and the highest specificity is achieved in the overlap of three or more methods.
ConusPipe allows users to choose different combinations of methods according to their requirement on discovery specificity and sensitivity (Table 1). In addition to developing machine learning, we demonstrate that using blast to search candidate sequences from different Conus species against each other provides better performance for conotoxin superfamilies that contain sequences that do not satisfy the hypothesis of having all three traits. However, this approach is less powerful for superfamilies that are limited to a small number of species and only works if more than one transcriptome is to be analyzed.
We would like to note that, for best performance, ConusPipe should be used in combination with currently used homology-based search algorithm, such as blast, as ConusPipe relies on the presence of full-length sequences (containing N-terminal signal sequences) and will not work for truncated contigs. Furthermore, several conotoxin gene superfamilies reported in Uniprot/ConoServer have low SignalP values (D value of <0.45). These would also be missed by our pipeline. Table 3 provides information on how many conotoxins which have significant homology to sequences in Uniprot/ConoServer database could also be retrieved by ConusPipe (average value for recovery: 65%, ranging from 57% for C. virgo to 82% for C. imperialis).
We provide a large set of new conotoxin candidates and a bioinformatic pipeline that is freely available and can be applied to any newly sequenced Conus venom gland. No doubt our approach also identifies false positives, particularly when only a single or a combination of two methods is employed. Thus, using our method without further validation is not suitable for defining the venom composition of a species. What it provides is a new tool to identify candidate toxin transcripts that are not able to be detected by homology-based methods. Generating comprehensive databases of all putative toxin candidates expressed in a venom gland will empower current mass spectrometric toxin sequencing approaches [7]. Using mass spectrometry, candidate transcripts can then be verified and subjected to functional characterizations.

Conclusions
Using ConusPipe, we identified 5148 new conotoxin candidates from 757,932 transcripts derived from venom gland transcriptomes of 10 Conus species. None of these candidate conotoxins has significant homology to any known conotoxin in the Uniprot/ConoServer database, although like known conotoxins, most candidates have an N-terminal signal sequence, a characteristic propeptide spacer region, and a single copy of a mature peptide at the C-terminus. Moreover, we have shown that several of these candidates share high homology to newly published conotoxins in the NCBI-NR molluscan database. In conclusion, our approach opens new avenues for the discovery of novel conotoxin transcripts from cone snails and other venomous animals with similar venom repertoires.

Transcriptome Sequencing
Specimen were collected in the central Philippines during several collection expeditions in 2011-2015. Specimen identification was initially performed by morphological examination and later verified by sequence analysis of the cytochrome oxidase c subunit 1 (COI) gene. Venom glands were dissected and stored in RNAlater at −80 • C until further processing. Total RNA was isolated from venom glands using TRIzol ® Reagent (Invitrogen, ThermoFisher Scientific, Waltham, MA, USA) or the RNeasy kit (Qiagen, Germantown, MD, USA) following the manufacturers' instructions. RNA integrity, quantity, and purity were determined on a 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). cDNA libraries were prepared and sequenced on an Illumina HiSeq 2000 instrument (Sanger/Illumina 1.9 reads, 101 bp or 125 bp paired-end, Illumina, San Diego, CA, USA).

Development of ConusPipe
ConusPipe proceeds by first extracting 16 features (see feature explanation below) from known conotoxin sequences (from the ConoServer [6] and Uniprot databases) and non-conotoxin Conus transcripts to make the training dataset. Sequences are required to contain a signal sequence as determined by SignalP [41]. Next, the 16 features extracted from Conus transcripts of 10 species, which have no homologues in current reference database (the combined ConoServer and UniProtKB database) are used as real test data to run the machine learning methods to predict whether a certain input transcript is conotoxin. All the transcripts predicted by any of the four methods are output by the pipeline as putative new conotoxins.

Feature Selection and Classifiers
We employed 16 features for the machine learning models: signalP D value for signal sequence; cysteine percentage; molecular weight; percentage of positively/negatively charged amino acids; and isoelectric point for all three regions of the precursor sequence (signal sequence, propeptide, and mature toxin). All features are continuous re-normalized to lie between 0 and 1 ( Figure 1A). The features are mainly chemical characters of amino acids in the three different parts of conotoxin sequence. The motivating hypothesis is that even though conotoxins evolve very rapidly they must still share similar chemical characters in amino acid composition, since they carry out similar functions, e.g., bind transporters and receptors. For example, the three parts of conotoxin sequence-signal sequence, propeptide, and mature toxin carry out different functions in conotoxin secretion process in the cell, so their charge distributions are stereotypical and different. The signal sequence is mainly hydrophobic, while the amino acids in propeptide are mainly charged, and the mature toxin is somewhat intermediate as regards charge distribution. To train the models we first ran signalP on a training dataset consisting of known conotoxin/nonconotoxin sequences to get the signalP D value for each known sequence [41]. Next, we calculated the cysteine percentage, molecular weight, percentage of positively/negatively charged amino acids and isoelectric point for signal sequence, propeptide and mature toxin, respectively. The pipeline uses the 16 extracted features in training dataset to train the logistic regression model, labelspreading model, and perceptron model using the Python scikit learn package [42]. The accuracy under different regularization parameters of the three models were tested by cross validation with training dataset from known conotoxin and nonconotoxin sequences.

Cross Validation
4950 known conotoxin sequences (from ConoServer and Uniprot/Swissprot) and 52,613 non-conotoxin Conus transcripts with matched sequence length were first split into 10 equal bins, and then sequentially one-tenth of the data was taken as the test set and the remaining other nine-tenths were used for the training set. The three models were trained under different regularization parameters and evaluated with the test set in 10 iterations. For the logistic regression model, the regularization parameter we tested is slack number C, which is the inverse of regularization strength. We tested this parameter from 0.001 to 1 × 10 10 . For the labelspreading model, we chose knn as the kernel function, so the regularization parameter we tested is n_neighbors, which was tested from 1 to 15. For the perceptron model, the regularization parameter we tested is n_iter, the number of passes over the training data, which we tested from 5 to 70. The plots of accuracy versus regularization parameter of different models are shown in Figure S2.

Discovering New Putative Conotoxins and Conotoxin Gene Families
Paired-end RNAseq data from 10 Conus species were generated by Illumina HiSeq 2000 platform (Table 4). RNAseq reads were assembled using best practice Trinity settings, annotated with Blastx against our reference dataset, and all the Conus transcripts which do not have homologous sequences in the current reference databases were selected and six-frame-translated into peptide sequences. These peptide sequences were then used as the input dataset for ConusPipe to drive machine learning models built in previous steps to predict whether they are conotoxins. Cross-species Blastp is also used on all putative peptide sequences that have a signal sequence as an independent method to predict putative conotoxin candidates. The pipeline output all input transcripts which were predicted as conotoxins. We then used NCBI-Blastp to search all putative toxin transcripts against the NCBI non-redundant (NR) protein database (August 2018 version), which includes recently published conotoxin sequences that were not yet available in Uniprot/ConoServer at the time of original analysis. Sequences with significant Blastp hits were excluded from final datasets (e-values < 1 × 10 −4 ).
Then all by all Blastp and single linkage cluster analysis were conducted among the new conotoxins, and the new conotoxins that shared at least 50% hit connections with one another were designated as to be in the same superfamily.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-6651/10/12/ 503/s1, Figure S1: Box plot shows that the sensitivity varied among different combinations (single method, overlap or union of methods) of methods used. Figure S2: Plot of accuracy vs. regularization parameter settings in each machine learning model. Supplementary