Conotoxin Prediction: New Features to Increase Prediction Accuracy

Conotoxins are toxic, disulfide-bond-rich peptides from cone snail venom that target a wide range of receptors and ion channels with multiple pathophysiological effects. Conotoxins have extraordinary potential for medical therapeutics that include cancer, microbial infections, epilepsy, autoimmune diseases, neurological conditions, and cardiovascular disorders. Despite the potential for these compounds in novel therapeutic treatment development, the process of identifying and characterizing the toxicities of conotoxins is difficult, costly, and time-consuming. This challenge requires a series of diverse, complex, and labor-intensive biological, toxicological, and analytical techniques for effective characterization. While recent attempts, using machine learning based solely on primary amino acid sequences to predict biological toxins (e.g., conotoxins and animal venoms), have improved toxin identification, these methods are limited due to peptide conformational flexibility and the high frequency of cysteines present in toxin sequences. This results in an enumerable set of disulfide-bridged foldamers with different conformations of the same primary amino acid sequence that affect function and toxicity levels. Consequently, a given peptide may be toxic when its cysteine residues form a particular disulfide-bond pattern, while alternative bonding patterns (isoforms) or its reduced form (free cysteines with no disulfide bridges) may have little or no toxicological effects. Similarly, the same disulfide-bond pattern may be possible for other peptide sequences and result in different conformations that all exhibit varying toxicities to the same receptor or to different receptors. We present here new features, when combined with primary sequence features to train machine learning algorithms to predict conotoxins, that significantly increase prediction accuracy.


Introduction
Conotoxins are peptides found in the venom of carnivorous aquatic mollusks known as cone snails that hunt by paralyzing their prey [1].This happens because conotoxins interfere with the normal function of various ion channels and signal receptors, ultimately leading to paralysis and suffocation [1].Despite the risks to human health posed by these kinds of toxins, there are limited effective anti-toxins available.The rational development of novel therapeutics requires topological knowledge of the receptors, binding sites, and interacting residues for a given toxin [2].There is increasing interest in peptide-based toxins for medical use as treatments for cancers [3,4], microbial infections [5], epilepsy, autoimmune diseases, neurological conditions, and cardiovascular disorders [4,6].As an Toxins 2023, 15, 641 2 of 17 example, the drug Ziconotide (Prialt), used for chronic pain relief, is a synthetic version of the ω-conotoxin MVIIA from the cone snail, Conus magus [7].
Despite recognizing the importance of these dual-use compounds as both dangerous and potentially therapeutic, our ability to identify, characterize, and determine the toxicities of conotoxins is difficult, costly, and time-consuming.Overcoming this challenge requires a series of diverse, complex, and labor-intensive biological, toxicological, and analytical techniques for effective characterization [8].Furthermore, with the thousands of new peptide sequences that are being obtained by transcriptomics and proteomics, traditional toxicological measurements are too slow.In many cases, the experimental determination of an individual toxin's function has become unfeasible and/or impossible because of the timeconsuming nature and high cost of the experiments.Recent attempts using deep learning or machine learning (ML) (i.e., TOXIFY [9], ToxClassifier [10], ClanTox [11], ToxinPred [12], PredCSF [13]) to predict the toxicity of peptides (e.g., conotoxins and animal venoms, etc.) based on primary amino acid sequences have improved the toxicity identification process.However, these methods are limited due to the inherent conformational heterogeneity exhibited by peptides, which comes from two primary sources: [1] the innate flexibility of peptides or peptide backbone and [2] the proportionally high numbers of cysteines.High cysteine counts allow peptides to adopt multiple conformational permutations that are stabilized by the disulfide bonds formed between the cysteine pairs.Peptide sequences locked into different conformations lead to shifts in physiological behavior by providing different interaction surfaces (i.e., topology).A given peptide may be toxic when its cysteine residues form a particular disulfide-bond pattern, resulting in a specific conformation.Alternative disulfide-bonding patterns, such as when the conotoxin peptide is an isoform (containing alternative cysteine-cysteine disulfide bridges) or is in its reduced form (absence of any disulfide bridges), may yield little or no toxicological effects for the peptide or may be highly toxic (Figure 1).For example, the conotoxin AuIB has an IC 50 value of 1.2 nM in its native helical form (Figure 1a), but the IC 50 decreases by a factor of 10 (to 0.1 nM) when AuIB is converted to its ribbon (Figure 1b) isoform [14].In contrast, the conotoxin GI in its native form shows a 10-fold greater IC 50 compared to its two ribbon isoforms [15].Consequently, similar disulfide-bond patterns result in different conformations for different peptide sequences, which all exhibit varying toxicities from altered binding to the same receptor or binding to different receptors.Furthermore, while the majority of conotoxins contain posttranslationally modified (PTM) amino acids (e.g., hydroxyproline, pyroglutamic acid, etc.), all current prediction methods and models exclude these unique residues, categorizing them as their unmodified residues.This exclusion results in a decrease in the potential size of a unique dataset and tremendously reduces the effectiveness and accuracy of any predictions [13,16,17].).(b) Alpha conotoxin AuIB in its ribbon (isoform) conformation with a disulfide bond pattern of Cys2-Cys15 and Cys3-Cys8 (PDB: 1MXP [18]).(c) Mu conotoxin KIIIA with a disulfide bond pattern of Cys1-Cys9, Cys2-Cys15, and Cys4-Cys16 (PDB: 7SAV [19]).(d) Mu conotoxin KIIIA with a disulfide bond pattern of Cys1-Cys16, Cys2-Cys9, and Cys4-Cys15 (PDB: 7SAW [19]).(e) Kappa conotoxin PVIIA with a disulfide bond pattern of Cys1-Cys16, Cys8-Cys20, and Cys15-Cys26 (PDB: 1AV3 [20]).(f) Omega conotoxin MVIIA with a disulfide bond pattern of Cys1-Cys16, Cys15-Cys25, and Cys8-Cys20 (PDB: 1DW4 [21]).
To augment the predictive capability of ML approaches for conotoxin prediction, we integrated a variety of physiochemical and structural features, including physiochemical surface properties, secondary structure characteristics, and collisional cross sections (CCSs).A CCS is an experimental value obtained from ion mobility-mass spectrometry (IM-MS) experiments or from a computational calculator such as the High-Performance Collision Cross Section (HPCCS) [22] software.The experimental or computational CCS value is a function of the size, shape, charge, and polarizability of a molecule.Here, we determine how these additional features improve conotoxin prediction accuracy and how to include them in building an effective ML-based platform to accurately predict if an unknown toxin molecule is a conotoxin.Such a platform will not only accelerate the identification of novel biochemical threat agents but also benefit the development of biological prophylactics and therapeutics, detection reagents, and medical countermeasures.

Construction of Datasets
In order to evaluate how new features impact conotoxin prediction accuracy, negative datasets were separated into easy-negative and hard-negative datasets.The easy-negative dataset contains peptides (from more than 100 species, including humans, yeast, zebrafish, mice, eels, chickens, and cattle) that are confirmed to be non-toxic, while the hard-negative dataset contains toxic peptides from spiders, scorpions, snakes, beetles, frogs, wasps, and ants, as well as conantokins and contryphans from cone snails.Toxic peptides were categorized as part of the hard-negative dataset with the expectation that they may contain similar amino acid compositions (regions) and also share similar binding sites with conotoxins; thus, they would be more difficult to distinguish from the conotoxins.
In general, three datasets: a positive, an easy-negative, and a hard-negative dataset, were constructed for the training and testing of the ML approach (see the methods section).These three datasets were initially collected from the Protein Data Bank (PDB) using the keywords indicated in the methods section.The positive datasets include conotoxins obtained from the Conoserver, PDB, and the Biological Magnetic Resonance Bank websites.These conotoxins are from twelve (12) distinct classes, including the alpha, delta, mu, and omega classes, that target nicotinic acetylcholine receptors (nAChRs), GABA receptors, and potassium (K + ), calcium (Ca 2+ ), and sodium (Na + ) ion channel receptors.A distribution of these classes is shown in Figure S1.We initially constructed small-sized datasets that include a positive dataset containing 154 conotoxins, an easy-negative dataset containing 180 non-toxic peptides, and a hard-negative dataset containing 178 peptides.To test how consistently the new features affect conotoxin prediction accuracy, we expanded our datasets by adding more entries into these small datasets.The extended datasets include a positive dataset containing a total of 184 conotoxins, an easy-negative dataset containing 317 non-toxic peptides, and a hard-negative dataset containing 560 peptides.All the entries in these datasets are peptides with lengths equal to or less than 80 residues and with known three-dimensional structures.Sequences with more than a 90% sequence identity were removed from the negative datasets.A summary of all the datasets collected and used for the ML experiments is shown in Table 1.A full list of these datasets along with all the features extracted is available in File S1.

Feature Extraction and Selection
Thirteen features belonging to three general categories (compositional, physiochemical, and structural) were divided into four groups (P, P2, SS, and CCS) (Figure 2).The compositional features consist of the peptide amino acid sequence, the frequency of amino acid occurrence, both of which have been used for standard biomolecular classifications [23][24][25], and the number of post-translational modifications (PTMs).Since conotoxins show high concentrations of PTMs, this is an important feature that has not yet been considered for improving prediction accuracy.New parameters for some common PTMs are found in Table S1.The physiochemical features include protein charge, mass, size, relative polarity, and hydrophobicity as shown in Table S2.The structural features inform peptide folds and include secondary structure identities, the radius of gyration, disulfide bond counts, and solvent-accessible surface areas (SASA).Because conotoxin function depends on surface topological interactions, we hypothesize that by characterizing the chemical surface of conotoxins, we should see an improvement in classification.Therefore, the SASA of each residue on the peptide was quantified.In addition, in order to test if an unknown conotoxin can be quickly and accurately predicted using an experimental parameter such as the CCS value obtained from an IM-MS experiment, we added the computationally calculated CCS values to the list of features for ML prediction.Additional details on the features are included in Supplemental Tables S2 and S3, and a visual aid for the surface characteristics is shown in Figure S2.Features were divided into four groups (P, P2, SS, and CCS), and the effect of each feature group was evaluated with regard to conotoxin prediction accuracy.
To determine the performance of the feature groups on conotoxin prediction, each feature group was, either individually or in combination with other feature groups, evaluated for efficacy.The features were split into four groups, identified as P, P2, SS, and CCS, as shown in Figure 2. The feature group P contains the counts for 11 sequence-level features (aliphatic, aromatic, polar, hydrophobic, charged, positively charged, negatively charged, tiny, small, large, and total), as well as total charge, mass, dipeptide 0 gap, and dipeptide 1 gap.Dipeptide 0 and dipeptide 1 are the frequencies of co-occurring residues in the sequence as adjacent neighbors or with one residue separating them, respectively.Most of the current ML models use only the P features to train ML algorithms [26].The feature group SS contains the residue counts for each of the defined secondary structures extracted from the Define Secondary Structure of Proteins DSSP [27,28] program (Table S3), namely disulfide-bond count, the radius of gyration, and the SASA of the residue To determine the performance of the feature groups on conotoxin prediction, each feature group was, either individually or in combination with other feature groups, evaluated for efficacy.The features were split into four groups, identified as P, P2, SS, and CCS, as shown in Figure 2. The feature group P contains the counts for 11 sequence-level features (aliphatic, aromatic, polar, hydrophobic, charged, positively charged, negatively charged, tiny, small, large, and total), as well as total charge, mass, dipeptide 0 gap, and dipeptide 1 gap.Dipeptide 0 and dipeptide 1 are the frequencies of co-occurring residues in the sequence as adjacent neighbors or with one residue separating them, respectively.Most of the current ML models use only the P features to train ML algorithms [26].The feature group SS contains the residue counts for each of the defined secondary structures extracted from the Define Secondary Structure of Proteins DSSP [27,28] program (Table S3), namely disulfide-bond count, the radius of gyration, and the SASA of the residue types, including the PTMs.The feature group P2 contains the counts of PTMs and the frequency of the dipeptide 2 gap, which is the frequency of residues appearing as neighbors with two residues separating them.The feature group CCS contains the CCS values of all the entries calculated by the HPCCS program using the corresponding structure files from the PDB.A list of all feature sets is included in Table S4.

Effect of Features on Prediction Performance
One common challenge with training ML models on biological samples is that the dataset sizes are usually small because the available experimental biological data are very limited.This is especially true for the positive datasets.In contrast, the negative datasets have considerably more entries due to their diversified sequences and the availability of more experimental data.

Improved Predicting Power from New Features across All Datasets
In order to determine how the PTMs, structural features, and CCS features affect classification performance, these features were tested on small and extended datasets (Table 1) using four different ML classifiers: a Penalized Logistic Regression (PLR) [29], a Support Vector Machine (SVM) [30], a Random Forest (RF) [31], and XGBoost [32].Due to the small positive dataset size, the models were tested using a leave-one-out cross validation in which the models were trained using all but one entry and then tested with the entry that was left out, which is then repeated, leaving a different entry out each time [29].The best average accuracy (AA) and f1 scores obtained from these four classifiers are shown in Table 2.The addition of the CCS, P2, and SS features was shown to consistently increase the classifiers' AA performance by between 0.08 and 2.8%, with f1 scores increasing by up to 0.0253 across small and extended datasets, with more impact on the conotoxins vs. easy + hard-negative datasets, indicating that these features are highly beneficial for predicting conotoxins.

Conotoxin Prediction Accuracy
Additional detailed testing was evaluated using the extended datasets to reveal the effect of individual feature sets and different feature set combinations on ML classification performance.Three metrics (overall accuracy (OA), average accuracy (AA), and f1 score (f1)) were used to evaluate the classification performance, as indicated in the methods section.Higher values for these metrics indicate a better performance of the classifier.
Table 3 shows the performance of the PLR and SVM classifiers (Top) and RF and XGBoost classifiers (Bottom) for predicting conotoxins against the easy-negative dataset (non-toxic peptides).The results show that the SS features alone, or in combination with the CCS feature did not increase the prediction accuracy and f1 score compared to P features alone.Similarly, the addition of the CCS feature on top of the P features (P + CCS), the addition of SS features on top of the P2 features (SS + P2), or the addition of the CCS and SS features on top of the P2 features (CCS + SS + P2) did not significantly affect the performance of all four classifiers.However, adding the SS features on top of the P feature set (P + SS) increased all three metrics for the PLR, SVM, and XGBoost classifiers, while almost no change was observed for the RF classifier.In particular, the OA was increased by 1.34%, the AA by 2.52%, and the f1 score by 0.0241 for the PLR classifier; the OA by 1.75%, the AA by 1.69%, and the f1 score by 0.0417 for the SVM classifier; and the OA by 1.08%, the AA by 1.37%, and the f1 score by 0.0241 for the XGBoost classifier.Interestingly, adding the P2 features on top of the P features (P + P2) increased all three metrics for all four classifiers: PLR, SVM, RF, and XGBoost.Specifically, for the PLR classifier, adding the P2 feature (P + P2) increased the OA by 2.15%, the AA by 4.1%, and the f1 score by 0.0397, and these numbers are quite similar for the SVM classifier (the OA increased by 2.15%, the AA increased by 3.1%, and the f1 score increased by 0.0473).For the XGBoost classifier, smaller increases were observed, for which the OA increased by 1.21%, the AA increased by 1.81%, and the f1 score increased by 0.0263, while these numbers are slightly lower for the RF classifier, for which the OA increased by 0.53%, the AA increased by 0.59%, and the f1 score increased by 0.0123.
When all the features were combined (P + SS + CCS + P2), the best performance was obtained across all four classifiers and over all three metrics, with converged numbers of an OA of ~96%, an AA of ~95%, and an f1 score of ~0.92.The combination of the features significantly improved the performance of the PLR and SVM classifiers, with an increase in the OA by 3.36%, the AA by 5.58%, and the f1 score by 0.0668 for the PLR classifier, while the SVM classifier saw an OA increase of 2.55%, an AA increase of 3.1%, and an f1 score increase of 0.0579.These increases are slightly smaller for the XGBoost classifier, with an OA increase of 1.35%, an AA increase of 2.1%, and an f1 score increase of 0.029, while for the RF classifier, only a slight improvement was observed, with an OA increase of 0.27%, an AA increase of 0.39%, and an f1 score increase of 0.0059.This result is consistent with the addition of P2 features on top of the P features (P + P2), where larger improvements were made for the PLR and SVM classifiers and only a slight improvement was obtained for the XGBoost classifier, with the least improvement observed for the RF classifier.Similar to the predictions of conotoxins against the easy-negative dataset, the addition of the P2 features on top of the P features (P + P2) increases the overall performance across the PLR, SVM, RF, and XGBoost classifiers in predicting conotoxins against the hard-negative dataset (other toxic peptides), as shown in Table 4.When all the features are combined (P + SS + CCS + P2), the performance again improves over all three metrics and across all four classifiers when compared to just using P as the only feature.Both the PLR and SVM classifiers show similar increases of 1.55% for the OA, 2.21% for the AA, and 0.0255 for the f1 score; and 1.51% for the OA, 2.31% for the AA, and 0.0262 for the f1 score, respectively.The RF classifier shows slight improvement (the OA increases by 1.2%, the AA by 1.19%, and the f1 score by 0.017), while the XGBoost classifier shows the least improvement (the OA increases by 0.84%, the AA by 0.89%, and the f1 score by 0.0127).Notably, the combination of all the features (P + SS + CCS + P2) shows the best performance for three classifiers: the PLR, SVM, and RF.For the XGBoost classifier, (P + P2) and (P + SS + CCS + P2) show similar performance.Interestingly, (P + SS) or (P + SS + CCS) show similar performance for all four (PLR, SVM, RF, and XGBoost) classifiers.
When the easy-negative and hard-negative extended datasets are mixed and tested together, the addition of the P2 features on top of the P features (P + P2) and the combination of all the features (P + SS + CCS + P2) improve the predictive performance for all four classifiers over all three metrics, the OA, AA and f1 score, as shown in Table 5.Overall, (P + P2) and (P + SS + CCS + P2) show the best performance across all the classifiers.When all the features are combined (P + SS + CCS + P2), the OA is increased by 1.64%, the AA by 1.56%, and the f1 score by 0.0421 for the PLR classifier.For the SVM classifier, the increases are 1% for the OA, 0.72% for the AA, and 0.0245 for the f1 score.Similar increases are obtained for the RF and XGBoost classifiers, with an increase in the OA by 0.62%, the AA by 1.14%, and the f1 score by 0.0211, and the OA by 0.72%, the AA by 1.05%, and the f1 score by 0.0218, respectively.

A Comparison of Our Model Performance to Previously Published Models
Overall, the RF classifier has the best performance in predicting conotoxins from non-toxic or other toxic peptides across multiple datasets.Table 6 shows how our model performance compares to previously published models (i.e., TOXIFY [9], ToxClassifier [10], ClanTox [11], ToxinPred [12], PredCSF [13]).When the primary sequence is used as the only feature, our model outperforms the best-performing published model, ToxinPred, by 1.74% in OA and 0.1% in Recall.When adding P2 on top of the primary amino acid sequence feature, our model outperforms ToxinPred by 2.27% in OA and 0.78% in recall.Used to predict conotoxins from non-toxic peptides TOXIFY [9] Swiss-Prot-derived 0.8600 0.7600 Sequence Used to predict if a peptide is toxic ToxClassifier [10] Swiss-Prot-derived 0.7700 0.5600 Sequence Used to predict if a peptide is toxic ClanTox [11] Swiss .The Swiss-Prot-derived dataset consists of Swiss-Prot entries, as described in [9].The composite dataset consists of small toxins from several different databases, with entries having more than 35 residues and any non-natural amino acids removed.

Discussion
We have demonstrated that, contrary to the current practice of using only the primary sequence (P) feature, the inclusion of PTM information as well as CCS values, when coupled with additional structural features, improves the prediction accuracy of conotoxins against non-toxic and other toxic peptides across varied datasets and across four different commonly used ML classifiers (PLR, SVM, RF, and XGBoost).In particular, the addition of these new features improved the PLR classifier significantly, with an overall accuracy increase of ~93% to ~97%, while the average accuracy increased from ~90% to 95%, and the f1 score increased from 0.8603 to 0.9271 when predicting conotoxins from non-toxic samples (the extended easy-negative dataset).The fact that all four classifiers converge to similar final accuracies and f1 scores indicates that the addition of new features increases both prediction accuracy and confidence when predicting conotoxins from non-toxic peptides.Furthermore, the performance of the RF and XGBoost classifiers is slightly better than the other two classifiers (PLR and SVM, which have similar performance) across different datasets, suggesting that either the RF or XGBoost classifier can be used successfully to build the final model for conotoxin prediction.However, the RF classifier seems to be a better choice due to its consistently higher performance across various datasets.
Our findings also suggest that there are conserved chemical and structural signatures across conotoxins that distinguish them from non-toxic peptides and other kinds of toxins.The acquisition of new, additional experimental data on isomer conformations of conotoxins would be helpful to expand the training datasets and to bolster the impact of CCS and structural features.Our results also imply the existence of similar chemical and structural signatures in other toxin families and that an ML platform that predicts different kinds of toxins and their toxicity is feasible.Additionally, traditional structure-function relationships suggest that such features can also be used for the prediction of receptor binding partners.

Construction of Datasets
Conotoxin data were extracted from the Conoserver [33], Protein Data Bank [34], and Biological Magnetic Resonance Bank [35] websites.Both easy-and hard-negative datasets were constructed using peptide samples with lengths equal to or less than 80 residues and with known three-dimensional structures.The easy-negative dataset includes samples that are not toxic, while the hard-negative dataset consists of toxic peptides, including spider, scorpion, and snake toxins.Entries containing post-translational modifications were included in all the datasets.Negative datasets were obtained from the Protein Data Bank using keyword searches.The easy-negative dataset was built using the search term "NOT toxic", and the hardnegative dataset was constructed using the search terms "toxic" and "species NOT conus", and entries were limited to peptides with a maximum length of 80 amino acids.Entries with the keywords "Synthetic", "Unknown function", or "De Novo" were all removed from the datasets.Entries that were conantokins, prions, antifungal, or antimicrobial were placed in the hard-negative dataset regardless of other identifying tags.

Feature Extraction
The general workflow for feature extraction from PDB/structure files is shown in Figure 3. Basically, the amino acid sequence of an entry is extracted from the PDB/structure file, and the corresponding structural features are then extracted from the PDB/structure file using the DSSP [28] program.
Conotoxin data were extracted from the Conoserver [33], Protein Data Bank [34], and Biological Magnetic Resonance Bank [35] websites.Both easy-and hard-negative datasets were constructed using peptide samples with lengths equal to or less than 80 residues and with known three-dimensional structures.The easy-negative dataset includes samples that are not toxic, while the hard-negative dataset consists of toxic peptides, including spider, scorpion, and snake toxins.Entries containing post-translational modifications were included in all the datasets.
Negative datasets were obtained from the Protein Data Bank using keyword searches.The easy-negative dataset was built using the search term "NOT toxic", and the hard-negative dataset was constructed using the search terms "toxic" and "species NOT conus", and entries were limited to peptides with a maximum length of 80 amino acids.Entries with the keywords "Synthetic", "Unknown function", or "De Novo" were all removed from the datasets.Entries that were conantokins, prions, antifungal, or antimicrobial were placed in the hard-negative dataset regardless of other identifying tags.The thirteen features extracted from the datasets, belonging to three general categories, were divided into four groups (P, P2, SS, and CCS), as shown in Figure 2. Within the compositional category, primary sequence features include amino acid composition, such as the frequency of occurrence, and the g-gap dipeptide feature to reflect the position of each amino acid in the sequences.G-gap peptide features count the frequencies of co-occurring residues in the sequence as neighbors with a g-residue separating them.We used g = 0, 1, 2 to extract amino acid positional information with adjacent amino acids or with one or two residues separating them, respectively.Physiochemical features include the charge, mass, relative size, and relative polarity (aliphatic, aromatic, polar, hydrophobic, positively charged, negatively charged) of each residue, as previously described [24].For non-standard amino acids, the physiochemical features were manually assigned based on their modifications, i.e., non-polar sidechains modified by the addition of an alcohol were reassigned as polar, as with hydroxyproline.A similar analysis, as conducted for the standard amino acids, was performed.
Structural features include secondary structure information, the area exposed to solvent within each physiochemical class, the radius of gyration, and CCS values.The DSSP software (version 3.0.0)[27,28,36,37] was used to calculate the secondary structure type and solvent-eposed surface area for each amino acid.The software package HPCCS (version 1.0) [22] was used to calculate CCS values based on the masses and partial charges determined by pdb2pqr (version 2.1.1)[37,38] for each atom in every input PDB file.For PTMs, custom parameters were employed based on the amber force field [39].The Cα positions were used to calculate the radius of gyration.The goal of the feature selection step is to find the best feature subsets to maximize the robustness and the performance of the classifiers, and this will be discussed below in more detailed.Here we use F-score, a variance-based analysis, which measures the classifying power of the features, where the larger the F-score, the higher its classifying power.For each feature, the F-score is calculated as follows: variance between classes variance within classes

Elimination of Highly Correlated Features
Before applying the feature selection protocol to the data, highly correlated features were removed from the dataset.If this was not carried out, the feature selection method would fail and select only a group of highly correlated features, which would negatively affect the classifier's performance.To address this issue, Pearson correlation coefficients are computed between features to measure redundancy.If the correlation coefficient between two features is larger than a preset threshold, the one with the smaller F-score is removed.The preset threshold is a hyperparameter, whose value is highly dependent on the datasets.The output of this process produces a smaller, but more independent, set of features, which improves the classifier's performance.

Incremental Feature Selection
The incremental feature selection framework [40] was employed to select the optimal feature subset used to build the classifiers.All features were ranked using the F-score as described above (Section 4.3.1),and redundant features were removed.All subsets are formed from a range of features, and an ML method is used to measure their performance with cross-validation.The simple linear SVM is used since the dataset size is small.The optimum number of features is the one that produces the highest balanced accuracy.Since our datasets are imbalanced, with more entries in the negative dataset compared to the positive dataset, the balanced accuracy, which is the average accuracy weighted by the classes' size, is a better metric to evaluate the classifier's performance than the overall accuracy.

Classifiers
In this study, four common ML classifiers: PLR [29], SVM [30], RF [31], and XG-Boost [32], were employed to evaluate how new features would affect the prediction accuracy for the conotoxins.Given that our datasets are small, the aim was to keep our models at low complexity to make the models more robust [41].A linear kernel was used for the SVM.As shown in Figure 4, the classifiers were trained using the training data, meaning their parameters are optimized for the classifiers to fit the training data.The trained classifier's performance was then tested on the testing dataset.The trained classifier, with its optimized parameters, was then used to make predictions on data with unknown labels.
[32], were employed to evaluate how new features would affect the prediction accuracy for the conotoxins.Given that our datasets are small, the aim was to keep our models at low complexity to make the models more robust [41].A linear kernel was used for the SVM.As shown in Figure 4, the classifiers were trained using the training data, meaning their parameters are optimized for the classifiers to fit the training data.The trained classifier's performance was then tested on the testing dataset.The trained classifier, with its optimized parameters, was then used to make predictions on data with unknown labels.

Using Geometric SMOTE to Handle Imbalanced Datasets
When classifying conotoxins against the mixed extended negative datasets (easynegative + hard-negative or non-toxic + other toxic peptides), the data were highly imbalanced because the sample size of the negative dataset (877 entries) was much larger than the sample size of known conotoxins (184 entries).Besides using the average accuracy metrics, for be er evaluation, an over-sampling technique called Geometric Synthetic Minority Oversampling Technique (GSMOTE) [42] was used to correct for this issue.Oversampling techniques are general approaches that address the imbalanced datasets by generating artificial data for the minority class.The SMOTE [43] class of methods generates synthetic samples along the line segments that connect minority class samples, which fills

Using Geometric SMOTE to Handle Imbalanced Datasets
When classifying conotoxins against the mixed extended negative datasets (easynegative + hard-negative or non-toxic + other toxic peptides), the data were highly imbalanced because the sample size of the negative dataset (877 entries) was much larger than the sample size of known conotoxins (184 entries).Besides using the average accuracy metrics, for better evaluation, an over-sampling technique called Geometric Synthetic Minority Oversampling Technique (GSMOTE) [42] was used to correct for this issue.Oversampling techniques are general approaches that address the imbalanced datasets by generating artificial data for the minority class.The SMOTE [43] class of methods generates synthetic samples along the line segments that connect minority class samples, which fills in the gap between minority class samples and densifies the minority clusters.GSMOTE is a variant of SMOTE that generates synthetic data, specifically in a geometric region of the minority samples.In this way, noisy samples from the minority class are added to the data, which increases the sample size without duplicating the samples in the classes.

Performance Evaluation
To maintain consistency across multiple datasets, a leave-one-out cross-validation protocol was used to measure the performance of the classifiers.The metrics used to measure the classification performance were the overall accuracy (OA), average accuracy (AA), precision (Pr), recall (Re), and f1 score [44], which are defined as follows: OA = TP 0 + TP 1 N where N − total number of samples

Figure 2 .
Figure2.Features were divided into four groups (P, P2, SS, and CCS), and the effect of each feature group was evaluated with regard to conotoxin prediction accuracy.

Figure 2 .
Figure 2.Features were divided into four groups (P, P2, SS, and CCS), and the effect of each feature group was evaluated with regard to conotoxin prediction accuracy.

4. 2 .
Feature Extraction general workflow for feature extraction from PDB/structure files is shown in Figure 3.Basically, the amino acid sequence of an entry is extracted from the PDB/structure file, and the corresponding structural features are then extracted from the PDB/structure file using the DSSP[28] program.

Figure 4 .
Figure 4. Overall ML pipeline describing the process of using a dataset to train and cross-validate a classifier.

Figure 4 .
Figure 4. Overall ML pipeline describing the process of using a dataset to train and cross-validate a classifier.

Table 1 .
Number of samples in small and extended datasets used in the machine learning experiments.

Table 2 .
The average accuracy and f1 scores of classification performance across different datasets.

Table 3 .
The classification performance of different feature sets for the extended conotoxins versus the extended easy-negative dataset (non-toxic peptides).Performance of the PLR and SVM classifiers (top table) and the RF and XGBoost classifiers (bottom table) is show.

Table 4 .
The classification performance of different feature sets for the extended conotoxins versus the extended hard-negative dataset (other toxic peptides).Performance of the PLR and SVM classifiers (top table) and the RF and XGBoost classifiers (bottom table) is shown.

Table 5 .
The classification performance of different feature sets for the extended conotoxins versus the mixed extended negative dataset (easy-negative + hardnegative or non-toxic + other toxic peptides).Performance of the PLR and SVM classifiers (top table) and of the RF and XGBoost classifiers (bottom table) is shown.

Table 6 .
A comparison of our model performance to previously published models.
Comparative table of prediction resultsfrom multiple models showing the accuracy and recall of different methods.Training sets used by the above methods include an S2 dataset, a Swiss-Prot-derived dataset, and a composite dataset.The S2 dataset is a superfamily training set containing 261 entries with four superfamilies: A (63 samples), M (48 samples), O (95 samples), and T (55 samples)