4. Discussion
In this work, we present a new KEGG-based dataset for the machine learning task of predicting the pathway involvement of metabolites. It significantly improves on the KEGG-SMILES dataset used in previous publications on metabolic pathway prediction, which contained duplicate entries and lacked the code and description of its creation. The lack of code precludes updating the dataset as more metabolites are discovered, the duplicate entries invalidate the prior analyses, and the lack of description of its creation makes the dataset suspect in general. Huckvale et al. outlined the optimal requirements of a benchmark dataset for this machine learning task, specifying that it must be reproducible, valid, accessible, and complete [
15]. We use the highest standards of computational reproducibility in our dataset [
37,
38], providing a thoroughly detailed description of how it was created. This includes merging duplicate entries discovered in KEGG, filtering entries with limited chemical information causing inferior classification reliability, and semi-automated manual inspection of a small subset of entries with a range of potential issues (
Figure 1). We provide the raw data and code for complete reproducibility when re-generating the dataset, as well as the original scripts for obtaining the raw data in the first place, including instructions for adding to the raw data as KEGG releases updates. Our final dataset with a size of 5683 entries exceeds the size of the de-duplicated KEGG-SMILES dataset of size 4929 by 754 entries. Beyond being reproducible and malleable, the dataset is also valid since it contains no duplicate metabolites and the description of its creation is transparent. And finally, the dataset is complete according to the most up-to-date KEGG data as of 3 July 2023 (with the caveat of filtering entries by non-hydrogen atom count). Finally, this new KEGG-based dataset is maintainable as KEGG changes. We recommend future research in metabolic pathway prediction use our dataset, build off of our dataset, or otherwise use the same standards of scientific computational reproducibility, data validation, accessibility, and completion.
We present strong evidence for the correlation of the number of non-hydrogen atoms in a metabolite and the ability for said metabolite to be classified reliably. This trend is expected, since low information content is always an issue in classification. Considering the clear drop in misclassification rate until reaching a non-hydrogen atom count of seven (
Figure 3), we recommend future work using our methods predict on metabolites with at least seven non-hydrogen atoms, as our models were trained on a dataset with metabolites that meet this restriction. This is pragmatic, since the prediction of pathway involvement should primarily focus on molecules where direct pathway involvement is not directly known and the pathway involvement of most molecules with less than seven non-hydrogen atoms is better known. But if predicting on metabolites with less than 7 non-hydrogen atoms is necessary, one will either need to be aware of lower reliability or produce models more capable of predicting on such metabolites. When performing this machine learning task using different models or different datasets, we recommend being cautious of non-hydrogen atom count, monitoring misclassification rates of the metabolites.
We defined ambiguous metabolites as those containing R groups or repeat sequences specified in their molfile such that underlying chemical information is obfuscated. We expected and demonstrated that such metabolites would be more difficult to predict correctly for most pathway categories. However, for some pathway categories, i.e., ‘Glycan biosynthesis and metabolism’ and ‘Lipid metabolism’, ambiguous metabolites surprisingly outperformed the non-ambiguous entries (
Figure S2).
By generating features from the atom colors, we make features out of the molecular substructures of the metabolites. Measuring the importance of these features enables biochemists to determine which substructures are associated with the pathway involvement of the corresponding metabolites. Some substructures are associated with a metabolite being present in a pathway category while other substructures indicate that a metabolite is absent from said category (
Table 7). The ability to quantify the importance of metabolite substructures and their positive association versus negative association provides insight into what substructures are inclusively or exclusively identifying for a pathway category. For example, the C-C-C-C atom color highlighted in
Figure 10 would not be readily thought of as an identifying feature for ‘Glycan biosynthesis and metabolism’; however, this feature helps identify metabolites used in lipopolysaccharide biosynthesis along with the presence of other identifying features.
Next, when comparing the performance of the three machine learning models, XGBoost unsurprisingly performed better overall than the Random Forest model while the MLP deep learning method did not improve on the tree-based methods. This is incongruent with deep learning-based methods exceeding the performance of tree-based methods in past publications (albeit on an invalid dataset). However, those models were more sophisticated than a simple MLP. It could be that such deep learning methods could surpass the performance of the XGBoost trained on our atom color features. Though, the atom color features do provide information on the molecular substructure of the metabolites similar to the graph-based models, albeit in a linearized fashion, and it includes information not just on atom configuration but also stereochemistry and bond order. It is still an open question whether models capable of processing more complex data structures can improve upon the performance of XGBoost trained on a tabular dataset. And to our knowledge, such models have yet to incorporate additional information beyond simple backbone molecular structure, such as atom stereochemistry, bond stereochemistry, and bond order.
We recommend using separate classifiers per pathway category. Depending on the pathway category that a classifier is being trained to predict, different hyperparameter values will result from the hyperparameter tuning. We also see that the importance of the features used is highly dependent on the pathway category being predicted (
Figure S3) while the majority of features have little to no importance (
Figure 9). If future work uses our atom coloring method to generate features, one may consider selecting features based on importance. However, one should be mindful of the pathway class being predicted since different target classes will require different features selected. It is possible that the important features will change further if training models are to predict more specific pathway classes, and we recommend using separate binary classifiers for the more specific pathway classes as well and perhaps a hierarchical classification method.
While the weighted average MCC of XGBoost trained on our final dataset (full feature set, full test set) was 0.7677 with a weighted standard deviation of 0.1540 (
Table 6), these weighted aggregates include ‘Chemical structure transformation maps’, the worst performing pathway category (
Figure 8). This category was excluded from previous publications on this machine learning task, including the most recent model for metabolic pathway prediction proposed by Du et al. called the MLGL-MP [
14]. Huckvale et al. re-ran the MLGL-MP on a de-duplicated version of the KEGG-SMILES dataset [
15], making it more comparable to our own dataset (though even the de-duplicated version is suspect).
Table 8 shows that the MCC improves significantly when ‘Chemical structure transformation maps’ are removed from the weighted average and weighted standard deviation calculations. We also see from
Table 8 that the F1 score of XGBoost trained on our dataset is comparable to that of the MLGL-MP trained on the de-duplicated version of the KEGG-SMILES dataset, keeping in mind that theirs was not a weighted average since their model predicted all the pathway categories at once rather than separating into an isolated classifier per pathway class. It should also be noted that the MLGL-MP was originally evaluated using the test set in each training epoch and choosing the highest scores from multiple evaluations, thus using the test set for model selection [
15], while we instead followed the best practice of training the models completely and evaluating on the test set only once per CV fold. We do not compare to the standard deviation of the MLGL-MP, since the MLGL-MP was only evaluated on 10 unique folds, which does not provide a reasonable estimate of the actual model performance variation.
In all of these cross-validation analyses and performance evaluations, KEGG is treated as a gold standard, making the assumption that KEGG’s description of pathway involvement is complete and without error. This assumption is necessary for training and evaluation and should be somewhat reasonable for central metabolism; however, this assumption is clearly not true, since KEGG is growing. Thus, there are implications with using a “gold standard” that is evolving. Besides improved machine learning methods and models, metabolic pathway prediction may also improve with more positive entries, i.e., metabolites that are involved in particular pathway classes (positive) as compared to not being involved in said pathway class (negative). A higher amount of positive entries was simulated by duplicating already extant positive entries in the SMILES dataset as shown by Huckvale et al. [
15], which of course resulted in impressive scores for this machine learning task in past publications but rendered the dataset invalid. However, the higher scores from duplicated entries do provide evidence that having more non-duplicate (real/valid) positive entries can greatly improve model performance, including in currently poor performing categories like ‘Energy metabolism’. For low performing pathways, more positive entries may be added to KEGG over time. However, additional positive metabolites may already be available in other data sources such as MetaCyc and PubChem.
Since the overall performance as well as the variance in performance are greatly dependent on the pathway category being predicted, certain use-cases may need to exclude certain pathway categories. As illustrated with the violin plots of MCC scores per pathway category in
Figure 8 as well as
Table S3, ‘Chemical structure transformation maps’, Energy metabolism‘, and ‘Metabolism of other amino acids’ pathway category predictions fall below a median MCC performance of 0.6 and should likely be excluded from many practical applications.