1. Introduction
Pathways are networks of interconnected chemical reactions and interacting biomacromolecules within cells and organisms. If a chemical compound is involved as a product, reactant, or other small-molecule participant in a chemical reaction, it is de facto associated with that reaction. If a particular reaction takes place in a “pathway”, the compounds associated with that reaction are considered to be associated with that “pathway” [
1,
2,
3]. In this context, “pathway” can be a metabolic pathway, signaling pathway, biological process, disease process, or other biological concept with a graph-like representation of molecular interactions. When researchers detect various compounds in the biological samples of their experiments, it is highly useful to know which pathways the detected compounds are involved in, since such information provides insight into the biological functions of the compounds. This facilitates drug discovery, provides insight into the causes and treatment of disease, and aids biological research overall. Because of this interpretative utility, the pathway associations of compounds are annotated in knowledgebases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [
4,
5,
6], Reactome [
7], and MetaCyc [
8]. However, these knowledgebases are grossly (at least 50%) incomplete, as there are many compounds without any pathway annotations, and determining the pathway involvement experimentally is time-consuming and costly. A simple comparison of the compound and pathway entries in the three knowledgebases (see
Table 1) clearly demonstrates that no single knowledgebase contains a majority of both compound and pathway entries. Likewise, the reactions in KEGG and MetaCyc represent a minority of known enzymatic reactions in other knowledgebases such as BRENDA: KEGG + MetaCyc represent roughly 20,000 reactions, compared to over 40,000 reactions in BRENDA, according to a comparison provided by MetAMDB [
9]. At the compound level, KEGG and MetaCyc have roughly 9000 and 14,600 compounds, respectively, while BRENDA has over 111,000 compounds [
9]. When enzyme promiscuity is considered, KEGG, MetaCyc, and Reactome, even together, likely represent a minority of compounds and reactions in cellular metabolism, and thus, the current definitions of pathways in these knowledgebases are incomplete. Given that the current human-defined pathways are incomplete, with the surrounding chemical reactions missing from their definitions, and considering that discovering the pathway involvement of metabolites is time-consuming and costly, the field of metabolomics has faced a persistent need to fill in the missing pathway annotations in an efficient and cost-effective manner, thereby expanding the pathway definitions to become more complete and more descriptive of (cellular) metabolism.
To increase the number of pathway annotations available for interpretation of experimentally measured metabolites, several prior studies have prototyped machine learning models that predict pathway involvement based on a compound’s chemical structure representation, with varying levels of performance. Attempts to predict the pathway involvement of compounds most notably began with the work of Hu et al., in which chemical interaction data were used to predict 11 level 2 metabolic pathway categories found in KEGG [
10]. Building on the work of Hu et al., Baranwal et al. created a dataset representing compounds in SMILES format [
11] along with their mappings to one or more of the 11 KEGG level 2 metabolic pathway categories. Baranwal et al. trained a multi-output graph neural network [
12] with 11 outputs, one for each pathway category, where compounds were represented as graphs and information about their molecular structure was used to predict their pathway involvement. Yang et al. [
13] and Du et al. [
14] later proposed different variants of graph neural networks to predict these same pathway categories using the same dataset. Huckvale and Moseley discovered that the results of the models trained on this initial dataset were invalid [
15] due to exact duplicates within the dataset, leading to data leakage and an overoptimistic estimate of model performance [
16]. As a result, Baranwal et al. published a corrected version of the paper with the duplicate samples removed from the dataset [
17]. All studies published before our 7 May 2024 publication [
18] used either a multi-classifier or a set of binary classifiers implementing a one-vs.-rest classification approach [
19] and only predicted 11 or 12 level 2 metabolic pathways defined in KEGG, which have very limited practical application, especially for pathway enrichment analyses. Since Baranwal et al. met the proper standards of scientific computational reproducibility by providing their code and data, we were able to train their model over 50 CV iterations and calculate MCC, resulting in a mean MCC of 0.7642 and a standard deviation of 0.0137 (
Table S1), providing representative performance of models generated prior to 7 May 2024. Moreover, it is important to report model performance in MCC due to the high imbalance in the training and testing datasets. With high imbalance, MCC has an advantage over the F1 score, which ignores true negatives, and a major advantage over accuracy, which ignores false positives and false negatives within the numerator [
20,
21].
KEGG pathways are organized in a hierarchical fashion where there are seven top-level (level 1) pathway categories within which there are second-level pathway categories, and at the third level, we see individual pathways [
22]. The 11 outputs of these past models were specifically predicting the second-level pathway categories under the ‘Metabolism’ top-level category. While these initial models were instrumental in demonstrating the ability to predict involvement based on information about a compound’s molecular structure, the reality is that there are far more pathways that are of biological interest. KEGG alone has over 500 pathways defined [
22]. Meanwhile, Reactome and MetaCyc both have thousands of pathways defined [
23,
24]. Therefore, this is not a simple multi-output problem but rather an extreme classification problem [
25,
26] with thousands of different classes. One could train a multi-output model with thousands of outputs, but it is well known that as the number of classes increases while the dataset size remains the same, it becomes more challenging to accurately predict the increasing number of classes [
27]. Alternatively, a separate binary classifier could be trained for each class, but in the case of pathway prediction, there are several small pathways with very few associated compounds. This results in many more negative entries than positive entries, and the severe class imbalance greatly reduces model performance [
28].
Huckvale and Moseley resolved the extreme classification problem in metabolic pathway prediction by developing a multitask classification approach that cross-joins compound features with features representing a pathway, training a single binary classifier to predict whether the given compound is associated (i.e., involved) with the given pathway [
18]. In this context, classifying a given compound as belonging to a given pathway represents a distinct classification task. With this technique, rather than the limited data set size (number of compounds only) needing to be shared amongst thousands of classes in a multi-output model, the dataset size increases, being multiplied by the number of classes (pathways), and only a single output is necessary. This is because, rather than a dataset entry being defined as a compound, it is defined as a compound–pathway pair. This reformulation of the metabolic pathway prediction problem demonstrated that a model can be trained in a computationally practical manner while predicting an indefinite number of pathways with sufficient performance. Firstly, Huckvale and Moseley demonstrated that not only can 12 level 2 metabolic pathways be effectively predicted (including the poorly performing pathway that everyone else left out) [
18], but also that 172 level 3 pathways can be predicted [
29]. This was followed by predicting all 502 pathways defined in KEGG using a dataset of 6485 compounds [
30]. Going beyond KEGG, Huckvale and Moseley later demonstrated that models can effectively be trained to predict all 3985 Reactome pathways [
31] and all 4055 MetaCyc pathways [
32]. In addition, these studies demonstrated that training on all the pathways together with a single multitask classification model resulted in significant transfer learning across pathway-specific classification tasks, greatly improving pathway prediction compared to training a separate model for each pathway class in traditional one-vs-rest approaches.
Table 1.
Description of the combined KEGG + Reactome + MetaCyc dataset compared to that of prior studies. “#” symbol normally represents “number of”.
| Dataset | # Compound Features | # Pathway Features | # Unique Compounds | # Unique Pathways | # Entries | Reference |
|---|
| KEGG + Reactome + MetaCyc | 34,474 | 27,208 | 16,640 | 8195 | 50,127,958 | Current study |
| KEGG | 16,509 | 11,321 | 6485 | 502 | 3,255,470 | [30] |
| Reactome | 6187 | 5386 | 1976 | 3985 | 7,874,360 | [31] |
| MetaCyc | 19,081 | 15,349 | 9847 | 4055 | 39,929,585 | [32] |
To handle a high number of pathways, Huckvale and Moseley entirely reformulated the problem to handle extreme classification by concatenating a compound feature vector with a pathway feature vector. The compound feature vector representation was made possible by the work of Jin et al. [
33,
34,
35], who developed a graph-based atom coloring technique where the atoms of the compound are “colored” by the chemical substructure surrounding each atom. The atom coloring features for a compound are the counts of the atom colors (i.e., specific chemical subgraphs) present in the compound. This full enumeration of all chemical subgraphs of certain sizes present in each compound in a dataset creates an input neural network layer that is similar to the latent space produced by a graph convolutional neural network. Also, the resulting compound feature vectors can be viewed as feature vectors for chemical substructure tokens. The pathway features are likewise constructed by aggregating the compound features of the compounds associated with the pathway [
18]. Multi-layer perceptron [
36] layers are then trained using the combined compound-pathway feature vector as input. This approach is more practical than the previously used graph neural network methods, since many of the early (preprocessing) steps performed by graph neural networks have already been performed by atom coloring. Also, the introduction of pathway features, which cannot feasibly be represented as single definite graphs, prevents the direct use of most graph neural network methods.
With models being able to effectively predict the pathways annotated in these three major knowledgebases, an intuitive hypothesis is that the mean model performance and model robustness can be further improved by training a model on a dataset constructed from compounds and pathways in KEGG, Reactome, and MetaCyc combined. We will refer to this as the KEGG + Reactome + MetaCyc dataset. However, the challenge with combining knowledgebases is that their molfiles [
37] have inconsistent chemical structure representations. This impacts both the way that the compounds are represented in compound features as well as the pathway features, which are derived from the compound features. We demonstrate in this work that these chemical structure representation inconsistencies confuse the model. By standardizing with InChI canonicalization [
38,
39,
40], we make the chemical structure representations, and therefore the input features consistent, further improving the predictive performance of all pathways across all three knowledgebases. This is similar to the standardization methods used by PubChem; however, PubChem has different tautomeric preferences than InChI canonicalization [
41]. We also demonstrate that the InChI-based standardization greatly improves the generalizability of the model, enabling better predictions of pathway involvement of novel chemical structure representations.
4. Discussion
We combined the KEGG, Reactome, and MetaCyc knowledgebases together to create a single dataset comprising 13,902 unique compound feature vectors, 8056 unique pathway feature vectors, and 49,919,875 compound-pathway entries (
Table S3). With the new combined dataset, the robustness of the resulting models improved to a mean MCC of 0.9036 ± 0.0033, with the standard deviation less than one-third of that reported in all prior published results. These are the best results published so far and are far better than older multi-classifiers or one-vs-rest binary classifiers; the best-performing multi-classifier has a mean MCC of 0.7642, and the best-performing one-vs-rest binary classifier has an average MCC of 0.7677 [
19]. Moreover, all models prior to our May 7, 2024, publication [
18] predicted only 11 or 12 level 2 KEGG metabolic pathways, compared with 22,265 pathways (8056 with unique representations) predicted by the extreme classification model presented here. The high level of performance presented here is due to four major innovations. One innovation is the cross-join of metabolite and pathway features, which allows the use of a single multitask classification model for this problem. The second innovation is the generation of metabolite atom coloring chemical subgraph features that can be combined to create pathway atom coloring features, which makes the cross-join possible. Also, the enumeration of all chemical subgraphs reproduces a latent space similar to what is generated from a graph convolutional neural network. The third innovation is the integration of the KEGG, MetaCyc, and Reactome knowledgebases using InChI canonicalization into a single large dataset with 49,919,875 entries, the largest dataset created for this purpose so far in the field. Do not forget that “Data is King!” in machine learning. The fourth innovation is the use of a custom data loader that performs the cross-join in GPU RAM, which speeds up model training by roughly 20-fold, making model training and hyperparameter tuning pragmatically possible.
The extreme classification model performance when predicting Reactome pathways and MetaCyc pathways additionally improved, indicating transfer learning across the knowledgebases. More precisely, the multitask classification approach demonstrates transfer learning between classification tasks, where Reactome pathway prediction represents one task and MetaCyc pathway prediction represents another task. However, KEGG pathway prediction performance decreased. The lower performance and robustness of KEGG pathways when trained along with the other two knowledgebases were caused by confusion introduced by inconsistent chemical structure representations between the knowledgebases. Prior to standardizing chemical structure representations, one might conclude that it was advisable to use a model trained on KEGG pathways only when predicting KEGG pathways. However, standardizing the chemical structure representations with InChI canonicalization evidently corrected and/or compensated for this discrepancy when training a model on all three knowledgebases, with KEGG pathway prediction performance improving. Therefore, we recommend training a single model to predict pathways from all three knowledgebases, as long as its training dataset was appropriately standardized. Chemical structure representation standardization further improved Reactome and MetaCyc pathway prediction performance as well. Also, the superior prediction performance for Reactome pathways versus KEGG and MetaCyc pathways implies that Reactome pathway definitions may be of high quality, compared to KEGG and MetaCyc. These results, taken together, indicate that standardizing the chemical structure representation of compounds significantly improves both model performance and robustness by enabling additional transfer learning between knowledgebase pathway classification tasks and/or preventing confusion, depending on one’s perspective.
Moreover, our cross-reference analyses demonstrated high inconsistency in chemical structure representations across knowledgebases with only 1 out of 9193 cross-reference pairs having identical atom coloring feature vectors. After chemical structure representation standardization, consistency across knowledgebases increased dramatically to 7234 out of 9193 ≈ 78.7%. By removing these inconsistencies in chemical structure representation, the drop in MCC for the cross-references decreased from 0.2687 without standardization to 0.0384 with standardization. Also, the standardized cross-reference MCC of 0.9239 represented the highest performance. Thus, the resulting models are more generalizable when predicting on compound entries outside of the training data while also maintaining high prediction performance. Therefore, it is essential to standardize the data prior to predicting metabolic pathway involvement. To our knowledge, investigations into data-engineering techniques to maximize model generalizability across different knowledgebases with different chemical structure representations have not been previously published.
Also, the method of standardization matters. Using SMILES for standardization was less useful than using InChI (
Table S5). Also, the three knowledgebases have their own standardizations. However, different standardizations can have different tautomeric, resonance, and ionization preferences in chemical structure representation, which is illustrated by the poor performance when the three knowledgebases were combined without a separate standardization step. Likewise, PubChem’s standardization has a 60% inconsistency with InChI canonicalization [
41]. Again, this all supports the use of a single chemical structure representation standardization method prior to training and predicting metabolic pathway involvement.
While the multitask classification models presented here have significantly higher performance than all prior published results, there are still limitations. While these models generalize to novel chemical structure representations, they do not generalize well to novel pathways. This is evident from the poor performance when building models trained on one knowledgebase and then predicting the pathways of another knowledgebase. When predicting metabolic pathways for novel compounds, we recommend predicting only pathways that the model was trained on. Further research is required to determine ways to generalize to novel pathways.
The addition of multi-layer perceptron (MLP) neural network layers complements the cross-join technique for extreme classification since input features in a graph format cannot be cross-joined with vectorized pathway features. If a graph2vec approach is used to vectorize the graph features prior to cross-joining with the pathway feature vectors, GPU memory limitations still arise for a dataset of this size, which we have directly tested. If the batch size is reduced to process a smaller number of compounds and prevent the graph neural network from outstripping GPU memory, it would be too small for the model to train in a reasonable amount of time. One could batch the compounds alone, perform graph2vec, and then cross-join with the pathway features, but current batching techniques, as provided by deep learning libraries such as PyTorch, are preformed on the CPU side with multiprocessing, where additional time is needed for transferring data between processes. However, we needed to create our own batching mechanisms performed entirely on the GPU side and in the same process, in order to practically train a model on a dataset of this size. Our custom batching method (data-loading method) improved GPU utilization by 20-fold, making the current model training, testing, and evaluation practical on a dataset with 49,919,875 entries. Special batching techniques would need to be developed to allow the use of a graph neural network followed by a cross-join of the vectorized compound representations. Such batching techniques are non-trivial to implement for graph data. However, here we demonstrate excellent performance using vector representations with atom color chemical subgraph features that input into MLP layers. Results may be improved if a batching technique with graph representations is implemented and efficiently performed on the GPU side, making the batching practical for a dataset of this size.