Figure 1.
The framework of DLERKm. (A) The E. coli and H. sapiens data used in this study were obtained from the SABIO-RK database in which samples include the information of enzymatic reaction IDs, UniProtKB AC numbers, and Km values. After processing, we obtained 2962 enzyme reaction samples from E. coli and 7160 enzyme reaction samples from H. sapiens, respectively, containing the information of substrates, products, enzyme sequences, and Km values. (B) The pre-trained model ESM-2 is employed to extract features from enzyme sequences. (C) The pre-trained model RXNFP is used to extract meaningful features from the enzymatic reaction SMILES strings. (D) Molecular set features are extracted using molecular fingerprints and logical operations. (E) The extracted features are refined using a channel attention mechanism and fed into a feedforward neural network to predict Km values.
Figure 1.
The framework of DLERKm. (A) The E. coli and H. sapiens data used in this study were obtained from the SABIO-RK database in which samples include the information of enzymatic reaction IDs, UniProtKB AC numbers, and Km values. After processing, we obtained 2962 enzyme reaction samples from E. coli and 7160 enzyme reaction samples from H. sapiens, respectively, containing the information of substrates, products, enzyme sequences, and Km values. (B) The pre-trained model ESM-2 is employed to extract features from enzyme sequences. (C) The pre-trained model RXNFP is used to extract meaningful features from the enzymatic reaction SMILES strings. (D) Molecular set features are extracted using molecular fingerprints and logical operations. (E) The extracted features are refined using a channel attention mechanism and fed into a feedforward neural network to predict Km values.
Figure 2.
Data processing steps for E. coli and H. sapiens.
Figure 2.
Data processing steps for E. coli and H. sapiens.
Figure 3.
(A) The structure of BERT-based pre-trained language model, which consists of multi-head attention layers, normalization layers, and feed-forward layers. (B) The structure of the multi-head attention layer, where the input matrix is linearly transformed to obtain query, key, and value, and attention scores are then computed using dot-product attention to update the feature matrix. (C) The feed-forward layer, which is composed of linear layers and activation functions. (D) The channel attention mechanism, which calculates attention scores along the channel dimension of the input tensor and refines the features using broadcasting and residual connections.
Figure 3.
(A) The structure of BERT-based pre-trained language model, which consists of multi-head attention layers, normalization layers, and feed-forward layers. (B) The structure of the multi-head attention layer, where the input matrix is linearly transformed to obtain query, key, and value, and attention scores are then computed using dot-product attention to update the feature matrix. (C) The feed-forward layer, which is composed of linear layers and activation functions. (D) The channel attention mechanism, which calculates attention scores along the channel dimension of the input tensor and refines the features using broadcasting and residual connections.
Figure 4.
Prediction results and performances of DLERKm and two variants. (A) Scatter plots of prediction results for DLERKm-v1. (B) Scatter plots of prediction results for DLERKm-v1. (C) Scatter plots of prediction results for DLERKm-v2. (D) Violin plot of prediction results for DLERKm and two variants.
Figure 4.
Prediction results and performances of DLERKm and two variants. (A) Scatter plots of prediction results for DLERKm-v1. (B) Scatter plots of prediction results for DLERKm-v1. (C) Scatter plots of prediction results for DLERKm-v2. (D) Violin plot of prediction results for DLERKm and two variants.
Figure 5.
(A) Visualization of channel attention scores for input features for all the test samples. (B) Visualization of mean channel attention scores of various types of input features.
Figure 5.
(A) Visualization of channel attention scores for input features for all the test samples. (B) Visualization of mean channel attention scores of various types of input features.
Figure 6.
Km value prediction results of DLERKm and UniKP. (A,B) Scatter plots and prediction value distributions of DLERKm and UniKP. (C–F) Distributions of performance metric values for DLERKm and UniKP on different test subsets. (G–J) Comparison of DLERKm and UniKP prediction metric values across different ranges of actual values.
Figure 6.
Km value prediction results of DLERKm and UniKP. (A,B) Scatter plots and prediction value distributions of DLERKm and UniKP. (C–F) Distributions of performance metric values for DLERKm and UniKP on different test subsets. (G–J) Comparison of DLERKm and UniKP prediction metric values across different ranges of actual values.
Figure 7.
Evaluation metric values of DLERKm and UniKP on different types of enzymatic reaction. DLERKm and UniKP metric values visualization for wild-type enzyme. DLERKm and UniKP metric values visualization for mutant enzyme.
Figure 7.
Evaluation metric values of DLERKm and UniKP on different types of enzymatic reaction. DLERKm and UniKP metric values visualization for wild-type enzyme. DLERKm and UniKP metric values visualization for mutant enzyme.
Figure 8.
Prediction metric values (PCC, R2) of DLERKm under different similarity (enzyme sequence) ranges.
Figure 8.
Prediction metric values (PCC, R2) of DLERKm under different similarity (enzyme sequence) ranges.
Figure 9.
DLERKm prediction results visualization for different types of enzymes. (A) DLERKm prediction results visualization for wild-type enzymes. (B) DLERKm prediction results visualization for mutant-type enzymes. (C) DLERKm prediction results visualization for E. coli type enzymes. (D) DLERKm prediction results visualization for H. sapiens type enzymes.
Figure 9.
DLERKm prediction results visualization for different types of enzymes. (A) DLERKm prediction results visualization for wild-type enzymes. (B) DLERKm prediction results visualization for mutant-type enzymes. (C) DLERKm prediction results visualization for E. coli type enzymes. (D) DLERKm prediction results visualization for H. sapiens type enzymes.
Figure 10.
DLERKm prediction results visualization for different types of enzymatic reactions. The bar chart on the left describes the four prediction metrics (RMSE, MAE, R2, PCC) of DLERKm on different types of datasets. Specifically, for each metric, the bar charts from left to right represent the predicted values of DLERKm for wild-type enzymatic reactions, mutant enzymatic reactions, E. coli-type enzymatic reactions, and H. sapiens-type enzymatic reactions. The violin plot on the right shows the distribution of predicted values of DLERKm across four types of enzyme datasets. In each violin plot, the dashed line represents the range of the predicted values interquartile range. The top and bottom of the violin plot represent the maximum and minimum predicted values, respectively.
Figure 10.
DLERKm prediction results visualization for different types of enzymatic reactions. The bar chart on the left describes the four prediction metrics (RMSE, MAE, R2, PCC) of DLERKm on different types of datasets. Specifically, for each metric, the bar charts from left to right represent the predicted values of DLERKm for wild-type enzymatic reactions, mutant enzymatic reactions, E. coli-type enzymatic reactions, and H. sapiens-type enzymatic reactions. The violin plot on the right shows the distribution of predicted values of DLERKm across four types of enzyme datasets. In each violin plot, the dashed line represents the range of the predicted values interquartile range. The top and bottom of the violin plot represent the maximum and minimum predicted values, respectively.
Figure 11.
DLERKm prediction results visualization for different types of enzymes. (A) DLERKm prediction results visualization for wild-type enzymes of H. sapiens. (B) DLERKm prediction results visualization for mutant-type enzymes of H. sapiens. (C) DLERKm prediction results visualization for wild-type enzymes of E. coli. (D) DLERKm prediction results visualization for mutant-type enzymes of E. coli.
Figure 11.
DLERKm prediction results visualization for different types of enzymes. (A) DLERKm prediction results visualization for wild-type enzymes of H. sapiens. (B) DLERKm prediction results visualization for mutant-type enzymes of H. sapiens. (C) DLERKm prediction results visualization for wild-type enzymes of E. coli. (D) DLERKm prediction results visualization for mutant-type enzymes of E. coli.
Figure 12.
DLERKm prediction results visualization for different types of enzymatic reaction (wild-type and mutant) of E. coli and H. sapiens. For each metric, the bar charts sequentially represent DLERKm predicted values for E. coli (wild-type and mutant) and H. sapiens (wild-type and mutant). The violin plot on the right shows the distribution of predicted values of DLERKm across four types of enzyme datasets. The right violin plot illustrates the distribution of DLERKm predicted values across four datasets. For each plot, the dashed line denotes the interquartile range and the vertical extremities (top and bottom of the violin shape) correspond to the maximum and minimum predicted values, respectively.
Figure 12.
DLERKm prediction results visualization for different types of enzymatic reaction (wild-type and mutant) of E. coli and H. sapiens. For each metric, the bar charts sequentially represent DLERKm predicted values for E. coli (wild-type and mutant) and H. sapiens (wild-type and mutant). The violin plot on the right shows the distribution of predicted values of DLERKm across four types of enzyme datasets. The right violin plot illustrates the distribution of DLERKm predicted values across four datasets. For each plot, the dashed line denotes the interquartile range and the vertical extremities (top and bottom of the violin shape) correspond to the maximum and minimum predicted values, respectively.
Table 2.
Number of data samples.
Table 2.
Number of data samples.
Organisms | Downloaded Samples | Processed Samples | Training Samples | Test Samples |
---|
H. sapiens | 10,749 | 7160 | 5728 | 1432 |
E. coli | 5573 | 2962 | 2369 | 593 |
Total | 16,322 | 10,122 | 8097 | 2025 |
Table 3.
Metric values of prediction results for DLERKm, DLERKm-v1, and DLERKm-v2.
Table 3.
Metric values of prediction results for DLERKm, DLERKm-v1, and DLERKm-v2.
Model | RMSE | MAE | R2 | PCC |
---|
DLERKm-v1 | 0.8067 | 0.5969 | 0.5520 | 0.7430 |
DLERKm-v2 | 0.7728 | 0.5438 | 0.5892 | 0.7676 |
DLERKm | 0.7711 | 0.5454 | 0.5938 | 0.7706 |
Table 4.
DLERKm and UniKP prediction metric values for RMSE, MAE, R2, and PCC.
Table 4.
DLERKm and UniKP prediction metric values for RMSE, MAE, R2, and PCC.
Model | RMSE | MAE | R2 | PCC |
---|
UniKP | 0.9218 | 0.6529 | 0.4294 | 0.6553 |
DLERKm | 0.7711 | 0.5454 | 0.5838 | 0.7706 |
Table 5.
DLERKm evaluation metric values on different types of enzymes.
Table 5.
DLERKm evaluation metric values on different types of enzymes.
Dataset | RMSE | MAE | R2 | PCC |
---|
Wild-type (UniKP) | 0.9015 | 0.6454 | 0.4355 | 0.6599 |
Wild-type (DLERKm) | 0.7569 | 0.5400 | 0.5997 | 0.7744 |
Mutant (UniKP) | 0.9852 | 0.6772 | 0.4030 | 0.6348 |
Mutant (DLERKm) | 0.8159 | 0.5633 | 0.5691 | 0.7544 |
Table 6.
The distribution of test set samples after removing the top-K most similar enzyme sequences from the training set, where top-K is determined by selecting the K most similar training samples for each test sequence. The column names in the table (e.g., “0.40–0.80”) represent the similarity range between test samples and training samples, while the values in the table indicate the number of test samples falling within the corresponding similarity range.
Table 6.
The distribution of test set samples after removing the top-K most similar enzyme sequences from the training set, where top-K is determined by selecting the K most similar training samples for each test sequence. The column names in the table (e.g., “0.40–0.80”) represent the similarity range between test samples and training samples, while the values in the table indicate the number of test samples falling within the corresponding similarity range.
Top-K | 0.40–0.80 | 0.80–0.95 | 0.95–1.00 |
---|
Top-5 | 98 | 100 | 1627 |
Top-10 | 214 | 431 | 1380 |
Top-15 | 317 | 570 | 1138 |
Top-20 | 411 | 682 | 932 |
Table 7.
Evaluation metric values (RMSE, MAE, R2, PCC) of DLERKm on the test set trained with different training subsets.
Table 7.
Evaluation metric values (RMSE, MAE, R2, PCC) of DLERKm on the test set trained with different training subsets.
Top-K | RMSE | MAE | R2 | PCC |
---|
Top-5 | 0.8994 | 0.6511 | 0.4374 | 0.6734 |
Top-10 | 0.9517 | 0.7111 | 0.3701 | 0.6176 |
Top-15 | 1.0275 | 0.7649 | 0.2657 | 0.5640 |
Top-20 | 1.0595 | 0.8009 | 0.2194 | 0.5229 |
Table 8.
DLERKm evaluation metric values (RMSE, MAE, R2, PCC) on enzymes (wild-type and mutant) and species (H. sapiens and E. coli).
Table 8.
DLERKm evaluation metric values (RMSE, MAE, R2, PCC) on enzymes (wild-type and mutant) and species (H. sapiens and E. coli).
Dataset | RMSE | MAE | R2 | PCC | Sample Number |
---|
Wild-type | 0.7569 | 0.5400 | 0.5997 | 0.7744 | 1550 |
Mutant | 0.8159 | 0.5633 | 0.5691 | 0.7544 | 475 |
E. coli | 0.8280 | 0.5984 | 0.5296 | 0.7277 | 593 |
H. sapiens | 0.7463 | 0.5235 | 0.6116 | 0.7820 | 1432 |
Table 9.
DLERKm evaluation metric values (RMSE, MAE, R2, PCC) for species (H. sapiens and E. coli) of different enzymatic reactions (wild-type and mutant).
Table 9.
DLERKm evaluation metric values (RMSE, MAE, R2, PCC) for species (H. sapiens and E. coli) of different enzymatic reactions (wild-type and mutant).
Dataset | RMSE | MAE | R2 | PCC | Sample Number |
---|
H. sapiens (wild-type) | 0.7323 | 0.5176 | 0.6291 | 0.7932 | 1164 |
H. sapiens (mutant) | 0.8041 | 0.5494 | 0.5349 | 0.7314 | 268 |
E. coli (wild-type) | 0.8264 | 0.6075 | 0.4884 | 0.6989 | 386 |
E. coli (mutant) | 0.8310 | 0.5813 | 0.5853 | 0.7650 | 207 |