This study provides significant insights into the structural dynamics of BRAF variants and their relationship to drug resistance in colorectal cancer, using molecular dynamics simulations and machine learning techniques. REST2 simulations generated high-resolution structural data, revealing key dihedral angles in the BRAF protein variants that correlate with resistance to dabrafenib and vemurafenib. The application of machine learning, particularly the random forest algorithm, allowed us to identify critical features from the structural data, which were instrumental in distinguishing between drug-resistant and sensitive variants.
3.1. Structure–Function Insights
Figure 4A shows the differences in the average phi dihedral angle between drug-resistant and susceptible for each residue in the dabrafenib study. The average difference in psi dihedral angle between sensitive and resistant for the dabrafenib study (
Figure 4B). Note that large differences are found between residues 450–470 and 575–635, with smaller regional differences around 510 and around 660. Previous studies have identified that the active site of BRAF variant V600E for vemurafenib and other related drugs consists of residues Ile-463, Gly-464, and Val-471 of the P-loop; Leu-505 and Thr-508 of the αC-helix; Arg-575, Asp-576, Lys-578, Asn-580, and Asn-581 of the catalytic site; Ile-527, Thr-529, Gln-530, Trp-531, Cys-532, Gly-534, Ser-535, Ser-536 and His-539 of the N-lobe; and Phe-583, Gly-593, Glu-600, Lys-601, Arg-603, Tpr-604, and Ser-605 of the C-lobe [
17]. These are shown in
Figure 4C in blue. The residues that participate in the intermolecular interaction of the drug and BRAF are Val-471, Ala-481, Lys-483, Leu-505, Leu-514, Thr-529, Trp-531, Cys-532, Asp-594, Phe-595, and Gly-596 and are deemed essential for the activity of these class inhibitors are shown in orange [
17]. The PDB structure 4RZV of the vemurafenib drug binding site shows residues 463, 471, 472, 481, 482, 483, 505, 514-516, 527, 529–532, 535, 536, 539, 583, and 593–596 as the binding site (dark green). In PDB structure 5CSW, the dabrafenib drug binding site is 463–466, 468, 471, 481, 483, 505, 508, 513–516, 527, 529–532, 580, 581, 583, and 592–595 (light blue). They are all part of the CR3 domain, which is the kinase domain. The feature residues 484 and 518 lie adjacent to contact residues in the drug binding site, and 450 and 495 are close enough to likely affect drug binding, as shown in
Figure 5.
Machine learning applied to the dihedral angle data as features found that residues 450, 484, 495, 518, and 622 (purple) were sufficient to classify vemurafenib resistance/sensitivity of variants. Similarly, residues 494, 600, 644, 663, 675, and 677 (light green) were sufficient to characterize the dabrafenib response. Finally, statistical evaluation of significant differences between the average dihedral angle value between the sensitive variants and the resistant variants using the t-test yielded many that were statistically significant. In our study we chose those with more than a 15-degree difference, which corresponds to the standard deviation over residues in wild-type BRAF. For phi, the residues are 455, 466, 467, 469, 486, 489, 510, 576, 597, 602, 608, 631, 661, and 684 (dark blue) and for phi the residues are 450 454, 463, 464, 468, 469, 485, 509, 575, 586, 594, 597, 598, 604, 606, 607, 608, 610, 613, 628, 657, 660, and 664 (brown).
A comparison between these sets of residues shows that the machine learning and statistical analysis of the molecular simulation identified many residues around the binding sites and suggests that these are very important for understanding drug resistance. The residues found by the machine learning lie in the CR3 domain except for one in the CR2 domain phosphorylation site for 14-3-3 proteins (residues ~280–457). Interestingly, there were several residues in the range of 620–685 that were not previously implicated as important for binding. This region is part of the kinase domain (CR3) that is responsible for the catalytic activity of the BRAF protein, which is to phosphorylate and activate downstream proteins in the RAS/MAPK signaling pathway [
18].
Figure 5 shows an image of the BRAF CR3 domain with the different functional regions identified. Most of the amino acid residues identified by the machine learning approach are in the N-lobe of CR3, which is responsible for ATP binding and the regulation of BRAF kinase activity [
19].
The model also correctly identified several sensitive variants, including V600E and L597R. These results confirm that the structural variations observed through REST2 simulations can be effectively used to predict drug response. This model demonstrated the robustness of our approach, where decision trees identified significant dihedral angles, such as M484, W450, L495, and P622, that played crucial roles in determining variant resistance for vemurafenib. However, for dabrafenib, the decision trees, particularly those focusing on dihedral angles such as Q494, D663, S675, and D677, were instrumental in the classification process and provided strong predictive power for the majority of variants. The residues identified with machine learning as important for determining drug resistance either overlap or are adjacent to those identified in the scientific literature.
3.2. Machine Learning Prediction of Drug Resistant Variants
For vemurafenib, the machine learning model achieved 100% accuracy, correctly predicting the resistance or sensitivity of variants, including well-known resistant mutations such as G469A and V600E+L505H. In contrast, for dabrafenib, the model achieved an accuracy of 91.67%. While the model correctly classified the majority of resistant and sensitive variants, a slight deviation was observed in predicting the resistance status of the K601E variant. This suggests that further refinement may be necessary to improve the model’s prediction accuracy for certain mutations.
In this study, we chose to use the clinically relevant classifications of resistant and sensitive. The experimental data on the IC50 of variants is limited and has been shown to be inconsistent with clinical findings.
Table 6 shows the variants with measured IC50 values for dabrafenib. There is a clear separation into two classes: <5 nm and >1 µM. This justifies the use of two classes. For example, the variants L597V (susceptible clinically to dabrafenib) and G469A (resistant clinically to dabrafenib) both showed an IC50 of (>1 µM) in cell lines [
16]. We therefore choose to rely on the clinical classification.
There have been other methods developed to predict the pathogenicity of variants. These include the online servers such as AlphaMissense and PredictSNP that are shown in
Table 7. These predict the variants as likely pathogenic or deleterious but are not made to specifically address drug resistance or sensitivity. Other methods have been developed to predict the functional consequences of variants. In one such study, an XGBoost-based machine used position, frequency, consequence for the canonical
BRCA2 transcript, and deleteriousness prediction scores from several tools as features to obtain high accuracy prediction of BRCA variants [
20]. The software 3Cnet uses recurrent neural networks to analyze the amino acid context of human variants to predict pathogenicity [
21]. Another study used gene-specific machine learning rather than disease-specific to predict the pathogenicity of BRCA variants [
22]. Deep learning has also been applied to predict pathogenicity in the Missense Variant Pathogenicity prediction method for variants in hereditary cancer [
23]. However, none of these methods specifically address the question of drug resistance/sensitivity of variants.
In this study, a few features (dihedral angles) were selected and yielded better results than using all the dihedral angles. In machine learning, feature selection can lead to an improvement of model results. In a model with a large feature set, some of the features might not correlate well with the different classes. For example, there might be a feature that changes little across classes or one whose large variance appears as “noise”. By eliminating such features and only keeping those with relevant information content, the machine learning model can have better performance with the selected set of features over regular parameters.
There have been previous studies applying machine learning to features from molecular simulations from our research group and others. In one study we used macroscopic measures such as the twist and length of molecular features in calmodulin and the dihedral angles to differentiate between different diseases caused by different variants in the same protein [
24]. We have also made predictions about Venetoclax resistance of BCL-2 variants [
25]. Tam and co-workers applied deep learning to Ramachandran plots of the dihedral angles to functionally classify genetic variants of proteins [
26].
In conclusion, the combination of molecular dynamics and machine learning is a powerful tool to understand how changes in protein structure caused by genetic variants can lead to changes in function. This method can be leveraged to develop a classification model for predicting drug resistance/sensitivity for previously unclassified variants. These predictions could be used to determine which variants should be the subject of further experimental and clinical studies.