Next Article in Journal
From Sea to Therapy: Marine Biomaterials for Drug Delivery and Wound Healing
Previous Article in Journal
Correction: Holmes, T.R.; Paller, A.S. Gene Regulation Using Spherical Nucleic Acids to Treat Skin Disorders. Pharmaceuticals 2020, 13, 360
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting EGFRL858R/T790M/C797S Inhibitory Effect of Osimertinib Derivatives by Mixed Kernel SVM Enhanced with CLPSO

1
College of Computer Science and Technology, Qingdao University, Qingdao 266071, China
2
School of Economics, Qingdao University, Qingdao 266071, China
*
Author to whom correspondence should be addressed.
Pharmaceuticals 2025, 18(8), 1092; https://doi.org/10.3390/ph18081092
Submission received: 20 June 2025 / Revised: 21 July 2025 / Accepted: 22 July 2025 / Published: 23 July 2025
(This article belongs to the Special Issue QSAR and Chemoinformatics in Drug Design and Discovery)

Abstract

Background/Objectives: The resistance mutations EGFRL858R/T790M/C797S in epidermal growth factor receptor (EGFR) are key factors in the reduced efficacy of Osimertinib. Predicting the inhibitory effects of Osimertinib derivatives against these mutations is crucial for the development of more effective inhibitors. This study aims to predict the inhibitory effects of Osimertinib derivatives against EGFRL858R/T790M/C797S mutations. Methods: Six models were established using heuristic method (HM), random forest (RF), gene expression programming (GEP), gradient boosting decision tree (GBDT), polynomial kernel function support vector machine (SVM), and mixed kernel function SVM (MIX-SVM). The descriptors for these models were selected by the heuristic method or XGBoost. Comprehensive learning particle swarm optimizer was adopted to optimize hyperparameters. Additionally, the internal and external validation were performed by leave-one-out cross-validation ( Q L O O 2 ), 5-fold cross validation ( Q 5 f o l d 2 ) and concordance correlation coefficient (CCC), Q F 1 2 , and Q F 2 2 . The properties of novel EGFR inhibitors were explored through molecular docking analysis. Results: The model established by MIX-SVM whose kernel function is a convex combination of three regular kernel functions is best: R 2 and RMSE for training set and test set are 0.9445, 0.1659 and 0.9490, 0.1814, respectively; Q L O O 2 , Q 5 f o l d 2 , CCC, Q F 1 2 , and Q F 2 2 are 0.9107, 0.8621, 0.9835, 0.9689, and 0.9680. Based on these results, the IC50 values of 162 newly designed compounds were predicted using the HM model, and the top four candidates with the most favorable physicochemical properties were subsequently validated through PEA. Conclusions: The MIX-SVM method will provide useful guidance for the design and screening of novel EGFRL858R/T790M/C797S inhibitors.

Graphical Abstract

1. Introduction

Lung cancer has long been the leading cause of cancer-related deaths worldwide, with an estimated 1.8 million deaths each year [1]. Moreover, its morbidity is so high that about 2.3 million new cases are diagnosed each year. Non-small cell lung cancer (NSCLC) accounts for almost 85% of these diagnoses. Unfortunately, since mild early symptoms tend to result in delayed diagnosis, lung cancer is often diagnosed at advanced stages when there are few therapeutic alternatives [2].
Major carcinogenic factors of NSCLC are epidermal growth factor receptor (EGFR) activating mutations, which can be effectively treated with EGFR tyrosine kinase inhibitors (TKIs) [3]. First-generation TKIs target mutated EGFR with point mutations in exon 21 (L858R) and exon 19 deletion, while second-generation TKIs work by covalently interacting with the cysteine residues (Cys797) in EGFR [4,5,6,7]. However, the T790M gatekeeper mutation, which causes acquired resistance to first-generation TKIs, is immune to second-generation inhibitors [8,9]. Osimertinib, a third-generation EGFR TKI designed to overcome the T790M mutation, has been approved by the FDA and EU due to its excellent performance in clinical trials [10]. Nevertheless, a tertiary mutation in EGFR C797S caused inevitable Osimertinib resistance, which significantly reduced the therapeutic effect during treatment of NSCLC and resulted in the requirement of developing novel inhibitors that target EGFRC797S [11].
EGFRL858R/T790M/C797S represents a triple mutation combination that is a major cause of acquired resistance to Osimertinib, the current third-generation EGFR tyrosine kinase inhibitor (TKI) used to treat non-small cell lung cancer (NSCLC) patients harboring the T790M mutation. While Osimertinib was specifically developed to overcome the T790M resistance mutation, the subsequent emergence of the C797S mutation severely limits its efficacy and poses a considerable challenge in clinical oncology.
Given the lack of effective therapies against this tertiary mutation, there is an urgent need to identify novel inhibitors that are effective against EGFRL858R/T790M/C797S. As such, we focused our modeling efforts on this triple-mutant EGFR variant to support the rational design and virtual screening of fourth-generation inhibitors that can address this critical resistance mechanism.
Previous research discovered a set of novel inhibitors, and their half maximal inhibitory concentration (IC50) values were measured to evaluate their inhibitory activity against EGFRL858R/T790M/C797S [12,13]. Since traditional experimental assessment of inhibitory activity is costly and inefficient, a novel method is urgently needed to accelerate the development of new drugs. Quantitative structure-activity relationship (QSAR) is a computational molecular modeling methodology that utilizes mathematical and statistical methods to explore the relationship between chemical structures and biological activities or properties of compounds [14]. On the basis of carefully selected descriptors, several economical and time-efficient QSAR prediction models could be established to assist in drug design.
In this study, the linear QSAR model was established based on descriptors selected by the heuristic method (HM) while the nonlinear QSAR models were established based on descriptors selected by extreme gradient boosting (XGBoost). Since nonlinear models constructed by random forest (RF), gene expression programming (GEP), gradient boosting decision tree (GBDT), polynomial kernel function support vector machine (Poly-SVM), and mixed kernel function support vector machine (MIX-SVM) have numerous complex hyperparameters to set, the comprehensive learning particle swarm optimizer (CLPSO) was applied for hyperparameter optimization, and the dataset was randomly split multiple times to ensure the stability and consistency of the selected parameters. The prediction and validation results indicate that the MIX-SVM model, whose kernel is a convex combination of trigonometric kernel function, polynomial kernel function, and linear kernel function, is robust and powerful in prediction. Furthermore, the applicability domain of the MIX-SVM model was defined to ensure the reliability of predictions. Unlike previous studies that have applied MIX-SVM- or PSO-based models to other biological targets, this study focuses for the first time on the challenging EGFRL858R/T790M/C797S mutations, using MIX-SVM enhanced by CLPSO to optimize predictive performance [3,4,6,7]. The design and property prediction of the new EGFR inhibitors were carried out, as well as the molecular docking of the best-performing compounds. In general, this study can provide useful guidance for the development of novel EGFR inhibitors.

2. Results

2.1. Results of HM

607 descriptors were calculated by CODESSA. By gradually increasing the number of descriptors from one to eight, a set of linear models were established by HM. The recorded R 2 and R c v 2 demonstrate the influence of the number of descriptors on the performance of linear models. As illustrated in Figure 1a, both R 2 and R c v 2 no longer increase sharply with the addition of descriptors once the number of descriptors exceeds three. Consequently, three descriptors were selected and their physical–chemical meanings are shown in Table 1.
For the training set, R 2 = 0.8094, RMSE = 0.5497; for the test set, R 2 = 0.7892, RMSE = 0.6215. Besides, Q L O O 2 , Q 5 f o l d 2 , CCC, Q F 1 2 , and Q F 2 2 are 0.7848, 0.6973, 0.8693, 0.7930, and 0.7905, respectively. The mathematical expression of the HM model is given in Equation (1):
lg I C 50 = 7.3731 40.041 × R S + 0.48875 × N R + 0.80855 × M E

2.2. Descriptors Selected by XGBoost

After removing all non-generic descriptors, XGBoost calculated the feature importance values of the remaining descriptors and ranked them in a non-incremental order. Figure 1b presents the top five descriptors and their corresponding feature importance. The obvious gap between the fourth and fifth descriptors in feature importance values suggests that the fifth descriptor is less relevant to the activities of the compounds. Moreover, the addition of the fifth descriptor may not significantly enhance the model but increases the risk of overfitting, which reduces the prediction performance of the model on unseen datasets [15]. Finally, four descriptors were chosen as the basis for building nonlinear models; their physical-– meanings are shown in Table 2.
To further validate the selected descriptors, SHAP (SHapley Additive exPlanations) analysis was applied to the XGBoost model. SHAP values offer consistent and locally accurate estimates of feature importance by quantifying each feature’s contribution to individual predictions.
Figure 2 shows that the four descriptors chosen by XGBoost have the highest SHAP values, confirming their relevance and alignment with the feature importance ranking.
This interpretability approach supports the robustness of the feature selection process and highlights the meaningful contribution of each descriptor to model performance.
Although the descriptors selected by HM and XGBoost do not overlap, the correlation coefficients between two sets of descriptors shown in Figure 1c indicate that RS is strongly correlated with DDC and FFH. Additionally, the correlation coefficient between any two descriptors selected by XGBoost is less than 0.8, confirming the validity of the descriptors selected by XGBoost.

2.3. Results of RF

Based on descriptors nonlinearly selected by XGBoost, the RF model was constructed, and its four key parameters were optimized by CLPSO. The RF model produced relatively good results when the parameters including the number of trees, the maximum depth of the tree, the minimum samples per split, and the minimum samples per leaf were set at 575, 6, 3, and 3, respectively. In this case, R 2 for the training set and test set are 0.8940 and 0.8779, and their corresponding RMSEs are 0.2188 and 0.3163. Additionally, Q L O O 2 , Q 5 f o l d 2 , CCC, Q F 1 2 , and Q F 2 2 are 0.8428, 0.7163, 0.9315, 0.8778, and 0.8777, respectively.

2.4. Results of GEP

The descriptors of selected by XGBoost were imported into APS to build the GEP nonlinear model. R 2 for the training set and test set are 0.7631 and 0.7036, and their corresponding RMSEs are 0.3812 and 0.4546. Additionally, Q L O O 2 , Q 5 f o l d 2 , CCC, Q F 1 2 , and Q F 2 2 are 0.5023, 0.4107, 0.8261, 0.7698, and 0.7701, respectively. The nonlinear model established by GEP is expressed as Equation (2):
lg I C 50 = sin a b s exp sin exp m o d cos f l o o r d 1 , d 2 + cos m o d a b s d 1 , cos p o w d 2 , d 3 d 1 × p o w d 2 , d 3 × s q r t d 0 + c o s ( a b s ( c o s ( d 2 × s i n ( c e i l ( d 3 ) ) ) ) ) ) × f l o o r ( d 1 ) + d 1 × e x p ( d 1 ) × p o w ( d 0 , d 1 ) × s q r t ( d 2 ) + s i n ( s i n ( s i n ( d 1 ) ) ) d 2 d 3 × d 1

2.5. Results of GBDT

The four descriptors selected by XGBoost were used to establish the GBDT nonlinear model. After CLPSO optimization, when the m = 11, the θ m = 0.08, and the maximum depth of the tree is 8, the best effect of the model is obtained. R 2 for the training set and test set are 0.7780 and 0.6478, and their corresponding RMSEs are 0.3319 and 0.4769. Additionally, Q L O O 2 , Q 5 f o l d 2 , CCC, Q F 1 2 , and Q F 2 2 are 0.7091, 0.6384, 0.9126, 0.9314, and 0.9278, respectively.

2.6. Results of Poly-SVM

The establishment of the Poly-SVM model involves the setting of many parameters that determine its performance. Specifically, C relates to the regularization degree; if C is too large, the model tends to overfit, while if it is too small, the model tends to underfit. ε is employed to fit the training data by adjusting the ε-insensitive zone’s width. d determines the complexity of the decision boundary. Higher values of d enable the SVM to capture more intricate relationships in the data, but may also increase the risk of overfitting. γ affects the shape of the decision boundary. The model may overfit if γ is set too high, and it may underfit if γ is set too low. r balances the influence of the constant term in the polynomial kernel function.
After CLPSO optimization, the values of C , ε, d , γ , and r are 4003.28, 0.36, 3, 0.57, and 4.47, respectively. R 2 = 0.8242, RMSE = 0.3026 for the training set and R 2 = 0.8956, RMSE = 0.2240 for the test set. In addition, Q L O O 2 , Q 5 f o l d 2 , CCC, Q F 1 2 , and Q F 2 2 are 0.8417, 0.8092, 0.9555, 0.9085, and 0.9049, respectively.

2.7. Results of MIX-SVM

As shown in Equation (11), aside from the parameters contained in the polynomial and trigonometric kernel functions, the mixed kernel function has three weight coefficients that balance learning and generalization abilities by convex combination. An efficient parameter optimization tool is necessary since the interdependence of multiple parameters increases the difficulty and complexity of parameter adjustment.
After optimizing parameters by CLPSO, the optimal C = 1.06, ε = 0.01, σ = 800.13, γ = 98.69, r = 0.01, α = 0.83, β = 0.12. At this point, R 2 for the training set and test set are 0.9445 and 0.9490, and their RMSEs are 0.1659 and 0.1814. Additionally, Q L O O 2 , Q 5 f o l d 2 , CCC, Q F 1 2 , and Q F 2 2 are 0.9107, 0.8621, 0.9835, 0.9689, and 0.9680, respectively.

2.8. Design of New EGFR Inhibitors

The determinants of EGFR inhibitor IC50 values were examined through the analysis of molecular descriptors employed in the HM model. As shown in Table 1, the raw (non-standardized) regression coefficients represent the rate of change in IC50 corresponding to unit changes in each descriptor. These coefficients illustrate the individual contribution of each variable to the predicted outcome. Based on their relative influence, the five descriptors were prioritized in the following order:
RS > NR > ME.
  • “RS” denotes the relative amount of S in the compound. Increasing its value significantly decreases the value of IC50 [16].
  • “NR” indicates the number of rings in the compound, and its coefficient indicates that decreasing its value slightly lowers the value of IC50 [17].
  • “ME” represents the maximum exchange energy of the C-N bond in the compound. Lowering its value slightly reduces the value of the IC50 [16].
In conclusion, HM model and analysis of molecular descriptors have identified several key factors that influence compound activity. Compound 45 is the most potent compound reported in the literature, as it exhibits the lowest IC50 value. Its structural composition can be modified based on these factors. To guide the design of novel and effective EGFR inhibitors, increasing the sulfur content may contribute to improved inhibitory potency. Figure 3 illustrates the primary positions where the molecular structure has been adjusted.
To reduce the polar interactions between atoms and enhance the distribution of different charges, functional groups such as halogens, mercaptans, thioethers, and sulfones are introduced at positions R1 to R3 and combined in a random manner. Using the descriptor analysis from the HM model, a total of 162 molecules were designed.
The IC50 values of the compounds were predicted using the HM model. If the predicted IC50 of a compound is lower than that of compound 45, it is selected for further evaluation and docking through the Property Explorer Applet (PEA) (https://www.organic-chemistry.org accessed on 10 July 2025). Ultimately, the IC50 values for four compounds were found to be lower than that of compound 71. Table 3 presents the predicted IC50 values and total docking scores for the newly designed EGFR inhibitors.

2.9. Property Prediction and Molecular Docking of New EGFR Inhibitors

PEA was used to assess the properties of the new compounds. The applet provides real-time predictions of physicochemical characteristics and evaluates potential toxicity risks for user-defined chemical structures. It analyzes various compound properties, including partition coefficient, water solubility, topological polar surface area (TPSA), drug-likeness, and others.
The partition factor P represents the ratio of solute concentrations between two solvents, and LogP is the logarithm of this ratio. LogP indicates the ratio of a compound’s solubility in n-octanol to water and is a key measure of hydrophilicity. Compounds with LogP values between –2.0 and 5.0 tend to exhibit favorable absorption properties [18].
Water solubility is vital for determining the intestinal uptake and cellular distribution of compounds. Increased solubility in water typically results in better absorption of the designed molecules [19].
TPSA is the total surface area of polar atoms in a compound [20]. A higher TPSA reduces the chances of molecules crossing the membrane. Drug-likeness evaluates bioavailability potential, while the drug score combines these factors into one value, serving as a key measure of a compound’s drug candidate potential. Table 4 shows the calculated properties of the newly designed compounds from PEA and their predicted lg(IC50) values from the HM model.
In the molecular docking analysis, the newly designed compounds were assessed as potential ligands for their interactions with the target protein (PDB ID: 7zyn). Among them, compound 45c showed the strongest binding affinity, with a total docking score of 5.0483—significantly higher than that of compound 45. This suggests a more stable and stronger interaction between compound 45c and the protein’s binding site. As shown in Figure 4, compound 45c forms two key hydrogen bonds with specific residues.
The predicted docking conformation of compound 45c revealed several key hydrogen bonds between the ligand and active site residues of the target protein. Notably, oxygen atoms in the core structure of 45c form hydrogen bonds with GLU-931 and ASP-916, contributing to a more stable and favorable binding pose. This strong molecular interaction suggests that compound 45c holds promise as a potential lead inhibitor of this protease.

2.10. Applicability Domain Analysis

The Williams plot was used to visualize the AD of the MIX-SVM model in this study. The vertical line represents the warning leverage h * while the horizontal lines represent the standardized residual threshold of ± 3 σ . When applying a QSAR model, only predicted data for chemicals within the chemical domain of the training set should be considered. In the case where the compound is outside the AD, the prediction result must be used with great care since it can be extrapolated from the model [21]. However, there is an exception. If a compound from the training set has a higher leverage value than h * and a standardized residual within ± 3 σ , it is conducive to reinforcing the model and will be considered a structurally influential chemical [22]. Figure 1d illustrates that all compounds in the test set fell within the AD and four compounds in the training set were identified as influential chemicals. This demonstrated the reliability of the MIX-SVM model’s predictions.

3. Discussion

Intuitive comparison of the prediction results is shown in Figure 5 and Figure 6. Due to the limitations of linear methods, the HM model did not produce satisfactory results. Overfitting has occurred with the GEP and GBDT models, which have average fitting effects. In contrast, the RF model not only avoided overfitting but also achieved high prediction scores, demonstrating its excellence in the domain of machine learning applications. In terms of models built by SVM, the Poly-SVM model failed to achieve high performance on the training set due to limited learning capability, and its leave-one-out cross-validation result was also affected. Benefitting from the complementary effect of the convex combination, the mixed kernel function reinforces its learning and generalization abilities by combining the advantages of the trigonometric, polynomial, and linear kernel functions. The results demonstrate that the MIX-SVM model fits the data best and exhibits excellent robustness. In addition, to evaluate the reproducibility of the model parameters, the dataset was randomly split multiple times, and similar performance was observed in the run. This consistency confirms the robustness of the hyperparameters chosen by CLPSO to changes in data partitioning. And the y-randomization test results shown in Table 5 confirm the absence of any chance correlation. In summary, a series of nonlinear models established on the basis of descriptors selected by XGBoost perform better. This further validates the effectiveness of XGBoost as a nonlinear feature selection technique.

4. Experimental Section

4.1. Data Preparation

All the compounds were from the literature [12,13] and their EGFRL858R/T790M/C797S IC50 values were measured under the same experimental environment. All target compounds synthesized have a purity of over 95%. The purity and mass spectra of the compounds were analyzed by HPLC-MS (SHIMADZU LCMS-2020), and NMR spectra were run on Bruker NMR Inova 300, 400, or 600 spectrometers to ensure their purity and structural correctness. Their structures and corresponding IC50 values are listed in Table 6. The dataset was divided at random in a 4:1 ratio: the training set which comprised 73 compounds was used to construct models, while the remaining 19 compounds constituted the test set for evaluation.

4.2. Descriptor Calculation

Calculation of molecular descriptors, which is the primary task in the process of building QSAR models, was executed by the following steps: First, ChemDraw8.0 (https://revvitysignals.com accessed on 10 May 2025) acted as a tool for drawing the structures of all compounds. Second, pre-optimization in the MM+ molecular mechanical force field and more accurate optimization by the semi-empirical method were performed by HyperChem4.0 (http://hypercubeusa.com accessed on 11 May 2025) [23] according to the principle of minimum energy [24]. Then, software MOPAC6.9 (Stewart Computational Chemistry MOPAC Home Page accessed on 11 May 2025) [25] converted the results into MNO files, which were the input for CODESSA2.64 (https://compudrug.com accessed on 12 May 2025). Finally, five classes of molecular descriptors, including constitutional, electrostatic, geometrical, topological, and quantum chemical descriptors, were obtained from CODESSA [26].

4.3. Linear Model by HM

Descriptor pre-selection is a prerequisite for building accurate and efficient models. When two or more descriptors in a regression model are highly correlated, multicollinearity occurs, making it difficult to isolate the individual effect of each descriptor on the dependent variable. This can lead to unstable regression coefficients, reduced model interpretability, and overfitting. During this procedure, descriptors with constant values, F-test values under 1.0 in the one-parameter correlation, correlation coefficients exceeding 0.8, and t-values below the user-specified value, as well as non-universal descriptors, were discarded [27]. Then linear regression models can be built based on the rest of the descriptors. HM is a universal approach for selecting descriptors and constructing linear models, and prior research has demonstrated its effectiveness [28,29,30]. The number of descriptors included in a linear model has a significant impact on model performance; the more descriptors a model contains, the stronger its prediction ability.

4.4. Nonlinearly Selecting Descriptors by XGBoost

Nonlinear dimensionality reduction, which can be done by mapping a low-dimensional surface to a high-dimensional space to find nonlinear relationships among features, is conducive to improving the performance of nonlinear models [31]. To enhance the performance of the nonlinear models constructed in this study, the optimized gradient tree boosting algorithm XGBoost was applied to nonlinearly compute the feature importance of each descriptor to obtain a minimized and optimal set of descriptors. Feature importance of XGBoost could be calculated based on different metrics, and the split gain method was adopted in this study because of its advantages in capturing more subtle relationships among features [32]. By calculating the information gain when splitting the decision trees, the significance of a feature could be expressed numerically.
w σ 2 T = i = 1 N τ i 2
Equation (3) was proposed to measure the importance of each feature X σ in a single decision tree T [33]. N represents the number of internal nodes in the decision tree, and feature X σ divides the area into left and right subareas at every node i . The feature that maximizes the estimated improvement τ i 2 in the squared error risk will be chosen [34]. The squared importance of the feature X σ is the accumulation of squared improvement over the N nodes. XGBoost feature importance over M decision trees is calculated as Equation (4).
w σ 2 T = 1 M m = 1 M τ i 2 T m

4.5. Nonlinear Model by RF

As one of the most popular bagging algorithms, RF is particularly adept at handling large-scale datasets with missing values and is widely used for classification and regression [35]. Based on a large number of uncorrelated decision trees, the result can be obtained by the trees’ voting for classification or averaging for regression [36]. By adopting the bagging strategy and aggregating decision trees, RF reduces the risk of overfitting and improves prediction accuracy [37].
There are four crucial parameters that should be determined in the process of building an RF model. The number of trees which controls the number of decision trees in the forest is positively correlated with the computational complexity. The maximum depth of the tree controls the depth of the tree and bigger values will lead to overfitting with higher possibility. The minimum samples per split controls the minimum number of samples required for a node split, and an excessive value may cause underfitting. The minimum samples per leaf, which is relative to the pruning, controls the minimum number of samples that a node must hold after being split.

4.6. Nonlinear Model by GEP

A nonlinear model was developed using gene expression programming (GEP) to predict IC50 of compounds, given the non-linear and complex relationship between factors affecting the ability of EGFR inhibitors [38].
The discovery of biological gene structure and function forms the basis of GEP, a new type of adaptive evolution algorithm. GEP was developed on the basis of genetic algorithms (GA) and genetic programming (GP) that absorb the advantages of both but overcome the disadvantages of both, and its distinctive feature is that it can solve complex problems with simple coding [39]. The process of GEP is as follows. GEP first decodes the problem into fixed-length chromosomes, whose head has functions and terminals and tail only has terminals. The tail length is determined by the length of the head to ensure legitimacy. Then, the population is iteratively evolved through selection, crossover, mutation, and other operations, and the individuals are retained according to the fitness assessment. Evolution leads to changes in chromosome structure, such as function combinations or terminal substitutions. Ultimately, the optimum chromosome is encoded into an expression tree and translated into a mathematical model or solution for the task.
GEP is widely used in the scientific field, as it is integrated into automatic problem solver2.9 (APS) (http://www.gepsoft.com accessed on 2 June 2025), a commercial software developed by Ferreira. The software is capable of encoding the molecular descriptors chosen by XGBoost that are most relevant to inhibitor activity, and developing nonlinear models to estimate IC50 values of EGFR inhibitors.

4.7. Nonlinear Model by GBDT

The gradient boosting decision tree (GBDT), which is an ensemble learning method, predicts target variables by constructing many decision trees in a sequential fashion. To improve accuracy, the prediction of the model is refined in every GBDT iteration. With each iteration, GBDT builds a new decision tree to reduce the total prediction error by combining predictions from previous iterations and focusing on the residuals that are the difference between projected and actual values [40]. A complex model is able to be constructed by GBDT through multiple iterations, which enables it to capture nonlinear relationships in data and make more accurate predictions. The principle of GBDT is shown in Equation (5).
f m x = f m 1 + T x , θ m
m represents m–th iterations and has m decision improvement trees. f m ( x ) is the predicted value of the model. T x , θ m   represents the m–th decision tree. θ m is the learning rate that controls the contribution of each tree. The difference between the boosted trees of GBDT in different problems is the difference in the loss function, and the squared error loss is used in the regression problem [41]. The final optimization objective function is shown in Equation (6).
L y i , f m 1 x + T x , θ m = r T x , θ m 2
y i represents the true value. r denotes the residual. The m–th decision T ( x , θ m ) tree fits the residuals.

4.8. Nonlinear Model by Poly-SVM

The support vector machine (SVM) proposed by Vapnik to solve the binary classification problems initially is a powerful supervised machine learning algorithm [42]. When SVM is used for regression, it is also known as support vector regression (SVR). After mapping the training data into a high-dimensional feature space, SVR aims to identify the optimal hyperplane that minimizes the training error of the data based on the ε-insensitive loss function [43]. To further regulate the SVM restrictions, slack variables ξ i and ξ i * are added to deal with non-perfectly separable cases, and a penalty parameter C is introduced to control the trade-off between the margin’s size and the penalty associated with the slack variables. The final optimization objective function is shown in Equation (7).
m i n w , ξ i , ξ i * = 1 2 w 2 + C i = 1 n ( ξ i + ξ i * )           s . t .   y i w · x i b ε + ξ i , w · x i + b y i ε + ξ i * , ξ i · ξ i * 0 .
where w , x , and b represent the weight vector, input feature vector, and bias, respectively. Besides, the kernel function K is devised to transform the data into a higher-dimensional space to separate linearly inseparable data with low computational complexity [44]. After introducing the Lagrange multipliers α i and α i * , the function can be simplified as Equation (8).
f x = j = 1 n α j α j * K x i , x j + b
Commonly used kernel functions include linear kernel function, Gaussian kernel function, polynomial kernel function, and sigmoid kernel function. In Poly-SVM, the polynomial kernel function is shown in Equation (9).
K x i , x j = γ x i T x j + r d ,     γ > 0
where d is the degree, γ is the kernel coefficient, and r is an offset coefficient.

4.9. Nonlinear Model by MIX-SVM

The polynomial kernel function is a powerful global kernel function in terms of generalization ability, but its learning ability is limited. And the linear kernel function has the same limitation. On the other hand, the trigonometric function was proposed which can be expressed as Equation (10) [45].
K x i , x j = sin π 2 + σ x i x j 2
where σ is a positive real parameter. The parameter σ directly affects the model’s performance. Small σ values are suitable for sparse datasets, while large σ values are suitable for compact datasets. Results show that models built by the trigonometric kernel function SVM have excellent learning ability but relatively poor generalization ability [45].
Considering the characteristics and complementarity of various kernel functions, a new mixed kernel function, which was a convex combination of the trigonometric kernel function, polynomial kernel function, and linear kernel function, was proposed according to Mercer’s theorem [46,47]. The expression of the mixed kernel function is expressed as Equation (11).
K M I X = α · K T r i g + β · K P o l y + ( 1 α β ) · K L i n e a r
where K T r i g , K P o l y , and K L i n e a r represent the trigonometric kernel function, polynomial kernel function, and linear kernel function, respectively. α and β are weight coefficients ranging from 0 to 1 used for balancing the learning capability and generalization capability of the mixed kernel function.

4.10. Optimization by CLPSO

As the process of building nonlinear models involves many important hyperparameters, an efficient optimization algorithm is necessary to improve the efficiency and convergence speed of hyperparameter search. Compared to traditional methods like grid search or Bayesian optimization, CLPSO offers stronger global optimization ability with fewer assumptions about the objective function. Its ability to maintain population diversity helps avoid local optima, making it well-suited for tuning multiple interdependent hyperparameters in complex nonlinear models like MIX-SVM.
The particle swarm optimizer (PSO) is a population-based optimization algorithm that simulates the behavior of animal swarms that forage in a cooperative way. In PSO, each member of the swarm is called a particle, identified by velocity V and position X , and represents a possible solution in the search space. A particle updates its position and velocity as Equations (12) and (13):
V ω × V + c 1 × r a n d 1 × p b e s t X + c 2 × r a n d 2 × g b e s t X
X X + V
where p b e s t and g b e s t represent the best previous position of each particle and the best position found by the whole population, respectively. c 1 and c 2 are acceleration constants; r a n d 1 and r a n d 2 are random numbers from [0, 1]. The inertia weight ω can balance the local and global search abilities.
As shown in Equation (12), both p b e s t and g b e s t have an impact on the update of all particles, even though the distance between the global optima and the current g b e s t is large. When there are many local optimal solutions in the search space, particles can easily become stuck in local optima and fail to find the global optimum [48]. The CLPSO which adopted a better learning strategy to overcome the PSO’s drawback of converging to local optimum easily was proposed [49], and it updates the velocity as Equation (14).
V ω × V + c × r a n d × p b e s t i X
where i determines which particles’ p b e s t s should be followed. This novel strategy uses information from all particles to update a particle’s velocity, which prevents premature convergence by preserving population diversity and makes CLPSO stand out among PSO variants [48].

4.11. Evaluation and Validation

The coefficient of determination ( R 2 ) ranging from 0.0 to 1.0 is a common index used to evaluate models and its higher value indicates better prediction performance of a model. It quantifies the proportion of variance explained by a regression model, which measures the success of a model constructed with independent variables in predicting the dependent variable [50]. The root mean square error (RMSE) was also used to assess model performance. RMSE values are always non-negative; a lower bound of zero indicates a perfect fit, whereas values for underperforming models increase indefinitely [51].
Validation which tests the reliability of a model in predicting the activities of compounds is an indispensable part of QSAR model development. Rigorous validation consists of three parts: the internal validation aims to verify the reproducibility of a model, the external validation aims to verify its generalizability, and the y-randomization test aims to detect chance correlations. This study performed internal validation by Q L O O 2 of leave-one-out cross-validation (LOO-CV) and 5-fold cross validation (5-fold). Additionally, statistics such as the concordance correlation coefficient (CCC), Q F 1 2 , and Q F 2 2 were used in external validation to measure the generalizability of the model. Additionally, the y-randomization test was performed ten times to obtain the R y r a n d 2 and Q y r a n d 2 of each random model.
Based on empirical values, the acceptable thresholds for a model are: R 2 > 0.7, Q L O O 2 > 0.6, Q 5 f o l d 2 > 0.55, CCC > 0.85, and Q F n 2 > 0.7 [52]. And a QSAR model without chance correlations should have low R y r a n d 2 and Q y r a n d 2 values.

4.12. Applicability Domain

As reliable predictions are mostly limited to chemicals that are structurally similar to the compounds in the training set, it is imperative to evaluate the similarity of the new molecule and limit the applicability domain (AD) of the model before applying a QSAR model to screen compounds [53]. The widely used leverage method was adopted to define AD in this study. The leverage of a compound is defined as Equation (15).
h i = x i T X T X 1 x i
where X is the descriptor matrix of the training set compounds and x i is the descriptor vector of the query compound. The warning leverage h * is defined as 3 × p / n . Specifically, p is the number of model parameters and n is the number of training compounds [54]. Any compound with a greater leverage value than h * will be regarded as outside the AD.

4.13. Property Prediction and Molecular Docking

In drug discovery, achieving strong binding affinity to the target is essential for identifying potential drug candidates. Additionally, crucial drug-like properties such as pharmacokinetics and toxicity profiles play a vital role in determining whether a compound advances to clinical development. The QSAR model, built on a set of carefully selected molecular descriptors that correlate with the activity of a compound, is invaluable in this process. It aids in the analysis of data to predict the behavior of new pharmaceuticals [55].
Molecular docking is a central tool in structural molecular biology and computer-aided drug design [56]. The goal of ligand–protein docking is to predict the primary binding mode between a ligand and a known three-dimensional protein structure. In this study, Sybyl-X2.1 software (https://www.scientificcomputing.com accessed on 15 July 2025) was employed to explore the potential interactions between the newly designed EGFR inhibitors and the target protein at the binding site. PyMol3.2 software (http://www.pymol.org/pymol accessed on 15 July 2025) is used to visualize the docking results.

5. Conclusions

This study reveals that the nonlinearly selected descriptors, including DPSA-3 Difference in CPSAs (PPSA3-PNSA3) [Quantum-Chemical PC], FHASA Fractional HASA (HASA/TMSA) [Quantum-Chemical PC], RPCS Relative positive charged SA (SAMPOS*RPCG) [Zefirov’s PC], and RPCG Relative positive charge (QMPOS/QTPLUS) [Zefirov’s PC], have a significant impact on the IC50 values of the novel inhibitors against EGFRL858R/T790M/C797S. More importantly, the nonlinear regression model built by MIX-SVM exhibits excellent prediction performance and robustness, proving the MIX-SVM to be a promising modeling method in the research of the new EGFRL858R/T790M/C797S inhibitors. In addition, four novel EGFR inhibitors were designed using the descriptors identified by the HM model, followed by property prediction and molecular docking analysis. Among them, compound 45c, which achieved a high drug score in PEA, also showed strong performance in molecular docking with a total score of 5.0483. To sum up, this research could offer helpful guidance for the development of novel EGFRL858R/T790M/C797S inhibitors and the treatment of NSCLC.

Author Contributions

S.L.: writing—original draft, formal analysis. W.D.: writing—original draft, data curation. A.Q.: software, methodology. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

First: the authors would like to thank Yasheng Zhu, Haojie Dong, etc. from China Pharmaceutical University for providing the experimental data set. Second, the authors is grateful for the software applications used in this research. Third, the authors would like to thank his brother Guanlin Chen from Rutgers University for his help and encouragement. Last but not least, the authors would like to express the most sincerely gratitude to the mentor, Aili Qu, who is the corresponding author of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
  2. World Health Organization: WHO. Lung Cancer. 26 June 2023. Available online: https://www.who.int/news-room/fact-sheets/detail/lung-cancer (accessed on 22 March 2025).
  3. Shah, R.; Lester, J.F. Tyrosine kinase inhibitors for the treatment of EGFR mutation-positive non–small-cell lung cancer: A clash of the generations. Clin. Lung Cancer 2020, 21, e216–e228. [Google Scholar] [CrossRef] [PubMed]
  4. Paez, J.G.; Janne, P.A.; Lee, J.C.; Tracy, S.; Greulich, H.; Gabriel, S.; Herman, P.; Kaye, F.J.; Lindeman, N.; Boggon, T.J.; et al. EGFR mutations in lung cancer: Correlation with clinical response to gefitinib therapy. Science 2004, 304, 1497–1500. [Google Scholar] [CrossRef]
  5. Dowell, J.; Minna, J.D.; Kirkpatrick, P. Erlotinib hydrochloride. Nat. Rev. Drug Discov. 2005, 4, 13–14. [Google Scholar] [CrossRef]
  6. Wu, Y.L.; Cheng, Y.; Zhou, X.; Lee, K.H.; Nakagawa, K.; Niho, S.; Tsuji, F.; Linke, R.; Rosell, R.; Corral, J.; et al. Dacomitinib versus gefitinib as first-line treatment for patients with EGFR-mutation-positive non-small-cell lung cancer (ARCHER 1050): A randomised, open-label, phase 3 trial. Lancet Oncol. 2017, 18, 1454–1466. [Google Scholar] [CrossRef]
  7. Nelson, V.; Ziehr, J.; Agulnik, M.; Johnson, M. Afatinib: Emerging next-generation tyrosine kinase inhibitor for NSCLC. OncoTargets Ther. 2013, 6, 135–143. [Google Scholar] [CrossRef]
  8. Kobayashi, S.; Boggon, T.J.; Dayaram, T.; Jänne, P.A.; Kocher, O.; Meyerson, M.; Johnson, B.E.; Eck, M.J.; Tenen, D.G.; Halmos, B. EGFR mutation and resistance of non–small-cell lung cancer to gefitinib. New Engl. J. Med. 2005, 352, 786–792. [Google Scholar] [CrossRef]
  9. Zhou, W.; Ercan, D.; Chen, L.; Yun, C.H.; Li, D.; Capelletti, M.; Cortot, A.B.; Chirieac, L.; Iacob, R.E.; Padera, R.; et al. Novel mutant-selective EGFR kinase inhibitors against EGFR T790M. Nature 2009, 462, 1070–1074. [Google Scholar] [CrossRef]
  10. Ricciuti, B.; Baglivo, S.; Paglialunga, L.; De Giglio, A.; Bellezza, G.; Chiari, R.; Crinò, L.; Metro, G. Osimertinib in patients with advanced epidermal growth factor receptor T790M mutation-positive non-small cell lung cancer: Rationale, evidence and place in therapy. Ther. Adv. Med Oncol. 2017, 9, 387–404. [Google Scholar] [CrossRef]
  11. Schmid, S.; Li, J.J.; Leighl, N.B. Mechanisms of osimertinib resistance and emerging treatment options. Lung Cancer 2020, 147, 123–129. [Google Scholar] [CrossRef]
  12. Dong, H.; Ye, X.; Zhu, Y.; Shen, H.; Shen, H.; Chen, W.; Ji, M.; Zheng, M.; Wang, K.; Cai, Z.; et al. Discovery of Potent and Wild-Type-Sparing Fourth-Generation EGFR Inhibitors for Treatment of Osimertinib-Resistance NSCLC. J. Med. Chem. 2023, 66, 6849–6868. [Google Scholar] [CrossRef]
  13. Zhu, Y.; Ye, X.; Shen, H.; Li, J.; Cai, Z.; Min, W.; Hou, Y.; Dong, H.; Wu, Y.; Wang, L.; et al. Discovery of Novel Fourth-Generation EGFR Inhibitors to Overcome C797S-Mediated Resistance. J. Med. Chem. 2023, 66, 14633–14652. [Google Scholar] [CrossRef]
  14. Tropsha, A. Best practices for QSAR model development, validation, and exploitation. Mol. Inform. 2010, 29, 476–488. [Google Scholar] [CrossRef]
  15. Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. CSUR 2017, 50, 1–45. [Google Scholar] [CrossRef]
  16. Bader, R.F. Atoms in molecules. Acc. Chem. Res. 1985, 18, 9–15. [Google Scholar] [CrossRef]
  17. Lipkin, M.R.; Martin, C.C. Calculation of Weight Per Cent Ring and Number of Rings per Molecule for Aromatics. Anal. Chem. 1947, 19, 183–189. [Google Scholar] [CrossRef]
  18. Kwon, Y. Handbook of Essential Pharmacokinetics, Pharmacodynamics and Drug Metabolism for Industrial Scientists; Springer Science & Business Media: Cham, Switzerland, 2001. [Google Scholar]
  19. Ertl, P.; Rohde, B.; Selzer, P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 2000, 43, 3714–3717. [Google Scholar] [CrossRef]
  20. Smith, G.F. Designing drugs to avoid toxicity. Prog. Med. Chem. 2011, 50, 1–47. [Google Scholar]
  21. Eriksson, L.; Jaworska, J.; Worth, A.P.; Cronin, M.T.; McDowell, R.M.; Gramatica, P. Methods for reliability and uncertainty assessment and for applicability evaluations of classification-and regression-based QSARs. Environ. Health Perspect. 2003, 111, 1361–1375. [Google Scholar] [CrossRef]
  22. Sosnowska, A.; Grzonkowska, M.; Puzyn, T. Global versus local QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line: The predictive ability. J. Mol. Liq. 2017, 231, 333–340. [Google Scholar] [CrossRef]
  23. Hyperchem; Version 4.0; Hypercube Inc.: Waterloo, ON, Canada, 1994.
  24. Loomis, R.A.; McGuire, B.A.; Shingledecker, C.; Johnson, C.H.; Blair, S.; Robertson, A.; Remijan, A.J. Investigating the minimum energy principle in searches for new molecular species—The case of H2C3O isomers. Astrophys. J. 2015, 799, 34. [Google Scholar] [CrossRef]
  25. Stewart, J.J. MOPAC: A semiempirical molecular orbital program. J. Comput.-Aided Mol. Des. 1990, 4, 1–103. [Google Scholar] [CrossRef] [PubMed]
  26. Katritzky, A.R.; Kulshyn, O.V.; Stoyanova-Slavova, I.; Dobchev, D.A.; Kuanar, M.; Fara, D.C.; Karelson, M. Antimalarial activity: A QSAR modeling using CODESSA PRO software. Bioorganic Med. Chem. 2006, 14, 2333–2357. [Google Scholar] [CrossRef]
  27. Gao, Z.; Xia, R.; Zhang, P. Prediction of anti-proliferation effect of [1, 2, 3] triazolo [4, 5-d] pyrimidine derivatives by random forest and mix-kernel function SVM with PSO. Chem. Pharm. Bull. 2022, 70, 684–693. [Google Scholar] [CrossRef]
  28. Wang, Y.; Liu, Z.; Qu, A.; Zhang, P.; Si, H.; Zhai, H. Study of Tacrine Derivatives for Acetylcholinesterase Inhibitors Based on Artificial Intelligence. Lat. Am. J. Pharm. 2020, 39, 1159–1170. [Google Scholar]
  29. LI, G.; Wang, X.; LI, A.; Zhang, P. QSAR Study on the IC50 of Thiosemicarbazone Derivatives as PC-3 Inhibitors Based on Mixed Kernel Function Support Vector Machine. Lat. Am. J. Pharm. 2023, 42, 543–553. [Google Scholar]
  30. Wang, Y.; Zhang, P. Prediction of histone deacetylase inhibition by triazole compounds based on artificial intelligence. Front. Pharmacol. 2023, 14, 1260349. [Google Scholar] [CrossRef]
  31. Hira, Z.M.; Gillies, D.F. A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, 2015, 198363. [Google Scholar] [CrossRef]
  32. Yang, X.; Qiu, H.; Zhang, Y.; Zhang, P. Quantitative Structure-Activity Relationship Study of Amide Derivatives as Xanthine Oxidase Inhibitors using machine learning. Front. Pharmacol. 2023, 14, 1227536. [Google Scholar] [CrossRef]
  33. Breiman, L. Classification and Regression Trees; Routledge: London, UK, 2017. [Google Scholar]
  34. Zheng, H.; Yuan, J.; Chen, L. Short-term load forecasting using EMD-LSTM neural networks with a Xgboost algorithm for feature importance evaluation. Energies 2017, 10, 1168. [Google Scholar] [CrossRef]
  35. Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef]
  36. Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef] [PubMed]
  37. Ali, J.; Khan, R.; Ahmad, N.; Maqsood, I. Random forests and decision trees. Int. J. Comput. Sci. Issues 2012, 9, 272. [Google Scholar]
  38. Ferreira, C. Gene expression programming: A new adaptive algorithm for solving problems. arXiv 2001, arXiv:cs/0102027. [Google Scholar] [CrossRef]
  39. Ferreira, C. Gene expression programming in problem solving. In Soft Computing and Industry: Recent Applications; Springer: London, UK, 2002; pp. 635–653. [Google Scholar]
  40. Zhang, C.; Zhang, Y.; Shi, X.; Almpanidis, G.; Fan, G.; Shen, X. On incremental learning for gradient boosting decision trees. Neural Process. Lett. 2019, 50, 957–987. [Google Scholar] [CrossRef]
  41. Sun, R.; Wang, G.; Zhang, W.; Hsu, L.T.; Ochieng, W.Y. A gradient boosting decision tree based GPS signal reception classification algorithm. Appl. Soft Comput. 2020, 86, 105942. [Google Scholar] [CrossRef]
  42. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  43. Chen, Y.; Xu, P.; Chu, Y.; Li, W.; Wu, Y.; Ni, L.; Bao, Y.; Wang, K. Short-term electrical load forecasting using the Support Vector Regression (SVR) model to calculate the demand response baseline for office buildings. Appl. Energy 2017, 195, 659–670. [Google Scholar] [CrossRef]
  44. Cyganek, B.; Krawczyk, B.; Woźniak, M. Multidimensional data classification with chordal distance based kernel and support vector machines. Eng. Appl. Artif. Intell. 2015, 46, 10–22. [Google Scholar] [CrossRef]
  45. Fathi Hafshejani, S.; Moaberfard, Z. A new trigonometric kernel function for support vector machine. Iran J. Comput. Sci. 2023, 6, 137–145. [Google Scholar] [CrossRef]
  46. Chen, D.G.; Wang, H.Y.; Tsang, E.C. Generalized Mercer theorem and its application to feature space related to indefinite kernels. In Proceedings of the 2008 International Conference on Machine Learning and Cybernetics, Kunming, China, 12–15 July 2008; Volume 2, pp. 774–777. [Google Scholar]
  47. Hu, M.; Chen, Y.; Kwok, J.T.Y. Building sparse multiple-kernel SVM classifiers. IEEE Trans. Neural Netw. 2009, 20, 827–839. [Google Scholar]
  48. Liang, J.J.; Qin, A.K.; Suganthan, P.N.; Baskar, S. Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Trans. Evol. Comput. 2006, 10, 281–295. [Google Scholar] [CrossRef]
  49. Liang, J.J.; Qin, A.K.; Suganthan, P.M.; Baskar, S. Particle swarm optimization algorithms with novel learning strategies. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), The Hague, The Netherlands, 10–13 October 2004; Volume 4, pp. 3659–3664. [Google Scholar]
  50. Nagelkerke, N.J. A note on a general definition of the coefficient of determination. Biometrika 1991, 78, 691–692. [Google Scholar] [CrossRef]
  51. Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef] [PubMed]
  52. Chirico, N.; Gramatica, P. Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspection. J. Chem. Inf. Model. 2012, 52, 2044–2058. [Google Scholar] [CrossRef] [PubMed]
  53. Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R. Comparison of different approaches to define the applicability domain of QSAR models. Molecules 2012, 17, 4791–4810. [Google Scholar] [CrossRef]
  54. Tropsha, A.; Gramatica, P.; Gombar, V.K. The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 2003, 22, 69–77. [Google Scholar] [CrossRef]
  55. Di, L.; Kerns, E.H.; Carter, G.T. Drug-like property concepts in pharmaceutical design. Curr. Pharm. Des. 2009, 15, 2184–2194. [Google Scholar] [CrossRef]
  56. Morris, G.M.; Lim-Wilby, M. Molecular docking. Methods Mol. Biol. 2008, 443, 365–382. [Google Scholar]
Figure 1. (a) Influence of descriptor numbers on R 2 and R c v 2 . (b) Feature importance ranking of descriptors selected by XGBoost. (c) Correlation heatmap of descriptors. (d) Williams plot of standardized residual versus leverage.
Figure 1. (a) Influence of descriptor numbers on R 2 and R c v 2 . (b) Feature importance ranking of descriptors selected by XGBoost. (c) Correlation heatmap of descriptors. (d) Williams plot of standardized residual versus leverage.
Pharmaceuticals 18 01092 g001
Figure 2. SHAP plot of descriptor importance in the XGBoost model.
Figure 2. SHAP plot of descriptor importance in the XGBoost model.
Pharmaceuticals 18 01092 g002
Figure 3. The design approach targeted the R region of compound 45.
Figure 3. The design approach targeted the R region of compound 45.
Pharmaceuticals 18 01092 g003
Figure 4. Docking analysis of compound 45c to target protein (PDB ID: 7zyn).
Figure 4. Docking analysis of compound 45c to target protein (PDB ID: 7zyn).
Pharmaceuticals 18 01092 g004
Figure 5. Plot of measured and predicted lg (IC50) of the GEP model (a), GBDT model (b).
Figure 5. Plot of measured and predicted lg (IC50) of the GEP model (a), GBDT model (b).
Pharmaceuticals 18 01092 g005
Figure 6. Plot of measured and predicted lg (IC50) of the HM model (a), RF model (b), Poly-SVM model (c), and MIX-SVM model (d).
Figure 6. Plot of measured and predicted lg (IC50) of the HM model (a), RF model (b), Poly-SVM model (c), and MIX-SVM model (d).
Pharmaceuticals 18 01092 g006
Table 1. Physical–chemical meaning and coefficient of HM selected descriptors.
Table 1. Physical–chemical meaning and coefficient of HM selected descriptors.
Physical–Chemical MeaningSymbolCoefficient
Relative number of S atomsRS−40.041
Number of ringsNR0.48875
Max exchange energy for a C-N bondME0.80855
Table 2. Physical–chemical meanings of descriptors selected by XGBoost.
Table 2. Physical–chemical meanings of descriptors selected by XGBoost.
Physical–Chemical MeaningSymbol
DPSA-3 Difference in CPSAs (PPSA3-PNSA3) [Quantum-Chemical PC]DDC
FHASA Fractional HASA (HASA/TMSA) [Quantum-Chemical PC]FFH
RPCS Relative positive charged SA (SAMPOS*RPCG) [Zefirov’s PC]RRPCSA
RPCG Relative positive charge (QMPOS/QTPLUS) [Zefirov’s PC]RRPC
HA dependent HDSA-1 [Zefirov’s PC]HDH
Table 3. Predicted lg(IC50) by HM and docking total score of new EGFR inhibitors.
Table 3. Predicted lg(IC50) by HM and docking total score of new EGFR inhibitors.
No.EGFR Inhibitors Predicted lg (IC50) Total Score
45Pharmaceuticals 18 01092 i0010.644.8087
45aPharmaceuticals 18 01092 i0020.624.8125
45bPharmaceuticals 18 01092 i0030.634.8103
45cPharmaceuticals 18 01092 i0040.565.0483
45dPharmaceuticals 18 01092 i0050.604.8394
Table 4. Predicted lg(IC50) by HM and properties by PEA of newly designed compounds.
Table 4. Predicted lg(IC50) by HM and properties by PEA of newly designed compounds.
No.Pre.lg (IC50)LogPSolubilityMol WeightTPSADrug LikenessDrug Score
450.643.58−7.51556.093.43−2.850.17
45a0.624.06−8.36602.0118.7−0.910.18
45b0.634.06−8.36602.0118.7−0.850.18
45c0.564.06−8.03618.9116.7−0.670.19
45d0.603.68−7.83574.093.43−0.910.20
Table 5. The y-randomization test results of the MIX-SVM model.
Table 5. The y-randomization test results of the MIX-SVM model.
R y r a n d 2 Q y r a n d 2
random 10.0208−0.1850
random 20.0207−0.1943
random 30.0016−0.3308
random 40.0663−0.1514
random 50.0152−0.0859
random 60.0440−0.4301
random 70.0003−0.4650
random 80.0010−0.3564
random 90.0597−0.1834
random 100.0922−0.2965
Table 6. Measured and predicted lg (IC50) of EGFRL858R/T790M/C797S inhibitors.
Table 6. Measured and predicted lg (IC50) of EGFRL858R/T790M/C797S inhibitors.
Pharmaceuticals 18 01092 i006
CompoundRMeasured IC50 (nM)Measured lg (IC50) Predicted Ig (IC50)
HMRFGEPGBDTPoly-SVMMIX-SVM
1 *Pharmaceuticals 18 01092 i0071772.252.392.312.052.062.312.26
2 *Pharmaceuticals 18 01092 i0081802.262.392.332.132.062.282.34
3Pharmaceuticals 18 01092 i0093782.582.432.392.162.062.182.32
4Pharmaceuticals 18 01092 i0102172.342.402.382.111.942.362.35
5 *Pharmaceuticals 18 01092 i0112422.382.372.392.081.922.422.37
6Pharmaceuticals 18 01092 i0125722.762.442.382.152.162.452.75
Pharmaceuticals 18 01092 i013
CompoundR1R2R3Measured IC50 (nM)Measured lg (IC50) Predicted Ig (IC50)
HMRFGEPGBDTPoly-SVMMIX-SVM
7 *Pharmaceuticals 18 01092 i014Pharmaceuticals 18 01092 i015Pharmaceuticals 18 01092 i016311.491.321.431.141.341.261.33
8Pharmaceuticals 18 01092 i017Pharmaceuticals 18 01092 i018Pharmaceuticals 18 01092 i019221.341.171.421.401.411.271.35
9Pharmaceuticals 18 01092 i020Pharmaceuticals 18 01092 i021Pharmaceuticals 18 01092 i022291.461.341.341.201.451.381.45
10Pharmaceuticals 18 01092 i023Pharmaceuticals 18 01092 i024Pharmaceuticals 18 01092 i025271.431.421.51.271.451.421.44
11Pharmaceuticals 18 01092 i026Pharmaceuticals 18 01092 i027Pharmaceuticals 18 01092 i028111.041.381.231.501.211.461.05
12Pharmaceuticals 18 01092 i029Pharmaceuticals 18 01092 i030Pharmaceuticals 18 01092 i031371.571.341.341.261.511.231.56
13Pharmaceuticals 18 01092 i032Pharmaceuticals 18 01092 i033Pharmaceuticals 18 01092 i034181.261.871.471.131.351.381.27
14Pharmaceuticals 18 01092 i035Pharmaceuticals 18 01092 i036Pharmaceuticals 18 01092 i0374482.652.272.52.242.072.312.64
15 *Pharmaceuticals 18 01092 i038Pharmaceuticals 18 01092 i039Pharmaceuticals 18 01092 i0402492.402.262.52.361.992.342.29
16Pharmaceuticals 18 01092 i041Pharmaceuticals 18 01092 i042Pharmaceuticals 18 01092 i0432062.312.432.382.091.982.182.32
17Pharmaceuticals 18 01092 i044Pharmaceuticals 18 01092 i045Pharmaceuticals 18 01092 i0462562.412.362.472.051.992.812.42
18Pharmaceuticals 18 01092 i047Pharmaceuticals 18 01092 i048Pharmaceuticals 18 01092 i0491582.202.352.392.201.892.572.21
19 *Pharmaceuticals 18 01092 i050Pharmaceuticals 18 01092 i051Pharmaceuticals 18 01092 i0523412.532.262.52.272.012.752.57
20 *Pharmaceuticals 18 01092 i053Pharmaceuticals 18 01092 i054Pharmaceuticals 18 01092 i0552612.422.952.412.182.002.272.41
21Pharmaceuticals 18 01092 i056Pharmaceuticals 18 01092 i057Pharmaceuticals 18 01092 i058421.621.381.471.431.551.361.23
22Pharmaceuticals 18 01092 i059Pharmaceuticals 18 01092 i060Pharmaceuticals 18 01092 i061191.281.231.181.371.361.441.29
23Pharmaceuticals 18 01092 i062Pharmaceuticals 18 01092 i063Pharmaceuticals 18 01092 i064231.361.381.271.201.401.391.40
24Pharmaceuticals 18 01092 i065Pharmaceuticals 18 01092 i066Pharmaceuticals 18 01092 i067211.321.201.291.211.391.531.31
25Pharmaceuticals 18 01092 i068Pharmaceuticals 18 01092 i069Pharmaceuticals 18 01092 i070131.111.171.181.101.261.471.12
26 *Pharmaceuticals 18 01092 i071Pharmaceuticals 18 01092 i072Pharmaceuticals 18 01092 i073601.781.251.731.211.671.61.44
27Pharmaceuticals 18 01092 i074Pharmaceuticals 18 01092 i075Pharmaceuticals 18 01092 i076261.411.741.451.021.411.331.42
Pharmaceuticals 18 01092 i077
CompoundXR1R2R3R4Measured IC50 (nM)Measured lg (IC50) Predicted Ig (IC50)
HMRFGEPGBDTPoly-SVMMIX-SVM
28 *CH-CH3HPharmaceuticals 18 01092 i0783752.572.302.22.362.462.222.22
29CH-CH3HPharmaceuticals 18 01092 i07915713.202.842.542.262.423.053.19
30CCI-CH3HPharmaceuticals 18 01092 i0801702.232.292.232.471.892.422.22
31CCI-CH3HPharmaceuticals 18 01092 i08119633.292.832.422.442.472.933.28
32CHPharmaceuticals 18 01092 i082HPharmaceuticals 18 01092 i0837.60.881.220.731.081.100.620.87
33CHPharmaceuticals 18 01092 i084HPharmaceuticals 18 01092 i085561.751.741.811.301.631.361.76
34CCIPharmaceuticals 18 01092 i086HPharmaceuticals 18 01092 i0871752.241.771.841.421.811.441.89
35CHPharmaceuticals 18 01092 i088HPharmaceuticals 18 01092 i089111.041.320.911.171.240.930.62
36CCIPharmaceuticals 18 01092 i090HPharmaceuticals 18 01092 i091241.381.281.41.211.411.111.37
37CHPharmaceuticals 18 01092 i092HPharmaceuticals 18 01092 i093381.581.351.151.081.501.231.57
38NH-CH3CIPharmaceuticals 18 01092 i094711.852.281.91.301.672.321.86
39NH-CH3-OCH3Pharmaceuticals 18 01092 i0952042.312.292.031.421.932.452.30
40NHPharmaceuticals 18 01092 i096CIPharmaceuticals 18 01092 i0974.60.661.270.921.171.1010.67
41NCI-CH3-NO2Pharmaceuticals 18 01092 i0981002.001.802.081.211.762.332.01
42CHPharmaceuticals 18 01092 i099HPharmaceuticals 18 01092 i1001.90.280.420.571.080.820.510.29
43CHPharmaceuticals 18 01092 i101CIPharmaceuticals 18 01092 i1022.80.450.510.872.440.870.910.71
44NCI-CH3-CH3Pharmaceuticals 18 01092 i103921.962.281.872.111.761.631.97
45 *CHPharmaceuticals 18 01092 i104-CF3Pharmaceuticals 18 01092 i105141.150.641.081.221.071.51.15
46CH-CH3-NO2Pharmaceuticals 18 01092 i1061002.001.711.740.831.762.331.99
47N-NH2HHPharmaceuticals 18 01092 i107301.481.571.782.401.491.761.53
Pharmaceuticals 18 01092 i108
CompoundR1R2Measured IC50 (nM)Measured lg (IC50) Predicted Ig (IC50)
HMRFGEPGBDTPoly-SVMMIX-SVM
484-FPharmaceuticals 18 01092 i109171.231.831.650.661.361.561.48
494-CIPharmaceuticals 18 01092 i110861.931.831.770.891.721.761.92
504-OMePharmaceuticals 18 01092 i1112892.461.862.182.172.012.122.45
513-FPharmaceuticals 18 01092 i112901.951.861.782.331.711.61.52
524-FPharmaceuticals 18 01092 i113631.801.831.881.961.681.911.81
534-FPharmaceuticals 18 01092 i1141272.101.831.942.101.811.952.09
544-FPharmaceuticals 18 01092 i115281.452.341.652.101.431.841.46
Pharmaceuticals 18 01092 i116
CompoundR1R2Measured IC50 (nM)Measured lg (IC50) Predicted Ig (IC50)
HMRFGEPGBDTPoly-SVMMIX-SVM
55-NH2Pharmaceuticals 18 01092 i1172.10.321.000.992.271.080.660.33
56-NHCH3Pharmaceuticals 18 01092 i118191.281.021.072.071.160.971.27
57-CH3Pharmaceuticals 18 01092 i119111.041.341.041.891.290.871.05
58-CH(CH3)2Pharmaceuticals 18 01092 i120201.301.441.082.311.290.971.29
59Pharmaceuticals 18 01092 i121Pharmaceuticals 18 01092 i122261.411.701.391.921.320.991.40
60-NHCOCH3Pharmaceuticals 18 01092 i1233.20.510.930.791.061.030.730.52
61-NHSO2MePharmaceuticals 18 01092 i1242.50.400.460.481.100.880.490.39
62-NHSO2EtPharmaceuticals 18 01092 i1250.9−0.050.520.471.080.740.340.18
63-NHSO2iPrPharmaceuticals 18 01092 i1263.40.530.550.421.000.950.270.52
64-NHSO2nBuPharmaceuticals 18 01092 i1274.30.630.610.540.980.990.450.64
65 *Pharmaceuticals 18 01092 i128Pharmaceuticals 18 01092 i129151.181.040.90.831.001.170.84
66Pharmaceuticals 18 01092 i130Pharmaceuticals 18 01092 i131201.301.051.111.021.390.971.29
67 *-NHSO2EtPharmaceuticals 18 01092 i1324.10.610.620.550.941.080.560.41
68-NHSO2EtPharmaceuticals 18 01092 i1336.30.801.090.670.491.061.030.81
69 *-NHSO2EtPharmaceuticals 18 01092 i1341.80.260.440.40.761.110.120.44
70 *-NHSO2EtPharmaceuticals 18 01092 i1352.20.340.530.410.900.990.360.53
71-NHSO2EtPharmaceuticals 18 01092 i1363.70.570.580.570.610.970.530.48
72-NHSO2EtPharmaceuticals 18 01092 i1372.10.320.500.410.530.880.010.33
73-NHSO2EtPharmaceuticals 18 01092 i1383.60.560.530.410.670.990.20.51
74 *-NHSO2EtPharmaceuticals 18 01092 i1392.50.400.550.590.971.020.740.67
75-NHSO2EtPharmaceuticals 18 01092 i140101.001.271.20.671.191.471.01
76 *-NHSO2EtPharmaceuticals 18 01092 i1413.70.570.400.41.191.000.190.46
77 *-NHSO2EtPharmaceuticals 18 01092 i1428.50.930.480.71.651.330.930.75
78-NHSO2EtPharmaceuticals 18 01092 i1434.10.610.390.570.731.00.650.62
79-NHSO2EtPharmaceuticals 18 01092 i14470.11.851.131.361.241.671.621.84
80Pharmaceuticals 18 01092 i145Pharmaceuticals 18 01092 i1465.10.711.391.230.521.031.391.28
81Pharmaceuticals 18 01092 i147Pharmaceuticals 18 01092 i148341.531.461.60.971.501.471.52
82 *Pharmaceuticals 18 01092 i149Pharmaceuticals 18 01092 i150711.851.931.680.530.851.521.85
83Pharmaceuticals 18 01092 i151Pharmaceuticals 18 01092 i1527.90.901.331.160.781.071.361.21
84Pharmaceuticals 18 01092 i153Pharmaceuticals 18 01092 i154181.261.371.191.121.351.371.27
85Pharmaceuticals 18 01092 i155Pharmaceuticals 18 01092 i156261.411.411.541.161.431.411.42
86Pharmaceuticals 18 01092 i157Pharmaceuticals 18 01092 i158491.691.441.481.421.591.421.44
87Pharmaceuticals 18 01092 i159Pharmaceuticals 18 01092 i160151.181.391.530.951.311.381.21
88 *Pharmaceuticals 18 01092 i161Pharmaceuticals 18 01092 i162361.561.411.541.021.251.521.53
89Pharmaceuticals 18 01092 i163Pharmaceuticals 18 01092 i164711.851.431.721.951.671.651.84
90Pharmaceuticals 18 01092 i165Pharmaceuticals 18 01092 i166581.761.321.561.721.621.391.37
91Pharmaceuticals 18 01092 i167Pharmaceuticals 18 01092 i168711.851.381.731.521.671.731.84
92Pharmaceuticals 18 01092 i169Pharmaceuticals 18 01092 i170771.891.311.581.821.701.381.21
* test set.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, S.; Dong, W.; Qu, A. Predicting EGFRL858R/T790M/C797S Inhibitory Effect of Osimertinib Derivatives by Mixed Kernel SVM Enhanced with CLPSO. Pharmaceuticals 2025, 18, 1092. https://doi.org/10.3390/ph18081092

AMA Style

Li S, Dong W, Qu A. Predicting EGFRL858R/T790M/C797S Inhibitory Effect of Osimertinib Derivatives by Mixed Kernel SVM Enhanced with CLPSO. Pharmaceuticals. 2025; 18(8):1092. https://doi.org/10.3390/ph18081092

Chicago/Turabian Style

Li, Shaokang, Wenzhe Dong, and Aili Qu. 2025. "Predicting EGFRL858R/T790M/C797S Inhibitory Effect of Osimertinib Derivatives by Mixed Kernel SVM Enhanced with CLPSO" Pharmaceuticals 18, no. 8: 1092. https://doi.org/10.3390/ph18081092

APA Style

Li, S., Dong, W., & Qu, A. (2025). Predicting EGFRL858R/T790M/C797S Inhibitory Effect of Osimertinib Derivatives by Mixed Kernel SVM Enhanced with CLPSO. Pharmaceuticals, 18(8), 1092. https://doi.org/10.3390/ph18081092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop