Quantitative Predictive Studies of Multiple Biological Activities of TRPV1 Modulators

TRPV1 channel agonists and antagonists, which have powerful analgesic effects without the addictive qualities associated with traditional analgesics, have become a focus area for the development of novel analgesics. In this study, quantitative structure–activity relationship (QSAR) models for three bioactive endpoints (Ki, IC50, and EC50) were successfully constructed using four machine learning algorithms: SVM, Bagging, GBDT, and XGBoost. These models were based on 2922 TRPV1 modulators and incorporated four types of molecular descriptors: Daylight, E-state, ECFP4, and MACCS. After the rigorous five-fold cross-validation and external test set validation, the optimal models for the three endpoints were obtained. For the Ki endpoint, the Bagging-ECFP4 model had a Q2 value of 0.778 and an R2 value of 0.780. For the IC50 endpoint, the XGBoost-ECFP4 model had a Q2 value of 0.806 and an R2 value of 0.784. For the EC50 endpoint, the SVM-Daylight model had a Q2 value of 0.784 and an R2 value of 0.809. These results demonstrate that the constructed models exhibit good predictive performance. In addition, based on the model feature importance analysis, the influence between substructure and biological activity was also explored, which can provide important theoretical guidance for the efficient virtual screening and structural optimization of novel TRPV1 analgesics. And subsequent studies on novel TRPV1 modulators will be based on the feature substructures of the three endpoints.


Introduction
TRPV1 channels are nociceptors found on C and Aδ fibers [1].They detect various noxious stimuli, such as high temperatures (>42 • C), acidity (H + ), and a range of endogenous and exogenous ligands [2].These channels are crucial in pain management.TRPV1 modulators, including agonists and antagonists, have demonstrated significant efficacy in the treatment of neuropathic pain, osteoarthritis, and cancer pain [3].Among them, TRPV1 agonists produce long-lasting and reversible analgesia through calcium-dependent desensitization, rendering TRPV1-expressing nerve fibers unresponsive to noxious stimuli [4].TRPV1 antagonists, on the other hand, reduce nociceptive hypersensitivity by inhibiting TRPV1 channels, thus inhibiting the production of noxious sensations.Traditional analgesics, such as opioid narcotic analgesics and nonsteroidal anti-inflammatory analgesics, while initially providing temporary or partial pain relief, are associated with dose-limiting side-effects, lack of tolerance, and decreased efficacy over time, particularly impacting the treatment of chronic pain in the elderly [5].However, existing TRPV1 modulators have side effects like strong irritation and hyperthermia, limiting their long-term clinical application [6,7].Therefore, it is still necessary to develop novel TRPV1 modulators.
The agonistic or inhibitory activity of TRPV1 modulators is generally quantified using the concentration for 50% of maximal effect (EC 50 ) or half maximal inhibitory concentration (IC 50 ).K i is the inhibition constant, which is a more precise indicator than IC 50 .The experimental approach to detect the effect of TRPV1 regulators on the opening (agonism) or closing (antagonism) of TRPV1 channels commonly involves the use of FLIPR [8] and electrophysiological membrane clamp [9].Although intuitively clear, these experimental methods require the synthesis of the compound to be tested first and later assayed on TRPV1-expressing cells.Moreover, the speed of drug discovery is limited by the experimental methods, although high-throughput screening and combinatorial chemistry have been developed.Both the in vitro experiments, especially electrophysiological assays, require significant time and high investment costs.
The discovery of hits is a mandatory pathway to the discovery of novel TRPV1 modulators.However, high-throughput screening based on wet assays, combinatorial chemistry, and fragment-based drug design requires significant labor, material, and time costs and consumes a lot of effort on inactive compounds [10,11].In recent years, computer-aided drug design, represented by quantitative structure-activity relationships (QSARs), has been rapidly developed due to the rise of artificial intelligence and big data [11].QSAR modeling is a mathematical or statistical methodology that establishes a quantitative mapping between molecular structure and biological activity that can be used to predict the biological activity of new compounds on specific targets [11].This methodology has been widely used in the discovery of various drug hits.There have been some structural modification studies of TRPV1 regulators based on 3D-QSAR models.For instance, Kristam et al. [12] constructed a 3D-QSAR model using 62 piperazine-aryl derived TRPV1 compounds with good predictive performance (Q 2 = 0.9, R 2 = 0.75).Then, they used a Topomer-CoMFA method to construct a new 3D-QSAR model [13].However, the predictive performance of the new model did not improve.Similarly, Wang et al. [14] constructed a 3D-QSAR model with good predictive performance (Q 2 = 0.522, R 2 = 0.839) using the CoMSIA method based on 236 TRPV1 antagonists.Although these 3D-QSAR models showed promising results, they require the superposition of molecular 3D conformations.Unfortunately, the effect of conformation overlap of 3D-QSAR method could seriously affect the robustness of the models.Furthermore, 3D-QSAR models are often limited to predicting the properties of compounds with similar structures, thus having poor generalization ability [15].
To address the above problems and build a model with good generalization ability and stability, this study successfully constructed several QSAR models based on multiple machine learning algorithms for the three activity endpoints (EC 50 of TRPV1 agonists, IC 50 of TRPV1 antagonists, and K i ).The internal and external validation showed that the models have good predictive performance and generalization ability, which can provide high-quality virtual screening models for the development of novel TRPV1 modulators.

Chemical Space and Scaffold Analysis
To construct the QSAR model, it is important for the dataset to encompass a wide range of activity.The K i dataset ranges from 5.76 to 10.00 in terms of pK i values, the EC 50 dataset ranges from 3.95 to 8.72 in terms of pEC 50 values, and the IC 50 dataset ranges from 4.04 to 9.40 in terms of pIC 50 values.Therefore, the datasets for the three activity endpoints cover a broad span of activity, ranging from µM to nM.The activity distributions of the training and testing sets for the three activity endpoints, as indicated by the histograms (Figure 1A-C), closely resemble those of the total dataset.This suggests that the division of the dataset is reasonable with respect to activity distribution.In addition, principal component analysis (PCA) was utilized to represent the scaffold distribution of the compounds in both the training and test sets (Figure 1).Notably, the compound scaffolds representing the test set of the three endpoints were mainly distributed within the compound scaffolds of their corresponding training sets, and no more outliers appeared.Hence, the scaffold division of the test set proved suitable for evaluating the predictive performance and generalization capability of the QSAR model.compound scaffolds of their corresponding training sets, and no more outliers appea Hence, the scaffold division of the test set proved suitable for evaluating the predic performance and generalization capability of the QSAR model.Table 1 lists the top ten carbon scaffolds with the highest numbers, the vast majo of which include an isobutane carbon scaffold structure corresponding to the neck gr of the TRPV1 modulator, typically an amide, ureido, or thiourea, among others.analysis of compound structures in the datasets showed 77 carbon scaffolds in th dataset, 275 in the IC50 dataset, and 97 in the EC50 dataset.This indicates a signifi diversity in structural composition across the datasets for the three endpoints.In contr the head and tail in the backbone are mostly cyclic structures, corresponding to the moiety that forms a hydrophobic interaction in TRPV1 modulators and the head mo that mostly contains an aromatic ring.It is noteworthy that some special carbon scaff appear in the scaffold of EC50.First, the cyclohexane carbon scaffold, ranked in term content, is quite different from the generic structure of TRPV1 modulators, and can designed as a head group to provide ideas for fragment-based drug design.Secondly carbon scaffolds ranked fifth and sixth in terms of content both appear to have a bridg ring structure and can be designed as a tail moiety that can provide strong van der W interactions and help increase binding affinity [16].Table 1 lists the top ten carbon scaffolds with the highest numbers, the vast majority of which include an isobutane carbon scaffold structure corresponding to the neck group of the TRPV1 modulator, typically an amide, ureido, or thiourea, among others.The analysis of compound structures in the datasets showed 77 carbon scaffolds in the K i dataset, 275 in the IC 50 dataset, and 97 in the EC 50 dataset.This indicates a significant diversity in structural composition across the datasets for the three endpoints.In contrast, the head and tail in the backbone are mostly cyclic structures, corresponding to the tail moiety that forms a hydrophobic interaction in TRPV1 modulators and the head moiety that mostly contains an aromatic ring.It is noteworthy that some special carbon scaffolds appear in the scaffold of EC 50 .First, the cyclohexane carbon scaffold, ranked in terms of content, is quite different from the generic structure of TRPV1 modulators, and can be designed as a head group to provide ideas for fragment-based drug design.Secondly, the carbon scaffolds ranked fifth and sixth in terms of content both appear to have a bridging ring structure and can be designed as a tail moiety that can provide strong van der Waals interactions and help increase binding affinity [16].

Feature Selection
To enhance the interpretability and accuracy of the model while minimizing training costs, a method known as recursive feature elimination based on random forest (RFE-RF) is employed for feature selection.Initially, RFE-RF utilizes the complete set of features from a descriptor or molecular fingerprint for modeling.Subsequently, it proceeds to eliminate the least significant feature iteratively, employing the remaining features for subsequent modeling steps.Finally, it selects the combination of features with the lowest RMSE CV .
As shown in Figure 2, the performance of the model gradually improves as the number of features increases, eventually reaching a plateau.The number of features selected varies across the three active endpoints for different descriptors or molecular fingerprints.The red dots identified by orange dashed lines in each subplot of Figure 2 indicate the selected feature combinations.In K i , the optimal number of features for Daylight, E-state, ECFP4, and MACCS was 53 (2.6% of original features), 30 (27.3% of original features), 82 (8.0% of original features), and 50 (30.1% of original features), respectively; in IC 50 , the optimal number of features for Daylight, E-state, ECFP4, and MACCS was 183 (8.9% of original features), 33 (30.0% of original features), 293 (28.6% of original features), and 68 (41.0% of original features), respectively; and in EC 50 , Daylight, E-state, ECFP4, and MACCS have the optimal number of features of 168 (8.2% of original features), 25 (22.7% of original features), 55 (5.4% of original features), and 22 (13.3% of original features), respectively.These 12 sets of features will be used as independent variables to construct 48 active prediction models using four machine learning algorithms, i.e., 16 models for each endpoint.

Feature Selection
To enhance the interpretability and accuracy of the model while minimizing trainin costs, a method known as recursive feature elimination based on random forest (RFE-RF is employed for feature selection.Initially, RFE-RF utilizes the complete set of feature from a descriptor or molecular fingerprint for modeling.Subsequently, it proceeds t eliminate the least significant feature iteratively, employing the remaining features fo subsequent modeling steps.Finally, it selects the combination of features with the lowes RMSECV. As shown in Figure 2, the performance of the model gradually improves as th number of features increases, eventually reaching a plateau.The number of feature selected varies across the three active endpoints for different descriptors or molecula fingerprints.The red dots identified by orange dashed lines in each subplot of Figure indicate the selected feature combinations.In Ki, the optimal number of features fo Daylight, E-state, ECFP4, and MACCS was 53 (2.6% of original features), 30 (27.3% o original features), 82 (8.0% of original features), and 50 (30.1% of original features respectively; in IC50, the optimal number of features for Daylight, E-state, ECFP4, an MACCS was 183 (8.9% of original features), 33 (30.0% of original features), 293 (28.6% o original features), and 68 (41.0% of original features), respectively; and in EC50, Dayligh E-state, ECFP4, and MACCS have the optimal number of features of 168 (8.2% of origina features), 25 (22.7% of original features), 55 (5.4% of original features), and 22 (13.3% o original features), respectively.These 12 sets of features will be used as independen variables to construct 48 active prediction models using four machine learning algorithm i.e., 16 models for each endpoint.

Evaluation of K i Activity Prediction Models
The evaluation results of the 16 K i activity prediction models are presented in Table 2.It can be observed that the internal validation results of different algorithms under the same descriptor are similar.Furthermore, there is a consistent trend in the internal validation results of different descriptors under the same algorithm, with ECFP4 showing the highest performance, followed by Daylight, MACCS, and E-state.The model constructed using the Bagging algorithm and ECFP4 descriptors demonstrates the highest performance (Q 2 = 0.778, R 2 = 0.780), while the models constructed using SVM and E-state descriptors exhibit the lowest performance (Q 2 = 0.502, R 2 = 0.536).The MAE CV and MAE T of the vast majority of the models were in the range of 0.3-0.4,indicating that the difference between the predicted results and the experimental values was not more than half an order of magnitude.Thus, the predicted values of these models are of practical significance.Subsequently, the external validation results of the 16 models align with the internal validation results, reaffirming their good generalization ability and reliable prediction capability for the K i activity values of new chemical entities.Figure 3 displays the scatter plot of the predicted values of the optimal model against the experimental values.The green dots represent the training set, while the orange dots represent the test set.The evaluation results of the 16 Ki activity prediction models are presented in Table 2.It can be observed that the internal validation results of different algorithms under the same descriptor are similar.Furthermore, there is a consistent trend in the internal validation results of different descriptors under the same algorithm, with ECFP4 showing the highest performance, followed by Daylight, MACCS, and E-state.The model constructed using the Bagging algorithm and ECFP4 descriptors demonstrates the highest performance (Q 2 = 0.778, R 2 = 0.780), while the models constructed using SVM and E-state descriptors exhibit the lowest performance (Q 2 = 0.502, R 2 = 0.536).The MAECV and MAET of the vast majority of the models were in the range of 0.3-0.4,indicating that the difference between the predicted results and the experimental values was not more than half an order of magnitude.Thus, the predicted values of these models are of practical significance.Subsequently, the external validation results of the 16 models align with the internal validation results, reaffirming their good generalization ability and reliable prediction capability for the Ki activity values of new chemical entities.Figure 3 displays the scatter plot of the predicted values of the optimal model against the experimental values.The green dots represent the training set, while the orange dots represent the test set.

Evaluation of IC 50 Activity Prediction Models
Table 3 presents the evaluation results of 16 IC 50 activity prediction models.Among SVM, Bagging, and XGBoost, the prediction performance of the four descriptors is ranked as follows: ECFP4 > Daylight > MACCS > E-state.However, in GBDT, the prediction performance of Daylight is stronger than ECFP4.Comparing the internal validation results of K i with those of the four algorithms in the IC 50 dataset, there are significant differences.Specifically, the internal validation performance of GBDT is significantly lower than that of the other three algorithms.It is worth noting that, although the SVM model using E-state descriptors has the worst internal validation performance (Q 2 = 0.487 ± 0.008, RMSE CV = 0.424 ± 0.005, and MAE CV = 0.338 ± 0.004) among all the models, the model constructed by XGBoost and ECFP4 is considered the optimal model for IC 50 activity prediction.This model performs the best for both internal and external validation, demonstrating good generalization performance.Furthermore, the MAE of the IC 50 model is comparable to that of the K i prediction model, with errors within half an order of magnitude.Figure 4 illustrates the scatterplot of predicted versus experimental values for the optimal model, with green dots indicating the training set and orange dots representing the test set.The green solid line represents the trend line for the training set, while the orange solid line corresponds to the trend line for the test set.Notably, the trend lines of the training and test sets resemble those of the K i prediction model, further confirming the model's strong generalization ability.

Evaluation of EC 50 Activity Prediction Models
Table 4 presents the performance evaluation results for the 16 EC 50 activity prediction models.It is evident that the internal validation of the same descriptors varies less across different algorithms.However, the performance of the E-state descriptor in the SVM model is noticeably inferior to the other three algorithms.Additionally, the performance of the four descriptors in the Bagging, GBDT, and XGBoost algorithms follows the order of ECFP4 > Daylight > MACCS > E-state.Conversely, in the SVM algorithm, the internal validation of Daylight outperforms that of ECFP4, making it the optimal model among the EC 50 activity prediction models with an external validation R 2 exceeding 0.8, thus highlighting its exceptional predictive capability.Figure 5 illustrates the scatter plots comparing the predicted and experimental values for the SVM and Daylight models.
The results of internal and external validation for our models demonstrate a significantly reduced difference (less than 0.03) between Q 2 and R 2 compared to the previous 3D-QSAR model [12][13][14] (Table 5), which exhibited a difference of more than 0.25.This indicates a significant improvement in the generalization ability of the model.The enhanced performance can be attributed, primarily, to the larger dataset utilized in this study, as well as the robust stability of the machine learning algorithms employed.Notably, the model proposed by Kristam was developed using only 62 molecules, making it challenging to ensure generalizability.

Evaluation of EC50 Activity Prediction Models
Table 4 presents the performance evaluation results for the 16 EC50 activi models.It is evident that the internal validation of the same descriptors vari different algorithms.However, the performance of the E-state descriptor model is noticeably inferior to the other three algorithms.Additionally, the of the four descriptors in the Bagging, GBDT, and XGBoost algorithms follo of ECFP4 > Daylight > MACCS > E-state.Conversely, in the SVM algorithm validation of Daylight outperforms that of ECFP4, making it the optimal m the EC50 activity prediction models with an external validation R 2 exceed highlighting its exceptional predictive capability.Figure 5 illustrates the comparing the predicted and experimental values for the SVM and Daylight    The results of internal and external validation for our models de significantly reduced difference (less than 0.03) between Q 2 and R 2 com previous 3D-QSAR model [12][13][14] (Table 5), which exhibited a difference 0.25.This indicates a significant improvement in the generalization ability o The enhanced performance can be attributed, primarily, to the larger datas this study, as well as the robust stability of the machine learning algorithm Notably, the model proposed by Kristam was developed using only 62 molec it challenging to ensure generalizability.

Y-Randomization Test
Feature selection involves selecting the best performing feature combi high-dimensional descriptors and molecular fingerprints to build models th fitted to experimental values.However, it is possible to obtain such mode without any real correlation between the descriptors and experimental valu the chance correlation of the model, we applied the Y-randomization test.D randomization test, the experimental values of pKi, pIC50, and pEC50 a disrupted, destroying the original relationship between the descriptors fingerprints and the activity values, but the distribution of the activity val change [17].We then re-modeled the disrupted data using the algorithm optimal models and the molecular fingerprints, repeating the process 100 results of evaluating the 1000 randomized models using Q 2 are shown in Fig figure , the horizontal coordinate represents Q 2 , and the vertical coordinate r

Y-Randomization Test
Feature selection involves selecting the best performing feature combinations from high-dimensional descriptors and molecular fingerprints to build models that are highly fitted to experimental values.However, it is possible to obtain such models by chance, without any real correlation between the descriptors and experimental values.To assess the chance correlation of the model, we applied the Y-randomization test.During the Y-randomization test, the experimental values of pK i , pIC 50 , and pEC 50 are randomly disrupted, destroying the original relationship between the descriptors or molecular fingerprints and the activity values, but the distribution of the activity values does not change [17].We then re-modeled the disrupted data using the algorithm of the three optimal models and the molecular fingerprints, repeating the process 1000 times.The results of evaluating the 1000 randomized models using Q 2 are shown in Figure 6.In this figure, the horizontal coordinate represents Q 2 , and the vertical coordinate represents the number of frequencies.The green bars on the left side of the three subfigures represent the histograms of the distribution of Q 2 for the 1000 randomized models, while the orange vertical lines indicate the Q 2 of the original models.From Figure 6, it is evident that all the Q 2 of the randomized models fall between −1 and 0, indicating no correlation between the true value of the randomized model and the descriptor or molecular fingerprints.According to the paired-sample t-test, the confidence level of the randomized model compared to the original model is 99% (p < 0.001), which is statistically significant.Therefore, in the three optimal activity prediction models constructed in this paper, there exists a real correlation, rather than a chance correlation, between the modeled molecular fingerprints and K i , IC 50 , or EC 50 .exists a real correlation, rather than a chance correlation, between the modeled molecular fingerprints and Ki, IC50, or EC50.

Model Interpretation
In this paper, we aim to interpret the three optimal models by ranking the importance of features.To select the features, we employ the RFE-RF method, which calculates the Gini index for each feature to indicate its significance.Figure 7 displays the five features with the highest importance among the three optimal models.Additionally, it is worth mentioning that the optimal models of Ki and IC50 utilize the ECFP4 fingerprint, whereas the optimal model of EC50 utilizes the Daylight fingerprint.Notably, the sum of the importance of all the features in the models is equal to 1.The first five features of Ki had a cumulative importance of 0.493.Out of the dataset, 349 compounds had these features in the following descending order: 200, 667, 573, 316, and 997.This accounted for 52.80% of the total Ki dataset.Figure 8A displays the histogram of pKi distribution for compounds containing the top 5 features.It is evident that the pKi of these compounds is shifted one unit to the right compared to other compounds.The structure of the first five features is shown in Figure 9A.The structure of TRPV1 modulators typically comprises three parts: the head, the neck, and the tail.Generally, the head serves as a hydrogen bond acceptor, while the neck acts as a hydrogen bond acceptor and is commonly an amide, urea, or thiourea.On the other hand, the tail is a hydrophobic group [18,19].In the case of these compounds, the first five characteristics correspond to the head (positions 667, 316, 997) and neck (positions 200, 573), with the head being a methylsulfonamide attached to a benzene ring and the neck being an amide

Model Interpretation
In this paper, we aim to interpret the three optimal models by ranking the importance of features.To select the features, we employ the RFE-RF method, which calculates the Gini index for each feature to indicate its significance.Figure 7 displays the five features with the highest importance among the three optimal models.Additionally, it is worth mentioning that the optimal models of K i and IC 50 utilize the ECFP4 fingerprint, whereas the optimal model of EC 50 utilizes the Daylight fingerprint.Notably, the sum of the importance of all the features in the models is equal to 1.
exists a real correlation, rather than a chance correlation, between the modeled molecular fingerprints and Ki, IC50, or EC50.

Model Interpretation
In this paper, we aim to interpret the three optimal models by ranking the importance of features.To select the features, we employ the RFE-RF method, which calculates the Gini index for each feature to indicate its significance.Figure 7 displays the five features with the highest importance among the three optimal models.Additionally, it is worth mentioning that the optimal models of Ki and IC50 utilize the ECFP4 fingerprint, whereas the optimal model of EC50 utilizes the Daylight fingerprint.Notably, the sum of the importance of all the features in the models is equal to 1.The first five features of Ki had a cumulative importance of 0.493.Out of the dataset, 349 compounds had these features in the following descending order: 200, 667, 573, 316, and 997.This accounted for 52.80% of the total Ki dataset.Figure 8A displays the histogram of pKi distribution for compounds containing the top 5 features.It is evident that the pKi of these compounds is shifted one unit to the right compared to other compounds.The structure of the first five features is shown in Figure 9A.The structure of TRPV1 modulators typically comprises three parts: the head, the neck, and the tail.Generally, the head serves as a hydrogen bond acceptor, while the neck acts as a hydrogen bond acceptor and is commonly an amide, urea, or thiourea.On the other hand, the tail is a hydrophobic group [18,19].In the case of these compounds, the first five characteristics correspond to the head (positions 667, 316, 997) and neck (positions 200, 573), with the head being a methylsulfonamide attached to a benzene ring and the neck being an amide The first five features of K i had a cumulative importance of 0.493.Out of the dataset, 349 compounds had these features in the following descending order: 200, 667, 573, 316, and 997.This accounted for 52.80% of the total K i dataset.Figure 8A displays the histogram of pK i distribution for compounds containing the top 5 features.It is evident that the pK i of these compounds is shifted one unit to the right compared to other compounds.The structure of the first five features is shown in Figure 9A.The structure of TRPV1 modulators typically comprises three parts: the head, the neck, and the tail.Generally, the head serves as a hydrogen bond acceptor, while the neck acts as a hydrogen bond acceptor and is commonly an amide, urea, or thiourea.On the other hand, the tail is a hydrophobic group [18,19].In the case of these compounds, the first five characteristics correspond to the head (positions 667, 316, 997) and neck (positions 200, 573), with the head being a methylsulfonamide attached to a benzene ring and the neck being an amide group.Referring to the activity distribution in Figure 8A, it is reasonable to assume that compounds containing such a structure tend to exhibit high K i activity and hold potential for modification.
group.Referring to the activity distribution in Figure 8A, it is reasonable to assume that compounds containing such a structure tend to exhibit high Ki activity and hold potential for modification.The first three features of IC50 have been found to be significantly more important than the last two, with the order of importance being 672 bits > 128 bits > 378 bits.Therefore, the first three features are selected for model interpretation, as illustrated in Figure 7.In the IC50 dataset, there were 273 compounds (14.41% of the dataset) that demonstrated the first three features.The cumulative importance of these features amounted to 0.225.Figure 8B highlights that compounds possessing the first three features exhibited a rightward shift in pIC50 compared to other compounds, suggesting a higher group.Referring to the activity distribution in Figure 8A, it is reasonable to assume that compounds containing such a structure tend to exhibit high Ki activity and hold potential for modification.The first three features of IC50 have been found to be significantly more important than the last two, with the order of importance being 672 bits > 128 bits > 378 bits.Therefore, the first three features are selected for model interpretation, as illustrated in Figure 7.In the IC50 dataset, there were 273 compounds (14.41% of the dataset) that demonstrated the first three features.The cumulative importance of these features amounted to 0.225.Figure 8B highlights that compounds possessing the first three features exhibited a rightward shift in pIC50 compared to other compounds, suggesting a higher The first three features of IC 50 have been found to be significantly more important than the last two, with the order of importance being 672 bits > 128 bits > 378 bits.Therefore, the first three features are selected for model interpretation, as illustrated in Figure 7.In the IC 50 dataset, there were 273 compounds (14.41% of the dataset) that demonstrated the first three features.The cumulative importance of these features amounted to 0.225.Figure 8B highlights that compounds possessing the first three features exhibited a rightward shift in pIC 50 compared to other compounds, suggesting a higher level of antagonistic activity.As illustrated in Figure 9B, positional markers 672 and 128 correspond to the aromatic ring of the head and the urea group of the neck, respectively, while position 378 signifies the indole ring of the head.The indole in the head acts as both a hydrogen bond donor and acceptor, facilitating specific interactions for antagonist binding to TRPV1 channels.Therefore, compounds that incorporate 1H-indole in the head may potentially possess highly active antagonistic properties.
The Daylight fingerprint is different from ECFP4 in that it represents the molecular structure as a linear path from atoms, and thus has no central atom.The importance of the 67-position feature is 0.311, which is much higher than that of the other features, and the number of compounds with this feature is 205, which accounts for 55.86% of the EC 50 dataset, thus this feature is used to interpret the model.In Figure 8C, the compounds with position 67 have a significant rightward shift in pEC 50 compared to the other compounds, and the IC 50 activity is nearly two orders of magnitude different.As can be seen in Figure 9C, the 67-position feature indicates the ureido group in the neck and the benzene ring in the head.This indicates that most of the highly active TRPV1 agonists have phenylurea at the neck and head, and thus compounds containing phenylurea are potentially highly active TRPV1 agonists.

Data Collection and Processing
Data on the K i , IC 50 , and EC 50 activities of human TRPV1 channels were collected from the ChEMBL [20] and PubChem [21] databases.The data were processed according to the following steps: 1. compounds without a clear type of activity and activity value were removed; 2. the units of nanomoles (nM) or micromoles (µM) in the original data were converted to M and the negative logarithms with a base of 10 were taken (i.e., pK i , pIC 50 , and pEC 50 ); 3. the compounds with multiple activity values were de-weighted, following the rule that if the maximum difference of the negative logarithm is less than or equal to 1, the mean value is taken as the activity value of the compound, and the compound is discarded otherwise; 4. salt ions and metal ions were removed from the dataset.After processing, three activity datasets were obtained, consisting of 661 K i , 1894 IC 50 , and 367 EC 50 values.Based on these datasets, three QSAR models were constructed.

Descriptor Generation
This paper utilizes four different methods to extract features and build QSAR models: Daylight fingerprints, molecular access system (MACCS) [22] fingerprints, electrotopological state indices (E-state) molecular descriptors [23], and Extended-Connectivity Fingerprints (ECFPs) [24].MACCS fingerprints are substructure-based molecular fingerprints, and this paper selects the commonly used 166-bit fingerprints.Daylight fingerprints, also known as path-based molecular fingerprints, characterize molecules through different atomic paths represented by a total of 2048 bits.E-state molecular descriptors simultaneously characterize the molecular structure and electrical characteristics with a total of 110 features.ECFP fingerprint is a circular topological fingerprint based on Morgan's algorithm.This study uses ECFP4 with a diameter of 4 and 1024 bits.These descriptors of compounds in databases were calculated through the Scopy [25] and rdkit [26] toolkit.

Data Set Segmentation
In order to avoid training bias or overfitting and to maintain similar structural distribution of compounds in each subset close to each other, this paper divides the dataset into training and test sets according to the carbon scaffold.The carbon scaffold is determined by removing all R groups from the molecule and retaining only the connecting groups between the ring systems, while converting heteroatoms to carbon atoms and bonding sequences to single bonds.The Scopy toolkit [25] is employed to calculate the carbon scaffold of the compounds in this study.If there are less than five molecules with the same carbon scaffold, one molecule is randomly assigned to the training set and the remaining molecules are assigned to the test set.On the other hand, if there are five or more molecules with the same carbon scaffold, 80% of them are randomly assigned to the training set while the remaining 20% are assigned to the test set.

Machine Learning Methods
In this study, four machine learning algorithms are used for the construction of QSAR models, namely support vector machine (SVM), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost) and bagging.The models of the 4 algorithms were implemented via the scikit-learn toolkit [27].
SVM [28] is a statistical learning algorithm based on the principle of Vapnik structural risk minimization.Originally developed for classification problems, SVM can also be extended to regression tasks by introducing slack variables.In the regression task, the objective is to find a hyperplane with a small number of paradigms, while minimizing the sum of the distances from the data to the hyperplane [29].Its high degree of generalization ability has contributed to its increasing popularity in the QSAR/QSPR species.
GBDT [30] is a machine learning algorithm based on the idea of Boosting integration.GBDT updates the strong learner by decreasing the loss function, fitting the loss approximation for each round of iteration with the negative gradient of the loss function.A disadvantage of GBDT is that it is difficult to train in parallel and is less efficient.
XGBoost was developed by Tianqi Chen et al. [31].Other Boosting algorithms develop their models in a sequential phase manner like other Boosting algorithms.However, XGBoost enables parallel computation and also has improved handling of missing values compared to GBDT.In addition, XGBoost is highly resistant to overfitting due to the inclusion of regular terms.
Unlike Boosting, each base learner in Bagging [32] is independent and can be computed in parallel.Bagging samples n sample sets using an autonomous sampling method and training a base learner for each sample set.Afterwards, the learners are combined.Hence, approximately 36.8% of the samples in the initial dataset do not appear in the sampling set.These samples can be used as a validation set to test the training performance and generalization ability of the model.The Bagging algorithm focuses on the reduction in variance and is known for its integration and efficiency.
The grid search in scikit-learn was used for parameter tuning.The key parameters of SVM are C (the penalty coefficient) and gamma (the coefficient of the kernel function).In grid search, the values of C were set as 0.01, 0.1, 1, 10, 100, and 1000; the values of gamma were set as 0.0001, 0.001, and 0.01; and the kernel was chosen as RBF.The number of decision trees is the parameter of Bagging, XGBoost, and GBDT ranging from 100 to 1000, with a step size of 50.

Performance Evaluation Indicators
To ensure the good generalization ability of the QSAR model in predicting the biological activity of new chemical entities, internal validation and external validation were conducted.The model was internally validated using five-fold cross-validation (CV) and independent test sets.In five-fold CV, the training set was divided into five equal parts, with four parts used for constructing the model and one part used for model validation.This process was repeated five times, allowing each part of the data to serve as a validation set.Four main statistical parameters were employed to evaluate the model's performance: the coefficient of determination (Q 2 ), the root mean square error (RMSE CV ), and the mean absolute error (MAE CV ) for CV and the coefficient of determination (R 2 ), the root mean

Figure 1 .
Figure 1.Distribution histograms (A-C) and principal component analysis plots (D-F) for the E IC50, and Ki data sets.

Figure 1 .
Figure 1.Distribution histograms (A-C) and principal component analysis plots (D-F) for the EC 50 , IC 50 , and K i data sets.

Figure 2 .
Figure 2. Feature selection results based on RFE-RF.The red dot indicates the point with the lowe RMSE, and the horizontal and vertical dotted lines refer to the RMSE and the number of feature corresponding to this point respectively.

Figure 2 .
Figure 2. Feature selection results based on RFE-RF.The red dot indicates the point with the lowest RMSE, and the horizontal and vertical dotted lines refer to the RMSE and the number of features corresponding to this point respectively.

Figure 3 .
Figure 3. Scatter plot of predicted values versus experimental values of K i prediction model based on Bagging and ECFP4.The green line indicates the trend line of the training set and the orange line indicates the trend line of the test set.

olecules 2024 ,Figure 4 .
Figure 4. Scatter plot of predicted values versus experimental values of IC50 predictio on XGBoost and ECFP4.The green line indicates the trend line of the training set a line indicates the trend line of the test set.

Figure 4 .
Figure 4. Scatter plot of predicted values versus experimental values of IC 50 prediction model based on XGBoost and ECFP4.The green line indicates the trend line of the training set and the orange line indicates the trend line of the test set.

Figure 5 .
Figure 5. Scatter plot of predicted values versus experimental values of EC50 predictio on SVM and Daylight.The green line indicates the trend line of the training set and t indicates the trend line of the test set.

Figure 5 .
Figure 5. Scatter plot of predicted values versus experimental values of EC 50 prediction model based on SVM and Daylight.The green line indicates the trend line of the training set and the orange line indicates the trend line of the test set.

Figure 6 .
Figure 6.The distribution of Q 2 of randomization models of three activity endpoints.the left-side green bar represents the randomized Q 2 distribution, and the orange vertical line on the right side represents the Q 2 of the original model.

Figure 7 .
Figure 7. Descriptor importance of the top 5 features of 3 optimal models.The vertical coordinate represents the bit encoding of the molecular fingerprint, while the horizontal coordinate corresponds to the feature importance.

Figure 6 .
Figure 6.The distribution of Q 2 of randomization models of three activity endpoints.the left-side green bar represents the randomized Q 2 distribution, and the orange vertical line on the right side represents the Q 2 of the original model.

Figure 6 .
Figure 6.The distribution of Q 2 of randomization models of three activity endpoints.the left-side green bar represents the randomized Q 2 distribution, and the orange vertical line on the right side represents the Q 2 of the original model.

Figure 7 .
Figure 7. Descriptor importance of the top 5 features of 3 optimal models.The vertical coordinate represents the bit encoding of the molecular fingerprint, while the horizontal coordinate corresponds to the feature importance.

Figure 7 .
Figure 7. Descriptor importance of the top 5 features of 3 optimal models.The vertical coordinate represents the bit encoding of the molecular fingerprint, while the horizontal coordinate corresponds to the feature importance.

Figure 8 .
Figure 8. (A) Histogram of activity distribution of compounds with and without feature structures for pKi endpoint; (B) Histogram of activity distribution of compounds with and without feature structures for pIC50 endpoint; (C) Histogram of activity distribution of compounds with and without feature structures for pEC50 endpoint;

Figure 9 .
Figure 9. Feature structures and their position in the molecules.The blue structure on the left side of the molecule is the sum of the sub-structures corresponding to the features.On the right side, the featured structures are depicted, with the central atoms marked by purple dots.The atoms within the bonding radius are represented in black, while the green color is used to indicate the environments of the featured structures within the molecule.An asterisk indicates an unknown atom, which could be carbon, nitrogen, or something else.(A) The representative compound in Ki is shown on the left, and 5 substructures with the most importance are shown on the right.(B) The representative compound in IC50 and 3 substructures with the most importance.(C) The representative compound in EC50 and one substructure with the most importance.

Figure 8 .
Figure 8. (A) Histogram of activity distribution of compounds with and without feature structures for pK i endpoint; (B) Histogram of activity distribution of compounds with and without feature structures for pIC 50 endpoint; (C) Histogram of activity distribution of compounds with and without feature structures for pEC 50 endpoint.

Figure 8 .
Figure 8. (A) Histogram of activity distribution of compounds with and without feature structures for pKi endpoint; (B) Histogram of activity distribution of compounds with and without feature structures for pIC50 endpoint; (C) Histogram of activity distribution of compounds with and without feature structures for pEC50 endpoint;

Figure 9 .
Figure 9. Feature structures and their position in the molecules.The blue structure on the left side of the molecule is the sum of the sub-structures corresponding to the features.On the right side, the featured structures are depicted, with the central atoms marked by purple dots.The atoms within the bonding radius are represented in black, while the green color is used to indicate the environments of the featured structures within the molecule.An asterisk indicates an unknown atom, which could be carbon, nitrogen, or something else.(A) The representative compound in Ki is shown on the left, and 5 substructures with the most importance are shown on the right.(B) The representative compound in IC50 and 3 substructures with the most importance.(C) The representative compound in EC50 and one substructure with the most importance.

Figure 9 .
Figure 9. Feature structures and their position in the molecules.The blue structure on the left side of the molecule is the sum of the sub-structures corresponding to the features.On the right side, the featured structures are depicted, with the central atoms marked by purple dots.The atoms within the bonding radius are represented in black, while the green color is used to indicate the environments of the featured structures within the molecule.An asterisk indicates an unknown atom, which could be carbon, nitrogen, or something else.(A) The representative compound in K i is shown on the left, and 5 substructures with the most importance are shown on the right.(B) The representative compound in IC 50 and 3 substructures with the most importance.(C) The representative compound in EC 50 and one substructure with the most importance.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of K i , IC 50 , and EC 50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 1 .
Top 10 carbon scaffolds and corresponding numbers of Ki, IC50, and EC50 data sets.

Table 2 .
The results of internal and external validation of K i prediction models.

Table 2 .
The results of internal and external validation of Ki prediction models.

Table 3 .
The results of internal and external validation of IC 50 prediction models.

Table 4 .
The results of internal and external validation of EC50 prediction models.

Table 4 .
The results of internal and external validation of EC 50 prediction models.

Table 5 .
The results of internal and external validation of prediction models and prev

Table 5 .
The results of internal and external validation of prediction models and previous studies.