A Machine Learning Approach for Predicting Caco-2 Cell Permeability in Natural Products from the Biodiversity in Peru

Background: Peru is one of the most biodiverse countries in the world, which is reflected in its wealth of knowledge about medicinal plants. However, there is a lack of information regarding intestinal absorption and the permeability of natural products. The human colon adenocarcinoma cell line (Caco-2) is an in vitro assay used to measure apparent permeability. This study aims to develop a quantitative structure–property relationship (QSPR) model using machine learning algorithms to predict the apparent permeability of the Caco-2 cell in natural products from Peru. Methods: A dataset of 1817 compounds, including experimental log Papp values and molecular descriptors, was utilized. Six QSPR models were constructed: a multiple linear regression (MLR) model, a partial least squares regression (PLS) model, a support vector machine regression (SVM) model, a random forest (RF) model, a gradient boosting machine (GBM) model, and an SVM–RF–GBM model. Results: An evaluation of the testing set revealed that the MLR and PLS models exhibited an RMSE = 0.47 and R2 = 0.63. In contrast, the SVM, RF, and GBM models showcased an RMSE = 0.39–0.40 and R2 = 0.73–0.74. Notably, the SVM–RF–GBM model demonstrated superior performance, with an RMSE = 0.38 and R2 = 0.76. The model predicted log Papp values for 502 natural products falling within the applicability domain, with 68.9% (n = 346) showing high permeability, suggesting the potential for intestinal absorption. Additionally, we categorized the natural products into six metabolic pathways and assessed their drug-likeness. Conclusions: Our results provide insights into the potential intestinal absorption of natural products in Peru, thus facilitating drug development and pharmaceutical discovery efforts.


Introduction
Peru is one of the most biodiverse countries in the world, because it possesses over 70% of the planet's biodiversity, which is represented in a variety of ecosystems [1].The richness of this diversity is reflected in the country's wealth of knowledge about medicinal plants, which has been passed down by several generations for treating various conditions [2].Some examples include Uncaria tomentosa ("Uña de gato"), which is used for treating various health problems, such as gastric ulcers, viral infections, arthritis, and other inflammatory problems [3]; Plukenetia volubilis ("Sacha inchi"), which is traditionally used for skin care to soften skin, heal wounds, and treat insect bites and skin infections [4]; and Hypericum laricifolium ("Hierba de la fortuna"), which used for inflammation, infections, and musculoskeletal and bone pain [5].However, there is a lack of information regarding the pharmacological properties, safety, and efficacy of natural products present in natural resources used in oral administration [6,7].
Natural products are organic compounds obtained from natural resources, such as plants, animals, and microorganisms that exhibit biological activities [8,9].Historically, they have been used as the primary source of compounds for medicines, cosmetics, and food products [10].They can be classified into primary and secondary metabolites [11].
Secondary metabolites are compounds that may provide an evolutionary advantage due to the organism adapting to its environment [8,11], and they are derived from biosynthetic intermediates, such as acetyl coenzyme A, shikimic acid, mevalonic acid, and 1deoxyxylulose-5-phosphate [8,9].Natural products can be categorized into six metabolic pathways: terpenoids, polyketides, fatty acids, alkaloids, shikimates and phenylpropanoids, and amino acids and peptides [12].
The gastrointestinal tract is a channel lined with epithelium that extends from the mouth to the anus and is responsible for food digestion, nutrient absorption, waste product excretion, and water reabsorption [13].The absorptive region, which includes the stomach, small intestine, and part of the large intestine, is composed of a layer of epithelial cells called enterocytes [13].Enterocytes represent the largest population of intestinal cells, presenting an apical membrane facing the gastrointestinal lumen and a basolateral membrane facing the serosal side [13,14].The intestinal barrier separates the intestinal lumen from the internal host, facilitating the exchange of molecules and nutrient absorption, while preventing water and electrolyte loss, and the entry of antigens and microorganisms into the body [15,16].
The efficacy of oral drug administration relies on intestinal absorption of the compound [17].According to the Biopharmaceutics Classification System, aqueous solubility and permeability are properties that influence oral absorption [18].Intestinal permeability is a property that describes the facility with which molecules pass through the intestinal epithelium via simple passive diffusion [13].It is classified into effective permeability or the human jejunal permeability rate, which is determined from intestinal perfusion studies [19], and the apparent permeability (Papp), which is the amount of compound transported per unit of time.
The human colon adenocarcinoma cell line (Caco-2) is commonly used as an in vitro assay to measure apparent permeability, due to its physiological and morphological properties, similar to enterocytes [20].However, this assay is time consuming due to the long culture periods between 21 and 24 days, making it challenging to perform high-throughput screening [21].An alternative approach for predicting apparent permeability in the Caco-2 cell line is quantitative structure-property relationship (QSPR) modeling.It is a mathematical model that correlates experimental values with chemical structure features to predict the properties of new chemical entities [22].Chemical structure features are represented by numerical physicochemical parameters called molecular descriptors.Molecular descriptors are determined experimentally or computationally, and are utilized in the development QSPR models [22].
Machine learning (ML) is a technique that uses algorithms to analyze data, learn from it, and then make predictions about new datasets [23].ML is widely used in all stages of drug discovery, including ligand identification, target validation, and identification of prognostic biomarkers [23].ML is divided into supervised learning methods, which are used to develop training models to predict future values, and unsupervised learning methods, which enable the clustering of data [23].ML models have been utilized in QSPR studies to predict the apparent permeability of Caco-2 cells.One of the pioneering efforts was that by Norinder et al. [24], who developed a PLS model using 17 compounds.Since this initial study, the availability of experimental data has increased, leading to a growth in QSPR studies aimed at predicting Caco-2 cell permeability, involving varying in data sizes, descriptor types and quantities, and modeling methods.Various types of descriptors, encompassing both 2D and 3D molecular descriptors, have been utilized, and a range of regression algorithms have been employed, including multiple linear regression (MLR), partial least squares (PLS), support vector machine (SVM), random forest (RF) [20,21,[25][26][27], and gradient boosting [28].
There is a need to obtain more pharmacological information on natural products, such as apparent permeability, which is highly correlated with intestinal absorption and determines the efficacy of orally administered compounds.This study focuses on the development of QSPR models using ML algorithms to correlate experimental values on the apparent permeability in the Caco-2 cell line with calculated molecular descriptors.Then, the best-performing QSPR model is utilized to predict the apparent permeability of natural products from the biodiversity in Peru.These findings will allow the pharmacokinetic and pharmacodynamic properties of natural products to be inferred, such as absorption and potential bioavailability, providing insights into the compounds responsible for the pharmacological activity of natural resources.

Analysis of the Datasets
In this study, we employed a dataset source from the literature [28] 1a.Additionally, principal component analysis (PCA) was performed using 41 selected descriptors.The first 40 principal components (PCs), which explained 99.8% of the cumulative variance, were plotted in pairs along the x-axis and y-axis (Figure S3).Furthermore, the first four PCs are plotted in Figure 1b,c.This analysis validated the suitability of the partitioning of our datasets, as the chemical space for both the training and testing sets exhibited a similarity.Consequently, the testing set, utilized for external validation, adequately represented the distribution of the log Papp values and chemical space present in the training set.
Pharmaceuticals 2024, 17, x FOR PEER REVIEW 3 determines the efficacy of orally administered compounds.This study focuses on th velopment of QSPR models using ML algorithms to correlate experimental values o apparent permeability in the Caco-2 cell line with calculated molecular descriptors.the best-performing QSPR model is utilized to predict the apparent permeability of ral products from the biodiversity in Peru.These findings will allow the pharmacok and pharmacodynamic properties of natural products to be inferred, such as absor and potential bioavailability, providing insights into the compounds responsible fo pharmacological activity of natural resources.

Analysis of the Datasets
In this study, we employed a dataset source from the literature [28] comprisi 1817 unique compounds.The diversity of the dataset was evaluated based on five p cochemical properties: molecular weight (MW) (437.95 ± 152), logP (1.56 ± 2.0), numb hydrogen acceptors (7.24 ± 3.8), number of hydrogen donors (2.81 ± 2.6), topologica face area (TPSA) (106.51 ± 59.3), and number of rotatable bonds (6.23 ± 4.3).The log values exhibited a mean of −5.34, ranging from a minimum of −7.70 to a maximu −3.78.Both the training and testing sets exhibited a comparable distribution of log values, as illustrated in Figure 1a.Additionally, principal component analysis (PCA performed using 41 selected descriptors.The first 40 principal components (PCs), w explained 99.8% of the cumulative variance, were plotted in pairs along the x-axis a axis (Figure S3).Furthermore, the first four PCs are plotted in Figure 1b,c.This an validated the suitability of the partitioning of our datasets, as the chemical space for the training and testing sets exhibited a similarity.Consequently, the testing set, ut for external validation, adequately represented the distribution of the log Papp value chemical space present in the training set.

Feature Selection and Development of QSPR Models
To evaluate the impact of our feature selection approach, we developed six Q models: MLR, PLS, SVM, RF, GBM, and SVM-RF-GBM.These models were constr using three different datasets: a dataset without feature selection (523 predictors), taset with RFE (60 predictors), and a dataset with both RFE and GA (41 predictors evaluated the performance of each model using RMSE and R 2 metrics for the trainin cross-validation, and testing set to evaluate the performance of each model (Tables S  and 1).Superior model performance is indicated by low RMSE values and high R 2 va while inferior performance is characterized by high RMSE values and low R 2 values results indicated that the performance metrics across the three datasets were quite si for each model.This suggests that our feature selection approach effectively reduce model complexity by eliminating less important predictors, while preserving

Feature Selection and Development of QSPR Models
To evaluate the impact of our feature selection approach, we developed six QSPR models: MLR, PLS, SVM, RF, GBM, and SVM-RF-GBM.These models were constructed using three different datasets: a dataset without feature selection (523 predictors), a dataset with RFE (60 predictors), and a dataset with both RFE and GA (41 predictors).We evaluated the performance of each model using RMSE and R 2 metrics for the training set, cross-validation, and testing set to evaluate the performance of each model (Tables S5, S6 and 1).Superior model performance is indicated by low RMSE values and high R 2 values, while inferior performance is characterized by high RMSE values and low R 2 values.Our results indicated that the performance metrics across the three datasets were quite similar for each model.This suggests that our feature selection approach effectively reduced the model complexity by eliminating less important predictors, while preserving the predictive performance.Consequently, we selected the RFE-GA approach, which identified 41 molecular descriptors for subsequent analysis.
The MLR and PLS models exhibited similar predictive capabilities, as observed from their RMSE and R 2 values across the training, testing, and cross-validation sets.Both models demonstrated the lowest performance, with the highest RMSE and lowest R 2 values (RMSE Test = 0.47 and R 2 Test = 0.63).The SVM, RF, and GBM models showed enhanced predicted performance (RMSETest = 0.40, 0.39, and 0.39, respectively, R 2 = 0.73, 0.74, and 0.74, respectively).To further improve the predictive capacity, an ensemble of SVM, RF, and GBM was utilized to create the SVM-RF-GBM model, which exhibited superior performance (RMSE Test = 0.38 and R 2 = 0.76) and was selected for subsequent analysis (Figure 2 and Table S7).These results suggest that the apparent permeability in Caco-2 cells is influenced by multiple variables with non-linear relationships.The SVM-RF-GBM model demonstrates a predictive capability comparable to that reported by Wang and Chen [28], with similar R 2 values and a marginally lower RMSE in this study (Table 1), despite employing different modeling approaches.These findings imply that there is a multivariate explanation for apparent permeability, as it is influenced by 41  The MLR and PLS models exhibited similar predictive capabilities, as observed from their RMSE and R 2 values across the training, testing, and cross-validation sets.Both models demonstrated the lowest performance, with the highest RMSE and lowest R 2 values (RMSETest = 0.47 and R 2 Test = 0.63).The SVM, RF, and GBM models showed enhanced predicted performance (RMSETest = 0.40, 0.39, and 0.39, respectively, R 2 = 0.73, 0.74, and 0.74, respectively).To further improve the predictive capacity, an ensemble of SVM, RF, and GBM was utilized to create the SVM-RF-GBM model, which exhibited superior performance (RMSETest = 0.38 and R 2 = 0.76) and was selected for subsequent analysis (Figure 2 and Table S7).These results suggest that the apparent permeability in Caco-2 cells is influenced by multiple variables with non-linear relationships.The SVM-RF-GBM model demonstrates a predictive capability comparable to that reported by Wang and Chen [28], with similar R 2 values and a marginally lower RMSE in this study (Table 1), despite employing different modeling approaches.These findings imply that there is a multivariate explanation for apparent permeability, as it is influenced by 41 molecular descriptors.

Mechanism Interpretation
In accordance with the OECD guidelines, a QSPR study should include mechanism interpretation whenever possible.Apparent permeability quantifies the transport rate of a compound across Caco-2 cells over time, facilitating the inference of pharmacological parameters, such as absorption and bioavailability [29].
Table 2 reveals that the top ten descriptors that are predominantly associated with hydrogen bonds, including maxHBint7, maxHBint5, maxHBint9, maxHBint3, Eta_D_epsiD, and maxHBd, consistent with the findings by Wang and Chen [28].Additionally, descriptors related to lipophilicity, such as AlogP and electrotopological state indices (E-state) [30], offer electronic and topological insights into the molecular structure.E-state indices are classified into four groups: E-state sums, Atom-type counts, the E-state minimum, and the E-state maximum [30,31].According to the QSPR SVM-RF-GBM model, the top ten molecular descriptors exhibit moderate to low linear correlation (Table 2) with the experimental log Papp values, as indicated by the Pearson correlation coefficient (r).The importance of the variables for each model are detailed in the supplementary information (Figures S4 and S5).Furthermore, the 41 molecular descriptors demonstrated moderate, low, or no linear correlation with the experimental log Papp values, as depicted in the scatter plot (Figures S6 and S7).Additionally, outliers are observed.

Applicability Domain of the SVM-RF-GBM Model
The applicability domain determines the compounds for which the log Papp predictions may not be reliable, in accordance with the third criterion in the OECD principles [32].Leverage values for both the training and testing sets were obtained by calculating the hat matrix of the training set.The Williams plot exhibited x-axis leverages and y-axis standardized residuals.The horizontal dashed lines indicate a standard residual of ±3, while the vertical dashed line denotes the warning leverage, h* = 0.0866, indicating the limits of the applicability domain.According to the Williams plot, 38 outliers were identified outside the applicability domain, 31 (2.13%) outliers of which were from the training set and 7 (1.93%) of which were from the testing set (Figure 3).
the hat matrix of the training set.The Williams plot exhibited x-axis leverages and y-axis standardized residuals.The horizontal dashed lines indicate a standard residual of ±3, while the vertical dashed line denotes the warning leverage, h* = 0.0866, indicating the limits of the applicability domain.According to the Williams plot, 38 outliers were identified outside the applicability domain, 31 (2.13%) outliers of which were from the training set and 7 (1.93%) of which were from the testing set (Figure 3).

Comparative Analysis of the SVM-RF-GBM Model and the Other Models
We compared our results with those of previous QSPR investigations targeting the prediction of log Papp values in Caco-2 cells (Table 3).This comparison encompassed details such as the methodology employed, molecular descriptors utilized, number size, and RMSETest and R 2 Test values.Our methodology used a combination of 2D and 3D descriptors sourced from PaDEL-Descriptor and alvaDesc, drawing from a substantial dataset comprising 1817 compounds.In contrast, the study conducted by Wang and Chen [28] employed a dual-RBF neural network model utilizing only 2D descriptors, achieving a similar performance, with an RMSETest of 0.39 and an R 2 Test of 0.76.Other investigations featured QSPR models constructed with smaller sample sizes, thereby limiting the diversity of the compounds amenable to prediction.

Comparative Analysis of the SVM-RF-GBM Model and the Other Models
We compared our results with those of previous QSPR investigations targeting the prediction of log Papp values in Caco-2 cells (Table 3).This comparison encompassed details such as the methodology employed, molecular descriptors utilized, number size, and RMSE Test and R 2  Test values.Our methodology used a combination of 2D and 3D descriptors sourced from PaDEL-Descriptor and alvaDesc, drawing from a substantial dataset comprising 1817 compounds.In contrast, the study conducted by Wang and Chen [28] employed a dual-RBF neural network model utilizing only 2D descriptors, achieving a similar performance, with an RMSE Test of 0.39 and an R 2 Test of 0.76.Other investigations featured QSPR models constructed with smaller sample sizes, thereby limiting the diversity of the compounds amenable to prediction.

Prediction of Log Papp Values for Natural Products from Peru
The SVM-RF-GBM model predicted the log Papp values of 516 natural products sourced from the biodiversity in Peru.However, 14 compounds were excluded due to falling outside the applicability domain based on the warning leverage, h* = 0.0866, which defines the boundary.One duplicate compound (gallic acid) was removed as it appeared in two different articles.Subsequently, the remaining 13 compounds outside the applicability domain were removed (Figure 4): SCH 644343, SCH 644342, hexahydroxydiphenic acid (HHDP), cinchonain Ib, gallic acid, 2,4,6-trihydroxybenzaldehyde (THBAL), 2,4,6-trihydroxybenzoic acid (THBA), ariakemycin A, ariakemycin B, pumilacidin C, pumilacidin E, cubene, and crofelemer.The final dataset, used for further analysis, comprised 502 compounds.
The natural products were classified into six metabolic pathways: alkaloids, fatty acids, polyketides, terpenoids, shikimates and phenylpropanoids, and amino acids and peptides (Figure 5b).The natural products showed a diverse distribution of the log Papp values in each pathway, encompassing high, medium, and low apparent permeability.For a clearer overview of the compounds characterized by high permeability, each metabolic pathway was further subdivided into chemical groups, with the exception of polyketides and amino acids and peptides.Upon analyzing these chemical groups, 11 groups primarily exhibited high apparent permeability (log Papp > −5) [33] (Figure 5c).These groups included apocarotenoids, fatty esters, monoterpenoids, diterpenoids, sesquiterpenoids, fatty acyls, lysine alkaloids and pseudoalkaloids, and polyketides, indicating the potential for high intestinal absorption via passive transport.Additionally, eight groups showed a distribution of log Papp values primarily in the medium to high permeability range, encompassing phenylpropanoids, stilbenoids, phenolic acids, coumarins and diarylheptanoids, lignans, isoflavonoids, fatty acids and conjugates, and fatty amides.The remaining eight groups demonstrated a mixed distribution, with compounds falling into low, medium, and high log Papp categories, including tryptophan alkaloids, tyrosine alkaloids, meroterpenoids and triterpenoids, steroids, ornithine alkaloids, flavonoids, amino acids and peptides, and xanthones.5a).
The natural products were classified into six metabolic pathways: alkaloids, fatty acids, polyketides, terpenoids, shikimates and phenylpropanoids, and amino acids and peptides (Figure 5b).The natural products showed a diverse distribution of the log Papp values in each pathway, encompassing high, medium, and low apparent permeability.For a clearer overview of the compounds characterized by high permeability, each metabolic pathway was further subdivided into chemical groups, with the exception of polyketides and amino acids and peptides.Upon analyzing these chemical groups, 11 groups primarily exhibited high apparent permeability (log Papp > −5) [33] (Figure 5c).These groups included apocarotenoids, fatty esters, monoterpenoids, diterpenoids, sesquiterpenoids, fatty acyls, lysine alkaloids and pseudoalkaloids, and polyketides, indicating the potential for high intestinal absorption via passive transport.Additionally, eight groups showed a distribution of log Papp values primarily in the medium to high permeability range, encompassing phenylpropanoids, stilbenoids, phenolic acids, coumarins and diarylheptanoids, lignans, isoflavonoids, fatty acids and conjugates, and fatty amides.The remaining eight groups demonstrated a mixed distribution, with compounds falling into low, medium, and high log Papp categories, including tryptophan alkaloids, tyrosine alkaloids, meroterpenoids and triterpenoids, steroids, ornithine alkaloids, flavonoids, amino acids and peptides, and xanthones.
In the terpenoid pathway, apocarotenoids, meroterpenoids, diterpenoids, and sesquiterpenoids primarily exhibited high apparent permeability.In the fatty acid pathway, fatty esters and fatty acyls showed high permeability.Phenolic compounds, which constitute the largest group of secondary metabolites in plants, are primarily derived from the shikimate and phenylpropanoid pathway [34].These compounds exhibited distributions of low, medium, and high apparent permeability, aligning with the chemical diversity of this group.Furthermore, we identified 50 compounds containing glycosidic bonds, with the majority falling into the low permeability category (n = 28), followed by the medium permeability category (n = 9), and the high permeability category (n = 13).
tute the largest group of secondary metabolites in plants, are primarily derived from the shikimate and phenylpropanoid pathway [34].These compounds exhibited distributions of low, medium, and high apparent permeability, aligning with the chemical diversity of this group.Furthermore, we identified 50 compounds containing glycosidic bonds, with the majority falling into the low permeability category (n = 28), followed by the medium permeability category (n = 9), and the high permeability category (n = 13).For a compound to cross the intestinal barrier, it must possess a certain degree of lipophilicity, while maintaining solubility in water.Compounds need to be dissolved in an aqueous medium to pass through the lipid bilayer of enterocytes.Some examples of natural products displaying high apparent permeability include α-ionone, R-(−)-Linalool, and methylcativate.These compounds are characterized by a predominance of hydrophobic regions, such as methyls, cyclohexanes, alkanes, and alkenes, facilitating the crossing of the lipid bilayer.Additionally, they contain few polar groups like carbonyl, hydroxyl, and ester groups, enabling the formation of hydrogen bonds with water molecules and providing solubility in an aqueous medium.Conversely, compounds with low apparent permeability, such as betacyanin, luteolin-7-O-glucoside, and genistin, are marked by the presence of numerous polar groups, such as carboxyl, hydroxyl, ether, and carbonyl, alongside fewer hydrophobic regions, leading to low lipophilicity and, subsequently, reducing the ability to cross the lipid bilayer [35].

Evaluation of Drug-Likeness of Natural Products
Assessing the drug-likeness of natural products is a pivotal step for predicting their potential as orally administered drug.The apparent permeability in Caco-2 cell lines serves as an indicator of intestinal absorption, a critical factor in the efficacy of orally administrated medications.Permeability is characterized by various physicochemical properties, including the molecular weight (MW), the octanol-water partition coefficient (logP), the number of hydrogen bond donors (HBDs), the number of hydrogen bond acceptors (HBAs), the topological surface area (TPSA), and the number of rotatable bonds (RBNs), all of which are known to influence intestinal permeability [36,37].The Lipinski rule of five (Ro5) [36] is a well-established set of criteria utilized for assessing permeability in oral drugs, encompassing parameters such as MW ≤ 500, ClogP ≤ 5 (or MLogP ≤ 4.15), HBD ≤ 5, and HBA ≤ 10.Veber et al. [37] expanded upon these criteria, including TPSA ≤ 140 and RBN ≤ 10.
In this study, we characterized our natural product dataset using these six physicochemical descriptors (Figure 6).About 75.5% (n = 374) of the natural products exhibited an MlogP < 5, 90.0% (n = 452) had an MW ≤ 500, 90.4% (n = 454) met the criteria for a HBA ≤ 10, 88.2% (n = 443) satisfied a HBD ≤ 5, 79.4% (n = 399) possessed an RBN ≤ 10, and 86.4% (n = 434) adhered to a TPSA ≤ 140.Moreover, 90% (n = 450) of the natural products in our dataset conformed to the Ro5 requirement for oral bioavailability, while 10% (n = 52) fell outside the Ro5 boundary in regard to at least one molecular property (Figure 7a).These results suggest that the majority of natural products in our dataset are suitable for oral administration.Additionally, 99.4% (n = 344) of the natural products with high apparent permeability (log Papp > −5) conformed to the Ro5 requirement (Figure 7b), further supporting their suitability for oral administration and validating the predicted log Papp values generated by our SVM-RF-GBM model for crossing the intestinal barrier.To gain a deeper understating of compounds well suited for oral administration based on the predicted log Papp values of our natural product dataset, we employed eight drug-likeness scores (DLSs).These indices are defined as the ratio between the number of satisfied rules (nRules) and the total number of rules (tRules) (Equation ( 4)) [38], as follows: • DLS_01 is derived from a modified version of Ro5 [36], comprising four rules, according to Equation (4); Additionally, 99.4% (n = 344) of the natural products with high apparent permeability (log Papp > −5) conformed to the Ro5 requirement (Figure 7b), further supporting their suitability for oral administration and validating the predicted log Papp values generated by our SVM-RF-GBM model for crossing the intestinal barrier.
To gain a deeper understating of compounds well suited for oral administration based on the predicted log Papp values of our natural product dataset, we employed eight drug-likeness scores (DLSs).These indices are defined as the ratio between the number of satisfied rules (nRules) and the total number of rules (tRules) (Equation ( 1)) [38], as follows: • DLS_01 is derived from a modified version of Ro5 [36], comprising four rules, according to Equation (1); • DLS_02 is defined by six rules, including a HBD ≤ 5, a HBA in the range 1-8, an MW in the range 200-450, an MlogP in the range −2.0-4.5, an RBN in the range from 1-9, and the number of rings ≤ 5 [39]; • DLS_03 incorporates criteria such as a HBD ≤ 5, a HBA ≤ 10, an MW from 200 to 500, an MlogP in the range −5.0-5.0, an RBN ≤ 8, and a formal charge in the range −2-2 [40]; • DLS_04 is defined by parameters including a HBD ≤ 5, a HBA in the range 2-10, an MW in the range 78-500, an MlogP in the range −0.5-5.0, the ratio of the number of C sp3 atoms to the total number of non-halogen atoms in the range 0.15-0.8, the ratio of the number of hydrogen atoms to the total number of non-halogen atoms in the range 0.6-1.6, and the ratio of molecular unsaturation (Unsat-p) in the range 0.10-0.45[41]; • DLS_05 relies on criteria such as the ratio of the total number of oxygen and nitrogen atoms and the number of C sp3 atoms in the range 0.10-1.80,and a descriptor Unsat-p ≤ 0.43 [42]; • DLS_06 is based on a HBD ≤ 5, a HBA ≤ 10, an MW ≤ 500, an MlogP ≤ 5, an RBN ≤ 10, and a TPSA ≤ 140 [43]; • DLS_07 is based on the criteria of an RBN ≤ 10, and a TPSA ≤ 140 [37]; • DLS_cons represents the average drug-likeness score obtained from the previously described criteria.
We found that the majority of natural products with high apparent permeability (log Papp > −5) exhibited drug-likeness scores within the range of 0.75 to 1, indicating their potential as drugs (Figure 8a-h).Specifically, DLS_cons revealed that 82.7% (n = 286) of the predicted high apparent permeability of the natural products achieved scores within this range.These different approaches to assessing drug-likeness based on various criteria of molecular properties expressed in scores are consistent with the predicted log Papp values generated by the SVM-RF-GBM model, reaffirming its robust predictive capability.
We found that the majority of natural products with high apparent permeability (log Papp > −5) exhibited drug-likeness scores within the range of 0.75 to 1, indicating their potential as drugs (Figure 8a-h).Specifically, DLS_cons revealed that 82.7% (n = 286) of the predicted high apparent permeability of the natural products achieved scores within this range.These different approaches to assessing drug-likeness based on various criteria of molecular properties expressed in scores are consistent with the predicted log Papp values generated by the SVM-RF-GBM model, reaffirming its robust predictive capability.

Materials and Methods
A workflow illustrating the development of the QSPR models and the predictions involving natural products sourced from the biodiversity in Peru is shown in Figure 9.

Materials and Methods
A workflow illustrating the development of the QSPR models and the predictions involving natural products sourced from the biodiversity in Peru is shown in Figure 9.

Data Collection
A dataset was compiled consisting of molecules with a measured logarithm of apparent permeability (log Papp) in Caco-2 cells and their simplified molecular input line entry specification (SMILES), as reported by Wang and Chen [28].
Apparent permeability is measured in vitro in cell lines and is calculated using Equation (1) [19,29].
where Papp is measured in cm s −1 ; dQ/dt is the drug permeation rate through the cells or the amount of substance transferred across the membrane per unit of time (µmol s −1 ); C0 is the initial concentration of the donor compartment (µM); and A is the area of the monolayer (cm 2 ) [29].Initially, a total of 1827 molecules were obtained.Duplicated molecules were removed based on their SMILES codes, resulting in a refined dataset containing 1817

Data Collection
A dataset was compiled consisting of molecules with a measured logarithm of apparent permeability (log Papp) in Caco-2 cells and their simplified molecular input line entry specification (SMILES), as reported by Wang and Chen [28].
Apparent permeability is measured in vitro in cell lines and is calculated using Equation (2) [19,29].
where Papp is measured in cm s −1 ; dQ/dt is the drug permeation rate through the cells or the amount of substance transferred across the membrane per unit of time (µmol s −1 ); C 0 is the initial concentration of the donor compartment (µM); and A is the area of the monolayer (cm 2 ) [29].Initially, a total of 1827 molecules were obtained.Duplicated molecules were removed based on their SMILES codes, resulting in a refined dataset containing 1817 unique molecules (Table S1).
Natural products sourced from the biodiversity in Peru were gathered from the literature up to 2019, encompassing 48 articles [3][4][5].The collected data encompassed the SMILES codes, chemical names, organism names, common organism names, and digital object identifiers (DOIs).The pathway classification of natural products was conducted using NPClassifier [12], based on 3 levels, namely the pathway, group, and subgroup, and considered its presence on the glycosidic bond, wherein the SMILES code was used to classified them.

The Calculation and Optimization of the 3D Structure
The generation of 3D coordinates from the SMILES format was performed using the Molconvert software version 24.1.2.Subsequently, hydrogens were added under the condition of pH = 7.4.The 3D structures were then optimized using the steepest descent method with the Merck molecular force field 94 (MMFF 94) [89].This optimization process was executed using 5000 steps, with the OpenBabel software 3.1.1[90].In cases where certain molecules could not be optimized using MMFF 94, the Generalized Amber Force Field (GAFF) [91] method was employed.

Molecular Descriptor Calculation
A total of 7110 molecular descriptors were calculated, encompassing both 2D and 3D molecular descriptors, using the SMILES codes of the compounds.The calculation was carried out using PaDEL-Descriptor 2.21 [92], which contributed 1785 descriptors, and alvaDesc v2.0.16 [38], which provided 5235 descriptors.

Splitting of the Dataset
In order to prevent overfitting, the dataset was split into a training set and a testing set, maintaining an 80:20 ratio.This partitioning resulted in 1455 molecules in the training set and 362 molecules in the testing set.The training set was employed for model development, while the testing set was reserved for external validation.

Preprocessing
The dataset initially contained missing values, which was addressed by imputing them using the median value.Subsequently, to eliminate uninformative variables and mitigate redundancy, the descriptors with low variance and those displaying high correlations (indicated by a Pearson correlation coefficient absolute value > 0.70) were removed.The remaining dataset contained 521 descriptors.Finally, the descriptors were centered and scaled.

Feature Selection
Feature selection is a set of methods for reducing dimensionality by selecting a subset of variables from the original feature set while eliminating irrelevant, redundant, or noisy features.Consequently, it can lead to enhanced learning performance, reduced computational expenses, and improved model interpretability.
To achieve this, recursive feature elimination (RFE) was employed to select the most informative molecular descriptors.This method involved an iterative process, where subsets of descriptors ranging from 1 to 200 were evaluated.At each iteration, a random forest model was trained on the selected subset of descriptors, and the performance was assessed using 5-fold cross-validation (CV), which was repeated 20 times to ensure its robustness.RFE involves fitting a random forest model to all the predictors, initially, and then ranking each predictor based on its importance to the model.Subsequently, at each iteration of the feature selection, the least important predictor is removed, and the model is retrained.This iterative process of removing predictors, one by one, continues until the optimal subset of descriptors is determined based on the root mean square error (RMSE) performance metric.This iterative process led to the selection of 60 molecular descriptors (Table S2 and Figure S1).Subsequently, a genetic algorithm (GA) was integrated with a random forest algorithm to further refine the selection.The GA employed a systematic approach inspired by principles of natural selection for modeling.The GA parameters were configured as follows: maximum generations = 100, population per generation = 50, crossover probability = 0.8, mutation probability = 0.1, and elitism = 0.Moreover, 5-fold cross-validation was employed for the model evaluation.The GA process began by initializing a population of potential solutions, where each solution represented a subset of molecular descriptors.Then, the fitness of each solution was evaluated using a fitness function.In this case, the fitness function aimed to minimize the RMSE metric.Through selection, crossover, and mutation operations, the population iteratively evolved over 100 generations (Figure S2).The process terminated after reaching the maximum generation limit, culminating in the identification of a subset of 41 molecular descriptors optimized for modeling.Additionally, principal component analysis (PCA) was conducted to visualize the distribution of the 41 selected descriptors in chemical space.

Modeling
In this study, six QSPR models were constructed using the training set, incorporating the 41 selected molecular descriptors as independent variables and the log Papp value as the dependent variable (training set and testing set are given in Tables S3 and S4).The models employed in this study are described below.
Multiple linear regression (MLR), a linear model that leverages multiple independent variables to predict one dependent variable, was used [22].It assumes a linear relationship between the independent variables and the dependent variable.Partial least squares (PLS), a model that transforms original independent variables into principal components or latent variables represented as linear combinations of independent variables, was used [22].This method is particularly useful when the predictors are highly collinear or when there are more predictors than observations.Support vector machine (SVM), an algorithm suitable for regression and classification problems, was used.During the regression, the input data is transformed into a feature space through a kernel function, finding a hyperplane that best fits the feature space [93].For this study, a radial basis function kernel was selected.Decision trees are non-parametric algorithms that partition data into smaller regions using a set of splitting rules.These models produce simple rules that are easy to interpret; however, they typically lack predictive performance capabilities compared to more complex algorithms [93].Random forest (RF) regression, which combines a set of decision trees with good predictive ability and averages their results to enhance the model's predictive performance, was used [93,94].Gradient boosting machine (GBM), an algorithm that combines shallow decision trees to facilitate learning and improve prediction accuracy, was also used [93].Unlike RF models, GBM builds trees sequentially, with each tree correcting the error of its predecessor.Ensemble methods, which integrate the best-performing models to create a new model or meta-model, achieving a predictive performance surpassing that of individual models, were used [93,94].In this study, a linear model was trained using predicted values from the SVM, RF, and GBM models using the ensemble method, denoted as SVM-RF-GBM.
During model training, the hyperparameters were adjusted to control the model's complexity.To assess the impact of the model tuning parameters on the performance and determine the optimal hyperparameter settings, we utilized 5-fold cross-validation, which was repeated five times.The hyperparameters for each model were as follows: MLR, no hyperparameters; PLS, the number of components (ncomp); SVM, sigma (sigma) and cost (C); RF, the number of randomly selected predictors (mtry); GBM, the number of boost-ing iterations (n.trees) and the max tree depth (interaction.depth);and SVM-RF-GBM, no hyperparameters.The following parameters were employed for model building: MLR, no additional parameters were employed; PLS, ncomp = 40; SVM, epsilon = 0.1, C = 2, and sigma = 0.015; RF, mtry = 18 and number of trees = 500; GBM, n.trees = 100, interaction.depth= 16, shrinkage = 0.1, and the minimum number of observations in the terminal nodes of the trees (n.minobsinnode) = 10.

Model Validation
For model validation, we employed 5-fold cross-validation repeated five times for internal validation, while the testing set was used for external validation.The performance of each model was evaluated using six statistical parameters: the root mean square error of the training set (RMSE Train ), the coefficient of determination of the training set (R 2 Train ), the root mean square error of the cross-validation (RMSE CV ), the coefficient of determination of the cross-validation (R 2 CV ), the root mean square error of the testing set (RMSE Test ), and the coefficient of determination of the testing set (R 2 Test ).

Applicability Domain
The applicability domain of the QSPR model ensures that the models can reasonably and accurately predict uncertain compounds [95].The leverage values or hat values measure the distance of a data point from the center of the distribution of the training set and are employed to assess whether a compound falls within the domain.The hat values are calculated from the diagonal values in the hat matrix (H), which are defined by Equation ( 3).
In a training set matrix with molecular descriptors of n rows and p columns, X n×p , the hat matrix (H) is calculated using Equation (3), where X ′ is the transpose matrix and (X ′ X) −1 is the inverse matrix of (X ′ X).
The applicability domain can be determined by analyzing the leverage and standardize residuals to create a Williams plot.Leverages greater than the warning leverage (h*) indicate that the predicted values are unreliable [95,96].The warning leverage (h*) is computed as 3 (p + 1)/n, where n is the number of compounds and p is the number of predictors.Furthermore, standardized residuals were computed for both the training and testing sets [96].However, this approach does not allow for calculating the leverage values for new datasets.Therefore, for a new dataset denoted as a matrix (u), the hat matrix is calculated using Equation (4).
where X is the matrix of the training set, u is the matrix of the unknown dataset, and u ′ is the transpose matrix.Finally, the leverage values of the new sample, h, are obtained from the diagonal of the matrix.
In accordance with the OECD guidelines [32], a QSPR study should encompass an applicability domain.The leverage values for the training and testing sets were calculated using the training set of the hat matrix, and standardized residuals were computed for both sets.Subsequently, a Williams plot was constructed using the leverages and standardized residuals [96].

Computational Processing
The computational experiments were conducted at the Center for High Computational Performance of the Peruvian Amazon, affiliated with the Research Institute of the Peruvian Amazon (IIAP) [97].To perform various essential tasks, the following software tools were employed: OpenBabel version 3.1.1was utilized for the addition of hydrogens, structure minimization, and conformer search; Molconvert was used to create 3D structures from the SMILES format; Marvin version 21.20.0;and the R programming language version 4.1.1[98] was used for data analysis, modeling, validation, and visualization.
comprising of 1817 unique compounds.The diversity of the dataset was evaluated based on five physicochemical properties: molecular weight (MW) (437.95 ± 152), logP (1.56 ± 2.0), number of hydrogen acceptors (7.24 ± 3.8), number of hydrogen donors (2.81 ± 2.6), topological surface area (TPSA) (106.51 ± 59.3), and number of rotatable bonds (6.23 ± 4.3).The log Papp values exhibited a mean of −5.34, ranging from a minimum of −7.70 to a maximum of −3.78.Both the training and testing sets exhibited a comparable distribution of log Papp values, as illustrated in Figure

Figure 1 .
Figure 1.Analysis of modeling dataset: (a) distribution of log Papp values for the training an ing sets; (b) PCA plot for PC1 and PC2 based on 41 selected descriptors; (c) PCA plot for PC PC4 based on 41 selected descriptors.Training set (red) and testing set (blue).

Figure 1 .
Figure 1.Analysis of modeling dataset: (a) distribution of log Papp values for the training and testing sets; (b) PCA plot for PC1 and PC2 based on 41 selected descriptors; (c) PCA plot for PC3 and PC4 based on 41 selected descriptors.Training set (red) and testing set (blue).

Figure 3 .
Figure 3. Williams plot.Red circles represent the training set; blue circles represent the testing set.h*: warning leverage.

Figure 3 .
Figure 3. Williams plot.Red circles represent the training set; blue circles represent the testing set.h*: warning leverage.

Figure 4 .
Figure 4. Leverage plot of natural product database from Peru.Horizontal dashed line: warning leverage, h* = 0.0866.The distribution of the log Papp values revealed that 68.93% (n = 346) of the compounds exhibited values greater than −5, classifying them as having high intestinal absorption [33].Another 21.51% (n = 108) fell within the range between −6 to −5, indicating medium intestinal absorption.Finally, 9.56% (n = 48) of the compounds had Log Papp values below −6, suggesting low intestinal absorption.The log Papp value distribution showed a mean log Papp value of −4.91, with values ranging from −6.59 (minimum) to −3.99 (maximum) (Figure5a).The natural products were classified into six metabolic pathways: alkaloids, fatty acids, polyketides, terpenoids, shikimates and phenylpropanoids, and amino acids and peptides (Figure5b).The natural products showed a diverse distribution of the log Papp values in each pathway, encompassing high, medium, and low apparent permeability.For a clearer overview of the compounds characterized by high permeability, each metabolic pathway was further subdivided into chemical groups, with the exception of polyketides and amino acids and peptides.Upon analyzing these chemical groups, 11 groups primarily exhibited high apparent permeability (log Papp > −5)[33] (Figure5c).These groups included apocarotenoids, fatty esters, monoterpenoids, diterpenoids, sesquiterpenoids, fatty acyls, lysine alkaloids and pseudoalkaloids, and polyketides, indicating the potential for high intestinal absorption via passive transport.Additionally, eight groups showed a distribution of log Papp values primarily in the medium to high permeability range, encompassing phenylpropanoids, stilbenoids, phenolic acids, coumarins and diarylheptanoids, lignans, isoflavonoids, fatty acids and conjugates, and fatty amides.The remaining eight groups demonstrated a mixed distribution, with compounds falling into low, medium, and high log Papp categories, including tryptophan alkaloids, tyrosine alkaloids, meroterpenoids and triterpenoids, steroids, ornithine alkaloids, flavonoids, amino acids and peptides, and xanthones.

Figure 4 .
Figure 4. Leverage plot of natural product database from Peru.Horizontal dashed line: warning leverage, h* = 0.0866.The distribution of the log Papp values revealed that 68.93% (n = 346) of the compounds exhibited values greater than −5, classifying them as having high intestinal absorption [33].Another 21.51% (n = 108) fell within the range between −6 to −5, indicating medium intestinal absorption.Finally, 9.56% (n = 48) of the compounds had Log Papp values below −6, suggesting low intestinal absorption.The log Papp value distribution showed a mean log Papp value of −4.91, with values ranging from −6.59 (minimum) to −3.99 (maximum) (Figure5a).The natural products were classified into six metabolic pathways: alkaloids, fatty acids, polyketides, terpenoids, shikimates and phenylpropanoids, and amino acids and peptides (Figure5b).The natural products showed a diverse distribution of the log Papp values in each pathway, encompassing high, medium, and low apparent permeability.For a clearer overview of the compounds characterized by high permeability, each metabolic pathway was further subdivided into chemical groups, with the exception of polyketides and amino acids and peptides.Upon analyzing these chemical groups, 11 groups primarily exhibited high apparent permeability (log Papp > −5)[33] (Figure5c).These groups included apocarotenoids, fatty esters, monoterpenoids, diterpenoids, sesquiterpenoids, fatty acyls, lysine alkaloids and pseudoalkaloids, and polyketides, indicating the potential for high intestinal absorption via passive transport.Additionally, eight groups showed a distribution of log Papp values primarily in the medium to high permeability range, encompassing phenylpropanoids, stilbenoids, phenolic acids, coumarins and diarylheptanoids, lignans, isoflavonoids, fatty acids and conjugates, and fatty amides.The remaining eight groups demonstrated a mixed distribution, with compounds falling into low, medium, and high log Papp categories, including tryptophan alkaloids, tyrosine alkaloids, meroterpenoids and triterpenoids, steroids, ornithine alkaloids, flavonoids, amino acids and peptides, and xanthones.In the terpenoid pathway, apocarotenoids, meroterpenoids, diterpenoids, and sesquiterpenoids primarily exhibited high apparent permeability.In the fatty acid pathway, fatty esters and fatty acyls showed high permeability.Phenolic compounds, which constitute the largest group of secondary metabolites in plants, are primarily derived from the shikimate and phenylpropanoid pathway[34].These compounds exhibited distributions of low, medium, and high apparent permeability, aligning with the chemical diversity of this group.Furthermore, we identified 50 compounds containing glycosidic bonds, with the majority falling into the low permeability category (n = 28), followed by the medium permeability category (n = 9), and the high permeability category (n = 13).

Figure 5 .
Figure 5. (a) Distribution of predicted log Papp values in natural products; (b) distribution of predicted log Papp values by chemical groups; (c) distribution of predicted log Papp values by metabolic pathway.Red circles represent the presence of a glycosidic bond; black circles indicate the absence of a glycosidic bond.

Pharmaceuticals 2024 , 10 Figure 6 .
Figure 6.Histogram of six physicochemical properties for a set of 504 natural products source biodiversity in Peru: (a) MlogP; (b) MW; (c) HBA; (d) HBD; (e) RBN; and (f) TPSA.The com Lipinski and Veber areas are shown in green.Additionally, 99.4% (n = 344) of the natural products with high apparent permea (log Papp > −5) conformed to the Ro5 requirement (Figure 7b), further supporting suitability for oral administration and validating the predicted log Papp values gene by our SVM-RF-GBM model for crossing the intestinal barrier.

Figure 6 .
Figure 6.Histogram of six physicochemical properties for a set of 504 natural products sourced the biodiversity in Peru: (a) MlogP; (b) MW; (c) HBA; (d) HBD; (e) RBN; and (f) TPSA.The compliant Lipinski and Veber areas are shown in green.
Lipinski and Veber areas are shown in green.

Figure 7 .
Figure 7. (a) Bar chat depicting the compliance of natural products with Ro5; (b) distribution of predicted log Papp values among natural products categorized according to their compliance with Ro5.

Figure 7 .
Figure 7. (a) Bar chat depicting the compliance of natural products with Ro5; (b) distribution of predicted log Papp values among natural products categorized according to their compliance with Ro5.

Figure 9 .
Figure 9. Overview of the workflow of the QSPR models for predicting Caco-2 cell apparent permeability in natural products from Peru.*: The training set was exclusively utilized in these steps.

Figure 9 .
Figure 9. Overview of the workflow of the QSPR models for predicting Caco-2 cell apparent permeability in natural products from Peru.*: The training set was exclusively utilized in these steps.

Table 1 .
Metrics for training set, testing set, and cross-validation, along with hyperparameters, employing RFE and GA for feature selection.

Table 2 .
The ten best molecular descriptors according to the SVM-RF-GBM model.

Table 3 .
Comparative results of QSPR studies.

Table 3 .
Comparative results of QSPR studies.