Machine Learning for Evaluating the Cytotoxicity of Mixtures of Nano-TiO2 and Heavy Metals: QSAR Model Apply Random Forest Algorithm after Clustering Analysis

With the development and application of nanomaterials, their impact on the environment and organisms has attracted attention. As a common nanomaterial, nano-titanium dioxide (nano-TiO2) has adsorption properties to heavy metals in the environment. Quantitative structure-activity relationship (QSAR) is often used to predict the cytotoxicity of a single substance. However, there is little research on the toxicity of interaction between nanomaterials and other substances. In this study, we exposed human renal cortex proximal tubule epithelial (HK-2) cells to mixtures of eight heavy metals with nano-TiO2, measured absorbance values by CCK-8, and calculated cell viability. PLS and two ensemble learning algorithms are used to build multiple QSAR models for data sets, and the test set R2 is increased from 0.38 to 0.78 and 0.85, and RMSE is decreased from 0.18 to 0.12 and 0.10. After selecting the better random forest algorithm, the K-means clustering algorithm is used to continue to optimize the model, increasing the test set R2 to 0.95 and decreasing the RMSE to 0.08 and 0.06. As a reliable machine algorithm, random forest can be used to predict the toxicity of the mixture of nano-metal oxides and heavy metals. The cluster analysis can effectively improve the stability and predictability of the model, and provide a new idea for the prediction of cytotoxicity model in the future.


Introduction
Nanotechnology is continually developing, and nanomaterials are gradually applied in various fields. With these developments, an interest has arisen to determine the possible risks in the production and use of nanomaterials. Because of nanomaterials' small particle size and the increasing probability of contact with humans, nanoparticles can easily enter the human body and cause adverse effects [1]. Therefore, cytotoxicity research of nanomaterials is essential. Nano-metal oxides are widely used in industrial and daily fields [2], and the market is developing rapidly. As a common nano-metal oxide, nano-titanium dioxide (nano-TiO 2 ) is continuously evaluated. Many studies have recorded its cytotoxic characteristics, such as weathering resistance [3], photocatalytic performance [4], and composite materials [5]. In addition to the well-known characteristic functions of nano-TiO 2 , including high specific surface area, stability, anti-corrosion, and photocatalysis [6], it can also have sterilization [7], UV protection [8], and self-cleaning functions [9]. Therefore, it has been mass-produced and widely used. The prospects for development are promising.
Due to its properties, such as small particle size, high specific surface area, and high active surface [10], nanoparticles are highly susceptible to adsorption with other this study, the cytotoxicity data of proximal tubule epithelial cells of human renal cortex were measured in laboratory, and QSAR model was constructed by using the random forest algorithm after cluster analysis, and the validation indexes were compared, which provided some reference for further study of cytotoxicity model of metal oxide and heavy metal mixture. A neat flowchart of the work is shown in Figure 1.
large descriptor gaps, removing them from the model would inevitably decrease the model's accuracy. After eliminating this situation, we tried to classify the selected mixtures [38]. The K-means algorithm [39] is an unsupervised clustering algorithm, widely used because of its simple implementation and proper clustering, and can be used for the QSAR modeling process. In this study, the cytotoxicity data of proximal tubule epithelial cells of human renal cortex were measured in laboratory, and QSAR model was constructed by using the random forest algorithm after cluster analysis, and the validation indexes were compared, which provided some reference for further study of cytotoxicity model of metal oxide and heavy metal mixture. A neat flowchart of the work is shown in Figure 1.

Experimental Results
The results of the CCK-8 assay on that the toxicity of 25 μmol/L nano-TiO2 was low, the cell survival rate was close to 1, and the toxicity increased when mixed with heavy metals. HK-2 cells were exposed to eight mixtures for 24 h showed different degrees of apoptosis, and the cell survival rate decreased with the increase of the concentration of heavy metal compounds. The mixture of ZnCl2 and 25 μmol/L nano-TiO2 decreased the survival rate of HK-2 cells significantly when the concentration of ZnCl2 increased from 240 μmol/L to 270 μmol/L. Compared with the other six heavy metal compounds, Pb (NO3)2 and SbCl3 showed higher cytotoxicity in mixtures with 25 μmol/L nano-TiO2. The cell survival rate of the eight mixtures is shown in Supplementary Material (Table S1), and the serial number corresponds to the serial number of the concentration in Table 1.

QSAR Model Calculation Results
The partial least squares (PLS) [40] is a numerical algorithm which is widely used in QSAR modeling in recent years. It was first used in the establishment of this model. R 2 > 0.6 is the first step for us to judge the quality of the model, but in PLS model, no matter how the data set is divided, the R 2 of the test set in the optimal model is only 0.38, far below the standard of 0.6, and the predicted value of the model is not credible. RMSE (test) is 0.18. R 2 and RMSE of the training set are 0.08 and 0.27 respectively. The comparison chart between the predicted value and the observed value of PLS model is shown in Supplementary Material ( Figure S1), and the application domain is shown in Supplementary Material ( Figure S2). Then, we use two typical machine ensemble learning methods, random forest and AdaBoost, to model 72 sample data.
All descriptors were put into the model, and the appropriate descriptors were screened out to fit the model. The selection of descriptors is actually consistent with the modeling method. "AdaBoostRegressor" and "RandomForestRegressor" in sklearn datasets are quoted respectively, and set the parameters "n_estimators" and "random_state" to 1~100 for fitting. The random state and the number of decision trees when R 2 is the maximum are determined by grid search. Through modeling, the importance of each descriptor is calculated, and the descriptor whose "feature_importances_" is greater than 0.1 is selected to re-fit the model. The importance of the optimized descriptors of AdaBoost model and RF model is shown in Table 2, from which it can be seen that absolute hardness and adsorption energy have shown high importance in these two models. The predicted values of cell survival rate are shown in Supplementary Material (Table S1). The comparison between the predicted values and the observed values of the two models is shown in Figure 2. According to the degree of dispersion of scattered points in the figure, the accuracy of the predicted value can be seen more intuitively. The dispersion degree of prediction results of random forest model is smaller than that of AdaBoost model, which indicates that the model based on the random forest algorithm is more accurate than that of the AdaBoost algorithm. In order to explore the influence of cluster analysis on QSAR modeling, the random forest algorithm was selected to continue the study. The K-means clustering algorithm is used to classify all data into Category 1 and Category 2. Category 1 included CdCl2, ZnCl2, MnCl2, and CoCl2. Category 2 included CuSO4, NiCl2, Pb (NO3)2, and SbCl3. Model A is a random forest model built by Category 1. Model B is a random forest model built by Category 2. Set the parameter cycle from 1 to 100, determine the parameters of "n_estimators" and "random_state" when R 2 is the maximum, and select the descriptors whose "fea-ture_importances_" is greater than 0.1 to re-model. Model A and B were created to model the Category 1 and Category 2 datasets separately to observe the influence of different categories on the predicted activity results. After screening and re-fitting, the importance of the optimization descriptors of Model A and B is shown in Table 2. Combined with the four models, absolute hardness and adsorption energy are two descriptors whose importance is always high. In addition, the importance of Lowest orbital energy in model A can't be ignored.
The comparison between the predicted values and the observed values of the two cluster analysis models is shown in Figure 3. The cluster analysis model based on random forest algorithm shows a low degree of dispersion and a more accurate prediction ability.
For Model C and D, we used cross-prediction methods between the training and the test set. Model C used random forest algorithm to model the training set of Category 1 and predict the test set of Category 2. Model D used random forest algorithm to model the training set of Category 2 and predict the test set of Category 1. Set the parameter cycle from 1 to 100, determine the parameters of "n_estimators" and "random_state" when R 2 is the maximum, and build the model. The results show that the R 2 of Model C test set is 0.31, and that the R 2 of Model D test set is 0.35, both of which are far lower than 0.6, indicating that the reliability of the predicted values is low and the generalization ability of the model is insufficient. The result of the model is too poor, so it is not meaningful to screen descriptors to optimize the model. The comparison between the predicted values and the observed values of the two cross models is shown in Figure 3. The dispersion degree of Model C and D is much higher than that Model A and B. Obviously, the model fitted by Category 1 samples is not suitable for the test set of Category 2. The model fitted by Category 2 samples is not suitable for the test set of Category 1. It shows that it is significant to build the model separately after cluster analysis, which can theoretically improve the accuracy of the model and will be further explained by the validation results of the model. In order to explore the influence of cluster analysis on QSAR modeling, the random forest algorithm was selected to continue the study. The K-means clustering algorithm is used to classify all data into Category 1 and Category 2. Category 1 included CdCl 2 , ZnCl 2 , MnCl 2 , and CoCl 2 . Category 2 included CuSO 4 , NiCl 2 , Pb (NO 3 ) 2 , and SbCl 3 . Model A is a random forest model built by Category 1. Model B is a random forest model built by Category 2. Set the parameter cycle from 1 to 100, determine the parameters of "n_estimators" and "random_state" when R 2 is the maximum, and select the descriptors whose "feature_importances_" is greater than 0.1 to re-model. Model A and B were created to model the Category 1 and Category 2 datasets separately to observe the influence of different categories on the predicted activity results. After screening and re-fitting, the importance of the optimization descriptors of Model A and B is shown in Table 2. Combined with the four models, absolute hardness and adsorption energy are two descriptors whose importance is always high. In addition, the importance of Lowest orbital energy in model A can't be ignored.
The comparison between the predicted values and the observed values of the two cluster analysis models is shown in Figure 3. The cluster analysis model based on random forest algorithm shows a low degree of dispersion and a more accurate prediction ability.
For Model C and D, we used cross-prediction methods between the training and the test set. Model C used random forest algorithm to model the training set of Category 1 and predict the test set of Category 2. Model D used random forest algorithm to model the training set of Category 2 and predict the test set of Category 1. Set the parameter cycle from 1 to 100, determine the parameters of "n_estimators" and "random_state" when R 2 is the maximum, and build the model. The results show that the R 2 of Model C test set is 0.31, and that the R 2 of Model D test set is 0.35, both of which are far lower than 0.6, indicating that the reliability of the predicted values is low and the generalization ability of the model is insufficient. The result of the model is too poor, so it is not meaningful to screen descriptors to optimize the model. The comparison between the predicted values and the observed values of the two cross models is shown in Figure 3. The dispersion degree of Model C and D is much higher than that Model A and B. Obviously, the model fitted by Category 1 samples is not suitable for the test set of Category 2. The model fitted by Category 2 samples is not suitable for the test set of Category 1. It shows that it is significant to build the model separately after cluster analysis, which can theoretically improve the accuracy of the model and will be further explained by the validation results of the model.

Model Validation Results
The QSAR model must be validated to provide a reasonable explanation for data analysis. The larger the squared correlation coefficient (R 2 ), the smaller the root mean square error (RMSE), and the larger the correlation coefficient( )of Leave-one-out (LOO) cross-validation usually means that the model has better prediction ability and robustness. RMSE is more reliable than R 2 [41]. The Y randomization correlation coefficient ( ) and Y randomization correlation coefficient ( ) both had low values. The above belongs to the internal validation of the model. The larger , , and the concordance correlation coefficient (CCC), the better the generalization ability of the model. The specific model validation criteria were as follows: > 0.5 [42], < 0.5, < 0.5 [43], RMSE < 0.2, R 2 > 0.6, > 0.5 [44], > 0.5 [45], > 0.5 [46], CCC > 0.85 [47]. The number of samples in training and test sets, the optimal parameters N estimators and Random state of AdaBoost, Random Forest, A, B, C and D models and their corresponding validation results are shown in Table 3. Comparing the validation parameters of random forest algorithm and AdaBoost algorithm, we find that for the same sample set, all the validation parameters of random forest algorithm are far better than AdaBoost algorithm. The value of the model based on AdaBoost algorithm is lower than the evaluation standard, which shows that the generalization ability of this model is insufficient. Because AdaBoost algorithm is much more sensitive to noise than random forest algorithm, the model built by AdaBoost algorithm is inferior to random forest algorithm in generalization ability, robustness and prediction ability. The validation parameters of Model A and B meet the model evaluation standard. Most of the validation parameters of models C and D don't meet the model evaluation criteria. By comparing the validation parameters of RF, A and B models in Table 3, it can be clearly seen that all the validation parameters in Model A and B are better than those of the random forest model. which shows that the clustering analysis model has better prediction ability, robustness and generalization ability in this case.

Model Validation Results
The QSAR model must be validated to provide a reasonable explanation for data analysis. The larger the squared correlation coefficient (R 2 ), the smaller the root mean square error (RMSE), and the larger the correlation coefficient (Q 2 LOO ) of Leave-one-out (LOO) cross-validation usually means that the model has better prediction ability and robustness. RMSE is more reliable than R 2 [41]. The Y randomization correlation coefficient (R 2 yrand ) and Y randomization correlation coefficient (Q 2 yrand ) both had low values. The above belongs to the internal validation of the model. The larger Q 2 F1 ,Q 2 F2 ,Q 2 F3 and the concordance correlation coefficient (CCC), the better the generalization ability of the model. The specific model validation criteria were as follows: Q 2 LOO > 0.5 [42], R 2 yrand < 0.5, Q 2 yrand < 0.5 [43], RMSE < 0.2, R 2 > 0.6, Q 2 F1 > 0.5 [44], Q 2 F2 > 0.5 [45], Q 2 F3 > 0.5 [46], CCC > 0.85 [47]. The number of samples in training and test sets, the optimal parameters N estimators and Random state of AdaBoost, Random Forest, A, B, C and D models and their corresponding validation results are shown in Table 3. Comparing the validation parameters of random forest algorithm and AdaBoost algorithm, we find that for the same sample set, all the validation parameters of random forest algorithm are far better than AdaBoost algorithm. The Q 2 F3 value of the model based on AdaBoost algorithm is lower than the evaluation standard, which shows that the generalization ability of this model is insufficient. Because AdaBoost algorithm is much more sensitive to noise than random forest algorithm, the model built by AdaBoost algorithm is inferior to random forest algorithm in generalization ability, robustness and prediction ability. The validation parameters of Model A and B meet the model evaluation standard. Most of the validation parameters of models C and D don't meet the model evaluation criteria. By comparing the validation parameters of RF, A and B models in Table 3, it can be clearly seen that all the validation parameters in Model A and B are better than those of the random forest model. which shows that the clustering analysis model has better prediction ability, robustness and generalization ability in this case.

Application domain analysis
Only

Research Results of the Toxicity Mechanisms
Comparing the screening results of descriptors from different samples, we found that the three characteristics of lowest orbital energy, absolute hardness and adsorption energy have important influence on the survival rate of HK-2 cells, and absolute hardness and adsorption energy are of high importance in all models. To a certain extent, the lowest orbital energy reflects the oxidation ability of substances as oxidants. The smaller the lowest orbital energy, the more favorable it is for electrons to be filled. Externally, the stronger the oxidation ability, the direct influence on the concentration of reactive oxygen species (ROS) in cells. Absolute hardness represents the difference between ionization potential and electron affinity, which essentially reflects the difficulty of gaining or losing electrons, that is, this feature also represents the oxidation of materials. Adsorption can indicate the degree of amount of energy required for different particles to be adsorbed together. The larger the adsorption energy, the easier it is to cause unstable atoms to separate from the material surface. Through the above mechanism analysis, we think that the mixture of nano-materials and heavy metals can damage HK-2 cells mainly by promoting the generation of free radicals.
We measured ROS of some mixtures to explore the cytotoxicity mechanisms of mixtures of nano-TiO2 and heavy metals [48]. The results showed that this type of mixture increases intracellular hydroxyl radicals (OH − ) when it leads to cell apoptosis, which is an oxidative stress reaction [49]. With the increase of the concentration of heavy metals in the mixture, the concentration of ROS generally increased ( Figure S3), which indicated that the mixture of nanomaterials and heavy metals induced cell death by increasing the concentration of ROS in HK-2 cells.

Research Results of the Toxicity Mechanisms
Comparing the screening results of descriptors from different samples, we found that the three characteristics of lowest orbital energy, absolute hardness and adsorption energy have important influence on the survival rate of HK-2 cells, and absolute hardness and adsorption energy are of high importance in all models. To a certain extent, the lowest orbital energy reflects the oxidation ability of substances as oxidants. The smaller the lowest orbital energy, the more favorable it is for electrons to be filled. Externally, the stronger the oxidation ability, the direct influence on the concentration of reactive oxygen species (ROS) in cells. Absolute hardness represents the difference between ionization potential and electron affinity, which essentially reflects the difficulty of gaining or losing electrons, that is, this feature also represents the oxidation of materials. Adsorption can indicate the degree of amount of energy required for different particles to be adsorbed together. The larger the adsorption energy, the easier it is to cause unstable atoms to separate from the material surface. Through the above mechanism analysis, we think that the mixture of nano-materials and heavy metals can damage HK-2 cells mainly by promoting the generation of free radicals.
We measured ROS of some mixtures to explore the cytotoxicity mechanisms of mixtures of nano-TiO 2 and heavy metals [48]. The results showed that this type of mixture increases intracellular hydroxyl radicals (OH − ) when it leads to cell apoptosis, which is an oxidative stress reaction [49]. With the increase of the concentration of heavy metals in the mixture, the concentration of ROS generally increased ( Figure S3), which indicated that the mixture of nanomaterials and heavy metals induced cell death by increasing the concentration of ROS in HK-2 cells.

Cell Experiments
HK-2 cells were cultured in Hyclone DMEM medium supplemented with 10% fetal bovine serum (FBS) and 100 units/ml penicillin/streptomycin and incubated at 37°C and 5% CO 2 . Nano-TiO 2 and heavy metal powders were mixed in the prepared medium. In  Table 1.
HK-2 cells were cultured in 96-well plates at 37°C and 5% CO 2 for 24 h. After reaching 70% confluence, the prepared mixture of nano-TiO 2 and heavy metals was added. A blank control group with cells without venom was set up. After 24 h of culture at 37°C and 5% CO 2 , the supernatant was removed, and 100 µL CCK-8(Cell Counting Kit-8) liquid was added to each well. After 30 min, each well's absorbance was measured by using a microplate reader with a wavelength of 450 nm. The Equation (1) used for calculating the cell viability is as follows: S is the cell survival rate, A exp is the absorbance value of the experimental group, A control is the absorbance value of the control group, and A blank is the absorbance value of the blank control group.

Selection and Calculation of Descriptors
Based on the mixture's characteristics, we used quantum mechanical descriptors [50,51] to establish the QSAR model of cytotoxicity of nano-TiO 2 mixtures with heavy metals. These descriptors included highest orbital energy, lowest orbital energy, ionization potential, electron affinity, absolute electronegativity, absolute hardness, molecular energy, and adsorption energy [52]. Based on density functional theory (DFT), B3LYP functional and LANL2DZ basis set are used in Gaussian [53] to optimize the structure and calculate highest orbital energy, lowest orbital energy and molecular energy. After obtaining the quantum mechanical (QM) characteristics of each material, the remaining descriptors are deduced by using specific formulas. Specific formulas are referred to in Supplementary Material (Table S2).
Generally, there are two mathematical models to calculate the toxicity mechanism of mixtures: concentration addition (CA) and independent action (IA) [54]. Except the adsorption of nano-TiO 2 on heavy metals, there is almost no other interaction and reaction between the two components, so other descriptors of the mixture were calculated by CA model. The calculation Equation (2) used was: D mix is the descriptor of the mixture, D i is the descriptor of component i, and x i is the molar concentration of component i.
All final descriptor data are referred to in Supplementary Material (Table S1). In the actual model construction process, we found that not all molecular descriptors are suitable for building models, and only the descriptors that have great influence on the results are reserved. Descriptors whose importance is less than 0.1 are usually considered as unimportant features, which are screened out because they have little influence on the prediction results. Some specific related descriptors may come from the same structural factors [55], and finally only one was selected to be included in the model.
The methods of selecting descriptors include heuristic method (HM), genetic algorithm (GA), random forests (RF), etc. In the random forest algorithm, feature importance refers to the contribution of each feature to each tree in a random forest, and compares the contribution between features after taking the average value. Gini index [56] or out-of-bag (OOB) error [57] can usually be used as an evaluation index. In this paper, we used the output variable "feature_importances_" in sklearn datasets, and set the R 2 of training set and test set to be the maximum to terminate the filtering. That is, Gini importance is used to sort the features, and sklearn normalizes all Gini importance in the form of sum to obtain the final output parameter of "feature_importances_".

Classification of Mixture Types
For the eight heavy metals included in this study, we use Python software to classify the mixture types. We randomly selected two as the initial clustering centers. We calculated the Euclidean distance [58] from each point to the center of the cluster through descriptors and delimited the nearest one to the cluster center. Every time a sample was assigned, the cluster center was recalculated according to the existing objects in the cluster, and finally, the heavy metal sample set was divided into two clusters. According to the mixture descriptor, the mixture was divided into two categories. Category 1 included CdCl 2 , ZnCl 2 , MnCl 2 , and CoCl 2 . Category 2 included CuSO 4 , NiCl 2 , Pb (NO 3 ) 2 , and SbCl 3 .

Data Set Division
An essential step in QSAR research is dividing the data set into a training set and a test set. We performed this division according to the 3: 1 by random sampling (RS) method. In this study, eight heavy metals were researched. Nine concentrations of each heavy metal were selected and mixed with 25 µmol/L nano-TiO 2 . We obtained data from 72 samples. Therefore, 54 samples were randomly selected for the development model's training set, and the remaining 18 were placed into the test set to evaluate the model's predictability. For classified samples, 27 samples were randomly selected as the training set of the development model, and the remaining 9 samples were put into the test set to evaluate the predictability of the model.

Algorithm Application
In the random forest algorithm, it is necessary to classify the input samples and enter that sample into each tree for classification. Firstly, it will randomly sample the data sample set for N times and get a subset of the training set as the new training set (this sampling method is called the bootstrap sample method) [59]. Secondly, in the new training set, K attributes will be randomly extracted from the attribute set of characteristic variables as the attribute subset, and an optimal attribute will be selected from the attribute subset for node splitting. Finally, based on generating M decision trees, the classification result is decided by voting of M decision trees to form a random forest.
The AdaBoost algorithm gets a series of weak classifiers through repeated learning and then combines them to get a strong classifier. Firstly, the same weights are given to N training sample data. Secondly, when constructing the next training set, the weight of the training sample points accurately classified by the weak classifier will decrease, otherwise, the weight will increase. After the weight is updated, a new weak classifier will be added for iteration every round. Finally, increase the weight of weak classifiers with low classification error rate, and reduce the weight of weak classifiers with high classification error rate, so that weak classifiers with low classification error rate will play a decisive role, otherwise, they will be combined into the final strong classifier.

Model Validation
The internal validation of the QSAR model mainly validate s the fitting ability and robustness of the model; the external validation mainly aims at the model's prediction ability. In addition to the commonly used the squared correlation coefficient (R 2 ) [60] and root mean square error (RMSE) [61] for internal validation, we also used the leaveone-out (LOO) cross-validation method [62] to evaluate the internal prediction ability of the model by calculating the correlation coefficient (Q 2 LOO ) between the predicted value (ŷ i ) and the observed value (y i ) to reduce the probability of model overfitting. The y randomization correlation coefficient (R 2 yrand ) and the y randomization Q 2 (Q 2 yrand ) values generated by the Y randomization validation [43] validate the model's robustness to avoid accidental correlation. With the external validation of the model, we validated the model's generalization ability by the index Q 2 F1 , Q 2 F2 , Q 2 F3 and the concordance correlation coefficient (CCC). The above evaluation indexes are all calculated by specific formulas. Specific formulas for model validation indexes are referred to in Supplementary Material (Table S3).

Application Domain of the Model
The application domain (AD) is a spatial region defined by the training set samples' descriptors and the studied properties. Only the compounds in the space can be considered reliable, and the samples beyond the application domain cannot guarantee the accuracy of the results. We used Williams diagram to analyze the application domain of the QSAR model, the ordinate represents the standardized residual, the abscissa represents the sample leverage value. The calculation method of leverage value is shown in Equation (3): x i represents the sample descriptor, and X represents the sample set composed of all the training set descriptors.
We take ±2.5 standardized residuals as the critical values. When the standardized residuals of the predicted values are greater than the critical values or the leverage value h i of the samples to be tested is greater than the critical values h*, the prediction quality is low. The calculation method of lever critical value is shown in Equation (4): p represents the number of descriptors, and n represents the number of samples in the training set.

Conclusions
We determined the cytotoxicity of mixtures of eight heavy metal compounds and nano-TiO 2 on HK-2 cells and explored the toxicity mechanism. We also used partial least squares (PLS) to establish a QSAR model. We showed that the test set R 2 and RMSE of this model are 0.38 and 0.18 respectively, suggesting that the modeling effect is poor. The partial least squares (PLS) is the most useful data analysis method used by most scholars to determine cytotoxicity, but this method has some limitations. We then introduced the random forest algorithm and AdaBoost algorithm for QSAR modeling. These two machine ensemble learning methods allow to evaluate the cell survival rate, and improve the test set R 2 to 0.78 and 0.85, and RMSE is decreased to 0.12 and 0.10. Among them, the random forest model has better prediction ability, robustness and generalization ability, and the feasibility of the model was proven. Therefore, when the standard regression method (such as PLS) performs poorly in the model, we can try to introduce ensemble learning methods, such as random forest algorithm. In addition, the K-means clustering algorithm improves the R 2 to 0.95 and decreases RMSE to 0.08 and 0.06 after classification, which shows the advantages of the model after clustering analysis.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27186125/s1. Model details (including descriptor data, observed and predicted values of cell survival rate, data set division) (Table S1) (EXCEL); Specific formulas for deducing descriptors (Table S2); Specific formulas for model validation (Table S3)