Anti-Cancer Drug Solubility Development within a Green Solvent: Design of Novel and Robust Mathematical Models Based on Artificial Intelligence

Nowadays, supercritical CO2(SC-CO2) is known as a promising alternative for challengeable organic solvents in the pharmaceutical industry. The mathematical prediction and validation of drug solubility through SC-CO2 system using novel artificial intelligence (AI) approach has been considered as an interesting method. This work aims to evaluate the solubility of tamoxifen as a chemotherapeutic drug inside the SC-CO2 via the machine learning (ML) technique. This research employs and boosts three distinct models utilizing Adaboost methods. These models include K-nearest Neighbor (KNN), Theil-Sen Regression (TSR), and Gaussian Process (GPR). Two inputs, pressure and temperature, are considered to analyze the available data. Furthermore, the output is Y, which is solubility. As a result, ADA-KNN, ADA-GPR, and ADA-TSR show an R2 of 0.996, 0.967, 0.883, respectively, based on the analysis results. Additionally, with MAE metric, they had error rates of 1.98 × 10−6, 1.33 × 10−6, and 2.33 × 10−6, respectively. A model called ADA-KNN was selected as the best model and employed to obtain the optimum values, which can be represented as a vector: (X1 = 329, X2 = 318.0, Y = 6.004 × 10−5) according to the mentioned metrics and other visual analysis.


Introduction
The discovery of novel drug molecules followed by their introduction into clinical trials is considered as the main goal of the drug development industry for increasing the efficiency and reducing the side effects of drugs [1][2][3][4]. Solubility is one of the main parameters that influence drug efficiency [5,6]. Low solubility is considered as the most important challenge towards the formulation of novel chemical entities [7]. Various techniques can be used to improve drug solubility, such as physical modification (i.e., nanosuspension), chemical modification (i.e., complexation and salt formation), and miscellaneous procedures (i.e., supercritical fluids (SCFs) process and solubilizers) [8][9][10][11].
SCFs (especially supercritical CO 2 (SC-CO 2 )) have been recently identified as a promising alternative for challengeable organic solvents. The emergence of remarkably positive points such as cost-effectiveness, inert nature, environmentally friendly, excellent chemical affinity in almost all organic solvents, safety of application and non-toxic characteristic has improved the tendency of researchers to apply them in pharmacology [12][13][14]. Additionally, the modulation of two momentous properties of CO 2 including density and solvent power is feasible by the alteration of operational pressure/temperature and true control of the process kinetics [15][16][17][18].
The development of predictive models to estimate the solubility of various types of drugs in real conditions has been an interesting topic. Artificial intelligence (AI) approach is known as a robust and efficient approach to mathematically predict the results in various scientific scopes, such as nanotechnology, separation, extraction, chemical reactors, and transport phenomena [19][20][21][22][23].
Machine learning (ML) is a set of techniques and tools that uses data to create a mathematical model to make predictions or perform analysis, and it is critical in artificial intelligence [24,25]. ML approaches are progressively replacing computational methods in scientific domains. ML models may now investigate any problem with several input features and at least one target. These models extract inputs-outputs relationships using various strategies [26][27][28].
Boosting is a subtype of ensemble techniques that integrate the outcomes of several weak estimators to build a robust estimator. Boosting makes the usage of weak estimators applying a sequential logic, which implies the results of each weak estimate the influence of the following estimate. AdaBoost [29], in particular, is a representative boosting learning method that generates weak estimators gradually utilizing reweighted training data.
In recent years, GPR has gained popularity as a data-driven modeling tool. GPR's popularity stems in part from its theoretical connection to Bayesian nonparametric statistics, infinite neural networks, kernel approaches in machine learning, and spatial statistics [30,31].
If the target data are numeric and continuous, neighbors-based regression such as KNN can be used. A query point's label is determined by averaging the labels of its nearest neighbors [32].
Theil-Sen Regression is another weak estimator is used here. Compared to Ordinary Least Squares (OLS), Theil-Sen Regressor has a comparable asymptotic efficiency and is an unbiased estimate. Since it makes no assumptions about the underlying distribution of the data, Theil-Sen is non-parametric in comparison to OLS. Theil-Sen can withstand outliers more effectively [33,34].
The main novelty of this paper is to predict the optimized value of tamoxifen solubility in an SC-CO 2 system via the ML approach. To achieve this, three ML-based predictive models including K-nearest Neighbor (KNN), Theil-Sen Regression (TSR), and Gaussian Process (GPR) were developed. The comparison of the models showed the fact that ADA-KNN is the most accurate and general model due to more proximity of points with actual test and train data lines and greater R 2 value.

Dataset
In this research, a small dataset containing two inputs composed of X1 = P (bar) and X2 = T (K) and the only possible output is Y = solubility was applied. There are only 32 data points that were taken from the literature, and they performed the analysis for the pressure of 120-400 bar and temperature of 308-338 K [35]. The entire dataset is displayed in Table 1.

Base Models
The first base model is a kernel-based and non-parametric method, Gaussian process regression (GPR). GPR focused on statistical learning theory and Bayesian models. When used in conjunction with the mean function, a kernel can be used to explain the covariance function of a Gaussian random variable. The GPR's capacity to generalize well, particularly when working with minor data sets, is one of its most significant advantages [36][37][38]. When constructing a GPR model, the following equation is assumed to be true for an output Y: f(X) illustrates the underlying function, X as input of the training data, X * as test subset, and ξ~N(0, σ 2 ) as the error. The error variance σ 2 is calculated based on the input vector. The previous joint distribution of the actual target Y and the expected target y are [39,40]: K = (k ij ) as the covariance kernel matrix of the train subset in which the elements measure the relation between X i and X j through k. K * stands for the covariance matrix between the test and train subsets, and K * * indicates the covariance matrix of the test subset [36,41]. The posterior distribution (in Bayesian analysis, reflects information about uncertain quantities) of y is shown in Equations (4)- (6): The other base models are K-nearest neighbor regression (KNN). The KNN regressor learns by comparison of the identified test examples to the training set [42].
. , x im ) represent the i-th sample indicates with m input features and its target output y i . Additionally, N represent the count of examples. It must calculate the d i between a test instance x and any sample x i ∈ T and sort the d i distance by its value for a test sample x. If d i is in the i-th place, the instance of x matches di, which is called the i-th nearest neighbor, or NN i (x), and its target is called y i (x). Lastly, the estimationŷ of input instance x denotes the average of the prediction of k-nearest neighbors to KNN regression algorithm can be summarized in the following steps [43]: The third base model is Theil-Sen Regression. The model is estimated in Theil-Sen regression by computing the slopes and intercepts of a subset of all feasible solutions of p subsample points. When an intercept is fitting, p must be bigger than or equal to number of features + 1. The spatial median of these slopes and intercepts is then used to define the final slope and intercept.
The trend slopes were estimated using the Theil-Sen (TSR) estimator [44], which was chosen since it is better than raw linear regression approaches in evaluating trend slopes in the existence of outliers in data [45].
The initial phase in calculating the TSR predictor is to determine the Q i value given N pairs of data [44]: x j , x k are the data point vectors.
If only one datum is existed, then N = n(n-1) 2 . Additionally, n is the count of vectors. If there are many observed data in several vectors, then N < n(n-1) 2 , n is the count of observed vectors.
Then, the TSR predictor is calculated as the median Q med of the N values of Q i , sorted in (minimum, maximum) interval [44]: when N is even The sign of Q med shows the trend behavior, and its value shows the magnitude of the trend.

AdaBoost
Adaboost [46] is the most well-known boosting model, and it was initially employed to address the classification issue. Freund [29] then presented the Adaboost.R to handle real-valued regression problems. Additionally, drucker [47] solved the regression problem using the updated Adaboost.R2 model, with amazing results.
The data sample weights are set to zero. The initial iteration trains a weak learner, and the instance weights are adjusted based on the training outcomes. The adjusted weights are used to train the next weak learner. Each iteration, the weights of the instances estimated incorrectly (with a high error) in the previous iteration are increased, while the weights of the instances estimated correctly (near expected value) in the previous iteration are decreased. The influence of hard-to-predict instances becomes increasingly substantial as the number of iterations grows; after each iteration, the weak learner concentrates more on samples that were previously estimated poorly. The final prediction outcome is established by a weak learner's weighted vote. Any machine learning regression technique may be used to choose the weak learner in AdaBoost regression [48][49][50]. In this study, we used three models of previous section as weak learners distinctly.

Results
We employed grid search to find the optimal hyper-parameters of these models and obtained the final configuration of each model. MAE and R 2 are two metrics that were used to evaluate the performance of the model that were calculated using Equations (10) and (11) [51,52].
In these equations, x t+1 i is the estimated value, x t+1 i is the observed value, and n is the quantity of examples.
The accuracy of the final models is presented in Table 2 Additionally, the comparison of expected and estimated values of tamoxifen solubility in SC-CO 2 system via ADA-KNN, ADA-GPR, and ADA-TSR models is shown in Figures 1-3. In these diagrams, the green line is the actual data line, and the point is predicted values blue for train sunset and red for test subset. Comparing these three charts proves that the ADA-KNN is the most general and appropriate model since the points are near actual test and train data lines.          Figure 4 illustrates the three-dimensional projection to demonstrate the final results of the ADA-KNN mathematical model to measure the impacts of input parameters (pressure and temperature) on drug solubility at the same time. Furthermore, two-dimensional depictions to individually evaluate the effects of pressure and temperature on the values of tamoxifen solubility in SC-CO2 system are shown in Figures 5 and 6. It can be seen from the figures that pressure has positive effect on the solubility value of drugs in the SC-CO2 fluid due to increasing the density of SCFs owing to modify the molecular compaction. If the value of density increases, the solvating capability of solvent increases significantly and, the solubility of drug in SC-CO2 increases. The effect of temperature on drug solubility is paradoxical. In one side, increment of temperature improves the pressure sublimation of solvent, which is a positive phenomenon in increasing the solubility of the drug inside SCFs. On the other side, the increase in temperature reduces the density of solvent, which considerably deteriorates the solvating power and consequently solubility amount of drug. Considering the abovementioned explanations, the net impacts of the sublimation pressure and density can determine the favorable/unfavorable role of temperature on the solubility. The evaluation of figures illustrates the emergence of a cross-over pressure in the isotherms. At the pressures over than cross-over pressure, an increase in temperature improves the drug solubility because of the greater effect of sublimation pressure compared to density. For the pressures lower than the cross-over pressure, increasing the temperature, decrement in the solvent density overcomes the effect of pressure sublimation and as a result, and decreases the tamoxifen solubility in SC-CO2 fluid [35]. Based on the presented results of Table 3, the pressure and temperature at 329 bar and 318 K, respectively, were considered as the optimum pressure and temperature for reaching the maximum amount of tamoxifen solubility.  Figure 4 illustrates the three-dimensional projection to demonstrate the final results of the ADA-KNN mathematical model to measure the impacts of input parameters (pressure and temperature) on drug solubility at the same time. Furthermore, two-dimensional depictions to individually evaluate the effects of pressure and temperature on the values of tamoxifen solubility in SC-CO 2 system are shown in Figures 5 and 6. It can be seen from the figures that pressure has positive effect on the solubility value of drugs in the SC-CO 2 fluid due to increasing the density of SCFs owing to modify the molecular compaction. If the value of density increases, the solvating capability of solvent increases significantly and, the solubility of drug in SC-CO 2 increases. The effect of temperature on drug solubility is paradoxical. In one side, increment of temperature improves the pressure sublimation of solvent, which is a positive phenomenon in increasing the solubility of the drug inside SCFs. On the other side, the increase in temperature reduces the density of solvent, which considerably deteriorates the solvating power and consequently solubility amount of drug. Considering the abovementioned explanations, the net impacts of the sublimation pressure and density can determine the favorable/unfavorable role of temperature on the solubility. The evaluation of figures illustrates the emergence of a cross-over pressure in the isotherms. At the pressures over than cross-over pressure, an increase in temperature improves the drug solubility because of the greater effect of sublimation pressure compared to density. For the pressures lower than the cross-over pressure, increasing the temperature, decrement in the solvent density overcomes the effect of pressure sublimation and as a result, and decreases the tamoxifen solubility in SC-CO 2 fluid [35]. Based on the presented results of Table 3, the pressure and temperature at 329 bar and 318 K, respectively, were considered as the optimum pressure and temperature for reaching the maximum amount of tamoxifen solubility.

Conclusions
In this research work, three new models were compared through machine learning to estimate and validate the solubility of tamoxifen in supercritical CO2. The Adaboost method was applied to improve these three different models, including KNN, GPR and TSR, and the results are promising. According to the analysis, the R 2 of the ADA-KNN, ADA-GPR, and ADA-TSR models were 0.996, 0.967, and 0.883, respectively. The MAE error rates for these three models were 1.98 × 10 −6 , 1.33 × 10 −6 , and 2.33 × 10 −6 , respectively. An ADA-KNN model was selected as the best model, and it was applied to optimize the values using these metrics (x1 = 329, X2 = 318.0, Y = 6.004 × 10 −5 ) and some visual analysis.
Author Contributions: B.H.: Writing, editing, data analysis, methodology, conceptualization, validation, resources, supervision, data cura-tion, A.A.: review and editing, validation, resources, visualization, software. All authors have read and agreed to the published version of the manuscript.

Informed Consent Statement: Not Applicable
Data Availability Statement: All data are available within the published paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Conclusions
In this research work, three new models were compared through machine learning to estimate and validate the solubility of tamoxifen in supercritical CO 2 . The Adaboost method was applied to improve these three different models, including KNN, GPR and TSR, and the results are promising. According to the analysis, the R 2 of the ADA-KNN, ADA-GPR, and ADA-TSR models were 0.996, 0.967, and 0.883, respectively. The MAE error rates for these three models were 1.98 × 10 −6 , 1.33 × 10 −6 , and 2.33 × 10 −6 , respectively. An ADA-KNN model was selected as the best model, and it was applied to optimize the values using these metrics (X1 = 329, X2 = 318.0, Y = 6.004 × 10 −5 ) and some visual analysis.  Data Availability Statement: All data are available within the published paper.

Conflicts of Interest:
The authors declare no conflict of interest.