Prediction of Antimalarial Drug-Decorated Nanoparticle Delivery Systems with Random Forest Models

Drug-decorated nanoparticles (DDNPs) have important medical applications. The current work combined Perturbation Theory with Machine Learning and Information Fusion (PTMLIF). Thus, PTMLIF models were proposed to predict the probability of nanoparticle–compound/drug complexes having antimalarial activity (against Plasmodium). The aim is to save experimental resources and time by using a virtual screening for DDNPs. The raw data was obtained by the fusion of experimental data for nanoparticles with compound chemical assays from the ChEMBL database. The inputs for the eight Machine Learning classifiers were transformed features of drugs/compounds and nanoparticles as perturbations of molecular descriptors in specific experimental conditions (experiment-centered features). The resulting dataset contains 107 input features and 249,992 examples. The best classification model was provided by Random Forest, with 27 selected features of drugs/compounds and nanoparticles in all experimental conditions considered. The high performance of the model was demonstrated by the mean Area Under the Receiver Operating Characteristics (AUC) in a test subset with a value of 0.9921 ± 0.000244 (10-fold cross-validation). The results demonstrated the power of information fusion of the experimental-centered features of drugs/compounds and nanoparticles for the prediction of nanoparticle–compound antimalarial activity. The scripts and dataset for this project are available in the open GitHub repository.


Introduction
Drug-decorated nanoparticles (DDNPs) are among the most interesting nanomaterials, with a broad range of medical applications. Many of them are used in drug delivery systems for different types of chemical compounds. These systems have numerous advantages, since there are countless combinations of drugs and nanoparticles that can be effective in treating different conditions. At the same time, they have some weaknesses. For example, the synthesis of nanoparticles can sometimes be expensive, or it can involve a lot of time that can increase with the number of samples. For this reason, in order to improve the possibility of forming effective pairs, there is a need for computational models.
Recently, some researches have been focusing on finding DDNPs that show antimalarial properties. For instance, silver and gold nanoparticles, that were synthesized from leaf and bark extracts of Myrtaceae, exhibited an effective antiplasmodial activity [1], and exopolysaccharide coated ZnO nanoparticles (EPS-ZnO NPs) presented functional effects against malaria vectors [2]. Therefore, this study aims to design a useful computational model that allows a good prediction of the antimalarial activity of varied drug-nanoparticle pairs.
Moreover, a brand new method for data fusion in nanotechnology, bio-molecular sciences, chemistry and big data analysis has been proposed in different works: it integrates Perturbation Theory (PT) and Machine Learning (ML) [3][4][5][6][7][8][9][10][11][12][13], using distinct PT operators to analyze changes in the varied non-structural and structural conditions of a test at once (PTML). A few of these PT operators represent the generalization of a classic cheminformatics approach introduced by Corwin Hansch [14]. He noticed the significant potential of using predictive methodologies to resolve multivariate questions in medicinal chemistry. Hansch's classic approach allows one to search for models with multiple physicochemical conditions so as to foretell the biological activity of compounds, and these models possibly include quadratic and/or linear terms. In this process, which is a Linear Free Energy Relationship (LFER) model, most of the terms are physicochemical parameters linked with the free energy of drug ionization, binding, transport, etc. In addition, because we are fusing the information (IF) of drugs and nanoparticles, the model becomes a PTMLIF (PTML + IF).
As an illustration, the logarithmic term (logP) of the octanol/water partition coefficient (P) is presented as an estimate of the free energy of drug transport and molecular lipophilicity [15]. We can approximate logP values via chemical fragment methods (such as CLogP), or via atomic methods (such as ALogP or XLogP) [16,17]. The logarithmic terms of acidity constants (pKa) are connected to the free energy of drug ionization. Additionally, to account for more molecular properties, we can use different parameters like Polar Surface Area (PSA). Generally, for a given molecule m i , we can utilize as input for the model several types of molecular properties, taking into account measures of molecular polarizability, lipophilicity, electronegativity, etc. [17,18]. We can define these models as: where ε i is the biological activity of the molecule m i , f (ε i ) is a function of the variable ε i , D k (m i ) are the molecular descriptors of m i , and a k and b k are the coefficients. This classical model works to account for changes in the chemical structure of the drug/compound using the molecular descriptors, but it does not take into account the result regarding the drug activity of perturbations in multiple experimental conditions (c j ). These include assay conditions or changes in drug chemical structure, such as c 0 = the biological parameter used (CC 50 : the ratio of the 50% cytotoxic concentration, IC 50 : inhibition concentration, etc.), c 1 = organism, c 2 = cell name, c 3 = assay organism, etc. An example is the large datasets found in the public database ChEMBL [19][20][21][22][23][24][25]. We used PTML methods to analyze a large set of over 50,000 preclinical assays of drugs. These assays incorporate drugs targeting Plasmodium. The PTML model proceeds from the classic LFER approach for drug activity. We combined the use of eight Machine Learning methods with feature selection in order to obtain a more accurate classifier for our task. Figure 1 presents the methodology used to build the PTMLIF classifier for the antimalarial activity of DDNPs. The methodology flow contains the following steps: (1) Get 23 properties of nanoparticle and anti-malaria drugs/compounds from the literature and public databases as initial molecular descriptors; (2) Fuse information about experimental conditions and the properties of anti-malaria drugs/compounds and nanoparticles, using an experimental-centered transformation of the original features (Box-Jenkins Moving Average operators); (3) Integrate drug/compound and nanoparticle data into the study dataset; (4) Build the baseline PTMLIF models using default parameters of the ML methods; (5) Improve the performance of the best classifier by using only the most important features (feature selection).

Materials and Methods
parameters of the ML methods; (5) Improve the performance of the best classifier by using only the most important features (feature selection).

ChEMBL Data Pre-Processing
We got the results of several preclinical assays from ChEMBL. The experimental measure, ε ij (d), used to quantify the biological activity of the i-th molecule (m i ) over the j-th objective, represents the outcome of every assay. The values of ε ij (d) rely on the structure of the compound, and also on certain limit conditions that mark off the properties of the assay c j (d) = (c 0 (d), c 1 (d), c 2 (d), . . . , c n (d)). The first c j (d) is c 0 (d) = the biological activity; we used drugs with CC 50 , EC 50 and IC 50 . Other conditions are c 1 (d) = organism, c 2 (d) = cell name, c 3 = assay organism, c 4 (d) = assay strain, etc. (see Table 1). We used classification techniques because the values ε ij (d) are not exact numbers in some cases. Furthermore, we discretized the values in this way: for IC 50 and EC 50 f(v ij (d))obs = 1 when v ij < cutoff and desirability of the biological activity parameter D(c 0 (d)) = −1 (see Table 2); for CC 50 , f(v ij (d))obs = 1 when v ij > cutoff and desirability D(c 0 (d)) = 1, if not f(v ij (d))obs = 0. The desirability D(c 0 (d)) = 1 or −1 denotes that the parameter measured decreases or increases directly with a biological effect, which can be desired or not. Finally, we calculated the deviations of each condition for all drugs/compounds.  a Parameters <LogP> and <PSA> = Average value of LogP and PSA for all drugs m i with value reported in ChEMBL dataset. These parameters are needed for the moving average calculation. Other parameters: n 0 = number of compounds that shown each different activity, n 1 = number of compounds considered as positive, p 1 = n 1 /n 0 probability of a compound being considered positive, d = −1, 0, 1 is the desirability of the parameter, cutoff = limit for the compound being treated as active or not.

Nanoparticle Data Pre-Processing
From the literature, we collected the outcomes of many nanoparticles, and the measure ε ij expresses the result of each of them. The values of ε ij (np) depend on different properties of the nanoparticle, and also on some boundary conditions that delimit the characteristics of the assay c j (np) = (c 0 (np), c 1 (np), c 2 (np), . . . , c n (np)) (see Table 3). Again, the first c j (d) = the biological activity, and we only used nanoparticles with CC 50 , EC 50 and IC 50 , so that they could match with the biological activities of the drugs/compounds. Other conditions are c 1 (np) = cell name, c 2 (np) = nanoparticle shape, c 3 (np) = nanoparticle medium and c 5 = surface coating. Additionally, we discretized the values in the same way that we did with drugs/compounds (see Table 4). In the end, we determined the deviations of every c j for all nanoparticles.    a Parameters <LogP> and <PSA> = Average value of LogP and PSA for all nanoparticles m i . These parameters are needed for the moving average calculation. Other parameters: n 0 = number of decorated nanoparticles that shown each different activity, n 1 = number of nanoparticles considered as positive, p 1 = n 1 /n 0 probability of a nanoparticle being considered positive, d = −1, 0, 1 is the desirability of the parameter, cutoff = limit for the DNPs being treated as active or not.

Combine Data PRE-Processing
Once both databases were done, we combined them by doing pairs with the same experimental conditions, for example, a CC50 with a CC50 nanoparticle. In addition, we used the same method to discretize each formed pair. Thus, a dataset of 107 input features and 249,992 examples will be used to build ML classification models. The positive (1) and negative control cases (0) were assigned as follows: if desirability function d(c0) = −1, then c ij = 1 when ε ij < 100 nM or ε ij < average <ε ij > for properties not measured in nM. In addition, if d(c 0 ) = 1, 0, then c ij = 1 when ε ij > average value <ε ij >. An extra input feature (prob = probability) was created as the probability of c0 for compound-nanoparticle pairs (count of the number of compound-nanoparticle pairs for each c0 activity type/total number of pairs

Machine Learning Methods
The study is done using eight Machine Learning scikit-learn classifiers to find the best classifier able to predict the probability of a nanoparticle-compound pair highly express antimalarial activity:

1.
KNeighborsClassifier = KNN-k-nearest neighbors: one of the most well-known non-parametric classifiers in the ML field. It assigns an unclassified sample to the same class as the nearest of the k samples in the training set [26]; 2. SVC(linear) = SVM linear-support vector classifier with linear kernels: the input data is non-linearly mapped to a higher dimensionality space, where a linear decision surface can be established [27]; 3. SVC = SVM-support vector classifier with non-linear RBF kernels: the real problems tend not to have a linear solution, and SVM can handle this limitation by using nonlinear kernel functions such as Gaussian radial basis (RBF) [28]; 4.
LogisticRegression = LR-Logistic regression [29] is a linear model which can estimate the probability of a binary response using different factors; 5.
LinearDiscriminantAnalysis = LDA-linear discriminant analysis [30]: a statistical supervised method that is based on the projection of data to a lower dimension to maximize the scatter between classes versus the scatter within each class. This projection facilitates the separation of the data; 6.
DecisionTreeClassifier = DT-Decision Tree uses a series of decision rules inferred from the features as a tree of rules. Thus, the paths from root to leaf represent classification rules [31]; 7.
RandomForestClassifier = RF-Random forest [32] is an ensemble method that aggregates several decision trees (parallel trees). Each tree is generated using a bootstrap sample drawn randomly from the original dataset using a classification or regression tree (CART) method and the Decrease Gini Impurity (DGI) as the splitting criterion [33]. RF is characterized by low bias and low correlation between individual trees, and high variance; 8.

ML Workflow
The features were standardized by removing the mean and scaling to unit variance, using the standard scaler in scikit-learn. A stratified 10-fold cross-validation was performed, preserving the percentages of samples for each class. As the dataset samples were not balanced, class weights were computed for each class using N/(k * n i ), where N is the total number of samples, k the number of classes and n i the number of samples belonging to the class i. This results in weights of 0.63778 for class 1 and 2.31448 for class 2. The model's performance was measured using Area Under the Receiver Operating Characteristics (AUC).
Given the results obtained in the baseline, the workflow has continued only with the best model. From this point on, a feature selection was done using the mean impurity decrease, which is already implemented in sklearn. This metric is calculated using the weighted gini impurity decreases for all nodes, averaged over all trees [33]. Thus, a feature selection was done using ExtraTreesClassifier [36] with n_estimators = 100, class weights and 10-fold CV (see Feature-Selection.ipynb [37]). We chose this tree-based method to select the most important features because extra trees (sometimes named extreme random trees) offer a higher performance in the presence of noisy features [38]. Our custom feature selection algorithm keeps at least one feature for each experimental condition for drugs/compounds and nanoparticles, and the probability feature (if the automatic selector eliminates them).
The simplest PTML linear models will be the first classifiers to test for complex datasets with multiple BD characteristics [39,40]. We can approximate function values f (v ij (d) and v ij (np)) calc for the i-th drug-nanoparticle pair in the j-th preclinical assay with multiple conditions of assay c j . As input, we used PT operators that can also be Box-Jenkins Moving Average (MA) operators [41,42]. PTML linear models have the following generic form: , v ij (np)) expt + k max ,j max k=1, j=0 a k j ·∆D k d j , c j + k max ,j max k=1, j=0 b k j ·∆D k (np j , c j ) (2) Biology 2020, 9, 198 8 of 15 Additional results have been provided in order to explain the predictions with the best model using Shapley values [43] (SHAP_test.ipynb). All the scripts for baseline AUCs, feature selection and the final model are available as an open repository at https://github.com/d-bcarrue/NanoDrugsMalaria [37].

Results and Discussion
In the present work, we created a PTML model to predict the activity of organic compounds assembled of some nanoparticles used against malaria disease. In doing so, we expanded the idea behind Hansch's analysis and searched models with applications to nanomedicine. As a proof-of-concept test, we investigated a huge number dataset of drugs downloaded from ChEMBL, and another dataset of nanoparticles. Those datasets contain (see materials and methods) the outcomes of many experimental pharmacological assays.
The model supposes that the changes in drug-nanoparticle binding occur thanks to perturbations in the input boundary conditions of both nanoparticles and drugs. We focused only on a nanoparticle-drug/compound binding pseudo-constant (v ij(d) ,v ij(np) ), defined by us, to quantify the probability of a nanoparticle-drug/compound pair highly expressed against malarial activity. This PTML model begins with a reference value, f v ij d , v ij np ) expt , and then adds the effects of perturbations in the structure of the compound or conditions of the assay, and the properties of the nanoparticle and its coating. Other input terms used here are the perturbation terms ∆LogP Using eight ML classifiers, the AUC values have been calculated (10-fold CV). The results are presented in Table 5. The best model was obtained with RF, and the AUC is 0.9844 ± 0.0007. Figure 2 represents the box-plot for the baseline AUC values of the ML methods (10-fold CV). The AUC values for the 10 splits have short ranges, especially RF. This suggests that the AUCs for all ML methods are stable within each fold. In addition, the high difference between the RF and the other methods (box-plots are far from overlapping) demonstrated that it is statistically significant. determined the more relevant perturbations under different experimental conditions, cj, related to the antimalarial property by using a RF. Casually, in this model most of the used operators are of PSA and ALOGP type. Therefore, they measure only perturbations in the value of ALOGP with respect to other subsets, cj, of drugs and nanoparticles. ALOGP is a relevant parameter used in medicinal chemistry because it is related to lipophilicity and the capacity to cross biological membranes.  Figure 3 shows the mean impurity reduction for each of the features in both the original order and the sorted. This mean impurity decrease was obtained using an Extra Trees classifier with 100 trees and weighted classes, and this model was applied in a stratified 10-fold CV. The horizontal dashed line indicates the threshold used as a selection filter. After checking that the probability feature is present, since this is strictly necessary in Perturbation Theory, we check that all experimental conditions are reflected in the selected subset. After filtering, the experimental conditions, c2 and c4, of the nanoparticles were not included. Therefore, we selected the characteristic with the highest mean impurity decrease for both experimental conditions, and added it to the previous selection, marking them in pink. The unsorted plot presents the features on the x-axis in the order they were presented into the dataset. For a better comparison of the selected feature mean impunities (the ones above the cutoff), the ordered version of the plot was presented too. In the next step, we reduced the number of features in order to improve the AUC of the RF model. Thus, a feature selection was done using ExtraTreesClassifier. The 27 features were selected from an initial 107: -One np-compound pair feature: prob; -5 np features using 5 experimental conditions (c0-c4): np_DVnpu(c0), np_DUccoat(c1), np_DVnpu(c2), np_DPnpu(c3), np_DPnpu(c4); -21 drug/compound features using 7 experimental conditions (c0-c6): d_DMw(c0), d_DALOGP(c0), d_DPSA(c0), d_DMw(c1), d_DALOGP(c1), d_DPSA(c1), d_DMw(c2), d_DALOGP(c2), d_DPSA(c2), d_DMw(c3), d_DALOGP(c3), d_DPSA(c3), d_DMw(c4), d_DALOGP(c4), d_DPSA(c4), d_DMw(c5), d_DALOGP(c5), d_DPSA(c5), d_DMw(c6), d_DALOGP(c6) and d_DPSA(c6).
Remarkably, this is the first model combining both Perturbation Theory and MAs in a QSBR study of relevant nanoparticle-drug/compound pairs used as an antimalarial delivery system. We determined the more relevant perturbations under different experimental conditions, c j , related to the antimalarial property by using a RF. Casually, in this model most of the used operators are of PSA and ALOGP type. Therefore, they measure only perturbations in the value of ALOGP with respect to other subsets, c j , of drugs and nanoparticles. ALOGP is a relevant parameter used in medicinal chemistry because it is related to lipophilicity and the capacity to cross biological membranes. Figure 3 shows the mean impurity reduction for each of the features in both the original order and the sorted. This mean impurity decrease was obtained using an Extra Trees classifier with 100 trees and weighted classes, and this model was applied in a stratified 10-fold CV. The horizontal dashed line indicates the threshold used as a selection filter. After checking that the probability feature is present, since this is strictly necessary in Perturbation Theory, we check that all experimental conditions are reflected in the selected subset. After filtering, the experimental conditions, c2 and c4, of the nanoparticles were not included. Therefore, we selected the characteristic with the highest mean impurity decrease for both experimental conditions, and added it to the previous selection, marking them in pink. The unsorted plot presents the features on the x-axis in the order they were presented into the dataset. For a better comparison of the selected feature mean impunities (the ones above the cutoff), the ordered version of the plot was presented too. Therefore, with only 27 selected features (from an initial 107), the mean test AUC for the RF classifier increased to 0.9921 ± 0.000244 (from 0.9844 ± 0.0007). This model shows a very good performance for a PTML model.
The feature selection showed that the current classifier prefers perturbations (MAs) of the logarithmic term of the octanol/water partition coefficient (ALOGP), polar surface area (PSA) and molecular weight (Mw) for compounds/drugs under all experimental conditions, such as activity type (c0), organism (c1), cell name (c2), assay organism (c3), assay strain (c4), type of curation (c5) and assay type (c6). In the case of the nanoparticles, the model selected the perturbations of the average of atomic Van der Waal volume for all atoms in the np (Vnpu) with activity type (c0) and with shape (c2), unsaturation count (Uccoat) in cell line (c1), atomic polarizability (Pnpu) with assay medium (c3) and surface coating (c4). Thus, we can conclude that the perturbations of the following molecular descriptors under different experimental conditions are important for anti-malaria drugdecorated nanoparticles: polarity of both components/drugs and nanoparticles, mass of compounds/drugs, volume, shape and coating unsaturation of nanoparticles.
A linear model is easily interpreted, but it is not always the most accurate model. Therefore, the complex models use different tools in order to avoid a "black box" model.  Therefore, with only 27 selected features (from an initial 107), the mean test AUC for the RF classifier increased to 0.9921 ± 0.000244 (from 0.9844 ± 0.0007). This model shows a very good performance for a PTML model.
The feature selection showed that the current classifier prefers perturbations (MAs) of the logarithmic term of the octanol/water partition coefficient (ALOGP), polar surface area (PSA) and molecular weight (Mw) for compounds/drugs under all experimental conditions, such as activity type (c0), organism (c1), cell name (c2), assay organism (c3), assay strain (c4), type of curation (c5) and assay type (c6). In the case of the nanoparticles, the model selected the perturbations of the average of atomic Van der Waal volume for all atoms in the np (Vnpu) with activity type (c0) and with shape (c2), unsaturation count (Uccoat) in cell line (c1), atomic polarizability (Pnpu) with assay medium (c3) and surface coating (c4). Thus, we can conclude that the perturbations of the following molecular descriptors under different experimental conditions are important for anti-malaria drug-decorated nanoparticles: polarity of both components/drugs and nanoparticles, mass of compounds/drugs, volume, shape and coating unsaturation of nanoparticles.
A linear model is easily interpreted, but it is not always the most accurate model. Therefore, the complex models use different tools in order to avoid a "black box" model. Shapley values and SHAP (Shapley Additive explanations) values are the proposed solution for the best RF model. The Shapley values represent the average of the marginal contribution across all permutations, a method for quantifying the contribution of the features to the final model. Thus, the SHAP method is able to explain the output of a machine learning model by: global interpretability: how much each feature contributes, either positively or negatively, to the output variable; -local interpretability: each case/instance gets its own SHAP values in order to explain why a case has a specific prediction, and the contribution of the features to this instance.
The global interpretability is presented by the correlation of the features with the output variable or the positive/negative impact using SHAP values (Figure 4). The ordered average impact of the features on the model output for each class, and the local interpretability for different instances/cases, are included in the GitHub repository with the new script (SHAP_test.ipynb).
Biology 2020, 9, x FOR PEER REVIEW 11 of 14 -global interpretability: how much each feature contributes, either positively or negatively, to the output variable; -local interpretability: each case/instance gets its own SHAP values in order to explain why a case has a specific prediction, and the contribution of the features to this instance.
The global interpretability is presented by the correlation of the features with the output variable or the positive/negative impact using SHAP values (Figure 4). The ordered average impact of the features on the model output for each class, and the local interpretability for different instances/cases, are included in the GitHub repository with the new script (SHAP_test.ipynb). Color shows whether that variable has a positive (in red) or a negative (in blue) impact on the output variable.
Thus, we can observe that the nanoparticle perturbation of molecular descriptors under experimental conditions has a high impact on the model prediction for anti-malaria drug/compounds carriers. These include perturbation of atomic polarizability (Pnpu), the average of atomic Van der Waal volume for all atoms (Vnpu) and unsaturation of coating (Uccoat). For the compounds/drugs, ALOGP has more impact than mass weight and PSA. Thus, we confirm that the molecular properties linked to polarity have the highest impact on the anti-malaria drug/compound-nanoparticle carriers. Atomic polarizability of nanoparticles has a more positive impact on the model output, and the volume of nanoparticles has only a negative impact: the optimal anti-malaria drug-np carriers should consider nanoparticles with high atomic polarizability but small volume. In addition, the compounds/drugs should have higher polar surface areas (PSA) with a positive impact, and smaller weight mass with a negative impact, on the model output. Color shows whether that variable has a positive (in red) or a negative (in blue) impact on the output variable.
Thus, we can observe that the nanoparticle perturbation of molecular descriptors under experimental conditions has a high impact on the model prediction for anti-malaria drug/compounds carriers. These include perturbation of atomic polarizability (Pnpu), the average of atomic Van der Waal volume for all atoms (Vnpu) and unsaturation of coating (Uccoat). For the compounds/drugs, ALOGP has more impact than mass weight and PSA. Thus, we confirm that the molecular properties linked to polarity have the highest impact on the anti-malaria drug/compound-nanoparticle carriers. Atomic polarizability of nanoparticles has a more positive impact on the model output, and the volume of nanoparticles has only a negative impact: the optimal anti-malaria drug-np carriers should consider nanoparticles with high atomic polarizability but small volume. In addition, the compounds/drugs should have higher polar surface areas (PSA) with a positive impact, and smaller weight mass with a negative impact, on the model output.

Conclusions
By combining Perturbation Theory ideas with Hansch's QSAR analysis and information fusion, we developed a multi-target PTMLIF model that is useful in classifying drugs based on their constant binding to many different nanoparticles and their capacity to act against plasmodium, which is the cause of malaria in humans. This model can help us to save experimental resources and time, since it allows the determination of which drug-decorated nanoparticles would be useful and which would not. In this way, we can prove only those with the highest probability of being active. The transformed features of drugs and nanoparticles have been used as input for eight Machine Learning methods. The best classification model has been obtained using Random Forest with only 27 selected features of drugs and nanoparticles in all the experimental conditions considered. The mean test AUC was 0.9921 ± 0.000244 (10-fold CV). The performance of the RF model demonstrated the power of the information fusion of the experimental-based features of drugs and nanoparticles for the prediction of probability, related to nanoparticle-drug/compound antimalarial activity. All the calculations can be reproduced using the scripts and dataset included in an open GitHub repository at https://github.com/d-bcarrue/NanoDrugsMalaria.