kESVR: An Ensemble Model for Drug Response Prediction in Precision Medicine Using Cancer Cell Lines Gene Expression

Background: Cancer cell lines are frequently used in research as in-vitro tumor models. Genomic data and large-scale drug screening have accelerated the right drug selection for cancer patients. Accuracy in drug response prediction is crucial for success. Due to data-type diversity and big data volume, few methods can integrative and efficiently find the principal low-dimensional manifold of the high-dimensional cancer multi-omics data to predict drug response in precision medicine. Method: A novelty k-means Ensemble Support Vector Regression (kESVR) is developed to predict each drug response values for single patient based on cell-line gene expression data. The kESVR is a blend of supervised and unsupervised learning methods and is entirely data driven. It utilizes embedded clustering (Principal Component Analysis and k-means clustering) and local regression (Support Vector Regression) to predict drug response and obtain the global pattern while overcoming missing data and outliers’ noise. Results: We compared the efficiency and accuracy of kESVR to 4 standard machine learning regression models: (1) simple linear regression, (2) support vector regression (3) random forest (quantile regression forest) and (4) back propagation neural network. Our results, which based on drug response across 610 cancer cells from Cancer Cell Line Encyclopedia (CCLE) and Cancer Therapeutics Response Portal (CTRP v2), proved to have the highest accuracy (smallest mean squared error (MSE) measure). We next compared kESVR with existing 17 drug response prediction models based a varied range of methods such as regression, Bayesian inference, matrix factorization and deep learning. After ranking the 18 models based on their accuracy of prediction, kESVR ranks first (best performing) in majority (74%) of the time. As for the remaining (26%) cases, kESVR still ranked in the top five performing models. Conclusion: In this paper we introduce a novel model (kESVR) for drug response prediction using high dimensional cell-line gene expression data. This model outperforms current existing prediction models in terms of prediction accuracy and speed and overcomes overfitting. This can be used in future to develop a robust drug response prediction system for cancer patients using the cancer cell-lines guidance and multi-omics data.


Introduction
Precision medicine aims to provide individually tailored cancer treatment by considering an individual's genetic makeup, genomic makeup and clinical information. Emerging Next Generation Sequencing (NGS) techniques and large-scale cancer screening data helps in achieving this goal [1,2]. Databases such as the Cancer Cell Line Encyclopedia (CCLE) [2] provides public access to genomic data over 1000 cancer cell lines by RNA sequencing (RNA-seq; 1019 cell lines), whole-exome sequencing (326 cell lines), whole-genome sequencing (329 cell lines), and reverse-phase protein array (RPPA; 899 cell lines). The Cancer Therapeutics Response Portal (CTRP; http://portals.broadinstitute.org/ctrp/, accessed on 1 June 2019) [3,4] quantitatively measured the sensitivity of 481 small-molecule probes and drugs. An important step of this process is to use cancer cell line models to simulate mixed tissue and predict his/her drug response [5].
Accuracy in drug response prediction is of utmost importance in this regard. Over the years various models have been developed for this purpose [6][7][8][9]. Contemporary models are based on a varied range of techniques such as regression methods, Bayesian inference methods, matrix factorization methods and deep learning methods. Some of these methods use only gene expression data while some use combination of other omics data such as mutation, copy number variation, methylation and so on for response prediction. Prediction models by gene expression profiles show the best performance in all kinds of omics analysis [10]. The detailed analysis and comparison of different methods can be found in a paper by Chen and Zhang [11].
These models, however, only focus on a single model [12] trained over large datasets. Without recognition of the weekly predictive local embedding data contribution, will cause them to make incorrect decisions in facing to outliers and errors. Outliers' and errors' in turn, causes these models to be incapable of capturing the dataset's true variance, thus distorting model complexity [13]. To deal with high-dimensional genomics data, a promising strategy is to find an effective low-dimensional subspace of the original data and cluster samples in the reduced subspace [14], and then do a localize regression.
However, it is hard to identify the homology features which one contribution to drug response prediction for increasingly heterogeneous datasets comprised of multi-omics data collected from overlapping latent low-dimensional subpopulations. Principal Component Analysis (PCA) can generate statistically uncorrelated principal components (PC) while retaining as much as possible of the variation present in the original data set. PCA has been used previously to delineate homogenous regions by PCs regression and applying in all kinds of fields such as temperature [15], hydrology [16], risk [17], and animal health [18], but have not applied in drug response prediction.
Machine learning nonlinear regression Support Vector Regression (SVR) was first introduced by Vapnik [19] and has been a highly effective and suitable method for regression [20]; the problem of regression is to find a function surface in high dimension that approximates mapping from an input domain (low dimension) to real values based on a training sample. However, most existing SVR learning algorithms are limited to the parameters selection and slow learning for high-throughput features and large samples [21,22].
To address these challenges, a novelty k-means Ensemble Support Vector Regression (kESVR) is developed to predict each drug response values for single patient based on cell-line gene expression data. kESVR's origin stems from previous work interval SVR [22], where we separated a global nonlinear SVR predictor into interval subspaces and ran a SVR in each interval subspace. However, kESVR is different to the interval SVR in that it constructs a local SVR regression in each principal component embedding subspaces, where the K-means algorithm clusters these homogenous regions and then predict associated drug response. The prediction process function by repeatedly running the local SVR learning algorithm on various distributions' clustering over the whole training data, then comparing the regression value produced by these local SVR learners to obtain a single regression value with the best performance in accuracy and output. The last step used a boosting strategy as literature [23] mentioned to obtain the high accuracy of any local SVR learning algorithm.
The kESVR is a blend of supervised and unsupervised learning methods while being an entirely data driven model. It utilizes embedded clustering (PCA and k-means clustering) and local regression (Support Vector Regression) to predict drug response and obtain the global optimal value with the smallest mean squared error (MSE) while overcoming missing data and outliers' noise. In contrast to classical 17 machine learning models [11] that estimate a single, complex model (or only a few complex models), our results show that kESVR model with PCA-compressed features make the training and validation more efficient in both model accuracy and computational costs, outperforming other machine learning methods in generalization performance.

Materials
Molecular mRNA data from database Cancer Cell Line Encyclopedia (CCLE; https: //portals.broadinstitute.org/ccle, accessed on 1 June 2019) [2], which include 610 gene expression profiles with 20,531 genes [24][25][26]  In this section we provide a brief overview of the different steps comprising our model. Figure 1 summarizes the 4 steps of our model. that estimate a single, complex model (or only a few complex models), our results show that kESVR model with PCA-compressed features make the training and validation more efficient in both model accuracy and computational costs, outperforming other machine learning methods in generalization performance.

Overview
k-means Ensemble Support Vector Regression (kESVR) model consists of 4 distinct steps. In this section we provide a brief overview of the different steps comprising our model. Figure 1 summarizes the 4 steps of our model.

1.
Dimensional reduction: In the first step, we convert high dimensional gene expression data to low dimensional data for better handling and visualization in the subsequent steps.
2. Embedded Clustering: In the second step, we split the lower dimensional data into distinct clusters based on the their labeled/given drug response value. This is the done so that data points (cell-lines) that have similar or close drug response values are grouped together.
3. Local regression and ensemble value selection: In the third step, we train different instances of a machine learning (ML) model on each of the different clusters of data points obtained in the previous step. If there are clusters, we train instances of the ML model. For a given/new input, we now have candidate ML prediction outputs to select from. We use a score-based approach to select the best output. We base our scoring system on the similarity of gene expression profiles between the input and the training data to get the best prediction result.

4.
Optimal drug response value prediction: In the final step, we optimize the number of clusters to get our model kESVR that gives the best performance (minimum Mean Square Error).
We define the problem of drug response prediction using cell-lines gene expression as follows: Given gene expression data ( : number of cell lines, : number of genes) and drug response data ( : number of drugs, : number of cell lines), create

1.
Dimensional reduction: In the first step, we convert high dimensional gene expression data to low dimensional data for better handling and visualization in the subsequent steps.

2.
Embedded Clustering: In the second step, we split the lower dimensional data into distinct clusters based on the their labeled/given drug response value. This is the done so that data points (cell-lines) that have similar or close drug response values are grouped together.

3.
Local regression and ensemble value selection: In the third step, we train different instances of a machine learning (ML) model on each of the different clusters of data points obtained in the previous step. If there are k clusters, we train k instances of the ML model. For a given/new input, we now have k candidate ML prediction outputs to select from. We use a score-based approach to select the best output. We base our scoring system on the similarity of gene expression profiles between the input and the training data to get the best prediction result.

4.
Optimal drug response value prediction: In the final step, we optimize the number of clusters k to get our model kESVR that gives the best performance (minimum Mean Square Error).
We define the problem of drug response prediction using cell-lines gene expression as follows: Given gene expression data GE g×n (n: number of cell lines, g: number of genes) and drug response data R d×n (d: number of drugs, n: number of cell lines), create a prediction model that will accurately predict the response of each of those d drugs for both, the known n cell-lines and unknown cell-line data.

Generalized Description Dimensional Reduction
Let CL = {CL 1 , CL 2 , . . . , CL N } represent N cancer cell-lines and X = {X 1 , X 2 , . . . , X N } denote the gene expression data of CL. Let Y = {Y 1 , Y 2 , . . . , Y N } represent the set of drug response AUC values of CL for a drug D. So, X i is the gene expression data of cell-line CL i and Y i is the drug response of CL i . Each X i represents the expression of G genes in the whole genome for CL i We use PCA [27] to perform dimensional reduction of X. Let φ p (X) denote the pth Principal Component of X. We select the value of p in such a manner that results in minimum loss of information (variation of the data) and use that pth principal component in our subsequent steps. The selection of the principal component can be written as a minimization problem: min Using the pth principal component we create the reduced data, Figure 2a shows a 2-D plot of reduced dataset Z where the X-axis is the drug AUC value and the Y-axis is the pth principal component value (in this figure p = 1). Each point represents a cell-line. All subsequent figures will use p = 1.

Embedded Clustering
We create a set of labeled data Q using gene expression data X and drug response data Y: Instead of the full set of genes (G) in the whole genome, we use a subset of target genes to create our dataset Q. The elements of Q will be used subsequently in training of support vector regressions (SVR) [19]. We train SVR S, on 75% (random selection) of the labeled data Q. We then use the trained SVR S, to predict the drug response of all the N cell-line gene expressions and calculate the corresponding predicted error values. From this we create the 2-tupled dataset: where e i denotes the prediction-error obtained from S for input gene expressionX i . The tuple (Y i , e i ), Z i and Q i are all different representations of the same cell-line CL i . We partition the dataset Z into K clusters G 1 . . . G K by applying K-means clustering [28] on Ψ. Figure 2b shows the plot of Ψ with drug AUC as X-axis and prediction error as Yaxis. Each point represents a cell-lines. Figure 2c shows Ψ dataset clustered into K = 8 clusters after applying K-means clustering. Each cluster is represented by a different color. Figure 2d shows the set of clusters G on a 2-D plane. Each color signifies a different cluster. Figure 2d uses the same color coding as Figure 2c, to show that each color in both figures represent the same cluster of cell-lines.

Local Regression and Ensemble Global Value Selection
Let the data point Z j ∈ Z belong to the cluster G k . For each cluster G k we model a SVR S k : where w k and β k represent the classification hyperplane, δ k represents the slack variable and α k represents the hyper parameter used for S k . Let Y jk represent the output predicted by S k . So, we have shows a 2-D rendering of the newly predicted value Y jk produced by input X j when plotted as a point Y jk , φ p X j along with the rest of the data-points in Z. The new value is denoted by the triangle.
Let ψ K j denote the global ensemble optimized value returned by our model kESVR from among the K candidate prediction values generated by each SVR S k . Let η r k (e, f ) represents the set of data points in cluster G k that fall within a circle of radius r centered at point (e, f ). That is η r k () indicates the set of neighbors of point (e, f ) within a circle of radius r. So, the neighbors of predicted data point Y jk , φ p X j can be represented as: Figure 2f illustrates the concept of η r k (). The triangle represents the point Y jk , φ p X j and the black circle of radius r is drawn around it. All data points within the circle represent the set η r k Y jk , φ p X j . In the example shown in Figure 2f, there are 9 points within the radius r, so 9 neighbors.
Let SP Z i , Z j denote the Spearman Correlation value between the gene expression valuesX i andX j of the cell-lines CL i and CL j represented by Z i and Z j respectively. We define the β() score of the output Y jk of SVR S k as follows: The β() score essentially indicates how similar a data point is to its neighbors based on their gene expression profile. If η r k Y jk , φ p X j = 0 then β Y jk = 0. We select ψ K j as the abscissa value of the prediction data-point, that has the highest β Y jk value among all K points. Thus

Optimal Drug Response Prediction
We return the best value of drug response prediction ψ K j by optimizing the parameters p and K in the following objective function:

Steps for Creating kESVR Model for a Specific Drug D
We execute the following steps to develop kESVR model, kESVR D for a particular drug D, using mRNA gene expression data X and drug response data Y.

1.
Perform PCA on the mRNA gene expression data X. Use the first principle component (p = 1) and create the reduce dataset Create the labeled data Q from X and Y. Use 321 target genes ( X i = 321) instead of the whole genome data for creating Q.
Train SVR S, on Q (75% training 25% testing data) and record the predicted value errors. From this create the 2-tupled dataset where e i denotes the prediction error obtained from S for input gene expressionX i . Next apply K-means clustering to partition Ψ into K(=12) clusters that can then be used to partition Z into K(=12) clusters G 1 . . . G 12 .
Given an inputX j , let Y jk represent the output predicted by S k . Calculate the k predicted values from the k SVRs. Then calculate β Y jk score for each of the k Y jk . c.
Select the prediction value ψ K j returned by kESVR to inputX j as the value Y jk with the highest β Y jk score. Retain the model created using the optimal value of k (obtained in step 4) as model kESVR D for drug D.
In case of a new gene-expression input (or from X), use the newly created kESVR D to generate predicted drug response value of that input for drug D.

Simulation
We demonstrate the steps of creation of kESVR model using zebularine drug response from CTRP on 610 cancer cells from CCLE. Figure 3 illustrates the steps for creating the k SVRs in our model development of kESVR for drug zebularine. Figure 3a shows the reduced dataset Z for drug zebularine. Figure 3b plots the data-points in Ψ on a 2-D plane with real/recorded drug AUC values as the abscissa and the errors in the predicted AUC values as the ordinate. Figure 3b is the result of applying K-means clustering on set Ψ that produces 8 clusters. These have been color coded for ease of visualization. Each point in both Figure 3a,b represent the cell lines. We use this clustering information to cluster set Z into the same eight clusters (depicted by the same colors). After clustering of the data points, we train 8 SVRs, one on each cluster.    Figure 3d shows how kESVR predicts drug response value for a given gene expression input sayX j . Each of the eight SVRs, returns a predicted value. These predicted values together with the first principal component value ofX j can be plotted as points on the same 2-D plan containing the clustered data Z. In Figure 3d, these are represented by the triangular points. Note that the color of each triangle matches the color of its generating SVR/cluster. We calculate the β() score for each point. Our kESVR model returns the value (the triangle) with the highest β() score. The idea of the β() score is quite intuitive. While the ordinate value (φ 1 X j ) remains the same for all 8 triangular point, it is the predicted value of each SVR (abscissa value) that determines the location of the triangular point and hence the number of neighbors. To temper the impact of the size of the training set of an SVR (overall density of a cluster), we use the average Spearman correlation value which makes sure that higher similarity ofX j with the genetic profile of a cluster favors the selection of the corresponding predicted value. In other words, if a cluster G i (cell-lines profiles) is more similar to the inputX j than others, then that similarity will be reflected in its β() score.

Implementation
The entire model is implemented using R. We use the caret package [29] to invoke the individual Support Vector Regression models with radial basis function kernel.

Comparison with Standard ML Models
We use 610 cancer cell lines from CCLE and 481 drug response of those cell lines from CTRP for testing our kESVR model. We test the performance of kESVR on 5 random drugs (zebularine, azacytidine, myricetin, BRDK64610608 and nelarabine) out of the 481 drugs from CTRP. We compare kESVR with 4 other machine learning models: (i) Linear regression (LR), (ii) Back Propagation Neural network (BPNN), (iii) Support Vector Regression (SVR) and (iv) Quantile Regression Forest (QRF). For each model the labeled dataset Q of 610 celllines is split into training and testing (75/25) set. We use the average Mean Square Error (MSE) for both the training and testing dataset are our performance metric. Our kESVR uses 5-fold cross-validation to return the average MSE value for a drug. Table 1, shows the optimal value of k that kESVR uses for each of the drugs. Table 2 compares the average MSE value of each model for the 5 drugs. It is evident that kESVR has the lowest MSE value, making it the best performing model among all the models. We also compare the model setup time across all the models for each drug. Table 3 shows kESVR takes longer time than LR, SVR or QRF for the model to be setup but its prediction accuracy makes up for the extra time taken in setup. We next compare the performance of kESVR against the same 4 models for 5 drugs that have the highest variance in drug AUC values amongst the 481 drugs. These 5 drugs are SB743921, paclitaxel, daporinad, neopeltolide and docetaxel. The idea is to see how well kESVR handles large variance of AUC value for a particular drug across the 610 cell-lines. We use the same metric for evaluation. Table 4 shows that except for drug neopeltolide, our model kESVR performs the best again. Now for 3 drugs, kESVR and SVR seem to have the same MSE value. The reason for this can be found in Table 1.  Table 1 shows the optimal k-value that kESVR uses for each drug. It can be seen that for 3 drugs the value of k is 1. That is, for those drugs after trying out different value of k, kESVR finds k = 1 to be the optimal value. A value of k = 1 essentially means a single SVR, hence the similarity of MSE values in Table 4. Table 5 like Table 3 shows that kESVR model takes more time than 3 out of the 4 drugs but has higher accuracy than them.  Tables 6 and 7 gives a glimpse of the performance of our model for drug Zebularine. Table 6 shows that k = 8 produces the lowest average MSE value which is why it is selected as the optimal k-value. Table 7 illustrates that our model does not suffer from over-fitting. Detailed results for all 10 drugs can be found in the Supplementary File: Tables S1-S10. We further compare the goodness of fit of our kESVR model with the 4 models using the R-squared metric. For each drug, we treat the entire data of that drug as our test dataset and use the trained 5 models to generate predicted values for the test data. We then calculate the R-squared value (R 2 = 1 − (residual sum of squares)/(total sum of squares)) for each drug and model. It can be seen from Table 8, that for 7 out of the 10 drugs, kESVR has the best fit (>=0.7) among all 5 models. For 3 drugs, the optimal value of k used by kESVR is 1 (Table 6), which is why the R-squared value of kESVR and SVR is the same.

Comparison with Existing Drug Response Prediction Models
We next compare our kESVR model to contemporary drug response prediction computational methods. A paper by Chen and Zhang [11] evaluates and compares the performance of 17 representative methods for drug response prediction developed in the past five years. These include models based on regression and their generalizations, Bayesian inference methods, matrix-factorization methods, random forest, kernel rank learning and deep convolutional neural networks. They assess the performances of these 17 methods using four large public datasets in nine metrics. We use one of the four datasets (CCLE) and one of the metrics (RMSE) used in their paper [11] to compare the methods in terms of prediction accuracy. To ensure fairness of comparison, we use the exact same curated data and the exact same steps that their paper use to generate the Root Mean Square Error (RMSE) values. This curated data contains expression profiles for 385 cancer cell lines and drug response AUC values for 23 drugs. To generate the RMSE value for each drug d, we use five-fold cross-validation; split the data into 5 parts, train the model with 4 parts of data and use the remaining 1 part as test data. After 5-folds, we get the RMSE value for the whole data. We repeat the process for 10 iterations to get the average RMSE value for the drug d. In this manner, we generate the RMSE values for all 23 drugs and compare them with those generated by the other 17 methods. The complete comparison table can be found in the Supplementary File: Table S11. We rank the performance of the 18 methods in terms of RMSE (lower RMSE means better rank) for all 23 drugs in Supplementary File: Table S12. It is evident from the performance rank table that kESVR places in the top 5 positions for all 23 drugs and holds the first rank for 17 out of 23 drugs. That is, for 17 drugs kESVR out-performs all other methods and for the remaining 6 drugs it performs better than at least 12 methods. In their paper, Chen and Zheng conclude that 4 methods: DualNets [30], Kernelized rank recommendation (KRR) [31], pairwise multiple kernel learning (pairwiseMKL) [32] and similarity-regularized matrix factorization (SRMF) [33] are the best among the 17 methods in terms of prediction accuracy. These 4 methods consistently out-perform the other 13 methods. So, we inspect the performance of kESVR against these 4 methods. The full rank table for kESVR and the 4 methods is provided in Supplementary File: Table S13. We plot the RMSE values of kESVR along with those 4 methods in Figure 4. From the plot and the rank table we see that kESVR is the best performing model for 17 out of 23 drugs. In case of the 6 drugs, where it does not rank first, the lowest rank it gets is third and consistently out-performs DualNets and KRR.

Discussion
kESVR is a data driven model. That is, it does not require any external input of parameter values. kESVR calculates all of its parameters ( , , parameters of individual SVRs) from the input data directly. This makes this model highly robust. kESVR uses PCA in its first step. This leads to the creation of the reduced dataset (Equation (2)). At this step, there are several principal components to choose from, for our model creation. The decision to use the first principal component for this step stems from the fact that the first principal component retains the maximum percentage of variation in the reduced data set. Graphically this means, it gives a better visualization/separation (Figure 2(a)) of the distinct data points on the 2-D plot that is used in the subsequent steps.
The embedded clustering step employs K-means clustering on the dataset Ψ (Equation (4)) in order to cluster the cell-lines into groups. We use K-means for two reasons,

Discussion
kESVR is a data driven model. That is, it does not require any external input of parameter values. kESVR calculates all of its parameters (k, r, parameters of individual SVRs) from the input data directly. This makes this model highly robust. kESVR uses PCA in its first step. This leads to the creation of the reduced dataset Z (Equation (2)). At this step, there are several principal components to choose from, for our model creation. The decision to use the first principal component for this step stems from the fact that the first principal component retains the maximum percentage of variation in the reduced data set. Graphically this means, it gives a better visualization/separation (Figure 2a) of the distinct data points on the 2-D plot that is used in the subsequent steps.
The embedded clustering step employs K-means clustering on the dataset Ψ (Equation (4)) in order to cluster the cell-lines into groups. We use K-means for two reasons, firstly we plan to cluster cell-lines that produce similar prediction errors and secondly since Ψ is a set of 2-tuple data. These clusters then train separate SVRs to reduce the prediction errors. Typically, K-means method itself is prone to producing different clusters each time due to the randomized nature of its initiation. However, in this particular scenario, we observe that even after running multiple times, with this data K-means always produce the same set of centroids. Traditionally, optimal value of k is determined by using metrics such as Silhouette value [34] or Calinski-Harabasz index [35]. However, such methods are not applicable in case of kESVR as our main objective is not to determine how well the data-points are clustered. Clustering is an intermediate step, that plays an important role in the performance of kESVR. That is, the performance metric MSE is dependent on the choice of k. In that respect the choice of value of k, is data driven. As our final objective is to minimize the value of MSE (Equation (8)), kESVR loops through different values of k, and selects the optimal one that gives the lowest MSE value. Figure 5 illustrates how the optimal value of k varies according to metric used for evaluation. Here we use the model for drug zebularine. Figure 5a uses the Calinski-Harabasz index, Figure 5b uses the average Silhouette value and Figure 5c uses the average (Training+Testing) MSE values to get optimal k for drug zebularine. Accordingly, the suggested optimal value of k turn out to be 11, 3 and 8 respectively. From Table 6, we known that only k = 8 gives the lowest average MSE value (best kESVR performance). So k = 8 is selected as the optimal value. Computation of neighbors for any data-point using () (Equation (6)) is dependent on the value of . We use the clustering information of the reduced data to calculate the best value of . Our model kESVR is data-driven. Being an ensemble method, its performance is dependent on the size/volume of the training data being used on each individual SVR. That is, some clusters in can be denser than others. Depending on the value Computation of neighbors for any data-point using η r k () (Equation (6)) is dependent on the value of r. We use the clustering information of the reduced data Z to calculate the best value of r. Our model kESVR is data-driven. Being an ensemble method, its performance is dependent on the size/volume of the training data being used on each individual SVR. That is, some clusters in Z can be denser than others. Depending on the value of r, that density of a cluster can influence the overall performance of kESVR. Figure 6 shows how the performance of kESVR varies with r for drug zebularine. We can see from Figure 6a that as the value of increases the performance of our model deteriorates. This observation is intuitive: if a cluster is very dense (more training instances) and the value of is sufficiently large, then that cluster can end up dominating the prediction value selection process and hence the overall performance. The best performance is given when is small. This is shown is Figure 6b. Keeping these things in mind we calculate as follows: We calculate the x-axis distance (AUC values) between the two farthest points for each cluster. We sort these distances in ascending order and select to be smallest value. This is done in order to be fair in comparing the number of neighbors for each cluster. If in a particular prediction instance, a value of results in zero neighbors in all clusters, we simply select the next higher value in the sorted list, and redo the neighbor selection process for that instance.
kESVR uses an ensemble approach to return the final prediction value from potential predicted values. It uses a maximum () score (Equation (7)) approach wherein the potential predicted value with the highest () score is selected as a final prediction value. We empirically test a total of five different approaches before settling on the () score one. The four other approaches are: (i) select the prediction value whose data-point has the maximum average Spearman correlation value with all cell-lines in the training set of its parent SVR.
(ii) return a weighted average of the prediction values using the () scores of the corresponding data-points as weights.
(iii) select the prediction value whose data-point has the maximum number of neighbors (using ()).
(iv) return a weighted average of the prediction data-points using the number of neighbors (from ()) of the corresponding data-points as weights.
Our experimental comparisons show that the maximum () score approach gives the best accuracy performance i.e. lowest MSE values.
Lastly feature selection plays an important role in the performance of our model kESVR. Intuitively, features indicate those genes that play a crucial role in the functioning of a drug. Expression (high/low) of these feature genes, affects how a cell-line/patient responds to the drug. We use genes that are known to be target genes for the drugs (available from public databases) as our feature genes. This contributes to the improved performance of kESVR over other drug response prediction methods. We can see from Figure 6a that as the value of r increases the performance of our model deteriorates. This observation is intuitive: if a cluster is very dense (more training instances) and the value of r is sufficiently large, then that cluster can end up dominating the prediction value selection process and hence the overall performance. The best performance is given when r is small. This is shown is Figure 6b. Keeping these things in mind we calculate r as follows: We calculate the x-axis distance (AUC values) between the two farthest points for each cluster. We sort these distances in ascending order and select r to be smallest value. This is done in order to be fair in comparing the number of neighbors for each cluster. If in a particular prediction instance, a value of r results in zero neighbors in all clusters, we simply select the next higher value in the sorted list, and redo the neighbor selection process for that instance.

Conclusions
kESVR uses an ensemble approach to return the final prediction value from k potential predicted values. It uses a maximum β() score (Equation (7)) approach wherein the potential predicted value with the highest β() score is selected as a final prediction value. We empirically test a total of five different approaches before settling on the β() score one. The four other approaches are: (i) select the prediction value whose data-point has the maximum average Spearman correlation value with all cell-lines in the training set of its parent SVR.
(ii) return a weighted average of the k prediction values using the β() scores of the corresponding data-points as weights.
(iii) select the prediction value whose data-point has the maximum number of neighbors (using η r k ()). (iv) return a weighted average of the k prediction data-points using the number of neighbors (from η r k ()) of the corresponding data-points as weights. Our experimental comparisons show that the maximum β() score approach gives the best accuracy performance i.e. lowest MSE values.
Lastly feature selection plays an important role in the performance of our model kESVR. Intuitively, features indicate those genes that play a crucial role in the functioning of a drug. Expression (high/low) of these feature genes, affects how a cell-line/patient responds to the drug. We use genes that are known to be target genes for the drugs (available from public databases) as our feature genes. This contributes to the improved performance of kESVR over other drug response prediction methods.

Conclusions
In this paper we introduce a novel computational model for drug sensitivity prediction. We show that our model kESVR consistently outperforms existing traditional ML tools and at least 12 recently developed drug response prediction methods in terms of prediction accuracy. Even among the 4 good methods deemed best performing by Chen and Zhang [11] it is the best performing model in most cases. It is a robust model that is completely data driven and combines both supervised and unsupervised learning concepts in its functionality.
As of now kESVR only uses gene expression data. However, as future work we plan to extend our kESVR model, to incorporate other kinds of omics data such as mutation, copy number variation and methylation data. Currently kESVR does not allow for feature selection. This is something that we will be working on, so that future users can use kESVR to get bio-markers for the drugs as well as their response prediction results. Finally, we plan to use kESVR to create a precision medicine drug recommendation system for patients wherein we use cancer lines and other omics data to perform drug response prediction and drug ranking recommendation efficiently.   Data Availability Statement: Functional genomics data in this manuscript can be found in: Gene expression of breast cancer cell lines are from Cancer Cell Line Encyclopedia (CCLE) (http://www. broadinstitute.org/ccle); Large scale of drug screening in cancer cells is from the Cancer Therapeutics Response Portal (CTRP, https://portals.broadinstitute.org/ctrp). Curated data used in method comparison can be found in the GitHub link provided by the paper by Chen and Zhang [11]. Users can access kESVR model for free at https://github.com/abhishekmaj08/kESVR.git.

Conflicts of Interest:
The authors declare no conflict of interest.