# kESVR: An Ensemble Model for Drug Response Prediction in Precision Medicine Using Cancer Cell Lines Gene Expression

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Materials

#### 2.2. Method

#### 2.2.1. Overview

- Dimensional reduction: In the first step, we convert high dimensional gene expression data to low dimensional data for better handling and visualization in the subsequent steps.
- Embedded Clustering: In the second step, we split the lower dimensional data into distinct clusters based on the their labeled/given drug response value. This is the done so that data points (cell-lines) that have similar or close drug response values are grouped together.
- Local regression and ensemble value selection: In the third step, we train different instances of a machine learning (ML) model on each of the different clusters of data points obtained in the previous step. If there are $k$ clusters, we train $k$ instances of the ML model. For a given/new input, we now have $k$ candidate ML prediction outputs to select from. We use a score-based approach to select the best output. We base our scoring system on the similarity of gene expression profiles between the input and the training data to get the best prediction result.
- Optimal drug response value prediction: In the final step, we optimize the number of clusters $k$ to get our model kESVR that gives the best performance (minimum Mean Square Error).

#### 2.2.2. Generalized Description

#### Dimensional Reduction

#### Embedded Clustering

#### Local Regression and Ensemble Global Value Selection

#### Optimal Drug Response Prediction

#### 2.2.3. Steps for Creating kESVR Model for a Specific Drug $\mathrm{D}$

- Perform PCA on the mRNA gene expression data $X$. Use the first principle component ($p=1$) and create the reduce dataset $Z=\left\{{Z}_{1},{Z}_{2},\dots ,{Z}_{N}\right\}$ where ${Z}_{i}=\left({Y}_{i},{\varphi}_{1}\left({X}_{i}\right)\right)$.
- Create the labeled data $Q$ from $X$ and $Y$. Use 321 target genes ($\left|\widehat{{X}_{i}}\right|=321$) instead of the whole genome data for creating $Q$.$$Q=\left\{{Q}_{1},{Q}_{2},\dots ,{Q}_{N}\right\}\mathrm{where}{Q}_{i}=\left\{\left({\widehat{X}}_{i},{Y}_{i}\right)\right\},{\widehat{X}}_{i}\subset {X}_{i}\mathrm{and}\left|{\widehat{X}}_{i}\right|=321.$$Train SVR $S$, on $Q$ (75% training 25% testing data) and record the predicted value errors. From this create the 2-tupled dataset $\Psi =\left\{\left({Y}_{1},{e}_{1}\right),\left({Y}_{2},{e}_{2}\right),\dots ,\left({Y}_{N},{e}_{N}\right)\right\}$ where ${e}_{i}$ denotes the prediction error obtained from $S$ for input gene expression ${\widehat{X}}_{i}$. Next apply K-means clustering to partition $\Psi $ into $K$(=12) clusters that can then be used to partition $Z$ into $K$(=12) clusters ${G}_{1}\dots {G}_{12}$.
- Repeat for $k$=1 to 12:
- Train $k$ SVRs ${S}_{1}\dots {S}_{12}$ on the clusters ${G}_{1}\dots {G}_{12}$ (75% training, 25% testing).
- Given an input ${\widehat{X}}_{j}$, let ${\overline{Y}}_{jk}$ represent the output predicted by ${S}_{k}$. Calculate the $k$ predicted values from the $k$ SVRs. Then calculate $\beta \left({\overline{Y}}_{jk}\right)$ score for each of the $k$ ${\overline{Y}}_{jk}$.
- Select the prediction value ${\psi}_{j}^{K}$ returned by kESVR to input ${\widehat{X}}_{j}$ as the value ${\overline{Y}}_{jk}$ with the highest $\beta \left({\overline{Y}}_{jk}\right)$ score.
- Calculate the Squared Error as ${\left|{Y}_{j}-{\psi}_{j}^{K}\right|}^{2}$.
- Calculate the average Mean Square Error ($MS{E}_{k}$) for both training and testing data ($N$ cell-lines).

- Select the value of $k$ with the lowest $MS{E}_{k}$ among $MS{E}_{1},..,MS{E}_{12}$ values as the ideal number of clusters of $kESV{R}_{D}$.
- Retain the model created using the optimal value of $k$ (obtained in step 4) as model $kESV{R}_{D}$ for drug $D$.

#### 2.2.4. Simulation

#### 2.2.5. Implementation

## 3. Results

#### 3.1. Comparison with Standard ML Models

#### 3.2. Comparison with Existing Drug Response Prediction Models

## 4. Discussion

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

## References

- Lin, J.-Z.; Long, J.-Y.; Wang, A.-Q.; Zheng, Y.; Zhao, H.-T. Precision medicine: In need of guidance and surveillance. World J. Gastroenterol.
**2017**, 23, 5045. [Google Scholar] [CrossRef] [PubMed] - Ghandi, M.; Huang, F.W.; Jané-Valbuena, J.; Kryukov, G.V.; Lo, C.C.; McDonald, E.R., 3rd; Barretina, J.; Gelfand, E.T.; Bielski, C.M.; Li, H.; et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature
**2019**, 569, 503–508. [Google Scholar] [CrossRef] [PubMed] - Basu, A.; Bodycombe, N.E.; Cheah, J.H.; Price, E.V.; Liu, K.; Schaefer, G.I.; Ebright, R.Y.; Stewart, M.L.; Ito, D.; Wang, S.; et al. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell
**2013**, 154, 1151–1161. [Google Scholar] [CrossRef] [PubMed][Green Version] - Rees, M.G.; Seashore-Ludlow, B.; Cheah, J.H.; Adams, D.J.; Price, E.V.; Gill, S.; Javaid, S.; Coletti, M.E.; Jones, V.L.; Bodycombe, N.E.; et al. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nat. Chem. Biol.
**2016**, 12, 109–116. [Google Scholar] [CrossRef] [PubMed] - Cheng, L.; Majumdar, A.; Stover, D.; Wu, S.; Lu, Y.; Li, L. Computational cancer cell models to guide precision breast cancer medicine. Genes
**2020**, 11, 263. [Google Scholar] [CrossRef] [PubMed][Green Version] - Costello, J.C.; NCI DREAM Community; Heiser, L.M.; Georgii, E.; Gönen, M.; Menden, M.P.; Wang, N.J.; Bansal, M.; Ammad-Ud-Din, M.; Hintsanen, P.; et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol.
**2014**, 32, 1202–1212. [Google Scholar] [CrossRef] [PubMed] - Jang, I.S.; Neto, E.C.; Guinney, J.; Friend, S.H.; Margolin, A.A. Systematic assessment of analytical methods for drug sensitivity prediction from cancer cell line data. In Biocomputing 2014; World Scientific: Singapore, 2014; pp. 63–74. [Google Scholar]
- Ali, M.; Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev.
**2019**, 11, 31–39. [Google Scholar] [CrossRef][Green Version] - Azuaje, F. Computational models for predicting drug responses in cancer research. Brief. Bioinform.
**2017**, 18, 820–829. [Google Scholar] [CrossRef] - Jiang, G.; Zhang, S.; Yazdanparast, A.; Li, M.; Pawar, A.V.; Liu, Y.; Inavolu, S.M.; Cheng, L. Comprehensive comparison of molecular portraits between cell lines and tumors in breast cancer. BMC Genomics.
**2016**, 17, 281–301. [Google Scholar] [CrossRef][Green Version] - Chen, J.; Zhang, L. A survey and systematic assessment of computational methods for drug response prediction. Brief. Bioinform.
**2021**, 22, 232–246. [Google Scholar] [CrossRef] - Lengerich, B.; Aragam, B.; Xing, E.P. Learning sample-specific models with low-rank personalized regression. arXiv
**2019**, arXiv:1910.06939. [Google Scholar] - Lever, J.; Krzywinski, M.; Altman, N. Points of Significance: Model Selection and Overfitting; Nature Publishing Group: Berlin, Germany, 2016. [Google Scholar]
- Wu, D.; Wang, D.; Zhang, M.Q.; Gu, J. Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: Application to cancer molecular classification. BMC Genom.
**2015**, 16, 1022. [Google Scholar] [CrossRef] [PubMed][Green Version] - Hua, X.G.; Ni, Y.Q.; Ko, J.M.; Wong, K.Y. Modeling of temperature–frequency correlation using combined principal component analysis and support vector regression technique. J. Comput. Civ. Eng.
**2007**, 21, 122–135. [Google Scholar] [CrossRef] - Rahman, A.S.; Rahman, A. Application of Principal Component Analysis and Cluster Analysis in Regional Flood Frequency Analysis: A Case Study in New South Wales, Australia. Water
**2020**, 12, 781. [Google Scholar] [CrossRef][Green Version] - Gao, W.; Han, J. Prediction of Destroyed Floor Depth Based on Principal Component Analysis (PCA)-Genetic Algorithm (GA)-Support Vector Regression (SVR). Geotech. Geol. Eng.
**2020**, 38, 3481–3491. [Google Scholar] [CrossRef] - Lopes, L.S.F.; Ferreira, M.S.; Baldassini, W.A.; Curi, R.A.; Pereira, G.L.; Neto, O.R.M.; Oliveira, H.N.; Silva, J.A.I.V.; Munari, D.P.; Chardulo, L.A.L. Application of the principal component analysis, cluster analysis, and partial least square regression on crossbreed Angus-Nellore bulls feedlot finished. Trop. Anim. Health Prod.
**2020**, 52, 3655–3664. [Google Scholar] [CrossRef] - Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn.
**1995**, 20, 273–297. [Google Scholar] [CrossRef] - Chidambaram, S.; Srinivasagan, K.G. Performance evaluation of support vector machine classification approaches in data mining. Cluster Comput.
**2019**, 22, 189–196. [Google Scholar] [CrossRef] - Che, J. Support vector regression based on optimal training subset and adaptive particle swarm optimization algorithm. Appl. Soft Comput.
**2013**, 13, 3473–3481. [Google Scholar] [CrossRef] - Ding, Y.; Cheng, L.; Pedrycz, W.; Hao, K. Global nonlinear kernel prediction for large data set with a particle swarm-optimized interval support vector regression. IEEE Trans. Neural Networks Learn. Syst.
**2015**, 26, 2521–2534. [Google Scholar] [CrossRef][Green Version] - Schapire, Y.F.R.E. Experiments with a New Boosting AlgorithmMachine Learning. In Proceedings of the Thirteenth International Conference, Bari, Italy, 3–6 July 1996. [Google Scholar]
- Gray, J.W.; Mills, G.B. Large-scale drug screens support precision medicine. Cancer Discov.
**2015**, 5, 1130–1132. [Google Scholar] [CrossRef][Green Version] - Seashore-Ludlow, B.; Rees, M.G.; Cheah, J.H.; Cokol, M.; Price, E.V.; Coletti, M.E.; Jones, V.; Bodycombe, N.E.; Soule, C.K.; Gould, J.; et al. Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer Discov.
**2015**, 5, 1210–1223. [Google Scholar] [CrossRef][Green Version] - Le Tourneau, C.; Delord, J.-P.; Gonçalves, A.; Gavoille, C.; Dubot, C.; Isambert, N.; Campone, M.; Trédan, O.; Massiani, M.-A.; Mauborgne, C.; et al. Molecularly targeted therapy based on tumour molecular profiling versus conventional therapy for advanced cancer (SHIVA): A multicentre, open-label, proof-of-concept, randomised, controlled phase 2 trial. Lancet Oncol.
**2015**, 16, 1324–1334. [Google Scholar] [CrossRef] - Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci.
**1901**, 2, 559–572. [Google Scholar] [CrossRef][Green Version] - MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
- Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw.
**2008**, 28, 1–26. [Google Scholar] [CrossRef][Green Version] - Zhang, N.; Wang, H.; Fang, Y.; Wang, J.; Zheng, X.; Liu, X.S. Predicting anticancer drug responses using a dual-layer integrated cell line-drug network model. PLoS Comput. Biol.
**2015**, 11, e1004498. [Google Scholar] [CrossRef] [PubMed] - He, X.; Folkman, L.; Borgwardt, K. Kernelized rank learning for personalized drug recommendation. Bioinformatics
**2018**, 34, 2808–2816. [Google Scholar] [CrossRef] [PubMed] - Cichonska, A.; Pahikkala, T.; Szedmak, S.; Julkunen, H.; Airola, A.; Heinonen, M.; Aittokallio, T.; Rousu, J. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics
**2018**, 34, i509–i518. [Google Scholar] [CrossRef] - Wang, L.; Li, X.; Zhang, L.; Gao, Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer
**2017**, 17, 1–12. [Google Scholar] [CrossRef] - Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef][Green Version] - Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. Methods
**1974**, 3, 1–27. [Google Scholar] [CrossRef]

**Figure 2.**(

**a**) 2-D plot of reduced dataset $Z$ (

**b**) 2-D representation of prediction error dataset $\Psi $ (

**c**) Clustering of $\Psi $ dataset (

**d**) 2-D plot of clustered $Z$ (

**e**) 2-D plot of data-points in $Z$ with the newly predicted value (

**f**) 2-D plot of a predicted data point $\left({\overline{Y}}_{jk},{\varphi}_{p}\left({X}_{j}\right)\right)$ and its neighbors.

**Figure 3.**The steps of kESVR model creation for drug zebularine response AUC prediction on 610 cancer cells from CCLE. (

**a**) Dimension reduction of gene expression profiles and mapping of the latent variable PC1 and drug response of zebularine onto 2D space. (

**b**) Seeking local clusters by k-means algorithm for regression. (

**c**) Construction of local SVR regression models after clustering reduced dataset $Z$ into 8 clusters (

**d**) Multiple prediction candidates from the trained 8 SVRs on the clusters.

**Figure 4.**Comparison of kESVR with DualNets, KRR, pairwiseMKL and SRMF models in terms of Root Mean Square Error (RMSE) value over 23 drugs. kESVR is the best performing (lowest RMSE) model in 17 out of 23 drugs. For the remaining 6 drugs, kESVR places in the top 3 position among the 5 models.

**Figure 5.**Selection of optimal value of $k$ for drug zebularine. (

**a**) Calinski-Karabasz index sets optimal $k=11$ (

**b**) Average Silhouette value specifies optimal $k=3$ (

**c**) Average (Train+Test) MSE value selects optimal $k=8$.

**Figure 6.**Variation of performance of kESVR with respect to $r$ for drug zebularine. (

**a**) r is varied from 0 to 500. (

**b**) Shows the details from (a) when $r$ is varied from 0 to 1.

Drug | $\mathbf{Optimal}\mathit{k}$ |
---|---|

zebularine | 8 |

azacitidine | 7 |

myricetin | 8 |

BRDK64610608 | 8 |

nelarabine | 12 |

SB743921 | 1 |

paclitaxel | 8 |

daporinad | 8 |

neopeltolide | 1 |

docetaxel | 1 |

Drug | Avg. (Training + Testing) MSE | ||||
---|---|---|---|---|---|

LR | BPNN | SVR | QRF | kESVR | |

zebularine | 36.490 | 1.078 | 1.039 | 1.001 | 0.336 |

azacitidine | 188.773 | 0.983 | 1.028 | 1.001 | 0.307 |

myricetin | 117.890 | 0.902 | 0.905 | 0.984 | 0.301 |

BRDK64610608 | 49.670 | 0.987 | 1.018 | 1.078 | 0.350 |

nelarabine | 42.137 | 1.010 | 1.093 | 1.090 | 0.450 |

Drug | Model Setup Time (in sec) | ||||
---|---|---|---|---|---|

LR | BPNN | SVR | QRF | kESVR | |

zebularine | 2.390 | 9417.563 | 30.328 | 3342.786 | 10,934.530 |

azacitidine | 2.419 | 8830.332 | 29.412 | 3603.740 | 7787.4456 |

myricetin | 2.487 | 15,179.391 | 30.088 | 3375.857 | 8259.927 |

BRDK64610608 | 2.493 | 13,580.990 | 29.556 | 3329.557 | 8442.622 |

nelarabine | 2.335 | 14,683.006 | 27.803 | 3608.870 | 7334.934 |

Drug | Avg. (Training + Testing) MSE | ||||
---|---|---|---|---|---|

LR | NN | SVR | QRF | kESVR | |

SB743921 | 674.143 | 3.496 | 3.095 | 3.238 | 3.095 |

paclitaxel | 98.3038 | 3.442 | 3.067 | 3.070 | 2.472 |

daporinad | 137.176 | 3.458 | 3.206 | 3.140 | 2.082 |

neopeltolide | 118.476 | 3.256 | 3.358 | 3.443 | 3.358 |

docetaxel | 110.085 | 3.360 | 2.856 | 3.074 | 2.856 |

Drug | Model Setup Time (in sec) | ||||
---|---|---|---|---|---|

LR | BPNN | SVR | QRF | kESVR | |

SB743921 | 2.271 | 10,361.546 | 27.447 | 3563.221 | 8918.989 |

paclitaxel | 2.088 | 9439.555 | 27.139 | 3497.768 | 8575.110 |

daporinad | 2.316 | 11,495.089 | 26.572 | 2391.248 | 7637.791 |

neopeltolide | 1.617 | 2132.497 | 7.937 | 570.632 | 2738.013 |

docetaxel | 1.650 | 2849.202 | 11.419 | 1139.292 | 4266.580 |

DRUG Zebularine | |
---|---|

k | Avg. (Training + Testing) MSE |

1 | 1.039 |

2 | 0.815 |

3 | 0.749 |

4 | 0.568 |

5 | 0.621 |

6 | 0.483 |

7 | 0.423 |

8 | 0.336 |

9 | 0.357 |

10 | 0.351 |

11 | 0.405 |

12 | 0.579 |

DRUG Zebularine | |||
---|---|---|---|

Fold | Training Set MSE | Testing Set MSE | Avg. (Training + Testing) MSE |

1 | 0.203 | 0.335 | 0.269 |

2 | 0.063 | 0.706 | 0.384 |

3 | 0.207 | 0.618 | 0.413 |

4 | 0.227 | 0.276 | 0.252 |

5 | 0.316 | 0.404 | 0.360 |

Drug | R—Squared Value | ||||
---|---|---|---|---|---|

LR | BPNN | SVR | QRF | kESVR | |

zebularine | −0.094 | −0.00047 | 0.284 | 0.693 | 0.778 |

azacitidine | −0.267 | −4.5 × 10^{−5} | 0.139 | 0.691 | 0.789 |

myricetin | −0.25 | −0.000681 | 0.082 | 0.698 | 0.772 |

BRDK64610608 | 0.072 | −0.000809 | 0.093 | 0.698 | 0.802 |

nelarabine | −0.152 | −4.2 × 10^{−5} | 0.119 | 0.627 | 0.700 |

SB743921 | 0.263 | −0.002453 | 0.66 | 0.795 | 0.66 |

paclitaxel | 0.093 | −6.0 × 10^{−6} | 0.415 | 0.764 | 0.853 |

daporinad | −0.262 | −0.000608 | 0.382 | 0.768 | 0.903 |

neopeltolide | −11789.886 | −0.00368 | 0.385 | 0.781 | 0.385 |

docetaxel | −301.018 | −0.001727 | 0.644 | 0.788 | 0.644 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Majumdar, A.; Liu, Y.; Lu, Y.; Wu, S.; Cheng, L. kESVR: An Ensemble Model for Drug Response Prediction in Precision Medicine Using Cancer Cell Lines Gene Expression. *Genes* **2021**, *12*, 844.
https://doi.org/10.3390/genes12060844

**AMA Style**

Majumdar A, Liu Y, Lu Y, Wu S, Cheng L. kESVR: An Ensemble Model for Drug Response Prediction in Precision Medicine Using Cancer Cell Lines Gene Expression. *Genes*. 2021; 12(6):844.
https://doi.org/10.3390/genes12060844

**Chicago/Turabian Style**

Majumdar, Abhishek, Yueze Liu, Yaoqin Lu, Shaofeng Wu, and Lijun Cheng. 2021. "kESVR: An Ensemble Model for Drug Response Prediction in Precision Medicine Using Cancer Cell Lines Gene Expression" *Genes* 12, no. 6: 844.
https://doi.org/10.3390/genes12060844