Breast Cancer Drugs Screening Model Based on Graph Convolutional Network and Ensemble Method

Li, Jia; Zhao, Yun; Shi, Guoxing; Tan, Xuewen

doi:10.3390/math12121779

Open AccessArticle

Breast Cancer Drugs Screening Model Based on Graph Convolutional Network and Ensemble Method

by

Jia Li

,

Yun Zhao

^*,

Guoxing Shi

^* and

Xuewen Tan

School of Mathematics and Computer Science, Yunnan Minzu University, Kunming 650031, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(12), 1779; https://doi.org/10.3390/math12121779

Submission received: 10 May 2024 / Revised: 3 June 2024 / Accepted: 6 June 2024 / Published: 7 June 2024

(This article belongs to the Topic Machine Learning Empowered Drug Screen)

Download

Browse Figures

Versions Notes

Abstract

Breast cancer is the first cancer incidence and the second cancer mortality in women. Therefore, for the life and health of breast cancer patients, the research and development of breast cancer drugs should be accelerated. In drug development, the search for compounds with good bioactivity, pharmacokinetics, and safety, including Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET), has always been a time-consuming and labor-intensive process. In this paper, the relationship between the molecular descriptor and ADMET properties of compounds is studied. Aiming at the problem of composite ADMET attribute classification, a Stacking Algorithm based on Graph Convolutional Network (SA-GCN) was proposed. Firstly, feature selection was performed in the data of molecular descriptors. Then the SA-GCN is developed by integrating the advantages of ten classical classification algorithms. Finally, various performance indicators were used to conduct comparative experiments. Experiments show that the SA-GCN is superior to other classifiers in the classification performance of ADMET, and the classification accuracy is 97.6391%, 98.1450%, 94.4351%, 96.4587%, and 97.9764% compared to other classifiers. Therefore, this method can be well applied to the classification of ADMET properties of compounds and then could provide some help to screen out compounds with good biological activities.

Keywords:

graph convolutional network; ensemble learning; feature selection; classification; drug screening

MSC:

68T07

1. Introduction

Cancer has been the killer demon of humanity throughout the long history of mankind. According to the World Health Organization, breast cancer has become the number one cancer worldwide, with 2,261,419 cases diagnosed in 2020. Using data published by GLOBOCAN, Siegel et al. reported that by 2023, it is estimated that 297,790 new breast cancers will be diagnosed among U.S. women, thus accounting for 31% of all new cancers in women and an estimated 43,170 deaths accounting for 15% of all cancer deaths [1]. Figure 1 shows his projections for the top-10 female cancer cases and deaths in 2023. Giaquinto et al. studied the data on breast cancer and found that the number of diagnoses has been increasing for most of the past 40 years, and the rate has increased by 0.5% per year in recent years [2]. So, the prevention and treatment of breast cancer has important practical significance.

In order to effectively prevent and treat breast cancer, researchers in the fields of biomedicine and genetics have carried out many experiments and studies, and they found that the incidence of breast cancer is closely related to the subtype of Estrogen receptors alpha (ER

α

). ER

α

is rarely expressed in normal breast cells, thus accounting for less than 10%, whereas it is overexpressed in approximately 50–80% of breast tumor cells [3]. Experiments in ER

α

null mice showed that ER

α

promoted mammary tumor formation during mammary gland development [4]. In other words, increased ER

α

activity in benign breast epithelial cells indicates an increased risk for breast cancer. Therefore, ER

α

is often a significant target in breast cancer drug development, and it is critical to find compounds capable of inhibiting ER

α

expression [5].

Drug development has the characteristics of long cycle, high consumption, and high failure rate. To save time and money, most pharmaceutical companies develop quantitative structure activity relationships (QSAR) to detect on-target and off-target drug activities [6]. Although good biological activity can effectively ensure the effectiveness of drugs against tumor cells, optimizing pharmacokinetics and minimizing toxicity are also crucial to the development of new viable drugs [7]. The pharmacokinetics and toxicity of compounds involve a comprehensive description of their absorption, distribution, metabolism, excretion, and toxicity, namely ADMET [8]. The ADMET properties of a compound play an important role as a key indicator to judge whether it can be used as a drug [9]. Molecular descriptors involve a variety of characteristic parameters describing the properties and structures of compounds, which have been applied to ADMET prediction and proved to be effective [10]. So, investigating the association of molecular descriptors with ADMET would be beneficial for the discovery of drugs that can antagonize ER

α

activity.

At present, traditional studies of ADMET properties usually require a large number of animal experiments, time, and money. The skyrocketing cost of drug development poses a threat to the sustainability of disease treatment [11]. Machine learning has powerful feature extraction and pattern recognition capabilities, which has a wide range of application prospects in the related research of the treatment of breast cancer. It can help researchers better understand the relationship between drugs and breast cancer, improve the therapeutic effect and safety of drugs, diagnose breast cancer quickly and accurately, and promote the development of breast cancer treatment. Dong et al. built an ADMET attribute prediction model utilizing Naive Bayes and Decision Tree (DT) on multiple databases [12]. Ogura et al. developed a prediction model of hERG inhibitory activity using support vector machine (SVM) and the Non-Dominated Sorting Genetic Algorithm-II algorithm and achieved good results [13]. Jiang et al. constructed data sets to verify the effectiveness of deep neural networks (DNNs), stochastic gradient boosting, and eXtreme Gradient Boosting (XGBoost) algorithms in predicting the ADMET in drug discovery [9]. Peng et al. improved the graph neural network to better predict the ADMET properties of compounds. More meaningful molecular structures were obtained by linking the molecular bond feature and node features together and adjusting the domain weight of the central node [14]. Park et al. used the attention mechanism to improve the graph convolutional network (GCN) and achieved good performance on the drug–drug interaction extraction corpus [15]. Venkatraman trained and constructed a fingerprint-based ADMET prediction model using the Random Forest (RF) algorithm [16]. Shi et al. used the principle of superposition generalization to integrate multiple machine learning algorithms and proposed a Two-Level Stacking Algorithm (TLSA), which can be well applied to the classification and screening of breast cancer drugs [17]. Yan et al. utilized K-Nearest Neighbor (KNN), bagging, and eigenvalue classification to design an ensemble classifier model that can diagnose whether a patient has breast cancer based on a mammogram [18].

Based on the above studies, we integrated nine machine learning classification models and the GCN to propose a Stacking Algorithm based on Graph Convolutional Network (SA-GCN) for the classification of the ADMET properties of compounds. The proposed algorithm first performed feature screening on the original data set, then trained on the selected subset to classify the ADMET attributes of compounds, and finally was compared with seven classical algorithms, as well as the recently proposed TLSA-SVM and TLSA-LR algorithms. The main contributions are as follows: (1) For better predicting the ADMET and meeting the requirement of further real-world breast cancer drug development, evolved from the traditional classification modeling based on molecular descriptors, this study developed a 2-layer Stacked Generalization utilizing the GCN and ensemble learning to classify the properties of ADMET for compounds. (2) To improve the model efficiency and performance, the Variance Threshold and Gradient Boosting Decision Tree (GBDT) algorithm were used to select important feature subsets to train the SA-GCN model. (3) We proposed the SA-GCN model, thus considering and combining the advantages of nine classification algorithms. In the numerical experiments of the ADMET properties of compounds, it showed better accuracy and robustness. (4) Through comprehensive numerical experiments comparing the performance of multiple algorithms, the promising results demonstrated that the SA-GCN has the best classification effect and could provide some help for the development of anti-breast cancer drugs. The rest is as follows. Section 2 introduces this experiment’s data composition and source. Section 3 describes the screening features and classical machine learning methods relevant to this study and the detailed procedure of the proposed algorithm. Section 4 is devoted to the analysis and discussion of the experimental results. The conclusions are given in Section 5.

2. Database Description

The biological activities data of 1974 compounds used in this study against the breast cancer therapeutic target ER

α

were collected from the ChEMBL database (accessed on 1 December 2022) (https://www.ebi.ac.uk/chembl/). First, the dataset provided the structural expressions of the compounds, thus represented by the one-dimensional linear expression Simplified Molecular Input Line Entry System (SMILES). Second, it was clarified that the biological activity values of the compounds against ER

α

were expressed as

I C_{50}

. Finally, the molecular descriptors of 1974 compounds were calculated based on PaDEL Descriptors from ChemDes (accessed on 1 December 2022) (http://www.scbdd.com/chemdes/).

The experiments in this paper should not only focus on the biological activities of the compounds but also consider their ADMET properties. The dataset has 729 molecular descriptors and five class labels corresponding to the ADMET properties of each compound. The five ADMET properties of the compounds considered in this experiment are as follows: small intestinal epithelial cell permeability (Caco-2)—a measure of the ability of compounds to be absorbed by the body; cytochrome P450 (CYP) and 3A4 isoform (CYP3A4)—major metabolic enzymes in the human body, which can measure the metabolic stability of compounds; compound cardiac safety evaluation (human Ether-a-go-go Related Gene, hERG), which can measure the cardiotoxicity of compounds; Human Oral Bioavailability (HOB anywhere), where measurable drugs enter the human body and are absorbed into the blood circulation according to the dosage ratio; and micronucleus (MN) test, which is a method to detect whether a compound has genotoxicity. Then, for symbolic description of ADMET, the corresponding values were provided using a binary taxonomy; see Table 1. Our proposed SA-GCN model is an accurate classification of compound properties based on a label of 1 or 0.

3. Methodology

In the previous screening methods for breast cancer drugs, each compound was often considered independently. However, these compounds share similarities in both structure and properties. Therefore, we constructed the graph (topology) of the compound according to its property similarity. And, the SA-GCN algorithm was developed based on this graph. The overall design flowchart of our algorithm is shown in Figure 2. The main goal for the prediction of ADMET properties of compounds is to construct classification prediction models for the five attributes of Caco-2, CYP3A4, hERG, HOB, and MN, whose prediction results are 0 or 1. For this binary classification problem, we proposed the classification algorithm SA-GCN. This classification task was divided into two steps. Step 1: Select an important subset of features in the original dataset. Step 2: Predict and classify the ADMET properties of compounds based on the selected most significant feature subset.

The detailed SA-GCN classification process consists of data preprocessing stage, classification stage, and classification model evaluation stage. In the data preprocessing stage, Variance Thresholding was first used. After low-variance filtering, 225 non-critical features were quickly excluded, thus resulting in a feature subset with a capacity of 504. For this feature subset, we used GBDT for feature selection, thus screening the 40 molecular descriptors that had the greatest impact on the properties of ADMET. Then, the data were divided into a training set and testing set with 7:3. Considering that GCN requires the structure of the graph adjacency matrix, which was not provided in the original data, we constructed the adjacency matrix of the compounds according to whether they perform the same on the five properties of ADMET. In the classification stage, the results of feature selection were fed into 10 classification algorithms. For the SA-GCN, the algorithm contains two levels. Level-1 used RF, Extra Trees (ET), GBDT, Decision Tree, Bagging, Adaptive Boosting (AdaBoost), KNN, Logistic Regression (LR), and SVM. GCN was used at the second level. In the model evaluation stage, the methods, including classification accuracy, recall, F1-score, precision, receiver operating characteristic (ROC) curves, and the area under the ROC curves (AUC), were used to compare and analyze these algorithms.

3.1. Feature Subset Selection

Feature selection is the process of selecting valuable features and removing redundant features from the original features to make the data set have a smaller dimensional space [19]. The goal of feature selection is to make predictive models more efficient, cost-effective, and accurate in performance, as well as to better understand the underlying processes generated data [20]. Variance threshold is a filtering method using feature variance, which can quickly filter data to obtain important features [21]. The GBDT assigns scores to features by evaluating how much each feature contributes to reducing the loss function when building a decision tree and then ranking and selecting features by averaging or accumulating scores across multiple trees [22].

There are a total of 729 molecular descriptors in the original data set, and too many data dimensions will affect the convergence of the model to a certain extent. The ADMET attributes were selected using the GBDT to facilitate subsequent research. The result of the GBDT feature selection was used as the training set of our model. See Figure 3 for the detailed process of feature selection.

3.2. Classification Methods

Gori et al. first introduced the concept of graph neural network, thus extending the method of neural network to the field of graph data computing [23]. And then, graph neural networks were further studied, and the theoretical cornerstone of the Banach fixed point theory was proposed [24,25]. Kipf and Welling proposed the theory of graph convolutional networks [26]. It constructs the convolutional architecture of samples by a localized first-order approximation of spectral graph convolutions. Then, the original features and the convolutional architecture information of the sample are fused using a convolution process. Finally, more accurate classification results can be obtained faster [27].

Dasarathy and Sheela took the lead in proposing the architecture of ensemble system, and they used the linear and nearest neighbor classifiers to form a composite system to illustrate the concept [28]. Subsequently, Hansen and Salamon proposed ensemble learning based on neural networks, which is divided into two categories: Boosting and Bagging [29]. Among them, Boosting can combine weak classifiers into a strong classifier. The appearance of Boosting makes ensemble learning develop rapidly, and many novel ideas and models have appeared [30]. AdaBoost, developed by Freund and Schapire, is one of the most widely used [31]. Then, Friedman proposed GBDT, which is different from AdaBoost’s iterative idea of changing the data distribution [32]. Considering the independence of data, Breiman developed Bagging to combine base learners from a random perspective [33]. Subsequently, he came up with the random forest algorithm, which introduced random features on the basis of Bagging to further improve the independence between each base learner [34]. Geurts et al. proposed Extremely Randomized Trees, which works better than random forests in a way because the features are randomly selected, and the split is random [35].

Decision tree is a typical classification algorithm in classification problems, which is a process of classifying instances based on special cases [36]. Cortes and Vapnik proposed SVM as a typical binary classification algorithm [37]. Its goal is to compute the maximum separating hyperplane to obtain efficient binary classification of data. Although LR is a regression model, it is a discriminative probabilistic classifier [38]. KNN as a classification algorithm is easy to understand and use. Because of its simplicity and efficiency, it is widely used in various fields [39].

3.3. Proposed Stacking Algorithm Based on GCN (SA-GCN)

Stacked generalization is a kind of ensemble learning. In traditional ensemble learning, sometimes an objective function needs to be approximated by combining multiple models. Generally, the results of each model are combined by voting (majority wins), weighted voting (some classifiers are more authoritative than others), and averaging the results [40]. Compared with a single model, the generalization ability of the system can be significantly improved. Stacked generalization was first proposed by Wolpert, and he believes it is similar to a cross-validation ensemble approach where the winner takes all [41]. The basic idea of stacked generalization is that there are two layers of learners in the vertical direction of the data partition. The initial data set trains the base learners at level 1, and its output results are used as the training set for learners at level 2, which is also known as the meta-learner. The labels at training time are the same for both layers of learners.

For the proposed SA-GCN algorithm, the base learners at level 1 are composed using nine classification algorithms and validated through 5-fold cross-validation. The GCN model is used in the meta-learner at Level 2. For the GCN model, we defined a two-layer GCN, where the input dimension is the base learners’ output dimension of 9, 5 is the dimension of the hidden layer, and the final output layer has a dimension of 2 (the number of classes). Our two-layer GCN forward model takes a simple form:

Z = softmax (\tilde{A} ReLU (\tilde{A} X W_{(0)}) W_{(1)})

(1)

where

\tilde{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}

. A is the adjacency matrix normalized after self-connections. D is a degree matrix of A,

d_{i i} = \sum_{j} a_{i j}

.

W_{(0)} \in R^{9 \times 5}

is an input-to-hidden weight matrix.

W_{(1)} \in R^{5 \times 2}

is a hidden-to-output weight matrix. The activation function of the hidden layer is

ReLU (\cdot) = max (0, \cdot)

. The activation function of the output layer is defined as

softmax (x_{i}) = \frac{1}{Z} e x p (x_{i})

with

Z = \sum_{i} exp (x_{i})

, which is applied row-wise. The loss function of the model is chosen as the cross-entropy error:

L = - x [class] + log (\sum_{j} exp (x [j]))

(2)

Here, since this is a binary classification problem,

x

is a two-dimensional vector. The value of

class

and j are 0 or 1.

The weight matrices

W_{(0)}

and

W_{(1)}

are trained using gradient descent. The other hyperparameters in the GCN model training were chosen as follows. This model was trained for 200 epochs (training iterations) with a learning rate of 0.01, and the L2 regularization factor was set to

5 \times 10^{- 4}

. See Figure 4 for the variation in loss value and accuracy with training iterations.

The algorithm steps are divided into the following steps. Step 1: select nine base models and train them on the whole training set T to obtain the trained base learners

h_{t} (t = 1, 2, \dots, 9)

. Step 2: pass the output of the nine trained learners

h_{t}

on the training set T to the meta-learner as its training examples. Step 3: train the meta-learner on the transmitted training examples to obtain the final model H. The pseudo-code of the proposed SA-GCN algorithm is shown in Algorithm 1.

Algorithm 1: Stacking Algorithm based on Graph Convolutional Network

Input: training data $T = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m})\}$ ;
Primary learning algorithm RF, ET, GBDT, DT, Baging, AdaBoost, KNN, SVM, LR;
Secondary learning algorithm GCN.
Output: ensemble classifier H
1: Step 1 learn base-level classifiers
2: for $t = 1, 2, . . ., 9$ do
3: learn $h_{t}$ based on T by primary learning algorithm
4: end for
5: Step 2 construct new data set of predictions
6: for $i = 1, 2, . . ., m$ do
7: $T_{h} = \{x_{i}^{'}, y_{i}\}$ , where $x_{i} = \{h_{1} (x_{i}), \dots, h_{9} (x_{i})\}$
8: end for
9: Step 3 learn a meta-classifier
10: learn H based on $T_{h}$ by secondary learning algorithm
11: return H

4. Results and Discussion

In order to better display and evaluate our model, the SA-GCN model classification experimental results and evaluation manners are given here. The evaluation manners were used to compare the results of the numerical experiments on the classification of SA-GCN with seven classical classification models and the recently proposed TLSA model by Shi et al. [17]. Our proposed classification model SA-GCN was implemented on Python 3.10 and a laptop with an Intel core i5.

4.1. Performance Evaluation Manners

This paper studied a binary classification problem and proposed the SA-GCN classification algorithm. To illustrate the classification performance and feasibility of our model, we compared it with nine different classification models. The evaluation metrics were selected, including the accuracy, recall, F1-score, precision, ROC curve, AUC, and confusion matrix (see Table 2). The formulas of the above metrics are as follows:

A c c u r a c y = (T P + T N) / (T P + T N + F P + F N)

(3)

R e c a l l = S e n s i t i v i t y = T P / (T P + F N)

(4)

P r e c i s i o n = T P / (T P + F P)

(5)

F 1 - s c o r e = (2 \times P \times R) / (P + R)

(6)

1 - S p e c i f i c i t y = F P / (T N + F P)

(7)

Here the meanings of TP, FP, TN, and FN are shown in Table 2. The ROC curve is a curve whose abscissa and ordinate are the true positive rate (TPR) and false positive rate (FPR), respectively. Here TPR, also called Sensitivity, is equal to recall, and FPR is equal to 1-Specificity. The farther the ROC curve is from the diagonal, or the larger the area in the lower right, the more discriminative the model is, and the better the performance is.

4.2. Results of Classification

In this subsection, to scientifically and effectively reflect the superiority of the SA-GCN model, the classification performance of the SA-GCN is compared horizontally with other nine classification models. The inputs to these ten models are the result of the same feature selection. Given the Caco-2 (A) property of the compounds, we show in Table 3 the results of the evaluation metrics of the classification effect of the SA-GCN and the other nine algorithms. Table 4 shows the results of the CYP3A4 (D) attribute. See Table 5 for the results of attribute hERG (M). For the HOB (E) attribute, see Table 6 for the results. Table 7 shows the results of the MN (T) attribute. Through the above comparison, it can be seen that on the five attributes of ADMET, the accuracy of the SA-GCN algorithm was on average improved by 4.1821%, 4.5531%, 5.6324%, 8.7015%, 6.3069%, 6.7117%, 8.8027%, 4.6880%, and 4.4182% when compared with the other nine algorithms. In addition, the SA-GCN was also better than the other nine methods regarding other performance metrics. The ROC curve comparison results of the SA-GCN and RF on the ADMET properties are shown in Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9. In the ROC curve, the curvature reflects the sensitivity index, and the diagonal line represents a line with discrimination equal to 0, which is also known as the pure chance line. The farther the ROC curve is from the pure chance line, the more discriminative the algorithm is. The ROC curve of the SA-GCN is further away from the pure chance line than the RF, so it is more discriminative. The ROC curve comparison between the SA-GCN model and the other eight individual models can be seen in the Supplementary Material.

4.3. Discussion

Drug discovery is a very complex process, from the research of new drugs to the release of finished products, which requires trial and error in multiple disciplines, takes 12–15 years, and costs more than one billion [42,43]. We know that the ADMET properties of compounds are an important factor in drug design, and it is expensive, time-consuming, and resource-consuming to measure and evaluate the ADMET properties through experiments. However, with the rapid progress of computer technology, machine learning can model the ADMET data of known drugs and predict the ADMET characteristics of new drugs through the model, which provides guidance and optimization for drug screening. Therefore, the purpose of this paper is to predict the properties of the drug ADMET, mine tacit knowledge, and establish a prediction model to reduce the time, money cost, and resource loss in the process of drug discovery.

This study aims to screen out compounds that are easily absorbed, have stable metabolic rates, and are non-toxic, that is, anti-breast cancer compounds with good ADMET properties. We used machine learning methods to assist the classification of compound properties, which greatly saves time and money costs and reduces the loss of resources. Firstly, for the bioactivity data of the ER

α

antagonist, a two-stage feature selection was implemented by using the variance threshold and GBDT algorithm. The characteristics of 1974 compounds were evaluated, and the 40 most important features were obtained. Secondly, the SA-GCN model was proposed based on 10 algorithms to predict the ADMET attributes. Finally, to verify the effectiveness of the SA-GCN, the prediction results of the SA-GCN were horizontally compared with the other nine classification models.

To ensure the reliability and validity of the compared classifier performance, the same data were used for all classifier training and testing in this example. Based on the horizontal comparison of the evaluation metrics of the classifier prediction results, the SA-GCN model outperformed the other algorithms. The improvement of the prediction effect is mainly due to the combination of multiple algorithms in the first layer of the SA-GCN model, which realizes the extraction of the hidden features of the data from the perspective of different data structures and data spaces. Based on combining the advantages of the first layer model, the second layer can effectively extract spatial features from the topological association of the data to carry out machine learning. The horizontal comparison experiments show that the SA-GCN model has better prediction performance and generalization ability. So in drug development, it can help to judge the potency and safety of candidate compounds and then screen out potential active compounds and reduce the failure rate of drug development.

5. Conclusions

In this paper, we proposed a compound ADMET properties classification algorithm known as SA-GCN based on the GCN and ensemble learning. Compared to the other nine algorithms, the prediction accuracy of this algorithm is improved. The highest prediction accuracy values of the proposed method for ADMET were 97.6391%, 98.1450%, 94.4351%, 96.4587%, and 97.9764%. Therefore, the SA-GCN model achieved good performance in classifying the ADMET attributes of breast cancer drug candidates. We hope that this model can help to enhance the tacit knowledge discovery in the drug screening process, avoid trial redundancy, and provide accurate prediction services for the discovery of potential drugs. At present, ensemble learning and deep learning have been successfully applied to the biomedical field. In the future, they will also continue to provide significant help for the development of drugs. With the help of recent advances in computer technology and artificial intelligence, cancer treatment will evolve with the times.

Our research can be improved in the following aspects in the future. The algorithm extracts features from the molecular descriptors of the compounds, but it can also extract the feature information of the molecular structure. This study is a binary classification problem, and the algorithm can also be applied to multi-classification problems.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12121779/s1.

Author Contributions

J.L.: Methodology, Software, Formal analysis, and Writing—Original Draft. Y.Z.: Validation and Writing—Review and Editing. G.S.: Project administration. X.T.: Conceptualization, Visualization, and Writing—Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 12361104), the Youth Talent Program of the Xingdian Talent Support Plan (XDYC-QNRC-2022-0514), and the Yunnan Provincial Basic Research Program Project (No. 202301AT070016).

Data Availability Statement

The original data presented in the study are openly available in the ChEMBL database (accessed on 1 December 2022) (https://www.ebi.ac.uk/chembl/).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Siegel, R.L.; Miller, K.D.; Wagle, N.S.; Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 2023, 73, 17–48. [Google Scholar] [CrossRef] [PubMed]
Giaquinto, A.N.; Sung, H.; Miller, K.D.; Kramer, J.L.; Newman, L.A.; Minihan, A.; Jemal, A.; Siegel, R.L. Breast Cancer Statistics, 2022. CA Cancer J. Clin. 2022, 72, 524–541. [Google Scholar] [CrossRef] [PubMed]
Frasor, J.; Danes, J.M.; Komm, B.; Chang, K.C.; Lyttle, C.R.; Katzenellenbogen, B.S. Profiling of estrogen up- and down-regulated gene expression in human breast cancer cells: Insights into gene networks and pathways underlying estrogenic control of proliferation and cell phenotype. Endocrinology 2003, 144, 4562–4574. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.T.; Kang, L.G.; Ding, L.; Vranic, S.; Gatalica, Z.; Wang, Z.Y. A positive feedback loop of ER-α36/EGFR promotes malignant growth of ER-negative breast cancer cells. Oncogene 2011, 30, 770–780. [Google Scholar] [CrossRef] [PubMed]
Asgharzadeh, A.; Alizadeh, S.; Keramati, M.R.; Soleimani, M.; Atashi, A.; Edalati, M.; Kashani, K.Z.; Rafiee, M.; Barzegar, M.; Razavi, H. Upregulation of miR-210 promotes differentiation of mesenchymal stem cells (MSCs) into osteoblasts. Bosn. J. Basic Med. Sci. 2018, 18, 328–335. [Google Scholar] [CrossRef] [PubMed]
Lavecchia, A. Machine-learning approaches in drug discovery: Methods and applications. Drug Discov. Today 2015, 20, 318–331. [Google Scholar] [CrossRef] [PubMed]
Muratov, E.N.; Bajorath, J.; Sheridan, R.P.; Tetko, I.V.; Filimonov, D.; Poroikov, V.; Oprea, T.I.; Baskin, I.I.; Varnek, A.; Roitberg, A.; et al. QSAR without borders. Chem. Soc. Rev. 2020, 49, 3525–3564. [Google Scholar] [CrossRef] [PubMed]
Cheng, F.X.; Li, W.H.; Zhou, Y.D.; Shen, J.; Wu, Z.R.; Liu, G.X.; Lee, P.W.; Tang, Y. admetSAR: A Comprehensive Source and Free Tool for Assessment of Chemical ADMET Properties. J. Chem. Inf. Model. 2012, 52, 3099–3105. [Google Scholar] [CrossRef] [PubMed]
Jiang, D.J.; Lei, T.L.; Wang, Z.; Shen, C.; Cao, D.S.; Hou, T.J. ADMET evaluation in drug discovery. 20. Prediction of breast cancer resistance protein inhibition through machine learning. J. Cheminform. 2020, 12, 16–41. [Google Scholar] [CrossRef]
Chen, L.; Li, Y.Y.; Zhao, Q.; Peng, H.; Hou, T.J. ADME evaluation in drug discovery. 10. Predictions of P-glycoprotein inhibitors using recursive partitioning and naive Bayesian classification techniques. Mol. Pharm. 2011, 8, 889–900. [Google Scholar] [CrossRef]
Truong, J.; Chan, K.K.W.; Mai, H.; Chambers, A.; Sabharwal, M.; Trudeau, M.E.; Cheung, M.C. The impact of pricing strategy on the costs of oral anti-cancer drugs. Cancer Med. 2019, 8, 3770–3781. [Google Scholar] [CrossRef] [PubMed]
Dong, J.; Wang, N.N.; Yao, Z.J.; Zhang, L.; Cheng, Y.; Ouyang, D.; Lu, A.P.; Cao, D.S. ADMETlab: A platform for systematic ADMET evaluation based on a comprehensively collected ADMET database. J. Cheminform. 2018, 10, 29–39. [Google Scholar] [CrossRef] [PubMed]
Ogura, K.; Sato, T.; Yuki, H.; Honma, T. Support Vector Machine model for hERG inhibitory activities based on the integrated hERG database using descriptor selection by NSGA-II. Sci. Rep. 2019, 9, 12220–12231. [Google Scholar] [CrossRef] [PubMed]
Peng, Y.Z.; Lin, Y.M.; Jing, X.Y.; Zhang, H.; Huang, Y.R.; Luo, G.S. Enhanced Graph Isomorphism Network for Molecular ADMET Properties Prediction. IEEE Access 2020, 8, 168344–168360. [Google Scholar] [CrossRef]
Park, C.; Park, J.; Park, S. AGCN: Attention-based graph convolutional networks for drug-drug interaction extraction. Expert Syst. Appl. 2020, 159, 113538. [Google Scholar] [CrossRef]
Venkatraman, V. FP-ADMET: A compendium of fingerprint-based ADMET prediction models. J. Cheminform. 2021, 13, 75–86. [Google Scholar] [CrossRef]
Shi, L.H.; Yan, F.; Liu, H.H. Screening model of candidate drugs for breast cancer based on ensemble learning algorithm and molecular descriptor. Expert Syst. Appl. 2023, 213, 119185. [Google Scholar] [CrossRef]
Yan, F.; Huang, H.S.; Pedrycz, W.; Hirota, K. Automated breast cancer detection in mammography using ensemble classifier and feature weighting algorithms. Expert Syst. Appl. 2023, 227, 120282. [Google Scholar] [CrossRef]
Haq, A.U.; Zeb, A.; Lei, Z.F.; Zhang, D.F. Forecasting daily stock trend using multi-filter feature selection and deep learning. Expert Syst. Appl. 2021, 168, 114444. [Google Scholar] [CrossRef]
Guyon, I.; Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Zhang, J.L.; Xu, D.; Hao, K.J.; Zhang, Y.S.; Chen, W.; Liu, J.G.; Gao, R.; Wu, C.Y.; Marinis, Y.D. FS-GBDT: Identification multicancer-risk module via a feature selection algorithm by integrating Fisher score and GBDT. Briefings Bioinform. 2020, 22, bbaa189. [Google Scholar] [CrossRef] [PubMed]
Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; Volume 2, pp. 729–734. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Gallicchio, C.; Micheli, A. Graph Echo State Networks. In Proceedings of the 2010 International Joint Conference on Neural Networks, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2017, arXiv:1609.02907. [Google Scholar] [CrossRef]
Fu, S.C.; Liu, W.F.; Tao, D.P.; Zhou, Y.C.; Nie, L.Q. HesGCN: Hessian graph convolutional networks for semi-supervised classification. Inf. Sci. 2020, 514, 484–498. [Google Scholar] [CrossRef]
Dasarathy, B.V.; Sheela, B.V. A composite classifier system design: Concepts and methodology. Proc. IEEE 1979, 67, 708–713. [Google Scholar] [CrossRef]
Hansen, L.K.; Salamon, P. Neural Network Ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 1990, 12, 993–1001. [Google Scholar] [CrossRef]
Schapire, R.E. The Strength of Weak Learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Kabir, A.; Ruiz, C.; Alvarez, S.A. Mixed Bagging: A Novel Ensemble Learning Framework for Supervised Classification Based on Instance Hardness. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 1073–1078. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Wang, L.T.; Wang, T.W.; Hu, X.L. Logistic Regression Region Weighting for Weakly Supervised Object Localization. IEEE Access 2019, 7, 118411–118421. [Google Scholar] [CrossRef]
Gou, J.P.; Ma, H.X.; Ou, W.H.; Zeng, S.N.; Rao, Y.B.; Yang, H.B. A generalized mean distance-based k-nearest neighbor classifier. Expert Syst. Appl. 2019, 115, 356–372. [Google Scholar] [CrossRef]
Jain, A.; Kumar, A.; Susan, S. Evaluating Deep Neural Network Ensembles by Majority Voting Cum Meta-Learning Scheme. In Soft Computing and Signal Processing: Proceedings of 3rd ICSCSP 2020, Secunderabad, India, 19–20 June 2020; Springer: Singapore, 2022; pp. 29–37. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Kaitin, K.I. Deconstructing the Drug Development Process: The New Face of Innovation. Clin. Pharmacol. Ther. 2010, 87, 356–361. [Google Scholar] [CrossRef]
Nayarisseri, A. Experimental and Computational Approaches to Improve Binding Affinity in Chemical Biology and Drug Discovery. Curr. Top. Med. Chem. 2020, 20, 1651–1660. [Google Scholar] [CrossRef]

Figure 1. Estimated number of new cancer cases and deaths among women in the United States in 2023 for the top-10 cancer types.

Figure 2. The overall flowchart of the SA-GCN classification model.

Figure 3. The feature selection process of this article.

Figure 4. The loss value and accuracy vary with training iterations.

Figure 5. ROC curve comparisons of RF and SA-GCN on Caco-2.

Figure 6. ROC curve comparisons of RF and SA-GCN on CYP3A4.

Figure 7. ROC curve comparisons of RF and SA-GCN on hERG.

Figure 8. ROC curve comparisons of RF and SA-GCN on HOB.

Figure 9. ROC curve comparisons of RF and SA-GCN on MN.

Table 1. Label descriptions of the ADMET attributes.

Attributes	Labels	Labels Meaning
Caco-2	1	The small intestinal epithelial cells representing the compound were more permeable
Caco-2	0	The small intestinal epithelial cells representing this compound were poorly permeable
CYP3A4	1	It represents that this compound is capable of being metabolized by CYP3A4
CYP3A4	0	It represents that this compound cannot be metabolized by CYP3A4
hERG	1	It represents that the compound is cardiotoxic
hERG	0	It represents that the compound is not cardiotoxic
HOB	1	It represents a good oral bioavailability of the compound
HOB	0	It represents the poor oral bioavailability of the compound
MN	1	It represents that the compound is genotoxic
MN	0	It represents that the compound is not genotoxic

Table 2. Confusion matrix.

		Predicted Value		Total
		Positive	Negative	Total
True value	Positive	Ture positive (TP)	False negative (FN)	TP + FN
True value	Negative	False positive (FP)	Ture negative (TN)	FP + TN
Total		TP + FP	FN + TN	TP + FN + FP + TN

Table 3. Comparative experimental results of algorithms on Caco-2.

Algorithms	Accuracy	Precision	Recall	F1-Score
RF	0.947723	0.946890	0.943433	0.945084
ET	0.946037	0.944772	0.942040	0.943357
GBDT	0.934233	0.932033	0.930059	0.931019
DT	0.888702	0.882809	0.885014	0.883866
Bagging	0.919056	0.918294	0.911572	0.914631
AdaBoost	0.922428	0.919913	0.917333	0.918575
KNN	0.875211	0.871133	0.866432	0.868601
TLSA-SVM	0.944351	0.942671	0.940647	0.941631
TLSA-LR	0.947723	0.946890	0.943433	0.945084
SA-GCN	0.976391	0.972545	0.979013	0.975469

Table 4. Comparative experimental results of algorithms on CYP3A4.

Algorithms	Accuracy	Precision	Recall	F1-Score
RF	0.944351	0.921264	0.936921	0.928675
ET	0.952782	0.931253	0.948997	0.939604
GBDT	0.934233	0.910840	0.919444	0.915013
DT	0.920742	0.895882	0.897564	0.896718
Bagging	0.927487	0.898098	0.919162	0.907804
AdaBoost	0.942664	0.918562	0.935784	0.926662
KNN	0.920742	0.899841	0.891169	0.895373
TLSA-SVM	0.937605	0.912771	0.928112	0.920030
TLSA-LR	0.942664	0.919736	0.933653	0.926364
SA-GCN	0.981450	0.985392	0.966184	0.975295

Table 5. Comparative experimental results of algorithms on hERG.

Algorithms	Accuracy	Precision	Recall	F1-Score
RF	0.893761	0.892684	0.891706	0.892171
ET	0.885329	0.884253	0.882959	0.883564
GBDT	0.868465	0.866660	0.866660	0.866660
DT	0.849916	0.847487	0.850044	0.848466
Bagging	0.876897	0.874819	0.876202	0.875446
AdaBoost	0.861720	0.860240	0.859027	0.859592
KNN	0.854975	0.854956	0.850199	0.852077
TLSA-SVM	0.888702	0.886810	0.887970	0.887348
TLSA-LR	0.885329	0.883560	0.884153	0.883846
SA-GCN	0.944351	0.945402	0.941797	0.943371

Table 6. Comparative experimental results of algorithms on HOB.

Algorithms	Accuracy	Precision	Recall	F1-Score
RF	0.883642	0.848447	0.833715	0.840657
ET	0.876897	0.841219	0.820109	0.829775
GBDT	0.871838	0.828887	0.825867	0.827357
DT	0.834739	0.779357	0.771560	0.775295
Bagging	0.860034	0.815393	0.802058	0.808326
AdaBoost	0.846543	0.801154	0.770286	0.783498
KNN	0.843170	0.794451	0.770324	0.780975
TLSA-SVM	0.873524	0.830049	0.831549	0.830794
TLSA-LR	0.885329	0.851397	0.834836	0.842588
SA-GCN	0.964587	0.955387	0.949094	0.952186

Table 7. Comparative experimental results of algorithms on MN.

Algorithms	Accuracy	Precision	Recall	F1-Score
RF	0.967960	0.971183	0.940723	0.954796
ET	0.957841	0.958963	0.924514	0.940213
GBDT	0.956155	0.955192	0.923403	0.937983
DT	0.917369	0.890620	0.881150	0.885740
Bagging	0.947723	0.935095	0.920233	0.927355
AdaBoost	0.937605	0.933859	0.892098	0.910581
KNN	0.912310	0.892653	0.861119	0.875319
TLSA-SVM	0.967960	0.968580	0.943108	0.955026
TLSA-LR	0.964587	0.963707	0.938500	0.950292
SA-GCN	0.979764	0.987013	0.958042	0.971523

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Zhao, Y.; Shi, G.; Tan, X. Breast Cancer Drugs Screening Model Based on Graph Convolutional Network and Ensemble Method. Mathematics 2024, 12, 1779. https://doi.org/10.3390/math12121779

AMA Style

Li J, Zhao Y, Shi G, Tan X. Breast Cancer Drugs Screening Model Based on Graph Convolutional Network and Ensemble Method. Mathematics. 2024; 12(12):1779. https://doi.org/10.3390/math12121779

Chicago/Turabian Style

Li, Jia, Yun Zhao, Guoxing Shi, and Xuewen Tan. 2024. "Breast Cancer Drugs Screening Model Based on Graph Convolutional Network and Ensemble Method" Mathematics 12, no. 12: 1779. https://doi.org/10.3390/math12121779

APA Style

Li, J., Zhao, Y., Shi, G., & Tan, X. (2024). Breast Cancer Drugs Screening Model Based on Graph Convolutional Network and Ensemble Method. Mathematics, 12(12), 1779. https://doi.org/10.3390/math12121779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Breast Cancer Drugs Screening Model Based on Graph Convolutional Network and Ensemble Method

Abstract

1. Introduction

2. Database Description

3. Methodology

3.1. Feature Subset Selection

3.2. Classification Methods

3.3. Proposed Stacking Algorithm Based on GCN (SA-GCN)

4. Results and Discussion

4.1. Performance Evaluation Manners

4.2. Results of Classification

4.3. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI