1. Introduction
Cancer has been the killer demon of humanity throughout the long history of mankind. According to the World Health Organization, breast cancer has become the number one cancer worldwide, with 2,261,419 cases diagnosed in 2020. Using data published by GLOBOCAN, Siegel et al. reported that by 2023, it is estimated that 297,790 new breast cancers will be diagnosed among U.S. women, thus accounting for 31% of all new cancers in women and an estimated 43,170 deaths accounting for 15% of all cancer deaths [
1].
Figure 1 shows his projections for the top-10 female cancer cases and deaths in 2023. Giaquinto et al. studied the data on breast cancer and found that the number of diagnoses has been increasing for most of the past 40 years, and the rate has increased by 0.5% per year in recent years [
2]. So, the prevention and treatment of breast cancer has important practical significance.
In order to effectively prevent and treat breast cancer, researchers in the fields of biomedicine and genetics have carried out many experiments and studies, and they found that the incidence of breast cancer is closely related to the subtype of Estrogen receptors alpha (ER
). ER
is rarely expressed in normal breast cells, thus accounting for less than 10%, whereas it is overexpressed in approximately 50–80% of breast tumor cells [
3]. Experiments in ER
null mice showed that ER
promoted mammary tumor formation during mammary gland development [
4]. In other words, increased ER
activity in benign breast epithelial cells indicates an increased risk for breast cancer. Therefore, ER
is often a significant target in breast cancer drug development, and it is critical to find compounds capable of inhibiting ER
expression [
5].
Drug development has the characteristics of long cycle, high consumption, and high failure rate. To save time and money, most pharmaceutical companies develop quantitative structure activity relationships (QSAR) to detect on-target and off-target drug activities [
6]. Although good biological activity can effectively ensure the effectiveness of drugs against tumor cells, optimizing pharmacokinetics and minimizing toxicity are also crucial to the development of new viable drugs [
7]. The pharmacokinetics and toxicity of compounds involve a comprehensive description of their absorption, distribution, metabolism, excretion, and toxicity, namely ADMET [
8]. The ADMET properties of a compound play an important role as a key indicator to judge whether it can be used as a drug [
9]. Molecular descriptors involve a variety of characteristic parameters describing the properties and structures of compounds, which have been applied to ADMET prediction and proved to be effective [
10]. So, investigating the association of molecular descriptors with ADMET would be beneficial for the discovery of drugs that can antagonize ER
activity.
At present, traditional studies of ADMET properties usually require a large number of animal experiments, time, and money. The skyrocketing cost of drug development poses a threat to the sustainability of disease treatment [
11]. Machine learning has powerful feature extraction and pattern recognition capabilities, which has a wide range of application prospects in the related research of the treatment of breast cancer. It can help researchers better understand the relationship between drugs and breast cancer, improve the therapeutic effect and safety of drugs, diagnose breast cancer quickly and accurately, and promote the development of breast cancer treatment. Dong et al. built an ADMET attribute prediction model utilizing Naive Bayes and Decision Tree (DT) on multiple databases [
12]. Ogura et al. developed a prediction model of hERG inhibitory activity using support vector machine (SVM) and the Non-Dominated Sorting Genetic Algorithm-II algorithm and achieved good results [
13]. Jiang et al. constructed data sets to verify the effectiveness of deep neural networks (DNNs), stochastic gradient boosting, and eXtreme Gradient Boosting (XGBoost) algorithms in predicting the ADMET in drug discovery [
9]. Peng et al. improved the graph neural network to better predict the ADMET properties of compounds. More meaningful molecular structures were obtained by linking the molecular bond feature and node features together and adjusting the domain weight of the central node [
14]. Park et al. used the attention mechanism to improve the graph convolutional network (GCN) and achieved good performance on the drug–drug interaction extraction corpus [
15]. Venkatraman trained and constructed a fingerprint-based ADMET prediction model using the Random Forest (RF) algorithm [
16]. Shi et al. used the principle of superposition generalization to integrate multiple machine learning algorithms and proposed a Two-Level Stacking Algorithm (TLSA), which can be well applied to the classification and screening of breast cancer drugs [
17]. Yan et al. utilized K-Nearest Neighbor (KNN), bagging, and eigenvalue classification to design an ensemble classifier model that can diagnose whether a patient has breast cancer based on a mammogram [
18].
Based on the above studies, we integrated nine machine learning classification models and the GCN to propose a Stacking Algorithm based on Graph Convolutional Network (SA-GCN) for the classification of the ADMET properties of compounds. The proposed algorithm first performed feature screening on the original data set, then trained on the selected subset to classify the ADMET attributes of compounds, and finally was compared with seven classical algorithms, as well as the recently proposed TLSA-SVM and TLSA-LR algorithms. The main contributions are as follows: (1) For better predicting the ADMET and meeting the requirement of further real-world breast cancer drug development, evolved from the traditional classification modeling based on molecular descriptors, this study developed a 2-layer Stacked Generalization utilizing the GCN and ensemble learning to classify the properties of ADMET for compounds. (2) To improve the model efficiency and performance, the Variance Threshold and Gradient Boosting Decision Tree (GBDT) algorithm were used to select important feature subsets to train the SA-GCN model. (3) We proposed the SA-GCN model, thus considering and combining the advantages of nine classification algorithms. In the numerical experiments of the ADMET properties of compounds, it showed better accuracy and robustness. (4) Through comprehensive numerical experiments comparing the performance of multiple algorithms, the promising results demonstrated that the SA-GCN has the best classification effect and could provide some help for the development of anti-breast cancer drugs. The rest is as follows.
Section 2 introduces this experiment’s data composition and source.
Section 3 describes the screening features and classical machine learning methods relevant to this study and the detailed procedure of the proposed algorithm.
Section 4 is devoted to the analysis and discussion of the experimental results. The conclusions are given in
Section 5.
2. Database Description
The biological activities data of 1974 compounds used in this study against the breast cancer therapeutic target ER
were collected from the ChEMBL database (accessed on 1 December 2022) (
https://www.ebi.ac.uk/chembl/). First, the dataset provided the structural expressions of the compounds, thus represented by the one-dimensional linear expression Simplified Molecular Input Line Entry System (SMILES). Second, it was clarified that the biological activity values of the compounds against ER
were expressed as
. Finally, the molecular descriptors of 1974 compounds were calculated based on PaDEL Descriptors from ChemDes (accessed on 1 December 2022) (
http://www.scbdd.com/chemdes/).
The experiments in this paper should not only focus on the biological activities of the compounds but also consider their ADMET properties. The dataset has 729 molecular descriptors and five class labels corresponding to the ADMET properties of each compound. The five ADMET properties of the compounds considered in this experiment are as follows: small intestinal epithelial cell permeability (Caco-2)—a measure of the ability of compounds to be absorbed by the body; cytochrome P450 (CYP) and 3A4 isoform (CYP3A4)—major metabolic enzymes in the human body, which can measure the metabolic stability of compounds; compound cardiac safety evaluation (human Ether-a-go-go Related Gene, hERG), which can measure the cardiotoxicity of compounds; Human Oral Bioavailability (HOB anywhere), where measurable drugs enter the human body and are absorbed into the blood circulation according to the dosage ratio; and micronucleus (MN) test, which is a method to detect whether a compound has genotoxicity. Then, for symbolic description of ADMET, the corresponding values were provided using a binary taxonomy; see
Table 1. Our proposed SA-GCN model is an accurate classification of compound properties based on a label of 1 or 0.
3. Methodology
In the previous screening methods for breast cancer drugs, each compound was often considered independently. However, these compounds share similarities in both structure and properties. Therefore, we constructed the graph (topology) of the compound according to its property similarity. And, the SA-GCN algorithm was developed based on this graph. The overall design flowchart of our algorithm is shown in
Figure 2. The main goal for the prediction of ADMET properties of compounds is to construct classification prediction models for the five attributes of Caco-2, CYP3A4, hERG, HOB, and MN, whose prediction results are 0 or 1. For this binary classification problem, we proposed the classification algorithm SA-GCN. This classification task was divided into two steps. Step 1: Select an important subset of features in the original dataset. Step 2: Predict and classify the ADMET properties of compounds based on the selected most significant feature subset.
The detailed SA-GCN classification process consists of data preprocessing stage, classification stage, and classification model evaluation stage. In the data preprocessing stage, Variance Thresholding was first used. After low-variance filtering, 225 non-critical features were quickly excluded, thus resulting in a feature subset with a capacity of 504. For this feature subset, we used GBDT for feature selection, thus screening the 40 molecular descriptors that had the greatest impact on the properties of ADMET. Then, the data were divided into a training set and testing set with 7:3. Considering that GCN requires the structure of the graph adjacency matrix, which was not provided in the original data, we constructed the adjacency matrix of the compounds according to whether they perform the same on the five properties of ADMET. In the classification stage, the results of feature selection were fed into 10 classification algorithms. For the SA-GCN, the algorithm contains two levels. Level-1 used RF, Extra Trees (ET), GBDT, Decision Tree, Bagging, Adaptive Boosting (AdaBoost), KNN, Logistic Regression (LR), and SVM. GCN was used at the second level. In the model evaluation stage, the methods, including classification accuracy, recall, F1-score, precision, receiver operating characteristic (ROC) curves, and the area under the ROC curves (AUC), were used to compare and analyze these algorithms.
3.1. Feature Subset Selection
Feature selection is the process of selecting valuable features and removing redundant features from the original features to make the data set have a smaller dimensional space [
19]. The goal of feature selection is to make predictive models more efficient, cost-effective, and accurate in performance, as well as to better understand the underlying processes generated data [
20]. Variance threshold is a filtering method using feature variance, which can quickly filter data to obtain important features [
21]. The GBDT assigns scores to features by evaluating how much each feature contributes to reducing the loss function when building a decision tree and then ranking and selecting features by averaging or accumulating scores across multiple trees [
22].
There are a total of 729 molecular descriptors in the original data set, and too many data dimensions will affect the convergence of the model to a certain extent. The ADMET attributes were selected using the GBDT to facilitate subsequent research. The result of the GBDT feature selection was used as the training set of our model. See
Figure 3 for the detailed process of feature selection.
3.2. Classification Methods
Gori et al. first introduced the concept of graph neural network, thus extending the method of neural network to the field of graph data computing [
23]. And then, graph neural networks were further studied, and the theoretical cornerstone of the Banach fixed point theory was proposed [
24,
25]. Kipf and Welling proposed the theory of graph convolutional networks [
26]. It constructs the convolutional architecture of samples by a localized first-order approximation of spectral graph convolutions. Then, the original features and the convolutional architecture information of the sample are fused using a convolution process. Finally, more accurate classification results can be obtained faster [
27].
Dasarathy and Sheela took the lead in proposing the architecture of ensemble system, and they used the linear and nearest neighbor classifiers to form a composite system to illustrate the concept [
28]. Subsequently, Hansen and Salamon proposed ensemble learning based on neural networks, which is divided into two categories: Boosting and Bagging [
29]. Among them, Boosting can combine weak classifiers into a strong classifier. The appearance of Boosting makes ensemble learning develop rapidly, and many novel ideas and models have appeared [
30]. AdaBoost, developed by Freund and Schapire, is one of the most widely used [
31]. Then, Friedman proposed GBDT, which is different from AdaBoost’s iterative idea of changing the data distribution [
32]. Considering the independence of data, Breiman developed Bagging to combine base learners from a random perspective [
33]. Subsequently, he came up with the random forest algorithm, which introduced random features on the basis of Bagging to further improve the independence between each base learner [
34]. Geurts et al. proposed Extremely Randomized Trees, which works better than random forests in a way because the features are randomly selected, and the split is random [
35].
Decision tree is a typical classification algorithm in classification problems, which is a process of classifying instances based on special cases [
36]. Cortes and Vapnik proposed SVM as a typical binary classification algorithm [
37]. Its goal is to compute the maximum separating hyperplane to obtain efficient binary classification of data. Although LR is a regression model, it is a discriminative probabilistic classifier [
38]. KNN as a classification algorithm is easy to understand and use. Because of its simplicity and efficiency, it is widely used in various fields [
39].
3.3. Proposed Stacking Algorithm Based on GCN (SA-GCN)
Stacked generalization is a kind of ensemble learning. In traditional ensemble learning, sometimes an objective function needs to be approximated by combining multiple models. Generally, the results of each model are combined by voting (majority wins), weighted voting (some classifiers are more authoritative than others), and averaging the results [
40]. Compared with a single model, the generalization ability of the system can be significantly improved. Stacked generalization was first proposed by Wolpert, and he believes it is similar to a cross-validation ensemble approach where the winner takes all [
41]. The basic idea of stacked generalization is that there are two layers of learners in the vertical direction of the data partition. The initial data set trains the base learners at level 1, and its output results are used as the training set for learners at level 2, which is also known as the meta-learner. The labels at training time are the same for both layers of learners.
For the proposed SA-GCN algorithm, the base learners at level 1 are composed using nine classification algorithms and validated through 5-fold cross-validation. The GCN model is used in the meta-learner at Level 2. For the GCN model, we defined a two-layer GCN, where the input dimension is the base learners’ output dimension of 9, 5 is the dimension of the hidden layer, and the final output layer has a dimension of 2 (the number of classes). Our two-layer GCN forward model takes a simple form:
where
.
A is the adjacency matrix normalized after self-connections.
D is a degree matrix of
A,
.
is an input-to-hidden weight matrix.
is a hidden-to-output weight matrix. The activation function of the hidden layer is
. The activation function of the output layer is defined as
with
, which is applied row-wise. The loss function of the model is chosen as the cross-entropy error:
Here, since this is a binary classification problem, is a two-dimensional vector. The value of and j are 0 or 1.
The weight matrices
and
are trained using gradient descent. The other hyperparameters in the GCN model training were chosen as follows. This model was trained for 200 epochs (training iterations) with a learning rate of 0.01, and the L2 regularization factor was set to
. See
Figure 4 for the variation in loss value and accuracy with training iterations.
The algorithm steps are divided into the following steps. Step 1: select nine base models and train them on the whole training set
T to obtain the trained base learners
. Step 2: pass the output of the nine trained learners
on the training set
T to the meta-learner as its training examples. Step 3: train the meta-learner on the transmitted training examples to obtain the final model H. The pseudo-code of the proposed SA-GCN algorithm is shown in Algorithm 1.
Algorithm 1: Stacking Algorithm based on Graph Convolutional Network |
Input: training data ; Primary learning algorithm RF, ET, GBDT, DT, Baging, AdaBoost, KNN, SVM, LR; Secondary learning algorithm GCN. Output: ensemble classifier H 1: Step 1 learn base-level classifiers 2: for do 3: learn based on T by primary learning algorithm 4: end for 5: Step 2 construct new data set of predictions 6: for do 7: , where 8: end for 9: Step 3 learn a meta-classifier 10: learn H based on by secondary learning algorithm 11: return H
|
4. Results and Discussion
In order to better display and evaluate our model, the SA-GCN model classification experimental results and evaluation manners are given here. The evaluation manners were used to compare the results of the numerical experiments on the classification of SA-GCN with seven classical classification models and the recently proposed TLSA model by Shi et al. [
17]. Our proposed classification model SA-GCN was implemented on Python 3.10 and a laptop with an Intel core i5.
4.1. Performance Evaluation Manners
This paper studied a binary classification problem and proposed the SA-GCN classification algorithm. To illustrate the classification performance and feasibility of our model, we compared it with nine different classification models. The evaluation metrics were selected, including the accuracy, recall, F1-score, precision, ROC curve, AUC, and confusion matrix (see
Table 2). The formulas of the above metrics are as follows:
Here the meanings of TP, FP, TN, and FN are shown in
Table 2. The ROC curve is a curve whose abscissa and ordinate are the true positive rate (TPR) and false positive rate (FPR), respectively. Here TPR, also called Sensitivity, is equal to recall, and FPR is equal to 1-Specificity. The farther the ROC curve is from the diagonal, or the larger the area in the lower right, the more discriminative the model is, and the better the performance is.
4.2. Results of Classification
In this subsection, to scientifically and effectively reflect the superiority of the SA-GCN model, the classification performance of the SA-GCN is compared horizontally with other nine classification models. The inputs to these ten models are the result of the same feature selection. Given the Caco-2 (A) property of the compounds, we show in
Table 3 the results of the evaluation metrics of the classification effect of the SA-GCN and the other nine algorithms.
Table 4 shows the results of the CYP3A4 (D) attribute. See
Table 5 for the results of attribute hERG (M). For the HOB (E) attribute, see
Table 6 for the results.
Table 7 shows the results of the MN (T) attribute. Through the above comparison, it can be seen that on the five attributes of ADMET, the accuracy of the SA-GCN algorithm was on average improved by 4.1821%, 4.5531%, 5.6324%, 8.7015%, 6.3069%, 6.7117%, 8.8027%, 4.6880%, and 4.4182% when compared with the other nine algorithms. In addition, the SA-GCN was also better than the other nine methods regarding other performance metrics. The ROC curve comparison results of the SA-GCN and RF on the ADMET properties are shown in
Figure 5,
Figure 6,
Figure 7,
Figure 8 and
Figure 9. In the ROC curve, the curvature reflects the sensitivity index, and the diagonal line represents a line with discrimination equal to 0, which is also known as the pure chance line. The farther the ROC curve is from the pure chance line, the more discriminative the algorithm is. The ROC curve of the SA-GCN is further away from the pure chance line than the RF, so it is more discriminative. The ROC curve comparison between the SA-GCN model and the other eight individual models can be seen in the
Supplementary Material.
4.3. Discussion
Drug discovery is a very complex process, from the research of new drugs to the release of finished products, which requires trial and error in multiple disciplines, takes 12–15 years, and costs more than one billion [
42,
43]. We know that the ADMET properties of compounds are an important factor in drug design, and it is expensive, time-consuming, and resource-consuming to measure and evaluate the ADMET properties through experiments. However, with the rapid progress of computer technology, machine learning can model the ADMET data of known drugs and predict the ADMET characteristics of new drugs through the model, which provides guidance and optimization for drug screening. Therefore, the purpose of this paper is to predict the properties of the drug ADMET, mine tacit knowledge, and establish a prediction model to reduce the time, money cost, and resource loss in the process of drug discovery.
This study aims to screen out compounds that are easily absorbed, have stable metabolic rates, and are non-toxic, that is, anti-breast cancer compounds with good ADMET properties. We used machine learning methods to assist the classification of compound properties, which greatly saves time and money costs and reduces the loss of resources. Firstly, for the bioactivity data of the ER antagonist, a two-stage feature selection was implemented by using the variance threshold and GBDT algorithm. The characteristics of 1974 compounds were evaluated, and the 40 most important features were obtained. Secondly, the SA-GCN model was proposed based on 10 algorithms to predict the ADMET attributes. Finally, to verify the effectiveness of the SA-GCN, the prediction results of the SA-GCN were horizontally compared with the other nine classification models.
To ensure the reliability and validity of the compared classifier performance, the same data were used for all classifier training and testing in this example. Based on the horizontal comparison of the evaluation metrics of the classifier prediction results, the SA-GCN model outperformed the other algorithms. The improvement of the prediction effect is mainly due to the combination of multiple algorithms in the first layer of the SA-GCN model, which realizes the extraction of the hidden features of the data from the perspective of different data structures and data spaces. Based on combining the advantages of the first layer model, the second layer can effectively extract spatial features from the topological association of the data to carry out machine learning. The horizontal comparison experiments show that the SA-GCN model has better prediction performance and generalization ability. So in drug development, it can help to judge the potency and safety of candidate compounds and then screen out potential active compounds and reduce the failure rate of drug development.
5. Conclusions
In this paper, we proposed a compound ADMET properties classification algorithm known as SA-GCN based on the GCN and ensemble learning. Compared to the other nine algorithms, the prediction accuracy of this algorithm is improved. The highest prediction accuracy values of the proposed method for ADMET were 97.6391%, 98.1450%, 94.4351%, 96.4587%, and 97.9764%. Therefore, the SA-GCN model achieved good performance in classifying the ADMET attributes of breast cancer drug candidates. We hope that this model can help to enhance the tacit knowledge discovery in the drug screening process, avoid trial redundancy, and provide accurate prediction services for the discovery of potential drugs. At present, ensemble learning and deep learning have been successfully applied to the biomedical field. In the future, they will also continue to provide significant help for the development of drugs. With the help of recent advances in computer technology and artificial intelligence, cancer treatment will evolve with the times.
Our research can be improved in the following aspects in the future. The algorithm extracts features from the molecular descriptors of the compounds, but it can also extract the feature information of the molecular structure. This study is a binary classification problem, and the algorithm can also be applied to multi-classification problems.