Next Article in Journal
Experimental Study on the Mechanical Properties and Microstructure of Metakaolin-Based Geopolymer Modified Clay
Next Article in Special Issue
Virtual Screening in the Identification of Sirtuins’ Activity Modulators
Previous Article in Journal
Sensing the ortho Positions in C6Cl6 and C6H4Cl2 from Cl2 Formation upon Molecular Reduction
Previous Article in Special Issue
Identification of Human Dihydroorotate Dehydrogenase Inhibitor by a Pharmacophore-Based Virtual Screening Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Small Molecular Drug Screening Based on Clinical Therapeutic Effect

College of Chemical Engineering, Beijing University of Chemical Technology, Beijing 100029, China
*
Author to whom correspondence should be addressed.
Molecules 2022, 27(15), 4807; https://doi.org/10.3390/molecules27154807
Submission received: 26 June 2022 / Revised: 22 July 2022 / Accepted: 25 July 2022 / Published: 27 July 2022
(This article belongs to the Special Issue Virtual Screening in Modern Medicinal Chemistry)

Abstract

:
Virtual screening can significantly save experimental time and costs for early drug discovery. Drug multi-classification can speed up virtual screening and quickly predict the most likely class for a drug. In this study, 1019 drug molecules with actual therapeutic effects are collected from multiple databases and documents, and molecular sets are grouped according to therapeutic effect and mechanism of action. Molecular descriptors and molecular fingerprints are obtained through SMILES to quantify molecular structures. After using the Kennard–Stone method to divide the data set, a better combination can be obtained by comparing the combined results of five classification algorithms and a fusion method. Furthermore, for a specific data set, the model with the best performance is used to predict the validation data set. The test set shows that prediction accuracy can reach 0.862 and kappa coefficient can reach 0.808. The highest classification accuracy of the validation set is 0.873. The more reliable molecular set has been found, which could be used to predict potential attributes of unknown drug compounds and even to discover new use for old drugs. We hope this research can provide a reference for virtual screening of multiple classes of drugs at the same time in the future.

Graphical Abstract

1. Introduction

The emergence of new diseases (such as COVID-19) and the rise of drug resistance are constantly forcing researchers to discover and develop new drugs with better therapeutic effects and fewer side effects. As an important way to find new drugs, drug development is receiving great attention [1]. It includes a series of procedures, such as the determination of lead compound, clinical trials, and the final review by the National Medical Products Administration [1]. As the early key step, the determination of lead compound is also called drug screening, by which possible candidate drugs to relieve and cure various diseases [2] are discovered and serve as the object for subsequent research. In traditional drug development, this step is conducted through constant experimentation and testing of compounds from small molecule databases, which requires a significant amount of time and money [3]. However, in the last two decades, virtual screening (VS) has gradually become more and more popular [3]. Currently, it plays a significant role in the discovery of small molecule drugs with certain activity [4]. According to different starting points for identifying desirable drugs, vs. can be classified as structure-based virtual screening (SBVS) or ligand-based virtual screening (LBVS). Based on these forms of screening, many drugs with therapeutic effects have been successfully discovered and brought into the market with shorter development period and less investment, such as scopoletin and aliskiren [2,3,4,5].
SBVS identifies a target drug by means of the docking between a target protein and small molecular drugs. In terms of drugs with certain effects, if the three-dimensional structure of a target protein is unknown, the search for corresponding drug molecules cannot be realized based on SBVS [6]. Differing from SBVS, LBVS pays attention to data mining on small molecular drugs based on the assumption that compounds with similar structure have similar properties [7]. All kinds of drug databases have been well established in the history of drug development, which include a huge number of drugs and their structure information [8,9]. With the development of structure digitalization, LBVS methods can accomplish faster computation and predict as many potential candidates as possible. Currently, machine learning technologies further facilitate SBVS and LBVS research [1,10], especially LBVS.
There had been many studies employing various machine learning methods to screen drugs with specific characteristics [11]. Müller et al. used kernel-based classification methods to reduce the error rate of distinguishing drugs and non-drugs [12]. Focusing on clinical trials failure and withdrawal caused by drug-induced liver injury, Li et al. proposed a support vector machine model to identify drugs harmful to the liver [13]. To decrease the failure rate of drug candidates that bind to the androgen receptor, Gupta et al. developed an efficient model to predict their toxicity, particularly focusing on liver injury [14]. For enhancing the hit rate of drugs that are able to treat various diseases by inhibiting the S100A9 target, Lee et al. established predictive models to classify them by applying several classifiers and molecular descriptors [15]. In order to recognize drugs that cause cardiotoxicity by blocking the Kv11.1 channel, Kim et al. applied ensemble models to predict blockers and non-blockers [16]. In these studies, molecular characteristics have been described and classified from different perspectives, i.e., drugs for certain diseases, drugs for certain protein targets, or toxicity induced by certain drugs. In addition, in the study by Lotsch et al., the functional genomics-based criterion is applied to classify drugs from pharmacology, which is suitable for drug classification and provides a phenotypic path for drug discovery and repurposing [17]. Kim et al. constructed a prediction model to provide new indications for herb compounds for certain diseases [18]. It can be seen that most studies have been focused on specific diseases or side effects, which are just properties of drugs. As it is also well known that drugs can be classified by what they treat, such as analgesics, antibacterial agents, and antitumor drugs, common features can be extracted from the same drug category [19]. In the early stage of VS, if drug actual clinical therapeutic effect is taken into account, potential candidates could be screened more quickly and efficiently by direct drug multi-classification [20].
The Anatomical Therapeutic Chemical (ATC) Classification System is formed by hierarchically classifying drugs according to their anatomical, therapeutic, and chemical properties, and many models have been developed to enrich the system [21,22]. Although the studies on the ATC system have included clinical therapeutic effect, the amount of drug in the system is much lower than the actual quantity [21], so the existing drug information has not been fully utilized. Moreover, most prediction models have realized multi-classification for ATC system only by an individual algorithm. At present, no individual classifier can show good classification performance for all data sets. To obtain a more reliable multi-classes prediction model, the fusion method was used to integrate the information of multiple classifiers [23]. According to different objects, fusion methods can be divided into class label fusion, support function fusion, etc. [23]. Compared to the label fusion, the fusion for support functions is more interpretable when similar performance is achieved. As a support function fusion method, Dempster–Shafer (DS) evidence theory has been successfully applied to classification [24]. As Kim et al. provided new repositioning for existing drugs [18], the constructed prediction model identified unknown compounds and discovered new possibilities for marketed drugs through providing prediction probability. DS fusion method can merge the predicted probabilities of multiple classifiers to recognize unknown drugs or even discover new effects of old drugs.
In order to promote drug multi-classification development, seven categories of popular drugs are chosen as examples to obtain data for research. As much as possible, drug molecules with corresponding therapeutic effects are collected to ensure the stability of the model. Differing from simple binary classification, classification for the collected drug molecules is a multi-class prediction issue. There are multi-classification algorithms and multi-classification models combining multiple binary classifiers that can be used to solve the issue. Not only has previous research proved the advantages of binary classifiers, but the applicability of the two-classification strategy to achieve multi-classifications has also been demonstrated by Galar et al. [25].
In this work, a multi-classification method based on DS evidence reasoning theory is proposed to predict possible clinical therapeutic effects of unknown drugs. First, random forest (RF), adaptive boosting trees (ABT), support vector machine (SVM), logistic regression (LR), and linear discriminant analysis (LDA) are selected as the separate classifiers for fusion due to their wide applicability [5,15,16]. Then, based on the DS fusion method, predicted probabilities of the five classifiers are fused into final discriminant probability. The Kennard–Stone (KS) division method [26] is used to divide the data set. The final statistical mean of the test set is used to pick out the model with good performance, which provides more reliable information for potential effects of unknown drugs. Furthermore, the reliability of the obtained model is verified by an external validation set.
The rest of the paper is organized as follows: In Section 2, based on different data sets, the better prediction model is obtained by comparing classification results, and the prediction ability of the model for multi-action drugs is also analyzed through external validation. Section 3 includes data collection and method application. The source of drug molecules, five description data sets for drug molecular structure (represented by the descriptors), and the acquisition process of four molecular sets are introduced in the section on data collection. In method application, five single classifiers, DS fusion principle, measurement method, and indicators for classification performance evaluation are presented. Finally, the conclusion is stated in Section 4.

2. Results and Discussion

An appropriate drug molecular set can provide the basis for reliable prediction of unknown drugs. According to comparison of classification results, the suitable molecular set, descriptor set, and classification algorithm are determined. Furthermore, based on the external validation set, the performance of twelve classification models is further compared and verified in Section 2.2.

2.1. The Comparison of Different Molecular Sets, Description Sets, and Classification Methods

Based on 240 groups of results obtained from four molecular sets (the acquisition process of four molecular sets is introduced in Section 3.1.3), five descriptor sets, and twelve classification methods, firstly, molecular sets that are suitable to discover the correlation between molecular structure and therapeutic effect are determined. Based on the determined molecular sets, two classification methods are utilized to make a comparison among ten groups of results to discover descriptor sets that are beneficial to characterize the molecular structure. Comparing twelve classification results from a suitable molecular set and descriptor set, the method with good classification performance is obtained.
In Table A2, Table A3, Table A4, Table A5 and Table A6 of the Appendix A, classification results of five descriptor sets are shown in detail. These results are obtained by five individual classifiers and seven fusion methods used on four molecular sets.

2.1.1. The Impact of Molecular Sets

Comparing all results from different molecular sets, the highest Q value, 0.862, is achieved in molecular set S1. The highest kappa coefficient, 0.81, is obtained in molecular set S3. For the overall results in the five tables, better classification results appear more frequently in molecular set S1 and S4. The results from data with Mordred descriptors and Morgan fingerprints processed by RF method are shown in Figure 1, with (a) classification performance on Mordred descriptor set and (b) classification performance on Morgan fingerprint set. When the descriptor set is Mordred descriptors, it can be observed that the highest Q value is obtained based on S1, and same is true for the largest kappa coefficient. When Morgan fingerprints serve as the descriptor set, RF method performs better on S4 with the fewest number of drug molecules, as shown in Figure 1.
It can be noticed that the classification performances of RF are different for the studied data sets. To investigate the root cause for such difference, the content of each data set is not identical from the very beginning, as can be seen in Section 3.1.3. Each data set is constructed by checking multifaceted drug information, such as therapeutic effect, mechanism of action, and phase. Four data sets including different drug molecules are introduced in Section 3.1.3. It can be observed that not all features can be captured effectively for given descriptors, leading to different performance on data sets. On the other hand, the numbers for each drug category are uneven, which could be another reason for the performance difference. This suggests that a better descriptor is important for the structural characterization of drugs. It is also noticeable that molecular sets with better results are inconsistent with two descriptors, which will be discussed in the next section.

2.1.2. The Impact of Descriptor Sets

Comparing the results from different descriptor sets, better classification performance can be achieved by combinatorial descriptors in most cases. Molecular set S4 is investigated by single ABT and the classification model fused from RF, SVM, and LR. Corresponding results are shown in Figure 2, where Q and kappa coefficient based on five different descriptor sets can be observed in Figure 2a,b, respectively. Furthermore, aiming at six types of descriptor groups, descriptor information included in five descriptor sets is displayed in Figure 3. Among these groups, “atom-type counts” and “substructure fragments” are representation for zero-dimensional and one-dimensional structure information, respectively, and the remaining four types of descriptor groups are all representation for two-dimensional structure information. Values on the axis for each descriptor represent the number of this type of descriptor in the descriptor set.
Comparing three descriptor sets with only binary value, i.e., MACCS, topological, and Morgan fingerprint data, the results obtained by combinatorial and Mordred descriptors are better, which is attributed to different representation of molecular structure. As shown in Figure 3, there are distinct differences between the two descriptor sets and the other three fingerprint sets in terms of six major types of descriptor groups. The three fingerprint sets are calculated by different principles but all come from digital transformation of molecular one-dimensional structure information, such as substructure fragments. In contrast, the two types of descriptor data composed of real numbers not only contain one-dimensional structure information but also include two-dimensional information, such as topological and connectivity descriptors.
Contrasting classification results of combinatorial and Mordred descriptor sets, although the number of descriptors contained in the former is 755 and less than that of the latter, 1127, the Q value and kappa coefficient achieved by the former are both higher than those of the latter. That is because the former is acquired by combining descriptors in ChemoPy and RDKit databases, including one-dimensional information in addition to two-dimensional information, compared with the latter. On the other hand, it is possible for the Mordred descriptor set that there is redundant information among descriptors, which influences extraction of structure information so as to degrade classification performance. In a word, comparing with other descriptor sets, combinatorial descriptor set is more appropriate to extract information from the collected drug molecular structure to obtain better classification performance. The detailed computation procedure for five descriptor sets is presented in Section 3.1.2.

2.1.3. The Impact of Classification Models

According to results in Table A2, Table A3, Table A4, Table A5 and Table A6 of the Appendix A, the fusion method obtained by RF and SVM performs better than single classifiers and other fusion models. The implementation principles and types of fusion methods are introduced in Section 3.2.2. The classification results by five single classifiers and seven fusion methods based on combinatorial descriptor data from molecular set S4 are displayed in Figure 4. The Q value and kappa coefficient are represented by different colored bars in the graph. Comparing all results by single classifiers, it can be found that better classification performance is achieved through SVM. In particular, classification performance obtained by LDA is far poorer than that of other classifiers, indicating that the correlation is not a simple linear relationship. Comparing carefully all fusion results, it can be observed that the DS12 method, fusing RF and SVM, performs better than other fusion methods and outperforms individual SVM. However, there are four fusion methods whose results are not better than that of single SVM, such as the fusion method based on SVM and LR, by which the obtained Q value and kappa coefficient are both lower than SVM by nearly 0.01.
Regarding fusion method, theoretically, the greater the number of fusion classifiers, the better the classification results. In fact, the result obtained by fusion method is limited by the performance of each classifier and could be worse than that of all classifiers due to evidence conflict. The result achieved by DS12345, fusing five classifiers, is far poorer than that of DS1235 fusing four classifiers, owing to the poor performance of LDA. Comparing results of DS12, DS123, and DS1235, due to the poor performance of LR, result obtained by DS123 is worse than that of DS12, but the result obtained by DS1235 is not as bad as that of DS123, although the performance of ABT is even poorer. As mentioned in Reference [27], the relationship between individual classifiers has not been considered in fusion method, while in Reference [28], only great conflicts between single classifiers are processed according to an improved method. This may be explained by the fact that there are conflicts between the result of LR and the two results of RF and SVM, while there is no conflict between the result of ABT and the three results of LR, RF, and SVM. Additionally, concerning certain cases where LR performs better than RF, as shown in Table A3 of the Appendix A, the results still show that the method fusing RF and SVM outperforms that of LR and SVM. This indicates that there is conflict between the result by LR and the result by SVM. Moreover, it demonstrates that RF is applicable to deal with uneven data sets, which makes results of RF and SVM complement each other to achieve better classification performance.
Based on the discussion in Section 2.1, the highest Q and K are obtained based on combinatorial descriptors, which are 0.862 and 0.81, respectively. The obtained standard deviation of kappa index is also the smallest, 0.028, as shown in Table A2 of the Appendix A. The result illustrates the applicability and reliability of combinatorial descriptors in predicting unknown compounds. Although a reliable prediction model has been obtained by KS method, the imbalance in molecular sets does have an influence, so the classification results based on different molecular sets cannot be more objectively measured and compared. Hence, according to the obtained results, the prediction performance of single SVM and two fusion models are further verified by the external validation set.

2.2. The Analysis on External Validation Set

In order to evaluate models more objectively, the diversity of the external validation set should be ensured. Aiming at different categories of drug data, partial random data are added into the external validation data set from S1, S2, S3, and S4. Seven classes of single-role drugs, whose order is consistent with that of the collected molecular sets, and all multi-role drugs are also contained in the final validation set, and their amounts are 7, 21, 26, 10, 5, 6, 2, and 10, respectively. The statistical mean obtained by running the procedure three times is used as the prediction result for the external validation set. Based on five descriptor data from four molecular sets, prediction results for external validation molecules are shown in Table 1, which are all obtained by four classification methods that perform better in Section 2.1.3. The prediction for the validation molecules is shown in Appendix A Table A7.
In terms of combinatorial descriptors, better prediction for validation molecules is achieved with molecular set S3. For Mordred descriptor data, good classification for the external validation set is implemented in molecular set S2. With regard to MACCS fingerprints, the highest correct prediction number is obtained with molecular sets S2 and S3. For topological fingerprints, the better prediction result is achieved with molecular sets S1 and S3. When Morgan fingerprints are used as descriptor data, better classification results are obtained with molecular set S2. From these results, it can be found that the greatest correct prediction number is achieved based on combinatorial descriptors. Moreover, the result obtained based on descriptor data from molecular set S3 performs well in most cases, and the prediction achieved by topological fingerprint data only performs well from S1.
To further conduct comparative analysis, better results, obtained by different descriptor data and molecular sets, are displayed in Table 2. For both Mordred descriptor set and topological fingerprint set, the results from molecular set S3 are listed in Table 2. Four descriptor data calculated from molecular set S4 are all trained by DS12 method to predict molecules in the validation set. The detailed prediction results for each class are shown in Table 2. The correct prediction number of single-role drugs for different classes and the correct prediction number of multi-role drugs are also included. By using combinatorial descriptor data from molecular set S3 to train DS12 method, the highest correct prediction number is obtained. However, it can be found that the correct prediction number of DS12 for antiarrhythmics is zero, indicating that prediction performance of DS12 is not balanced. Comparing carefully the correct prediction rate of each class of single-role drugs, most results are acceptable. From the aspect of whole prediction performance, four models with better performance are bolded in the first column of Table 2. Based on them, two multi-role drug molecules are taken as examples to further verify the four models, whose prediction probabilities for each class are listed in Table 3.
These two drug molecules are rifampicin and celecoxib. Rifampicin, as an antibiotic, has antitumor activity [29]. It is consistently correctly predicted as antibiotic by the four models. Meanwhile, prediction for its second possible activity is also consistent with reality for all four models. Celecoxib is used as an analgesic at first and then is used as an antitumor agent because of its favorable antineoplastic properties. Furthermore, studies have verified antiviral efficacy of celecoxib since 2013 [30,31,32]. From the prediction probabilities in Table 3, it can be found that correct prediction for analgesic and antineoplastic activities of celecoxib cannot be obtained by a single SVM. Prediction only for antiviral efficacy of celecoxib is incorrect for these two models based on topological fingerprints. It is worth noting that celecoxib is considered more likely to be an antidiabetic drug, based on the result of combinatorial descriptor set, instead of an antitumor and antiviral drug, which needs to be further confirmed.
In summary, drug information included in molecular sets S4 and S3 is more helpful to establish correlation between drug molecular structure and clinical therapeutic effect, which provides more reliable prediction for unknown drugs. The combinatorial descriptor and topological fingerprint are favorable for extracting structure information, which facilitates the mining of the correlation. Furthermore, compared with single classifier, higher Q value and kappa coefficient are obtained by fusion method, which is more suitable for predicting the potential clinical therapeutic effect of unknown drugs.

3. Materials and Methods

Analgesics, antineoplastic drugs, antibacterial drugs, antiviral drugs, antifungals, antidiabetic drugs, and anti-arrhythmic drugs are taken as examples to conduct multi-classification research on drugs. Section 3.1 details the collection of drug molecules and the acquisition of drug data. The study procedure of classification, the basic theory, and the fusion method that classification models depend on are introduced in Section 3.2.

3.1. Drug Collection and Corresponding Descriptor Data Set

After collecting drug molecular information, drug molecular structure is converted into a form that can be recognized by computer by Python script PubChemPy (https://pypi.python.org/pypi/PubChemPy, accessed on 10 March 2022). Based on this form of structure, five descriptor sets are calculated by embedding ChemoPy [33], Mordred [34], and RDKit [35] packages into Python.
As mentioned above, all drugs can be considered as certain molecules or a collection of molecules with certain structures. There are several available databases including known drug information, such as commercial names, molecules, basic physical-chemical properties, and structure descriptors, simplified molecular linear specifications (SMILES). Depending on the area and focus of their developer, these databases may cover different drugs and corresponding attributes, which are usually presented in different data formats. To include drugs that are from more categories and described by more universal attributes, drugs from three databases are collected and grouped into seven classes according to their actual clinical therapeutic effects.
Many properties are quantitative expressions of molecular structural information, which can be calculated by software or even just web applications [30,31,32]. Here, five different types of description data are chosen to select the best one. The SMILES is often used as input form for computational programs to calculate the descriptor data, which include general information on molecular structure. Drug data analyzed in this work are detailed in the following section.

3.1.1. Drug Molecules

Drug molecules in seven categories are initially collected from World Health Organization official website and then are checked according to the drug information in KEGG [8], DrugBank [9], and PubChem [36]. During the collection of drugs, all compounds with therapeutic effects were included, such as prodrugs and active metabolites. The comprehensive coverage of drug molecular sets is ensured, so that more reliable prediction ability is obtained. According to actual clinical therapeutic information, drug molecules belonging to one class may be classified as another category or removed from a drug molecular set. It can be found that drugs with the same molecular structure in different databases may be named differently. In this case, the repeated ones are removed from the database.

3.1.2. Different Descriptors

Molecular digital representation is varied, including both experimental and computational properties. The computer-acquired descriptive properties are widely applied due to their convenience and usability, such as molecular descriptors and molecular fingerprints [37,38], and they are used to facilitate drug development [13,16]. Although there are still many types of data for quantifying molecular structure, such as three-dimensional descriptors and pharmacophore fingerprints, there are limitations to them, such as high computational complexity and slow computational speed, which have a key impact on drug screening. In order to achieve better representation of drug molecular structure as soon as possible, five data sets quantifying molecular structure information from various aspects are calculated and used as feature data to classify drugs to determine better description for drug structure. They can be calculated by programming software, as detailed in the following.
Two types of molecular descriptor sets, whose data are real numbers, are selected as description data for drug molecules. Various descriptor groups are formed based on multiple descriptors obtained by different calculation methods. The computation of multiple descriptor groups has been implemented, and the relationships among them also have been clearly shown [39]. Here, the combination of ChemoPy [33] and RDKit [35] descriptors is chosen as a descriptor set. After removing repeated descriptors, the combinatorial descriptor set can be obtained from these two databases. It contains 632 descriptors from ChemoPy and 123 descriptors from RDKit. In addition, a Mordred descriptor group proposed by Moriwaki et al. [34], including multiple sets of descriptors, is used as another descriptor set. To ensure data is processable, a total of 1127 Mordred descriptors are reserved due to the missing value of other descriptors. Two descriptor sets are calculated by embedding ChemoPy, Mordred, and RDKit packages into Python.
Data for molecular fingerprints are all binary. Molecular fingerprints are diverse [40,41], and each fingerprint is a binary vector with certain dimensions. There are three types of fingerprints with fixed dimensions. Morgan fingerprints, MACCS fingerprints, and topological fingerprints are selected as another three descriptor sets to acquire data for drug molecules. Especially for Morgan fingerprints, when the selected circle radius is different, the vector dimension of the fingerprint is also different. Here, the radius is set to 4, and a 1024-dimensional Morgan fingerprints is obtained. They are all generated by combining RDKit database and Python program.

3.1.3. Final Molecular Set

When calculating the above descriptor data, it is found that partial molecular structure information cannot be converted into complete and processable data, such as mixtures, ionic compounds, and biological macromolecules. Molecules with molar mass more than 1800 are unsuitable to train models with other drug molecules, because their values are much higher than others. Drug molecules that match the above conditions are removed.
Although five types of descriptor sets cannot distinguish isomers, they have been widely used in drug screening [13,14,15,16]. Furthermore, they do not perform any worse than the data for describing three-dimensional structure [42], whose computation procedure are complex and computation time are long. Therefore, it is suitable to classify drugs through the calculated data, and isomer pairs are removed during the checking phase of drug molecule collection.
Additionally, some drugs for the treatment of certain diseases also have other therapeutic uses, as shown in References [43,44]. To obtain more explicit classification results, only molecules with a single therapeutic purpose are kept in the analysis data set, and drugs with multiple therapeutic effects are collected into an external validation set. After removing some drugs in the early stage, an original molecular set S1 containing 1019 drug molecules is obtained.
In order to collect as many drug molecules as possible with a certain therapeutic effect, namely analgesic effect, analgesics and several anesthetics and antipsychotics that achieve analgesic effects only because of local anesthesia or muscle relaxation are all contained in molecular set S1. When carefully checking therapeutic use and mechanism, it was found that several drugs are just healthcare products or adjuvants, such as radiation adjuvants. Therapeutic effects of certain drugs also cannot be confirmed according to articles on PubMed website [45]. Additionally, a molecular set that is appropriate to be used as basic data set for identifying unknown drugs is currently unable to be determined. It is necessary to check drug-relevant information and remove some drug molecules to obtain a molecular set that is more beneficial to discovering the relationship between structure and properties. Database information and literature information from PubMed are collected and checked, including ATC code, research phase, target, therapeutic effect, mechanism of action, and applicable subjects. The original data set is screened layer by layer according to concrete characteristics of drug molecules, and then four molecular sets containing different numbers of drugs are obtained. The detailed process of acquiring four different molecular sets is as follows.
After removing 19 drugs with weak relief for pain and sedative effect or auxiliary, 2 antineoplastic drugs with only healthcare effect and inhibiting DNA repair, and 2 antiviral drugs in S1, the molecular set S2 containing 996 drugs is obtained. Based on S2, drugs with potential therapeutic uses are removed, and then the molecular set S3 containing 921 drugs is obtained. Based on S3, drugs without known mechanism, still used as veterinary drugs or just in the clinical trial phase are removed, and thus molecular set S4 with 844 drugs is obtained. The acquisition process of the four molecular sets is illustrated in Figure 5. While the seven types of drugs are collected, drug molecules with two or even three therapeutic uses are identified and stored. Number of drugs included in four molecular sets is summarized in Table 4. Additionally, an external validation molecular set containing 37 drugs is obtained. For all collected molecules, their name and category can be seen in Supplementary Tables S1 and S2.

3.2. Methods for Selection, Combination, and Evaluation

The prediction model is established based on classification method. The performance of the final model depends on the applied classifier and its parameter setting. For constructing models with better classification performance, multiple classifiers are applied to obtain classification models based on DS fusion method. For objective comparison and selection, suitable methods and indicators should be used to evaluate different models. Digitalized descriptors are generated for further machine learning algorithms.
The whole classification procedure is accomplished with Python. Five different types of classifiers are utilized to extract drug information by Python program and scikit-learn packages [46]. They have been described in detail in Reference [47].

3.2.1. Classification Algorithms

In order to achieve multi-classification and ensure interpretability of results, strategy for multi-classification based on binary classifiers is applied. There are two ways to proceed according to the strategy, namely “one vs. rest” (“OvR”) and “one vs. one” (“OvO”). Considering the computational cost and subsequent fusion processing issues, “OvR” is adopted here. Five algorithms are introduced as follows.
RF had been widely used in classification since the beginning of the 21st century. The number of decision trees for RF, as a key parameter, has an key impact on the performance of the algorithm. In order to increase the speed of optimization and computation, the adjustment range of the parameter is set from 10 to 100 and incremented by 10. As another parameter that needs to be adjusted, the split criterion can be set as Gini or entropy.
ABT, similar to RF, is also a tree-based classifier. Different from RF, which is conducted by directly adding and averaging the results of a large number of trees, ABT is trained by gradually emphasizing the weight of those samples that are difficult to separate based on the classification of each tree. The changing range of basic classifier number is the same as that in RF.
SVM is a classifier defined in feature space, which had been widely used in statistical classification and regression analysis. A hyperplane was constructed by SVM to separate the training samples. It was employed to process linear and non-linear data through the kernel function, and its classification accuracy is closely related to the kernel function. The penalty parameter C is an important parameter, and its range is 1 to 10. After previous attempts, the polynomial and the radial basis function kernel are more suitable for the obtained data set. In grid search, the kernel function varies between the above two kernels.
As a standard two-classification method, LR was adopted to realize classification by similar regression, i.e., to calculate the relationship between the conditional probability of each sample feature vector and the set threshold to determine the sample class. Its performance depends on whether the data conform to the predetermined model. Since “OvR” strategy is applied to achieve multi-classification, “liblinear” is chosen as the solver according to the preliminary attempts. As the key parameter, the value of penalty factor C varies between 0.0001, 0.001, 0.01, 0.1, 1, and 10.
LDA was an extension of Fisher discriminant analysis. It was a weighted linear combination between a set of feature variables. This set was obtained by training data under the condition that the variance within classes was as small as possible and the variance between classes was as large as possible. It is used as a decision function to recognize the category to which the sample belongs. Default parameters for LDA are adopted in this work [46].

3.2.2. Fusion Methods

After the model framework is established, the best classification model should be selected by validation methods. The division of data sets affects the prediction results of the model for the test set and will eventually affect the choice of the best model. As is well known, a single classifier is usually suitable to a certain scope. Multiple classifiers are applied to the obtained data set. Moreover, the collected data set is multi-category and unbalanced, and it is more difficult to obtain proper division for the data set. To achieve better classification performance, many fusion methods have been proposed. Classifier fusion methods are different in architectures and ways of fusion, and the DS fusion method is applied in this study. The DS fusion method was introduced and developed in the work of Dempster and Shafer [48,49]. This theory has been applied in many fields, such as fault diagnosis, and there are also examples in the field of classification [27,28].
There are multiple situations for the categories of collected sample data, which are represented by a limited nonempty set Θ. Enumerating the possible categories of data sample, Θ = θ 1 , θ 2 , , θ c   can be obtained, where c represents the number of hypotheses. In DS theory, Θ is called the discrimination frame, and 2 Θ represents a power set containing 2 c cases, namely 2 Θ = ϕ , θ 1 , , θ c , θ 1 , θ 2 , , Θ . Corresponding mass function m can be obtained for each case in the set Θ, which is also called basic probability assignment (BPA) ranging from 0 to 1. Under this assumption, BPA should meet the following two conditions:
m = 0
A 2 Θ m A = 1
where m A is the BPA of a certain situation and is also the confidence of its occurrence. Therefore, when BPA is 0, the confidence of a certain situation is 0, and the opposite is 1. According to Dempster’s combination rule in [39], the confidence level (in some case A) obtained by fusion of two pieces of evidence is calculated as follows:
m A = 0                                                     A = 1 1 K c B C = A m 1 B m 2 C       A
where K c represents the conflict between the confidence of the two evidences. The calculation formula is in Equation (4):
K c = B C = m 1 B m 2 C
where B and C indicate possible situations under the evidence system.
When the number of evidences exceeds 2, the fusion is achieved according to Dempster’s combination rule, and the calculation rule is as follows:
m = m 1   m 2 m n = m 1 m 2 m n
where represents an operation that can fuse two classification results. As explained in References [27,28], DS evidence fusion is not applicable when there are conflicts between evidences. Since the K c value between the evidences reflects the degree of conflict between the evidences, for the case where K c is too large, DS fusion should be replaced by other methods. In Reference [28], a different fusion method is proposed afterwards for the case of K c > 0.95. For this situation, the method mentioned in Reference [28] is also applied in this paper.
All fusion models applied in Section 3 are shown in Table A1 of the Appendix A.

3.2.3. The Evaluation of Classification Performance

An appropriate model evaluation method is the key to discovering models with very good classification performance. It compares performance of different methods based on test sets obtained by dividing data sets. There are many methods for evaluating models now, such as common leave-one-out and bootstrapping. However, these methods are more suitable for small and balanced data sets, which are inconsistent with the characteristics of the collected data sets. In terms of the data set, leave-one-out has high computing cost, and bootstrapping does not make full use of data information.
To get objective statistical results under the influence of a data set in random sampling, the KS method was proposed first by Kennard and Stone [26] and applied and compared in research by Martin et al. [50]. The KS method is selected as the division method of data sets in this study. The training set and the test set are obtained based on difference between samples, so that the model obtained by means of the test set with diverse samples can achieve a more reasonable and reliable classification prediction for unknown drugs.
Because the difference is measured by distance between samples in KS method, the data need to first be standardized. Based on normalized training set, key parameters for each classifier are determined by grid search method that is implemented by five-fold cross-validation. Then, molecules in the test set are classified by classification methods with the optimized parameters. For comparability of methods and objectivity of results, KS division is repeated 100 times to obtain statistical results. Moreover, the model for better and more robust classification can be obtained by comparing the results of different classification methods.
In addition, it is also important for model evaluation to select appropriate evaluation indicators, which will determine the reliability of a model. As one of the evaluation indicators for classification models, prediction accuracy (Q) is most commonly used and most intuitive. It can be calculated by Equation (6).
Q = t h e   n u m b e r   o f   s a m p l e   p r e d i c t e d   c o r r e c t l y t h e   t o t a l   n u m b e r   o f   s a m p l e
Compared with various evaluation indicators of two-class classification, there are fewer indicators that can be used for multi-class. There is currently a metric, Cohen’s kappa coefficient (K), which has been widely used to evaluate performance of two-class and multi-class models. K is calculated as follows:
K = Q p e 1 p e
where p e is quotient obtained by dividing the sum of the products for each class by the square of the number of samples, and the product refers to the number of samples that actually belong to a class times the number of samples predicted to be in the class. The range of K is from −1 to 1. The larger its value, the more consistent the predicted result is with the real situation.

3.2.4. Study Process of Classifying Drugs

The overall flow of this study is shown in Figure 6. Different data sets compared in Figure 6a come from the collocation and combination of different molecular sets and different descriptor sets in the study. The different types of molecular sets and descriptor sets and all applied classification models can be found in Table A1 of the Appendix A, and the details of the classification process are as follows.
An example for model selection is obtained by randomly choosing a molecular set and a descriptor set from the first two columns of Table A1. Their combination with all the classification models in the third column of Table A1 is used to implement classification procedures as shown in Figure 6a. The molecular set is input for calculating the descriptors set. The calculated descriptor set is grouped into training set and test set by KS method. The training set is used to tune parameters of the classification algorithms and fit classification models. Then, the test set is classified by fitted models, and their results are evaluated and saved. This process, from grouping to evaluation, is run 100 times to obtain a statistical average result from different models. By comparing classification indicators, the better model based on chosen molecular set and descriptor set will be discovered. In this way, the better models for each combination of molecular set and descriptor set are selected to identify the molecules of the external validation set.
The whole prediction procedure for external validation is displayed in Figure 6b. Similarly, an example for external validation is obtained by randomly choosing a molecular set and a descriptor set in Table A1. Classification models are selected from those that perform well in Figure 6a. The chosen molecular set and molecular set for external validation are used as the input for calculating the descriptor set at the same time. In order to ensure the class diversity of final validation molecules, descriptor data of partial molecules are randomly selected for the external validation descriptor set from the descriptor set calculated based on chosen molecular set. Afterwards, the rest of the descriptor set is utilized to tune parameters of the classification algorithm and fit the classification models. The renewed external validation descriptor set is predicted by the fitted model, and the prediction probabilities of different models for it are saved. Differing from the study flow in (a), this procedure is run 3 times to obtain a final statistical predicted probability result.

4. Conclusions

Compared with the traditional virtual screening of a single class of drugs, virtual screening based on multiple classes of drugs not only enhances the screening efficiency but also discovers multiple possible uses of a drug at one time. In this study, seven classes of drugs are taken as examples to obtain enough drug structure information from various databases. Structural information on drug molecules can be converted into feature information such as descriptors and fingerprints based on SMILES. Then, a DS fusion model based on five classifiers is proposed to make full use of descriptor data for good prediction performance. Subsequently, by comparing the classification results, better methods for classifying multi-class drugs are found, including the drug structure description method and the machine learning method. Based on the above results and discussions, it is found that combinatorial descriptor data is more appropriate to extract drug information to obtain better classification performance. Compared to the single classification methods, SVM performed better in multi-class drugs classification, indicating that there is a nonlinear correlation between drug molecular structure and treatment effect. Additionally, the established fusion methods outperform the single machine learning methods, especially the DS12 method fusing RF and SVM. The final results suggest that the combination descriptor data and DS12 classification method can make a better prediction for multi-class drugs, which is also verified by the results from the external verification set. This study provides a methodological basis for simultaneous screening of multi-class drugs and a new direction for speeding up virtual screening.
Although a good classification result is obtained, the study is only focused on discovering correlation between a drug’s therapeutic effect and its two-dimensional structure, ignoring the effect of drug isomers on therapeutic effects, which needs further research. In addition, the current classification of drug therapeutic uses is rough. In fact, there are far more than seven classes of real drugs, and each class can be further subdivided. These problems can be gradually explored and solved through in-depth research, such as adding a non-drug class and further expanding the drug classes.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/molecules27154807/s1, Table S1: drug molecular information, including name and category, and Table S2: drug molecular information in external validation set.

Author Contributions

Conceptualization, C.Z., J.A., Y.Y., F.M. and W.S.; Data curation, C.Z. and W.S.; Formal analysis, C.Z., J.A., Y.Y., F.M. and W.S.; Investigation, C.Z., Y.Y., F.M. and W.S.; Methodology, C.Z., Y.Y., F.M. and W.S.; Software, C.Z.; Supervision, J.A. and W.S.; Validation, C.Z.; Writing—original draft, C.Z.; Writing—review & editing, C.Z., J.A., Y.Y., F.M. and W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 21878012).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All drug molecules utilized in this work are acquired from KEGG database (https://www.kegg.jp/, accessed on 5 January 2022), DrugBank database (https://go.drugbank.com/, accessed on 15 April 2022), and PubChem database (https://pubchem.ncbi.nlm.nih.gov/, accessed on 15 April 2022).

Acknowledgments

The authors are grateful to reviewers for their attention and comments on our work, which makes our revision more clear and complete.

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

Not applicable.

Appendix A

Drug molecular sets, descriptor sets, and classification methods that are used in this study are sorted in Table A1. In terms of five different descriptor data, classification results obtained by twelve different methods for four molecular sets are shown in Table A2, Table A3, Table A4, Table A5 and Table A6, respectively.
Table A1. Different classes of molecule sets, descriptor sets, and classification methods.
Table A1. Different classes of molecule sets, descriptor sets, and classification methods.
Molecule SetsDescriptor SetsClassification Methods
S1 with 1019 moleculesCombinatorial descriptors (F1)RF (1)
S2 with 996 moleculesMordred descriptors (F2)SVM (2)
S3 with 921 moleculesMACCS fingerprints (F3)LR (3)
S4 with 844 moleculesTopological fingerprints (F4)LDA (4)
Morgan fingerprints (F5)ABT (5)
Fusion of RF and SVM (DS12)
Fusion of RF and LR (DS13)
Fusion of SVM and LR (DS23)
Fusion of RF, SVM, and LR (DS123)
Fusion of RF, SVM, LR, and LDA (DS1234)
Fusion of RF, SVM, LR, and ABT (DS1235)
Fusion of five single classifiers (DS12345)
Table A2. Performance based on combinatorial descriptor data. The results are displayed as the mean plus or minus the standard deviation. The same is true for Table A3, Table A4, Table A5 and Table A6.
Table A2. Performance based on combinatorial descriptor data. The results are displayed as the mean plus or minus the standard deviation. The same is true for Table A3, Table A4, Table A5 and Table A6.
IndicatorsAlgorithmsS4S3S2S1
QRF0.824 ± 0.030.834 ± 0.0230.828 ± 0.0260.832 ± 0.027
SVM0.849 ± 0.0250.856 ± 0.0220.852 ± 0.0210.857 ± 0.024
LR0.808 ± 0.0310.811 ± 0.0250.808 ± 0.0250.811 ± 0.027
LDA0.693 ± 0.0290.736 ± 0.030.751 ± 0.0270.758 ± 0.029
ABT0.803 ± 0.030.812 ± 0.030.807 ± 0.0270.808 ± 0.027
DS120.854 ± 0.0250.861 ± 0.0210.858 ± 0.0230.862 ± 0.025
DS130.833 ± 0.0250.843 ± 0.0230.837 ± 0.0230.842 ± 0.023
DS230.843 ± 0.0260.851 ± 0.0210.845 ± 0.0250.847 ± 0.024
DS1230.851 ± 0.0240.858 ± 0.0210.853 ± 0.0220.858 ± 0.024
DS12340.794 ± 0.0280.808 ± 0.0240.813 ± 0.0240.823 ± 0.027
DS12350.851 ± 0.0240.859 ± 0.0210.853 ± 0.0220.858 ± 0.024
DS123450.834 ± 0.0240.84 ± 0.0220.84 ± 0.0220.847 ± 0.023
KRF0.761 ± 0.040.772 ± 0.030.761 ± 0.0350.763 ± 0.037
SVM0.797 ± 0.0320.804 ± 0.0290.798 ± 0.0290.802 ± 0.033
LR0.743 ± 0.040.744 ± 0.0320.739 ± 0.0330.74 ± 0.036
LDA0.599 ± 0.0370.649 ± 0.0380.667 ± 0.0350.673 ± 0.039
ABT0.736 ± 0.0390.746 ± 0.0390.737 ± 0.0350.736 ± 0.036
DS120.803 ± 0.0330.81 ± 0.0280.805 ± 0.0310.808 ± 0.035
DS130.775 ± 0.0320.786 ± 0.030.776 ± 0.0320.78 ± 0.032
DS230.789 ± 0.0330.797 ± 0.0280.789 ± 0.0330.789 ± 0.033
DS1230.799 ± 0.0320.807 ± 0.0280.799 ± 0.030.802 ± 0.033
DS12340.725 ± 0.0350.741 ± 0.0310.746 ± 0.0320.756 ± 0.036
DS12350.799 ± 0.0310.808 ± 0.0290.799 ± 0.030.803 ± 0.033
DS123450.777 ± 0.0310.782 ± 0.0290.781 ± 0.0290.787 ± 0.032
Table A3. Performance based on Mordred descriptor data.
Table A3. Performance based on Mordred descriptor data.
IndicatorsAlgorithmsS4S3S2S1
QRF0.807 ± 0.0270.818 ± 0.0250.816 ± 0.030.823 ± 0.025
SVM0.841 ± 0.0240.852 ± 0.0230.847 ± 0.0250.853 ± 0.021
LR0.811 ± 0.0240.82 ± 0.0240.811 ± 0.0260.818 ± 0.024
LDA0.555 ± 0.0450.609 ± 0.0430.637 ± 0.0330.656 ± 0.031
ABT0.775 ± 0.0320.781 ± 0.030.777 ± 0.0280.784 ± 0.029
DS120.843 ± 0.0250.855 ± 0.0240.85 ± 0.0250.857 ± 0.023
DS130.831 ± 0.0240.841 ± 0.0250.834 ± 0.0260.845 ± 0.024
DS230.834 ± 0.0210.844 ± 0.0230.839 ± 0.0250.846 ± 0.022
DS1230.843 ± 0.0230.851 ± 0.0240.845 ± 0.0270.855 ± 0.022
DS12340.769 ± 0.0310.779 ± 0.0330.781 ± 0.0310.791 ± 0.028
DS12350.844 ± 0.0230.851 ± 0.0230.846 ± 0.0270.855 ± 0.022
DS123450.828 ± 0.0250.833 ± 0.0270.83 ± 0.0270.841 ± 0.024
KRF0.74 ± 0.0350.752 ± 0.0330.747 ± 0.0390.754 ± 0.034
SVM0.787 ± 0.0310.8 ± 0.030.791 ± 0.0330.796 ± 0.029
LR0.748 ± 0.0310.757 ± 0.0320.744 ± 0.0330.751 ± 0.031
LDA0.435 ± 0.0540.495 ± 0.0520.527 ± 0.0390.548 ± 0.039
ABT0.701 ± 0.0410.706 ± 0.040.699 ± 0.0360.705 ± 0.038
DS120.789 ± 0.0330.804 ± 0.0310.794 ± 0.0330.801 ± 0.03
DS130.774 ± 0.0310.784 ± 0.0330.774 ± 0.0340.786 ± 0.032
DS230.779 ± 0.0270.789 ± 0.0310.781 ± 0.0330.789 ± 0.03
DS1230.79 ± 0.030.798 ± 0.0310.789 ± 0.0350.799 ± 0.029
DS12340.694 ± 0.0390.705 ± 0.0430.705 ± 0.040.716 ± 0.036
DS12350.791 ± 0.030.798 ± 0.0310.789 ± 0.0360.8 ± 0.029
DS123450.77 ± 0.0330.775 ± 0.0370.77 ± 0.0350.781 ± 0.033
Table A4. Performance based on MACCS fingerprint data.
Table A4. Performance based on MACCS fingerprint data.
IndicatorsAlgorithmsS4S3S2S1
QRF0.812 ± 0.0260.816 ± 0.0270.81 ± 0.0230.819 ± 0.022
SVM0.807 ± 0.0270.808 ± 0.0250.806 ± 0.0260.815 ± 0.023
LR0.716 ± 0.0310.722 ± 0.0290.718 ± 0.0260.727 ± 0.027
LDA0.711 ± 0.0330.719 ± 0.0290.715 ± 0.0250.727 ± 0.027
ABT0.691 ± 0.0340.689 ± 0.0340.673 ± 0.0270.686 ± 0.028
DS120.819 ± 0.0290.819 ± 0.0260.819 ± 0.0230.822 ± 0.022
DS130.78 ± 0.0290.787 ± 0.0270.785 ± 0.0260.793 ± 0.025
DS230.791 ± 0.0320.791 ± 0.0270.789 ± 0.0270.8 ± 0.024
DS1230.806 ± 0.030.807 ± 0.0270.806 ± 0.0270.812 ± 0.022
DS12340.774 ± 0.030.78 ± 0.0250.78 ± 0.0230.79 ± 0.025
DS12350.806 ± 0.0290.806 ± 0.0270.805 ± 0.0270.811 ± 0.022
DS123450.774 ± 0.030.78 ± 0.0250.78 ± 0.0240.79 ± 0.025
KRF0.75 ± 0.0350.753 ± 0.0360.742 ± 0.0310.751 ± 0.029
SVM0.745 ± 0.0350.745 ± 0.0320.739 ± 0.0340.749 ± 0.03
LR0.627 ± 0.040.632 ± 0.0370.622 ± 0.0340.631 ± 0.035
LDA0.623 ± 0.0410.63 ± 0.0370.621 ± 0.0330.633 ± 0.035
ABT0.596 ± 0.0420.59 ± 0.0420.566 ± 0.0350.579 ± 0.037
DS120.761 ± 0.0380.759 ± 0.0330.755 ± 0.0310.758 ± 0.028
DS130.709 ± 0.0380.716 ± 0.0350.709 ± 0.0340.717 ± 0.032
DS230.725 ± 0.0410.723 ± 0.0350.717 ± 0.0360.728 ± 0.031
DS1230.744 ± 0.0380.742 ± 0.0350.737 ± 0.0350.743 ± 0.029
DS12340.703 ± 0.0390.708 ± 0.0320.703 ± 0.0310.714 ± 0.032
DS12350.744 ± 0.0380.742 ± 0.0350.737 ± 0.0350.743 ± 0.029
DS123450.703 ± 0.0390.708 ± 0.0320.703 ± 0.0310.714 ± 0.032
Table A5. Performance based on topological fingerprint data.
Table A5. Performance based on topological fingerprint data.
IndicatorsAlgorithmsS4S3S2S1
QRF0.813 ± 0.0290.821 ± 0.0290.819 ± 0.0240.821 ± 0.029
SVM0.837 ± 0.0260.835 ± 0.0290.827 ± 0.0280.828 ± 0.028
LR0.796 ± 0.0270.804 ± 0.0280.795 ± 0.0250.797 ± 0.024
LDA0.599 ± 0.0640.588 ± 0.0650.571 ± 0.0590.577 ± 0.059
ABT0.759 ± 0.0340.761 ± 0.0310.753 ± 0.030.755 ± 0.032
DS120.832 ± 0.0270.834 ± 0.0260.833 ± 0.0250.834 ± 0.024
DS130.816 ± 0.0270.825 ± 0.0280.825 ± 0.0250.827 ± 0.025
DS230.828 ± 0.0290.825 ± 0.0270.818 ± 0.0260.822 ± 0.025
DS1230.832 ± 0.0260.829 ± 0.0270.828 ± 0.0250.829 ± 0.025
DS12340.774 ± 0.0310.771 ± 0.0320.751 ± 0.0290.752 ± 0.029
DS12350.832 ± 0.0270.829 ± 0.0280.828 ± 0.0250.829 ± 0.025
DS123450.794 ± 0.0290.796 ± 0.030.777 ± 0.0260.779 ± 0.027
KRF0.753 ± 0.0370.761 ± 0.0380.756 ± 0.0330.759 ± 0.039
SVM0.786 ± 0.0340.781 ± 0.0360.77 ± 0.0370.771 ± 0.036
LR0.736 ± 0.0340.745 ± 0.0350.731 ± 0.0330.735 ± 0.032
LDA0.5 ± 0.070.483 ± 0.0720.459 ± 0.0660.464 ± 0.067
ABT0.687 ± 0.0430.686 ± 0.0390.676 ± 0.0390.678 ± 0.04
DS120.78 ± 0.0340.781 ± 0.0340.778 ± 0.0340.779 ± 0.032
DS130.76 ± 0.0340.769 ± 0.0360.768 ± 0.0330.77 ± 0.032
DS230.776 ± 0.0360.77 ± 0.0340.76 ± 0.0350.765 ± 0.032
DS1230.78 ± 0.0340.775 ± 0.0340.772 ± 0.0330.774 ± 0.032
DS12340.707 ± 0.0390.7 ± 0.040.673 ± 0.0380.674 ± 0.038
DS12350.78 ± 0.0340.775 ± 0.0350.772 ± 0.0330.774 ± 0.032
DS123450.732 ± 0.0360.733 ± 0.0370.706 ± 0.0350.709 ± 0.036
Table A6. Performance based on Morgan fingerprint data.
Table A6. Performance based on Morgan fingerprint data.
IndicatorsAlgorithmsS4S3S2S1
QRF0.781 ± 0.0280.767 ± 0.0260.765 ± 0.0280.772 ± 0.029
SVM0.775 ± 0.0340.766 ± 0.0290.763 ± 0.0290.772 ± 0.031
LR0.753 ± 0.030.73 ± 0.0310.725 ± 0.0260.732 ± 0.031
LDA0.586 ± 0.0510.563 ± 0.0380.505 ± 0.0410.513 ± 0.042
ABT0.652 ± 0.0350.645 ± 0.0360.642 ± 0.0340.646 ± 0.033
DS120.788 ± 0.0290.774 ± 0.0280.773 ± 0.0280.78 ± 0.028
DS130.785 ± 0.0260.767 ± 0.0270.763 ± 0.0260.773 ± 0.03
DS230.771 ± 0.0330.76 ± 0.0280.757 ± 0.0290.764 ± 0.029
DS1230.786 ± 0.030.771 ± 0.0270.771 ± 0.0270.778 ± 0.029
DS12340.696 ± 0.0340.681 ± 0.0320.66 ± 0.0280.666 ± 0.035
DS12350.785 ± 0.030.77 ± 0.0260.771 ± 0.0280.777 ± 0.028
DS123450.725 ± 0.0330.714 ± 0.0320.699 ± 0.0270.705 ± 0.03
KRF0.705 ± 0.0360.683 ± 0.0340.672 ± 0.0380.676 ± 0.04
SVM0.698 ± 0.0440.683 ± 0.0370.672 ± 0.0380.679 ± 0.041
LR0.671 ± 0.0390.636 ± 0.040.622 ± 0.0360.623 ± 0.044
LDA0.466 ± 0.0610.433 ± 0.0470.363 ± 0.0460.371 ± 0.049
ABT0.545 ± 0.0440.531 ± 0.0430.522 ± 0.0440.522 ± 0.044
DS120.716 ± 0.0380.695 ± 0.0360.687 ± 0.0370.69 ± 0.038
DS130.711 ± 0.0330.682 ± 0.0360.67 ± 0.0350.677 ± 0.041
DS230.697 ± 0.0430.679 ± 0.0360.668 ± 0.0380.673 ± 0.04
DS1230.713 ± 0.0390.691 ± 0.0350.683 ± 0.0370.688 ± 0.038
DS12340.599 ± 0.0430.576 ± 0.0410.542 ± 0.0340.547 ± 0.046
DS12350.713 ± 0.0390.69 ± 0.0340.684 ± 0.0370.687 ± 0.038
DS123450.635 ± 0.0410.618 ± 0.0410.591 ± 0.0330.596 ± 0.04
Table A7. Prediction results of “F1—DS12 on S3”model for external validation set.
Table A7. Prediction results of “F1—DS12 on S3”model for external validation set.
DrugsTrue CategoriesPredicted Categories
Oliceridineanalgesicsantineoplastic drugs
Cyproheptadineanalgesicsanalgesics
Methylergometrineanalgesicsanalgesics
Ubrogepantanalgesicsantineoplastic drugs
Lasmiditananalgesicsantineoplastic drugs
Talaporfinantineoplastic drugsantineoplastic drugs
Avapritinibantineoplastic drugsantineoplastic drugs
Tazemetostatantineoplastic drugsantineoplastic drugs
Capmatinibantineoplastic drugsantineoplastic drugs
Lurbinectedinantineoplastic drugsantineoplastic drugs
Abiraterone acetateantineoplastic drugsantineoplastic drugs
Sotorasibantineoplastic drugsantineoplastic drugs
Tamoxifenantineoplastic drugsanalgesics
Fulvestrantantineoplastic drugsantineoplastic drugs
Anastrozoleantineoplastic drugsantiviral drugs
Letrozoleantineoplastic drugsantifungals
Exemestaneantineoplastic drugsantineoplastic drugs
Zanubrutinibantineoplastic drugsantineoplastic drugs
Apalutamideantineoplastic drugsantineoplastic drugs
Darolutamideantineoplastic drugsantineoplastic drugs
Glasdegibantineoplastic drugsantineoplastic drugs
Duvelisibantineoplastic drugsantineoplastic drugs
Tofacitinibantineoplastic drugsantineoplastic drugs
Enzalutamideantineoplastic drugsantineoplastic drugs
Berzosertibantineoplastic drugsantineoplastic drugs
Mobocertinibantineoplastic drugsantineoplastic drugs
Vebicorvirantiviral drugsantineoplastic drugs
Rifampicinantineoplastic drugs, antibacterial drugsantibacterial drugs
Cytarabineantineoplastic drugs, antiviral drugsantineoplastic drugs
Seliciclibantineoplastic drugs, antiviral drugsantineoplastic drugs
Celecoxibanalgesics, antineoplastic drugsantidiabetic drugs
Pomalidomideanalgesics, antineoplastic drugsanalgesics
Acetylcysteineanalgesics, antineoplastic drugs, antiviral drugsantineoplastic drugs
Salicylic acidanalgesics, antineoplastic drugs, antifungalsanalgesics
Suxibuzoneanalgesics, antineoplastic drugsanalgesics
Promethazineanalgesics, antiviral drugsanalgesics

References

  1. Chan, H.C.S.; Shan, H.; Dahoun, T.; Vogel, H.; Yuan, S. Advancing drug discovery via artificial intelligence. Trends Pharmacol. Sci. 2019, 40, 592–604. [Google Scholar] [CrossRef]
  2. Kumar, A.; Zhang, K.Y.J. Hierarchical virtual screening approaches in small molecule drug discovery. Methods 2015, 71, 26–37. [Google Scholar] [CrossRef] [PubMed]
  3. Mak, K.K.; Pichika, M.R. Artificial intelligence in drug development: Present status and future prospects. Drug Discov. Today 2019, 24, 773–780. [Google Scholar] [CrossRef] [PubMed]
  4. Ekins, S.; Mestres, J.; Testa, B. In silico pharmacology for drug discovery: Applications to targets and beyond. Br. J. Pharmacol. 2007, 152, 21–37. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Andricopulo, A.D.; Guido, R.V.C.; Oliva, G. Virtual screening and its integration with modern drug design technologies. Curr. Med. Chem. 2008, 15, 37–46. [Google Scholar] [CrossRef] [Green Version]
  6. Yuriev, E. Challenges and advances in structure-based virtual screening. Future Med. Chem. 2014, 6, 5–7. [Google Scholar] [CrossRef] [PubMed]
  7. Scior, T.; Bender, A.; Tresadern, G.; Medina-Franco, J.L.; Martínez-Mayorga, K.; Langer, T.; Cuanalo-Contreras, K.; Agrafiotis, D.K. Recognizing pitfalls in virtual screening: A critical review. J. Chem. Inf. Modeling 2012, 52, 867–881. [Google Scholar] [CrossRef]
  8. Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017, 45, D353–D361. [Google Scholar] [CrossRef] [Green Version]
  9. Wishart, D.S.; Knox, C.; Guo, A.C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36, D901–D906. [Google Scholar] [CrossRef]
  10. Talevi, A.; Morales, J.F.; Hather, G.; Podichetty, J.T.; Kim, S.; Bloomingdale, P.C.; Kim, S.; Burton, J.; Brown, J.D.; Winterstein, A.G.; et al. Machine Learning in Drug Discovery and Development Part 1: A Primer. CPT Pharmacomet. Syst. Pharmacol. 2020, 9, 129–142. [Google Scholar] [CrossRef]
  11. Heikamp, K.; Bajorath, J. Support vector machines for drug discovery. Expert Opin. Drug Discov. 2014, 9, 93–104. [Google Scholar] [CrossRef] [PubMed]
  12. Müller, K.R.; Rätsch, G.; Sonnenburg, S.; Mika, S.; Grimm, M.; Heinrich, N. Classifying ‘drug-likeness’ with kernel-based learning methods. J. Chem. Inf. Modeling 2005, 45, 249–253. [Google Scholar] [CrossRef] [Green Version]
  13. Li, X.; Chen, Y.; Song, X.; Zhang, Y.; Li, H.; Zhao, Y. The development and application of in silico models for drug induced liver injury. RSC Adv. 2018, 8, 8101–8111. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Gupta, V.K.; Rana, P.S. Toxicity prediction of small drug molecules of androgen receptor using multilevel ensemble model. J. Bioinform. Comput. Biol. 2019, 17, 1950033. [Google Scholar] [CrossRef] [PubMed]
  15. Lee, J.; Kumar, S.; Lee, S.Y.; Park, S.J.; Kim, M. Development of predictive models for identifying potential S100A9 inhibitors based on machine learning methods. Front. Chem. 2019, 7, 779. [Google Scholar] [CrossRef]
  16. Liu, M.; Zhang, L.; Li, S.; Yang, T.; Liu, L.; Zhao, J.; Liu, K. Prediction of hERG potassium channel blockage using ensemble learning methods and molecular fingerprints. Toxicol. Lett. 2020, 332, 88–96. [Google Scholar] [CrossRef]
  17. Loetsch, J.; Ultsch, A. A machine-learned computational functional genomics-based approach to drug classification. Eur. J. Clin. Pharmacol. 2016, 72, 1449–1461. [Google Scholar] [CrossRef]
  18. Kim, E.; Choi, A.; Nam, H. Drug repositioning of herbal compounds via a machine-learning approach. BMC Bioinform. 2019, 20, 33–43. [Google Scholar] [CrossRef] [Green Version]
  19. Liang, X.; Zhang, P.; Yan, L.; Fu, Y.; Peng, F.; Qu, L.; Shao, M.; Chen, Y.; Chen, Z. LRSSL: Predict and interpret drug–disease associations based on data integration using sparse subspace learning. Bioinformatics 2017, 33, 1187–1196. [Google Scholar] [CrossRef] [Green Version]
  20. Hurle, M.R.; Yang, L.; Xie, Q.; Rajpal, D.K.; Sanseau, P.; Agarwal, P. Computational drug repositioning: From data to therapeutics. Clin. Pharmacol. Ther. 2013, 93, 335–341. [Google Scholar] [CrossRef]
  21. Liu, Z.; Guo, F.; Gu, J.; Wang, Y.; Li, Y.; Wang, Y.; Lu, L.; Li, D.; He, F. Similarity-based prediction for anatomical therapeutic chemical classification of drugs by integrating multiple data sources. Bioinformatics 2015, 31, 1788–1795. [Google Scholar] [CrossRef] [PubMed]
  22. Wang, X.; Wang, Y.; Xu, Z.; Wang, Y.; Li, Y.; Wang, D.; Lu, L.; Li, D.; He, F. ATC-NLSP: Prediction of the classes of anatomical therapeuticchemicals using a network-based label space partition method. Front. Pharmacol. 2019, 10, 971. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Woźniak, M.; Grana, M.; Corchado, E. A survey of multiple classifier systems as hybrid systems. Inf. Fusion 2014, 16, 3–17. [Google Scholar] [CrossRef] [Green Version]
  24. Xiao, Z.; Yang, X.; Pang, Y.; Dang, X. The prediction for listed companies’ financial distress by using multiple prediction methods with rough set and Dempster–Shafer evidence theory. Knowl.-Based Syst. 2012, 26, 196–206. [Google Scholar] [CrossRef]
  25. Galar, M.; Fernández, A.; Barrenechea, E.; Bustince, H.; Herrera, F. An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit. 2011, 44, 1761–1776. [Google Scholar] [CrossRef]
  26. Kennard, R.W.; Stone, L.A. Computer aided design of experiments. Technometrics 1969, 11, 137–148. [Google Scholar] [CrossRef]
  27. Jiang, W.; Xie, C.; Zhuang, M.; Tang, Y.C. Failure mode and effects analysis based on a novel fuzzy evidential method. Appl. Soft Comput. 2017, 57, 672–683. [Google Scholar] [CrossRef]
  28. Pan, Y.; Zhang, L.; Wu, X.; Skibniewski, M.J. Multi-classifier information fusion in risk analysis. Inf. Fusion 2020, 60, 121–136. [Google Scholar] [CrossRef]
  29. Chakraborty, A.; Panda, A.K.; Ghosh, R.; Roy, R.; Biswas, A. Depicting the DNA binding and photo-nuclease ability of anti-mycobacterial drug rifampicin: A biophysical and molecular docking perspective. Int. J. Biol. Macromol. 2019, 127, 187–196. [Google Scholar] [CrossRef]
  30. Küçükgüzel, Ş.G.; Coşkun, İ.; Aydın, S.; Aktay, G.; Gürsoy, Ş.; Çevik, Ö.; Özakpınar, Ö.B.; Özsavcı, D.; Şener, A.; Kaushik-Basu, N.; et al. Synthesis and characterization of celecoxib derivatives as possible anti-inflammatory, analgesic, antioxidant, anticancer and anti-HCV agents. Molecules 2013, 18, 3595–3614. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  31. Chen, J.; Jiang, L.; Lan, K.; Chen, X. Celecoxib inhibits the lytic activation of Kaposi’s Sarcoma-Associated Herpesvirus through down-regulation of RTA expression by inhibiting the activation of p38 MAPK. Viruses 2015, 7, 2268–2287. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Risner, K.; Ahmed, A.; Bakovic, A.; Kortchak, S.; Bhalla, N.; Narayanan, A. Efficacy of FDA-approved anti-inflammatory drugs against Venezuelan equine encephalitis virus infection. Viruses 2019, 11, 1151. [Google Scholar] [CrossRef] [Green Version]
  33. Cao, D.S.; Xu, Q.S.; Hu, Q.N.; Liang, Y.Z. ChemoPy: Freely available python package for computational biology and chemoinformatics. Bioinformatics 2013, 29, 1092–1094. [Google Scholar] [CrossRef]
  34. Moriwaki, H.; Tian, Y.S.; Kawashita, N.; Takagi, T. Mordred: A molecular descriptor calculator. J. Cheminform. 2018, 10, 4. [Google Scholar] [CrossRef] [Green Version]
  35. Landrum, G. RDKit: A Software Suite for Cheminformatics, Computational Chemistry, and Predictive Modeling. 2013. Available online: https://www.rdkit.org/RDKit_Overview.pdf (accessed on 10 March 2022).
  36. Kim, S.; Thiessen, P.A.; Bolton, E.E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B.A.; et al. PubChem substance and compound databases. Nucleic Acids Res. 2016, 44, D1202–D1213. [Google Scholar] [CrossRef]
  37. Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics; Volume I: Alphabetical Listing/volume II: Appendices, References; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
  38. Xu, J.; Hagler, A. Chemoinformatics and drug discovery. Molecules 2002, 7, 566–600. [Google Scholar] [CrossRef]
  39. Dong, J.; Cao, D.S.; Miao, H.Y.; Liu, S.; Deng, B.C.; Yun, Y.H.; Wang, N.N.; Lu, A.P.; Zeng, W.B.; Chen, A.F. ChemDes: An integrated web-based platform for molecular descriptor and fingerprint computation. J. Cheminform. 2015, 7, 60. [Google Scholar] [CrossRef] [Green Version]
  40. Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Modeling 2010, 50, 742–754. [Google Scholar] [CrossRef]
  41. Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. [Google Scholar] [CrossRef] [Green Version]
  42. Roy, P.P.; Roy, K. QSAR studies of CYP2D6 inhibitor aryloxypropanolamines using 2D and 3D descriptors. Chem. Biol. Drug Des. 2009, 73, 442–455. [Google Scholar] [CrossRef]
  43. Ricchi, P.; Zarrilli, R.; Di Palma, A.; Acquaviva, A.M. Nonsteroidal anti-inflammatory drugs in colorectal cancer: From prevention to therapy. Br. J. Cancer 2003, 88, 803–807. [Google Scholar] [CrossRef] [PubMed]
  44. Tołoczko-Iwaniuk, N.; Dziemiańczyk-Pakieła, D.; Nowaszewska, B.K.; Celińska-Janowicz, K.; Miltyk, W. Celecoxib in cancer therapy and prevention—review. Curr. Drug Targets 2019, 20, 302–315. [Google Scholar] [CrossRef]
  45. Canese, K.; Weis, S. PubMed: The Bibliographic Database; The NCBI Handbook: Bethesda, MD, USA, 2013; Volume 2, p. 1. [Google Scholar]
  46. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  47. Shalev-Shwartz, S.; Ben-David, S. Understanding Machine Learning: From Theory to Algorithms; Cambridge University Press: Cambridge, MA, USA, 2014. [Google Scholar]
  48. Dempster, A.P. Upper and Lower Probabilities Induced by a Multivalued Mapping. Classic Works of the Dempster-Shafer Theory of Belief Functions; Springer: Berlin/Heidelberg, Germany, 2008; pp. 57–72. [Google Scholar]
  49. Shafer, G. A Mathematical Theory of Evidence; Princeton University Press: Princeton, NJ, USA, 1976. [Google Scholar]
  50. Martin, T.M.; Harten, P.; Young, D.M.; Muratov, E.N.; Golbraikh, A.; Zhu, H.; Tropsha, A. Does rational selection of training and test sets improve the outcome of QSAR modeling? J. Chem. Inf. Modeling 2012, 52, 2570–2578. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Classification results obtained by RF. (a) Results from Mordred descriptor set; (b) Results from Morgan fingerprint set.
Figure 1. Classification results obtained by RF. (a) Results from Mordred descriptor set; (b) Results from Morgan fingerprint set.
Molecules 27 04807 g001
Figure 2. Classification results based on molecular set S4. (a) Q values; (b) Kappa coefficient.
Figure 2. Classification results based on molecular set S4. (a) Q values; (b) Kappa coefficient.
Molecules 27 04807 g002
Figure 3. Six types of descriptor information included in five data sets.
Figure 3. Six types of descriptor information included in five data sets.
Molecules 27 04807 g003
Figure 4. Classification results based on molecular set S4 and combinatorial descriptors.
Figure 4. Classification results based on molecular set S4 and combinatorial descriptors.
Molecules 27 04807 g004
Figure 5. The acquisition of molecular sets. S2 is obtained by checking drug therapeutic mechanism. S3 is obtained by checking other potential therapeutic effects. S4 is obtained by checking applicable objects and experimental stage.
Figure 5. The acquisition of molecular sets. S2 is obtained by checking drug therapeutic mechanism. S3 is obtained by checking other potential therapeutic effects. S4 is obtained by checking applicable objects and experimental stage.
Molecules 27 04807 g005
Figure 6. The whole study process, where (a) is the flow for comparing classification results based on different data sets and (b) is the flow for further validation.
Figure 6. The whole study process, where (a) is the flow for comparing classification results based on different data sets and (b) is the flow for further validation.
Molecules 27 04807 g006
Table 1. The number of correct predictions by fusion method in the external validation set.
Table 1. The number of correct predictions by fusion method in the external validation set.
Descriptor SetsS4S3S2S1
Combinatorial descriptor73767370
Mordred descriptor74747572
MACCS fingerprint70747471
Topological fingerprint74757475
Morgan fingerprint73697570
Table 2. The correct prediction for each class of drugs by several models. In first column, F1–F5 are different descriptor data, and detailed information is shown in Table A1 of the Appendix A. The seven classes of drugs are represented by C1, C2, C3, C4, C5, C6, and C7 in order. The same is true for Table 4.
Table 2. The correct prediction for each class of drugs by several models. In first column, F1–F5 are different descriptor data, and detailed information is shown in Table A1 of the Appendix A. The seven classes of drugs are represented by C1, C2, C3, C4, C5, C6, and C7 in order. The same is true for Table 4.
ModelsSingle-RoleMulti-role/10Total/87
C1/7C2/21C3/26C4/10C5/5C6/6C7/2Total/77
F1—DS12 on S341826856067976
F1—DS12 on S4 41625845264771
F2—SVM on S2416266562651075
F2—DS12 on S3 417267431621072
F2—DS12 on S4414258552631073
F3—SVM on S3 415267552641074
F3—DS12 on S4 31526554260868
F4—DS123 on S3516256562651075
F4—DS12 on S3517255552641074
F4—DS12 on S4416265552631073
F5—SVM on S261326434056965
F5—DS12 on S451526736264973
Table 3. The predicted probabilities for the four drugs.
Table 3. The predicted probabilities for the four drugs.
DrugsModelsClasses
C1C2C3C4C5C6C7
RifampicinF1—DS12 on S300.0050.9940000
F2—SVM on S20.0280.0790.8330.0120.0230.0080.017
F4—DS123 on S30.0020.0050.9910.001000
F4—DS12 on S300.0010.9980000
CelecoxibF1—DS12 on S30.3950.10.0030.0460.0080.4470
F2—SVM on S20.3870.080.150.1170.0250.2360.005
F4—DS123 on S30.5350.3210.0530.0380.0090.0410.002
F4—DS12 on S30.6090.3180.0210.0120.0040.0350.001
Table 4. Drug molecules included in four molecular sets.
Table 4. Drug molecules included in four molecular sets.
Drug ClassesMolecular Set S1Molecular Set S2Molecular Set S3Molecular Set S4
Analgesics228209183164
Antineoplastic211209189165
Antibacterial drugs296294285261
Antiviral drugs10810810299
Antifungals64645754
Antidiabetic drugs70706663
Antiarrhythmics42423938
Total1019996921844
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhong, C.; Ai, J.; Yang, Y.; Ma, F.; Sun, W. Small Molecular Drug Screening Based on Clinical Therapeutic Effect. Molecules 2022, 27, 4807. https://doi.org/10.3390/molecules27154807

AMA Style

Zhong C, Ai J, Yang Y, Ma F, Sun W. Small Molecular Drug Screening Based on Clinical Therapeutic Effect. Molecules. 2022; 27(15):4807. https://doi.org/10.3390/molecules27154807

Chicago/Turabian Style

Zhong, Cai, Jiali Ai, Yaxin Yang, Fangyuan Ma, and Wei Sun. 2022. "Small Molecular Drug Screening Based on Clinical Therapeutic Effect" Molecules 27, no. 15: 4807. https://doi.org/10.3390/molecules27154807

Article Metrics

Back to TopTop