Integrated Multi-Class Classification and Prediction of GPCR Allosteric Modulators by Machine Learning Intelligence

G-protein-coupled receptors (GPCRs) are the largest and most diverse group of cell surface receptors that respond to various extracellular signals. The allosteric modulation of GPCRs has emerged in recent years as a promising approach for developing target-selective therapies. Moreover, the discovery of new GPCR allosteric modulators can greatly benefit the further understanding of GPCR cell signaling mechanisms. It is critical but also challenging to make an accurate distinction of modulators for different GPCR groups in an efficient and effective manner. In this study, we focus on an 11-class classification task with 10 GPCR subtype classes and a random compounds class. We used a dataset containing 34,434 compounds with allosteric modulators collected from classical GPCR families A, B, and C, as well as random drug-like compounds. Six types of machine learning models, including support vector machine, naïve Bayes, decision tree, random forest, logistic regression, and multilayer perceptron, were trained using different combinations of features including molecular descriptors, Atom-pair fingerprints, MACCS fingerprints, and ECFP6 fingerprints. The performances of trained machine learning models with different feature combinations were closely investigated and discussed. To the best of our knowledge, this is the first work on the multi-class classification of GPCR allosteric modulators. We believe that the classification models developed in this study can be used as simple and accurate tools for the discovery and development of GPCR allosteric modulators.


Introduction
G-protein-coupled receptors (GPCRs) are the largest family of membrane proteins in the human genome and regulate a variety of extracellular signal transduction pathways, including photons, ions, hormones, neurotransmitters, odorants, and other stimuli [1,2]. Based on similarity and diversity of amino acid sequences and functions, GPCRs can be categorized into three main subfamilies, termed A, B, and C. As GPCRs are therapeutic targets for a broad spectrum of diseases, they have been of long-standing interest as therapeutic targets, and account for~34% of the global market share of therapeutic drugs [3,4]. Most of the GPCR-targeted drugs are functionally active by binding to the orthosteric site of the receptor, which is the pocket bound by endogenous activating ligand [5]. However, an increasing number of drugs targeting the orthosteric sites have been withdrawn from the market due to low efficacy and undesired side effects [6,7]. The major issue that is associated with the orthosteric ligand is that their binding sites are often highly conserved across a single GPCR subfamily, making it difficult to achieve high selectivity for specific GPCR subtypes [8]. In recent years, great attention has been devoted to the discovery of drugs targeting GPCRs as allosteric modulators [9][10][11]. These small molecules bind to a site (allosteric site) that is topographically distinct from the orthosteric site of the GPCR protein and thus do not compete with orthosteric ligands [12]. Compared to the highly conserved orthosteric sites, allosteric binding pockets are more diverse across the same subfamily of GPCRs. This mechanism allows allosteric modulators to confer subtype selectivity. Meanwhile, the allosteric modulators also show a preferable safety profile due to the 'ceiling' effect [5,13,14]. In addition, Yiran Wu et al. added that the distinct pathways and allosteric pockets may enable allosteric modulators' cooperativities among different protein subtypes [15]. Therefore, it is important to develop allosteric modulators as both therapeutic agents and research tools to bring new opportunities to drug discovery towards GPCRs and have an in-depth understanding of receptor modulation mechanisms.
Drug discovery is expensive. The process requires molecule design, lead optimization, in vitro and in vivo data analysis [16]. High-throughput screening (HTS) is a modern technique that is commonly used to facilitate the discovery of the allosteric modulators of GPCRs [17]. However, it is costly and often has been plagued with problems of high false-positive rates [18]. Alternatively, computational approaches, including homology modeling, molecular docking, and molecular dynamics simulation have been applied to aid drug discovery of novel allosteric modulators [19][20][21][22]. Yet, developing in silico screening methods that attain high accuracy remains challenging. At present, there is still an urgent demand for computational tools that can identify allosteric drugs from inactive random compounds and increase the chances of success in the development of allosteric modulators as lead clinical compounds.
While there are many traditional cheminformatic tools to assist the development of allosteric modulators, there have been few examples using machine learning (ML). ML has emerged as a promising pillar to promote data-driven decision-making, facilitate the process, and reduce the failure rates in drug discovery and development [23][24][25][26][27][28][29][30]. Kumar et al. developed multiple in silico models (Support Vector Machines (SVM), knearest neighbor algorithms, partial least square (PLS), etc.) to predict human intestinal absorption of diverse chemicals [31]. Jacob and Vert [32] used tensor-product-based features and applied SVM to predict protein-ligand interactions. Remarkably, in 2020, Google's DeepMind's AlphaFold made a scientific breakthrough for its astonishing performance on predicting protein 3D structures by using deep learning approaches [33,34].
We have previously reported an application of developing ML-based classification models for the prediction of orthosteric and allosteric regulations on cannabinoid receptors [13]. To expand the application to a broader scope of GPCR families and to handle a more diversified chemical space, in this paper, we proposed an 11-class classification task to discriminate allosteric modulators among different subtypes of GPCRs A, B, C subfamilies, and inactive compounds simultaneously. Diverse types of molecular features and multiple machine learning algorithms were applied for model training. The combinations of different types of molecular features and ML algorithms were carefully investigated to search which set of features works best for a specific classifier. The performance of trained ML models was systematically evaluated by using different metrics. This research gives the first report on the multi-class classification of GPCR allosteric modulators. The study can be of value for facilitating in silico screening and providing guidance for future discovery and development of GPCR allosteric modulators.

Data Collection and Preparation
In this study, the allosteric database (ASD) [35] was used for collecting GPCR allosteric modulators. Classical GPCR subfamilies A, B, and C were selected as targets to collect allosteric modulators. So far, the number of discovered allosteric modulators across different GPCR subtypes is of high variance. To construct a dataset that mimics this nature, we selected some common subtypes that are of a distinct number of allosteric modulators from each GPCR subtype. The number of collected allosteric modulators from each sub-type is shown in Table 1. We also collected drug-like random compounds from the ZINC database [36] to serve as inactive decoy compounds. More than eight thousand drug-like compounds were randomly collected and integrated into the GPCR allosteric modulator datasets. Allosteric modulators collected from 10 GPCR subtypes were combined with the inactive compounds to make up the final dataset containing 34,434 compounds (Table 1).

Molecular Fingerprint and Descriptor Calculation
Both molecular descriptors and molecular fingerprints were used as molecular representations for all compounds in the datasets. A total of 119 molecular descriptors (ExactMW, SlogP, TPSA, NumHBD, NumHBA, etc.,), which characterize the physicochemical properties of the studied compounds were calculated using RDKit (http://www.rdkit.org/, accessed on 29 October 2020). Three different types of molecular fingerprints, Atom-pair fingerprints [37], MACCS fingerprints [38], and ECFP6 fingerprints [39] were calculated with a CDK toolkit [40]. Atom-pair fingerprints were encoded as standard bit vectors of length 1024 based on the atomic environments and shortest path separations of every atom pair in the molecule. MACCS fingerprints consist of 166-bit fingerprints representing the presence or absence of 166 substructural keys. ECFP6 are circular topological fingerprints that represent circular topological atom neighborhoods with 1024 descriptors.

Model Building
Here, six ML algorithms were employed to develop the classification models to discriminate between different subtypes of GPCR allosteric modulators and inactive drugs, including support vector machine (SVM), neural network/multilayer perceptron (MLP), decision tree (DT), random forest (RF), naïve Bayes (NB), and logistic regression. The open-source scikit-learn (http://scikitlearn.org/, accessed on 14 November 2020) was used for model building, tuning, validation, and result interpretation. Support vector machine (SVM) [41] is a kernel-based algorithm widely used for binary classification and regression tasks. Each chemical structure was described as a binary string and worked as an eigenvector for SVM. The eigenvector was trained using the SVM algorithm, which results in a decision function for classification. The svm.SVC() method with three kernel functions (linear, rbf, poly) from scikit-learn was applied. The penalty parameter C and parameter γ for rbf and poly kernels were tuned on the training set by five-fold cross-validation using the grid search strategy.
Multilayer perceptron (MLP) [42] is a type of fully connected, feed-forward artificial neural network (ANN), consisting of three types of layers: the input layer, hidden layer, and output layer. An arbitrary number of hidden layers between the input and output layer is the true computational engine of the MLP. For each hidden layer, different numbers of hidden neurons can be assigned. MLP is based on calculating the values of hidden neurons in a current layer as the activated summation of weighted outputs of hidden neurons from a previous layer. The weights of the neuron connections are initially random but then adjusted through the backward propagation learning algorithm. Similar to SVM, we used grid search to optimize the hyperparameter for the MLPClassifier() method in scikitlearn. We searched for the optimal number of layers, number of hidden units, activation function (identity, logistic, tanh, ReLu), regularization parameter (0.0001, 0.001, 0.01, 0.1), and learning rate (0.1, 0.01, 0.001, 0.0001).
Decision tree (DT) [43] is a simple classic supervised learning method used to solve classification and regression problems. At each node of the tree, the attribute that gives the greatest information gain is chosen to make the decision. The data are divided in the most homogeneous way. Then the process is repeated on each smaller subset in a recursive manner. DecisionTreeClassifier() was applied for generating models with tuning on max_depth.
Random forest (RF) [44] is an ensemble method that leverages the power of a multitude of decision trees. In classification, regression, or other tasks, the final output is obtained by averaging the results of classification and regression trees that are grown on bootstrap samples. RandomForestClassifier() was applied. The best model was saved after the tuning on n_estimators (10, 50, 100) and max_depth (2,3,4,5).
Naïve Bayes (NB) [45] is a simple probabilistic classifier based on Bayes' theorem for conditional probability. This algorithm assumes that the attributes in a dataset are independent of each other. In other words, the NB classifier ignores the possible dependencies among the inputs and reduces a multivariate problem to a group of univariate problems.
Logistic regression (LR) [46] is a classification algorithm used for the prediction of the outcome of a categorical dependent variable from a set of predictor or independent variables. It is mainly used for prediction also calculating the probability of success. LogisticRegression() was applied and tuned with penalty (l1, l2, elasiticnet).

Chemical Space Analysis
The classification tasks in this study are completely implemented by machine intelligence without using any chemical or pharmacy domain knowledge. To investigate the possible relationships between the molecular properties or fingerprints and classification tasks, we visualized the chemical space distribution of each class by using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method. Atom-pair, ECFP6, MACCS fingerprints, and molecular descriptors were used to represent molecules and were decomposed into two dimensions by t-SNE. The scikit-learn was applied for t-SNE analysis. Matplotlib library was used for plotting. t-Distributed stochastic neighbor embedding (t-SNE) [47] is a nonlinear embedding technique developed by van der Maaten, L. and G. Hinton in 2008. It was used as a dimension reduction method well-suited for embedding high-dimensional data into a lowdimensional space of two or three dimensions [48]. t-SNE calculates the similarity measure between pairs of instances in the high dimensional space and the low dimensional space. The pairwise similarities of points were converted to joint probabilities. In this process, the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data was minimized. This neighbor embedding property makes t-SNE effective for identifying local clusters in the data.

Model Evaluation
In this study, the dataset was randomly divided into a training set (80%) and test set (20%) using a stratified sampling method. Then 80% of the training sets were randomly selected and used for training. The remaining compounds were used as a validation set. Various evaluation metrics were calculated to evaluate the performance of different ML models, including accuracy (ACC), balanced accuracy (Bal_ACC), precision, recall, f1-score, area under the receiver operating characteristic (ROC) curve (AUC) score, Cohen's κ (CK) [49], and Matthews correlation coefficient (MCC) [50].
Micro-average and macro-average are two strategies being used for multiclass classification tasks. Here, we applied the macro-average method to calculate the precision, recall, and f1-score. Macro-average calculates each metric for each label separately and returns their unweighted mean. An ML model with good performance on macro-averaging metrics means it can recognize each class perfectly, even on small classes, which is a suitable case for our study. As for micro-averaging, we are aggregating the contributions of all classes to compute the average metric, which only emphasizes the performance of majority classes. The macro-averaged metric for each class is defined as: where M is the current metric, C is the number of total classes for the classification task, i denotes the ith class.
The following abbreviations are used for metric definitions: the number of true positives (TP), the number of false positives (FP), the number of true negatives (TN), and the number of false negatives (FN). The AUC score is one of the most widely used metrics that measures the overall performance of a classification model. It ranges between 0.5 and 1. A model with an AUC of 1 means perfect separation whereas an AUC of 0.5 means the model has no class separation capacity.
ACC is a measure of systematic error. It is the number of correctly predicted data points out of all data points.
Recall is also called the true positive rate or sensitivity, which measures the ability of a classifier to find all of the positive samples.
Precision is known as the positive predictive value, which is the proportion of the predicted true label among all the retrieved instances.
The f 1 -score is a weighted average of the precision and recall and takes both false positives and false negatives into account.
CK is used to estimate overall model performance by measuring the proximity of the predicted classes to the actual classes when compared to a random classification.
When the dataset is imbalanced, Bal_ACC can be used to evaluate the general performance of an algorithm. It avoids inflated performance estimates and is computed as the average between sensitivity and specificity.
where n represents the total number of classes. MCC is another useful metric when the dataset has varying classes and is imbalanced. It is a correlation coefficient between observed and predicted classes and has a range between −1 and 1. A value of −1 indicates a completely wrong prediction, while a coefficient of 1 indicates a perfect prediction.

Overall Workflow
In this study, four subtypes from class A GPCRs, three subtypes from class B GPCRs, and three subtypes from class C GPCRs were collected from the ASD database. Inactive compounds were randomly selected from the ZINC database to cover a large drug-like chemical space. The allosteric modulators from 10 GPCRs subtypes and one random drug-like compounds class (decoys class) were used to construct the dataset for our 11class classification task (Table 1, Figure 1A). Atom-pair, ECFP6, MACCS fingerprints, and molecular descriptors were calculated from the constructed dataset to represent four types of features. Different types of features can be used to evaluate the properties of the compounds from diverse aspects, which may affect the performance of ML models. Therefore, besides using one type of feature at each time for model training, we also paired fingerprints with molecular descriptors, as well as combining all four feature types to cover the best possible combinations ( Figure 1A). Eight new datasets were generated from the different feature combinations. A total of 11 classes were labeled for classification from 0 to 10, respectively. Six supervised ML algorithms were applied to build classifiers for each dataset. We conducted a five-fold cross-validation for each dataset to select the best-performing model ( Figure 1B). The ML models' performance was evaluated by ACC, precision, recall, f 1-score, MCC, CK, Bal_ACC, and AUC values.

Data Set Analysis
To investigate the possible relationships between the physicochemical properties or fingerprints and classification tasks, the chemical space distribution of each class was analyzed. t-SNE was used to decompose the molecular descriptors and fingerprints into two dimensions for visualizing. Figures 2-4 show the results of the chemical space distribution of compounds in the dataset, which are represented by three types of fingerprints and molecular descriptors, respectively. As shown in Figures 2-4, the blue dots represent drug-like molecules that define the background of the overall chemical property space. All 10 allosteric modulator subtypes fell in the defined chemical spaces (based on molecular fingerprints and properties), indicating that both similar and distinctive drug-like molecules were involved in the classification tasks for target-specific allosteric modulators. In comparison with known allosteric modulators, both similar and distinctive druglike molecules were presented in the model training and validation processes. Similar molecules challenge the robustness of the model training process while distinctive molecules exhibit the vast chemical space to classifiers. Each of the allosteric modulator subtypes also occupies several specific regions which are distinct from one another, which may indicate their subtype selectivity. The distinct chemical space distribution of different allosteric subtypes shows the feasibility of applying our machine intelligence models.

Data Set Analysis
To investigate the possible relationships between the physicochemical properties or fingerprints and classification tasks, the chemical space distribution of each class was analyzed. t-SNE was used to decompose the molecular descriptors and fingerprints into two dimensions for visualizing. Figures 2-4 show the results of the chemical space distribution of compounds in the dataset, which are represented by three types of fingerprints and molecular descriptors, respectively. As shown in Figures 2-4, the blue dots represent drug-like molecules that define the background of the overall chemical property space. All 10 allosteric modulator subtypes fell in the defined chemical spaces (based on molecular fingerprints and properties), indicating that both similar and distinctive drug-like molecules were involved in the classification tasks for target-specific allosteric modulators. In comparison with known allosteric modulators, both similar and distinctive drug-like molecules were presented in the model training and validation processes. Similar molecules challenge the robustness of the model training process while distinctive molecules exhibit the vast chemical space to classifiers. Each of the allosteric modulator subtypes also occupies several specific regions which are distinct from one another, which may indicate their subtype selectivity. The distinct chemical space distribution of different allosteric subtypes shows the feasibility of applying our machine intelligence models. Biomolecules 2021, 11, x FOR PEER REVIEW 8 of 17

Performance Evaluation on Different Feature Types
The ACC, precision, recall, f1-score, MCC, CK, Bal_ACC, and AUC values of all machine learning (ML) models on validation sets are summarized in Tables S1-S3. Radar charts are also plotted to visualize all the above metrics for both training sets and test sets ( Figures S1-S4). The results on test sets are shown in Tables 2-4.

Performance Evaluation on Different Feature Types
The ACC, precision, recall, f 1-score, MCC, CK, Bal_ACC, and AUC values of all machine learning (ML) models on validation sets are summarized in Tables S1-S3. Radar charts are also plotted to visualize all the above metrics for both training sets and test sets ( Figures S1-S4). The results on test sets are shown in Tables 2-4. According to the results, SVM trained on ECFP6 (SVM-ECFP6) outperformed other ML models with the highest scores on AUC, ACC, f 1-score, CK, and MCC. It also showed satisfactory results on Bal_AUC (0.950), precision (0.978), and recall (0.950). The good result on Bal_AUC is exciting for an imbalanced dataset. The high f 1-score of SVM-ECFP6 is also expected because it is a weighted average of precision and recall. MCC is a reliable and comprehensive assessment of the models' performance. High MCC values mean that the ML model was able to correctly predict both the positive data instances as well as negative data instances, indicating an outstanding discriminant capability of the models' performance. The confusion matrix is plotted to summarize the classification result of SVM-ECFP6 ( Figure 5) on the test set. The results of all other models are shown in supplementary data (Figures S5-S16). As is shown in the confusion matrix, the separability of SVM-ECFP6 works well on all 11 classes, with most classes being correctly predicted. This model also shows a generalization capability even in small classes. Notably, a total of 17 test cases from the seventh class (PTHrP) was all correctly classified.
SVM-ECFP6 works well on all 11 classes, with most classes being correctly predicted. This model also shows a generalization capability even in small classes. Notably, a total of 17 test cases from the seventh class (PTHrP) was all correctly classified. The element m(i, j) is the number of times an observation of the ith true class was predicted to be of the jth class. Each colored cell of the confusion matrix chart corresponds to one element of the confusion matrix. Drug-like molecules, CB1, FFA2, mAchR M1, S1P3, GLP1-R, GCGR, PTHrP, mGlu2, mGlu4, and mGlu5 were labeled as 0 to 10, respectively.
By comparing different ML algorithms, while SVM performed well when trained on one fingerprint type, its performance is significantly reduced when trained with molecular descriptors alone or in combination with other types of features. Compared to SVM, MLP shows a more stable performance on all datasets. When trained on datasets with two or more feature types, MLP showed superior overall performance, with AUC values above 0.95. It also showed the best performance on the molecular descriptor dataset, with an AUC value of 0.943. MLP can be considered a subset of deep neural networks (DNN), where its neural network architecture gives a competitive performance on high-dimensional datasets. This may result in the good performance of MLP on these datasets. RF has a balanced performance on most of the datasets. It generally shows good AUC scores but compared to SVM and MLP, the Bal_ACC and recall values are lower, indicating this classifier is more likely to be affected by the imbalanced dataset. LR also shows similar results as RF. NB and DT performed the worst among all ML models, which is also reasonable since they are more simple models and are often used as the baseline for comparison.
The selection of different features will also affect the ML models' performance. As summarized in Table 5, models trained on the ECFP6 feature showed the best overall results; whereas, the molecular descriptors feature has poor performance compared with all other feature types or feature combinations. The results also showed that models trained on MACCS fingerprint underperformed the results of ECFP6 and Atom-pair fingerprint. The underperformance of MACCS could possibly be due to the fact that MACCS contains less information (166-bit) than ECFP6 (1024-bit) and Atom-pair (1024-bit), which would The element m(i, j) is the number of times an observation of the ith true class was predicted to be of the jth class. Each colored cell of the confusion matrix chart corresponds to one element of the confusion matrix. Drug-like molecules, CB1, FFA2, mAchR M1, S1P3, GLP1-R, GCGR, PTHrP, mGlu2, mGlu4, and mGlu5 were labeled as 0 to 10, respectively.
By comparing different ML algorithms, while SVM performed well when trained on one fingerprint type, its performance is significantly reduced when trained with molecular descriptors alone or in combination with other types of features. Compared to SVM, MLP shows a more stable performance on all datasets. When trained on datasets with two or more feature types, MLP showed superior overall performance, with AUC values above 0.95. It also showed the best performance on the molecular descriptor dataset, with an AUC value of 0.943. MLP can be considered a subset of deep neural networks (DNN), where its neural network architecture gives a competitive performance on high-dimensional datasets. This may result in the good performance of MLP on these datasets. RF has a balanced performance on most of the datasets. It generally shows good AUC scores but compared to SVM and MLP, the Bal_ACC and recall values are lower, indicating this classifier is more likely to be affected by the imbalanced dataset. LR also shows similar results as RF. NB and DT performed the worst among all ML models, which is also reasonable since they are more simple models and are often used as the baseline for comparison.
The selection of different features will also affect the ML models' performance. As summarized in Table 5, models trained on the ECFP6 feature showed the best overall results; whereas, the molecular descriptors feature has poor performance compared with all other feature types or feature combinations. The results also showed that models trained on MACCS fingerprint underperformed the results of ECFP6 and Atom-pair fingerprint. The underperformance of MACCS could possibly be due to the fact that MACCS contains less information (166-bit) than ECFP6 (1024-bit) and Atom-pair (1024-bit), which would result in a less robust model. However, a fingerprint containing higher information density does not necessarily mean that it also should achieve a better outcome. Compared to ECFP6, which only contains structural information, Atom-pair is usually considered a hybrid type of fingerprint that contains both atomic and structural information. However, in our case, ECFP6 still shows the best outcome across all GPCR classes. Since there are 10 GPCR protein targets and one random decoy class involved in this study, the good result on ECFP6 means that ML models trained on this fingerprint can have a good impact on the majority of classes of selected GPCR protein targets. To obtain a deeper understanding of the predictive models, we further extracted the SlogP, molecular weight (M.W.), hydrogen bond acceptor (HBA), and hydrogen bond donor (HBD) from the molecular descriptors and conducted pair-wise distribution comparisons on each GPCR class with drug-like molecules, which are shown in Figures S17-S26 in the Supplementary Data. The comparison plots showed that most classes (all except the PTHrP) generally follow Lipinski's rule of five since the distributions of drug-like molecules overlap with the target GPCR classes. However, the physicochemical properties of allosteric modulators from some of the GPCR classes do form distinguishable distributions from that of druglike molecules. For example, the distribution of CB1 on M.W. is more concentrated than drug-like molecules with a mode around 500 whereas the mode of drug-like molecules is around 400. The distributions of GLP1-R and GCGR on SlogP are around 5, which is also different from the mode of SlogP for drug-like molecules that is around 4. It is worth noting that compared to the performance on datasets combining two feature types, MLP, LR, and RF all showed enhanced performance when trained on the dataset with all four feature types' combinations, which is possibly due to the fact that the four feature types can give complementary information that is favored by these ML algorithms.

Performance Evaluation on Individual GPCR Classes
The f 1-score takes the imbalanced data distribution into account and is the harmonic mean of precision and recall. Here we selected the f 1-score as a representative metric to evaluate the ML models' performance on one feature type (Table 6), two feature types (Table 7), and four feature types (Table 8) for each GPCR class. The f 1-scores across different GPCR families were also compared and shown in supporting information (Table S4).   From Table 6, we can see that the best model SVM-ECFP6 has generally a high f 1-score across all GPCR classes. It has an f 1-score over 0.98 on CB1, mAchR M1, S1P3, PTHrP, mGlu2, and mGlu5. The f 1-scores on FFA2, GLP1-R, GCGR of SVM-ECFP6 are all over 0.90 but below 0.95. The small sample sizes they have could impede the training process for robust models. PTHrP has a similar sample size (88 samples in total) to FFA2 (89 samples in total). Surprisingly, it has an f 1-score of 1, indicating a precise and robust classification. The 88 samples from PTHrP are all polypeptides and share a core scaffold.
They are very focused chemicals with small modifications to the core scaffolds. The highly distinguishable pattern may explain the good performance of the ML models in this class. While a large sample size is appreciated for building ML-based classification models, the current challenge remains on limited numbers of available GPCR allosteric modulators. In this study, the SVM-ECFP6 gives a satisfying performance across 11 classes (including small classes with around 90 samples), showing the feasibility of our method to be generalized to broader GPCR families.
In compliance with the previous observation, SVM and MLP have better performance on most classes than other ML models when trained with one feature type (except for molecular descriptors feature). When trained with two or more feature types, the SVM's performance is significantly reduced while MLP, LR, and RF show better performance in each class (Tables 6-8). Moreover, similar to the result from SVM-ECFP6, many other models did not perform well on FFA2, GLP1-R, and GCGR, but most models have a high f 1-score on PTHrP. In general, ML models trained on GPCR family A and C show a better f 1-score than GPCR family B (Table S4) as all datasets for class B GPCRs have small numbers of instances.

Conclusions
In this study, four types of features, Atom-pair, ECFP6, MACCS fingerprints, and molecular descriptors were used to construct a series of datasets. The chemical space analysis of all four features demonstrated that the 10 allosteric subtypes form spatial patterns that are distinguishable from each other. Six ML models were built and trained on datasets with different feature combinations. SVM-ECFP6 shows the best results (AUC = 0.974, ACC = 0.976, Bal_ACC = 0.950, f 1-score = 0.963, CK = 0.971, MCC = 0.971, precision = 0.978, recall = 0.950). MLP has the most stable performance across different feature combinations. In particular, it outperformed other ML models on datasets constructed with two or more features. By comparing the ML model's performance on different features, we found that when only using one feature type for training, ECFP6 is the best choice for its good performance on most ML models. Mixed effects were seen on datasets with various feature combinations on different ML models. In the field of drug discovery, we need to frequently deal with imbalanced datasets [51]. The model developed in our study shows a good generalization capability on an imbalanced dataset. To the best of our knowledge, this study is the first work on the multi-class classification of GPCR allosteric modulators. The developed multi-class classifiers provide alternative options on virtual screening besides the conventional structure-based and ligand-based methods. Besides being of benefit to potential hit identification campaigns on GPCR allosteric modulators, this study can also be of value to demonstrate the possibility of adapting machine learning to the broad area of drug discovery.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/ 10.3390/biom11060870/s1. Collected raw chemical datasets. Results of the validation set on each dataset with different feature types (Tables S1-S3); The f1-score among different GPCR families on each dataset (Table S4)