Representation Learning for Class C G Protein-Coupled Receptors Classification

G protein-coupled receptors (GPCRs) are integral cell membrane proteins of relevance for pharmacology. The complete tertiary structure including both extracellular and transmembrane domains has not been determined for any member of class C GPCRs. An alternative way to work on GPCR structural models is the investigation of their functionality through the analysis of their primary structure. For this, sequence representation is a key factor for the GPCRs’ classification context, where usually, feature engineering is carried out. In this paper, we propose the use of representation learning to acquire the features that best represent the class C GPCR sequences and at the same time to obtain a model for classification automatically. Deep learning methods in conjunction with amino acid physicochemical property indices are then used for this purpose. Experimental results assessed by the classification accuracy, Matthews’ correlation coefficient and the balanced error rate show that using a hydrophobicity index and a restricted Boltzmann machine (RBM) can achieve performance results (accuracy of 92.9%) similar to those reported in the literature. As a second proposal, we combine two or more physicochemical property indices instead of only one as the input for a deep architecture in order to add information from the sequences. Experimental results show that using three hydrophobicity-related index combinations helps to improve the classification performance (accuracy of 94.1%) of an RBM better than those reported in the literature for class C GPCRs without using feature selection methods.


Introduction
G protein coupled receptors (GPCRs) are integral cell membrane proteins responsible for translating the molecular signals encoded in the chemical structure of hormones and neurotransmitters from outside to inside the cell. GPCRs share a common structure consisting of seven transmembrane helices (7TM), which are linked by three extracellular and three intracellular loops [1]. The binding of endogenous or synthetic agonists causes the activation of the receptor, which results in conformational changes that allow the allosteric coupling of accessory proteins such as G protein or β-arrestin at the intracellular part of the receptor [2,3]. Activation of these accessory proteins triggers the series of steps that constitute the signal transduction mechanism, which eventually lead to the observed physiological responses. The human GPCRs have been classified into five main families or classes (glutamate or class C, rhodopsin or class A, adhesion, frizzled or class F/taste2 and secretin or class B) by phylogenetic analysis [4]. Crystallographic determinations of a number of ligand-GPCR complexes have provided insights into the recognition determinants that discriminate between agonists (activators) and antagonists (inhibitors) [5], whereas other techniques such as nuclear magnetic resonance (NMR) [6], fluorescence approaches [7] and molecular dynamics (MD) [8] have led to mechanistic proposals for receptor activation and the allosteric transmission of the signal from the ligand binding site to the G protein or β-arrestin binding sites of the receptor.
GPCRs are at the center of current drug discovery programs. As of November 2017, approximately 35% of approved drugs in the United States or European Union target GPCRs [9]. There are different criteria for therapeutic drug design. One is selectivity, as it seems appropriate that drugs act selectively through specific receptors. Another is the concept of receptor polypharmacology in which a drug exerts a combination of positive effects by binding to different receptors [10]. Notwithstanding the approach that is followed, the correct classification of receptors in public databases is fundamental for virtual screening studies and in the examination of receptor functionality in general. To this aim, machine learning methods have proven to be useful [11][12][13][14][15][16][17]. For this, the standard procedure follows a feature extraction stage, where many ad hoc representations designed by specific domain experts can be used, and then, a classification stage is utilized. For the first stage, there are two main approaches to analyze GPCR sequences in order to extract the inherent features of the original sequences: multiple alignment and alignment-free representations. Many methods of both techniques have been developed in the literature achieving good representations, which are confirmed by the corresponding classification results [11][12][13][14][18][19][20]. However, the obtained/extracted representations are domain-dependent, which considers only certain factors (as frequency, order, etc.) of the original sequences.
In recent years, the representation learning field has arisen as an alternative resource for learning representations of the data that makes it easier to extract useful information when building classifiers [21]. That is, the main idea is to extract the relevant features (explanatory factors) from the observed data without using feature engineering methods. Following this idea and the good results presented in [22][23][24][25][26], in this paper, we aim to use a deep architecture in order to implicitly represent the explanatory factors of the protein sequences as much as possible and at the same time to obtain a model for classification. To this aim, we propose to use aligned GPCR sequences, which are translated into a numeric form by using an amino acid property index [27]. In the first stage, a hydrophobicity-related index is selected (because of its importance in determining the structure and function of GPCRs [14]) as the input for several deep architectures in order to choose one of them and find its parameters. After that, the preprocessed amino acid index (AAindex) database [19] is used as the input for training the selected deep architecture in order to implicitly represent the explanatory factors of the protein sequences. Experimental results assessed by the classification accuracy, Matthews' correlation coefficient (MCC) and balanced error rate (BER) show that using the hydrophobicity index number 531 and a restricted Boltzmann machine (RBM) can achieve performance results (accuracy of 92.9%) similar to those reported in the literature [12,20].
As a second proposal, we hypothesize that using two or more physicochemical property index combinations instead of only one might add information from the sequences that a deep architecture can extract and classify in a better way. Experimental results show that using three hydrophobicityrelated index combinations helps to improve the classification performance (accuracy of 94.1%) of an RBM better than those reported in the literature for class C GPCRs without using feature selection methods. The class C subfamily has been chosen for the present study due to structural, functional and therapeutic reasons [28].

Datasets
The current study focuses on class C GPCRs, which have become an increasingly important target for new therapies, particularly in areas such as fragile-X syndrome, schizophrenia, Alzheimer's disease, Parkinson's disease, epilepsy, L-DOPA-induced dyskinesias, generalized anxiety disorder, migraine, chronic pain, gastroesophageal reflux disorder, hyperparathyroidism, osteoporosis and drug addiction [29].
Because of its specificity, data were taken from GPCRdb (http://gpcrdb.org/) [30], which is defined as a molecular-class information system that collects, combines, validates and disseminates large amounts of heterogeneous data on GPCRs [31]. GPCRdb divides the GPCR superfamily into 5 families: the class A Rhodopsin like, the class B secretin like, the class C metabotropic glutamate/ pheromone, vomeronasal receptors (V1R and V3R) and taste receptors (T2R).
Class C GPCRs were selected for analysis because of (i) their structural complexity, (ii) high sequence length variability and (iii) therapeutic relevance. Briefly, (i) whereas all GPCRs are characterized by sharing a common seven-transmembrane (7TM) domain, responsible for G protein/β-arrestin activation, most class C GPCRs include, in addition, an extracellular large domain, the Venus flytrap (VFT) and a cysteine rich domain (CRD) connecting both [28]. It was till 2014 that the crystal structures of the 7TM domains of two class C receptors had been solved [32,33]. (ii) The full or partial presence of the whole domain structure confers a high sequence length variability to this family. (iii) The involvement of class C GPCRs in many neurological disorders, as previously mentioned, makes this class an attractive target for drug discovery and development.
Class C is, in turn, subdivided into seven types: metabotropic glutamate (mG), calcium sensing (Cs), GABA B (gB), vomeronasal (Vn), pheromone (Ph), odorant (Od) and taste (Ta). The investigated dataset is available in two forms: unaligned and aligned versions, which can be downloaded as Supplementary Material files. The former and the latter are distributed as shown in Tables 1 and 2, respectively. The unaligned version is used for experimentation with alignment-free transformations, while the aligned one is used for experimentation with representation learning methods. When the aligned version is used, each sequence is converted to a basic and numeric form by using an amino acid physicochemical property index taken from the amino acid index (AAindex) database [27]. This database contains three sections: AAindex1, AAindex2 and AAindex3 (Version 9), where AAindex1 contains 544 indices. For our experimentation, we used a preprocessed version of AAindex1, which contains 531 indices. All of them are available as Supporting Information in [19].

GPCR Representations
There are two main approaches to analyzing GPCR sequences through machine learning methods in order to capture the inherent features of the original sequences: (a) multiple alignment and (b) alignment-free representations. Both of them have been extensively utilized depending on the final application or use. Many methods of both techniques have been developed in the literature achieving good representations, which are confirmed by the corresponding classification results [11][12][13][14][18][19][20]. However, most of them are manually designed ad hoc by specific domain experts as a pre-processing step, which produces the fixed-length inputs for the classification methods. Therefore, the obtained/extracted representations are domain-dependent, which considers only certain factors (such as frequency, order, etc.) of the original sequence. Consequently, the extracted features can be relevant or not when they are used for different applications.

Multiple Sequence Alignment and Alignment-Free Representations
A very common preprocessing step for protein classification is multiple sequence alignment (MSA). The outputs of MSA are sequences of the same length using the one-letter code of the amino acids. Several methods and tools of MSA have been developed for studies of homology and evolutionary relationships between the sequences [34][35][36]. In addition, MSA output can be used as input for machine learning methods applied to classification tasks. Usually, the MSA output is directly used with natural language processing (as n-grams) or similarity matrix-related techniques. When MSA is used, the protein classification results strongly depend on the characteristics of the information provided by the alignment.
On the other hand, alignment-free protein representations have been defined in the literature in order to capture as much relevant information that might be conveyed by an amino acid sequence as possible. Among these, some rely on transformations based on the amino acid physicochemical characteristics, such as the auto-cross-covariance transformation [37,38].
In this paper, we consider a basic and three advanced alignment-free data transformations to obtain fixed-length vectors as input data for supervised classification algorithms. The corresponding transformed resulting datasets are available as Supplementary Material files. The first and most simple one reflects the amino acid composition (AAcomp) of the primary sequence: the relative frequencies of the occurrence of the 20 amino acids are calculated for each sequence resulting in a N × 20 matrix, where N is the number of sequences in the dataset. This transformation does not take into account the relative position of amino acids in the sequence.
The second and third are extensions of the AAcomp, which include sequence-order information. The second is known as pseudo-amino acid composition (PseAA) [39], while the third is formed by a hybrid feature vector, which combines multiscale energy (MSE) and PseAA representations. Both representations have shown a better GPCR classification performance than AAcomp [14,16].
For a GPCR sequence S = R 1 , R 2 , . . . , R L where R i represents the amino acid at position i in the sequence S of length L, the PseAA is defined as: where Λ = 20 + n × λ (λ = 0, 1, . . . , m is the number of levels used to compute the correlation factors of the amino acids in the sequence, and n is the number of physicochemical properties used as relevant information for the GPCR sequences). Following [14,40,41], we set λ = 21 as the maximum level and n = 2 physicochemical properties (hydrophobicity and hydrophilicity). That is, the PseAA feature vector length is 62, where the first 20 elements are the relative frequencies of occurrence of the 20 amino acids (as AAcomp), and the remaining elements are the first-level to λ-level correlation factors of amino acid sequences for each physicochemical property. In our case, the PseAA transformation of the class C GPCRs was obtained by using the PseAAC server [42]. Now, the wavelet-based MSE representation of a sequence is defined as: where k = 1, 2, . . . , N (N is the total number of GPCRs); d k i is the root mean square energy of wavelet detail coefficients in the corresponding i-th scale; and a k m is the root mean square energy of wavelet approximation coefficients in the m-th scale. For this transformation, the GPCR sequences are first converted into a numeric form by using hydrophobicity values taken from the FHscale [43]. The resulting numeric form takes the role of a digital signal in which the wavelet (Haar) transformation is applied. That is, the approximation (a k m ) and detailed (d k i ) coefficients are computed, where the maximum decomposition level (scale) m of a sequence is taken as log 2 (L).
Finally, the MSE and PseAA are concatenated to form a hybrid feature vector as follows: Major details for computing PseAA and MSE can be found in [14,16,40]. The fourth representation, related by the descriptors obtained in [44], is the ACC transformation [37,38]. Here, time series models are applied to the protein sequences in order to extract their sequential patterns, and consequently, the extracted information is sequence-order dependent. This representation was originally developed in [38] and then applied and modified in [15,37].
The ACC transformation can be described as follows: each sequence is first translated into physicochemical descriptors by representing each amino acid with the five z-scales derived in [44], where these scales are in turn obtained from 26 physicochemical properties. The auto-covariance (AC) and cross-covariance (CC) variables are then computed from the transformed sequences. The AC measures the correlation of the same descriptor, d, between two residues separated by a lag, l, along the sequence, and it can be calculated as: The CC variable measures the correlation of two different descriptors between two residues separated by a lag along the sequence, and it can be computed as: where l = 1, . . . , Lag and Lag is the maximal lag, which must be lesser than the length of the shortest sequence in the dataset; n is the total number of amino acids in the sequence; v d,i is the value of descriptor d = 1, . . . , D (D = 5) of an amino acid in a sequence at position i;v d is the mean value of descriptor d across all positions; and p is the degree of normalization. From these, the ACC fixed-length vectors are obtained: first, the AC and CC terms from Equations (4) and (5) are concatenated for each lag (C(l) = [AC(l) CC(l)]), and then, the ACC is obtained for a maximum lag Lag by concatenating the C(l) terms, that is, Here, the length of an ACC feature vector is length(AC) × length(CC) × Lag = 25 × Lag. Details of this procedure can be found in [15,37].

GPCR Feature Learning Proposal through the Deep Approach
In recent years, the representation learning field has arisen as an alternative resource for learning representations of the data that makes it easier to extract useful information when building classifiers [21]. That is, the main idea is to extract the relevant features (explanatory factors) from the observed data without using feature engineering methods.
When representation learning methods are applied to GPCR sequences, a fixed-length and as unprocessed as possible representation of them is needed as the input for these methods. For this reason, we take the aligned version (see Table 2) with 259 fixed-length sequences of the GPCR database described in [30].
In our first proposal, each aligned sequence is converted to a basic and numeric form by using an amino acid property index taken from the preprocessed AAindex1 database [19,27]. That is, the sequence S = R 1 , R 2 , . . . , R L of length L is now represented by: where I k i indicates the corresponding numeric value of the amino acid R i using the k-th amino acid property index. In the case that a gap is presented in a sequence, it is replaced by a zero value. From Equation 7, it is observed that neither occurrence frequency, nor order information from S are included in S .
In this way, for the class C GPCRs dataset, we form k = 1, 2, . . . K input datasets, where K = 531 is the total number of indices of the preprocessed [19] amino acid properties index database [27]. Each k-th dataset is used as input for training a deep architecture in order to implicitly represent the explanatory factors of the protein sequences as much as possible and at the same time to obtain a model for classification. For illustration, Figure 1 shows how a sequence is used for training a deep architecture. It is observed from this figure that we can use different kinds of deep architectures to represent a dataset. In this paper, we experiment with basic and functional architectures, namely: (a) autoencoders, (b) convolutional neural networks (CNN) and (c) restricted Boltzmann machine in the first stage in order to select the architecture that best represents the original dataset. In this stage, a hydrophobicity-related index is selected because of its importance in determining the structure and function of GPCRs [14].
After the selection of a deep model, we proceed to find the right number of hidden layers and the number of neurons in each hidden layer by using a grid search in the range of [1, 2, . . . , 10] and [100, 200, 300, 500, 800], respectively. The number of neurons in a hidden layer is selected to be lesser or greater than the number of input neurons in order to allow the codification or magnification of the information from the inputs.
Once the number of hidden layers and the number of neurons in each hidden layer is found, we look into a neighborhood of the number of neurons in a hidden layer in order to refine and confirm the results. Next, we use the best setting (deep model, number of hidden layers and number of neurons in a hidden layer) to train a model using each one of the 531 indices from the AAindex database. This process will help to select the physicochemical index that in conjunction with the selected deep architecture represents the explanatory factors of the GPCR sequences. Now, we hypothesize (as a second proposal) that using two or more physicochemical property indices instead of only one might add information from the sequences that a deep architecture can extract and classify in a better way. This is carried out by combining the physicochemical indices for each amino acid in a sequence. That is, if a GPCR sequence has a length of L (259), after combination, its length is L × n, where n is the number of indices. For n = 2, the sequence is represented by: where I j i , I k i indicates the combination of the corresponding numeric value of the amino acid R i using the j-th and k-th amino acid property indices.

Deep and Conventional Supervised Learning Methods
In this paper, we experiment with basic and functional deep architectures, namely: autoencoders, convolutional neural networks and restricted Boltzmann machine in the first stage in order to select the architecture that best represents the original dataset. These architectures help to discover complex structures in datasets, which are used to compute the representations in each layer. These distributed representations lead to improved generalization for different tasks.
Convolutional neural networks have been widely applied to the recognition of objects in digital images. The architecture of a typical deep CNN is structured as a series of convolutional layers and pooling (subsampling) layers. The role of a convolutional layer is to detect local conjunctions of features from the previous layer, whereas the role of a pooling layer is to merge semantically similar features into one [45].
A stacked autoencoder is used mainly to encode the inputs into some representation so that the inputs can be reconstructed from that representation. In practice, the output representation can also be used to initialize a deep neural network for multi-class classification. In this paper, a stochastic version of the autoencoder is used, namely the denoising autoencoder, which avoids learning the identity function [46,47].
A stacked RBM is a particular type of energy-based model with hidden variables, which has the restriction that its neurons must form a bipartite graph. An RBM is formed by a visible input layer and a hidden layer and connections between them, but not within a layer. Usually, the contrastive divergence algorithm is utilized as the unsupervised training procedure to detect features from the inputs [46,48].
For classification tasks, a deep belief network or simply a deep neural network can be constructed by stacking RBMs or autoencoders where the top layer (n) is used as the classifier's output. In the first stage, a deep network of this kind is trained without supervision using n − 1 layers to detect the main features of the inputs. After this, the n-th layer is added to the network and trained with supervision through the error backpropagation algorithm to perform classification.
On the other hand, we also compare the obtained RBM results with some conventional classifiers such as k-nearest neighbor (k-NN), decision tree (DT), multilayer perceptron (MLP) and support vector machine (SVM).
k-NN is one of the simplest classifiers. It finds the k points in the training set that are nearest to the test input, then counts how many members of each class are present in the corresponding neighborhood (formed by the k points) and returns a class label belonging to the most common class in such a neighborhood.
Another basic classifier is a DT or classification tree. It partitions the feature space into hyperrectangles with sides parallel to the axes and then fits a simple model in each one. That is, the sequence of decisions is applied to individual features. In the resulting (tree-like) structure, an internal node represents a test on a variable or attribute, and a leaf node represents a class label.
MLP is a sophisticated feedforward neural network architecture, which can be trained in a supervised manner through the error backpropagation algorithm. The network contains layers of hidden neurons, which extract meaningful features from the input vectors. Each neuron in the network uses a nonlinear activation function, which helps to model non-linearities in its input-output relation.
A more sophisticated and widely-applied nonlinear classifier is SVM. It separates the input data points by mapping them into a high-dimensional feature space where a hyper-plane is constructed. This hyper-plane creates a decision surface, which has a maximum distance to the nearest points in the feature space. That is, two key concepts are involved in the design of an SVM: large margin separation and kernel functions. The former concept means that the constructed hyper-plane should be placed as far as possible away from the points in different classes. The latter concept helps to calculate the similarity between points in the corresponding feature space, which allows an SVM to generate nonlinear decision boundaries [49,50]. In this paper, a radial basis function was used as the kernel (due to the good results presented in [12,20]), where a grid search was carried out to find the regularization parameter C and the kernel width parameter σ.

Performance Assessment Measures
The performance measures used in the experiments are classification accuracy, MCC and BER. Accuracy is widely known and used as the proportion of correctly-classified cases. MCC and BER are commonly used as performance measures when the analyzed datasets are class-unbalanced. All of them can be naturally extended from the binary to the multi-class context [51].
Let us assume a classification problem with S samples and G classes and two functions defined as tc, pc : S → {1, . . . , G}, where tc(s) and pc(s) return the true and the predicted class of s, respectively. The corresponding square confusion matrix C is: in which the ij-th entry of C is the number of cases of true class i that have been assigned to class j by the classifier. Then, the confusion matrix notation can be used to define the accuracy, MCC and BER as: BER is the average of the errors on each class, which takes values in the interval [0, 1]. Then, 0 means perfect classification where no error contribution per class was found, and 1 means an extreme misclassification case where items for each class are misclassified.
MCC is commonly used in the bioinformatics field and takes values in the interval [−1, 1], where 1 means complete correlation (perfect classification), 0 means no correlation (all samples have been classified to be of only one class) and −1 indicates a negative correlation (extreme misclassification case). MCC is recommended as an optimal tool for practical tasks, since it presents a good trade-off among discriminatory ability, consistency and coherent behavior with a varying number of classes, unbalanced datasets and randomization [52].

Results and Discussion
The experimental results reported in this section aim to assess the ability of representation learning methods to extract the explanatory factors from the observed class C GPCR sequences without using feature engineering methods. For this purpose, two kinds of experimentation are designed.
Firstly, unaligned amino acid sequences are transformed according to the alignment-free transformations described in Section 2.2 in order to extract the relevant features that will help to gauge the classification performance using conventional supervised methods. Secondly, aligned amino acid sequences converted to a basic and numeric form are used as input for deep learning methods in order to implicitly represent the explanatory factors of the protein sequences. These models have the characteristic that at the same time the representation is extracted, a classification model is also obtained, which is assessed through classification performance.

Class C GPCRs Classification Using Alignment-Free Representations
The goal of the experiments in this subsection is two-fold. Firstly, we aimed to gauge the ability of the alignment-free amino acid sequence transformations to capture the inherent relevant features of class C GPCR subfamilies through supervised classification models. Secondly, we aimed to compare the performance of four conventional supervised models in terms of classification performance.
For the first set of experiments, the alignment-free transformations described in Section 2.2 are used in order to obtain the fixed-length feature vectors of the class C GPCRs unaligned dataset (see Table 1). This means that the AAcomp, PseAA, PseAA-MSE and ACC transformations are computed to obtain the corresponding four datasets as input for classification algorithms.
Following Section 2.2, a feature vector of the AAcomp dataset has a length of 20; for the PseAA dataset, the length is 62; and for the PseAA-MSE, the length is 74; taking a maximum decomposition level of m = 11 (log 2 (max{L 1 , L 2 , . . . , L N }) ≈ 11, where L i is the length of the sequence i). In the case of the ACC transformation, it uses two parameters that must be set to adequate values prior to classification: the maximum Lag and the degree of normalization p. In this study, we set both as Lag = 13 and p = 0.5, since the unaligned dataset is almost the same as in [11,12]. Then, the length of an ACC-transformed feature vector is 25 × Lag = 325.
For the second set of experiments, we selected two baseline and two sophisticated (non-linear) classifiers. Here, the k-nearest neighbor, decision tree, multilayer perceptron trained with the backpropagation algorithm and support vector machines were used. For k-NN, different neighborhoods were tried in the range k = 1, . . . , 10. Different settings for the number of hidden layers (hl) and number of neurons in a hidden layer (nhl) for MLP were used as hl = [1,2,3,4,5] and nhl = [10,20,30,40,50,60,70,80,90,100]. In the case of SVM classifier, the radial basis function kernel was used, which utilizes two parameters that must be identified in order to accurately predict unknown data: C and γ. For this, a grid search is carried out in the ranges C = [1,16] and γ = [2 −10 , 2 5 ] as in [12,20]. For these classifiers, only the parameters that lead to the best classification performance are reported.
For all conventional classifiers, the corresponding implementation available in the Weka (Version 3.6) toolbox [53] was used. It also allows data preprocessing where data normalization in the range [0, 1] was carried out using the min − max normalization technique. In order to estimate the average classification performance, 10-fold cross-validation is used.
The average classification accuracy results using alignment-free representation datasets with the above described classifiers are shown in Table 3. From these results, SVM is shown to outperform the rest of the classifiers in terms of accuracy, which is similar to that reported in the literature [12,20]. On the other hand, the alignment-free transformation that best captures relevant features through classifiers is ACC, except for decision trees. This is followed by PseAA and PseAA-MSE transformation, which indicates the importance of adding sequence-order information in transformed feature vectors [11,14,16].

Class C GPCRs Classification Using Representation Learning
The goal of the experiments in this subsection is two-fold. Firstly, we aimed to gauge the ability of representation learning methods to extract the explanatory factors directly from the observed data sequences through deep learning approaches. Secondly, we aimed to compare the performance of deep and conventional learning models in terms of classification performance.
As stated in the first proposal of Section 2.3, the aligned dataset (see Table 2) is converted to a numeric form by using an amino acid property index taken from AAindex database [19,27]. From the 531 indices, in the first stage, we selected a hydrophobicity-related index, because of its importance in determining the structure and function of GPCRs [14], then the hydrophobicity index 2 is chosen. The resulting dataset is named AAhydro.
Three common deep architectures were selected for experimentation: autoencoders, restricted Boltzmann machines and convolutional neural networks. In order to select the best architecture, which will be tuned in a posterior step, a basic configuration for each was used: two hidden layers and 700 neurons for each layer. To estimate the classification performance of the deep models, stratified 10-fold cross-validation was carried out.
The corresponding implementation of deep architectures was taken from [54,55]. The average classification accuracy results of the different deep architectures are shown in Table 4. Here, it is observed that RBM outperforms the other deep architectures in terms of classification accuracy. It is worth noting that RBM is modeled through a Gaussian-Bernoulli distribution, which naturally allows real-valued inputs. Although it is widely-known that CNNs have good performance for image pattern recognition (where large datasets are used), this is not the case for class C GPCR classification where the amount of analyzed data is not large enough. From here on, RBM is selected as the deep architecture trained with the backpropagation algorithm where gradient descent is accelerated by Nesterov's method [56]. In order to find the right configuration for the number of hidden layers and the number of neurons for layer of RBM, an ad hoc and coarse grid search was carried out in the ranges [1, 2, . . . , 10] and [100, 200, 300, 500, 800], respectively. The corresponding average classification accuracy results of this search are progressively shown in Tables 5-8. From Tables 5 and 6, it is observed that the right number of neurons for the first and second layer is around 500. Then, the number of neurons for the third and fourth layer is around 500. Tables 7 and 8 show that no improvement is achieved when we add more hidden layers. We also tried five hidden layers, but the results are worse than previous tables; therefore, they are not reported.  Since the best results are obtained using 500 neurons for two hidden layers, we proceed with a fine grid search of around 500 neurons for each layer. Then, we tried the range [400, 450, 500, 550, 600] for each layer. The average classification accuracy results for this search are shown in Table 9. Again, the best results are obtained with 500 neurons for the first and second layers. Table 9. Average classification accuracy results for an RBM with two hidden layers using a fine grid search around 500 neurons and the AAhydro set. From previously-obtained results, we selected two hidden layers and 500 neurons for each layer as the right configuration for RBM. Now, we train the selected RBM architecture using each one of the 531 indices from the preprocessed AAindex database [19]. This process will help us to select the amino acid physicochemical property index that in conjunction with RBM represents the explanatory factors of the class C GPCR sequences.
The average classification results of the 12 amino acid physicochemical property indices with the highest classification accuracy are shown in Table 10. Since the resulting datasets are unbalanced (see Table 2), the MCC and BER measures are also presented in order to compare them with accuracy results. It is observed from Table 10 that the amino acid property index number 531 in conjunction with RBM represents in a better way the explanatory factors of the class C GPCR sequences than the initial hydrophobicity index number 2. Although indices 531 and 485 have similar accuracy results, the BER measure is in favor of the hydrophobicity index number 531, which indicates a minimum mean misclassification for each class. Furthermore, this result is similar to that reported in the literature using feature engineering methods with SVM classifier (Konig 2013, 2014), but in contrast, an RBM learns representations directly from the observed data sequences.
In order to find out to what extent each of the seven class C GPCR types described in Section 2.1 can be discriminated from the rest and how each of them influences the overall classification performance, the four highest accuracy results represented by their corresponding amino acid property indices are presented in Figure 2 for all these types. Here, it is clear that the overall pattern of supervised classification is quite stable across amino acid property indices, except for index 166. The tendency is that the odorant and pheromone subfamilies are those that contribute less to the overall classification, which is a pattern similarly obtained in [11,12] with a difference (in favor of RBM) in the vomeronasal subfamily results. In this figure, five out of seven subfamilies (including vomeronasal) have high classification performance. The exception for this pattern is the results from index 166, which indicate that RBM cannot extract the explanatory factors of calcium sensing receptors, but in contrast, it has the highest recognition rate for the most difficult subfamily (odorant) to discriminate. Results from Figure 2 suggest that if feeding an RBM with information of two or more amino acid property indices instead of one, it probably could extract and represent more inherent and hidden information from GPCR sequences and consequently improve classification performance. Therefore, as a second proposal, we can combine two or more amino acid property indices as inputs for the RBM architecture previously selected.
Providing two amino acid property indices to an RBM means that the input sequence is first converted to a numeric form as I  Table 10 in order to reduce the search space.
The average classification results of the five amino acid property index combinations with the highest classification accuracy are shown in Table 11. From this table, it is observed that a combination of indices 65 and 205 in conjunction with RBM can represent the class C GPCRs types in a better way than using only one index, then classification performance is improved and confirmed by all the performance measures. This result outperformed the one obtained in [12] using feature engineering methods and is similar to [20], obtained by feature selection methods, with the difference being that we did not resort to these kinds of methods.  28 11.55 As in the previous experiment, we investigate the class-specific contribution to overall classification for class C GPCR types, and this is shown in Figure 3. The tendency and pattern described by these results are very similar to the ones described by using only one amino acid property index, but this, time the recognition rate of the most difficult subfamily to discriminate is improved.  Now, we proceed with combinations of three and more amino acid property index combinations. Since the results were not improved using four or more combinations, we only present the performance results with three index combinations in Table 12 and Figure 4. The classification results in Table 12 slightly improve the highest obtained using a combination of two indices. In particular, the combination of indices 485, 247 and 193 is better than the combination of 65 and 205 in terms of MCC and BER measures, but the rest of the combinations are not better than those shown in Table 11.  From Figure 4, it is observed that the same pattern described in Figures 2 and 3 is found, including the recognition rate improvement of the odorant type.
A summary of the highest classification performance using one, two and three amino acid property index combinations is presented in Figure 5. Here, it is observed that the highest results are addressed by the ability of the recognition (discrimination) of odorant and pheromone subfamilies. According to [11,12], subfamilies related to the odor function, such as vomeronasal, pheromone and odorant, are the most difficult to discriminate. However, Figure 5 shows that an RBM using one, two or three amino acid property index combinations can perfectly discriminate the vomeronasal type from the rest. Moreover, an RBM using the 485-247-193 index combination can also highly recognize the pheromone and odorant subfamilies. These results reveal the important contribution of hydrophobicity-related index combinations to correct amino acid sequence classification. This is not an unexpected result considering that GPCRs are membrane proteins, and thus, hydrophobic residues are highly present along the sequence and important both for receptor structure and function [14]. Finally, we compare the performance obtained with the highest classification accuracy results of RBM using one, two and three amino acid property index combinations with conventional supervised classification methods. For this purpose, the datasets obtained with one, two and three amino acid property index combinations are used as input for classification methods, such as SVM, k-NN and DT. The corresponding parameters of SVM and k-NN were as in Section 3.1, and the best average classification accuracy results are reported in Table 13. From Table 13, it can be observed that RBM can extract and represent the inherent and hidden information of class C GPCRs in a better way than conventional classification methods, which is confirmed by the accuracy, MCC and BER measure results. These results outperformed those reported in the literature [11,12,20] for class C GPCR classification without using feature selection methods.

Conclusions
Given the interest in class C receptors in pharmacology and in the absence of much knowledge regarding their complete 3D crystal structures, the investigation of their functionality can be approached through the analysis of their primary structure in the form of amino acid sequences. For this, many works reported in the literature [11,13,14,16,19,20,37] have coincided with the fact that sequence representation is a key factor for the GPCR classification task. Following this idea and opposite to the standard procedure of applying feature engineering methods for sequence representation, the use of the representation learning approach for automatically acquiring the features that best represent the class C GPCR sequences is proposed in this paper. That is, the AAindex database is used as the input for training a stacked RBM in order to implicitly represent the explanatory factors of the protein sequences. Experimental results assessed by classification accuracy, MCC and BER show that using the hydrophobicity index number 531 in conjunction with an RBM can achieve performance results similar to those reported in the literature. Furthermore, it is also shown that using three hydrophobicity-related index combinations helps to improve the classification performance of an RBM better than those reported in the literature for class C GPCRs without using feature selection methods.
Besides, type-specific classification results have shown that the discriminative and representative ability of the stacked RBM for each type varies according to the provided amino acid property index combinations, but keeping, in general, a stable and consistent classification pattern across all index combinations. Moreover, and importantly for the problem of recognizing the subfamilies related to the odor function, the experimental results indicate that RBM in conjunction with any amino acid physicochemical property index combinations can quite accurately represent and discriminate the vomeronasal type, and specifically using the 485-247-193 index combination, it can also highly recognize the pheromone and odorant subfamilies.
Motivated by the fact that relevant features of two class C GPCR subfamilies (related to the odor function) are difficult to represent and classifiers confuse them, a multi-label learning approach that allows an instance to belong to different classes is considered as future work. Furthermore, a pertinent evaluation of the three hydrophobicity-related index combinations found in this work should be carried out at a biochemical level.