1. Introduction
Labeling data according to classes in a hierarchy is also known as hierarchical classification. Most real data can be classified as hierarchical classifications, in areas as disparate as diseases, text, visuals, plant species, protein functions, websites, documentation, music genres, and images. Manually labeling real data in a hierarchical structure is challenging and complex. In addition, the hierarchical classification problem becomes more difficult as the amount of data increases over time. Although many classification methods have been developed to automate the classification task, these methods are inefficient and ineffective for hierarchical classification. This is because these methods ignore the relationship information between classes in a hierarchy [
1]. Various methods have been introduced to overcome the weaknesses of hierarchical classification methods in past studies. However, there is still room to improve existing methods by conducting various experiments related to hierarchical classification.
One of the ways to overcome the problem of hierarchical classification is to consider the feature representation of each class in the hierarchy. Each feature has a different level of relevance to represent each class in the hierarchy [
2]. The use of inappropriate features may affect the performance of classification [
3,
4]. There are many different feature representation techniques used to represent protein sequences. The well-known amino acid composition (AAC) feature representation method represents protein classes based on the number of amino acids in the protein sequence. However, this AAC method has the disadvantage that it does not contain protein sequence information [
3]. This problem is overcome by the pseudo amino acid composition (PseAAC) feature representation method. The PseAAC feature representation method can convert protein features based on the number of amino acids and contains information on the order of amino acids in the protein sequence [
5]. There are additional protein feature representation methods such as N-gram [
6,
7], position-specific scoring matrix (PSSM) [
8,
9], z- value [
10], and a combination of several features [
11]. According to [
12,
13], most studies use global features with the entire protein sequence information for protein representation. Nevertheless, these global features cannot extract hidden information in protein sequences [
13]. In addition, global features contain overlapping and unnecessary information [
11,
14].
Discrete wavelet transform (DWT) is a feature representation method that has the ability to perform analysis at various levels [
15,
16,
17]. The DWT method is suitable for representing features of biological data [
13,
16,
17,
18,
19]. With the property of multiresolution analysis, DWT can provide information on protein sequence arrangement more effectively and allow biological signals to be analyzed in the frequency domain and time domain [
17,
19]. This differs from signal processing methods such as Fourier transform, which can only study signals in the frequency domain [
13,
15,
20,
21]. Therefore, the DWT method can provide more information compared to other feature representation methods [
22,
23,
24,
25]. DWT can produce global features and local features in various decomposition levels to be analyzed, and produce features that do not have overlaps [
26]. Global features are obtained from approximation DWT coefficients, while local features are obtained from detailed DWT coefficients [
17].
Despite many advantages of DWT, the main problem is selecting the appropriate DWT family type and decomposition level to represent each class in the GPCR protein hierarchy. Different types of families and decomposition levels can be used in DWT. The appropriate DWT family type and decomposition level selection are important in data analysis because an accurate representation will preserve the data’s important characteristics [
27,
28,
29]. Furthermore, it assists in understanding the organization and complexity of this data [
16,
30]. The selected DWT family needs to meet the characteristics of orthogonality, symmetry, and shape similarity with the studied data [
31,
32]. The application type impacts the wavelet chosen, and out-of-range decomposition levels produce useless data, which harms data analysis [
33]. Furthermore, the selection of the appropriate family type and DWT decomposition level for the feature representation of a class is important because these two parameters affect the classification performance [
34,
35,
36]. However, previous studies have mostly selected the DWT family type and decomposition level, based on experience or manually [
4,
30]. They did not find the optimal wavelet family and decomposition level based on the data used. The well-known DWT families include Haar, Daubechies, Coiflets, Symlets, Discrete Meyer, and Biorthogonal. The differences between these DWT families include compact size, vanishing moment, symmetry, and orthogonality [
37]. The decomposition level is related to the number of global and local features produced from DWT [
34]. Using a high level of decomposition on a short sequence will create overlapping information, while using a low decomposition level on a long sequence will ignore much information.
Some studies use metaheuristic methods to optimize the selection of DWT family types and decomposition levels such as genetic algorithms [
27,
38], particle swarm optimization [
27,
34,
39], whale optimization algorithm [
28] and evolutionary quantum crowding [
29]. However, no research has been performed to choose the optimal DWT family type and decomposition level for protein feature representation in hierarchical classes. This study uses a hybrid optimization method between the particle swarm optimization algorithm and the firefly algorithm (FAPSO) to select DWT’s family and decomposition level. This hybridization is necessary to overcome the shortcomings of particle swarm optimization (PSO) and firefly algorithm (FA) algorithms [
40]. The PSO algorithm has several weaknesses such as premature convergence or being trapped in local optima and a low convergence rate in the exploitation process [
41,
42]. The FA algorithm has advantages compared to the PSO algorithm, such as no local and global parameters which could prevent it from getting stuck in local minima and premature convergence. In addition, the FA algorithm does not have a velocity parameter that can overcome the problem of the slowness of a particle’s velocity. In this study, FA and PSO algorithms have been combined to prevent premature convergence of each algorithm, avoid local optima, and balance the exploitation and exploration processes [
43].
The other problem relates to error propagation in hierarchical classification. It is a classification error in the nodes at the top level of the hierarchy. Then, this classification error will continue to the bottom level of the hierarchy and affect the classification results. This error propagation cannot be corrected at the bottom of the hierarchy [
44,
45]. There are various methods to overcome the error propagation problem in hierarchical classification [
22,
45,
46,
47]. This study used the virtual class method introduced by [
48] to overcome this problem. This method was also used in [
49] to overcome the problem of propagation errors in text hierarchy classification. In this method, each parent node except the root node and the terminal node will have one child node known as the virtual class node. These virtual nodes contain the training data from the parent node only. During the training phase, each classification model on each parent node will learn how to classify data only between the node and the virtual class node. The models generated at each parent node use the top-down classification method during the testing phase. A study by [
49] found that this virtual class method can avoid the problem of propagated error by stopping the classification at the correct hierarchical level. This method can produce good experimental results compared to the hierarchical classification method by overcoming the problem of error propagation.
Therefore, this study analyzes the suitable wavelet family and decomposition level using the FAPSO optimization method and utilizing virtual classes for GPCR hierarchical protein classification.
2. G-Protein Coupled Receptor
G protein-coupled receptors (GPCR) exist on the surface of every cell [
50]. GPCRs generate signals in cells to regulate key physiological processes such as hormone signalling, neurotransmission, cognition, vision, taste, pain perception, and others. It is also known as the seven transmembrane domain (7TM) receptor, because there are seven transmembrane segments in which three loops are outside the cell, and three loops are inside the cell with the N-terminal position being inside the cell and the C-terminal being outside the cell. GPCR hierarchy consists of three levels, namely family, subfamily, and sub-subfamily. The family class consists of five classes, namely families A, B, C, D, and E. The subfamily class consists of 38 classes, and the sub-subfamily class consists of 87 classes. Since there are very complex relationships between classes, GPCR classification has proven difficult [
51]. In addition, many protein sequences in the same family share homology with protein sequences in other families, which also increases the difficulty of classification [
52]. GPCR classification depends on sequence order and includes structural, functional, and evolutionary characteristics, such as chemical and pharmacological factors [
53]. GPCRs are thus one of the most challenging datasets to classify.
Various feature representation methods have been used in past research for GPCR proteins. Ref. [
3] used the amino acid composition (AAC), pseudo amino acid composition (PseAAC), and dipeptide composition (DC) feature representation methods for the classification of GPCR proteins. Ref. [
54] used several indices related to the hydrophobicity of amino acids to implicitly describe the properties of protein sequences. They showed that combining three hydrophobic indices with a constrained Boltzmann machine (RBM) achieved high-performance results for class C GPCR classification with 94% accuracy. The Ngram representation method was used by [
6,
55] to represent GPCR proteins.
The Local Descriptor feature representation method has been used by [
52]. This representation method works by grouping amino acids into three groups: hydrophobic, neutral, and polar. This method has yielded 87% to 90% accuracy for the GPCR protein hierarchy family level.
Deep learning methods have been used for GPCR protein classification by [
51,
55,
56,
57]. The feature representation method and deep learning design varied between these studies. Ref. [
51] used a combination of the n-gram feature representation method, term frequency-inverse document frequency (TF-IDF), and deep learning. This study has obtained an accuracy of 98.5% at the family level, 98.1% at the subfamily level, and 96.4% at the sub-subfamily level. Ref. [
57] have used the deep learning method with one-hot coding feature representation and achieved an accuracy of 97.17% for the family level, 86.82% for the subfamily level and 81.17% for the sub-subfamily level. Ref. [
55] used 1-2-3-g feature representation in deep learning methods and achieved an accuracy of 99.2% for the family level, 99.2% for the subfamily level, and 98.15% for the sub-subfamily level. Ref. [
56] has used 2-3-4 g feature representation together with deep learning and obtained an accuracy of 97.40% for the family level, 87.78% for the subfamily level, and 81.13% for the sub-subfamily level. However, the performance of these studies cannot be directly compared because studies used different GPCR datasets.
Ref. [
58] used the DWT method and selected the DWT Coiflet 4 family with three levels of decomposition to represent GPCR protein features for the levels of family A, B and C, subfamily A, as well as sub-subfamily Amine. This study found that the Coiflet 4 family type had the highest accuracies compared to other DWT families, which is 99.67% for the family level, 97.64% for the A subfamily, and 99.20% for the Amine sub-subfamily. Ref. [
59] used Fourier transform for GPCR protein classification. That study produced an accuracy of 96.1% for the classification of GPCR proteins at the family level.
Several feature representation methods have been combined to represent GPCR proteins. Ref. [
7] combined the 400D feature representation method and parallel correlation pseudo amino acid composition (PC-PseAAC). Ref. [
60] has produced 1497 features as a result of a combination of AAC feature representation, PseAAC, dipeptide composition, correlation feature, composition, transition, distribution, and sequence arrangement. Ref. [
61] combined PseAAC, AAC, and dipeptide decomposition feature representation methods to obtain high classification performance. Ref. [
62] hybridized PseAAC feature and energy from DWT approximation coefficients and detail to represent protein classes at three levels of the GPCR protein hierarchy. It was found that combining several feature representation methods could produce better classification performance as compared to the use of individual feature representation methods.
4. Results
This paper proposes FAPSO as a feature extraction method and SVM as the hierarchical GPCR protein classification model. The proposed model is different from the traditional classification model as it is a combination of the following parts: (1) hybridization of the Firefly algorithm and Particle Swarm Optimization to select the suitable wavelet family and decomposition level and (2); a hierarchical classification strategy (LCPN with VC) to stop the incorrect classifications that occur with non-mandatory leaf node prediction (NMLNP) in internal hierarchy levels before reaching the leaf node. The results are analyzed according to three hierarchy levels: family, subfamily, and sub-subfamily. The family level consists of 5 classes; the subfamily level contains 38 classes; and the sub-subfamily level contains 87 classes.
The PseAAC features vector graph for one of the class A GPCR proteins is shown in
Figure 4a. Using the Coiflet wavelet family 4 and 3 levels of decomposition, the DWT converted the PseAAC features into approximation coefficients and detail coefficients. The detailed wavelet coefficients for decomposition levels 1, 2, and 3 that contain local protein features are shown in
Figure 4b–d. While
Figure 4d shows the wavelet approximation coefficient with global features or rough characteristics of the protein. The number of global and local features produced depends on the number of decomposition levels. Decomposition level can be performed up to the log2 (N) level, where N is the length of the studied data. In this study, considering the size of the PseAAC feature used is 170, the maximum decomposition size allowed is seven levels. Here, we can see the advantage of the wavelet’s multiresolution analysis, which enables protein sequences to be studied using multiple decomposition levels.
In this paper, the results are separated into two sections. The first section describes the chosen wavelet family and decomposition level, while the second section describes the algorithm’s performance.
4.1. Selection of Wavelet Family and Decomposition Level
Table 2 shows the wavelet families selected by the FAPSO algorithm for each of the five folds. Each fold of the data contained eleven wavelet families to represent the eleven nodes in the GPCR protein hierarchy, which are the root, family A, family B, family C, subfamily A Amine, subfamily A Hormone, subfamily A Nucleotide, subfamily A Peptide, subfamily A Thyro, subfamily Prostanoid, and subfamily C CalcSense. As a result, for the five folds of the data set, the total number of wavelet families is 55. The wavelet family and decomposition level selected by the FAPSO method were examined at eleven nodes of the hierarchy.
Table 2 clearly shows that the wavelet family and decomposition level chosen at different nodes differ significantly. This supports the use of FAPSO, which determines the wavelet family and decomposition level at each class node automatically.
Figure 5 shows how often each wavelet family is chosen at every level of the GPCR hierarchy. 15 GPCR proteins were chosen from the Biorthogonal family, 14 were chosen from the Reverse Biorthogonal family, 9 were chosen from the Daubechies family, 4 were chosen from the Symlet family, 5 were chosen from the Fejer–Korovkin family, 3 were chosen from the Dmeyer family, 3 were chosen from the Coiflet family, and 2 were chosen from the Haar family. Most protein classes are effectively represented by the Biorthogonal wavelet family since most scaling functions and wavelet functions in this family have a sudden change in shape. This fits with the rough shape and many sharp changes in the shape of PseAAC, as shown in
Figure 4a. Furthermore, [
76] mentioned that the Biorthogonal wavelet could get rid of redundant information from protein sequences, minimise feature information leakage and aliasing, let feature vectors represent the original sequence information, and improve prediction performance. FAPSO also selects the Daubechies and Fejer-Korovkin wavelets at least five times.
Figure 6 depicts the frequency of decomposition levels to represent protein classes. In total, 15 of the 55 wavelet families used decomposition level 1, the highest level of decomposition used in feature representation. Decomposition level 1 yielded 85 approximation and 85 detail coefficients, accounting for half of the 170 PseAAC features. Decomposition level 1 preserved global and local information. Decomposition level 3 was used by 14 wavelet families. Decomposition level 2 was used by 11 wavelet families. Four wavelet families used level decomposition 4, six families used level decomposition 5, and four families used level decomposition 6. Decomposition level 7 was the fewest decomposition level chosen for the feature representation, with only one. A higher level of decomposition increases the number of detail coefficients while decreasing the number of approximation coefficients. This means that the protein class required more local information to represent it than global information.
4.2. GPCR Classification Performance without Virtual Class
Table 3 shows the performance for the classification of GPCR using SVM at the family level. It can be observed from the results shown that at the family level, FAPSO is the best feature extraction strategy. The performance level achieved by FAPSO was 97.9% for accuracy, precision, and recall. This illustrates that both PseAAC and FAPSO can be used to classify GPCRs at the family level with high accuracy.
Table 4 shows the performance for the classification of GPCR using SVM at the subfamily level. As seen from the table, the results show that at the subfamily level, PseAAC is the best feature extraction strategy at this hierarchy level. The performance levels achieved by PseAAC were 85.3%, 87.7%, 87.7%, and 0.887% for accuracy, precision, recall and F-score, respectively. This indicates that PseAAC is good for characterizing the GPCR at the subfamily level. Nevertheless, the FAPSO algorithm achieved better precision than PseAAC at 88.9%. It means that FAPSO produced a lower false positive rate by using DWT features compared to PseAAC.
Table 5 displays the classification performance of GPCRs using SVM at the sub-subfamily level. This table demonstrates that at the sub-subfamily level, PseAAC has the highest accuracy, at 76.9%. While FAPSO has the highest precision, recall, and F-score values, which are 87.5%, 81.2%, and 84.4%, respectively.
Table 4 and
Table 5 demonstrate that the precision value for FAPSO classification results at the subfamily and sub-subfamily levels is greater than the accuracy value. This is indeed a good sign since high accuracy can also be attained by correctly identifying the dominating negative class. It has been shown that the GPCR protein has unbalanced data for each class [
63,
77]. Therefore, it is more crucial to obtain a high precision value [
78]. In addition, the accuracy value is not a suitable statistic for data sets with imbalance [
78]. The recall value was higher than the accuracy value, as seen in
Table 5. This suggests that FAPSO worked effectively despite the imbalance that existed in the GPCR protein dataset.
From the observation, the performance of PseAAC and FAPSO decreased as the hierarchy level became deeper. This is because of the large number of classes (38 and 87, respectively) at the subfamily and sub-subfamily levels.
4.3. GPCR Classification Performance with Virtual Class
Table 6 shows the performance for the classification of GPCR using SVM with virtual class implementation at the subfamily level. First and foremost, the results are considerably different between PseAAC+VC and FAPSO+VC. FAPSO+VC gives better performance than PseAAC+VC. When comparing PseAAC against PseAAC+VC, the latter has higher accuracy, precision, recall and F-score values than the former, at 85.4%, 88.3%, 88.3%, and 0.833. As for FAPSO and FAPSO+VC, the FAPSO+VC also performs better than FAPSO. FAPSO+VC achieved 86.9% accuracy, 94.1% precision, 90.1% recall, and 0.921 F-score. At the subfamily level, it was found that false positive numbers had decreased for the FAPSO algorithm compared to the PseAAC algorithm after VC implementation. This is because these data have been classified at the VCs, which are in classes A, B, C, D, or E and therefore avoid the error propagation problem. There is also an increase in the true positive number of classes A and C, in which the data were classified accurately after VC implementation.
As for the performance at the sub-subfamily level shown in
Table 7, when comparing PseAAC against PseAAC+VC, the latter has better accuracy, precision, recall and F-score values than the former, at 78.8%, 89.6%, 82.6%, and 0.86, respectively. As for FAPSO and FAPSO+VC, the FAPSO+VC also performs better than FAPSO. FAPSO+VC achieved 81.3% accuracy, 92.1% precision, 85.1% recall, and 0.844 F-score. At sub-subfamily level, false positive numbers have also decreased especially for sub-subfamily under subfamily A amine, A hormone, A Peptide and C CalcSense. It is found that the number of positive has increased especially for the subfamily A Amine, A Hormone and C CalcSense.
It can be seen that virtual class consistently increased performance in all experiments, since it increases the likelihood of avoiding error propagation in the hierarchy by stopping the classification at an appropriate level.
Table 8 compares the GPCR protein classification accuracy performance obtained from this study and earlier research that made use of the same GDS data set. This study’s accuracy value at the family level was determined to be 97.9%, which is remarkably similar to the accuracy values in the studies of [
57,
61,
63,
64,
66]. In this study, the second level of the hierarchy had an accuracy value of 86.9%. Nevertheless, [
63] obtained higher accuracy values for the second level of the hierarchy, which is 89.2%. The accuracy percentage at the third level obtained by this study is 81.3%. Still, the research of [
63], which received an accuracy value of 90.9%, produced the highest accuracy value for this third level.
Although the accuracy of this study is lower than [
63] at the subfamily and sub-subfamily level, their research focused on optimizing the GPCR protein classifiers. This study conversely attempted to optimize the representation of GPCR proteins at the feature level. Hence, compared to other studies, it was found that this study’s results are comparable to those by [
56,
57], which used the current trending deep learning method.
5. Discussion
In this study, the FAPSO feature representation method has helped identify the type of wavelet family and the appropriate decomposition level for each GPCR protein class representation. There are 82 wavelet families to select from, including the Daubechies, Coiflet, Meyer, Biorthogonal and Reverse Biorthogonal families. Each wavelet family has characteristics such as orthogonality, the number of vanishing moments, and the type of symmetry found in the scaling and wavelet functions. These features can be adjusted to the characteristics of the studied protein sequence. The convolution result between PseAAC features and the wavelet family will produce low-frequency or approximation and high-frequency or detail coefficients. The approximation coefficient contains coarse or global information about the studied protein sequence. Meanwhile, the detail coefficient contains important information found in the studied protein. In this study, the Biorthogonal wavelet is the most selected wavelet family representing GPCR protein classes by FAPSO. The Biorthogonal wavelet contained symmetric and compact support, which appears suitable to represent this protein.
The multiresolution analysis allows protein sequences to be studied using multiple decomposition levels and can be performed up to the log2 (N) level, where N is the length of the studied data. In this study, considering the size of the PseAAC feature used is 170, the maximum decomposition size allowed is seven levels. Different levels will produce many coefficients, which are global features and detail coefficients. FAPSO has identified that the root, family A, subfamily A Amine, and subfamily A Peptide nodes required decomposition level 1 to represent proteins in those classes. For family C and subfamily C CalcSense, decomposition levels 6 and 3 are required to represent proteins in that class. This is because protein C has a long and varied sequence size, in addition to a complex structure [
54]. Therefore, the selection of the decomposition level also plays a vital role in determining the classification performance. This proves that each GPCR protein class requires a different wavelet family and decomposition level for feature representation.
Nevertheless, it can be concluded that the FAPSO optimization algorithm alone cannot improve the GPCR hierarchical protein classification performance compared to the PseAAC algorithm. In this study, the number of iterations used in the FAPSO optimization algorithm is 100 iterations on each parent node of the GPCR protein hierarchy. The classification performance may increase if the number of iterations increases or if the FAPSO parameters are changed to appropriate values.
The virtual class has been proposed to overcome the propagation error problem in top-down hierarchical classification. As a result, it can improve the classification results on the nodes in the hierarchy, especially when combined with the FAPSO feature representation algorithm.
6. Conclusions and Future Works
In this paper, we briefly studied the selection of optimized wavelet family and decomposition level by the FAPSO algorithm as the feature representation method and the virtual class method as the hierarchical classification method for GPCR protein. Based on the results, the most selected wavelet family and decomposition level chosen to represent GPCR classes by FAPSO were Biorthogonal wavelets and decomposition level 1. The choice of an adequate feature extraction strategy for protein classification is an important problem because they influence the performance measurement values according to the commonly used machine learning algorithm. It can be seen that virtual class consistently increases the performances in all experiments since it increases the likelihood of avoiding error propagation in the hierarchy by stopping the classification at an appropriate level.
This area of research still has a lot of space for growth and advancement. Data mining always runs into the imbalance data problem in the data set. A similar issue, with 60% of known GPCR protein sequences being proteins from the A family, also affects the GPCR dataset. To solve this issue, sampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) algorithm must be taken into account. SMOTE is an oversampling method that creates artificial samples from minority classes. After obtaining a synthetically balanced training set for each class, the classifier is trained.
Future work should investigate the best wavelet family and level of decomposition to represent the GPCR protein class. The trials carried out in this study used parameter values for the FAPSO algorithm based on prior work. The suitability of the parameters for FAPSO is not examined in this work. This necessitates a thorough investigation to establish the correct parameter values needed for the FAPSO algorithm as an algorithm for choosing the wavelet family and decomposition level.
There are various implementation strategies for hierarchical classification. This includes global hierarchical classification, which does not have the same issue with errors propagating as in the present study. A training set for global classification is used to create a sophisticated classification model. When classifying, this takes into account the entire class structure. Each test data set will be categorised using the test phase’s classification model. This global hierarchical classification method is expected to enhance the performance of the hierarchical classification of GPCR proteins.