AoP-LSE: Antioxidant Proteins Classification Using Deep Latent Space Encoding of Sequence Features

It is of utmost importance to develop a computational method for accurate prediction of antioxidants, as they play a vital role in the prevention of several diseases caused by oxidative stress. In this correspondence, we present an effective computational methodology based on the notion of deep latent space encoding. A deep neural network classifier fused with an auto-encoder learns class labels in a pruned latent space. This strategy has eliminated the need to separately develop classifier and the feature selection model, allowing the standalone model to effectively harness discriminating feature space and perform improved predictions. A thorough analytical study has been presented alongwith the PCA/tSNE visualization and PCA-GCNR scores to show the discriminating power of the proposed method. The proposed method showed a high MCC value of 0.43 and a balanced accuracy of 76.2%, which is superior to the existing models. The model has been evaluated on an independent dataset during which it outperformed the contemporary methods by correctly identifying the novel proteins with an accuracy of 95%.


Introduction
The oxidative damage due to free radicals is prevented by the antioxidants which are naturally produced in the human body [1]. Small antioxidant molecules, scavenges the free radicals by neutralizing them, where as the large antioxidant molecules prevents the potential damage by absorbing the free radicals [2]. However, on some occasions, the naturally produced antioxidants are outnumbered by the free radicals, and their procurement from the external sources including vegetables and fruits become indispensable [3,4]. The macro-molecules (proteins), that offer these nutrients must be effectively identified, since it would contribute towards the development of novel methods to prevent the diseases caused by the free radicals. Conventionally, the experiments to identify the antioxidant proteins are conducted using biochemical techniques, which however, is arduous and comes at high cost. The protein sequence database provided an opportunity for computer-based antioxidant protein inference [5]. In early study [6], a naive bayes method was developed for the prediction of antioxidant proteins. Amino acid compositions and di-petide compositions of the protein sequences were utilized as features, which were later subjected to a feature selection algorithm. The resultant features with reduced dimension were then used as the training dataset. Later, AoPred was presented by same authors [7], in which g-peptide composition was adopted to extract features from the protein sequences and analysis of variance (ANOVA) [8], was employed to obtain the optimal feature set to train a support vector machines (SVM).
Recently, deep learning (DL) based classification methods have been proposed to perform prediction in almost every field. In particular, many bioinformatics applications have taken advantage of the outstanding capabilities of deep neural networks (DNN) [9].
The prediction of protein structure and function, based on the deep learning algorithms have also been proposed, which yielded highly accurate results [10,11], confirming the potential scope of DL methods in this field. Motivated by the recent success of deep latent-space based learning methods [12][13][14], herein, we propose a deep learning based classification model which is essentially an auto-encoder with an embedded classifier.
The major advantages of the proposed approach is its ability to learn the non-linear embedding of the features whilst maintaining the properties of the original features through reconstruction. This consistency helps filter out fluctuating data points by evaluating their distance from the class centers. As a result, the model can be an important tool for designing a classifier with noisy annotations. Another advantage of using such configuration is the elimination of the need of a crucial step of feature engineering, which involves the selection of dominant features for the training of classifier. The built model has been evaluated by the widely known statistical parameters, and the performance is compared with the existing methods. The results show that the proposed method offers superior performance on the benchmark dataset.
The rest of the paper is organized as follows: In Section 2, the details of the benchmark dataset, features and the classification model are discussed. Section 3 presents the results of the performance evaluation of the proposed method and its comparison with contemporary approaches. The conclusion of the paper is presented in Section 5.

Evaluation Parameters
To evaluate the performance of the proposed method, we used metrics for the imbalanced class samples including sensitivity (Sn) or recall, specificity (Sp), precision, conventional accuracy (ACC), mathews correlation coefficient (MCC), balanced accuracy (BACC), youden's index (YI), F1 score, and Cohen's kappa (κ). These parameters can be evaluated using the following equations: Balanced Accuracy = Sensitivity + Speci f icity 2 (6) Here TP, FP, TN, and FN represent true positive (correctly classified AoPs), false positive (incorrect classification of non-AoP as AoP), true negative (correctly classified non-AoPs), and false negative (incorrect classification of AoP as non-AoP), respectively.

Dataset
The benchmark dataset has been obtained from Feng et al. [6]. The formation of the dataset included the selection of the sequences from UniProtKB, that were annotated and reviewed as antioxidant in the molecular function of gene ontology. The sequences were manually observed to remove any nonstandard letters other than the 20 standard amino acid alphabets. Furthermore, the sequence similarity was reduced to ≤60% using CD-HIT. The final dataset contained 253 antioxidant protein sequences and 1552 non-antioxidant proteins. To fairly compare the performance of the proposed method, the training and test datasets were set quantitatively equal to the contemporary approaches, i.e., 200 antioxidant and 1240 non-antioxidant proteins were used as the training subset, while the remaining 53 antioxidants and 312 non-antioxidant proteins were used as the test dataset.

Features and Latent Space Encoding
The protein features are made compatible with the machine learning algorithms by encoding them numerically. Several encoding schemes have been utilized by the researchers in accordance with the adopted machine learning method [13,[15][16][17]. In this study, we use a well known feature encoding method called the composition of k-spaced amino acid pairs (CKSAAP). It is a comprehensive encoding scheme, and has shown significant performance in variety of protein prediction tasks [18][19][20]. In CKSAAP, k denotes the gap or space between the pairs of the amino acid fragments. For k = 2, the intermediate feature vectors ranging from j = 0, . . . , k are obtained, which are stacked together to form the final feature vector. An illustration of the feature vector for k = 2, obtained from the CKSAAP method is shown in Figure 1.
As observed in Figure 1, the dimensions of the resultant feature vector of CKSAAP can be very high. This high resolution feature vector has multilevel granularity which intensifies the relationship of amino acid fragments. However, the inadequate number of positive samples leads to the large p small n problem and therefore, feature engineering such as component analysis [21] and information gain [15] must be employed. The feature engineering tends to work well in the circumstances when the features and class labels have a linear relationship, however, in deep learning applications, this relationship is mostly non-linear, which presents complications in selecting the efficacious features. Therefore, in this study we employ an auto-encoder, which is commonly used for the compressed representation of the input data [22]. The compressed input exists in the bottle-neck of the auto-encoder called the latent space. The latent space contains sufficient information to reconstruct the approximate of the original input by the decoder unit. In the proposed method, the latent space also serves as an input to the classifier module, which is a fully connected neural network, the details of which is presented in Section 2.4.

Neural Network Architecture
The proposed Deep Latent Space Encoding (DeepLSE) model comprises of three modules: (1) encoder, (2) decoder and (3) classifier. The input layer of the encoder of the baseline model consists of 400 × (k + 1) neurons, the first and second hidden layers has N × 10, N × 5 neurons respectively, while the final hidden layer has N × 2 neurons. To improve generalization and to avoid over-fitting, the batch-normalization layer and a dropout of 30% has been employed between each dense layer. The decoder being the complement of the encoder and has mirror symmetric configuration. For classification, a multi-layer perceptron has been implemented with two fully connected hidden layers. Each hidden layer has 10 neurons while the output layer has 2 neurons characterizing the antioxidant and non-antioxidant proteins class labels. For N = 5, the architecture of the proposed model is shown in Figure 2.

Training Configurations
The model is trained using Python on Tensorflow-Keras [23] platform for different configurations of latent variables (LV) and gap values (k). The output of encoder/decoder and classification network uses sigmoid and soft-max activation functions respectively, while the hidden layers of all modules use rectified linear unit (ReLU) activation function to avoid the occurrence of vanishing gradient. Two loss functions; mean squared error and binary cross-entropy for auto-encoder and classifier respectively are being minimized using the default learning rate of RMSprop optimizer for 1000 epochs with an early stopping tolerance of 100 epochs. The convex combination of two losses is achieved by where λ is a mixing weight and which was set to be 0.99.

Finding Best Latent-Space Encoding (LSE) Scheme
The workflow of the proposed study is aimed to obtain the best classification model based on two variable parameters, i.e., the gap between the two amino acid pairs, which is done by setting different values of k during the encoding, and the number of units in the latent space LS. An ablation study has been designed to acquire latent-space encoding (LSE) with varying number of aforementioned variable parameters according to the workflow depicted in Figure 3a. The proteins are distributed into train and test datasets as discussed in the Section 2.2 and resultant datasets are processed for CKSAAP encoding. For the model configuration (N = 5) shown in Figure 2, nine different subsets are constructed by keeping k = 1 to 9. For each value of k, the model is trained with several values of latent space variables LVs ranging from 2 to 10. For each configuration, 20 independent trials are performed and the obtained test results are averaged. Same procedure is followed for the next configuration by incrementing the gap value (k) between the amino acid pairs. The model with the best average results is finally selected as the base model to perform prediction and is named as AoP-LSE.  The summary of the ablation study is provided in a visual and tabular form. The results of the BACC, MCC and PSNR are provided as a surface plot in Figure 4. These metrics are suggested to be an effective evaluation parameter for the imbalanced classification problems. From the results it can be seen that for the large range of parameters, the model achieve comparable accuracy and the variation in both the BACC and MCC is under 10%. However the PSNR performance is highly sensitive to parameter choice as the variation in PSNR around 20 dB units. This clearly indicates discrepancy in most descriptive and most discriminating features. In other words, embedded features which are useful for the reconstruction of original feature space are not necessarily the most discriminating features for designing the classifier, while the most discriminating embedded features are not necessarily the most descriptive (indicative feature for oxidation properties) features. Therefore, finding a balanced features embedding that can help classification while preserving the indicative feature for oxidation properties is important. The same issue is further discussed in the Section 4.1 where learned latent-spaces are compared.  Table 1 presents the results of the balanced accuracy, which is the average of recall of each class and is suggested to be an effective evaluation parameter for the imbalanced classification problems. The consistent values of standard deviations and mean for 20 random trials manifests the stability of the model. However, the best average balanced accuracy results (i.e., 76.2%) was obtained when the value of both k and LV was equal to 6. Therefore, the model with k = 6 and LV = 6 has been selected for further testing and evaluations.

Finding Best-Configuration for DeepLSE Architecture
For the aforementioned combination of k = 6, and LV = 6, three configurations of model parameters were evaluated. The number of neurons in each configuration were incremented as shown in Table 2, and their effect on the classification performance were evaluated for the metrics including Youden's index, MCC, receiver operating characteristics area under the curve (ROC AUC), precision recall area under the curve (PR AUC) and the mean reconstruction error. 10 independent trials were performed for each configuration and the one producing the best results in terms of aforementioned performance metrics was selected as the baseline model in this study. Table 3 shows the results of the ablation study.
It was observed from the analysis above that a shallow model (N = 1) failed to learn the discriminating features, whereas a highly complex model (N = 10) also showed mediocre performance. Interestingly, the moderate size model with the configuration (N = 5) works best. Although, this ablation study is not exhaustive, nevertheless, it gives sufficient clue that the selected parameters are the optimal choice for best performance.

Comparison with the Contemporary Methods
Once the best model has been identified from the ablation study, it is selected for further testing and evaluation according to the workflow depicted in Figure 3b. For comparative analysis, the model is compared with the Naive Bayes [6] and AODPred(SVM) [7]. Table 4 presents a comparison of the evaluation metrics of the proposed AoP-LSE including the test dataset Accurary (ACC), Sensitivity (S n ), Specificity (S p ), and Youden's Index (YI) with Naive Bayes [6] and AODPred(SVM) [7]. Although the sensitivity of proposed method is relatively lower, the proposed AoP-LSE method achieved 0.14, and 0.03 units higher Youden's index value than the Naive Bayes and AODPred(SVM) methods respectively. Since, for a highly skewed test dataset, the higher values of a balanced metric e.g., Youden's index are much desired than the individual class metric, therefore, the proposed AoP-LSE can be considered as a better classification approach.

Verification on Independent-Dataset of Antioxidant Proteins
For objective evaluation of the performance of the proposed method, we utilize the reviewed antioxidant proteins from UniProtKB/Swiss-Prot [24]. UniProtKB is a high quality, manually annotated and non-redundant protein sequence database, which brings together experimental results, computed features and scientific conclusions. These sequences were totally independent and were not present in the positive of negative datasets of the proposed study. Comparison was performed among AODPred [7], Vote9 [25] and the proposed AoP-LSE methods. The results are presented in Table 5 show that the proposed method successfully predicts 21 out of 22 independent antioxidant proteins achieving an accuracy of 95.4%. This superior performance of of AoP-LSE indicates that it can be utilized as a useful tool for the annotation of unknown antioxdants.

Comparison of Feature and Latent-Space Discrimination Capability
The latent space learned by the proposed method has been visualized in Figure 5. For fair comparison, two feature-encoding/ dimension-reduction methods are utilized. For linear and non-linear embedding, PCA [26] and tSNE [27] methods are used respectively. Both methods were provided with the original feature space and deep encoded latent-space data. From Figure 5 it can be clearly seen that the proposed deep latent space encoding method learns to separate the two classes data which are not distinct in the original feature space as can be seen in Figure 5A. The tSNE results in Figure 5B shows interesting patterns, indicating the possibility of noisy labels, which are not as prominent in original features space. Another important characteristics of proposed method is the fact that even with linear encoding method of PCA, the proposed Deep-LSE encoded features looks linearly separable, which is an added advantage as in many bioinformatics problem the simplicity and explainability is more important than the gain in classification scores.
Since both classification and reconstruction modules are trained simultaneously and the training attention is increased towards minimizing the reconstruction loss. The model has more freedom to represent individual samples with its variability. On the other hand, the attention of learning is towards minimizing the decoder's loss with a low value of the classification weight. It creates an unfair tug of war between the two objectives and results in a cluster of high dimensional data in a compact distinguishable latent-space.

Comparison of Proposed DeepLSE and Conventional Auto-Encoder-Based Encoding Schemes
Although, the architecture of the proposed model is similar to auto-encoder (AE) + multi-layer perceptron (MLP) neural network. However, the way it has been configured/trained i.e., combined training of AE+MLP makes it different from the conventional approach where AE is trained separately for dimension reduction and obtain the most representative (low-ranked) features in the latent space (LS) for signal reconstruction. The proposed DeepLSE on the other hand, not only learns the most representative features but also impose a classification constraint on the latent variables. Thus, creating a tug of war between the decoder and the classifier forcing the encoder to generate useful features. To emphasize this point and for a fair comparison with existing techniques, an evaluation on three different models has been presented.
We trained the auto-encoder network with similar configuration (N = 5, k = 6, and LV = 6) as in DeepLSE and extracted the latent space encoding for the entire dataset. Later the encoder latent-variables were used for the design and evaluation of auto-encoder+multi layer perceptron (AE + MLP), auto-encoder+ support vector machine (AE + SVM) and auto-encoder+naive bayes (AE + NB) methods. The models were evaluated for the several evaluation parameters and the results are reported in Table 6. For AE + MLP model, the same MLP configuration was used as in DeepLSE. For AE+SVM, linear SVM [28,29] with Euclidean-distance-based radial-basis-function [30] kernel, balanced class weights and l2 penalty of C = 1000 is used. For AE + NB method, Gaussian Naive Bayes [31] with prior of (0.5, and 0.5) is used. Both the SVM and NB methods were implemented using scikit-learn package [32] on Python 3.
As anticipated, the conventional AE fail to generate useful embedding, hence, the performance of all the aforementioned models is poor. To further highlight this point, we provided the visualization for the AE and DeepLSE latent variables in Figure 6, and compare the discrimination potentiality of the original feature-space of DeepLSE's and AE's latent-variables using one-dimensional PCA with absolute-GSSMD [33] method. The GSSMD is derived from generalized contrast-to-noise ratio (GCNR) [34] metric, in which the overlap between two distributions is compared. For an ideal classification, a GCNR score of 1 is obtained, which suggests that the two distributions are distinct with no overlap.
While for classification in worst-case, the two distribution must be fully overlapping resulting in the GCNR score of 0. It can be observed in Figure 6a, that the conventional AE, with only dimension reduction constraint, obtained the training and testing MSE of 46.21 dB and 46.37 dB, respectively with (1D-PCA-GCNR score = 0.22) and does not distinguishes the two classes. While with a comparable MSE error, the proposed method depicted in Figure 6b, presents superior learning capabilities (1D-PCA-GCNR score = 0.91) and maps the AoP and non-AoP in separate regions.

Analysis of Decoder and Residual Error
Herein, we analyzed the discriminating power of input, decoded and residual signals. This provide another interesting characteristics of our proposed DeepLSE. From the 2D visualization of original feature space X , the decoded-output using proposed DeepLSE X = Dec(Enc(X )), and the residual of original-input and decoded output of DeepLSE X − X in Figure 7a it can be seen that the original feature space X with 1D-PCA-GCNR score of 0.14 units, is highly overlapping/non-linear and have almost no discriminating capabilities in linear-space. Whereas, the proposed DeepLSE method which acts as optimaltransport (OT) [35,36], and shift the data distributions in such a way that the reconstruction error of the original signal remains comparable to the conventional AE as well as the classification power is greatly improved. Its feature space, shown in Figure 7b, results in 1D-PCA-GCNR score of 0.77 units, which is 0.63 units higher than the original discriminating power. The 2D projection of residual error signal X − X in Figure 7c looks identical to the original feature space X , however, the class separability in residual error is dropped by 0.02 units, This is understandable as the ideal residual error should not contain any useful information which may improve the classification.

Conclusions
In this study, we proposed a deep latent space encoding method for the classification of anti-oxidant proteins using sequence derived features. In particular, the composition of k-spaced amino acid pair (CKSAAP) and a densely connected multi-layered perception neural network are used which are trained in tandem. The proposed method can be used to extract a non-redundant compact feature space, which is shown to outperform the conventional antioxidation protein classification approaches. Furthermore, the effect of varying parameters and number of gaps in CKSAAP and latent variables in Deep-LSE is analyzed, which suggests that a sufficient separable encoding can be learned by keeping moderate number of neurons in the architecture with k = 6 and LV = 6. The proposed approach effectively learns the non-linear embedding of the features from original feature space, thus filters out the non-relevant data on the basis of their distance from the class centers. This property leads towards the development of a classifier for the antioxidant proteins with noisy annotations and for assessment, the proposed method has been evaluated on an independent dataset which showed superior classification performance compared to the contemporary methods. We hope that AoP-LSE will serve as an effective method for the identification of unknown antioxidants.