Classical and Deep Learning Paradigms for Detection and Validation of Key Genes of Risky Outcomes of HCV

: Hepatitis C virus (HCV) is one of the most dangerous viruses worldwide. It is the foremost cause of the hepatic cirrhosis, and hepatocellular carcinoma, HCC. Detecting new key genes that play a role in the growth of HCC in HCV patients using machine learning techniques paves the way for producing accurate antivirals. In this work, there are two phases: detecting the up/downregulated genes using classical univariate and multivariate feature selection methods, and validating the retrieved list of genes using Insilico classifiers. However, the classification algorithms in the medical domain frequently suffer from a deficiency of training cases. Therefore, a deep neural network approach is proposed here to validate the significance of the retrieved genes in classifying the HCV-infected samples from the disinfected ones. The validation model is based on the artificial generation of new examples from the retrieved genes’ expressions using sparse autoencoders. Subsequently, the generated genes’ expressions data are used to train conventional classifiers. Our results in the first phase yielded a better retrieval of significant genes using Principal Component Analysis (PCA), a multivariate approach. The retrieved list of genes using PCA had a higher number of HCC biomarkers compared to the ones retrieved from the univariate methods. In the second phase, the classification accuracy can reveal the relevance of the extracted key genes in classifying the HCV-infected and disinfected samples.


Introduction
Hepatitis C virus (HCV) is one of the dangerous infection diseases worldwide. The replication of hepatitis C in an infected patient eventually causes cirrhosis of the liver or hepatocellular carcinoma (HCC) [1,2] which is ranked as the 12th disease in a ranking of the principal causes of death [3]. Current antivirals for HCV do not target every virus protein required during its life cycle due to the lack of knowledge about the key genes responsible for its replication phase [4].
The microarray is an effective innovation that helps in studying the sub-atomic science of tissues and the quality expression estimations of the entire genome. High-density oligonucleotide array technology, Affymetrix GeneChip, is generally utilized as a part of numerous regions of biomedical exploration for estimating the gene expression values [5]. Affy microarrays help in quantifying the expression of thousands of genes in only one test, which paved the way to understanding and analyzing gene behavior under different conditions [6]. However, the prediction of new significant genes from huge data produced by large-scale Affymetrix microarrays may require the use of statistical and machine learning techniques [7].
Simple statistical approaches for predicting informative genes from microarrays such as T-test and F-test can indicate the variance in gene expression in different data sets [8]. However, univariate and multivariate machine learning techniques are more advanced methods [9]. Univariate gene selection approaches can measure the significance of each gene individually. Multivariate approaches are optimized to handle multiple variables (or features) simultaneously [10]. Another use of the multivariate approach is to expose the inherent structure of variables through the application of various statistical methods.
One of the most commonly used approaches for validating the significance of detected key genes is done via Insilco classifiers. However, the classification algorithms in the medical domain frequently suffer from a deficiency of training cases [11]. Most probably, the classifiers yield worse prognostic performance when trained on such small number of classes [12]. Typically, large dimensionality is one of the significant challenges that faces the interpretation and the analysis of gene expression data measured using microarray technology [13]. In microarray technology, thousands of gene expressions are produced under few conditions' samples. The inadequate number of condition samples yield a faulty generalization and an inaccurate precision of classification models [14]. Data augmentation, the synthetic generation of additional training samples, can help in resolving the imbalance in data [15,16]. The deep autoencoder is one the most commonly employed paradigms to the field data augmentation [17]. It is a feed-forward, deep neural network that generates an output data X' that is similar o an input data X using a set of low-dimension hidden layers [18].
In this work, the variations in genes expression during the different stages of HCV replication cycle are analyzed, which may help in discovering new key genes that may be targeted via a more effective HCV antiviral vaccine. This study, using univariate and multivariate gene selection methods, aims to extract key genes that play a role in worsening the progression of the Hepatitis C cycle. The effectiveness of each approach, univariate and multivariate, is investigated via a biological interpretation of retrieved features and a proposed deep autoencoder validation mode. The model is based on generating an adequate dataset using a feed forward autoencoder. The generated synthetic dataset was generated based on the expression values of the extracted key genes. Such data have been used to train sets of classical classifiers.

Literature Review
The non-invasive detection of new significant genes for the replication of HCV and its outcomes, such as Hepatocellular carcinoma using machine learning techniques, has been recently addressed in a diverse array of studies. Studying the early stages of HCV infection and detecting the host genes involved in the HCV life cycle was discussed in [8]. They utilized the same dataset applied in this research. A simple statistical method, Analysis of Variance (ANOVA), has been applied. An average of the triplicate values was used to calculate fold change, and each value was assessed for its statistical significance. Host genes having a p-value less than 0.05 were considered significant genes. The retrieved genes are those ones that has an increase/decrease in their expression of at least 2-fold.
The work done in [1] investigated the potential of Alpha Fetal Protein (AFP) as a well-known biomarker for HCC [19,20]. LncRNA (long non-coding RNAs) microarray has been used. Significant genes were selected using two univariate methods (chi-square tests and t-test). T-test was used to compare variances in the lncRNA expression of plasma tissue of normal and up normal samples. They detected three genes that might be potential biomarkers for tumorigenesis prediction and two genes for metastasis prediction in the future.
Another study has been done in [10] and they compared the univariate and multivariate gene selection methods via a range of classifiers based on a diverse type of cancers. They concluded that univariate gene selection paradigms yielded better results than the multivariate ones in five out of seven datasets. In the univariate method, they applied Pearson correlation, t-statistic or SNR , signal to noise ratio, and applied base pair selection, forward selection and recursive feature elimination in the multivariate method.
In the study done in [21], the author proposed new technique called Stable Gene Selection (SGS) which selects significant genes for training a Support Vector Machine (SVM) classifier [22]. Key genes are selected using Bayesian [23] and Lasso [24]. Then, the selected genes are used to train the SVM classification algorithm to build a prediction model. The proposed method (SGS) has been applied on four datasets and it outperformed the existing gene selection methods.
Another study [25] utilized perception tools to predict the up/down regulated genes in microarray samples. They proposed the Kernel PCA (KPCA) [26] and Biplot [27] to plot gene expression profiles. They applied the proposed method on three types of cancer including lymphoma, colon tumor, and leukemia cancer datasets. The proposed procedure starts with the SVD of preprocessed gene expression input matrix then takes the row of matrices as a set of observations to compute Kernel matrix. The nonlinear features are calculated using PCA on the Kernel matrix.
Similar work using machine learning for the same medical domain, HCV, is the work done in [28]. They applied a hybrid machine learning paradigm for diagnosing hepatitis disease. Four stages are used in that work, including dimension reduction, clustering, feature selection, and classification. The dimension of the data was reduced using non-linear iterative partial least squares, then the selforganizing map was applied to cluster the similar data points, Classification and Regression Trees (CART) for selecting the significant features, and the ensemble classifier to predict the class (live or die).
The work done in [29] investigated the serum miR-218 and its expression in patients with HCC, and analyzed its potential in the diagnosis and prognosis of HCC. They compared the expression of the serum level of miR-218 in healthy liver and HCC tissues to assess the relationship between its expression normal and tumor samples. The demonstrative estimation of serum miR-218 in HCC was additionally examined. This study gave profitable confirmation of the recognizable proof of the serum miR-218 as prognostic biomarker for HCC.
This study is an extension to our previous work done in [30] in which we have introduced a hybrid algorithm for the detection of the differentially expressed genes, upregulated ones, as candidate biomarkers for HCC. We have applied univariate methods including Pearson's correlation coefficient, Cosine coefficient, Euclidean distance, mutual information and entropy. The experimental results yielded six genes that are well-known biomarkers for HCC using Pearson's correlation coefficient, and Cosine coefficient . A lower number of well-known biomarkers were obtained by the other methods (four genes using mutual information, three genes using Euclidean distance and only one gene using entropy). In this work, we are comparing the significance of the univariate and multivariate approaches in detecting key genes associated with the replication cycle of C virus and its outcomes. Furthermore, we are proposing a novel approach of deep learning where sparse autoencoders are used in the validation model of retrieved significant genes.

Proposed framework
The proposed framework, as shown in Figure 1, consists of two phases: extracting the key genes that play a role in the occurrence of risky outcomes of HCV, and validating their significance for such diseases. The extraction of key genes has been done using both of univariate and multivariate gene selection methods. The significance of each tested method in retrieving powerful key genes for the risky outcomes of HCV is assessed by three paradigms: mining the biological literature, NCBI Entrez system, and KEGG pathways for such genes, inspecting their P-values and profiling their expression in both HCV-infected and HCV-disinfected samples, and assessing their ability in classifying the HCV-infected and HCV-disinfected samples.

Key Genes Extraction
The profiles of genes in HCV-infected and disinfected samples are represented by a gene expression matrix. The entries of this matrix are expression values of all genes, amount of their RNA, measured in infected and controlled samples. In this work, twenty-four samples were taken from Gene expression Omnibus, GEO [31]. The detailed description of the data used in this work is included in the Supplementary Materials section. After preprocessing of the data samples, a gene expression matrix of 54675 genes is retrieved for each period of post-infection.

Ideal Up/Down Regulated Key Genes
An ideal key gene can be defined as a gene that has a variation in its values in infected and disinfected samples [30]. In this work, two ideal key genes have been proposed, as shown in Figure  2. Upregulated key genes are used as a vector, with two different sets (-1, 1) of values in HCV-

Univariate Gene Selection Methods
Multivariate Gene

Univariate gene selection methods
The univariate gene selection is a methodology that utilizes a criterion to assess the information of every gene exclusively. T&F tests, Pearson correlation, Euclidean and Cosine distances have been applied here to detect the significant genes. The similarity between genes under investigation and the ideal key genes has been calculated using the Pearson Coefficient, Cosine coefficient, and Euclidean distances using the same criteria as [30]. Euclidean distance has been applied to measure similarities between the ideal key genes and all genes in the gene expression matrix. The similarity between two vectors can be dictated by measuring the distance between them in the space. The Cosine Coefficient (CC) can gauge the reliance between two vectors representing the genes. On the off chance that the cosine coefficient is zero, then they are independent, and, if one, then they are indicating in the same direction. Key genes should have a Pearson coefficient close to +/-1. Typically, the retrieved key genes have minimum values of +/-0.7 for their Pearson coefficient.

Multivariate gene selection method:
Multivariate feature selection approach is optimized to handle multiple features (or genes) simultaneously [10]. Principal Component Analysis (PCA) is a statistical multivariate paradigm for dimensionality reduction. It applies an orthogonal conversion for a set of correlated features into a set of principal components that are uncorrelated features. In this work, PCA is utilized, as it is the simplest multivariate analyses method and mostly applied as a tool in exploring and describing the variance of features within a dataset [33].
Principal component analysis (PCA) is applied to outline the information in a dataset described by numerous variables. PCA reduces the dimensionality of data containing an extensive set of variables. This is accomplished by transforming the initial variables into another small set of variables without losing the most critical data in the first information set. The fundamental objective of PCA is identifying a concealed example in a dataset, dropping the dimensionality of the data by removing the noise and redundancy in the data, and identifying correlated variables.
PCA applies to an input data table, X, that has rows (individuals) and columns (quantitative variables). X is transformed via an orthogonal linear transformation, as follows: assuming a is the new individual, then its coordinate can be written as shown in Equation 1, where ( ), shown in Equation 2, is the coordinate of the variable on the axis s, is the weight accompanying to the variable , and l is the eigenvalue accompanying with the axis s, the weight accompanying to the individual , is the data table of row , and column .
In this study, the multivariate analysis has been implemented using R language. Two R packages, namely FactoMineR [34] and FactoextraR [35], have been applied. FactoMineR has been used here for performing a multivariate exploratory data analysis. FactoextraR package has been used for computing variances in retrieved principal components. We have visualized individuals that are used during the principal component analysis, which appear as Affy Ids, ranked from the smallest p-value to larger ones. Then, we validated the data to extract significant genes that affect the replication cycle of the C virus.

Validating the extracted Key Genes
Mining the biological databases and literature, and examining the gene signal profiles for the top ranked extracted key genes have been extensively carried out in this work, as will be illustrated in the results section. However, the classification of HCV-infected and disinfected samples using the conventional classification algorithms has suffered from a deficiency in the number of samples. Twenty-four infected and disinfected samples are too insufficient to be split into training and testing samples. Most probably, the classification algorithms yield a worse predictive performance after being trained and tested on such a low number of samples. Therefore, in this work, we are proposing data augmentation of the expressions of the extracted key genes in the twenty-four samples to generate additional samples, as shown in Figure 3. A sparse autoencoder has been applied here, as it is one of the furthermost working methods for the field data augmentation [17]. It is a feed-forward deep neural network that generates an output, data X', that is similar to an input data X using a set of low-dimension hidden layers [18]. A sparse autoencoder is an unsupervised neural network learning approach that tries to predict an output that is very close to its input. The input data are passed to an encoder which compresses and encodes the data. The encoded data, in turn, will be decompressed via a decoder. The weights of the closing hidden layer are the compressed picture of the input from which an approximated version of the original data can be regenerated. The number of nodes of both the input layer and the output layer are the same in case of data reconstruction. In our experiment, a gene expression matrix of the most significant retrieved genes, with a pvalue less than 0.005, is the input data, X, to the autoencoder. Ten autoencoders have been trained in an unsupervised manner with no labels to its input data examples, so the number of generated samples is 240. Each autoencoder consists of an encoder, hidden layers, and a decoder. Satlin, and purelin (defined by Equations 3, and 4 respectively) have been applied for the transfer function of the encoder and the decoder, respectively. The learning model tries to minimize the difference between the generated and original data (X and X'), so the cost function of the training model has been adjusted as a mean squared error function between X and X'. The learning model is trained for 1000 epochs, 0.04 as a coefficient of L2 regularization term, and 4 as a sparsity regularization term.

Biological validation of extracted Key Genes
By mining the KEGG pathways [36], and NCBI Entrez system, the biological interpretations of the extracted key genes are listed in Tables 1-4. Each table contains the following details about the extracted key genes : the affy ID, gene symbol, Entrez ID, oncology, and the gene pathway.

Signal profiles and P-values of extracted Key Genes
The gene signal profiles of top ranked key genes retrieved using each feature selection method are shown in Figures 4-7. Each gene signal is plotted in HCV-infected and disinfected samples and its P-value is attached along with its Affy ID. The signal profile illustrates the up/downregulated genes. Each figure represents the plot of the gene expression value in the disinfected and infected samples. The x axis represents the samples, 12 samples for the disinfected samples and 12 for the infected one. The Y axis represents the gene expression value.

Discussing the relevance of extracted Key Genes based on their Biological examination and signal profiles
The significance of those genes listed above, and their contributions in cellular functions and malignancies that may happen as risky outcomes of HCV, are discussed here in this section. An identical list was retrieved using T-test and F-test. It can be inferred that there is a major distinction between the two means of normal and infected samples. Additionally, the whole values of gene expression in infected samples are considerably completely different than the mean of gene expression value in disinfected samples. This is often a validation of the importance of the retrieved genes as key genes for the risky outcomes of HCV. TXNIP, a Thioredoxin interacting protein, has been detected as a downregulated gene, as shown in Figure 4, in all periods of post-infection (12,18,24, and 48 hours) using T-test and F-test. TXNIP is known as a vitamin D3 protein and convoluted in a varied range of cellular developments, as well as apoptosis, proliferation, lipid and glucose metabolism, and may additionally be concerned within the metastasis of a range of tumors [37]. Stanniocalcin 2 (STC2) has been detected as a downregulated gene in three periods of post-infection. The encoded protein of STC2 is significant in the regulation of renal and intestinal calcium. Variations in the expression of STC2 may contribute to the appearance of breast cancers, Colorectecal cancer, and HCC, as discussed in [38]. Asparagine synthetase(ASNS) and DNA-damaged inducible transcript 4 (DDIT4) have been detected as downregulated in all studied periods of post-infection with HCV. The ASNS gene is extremely regulated in stress, liver development, and HepG2 human hepatocellular carcinoma [39]. DDIT4 has been detected as a downregulated gene, and it encodes a protein that is well known as a biomarker for the prognosis of different types of cancer including liver cancer [40]. In addition, DDIT4 is associated with the TP53 pathway, which is a significant pathway for HCC according to the biological literature [40]. INHBE, inhibin subunit beta E, is detected as a downregulated gene after 24, and 48 hours of post-infection. This gene is regulated in cell proliferation, immune response, apoptosis, and hormone secretion [41]. Cystathionine gammalyase, CTH, has been detected as downregulated. CTH encodes an enzyme in the cellular processes of liver and kidney and it is a prognostic biomarker for bladder cancer [42]. CHAC1 has only been detected in the first and second periods of post-infection. CHAC1 encodes a protein in the ATF4 signaling. However, these genes are recommended as novel key genes in the replication cycle of HCV, as they have not been addressed before in the literature for HCV, liver cirrhosis, and HCC. COL1A1 has been detected as upregulated after 12 hours of post-infection. COL1A1 has been reported recently as a highly up-regulated biomarker in HCC cancer tissues. COLIA1 can suppress the clonogenicity of HCC cells and help in the early survival of the HCC and play a great role in the target therapy of HCC [43].
The univariate methods, Pearson correlation, cosine coefficient, and Euclidean distance, have also yielded a similar list of key genes. KLF2 is the most significant key gene that has been detected using these methods. Its signal profile was as a downregulated gene in the infected tissues. The protein expression of KLF2 was enlarged in HCC cells [44]. ASNS, TXNIP, and PCK2 have been detected downregulated also, with significant p values, according to Figure 5. The PCK2 gene has been reported as a downregulated gene in primary HCC and a forced expression of PCK2 was suppressed the HCC tumorigenesis in an experiment on mice [45]. OARD1, SPRY1, and ZNF691 are downregulated detected genes. The role of OARD1 has been investigated in [46]. SPRY1 is related to the Sprouty Protien and has been investigated in [47]. They revealed that its expression is overexpressed in HCC. The role of ZNF691 gene in the HCC tumorigenesis has not been investigated yet. CYP1A1, from cytochrome P450 family 1, has been detected as downregulated in the first period of post-infection. A study on the contribution of CYP1A1 in the risky outcomes of an HCV-infected patient was done in [48]. They investigated the impact of polymorphisms of the CYP family of genes on the progression of liver diseases. Their study showed that Polymorphic modifications of CYP family genes could result in the development of liver infection and occurrence of HCC risk. MAFF and CCDC86 were insignificant according to their P values, shown in Figure 5. MAFF has been detected in the first and second periods of post-infection. MAFF regulates a diversity of goal genes, including genes responsible for platelet production and genes responsible for antioxidant/xenobiotic enzyme. MAFF has been conveyed in the regulation of the oxytocin gene. However, the involvement of MAFF within the regulation of genes and proteins significant for HCV and HCC has not been inspected to date [49]. CCDC86 has a contribution in the formation of HCV [50].
By using the Euclidean distance gene selection method, a lower number of significant genes were retrieved, according to P values, as shown in Figure 6. CYP1A1, MAFF, and PCK2 genes have been retrieved. CTH, BCAT1, EREG, PHLDA1, and AGR2 have been retrieved as downregulated genes with significant P values. CTH has been differentially expressed in normal and tumor HCC tissues [51]. BCAT1 has a highly significant expression in HCC samples compared to normal samples, as stated in [52], EREG has contributed in the hepatocarcinogenesis, as testified in [53], PHLDA1 has been detected as a novel biomarker of HCC, as it expressed differentially in the experiment done in [54], and AGR2 has high expression values in metastatic hepatocellular carcinoma samples, as found in [55]. Other retrieved genes, including SLC19A1, CCDC127, and MAFF, were unexpressed differentially according to their P values.
Using PCA, several HCC biomarkers (EEF1A1, ATP6, and FTL) have been detected in the four periods of post-infection with C virus, as shown in Table 4. EEF1A1 is a well-known HCC biomarker and is considered one of the top 20 genes related to human hepatocarcinoma cell lines. EEF1A1 has been reported as a novel prognosis biomarkers for liver cancer using a multivariate analysis done in [56]. In our experiment, EEF1A1 was retrieved as a down-regulated gene with a significant p value, as illustrated by Figure 7. However, ATP6, and FTL were unexpressed, with p values > 0.005. RPS2 was downregulated with a significant p value. RPS2 was founded in [57] as a significant key gene for HCC. ACTB gene has been detected as upregulated with a significant p values. ACTB was expressed differentially in the study done in [58]. Although the other retrieved genes were related to liver diseases (HCV, and HCC) in the biological literature [59][60][61], including COX1, COX2, KRT18, PPIA, UBC, and MTND4, they were unexpressed in this study.

Examining the relevance of extracted Key Gens using Conventional classification & Data Augmentation
The key genes expression matrix has been augmented using the sparse autoencoder to generate more samples, as we have an insufficient number of samples. The augmented data has been applied to the classification of HCV-infected and disinfected samples Each autoencoder consists of an encoder/decoder module with one hidden layer in each module. The training procedure is built on optimizing the cost formula, which calculates the error between the input data X and its regenerated output data, X', on each iteration. The mean square reconstruction error of autoencoder has been calculated for the generated samples, as illustrated by Table 5. The effectiveness of the generated feature matrices has been investigated by comparing their performance in classifying the HCVinfected and disinfected samples. The following conventional supervised classifiers, including Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Support Vector Machines (SVM), and K-Nearest Neighbor (KNN), have been extensively tested. In KNN, three values of K have been tested, including 1, 3, and 5. In SVM, three kernel functions have been employed, including linear, polynomial, and the Radial Base Kernel Function (RBF). Two polynomial orders (2, and 3) has been applied for the polynomial kernel function. Standardized and optimized RBF have been employed. All classifiers have been trained using 10-fold cross-validation to prevent over-fitting. During each fold, the learning model has been trained on nine divisions and verified on the 10 th . The confusion matrix has been calculated in each fold and a summarized one was used to calculate the accuracy. Table 6 illustrates the classification accuracy for all feature selection methods discussed here. The highest accuracies are highlighted in a grey color. The retrieved key genes using T&F test have yielded the highest classification accuracy, 95.83%, using the QDA classifier. The extracted key genes, using PCA test, have returned a classification accuracy of 93.75% using the QDA classifier. On the other side, the key genes retrieved using PCA, and Euclidean distance have returned a 91.67% classification accuracy using the SVM classifier.

Conclusions
In this work, we have used the classical feature selection techniques, univariate method and multivariate methods, to detect up/downregulated genes which have a role in understanding the identification and characterization of the HCV replication cycle. This study has yielded 15 downregulated key genes (TXNIP, STC2, ASNS, DDIT4, CTH, CHAC1, INHBE, KLF2, PCK2, OARD1, SPRY1, ZNF691, CYP1A1, EEF1A1, and RPS2) for studying the outcomes of HCV infection. Only two upregulated key genes (COL1A1, and ACTB) were detected. In addition, a deep neural network approach has been proposed to augment the insufficient number of samples. The augmented data has been employed in a training set of conventional classification algorithms. The classification accuracy can reveal the relevance of the extracted key genes in classifying the HCV-infected and disinfected samples.