Ensemble Consensus-Guided Unsupervised Feature Selection to Identify Huntington’s Disease-Associated Genes

Due to the complexity of the pathological mechanisms of neurodegenerative diseases, traditional differentially-expressed gene selection methods cannot detect disease-associated genes accurately. Recent studies have shown that consensus-guided unsupervised feature selection (CGUFS) performs well in feature selection for identifying disease-associated genes. Since the random initialization of the feature selection matrix in CGUFS results in instability of the final disease-associated gene set, for the purposes of this study we proposed an ensemble method based on CGUFS—namely, ensemble consensus-guided unsupervised feature selection (ECGUFS) in order to further improve the accuracy of disease-associated genes and the stability of feature gene sets. We also proposed a bagging integration strategy to integrate the results of CGUFS. Lastly, we conducted experiments with Huntington’s disease RNA sequencing (RNA-Seq) data and obtained the final feature gene set, where we detected 287 disease-associated genes. Enrichment analysis on these genes has shown that postsynaptic density and the postsynaptic membrane, synapse, and cell junction are all affected during the disease’s progression. However, ECGUFS greatly improved the accuracy of disease-associated gene prediction and the stability of the disease-associated gene set. We conducted a classification of samples with labels based on the linear support vector machine with 10-fold cross-validation. The average accuracy is 0.9, which suggests the effectiveness of the feature gene set.


Introduction
Neurodegenerative diseases seriously affect human health and quality of life. It is reported that half of the population aged seventy and over suffer from Alzheimer's disease [1]. Neurodegenerative diseases are usually the result of one or more genes combined with one or more environmental factors. They are a kind of chronic disease characterized with complex symptoms. Alzheimer's disease (AD) [2,3], Parkinson's disease (PD) [4], and Huntington's disease (HD) are some of the most common neurodegenerative diseases. Due to the complexity of neurodegenerative diseases, there are still many unknown parts of molecular mechanisms. It has been shown that the pathogenic gene of HD is IT15, and although it is widely expressed in various tissues within the body, only the neurons of specific tissues are damaged. Additionally, HD results from the misfolding of the protein Htt, and the symptoms of this disease usually develop after the age of 40 [5,6]. Due to the complexity of these diseases [7], identifying disease-associated genes is helpful for revealing the pathogenesis of the diseases. (1) , H (2) , · · · H (r) are the r basic clustering results of X in consensus clustering.
Part of CGUFS is the design of the following objective function, Equation (1). When the objective function is minimized to get the feature selection matrix Z and the consensus indicator matrix H * , the l 2 -norm of each row of the feature selection matrix Z is used as the weight of each feature gene. In order to identify the disease-associated genes, the genes need to be sorted into descending order according to weight. The highly-ranked genes are the disease-associated genes.
where H * is the consensus indicator matrix of the consensus clustering, U c is a function that measures the similarity of two basic clustering results to obtain the consensus clustering result [21]. Z ∈ R g× K is the feature selection matrix, G is a mapping matrix between X T Z and H * , and both α and β are parameters that control consensus clustering and sparse learning. Specifically, CGUFS is an unsupervised feature selection method that does not require label information. As CGUFS adopts consensus clustering for pseudo-label learning, it can greatly improve clustering accuracy. CGUFS performs sparse regularization constraint on the feature selection matrix, which reduces the model's complexity and improves its operation rate. The optimization solution of the objective function is as follows [15]: Firstly, when given Z, update H * and G. A part of Equation (1) can be converted to Equation (2) in order to simplify the optimization process [22]: where B = [H (1) , H (2) , · · · H (r) ] is a matrix of the r basic clustering results of consensus clustering, and C = [C (1) , C (2) , · · · C (r) ] is the center of B. Thus, H * and G can be updated through the optimization of Equation (3).
Secondly, an optimization approach similar to K-means clustering is used to update H * and G [15].
Update H * , G, and C in the following way.
where m k is the k − th center of matrix U, and G is the last K row of center U. Finally, update Z by giving H * and G. Since Z only appears in the last two terms of Equation (1), update Z by optimizing the last two items. Let L = X T Z − H * G 2 F + β Z 2,1 , and let the derivative of L to Z be 0. The updated equation of Z is: where ). In our proposed method, the weight of the gene can be calculated by using Equation (6). w i represents the weight of gene i, and W represents a vector of the weights for all genes.

Evaluation
We used the true positive rate (TPR), false positive rate (FPR), precision (P), and recall (R) to assess the accuracy of disease-associated gene prediction. The ROC curve was plotted using TPR and FPR, the precision-recall (PR) curve was plotted using P and R, and the area under the ROC curve (AUC) and the area under the PR curve (AUPR) were used as measures of prediction accuracy [23].

Ensemble Consensus-Guided Unsupervised Feature Selection
Since the random initialization of the feature selection matrix in CGUFS caused instability in the final gene ranking, this work proposed a bagging integration strategy to integrate the results of CGUFS, or namely, ECGUFS, to obtain a more unified disease-associated gene set and also to improve the accuracy of the final gene set.
Firstly, we used bootstrap aggregation to generate b bags from X = [x ij ] g×s . Each bag had c samples, where c is generally equal to the number of samples. For the i − th bag, the gene weights were calculated based on CGUFS and denoted as W i . The gene-ranked list was obtained through W i . Secondly, we calculated the area under the ROC of the gene-ranked list, which is denoted by a i . Finally, we used Equation (8) to calculate the ensemble gene weights.
The genes were sorted in descending order according to W f inal . Highly-ranked genes indicated that they played an important role in the discrimination between disease samples and normal ones. Figure 1 shows the flow chart of ECGUFS. After these processes, the final feature gene set is obtained. In our proposed method, the weight of the gene can be calculated by using Equation (6). i w represents the weight of gene i , and W represents a vector of the weights for all genes.

Evaluation
We used the true positive rate (TPR), false positive rate (FPR), precision (P), and recall (R) to assess the accuracy of disease-associated gene prediction. The ROC curve was plotted using TPR and FPR, the precision-recall (PR) curve was plotted using P and R, and the area under the ROC curve (AUC) and the area under the PR curve (AUPR) were used as measures of prediction accuracy [23].

Ensemble Consensus-Guided Unsupervised Feature Selection
Since the random initialization of the feature selection matrix in CGUFS caused instability in the final gene ranking, this work proposed a bagging integration strategy to integrate the results of CGUFS, or namely, ECGUFS, to obtain a more unified disease-associated gene set and also to improve the accuracy of the final gene set.
Firstly, we used bootstrap aggregation to generate b bags from Each bag had c samples, where c is generally equal to the number of samples. For the i th − bag, the gene weights were calculated based on CGUFS and denoted as i W . The gene-ranked list was obtained through i W . Secondly, we calculated the area under the ROC of the gene-ranked list, which is denoted by i a .
Finally, we used Equation (8) to calculate the ensemble gene weights.
The genes were sorted in descending order according to final W . Highly-ranked genes indicated that they played an important role in the discrimination between disease samples and normal ones. Figure 1 shows the flow chart of ECGUFS. After these processes, the final feature gene set is obtained.  The algorithm of ECGUFS can be described as follows in (Table 1).
Initialize H * ,Z,F; 2: repeat; 3: build run K-means on U to update H * and G; until the value of the objective function remains unchanged. 8: Obtain the gene weights W i according to Equation (7); sort genes according to W i to get the gene-ranked list; 9: get the area under the ROC of the gene-ranked list a i ; 10: End 11: Calculate the ensemble gene weights according to Equation (8).
Note: Initialize the elements of Z between 0 and 1,

Gene Expression Data
The data used in this study was obtained from HDinHD (http://www.hdinhd.org), which is the gene expression data of HD mice obtained by RNA-Seq technology. The dataset contained mouse liver tissue, cortex tissue, and striatum tissue, and the genotypes of the mice were poly Q20, poly Q80, poly Q92, poly Q111, poly Q140, and poly Q175. There were 8 samples for each genotype [24]. The mice whose genotype was poly Q20 were normal mice, whereas mice with all other genotypes were diseased mice. The longer the poly Q, the heavier was the phenotype of the diseased mice. There were 23,351 genes in the dataset, and most of the calculation methods used for data analysis were based on differential expression genes to identify disease-associated genes. As it was difficult to identify the genes whose expression levels did not change during the disease's progression, we preprocessed the gene expression data in order to reduce computational complexity. Firstly, the gene whose expression value was zero in any sample was deleted according to the l 0 -norm. Then, we normalized the gene expression data for each sample. We ranked the genes into descending order according to the gene variance in different samples, and only the top 4000 genes were selected. We then got the union set of 4000 genes in the three tissues and added the modifier gene set [24]. Finally, 6723 genes were selected from the entire genome. As it has been reported that striatum tissue can be seriously affected by the length of poly Q, we used striatum tissue data as experimental data in this study.
The modifier genes are from [24], including 520 genes, 89 of which are disease-associated genes and the rest of which are non-disease-associated genes. It should be noted that we normalized the gene expression data by each gene to ensure the convergence of the ECGUFS.

Parameter Setting
In ECGUFS, we set the number of clusters to K = 6 as the number of mouse genotypes was six. We set the number of basic clustering results to r = 100 to ensure the robustness of consensus clustering [15]. We set the parameter that controlled the consensus clustering to α = 10 4 , and the parameter that controlled the sparse regularization to β = 1. The higher α and β were, the better the performance of ECGUFS. When β was larger than 1, the results stabilized and were unaffected by α, according to [15]. The number of bags was set to b = 20, and the number of samples in a bag to c = 48.

Performance Comparison between Ensemble Consensus-Guided Unsupervised Feature Selection and Other Methods
To further verify the prediction accuracy of ECGUFS, we conducted comparative experiments using statistic-based methods, machine learning methods, and network-based methods on the above data set. The t-test method [8], FC [8], edgeR tool [25], limma [26], FNMF [9], the joint non-negative matrix factorization meta-analysis method (jNMFMA) [27], LP [12], and CGUFS [15] were used as comparison methods. For non-parametric methods, such as the t-test, FC, edgeR, limma, and LP, only one experiment was conducted. The experimental results of parametric methods, such as CGUFS, FNMF, jNMFMA, and ECGUFS were unstable due to the random initialization. Consequently, this work conducted 10 experiments for each parametric method. The mean and standard deviation of the prediction accuracy of the 10 experiments were calculated to evaluate the performance of the corresponding method.
The experimental results of FNMF, jNMFMA, CGUFS, and ECGUFS are shown in Table 2. From Table 2, we can see that the average AUC and AUPR of ECGUFS are better than FNMF, jNMFMA, and CGUFS, which shows that ECGUFS improved the accuracy of the disease-associated gene prediction. To analyze the performance of the nine methods in detail, a set of best-performing experiments were selected for further comparison.  Figure 2 shows the ROC curves for the prediction accuracy of the t-test, FC, edgeR, limma, LP, FNMF, jNMFMA, CGUFS, and ECGUFS. It can be seen from Figure 2 that the AUC of ECGUFS is better than the other eight methods. However, the above methods could not effectively distinguish the disease-associated genes from the non-disease-associated genes in the modifier gene set. Possible reasons for this may be that the expression levels of the disease-associated genes did not change significantly during the disease's progression, or that a large number of gene expression levels had changed during the disease's progression, thereby making it difficult to identify the disease-associated genes [28]. Figure 3 shows the PR curves of the nine methods. It can be seen from Figure 3 that the AUPR of ECGUFS is better than the other eight methods. As ECGUFS has a higher prediction accuracy for highly-ranked genes, it indicates that ECGUFS can better distinguish disease-associated genes from non-disease-associated genes for highly-ranked genes.
Briefly, it can be seen from Table 2, and Figures 2 and 3 that the performance of ECGUFS is better than CGUFS. The AUC and AURP standard deviation of the 10 experiments by ECGUFS is small, indicating the stability of ECGUFS. Experimental results show that ECGUFS is more suitable for identifying disease-associated genes than the other eight methods.
Briefly, it can be seen from Table 2, and Figures 2 and 3 that the performance of ECGUFS is better than CGUFS. The AUC and AURP standard deviation of the 10 experiments by ECGUFS is small, indicating the stability of ECGUFS. Experimental results show that ECGUFS is more suitable for identifying disease-associated genes than the other eight methods.  From Figure 3, we can see that when the recall rate is less than 0.2, ECGUFS has high prediction accuracy. Since there are 89 disease-associated genes in the modifier gene set, the top 18 ( 0.2 89 ) disease-associated genes have higher prediction accuracy. As the last of the 18 genes ranked roughly at about 1000 in the overall ranking, this work further calculates the coincidence degree of the top 1000 genes in the ranked lists of any two experiments. The results are shown in Table 3, where we can see that the coincidence degree is greater than 60%. The results also show that when ECGUFS is used to identify disease-associated genes, many overlapped genes can be identified under the condition of changes in sample size. Through the integration analysis we found that of the top 1000 genes, there were 287 overlapped ones of the ten experiment-ranked lists. In addition, we also calculated the coincidence degree of the top 2000 genes in different ranked lists. The results are shown Briefly, it can be seen from Table 2, and Figures 2 and 3 that the performance of ECGUFS is better than CGUFS. The AUC and AURP standard deviation of the 10 experiments by ECGUFS is small, indicating the stability of ECGUFS. Experimental results show that ECGUFS is more suitable for identifying disease-associated genes than the other eight methods.  From Figure 3, we can see that when the recall rate is less than 0.2, ECGUFS has high prediction accuracy. Since there are 89 disease-associated genes in the modifier gene set, the top 18 ( 0.2 89 ) disease-associated genes have higher prediction accuracy. As the last of the 18 genes ranked roughly at about 1000 in the overall ranking, this work further calculates the coincidence degree of the top 1000 genes in the ranked lists of any two experiments. The results are shown in Table 3, where we can see that the coincidence degree is greater than 60%. The results also show that when ECGUFS is used to identify disease-associated genes, many overlapped genes can be identified under the condition of changes in sample size. Through the integration analysis we found that of the top 1000 genes, there were 287 overlapped ones of the ten experiment-ranked lists. In addition, we also calculated the coincidence degree of the top 2000 genes in different ranked lists. The results are shown From Figure 3, we can see that when the recall rate is less than 0.2, ECGUFS has high prediction accuracy. Since there are 89 disease-associated genes in the modifier gene set, the top 18 (0.2 × 89) disease-associated genes have higher prediction accuracy. As the last of the 18 genes ranked roughly at about 1000 in the overall ranking, this work further calculates the coincidence degree of the top 1000 genes in the ranked lists of any two experiments. The results are shown in Table 3, where we can see that the coincidence degree is greater than 60%. The results also show that when ECGUFS is used to identify disease-associated genes, many overlapped genes can be identified under the condition of changes in sample size. Through the integration analysis we found that of the top 1000 genes, there were 287 overlapped ones of the ten experiment-ranked lists. In addition, we also calculated the coincidence degree of the top 2000 genes in different ranked lists. The results are shown in Table 4. There are 962 overlapped genes in the top 2000 genes. The high coincidence degree indicates that ECGUFS can improve the stability of a disease-associated gene set.  To verify the effectiveness of the overlapped 287 genes, we conducted a classification of disease samples from normal samples based on the support vector machine (SVM) [29], using ten-fold cross-validation. The average accuracy is 0.9, suggesting the effectiveness of the 287 feature genes.

Enrichment Analysis
We used the functional clustering tool DAVID [30] to perform enrichment analysis on 287 overlapped genes to further understand the functional annotation of these genes. Table 5 shows the functional annotation results. In the first clustering module, the cellular component (CC) annotations include postsynaptic density and the postsynaptic membrane, synapse, and cell junctions, indicating that the morphology of the cells has undergone major changes during the progression of HD. In fact, the connection between neurons of the striatum tissue gradually became sparse and the nerve cells slowly died during the progression of the disease [31,32]. In the second clustering module, the biological process (BP) annotations include a fatty acid metabolic process and a fatty acid biosynthetic process. The Molecular Function (MF) annotation includes transferase activity, transferring acyl groups other than amino-acyl groups. The Kyoto encyclopedia of genes and genomes (KEGG) pathway annotation includes fatty acid metabolism. In the third clustering module, the MF annotations include transferase activity, kinase activity, nucleotide binding, ATP (adenosine triphosphate) binding, protein kinase activity, and protein serine/threonine kinase activity. The BP annotations include phosphorylation and protein phosphorylation. In the fourth clustering module, the BP annotations include learning or memory, regulation of synaptic plasticity, and embryo development. Studies have shown that HD is related to mental disorders and cognitive decline. In the fifth clustering module, the MF annotation includes cadherin binding involved in cell-cell adhesion. The BP annotation includes cell-cell adhesion. The above molecular functions and biological processes are most likely to be affected during the disease's progression. Studies have shown that Huntington's disease is caused by excessive repetition of the CAG sequence of the huntingtin gene on the fourth chromosome. It leads to a misfolding of the Htt protein, which affects protein transport, gene regulation, and other biological processes. It eventually leads to sparse cell connections, neuronal apoptosis, and the formation of amyloid plaques in the striatum of the brain [33][34][35].

Discussion
In this work we proposed ECGUFS based on CGUFS. The main goal is that we proposed a bagging integration strategy to integrate the results of CGUFS. ECGUFS can effectively select disease-associated genes when the labels are unknown. Experimental results verify the better feasibility and stability of ECGUFS. In addition, we conducted an enrichment analysis of the overlapped 287 genes to further explore the molecular pathogenesis of HD, as well as to provide guidance for subsequent biological validation.