Graph Based Feature Selection for Reduction of Dimensionality in Next-Generation RNA Sequencing Datasets

: Analysis of high-dimensional data, with more features ( p ) than observations ( N ) ( p > N ), places signiﬁcant demand in cost and memory computational usage attributes. Feature selection can be used to reduce the dimensionality of the data. We used a graph-based approach, principal component analysis (PCA) and recursive feature elimination to select features for classiﬁcation from RNAseq datasets from two lung cancer datasets. The selected features were discretized for association rule mining where support and lift were used to generate informative rules. Our results show that the graph-based feature selection improved the performance of sequential minimal optimization (SMO) and multilayer perceptron classiﬁers (MLP) in both datasets. In association rule mining, features selected using the graph-based approach outperformed the other two feature-selection techniques at a support of 0.5 and lift of 2. The non-redundant rules reﬂect the inherent relationships between features. Biological features are usually related to functions in living systems, a relationship that cannot be deduced by feature selection and classiﬁcation alone. Therefore, the graph-based feature-selection approach combined with rule mining is a suitable way of selecting and ﬁnding associations between features in high-dimensional RNAseq data.


Introduction
Analysis of high-dimensional data, with more features (p) than observations (N) (p >> N), places significant demand in cost and memory computational usage attributes. Analysis of high-dimensional data requires both high computational costs and computer memory usage [1]. Dimensionality reduction techniques that minimize the features in the original data without losing any important information have been used to address this challenge [2]. This reduces storage space and computational time and removes redundant, noisy and irrelevant data while improving algorithm efficiency as well as accuracy [3]. Feature selection and extraction constitute common approaches used to reduce dimensionality with the former identifying subsets of sufficient informative features that define the data. Major feature-selection techniques include filters, wrappers and embedded/hybrid techniques [4]. Wrappers follow a greedy search approach that wraps the feature selection around a learning algorithm and subsequently uses the accuracy of performance or error rate of the classification process as a criterion of feature evaluation. However, this approach is slower for a large feature space because every feature must be evaluated using a trained classifier. The filter method checks the features relying on the intrinsic characteristics (information, dependency, consistency and distance) prior to the learning tasks [5], which requires lower computational resources. Filters are thus faster but inefficient in classification relative to wrapper methods. Hybrid/embedded techniques are a combination of

Feature Selection and Classification
Feature-selection methods have been widely used in the biological domain [24,25]. In the following section, we highlight studies where various feature-selection methods combined with classification have been applied to answer biological questions. Mazumder and Veilumuthu [19] proposed a feature-selection approach using Joe's normalized mutual information on seven benchmark microarray cancers. They compared five classifiers and reported an average increase in the prediction accuracy of 5.1% when feature selection was performed before classification. Ray et al. [26] used a microarray leukemia dataset and proposed a three-step approach that involved data preprocessing and normalization, feature selection using mutual information method and classification using support vector machine (SVM) and regression analysis. The authors reported improved computation time and efficiency of both classifiers although SVM performed better than logistic regression in terms of accuracy. Lokeswari and Jacob [27] performed classification before and after application of feature selection on a microarray pediatric tumor dataset. They reported that application of feature selection before classification improved accuracy of the results on both SVM and logistic regression. However, SVM achieved an accuracy of 75%, compared to 63% accuracy obtained by logistic regression. Peralta et al. [28] used MapReduce for feature selection on a protein structure prediction dataset [29] followed by SVM, logistic regression and Naïve Bayes with and without feature selection. Analysis of the performance and running time showed that SVM outperformed the other classifiers. Alghunaim and Al-Baity [30] used SVM, decision tree and random forest algorithms to analyze gene expression and DNA methylation datasets in order to predict breast cancer. The experiment conducted in WEKA showed differences in accuracy for both datasets where SVM achieved 98.03%, decision tree 95.09 and random forest 96.07 for gene expression data. Accuracies of methylation datasets were 98.03 for SVM, 88.23% for decision tree and 95.09% for random forest classifiers. This shows that SVM achieved the highest accuracy in both datasets. Turgut et al. [31] used recursive feature elimination (RFE) and randomized logistic regression (RLR) on a cancer dataset [32] for feature selection. They thereafter applied eight classification models on the selected features. SVM gave an accuracy of 99.23% using both RFE and RLR as compared to 98.49% before feature selection. Morovvat and Osareh [33] used symmetric uncertainty (SU) filter methods and then applied CFS, FCBF, GSNR, ReliefF and MRMR feature selection to further reduce the number of attributes. Thereafter, SVM, J48 decision tree and Naïve Bayes were used for classification. SVM gave the best results as compared to the other classifiers.
A graph is an effective way of modeling a combinatorial relationship between features in a dataset [34]. Data features are the vertices and edges that represent the inter-feature relationships. Each edge has a weight corresponding to mutual information (MI) between features connected by that edge. Dominant set clustering allows selection of a highly coherent set of features which are further selected based on a new measure called multidimensional interaction information (MII). The advantage of MII is that it can consider third-or higher-order feature interaction [35]. Therefore, graphs provide a visual relationship between objects and assist users in making useful inferences based on the connected features on the graph. Schroeder et al. [36] compared graph-based feature selection to related methods and found that the approach outperforms other feature-selection methods on many datasets or shows similar qualitative results using a smaller number of features. Roffo et al. [37] developed infinite feature selection which is a graph-based feature filtering approach and compared it to other existing feature-selection methods on 11 select publicly available benchmark datasets. Their findings show that infinite feature selection operated on neural features, improving relevance and diminishing redundancy. Rana et al. [38] introduced a new feature-selection method that uses a dynamic threshold algorithm (DTA) as described by Nguyen et al. [39]. The algorithm selects important, non-redundant and relevant features by maximizing the similarity between each patient pair by an approximate k-cover algorithm. The k-cover problem in a graph G = (V,E) is an NP-complete problem, which seeks a set of size k nodes that cover the maximum number of edges. The new algorithm showed improved performance over existing feature-selection techniques for disease classification and subtype detection problems.

Discretization and Association Rule Mining
Liu et al. [23] combined k-means clustering discretization and the Apriori algorithm in the analysis of water supply association and found that there was a strong association between features and water supply based on the rules generated. Lu et al. [40] used the quantile-based discretization method of QUBIC to generate a qualitative representing matrix for the RNA-Seq expression matrix. In a study by Chiclana et al. [41], animal migration optimization was proposed to reduce the number of association rules using a new fitness function that incorporated frequent rules. The authors reported a reduction in computation time and memory required for rule generation due to the filtering of weakly supported rules. Hybrid temporal association rule mining was proposed by Wen et al. [42] to predict traffic congestion. In their study, the authors analyzed the rules by the use of classification such that the classifiers were built to predict congestion levels of traffic. Their experimental results show that their approach could predict traffic congestion with better accuracy. Shui and Cho [43] proposed rank-based weighted association rule mining for gene expression analysis. The authors reported a reduced number of rules generated in this approach thus reducing time and execution time. The rules generated using this approach were validated using gene ontologies. Another study on association rule mining in gene expression data was conducted by Agapito et al. [44] where they electronically inferred annotations of the association rules by assigning different weights to different types of annotation.

Naïve Bayes Algorithm
This is an efficient supervised learning method suitable for binary and multiclass classification. The algorithm is based on Bayes' theorem [45,46]. Bayes' theorem calculates the P(c|x) , posterior probability, using P(x|c), P(c) and P(x), as shown in the Equation (1).
where P(c|x) is the class posterior probability, and P(c) is the prior probability class: P(x|c) is the likelihood that is the predictor given class probability. P(x) is the predictor prior probability.

Sequential Minimal Optimization (SMO)
This is a supervised machine learning algorithm that belongs to the SVM classifiers [47]. The SVM algorithm works by building a hyperplane that separates different instances into their specific classes [48]. Thereafter, a pairwise multiclass classification scheme is performed. Even when p > n, SVM is functional without any alteration. SVM hyperplane is defined as shown in Equation (2).

Multilayer Perceptron
Multilayer perceptron (MLP) belongs to a class of feedforward artificial neural networks, which find complex patterns that a human programmer cannot extract by performing machine recognition. MLP has input layers (attributes), output layers (classes) and hidden layer(s) that are interlinked by various neurons. The optimization of interconnected weights is performed by the backpropagation algorithm by training instances of the dataset [49]. The rule weight update in the backpropagation algorithm is defined in Equation (3).

Measures for Performance Evaluation Classification Accuracy, Time, Kappa Statistic, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE)
Classification accuracy measures how well a test is able to predict different categories; it shows the number of samples that are correctly classified into their respective classes and is shown as a percentage. Classification time is the total CPU time that is required to build a classification model as well as the training time required to predict the output of the test data. Kappa statistic (KS) is calculated to evaluate measurement accuracy. The closer the K value is from 0 to 1, the more reliable the classification is. When K equals 1, the correctness of classification is the safest. On the other hand, when K equals 0, the chance of classification is right and unreliable [50]. Mean absolute error (MAE) is used to evaluate the performance of the algorithm and is calculated by taking the average of all absolute errors [51]. Finally, root mean squared error (RMSE) is a popular measurement for performance evaluation. Thanks to square root, the measurement calculates the error without canceling the positive and negative errors [52].

Generating a Graph
A fully connected undirected graph G = (V, E) is comprised of nodes and vertices, whereby V represents a set of n features which are nodes and E models the relationship among features(edges). G is an adjacency of matrix A, where each of its elements A(i, j), 1 ≤ i, j ≤ n, models the confidence that features f i and f j (nodes, v i , v j ) are both potential candidates to be selected due to the weight function g : The edge weight is defined here to represent the redundancy of the connected features by similarity function s : V → R .

Data Source and Data Type
Two RNA-Seq datasets were used in this study ( Table 1). The first dataset referred to as small-cell lung cancer (SCLC) and had 86 samples with two classes: 79 cancer cells and 7 normal cells. The second dataset denoted as non-small-cell lung cancer (NSCLC) had a total of 218 samples whereby 199 samples were non-small-cell lung cancer and 19 were normal cells. Both datasets used in this study were highly imbalanced whereby the cancer samples were the majority class while normal samples were the minority class. We used synthetic minority oversampling technique (SMOTE) algorithm to balance the datasets. The minority classes were increased based on the 5 k-nearest neighbors to nearly equal classes. After addressing class imbalance, we then performed 70:30 subsets on data and performed 10-fold cross validation of the accuracy and then recorded the accuracy and F-measure. This was repeated 20 times across each feature-selection method. We then used Kruskal-Wallis H-statistic to test if there was significant difference in the mean ranks of the groups.

Data Preprocessing
The raw count data were preprocessed to filter out any features with zero counts. This was achieved via normalization using the upper quartile method implemented in edgeR package [55]. A scaling factor of 75th percentile of every count was calculated after removing features with zero counts using Equation (4): where UQ(X) is the upper quartile of sample X of jth sample of normalized counts and K gj > 0.

Feature Selection
After normalization, we performed feature selection on both datasets as a means of dimensionality reduction. We used principal component analysis (PCA), recursive feature elimination (RFE) and a graph-based approach.

Principal Component Analysis (PCA)
Assuming dataset x (1) x (2) , . . . .x m has inputs of n dimensions, these n− dimension data must be reduced to k − dimensional (k n) using PCA. The first step in PCA is raw data standardization whereby the raw data should have unit variance and zero mean defined in Equation (5).
In the second step, a covariance matrix of the raw data is calculated as shown in Equation (6).
The third stage is calculation of eigenvector and eigenvalue of the covariance using Equation (7).
In the fourth step, raw data are projected into a k-dimensional subspace, and this is followed by choosing the top k eigenvector of covariance matrix. The corresponding vector is calculated as shown in Equation (8).
The raw data with n dimensionality is reduced to a new k dimensional representation of data.

Recursive Feature Elimination
The second feature-selection approach used was recursive feature elimination (RFE). This is a recursive process where features are ranked based on their importance [56]. RFE employs machine learning models in computing feature-relevant scores. RFE first trains the model using all features and then computes the relevance score of every feature in the dataset. All the features with lowest relevance score are ignored, and this is followed by model retraining for computation of new relevant feature scores. This process is repeated until the final desired features are obtained [57].

Proposed Graph-Based Approach
The following steps were used in graph-based feature selection.
ii. Network construction Calculate PCC of the normalized features (Equation (10)):  (11) iv. Construct a topological overlap matrix TOM based on the adjacency of a ij (Equation (12)): Tom ij = Σ u =−1,j a u i a uj +a ij min k i,kj +1−a ij (12) v. Filter the resulting network using maximal cliques: a.
Apply Bron-Kerbosch algorithm to find all possible cliques within the graph that has been filtered where a clique is a complete subgraph C ⊆ G. b.
Determine a set of C of the maximal cliques C max i where a maximal clique is a complete subgraph C i ⊆ G which is not a subset of another complete subgraph Whenever there is C max i > 1, where C max i ∈ C a rating function r : C max i → R , ∀C Max i ∈ C is applied and maximal clique with the highest score selected.

Classification
Experiments were performed in WEKA framework and RStudio on a PC intel i3 CPU, 2 cores and 8 GB of RAM. Three classifiers namely Naïve Bayes, sequential minimal optimization (SMO) and multilayer perceptron were used to compare the classification accuracy of the different feature-selection methods. Features were the dependent variables while class was the independent variable. In this case, the expression levels (counts) of the features (genes) were used to predict the tissue type (diseased/normal). Classification accuracy, root mean squared error (RMSE), mean absolute error (MAE), kappa statistic (KS) and the time taken to build the model for every classifier were recorded before and after feature selection using the various approaches.

Discretization
To discretize the data, we used equal-width interval discretization. This algorithm divides the range of values for a feature into equally sized bins that are represented by k parameter provided by the user. The Algorithm 1 finds minimum and maximum observed values through sorting of continuous features A = {a 0, a 1, . . . . . . . . . . . . . . . ..a n } where a min, = a 0, and a max=a n, . To compute the interval, the range of observed values for the variable is divided into equally sized bins as described in Equations (13) and (14) below:

. Sort values of A in ascending order
Step 2: Calculate interval using Equation (19) Step 3: Divide the data into 3 bins Step 4: Place the values of the array in the same boundary

Association Rule Mining (ARM)
ARM is a process of determining possible association rules amongst items in a large database or dataset. Let I = l 1 , l 2 . . . . . . l k be a set of k features and T be a transaction that contains items such that T → I and D is a database of transaction records. An association rule is of type X(antecedent) → Y(Consequent) where X and Y are features such that X ∩ Y = ø.

Data Preprocessing
The datasets used in this study had 28,089 initial features as summarized in Table 2. Preprocessing involved elimination of non-differentially expressed genes and normalization which resulted in 12.2% features for small-cell lung cancer and 43.2% features for nonsmall-cell lung cancer after preprocessing.

Feature Selection
After preprocessing, we used three feature-selection approaches to filter the features further to retain only informative features. The RFE retained the highest number of features in both datasets followed by PCA as shown in Table 2 and Figure 1.

Association Rule Mining (ARM)
ARM is a process of determining possible association rules amongst items in a large database or dataset. Let = , … … be a set of features and be a transaction that contains items such that → and is a database of transaction records. An association rule is of type ( ) → ( ) where are features such that ∩ = ø.

Data Preprocessing
The datasets used in this study had 28,089 initial features as summarized in Table 2. Preprocessing involved elimination of non-differentially expressed genes and normalization which resulted in 12.2% features for small-cell lung cancer and 43.2% features for non-small-cell lung cancer after preprocessing.

Feature Selection
After preprocessing, we used three feature-selection approaches to filter the features further to retain only informative features. The RFE retained the highest number of features in both datasets followed by PCA as shown in Table 2  The graph-based feature-selection approach retained 80 for the SCLC dataset and 134 features for the NSCLC dataset. This is because the graph considers only the connecting features and the filtering step using maximal cliques retaining only the features that had the highest maximal clique score. Figure 2 presents the networks for the two datasets before and after filtering using maximal cliques. The graph-based feature-selection approach retained 80 for the SCLC dataset and 134 features for the NSCLC dataset. This is because the graph considers only the connecting features and the filtering step using maximal cliques retaining only the features that had the highest maximal clique score. Figure 2 presents the networks for the two datasets before and after filtering using maximal cliques.

Classification
In the next step, we compared the performance of three classifiers with the features selected in the previous step as input while the raw features were our baseline. Table 3 summarizes the performance of the various classifiers before and after feature selection. The accuracy value, root mean squared error (RMSE), mean absolute error (MAE), kappa statistic (KS) and the time taken to build the model for every classifier, arising from the 10-fold cross validation are given. Overall, accuracy levels after selection ranged between 94.186 and 100% depending on the classification method used and the dataset (Table 3).
NB performed better on features selected using PCA and a graph-based approach whereby accuracy, MAE, kappa and time taken improved as compared to unfiltered features and RFE-selected features where there was no difference. This can be attributed to the working principle of RFE where the optimal number of features is not known apriori (in advance) [58]. A Kruskal-Wallis test showed that there is no significant difference between the mean ranks of the groups ( 0.05), i.e., 20 iterations for each of the featureselection methods.

Classification
In the next step, we compared the performance of three classifiers with the features selected in the previous step as input while the raw features were our baseline. Table 3 summarizes the performance of the various classifiers before and after feature selection. The accuracy value, root mean squared error (RMSE), mean absolute error (MAE), kappa statistic (KS) and the time taken to build the model for every classifier, arising from the 10-fold cross validation are given. Overall, accuracy levels after selection ranged between 94.186 and 100% depending on the classification method used and the dataset (Table 3).
NB performed better on features selected using PCA and a graph-based approach whereby accuracy, MAE, kappa and time taken improved as compared to unfiltered features and RFE-selected features where there was no difference. This can be attributed to the working principle of RFE where the optimal number of features is not known apriori (in advance) [58]. A Kruskal-Wallis test showed that there is no significant difference between the mean ranks of the groups (p < 0.05), i.e., 20 iterations for each of the featureselection methods. A study by Furat and Ibrikci [59] used five tumor types of gene expression cancer RNA-Seq data and, using Naïve Bayes with 10-fold cross validation, achieved an accuracy of 98.7516%. This shows that NB accuracy levels will vary with the dataset being analyzed. In dataset GSE81089, which had a larger sample size of selected features, SMO and MLP achieved 100% accuracy when feature selection was performed prior to classification, and in fact, the PCA-selected features could be classified at 100% accuracy by all three classifiers. In the smaller dataset, SCLC dataset, accuracy levels were also lower. Notable is that a graph-based feature-selection approach gave the best classification results in the two datasets and also took the least time to execute. The time required to build the model improved after feature selection across the three classifiers though MLP required the longest duration and a graph-based approach the shortest (Table 3).

Association Rule Mining
Features from PCA, RFE and graph-based selection methods were discretized and analyzed to find possible associations using Apriori. The resulting number of rules, maximum confidence, support and lift values are summarized in Table 4.
The graph-based feature-selection approach gave 15 and 36 non-redundant rules, respectively, from the two datasets at a support of 0.5 confidence value of 0.9 and a lift of 2. The other feature-selection methods did not generate any rules at a support of 0.5. Features selected by RFE had the lowest maximum support and lift, and this led to the generation of too many redundant rules. For the PCA-based feature selection, support ranged between 0.405 and 0.425 with a total of 38 rules for the first dataset and 36 rules for the second dataset ( Table 4). The top 10 rules are shown in Table 5.  Association rules are represented as X => Y, where X and Y are items contained within a dataset/database, and X ∩ Y = ø. X is the antecedent, and Y is the consequent ( Table 5). It means that whenever X, which is the antecedent, is present, even Y, which is the consequent, will be present. Support indicates the frequency of the itemset appearance in the dataset, and the confidence indicates how often a rule has been found to be true. A support value of 0.5 means 50% of the items (genes) are found in the transaction and 90% of the rules are true (Confidence). The lower support means that most of the items are not frequently found together. The lift value is used to measure the rule importance. A lift of greater than 2 achieved by the graph-based feature-selection approach indicates the degree to which any two occurrences depend on each other, and this is an indication those rules are useful in consequent prediction.
External factors that play a critical role in the success of this type of study include the choice of the technology used in generating the data since it determines the volume and quality of the data across the replicates. The cost of generating the data also limits the number of samples as well as the volume of data. In the biological space, what can be defined as control/normal samples is a gray area, and therefore this may have a bearing on downstream analysis while at the same time bringing about class imbalance. Internal threats to this kind of study include experimental noise in the data as well as the assumption that gene expression level is uniform across cells. Co-regulation and co-expression between the features can also lead to redundancy in the datasets. Features without an assigned biological function may also not be informative unless in vitro experiments are designed to validate the function.

Conclusions
In this study, we used three different feature-selection methods to select informative features from two different cancer datasets. Most existing feature-selection techniques assume that features are independent of one another. However, this assumption ignores the fact that biological features are usually related because of their function in living systems. We evaluated the performance of three classifiers on the selected features. Features selected using a graph-based approach with maximal clique could be classified with high accuracy when compared to PCA and RFE. These features also gave informative rules with a higher support and lift as compared to those selected using PCA and RFE. Therefore, the proposed graph-based feature-selection approach combined with rule mining is a suitable way of selecting and finding associations between features in high-dimensional RNA-Seq data.

Conflicts of Interest:
The authors declare no conflict of interest.