A Machine Learning Approach for the Classiﬁcation of Kidney Cancer Subtypes Using miRNA Genome Data

: Kidney cancer is one of the deadliest diseases and its diagnosis and subtype classiﬁcation are crucial for patients’ survival. Thus, developing automated tools that can accurately determine kidney cancer subtypes is an urgent challenge. It has been conﬁrmed by researchers in the biomedical ﬁeld that miRNA dysregulation can cause cancer. In this paper, we propose a machine learning approach for the classiﬁcation of kidney cancer subtypes using miRNA genome data. Through empirical studies we found 35 miRNAs that possess distinct key features that aid in kidney cancer subtype diagnosis. In the proposed method, Neighbourhood Component Analysis (NCA) is employed to extract discriminative features from miRNAs and Long Short Term Memory (LSTM), a type of Recurrent Neural Network, is adopted to classify a given miRNA sample into kidney cancer subtypes. In the literature, only a couple of kidney subtypes have been considered for classiﬁcation. In the experimental study, we used the miRNA quantitative read counts data, which was provided by The Cancer Genome Atlas data repository (TCGA). The NCA procedure selected 35 of the most discriminative miRNAs. With this subset of miRNAs, the LSTM algorithm was able to group kidney cancer miRNAs into ﬁve subtypes with average accuracy around 95% and Matthews Correlation Coefﬁcient value around 0.92 under 10 runs of randomly grouped 5-fold cross-validation, which were very close to the average performance of using all miRNAs for classiﬁcation.


Introduction
Kidney cancer is one of the deadliest diseases and unfortunately it is hard to detect early through normal clinical means [1]. Despite being one of the top-ten killer-cancers, there is a lack of research effort on kidney cancer. It has been overshadowed by other cancer types in the medical community, which has hindered the development of new techniques to detect and treat it. For decades, patients with kidney cancer have had limited options of treatment beyond surgery, and in most cases, life expectancy is less than one year. Thus, it is crucial to detect the disease early. Besides traditional clinical techniques, the study of various biomarkers brings researchers closer to understanding the onset of kidney cancer, making accurate diagnoses, and removing the ambiguity surrounding this disease. However, even with all the genomic understanding and technological progress, there are many unclear answers and undiscovered paths of research. Researchers need to explore new techniques to detect this disease and diagnose both the stage and sub-type accurately, thus assisting physicians in prescribing the right

The RNA Sequence and Kidney Cancer
MiRNA is a small non-coding RNA, and its length is approximately 19-25 nucleotides, which is relatively short. It has considerable biological importance at the molecular level and acts to control the miRNAs post-transcriptionally in plants, animals, and some viruses. In the human genome, there are thousands of classified miRNAs and they are responsible for targeting about 60% of protein-coding [7,8].
MiRNA plays an important role in basic biological processes inside our bodies such as proliferation, cell cycle control, apoptosis, differentiation, migration, and metabolism. Although many features of the miRNA biogenesis conduct are still hazy, the key processes have been characterized. MiRNA dysregulation can result in numerous types of cancers. Microarray expression data collected from a wide range of cancers have since proved that aberrant miRNA is the initial factor that influences cancer [9][10][11][12][13]. Recent studies on mouse strains have shown that individual miRNAs or miRNA clusters are responsible for a range of diseases. It has been stated that low levels of mature miRNAs are linked to cancerous diseases [10] and miRNAs have been proven to be among the most promising biomarkers for providing information about cancer sub-type differentiation and prognosis [14]. For more details and an understanding of the miRNA and cancer correlation, please refer to [9].
Researchers are seeking to address the significant miRNA bio-factors and to identify the exact miRNAs with the greatest discriminative power as biomarkers which precise diagnosis depends on, especially for the task of identifying kidney cancer sub-types and staging. The following summarizes some of these research efforts done in this field to help understand the relationship between miRNA and kidney cancer sub-types and staging identification. White et al. [15] performed combinatorial analysis on previously reported dysregulated miRNAs and identified 62 miRNAs out of the 133 miRNAs previously reported by other researchers. In [16], 35 miRNAs were found to distinguish between clear cell Renal Cell Carcinoma (ccRCC) and patient-matched normal kidney tissue. Among this group of 35 miRNA, nine were up-regulated and 26 were down-regulated and miRNA 106b was selected as the reference point as endogenous control. Samaan et al. [17] reported that over-expressed levels of miR-210 were found in ccRCC with higher levels of expression in metastatic tumors, where high expression was considered as an independent biomarker of poor prognosis in ccRCC. Higher levels noticed in the clear cell and papillary sub-types compared with chromophobe renal cell carcinoma and oncocytoma. In [18], the authors stated that significantly higher expression levels of exosomal miR-210 and miR-1233 were found in ccRCC patients than in healthy samples. A combined expression level of miR-21 and miR-126 can be used to predict cancer-specific survival in two independent RCC groups depending on the sensitivity of the up-regulation of the mir-21 and down-regulation of mir-126 [19]. Studies reported in [20] showed that the up-regulation of the miR-21 was correlated with clinical characteristics of renal cancer which led to lowering kidney cancer survival time.
From these studies, the reader can deduce that there is a level of uncertainty among the researchers' claims that a specific group of miRNAs is the best candidate to differentiate among the sub-types or has a significant role in detection. Wach et al. [21] performed micro-array analysis and stated that RCC was a variety of different entities, each having a distinguishable molecular pattern. The high similarity in miRNA expression was observed between matched tumor and normal tissue taken from the same RCC case. Recent studies tried to understand the linkage between RCC tumors or tumor entities and miRNAs. White et al. [22] used 35 miRNAs in a cluster analysis to discriminate between the corresponding pairs of the ccRCC and normal samples. In [23], the authors addressed several pairs of miRNAs that have the capability to differentiate between RCC cases of different entities and a normal sample, they achieved a distinguishing sensitivity of 97% using a vote counting strategy. However, in [21], over 86% accuracy was achieved using only four sets of miRNAs that can be used to distinguish between tumor samples of different RCC entities and normal ones. In [24], a group of nine primary transcripts and 18 mature miRNAs were listed as the differential expression of 27 miRNAs, while [21] claims that only 10 of the 18 mature miRNAs can be used as differential expression according to his analysis. By using vote-counting strategy, 28 miRNAs were able to classify tumor samples into ccRCC, chromophobe RCC, pRCC, or oncocytoma with 87% accuracy. To classify between ccRCC and pRCC, they presented a binary classification system using only 11 miRNAs [23].
It can be seen that most researchers focused on limited groups of miRNAs, or on a specific subtype of kidney cancer. In our research, we consider the entire set of 1881 miRNAs, excluding only the null quantities to obtain a final subset of 1627 miRNAs. We selected the 35 isolated miRNAs using the feature selection tool which will be discussed in Section 3.1. In addition, we will consider five sub-types as categorized by the TCGA and TARGET kidney cancer projects.

Machine Learning
A typical machine learning algorithm starts with feature selection, though deep learning algorithms can also be designed to handle raw data [25,26]. With regard to feature selection, it was demonstrated in [3] that NCA is an effective method for selecting significant feature points for high-dimensional data. This method is a nearest neighbor-based feature weighting algorithm. As a feature selection tool, the NCA method was successfully tested on several microarray datasets for various cancers, such as colon cancer, brain tumor, leukemia, lung cancer, and prostate cancer [3]. In this research, we adopt the NCA algorithm for selecting high-rank features form miRNA data.
Another important tool in automated cancer subtype classification is an effective classifier. In the literature [27], various deep learning techniques were used for this purpose. For instance, LSTM networks were used for Pulmonary Nodule Detection given CT Images, illustrating a significant discriminative capability [28]. Similarly, a three-layered 1D LSTM network was trained for extracting prognostic information of colorectal cancer from tissue images [29]. In [30], a segmentation algorithm based on deep-learning was presented for the identification of pathological kidneys in CT images. In this paper, we will explore the efficacy of LSTM networks for kidney cancer subtype classification.
Brief descriptions of the NCA method and LSTM network are given in the following subsections.

Neighborhood Component Analysis
Let us consider kidney sub-type classification, a multi-class classification problem. Let c be the number of subtype classes, and n the number of observations (patients). Then a given training set can be described as follows [3]: where x i ∈ R p are the feature vectors, and l i ∈ {1, 2, . . . , c} are the class labels. Let f : R p → {1, 2, 3, . . . , c} be the classifier to be trained. Consider a randomized classifier that picks a reference point randomly, Re f (x), then labels x using the label of the randomly selected reference point Re f (x). By choosing the reference point to be the nearest neighbor of the given point x, one makes the algorithm similar to that of the Nearest Neighbor Classifier. However, in the NCA algorithm, the choice of the reference point is based on some probability, which is called the selection probability. The probability P(Re f (x) = x j |S) will be higher if the reference point of x, x j , is closer to x, as measured by the distance function Where w r for r = 1, 2, . . . , p are the feature weights. Assume that the selection probability is direct proportional to k(d w (x, x j )), where k is a kernel or similarity function, such that it produces large values when d w (x, x j ) is small. Since the reference point is chosen from the set, the sum of P(Re f (x) = x j |S) = 1 for all j [3]. Thus, we can consider the following probability P This is a randomized classifier using the strategy of leave-one-out. The probability that point x j is picked as the reference point for x i is where p i is the average leave-one-out probability of correct classification, which is the probability of correct classification of the observation i using S i . We can express the probability of correct classification by the randomized classifier as where λ is the regularization parameter, and F(w) depends on the weight vector w. The Neighborhood Component Analysis procedure tries to find the maximum F(w) with respect to w. Many of the weights in w will vanish by regularization. We can find the vector w by minimizing (7) given λ. For more details about the regularized objective function, please refer to [3].

LSTM
The LSTM algorithm is one type of Recurrent Neural Network that deals mostly with sequential input data. Cell state is the key to LSTMs; it is the direct steps from C t−1 to C t as shown in the upper part of Figure 1. The cell state is similar to a production chain; the parameter flows straight forward, but some linear processes, such as addition and multiplication, will interact. The state depends on these interactions, and if there are no interactions, it will flow along without changes. The LSTM block will remove or add information to the cell state through gates; gates are structures that allow optional information to cross. These gates can be implemented by sigmoid functions. The sigmoid function produces two decisions: either '0' or '1'. Assume that '0' will block information flow and '1' will let it go through. With this, a control will be done on how the information should flow through. Three of these gates are available in a LSTM cell, where these gates will determine the final cell state. The neuron we show in Figure 1 is described by the following functions where f t is the activation vector of the forget gate, σ is the sigmoid function, W is weight matrices to be learned during training, x t is input vector to the LSTM unit, b is bias vector parameters to be learned during training, i t is activation vector of the input gate, C t is cell state vector, Q t is activation vector of the output gate, and h t is output vector of the LSTM unit.

Data Preparation and Results
In this research, we used kidney cancer RNA-sequence data represented by the miRNA expression that is publicly available on The Cancer Genome Atlas (TCGA) database website. For kidney cancer, three TCGA and two TARGET projects defined the most relevant kidney cancer types as High-Risk Wilms Tumor, Kidney Renal Clear Cell Carcinoma, Kidney Renal Papillary Cell Carcinoma, Kidney Chromophobe, Rhabdoid Tumor and Clear Cell Sarcoma of the Kidney. Figure 2 shows the sub-types project name and the percentage of cases for each project that are available in the TCGA data repository. From Table 1. one can see that the miRNA data is associated with the kidney cancer sub-types. The column "No. of Files" represents the available miRNA files for the cases. Please note that for some cases, more than one file is present, which is because of multiple readings for these cases during the diagnosis time.

Data Preparation and Categorization
We have considered all kidney cancer cases in which miRNA information was provided. These cases represent the samples taken from patients who had kidney cancer which belonged to one of five different cancer sub-types. Some individual cases had more than one miRNA sequence data file, which were represented by both isoform expression quantification and miRNA expression quantification. In our study, we only considered the miRNA expression qualification data because it tabulated in a balanced way, i.e., all cases had the same number of miRNAs. Figure 3 shows the schematic diagram for the data preparation procedure. The data from the TCGA server was downloaded, and then it was categorized using a MATLAB program. First, the information related to each case was matched with its corresponding miRNA quantification files using the file ID. The information of miRNA read per million for each miRNA was considered in our experiment. The miRNA files were then matched with those in clinical data, stored in javascript file, using the case ID. This clinical data provides the record of cancer sub-types and other patient clinical information such as age, sex, and demographics. The above procedure of preparing the cancer data facilitated automatic classification of kidney cancer sub-types based on the miRNA quantification expression information of the patients.
With the procedure given above, we obtained all of the miRNAs provided. However, we noticed that many miRNAs had null readings. By removing those fields, we ended up with a total of 1627 out of 1881 miRNA and 1221 cases, which were grouped into the five subtypes as mentioned in Figure 2 and Table 1.

Results and Discussions
After we categorized and prepared the data from the TCGA repository, we checked the data points and found out that the data variation was too large, therefore it needed to be normalized. For this purpose, a logarithmic kernel was first applied to reduce the variance of the data points, then a normalization procedure was devised to ensure that the data points obeyed zero mean and unit variance.
Once the data was ready, we fed the labeled data files to a feature selection algorithm. We hypothesized that if all the miRNAs were used for cancer subtype classification, it would produce the best possible outcomes. We tested this hypothesis with an experiment as follows. First, we used all the data points without dimension reduction, and then used the NCA algorithm without regularization for five kidney cancer subtype classification. Indeed, we found out that the former in general produced better results. However, one would not know which miRNAs are more important for the classification if all the miRNAs are used. It is therefore important to select the miRNAs that have high discriminative capability.
To find the most discriminative miRNAs, the value of λ in the NCA algorithm needs to be tuned. For this, a five-fold cross-validation test was performed. For each fold, randomly selected 80% of the data is used as a training set, and the remaining 20% of the data as a test set. To produce reproducible results, the procedure needs to be repeated 10 times. Figure 4 shows the the average loss values of the five-fold validation verses λ values.
Using the tuned λ value, NCA is applied to find the maximum weighted features, i.e., the most effective miRNAs that have the greatest discriminative power among the kidney cancer subtypes. Figure 5 shows the selected miRNAs according to NCA features weight value. Having a higher feature weight corresponds with better discriminative power for subtype classification. Table 2 shows the values of the selected Feature weight with corresponding miRNA name and index, where the indices of the miRNAs as appeared in the kidney cancer Quantified RNA sequence files in TCGA.  In the classification phase, we adopted the LSTM network algorithm with two LSTM layers. The first hidden layer had 500 neurons and the second one had 250 neurons. The hardware platform was NVIDIA TITAN X GPU. Ten runs of randomized five-fold validation was adopted for data analysis. The procedure followed largely the Data Analysis Protocol (DAP), which was defined by the US-FDA MAQC-II initiative [31]. Both the selected miRNA subset and the complete miRNA dataset were trained using the LSTM network. In each five-fold validation, a procedure of randomizing all the data points was performed for both the training and testing sets, whose average values were used to compute the total confusion matrices, which are given in Figures 6 and 7, where indices of the classes are as follows: class 1 = WT, class 2 = KICH, class 3 KIRC, class 4 = KIRP, and class 5 = RT.   It can be observed in this experimental study that using only the 35 selected miRNAs as features performs competitively with using all the available miRNAs.
One issue with the results presented in Figures 6 and 7 was that the dataset was not balanced. For instance, the number of the Kidney Renal Clear Cell Carcinoma cases (class 1) is far greater than that of the Rhabdoid Tumor cases (class 5); refer to Table 1. Therefore, the training set for each class needs to be balanced to obtain unbiased classification results. For this purpose, a data augmentation procedure, given in [32], was applied. Small, random Gaussian noise with zero Mean and 0.02 variance was added to the data points of those classes with fewer cases, resulting in a balanced dataset. It is essential that the test set samples are removed before data augmentation. One-fifth of each subtype dataset was randomly selected and reserved for this purpose in a five-fold validation process, and then the training set was augmented to have 475 training samples for each subtype set. This process was repeated 10 times to archive the 10 × 5-Cross Validation, and the results were averaged to construct the confusion matrices as shown in Figures 8 and 9.  To further assess the classification performance, Matthews Correlation Coefficient (MCC) [6] was adopted. The MCC is given by the following equation: In order to compute MCC, the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) were extracted as shown in Tables 3 and 4.  The multiclass generalization of the MCC, refer to [33], for the cells of the confusion matrix C is given in Table 5. The proposed method was able to achieve an average classification accuracy of 97.2% and an MCC of 0.949 if all the available miRNAs were used and if the dataset was balanced. If not balanced, the accuracy is reduced to 92.9% and the MCC value to 0.887. With the 35 selected miRNA, the accuracy was 95.4% for both and the MCC values were 0.92 and 0.924, respectively (Figures 6-9 and Table 5). It is clear that using the selected set of miRNAs, one can achieve a more consistent performance, in terms of both classification accuracy and MCC [34]. The result also shows that the selected 35 miRNAs performed better without data augmentation, compared to those obtained by using all of the available miRNAs (Table 5).

Conclusions
In this paper, we reported a machine learning approach for the classification of five subtypes of kidney cancer. In this approach, the NCA procedure was applied to select the most discriminative miRNAs as features and the LSTM neural network was designed to classify the given patient data files into five subtypes of kidney cancer. The Data Analysis Protocol was largely adopted to control the experiments and the Matthews Correlation Coefficient, together with accuracy, to assess the classification performance.
We demonstrated that with all of the available miRNAs, the proposed method produced an accuracy of 97.2% and an MCC value of 0.924 using an augmented dataset. We further demonstrated that with a subset of 35 miRNAs, the method achieved a more consistent classification performance for both balanced and unbalanced datasets in terms of both accuracy and MCC values. This demonstrates the importance of most discriminate miRNAs in cancer subtype diagnosis and classification.
We hope that the proposed method can be a step forward in the direction of early diagnosis of kidney cancers, which in turn will allow physicians to have better options in treating kidney cancer patients. We also hope that the identified miRNAs in this study can be used as biomarker candidates for kidney cancer subtype classification, though we understand that the effectiveness of these selected miRNAs must be validated by wet-lab experiments and further clinic studies.