RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features

Hassan, Arfa; Alkhalifah, Tamim; Alturise, Fahad; Khan, Yaser Daanial

doi:10.3390/diagnostics12123036

Open AccessArticle

RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features

by

Arfa Hassan

¹,

Tamim Alkhalifah

^2,*

,

Fahad Alturise

²

and

Yaser Daanial Khan

¹

Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan

²

Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass 58892, Qassim, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Diagnostics 2022, 12(12), 3036; https://doi.org/10.3390/diagnostics12123036

Submission received: 27 October 2022 / Revised: 24 November 2022 / Accepted: 30 November 2022 / Published: 3 December 2022

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Download

Browse Figures

Versions Notes

Abstract

To save lives from cancer, it is very crucial to diagnose it at its early stages. One solution to early diagnosis lies in the identification of the cancer driver genes and their mutations. Such diagnostics can substantially minimize the mortality rate of this deadly disease. However, concurrently, the identification of cancer driver gene mutation through experimental mechanisms could be an expensive, slow, and laborious job. The advancement of computational strategies that could help in the early prediction of cancer growth effectively and accurately is thus highly needed towards early diagnoses and a decrease in the mortality rates due to this disease. Herein, we aim to predict clear cell renal carcinoma (RCCC) at the level of the genes, using the genomic sequences. The dataset was taken from IntOgen Cancer Mutations Browser and all genes’ standard DNA sequences were taken from the NCBI database. Using cancer-associated information of mutation from INTOGEN, the benchmark dataset was generated by creating the mutations in original sequences. After extensive feature extraction, the dataset was used to train ANN+ Hist Gradient boosting that could perform the classification of RCCC genes, other cancer-associated genes, and non-cancerous/unknown (non-tumor driver) genes. Through an independent dataset test, the accuracy observed was 83%, whereas the 10-fold cross-validation and Jackknife validation yielded 98% and 100% accurate results, respectively. The proposed predictor RCCC_Pred is able to identify RCCC genes with high accuracy and efficiency and can help scientists/researchers easily predict and diagnose cancer at its early stages.

Keywords:

clear cell renal carcinoma; cancer driver mutations; machine learning; prediction; statistical moments

1. Introduction

In human DNA sequences, change due to certain reasons is called a mutation [1,2]. Mutations can be good or bad and are used by scientists to study human health, body cell development, etc. Health scientists already have conducted a lot of work to identify mutations in human beings [3], because this identification could work as a foundation for personalized medicine [4]. Furthermore, genetic engineering can play an important role to prevent disease, as well as making an early diagnosis to control the death rate [5,6].

Gene mutations play a vital role in cancerous cell generation [7], while genetic engineering is capable to predict deadly diseases such as cancer even before the symptoms of the disease are produced in the human body [8,9]. For this purpose, many techniques are available, but the traditional method used is lab evaluation, which is time-consuming, as well as expensive [10,11]. Cancer is a disease that can affect all the different body systems and tissues in human beings. Malignancy of cancer has more than 100 types and, back in 2010 alone, cancer was responsible for 1 out of 8 deaths worldwide [12]. Kidney malignancy is one of the different types of malignant cancers, but it is not a single disease, actually, it is a combination of many types, for example, clear cell, type 1 papillary, type 2 papillary, chromophobe, TFE3, TFEB, and oncocytoma [13].

Among all these types, the prevalence of renal cell carcinoma (RCC) or kidney cancer is all over the world, but most of the cases are reported in North America. Gender and age play an important role in the occurrence of the disease, as males above 65 years old are considered to be more prone to this disease, and obesity is also one of the main causes of the disease [14,15]. It initially starts from the outer walls of the kidney [16]. Based on some specific characteristics, RCC is divided into three main types of RCC. Clear cell RCCs—these are the most usual type of renal cell carcinoma. Researcher call them clear cells because most of the time cells within the tumor are clear. Papillary cell RCCs—these are the second most common type of kidney cancer. Papillary cells are distinguished by small, rounded protuberances on their surface. These tumors present as either Type 1 or the more aggressive Type 2 forms. Chromophobe cell RCCs—these are the third most common form of kidney malignancy. Scientists call them chromophobe because these cells do not acquire colored stains easily [17].

Various studies have been reported previously which targeted the prediction of cancer-driver gene mutations. Luo, in 2019, proposed deepdriver, which used a deep convolution neural network approach for cancer driver mutation prediction. The proposed method can predict breast and colorectal cancer. The AUC scores of deepdriver on cancer and colorectal cancer are 0.98 and 0.97, respectively [18]. In 2018, Wand et al. proposed a Bayesian hierarchical modeling algorithm for cancer driver mutation prediction named rDriver. They examined 3080 samples of 8 different kinds of cancer, in which the rDriver predicted 1389 affected samples. The evaluation process of the rDriver predictions method is conducted by using engineered cell line models and gives good results. The value results are a positive predictive value of 0.94 in PIK3CA genes [19].

In 2017, Pi-Jung et al. proposed a CNV method for cancer driver mutation prediction. For simulation and results, they used four TCGA datasets (BRCA, HNSC, KIRC, and THCA). They covered breast, head, neck, thyroid, and kidney cancer genes. They also discovered rare driver genes in their work. They did all the work with the help of gene sequence length [20]. A driver mutation provided growth advantages to the affected tumor cell. In 2013, Mao et. Al. introduced the candrA tool for the prediction of cancer driver mutations. The proposed method was based on a set of 95 structural and evolutionary features by using 10 functional prediction algorithms such as CHASM, SIFT, and Mutation Assessor. They used two mutation datasets, GBM and OVC [21].

The majority of the work carried out for the identification of RCCC driver genes uses experimental lab procedures, and those who used computational approaches lacked in performance due to fewer data, such as Kocak et al. [22]. The focus of their study was only on the PBRM1 gene, and the sample size for training was also not good. Thus, to address this problem, the present study aims to propose a prediction model for renal clear cell carcinoma mutations in gene sequences using machine learning algorithms. We curated a comparatively large dataset from the IntOgen and NCBI databases, and after meticulous and thorough feature extraction, we trained different machine learning classifiers. After an exhaustive evaluation of machine learning classifiers, the best-performing classifier was selected as the final model and was compared with existing methods. The proposed method is easy to use, efficient and accurate for obtaining results, scientists and researchers only need sequence information and can avoid hectic experimentations. To do so, the intended clinical use is that scientists, researchers, and others in the clinical community can opt for a system developed based on the proposed method and obtain results by inputting the gene sequence. The sequence could be from human biological samples. The system will help them classify it as a potential RCCC gene or not.

2. Materials and Methods

A method named RCCC_Pred is proposed in the present study for the identification of renal clear cell carcinoma driver genes and associated mutations. The detailed graphical representation of the proposed method is shown in Figure 1. Creating a benchmark dataset from raw data is a great challenge for this work, considering the importance of human life and health.

2.1. Dataset Collection and Pre-Processing

As human health is a sensitive issue, datasets for human-health-related application needs perfection and accuracy. For this research work, a raw dataset is collected from intogen_driver_mutations_catalog-2016.5 (https://www.intogen.org, Accessed on: 1 November 2021). This raw dataset contains basic information such as genes name, ID, position (of the gene on which mutation occurs), which nucleotide is mutated, and which new nucleotide is replaced. The standard sequence of the nucleotide sequence is taken from NCBI (https://www.ncbi.nlm.nih.gov/, Accessed on: 10 November 2021). To construct the benchmark dataset, the system takes an input of mutation from a CSV file and then updates the sequence file and saves it in .txt format in a specific folder. At the end of the process, the folder had approximately 11,000 samples, and every sample showed a unique entry. This raw dataset contained 26,000 unique entries of 11,000 different kinds of genes, out of which 4529 samples were selected for feature extraction after redundancy removal through CD-HIT [23], using the threshold of 0.7. The independent test dataset in the present study was created by using the train–test split. Here, we split the original data into a 70:30 ratio, i.e., 70% of the original data were used for training the model, while for testing, the remaining 30% was used.

2.2. Feature Extraction

After obtaining the raw data, thorough and meticulous feature extraction was performed for gene sequences to obtain the quantitative description of the data [24]. The system performs two approaches for feature extraction by using statistical moments [25]. Gene sequence data are position-sensitive data, so for the first layer of feature extraction the system uses position-sensitive raw, Hahn, and central moments [26]. Additionally, the results are saved into separate files. In the second layer of feature extraction, the first four statistical moments and maximum and minimum values are found and stored in a CSV file. The mathematical equation of the raw moment is described as [27]

R_{a b} = \sum_{x = 1}^{n} \sum_{y = 1}^{n} x^{a} y^{b} β_{a b}

(1)

To calculate the raw moment, the sequence is formed into an X’ matrix of n*n dimension. The mathematical expression for the central moment is shown in Equation (2).

C_{a b} = \sum_{x = 1}^{n} \sum_{y = 1}^{n} {(x - \bar{e})}^{a} {(y - \bar{d})}^{b} β_{x y}

(2)

The Hahn moment requires a square matrix for calculation as input, so the system uses the following equation to calculate the Hahn of n order polynomial [28,29,30].

H a_{n}^{x, y} (b, N) = {(N + y - 1)}_{n} {(N - 1)}_{n} \sum_{Q = 0}^{n} {(- 1)}^{Q} \frac{{(- n)}_{Q} {(- b)}_{k} {(2 N + y + x - n - 1)}_{Q} 1}{{(N + y - 1)}_{Q} {(N - 1)}_{Q}} Q!

(3)

The machine learning algorithm requires a low-dimensional dataset for classification. In this low-dimensional dataset, every row shows a unique entity. For this purpose, the proposed system covers all the unique files into a 1D vector by using statistical moments and storing them in CSV files. After all the calculations, the final dataset contains 6 features for each entity, which are provided in Supplementary Information S1. The statistical equation moment is [31]

m o m e n t_{p q} = \frac{1}{n} \sum_{j - 1}^{n} {(h a - \bar{h a})}^{p q}

(4)

2.3. Classification

Using the featured vector extracted in the previous phase, a classifier is trained with the features from the dataset [32]. After feature extraction, 4529 unique entities are found, which are divided into three classes “kindey_tumor_driver”, “other_tumer_driver”, and “Unknown (non-tumour driver)”. A feed-forward artificial neural network (ANN) with backpropagation empowered by the hist gradient descent algorithm is used for prediction purposes [33]. ANN is an AI algorithm that is inspired by the human brain [34]. ANN takes inputs and combines them with the activation function and proceeds toward the output function. The representation of the ANN is explained in Figure 2.

Here, M1 to M4 represent the first 4 moments, while Min represents the minimum value of the computed moment and Max contains the maximum value of the computed moment for each sample. The mathematical equation of ANN is [35]

A N N_{Q} = \sum_{j = 1}^{n} W_{Q j} y_{j}

(5)

Gradient descent (GD) and adaptive learning methods are used for training the operational algorithm. GD is an optimization algorithm that is used to minimize the cost function and is very useful for MCC calculation. GD shows the best results in the analysis of data [24,36]. If the cost function of GD is [37]

f (c, d) = \frac{1}{N} \sum_{i = 1}^{n} {(y_{i} - (c z_{i} + d))}^{2}

(6)

then the mathematical equation of GD is [38]

f^{'}^{(c, d)} = [\begin{matrix} \frac{d f}{d m} \\ \frac{d f}{d d} \end{matrix}] = [\begin{matrix} \frac{1}{N} \sum^{} - 2 z_{i} (y_{i} - (m z_{i} + d)) \\ \frac{1}{N} \sum^{} - 2 (y_{i} - (m z_{i} + d)) \end{matrix}]

(7)

2.4. Model Evaluation

The most important part of any prediction model development is the accurate assessment of the model [39]. This accuracy evaluation is based on different factors. In this proposed work, various metrics were computed such as the specificity, sensitivity, and Matthew coefficient correlation (MCC) for the stability of the model and the accuracy of the model [33,40,41]. However, all these measures can be mathematically denoted as

Specificity = \frac{TN}{TN + FP}

(8)

Sensitivity = \frac{TP}{TP + FN}

(9)

Accuracy = \frac{TP + TN}{TP + FN + TN + FP}

(10)

MCC = \frac{(TP * TN) - (FP * FN)}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}

(11)

In Equations (8)–(11), as described in [42], TP represents the true positive, i.e., the number of the data samples where the class label is positive, and the system predicts it is positive. TN represents the true negatives, which are the samples whose label was negative and the system also predicted as negative. FP is the false positive, representing negative samples, predicted incorrectly as positive by the predictor. Lastly, the FN are false negatives, representing positive samples predicted incorrectly as negative by the predictor [43]. As the description of these measures depicts, these are usually used for binary classification problems. However, in the present study, we had three different classes, i.e., RCCC, other tumors, and Unknown (non-tumor driver). Thus, to map our problem on these measures, we used the scheme shown in Table 1.

3. Results

In the proposed method, after data collection, the first step in the pre-processing layer was to perform the mutation and save the results. Later on, for all sequential data, features were computed and fed to machine learning classifiers for evaluation. For evaluating the training accuracy, self-consistency testing was performed in which the same training and testing data were used. To evaluate and validate the outcome of the prediction model, the proposed system was tested by using three different techniques, i.e., independent testing, K-cross-validation testing, and Jackknife testing.

3.1. Training Accuracy

The training accuracy was evaluated using self-consistency testing [44]. For this purpose, the same training and testing data were used. The evaluation scores are shown in Table 2.

Table 2 illustrate that the model was trained accurately, identifying all positive/negative samples correctly.

3.2. Validation of the Model through 10-Fold Cross-Validation and Jackknife Testing

The validation of results was performed using various evaluation metrics such as accuracy, specificity, sensitivity, and Matthews Correlation Coefficient. The test methods adapted were independent testing, K-cross-validation testing, and Jackknife testing [3].

In case of the unavailability of a separate test dataset, the best approach to test any predictive model is k-fold cross-validation [45]. Using k-fold cross-validation, the dataset is split into k-disjoint folds, and the model is validated k-times. In each iteration, k-1 folds are chosen for the training model, while the remaining 1-fold is used for testing. This testing fold is chosen separately in each iteration [46]. For the evaluation of the proposed model, the value of k was chosen as 10. The overall mean accuracy score is 0.98, and scores of the remaining measures, as well as for all folds, are shown in Table 3. The box plot graph is shown in Figure 3, while the ROC curves for all classes using 10-fold cross-validation are shown in Figure 4.

To further elaborate on the 10-fold cross-validation, training and validation loss for each fold is shown in Figure 5.

In Figure 5, the curves are plotted between the number of iterations and losses. The neural network was trained till convergence by using an early stopping criterion, while max iterations were set as 3000. However, it was observed that for all 10 iterations of k-fold, the model converged on an average of 20–30 epochs. This is the reason that, after such epochs, we can observe that the loss curves converged at a point and turned into straight lines.

For further exhaustive validation of the model, the Jackknife test was opted for. The Jackknife test is also referred to as Leave-One-Out cross-validation, and works on the same principle as k-fold cross-validation, with the value of k = the number of samples in the dataset [2]. It is known to be the least arbitrary method which can yield unique output for a given benchmark. After testing, the accuracy metrics were computed to evaluate the quality of the proposed algorithm. The results for Jackknife validation are shown in Table 4, while the ROC curve is shown in Figure 6.

Using the scores of Jackknife validation, RCCC_Pred was compared with a few existing methods. The results are shown in Table 5.

Herein, we compared the results of Jackknife testing of the present study with this method and observed a lack of performance. It could be observed that the proposed method outperformed the existing method by Kocak et al. [22] in terms of all accuracy measures.

3.3. Independent Dataset Validation

For any new predictor, it is of great importance to test its ability to predict against unknown data. Here, unknown data are referred to as the data which the model has not seen or observed during the training process [1,2,39]. Therefore, keeping in view the importance of this test, it was performed for the evaluation of the model proposed in the present study. As the whole dataset was created manually in this study, all possible data samples available at that time were already retrieved. Here, we trained the model from scratch using 70% of the data as described in the Methods while testing it for the remaining 30% of samples. The scores were computed along with the ROC curve and are represented in Table 6 and Figure 7, respectively.

3.4. Comparison with Other Classifiers

Besides comparing the proposed method with previously existing methods, we have also run experiments to compare the performance of our method with different other classifiers. The benchmark dataset of the proposed method is also trained with other machine learning algorithms, but Hist Gradient boosting shows the best results. The details of other proposed models are shown in Table 7. Table 7 shows the accuracy scores for three different kinds of tests, which are Jackknife, independent, and cross-validation, while these tests are performed using five classifiers. To further elaborate on performance, the ROC curve for all five classifiers for the independent dataset testing is added in Figure 8.

The decision tree showed better performance in independent dataset testing, however, overall, the performance of Hist Gradient boosting was better as compared with the other classifiers. Based on these results, the Hist Gradient Boosting was considered as the final model for the proposed RCCC_Pred classifier.

4. Discussion

Herein, we proposed a prediction model named RCCC_Pred for renal clear cell carcinoma mutations in gene sequences using machine learning algorithms. We curated a dataset from IntOgen and NCBI databases and, after meticulous feature extraction, we trained different machine learning classifiers. After a thorough evaluation, the best-performing classifier was selected as the final model and was compared with existing methods. The decision tree showed better performance in independent dataset testing, however, overall, the performance of Hist Gradient boosting outperformed other classifiers. Based on these results, the Hist Gradient boosting was considered as the final model for the proposed RCCC_Pred classifier.

Previously, a few studies were proposed for renal clear cell carcinoma or other cancers [47]. A few machine-learning-based and experimental approaches have been proposed to study multicellular complexity and tissue specificity [48], as well as to study molecular interactions in cancer [49]. The majority of the work conducted for the identification of RCCC driver genes uses experimental lab procedures. A few research studies used AI-based approaches to predict cancer driver mutations, but their methods used a very limited amount of data. In a previous study reported by Kocak et al. [22], the researchers proposed a machine-learning-based algorithm for kidney cancer prediction at the level of the gene. The sample set for the machine learning consists of 161 label examples of augmented data, from which 74 mutations are recorded in the PBRM1 gene, and the other 87 occurred outside the PBRM1 gene. The focus of the study is only on one gene code, which is PBRM1, and the sample size for training is also not good. Due to the tiny dataset, the possibility of overfitting occurrence is very high, which affects the accuracy of the system.

In 2020, Kocak et al. [50] further extended their work by considering BAP1 mutation in clear cell renal cell carcinoma. However, again, the dataset was limited, comprising only 65 samples. Here, authors used a Random Forest classifier and correctly classified samples with 84.6% accuracy and an area under the curve of only 0.897. For similar BAP1 mutation status with 54 samples, Feng et al. [51] reported an accuracy of 83% for Jackknife testing using Random Forest. Using image data for clear cell renal cell carcinoma, Acosta et al. proposed a deep-learning-based method for analyzing intertumoral heterogeneity and considered the three most frequently mutated genes, which were BAP1, PBRM1, and SETD2. Overall, the authors achieved an area under the receiver operating characteristic curve of around 0.89.

By considering the importance of clear cell renal cell carcinoma, Chen et al. [52] proposed a deep learning algorithm for the prediction of prognosis and immunotherapeutic response. The authors used data from 3 different cohorts, and samples were around 730 after performing pre-processing. After training the deep learning model for 100 epochs, the authors achieved a sensitivity of 0.71 and a specificity of 0.68.

The proposed method of the present study is trained on 10706 genes and 27685 instances, from which 1513 are other tumor drivers, 1272 are RCCC tumor drivers, and the rest are passenger gene mutation instances. The proposed system covers all the kidney genes and a huge number of tumor driver mutations, so it gives more accurate and reliable results after deployment.

5. Conclusions

The kidney is one of the vital organs in the human body, as it is responsible for blood cleaning, removing waste and poisonous substances from the blood, and balancing the electrolytes in the body. Kidney cancer is the most popular cancer in developing countries because of a lot of reasons, one of which is the huge amount of alcohol consumption. In this research work, a machine-learning-based efficient automated method is introduced for the prediction of kidney cancer before the development of kidney cancer. For this purpose, the proposed approach maintains the record of cancer driver mutations in the human body, and for this reason, the statistical position sensation calculation is performed. It then validates the approach with different types of testing techniques, which are Jackknife, independent dataset, and cross-validation. The Jackknife, cross-validation, and independent test accuracies of the system are 100%, 98%, and 83%, respectively.

6. Limitations and Future Work

The proposed RCCC_Pred system could be improved further in the future, and this work can be extended towards the betterment of the system to improve human health and to save more human lives, which are the targeted critical issues to make the system more perfect and updated. In the future, this work can also help develop AI systems for other human diseases, especially different kinds of cancers in other human body organs and tissues. Moreover, by increasing the number of data samples and employing deep neural networks, the method could be improved further in terms of performance.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics12123036/s1.

Author Contributions

Conceptualization, F.A. and Y.D.K.; methodology, A.H.; software, T.A.; validation, T.A., F.A. and Y.D.K.; formal analysis, A.H.; investigation, F.A.; resources, T.A.; data curation, A.H.; writing—original draft preparation, A.H.; writing—review and editing, A.H., T.A., F.A. and Y.D.K.; visualization, A.H.; supervision, T.A. and Y.D.K.; project administration, F.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is provided as Supporting Information S1.

Acknowledgments

The researchers would like to thank the Deanship of Scientific Research, Qassim University for funding the publication of this project.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lv, H.; Dao, F.Y.; Zhang, D.; Yang, H.; Lin, H. Advances in mapping the epigenetic modifications of 5-methylcytosine (5mC), N6-methyladenine (6mA), and N4-methylcytosine (4mC). Biotechnol. Bioeng. 2021, 118, 4204–4216. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Yang, Y.; Chai, L.; Li, Q.; Liu, J.; Lin, H.; Liu, L. A deep learning model to identify gene expression level using cobinding transcription factor signals. Brief. Bioinform. 2021, 23, bbab501. [Google Scholar] [CrossRef] [PubMed]
Dao, F.-Y.; Lv, H.; Su, W.; Sun, Z.-J.; Huang, Q.-L.; Lin, H. iDHS-deep: An integrated tool for predicting DNase I hypersensitive sites by deep neural network. Brief. Bioinform. 2021, 22, bbab047. [Google Scholar] [CrossRef] [PubMed]
Althubaiti, S.; Karwath, A.; Dallol, A.; Noor, A.; Alkhayyat, S.S.; Alwassia, R.; Mineta, K.; Gojobori, T.; Beggs, A.D.; Schofield, P.N. Ontology-Based Prediction of Cancer Driver Genes. Sci. Rep. 2019, 9, 17405. [Google Scholar] [CrossRef]
Mustafa, M.F.; Fakurazi, S.; Abdullah, M.A.; Maniam, S. Pathogenic mitochondria DNA mutations: Current detection tools and interventions. Genes 2020, 11, 192. [Google Scholar] [CrossRef] [PubMed]
Malebary, S.J.; Khan, Y.D. Evaluating machine learning methodologies for identification of cancer driver genes. Sci. Rep. 2021, 11, 12281. [Google Scholar] [CrossRef] [PubMed]
Brazhnik, K.; Sun, S.; Alani, O.; Kinkhabwala, M.; Wolkoff, A.W.; Maslov, A.Y.; Dong, X.; Vijg, J. Single-cell analysis reveals different age-related somatic mutation profiles between stem and differentiated cells in human liver. Sci. Adv. 2020, 6, eaax2659. [Google Scholar] [CrossRef]
Luo, L.; Lin, L.; Zhang, X.; Cai, Q.; Zhao, H.; Xu, C.; Cong, Q. Next-Generation Sequencing Panel Analysis of Clinically Relevant Mutations in Circulating Cell-Free DNA from Patients with Gestational Trophoblastic Neoplasia: A Pilot Study. BioMed Res. Int. 2020, 2020, 1314967. [Google Scholar] [CrossRef]
Liu, X.; Lang, J.; Li, S.; Wang, Y.; Peng, L.; Wang, W.; Han, Y.; Qi, C.; Song, L.; Yang, S. Fragment enrichment of circulating tumor DNA with low-frequency mutations. Front. Genet. 2020, 11, 147. [Google Scholar] [CrossRef]
Seyhan, A.A.; Carini, C. Are innovation and new technologies in precision medicine paving a new era in patients centric care? J. Transl. Med. 2019, 17, 114. [Google Scholar] [CrossRef]
Grant, A.D.; Vail, P.; Padi, M.; Witkiewicz, A.K.; Knudsen, E.S. Interrogating Mutant Allele Expression via Customized Reference Genomes to Define Influential Cancer Mutations. Sci. Rep. 2019, 9, 12766. [Google Scholar] [CrossRef] [PubMed]
Elmekharam, N. Radioimmunoconjugate for Cancer Molecular Imaging. 2021. Available online: https://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=7785&context=etd (accessed on 28 November 2022).
Tian, K.; Rubadue, C.A.; Lin, D.I.; Veta, M.; Pyle, M.E.; Irshad, H.; Heng, Y.J. Automated clear cell renal carcinoma grade classification with prognostic significance. PLoS ONE 2019, 14, e0222641. [Google Scholar] [CrossRef] [PubMed]
Di Martino, S.; De Luca, G.; Grassi, L.; Federici, G.; Alfonsi, R.; Signore, M.; Addario, A.; De Salvo, L.; Francescangeli, F.; Sanchez, M. Renal cancer: New models and approach for personalizing therapy. J. Exp. Clin. Cancer Res. 2018, 37, 217. [Google Scholar] [CrossRef] [PubMed]
Tabibu, S.; Vinod, P.; Jawahar, C. Pan-Renal Cell Carcinoma classification and survival prediction from histopathology images using deep learning. Sci. Rep. 2019, 9, 10509. [Google Scholar] [CrossRef] [PubMed]
Perazella, M.A.; Dreicer, R.; Rosner, M.H. Renal cell carcinoma for the nephrologist. Kidney Int. 2018, 94, 471–483. [Google Scholar] [CrossRef]
Wu, H.; Fan, L.; Liu, H.; Guan, B.; Hu, B.; Liu, F.; Hocher, B.; Yin, L. Identification of key genes and prognostic analysis between chromophobe renal cell carcinoma and renal oncocytoma by bioinformatic analysis. BioMed Res. Int. 2020, 2020, 4030915. [Google Scholar] [CrossRef]
Luo, P.; Ding, Y.; Lei, X.; Wu, F.-X. deepDriver: Predicting cancer driver genes based on somatic mutations using deep convolutional neural networks. Front. Genet. 2019, 10, 13. [Google Scholar] [CrossRef] [PubMed]
Azuaje, F.; Kim, S.-Y.; Perez Hernandez, D.; Dittmar, G. Connecting histopathology imaging and proteomics in kidney cancer through machine learning. J. Clin. Med. 2019, 8, 1535. [Google Scholar] [CrossRef]
Pray, L.A. Discovery of DNA Double Helix: Watson and Crick. Nat. Educ. 2008, 1, 100. [Google Scholar]
Mao, Y.; Chen, H.; Liang, H.; Meric-Bernstam, F.; Mills, G.B.; Chen, K. CanDrA: Cancer-specific driver missense mutation annotation with optimized features. PLoS ONE 2013, 8, e77945. [Google Scholar] [CrossRef]
Kocak, B.; Durmaz, E.S.; Ates, E.; Ulusan, M.B. Radiogenomics in clear cell renal cell carcinoma: Machine learning–based high-dimensional quantitative CT texture analysis in predicting PBRM1 mutation status. Am. J. Roentgenol. 2019, 212, W55–W63. [Google Scholar] [CrossRef]
Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012, 28, 3150–3152. [Google Scholar] [CrossRef] [PubMed]
Suleman, M.T.; Alkhalifah, T.; Alturise, F.; Khan, Y.D. DHU-Pred: Accurate prediction of dihydrouridine sites using position and composition variant features on diverse classifiers. PeerJ 2022, 10, e14104. [Google Scholar] [CrossRef] [PubMed]
Alghamdi, W.; Alzahrani, E.; Ullah, M.Z.; Khan, Y.D. 4mC-RF: Improving the prediction of 4mC sites using composition and position relative features and statistical moment. Anal. Biochem. 2021, 633, 114385. [Google Scholar] [CrossRef] [PubMed]
Suleman, M.T.; Khan, Y.D. m1A-pred: Prediction of Modified 1-methyladenosine Sites in RNA Sequences through Artificial Intelligence. Comb. Chem. High Throughput Screen. 2022, 25, 2473–2484. [Google Scholar]
Akmal, M.A.; Rasool, N.; Khan, Y.D. Prediction of N-linked glycosylation sites using position relative features and statistical moments. PLoS ONE 2017, 12, e0181966. [Google Scholar] [CrossRef]
Almagrabi, A.O.; Khan, Y.D.; Khan, S.A. iPhosD-PseAAC: Identification of phosphoaspartate sites in proteins using statistical moments and PseAAC. Biocell 2021, 45, 1287. [Google Scholar] [CrossRef]
Khan, Y.D.; Khan, N.S.; Naseer, S.; Butt, A.H. iSUMOK-PseAAC: Prediction of lysine sumoylation sites using statistical moments and Chou’s PseAAC. PeerJ 2021, 9, e11581. [Google Scholar] [CrossRef]
Allehaibi, K.; Daanial Khan, Y.; Khan, S.A. iTAGPred: A Two-Level Prediction Model for Identification of Angiogenesis and Tumor Angiogenesis Biomarkers. Appl. Bionics Biomech. 2021, 2021, 2803147. [Google Scholar] [CrossRef]
Hussain, W.; Khan, Y.D.; Rasool, N.; Khan, S.A.; Chou, K.-C. SPalmitoylC-PseAAC: A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. Anal. Biochem. 2019, 568, 14–23. [Google Scholar] [CrossRef]
Naseer, S.; Hussain, W.; Khan, Y.D.; Rasool, N. Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations. Anal. Biochem. 2021, 615, 114069. [Google Scholar] [CrossRef] [PubMed]
Malebary, S.J.; Khan, R.; Khan, Y.D. ProtoPred: Advancing oncological research through identification of proto-oncogene proteins. IEEE Access 2021, 9, 68788–68797. [Google Scholar] [CrossRef]
Awais, M.; Hussain, W.; Rasool, N.; Khan, Y.D. iTSP-PseAAC: Identifying tumor suppressor proteins by using fully connected neural network and PseAAC. Curr. Bioinform. 2021, 16, 700–709. [Google Scholar]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Arif, M.; Ahmed, S.; Ge, F.; Kabir, M.; Khan, Y.D.; Yu, D.-J.; Thafar, M. StackACPred: Prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach. Chemom. Intell. Lab. Syst. 2021, 220, 104458. [Google Scholar] [CrossRef]
Hochreiter, S.; Younger, A.S.; Conwell, P.R. Learning to Learn Using Gradient Descent. In International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M.W.; Pfau, D.; Schaul, T.; Shillingford, B.; De Freitas, N. Learning to learn by gradient descent by gradient descent. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar] [CrossRef]
Cui, T.; Dou, Y.; Tan, P.; Ni, Z.; Liu, T.; Wang, D.; Huang, Y.; Cai, K.; Zhao, X.; Xu, D. RNALocate v2. 0: An updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic Acids Res. 2021, 50, D333–D339. [Google Scholar] [CrossRef]
Malebary, S.J.; Khan, Y.D. Identification of Antimicrobial Peptides Using Chou’s 5 Step Rule. CMC-Comput. Mater. Contin. 2021, 67, 2863–2881. [Google Scholar] [CrossRef]
Alzahrani, E.; Alghamdi, W.; Ullah, M.Z.; Khan, Y.D. Identification of stress response proteins through fusion of machine learning models and statistical paradigms. Sci. Rep. 2021, 11, 21767. [Google Scholar] [CrossRef]
Khan, Y.D.; Rasool, N.; Hussain, W.; Khan, S.A.; Chou, K.-C. iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Anal. Biochem. 2018, 550, 109–116. [Google Scholar] [CrossRef]
Liu, K.; Chen, W.; Lin, H. XG-PseU: An eXtreme Gradient Boosting based method for identifying pseudouridine sites. Mol. Genet. Genom. 2020, 295, 13–21. [Google Scholar] [CrossRef] [PubMed]
Lv, H.; Zhang, Y.; Wang, J.-S.; Yuan, S.-S.; Sun, Z.-J.; Dao, F.-Y.; Guan, Z.-X.; Lin, H.; Deng, K.-J. iRice-MS: An integrated XGBoost model for detecting multitype post-translational modification sites in rice. Brief. Bioinform. 2021, 23, bbab486. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Zhang, Z.; Jiang, Y.; Mao, Z.; Wang, D.; Lin, H.; Xu, D. DM3Loc: Multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism. Nucleic Acids Res. 2021, 49, e46. [Google Scholar] [CrossRef] [PubMed]
Lv, H.; Dao, F.-Y.; Guan, Z.-X.; Yang, H.; Li, Y.-W.; Lin, H. Deep-Kcr: Accurate detection of lysine crotonylation sites using deep learning method. Brief. Bioinform. 2021, 22, bbaa255. [Google Scholar] [CrossRef] [PubMed]
Butt, A.H.; Khan, Y.D. CanLect-Pred: A cancer therapeutics tool for prediction of target cancerlectins using experiential annotated proteomic sequences. IEEE Access 2019, 8, 9520–9531. [Google Scholar] [CrossRef]
Sealfon, R.S.; Wong, A.K.; Troyanskaya, O.G. Machine learning methods to model multicellular complexity and tissue specificity. Nat. Rev. Mater. 2021, 6, 717–729. [Google Scholar] [CrossRef]
Shaath, H.; Vishnubalaji, R.; Elango, R.; Kardousha, A.; Islam, Z.; Qureshi, R.; Alam, T.; Kolatkar, P.R.; Alajez, N.M. Long Non-Coding RNA and RNA-Binding Protein Interactions in Cancer: Experimental and Machine Learning Approaches. In Seminars in Cancer Biology; Elsevier: Amsterdam, The Netherlands, 2022. [Google Scholar]
Kocak, B.; Durmaz, E.S.; Kaya, O.K.; Kilickesmez, O. Machine learning-based unenhanced CT texture analysis for predicting BAP1 mutation status of clear cell renal cell carcinomas. Acta Radiol. 2020, 61, 856–864. [Google Scholar] [CrossRef]
Feng, Z.; Zhang, L.; Qi, Z.; Shen, Q.; Hu, Z.; Chen, F. Identifying BAP1 mutations in clear-cell renal cell carcinoma by CT radiomics: Preliminary findings. Front. Oncol. 2020, 10, 279. [Google Scholar] [CrossRef]
Chen, S.; Zhang, E.; Jiang, L.; Wang, T.; Guo, T.; Gao, F.; Zhang, N.; Wang, X.; Zheng, J. Robust Prediction of Prognosis and Immunotherapeutic Response for Clear Cell Renal Cell Carcinoma Through Deep Learning Algorithm. Front. Immunol. 2022, 13, 798471. [Google Scholar] [CrossRef]

Figure 1. Architecture diagram of the proposed model.

Figure 2. Architecture of the ANN model.

Figure 3. Accuracy scores of all 10 folds for 10-fold cross-validation. The circles represent outliers.

Figure 4. Mean ROC of 10-fold cross-validation.

Figure 5. Training and validation curves for 10-fold cross-validation. Orange curves are for training, while blue curves are for validation.

Figure 6. ROC curve for Jackknife validation.

Figure 7. ROC curve of independent dataset test.

Figure 8. ROC curve of independent dataset test for all 5 classifiers.

Table 1. Information of Positive and Negative Data for Classes.

Class	Positive Data	Negative Data
RCCC (kindey_tumour_driver)	RCCC (kindey_tumour_driver)	Other Tumor (other_tumour_driver) + Unknown (non-tumor driver)
Other Tumor (other_tumour_driver)	Other Tumor (other_tumour_driver)	RCCC (kindey_tumour_driver) + Unknown (non-tumour driver)
Unknown (non-tumour driver)	Unknown (non-tumour driver)	RCCC (kindey_tumour_driver) + Other Tumor (other_tumer_driver)

Table 2. Training accuracy results through self-consistency test.

Specificity	Sensitivity	Accuracy	MCC Stability
1.00	1.00	1.00	1.00

Table 3. Mean scores for 10-fold cross-validation results.

Specificity	Sensitivity	Accuracy	MCC Stability
0.99	0.97	0.98	0.96

Table 4. Mean scores for Jackknife validation results.

Specificity	Sensitivity	Accuracy	MCC Stability
0.98	0.99	0.99	0.99

Table 5. Comparison of the proposed method with other techniques.

Methods	Classifiers	Specificity	Sensitivity	Accuracy	MCC Stability
Kocak et al., 2019 [22]	ANN	87.8%	87.8%	88.2%	0.763
Kocak et al., 2019 [22]	Random Forest	94.6%	94.6%	95%	0.900
Proposed method	ANN + Hist Gradient boosting	98.3%	99.4%	98.87%	0.990

Table 6. Scores for Independent dataset testing of the proposed method.

Specificity	Sensitivity	Accuracy	MCC Stability
0.84	0.82	0.83	0.80

Table 7. Comparison of accuracy score for the proposed model with other ML algorithms.

Classifiers	Independent Test	Cross-Validation	Jackknife Test
Random Forest	0.82	0.97	0.86
Decision Tree	0.84	0.96	0.93
Naive Bayes	0.36	0.48	0.40
Gradient Boosting	0.80	0.96	1.00
Hist Gradient Boosting	0.83	0.98	1.00

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hassan, A.; Alkhalifah, T.; Alturise, F.; Khan, Y.D. RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features. Diagnostics 2022, 12, 3036. https://doi.org/10.3390/diagnostics12123036

AMA Style

Hassan A, Alkhalifah T, Alturise F, Khan YD. RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features. Diagnostics. 2022; 12(12):3036. https://doi.org/10.3390/diagnostics12123036

Chicago/Turabian Style

Hassan, Arfa, Tamim Alkhalifah, Fahad Alturise, and Yaser Daanial Khan. 2022. "RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features" Diagnostics 12, no. 12: 3036. https://doi.org/10.3390/diagnostics12123036

APA Style

Hassan, A., Alkhalifah, T., Alturise, F., & Khan, Y. D. (2022). RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features. Diagnostics, 12(12), 3036. https://doi.org/10.3390/diagnostics12123036

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RCCC_Pred: A Novel Method for Sequence-Based Identification of Renal Clear Cell Carcinoma Genes through DNA Mutations and a Blend of Features

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Collection and Pre-Processing

2.2. Feature Extraction

2.3. Classification

2.4. Model Evaluation

3. Results

3.1. Training Accuracy

3.2. Validation of the Model through 10-Fold Cross-Validation and Jackknife Testing

3.3. Independent Dataset Validation

3.4. Comparison with Other Classifiers

4. Discussion

5. Conclusions

6. Limitations and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI