Computational Analysis Identifies Novel Biomarkers for High-Risk Bladder Cancer Patients

In the case of bladder cancer, carcinoma in situ (CIS) is known to have poor diagnosis. However, there are not enough studies that examine the biomarkers relevant to CIS development. Omics experiments generate data with tens of thousands of descriptive variables, e.g., gene expression levels. Often, many of these descriptive variables are identified as somehow relevant, resulting in hundreds or thousands of relevant variables for building models or for further data analysis. We analyze one such dataset describing patients with bladder cancer, mostly non-muscle-invasive (NMIBC), and propose a novel approach to feature selection. This approach returns high-quality features for prediction and yet allows interpretability as well as a certain level of insight into the analyzed data. As a result, we obtain a small set of seven of the most-useful biomarkers for diagnostics. They can also be used to build tests that avoid the costly and time-consuming existing methods. We summarize the current biological knowledge of the chosen biomarkers and contrast it with our findings.


Introduction
According to the World Cancer Research Fund International, bladder cancer is the 10th most common cancer in the world [1]. It is diagnosed mostly in people over 55 in highly developed countries of southern and western Europe, as well as in North America. Men are more than four times more likely to develop bladder cancer than women. The most commonly mentioned urinary bladder cancer risks, other than being male, are smoking cigarettes, exposure to certain chemicals (such as aromatic amines, polycyclic aromatic hydrocarbons, and chlorinated hydrocarbons and alcohol), having a red meat-rich diet, and being genetically predisposed (reviewed in [1]). Urothelial carcinoma of the bladder is divided into two major groups on the basis of clinical staging with different clinical outcomes and therapy options: non-muscle-invasive bladder cancer (NMIBC) and muscleinvasive bladder cancer (MIBC). MIBCs are aggressive tumors, characterized by a five-year survival rate of less than 50% [2]. Up to 15% of MIBCs are initially diagnosed as NMIBCs that progressed into MIBCs [3]. NMIBC is considered a tumor with a relatively good prognosis since the five-year overall survival rate is about 90% [4]. Unfortunately, NMIBCs are a very heterogeneous tumor group with a high rate of recurrence (up to 70%) and risk of progression to MIBC (up to 20%), despite significant improvement in the adjuvant therapies' efficacy (reviewed in [5][6][7]). Carcinoma in situ (CIS) belongs to this group and can be diagnosed as a primary or a recurrent tumor. CIS is associated with a poorer prognosis, a higher grade, as well as an elevated risk of recurrence and progression to MIBC [8]. The recurrence rate for CIS is 63-92%, and the progression to MIBC is 50-75%,

Materials and Methods Dataset
The E-MTAB-4321 dataset, used in the study, contains clinical and RNA-seq data from 476 patients with early stage urothelial carcinoma, of whom 74 have developed CIS at a certain point of the disease course, whereas 402 were free of CIS during the study period. There are 43,204 genetic markers in this dataset, out of which 4800 have 0 variance, resulting in 38,404 markers actually carrying any information. A summary of the dataset characteristics is present in Table 1 and in Appendix A. For details on data collection, please refer to the original paper by Hedegaard et al. [17].  4 16 The analytical protocol is based on supervised feature selection (FS) and supervised classification. In our analysis, we focus on finding markers for predicting the appearance of CIS in disease course.
The following base feature-selection protocol is used. We first identify all informative variables and, therefore, reduce the dimensionality of the problem. Then, we further decrease the dimensionality by clustering similar variables. Finally, we use clusters' rep-resentatives to build machine learning models for the prediction of CIS-DC. Each step is described in detail in the following paragraphs.
In the first step, the variables that carry information about future development of CIS in disease course are identified. To this end, we use the multidimensional feature selection (MDFS) filter, which is based on the information entropy and is available as a library in R [22,23]. The informative variables are identified by computing information entropy conditioned on the knowledge of the descriptive variables and comparing it with the null distribution of information entropy conditioned on the non-informative variables. This metric is called information gain (IG). In this case, we use single-dimensional analysis, which computes maximum IG over multiple (30) random discretizations of continuous variables. The relevance is determined by a p-value threshold of 0.05 after applying Holm's correction [24].
Unlike minimal-optimal approaches to feature selection, all-relevant feature selection does not have a goal of producing the best set of features for model building. On the contrary-the goal is to preserve the information about all relevant variables so that they and their structure can be studied at will. However, this leads to higher complexity for model building and more uncertainty for tooling to discover such structures. To counter this, we used feature clustering to group similar features together.
Similarity is a concept rooted in clustering (and data analysis in general) and is a broad category. For our purpose, we use a correlation coefficient as our similarity metric; precisely, we use the Pearson's product-moment correlation coefficient ρ. However, for the purpose of applying clustering algorithms, we need a function that can be used as a proper metric-that monotonically describes the similar-dissimilar relation and outputs the penalty associated with dissimilarity. Thus, we apply the following transformation to obtain the function d: which satisfies the properties of a proper metric and describes dissimilarity as a penalty due to lack of correlation. The function d is called the dissimilarity function. We choose hierarchical clustering as our clustering algorithm due to its property of revealing the internal clustering structure. As a method of hierarchical clustering, we evaluated Ward's minimum variance method as well as the complete linkage method. Of note here is that we applied clustering only to features-not objects nor both objects and features-unlike how clustering and biclustering algorithms are usually used.
We evaluate two ways to choose the representatives to build the classification models. The first is the most commonly applied procedure of working directly with the ranking of features as they are available from the feature-selection method: choosing top-n features with the lowest p-values. Secondly, we evaluate the effect of hierarchical clustering to N clusters and then, analogously, use the ranking to choose one representative from each cluster, basically the top-1 representative from each group.
To evaluate the marker set, we used the Random Forest [25] (RF) implementation available in R's randomForest package [26] as our target classifier. No tweaks to the default parameters were applied. We used the area under the ROC (receiver operating characteristic) curve, also known as AUROC or even AUC (area under curve), to describe the performance of each built classifier.
While evaluating the stability and generality of the above base protocol, we developed an extended procedure that we present here. We propose the use of cross validation as part of the feature-selection protocol. The entire above-mentioned procedure was run in a stratified 5-fold cross validation with 30 repeats, the direct results of which are presented in Figure 1. Essentially, we have obtained a new ranking from cross validation that allows us to apply the top-n procedure while using the count of repetitions as the quality metric. For an overview, see Figure 2. Furthermore, to estimate the mean and error of our evaluation metric (AUC), we have applied (independently) both external resampling and external cross validation. The resampling procedure consisted of 100 repeats of random sampling with replacement. The omitted objects, called out-of-bag (OOB) objects, were used for verification of the performance of the built models, i.e., for the calculation of the AUC. The cross-validation procedure, on the other hand, was conducted using a stratified 10-fold approach with 30 repeats (independent of the CV inside the procedure). The internal procedure was adjusted to use 10-fold CV as well, to gather enough objects for the MDFS statistic to work well.
Assuming validation with resampling, the full analytical protocol is, thus, as follows (with an overview in Figure 3): 1.
repeat 30 times: split data randomly in 5 equal bins (i.e., run 30 repetitions of 5-fold CV) and for each (i-th) bin: (a) set aside the i-th bin as the test set and create a training set from the 4 remaining bins; (b) identify informative variables in the training set; (c) cluster informative variables using the hierarchical approach and select representatives of each cluster on each clustering level between 2 and 15, utilizing the usual procedure of choosing the most informative one, and: i. build an RF model of "CIS in disease course" using those representatives; ii.
test the quality of the built model on the test dataset; 2. find cluster representatives at each level that appear most often in the above 150 iterations (30 times 5 iterations), at each level of clustering between 2 and 15; 3.
use those representatives for building the final model on the entire dataset: (a) estimate the confidence intervals of the final models at each number of representatives (between 2 and 15) using the bootstrap approach-repeat 100 times: i. draw with replacement N patients from the original data, build RF models using the 2 to 15 representative variables; ii.
measure the performance of each model using OOB objects; (b) compute the aggregate performance of each model; (c) use the results of the above procedure to propose the relevant markers.
Apart from the above selection of methods, we have verified the final marker set using naive Bayes [27] and logistic regression classifiers, estimating the achievable diagnostic metrics with such simpler classifiers. The details of the naive Bayes classifier are presented in the Appendices A and B, as it is used as an example simple classifier that is useful for diagnostic personnel.

Results
The final set of markers was selected after inspecting both the lists of representatives and plots of the AUC in predictive models as a function of the number of markers used. The quality of the predictive models improves with the increasing number of markers used in the model, until it saturates with about seven-nine markers; see Figure 4. An additional argument for selecting seven markers is the relative stability of the positions of the first seven markers in the list of markers consistently selected in the cross validation; see Figure 5. The first 7 markers appear in the set of the top-7 most-often selected cluster representatives in 150 repeats of the feature-selection procedure, whereas positions of other markers do not rise to the top-7.  Figure 5. Most-representative markers at different clustering levels in 150 repeats of hierarchical clustering procedure. The first 3 columns show the order in which markers are included in the representative set, when the number of representatives is increased by 1-from 2 to 15. The Ensemble code of each marker, with 5 leading zeros removed, is shown in column 2, and the gene name corresponding to the marker is shown in column 3. In the remaining columns, the markers that are most often selected as representatives in 150 repeats are shown, and their positions within the column corresponds to the frequency of selection of a given marker as the representative (higher position-higher frequency).
The chosen markers maximize the AUC in resampling (see Figure 6) and do not exhibit strong correlations among themselves (as expected from the protocol); see Figure 7. For completeness, we also present the internal quality metric of the protocol in Figure 1 and the details of the selected seven markers in Table 2.  The markers exhibit different directionalities of expression levels between high-and low-risk classes-ADAM28 and TMEM32 expression levels are higher in the low-risk class, while for the other markers, we observe the reverse; see Figure 8.
The diagnostic properties of the models built with the chosen markers are presented in Figure 4. Properties of models from the external cross validation are presented in Table 3. The details of the naive Bayes classifier built on the entire set are reported in Appendix A.  Table 3. Externally cross-validated results for the Random Forest classifiers. The first column defines the threshold set in the classifiers, separating low-and high-risk groups. The second column displays the fraction of all patients assigned to a high-risk class (HRC) at a given threshold. Analogously, the third column presents the fraction of all CIS-DC cases assigned to the high-risk class. Two next columns present the fraction of CIS-DC in a low-and high-risk class (LRC and HRC), respectively. Similarly, columns six and seven present the odds of CIS-DC in an LRC and HRC, respectively. Finally, the DOR column displays the diagnostic odds ratio between the HRC and LRC, and RR displays the risk ratio between these classes. The comment on the cutoff and share of patients in the HRC from  Table 4. Cross-validated results for the three classifiers, built using the seven selected markers. Stratified 5-fold cross validation, repeated 30 times, was used. Column headings are as in Table 3. Please note that the observed inconsistency between the cutoff and the share of patients in the HRC stems from the application of cross validation-the share is averaged over all folds from all iterations, while the cutoff is established using the entire dataset a priori.

Discussion
Bladder cancer is one of the most-common cancers in the world [1]; thus, there is a need to develop sensitive methods for the early diagnosis of non-advanced lesions or poor prognosis predictors. Currently, there is a limited number of commercially available tests for bladder cancer diagnosis. The NMP22BC test allows for the diagnosis of non-muscle-invasive bladder cancer and low-grade bladder cancer in urine samples [28]. Recently published data shows that HPLC (high-performance liquid chromatography) of urine could distinguish bladder cancer patients from non-malignant hematuria patients based on chromatographic absorptions and fluorescence peaks [29]. Similarly, fluorescence urine analysis using concentration matrices of synchronous spectra could be useful in bladder cancer diagnosis, allowing to distinguish between cancer patients and heumaturia patients [30]. The new diagnostic strategy could include a label-free optical sensing platform based on DNA strand displacement. Currently, there are no data on bladder cancer detection using this method [31]. Metabolomic analysis is a very promising and useful tool for the identification of biomarkers; it allows for analyses of urine, blood, and tissue samples. The results enable distinguishing between MIBC and NMIBC patients [32]. The aforementioned techniques are aimed at the sensitive and early detection of urinary bladder cancer or at discriminations between MIBC and NMIBC. However, markers allowing for the identification of the risk of CIS development have still not been identified.
CIS of the urinary bladder represents the tumors with high risk of progression to MIBC and metastatic disease [8]. Some data indicate that primary CIS is diagnosed in about 1-3% of newly diagnosed bladder cancers, but some papers report about 20% primary CIS case diagnoses [33,34]. Secondary CIS (detected during follow-up) are diagnosed in about 20% of NMIBC cases [33,35]. Our method allowed for the identification of seven markers related to an increased risk of CIS-DC of urinary bladder cancers. Some of these markers are well-known molecules involved in cancer biology, but some of them are quite unique, with very limited information on their involvement in cancer development and their relationship with tumors.
We identified two markers that are characterized by limited information: DPY19L3-DT (DPY19L3 Divergent Transcript, ENSG00000267213) and E9PMD0 (ENSG00000258472). DPY19L3-DT belongs to the lncRNA class, but there is no information on the function of this molecule in normal and pathological cells and tissues, while the function of E9PMD0 is linked to the cell division and regulation of the attachment of spindle microtubules to kinetochore [36].
ADAM28 belongs to the disintegrin and metalloprotease domain (ADAM) family. Its role in cancers is ambivalent: it promotes cancer cells' proliferation, survival, migration, and metastasis by affecting neoangiogenesis, epithelial-to-mesenchymal transition, and extracellular matrix degradation, but in the tumor microenvironment it shows strong protective effects against deleterious metastasis dissemination [37]. In bladder cancers, ADAM28 may represent a possible biomarker, since it is overexpressed in bladder transitional cell carcinoma patients and detected in urine [38,39]. In our model, its higher expression was found in patients with low-risk cancers.
Another marker identified by our protocol, RAC3, is involved in neuronal development and in tumor progression, by modulating the organization of the cytoskeleton, cell migration, cell proliferation, and reactive oxygen species production. Its expression was found in different cancers, and it is considered as a marker of poor prognosis, metastasis, and a target for molecular-targeted therapies in some human cancers, such as breast or lung (reviewed in [40]. In our model, the increased expression of RAC3 in high-risk cancers is in line with the existing knowledge and data published by Chen et al. [41]. It indicates that, in bladder cancer, this molecule can be a potential prognostic marker and a target for molecular medicine.
TPX2 is a microtubule-associated protein, involved in the assembly of mitotic spindles and in cell cycles, cell proliferation, and apoptosis [42,43]. TPX2 was found in in silico studies to be related to the risk of the distant metastasis of breast cancers [44]. In bladder cancer, TPX2 is involved in TPX2-mediated phosphorylation of the AURKA-PI3K-AKT axis [45]. In addition, heterogeneous nuclear ribonucleoprotein F, by regulating the TPX2 protein, promotes the cell cycle and proliferation of bladder cancer cells [46]. The proliferation of bladder cancer cells can also be regulated by the interplay between TPX2, p53, and GLIPR1 [47]. In our model, similar to Yan et al. [48], a higher expression of TPX2 was found in high-risk cancers. Thus, we conclude that TPX2 plays an important role in the progression of bladder cancers, including CIS in disease course, and represents a good potential marker for targeted therapy.
ANKRD13B is ubiquitin-binding protein that specifically recognizes and binds Lys-63linked ubiquitin and that is responsible for the internalization of ligand-activated EGFR [49]. In addition, it is involved in DNA methylation since ANKRD13B (and ANKRD13A and ANKRD13D) form a complex with RNF11 (RING finger protein 11), belonging to the Really Interesting New Gene E3 ligase family (RING) [49,50]. Based on our data, we suggest that ANKRD13B could act as a marker of high-risk bladder cancer, since its expression was significantly elevated in these cancers. It could also be a potential molecular target for anticancer therapies.
TMEM232 is a member of the transmembrane protein family (TMEMs), consisting of more than 300 proteins, being components of cellular membranes [51]. Proteins of this family have differential expression in cancers, but there is limited information on TMEM232.
Published data have linked this protein with atopic dermatitis [52,53] or with multiple sclerosis [54]. In our model, the TMEM232 expression pattern was similar to ADAM28, with higher expression in low-risk cancers.
Using externally cross-validated results for the Random Forest classifier and a 75% threshold in our model (Table 3), the fraction of all patients assigned to a high-risk group was 23.6%, and the fraction of all CIS-DC cases assigned to the high-risk group was 49.9%, while the fraction of CIS-DC in a low-risk group was 10.2%, and that in a high-risk group was 32.9%. The fraction of these patients for the 75% threshold, using cross validation and naive Bayes, logistic regression, and Random Forest classifiers are similar, with very promising diagnostic results for Random Forest. The described method could aid clinicians in identifying high-risk bladder cancer (the risk of CIS in disease course). Thus, it offers a diagnostic tool that allows for the personalization of bladder cancer surveillance, more precise treatment option determinations, and the improvement of bladder cancer prognoses.
To summarize, the identified genes can be used as markers of progression in urinary bladder cancers. Moreover, the increased expression of some identified proteins (RAC3, TPX2, ANKRD13B, and TMEM232) indicates their usefulness as potential targets in molecular-tailored therapies. Some of them require more detailed studies since their biological role, especially in cancer, is unknown, or the data are contradictory (ADAM28, TMEM232, DPY19L3-DT, and E9PMD0). We also conclude that, since we identified seven important genes, their evaluation in routine diagnostic procedures is possible using immunohistochemistry or in situ hybridization. Such a panel would not burden laboratories with high costs and labor. Finally, a ready classifier based on naive Bayes technique is presented in the Appendices A and B, along with an example calculation to enable the research and diagnostics communities to readily analyze applicable data.
Author Contributions: R.P. conceptualized the study, prepared the resources and the software, validated the methodology, performed the data curation and investigation, prepared the visualization of the results, and wrote, revised, and edited the manuscript; A.A.B. validated the methodology, performed formal analysis and investigation, and wrote, revised, and edited the manuscript; W.R.R. conceptualized the study, validated the methodology, performed the investigation, prepared the visualization of the results, wrote, revised, and edited the manuscript, and supervised and administrated the project. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript (sorted alphabetically):

Appendix A. Dataset Properties
In this short appendix, we summarize the details of the dataset properties in terms of the clinical metadata values' distribution. The details are in the following four tables.  Total  74  402  Female  18  91  Cystectomy  9  23  BCG treatment  35  53  High grade  41  151  Low grade  32  245  PUNLMP  1  6  CIS  3  0  Ta  46  299  T1  22  90  T2-4 3 13

Appendix B. Naive Bayes Classifier Using the Seven Chosen Markers
The naive Bayes classifier built using the seven chosen markers is a simple classifier with diagnostically interesting properties (as shown in the main text). Thus, in this appendix, we present the details of classification made possible by the data we have used.
The naive Bayes classifier requires knowledge of the distribution of each variable. In most cases, it is assumed that the underlying distribution is normal. This is not the case in the raw gene expression data. Nevertheless, the logarithm of gene expression is usually sufficiently close to normal distribution. To avoid numerical artifacts for cases with very low expression, a value of 0.001 has been added to each recorded value.
To make the model general, the expression levels have been normalized using values of expression levels of genes with a stable and high level of expression that are available in the dataset. The three genes selected as reference are: ENSG00000075624 (ACTB), ENSG00000166794 (PPIB), and ENSG00000149273 (RPS3).
The values of these three have been combined to create a reference gene. The weights applied are ratios of the mean to the standard deviation, as was used in the initial choice. They are the following: 14.1, 12.8, and 8.79, respectively, normalized to 0.395, 0.357, and 0.247, respectively. The reference gene's expression level is used as the denominator in the construction of normalized expression levels of diagnostic genes, while their expression levels are the respective nominators. Thus, for a particular gene and patient, we have the following formula: log v 0.395 log v r 1 + 0.357 log v r 2 + 0.247 log v r 3 (A1) where v is the patient's gene expression level that we normalize, and v r 1 , v r 2 and v r 3 are the patient's gene expression levels for reference genes mentioned above: ACTB, PPIB, and RPS3, respectively. The log function is a natural logarithm, i.e., a logarithm with base e. Let us assume that some patient has the following values of gene expression for the respective reference genes: 490, 180, 304, and we want to normalize the value of TPX2, for which this same patient has a gene expression value of 2.03. We substitute the values in the above formula: The normal distribution is described using two parameters which are real numbers: the mean µ and the standard deviation (SD) σ. The density function f µ,σ of the normal distribution with the declared parameters is expressed as follows: The parameters for the selected seven genes are presented per class in Table A5. The name-naive Bayes-stems from the underlying statistical approach. The method uses Bayes' theorem, with an assumption that the different variables are independent. This assumption is false in most real-life cases; hence, the method is called naive. Nonetheless, the naive Bayes classifiers work surprisingly well in many cases [55].
Having the distributions well defined, equipped with the independence assumption, and using the Bayes' theorem, it is possible to evaluate whether a particular sample belongs with higher probability to the low-or high-risk class. To this end, one needs to calculate the simplified nominator from the Bayes' theorem's formula: where p(C k ) is the prior probability of the class, while p(x i |C i ) is the conditional probability of sample with value x i of the i-th variable, under the assumption that the sample belongs to class C k -this is performed using the assumed distributions. The ∏ symbol means product. The class with the larger value is the more probable class. However, it is possible to use the classifier with an arbitrary threshold, which enables the classifier to be tweaked depending on the nature of the problem. For practical purposes, in the case of binary classification, it is possible to define a score as a logarithm of the ratio of the probabilities of each class, in our case low-risk (L) and high-risk (H): Since the second part is effectively a constant, and we are allowing arbitrary thresholds, the score becomes the following: Let us assume that the patient for whom we want to apply the method above has the following values of normalized expression levels: DP = −0.2, AD = 0.51, RA = 0.02, E9 = 0.08, TP = 0.18, AN = 0.0, and TM = 0.02. Then, utilizing Equations (A3) and (A6) (as we can substitute the probability function with probability density function since we have a ratio of them) and data from Table A5, we obtain the following contributions to the final score: Taking into account the thresholds computed for the three analyzed cutoffs (Table A6), the calculated score classifies the patient as low-risk, regardless of the cutoff.