Bioinformatics Analysis Identifying Key Biomarkers in Bladder Cancer

: Our goal was to ﬁnd new diagnostic and prognostic biomarkers in bladder cancer (BCa), and to predict molecular mechanisms and processes involved in BCa development and progression. Notably, the data collection is an inevitable step and time-consuming work. Furthermore, identiﬁcation of the complementary results and considerable literature retrieval were requested. Here, we provide detailed information of the used datasets, the study design, and on data mining. We analyzed di ﬀ erentially expressed genes (DEGs) in the di ﬀ erent datasets and the most important hub genes were retrieved. We report on the meta-data information of the population, such as gender, race, tumor stage, and the expression levels of the hub genes. We include comprehensive information about the gene ontology (GO) enrichment analyses and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. We also retrieved information about the up- and down-regulation of genes. All in all, the presented datasets can be used to evaluate potential biomarkers and to predict the performance of di ﬀ erent preclinical biomarkers in BCa.


Summary
Bladder cancer is one of the most common malignancies [1]. Although new treatment strategies and tools for surgical resection [2], neoadjuvant chemotherapy [3,4], and photodynamic therapy (PDT) [5] have been developed, BCa remains with a high rate of recurrence [1]. Up to now, cystoscopy and bioptic histology are still the gold standards to diagnose bladder cancer (BCa) [6]. No consent about urinary marker or non-invasive screening strategies could be found by the European Association of Urology (EUA). Furthermore, nuclear matrix protein 22 (NMP22) is only recommended by American Urological Association (AUA) under certain conditions [7]. Therefore, it remains a priority to develop reliable, safe, and non-invasive diagnostic/prognostic biomarkers and therapeutic targets for BCa, and considerable efforts are ongoing.
Bioinformatics analysis is necessary for the integration of, e.g., huge amounts of transcriptome, microarray, and RNA-sequencing data to disclose alterations in gene expression, mutational burden, transcriptome, and proteome of cancer compared to non-cancer controls [8]. Of special importance are so-called hub genes, defined as highly connected genes, which can be regarded as a representative of a distinct module in the genes network [9]. Furthermore, hub genes potentially play an important role in the progression of cancer. Therefore, they are good biomarker candidates and may even provide new therapeutic targets [8,10]. Following, we provide supplemental results from our recent investigation  [11]. For detailed information on the analytical methods and the software packages please refer to the Materials and Methods section in the cited paper.  [11]. For detailed information on the analytical methods and the software packages please refer to the Materials and Methods section in the cited paper. Molecular complex detection (MCODE) [15][16][17] analysis identified 11 relevant modules (subnetworks) and cytoHubba (based on Cytoscape software) [18] classified 376 of those 418 DEGs as hub genes, which are the gene most interconnected in the networks/modules (Table S4) [8].
To reduce the hub genes to the most promising, we defined 11 seed genes [15,16] for the most important 11 modules and the subsequent analysis yielded 14 hub genes on the basis of correlation to overall survival and degree of interaction (Table S5). The hub genes were ordered by their descending interaction degree: CDK1(98), CCNB1(92), CCNA2(84), KIF11(84), CDC20(83), UBE2C(83), MAD2L1(81), AURKA(80), KIF20A(80), KIF2C(80), KPNA2(67), TPM1(29), CASQ2 (11), and CRYAB (11). Figure 2 depicts the results of the protein-protein interaction (PPI) analysis. extracted the clinical meta-data from TCGA-BLCA for the correlation to overall survival (OS) and disease-free survival (DFS) based on the 14 hub genes (Table S5).  [11]. For detailed information on the analytical methods and the software packages please refer to the Materials and Methods section in the cited paper. In addition, we performed other subgroup analyses, such as the expression levels in different groups based on tumor stage, lymph nodal metastasis, race of patients, gender of patients, histological subtype, and molecular subtypes. We found that CKD1, CCNB1, CCNA2, KIF11, CDC20, UBE2C, MD2L1, AURKA, KIF20A, KIF2C, KPNA2, TPM1, CASQ2, and CRYAB were significantly higher expression in Caucasian and African American than in the ASI cohort. Except for TPM1, CASQ2, and CRYAB, all the genes were significantly overexpressed in both, male and female bladder cancer patients. However, no significant difference was found between males and females. All the genes were significantly higher expressed in non-papillary tumors than in papillary tumors ( Table 2) [28]. In addition, CKD1, CCNB1, CCNA2, KIF11, CDC20, UBE2C, MD2L1, AURKA, KIF20A, KIF2C, and KPNA2 were significantly upregulated in papillary tumors and non-papillary tumors than in non-cancerous tissues; in contrast, TPM1, CASQ2, and CRYAB We then performed a gene ontology (GO) enrichment analysis [26] and used the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis [27] to identify the pathways potentially related to these protein-coding genes and to predict the roles of these genes in BCa. The Go and KEGG analyses significantly enriched the 'Pathways in cancers', 'Viral carcinogenesis', and 'Cell cycle', which relate to carcinogenesis and progression of cancer. We listed all significant results of GO and KEGG analysis based on the 418 DEGs and Benjamini-Hochberg value <0.05 in Table S6. The GO and KEGG analysis results based on the 14 hub genes are available and searchable at http://www.mdpi.com/2075-4418/10/2/66/s1, Table S5. Moreover, we also listed all significant results of GO and KEGG analyses of up- (Table S7) and downregulated DEGs (Table S8). In addition, we Data 2020, 5, 38 4 of 12 extracted the clinical meta-data from TCGA-BLCA for the correlation to overall survival (OS) and disease-free survival (DFS) based on the 14 hub genes (Table S5).
In addition, we performed other subgroup analyses, such as the expression levels in different groups based on tumor stage, lymph nodal metastasis, race of patients, gender of patients, histological subtype, and molecular subtypes. We found that CKD1, CCNB1, CCNA2, KIF11, CDC20, UBE2C, MD2L1, AURKA, KIF20A, KIF2C, KPNA2, TPM1, CASQ2, and CRYAB were significantly higher expression in Caucasian and African American than in the ASI cohort. Except for TPM1, CASQ2, and CRYAB, all the genes were significantly overexpressed in both, male and female bladder cancer patients. However, no significant difference was found between males and females. All the genes were significantly higher expressed in non-papillary tumors than in papillary tumors ( Table 2) [28]. In addition, CKD1, CCNB1, CCNA2, KIF11, CDC20, UBE2C, MD2L1, AURKA, KIF20A, KIF2C, and KPNA2 were significantly upregulated in papillary tumors and non-papillary tumors than in non-cancerous tissues; in contrast, TPM1, CASQ2, and CRYAB were significantly downregulated in papillary tumors and non-papillary tumors than in non-cancerous tissues. Intriguingly, except for CRYAB, we found that TPM1 and CASQ2 were most significantly downregulated in "Luminal Papillary" tumors, while the other genes were most significantly upregulated in subtype of "Neuronal" and "Basal squamous" based on molecular subtyping ( Table 2).

Literature Research
We retrieved seven bioinformatics studies for BCa biomarkers based on public database analyses [9,[29][30][31][32][33][34], and we compared the biomarkers in the present study with the ones that were reported in the literature studies. We found that CRYAB and CASQ2 were so far unrecognized as biomarkers in previous studies. On the basis of Oncomine meta-analysis (https://www.oncomine.org/), we here present the meta-analysis of the expression levels of the hub genes described, but not been shown in our previous study ( Figure 3 and Table S9). The genes compared in the meta-analysis were CCNB1, CCNA2, KIF11, CDC20, UBE2C, MAD2L1, AURKA, KIF2C, CASQ2, CRYAB, and KIF20A. Furthermore, except for the CRYAB and CASQ2, which have been shown in our previous paper, we constructed the expression body maps of the other 12 hub genes reported in our previous study using GEPIA (http://gepia.cancer-pku.cn, accessed on Nov.11. 2019). Body maps are an impressive way to visualize the differences in gene expression between normal and tumor tissues ( Figure 4). Ultimately, our research results were roughly in line with the majority of the retrieved studies.    Furthermore, except for the CRYAB and CASQ2, which have been shown in our previous paper, we constructed the expression body maps of the other 12 hub genes reported in our previous study using GEPIA (http://gepia.cancer-pku.cn, accessed on Nov.11. 2019). Body maps are an impressive way to visualize the differences in gene expression between normal and tumor tissues ( Figure 4). Ultimately, our research results were roughly in line with the majority of the retrieved studies.

Methods
The workflow of the current study is depicted in Figure 1, which has been published before [11]. For more detailed information of the datasets, materials and methods please refer to this article.

Data Source Identification and Data Mining
The quality control of microarray data was conducted by relative log expression (RLE) box plot through R studio (version 1-1-463). Criteria were made for defining DEGs, compared the expression levels between the non-cancerous tissues and cancer samples, where |Log FC (fold change)| > 1 and a p-value < 0.05 were considered statistically significant [15,40].

Acquisition of the Hub Genes
DEGs should at least be expressed in two different GEO datasets. The overlap between DEGs in different datasets was determined by FUNRICH software (version 3.1.3) and 418 DEGs were identified. Based on the interaction degree of 418 DEGs extracted from STRING database [14], cytoHubba analysis [18,41] reported 376 hub genes, of which 135 hub genes fulfilled the criterion of degree ≥11. However, to find the most important hub genes, we focused on the top 10 hub genes, all showing a degree ≥80. Additionally, we included another four hub genes KPNA2, TPM1, CASQ2, and CRYAB, which not only significantly correlated with overall survival based on the results from Human Protein Atlas, but also showed a degree ≥11 [15,40].

Methods
The workflow of the current study is depicted in Figure 1, which has been published before [11]. For more detailed information of the datasets, materials and methods please refer to this article.

Data Source Identification and Data Mining
The quality control of microarray data was conducted by relative log expression (RLE) box plot through R studio (version 1-1-463). Criteria were made for defining DEGs, compared the expression levels between the non-cancerous tissues and cancer samples, where |Log FC (fold change)| > 1 and a p-value < 0.05 were considered statistically significant [15,40].

Acquisition of the Hub Genes
DEGs should at least be expressed in two different GEO datasets. The overlap between DEGs in different datasets was determined by FUNRICH software (version 3.1.3) and 418 DEGs were identified. Based on the interaction degree of 418 DEGs extracted from STRING database [14], cytoHubba analysis [18,41] reported 376 hub genes, of which 135 hub genes fulfilled the criterion of degree ≥11. However, to find the most important hub genes, we focused on the top 10 hub genes, all showing a degree ≥80. Additionally, we included another four hub genes KPNA2, TPM1, CASQ2, and CRYAB, which not only significantly correlated with overall survival based on the results from Human Protein Atlas, but also showed a degree ≥11 [15,40]. . Both software packages were used to annotate, visualize, and integrate the discoveries, and to extract the crucial biological information. We also used DAVID to analyze the up-and downregulated genes, separately. A Benjamini-Hochberg FDR <0.05 was considered significant in DAVID and FUNRICH analyses. In addition, we performed GO and KEGG analysis to identify the critical biological process (BP), cellular component (CC), molecular function (MP), and essential pathways potentially related to the initiation and development of BCa. p < 0.05 was considered statistically significant.
Clinical information was extracted from TCGA-BLCA using R software and, subsequently, the expression levels of 14 hub genes in subgroups were analyzed based on tumor stage, lymph nodal metastasis, race of patients, gender of patients, histological subtype, and molecular subtypes. To evaluate the prognostic value of the identified DEGs, we did Kaplan-Meier survival analyses of overall survival (OS) and disease-free survival (DFS). p < 0.05 was considered statistically significant.

Literature Retrieval and Oncomine Meta-Analysis
PubMed, EMBASE, Science Direct, and Google Scholar databases were used to search and identify published results about bioinformatics analysis in BCa, via the Google search engine. The data collection process was undertaken and ended in November 2019. The retrieval criteria were rigorous, filtering with bioinformatics analysis and BCa, only 7 full-text papers were returned [9,[29][30][31][32][33][34]. On the basis of Oncomine database, 5 studies were relevant [35] and we gathered the information of the 14 hub genes from the 5 previous studies.

User Notes
The present report describes the character of the research "Identification of key biomarkers in bladder cancer: Evidence from bioinformatics analysis" [11]. Furthermore, the present report provides a convenient way to use extended datasets for biomarker discovery and hypothesis generation.
Supplementary Materials: We provide the supplementary Tables S1-S3 listing step by step the identification process of the most promising DEGs. The DEGs listed in Table S3 were used during the following analysis steps. We also indicated up-or downregulation of those DEGs (↑; ↓) in bladder cancer. Table S4 provides the 376 hub genes defined by cytoHubba. Table S5 summarizes the clinical and gene expression data of the n = 406 TCGA-BLCA patients for the 14 hub genes, defined from their degree of network interaction and the 11 seed genes, defined from the 11 most important modules. This table is the basis for the gene expression analyses. Tables S6-S8 summarize the results of the GO and KEGG pathway analyses. Table S9 provides the data of the Oncomine meta-analysis. The data tables may be used to construct a complete data set after applying different normalization strategies as cross-platform normalization or batch effects removal. This data set can be used as a benchmarking data set for machine learning-based feature selection in data-driven biomarker research. The following supplemental data are available online at http://www.mdpi.com/2306-5729/5/2/38/s1, Table S1: DESs extracted from different dataset, Table S2: 726 DEGs extracted from 5 GEO datasets,