Open Data to Support CANCER Science—A Bioinformatics Perspective on Glioma Research

Supporting data sharing is paramount to making progress in cancer research. This includes the search for more precise targeted therapies and the search for novel biomarkers, through cluster and classification analysis, and extends to learning details in signal transduction pathways or intra- and intercellular interactions in cancer, through network analysis and network simulation. Our work aims to support and promote the use of publicly available resources in cancer research and demonstrates artificial intelligence (AI) methods to find answers to detailed questions. For example, how targeted therapies can be developed based on precision medicine or how to investigate cell-level phenomena with the help of bioinformatical methods. In our paper, we illustrate the current state of the art with examples from glioma research, in particular, how open data can be used for cancer research in general, and point out several resources and tools that are readily available. Presently, cancer researchers are often not aware of these important resources.


Introduction
What are the currently known biomarkers and cancer driver genes for a selected subdisease? Which genetic aberrations can be used diagnostically? Have survival-associated patterns been already identified? Which overall survival can be predicted? Are there any gender and age specifics about certain cancer subtypes? Are there any targeted drug recommendations for certain genomic variations? Numerous questions are being raised regarding cancer research every day and, partly, data already exist that help find answers. General biomedical data providers. 3.
Cancer specific data initiatives and resources. 4.
Metadata for AI in cancer research.
Fostering exchange and use cases for glioma research.

Ad 1. Why Open Data Research?
In the year of 1957, the International Council of Scientific Unions (ICSU) prepared the International Geophysical Year, amongst other reasons, to overcome the many data locks of the cold war times [4]. Recently, the ICSU was merged with the International Social Science Council (ISSC) to form the International Science Council (ISC) [5]. In the last quarter of the past century, the idea of worldwide data exchange grew, which resulted in the necessity of the standardization of metadata for exchange [6]. In the 1970s, National Aeronautics and Space Administration (NASA) had to cooperatively work with international partners to operate ground control stations, leading to the implementation of a standardized way of data exchange [7]. By now, NASA has its very own open data portal [7]. In 1995, the national academy of sciences published a report "On The Full And Open Exchange Of Scientific Data". Within this report, the committee on the geophysical and environmental data of national research council, Washington, D.C., demanded the disclosure of data and promoted open exchange between different countries [8]. It was the end of 2005 when common endeavours to collect and share the genomic analysis of 33 different cancer types with The Cancer Genome Atlas (TCGA) was launched, 2006, followed by Therapeutically Applicable Research to Generate Effective Treatments (TARGET), concerned with childhood cancer research [9] and International Cancer Genome Consortium's (ICGC) in 2008 [10]. Local initiatives followed, such as the German cancer consortium (DKTK) in 2012 [11]. Additionally, other global initiatives followed. In 2014, the Global Alliance for Genomics and Health (GA4GH) was founded, in order to enable responsible genomic data sharing. Soon after, global corporations throughout the world joined, supporting data sharing initiatives in cancer research [12].
Biomedical databases provide both open, as well as controlled, access data, depending on data type, such as for the ICGC data portal [10]. Open (access) data are data that can be used by anyone, without technical or legal restrictions. The use encompasses both access and reuse. Still, open data is less common than open access publications, which are two of many important research stages in open(ing) science [13]. AI development requires diverse, publicly available, and annotated data, in regard to quality, validation, and reproducibility. This aspect becomes more and more important, with an increased amount of data being produced every day. The recent year has proven that open science can save lives [14]. Besides open data, the FAIR principles developed as a concept to ensure the reproducibility and quality of research. FAIR does not only apply to data but also to tools and services (e.g., repositories). FAIR data makes data findable (e.g., through a digital object identifier), accessible (e.g., through repositories), interoperable (e.g., through the use of open formats and technologies), and re-usable (e.g., through adequate documentation with metadata), while still protecting individuals privacy, which is essential, in case of sensitive patient data. In order to adhere with FAIR principles, it is crucial to have access to technological solutions (e.g., repositories) but also to have discipline-specific know-how for the adequate documentation or use of metadata standards [15][16][17].

General Biomedical Data Providers
In the area of biomedicine, vast amounts of data are produced; meanwhile, international institutions exist that provide data and tools openly by, and to, the scientific community. There are two big institutions that provide open data for bioinformatic research, including cancer data. Famous worldwide is the National Center for Biotechnology Information (NCBI), located in the United States. Another key player in providing resources for bioinformatic research is the European Bioinformatics Institute (EMBL-EBI), located in the United Kingdom. NCBI provides many national resources but also participates in international projects, including EMBL-EBI projects and vice-versa [18]. Additionally, EMBL-EBI provides many internationally curated, high-quality data resources, including data from teams worldwide, following a coherent strategy [7].
There are many resources available online, from old and outdated ones to highly curated, disease-specific data repositories, providing data with full open access, semi-free access, and some that require data requests to grant access. One of the most famous open access data providers is Pubmed Central (PMC), which provides many full abstracts and both free full-text publications, as well as information that links to publishers with restrictions. PubChem is a freely accessible chemical information database with information about chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations, and more. Gene Expression Omnibus (GEO) is a functional genomics data repository with querying tools and download options for arrayand sequence-based data. PMC, PubChem, and GEO, among others, are services from the before-mentioned NCBI [18]. Ensembl, UniProt, Protein Data Bank in Europe (PDBe) (but also the larger content provider Europe PMC), ChEMBL, ArrayExpress (currently being migrated to BioStudies), and the Expression Atlas are some of the more famous data resources provided by EMBL-EBI [19,20]. Ensembl currently supports data from more than 50,000 genomes across the different websites. Uniprot is a comprehensive, high-quality database of protein sequences and functional information. PDBe is the European descendant of the worldwide Protein Data Bank (PDB) [21], collecting, organising, and disseminating data on biological molecular structures. ChEMBL combines chemical, genomic, and bioactivity data of drug-like molecules. ArrayExpress collects data from highthroughput functional genomics experiments. Expression Atlas makes use of ArrayExpress data. There are also joint repositories next to the worldwide PDB, such as the Consensus Coding Sequence Database (CCDS) [22] or GLOBOCAN cancer statistics, provided by International Agency for Research on Cancer (IARC), a specialized cancer agency of the World Health Organization (WHO) [23]. Smaller local, and more specific resources, are also available, such as the Chinese Glioma Genome Atlas (CGGA) [24]. PDB provides access to structural data for biologial molecules. CCDS collects high-quality annotated protein coding regions in human and mouse genomes. GLOBOCAN provides global cancer statistics for cancer control and research. CGGA is a resource with functional genomic data from Chinese gliomas. Most of these resources provide information on which data is available but also how to contribute to the projects. For instance, regarding BioStudies, which data and how to submit is described in https://www.ebi.ac.uk/biostudies/submit (accessed on 12 December 2021). There are also several imaging data repositories from EMBL-EBI, providing images of different molecular scales, ranging from macro-molecular subcellular structures, up to large tissue masses: EMPIAR, Cell-IDR, Tissue-IDR, BioImage Archive, and many more [25]. In the area of life science, one can find comprehensive lists for research data management practice, f.i. in https://github.com/elixir-europe/rdmkit (accessed on 12 December 2021). A table of data resources, with causal information in biological databases, can be found in [26]. However, disease-specific, in particular, on a certain cancer types, data availability varies. The next subsection describes cancer-specific resources. We try to summarize most important resources in Table 1 and relate to specific use case examples in Table 2.

Cancer Specific Data Initiatives and Resources
Regarding the topic of cancer research, there are also some disease-specific resources provided by the US National Cancer Institute (NCI). To name some of the most important ones, TCGA is available via the Genomic Data Commons Portal at https://portal.gdc. cancer.gov/ (accessed on 12 December 2021). The Cancer Imaging Archive (TCIA), also sponsored by NCI, provides radiomics data [50] via https://www.cancerimagingarchive. net/ (accessed on 12 December 2021). Radiomics data can be submitted to TCIA, following the guide in https://www.cancerimagingarchive.net/primary-data/ (accessed on 12 December 2021). The Pan Cancer Analysis of Whole Genomes (PCAWG) is one of the ICGC initiatives that provides common patterns of mutation among different cancer types. PCAWG data is available via several databases, such as the ICGC data portal but also the Expression Atlas and the University of California Santa Cruz (UCSC)'s Xena Functional Genomics Explorer [51]. For instance, differential network analysis can be applied using the Expression Atlas and PCAWG data [34]. The cBio Cancer Genomics Portal (cBioPortal) is another collaborative effort that provides open genomic data, including TCGA pancancer studies, as well as open source software for local instances [52]. Data from cBioPortal, and its pediatric-specific instance, pedcBioPortal, can be used for clustering and classification analysis [35]. The multi-institutional systems biology center Cancer Cell Map Initiative (CCMI) supports NDEx, providing data commons for biological networks [53]. To overcome the lack of data from young patients, the Pediatric Cancer Genome Project (PCGP) provides data via https://pecan.stjude.cloud/pcgp-explore (accessed on 12 December 2021) [36]. The Catalogue of Somatic Mutations in Cancer (COSMIC) can be accessed via https://cancer.sanger.ac.uk/cosmic (accessed on 12 December 2021). COSMIC is provided by Wellcome Sanger Institute (WSI), located in the United Kingdom. COSMIC uses data from ICGC, TCGA, and others. Several other resources can be found and are discussed elsewhere [54]. Glioma-specific web resources, partly making use of data provided by these initiatives, are further described in Section 2.6. To support the scientific community, a notable example for in silico resources is Kipoi, a repository of reusable predictive genomic models, where researchers are able to both contribute, as well as reuse and compare [55]. Additionally, datasets dedicated to finding suitable AI methods are growing [42]. Generally, data sharing is named as one key limitation in AI research [56].

Metadata for AI in Cancer Research
Reports on machine learning applications in medical science often lack accessibility or reproducibility and describe only selected aspects of the models; still, trust in biomedical applications is of particular importance in medical science [57]. The clinical utility of AI applications would require the evaluation of external cohorts and documentation in online repositories [58]. Next to the challenges of finding sufficiently large, diverse, and wellannotated datasets for AI training, there is the issue of data privacy and ownership that significantly hampers model development in medicine. This aspect makes transfer learning and, moreover, federated learning approaches, based on distributed model-training to data-owners, more and more prominent [59,60]. Additionally, the EU recently published a regulatory framework on AI, to propose a list of high-risk applications, set requirements, and define specific obligations for AI users and providers of high-risk applications [61].

Explainability and Causability
Although explainable AI (xAI) has only recently become popular as a field, the problem of explainability is practically the oldest field of science and is well anchored in the philosophy of science [62]. Actually, the problem has arisen due to the great successes of statistical machine learning and the non-linear models, such as complex neural networks (deep learning), that make it practically impossible to track all steps to a result. However, this traceability is now necessary for legal reasons, and xAI is now developing a series of post-hoc models that make it possible for results of so-called black-box models to be understood, comprehended, and interpreted by the end users [63,64]. These methods can be very useful in biology, medicine, and the life sciences, e.g., [65,66]. However, in certain domains, especially in the medical domain, there is a need for causability, referring to a human model, instead of the technical approach of explainability [65,67]. Causability, introduced in reference to usability, corresponds to the measurable extent to which an explanation, resulting from an xAI method to a human expert, reaches a certain level of causal understanding, measured with the system causability scale [68], causal, in Judea Pearl's sense, as the relationship between cause and effect [69]. Understanding can be reached if explainability is mapped with causability, which requires new human-AI interfaces that allow domain experts to interactively ask questions and counterfactuals to gain insights into the underlying explanatory factors of an outcome [70], likewise supporting reproducibility [71].

Fostering Exchange and Use Cases for Glioma Research
Modeling brain tumor-related studies exist that allow for the simulation of tumor growth [29] and resection [30], making use of open data, as well as providing open source implementations to reproduce and further refine model parameters. Moreover, using open data for cancer research can support biomarker prediction [72]. With the help of the pan cancer analysis of TCGA data, the evaluation of the mRNA level of traditionally used reference genes revealed novel ones for specific cancer types [38]. Brain tumor subtype classification has been based on TCGA brain cancer multi-omics data [37].
Network analysis and clustering benefit from several open cancer resources [34,35]. The combination of various data sets and types can further lead to novel findings of signal transduction events, leading to new therapy possibilities. This has been done, for exam-ple, in the case of using publicly available gene expression data by GEO, transcriptomic sequencing data by CGGA, and RNA-sequencing data by TCGA [40]. Another notable example for targeting cancer studies is the immune landscape of cancer [49]. An exemplary glioma-specific web resource uses raw and annotated data from several sources including TCGA, GEO, COSMIC, ClinVar, FDA, etc., for network visualizations [33]. Another example is described by a web resource on metabolomic data [73,74]. Metabolic data has likewise been used for molecular classification and biomarker discovery in glioma research [75]. Additionally, several metabolic alterations have been highlighted in glioma patients [32]. Another example use case is described by combining metabolic profiles with transcriptomic and proteomic data [39].
Radiomics constitutes the discipline on medical image analysis, in regard to harnessing radiomic features, which are extracted quantitative metrics, using methods, such as feature calculation, selection, dimensionality reduction, and data processing [76]. Noninvasive imaging is readily used for monitoring tumor mass and treatment resistance and can be included in patient-specific models on tumor growth and response to chemoradiation [28]. Open access tools and medical image repositories already exist to support radiomic approaches [77]. Moreover, open data is used for solving brain tumor segmentation challenges [41,42]. Classification can be based on various data types, also using radiomics [44]. The combinatory use of medical images and genomic features, described by radiogenomics, can be used for clinical outcome prediction and guiding therapy [78].
In both public and scientific communication, it is the goal to foster understanding [79], such as the dissemination of cancer inequities [23] or facing challenges as uncertainties [80]. Benefits of mapping and visualization are used to tackle varying informational needs [48,81]. Specific glioma gene expressions can be visually analyzed with the tool Glioblastoma Bio Discovery Portal (GBM-BioDP) [48], next to other more general cancer TCGA visualization tools [72]. Prognostic markers, as well as genetic risk factors, can be reviewed, with the help of molecular epidemiology [47]. The surveillance, epidemiology, and end results (SEER) data can be used to study risks that may occur after radiation therapy of pediatric LGG [82]. Bibliometrics can show trends for specific research topics. Figure 1 shows the growing number of published documents on open data, related to cancer, as well as open access share on publications. Bibliometric analyses related to glioma exist, which make use of Scopus, ranking both open access and closed access publications [45,46]. Data from the past years were used to report estimations on new cases and deaths globally for the upcoming year. Challenges arise for cancer registries to exchange incidence data, regarding national regulations concerned with data privacy [23]. Examples, such as the proportional increase in open access publications on glioma, illustrated in Figure 1, show a tiny, but recognizable, trend towards opening science.

Conclusions
Regarding all the opportunities that come with open data, as a part of open science, it is still essential to further publish data with free access. Among top limitations are data privacy laws, technology, and lack of expertise [83]. Challenges, regarding data privacy, include re-identification risks [10]. In contrast to legal issues with privacy, computer science methods are concerned with data protection, which brings us to technological challenges and, up to now, certain limitations. The more, the better is not always true. Regarding the application of ML models in cancer, large datasets may result in overfitting and/or bias; therefore, training data sets should be diverse, as well as representative [58]. Thus, next to quantity, data quality is of particular importance, since data sets used for AI approaches require thorough curation and processing [84]. This aspect includes several factors, such as expert labeling [85], completeness, harmonization, and standardization [86], just as validation [87].
There is a discipline-specific tendency to share data openly, as is common for biology researchers but to a lesser extent for medical or pharmaceutical scientists, based on several drivers and inhibitors for sharing and using open research data, including aspects such as the researchers background and experience, intrinsic motivation, trust facilitating conditions, social influence and affiliation, expected performance, effort, requirements and formal obligations, legislation, and regulation, next to data characteristics [2].
The growth of image repositories is suggested to have a great impact on AI, with clinical relevance, in the future [56]. Unfortunately, many examples in radiomics lack openness, both in data and source code and, therefore, reproducibility. While radiomics is becoming more interdisciplinary, not only including medicine but also computer science, reports also emerge that already include accessible links to source code within the publication, such as in radiomic studies [41][42][43], as well as other related cancer research [29,35,52]. Another issues concern long-term financing. Examples, such as GliomaDB [33], show that small projects, with limited funding, can only offer temporary solutions. To pursue such solutions, it is essential to broaden thought beyond distribution and maintenance.
More openness across institutes will help us to exchange research with others and foster novel outreach and engagement activities. Therefore, we propose to share and reuse research output towards decoding diseases, such as cancer, together [8,9,14,25,50,72,88].

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Abbreviations
The following abbreviations are used in this manuscript: