1. Introduction
High-throughput sequencing technology has enabled a revolution in the field of genetics over the past few decades. In brief, this family of technologies allows the genome to be broken into small components and sequenced in a massively parallel way; an entire genome can be sequenced in a day instead of the decade it took to generate the first draft [
1]. Scientists have been able to use these technologies to examine functional components of a whole genome, such as, most commonly, the exome, but also promoter and enhancer regions, and epigenetic markers [
2]. Since the cost scales with the size of the genomic regions examined, scientists often focused on the much smaller, but more information-dense, exome region of the genome. Research has, however, shown that the non-coding region of the genome holds variants with explanatory power due to regulatory effects [
3,
4,
5,
6]. Now that costs have decreased dramatically [
7] and laboratory consortiums have formed to pool resources, the use of whole genome sequencing has become more and more economically viable and more common in research projects.
The true cost of high-throughput sequencing includes both the raw material costs and the bioinformatics costs. Bioinformatics costs cover computational resources needed to store, analyze, and perform specialized tasks such as quality control and annotation on the data [
8]. The costs associated with the analysis can now easily eclipse the raw material cost of a sequencing run due to the specialized training and computational resources required [
9]. With these overhead requirements in mind, it is advantageous for researchers to share their processed data with each other to facilitate more rapid research in the field. Organizations have built infrastructures to facilitate the sharing and re-use of data. The National Cancer Institute (NCI), for example, has invested resources in its Cancer Research Data Commons (CRDC) to help drive research innovation [
10,
11,
12].
In this spirit, we generated 1342 variant call datasets based on The Cancer Genome Atlas (TCGA) whole genome sequence (WGS) and alignment data. The dataset was constructed after analyzing the TCGA dataset for suitable whole genome sequencing experiments. The entries varied in their composition, so we focused on the unambiguous normal–tumor pairs, which left 2207 potential samples to analyze as shown in
Table S1.
3. Results
In the original analysis [
13], five different TCGA cancer types were covered over 154 samples. This paper aimed to extend this to 1386 samples covering 18 different cancer types in the TCGA dataset (see
Table 1 for cancer code definitions used by TCGA and within this paper). Of this number, 1342 passed validation (see Validation section,
Supplementary Table S2 for successful samples, and
Supplementary Table S3 for failed samples), leaving 44 entries that failed quality control.
Final data generation revealed 157,313,519 pooled (non-unique) cancer-associated single-nucleotide variations (SNVs) across all samples. This was an average of 117,223 SNVs per sample with a range from 1111 to 775,470 and a standard deviation of 163,273, illustrating the variation in sample preparation and experimental design decisions made by the participating laboratories.
Figure 3 shows the distribution of variant counts within each cancer type. One thing that is important to note here is that the results from this cohort are unique to this study and should not be viewed as summaries for that cancer type in general. Due to the differences in read depth between different cancer types as well as different research goals set before sequencing, the numbers represent only a snapshot of this particular set and not the cancers as a whole.
The makeup of the dataset ranges from Acute Myeloid Leukemia (LAML) at 0.89% of the samples to Uterine Corpus Endometrial Carcinoma (UCEC) at 9.17%. While this range is large, the average of each of these 18 cancer types was 5.56%, the median value was 6.04%, and the standard deviation for the ratios was 2.54. The sample distribution by cancer type is relatively balanced, as seen in
Figure 4, which will lend itself to many different types of analysis.
When looking at the distribution of number of variants across cancers, a slightly different picture emerges, as there are a few outliers.
Table 2 shows for each cancer type the mean and median number of variants per sample, the standard deviation within the cancer type, the minimum and maximum variant counts, and the sample counts. Both Acute Myeloid Leukemia (LAML) and Sarcoma (SARC) are interesting in that they have high minimum values (the sample of that cancer type with the fewest variations) and have low standard deviations. While the standard deviation scores could be in large part due to the limited sample size, the minimum cancer-associated variations for these samples are intriguing and could suggest something biologically distinct for them.
Figure 5 illustrates this by showing the minimum sample variant count for each cancer type versus the maximum variant count, with the average variant count being the size of the bubble. LAML and SARC look to be outliers to the rest of the cancers. The general trend shows the minimum counts to be independent of the maximum count and the average SNV count generally not related to the maximum (or the minimum) count.
Looking next at the non-coding region of the genome, we see in
Figure 6 that most cancer types analyzed in this study have widely different amounts of variation found in the non-coding region versus the coding. The percentages of coding SNVs were calculated by taking the raw counts of variants found in CDS regions [
19,
20,
21] and dividing by the number of variants found in total (excluding mitochondrial variants since consensus mitochondrial coding regions are not included in the dataset used). The wide variety of ratios found through the samples in most cancers is in stark contrast to SARC, LAML, and KIRP. These three cancers exhibit, within this study, the property that all of their samples have nearly the same ratio of coding to non-coding and also have nearly the same number of non-coding and coding variations.
The SNV calls provide valuable information on cancer-associated changes in the genome outside of the original research driven by them [
13]. In order to facilitate contributing this information back to the scientific community, we have deposited the processed output with the NCI Cancer Data Service (CDS). The submission is registered with the Database of Genotypes and Phenotypes (dbGaP) and is accessible in CDS to authorized researchers through their cloud service (
https://datacommons.cancer.gov/repository/cancer-data-service, accessed on 1 June 2022). Searchable tables for metadata related to the samples are available via the Institute for Systems Biology (ISB)’s Cancer Gateway in the Cloud (ISB-CGC) (
https://isb-cgc.org, accessed on 1 June 2022), one of NCI’s Cloud Resources (
https://datacommons.cancer.gov/analytical-resource/isb-cancer-gateway-cloud, accessed on 1 June 2022) [
22].
3.1. Data Records
The pipeline as described in the methods section was run on normal–tumor paired samples from TCGA. There were several records generated through this pipeline with high-confidence somatic single-nucleotide variations (SNV) calls being deposited to the CDS cloud repository.
These high-confidence somatic SNVs were generated for each of the processed samples as a standard VCF file and ingested into BigQuery as a single table across all samples. Raw VCF files can be regenerated from a BigQuery table if needed after ingestion (see Usage Case 3 in this paper).
The SNV dataset shows individual positions in the genome where a nucleotide in the cancer tissue differs from the reference genome (hg19) but not the paired tumor tissue. Both the normal and tumor samples were also required to have adequate coverage. In the current pipeline, this translates to a read depth of 8 for normal tissue reads and 6 for tumor tissue reads at the location of the SNV call. VarScan2 calls SNVs by using a heuristic method and performing a statistical test which considers the number of aligned reads for each allele [
20,
21,
22].
3.2. Technical Validation
We pursued several strategies to assess the quality of the data and to correct for errors that are common in highly parallelized pipelines. After the variant calling pipeline was finished, the output data were a compressed archive of different sets of information including SNVs, indels, germline mutations, error logs, etc. If a computation failed, the size of the output archive would be approximately 10 kb. This allowed for quick screening for failed computations that needed to be rerun or investigated further. This was a semi-frequent occurrence for several reasons. The data were controlled and sometimes required significantly more time to transfer than our access window, occasionally causing transfer issues.
Additionally, as each step of the computation was run, we were reporting progress and command output values to a database being run in the cloud cluster. This database accepted reporting for a wide variety of values including error codes of interest and computational metrics for future regression analysis. Examining the database was useful for diagnosing computational problems as they occurred and allowed for on-the-spot corrections. These corrections were often re-running the sample due to I/O issues or file corruption.
Variations were mapped to 10,000 base windows to bucket the variations for visualization purposes. These windows were plotted in a Manhattan plot-type graph with region number and chromosome on the x-axis, while the y-axis was the number of variations found in the region. Computations that failed to produce expected results across the genome would have entire chromosomes missing, which indicated a computational problem. Samples which failed this test were re-run and all were successfully recovered.
To shield against outliers, a strategy was taken to limit the right-tailed variations to being within a standard deviation of the average variant counts across the samples. Since an alignment cannot produce fewer than zero variations, the data are right-skewed but the validation would allow the left side to be arbitrarily low and only check against the right side. This is because we would not necessarily expect an arbitrary low count of variations to be an outlier since it is possible that only a handful of variations could lead to cancer. On the other hand, it is difficult to imagine that millions of variations would be required for oncogenesis and more likely this related to mis-matched samples (or some other error). In any case, with the average total variation count (not restricted to high-confidence somatic to capture a broader section of outliers) being 1,651,974 and the standard deviation being 4,090,131, no left-side values would fail the test even if they were not excluded. All variant counts that then fell above 5,742,105 variations were discarded as outliers using this methodology and are reported in
Table S3. This was a total of 44 out of the 1386 finished computations.
An additional strategy used for quality control was that all downloads were checked against their md5sum hashes (mathematical representations of data that differ even if a single bit is changed in the data) to verify there was not a download-related corruption issue. There were several samples that were only partially transferred during the download, and a simple retransfer was able to correct this.
A final validation was performed after the BigQuery tables were ingested. The tables were validated to have the expected number of samples and verify that variant counts matched what was seen in the raw VCF files.
4. Discussion
Here, we present a dataset which covers 1342 WGS cancer-associated SNV calls across 18 different cancer types as defined by TCGA. These data cover an assortment of variants within the whole genome and offer an opportunity for researchers to deeply dive into differences between normal and cancerous tissue within the same patient.
The pipeline itself, as illustrated in
Figure 1, can be used by researchers as a blueprint to run their own analysis on normal–tumor paired data either from the TCGA or other studies. The software provided in the associated GitHub repositories can be modified to support non-TCGA content and accelerate the computational component to many research questions by lowering the barrier to performing these computations.
Currently available datasets through TCGA provide only whole exome sequencing (WXS)-level data. While there is a high number of exome sequences for normal–tumor paired samples as a result of the TCGA project, whole genome sequences have remained in raw format. This publication presents the cancer-associated SNVs for many of the WGS samples published by TCGA. With this dataset in hand, researchers can supplement their current research with these enriched data without bearing the high computational and financial cost of determining the variants themselves.
From this study, SARC and LAML data within the TCGA project are interesting outliers. All the samples within these cancers have an unexpectedly high number of variants compared with all other cancers in the set. The fact that they appear to be outliers requires a closer look at the specifics of the data collection and sequencing for these cancers before using them in further studies as they may be inappropriate datasets for some contexts. Other cancer types have many samples with strikingly low numbers of variants. These samples tend to be ones collected early in progression. This matches expectations as younger cancers would have less time to accumulate mutations. An upper level or ceiling on variations would not be expected since these samples were taken at various stages of cancer growth and therefore could have had significant time to continue to mutate. This finding provides confidence on using those specific cancer-associated variations.
These two cancer types, along with KIRP, also show some interesting differences compared to the other cancer groups with reference to non-coding vs. coding variation counts. Within this study, the samples in these cancer types have very close percentages of coding variants out of the entire variant set. They also have nearly the same ratio of non-coding to coding variations. This implies that the methodology used for these cancers in the TCGA project may have had some bias in it. Alternatively, it could imply that there is something particular about these types of cancers which weighs variations in this way. Whether or not non-coding variations have a special meaning for these cancers is an interesting research topic.
The ability to easily use this dataset was one of the priorities of the authors during this study. Several exemplary ways to use the variant datasets are examined below.
4.1. Use Case 1: Determination of New Entries to a Cancer Database
Cross-mapping between annotation and variant databases is a common use case to increase the value of the variant database. In addition to annotation and variant cross-mappings, a similar approach is used to enrich a variant database with additional entries. The first use case was to take an existing cancer variant database and determine, quickly, how many variants are within the new dataset that are not represented in the existing dataset. We used the BioMuta database [
23], which primarily focuses on exome region variants. As a result of this, we expected that there would be many new variants found in the high-confidence somatic dataset published in this paper.
We first ingested BioMuta into BigQuery as a custom dataset, following instructions for BigQuery. From the BigQuery web interface, we constructed a SQL command to map the high-confidence somatic variations to the BioMuta database to determine how many variants found from this project are not yet represented in the BioMuta database. While constructing the SQL command can seem daunting, it uses a relatively easy-to-learn structured query language (SQL) with which many scientists are already familiar.
An example SQL command is provided here, although there would need to be slight modifications depending on the details of the BioMuta ingestion.
SELECT v4_0.*
FROM biomuta.v4_0
WHERE NOT EXISTS(SELECT *
FROM `isb-cgc-04-0026.fs_scratch.tcga_variants`
WHERE CHROM = CONCAT(“chr”, CAST(v4_0.chr_id as string)) AND POS = v4_0.chr_pos);
This command retrieves from the high-confidence somatic variant BigQuery table all of the variants that exist there but do not exist in the BioMuta version 4 dataset. The output from this command will return a table with all entries in the somatic high-confidence variant table that are not found in BioMuta. The count of rows in this table is the number of variants that are not represented in the BioMuta dataset. We found that there were 7,630,735 (non-unique) variants which were not found in the BioMuta dataset. The duration of the BigQuery search creating this mapping was 18.3 s.
4.2. Use Case 2: Generation of Summary Statistics of the Dataset
Summary statistics of a dataset can easily be generated by using either the BigQuery API and generating via Python (or some other language) or generated through the BigQuery language itself. A simple example is to generate the counts of high-confidence somatic variations for each cancer type.
SELECT COUNT(CHROM), project_short_name
FROM `isb-cgc-04-0026.fs_scratch.tcga_variants`
GROUP BY (project_short_name);
This SQL command will generate a table of each of the cancer types as well as the variation counts found in the somatic high-confidence table. It performs this task by reading the project name and counting the number of hits for each, and then presenting the results as the output from the command. The project counts from this use case are included in
Table 3.
4.3. Use Case 3: Regenerate a VCF File from the BigQuery Tables
While the data as published in BigQuery tables are useful for cross-table investigation, it is often required to have a VCF-formatted file for a specific pipeline where the software is expecting that format. A standard formatted VCF file can be directly generated from the BigQuery tables, as needed.
Unlike the other use cases, this case requires the output from the BigQuery table as the input and then processes it into a standard VCF-format file. The table output can be retrieved either through the BigQuery interface by running a general fetch command focused on the sample of interest (shown below) and saving the output table in comma-separated value (CSV) format through the BigQuery tools, or through the BigQuery API in Python or another language (not shown).
SELECT * FROM `isb-cgc-04-0026.TCGA_WGS_HG19_VCF.somatic_hc_variants` WHERE project_short_name) = `TCGA-44-2656`
Once the input data have been generated, a simple script can be used to convert these data into a VCF file (see Code Availability below) by generating the VCF header text and then looping through each of the entries and outputting into the appropriate VCF formatting. This script accepts a TCGA ID as a required parameter and can also be given a specific chromosome and a limit to the number of SNVs returned, if desired. Specific instructions on running the script are provided in the repository.
4.4. Study Limitations
While this study presents a large, cancer-associated SNV dataset there are several limitations which should be noted. First is sample size. Even 1342 whole genome cancer variant sets are likely to be insufficient to untangle cancer comprehensively. Cancer is an umbrella term capturing diseases of many different tissues; further, even within these categories, each individual cancer can be caused by different mechanisms of cellular dysregulation. Therefore, it is unlikely that even a dataset of this size will have enough explanatory power to answer all biological questions related to even a single type of cancer. The hope is that this dataset, along with many others that are produced, can help drive understanding of this disease when supplementing ongoing research.
The dataset is not a comprehensive processing of all of the TCGA sequencing data. There are around 2200 whole genome normal–cancer pairs within the consortium’s data, meaning this dataset includes 60% of the TCGA cancer-associated variants. For several different reasons including computational, logistical, and a conservative approach to the data that were in question, the entire set was outside of the scope of this experiment. This represents both a limitation of this dataset as well as an opportunity for additional data available to supplement more specific research questions.
Additionally, TCGA data represent a single coherent study and may not represent all cancer data. The consortium study had various standards for data collection and analysis which are incredibly useful for comparing data between the different laboratories, but also run the risk of biasing the entire dataset in some way. Care will need to be taken when combining this dataset with others to make sure they are compatible.
Any insights gleaned from the dataset would need to be validated against real-world samples. A purely data-driven approach can only point us in the right direction for research but cannot currently replace validation-level research that would be required in a clinical setting.
The TCGA project utilized short-read-focused, next-generation sequencing which, while revolutionary, does not offer a comprehensive genetic profile of the genome. Longer read technologies help in examining copy number variation, different techniques are used for 3D mapping of the genome, non-coding rearrangement, and other block rearrangement of the genome, and many other techniques are under development to see beyond the nucleotide-level sequence. All of these techniques, and more, would be useful in a full examination of the cancer genome. This study reports on the short-read-focused results and is therefore limited to the information captured by these techniques.
Finally, despite the high-confidence estimates used in the study, there are likely to remain false positive variants in this set. It is possible that many of these have no or very low impact, which may be difficult to deduce without a large quantity of cancer genomic data, far beyond this study.
Even with these limitations, the dataset of 1342 whole genome, cancer-associated, high-confidence SNV calls provides an exciting opportunity for researchers to supplement their current and future studies.