Mitochondrial RNA Expression and Single Nucleotide Variants in Association with Clinical Parameters in Primary Breast Cancers

The human mitochondrial DNA (mtDNA) encodes 37 genes, including thirteen proteins essential for the respiratory chain, and RNAs functioning in the mitochondrial translation apparatus. The total number of mtDNA molecules per cell (mtDNA content) is variable between tissue types and also between tumors and their normal counterparts. For breast cancer, tumors tend to be depleted in their mtDNA content compared to adjacent normal mammary tissue. Various studies have shown that primary breast tumors harbor somatic mtDNA variants. A decrease in mtDNA content or the presence of somatic variants could indicate a reduced mitochondrial function within breast cancer. In this explorative study we aimed to further understand genomic changes and expression of the mitochondrial genome within breast cancer, by analyzing RNA sequencing data of primary breast tumor specimens of 344 cases. We demonstrate that somatic variants detected at the mtRNA level are representative for somatic variants in the mtDNA. Also, the number of somatic variants within the mitochondrial transcriptome is not associated with mutational processes impacting the nuclear genome, but is positively associated with age at diagnosis. Finally, we observe that mitochondrial expression is related to ER status. We conclude that there is a large heterogeneity in somatic mutations of the mitochondrial genome within primary breast tumors, and differences in mitochondrial expression among breast cancer subtypes. The exact impact on metabolic differences and clinical relevance deserves further study.


Introduction
Mitochondria are small organelles involved in multiple cellular processes. They are most renowned for their role in energy production, since they contain their own circular genomic entity encoding proteins essential for the respiratory chain and thereby for generating cellular ATP via oxidative phosphorylation. The human mitochondrial DNA (mtDNA) is gene-dense consisting of 16569 base pairs encoding 37 genes: thirteen proteins, and two rRNAs and twenty-two tRNAs functioning in the mitochondrial translation apparatus. Polycistronic transcription of mtDNA is initiated at the non-coding D-loop region, and the resultant precursor transcripts are processed by excision of the tRNA genes ("tRNA punctuation model" [1]) generating individual mitochondrial tRNA, rRNA and mRNA transcripts. The total number of mtDNA molecules per cell (mtDNA content) is variable between tissue types, and interestingly also between tumors and their normal counterparts [2]. For breast cancer specifically, tumors tend to have reduced mtDNA content compared

Results
We evaluated RNA sequencing data of 344 primary breast tumor specimens. After mapping of sequencing reads against the human reference genome, median 15% (Interquartile range (IQR) 10-23%) of the uniquely mapped reads were assigned to the mitochondrial contig, resulting in median 9889× read depth (IQR 5333) of mtDNA.

Somatic Variants in mtRNA
Variant calling resulted in a total of 9063 single nucleotide variants (SNVs) on 1600 positions and 84 small insertions or deletions (INDELs) on 38 positions of the mitochondrial genome within the 344 cases ( Figure 1). Since INDELs were only a minority, our focus was on the SNVs only. We defined SNVs as somatically acquired tumor variants when not associated with the individual's haplotype (n = 7235 excluded, 80%) or with heteroplasmic allele frequency of ≤95% (n = 917 excluded, 10%). Also, we defined the variants at position 2617 (r.2617a>u and r.2617a>g, present in respectively n = 340 and n = 101 cases) as not tumor-specific because (1) they have been described previously as RNA-DNA differences in blood cells of non-cancer patients [18,19] and (2) we confirmed their presence in a transcriptomic dataset of normal specimens of various tissue types including breast tissue [20] (Supplementary Materials Table S1). After these exclusions, a total of 470 somatic variants on 429 positions were identified.
Our dataset has overlapping cases (n = 165) with the dataset published by Ju et al. [15] concerning somatic mitochondrial variants in tumor and matched normal specimens at the DNA level. This allowed us to directly compare called variants between the two datasets (see also Appendix A) to evaluate presence, classification and allele frequency of variants. Since variants at position 2617 are known RNA-DNA differences (see above) and indeed not called in the DNA dataset, these were not included in this comparison. A total of respectively 3997 and 4009 SNVs were called at the RNA and DNA level within the primary tumor specimens of the 165 cases. The majority of the variants were called at both the RNA and DNA level (n = 3889, respectively 97.3% and 97.0%), whereas a small fraction was only called at either the RNA or the DNA level (respectively n = 108 (2.7%) and n = 120 (3.0%) variants) ( Figure 2). Of the variants detected at both the RNA and DNA level, only a few (n = 10, 0.3%) had a discrepancy in classification as either 'somatic' or 'germline' (Figure 2). Also, good consistency was observed in allele frequency at the RNA and DNA level (linear fit coefficient of 0.92 for all variants and 0.96 for somatic tumor variants). From this we concluded that presence,     We then continued to further decipher the somatic mtRNA variants in our dataset (n = 470 in n = 344 cases). The variant allele frequency of the somatic variants was distributed with a peak at the lower and at the upper end of allele frequencies (Supplementary Materials Figure S1). There was no correlation between the variant allele frequency and the percentage of invasive tumor cells in the evaluated specimen (Spearman correlation coefficient rho = 0.03, p = 0.5). The detected somatic variants were distributed along the entire mitochondrial genome (Figure 1), with 40 (8.5%) variants located in the tRNA genes, 69 (14.7%) in rRNA genes, 85 (18.1%) in the D-loop, 1 (0.2%) in the non-coding regions, and 275 (58.5%) in the mRNA genes of which 212 (77.1%) had a nonsynonymous effect on the coding amino acid (Figure 3). However, relative to their genomic size (9.0% tRNA genes, 15.1% rRNA genes, 6.8% D-loop, 0.4% non-coding and 68.7% mRNA genes) more variants were present in the D-loop and fewer in the mRNA genes (Fisher exact p < 0.001). Also in comparison to the germline variants (variants that were associated with the haplogroup of that individual or with an allele frequency > 95%, n = 8152) there was a difference in genomic distribution (Fisher's exact p < 0.001) with fewer somatic variants in the D-loop but more in the tRNA and mRNA genes, and an enrichment for somatic nonsynonymous mRNA variants (Figure 3). The positions of somatic variants were much more conserved among species compared to the germline variants (Mann-Whitney test p < 0.001), as measured by the fraction of species that harbor the reference sequence at that position (Conservation Index of respectively median (IQR) 0.93 (0.36) and 0.76 (0.69)). A total of 69 (15%) somatic variants were recurrent and positioned on 28 mitochondrial positions. The majority of the somatic variants (95%) represented the typical replication-coupled mtDNA substitution pattern with predominantly C > T and T > C transitions as described previously [15,16,21] in a nucleotide context similar to the germline variants ( Figure 4). However, compared to the detected germline variants the ratio between C > T and T > C variants is shifted (Fisher exact p < 0.001) with an increased number of C > T transitions among the somatic variants ( Figure 4).
In the entire cohort, there are 112 (33%) cases with 0 somatic variants, 97 (28%) with 1 somatic variant, and 135 (39%) with more than 1 somatic variant (range 2 to 7). Of the cases with more than 1 somatic variant, 82 (61%) had a difference > 20% allele frequency between variants, indicative for (sub-)clonality. We then continued to further decipher the somatic mtRNA variants in our dataset (n = 470 in n = 344 cases). The variant allele frequency of the somatic variants was distributed with a peak at the lower and at the upper end of allele frequencies (Supplementary Materials Figure S1). There was no correlation between the variant allele frequency and the percentage of invasive tumor cells in the evaluated specimen (Spearman correlation coefficient rho = 0.03, p = 0.5). The detected somatic variants were distributed along the entire mitochondrial genome (Figure 1), with 40 (8.5%) variants located in the tRNA genes, 69 (14.7%) in rRNA genes, 85 (18.1%) in the D-loop, 1 (0.2%) in the non-coding regions, and 275 (58.5%) in the mRNA genes of which 212 (77.1%) had a nonsynonymous effect on the coding amino acid ( Figure 3). However, relative to their genomic size (9.0% tRNA genes, 15.1% rRNA genes, 6.8% D-loop, 0.4% non-coding and 68.7% mRNA genes) more variants were present in the D-loop and fewer in the mRNA genes (Fisher exact p < 0.001). Also in comparison to the germline variants (variants that were associated with the haplogroup of that individual or with an allele frequency > 95%, n = 8152) there was a difference in genomic distribution (Fisher's exact p < 0.001) with fewer somatic variants in the D-loop but more in the tRNA and mRNA genes, and an enrichment for somatic nonsynonymous mRNA variants ( Figure 3). The positions of somatic variants were much more conserved among species compared to the germline variants (Mann-Whitney test p < 0.001), as measured by the fraction of species that harbor the reference sequence at that position (Conservation Index of respectively median (IQR) 0.93 (0.36) and 0.76 (0.69)). A total of 69 (15%) somatic variants were recurrent and positioned on 28 mitochondrial positions. The majority of the somatic variants (95%) represented the typical replication-coupled mtDNA substitution pattern with predominantly C > T and T > C transitions as described previously [15,16,21] in a nucleotide context similar to the germline variants ( Figure 4). However, compared to the detected germline variants the ratio between C > T and T > C variants is shifted (Fisher exact p < 0.001) with an increased number of C > T transitions among the somatic variants ( Figure 4).
In the entire cohort, there are 112 (33%) cases with 0 somatic variants, 97 (28%) with 1 somatic variant, and 135 (39%) with more than 1 somatic variant (range 2 to 7). Of the cases with more than 1 somatic variant, 82 (61%) had a difference > 20% allele frequency between variants, indicative for (sub-)clonality. Genomic distribution is depicted for somatic (left) or germline (right) variants in either non-coding (purple), the D-loop (orange), tRNA (red), rRNA (blue) or mRNA (green) regions of the mitochondrial genome. The percentage of total is indicated at the top of the bars. The percentage of substitutions in the mRNA regions with either a synonymous or nonsynonymous effect is indicated within the mRNA bar (light green). Note that variants at position 2617 (known RNA-DNA differences) are not included. Genomic distribution is depicted for somatic (left) or germline (right) variants in either non-coding (purple), the D-loop (orange), tRNA (red), rRNA (blue) or mRNA (green) regions of the mitochondrial genome. The percentage of total is indicated at the top of the bars. The percentage of substitutions in the mRNA regions with either a synonymous or nonsynonymous effect is indicated within the mRNA bar (light green). Note that variants at position 2617 (known RNA-DNA differences) are not included.

Somatic Mitochondrial Variants in Relation to Somatic Variants in the Nuclear Genome
Next, to gain more insight into the relation between the mutational processes shaping mtDNA and nDNA, we associated the amount of somatic mtRNA variants with the number of somatic variants induced by the known major mutational patterns shaping the nDNA. For this purpose, we obtained for the overlapping cases (n = 268) the number of nDNA variants as published by Nik-Zainal et al. [17]. There was no statistically significant association between the number of somatic mtRNA variants and the total number of somatic variants in the nuclear DNA (Spearman correlation coefficient rho = 0.01, p = 0.8). Next, we combined per case the number of variants in nDNA associated with the mutational processes as described by Nik-Zainal et al. [17]: age-related (signatures 1 and 5), APOBEC-related (signatures 2 and 13) and homologous-recombination deficiency-related (signatures 3 and 8) processes. No statistically significant associations were observed between the number of somatic mtRNA variants and any of these three mutational processes (all Kruskal-Wallis p > 0.2). Note that only two samples within the dataset contained variants associated with mismatch-repair deficiency (signatures 6, 20 and 26), and none of samples contained variants associated with the signatures of unknown etiology (signatures 17, 18 and 30), as a consequence of which these specific subgroups could not be evaluated.

Mitochondrial Gene Expression
To estimate the expression and transcript processing of the mitochondrial genome for each case, transcripts per million (TPM, log2-transformed) were calculated for the entire mtDNA and each mitochondrial-encoded gene individually. Expression of the entire mtDNA-normalized against the nuclear genome and thus evaluated as driven by mtDNA content and transcription rate-was high and showed minor variability among the 344 cases (median 19.9210 TPM, IQR 0.0045). Within the 37 mitochondrial-encoded genes-normalized within the mitochondrial genome and thus evaluated as driven by processing of the polycistronic transcripts-the levels for genes encoding tRNAs were lowest (median 12.52 TPM, IQR 1.32), followed by mRNAs (median 15.37 TPM, IQR 0.31) and rRNAs (median 16.83 TPM, IQR 0.48). Most variability was observed in levels of tRNAs. Also, distinct correlation clusters were observed between the expression levels of the genes encoding mRNAs, tRNAs and rRNAs, where among genes a positive correlation was present per gene-type, but between different gene-types a negative correlation was present ( Figure 5). No correlation was

Somatic Mitochondrial Variants in Relation to Somatic Variants in the Nuclear Genome
Next, to gain more insight into the relation between the mutational processes shaping mtDNA and nDNA, we associated the amount of somatic mtRNA variants with the number of somatic variants induced by the known major mutational patterns shaping the nDNA. For this purpose, we obtained for the overlapping cases (n = 268) the number of nDNA variants as published by Nik-Zainal et al. [17]. There was no statistically significant association between the number of somatic mtRNA variants and the total number of somatic variants in the nuclear DNA (Spearman correlation coefficient rho = 0.01, p = 0.8). Next, we combined per case the number of variants in nDNA associated with the mutational processes as described by Nik-Zainal et al. [17]: age-related (signatures 1 and 5), APOBEC-related (signatures 2 and 13) and homologous-recombination deficiency-related (signatures 3 and 8) processes. No statistically significant associations were observed between the number of somatic mtRNA variants and any of these three mutational processes (all Kruskal-Wallis p > 0.2). Note that only two samples within the dataset contained variants associated with mismatch-repair deficiency (signatures 6, 20 and 26), and none of samples contained variants associated with the signatures of unknown etiology (signatures 17, 18 and 30), as a consequence of which these specific subgroups could not be evaluated.

Mitochondrial Gene Expression
To estimate the expression and transcript processing of the mitochondrial genome for each case, transcripts per million (TPM, log2-transformed) were calculated for the entire mtDNA and each mitochondrial-encoded gene individually. Expression of the entire mtDNA-normalized against the nuclear genome and thus evaluated as driven by mtDNA content and transcription rate-was high and showed minor variability among the 344 cases (median 19.9210 TPM, IQR 0.0045). Within the 37 mitochondrial-encoded genes-normalized within the mitochondrial genome and thus evaluated as driven by processing of the polycistronic transcripts-the levels for genes encoding tRNAs were lowest (median 12.52 TPM, IQR 1.32), followed by mRNAs (median 15.37 TPM, IQR 0.31) and rRNAs (median 16.83 TPM, IQR 0.48). Most variability was observed in levels of tRNAs. Also, distinct correlation clusters were observed between the expression levels of the genes encoding mRNAs, tRNAs and rRNAs, where among genes a positive correlation was present per gene-type, but between different gene-types a negative correlation was present (

Association with Clinicopathological Parameters
Lastly, we explored how these findings correlate with relevant clinical parameters. We analysed the number of somatic mtRNA variants (grouped variable as 0 variants, 1 variant and >1 variant per tumor, Table 1) and the expression of the entire mitochondrial contig (continuous variable, Table 1) in relation to traditional clinicopathological variables including age at diagnosis (n = 291 cases), tumor size (T-stage) (n = 216 cases), pathological grade (n = 282 cases), estrogen receptor (ER) status (n = 291 cases) and progesterone receptor (PR) status (n = 288 cases). Due to the low numbers of patients with HER2-amplified (n = 2 cases) and presenting with metastases at primary diagnosis (n = 3 cases), these clinicopathological variables were not evaluated. Age at diagnosis was statistically

Association with Clinicopathological Parameters
Lastly, we explored how these findings correlate with relevant clinical parameters. We analysed the number of somatic mtRNA variants (grouped variable as 0 variants, 1 variant and >1 variant per tumor, Table 1) and the expression of the entire mitochondrial contig (continuous variable, Table 1) in relation to traditional clinicopathological variables including age at diagnosis (n = 291 cases), tumor size (T-stage) (n = 216 cases), pathological grade (n = 282 cases), estrogen receptor (ER) status (n = 291 cases) and progesterone receptor (PR) status (n = 288 cases). Due to the low numbers of patients with HER2-amplified (n = 2 cases) and presenting with metastases at primary diagnosis (n = 3 cases), these clinicopathological variables were not evaluated. Age at diagnosis was statistically significant associated with both the number of somatic mtRNA variants (Kruskal-Wallis p = 0.022) and expression of the entire mtDNA (Spearman correlation coefficient rho = 0.11, p = 0.049), where a higher age corresponded to more somatic mtRNA variants and higher expression of the entire mtDNA. Also, a highly statistically significant association was observed between expression of the entire mtDNA and hormone receptor status (as evaluated at the protein level), with increased mtDNA expression in the ER-positive and in the PR-positive tumors (respectively Mann-Whitney U test p < 0.001 and p = 0.006). In fact, also a significant correlation was observed between expression of the entire mtDNA and RNA expression of ESR1 or PGR (respectively Spearman correlation coefficient rho = 0.19 p < 0.001 and rho = 0.17 p = 0.001, n = 344 and n = 342 cases). For each subgroup within the clinicopathological variable, the number of cases and either the fraction of patients within the mtRNA somatic variant groups (0, 1 or more than 1) or the mtRNA expression (TPM, log2 transformed) is indicated. a Kruskal-Wallis (multiple groups) or Mann-Whitney (two groups) p-value. b Fisher exact p-value. c Spearman correlation coefficient. d Spearman correlation p-value.

Discussion
In this work, we explored genomic changes in and expression of the mitochondrial genome within primary breast tumors, and their correlation with clinicopathological variables.
Within our breast tumor dataset, the fraction of reads mapping to the mitochondrial contig of the reference genome (median 15%) is in line with previous findings in non-tumorous breast samples: within the Illumina Body Tissue Atlas~15% of the sequencing reads mapped to the mitochondrial genome (n = 1) [22], and within the Genotype-Tissue Expression (GTEx) Consortium~15-20% of the transcriptional output was of mitochondrial origin (n = 27) [23]. This is in line with the requirement for functional mitochondria within cancer cells [24]. This also indicates that although the expression of the mitochondrial genome has been shown to be decreased in breast tumors compared to tumor-adjacent normal mammary tissue [11], the extent to which this occurs is less extreme than observed among tissue types (e.g., a much lower fraction of mitochondrial reads in blood (<5%) or much higher fraction in kidney (>50%) [23]). Nevertheless, we observed an association between expression of the entire mtDNA and ER status (measured at protein-level), with marginally higher expression in ER-positive tumors and a similar observation for PR status (protein-level) ( Table 1). In addition, also RNA expression of ESR1 and PGR was positively correlated with expression of the entire mitochondrial contig. The relation between expression of mtDNA and clinicopathological parameters has not been evaluated by others, but when we associated the data reported by Reznik et al. [11] on mtRNA expression within the TCGA-BRCA dataset (n = 656 cases) we observe a similar correlation for ER status (Kruskal-Wallis p = 0.006, Supplementary Materials Table S2 and none for the other clinicopathological variables (all p > 0.05 Supplementary Materials Table S2). In pre-clinical models, there appears to be a link between ER and mitochondrial activity: exposure to estrogens increases mitochondrial expression and oxygen consumption in ER-positive [25,26] but not in ER-negative breast cancer cells [26]. Similarly, ER-negative breast cancer cell lines show lower mitochondrial respiration and a stronger dependency on glycolysis in comparison to ER-positive breast cancer cells [27]. Unfortunately, measurements on mitochondrial activity comparing ER-positive and ER-negative clinical specimens are to our knowledge not reported in the literature, and thus the effect of differences in ESR1 levels on mitochondrial activity in primary breast tumors remains currently unknown. Interestingly, uptake values of fluorodeoxyglucose (FDG) in positron emission tomography (PET)-a visualization of glucose uptake reflecting the increased rate of glycolysis in the tumor-appears to be higher in ER-negative cases [28][29][30][31][32][33][34], indicative that indeed metabolic differences are present between the subtypes. Additional studies should be performed to identify if there are differences in mitochondrial (oxidative phosphorylation) function among breast cancer subtypes and the potential clinical relevance of these findings, such as predictive and prognostic potential.
We also observed distinctive clustering of tRNA genes, which is in line with the tRNA punctuation model: when processing the polycistronic transcripts, tRNA genes are excised and due their small size (<75 base pairs) tRNAs are more likely to be lost during the RNA extraction and/or library preparation procedures, whereas the mRNA and rRNA genes are retained (>200 base pairs). Notably, we did not observe differences in this distinct pattern between the ER-positive and the ER-negative cases (Supplementary Materials Figures S2 and S3), and thus the processing of the polycistronic transcripts does not seem to differ between these two subtypes.
Our findings on the number, genomic distribution, and substitution pattern of mtDNA variants within the mitochondrial transcriptome are in line with previous studies on variants within the mitochondrial genome in other cancer types [8,15,16,21,35,36] (see also Appendix A). We observe an increased number of somatic variants in the D-loop and fewer in mRNA genes than expected by genomic size, which might be explained by the gene-dense constitution of mtDNA: variants in the D-loop potentially have less destructive effects whereas variants in the mRNA genes might have detrimental effects on the function of the oxidative phosphorylation system, and thus will be selected against. Also, the structural conformation of the D-loop (a triple-stranded structure) could make it more prone to damage. However, compared to germline variants in our dataset there are fewer variants in the D-loop and more in the tRNA and mRNA genes, and enrichment for nonsynonymous variants. This might be explained by the typical mutation pattern shaping mtDNA, which has been shaping the germline variants and thus the trivial positions have already been altered, as suggested by Ju et al. [15]. In line with this, the conservation of variants among species-the fraction of species that harbor the reference sequence at that position-was much higher for somatic variants than for the germline variants, which can be explained by the same hypothesis. Adding to this, compared to the detected germline variants there is an increased number of C > T transitions among the somatic variants ( Figure 4). Note that the functional effect of somatic mtDNA variants on mitochondrial function is dependent on the actual position (e.g., protein-coding regions) and consequence (e.g., stop-gain or nonsynonymous) of the variant in combination with their heteroplasmy level within the tumor cell, rather than merely the number of somatic variants observed.Adjusting variant allele frequency to account for sample purity (percentage of tumor cells within the specimen) is often applied for nuclear-encoded genes to obtain information on the allele frequency of variants in the tumor cells. However, this is not possible for mtDNA variants in tumor tissue specimens: the number of mtDNA molecules per cell largely varies among cell types and thus the non-tumor cells present in the specimen do not have the regular two copies as the nuclear genome would have, but contain multiple mtDNA copies of an unknown number. As a result, whereas allele frequency of variants could give information on possible constraints on variants, we did not perform analysis on it since it is impossible to estimate the actual allele frequency of variants in the mitochondria of tumor cells. Nevertheless, we show that majority of the samples with more than 1 somatic variant harbor a difference in variant allele frequency between variants, indicative for (sub-)clonality. This corresponds to the hypothesis that mtDNA variants are either expanded or lost [37] and that the mutations occur separated in time [15].
Also noteworthy is that with the current methodologies applied by us and by others-namely the use of non-micro dissected tumor specimens and blood as matched normal DNA-we cannot be completely sure that the detected somatic mtDNA mutations are tumor-specific. First, tumor tissue specimens consist of multiple cell types, including the tumor cells but also non-neoplastic cells such as immune cells and cells from the mammary epithelium, all with variable mtDNA content. Secondly, (somatic) mtDNA variant heteroplasmy patterns can differ within an individual across tissues [38][39][40][41]. Thus, the somatic variants were either acquired in the tumor, the normal somatic epithelium, or even in other cell types present within the specimen.
We did not observe associations between the number of somatic mtRNA variants and the three major mutational processes shaping the nDNA within breast tumors. This is in line with the hypothesis that mutations within the mitochondrial genome are mainly due to fidelity of the mitochondrial polymerase [42] and thereby hardly due to exogenous factors [15]. Accordingly, in our evaluation of associations with clinicopathological parameters we observed a statistically significant association between the number of mtRNA somatic variants and age at diagnosis. Previous work on somatic variants at the DNA level also revealed a correlation with older age of diagnosis (n = 381 [15] and n = 58 cases [35]). Previous work in a small cohort also showed associations between number of somatic variants in mtDNA and higher TNM and higher histological grade (n = 58 cases [35]), which we did not observe. Please note that there are differences in the composition of the cohorts; our dataset does not exactly represent the breast cancer population as seen in daily practice, with an underrepresentation of ERBB2-amplified cases (Supplementary Materials Table S3).
By using data at the RNA level, we intended to minimize the interference of NUMTs with evaluation of mtDNA expression and variant calling, since their expression in the nucleus is negligibly low [11,43]. Especially in defining heteroplasmic mtDNA variants in DNA data, NUMTs have been shown to be a complicating issue with non-identical positions misinterpreted as heteroplasmic variants [44][45][46][47][48]. Note that we do observe a few heteroplasmic variants at the DNA-only level (Appendix A). However, using data at the RNA level comes with the trade-off that only variants in expressed regions are detected and thus variants in non-expressed regions are missed. Since mtDNA is a gene-dense entity, we estimate that the number of missed variants should be low. Indeed, in our direct comparison of samples with variants at the RNA and DNA level, we show that this is maximallỹ 3% of the variants (DNA-only variants). Similar to these findings, the comparison by Stewart et al. [16] on somatic variants at the RNA and DNA level showed 7 of the 130 variants (5%) detected at only the DNA level within their set of 100 breast cancer specimens. Another trade-off using RNA is the additional step to generate cDNA, which might induce false positive calls by mistakes of the reverse transcriptase. Again based on our direct comparison of samples with variants at the RNA and DNA level, the number of false positives is maximally 3% of the detected variants (RNA-only variants). Though, besides false positives, these RNA-only variants might actually be RNA-DNA differences for example caused by RNA-editing [49], or true variants not called at the DNA level.

Data
We studied all patients with RNA sequencing data within the ICGC BASIS consortium, of which the cohort has been described previously [17] and data deposited in the European-Genome Phenome Archive (accession code EGAS00001001178). Briefly, for a total of 348 primary breast tumors we generated duplex-specific nuclease-based RNA sequencing data. Four samples were excluded from analyses due to potential cross-contamination (see below). We did not apply a threshold on tumor cell percentage within the specimen for inclusion in this study. Clinicopathological data and the nuclear somatic mutation catalogue were obtained from the Supplementary Tables as provided by Nik-Zainal et al. [17]. Expression levels of ESR1, PGR (quantile normalized FPKM, log2 transformed) were obtained as described previously [50]. A complete dataset on all variables used in our analyses is provided in Supplementary Materials Table S3. In addition, we used publically available RNA sequencing data of twelve human tissue specimens obtained via a similar sequencing approach [20], that has been deposited in NCBI's Gene Expression Omnibus (GEO) (accession code GSE45326). Also, we used the mtDNA variants called by Ju et al. [15] from whole-genome or whole-exome sequencing data of DNA from the primary breast tumor specimens and matched normal tissue specimens as provided in their Supplementary Tables.

Bioinformatics
Sequencing reads were aligned using STAR v2.4.2.a [51] against the Genome Reference Consortium Human Build 38 (GRCh38, GenBank assembly GCA_000001405.15), which contains as the mitochondrial contig the revised Cambridge Reference Sequence (rCRS). Only non-duplicated uniquely mapped reads on mtDNA were used for further analysis, to avoid the potential use of improper assigned nuclear insertions of mitochondrial origin (NUMTs, mitochondrial pseudogenes). Note that RNA expression of NUMTs has been shown to be absent or negligibly low [11,43]. Total read depth was estimated based on the read length (75 nucleotides) and mtDNA size (16,569 nucleotides). FeatureCounts v 1.4.6 [52] was used to count mapped reads using mtDNA as the meta-feature and each genomic region (13 mRNAs, 22 tRNAs, 2 rRNAs) as the features, allowing multi-overlapping reads (-O) because of the polycistronic nature of mitochondrial RNA transcripts. We normalized read counts to transcripts per million (TPM) for the entire mitochondrial contig (mtDNA read counts versus total read counts assigned to genes in GRCh38, defined as entire mtDNA levels) and for each mitochondrial-encoded gene (gene read counts versus total mtDNA read counts, defined as <gene> levels). In this way, the TPM for the entire mtDNA represents the total amount of mtRNA influenced by both mtDNA content, transcription rate and transcript stability, whereas the TPM for each mitochondrial-encoded gene represents the variation in gene expression driven by processing of the polycistronic transcripts and transcript stability [53]. A complete dataset of all expression levels is provided in Supplementary Materials Table S4. Variants alternative to rCRS were called using GATK HaplotypeCaller 3.4-46-gbc02625 [54] using default settings (including downsampling_type = BY_SAMPLE, downsample_to_coverage = 500, standard_min_confidence_ threshold_for_calling = 20). In this way, maximum depth of coverage is controlled at each locus, resulting in a more even coverage of variants between the samples. Hard-filtering was applied to the called variants for quality by depth (QD > 2), alternative depth (AD of ALT > 10) and strand odds ratio (variants with allele frequency ≤ 95% i.e., heteroplasmic variants: SOD < 4 for SNVs and SOD < 10 for INDELs; variants with allele frequency > 95% i.e., (near) homoplasmic: no filtering). In this way, the allele frequency of detected variants was high and confident enough to be a true variant and likely no sequencing errors or PCR mistakes. In addition, after visual inspection of variants (Integrative Genomics Viewer [55,56]), potential false positive calls in challenging regions were excluded: positions surrounding the homopolymer region 301-315 ("D310"), positions 512-513 due to a repetitive sequence, alternative C calls at positions 16,182-16,183 and 16,189 due to polyC sequences, and alternative A at positions 4264, 5513 and 12,138-12,139 due to polyA sequences. A complete dataset of all remaining variants is provided in Supplementary Materials Table S5. All remaining single nucleotide variants were used in a nucleotide BLAST against the human reference sequence (NCBI's nucleotide web blast, https://blast.ncbi.nlm.nih.gov) with the surrounding reference sequence (30 bases 5 and 30 bases 3 ) to uncover potential NUMT events, but none were recovered. The conservation index (45 species conservation) for the protein-coding genes, tRNAs and rRNAs were obtained via SNV Query in Mitomaster [57]. The haplotype of each case was estimated by using the heteroplasmic and homoplasmic variants in HaploGrep v2 [58]. Sample cross-contamination was estimated using only the heteroplasmic variants (allele frequency ≤ 95%) in haplotype assignment. This identified four samples with heteroplasmic contamination of another haplotype, therefore these samples were excluded from analyses. Sample mismatch between cases with variants called in both RNA (our dataset) and DNA (dataset Ju et al. [15]) sequencing data (n = 168) was estimated by haplotyping based on all near-homoplasmic variants (allele frequency > 95%), and comparison of the obtained haplogroup. Mismatch was observed for 13 patients, but after manual inspection specificity could be confirmed for 10 patients by the presence of private variants. Two patients with a clear mismatch, and one patient ambiguous in mismatch, were excluded from the RNA-DNA comparison analyses (n = 165 remaining).

Statistics
Performed statistical tests are reported in the results section. All statistical tests were two-sided, and P values smaller than 0.05 were considered as statistically significant. Outliers data points in boxplots are defined as Q1−1.5*IQR or Q3+1.5*IQR. Analyses were performed in R version 3.3.2 (https://cran.r-project.org). Data analyses included usage of the following packages: the set of tidyverse, ggcorplot, SomaticSignatures [59] and VennDiagram [60].

Conclusions
To conclude, in this explorative study on the role of mtRNA in breast cancer, we found that somatic variants at the DNA level are reflected at the RNA level with no hotspot mutations and great heterogeneity across tumors. We confirm that the number of somatic variants within the mitochondrial transcriptome is not associated with the mutational processes shaping the nuclear genome but instead, is associated with age of diagnosis. Furthermore, we show that mitochondrial expression is related to ER status. The exact consequence of the observed differences in mtRNA expression and the detected somatic variants on cancer metabolism and clinical outcome warrants further study.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2072-6694/10/12/ 500/s1, Figure S1: Histogram of variant allele frequency of the somatic mitochondrial RNA variants of 344 primary breast tumor cases, Figure S2: Correlation matrix of expression of all 37 mitochondrial-encoded genes of 81 ER-negative primary breast tumor cases, Figure S3: Correlation matrix of expression of all 37 mitochondrial-encoded genes of 210 ER-positive primary breast tumor cases, Table S1: Position 2617 in normal tissue specimens, Table S2: Association between mtRNA expression and clinicopathological variables in TCGA-BRCA, Table S3: Clinicopathological and other variables per sample (n = 344 cases), Table S4: Expression levels of mtRNA per sample (n = 344 cases), Table S5: Variants in mtRNA per sample (n = 344 cases). includes a total of n = 381 breast tumor specimens, of which n = 165 cases are overlapping with our dataset.

A.1. Comparison of Mitochondrial Variants at the RNA and DNA Level
By using the dataset published by Ju et al. [15] concerning somatic mitochondrial variants in tumor and matched normal specimens at the DNA level, we intended to compare mitochondrial variants called in primary breast tumor tissue specimens at the RNA and the DNA level. Their dataset includes a total of n = 381 breast tumor specimens, of which n = 165 cases are overlapping with our dataset.
The DNA dataset contains 8892 single nucleotide variants (SNVs) on 1744 positions within the 381 cases ( Figure A1), of which 589 variants classified as somatic (using VarScan2, see Ju et al. [15]). The variant allele frequency of these somatic variants was distributed with a peak at the lower end of allele frequencies ( Figure A2). The detected somatic variants were distributed along the entire mitochondrial genome ( Figure A1 The variant allele frequency of these somatic variants was distributed with a peak at the lower end of allele frequencies ( Figure A2). The detected somatic variants were distributed along the entire mitochondrial genome ( Figure A1), with 50 (8.5%) variants located in the tRNA genes, 103 (17.5%) in rRNA genes, 80 (13.6%) in the D-loop, 6 (1.0%) in the non-coding regions, and 350 (59.4%) in the mRNA genes of which 285 (81.4%) had a nonsynonymous effect on the coding amino acid ( Figure A3). Relative to their genomic size (9.0% tRNA genes, 15.1% rRNA genes, 6.8% D-loop, 0.4% non-coding, 68.7% mRNA genes) more variants were present in the D-loop and fewer in the mRNA genes (Fisher exact p < 0.001). Also compared to the germline variants, there is a difference in genomic distribution: fewer somatic variants in the D-loop but more in the tRNA and mRNA genes, and an enrichment for somatic nonsynonymous mRNA variants (Fisher exact p < 0.001) ( Figure A3). The positions of somatic variants were much more conserved among species compared to the germline variants (Mann-Whitney test p < 0.001), as measured by the fraction of species that harbor the reference sequence at that position (Conservation Index of respectively median (IQR) 0.96 (0.36) and 0.76 (0.69)). A3). Relative to their genomic size (9.0% tRNA genes, 15.1% rRNA genes, 6.8% D-loop, 0.4% non-coding, 68.7% mRNA genes) more variants were present in the D-loop and fewer in the mRNA genes (Fisher exact p < 0.001). Also compared to the germline variants, there is a difference in genomic distribution: fewer somatic variants in the D-loop but more in the tRNA and mRNA genes, and an enrichment for somatic nonsynonymous mRNA variants (Fisher exact p < 0.001) ( Figure A3). The positions of somatic variants were much more conserved among species compared to the germline variants (Mann-Whitney test p < 0.001), as measured by the fraction of species that harbor the reference sequence at that position (Conservation Index of respectively median (IQR) 0.96 (0.36) and 0.76 (0.69)).  A total of 74 (12.6%) somatic variants were recurrent and positioned on 34 mitochondrial positions. Also, majority of the somatic variants (89.5%) represented the typical replication-coupled mtDNA substitution pattern with predominantly C > T and T > C transitions in a nucleotide context similar to the germline variants ( Figure A4). Compared to the detected germline variants the ratio non-coding, 68.7% mRNA genes) more variants were present in the D-loop and fewer in the mRNA genes (Fisher exact p < 0.001). Also compared to the germline variants, there is a difference in genomic distribution: fewer somatic variants in the D-loop but more in the tRNA and mRNA genes, and an enrichment for somatic nonsynonymous mRNA variants (Fisher exact p < 0.001) ( Figure A3). The positions of somatic variants were much more conserved among species compared to the germline variants (Mann-Whitney test p < 0.001), as measured by the fraction of species that harbor the reference sequence at that position (Conservation Index of respectively median (IQR) 0.96 (0.36) and 0.76 (0.69)).  A total of 74 (12.6%) somatic variants were recurrent and positioned on 34 mitochondrial positions. Also, majority of the somatic variants (89.5%) represented the typical replication-coupled mtDNA substitution pattern with predominantly C > T and T > C transitions in a nucleotide context similar to the germline variants ( Figure A4). Compared to the detected germline variants the ratio A total of 74 (12.6%) somatic variants were recurrent and positioned on 34 mitochondrial positions. Also, majority of the somatic variants (89.5%) represented the typical replication-coupled mtDNA substitution pattern with predominantly C > T and T > C transitions in a nucleotide context similar to the germline variants ( Figure A4). Compared to the detected germline variants the ratio between C > T and T > C variants is shifted (Fisher exact p < 0.001) with an increased number of C > T transitions among the somatic variants ( Figure A4). In the cohort, there are 101 (26%) cases with 0 somatic variants, 117 (31%) with 1 somatic variant, and 163 (43%) with more than 1 somatic variant (range 2 to 7). Of the cases with more than 1 somatic variant, 103 (63%) had a difference >20% allele frequency between variants, indicative for (sub-)clonality.
When comparing these findings at the DNA level with our findings at the RNA level, no differences were observed in genomic distribution of the somatic variants (Fisher's exact p = 0.1), the conservation index of somatic variants (Mann-Whitney test p = 0.4), and the recurrence rate (Fisher's exact p = 0.3). However, the substitutional pattern differed, with a higher fraction of C > A substitutions at the DNA level (4.8%) compared to the RNA level (1.1%) (Fisher's exact p < 0.001).
Cancers 2018, 10, x 14 of 26 between C > T and T > C variants is shifted (Fisher exact p < 0.001) with an increased number of C > T transitions among the somatic variants ( Figure A4). In the cohort, there are 101 (26%) cases with 0 somatic variants, 117 (31%) with 1 somatic variant, and 163 (43%) with more than 1 somatic variant (range 2 to 7). Of the cases with more than 1 somatic variant, 103 (63%) had a difference >20% allele frequency between variants, indicative for (sub-)clonality. When comparing these findings at the DNA level with our findings at the RNA level, no differences were observed in genomic distribution of the somatic variants (Fisher's exact p = 0.1), the conservation index of somatic variants (Mann-Whitney test p = 0.4), and the recurrence rate (Fisher's exact p = 0.3). However, the substitutional pattern differed, with a higher fraction of C > A substitutions at the DNA level (4.8%) compared to the RNA level (1.1%) (Fisher's exact p < 0.001).

A.2. Direct Comparison of Mitochondrial Variants at the RNA and DNA Level of Overlapping Cases
We next focused on the variants called within the overlapping cases within our RNA and the published DNA dataset (n = 165 cases). As stated in the main manuscript text, of the variants detected at both the RNA and DNA level only a few (n = 10, 0.3%) had a discrepancy in classification as either 'somatic' or 'germline' (resp. n = 4 and n = 6, Table A1). These were misclassifications at the RNA level, mainly due to the absence of information on the matched normal tissue: variants misclassified as 'germline' at the RNA level had allele frequencies > 95%, indicative for germline origin, but were not detected in the matched normal DNA whereas they were present in the matched tumor DNA and thus of somatic origin. Also, variants misclassified as 'somatic' at the RNA level had allele frequencies between 85% and 95% allele frequency, but were detected in the matched normal DNA as well as the matched tumor DNA and thus of germline origin. Also, of the variants detected at both the RNA and DNA level, only a few variants (n = 7, 0.2%) showed a strong deviation in variant frequency (>30% difference) (n = 3 germline and n = 4 somatic) (Table A2). In contrast to previous observations that mainly variants in tRNAs have allelic imbalances [9], none of them occurred at tRNA sites.
We continued evaluating the variants called within the overlapping cases present at either only the DNA or only the RNA level. A total of 120 variants (3.0%) were only called at the DNA level (103 somatic and 17 germline) (Table A2) and 108 variants (2.7%) were present at only the RNA level (47 somatic and 61 germline) (Table A3). Within the aligned reads of the RNA data (BAM file) we inspected if variants called at the DNA-only level were truly not present at the RNA level or just not called (Table A2). Majority of the called DNA variants were present in the RNA alignment data but We next focused on the variants called within the overlapping cases within our RNA and the published DNA dataset (n = 165 cases). As stated in the main manuscript text, of the variants detected at both the RNA and DNA level only a few (n = 10, 0.3%) had a discrepancy in classification as either 'somatic' or 'germline' (resp. n = 4 and n = 6, Table A1). These were misclassifications at the RNA level, mainly due to the absence of information on the matched normal tissue: variants misclassified as 'germline' at the RNA level had allele frequencies > 95%, indicative for germline origin, but were not detected in the matched normal DNA whereas they were present in the matched tumor DNA and thus of somatic origin. Also, variants misclassified as 'somatic' at the RNA level had allele frequencies between 85% and 95% allele frequency, but were detected in the matched normal DNA as well as the matched tumor DNA and thus of germline origin. Also, of the variants detected at both the RNA and DNA level, only a few variants (n = 7, 0.2%) showed a strong deviation in variant frequency (>30% difference) (n = 3 germline and n = 4 somatic) (Table A2). In contrast to previous observations that mainly variants in tRNAs have allelic imbalances [9], none of them occurred at tRNA sites.
We continued evaluating the variants called within the overlapping cases present at either only the DNA or only the RNA level. A total of 120 variants (3.0%) were only called at the DNA level (103 somatic and 17 germline) (Table A2) and 108 variants (2.7%) were present at only the RNA level (47 somatic and 61 germline) (Table A3). Within the aligned reads of the RNA data (BAM file) we inspected if variants called at the DNA-only level were truly not present at the RNA level or just not called (Table A2). Majority of the called DNA variants were present in the RNA alignment data but not called by our used algorithm (n = 108, 90%), a few variants were not (sufficiently) covered (n = 5, 4%), and some were sufficiently covered but truly not present as alternative allele (n = 7, 6%). Unfortunately, we were unable to visually inspect variants in the DNA alignment data (not available) and thus the relevance of variants present at only the RNA level was not evaluable. Interestingly, when evaluating the substitution pattern of variants detected at both the RNA and DNA level ( Figure A5), and at either the RNA or DNA level, the higher fraction of C > A substitutions at the DNA level compared to the RNA level appeared mainly due to variants called at only the DNA level.
Given these results, the differences we observe in called variants at the RNA and DNA level is likely an effect of differences in either the expression at the RNA level (biological) the calling algorithms used (technical).
Cancers 2018, 10, x 15 of 26 not called by our used algorithm (n = 108, 90%), a few variants were not (sufficiently) covered (n = 5, 4%), and some were sufficiently covered but truly not present as alternative allele (n = 7, 6%). Unfortunately, we were unable to visually inspect variants in the DNA alignment data (not available) and thus the relevance of variants present at only the RNA level was not evaluable. Interestingly, when evaluating the substitution pattern of variants detected at both the RNA and DNA level ( Figure A5), and at either the RNA or DNA level, the higher fraction of C > A substitutions at the DNA level compared to the RNA level appeared mainly due to variants called at only the DNA level. Given these results, the differences we observe in called variants at the RNA and DNA level is likely an effect of differences in either the expression at the RNA level (biological) the calling algorithms used (technical).