Defining Signatures of Arm-Wise Copy Number Change and Their Associated Drivers in Kidney Cancers

Using pan-cancer data from The Cancer Genome Atlas (TCGA), we investigated how patterns in copy number alterations in cancer cells vary both by tissue type and as a function of genetic alteration. We find that patterns in both chromosomal ploidy and individual arm copy number are dependent on tumour type. We highlight for example, the significant losses in chromosome arm 3p and the gain of ploidy in 5q in kidney clear cell renal cell carcinoma tissue samples. We find that specific gene mutations are associated with genome-wide copy number changes. Using signatures derived from non-negative matrix factorisation (NMF), we also find gene mutations that are associated with particular patterns of ploidy change. Finally, utilising a set of machine learning classifiers, we successfully predicted the presence of mutated genes in a sample using arm-wise copy number patterns as features. This demonstrates that mutations in specific genes are correlated and may lead to specific patterns of ploidy loss and gain across chromosome arms. Using these same classifiers, we highlight which arms are most predictive of commonly mutated genes in kidney renal clear cell carcinoma (KIRC).


Introduction
One of the most striking features of the cancer cell genome is the frequently observed abnormal karyotype. Many factors can lead to abnormal structural rearrangements of chromosomes including errors in cell division such as spindle assembly checkpoint defects [1] and missegregation due to issues such as telomerase insufficiency [2]. The inaccurate repair of DNA double-strand breaks can also result in translocations, duplications, deletions and inversions of DNA leading to genome instability [3]. Recurrent translocations are frequently observed in haematological malignancies where the resulting fusion genes drive tumourigenesis [4]. Loss of heterozygosity (LOH) can also contribute to the loss of function of tumour suppressor genes [5], and large-scale copy number changes can lead to oncogene amplification [6,7].
In addition, defects in the fidelity of chromosome segregation can lead to the gain or loss of entire chromosomes. The chromosoma can vary greatly within tumours [8] and changes in ploidy are known contributors to tumourigenesis and the progression of certain cancer types [9]. Indeed some individual gene variations such as alterations in BRAF [10], TP53 [11,12], PTEN [13] and VHL [14] have been linked directly with genome stability and changes in the aneuploidy state of the cell. We first compared overall chromosomal copy number change across all tissue types. We processed the raw CNV data to calculate mean chromosomal copy number (CCN) for all samples (excluding sex chromosomes) and grouped these samples by cancer subtype (Figure 1). Across tumours from all tissue types present in this study, we see a slight overall gain of genetic material with a mean CCN of 2.046. There is much variance with a bottom quartile of CCN 1.932 and a top quartile of CCN 2.128. The lowest CCN recorded in our samples is 0.375, from a breast cancer sample, and the highest 8.308 from an ovarian cancer sample.
Mean chromosome copy number did not fall below two for any of the tissue types included in this study. KICH exhibited most gain with a mean CCN of 2.105 (s.d. 0.450) as well as the lowest 25 th percentile at CCN 1.605 and the highest 75th percentile at CCN 2.488. Ovarian cancer samples showed the most variance of CCN with a standard deviation of 0.499. The tissue types with the lowest variance of CCN included prostate adenocarcinoma and KIRC with standard deviations of 0.175 and 0.219 respectively.
These results show that the extent of genomic ploidy change seems to vary by cancer subtype type rather than by the primary site of the sample. For example, the three classes of kidney tissue exhibited notably different distributions.

Different Genes Are Associated with Changes in Overall Aneuploidy in Different Tissues
We next investigated how gene mutations were associated with overall ploidy change. We identified the 250 most commonly mutated genes in our kidney cancers. Genetic mutations in VHL, PBRM1, SETD2, BAP1, MTOR, KDM5C, PCLO, KMT2C, ARID1A and SPEN were most commonly seen in KIRC and MET, KMT2C, AR, FAT1, PCLO and NF1 mutations were most commonly found in KIRP tissue [27]. Mutational data was not available for KICH tissue.
Within each tissue type, we compared the distributions of copy number change between samples with a specific mutation and those without a mutation to calculate a significance score for each gene. Across tumours from all tissue types present in this study, we see a slight overall gain of genetic material with a mean CCN of 2.046. There is much variance with a bottom quartile of CCN 1.932 and a top quartile of CCN 2.128. The lowest CCN recorded in our samples is 0.375, from a breast cancer sample, and the highest 8.308 from an ovarian cancer sample.
Mean chromosome copy number did not fall below two for any of the tissue types included in this study. KICH exhibited most gain with a mean CCN of 2.105 (STD = 0.450) as well as the lowest 25th percentile at CCN 1.605 and the highest 75th percentile at CCN 2.488. Ovarian cancer samples showed the most variance of CCN with a standard deviation of 0.499. The tissue types with the lowest variance of CCN included prostate adenocarcinoma and KIRC with standard deviations of 0.175 and 0.219 respectively.
These results show that the extent of genomic ploidy change seems to vary by cancer subtype type rather than by the primary site of the sample. For example, the three classes of kidney tissue exhibited notably different distributions.

Different Genes Are Associated with Changes in Overall Aneuploidy in Different Tissues
We next investigated how gene mutations were associated with overall ploidy change. We identified the 250 most commonly mutated genes in our kidney cancers. Genetic mutations in VHL, PBRM1, SETD2, BAP1, MTOR, KDM5C, PCLO, KMT2C, ARID1A and SPEN were most commonly seen in KIRC and MET, KMT2C, AR, FAT1, PCLO and NF1 mutations were most commonly found in KIRP tissue [27]. Mutational data was not available for KICH tissue.
Within each tissue type, we compared the distributions of copy number change between samples with a specific mutation and those without a mutation to calculate a significance score for each gene. After applying Bonferroni correction to correct for false discovery error, we ranked these genes by the reported corrected significance score.
In KIRC tissue samples, those with POLE mutations were reported to show the most significant loss in ploidy (p = 1.53 × 10 −13 ) while samples with mutations in TP53 exhibited the most significant gains (p = 1.33 × 10 −12 ). Other prominent gene mutations associated with ploidy change in KIRC included GRM8 (associated with loss, p = 3.44 × 10 −12 ), SYNE1 (associated with gain, p = 2.36 × 10 −11 ) and ASTN1 (associated with gain, p = 2.66 × 10 −11 ) all of which are associated with brain function and have not previously been associated with ploidy change or KIRC tissue. Mutations in VHL and SETD2 genes, traditionally associated with KIRC tissue, also result in a significant change in ploidy in our KIRC tissue samples (p = 2.00 × 10 −3 and p = 6.00 × 10 −3 respectively) though mutations in PBRM1 did not.
In KIRP tissue, we found that patients with KRAS mutations showed the most significant loss in copy number (p = 2.2 × 10 −16 ) followed by THSD7B (p = 2.48 × 10 −36 ) and CHD4 (p = 3.31 × 10 −24 ). Patients with mutations in EP400 (p = 8.99 × 10 −16 ) and PCDH11X (p = 2.79 × 10 −9 ) exhibited the most significant levels of ploidy gain. Surprisingly, we found that mutations in TP53 did not appear to result in any significant ploidy change in the available KIRP tissue samples.

Chromosome Arm Ploidy Patterns Vary by Tissue
We next investigated copy number data stratified by chromosome arm to ascertain whether certain chromosome arms were preferentially gained or lost in the kidney cell lines compared to other cancers ( Figure 2). All p-values below were calculated using segment mean values to allow direct comparison. The apparent stability of 8q in KIRC and KIRP tissues compared to our pan-cancer set, may be Chromosome Arm-wise copy number. By grouping chromosome arm-wise copy number data across our kidney tissues individually, we develop a clearer picture of the pattern of chromosome arm copy number that occurs specifically in each kidney tissue compared to our pan-cancer dataset.
Across our kidney tissue samples, we found that each tissue type exhibited a distinct pattern of copy number variance. This suggests that arm-wise copy number change profiles depend on tissue type more than on the primary cancer site.
As expected, based on previous studies 5q showed the highest ploidy gain on average in our KIRC tissue samples (CCN The KICH arm-wise copy number data exhibit much larger copy number changes across every arm with only 21p appearing relatively diploid. On average each of the other arms seems to have either gained or lost roughly half of its genetic material. These results support existing research into the karyotype of KIRC tissue, including loss in 3p and gain in 5q. Across the pan-cancer samples, a set of 10 cancer types excluding the three kidney cancers (see methods), the chromosomes that exhibited the most notable copy number change included 8q and 17p. We note that the patterns found in our pan-cancer samples were not shared with our kidney tissue samples.
Chromosome arm 8q was shown to exhibit the highest average copy number gain and variance in our pan-cancer data (CCN The apparent stability of 8q in KIRC and KIRP tissues compared to our pan-cancer set, may be a possible avenue for further research.

Patterns of Arm Ploidy Correlation Are Found Within Tissue Types
A loss in chromosome arm 3p paired with a gain in chromosome arm 5q is a common and well described trait of KIRC tissue. Next, we investigated how changes of ploidy in other pairs of chromosome arms relate to each other. We measured the correlation coefficient for each pair of arms within tissue type ( Figure 3). Significance scores were measured using permutation tests as discussed in methods.
The most prominent positive correlation scores occurred between arms of the same chromosome (For example chromosome arms 1p and 1q commonly exhibited a similar amount of gain or loss together), however these scores were not uniform across all tissue types. In KIRC, for example, the 9p-9q arm pair was reported to show the highest positive correlation coefficient (r =~0.73, p = 4.08 × 10 −72 ) while in KIRP it was the 20p-20q arm pair (r =~0.78, p = 7.27 × 10 −53 ). Conversely the 9p-9q arm pair in KIRP exhibits a much lower correlation score (r =~0.31, p = 2.69 × 10 −18 ) and the 20p-20q arm pair in KIRC shows a slightly negative correlation (r =~−0.02, p = 4.17 × 10 −55 ).
In terms of negative correlation, where one arm's gain of genetic material was commonly associated with a loss in the others, we found the 3p-5q pair featured prominently for KIRP tissue (r =~−0.63, p = 1.37 × 10 −07 ) but perhaps surprisingly not as prominent as for KIRC (r =~−0.13, p = 2.21 × 10 −05 ) although both are reported to be significant. In KIRC tissue the 4q-6q exhibits the strongest negative correlation ploidy (r =~−0.58, p = 3.64 × 10 −05 ) (Figures S1 and S2). Taking the Pearson correlation coefficient as our measure, we visualise the correlation of gain and loss of copy number for each chromosome arm pair in each of our kidney datasets and in our pan-cancer data. Bright yellow tiles denote highly positive correlation and dark red tiles denote highly negative correlations. The results along the diagonal (which report, for example, of the correlation of c.1p-c.1p) are normalised to r = 0 to improve overall contrast).
The most prominent positive correlation scores occurred between arms of the same chromosome (For example chromosome arms 1p and 1q commonly exhibited a similar amount of gain or loss together), however these scores were not uniform across all tissue types. In KIRC, for example, the 9p-9q arm pair was reported to show the highest positive correlation coefficient (r = ~0.73, p = 4.08 × 10 −72 ) while in KIRP it was the 20p-20q arm pair (r = ~0.78, p = 7.27 × 10 −53 ). Conversely the 9p-9q arm pair in KIRP exhibits a much lower correlation score (r = ~0.313, p = 2.69 × 10 −18 ) and the 20p-20q arm pair in KIRC shows a slightly negative correlation (r = −0.02, p = 4.17 × 10 −55 ).
In terms of negative correlation, where one arm's gain of genetic material was commonly associated with a loss in the others, we found the 3p-5q pair featured prominently for KIRP tissue (r = −0.6348595, p = 1.369225 × 10 −07 ) but perhaps surprisingly not as prominent as for KIRC (r = −0.1250279, p = 2.212398 × 10 −05 ) although both are reported to be significant. In KIRC tissue the 4q-6q exhibits the strongest negative correlation ploidy (r = −0.5838217, p = 3.641906 × 10 −05 ) (Figures S1 and S2). Taking the Pearson correlation coefficient as our measure, we visualise the correlation of gain and loss of copy number for each chromosome arm pair in each of our kidney datasets and in our pan-cancer data. Bright yellow tiles denote highly positive correlation and dark red tiles denote highly negative correlations. The results along the diagonal (which report, for example, of the correlation of c.1p-c.1p) are normalised to r = 0 to improve overall contrast).

Arm Ploidy Signatures Are Somewhat Enriched for Tissue Types
The patterns of arm ploidy correlation found above suggest that the underlying mechanisms that occur in specific tissues may give rise to different profiles of loss and gain of ploidy in different chromosome arms. In an attempt to capture these profiles more fully, we employed non-negative matrix factorisation (NMF) to generate a number of arm ploidy signatures across our kidney tissue data.
NMF is a multivariate analysis tool commonly used for easily interpretable decomposition. NMF was chosen because it provides both dimensionality reduction and clustering. Essentially NMF decomposes a feature matrix into two descriptive matrices, a basis, which describes the feature composition of each component and a coefficient, which describes the component composition of each sample in the original feature matrix.
NMF generates a specified number of components (used as signatures). To identify the optimum number of components for each dataset, we ran a number of trials for decompositions with an increasing value of required components. Each of these trials provided a cophenetic score, a measure of cluster stability, i.e., how well the clusters obtained by NMF preserved the pairwise distances between the original data points. For each dataset, we selected the lowest component count associated with a local minimum cophenetic score.
Our kidney tissue ploidy data was decomposed into six component signatures ( Figure 4). We found that some of these signatures where highly enriched for just one of the three tissue types within the kidney group, KIRC, KIRP and KICH whilst others were more mixed. KIRC samples were most prominent in signature 1 and 4, KIRP samples were dominant in signatures 2, 3 and 6, and KICH was found most prominently in signature 5.
NMF generates a specified number of components (used as signatures). To identify the optimum number of components for each dataset, we ran a number of trials for decompositions with an increasing value of required components. Each of these trials provided a cophenetic score, a measure of cluster stability, i.e., how well the clusters obtained by NMF preserved the pairwise distances between the original data points. For each dataset, we selected the lowest component count associated with a local minimum cophenetic score.
Our kidney tissue ploidy data was decomposed into six component signatures (Figure 4). We found that some of these signatures where highly enriched for just one of the three tissue types within the kidney group, KIRC, KIRP and KICH whilst others were more mixed. KIRC samples were most prominent in signature 1 and 4, KIRP samples were dominant in signatures 2, 3 and 6, and KICH was found most prominently in signature 5.
In terms of composition, we found that the feature composition of signature 2, associated with KIRP tissue, exhibited the most notable deficiency of 3p paired with gains in 5q which reflected our correlation analysis. This was paired with large gains in 6p, 15q and 16p.  We found that signature 1, predominantly associated with KIRC, featured gains in a number of arms most notably in 10p, 18q and 19p. Signature 5, enriched for KICH samples, featured gains in a number of chromosome arms that are not present in the other signatures and notably exhibited losses in arms 6p, 6q and 7p. Signature 6, the other signature associated with KIRP tissue, exhibited gains in 3p and 3q paired with some loss in 5p and 5q. 19q and 20p also showed large gains.

The Amount of Signature Composition Change Varies by Gene Mutation
We next used cosine similarity, the cosine of the angle between two signatures represented as vectors projected in a multi-dimensional space, to measure the effect of specific mutations on overall In terms of composition, we found that the feature composition of signature 2, associated with KIRP tissue, exhibited the most notable deficiency of 3p paired with gains in 5q which reflected our correlation analysis. This was paired with large gains in 6p, 15q and 16p.
We found that signature 1, predominantly associated with KIRC, featured gains in a number of arms most notably in 10p, 18q and 19p. Signature 5, enriched for KICH samples, featured gains in a number of chromosome arms that are not present in the other signatures and notably exhibited losses in arms 6p, 6q and 7p. Signature 6, the other signature associated with KIRP tissue, exhibited gains in 3p and 3q paired with some loss in 5p and 5q. 19q and 20p also showed large gains.

The Amount of Signature Composition Change Varies by Gene Mutation
We next used cosine similarity, the cosine of the angle between two signatures represented as vectors projected in a multi-dimensional space, to measure the effect of specific mutations on overall signature composition change.
By taking the median signature composition of patients with and without specific gene mutations and measuring the cosine similarities of these compositions, we were able to rank genes by the distance of signature composition between these groups.
In our kidney cancer data, we found a number of gene mutations associated with an increased distance in signature composition. The genes with the highest cosine similarity (all with cosine similarity scores of~0.678) in our ranked list are included in Table 1. In our pan-cancer data, we found a different set of genes with some similar functionality listed in Table 1. We found overrepresentation of gene associated with the innate immune system as well as KDM6B, INCENP and H2AFJ which are all related to the chromosome organisation. Table 1. Gene mutations associated with an increased distance in signature composition between wild type and mutated kidney cancer samples/mutated pan-cancer samples.

PON2
Oxidative stress protection Pathogenic bacteria protection

Some Gene Mutations Are Enriched in Our Kidney Ploidy Signatures
To find gene mutation enrichment in our six kidney cancer signatures, we first categorised samples into signature groups and counted the frequency of mutations occurring in each patients' group. We then used permutation tests to measure if the frequency of mutation was significantly higher than would be expected in a random sample of patients. Below, we highlight some of the notable gene mutations that were reported to be significantly enriched in each signature.

Patterns of Ploidy of Chromosome Arms Are Associated with Specific Mutated Genes
To investigate to what extent mutations in specific genes lead to specific patterns of arm-wise copy number change, we trialled a set of four machine learning classifiers designed to predict gene mutations based on a patient's chromosome arm ploidy profile (see methods). Using an AUC ROC score of 0.70 as a cut-off, we found that the patterns of ploidy in chromosome arms had varied predictive power for mutations in specific genes.
We trialled four classifiers; Bernoulli naive Bayes, support vector machine, logistic regression and random forest. Due to consistently better performance when compared to the other classifiers all ROC AUC scores below are based on the results of the random forest classifier. In our pan-cancer dataset patterns of ploidy changes in chromosome arms were strongly predictive of mutations in VHL (ROC 0.91), TP53 (ROC 0.74), and PBRM1 (ROC 0.71), with BAP1 (ROC 0.67) narrowly missing the cut-off score.
To better understand how different features contributed to the predictive power of each classifier, we ranked the importance of each feature for each model. Feature importance was calculated by measuring the mean decrease in classifier accuracy when systematically holding out each variable across all tree permutations in a random forest. Feature importance was reported as a mean decrease in accuracy.
When analysing the important features in the predictive model, losses in 3p featured heavily in patients with BAP1, PBRM1 and VHL mutations. Gains in 5p were an important feature for both PBRM1 and VHL mutated genes.
In general, KIRC tissue arm-wise CCN data, which exhibited less variance than KIRP tissue, were less predictive of specific gene mutations. The highest scoring genes, in terms of ROC AUC score, were PTEN (ROC 0.68), PBRM1 (ROC 0.61) and AKAP9 (ROC 0.61) ( Figure 5). 18q previously noted for its overall low ploidy change in KIRC tissue was found to be the most important feature for both AKAP9 (0.15 Mean decrease in accuracy) and PTEN (0.081 Mean decrease in accuracy) (Supplementary Table S1).
Chromosome arm 3p, on which both PBRM1 and BAP1 are located, commonly exhibits loss in KIRC tissues. However, we found that 3p reported a relatively low overall importance score in our KIRC tissue predictive models. Rather than 3p the highest ranking feature for PBRM1 was 10p (0.063 Mean decrease in accuracy). This may be due to the all but uniform loss of these genes along with 3p in KIRC tissue (94% of patients with overall loss in 3p and 71% with less than CCN 1.8) leading to less variation and, as such, less signal.
In general, KIRC tissue arm-wise CCN data, which exhibited less variance than KIRP tissue, were less predictive of specific gene mutations. The highest scoring genes, in terms of ROC AUC score, were PTEN (ROC 0.68), PBRM1 (ROC 0.61) and AKAP9 (ROC 0.61) ( Figure 5). 18q previously noted for its overall low ploidy change in KIRC tissue was found to be the most important feature for both AKAP9 (0.15 Mean decrease in accuracy) and PTEN (0.081 Mean decrease in accuracy) (Supplementary Table 1). In models derived from our pan-cancer data, features based on patterns of arm-wise CCN performed relatively well when predicting TP53 and PTEN mutations with AUC ROC scores of 0.74 and 0.73 respectively. TP53 has been implicated in genetic instability in many tissue types [28,29]. BAP1 also performed fairly well with a ROC AUC score of 0.66; with important features being a loss of 3p (0.076 Mean decrease in accuracy) and of 3q (0.15 Mean decrease in accuracy). Our pan-cancer tissues models performed poorly when predicting PBRM1 with a ROC AUC score of 0.47.
Once again both PBRM1 and VHL were found to be associated with KIRC tissue with losses in 3p and gains in 5p featuring as important predictors for both PBRM1 and VHL mutated genes. Chromosome arm ploidy proved a strong predictor of mutations in VHL, TP53 and PBRM1 in our pan-cancer dataset and the insight drawn from our feature importance scores again matched our expected results.

Discussion
The goal of this study was to investigate how genome-wide and chromosome arm-wise ploidy varies by tumour type, how these ploidy patterns are associated with genetic mutations and how suitable ploidy data is as a predictor of specific mutations.
As described above, we observed significant variation in both genome-wide and chromosome arm-wise ploidy between samples from different tissue types. This can be expected given that different tumours are driven via gains or losses of specific genes located on various chromosome arms. While we observed a small number of generalisations regarding the pattern of genetic material, e.g., 20p shows gains in CCN across all tumour types, beyond these similarities, there is significant variation in arm ploidy profiles between tumour types.
Our focus on KIRC and kidney tissue was due to its distinct and well-described karyotype. Throughout this study, we compared our observation with previous experimental studies of KIRC to cross-validate our analysis. Turajlic et al. found that KIRC is generally characterised by recurrent copy-number variants in arms including, but not exclusive to, 3p, 5q, 7q, 8p, 9p, and 14q. SQSTM1, a gene which resides on 5q, an arm which shows relatively large gains in CCN in this study has been postulated to represent an alternative mechanism for activation of mTORC1 [30]. PBRM1, VHL, BAP1 and SETD2 all reside on 3p, a copy of which is known to be often lost at the outset of KIRC tumourigenesis. As such, we expected changes in 3p ploidy to feature prominently in our results. The association between 3p and KIRC was supported by the findings in this study, both whilst investigating the chromosome arm ploidy profile of KIRC compared to other tissues and while calculating feature importance as part of our classification, where we clearly found 3p performing as an important indicator for both PBRM1 and VHL mutations. We also found some association between 3p and 5q in KIRP tissue samples both in our correlation and signature analysis.
Of further interest is the prominence of 5q in our results which often sees a gain in ploidy. This increased importance may be the result of a number of factors. This gain in ploidy at 5q might be a direct or indirect result of the loss of heterozygosity and resulting vulnerability to loss of gene function at 3p [26]. Our random forest importance scores suggest that 5p is highly associated with both PBRM1 (0.088 Mean decrease in accuracy) and VHL (0.13) mutations across all tissues. Chen et al. [31] describe the gain of 5q as a common trait of papillary-enriched KIRC subtype which often includes mutation or amplification of MET.
VHL, a gene often found mutated early during the progression of KIRC and found associated with changes in ploidy in this study, has been shown to drive aneuploidy [32]. Loss of a functioning VHL gene (located on chromosome arm 3p) has previously been associated with spindle misorientation, chromosome instability and aneuploidy [14]. Similarly, PBRM1, a gene that encodes BAF180 and also located on chromosome 3p, has been shown to be important for the establishment or maintenance of cohesin on chromatin at centromeres. Loss of functioning PBRM1 has recently been reported as a driver of chromosomal instability and aneuploidy [33].
The observations in this study suggest that our analysis has been sensitive to known karyotypic patterns and may be useful for the detection of additional patterns.
As well as these commonly cited kidney tissue gene mutations, we identified a number of other mutations significantly associated to changes in kidney tissue karyotypes. We identified a number of genes associated with increased overall ploidy change, such as POLE in KIRC tissue and KRAS in KIRP. We also found several gene mutations associated with the innate immune system and chromosome organisation such as KDM6B, INCENP and H2AFJ that appear to contribute to broad changes in patterns ploidy patterns. Further, we evidenced genes significantly associated with the ploidy pattern signatures that we generated in this study, which in some cases described karotypes shared between tissue types.

Data Acquisition
In total, we analysed 3,559,315 samples (356,069 in our kidney group, 3,203,246 in our pan-cancer group) across 5756 patients (888 kidney, 4868 pan-cancer). Our source data consisted of somatic mutation data (BCM Curated or Automatic Somatic Mutation Calling) and copy number variant data (BI Genome-Wide SNP6) downloaded from the GDC data portal [16]. To compare the aneuploidy exhibited by different tumour types, we used the segment mean values in the CNV data provided by TCGA/GDC. The segment mean value is the log base 2 geometric mean of the ratio a sample's copy number over the wild-type copy number. Throughout this study, our CCN values were calculated directly from these segment mean values using the formula.
(2ˆSegment Mean) *2 Sex chromosomes were excluded from the study, thus the chromosomal copy number for healthy tissue is 2.
To ensure the data represented large-scale changes in ploidy rather than smaller, linear duplication CNVs, we sorted the sample data by length and removed the shortest samples observed in the bottom 25th percentile of data. Samples with a probe number of less than 10 were also filtered out to provide high confidence in the CNV data as per Laddha et al. [34].
All data preprocessing and analysis was completed using R 3.4.4 [35] and the machine learning component of the study was completed using python pandas version 0.19 [36].

Describing Chromosome Arm-Wise Ploidy Patterns
To prepare the data for modelling, we segmented CNV data for each patient by chromosome arm. Centromere position data were obtained from genome.ucsc.edu [37]. These data were organised with each patient as an observation and each of their chromosome arm's average CCN as a variable.
Samples with deleterious mutations that were likely to change the protein product or the functioning of the protein such as missense mutations, frame-shift insertion and deletions, nonsense mutations and in-frame insertions and deletions were labelled as mutations in the data whilst all other samples were labelled as not-mutated.
To calculate correlation between each mutation and overall genome ploidy, we took an absolute, positive value for each sample's segment mean to better measure and compare overall average ploidy change. This data was grouped by individual patients resulting in a list of all patients each with an average absolute genome-wide segment mean.
A list of patients featuring mutations in our chosen genes was then extracted from the somatic mutation data and used to filter our preprocessed CNV data to calculate the probability of each chosen gene mutation being associated with a general change in ploidy. For each gene, the cohort of patients with a mutation in that particular gene was sampled against a control group of all other available patients.
We used t-tests to calculate the probability that patients with mutations in each gene would exhibit a significant change in overall ploidy. This was repeated in our pan-cancer, KIRC and KIRP tissue datasets for comparison. All p-values were adjusted through the use of the Bonferroni procedure to correct for false discovery.

Arm Ploidy Correlations
To measure correlation between chromosome arms, we first stratified patients by tissue type and then measured correlational coefficients between each arm pair for all the samples within those groups.
Significance values were calculated using a permutation test where a mock distribution was produced using the same data for each arm pair but with one arm's data randomly permeated for 1000 samples. The real value was then compared to this mock distribution using a student t-test.

Generating Arm-Wise Ploidy Signatures
Arm-wise ploidy signatures were generated from the arm-wise ploidy data (as above) using non-negative matrix factorisation (NMF). We used a cophenetic correlation coefficient score via the R NMF library [38] to measure the stability of our models and to select the most stable component count. When associating a sample with a signature, we chose the highest scoring component in the coefficient matrix for that sample.

Gene Mutation and Pattern Change
To measure the distance between the arm-ploidy pattern of groups of patients with and without gene mutations, we took the median values of each arm for each group and measured distance using cosine similarity. We took the top 250 most frequently mutated genes in kidney cancers to analyse. Similar to the arm ploidy correlation analysis to measure significance for each of these gene mutations, we used permutation tests this time randomising the labelling of patients in the two group.

Finding Gene Mutations Enriched Within Signatures
To find the most frequently mutated genes in each signature, we stratified our samples by the most prominent signature (as above), counted the frequency of gene mutations in these groups and ranked by their frequency. To determine whether the frequency of mutations was significantly increased compared to that expected in the case of random association, we once again used permutation tests using randomised sampling of all patients from all signatures to create our mock distribution. In each case the number of randomly sampled patients was equal to the number of samples found in the respective signature.

Using Chromosome Arm Ploidy Patterns to Predict Gene Mutations
A boolean, stating whether the patient suffered a mutation in the respective gene or not, served as the label for each observation. Data was split randomly into training and testing groups with a test size of 0.2 and a training set size of 0.8 of all samples.
To measure the predictive power of chromosome arm-wise segment mean for specific gene mutations, receiver operating characteristic area under the curve (ROC AUC) scores were calculated. We initially trialled four different machine learning classifiers; Bernoulli Naive Bayes, Support Vector Machine, logistic regression and random forest. Hyper-parameters for each classifier were optimised using 5 fold cross validation.
Ultimately it was found that a random forest classifier with 1000 estimators and a minimum sample leaf size of 30 performed consistently better when compared to the other classifiers and so ROC AUC scores given in this study were all a result of this classifier For each model, feature importance was calculated by measuring the mean decrease in classifier accuracy with the removal of each feature across all trees in the random forest. This metric was reported as the mean decrease in accuracy given the removal of a feature.
Using this measure of the mean decrease in accuracy, we ranked and identified chromosome arms that were most commonly lost or gained when specific genes are mutated. This analysis was applied to KIRC tissue data alone and finally to all tissue types excluding KIRC for comparison (Supplementary Table S1).
The source code and data for this study is available via bitbucket at https://bitbucket.org/ bioinformatics_lab_sussex/ploidy_nmf. The source data required for running this analysis, including copy number and mutation data is available at: https://bitbucket.org/bioinformatics_lab_sussex/ploidy_ nmf/downloads/.